Categories

Using Wordnet with Bonsai

An explanation of WordNet and the two ways to use it with Bonsai.
Last updated
July 7, 2023

WordNet is a huge lexical database that collects and orders English words into groups of synonyms. It can offer major improvements in relevancy, but it is not at all necessary for many use cases. Make sure you understand the tradeoffs (discussed below) well before setting it up.

There are two ways to use WordNet with Bonsai. Users can add a subset of the list using the Elasticsearch API, or use the WordNet file that comes standard with all Bonsai clusters.

First, a brief background on synonyms and WordNet. If you want to jump around, the main sections of this document are:

  • What Are Synonyms in Elasticsearch
  • How Does WordNet Improve Synonyms?
  • Why Wouldn’t Everyone Want WordNet?
  • Using WordNet via the Elasticsearch API
  • Using the WordNet List File, wn_s.pl
  • Resources

What Are Synonyms in Elasticsearch?

Let’s say that you have an online store with a lot of products. You want users to be able to search for those products, but you want that search to be smart. For example, say that your user searches for “bowtie pasta.” You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has “farfalle” instead of “bowtie pasta”. How do you address this issue?

Elasticsearch has a mechanism for defining custom synonyms, through the Synonym Token Filter. This lets search administrators define groups of related terms and even corrections to commonly misspelled terms. A solution to this use case might look like this:

<div class="code-snippet w-richtext"><pre><code fs-codehighlight-element="code" class="hljs language-javascript">{
   "settings": {
       "index" : {
           "analysis" : {
               "filter" : {
                   "synonym" : {
                       "type" : "synonym",
                       "synonyms" : [
                           "bowtie pasta, farfalle"
                       ]
                   }
               }
           }
       }
   }
}</code></pre></div>

This is great for solving the proximate issue, but what it can get extremely tedious to define all groups of related words in your index.

How Does WordNet Improve Synonyms?

WordNet is essentially a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. An entry in WordNet looks something like this:

<div class="code-snippet w-richtext"><pre><code fs-codehighlight-element="code" class="hljs language-javascript"></code></pre></div>

Let’s break it down:

This line expresses that the word ‘kitty’ is a noun, and the first word in synset 102122298 (which includes other terms like “kitty-cat,” “pussycat,” and so on). The line also indicates ‘kitty’ is the fourth most commonly used term according to semantic concordance texts. You can read more about the structure and precise definitions of WordNet entries in the documentation.

The WordNet has become extremely useful in text processing applications, including data storage and retrieval. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable.

Why Wouldn’t Everyone Want WordNet?

Relevancy tuning can be a deeply complex subject, and WordNet – especially when the complete file is used – has tradeoffs, just like any other strategy. Synonym expansion can be really tricky and can result in unexpected sorting, lower performance and more disk use. WordNet can introduce all of these issues with varying severity.

When synonyms are expanded at index time, Elasticsearch uses WordNet to generate all tokens related to a given token, and writes everything out to disk. This has several consequences: slower indexing speed, higher load during indexing, and significantly more disk use. Larger index sizes often correspond to memory issues as well.

There is also the problem of updating. If you ever want to change your synonym list, you’ll need to reindex everything from scratch. And WordNet includes multi-term synonyms in its database, which can break phrase queries.

Expanding synonyms at query time resolves some of those issues, but introduces others. Namely, performing expansion and matching at query time adds overhead to your queries in terms of server load and latency. And it still doesn’t really address the problem of multi word synonyms.

The Elasticsearch documentation some really great examples of what this means. The takeaway is that WordNet is not a panacea for relevancy tuning, and it may introduce unexpected results unless you’re doing a lot of preprocessing or additional configuration.

tl;dr: Do not simply assume that chucking a massive synset collection at your cluster will make it faster with more relevant results.

Using WordNet via the Elasticsearch API

Elasticsearch supports several different list formats, including the WordNet format. WordNet synonyms are maintained in a Prolog file called <span class="inline-code"><pre><code>wn_s.pl</code></pre></span>. To use these in your cluster, you’ll need to download the WordNet archive and extract the <span class="inline-code"><pre><code>wn_s.pl</code></pre></span> file. You’ll then need to create your synonyms list by reading this file into a request to your cluster.

The target index could be created with settings like so:

<div class="code-snippet w-richtext"><pre><code fs-codehighlight-element="code" class="hljs language-javascript">PUT https://randomuser:randompass@something-12345.us-east-1.bonsai.io/some_index

   {
     "settings": {
       "analysis": {
         "filter": {
           "wn_synonym_filter": {
             "type": "synonym",
             "format" : "wordnet",
             "synonyms" : [
                 "s(100000001,1,"abstain",v,1,0).",
                 "s(100000001,2,"refrain",v,1,0).",
                 "s(100000001,3,"desist",v,1,0).",
                 #... more synonyms, read from wn_s.pl file
             ]
           }
         },
         "analyzer": {
           "my_synonyms": {
             "tokenizer": "standard",
             "filter": [
               "lowercase",
               "wn_synonym_filter"
             ]
           }
         }
       }
     }
   }</code></pre></div>

There are a number of ways to generate this request. You could do it programmatically with a language like Python, or Bash scripts with <span class="inline-code"><pre><code>curl</code></pre></span>, or any language with which you feel comfortable.

A benefit of using a subset of the list would be more control over your mappings and data footprint. Depending on when your analyzer is running, you could save IO by not computing unnecessary expansions for terms not in your corpus or search parameters. Reducing the overhead will improve performance overall.

Using the WordNet List File, wn_s.pl

If you would rather use the official WordNet list, it is part of our Elasticsearch deployment. You can follow the official Elasticsearch documentation for WordNet synonyms, and link to the file with <span class="inline-code"><pre><code>analysis/wn_s.pl</code></pre></span>. For example:

<div class="code-snippet w-richtext"><pre><code fs-codehighlight-element="code" class="hljs language-javascript">PUT https://username:password@my-awesome-cluster.us-east-1.bonsai.io/some_index
{
   "settings": {
       "index" : {
           "analysis" : {
               "analyzer" : {
                   "synonym" : {
                       "tokenizer" : "whitespace",
                       "format" : "wordnet",
                       "filter" : ["synonym"]
                   }
               },
               "filter" : {
                   "synonym" : {
                       "type" : "synonym",
                       "format" : "wordnet",
                       "synonyms_path" : "analysis/wn_s.pl"
                  }
               }
           }
       }
   }
}</code></pre></div>

Resources

WordNet is a large subject and a great topic to delve deeper into. Here are some links for further reading:

View code snippet
Close code snippet