New! Deploy Bonsai in your AWS Account with Bonsai Vaults →

The Ideal Elasticsearch Index

Rob Sears · April 27, 2018
14 minute read

A t the core of Elasticsearch’s popularity is the ease and simplicity of setting up a cluster. And with hosted search options like Bonsai.io, powerful, multi-node clusters can be created instantly.

In a lot of ways, ease of use is both a blessing and a curse. When the learning curve isn’t a barrier to entry, it’s easy to start on a path that causes problems later. Many users find that decisions they made early in the product cycle regarding Elasticsearch don’t scale well in production. Some early decisions can become extremely painful or difficult to change later.

At Bonsai, we’ve had the opportunity to work directly with thousands of users around the world. We’ve had the unique experience of getting a peek into projects integrating Elasticsearch at scale, and helping people get the absolute most out of search. This experience has been instrumental in crafting some guidelines for users just starting out.

This is the guide a lot of users wished they’d had in the beginning.

Why Focus on the Index? Why Not the Cluster?

The short answer is that you should let your data decide the cluster specifications, not vice-versa. The index will determine how the data is stored and retrieved. If you haven’t thought through what your indices will look like and how they will be configured, you may end up underprovisioning the cluster (hurting performance) or overprovisioning the cluster (hurting your wallet).

This guide will help you understand and reason about the factors that will impact the quality of your search experience over the long term.

Elasticsearch Has So Many Use Cases, How Relevant is this Advice to Me?

People use Elasticsearch in so many ways. And each use case is going to benefit from a different set of settings and low level tweaks. However, these are typically changes that can be made at any time, in contrast to immutable settings like primary sharding. In other words, the fine tuning can happen a little later.

Instead, we’re going to focus on a couple of high level characteristics shared by all well-configured indices, and leave the more esoteric stuff for a later series.

That said, the Ideal Elasticsearch Index is…

  1. Backed by a database
  2. Sharded intelligently
  3. Replicated, but not overreplicated
  4. Has well thought out data structures

Let’s look into those in a bit more detail.

Backing Elasticsearch with a Database

Most developers have at least a casual understanding of what a database is: some software that stores data in a structured format, and allows applications to interact with that data using a structured query language. And Elasticsearch is basically the same thing, right?

While that is an accurate description of what Elasticsearch does, the devil is in the details. Database software aims for something called ACID compliance. ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee validity of data in the event of unexpected service interruptions.

An “ACID-compliant data store” is a fancy way of saying that the technology is highly resistant to corruption and data loss. Some examples of ACID-compliant data stores you may be familiar with: Postgres, SQLite, Oracle, MySQL (with InnoDB), and MongoDB (4.0+). Not listed: Elasticsearch.

Elasticsearch uses a search engine technology called Lucene. Lucene is an information retrieval technology built for speed, not redundancy. It has a radically different architecture that gives it blazing fast performance, at the expense of being more susceptible to data loss.

Data loss can happen in a number of ways, you need to be able to recreate the data if needed. True, Elasticsearch has a snapshot/restore feature, but this process will only ever partially recover you in the event of data loss. Updates made between the most recent snapshot and outage will be lost unless you have another system in place to queue them. Snapshot/restore will also not help in the event of split brain, because there’s no mechanism for reconciling updates to each partition. Updates will just be lost.

This is why every Elasticsearch index you create should be backed by an ACID-compliant data store. The database should be the ultimate source of truth, from which the index is populated. If you ever lose the index — via node loss, outage, or simple fat-finger — you’ll be able to completely recover it.

Intelligent Sharding

Are you brand new to Elasticsearch? Start by reading this article on Elasticsearch shards and replicas.

Every Elasticsearch index is composed of one or more shards. There are two types of shards: primary, and replica. Primary shards are responsible for managing updates to the index, while replica shards are simply copies on the primary and live on other nodes. The idea is that if a primary shard is taken offline, the replica will be able to fill the role and keep search from going down.

Shards are really just abstractions for Lucene indices. When an Elasticsearch index has several primary shards, it can be thought of having the data spread out over several different search engines. This is useful, for example, when you have millions of documents, because it allows queries to be served in parallel, which can save time and IO.

“Oh,” I hear you say, “well then I’ll just create my index with 100 primary shards and be done with it!” That certainly is one approach, but it’s a terrible one. Shards come with overhead; even empty shards require memory and disk. And tf-idf relevance is calculated on a per-shard basis. Oh, and if you have more shards than CPU cores, they will end up fighting each other for system resources on every request.

In other words, this approach would give you the worst of all worlds unless you already have a lot of data and a lot of nodes.

Another thing to understand is that Elasticsearch uses a naive hashing algorithm to route documents to a given primary shard. This design choice allows documents to be randomly distributed in a reproducible way. This avoids “hot spots” that affect performance and overallocation. However, it has one major downside, which is that the number of primary shards can not be changed after an index has been created. Replicas can be added and removed at will, but the number of primary shards is basically written in stone.

This means that we need to have some sense of our scale before even creating the index. We need a shard scheme that makes sense for our present needs, but with enough headroom to scale up.

Benchmarking is key

Benchmarking a subset of your data before you decide on your production settings will give you a sense of how much data will be used per document, both on-disk and in memory. From there, you can use some heuristics to figure out a good launch point.

Something else to think about is the number of nodes you will be using, and what are their capabilities. A production cluster should not have fewer than three nodes, and should almost always have an odd number. With that in mind, here are some heuristics:

If you don’t expect data to grow significantly, then:

  • One primary shard is fine if you have less than 100K documents
  • One primary shard per node is good if you have over 100K documents
  • One primary shard per CPU core is good if you have a couple million documents

If you do anticipate significant growth, then you’ll want to balance your present needs with your expected needs. At a minimum, you’ll want a number of primary cores equal to the number of nodes you would scale to when horizontal scaling is required.

In other words, if you’re starting out on a three node cluster, create the index with at least five primary shards. If you have a large enough collection of documents, you could also go with the product of your current and future number of nodes (in this case, 3 nodes today x 5 nodes at scale-up=15 primary shards). That assumes that the nodes in your three node cluster have at least five CPU cores.

See why this is an interesting problem?

Thinking about your needs and planning a little ahead will go a long way in creating an Elasticsearch index that performs well over the long term.

Replicating shards, but not Overreplicating them

Replica shards are copies of primary shards that live on any node other than the primary. Like primary shards, replicas can serve search requests if needed. Unlike primary shards, replicas do not handle update operations; they simply receive updates from the primary at regular intervals. A nice benefit of this design is that replicas are also not subject to immutability like primary shards. You can add/remove replicas at any time.

Replicas reduce stress on primary shards, and provide protection against data loss, node loss, network partitions, etc. However, too many replicas lead to wasted resources, because shards aren’t free. The cost-benefit ratio of replication gets worse with each new replica shard.

The ideal Elasticsearch index has a replication factor of at least 1. This means for every primary shard — however many there may be — there is at least one replica. And the maximum number of replicas never exceeds (n-1), where n is the number of nodes in the cluster. (This is a common newbie mistake: creating 10 replicas on a three node cluster and then wondering why the cluster state is perpetually “yellow.” Where are all of those extra replicas supposed to go?)

How do you know how many replicas is “enough”? With a minimum of 1 and maximum of (n-1), what is the “sweet spot?” Incremental benchmarking is a good way to narrow in on that. Add replicas incrementally and measure performance improvements with each new replica. If improvements are not noticeable, replication is probably too high.

Thinking through data structures

What are “data structures,” and how are they pertinent to Elasticsearch? Elasticsearch has a concept known as “mappings.” Mappings to specify how data will be ingested and queried. Mappings define analyzer chains composed of tokenizers and filters; these allow data (and queries) to be tokenized and mutated before being used.

While a well considered mapping can improve your search results, it’s wise to start simply, benchmark, and iterate later. Tweaking mappings and then promoting them to production is easier than you think. That means you don’t have don’t index everything in your database. Don’t try to do All The Things™. Think about what features your application actually needs right now, and worry about new features later.

It’s also carefully considered: don’t use mappings that you don’t fully understand. Don’t just copy/paste from the Internet. Read the documentation, know what each component does and how it will affect performance and system resources. If you don’t know what your mappings look like, take an hour to look at them and comprehend how they will work with your data.

The Ideal Elasticsearch Index isn’t necessarily just implementing default data structures, but has mappings that were honed in small scale testing. A subset of production data can be used to benchmark the performance and resource demands of a mapping. Figure these things out before taking it to scale.

Understanding how a mapping can affect performance

If you’re familiar with the Pipe and Filter Pattern, then this will make a lot of sense. But just in case, let’s look at the standard tokenizer for a moment. This component takes a text input and breaks it up using the Unicode Text Segmentation algorithm. The short explanation is that this splits the input up on whitespace. So an input like:

It was the best of times, it was the BLURST of times

is broken up into an array like so:

[It, was, the, best, of, times, it, was, the, BLURST, of, times]

The elements of this array are called tokens, and they can be filters out and mutated before reaching Lucene. We can use stopwords filters to get rid of tokens that aren’t likely to be relevant to a search:

[best, times, BLURST, times]

Then we can pass these tokens through a lowercase filter to get rid of the all-caps, and then through a synonyms filter, which maps blurst => worst. That will leave us with:

[best, times, worst, times]

These tokens are now stored in the inverted index along with the corresponding document ID, ensuring queries will be much more relevant. This pattern gives developers a lot of flexibility, however some seemingly innocuous choices will have enormous consequences at scale.

Why? Well, mappings are easy to understand conceptually, but can take time to master in practice. Here’s a classic example: n-grams. This is a filter which breaks up tokens into lots of sub-tokens, with set minimum and maximum sizes. Developers often like to use n-grams for substring matching at search time, and subsequently give the tokens a size from 1 to 100 (or something obnoxiously large).

To see why this is a problem, look at what a token like “terrific” turns into:

["t", "e", "r", "r", "i", "f", "i", "c", "te", "er", "rr", "ri", "if", "fi", "ic", "ter", "err", "rri", "rif", "ifi", "fic", "terr", "erri", "rrif", "rifi", "ific", "terri", "errif", "rrifi", "rific", "terrif", "errifi", "rrific", "terrifi", "errific", "terrific"]

If you work it out, you find that there is a parabolic relationship between the token length and the number of n-grams that it generates:

Worse, this pattern is generated per token. A standard 500-token document will create 500 times more grams, much of them garbage. How many words can you think of with substrings that are also, unrelated word? “Scatter” contains the totally unrelated word “cat.” If a user searches for “cat,” a document with the term “scatter” could plausibly be a match in this scenario.

And if your gram size is 1, then you’re basically indexing every single character in the alphabet to every single document. Pro-tip: this is not a good way to get relevant results or use system resources intelligently. But it’s a great way to consume lots of disk and memory for no reason.

How about another example? Synonym matching is another feature that is easy to implement, and even easier to screw up. Synonyms allow developers to create rules that group or translate related tokens. You might have a rule that replaces “ram” with “memory,” or a rule that takes the token “cat” and expand it out to the tokens: “cat”, “kitty” and “kitten.”

Elasticsearch allows developers to perform synonym expansion at index time, query time, or both. Expanding at index time means that the tokens are being expanded before being written out to disk. This does lead to larger indices, and has the downside of needing to reindex every time you want to change them.

That said, many developers opt instead for query time expansion. The user searches for a term that the synonym filter matches and expands, and those new tokens become part of the Lucene query. But then in production, the users report lots of unexpected and irrelevant results.

Why? Because multi-word synonyms break phrase queries, and the tf-idf scores for rare terms lead to score boosting in ways that are not immediately obvious. The low level details of why this happens is well beyond the scope of this post, but needless to say, you could spend weeks wrangling with relevancy tuning because of an ill-considered choice when working out the mappings.

These are just a few examples; broader coverage of the kinds of subtle mistakes and inefficiencies that are introduced in early stage data structures would be a whole other series of articles. But it’s just a taste of what can happen when choices are made without understanding the consequences and trade offs.

Conclusion

Elasticsearch and search engines are incredibly broad topics, and mastery of these subjects can be incredibly difficult. Fortunately, the fundamentals are easy to grasp and implement. The Ideal Elasticsearch Index is a set of design principles common to most use cases.