The ideal Elasticsearch index

Rob Sears · January 11, 2016
14 minute read

A key reason for the popularity of Elasticsearch is the ease and simplicity of setting up a cluster. It's trivial to run the binary and create an index, and the learning curve required to get started with Elasticsearch is very approachable.

In a lot of ways, this is both a blessing and a curse. Many users find that decisions they made early in the product cycle regarding their Elasticsearch cluster don't scale well and aren't performant in production. There may also be issues with data partitioning that makes a proper fix difficult.

At Bonsai, we've had the opportunity to work directly with hundreds of users facing these kinds of issues, and the lessons learned from these interactions have helped us to develop processes and some key best practices for setting up your cluster.
 

Index Design Goals

There are four key things to think about when designing your index.
 

Goal #1: Minimize reindexing time

First, the intial bulk import performance. Creating an index is trivially easy; the tricky part is populating it with data. Let's say you've integrated Elasticsearch into your application and are now ready to index to a live cluster. Depending on your integration details and volume of data, this might take anywhere from a few seconds to many hours.

danger zone for reindexing

If you're experiencing the latter, that is a potential problem. Imagine what would happen if you needed to reindex for some reason in production: your app would be down for hours, affecting your users and your business. Our goal here is to leverage Elasticsearch intelligently to minimize the time required for a full reindex.
 

Goal #2: Resilient to write operation spikes

Another thing to think about is update activity over time. Are you updating the index on a regular basis, or do updates occur on the fly? Are they batched, or are you using single-document updates? These are important questions because write operations are resource intensive. If you anticipate that this kind of traffic will rise over time (or will remain constant on average, with brief periods of extreme spikes), then that's something to consider before going into production. Our goal here is to design an index that is resilient against jumps in write operations and can scale easily with traffic.
  Don't experiment with update risk tolerance.

Goal #3: Efficient Queries

Similarly, you'll need to ask yourself the same question about search activity, albeit for different reasons. Read operations are often not as resource intensive as writes, but it does depend on the efficiency of your queries. If your queries are inefficient, then performance will suffer regardless of your index settings. Our goal here will be to design an index that is optimized for read traffic and utilizes resources efficiently to mitigate against spikes in read operations and poor queries.
 

Goal #4: Scale capacity without index recreation

Finally, there's capacity. Some applications have a relatively fixed set of documents, while others might be adding millions of documents a day. Our goal in this regard is to design an index that can support additional data capacity as needed, without needing to be recreated entirely.
 

Index design best practices

With this in mind, our best practices for index design are:

  • Don't treat Elasticsearch like a database
  • Know your use case BEFORE you jump in!
  • Organize your data wisely
  • Make smart use of replicas
  • Base your capacity plans on experiment
     

Best Practice #1: Elasticsearch is not your database

just don't do it

One of the most important things to keep in mind when forming a mental picture of Elasticsearch: it's not a database! Most developers are familiar with SQL (and its variants); Postgres, MongoDB, etc. Lucene-based search engines like Elasticsearch are fundamentally different. Rather than focusing on data integrity, Lucene is focused on speed (see the inverted index). Data is not stored for redundancy and resiliency, but for speed. As a result, Lucene indices are more susceptible to corruption and data loss. The Lucene developers have done a great job mitigating this over the years, but it's still a good idea to populate your Elasticsearch index from a database to hedge against data loss.

Another important point is that creating an index in Elasticsearch involves defining a logical namespace, which is assigned one or more shards across the cluster. Each shard is actually a Lucene instance. When you query the index, Elasticsearch handles the task of routing the request to the proper shards, which then perform the query locally, before returning the results to the coordinator, where they are collated, analyzed, then returned to the user.

There's a lot of information there, and a fuller explanation is beyond the scope of this post. The Elasticsearch documentation is an invaluable resource for gaining a deeper understanding into how the system works. Keep this in mind as we continue with the remaining best practices, so you can get the most from Elasticsearch.
 

Best Practice #2: Know your use case BEFORE you jump in!

Before you even create your first index, take some time to think about your application. What purpose does Elasticsearch serve in your app? Some users want to use the ELK stack to analyze logs, while other users have a product catalog they want searchable by their customers. Other users have a social media-based application that needs to index user activity. These are just a few of the more common use cases for Elasticsearch, but they all have very specific requirements that need to be addressed before data can be added. Here are some examples:
 

  1. Catalog-based data

    Perhaps you need Elasticsearch for it's most common purpose: searching a body of documents. Maybe those documents are products in your online store, academic papers from your university's library, or a set of restaurants with reviews and their GPS coordinates. For this use case, the data footprint is not expected to grow significantly over time. You may have other considerations, like frequent updates, spikes in traffic, faceting, etc.

    If this is your use case, it's not a terrible idea to use aliases with your index. For example, if you want an index called "products," it's a good idea to name it something like "productsv1," then supply an alias linking the name "products" to "productsv1." Like this:

    PUT /products_v1
    PUT /products_v1/_alias/products
    

    Consider this: you decide to implement a new feature, and you want to add a "reviews" field to your "products" index. In order for this to take effect, you'll need to completely reindex the data. Depending on how much data you have, this process could take anywhere from a few seconds to a few hours. The latter downtime is probably not acceptable.

    However, if you create a new index, say, "products_v2," you can populate that with the data containing the "reviews" field without affecting the existing v1 index. Once v2 is ready, you can simply swap the aliases and delete the old index:

    POST /_aliases
    {
        "actions": [
            { "remove": { "index": "products_v1", "alias": "products" }},
            { "add":    { "index": "products_v2", "alias": "products" }}
        ]
    }
    
    DELETE /products_v1
    

    This process is atomic, so as far as the users and application are concerned, the transition was seamless and instantaneous.
     

  2. Index per tenant

    Some developers want their Elasticsearch cluster to handle search for a large number of users. One example might be an application that provides business search to a number of customers. Rather than set up a cluster for each customer, the developer sets up a single cluster that gives each customer their own index. Another example might be a social media site, where a user's activity is stored in an index created specifically for that user.

    This so-called the index-per-tenant model is a common one, but it doesn't scale well. Every index comes with system overhead, so the more indices you create, the more resources you waste. There are much more efficient methods for handling this. Elasticsearch has some great articles dealing with this situation. Essentially, it's a better idea to have one large index and use aliases to "fake" the index-per-tenant setup. This method is easier to scale and modify later if needed.
     

  3. Time-based data

    If you plan to have data that is time-based (log events, for example), you should consider avoiding a single, large index. It's a much smarter idea to index by timeframe, say by appending something like '-YYYY.MM.DD' to your index name. Aliases can then be used so that the application can simply send PUT requests to the same index. Something like this should work:

    PUT /events-20150428
    PUT /events-20150428/_alias/events
    
    PUT /events/event/123 ...
    PUT /events/event/124 ...
    PUT /events/event/125 ...
    
    PUT /events-20150429
    POST /_aliases
    {
        "actions": [
            { "remove": { "index": "events-20150428", "alias": "events" }},
            { "add":    { "index": "events-20150429", "alias": "events" }}
        ]
    }
    
    POST /events-20150328/_close
    
    PUT /events/event/126 ...
    PUT /events/event/127 ...
    PUT /events/event/128 ...
    

    The reason this works is due to the nature of time-based data. In practice, the older an event is, the less likely it is to be relevant to our immediate interests. By indexing this way, we can get better response times for recent events and have the flexibility to close indices after a certain time frame. An index that is a month old can be closed, thereby freeing up system resources while keeping the data handy. If there's ever a need to search for events older than a month, those indices can simply be re-opened, queried, then closed again.

    In practice, it's a good idea to start with a shard per day, and increase the frequency of shard creation if once per day is "hot" enough to demand it.
     

Best Practice #3: Organize your data wisely

Once you have considered your use case and have a good idea of different data organization schemes in the abstract, the next step is thinking about how that data will be physically partitioned across the cluster. In practice, splitting your index into several primary shards across the cluster will improve your indexing throughput.

There are a few possibilities here. To achieve basic load balancing, use at least one primary shard per node to distribute load evenly across the cluster. If you need even more power, you can get the maximum performance out of the cluster by splitting the index into one shard per CPU core, per node. That is, if you have 3 nodes with 8 CPUs each, then you can split your index into 24 primary shards (8 shards per node) to get the absolute most from the cluster.

If you think that you may need to scale horizontally (or employ autoscaling) at some point in the future, partition the index over the lowest common denominator (LCD) of the current and expected nodes. For example, if you are starting with a 3 node cluster with plans to eventually scale to a 5 node cluster, limit the index to 15 primary shards. This will have you start off with 5 primary shards per node, which will drop to 3 primary shards per node when you add the next two nodes to the cluster. Maxing out the primary shards on the 3 node cluster, then adding two nodes will cause the load will be uneven.
 

Best Practice #4: Make smart use of replicas

Replication is one of the killer features of Elasticsearch. It allows you to automatically create copies of shards, which are then distributed across the cluster. That way, if a node goes down, no data is lost because the copies of its primary shards exist on the remaining nodes. Those replicas are then promoted to primary shards. The nice thing about replication is that, unlike shards, replicas can be added or removed on the fly. They can serve read operations as well, and so can help handle the load of searches.

A single replica is all you need for basic redundancy, but you can gain the benefit of additional search throughput by adding more replicas. Elasticsearch has an option for index settings, auto_expand_replicas that can be set to 1-all like so:

PUT /<index>/_settings
{
  "auto_expand_replicas" : "1-all"
}

This tells Elasticsearch to put a replica on every other node. So if you have 5 nodes, with the primary shard on node 1, then Elasticsearch would place replica shards on nodes 2 through 5. The nice thing about this is that when new nodes join the cluster, Elasticsearch will automatically create a replica of the primary shard there.

You could also consider using search routing preference to make efficient reuse of filter caches. Queries to an Elasticsearch index must hit every shard in the index, which can introduce additional latency to the response time. By setting up custom routing, Elasticsearch only needs to hit one shard, which gives a noticable performance boost. When combined with smart use of replication, this strategy maximizes your search performance. See this post on the Elasticsearch blog for more info.
 

Best Practice #4: Base your capacity plans on experiment

The best way to find out how many shards your new index will need is to run some tests. We've found a reliable way to do this is to spin up a virtual machine (locally or on a cloud provider like AWS), allocate to it some memory and disk capacity equal to what is offered on bonsai, then fire up Elasticsearch. From there, create an index consisting of a single, unreplicated shard.

Begin simulating real, production usage - add documents, make requests, updates, etc - and increase the intensity of the behavior until the cluster breaks. "Break" is a somewhat subjective term here, but generally it means the point at which response times become unacceptable. Once this happens, stop and measure how much data the experimental shard has accumulated.

After you've measured the data footprint of the experimental shard, look at the total amount of production data you need to index, then add 20%. From this, divide by the data footprint of the experimental shard. This will give you a rough estimate of how many primary shards you'll need for your index.

Next, multiply the number of primary shards by the number of replicas you want in order to get the total number of replicas you'll need. Add these two numbers together to get the total number of shards.

Multiply the total number of shards you'll need by the data footprint of the experimental shard. That will give you the data footprint of your cluster. Finally, multiply that number by 2.5 to account for the overhead of large merges and other index maintenance. This will tell you the total amount of data you'll need for your cluster.
 

Summary

Elasticsearch is a powerful and multifaceted search engine that makes it easy to get up and running fast. If you want to get the most out of it however, you need to do some preliminary work. Elasticsearch was built to scale to insane dimensions, but lack of forethought in the initial stages can make scaling much more difficult. That said:

  • Don't use Elasticsearch as a primary data store! Lucene has come a long way over the years, but it still can't match the resiliency of a database. There are a handful of situations that can cause you to lose data in Elasticsearch, so make sure it's backed by a database.
  • Think about your use case! This will help you configure the index in an intelligent, scalable way.
  • Don't under- or over-allocate shards! Run some experiments to get a plausible estimate for how many shards your index needs.
  • Optimize your queries in development! If your queries are bad, it can't be fixed by throwing money at the problem.
  • Replication aids in failover and search performance. On bonsai, we handle the details here, so you can focus on the application.