The Importance of Shard Math in ElasticsearchMaddie Jones, Customer Success Engineer • in Search • 9 min read
As Customer Success Engineers for Bonsai, we often interface with people experiencing problems with their cluster. Poor performance, relevancy problems, and data loss are some of the most common issues we encounter. Often, these problems can be traced back to an inappropriate sharding decision early in the product cycle.
In this post, we’re going to explore shards and some basic heuristics for configuring them, but it should be noted that there is not a single heuristic that works equally well for all deployments. Enterprise grade and other large-scale clusters will have some different logic involved and will be covered in a future post. Instead, we’re going to look at some settings and heuristics that apply to small- and medium-sized production clusters.
What Are Shards?
To start things off, let’s define shards. Elasticsearch is built on top of Lucene, which is a data storage and retrieval engine. What are called “shards” in Elasticsearch are technically collections of Lucene segments (the files in which the cluster’s data is stored). Elasticsearch manages these different segments, spreading data across its nodes, and automatically balancing the data as evenly as it can.
Our Shard Primer document has plenty of visuals to help explain shard allocation and the difference between primary and replica shards. Definitely check that out if these concepts are new to you.
With that out of the way, here are the top most common mistakes we see when it comes to sharding.
Not Enough Replicas
Sometimes people make the decision to reduce their shard usage by avoiding replicas altogether. Usually, this is done as a cost saving measure, but it creates significant risk of data loss.
When an index has no replica shards, this means that there is no redundant copy of the data online. So, if a node is terminated or otherwise dropped from the cluster, whatever data was on that node is now gone. This means that Elasticsearch can no longer read or write to the affected index.
The only time it is acceptable to have 0 replicas is if you have a 1-node cluster. That would likely be something that you’re running locally, or as a dev/staging instance where the availability of the data isn’t important. Production grade clusters need 3 or more nodes for stability, which means a single replica shard per primary is the minimum needed.
Too Many Replicas
So if 0 replicas is bad, and 1 replica is good… then many replicas must be great. Right?
Replica shards are a case where more is better sometimes, and even then only up to a point. The case for more than one replica is primarily to facilitate high traffic search applications. But there’s still an upper limit.
Replicas of a primary shard can not occupy the same node, so if there are too many shards, then Elasticsearch will not be able to allocate them all, leaving the cluster in a permanent “yellow” state.
To put it more concretely, if you have 3 nodes, then no index can have more than 2 replicas. The primary shard of data will be on one node, and no replica can be on the same node as the primary or the other replica. This leaves space for two replicas. Similarly, if your cluster has 5 nodes, you can have up to 4 replicas. A 10 node cluster can have indices with up to 9 replicas. And so on.
The other thing to consider is that every time a change happens to a primary it has to be recorded as many times as the number of replicas, this can cause a lot of overhead work for your cluster that is taken away from search.
The other problem we commonly see is when an index has either too many or too few primary shards. This leads to “hotspots” (nodes where there is significantly more data or traffic than on the other nodes). Hotspots prevent a cluster from performing optimally and can be difficult to resolve at scale.
For example, suppose you have a cluster with 3 nodes and an index with 2 primary shards, each with a single replica. This means you’ll end up with 4 total shards (2 primary + 2 replica), spread out over 3 nodes. Because 4 is not evenly divisible by 3, two of the nodes will have one shard, and one node will have two shards.
This is a problem, because the node with 2 shards will have less disk available, and the shards might compete for system resources. Whereas on the other nodes, there would be no such competition.
Further, suppose each node has 50GB of disk available for your data; the 3 node cluster should be able to host 150GB of data. In theory, all shards would be about the same size, and the cluster should be able to host up to 75GB of primary data. But because two shards are sharing a node, the maximum size for each shard is 25GB. This means that the index itself can only have 50GB of primary data, and 100GB total.
In other words, the primary shard choice has made it impossible to utilize 50GB (1/3) of the data available to the cluster!
Having more primary shards will allow you to increase your write throughput, to a point. Once your number of primary shards on a node exceeds the number of CPUs on that node, then write operations can start to be queued, leading to higher latency. If you’re finding that your application is unable to keep up with your users’ demands, then you may want to look at adding some additional nodes and sharding across them.
Elasticsearch also supports master-only and data-only node types. This allows you to have a cluster that is really two clusters: one that only handles incoming writes and one that serves search traffic. This way, the operations are partitioned into groups that can be elastically scaled to meet demand.
The trade off to vertical and horizontal scaling of clusters is additional cost and complexity. Fortunately, this is an approach that most people will never need.
After all this discussion of how people do it wrong, let’s focus on how to do it right!
The best way to approach shard planning is to first consider the disk footprint of the primary data. If you have 10GB or less, then you’ll absolutely want to keep the index to 1 primary shard. There’s no value in splitting it up at that size.
If you have more than 10GB of primary data, but less than 50 GB, things are a little more fuzzy. There are a variety of things to take into consideration. Sharding does introduce a network tax, and there’s going to be an inflection point where this overhead is surpassed by the performance hit of a single large shard.
Where that inflection point occurs will vary widely. For example, if you have 3 c5d.large instances with 50 GB of NVMe SSD, then letting a shard get near 50GB of disk will result in a huge reduction in performance (as 2/3 nodes will be nearly full, while the remaining node is mostly empty).
On the other hand, if your cluster is running on a few c5d.9xlarge instances, then a 50 GB shard is going to be no problem in terms of disk. But it still may be slower for Elasticsearch to query a single ~50 GB shard, than it is to query a few ~16 GB shards with network tax. (Want more clarity on where this inflection point lies for your app? Ask us!)
Most of the time, if you have more than 50Gb of primary data, you’re almost certainly going to want multiple shards. You will want to create your index with enough primary shards such that:
- Each primary shard is 10 - 50 GB
- The primary shards can be evenly distributed across the cluster
If the total number of shards for an index does not evenly divide by the number of data nodes, that is a sign that your cluster is imbalanced.
When planning the number of your primary shards it is important to realize that it’s difficult to change later, and will require a full reindex. So carefully consider what you want to optimize for. Do you need to be able to have fast indexing or fast search, or do you need to optimize on both? (We’ll talk about those topics in a later post).
Consider a cluster with more nodes (or an array of dedicated master nodes) if you need higher write throughput and possibly some data-only nodes if you need more search throughput.
The good news is that with Bonsai’s Evergreen Promise, we take a lot of guesswork out of the process for you. All of our clusters have 3 nodes at a minimum (even on our free plans!), and we default to a single primary shard and a single replica for every index created. If you don’t specify these settings when the index is created, then we’ll use these defaults. We’ve found them to be ideal for >90% of users. You can also take a peek at our post on the Ideal Elasticsearch Index, which has a lot of details on this as well.
Getting the shard math right is equal parts math, science, and solid planning. But, it’s worth it to think about in the present to ensure your cluster operates at peak efficiency as you scale.
Does this post offer any epiphanies? Hate this advice? Love it? Let us know your thoughts on Twitter: @bonsaisearch.