Mar 20, 2023
The ability to partition and replicate shards throughout a cluster is particularly helpful in search. A search need tends to exist when there is too much data for users to reasonably scroll or browse through. Scaling search means shards, so it’s helpful that Elasticsearch and OpenSearch make that accessible.
Accessible doesn’t necessarily mean easy. I’ve worked closely with dozens of teams over the years who are searching at scale. Collaborating on the data partitioning and replication strategy is an important area of focus.
It’s a very common intuition that more scale must mean more shards. And the ease of creating more indices and more shards means that many use cases… do just that. It’s a common enough pattern that we see teams who create their indices with dozens or hundreds of primary shards.
So while it’s interesting to look at when and how to use more shards to achieve more scale, I’m going to examine the opposite. When there are too many shards it can be necessary to reduce them. This is an operation we’ve worked through twice with clients in recent weeks, so we’ll look at how we approached their planning and execution.
We’ve already alluded to one source of excessive shards: premature optimization. This can show up when a client chooses an arbitrarily large number of shards that are decoupled from the partitioning needs of the underlying index. We see this when there are several indices, with data sizes that span multiple orders of magnitude, all with the same number of shards and replicas. The larger indices may need more primary shards, while many of the smaller indices could be combined into fewer, often just one, primary shard.
This matters because the shards themselves are not free. Each carries some amount of overhead in carrying out the work of search. Moving from one primary shard to two allows a query to be split in half, with each half running in parallel to the other. However, the act of splitting that workload, waiting for responses, and merging those responses adds to the cost in tension with the concurrency of the parallel work.
In computer science, we have Amdahl’s law, which describes a theoretical upper limit on the relative benefits of parallelization. Grappling with the implications of parallel work in a distributed system can be heady, but a shorter takeaway of this law is that even in the best cases with workloads that can be highly parallelized, there is a rapidly diminishing benefit to parallelization.
When we see a set of indices with an arbitrary fixed number of shards, we start to have… questions about whether they are appropriately sized for their workloads.
A hot spot is when one or more data nodes have more data (in turn, more shards) than the others. Hot spots are a problem because the work happening in a hot spot will start to slow down. And because each query has to wait for every shard to give its response, that will slow down the rest of the work that’s happening. As this compounds it can even start to fill bottlenecks upstream in the system, such as threads and connection pools.
This one is a bit easier to conceptualize: imagine an index with 13 shards, running on a cluster with 12 data nodes. When evenly distributed, 11 data nodes will host a single shard, leaving a single data node to host two shards. Assuming each shard is generating an equal amount of work, that one node with two shards will be handling double the workload compared to the others.
There are a handful of heuristics floating around in various blog posts and documentation about the sizing and number of shards. We find in practice that there isn’t a clear formula for either.
However, we have a few useful principles:
Having these principles in mind, figuring out how many shards an index needs is an art. Hopefully, you’ll be closer to the answer with these principles, but you will still have to test to find the right amount of shards to set up for an index.
Now that you have a number in mind to shrink shards down to, let’s explore three useful approaches to reducing shards in Elasticsearch and OpenSearch.
The tradeoffs of these are notable because changing shards is an inherently costly operation. In the best case, we are transferring a notable portion of the contents of the index across the network. And because shards correlate with larger volumes of indices, just moving the data around in the cluster will consume time and resources.
There are other constraints and complexities that come from the context of the work itself. In most cases, a re-sharding operation will need to be carried out alongside ongoing workloads which are adding or changing or removing documents from within the index. These overlaps with upstream business systems mean that re-sharding is not often an operation that can be performed ‘out of the box’ by the search engine, and that some capabilities will have to exist in the upstream systems which interface with it.
We start with the assumption that your search engine is not being used as a primary data store. That means there exists some capability in a different area of the application architecture to send documents and changes to the search engine and maintain synchronization. Often that will include the ability to perform a bulk load of all data.
I can’t understate the usefulness of performing a bulk load of all the documents! Not only does it come in handy for re-sharding, but it serves a core requirement of having a search engine in the first place. The users’ needs from search change over time — whether through new functions in the product, or through developing a better understanding and measurement of search’s ability to serve the user’s needs.
In nearly all search-focused use cases, evolution and improvement over time will require that the index be rebuilt, in order to add new values and signals to the index. New values can include changing business goals, seasonal updates, and featured inventory. New signals can include insights from how your end users change their search patterns or from their evolving interests.
Being good at doing this is a worthwhile capability that will pay dividends. Indeed, high-functioning search teams will maintain the ability to write to multiple different destinations in parallel, in order to facilitate A/B testing, which means they get the ability to iterate on optimal sharding schemes essentially for free.
The main downside of reindexing from source is that it tends to be the slowest process, with the most redundant resource consumption. Again, the existence of shards and a need to rebuild them implies a large workload at nontrivial scale. A reindex from source requires fetching data from the primary data store, processing and transforming the documents, batching them and sending them to the search engine, and finally transforming those documents into the structure of the inverted index.
As valuable as it is to be able to build an index from the primary data from scratch, that’s a lot to plan for, both for engineering capabilities and server capacity. I think it’s worth the investment, and we recommend these capabilities early and often to the teams we work with.
But it’s not always a practical approach when there is an urgent need to recompose a cluster for urgent performance, availability, or cost requirements.
Given the very common need to rebuild an index, Elasticsearch introduced a Reindex API back in version 2.3, which is present in Elasticsearch and OpenSearch to this day.
The Reindex API is a fairly straightforward action that runs a series of scrolling searches, fetches the raw, original document from the source index, and inserts it into a target index. For our purposes, the Reindex API configures the target index with fewer shards.
The main advantage of this tactic is that the capability already exists within the search engine itself. It’s not a component that needs to be created from scratch and maintained on the application side. For a team under a time and team constraint, this is a very pragmatic distinction.
Using the Reindex API does require that the index be configured to store the original document’s source. That’s a default behavior that one would need to opt out of, and something we would recommend to a use case that is reaching dangerously close to running out of storage space. In other words, you shouldn’t choose this approach if you can’t add more storage space to the node where the index lives, because Reindex API will double the storage the index is currently using to complete the operation.
Besides the pragmatism of a capability that exists, there is also a benefit from reducing some of the potential upstream bottlenecks. A reindex in place means there is no additional load fetching from a primary data store, and any document transformation or enrichment will already be represented. Resource usage will be limited to that of the cluster itself, and transferring across its local network will likely be less constrained than running through several layers of a data pipeline.
The first drawback is the timing and synchronization. Running a reindex in place will generate a new index whose contents are in sync from the point in time at which the reindex started. Depending on how long it takes to scroll through the source and reindex all the documents, there may be a substantial gap when the first reindex completes. This necessitates a combination of a series of “delta” synchronizations, or a maintenance window where updates are paused until the index is brought into sync.
The second drawback is getting the reindex to fully synchronize. Synchronization depends on the size of the index, the timing of scrolling through the index, and the rate of incoming updates. If the cluster is sufficiently well-resourced, a dry run should be performed of an initial sync and a subsequent delta to measure the timing of each, and also determine the timing of a final maintenance window.
If there are performance and resource concerns, the Reindex API also supports pulling from a “remote” source. The bulk of the resource impact is in processing the documents and building the index itself, so this is a helpful way to minimize the impact to the production workload. However, reindexing into a separate cluster does require that the application also have the ability to quickly reconfigure its connection to shift its workload to the other cluster at the appropriate time.
It’s also worth noting that because each pass at reindexing, whether the initial or the subsequent “delta” passes, requires scanning the entire source index. And on the delta synchronization runs, there is either repeated work in reindexing, or coordination within Elasticsearch or OpenSearch to determine whether a document needs to be updated at all. In some large indices we find that this process can start quickly, but then taper off asymptotically. Dry runs are necessary to determine a true average rate of progress before making estimates about time to completion.
Our gold standard for re-sharding is backfilling from the primary data source and maintaining synchronization in parallel to multiple sources. Relative to this standard, the Reindex API is a pragmatic tool that can be activated on quickly without a lot of engineering work on the application side. However, there is still a nontrivial amount of resource utilization on the cluster, and the final synchronization needs coordination effort, while risking some discontinuity in search experience during the final traffic cutover.
The Shard Shrink API deals with the main constraint of the Reindex API: speed. While the Reindex API offers the pragmatism of an off-the-shelf function, it still duplicates a lot of the work of reindexing from source. Documents must be fetched and the index rebuilt, which is computationally expensive.
By contrast, the Shrink Shard API works directly with the binary files on disk, saving nearly all of the work involved in getting the documents transformed into that structure in the first place. It’s not without tradeoffs, but there is a reduction in overall work that makes a notable speedup in the operation.
First, an extremely brief primer on the mechanics involved. An Elasticsearch or OpenSearch shard is a Lucene index. That shard-index is divided into smaller segments, which is itself a fully self-contained structure. These segments are tracked in a single file that maintains a pointer to those which are currently part of the Lucene index and responsible for work. Shrinking the Elasticsearch shards means combining many Lucene indexes into fewer indices. Combining Lucene indices means merging their segments.
Thus, shrinking down to a single-primary-shard Elasticsearch index has two steps:
Step 1: Relocate all primary shards onto the same server
Step 2: Create a new Lucene index whose segments file points to all of the combined Lucene segments, effectively merging many Lucene indices into one.
And if we’re shrinking to more than one final shard, we have a final step:
Step 3: From each merged copy, delete the documents that no longer belong in that particular shard.
The main constraint to keep in mind here: all of the primary data must fit on the same node. This can be a tricky one to satisfy when we are reminded, again, that sharding exists as a response to large amounts of data. Shrinking shards in this manner will require a bit of capacity planning, as well as a bit of allocation work to get the shards in the right place.
Another key constraint is that this is an “all or nothing” operation. There is no feasible incremental update. In order to maintain synchronization, the entire operation has to be carried out during some kind of maintenance window, when updates to the index are suspended. This requirement reintroduces the need for some kind of capability to buffer or replay updates, which will tend to overlap with the first strategy of backfilling from the primary source data.
Fortunately, a shard shrink operation, by virtue of being a binary file operation, is multiple orders of magnitude faster than reindexing. Once properly prepared, a shard shrink can take minutes where a reindex may take hours to complete. If disk capacity is not a constraint, then it’s a highly pragmatic approach compared to the Reindex API.
Rock, meet paper, meet scissors. Each of these approaches has its own relative benefits and constraints. And by the time we arrive at the last, we’re back to evaluating tradeoffs against the first.
To be clear, we have a strong preference here for any change to shards: reindexing from source. In our experience, investing in the data pipeline directly correlates to the effectiveness and value of search. In particular, the ability to backfill from the beginning and to send writes to multiple arbitrary destinations. These allow teams to change and add to the search features that are present in the index, as well as design a testing and change management regimen as search is iterated on in response to user engagement.
When the data pipeline is invested in (from the search capability perspective), the ability to experiment with sharding and partitioning and performance becomes essentially free. Especially if coupled with the capability to send a percentage of search traffic — live or dark — to test the results.
But where constraints prohibit a more mature approach, there are alternatives that allow for some forward progress. These have some of their own tradeoffs and involve a little more operational coordination around synchronization or cutovers.
At the end of the day, pragmatism will win out. The ideal approach is the one that generates forward progress. We’ve used each of these three strategies regularly with clients to help optimize their workload, whether reducing hot spots, or just reducing the costs of too many servers.