Dec 18, 2024
•
Best Practices
•
min read
At Bonsai, we provide fully managed Elasticsearch and OpenSearch infrastructure to all cluster tiers, streamlining operations for customers with all sorts of unmanaged and managed search implementations and workloads: ranging from small to tens of thousands of searches per minute.
We're proud to offer 7 nines of uptime and availability - and, infrastructure isn't the only contributing factor
to that story: managing indexes and
shard
s is a hard problem. When one of our customers ran into some trouble with their search workload, we jumped into
action, and worked together to identify and resolve the underlying problems. Today, we're sharing technical
details around how to identify a potential
shard
performance issue, and how to create a solid
shard
ing strategy, to help your cluster reach some additional availability nines.
Our customer was experiencing a degradation in search performance, hiccups in cluster communication, and inconsistent search results. The infrastructure layer, we knew, was performing well - these were intermittent issues in the index architecture and implementation.
Diving into metrics and logs, our engineers were able to quickly identify a likely source of our customer's woes:
a search index with too few
shard
s for its multi-terabyte document coverage, which meant that each individual
shard
grew to well over 1TB in size.
shard
?
Think of a crystal shattering into multiple
shard
s: each contains some part of the whole.
Shard
s can be distributed to individual nodes (that is, members or servers) of a search cluster,
increasing both performance and durability, as now multiple nodes can service requests for
information contained in any given
shard
.
When an index's
shard
s grow too large, they begin to transition from a source of durability to a source of fragility, but
understanding why requires some consideration of the underlying mechanics of Elasticsearch and OpenSearch.
"...resilience in the face of node failure, a key benefit of distributed systems, is eliminated when
shard
s grow too large."
Shard
s, which hold a search cluster's data, are backed-up, shared between nodes, appended to, and read from. If these
operations begin to slow down, any one of them can be a source of instability, and, in the worst case, can lead
to a sort-of domino effect, bringing the entire cluster to a stand-still.
Essentially, resilience in the face of node failure, a key benefit of distributed systems, is eliminated when
shard
s grow too large.
Potential and consequences for clusters with unmanageably large
shard
s include:
shard
s are too large for any given node in the cluster to hold more than one
If any node in the cluster fails, there are no other nodes which can operate as an immediately online backup, and so the cluster will be in recovery mode - and unable to server some search requests - until the data can be fully loaded from an out-of-cluster storage source (that is, a back-up). In this case, our estimate was that the cluster would be in recovery mode for more than 3 hours.
If any node enters recovery mode, there is a high-likelihood that all the remaining nodes would be on the brink of being overwhelmed.
Thankfully Elasticsearch and OpenSearch have a mechanism to address this situation, termed the Flood Stage. If any node enters Flood Stage, it is unable to accept new writes of data, and will remain in read-only mode until it exits the flood stage.
While read-only mode is great for the overall resilience of write operations in the cluster, the nodes which are still available for writing will experience dramatically higher write workloads, causing further instability.
Additionally, if as a result of node failure, other nodes are unable to recover data quickly, they are identified
as "incomplete" replicas, and will force all read requests to be served from the data's primary
shard
. If a node holding a primary
shard
were to become unstable during this time, it could cause a catastrophic cluster failure, requiring recovery from
a recent snapshot.
Increasing the memory and disk configuration for individual nodes can alleviate performance issues temporarily - buying your team time - at a potential 50-100% increase of cloud costs.
Operating a few nodes of large sizes, on most clouds, is much more expensive than operating many more smaller
node size instances. Unfortunately, however, as these new larger nodes allow the problematic
shard
s to grow ever larger, the potential for catastrophic cluster failure only increases along with your cloud bill.
Our team was able to provide a roadmap to sustainability and increased performance for the customer's cluster. The most critical components of the overall solution included:
shard
s, such that each
shard
now only holds less than 150GB of data (a roughly 90% reduction of the per-
shard
size).
zstd_no_dict
compression on the target index.
pre_filter_shard_size
, which helps to prevent queries from reaching
shard
s that are unlikely to hold relevant data.
If your cluster is in a similar situation, here are the sort of performance outcomes you can expect as a result of following the same resolution strategy:
Search is complex, and so are distributed systems. Going it alone often means encouraging some members of your team to begin the multi-year path of expertise in multiple domains or bringing in external talent at a premium.
The good news is: you don't have to go it alone. Find out how Bonsai can help bring you achieve your search goals, and contact our team.
There is no next post
Schedule a free consultation to see how we can create a customized plan to meet your search needs.