White Hex icon
Introducing: A search & relevancy assessment from engineers, not theorists. Learn more

Dec 18, 2024

Case Study: Big Shards Cause Big Problems

Best Practices

min read

At Bonsai, we provide fully managed Elasticsearch and OpenSearch infrastructure to all cluster tiers, streamlining operations for customers with all sorts of unmanaged and managed search implementations and workloads: ranging from small to tens of thousands of searches per minute.

We're proud to offer 7 nines of uptime and availability - and, infrastructure isn't the only contributing factor to that story: managing indexes and shard s is a hard problem. When one of our customers ran into some trouble with their search workload, we jumped into action, and worked together to identify and resolve the underlying problems. Today, we're sharing technical details around how to identify a potential shard performance issue, and how to create a solid shard ing strategy, to help your cluster reach some additional availability nines.

The Challenge

Our customer was experiencing a degradation in search performance, hiccups in cluster communication, and inconsistent search results. The infrastructure layer, we knew, was performing well - these were intermittent issues in the index architecture and implementation.

A cluster out-of-balance is at-risk

Diving into metrics and logs, our engineers were able to quickly identify a likely source of our customer's woes: a search index with too few shard s for its multi-terabyte document coverage, which meant that each individual shard grew to well over 1TB in size.

Info

What is a shard ?

Think of a crystal shattering into multiple shard s: each contains some part of the whole. Shard s can be distributed to individual nodes (that is, members or servers) of a search cluster, increasing both performance and durability, as now multiple nodes can service requests for information contained in any given shard .

When an index's shard s grow too large, they begin to transition from a source of durability to a source of fragility, but understanding why requires some consideration of the underlying mechanics of Elasticsearch and OpenSearch.

"...resilience in the face of node failure, a key benefit of distributed systems, is eliminated when shard s grow too large."

Shard s, which hold a search cluster's data, are backed-up, shared between nodes, appended to, and read from. If these operations begin to slow down, any one of them can be a source of instability, and, in the worst case, can lead to a sort-of domino effect, bringing the entire cluster to a stand-still.

Essentially, resilience in the face of node failure, a key benefit of distributed systems, is eliminated when shard s grow too large.

Potential and consequences for clusters with unmanageably large shard s include:

Individual shard s are too large for any given node in the cluster to hold more than one

If any node in the cluster fails, there are no other nodes which can operate as an immediately online backup, and so the cluster will be in recovery mode - and unable to server some search requests - until the data can be fully loaded from an out-of-cluster storage source (that is, a back-up). In this case, our estimate was that the cluster would be in recovery mode for more than 3 hours.

Recovery mode triggers cascading, rolling, node locking

If any node enters recovery mode, there is a high-likelihood that all the remaining nodes would be on the brink of being overwhelmed.

Thankfully Elasticsearch and OpenSearch have a mechanism to address this situation, termed the Flood Stage. If any node enters Flood Stage, it is unable to accept new writes of data, and will remain in read-only mode until it exits the flood stage.

While read-only mode is great for the overall resilience of write operations in the cluster, the nodes which are still available for writing will experience dramatically higher write workloads, causing further instability.

Additionally, if as a result of node failure, other nodes are unable to recover data quickly, they are identified as "incomplete" replicas, and will force all read requests to be served from the data's primary shard . If a node holding a primary shard were to become unstable during this time, it could cause a catastrophic cluster failure, requiring recovery from a recent snapshot.

Exponentially increasing cloud costs and increased future risk

Increasing the memory and disk configuration for individual nodes can alleviate performance issues temporarily - buying your team time - at a potential 50-100% increase of cloud costs.

Operating a few nodes of large sizes, on most clouds, is much more expensive than operating many more smaller node size instances. Unfortunately, however, as these new larger nodes allow the problematic shard s to grow ever larger, the potential for catastrophic cluster failure only increases along with your cloud bill.

Outlining solution options and offering assistance

Our team was able to provide a roadmap to sustainability and increased performance for the customer's cluster. The most critical components of the overall solution included:

  • Re-indexing the data, across 18 primary shard s, such that each shard now only holds less than 150GB of data (a roughly 90% reduction of the per- shard size).
  • Enabling zstd_no_dict compression on the target index.
  • Migrating to a six-node cluster, supporting better distribution of workloads.
  • Upgrading the cluster to OpenSearch 2.17.
  • Enabling some newer OpenSearch optimizations, like pre_filter_shard_size , which helps to prevent queries from reaching shard s that are unlikely to hold relevant data.

If your cluster is in a similar situation, here are the sort of performance outcomes you can expect as a result of following the same resolution strategy:

  • Index-specific configuration and details resulting in an immediate 15% reduction in cloud disk usage.
  • Storage and memory requirements reduction due to index compression, resulting in 30% less cloud disk usage.
  • Node-failure recovery and resilience estimations with an over 600% faster recovery time.
  • Elimination of read-only downtimerisk: under recovery circumstances, the cluster would no longer be at risk of catastrophically entering a read-only state.
  • Dramatically reduced query times.

Search is complex, and so are distributed systems. Going it alone often means encouraging some members of your team to begin the multi-year path of expertise in multiple domains or bringing in external talent at a premium.

The good news is: you don't have to go it alone. Find out how Bonsai can help bring you achieve your search goals, and contact our team.

Find out how we can help you.

Schedule a free consultation to see how we can create a customized plan to meet your search needs.