On Monday, March 5, 2013 we experienced a substantial outage of our production Elasticsearch cluster. We are profoundly sorry for the outage and its affect on our customers and the users and customers of their services.
At approximately 12:05pm PST (UTC-0800), we began receiving alerts of high load on one of our Elasticsearcn nodes. We noted a high level of CPU usage by Elasticsearch on this node, coupled with a higher than normal usage of system memory. Our logs are still unclear on the precise cause of the excess CPU usage, but we were able to rule out other common system causes of high load, such as high disk IO latency.
At this time we removed the node from our front end load balancers, redistributed its shards throughout the cluster, and shut down the node in an attempt to redistribute the load throughout the rest of the cluster. However, the node was not able to successfully reconnect to the cluster, and logs on other servers indicated that other nodes were having trouble connecting to the cluster as well.
At this time, around 12:40pm PST, we decided to perform a restart of the Elasticsearch cluster to temporarily suspend index traffic and allow nodes to reconnect to each other. After the restart, cluster health checks were reporting all but the first node to be online and connected. As indexes gradually reloaded into the cluster, we noticed some errors persisting, particularly with index provisioning.
Around 1:00pm PST we identified that the previous cluster restart had not been entirely successful. Rather, the cluster had partitioned itself into two clusters. One with half of the nodes present, and another with all but one node. The last node was correctly failing to join either. This is a so-called "split-brain" state, wherein a cluster partitioning event may incorrectly lead one subset of the cluster to elect a new master.
Much of the normal index traffic would have been successful at this time, having been likely to route to the larger cluster subset. However, index deletion and creation was still failing at this time, in addition to the single node refusing to join the cluster, prompting our continued investigation.
By about 2:21pm PST, we had identified the split brain state of the cluster, as well as the precise discrepancies in the cluster state among all of our nodes. At this time, we decided to perform another full restart of the cluster. This time we waited briefly after stopping all nodes before starting them again, to allow cluster discovery to proceed without any potential issues.
This restart was successful in reuniting a correct and complete cluster. All nodes connected and successfully elected a master node. Indexes were gradually recovered throughout the cluster. By 3:00pm PST, the cluster had recovered 98–99% of all indexes.
While troubleshooting the missing indexes, and cross-referencing them with accounts and user activity, we were continuing to receive reports from customers of failed attempts to recreate their indexes. These customers had generally attempted to self-troubleshoot by deleting and recreating their indexes, only to find that index deletion requests would fail with a 502 error, only to apparently succeed a few minutes later, while index creation requests failed entirely.
Based on the logs, we saw that servers within the Elasticsearch cluster were rapidly and repeatedly trying to recover the 1% of unassigned indexes. These attempts, and subsequent failures, were generating a storm of cluster state updates, which effectively poisoned the cluster's attempt to service normal traffic which changes the cluster state, such as index creation and deletion.
Ultimately, we were able to verify that the missing indexes had all been previously deprovisioned by customer activity, and were awaiting their final deletion. Removing these indexes on the filesystem prevented Elasticsearch from attempting to recover the indexes, quieting internal cluster state traffic, and restoring the ability for customers to delete and recreate indexes.
During this outage, we communicated in some detail with Shay from Elasticsearch, who graciously provided assistance troubleshooting and recovering our cluster. Shay was able to confirm that a few of the issues we faced in our outage have since been fixed in the upcoming 0.90 release of Elasticsearch, as well as made some recommendations for mitigating similar events in the future. We also provided excerpts of our logs for his more thorough analysis.
By 6:50pm PST our cluster returned to normal operation, reporting a green state, with normal load levels, CPU and memory usage across all nodes.
Total duration of this outage was approximately 20 minutes of hard downtime for an average index, due to cluster restarts. In addition to that, approximately six hours of degraded service, primarily in the form of failures to create and destroy indexes, as well as high latency and various intermittent 404, 500, 502 and 503 errors.
Lessons learned & fixes to implement
This outage identified a number of lessons for us to apply, from simple growing pains, to outright misconfigurations of our cluster. Following yesterday's outage we have a number of changes to implement to harden our cluster against similar future outages.
Isolate master-only routing nodes
It is a recommended practice for high-scale Elasticsearch clusters to use a set of dedicated nodes which do not store data, and whose only responsibilities are routing and master eligibility. This helps avoid the problem of splitting a cluster when a node is subjected to excessive load. It has long been our plan to evolve to this kind of cluster design, and now we see that our cluster has reached a size where this is now necessary.
Enforce more precise connection throttling
We believe the original root cause of high system load was due to an excessively high number of expensive requests being allowed through to the cluster. We are in the process of making our connection throttling a bit more sophisticated to better match our desired traffic levels.
In our experience, scaling effectively in the cloud means building up and planning cluster capacity from known quantities, including the number of connections allowed per customer and resource. We have a design which should help protect fair and reasonable usage within the cluster, which should go mostly unnoticed by the vast majority of normal usage.
Set minimum master nodes to a correct value
Our minimum master nodes setting was incorrectly set to half of our master-eligible nodes, rather than the recommended half plus one. This was likely a significant contributing factor to our cluster partitioning. We have already corrected this setting to prevent the possibility of future split brain scenarios.
More frequent purging of old index data to help improve cluster restart
When a customer deletes an index, we issue a "soft" delete which leaves the index in place within the cluster, until we periodically run a manually verified process to purge old data. This lets us recover large indexes from accidental deletion, which does happen from time to time. However, all of the extra index data simply served to slow down cluster restarts as old indexes were loaded alongside the new.
Sunsetting our beta indexes
Similar to more frequently purging deleted index data, the time has come for us where we will soon announce the sunsetting of our public beta indexes and ask customers to upgrade to one of our production accounts. Moving beta accounts to our newer production plans will allow us to remove some deprecated internal architecture, as well as more accurately plan our cluster capacity and provide for faster recoveries during server and cluster restarts.
We are continually grateful for all of our customers' patience and understanding. Thank you for your support! We take these outages very seriously. We have much higher standards for ourselves and our business, and we are committed to improving ourselves and serving you better in the future.