October 2013 Post-Mortem

Nick Zadrozny · November 11, 2013
12 minute read

Summary

On Tuesday, October 29, we experienced an outage on our primary Elasticsearch cluster in our US East region. Up to 20% of our customers in this cluster experienced a total downtime for 24 hours. The remaining 80% experienced several windows of downtime during which we performed emergency maintenance to the cluster. All customers have since been migrated to one of several replacement clusters.

This outage was the most severe in our four-year history, which we profoundly regret. The following post-mortem provides a summary of events, our analysis of them, and the actions we have taken and continue to take to prevent this from ever happening again.

Background

At the time of the outage, our main US East Elasticsearch shared cluster had been hosting over 10,000 shards for many thousands of indices. We fully expected this size to become a scalability challenge, and already had taken steps and released some back-end tools to help deploy and manage a set of smaller, isolated Elasticsearch clusters.

As part of this effort, we have relatively recently developed a new generation of our operational provisioning and configuration management scripts. This latest revision of these tools represent an accumulation of our last few years of experience managing production data services in AWS, and they have proven to be much leaner, more maintainable, and more efficient to operate.

These new tools had already seen fairly extensive usage and testing over the last few months. They are used for all of our staging and development clusters, as well as in production for all new customer dedicated clusters. We have also since migrated over our younger and smaller EU shared cluster without incident.

Our plan for the older US East cluster was to migrate it to use these new scripts, and then use the improved toolset to help partition and migrate its data to a set of replacements.

Monday, 28 Oct 2013

On Monday evening we had seen some signs of cluster load. The symptoms we observed were an increase in the average maximum response times of requests, as well as occasional timeouts when performing cluster state operations.

This triggered an alert internally to investigate. The engineer on call interpreted the symptoms as a result of the cluster being under-provisioned, and decided to add new servers. This is similar to previous incidents, where similar symptoms had been alleviated by adding new servers to the cluster.

Our newer servers joined the cluster successfully, and symptoms seemed to subside.

We had deployed the replacement servers using our newer generation operations tools, with the reasoning that it was sufficiently tested, and we could begin the migration by gradually replacing servers over time. So in this case, we spent some extra time monitoring the logs as a team following the deploy.

In this case, we began to see an unfamiliar error message in the logs. This message included a cause of TransportSerializationException[Failed to deserialize exception response from stream]. We investigated these messages for a few hours, testing and ruling out a handful of early hypotheses of its cause.

With only a few of these errors, and with no clear indication of the source, and with no other signs of a negative impact to the cluster, we decided to defer more in-depth analysis until the next morning.

Tuesday, 29 Oct 2013

Around 8:30am EST / 5:30am on Tuesday morning, we were alerted of continued degradation of cluster performance. The morning showed similar signs of errors that we had seen the previous evening.

Our engineer on call interpreted these symptoms as continued cluster load, and decided to add another large batch of extra servers to our cluster as temporary extra capacity. With the new servers coming online, we only saw cluster state operations sharply degrade, along with a corresponding increase in the RecoveryFailedException and TransportSerializationException errors in the logs.

At this time we realized that there was a more serious error happening than simple cluster load from customer traffic, and promptly paged the rest of the team to have all hands available to respond.

Our first priority was to immediately get an accurate assessment of the state of our cluster, so that we could logically search our way back toward identifying contributing factors.

The largest and most obvious source of recent change was the set of new servers, which had been provisioned using a newer version of our cluster provisioning and configuration management tools. Our first step in troubleshooting was to simply shut down Elasticsearch on all of the recently provisioned servers, in order to minimize any negative impact they may have been contributing.

Shutting down these servers caused a subset of indexes which had already been successfully relocated to the new servers to go offline. However, it did buy us some room to further troubleshoot the new servers without worrying about propagating a more serious error.

On a closer look, we immediately noticed one error. A handful of the new servers experienced a bug in their provisioning scripts which caused them to prepare their data volumes incorrectly. This left the Elasticsearch data directory to be set up on the wrong device, generating disk full errors on those nodes, causing corruption on a small number of shards that had attempted and failed to be relocated to those nodes. These nodes had their data volumes repaired in place, and the affected shards were later recovered from a replica or a recent data snapshot.

Continuing to troubleshoot, we returned to our earlier TransportSerializationException message.

At this time, most of our servers in the older cluster were running Java 1.7.0_17. Due to an oversight in our configuration management scripts, we hadn't "pinned" the patch version of Java. The newer servers had been built using a more recent minor version of Java, 1.7.0_45. Some more focused research eventually indicated this version mismatch as a problem.

The full description of that error is described in this GitHub issue for Elasticsearch: https://github.com/elasticsearch/elasticsearch/issues/3145

Once we learned about this problem, we updated our provisioning scripts to rollback and replace the more recent versions of Java to the already widely deployed version. Once completed, we would be able to bring these servers back online one by one to recover the shards that had previously been taken offline in their shutdown.

While starting these servers, we observed that the cluster was taking a much longer time to recover the missing shards than we would expect. We spent some time iterating on Elasticsearch recovery throttling settings, however the master node would occasionally become unresponsive, forcing us to perform an emergency full cluster restart for new settings to take effect.

Over the course of the afternoon, we continued to observe the cluster recovery process gradually slow down and hang as the cluster approached 80% online. We continued to develop, research and test a number of hypotheses to overcome this problem and recover the cluster to a green state.

Some of our hypotheses included tuning cluster recovery throttling settings, allocation rack-awareness settings, and possibly-corrupt transaction logs. Each of these attempts generally required a full cluster stop and restart, which was very time consuming, continuing through the afternoon and well into the night.

Around 2:30am PST / 5:30am EST on Wednesday, we still had roughly 80% indices online and no clear ETA on full recovery of the Elasticsearch cluster. At this point we decided to call for a short break to sleep and regroup.

Wednesday, 30 Oct 2013

At 6:30am PST / 9:30am EST we decide that it is time to deprecate the old cluster and abandon our attempts to recover it in place.

Within the next hour, we provisioned the first of a new set of replacement clusters, and deployed a handful of systems we had developed to help manage a set of partitioned, independent clusters. After a quick round of testing, we began allocating all new accounts on this replacement cluster, and updated all our open support cases about the availability of a quick fix to migrate their data.

Over the next few hours, we helped a large number of our customers still impacted by the outage migrate over to the newer cluster. By 7:30am PST we were "through the worst of it," with over 95% of indexes online. Throughout the rest of the morning, we provisioned another replacement cluster where we were able to manually migrate the data of anyone who was still offline and unable to reindex.

Thursday & Friday

Throughout the rest of the day on Wednesday, and into the rest of the week, we continued to migrate our customers off the older cluster and onto our newer architecture. We worked painstakingly in batches of decreasing apparent impact. Over the course of Wednesday and Thursday we developed our capability to mass-migrate customers in large batches, and quickly verify the success of those migrations.

Ultimately, we completed the last of our migrations on Friday afternoon, bringing our service back to 100% operational status for production accounts.

Today

All of our clusters are running well. We have sufficiently partitioned our original monolithic cluster to avoid the scalability issues we recently encountered. This is the culmination of a longer-term effort which we think will greatly serve our long-term scalability.

Our older cluster is still running, however at this point it is only serving our much older beta accounts. These accounts were left in place due to their dependence on legacy system compontents that were not migrated. Anyone still running a beta index has since been contacted directly with our plans to end-of-life support for beta accounts, and instructions for a smooth upgrade.

Elasticsearch Lessons Learned

We did a fair amount of digging into the source code for Elasticsearch itself while trying to recover our monolithic cluster in place, and learned a few things about how the ES master works.

The ES master runs in a single thread. In addition to performing recovery, this thread also makes adjustments to the resources alotted to running indices. This has some consequences for recovery and failover:

With a low value of node_concurrent_recoveries, recovery runs in N^2 time. Each loop iteration recovers a constant number of indices, but that loop takes time proportional to the number of running indices. With a high value of node_concurrent_recoveries, the garbage collector can have a hard time keeping up.

Also, at certain points in the loop, updates to cluster settings are not accepted immediately, and may timeout entirely. Updating the cluster settings via API is often impossible, and you should be prepared to patch config files and/or the binary itself to affect the cluster settings you need. Expect some failovers to require a full cluster restart.

For example, we were unable to set index.recovery.initial_shards via the api, so we were forced to deploy a patched build of Elasticsearch.

Handling of cluster state traffic has had a number of improvements in the 0.90 series, and we expect those improvements to continue in future versions of Elasticsearch. We're also looking forward to the cleaner backup and restore APIs being implemented in 1.0.

Until some of these improvements happen, we don't currently recommend running individual clusters with much more than 1,000 shards. Use cases which require a large number of shards should run many smaller clusters where possible. Clusters with many shards should run dedicated master-only nodes on servers with node_concurrent_recoveries set to at least 10% of the total number of shards. A large amount of RAM on the master will help relieve some pressure from the garbage collector.

Lessons Learned for Humans

Call for help when making changes to production under duress. Just because it looks like something you've seen before doesn't mean it is, and it's much easier to misinterpret the circumstances or commit errors in judgement when you're operating alone. Develop a culture of asking for help, because by definition it's impossible to tell in advance if you're about to make things worse. (Otherwise, you wouldn't be doing it, right?)

Businesses grow, and change happens. Embrace it, but always minimize the changes you make to a running production system. Even code that is well tested when running in isolation can cause problems if it introduces too many changes at once in a running system.

Just because you have a horizontally scalable system doesn't mean that you should run an arbitrarily large cluster. In many cases, it's beneficial to partition after an empirically or even arbitrarily determined size, both to isolate failure and to minimize time-to-recovery. This is something we have historically done well with our sister search hosting service, Websolr, and was the basis for what we've now implemented with Bonsai.

Conclusion

This outage was the most severe in our company's four years, and we are humbled by the experience. We are also humbled by the gracious response of our customers. We can only hope to make up to all of you with our continued efforts to learn and adapt improve our service on your behalf.

The silver lining from last month's outage is that we have since completed our long-intended migration forward onto a much more robust architecture. This has been working very well, and will serve as the foundation for many improvements to come.