According to our status page, our cluster availability was approximately 99.99% in January 2013, due to a five minute outage. This was a partial outage with limited impact, so that five minute figure is slightly less, and slightly more, serious than it looks.
Let me start from the beginning.
Our first indication of a problem was when an index failed to provision successfully, causing the index status to go red. This is what triggered an outage with Pingdom, and starts the five minute outage in our uptime report for January.
We were paged right away, and within a few minutes had identified the affected index as the cause of the red cluster status. Recovering the cluster status was a matter of re-creating the index. We also contacted the affected customer to let them know of the issue that we discovered, and to ask for any feedback they could provide on their end.
Initial recovery, and continued investigation
Fixing the single affected index let us restore the cluster state back to green. However, our tests, and some tests from our helpful customer, showed that some index creation attempts were still failing randomly.
The random index creation failures seemed proportional to a single node in our index. Based on this, and some output from our logs, we identified a node that we thought was having issues connecting to the rest of the cluster.
After restarting this node, however, it was not able to join the cluster successfully. We promptly removed this node from our load balancing to further troubleshoot the problem. Around this time, a subset of some requests routed to this particular node would have returned with a 404 error.
Bringing in the big guns
With one node offline, and index creation still failing intermittently, we enlisted the help of Elasticsearch creator Shay Banon to troubleshoot the situation.
Shay was able to help direct our attention to the relevant log output, as well as interpret its output, to ultimately identify an anomaly during cluster discovery. One of the nodes in the cluster was returning outdated information for the other nodes, causing attempts for a new node to join the cluster to fail.
After cross referencing each node's picture of the rest of the cluster, we identified the node that was more likely to be the source of our cluster connection problems. Restarting this node went smoothly, and allowed us to bring our other node back online smoothly as well.
After a few minutes monitoring the cluster's recovery and rebalancing, we added all nodes back to load balancing.
The total impact
During this incident, index creation and deletion would have been intermittently unsuccessful for approximately 90 minutes. We were able to recover affected indexes once the cluster had been recovered.
Furthermore, some requests would have failed with a 404 error or similar when inadvertently routed to a node with an incomplete view of the rest of the cluster. These happened in one- or two-minute intervals for a total of less than ten minutes throughout our debugging.
Those issues aside, the rest of our cluster traffic should have been un-impacted. Production indexes were replicated across multiple nodes, and recovered seamlessly from the loss of a single node. Staging and beta indexes were similarly running at an extra level of replication in preparation for a later cluster expansion, and were likewise unaffected.
Looking toward the future
Having recently soft launched our Heroku Elasticsearch addon to production this month, we're counting this as our first production outage. As such, we decided it would be worthwhile to write up a more comprehensive post-mortem on the event.
Going forward, we would like to maintain a habit of posting a monthly uptime summary, including the details of any similar outages, if they have not already been published.
We take uptime and availability very seriously. We recognize that our service is being used in production by hundreds of applications, who are quite literally trusting their business on the quality of our service. We're humbled and grateful for that trust, and will keep working hard to earn it.
Are you enamored with Elasticsearch? Do you enjoy some ops? Would you like to work more closely with customers? We're looking for another engineer to join our team. If you're a junior- to mid-level developer who wants to jump into the deep end of running an exciting infrastructure service, introduce yourself to Nick Zadrozny at firstname.lastname@example.org.