Our uptime in September, as reported by external polling, was 100%. But sometimes we like to take a stricter view for the sake of self-improvement, and for September, that number might more accurately be 99.98%. Here’s a story of how we think about availability in general.
We leverage a number of tools to measure our total system uptime, one of which is an external polling system. This mechanism exercises all of the layers in our technology stack, and most realistically emulates real customer traffic.
Over time, our system has become increasingly insulated and isolated on a per-customer basis. This is a good thing, because it ensures that the problems of one tenant don’t become the problems of many, a core concern of managing a multi-tenant cloud hosting service.
However, this starts to become something of a problem when the primary system of measuring uptime is external polling. External polling suffers from the fact that it is coarse-grained, and so will only detect outages that affect broad segments of customers for longer than its polling interval.
Recognizing this, we have built much more fine-grained measurement and analytics tools for our own internal usage, as any good server operator must. Our systems can track response times and error rates for all traffic on a global, regional, and per-customer basis.
We refer to these analytics constantly to ensure that all of our customers’ clusters are healthy, and to help identify proactive maintenance before it becomes a problem.
The trick for uptime reporting, though, is simplifying these data into an answer for a deceptively simple question, “Is the system available?”
The coarseness of a failed external health check makes for a fairly strong signal of “no.” And so we definitely take advantage of these measurements, treating them as an upper bound on our availability for a given month.
As your measurements become more and more fine-grained, the judgement call of what constitutes “downtime” becomes a little more subjective. It starts with the underlying root cause. Is a “404 Not Found” response due to legitimate customer usage, or is a routing proxy working with an outdated routing table? Is a “504 Gateway Timeout” the result of performance problems, or an unusually large document hitting a timeout? Does temporary slowness count a downtime? Is the problem affecting a single tenant, a certain segment of tenants, or some percentage of all traffic in a region?
One way that we simplify this analysis is to be strict. Our current policy internally is that if a tenant is experiencing degraded performance or increased server-side errors for a period of time, then we count that tenant as down for that period of time. These are the kinds of incidents that get reported in our Status History, each of which receives an internal post-mortem for the sake of this kind of analysis.
For a hypothetical example, say that some regression causes service degradation for 10% of our customers for 15 minutes. That’s a very serious issue for a hosting provider such as ourselves, which we would want to accurately represent in our uptime reports. For a thirty-day month (720 hours), we would measure that as 10% × 0.25h / 720h = 0.003%, or a lower bound of 99.996% availability for the month.
It may not seem like much, but it adds up, and it helps us stay focused on providing the best hosted Elasticsearch service available.
So while our external polling might report 100% uptime for September, and while strictly speaking everything was available during the month, we have somewhat higher standards, and count this month as 99.98%.