We recently found and fixed a small regression in our routing layer, which caused some complications with index creation over the past 24 hours. We felt the issue deserved a post-mortem writeup with more details.
On the evening of February 19th, we deployed an update to our routing layer to support basic multi-index routing syntax. However, there was a subtle bug in this update which affected index deletion, in turn causing complications with subsequent index creation.
First, some background.
Our architecture has an additional routing proxy layer that runs in front of Elasticsearch. Among other things, it is this layer's responsibility to handle load balancing, log traffic metrics, whitelist and pre-process certain APIs, enforce account limits, and authorize requests.
Toward this end, one major function of the proxy is to map the "logical" or "public" URL which identifies a specific index belonging to a specific account. This URL is transformed into a "private" URL for an index within the flat, global namespace of our per-region shared cluster. This mapping functions as a kind of whitelisting for request authorization, and also allows us to run a single more efficient cluster and pass on some beneficial economies of scale to our customers.