New! Deploy Bonsai in your AWS Account with Bonsai Vaults →

How We Built & Designed Operational Metrics

Allison Zadrozny · January 02, 2018
9 minute read

The newest feature to join the dashboard family is the one we’ve been most excited about: announcing Bonsai’s Elasticsearch operations metrics, available now on your cluster dashboard! All accounts have access to cluster metrics, including those that provision our add-ons through Heroku or Manifold.

These performance metrics are fast, live-updating, and retain up to a month of request data. From platform to UI, they are a technical feat built entirely in-house with great care and attention to detail. You can toggle between local and UTC time for debugging with app logs, explore time ranges, and click and drag sections of time to focus your time window.

You can see your cluster’s metrics by navigating to the Metrics tab in your cluster dashboard.

Metrics are built for problem-solving

At its core, the metrics page shows visibility and transparency into what is coming in and out of your cluster. It reveals not just how users are interacting with your cluster, but how your app is.

The graphs in this page give you a visual, and what they say depends on the question you’re asking. A question like “Is it fast?” is answered by the performance heat map and percentiles. “When are we under the most load during the week?” — zoom out the timeline and look at request count patterns.

The most valuable puzzle these graphs unravel is the question of causation. Each graph retains the same timeline so that trends can be linked across multiple graphs. For example, an increase in request counts can reveal load handling problems, which may mean that you need more resources. An increase in queue time might be linked to higher write traffic, perhaps from an ill-timed indexing task during peak read times.

Hovering over a section of any graph reveals information overlays for all graphs.

Worth the wait

We’ve long wanted to empower Bonsai customers with visibility into their cluster, but it’s been difficult to find a solution that would scale affordably and remain performant. For many years our operations and support teams have been using things like Grafana and Kibana to manage, debug, and optimize clusters. But our databases that house the huge amount of request data we handle are under limited load: only a small set of engineers have access to them, and during analysis they have the patience to wait for complicated aggregation requests to complete (sometimes up to 30s, depending on the time window length). There was no way we could scale these tools for our thousands of clusters we manage without spending a ton of money on hardware to keep up with our client base. Moreover, there was doubt that those existing platforms would support a good UI experience (namely, a fast one). In fact, our first prototypes worked well if we only showed requests from the last hour. But every time we zoomed out, things slowed to a glacial pace. That was unacceptable.

In many ways, building metrics was a waiting game. Tooling is much better now than when we first launched Bonsai back in 2012. Our current infrastructure supports the high standards we had when we set out to ship this project.

The technical stuff (or, Make It Fast)

From the onset of this project, our team had three goals to measure success:

1. The metrics database must be operationally dependable, affordable, and fast.

The platform has huge implications on cost, because we need to store requests for 10k+ clusters and read metrics off of them quickly. That’s around 1B request insertions per day. The first metrics prototypes used Elasticsearch, but we ran into performance issues once we started ran larger aggregations for bigger time slices. Our metrics engineer ultimately replaced Elasticsearch with Cassandra, which is extremely fast, consistent, and scales for long periods of time.

2. The UI must be fast and responsive, and integrate easily our current javascript environment.

Web users have high expectations for interface usability. All graphs need to load on the DOM quickly and respond to interaction smoothly. A lot of our UI is React-based, so we first tried building our graphs with open-source React libraries. During prototyping, we weren’t able to customize and simplify the axes and labels. A whole new React-based charting library was developed in-house to give us all the tools we needed to keep our charts Tuftian. Keep on the lookout for more news from our team, as we’ll be open-sourcing this library soon.

3. The graphical design must be easy to read and relay information accurately.

Charts must be easy to read — the design should avoid ‘getting in the way’ of answering any questions people have when they come to the Metrics page. In order to build something we’d be proud to look at, we took some lessons from Tufte to keep our designs straightforward and readable.

Tufte as a baseline

If you’ve studied data visualization, you know the work of statistician Edward Tufte. Our charts’ design takes two directives from Tufte very seriously. First, never lie. Second, don’t add chart junk.

1. Never Lie

Every graph has a measurable lie factor. The biggest contributor is when vertical space leads to incorrect comparisons within the data set. If you don’t start with base zero, the graphs can look drastically different based on the smallest value or zero.

Lie factor can easily fly under the radar to an unsuspecting viewer. While most people are trained from a young age to read charts and graphs, they aren’t taught about designing them or how to choose the appropriate graph type given a data set. It’s easy to give a lot of absolute weight to graphs, because we’re taught that they’re a picture of some truth. But what truth are they telling? Some lies are easy to spot, like this price of crude oil graph:

A graph with a pretty bad lie factor.

Others are harder to determine, usually because the display is unconventional.

An example of how to calculate Lie factor. (Tufte)

This is an example of our request count graph, with a y-axis minimum of 0:

Request counts with a zero minimum.

Now, let’s take the same graph and push the minimum closer to the minimum of the data set (700):

Request counts with a non-zero minimum.

Forcing a non-zero minimum not only diminishes the ratio of total requests running per minute, but also lies about the ratio of write to read requests.

We mitigate our lie factor by always starting at zero. This means that the ratio between two data points is neither (a) given undue significance or (b) visually downplayed.

2. Don’t Add Chart Junk

The ‘stuff’ around charts and graphs can become eye-catching distractions to the thing that matters the most: the data. Sometimes, it’s a function of a overeager designer looking to add more visual interest (something I’ve often been culpable of):

Excess chart junk in the form of pattern and color.

Others are more simple graphing and chart ideas — simple lines or grids that we assume come with the territory, but whose removal creates clarity and improves focus and readability.

Excess chart junk - the seemingly innocuous version.

In each of the Bonsai metrics graphs, the labels are discrete and the same grey color. We don’t use gridlines. The palette is a limited set that each graph inherits. Overlays convey focused information only on hover. When we do use color, it’s because it has a significant meaning that operations-focused engineers inherently understand. We chose to display and label 5xx requests as red in the Response Codes graph because, at least in the western world where we have the majority of customers, red equates to danger or problems. In the first design pass-throughs, using an info-centric color like blue for 5xx requests looked like a chart lie.

Our design is minimalistic: color is used sparingly and with purpose.

Bonus: Perceptually Uniform Color Spaces

Whilst deciding on colors, our metrics engineer stumbled upon an article about perceptually uniform colors spaces.

Perceptually uniform color spaces “are human-friendly alternatives to color spaces such as sRGB, and they are incredibly helpful for designers working in code” (Madsen). When using a color space like sRGB, you cannot programmatically render colors that progress linearly. See this example of a 10 step transition from green to blue using sRGB:

Decidedly not a smooth transition. (Madsen)

The first half looks almost indistinguishable from each other, while the blues look like tints of blue instead of a linear transition from green.

This happens because the default sRGB color space (and any color model built on it like HSV and HSL) is irregular, which means that even though the rectangles have evenly spaced hue values, the corresponding effect is not linear to the human eye.

(Madsen)

If we instead use an even transition of hues, we get something like this:

Perceptually uniform green to blue. (Madsen)

In short, using color spaces that are perceptually uniform means that the graphs are easier to look at, account for color blindness, and ensure that any bars or boxes have appropriate color contrast. Our performance heatmap uses the colormap npm package, which is used to generate the palette using the viridis colormap (an open-source perceptually uniform colormap by Matlab, https://matplotlib.org/users/colormaps.html):

Performance heatmap

Perceptually uniform color is incredibly useful for any designer, and we really suggest reading more about it and watching this lecture. It makes all graphs better.

What’s Next

Metrics will continue to grow and become more powerful. Feature additions include additional timezone selection, date scrubbing, stateful links for collaboration, and request filters.

We’d love to hear any and all feedback, so let us know how you’re using your cluster metrics by dropping us a line at info@bonsai.io.

Happy searching, and happy holidays!

— ❤️ The Bonsai Product Team


Bibliography

Madsen, Rune. “Programming Design Systems.” Programming Design Systems, programmingdesignsystems.com.

Tufte, Edward Rolfe. The visual display of quantitative information. Graphics Press, 2001.