Of Millies and Minutes

Getting data in and out of your search cluster is superficially a straightforward process. Data in, data out, right? Records are indexed into the cluster, and then returned when they match a query. But there is a significant amount of nuance and esoterica that can easily frustrate a developer just trying to implement a basic app search. Relatively small changes in code can have a serious impact on performance, specifically indexing speed.

In this post, we’re going to focus on the importance of indexing speed and touch on a few common gotchas.

Why Indexing Speed is Important

Whenever I talk about speed (in terms of computing and networking), I always think about Adm. Grace Hopper’s famous lecture at MIT:

https://youtu.be/9eyFDBPk4Yw

Adm. Hopper illustrates the difference between nanoseconds and milliseconds with lengths of wire to illustrate the “cost” of inefficient code.

A quick note before moving on: like most programmers, I’m lazy and prefer shorthand for words I say a lot. I’m going to talk a lot about milliseconds in the paragraphs to come. For brevity, from here on out I’ll refer to them as“millie”(singular) or“millies”(plural).

For our purposes, consider the “cost” of indexing 10 million documents. Suppose you can achieve an indexing speed of 5,000 documents per second. Reindexing all 10 million documents would take about 33 minutes. Now suppose we make a change to a mapping which adds one extra millie to render a JSON document.

A developer working on a local machine testing out the change would never notice the material difference in speed that a single millie adds, especially when working with small batches of test data. But when the change goes into production, that single millie per document drops our indexing speed to about 833 documents per second. Now our indexing job takes 3 hours and 20 minutes!

Whoa. A single millie increase per document added almost 3 hours to the time required to reindex all the data. Consider the production impact of this change. Imagine needing to wait 3.5 hours to deploy a mapping update. Or an intern accidentally deletes the production index (as we’ve seen more than once!) and it needs to be rebuilt. What’s the business impact of 3.5 hours of downtime?

All due to a single millie of overhead.

When you index data into Elasticsearch, your app is generally reading records from a database, translating them into JSON objects, and packaging those objects into a batch payload to be sent to Elasticsearch. Anything that increases the time the app spends rendering each document into JSON will increase the time needed to index your document.

Each millie added to that time will cost you an extra hour of reindexing time, per 3.6M documents in your corpus. In other words, if you have to index 3.6 million documents, adding a millie to the JSON rendering time will add 1 hour to your reindexing time. Adding two millies translates to an extra 2 hours of reindexing time. And so on.

The bottom line is that the more computations and records need to be loaded in order to produce a single JSON document, the more indexing time it will take.

A Contrived Example

Suppose you have a Rails app with 3 models: User, Post and Comment. The models look like this:


class User < ApplicationRecord
  has_many :comments
  has_many :posts
end

class Comment < ApplicationRecord
  belongs_to :user
end

class Post < ApplicationRecord
  belongs_to :user
end

For the purposes of demonstration, I made just such an app and used the awesome Faker gem to generate 1000 users, ~500K comments and ~30K posts. Adding the elasticsearch gem to this app and indexing into a local Elasticsearch cluster took me 2 minutes. Not great, but not bad for about 530K records.

So our baseline is 2 minutes to index around 530K records.

Now, suppose that I want to have a global app search: when my users enter a query, I want it to search all models. No problem, easy breezy.

But after testing it out, I realize that several users have a similar name. One is a prolific poster and commentator; the other two haven’t posted in over a year. But for some reason, Elasticsearch keeps putting the prolific poster near the bottom of the results and the inactive users at the top.

We can fix that by having Elasticsearch use post volume and frequency determine the sort order. Let’s add a couple fields to our models that show post count and the most recent post. That way, we can have Elasticsearch use those fields for boosting and tie breaking, and ensure our most active users are near the top of the results.

To do this, we need to extend the default mappings:


class User < ApplicationRecord
  include Elasticsearch::Model has_many :comments
  has_many :posts

  def as_indexed_json(options = {})
    as_json.merge({
                    'post_count' => posts.count,
                    'last_post_at' => posts.last.published_at,
                    'comment_count' => comments.count,
                    'last_comment_at' => comments.last.published_at,
                  })
  end
end

class Comment < ApplicationRecord
  include Elasticsearch::Model belongs_to :user

  def as_indexed_json(options = {})
    as_json.merge({
                    'author_name' => user.name,
                    'author_comment_count' => user.comments.count,
                    'author_last_post_at' => user.posts.last.published_at,
                    'author_last_comment_at' => user.comments.last.published_at,
                  })
  end
end

class Post < ApplicationRecord
  include Elasticsearch::Model belongs_to :user

  def as_indexed_json(options = {})
    as_json.merge({
                    'author_name' => user.name,
                    'author_comment_count' => user.comments.count,
                    'author_last_post_at' => user.posts.last.published_at,
                    'author_last_comment_at' => user.comments.last.published_at,
                  })
  end
end

We’ve changed the mappings, so now we need to reindex everything. Let’s benchmark it and see how much this change costs:

Benchmark.realtime do
  User.import force: true
  Comment.import force: true
  Post.import force: true
end

=> 31174.95103677502 # 8 hours, 39 minutes, 35 seconds

Wow, that is a long time. We went from 2 minutes to over 8 and a half hours! Why? Because we now have to make multiple calls to our database per document, and these calls add nearly 59 millies per document, which kills our reindexing time.

If all we cared about was tie-breaking for user search, we could have a migration that stores these metrics on the User model. We could do that with a migration that looks like this:


class AddFieldsToUser < ActiveRecord::Migration[7.0]
  def self.up
    add_column :users, :post_count, :integer
    add_column :users, :comment_count, :integer
    add_column :users, :last_post_at, :datetime
    add_column :users, :last_comment_at, :datetime

    User.all.each do |user|
      user.update(
        post_count: user.posts.count,
        comment_count: user.comments.count,
        last_post_at: user.posts.last.published_at,
        last_comment_at: user.comments.last.published_at,
      )
    end
  end

  def self.down
    remove_column :users, :post_count
    remove_column :users, :comment_count
    remove_column :users, :last_post_at
    remove_column :users, :last_comment_at
  end
end

We can then get rid of the as_indexed_json methods in our models:

class User < ApplicationRecord
  include Elasticsearch::Model
  has_many :comments
  has_many :posts
end

class Comment < ApplicationRecord
  include Elasticsearch::Model
  belongs_to :user
end

class Post < ApplicationRecord
  include Elasticsearch::Model
  belongs_to :user
end

Let’s see how this single migration speeds up indexing:

Benchmark.realtime do
User.import    force:true
Comment.import force:true
Post.import    force:true
end

=> 122.32768263696926 # 2 minutes, 2 seconds

Wow! Huge improvement! We went from 16.8 records per second (taking over 8.5 hours to complete), to a blazing 4,312 records per second (which took about 2 minutes). That’s a 99.6% reduction in overall indexing time, and a 255% increase in throughput!

But at what cost?

What did the increase in indexing speed cost?

We’ve sped up indexing significantly, but what did it take to get us there? Well, our User model is fatter, which means the index itself now requires a bigger data footprint. And if we wanted to introduce similar tie-breaking for the other models, we’d end up with a significantly larger footprint and duplicated data.

As you develop and refine your app’s search interface to better align with business objectives, you will inevitably stumble across these kinds of trade offs. Requirements that seem simple on the surface can have a dramatic and surprising impact in performance and operational costs.

Keeping Millie Costs Down

Regardless of how your models are designed, you want to implement a handful of different strategies to save as many millies as you can. Elastic even has documentation on this subject.

Most of Elastic’s server-side recommendations are already in place on Bonsai, but there are still several things Bonsai users should be doing to save every last millie possible:

Use the Bulk API. Indexing documents one at a time costs more millies than indexing the same number of documents in batches. You can also benchmark with different batch sizes to find what’s optimal for ingesting your data.
Use multiple threads/worker where possible. Making use of concurrency is generally a great way to save a ton of millies. But remember that Elasticsearch can only handle a finite number of write operations at once. There are a lot of variables affecting how much concurrency the cluster can tolerate, and you will likely need to perform some benchmarking to find the optimal number of workers.
Unset / increase the refresh interval. The refresh interval is the time interval at which Elasticsearch writes a new segment file. You want a small value like “1s” when in production in order to make use of near realtime search (NRT), but when performing a full reindex, it costs a lot of millies while providing no benefit.
Disable replicas during the initial indexing. When indexing documents, Elasticsearch needs to push out the data to replica shards. That adds costs millies unnecessarily during the initial ingest operation. It’s generally more efficient to push data into an unreplicated index, then turn on replication when indexing is complete.
Utilize HTTP Keep-Alive. Establishing and closing an HTTP connection (especially over TLS) costs a lot of millies, and keep-alive mitigates that by re-using a connection. Ruby developers, for example, should be using typhoeus gem, which supports http keep-alive, is faster than the default Net::HTTP library, and has support for connection pooling.

Additionally, I usually recommend developers use a queue of some kind for their bulk requests, and implement catch-and-retry with exponential backoff for indexing a bulk payload. It’s neither a cost increase or savings, and it’s a little more complex. But it’s also more stable and resistant against concurrency issues.

Wrapping Up

Hopefully this post has been enlightening and given you some things to think about.

What strategies did you implement to see improvements in speed? Share your thoughts on Twitter: @bonsaisearch. If you have any questions, feel free to reach out at [email protected]. Next post

Ready to take a closer look at Bonsai?

Find out if Bonsai is a good fit for you in just 15 minutes.

Learn how a managed service works and why it’s valuable to dev teams

You won’t be pressured or used in any manipulative sales tactics

We’ll get a deep understanding of your current tech stack and needs

Get all the information you need to decide to continue exploring Bonsai services