Jan 1, 2020
Nick Zadrozny
•
•
5
min read
Getting data in and out of your search cluster is superficially a straightforward process. Data in, data out, right? Records are indexed into the cluster, and then returned when they match a query. But there is a significant amount of nuance and esoterica that can easily frustrate a developer just trying to implement a basic app search. Relatively small changes in code can have a serious impact on performance, specifically indexing speed.
In this post, we’re going to focus on the importance of indexing speed and touch on a few common gotchas.
Whenever I talk about speed (in terms of computing and networking), I always think about Adm. Grace Hopper’s famous lecture at MIT:
https://youtu.be/9eyFDBPk4Yw
Adm. Hopper illustrates the difference between nanoseconds and milliseconds with lengths of wire to illustrate the “cost” of inefficient code.
A quick note before moving on: like most programmers, I’m lazy and prefer shorthand for words I say a lot. I’m going to talk a lot about milliseconds in the paragraphs to come. For brevity, from here on out I’ll refer to them as“millie”(singular) or“millies”(plural).
For our purposes, consider the “cost” of indexing 10 million documents. Suppose you can achieve an indexing speed of 5,000 documents per second. Reindexing all 10 million documents would take about 33 minutes. Now suppose we make a change to a mapping which adds one extra millie to render a JSON document.
A developer working on a local machine testing out the change would never notice the material difference in speed that a single millie adds, especially when working with small batches of test data. But when the change goes into production, that single millie per document drops our indexing speed to about 833 documents per second. Now our indexing job takes 3 hours and 20 minutes!
Whoa. A single millie increase per document added almost 3 hours to the time required to reindex all the data. Consider the production impact of this change. Imagine needing to wait 3.5 hours to deploy a mapping update. Or an intern accidentally deletes the production index (as we’ve seen more than once!) and it needs to be rebuilt. What’s the business impact of 3.5 hours of downtime?
All due to a single millie of overhead.
When you index data into Elasticsearch, your app is generally reading records from a database, translating them into JSON objects, and packaging those objects into a batch payload to be sent to Elasticsearch. Anything that increases the time the app spends rendering each document into JSON will increase the time needed to index your document.
Each millie added to that time will cost you an extra hour of reindexing time, per 3.6M documents in your corpus. In other words, if you have to index 3.6 million documents, adding a millie to the JSON rendering time will add 1 hour to your reindexing time. Adding two millies translates to an extra 2 hours of reindexing time. And so on.
The bottom line is that the more computations and records need to be loaded in order to produce a single JSON document, the more indexing time it will take.
Suppose you have a Rails app with 3 models: User, Post and Comment. The models look like this:
class User < ApplicationRecord
has_many :comments
has_many :posts
end
class Comment < ApplicationRecord
belongs_to :user
end
class Post < ApplicationRecord
belongs_to :user
end
For the purposes of demonstration, I made just such an app and used the awesome Faker gem to generate 1000 users, ~500K comments and ~30K posts. Adding the elasticsearch gem to this app and indexing into a local Elasticsearch cluster took me 2 minutes. Not great, but not bad for about 530K records.
So our baseline is 2 minutes to index around 530K records.
Now, suppose that I want to have a global app search: when my users enter a query, I want it to search all models. No problem, easy breezy.
But after testing it out, I realize that several users have a similar name. One is a prolific poster and commentator; the other two haven’t posted in over a year. But for some reason, Elasticsearch keeps putting the prolific poster near the bottom of the results and the inactive users at the top.
We can fix that by having Elasticsearch use post volume and frequency determine the sort order. Let’s add a couple fields to our models that show post count and the most recent post. That way, we can have Elasticsearch use those fields for boosting and tie breaking, and ensure our most active users are near the top of the results.
To do this, we need to extend the default mappings:
class User < ApplicationRecord
include Elasticsearch::Model
has_many :comments
has_many :posts
def as_indexed_json(options={})
as_json.merge({
'post_count' => posts.count,
'last_post_at' => posts.last.published_at,
'comment_count' => comments.count,
'last_comment_at' => comments.last.published_at,
})
end
end
class Comment < ApplicationRecord
include Elasticsearch::Model
belongs_to :user
def as_indexed_json(options={})
as_json.merge({
'author_name' => user.name,
'author_comment_count' => user.comments.count,
'author_last_post_at' => user.posts.last.published_at,
'author_last_comment_at' => user.comments.last.published_at,
})
end
end
class Post < ApplicationRecord
include Elasticsearch::Model
belongs_to :user
def as_indexed_json(options={})
as_json.merge({
'author_name' => user.name,
'author_comment_count' => user.comments.count,
'author_last_post_at' => user.posts.last.published_at,
'author_last_comment_at' => user.comments.last.published_at,
})
end
end
We’ve changed the mappings, so now we need to reindex everything. Let’s benchmark it and see how much this change costs:
Benchmark.realtime do
User.import force:true
Comment.import force:true
Post.import force:true
end
=> 31174.95103677502 # 8 hours, 39 minutes, 35 seconds
Wow, that is a long time. We went from 2 minutes to over 8 and a half hours! Why? Because we now have to make multiple calls to our database per document, and these calls add nearly 59 millies per document, which kills our reindexing time.
If all we cared about was tie-breaking for user search, we could have a migration that stores these metrics on the User model. We could do that with a migration that looks like this:
class AddFieldsToUser < ActiveRecord::Migration[7.0]
def self.up
add_column :users, :post_count, :integer
add_column :users, :comment_count, :integer
add_column :users, :last_post_at, :datetime
add_column :users, :last_comment_at, :datetime
User.all.each do |user|
user.update(
post_count: user.posts.count,
comment_count: user.comments.count,
last_post_at: user.posts.last.published_at,
last_comment_at: user.comments.last.published_at,
)
end
end
def self.down
remove_column :users, :post_count
remove_column :users, :comment_count
remove_column :users, :last_post_at
remove_column :users, :last_comment_at
end
end
We can then get rid of the as_indexed_json methods in our models:
class User < ApplicationRecord
include Elasticsearch::Model
has_many :comments
has_many :posts
end
class Comment < ApplicationRecord
include Elasticsearch::Model
belongs_to :user
end
class Post < ApplicationRecord
include Elasticsearch::Model
belongs_to :user
end
Let’s see how this single migration speeds up indexing:
Benchmark.realtime do
User.import force:true
Comment.import force:true
Post.import force:true
end
=> 122.32768263696926 # 2 minutes, 2 seconds
Wow! Huge improvement! We went from 16.8 records per second (taking over 8.5 hours to complete), to a blazing 4,312 records per second (which took about 2 minutes). That’s a 99.6% reduction in overall indexing time, and a 255% increase in throughput!
But at what cost?
We’ve sped up indexing significantly, but what did it take to get us there? Well, our User model is fatter, which means the index itself now requires a bigger data footprint. And if we wanted to introduce similar tie-breaking for the other models, we’d end up with a significantly larger footprint and duplicated data.
As you develop and refine your app’s search interface to better align with business objectives, you will inevitably stumble across these kinds of trade offs. Requirements that seem simple on the surface can have a dramatic and surprising impact in performance and operational costs.
Regardless of how your models are designed, you want to implement a handful of different strategies to save as many millies as you can. Elastic even has documentation on this subject.
Most of Elastic’s server-side recommendations are already in place on Bonsai, but there are still several things Bonsai users should be doing to save every last millie possible:
Additionally, I usually recommend developers use a queue of some kind for their bulk requests, and implement catch-and-retry with exponential backoff for indexing a bulk payload. It’s neither a cost increase or savings, and it’s a little more complex. But it’s also more stable and resistant against concurrency issues.
Hopefully this post has been enlightening and given you some things to think about.
What strategies did you implement to see improvements in speed? Share your thoughts on Twitter: @bonsaisearch. If you have any questions, feel free to reach out at support@bonsai.io.
Schedule a free consultation to see how we can create a customized plan to meet your search needs.