We sometimes hear from users concerned about their shard counts. They ask how they can instruct Elasticsearch to add multiple indices to a single shard. This is a surprisingly difficult question to answer in full because the terminology can confuse the issue quite a bit.
In Elasticsearch, every index maps to one or more shards, but the reverse is not true for some pretty sound technical reasons. However, changes to data mapping do afford users the ability to logically separate data on a single shard, thus producing the effect of multiple indices per shard for many use cases.
If that explanation sounds confusing, don’t worry. In this post, we’re going to lay out the issue and solution as easily as possible.
Lucene is the Java-based information retrieval software that forms the backbone of Elasticsearch. Elasticsearch adds a few layers of abstraction on top of Lucene to provide features that are extremely useful to production applications: aggregations, analysis, distributed search, and more, all controlled through a well-documented and easy to use RESTful API. However, with these features comes a richer glossary of terms, with slightly more subtle definitions.
The biggest point of confusion for users is the word “index.” This word means different things, depending on the context. It can be used:
- As a noun in the context of Lucene
- As a noun in the context of Elasticsearch
- As a verb in the context of either Lucene or Elasticsearch
Catch that? A Lucene index and an Elasticsearch index are very different things, even though Elasticsearch implements Lucene. In the context of Lucene, an index is basically a Lucene instance. In the context of Elasticsearch, an index is… well, read on.
I mentioned earlier that Elasticsearch offers a rich glossary, with terms like “shard,” “index,” and “type,” which are closely related but distinct concepts. So what distinguishes a shard from an index from a type in Elasticsearch? In the Elasticsearch lexicon:
- A shard is a single Lucene instance
- An index is a logical namespace that points to one or more shards
- A type is a piece of metadata used to logically distinguish between documents of different types within an index
- A document in Elasticsearch is essentially a Lucene document with some injected metadata
To distill this all down:
An Elasticsearch shard is a Lucene index. An Elasticsearch index is a collection of one or more Lucene indices. An Elasticsearch type is a bit of metadata injected into the document which allows users to organize different types of documents within a single collection of Lucene indices.
Pedantry! Just tell me what to do!!
Let’s say I have an application and I want to be able to search all of my users. My application has three kinds of users: administrators, moderators and customers. There are a couple of ways I could organize my data. I could create an Elasticsearch index for each type:
PUT localhost:9200/administrators/administrator/1 -d <some fields> PUT localhost:9200/moderators/moderator/1/ -d <some fields> PUT localhost:9200/customers/customer/1 -d <some fields> # Search for customers: GET localhost:9200/customers/_search
In this configuration, I have three indices, one for each type of user. Elasticsearch therefore requires a minimum of 3 shards, one for each index. This is kind of a waste, because each index only contains documents of a single type, and Elasticsearch offers the ability have multiple types on an index.
If I want my index to use fewer shards, I could instead create a single index and use types to distinguish between different users. Maybe my index could be called
accounts, and my types would be the same as my user types:
PUT localhost:9200/accounts/administrator/1 -d <some fields> PUT localhost:9200/accounts/moderator/1 -d <some fields> PUT localhost:9200/accounts/customer/1 -d <some fields> # Search for customers: GET localhost:9200/accounts/customer/_search
In the second configuration, I have one index containing all three types of users. Elasticsearch therefore only requires a minimum of one shard, and uses said types to scope search results to a particular class of user. My goal of reducing the shard count has been completed.
In this post, we focused more on the basic problem and solution to a particular concern without really addressing the underlying question of why you can’t have multiple indices on a shard. The short answer is that Elasticsearch is magic, and you should just accept this as a reality.
A slightly longer, more technical answer is that it spins up fresh Lucene instances (shards) when an index is created, and handles the coordination of requests between those shards. Elasticsearch takes this approach to make sure that data is spread out for both performance and integrity reasons. That’s why it doesn’t make sense to have multiple indices per shard, because that is essentially asking why you can’t have multiple Lucene instances per Lucene instance, when the Lucene instance is the lowest possible worker unit. That’s also why types exist. Types are used to achieve the functionality of multiple indices per shard, but the terminology can confuse the issue by masking what’s going on internally.
Hopefully this has all helped to clarify the issue. If you have questions, concerns, hate mail or songs of praise, feel free to gives us a shout!