Throughout the Bonsai documentation, we use a standard terminology for discussing Elasticsearch-related subjects. This terminology is described in the Elasticsearch documentation, but we’ll highlight and discuss some very common terms here.
A node is simply a running instance of Elasticsearch. While it’s possible to run several instances of Elasticsearch on the same hardware, it really is a best practice to limit a server to a single running instance of Elasticsearch. On Bonsai, a node is technically a virtual server running in the cloud. Bonsai follows the best practice of one Elasticsearch instance per server.
A cluster is a collection of nodes running Elasticsearch. From the Elasticsearch documentation: “A cluster consists of one or more nodes which share the same cluster name. Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.”
In Elasticsearch parlance, the word “index” can either be used as a verb or a noun. This can sometimes be confusing for users new to Elasticsearch, and especially for users for whom English is not their first language. The intended meaning is usually understood through syntax and context clues.
From the Elasticsearch documentation: “An index is like a table in a relational database… [It] is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.”
To add to this, an index is sort of an abstraction because the shards (discussed in the section on Shards) are the “real” search engines. Queries to an index’s contents are routed to its shards, each of which is actually a Lucene instance. Because this happens in the background, it can appear that the index is performing the actual work, and the shards are more akin to a Heroku dyno or perhaps are simply there to add compute resources.
Because an Elasticsearch index is both abstract and opaque with regards to data storage and retrieval, sometimes new users miss the connection between indices and shards. A common support question that arises is “how do I put more indices on my shard?” The answer is that the index is composed of shards, not vice-versa.
In other words, an index can have many shards, but each shard can only belong to one index. This relationship precludes collocating multiple indices on a single shard.
The process of populating an Elasticsearch index (noun) with data. It is most commonly used as a transitive verb with the data as the direct object, rather than the index (noun) being populated. In other words, the process is performed on the data, so that you would say: “I need to index my data,” and not “I need to index my index.”
Indexing is a critical step in running a search engine. Without indexing your content, you will not be able to query it using Elasticsearch, or take advantage of any of the powerful search features Elasticsearch offers.
Elasticsearch does not have any mechanism for extracting the contents of your website or database on its own. Indexing is something that must be managed by the application and/or the Elasticsearch client (defined below). Refer to the documentation for your platform as necessary for details.
Elasticsearch is a search engine, not a content crawler. If your use case involves scanning domains or websites, you’ll need to use a crawler like Apache Nutch. In this case, the crawler would be responsible for scraping content and indexing it into your cluster.
From the Elasticsearch documentation: “Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard.”
To add to this, each shard is technically a standalone search engine. There are two types of shards: primary and replica. We have some documentation on what shards are and where they come from in What are shards and replicas?, but we’ll also elaborate on some of the differences here.
According to the Elasticsearch documentation: “Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard.”
Another way to think about primary shards is “the number of ways your data is split up.” One reason to have an index composed of X primary shards is that each shard will contain 1/X of your data. Queries can then be performed in parallel. This is useful and can definitely improve performance if you have millions of documents.
Elasticsearch has an entire article dedicated to the tradeoffs and limitations of a large number of primary shards. The really short version is that primary shards offer diminishing returns while the overhead increases sharply.
The takeaway is that you never want to use a large number of primary shards without some serious capacity planning and testing. If you are wondering how many primary shards you’ll need, you can check out The Ideal Elasticsearch Index (specifically the benchmarking section), or simply shoot us an email.
From one of our more popular blog entries: “Elasticsearch uses a naive hashing algorithm to route documents to a given primary shard. This design choice allows documents to be randomly distributed in a reproducible way. This avoids “hot spots” that affect performance and overallocation. However, it has one major downside, which is that the number of primary shards can not be changed after an index has been created. Replicas can be added and removed at will, but the number of primary shards is basically written in stone.
The Elasticsearch definition for replica shards sums it up nicely:
A replica is a copy of the primary shard, and has two purposes:
Another way to think about replica shards is “the number of redundant copies of your data.” If your index has 1 primary shard and 2 replica shards, then you can think of the cluster as having 3 total copies of the data. If the primary shard is lost – for example, if the server running it dies, or there is a network partition – then the cluster can recover automatically by using one of the replicas.
If you’re running a production application, all of your indices should have a replication factor of at least 1. Otherwise, you’re exposed to data loss if anything unexpected happens.
In contrast to primary shards, which can not be added/removed after the index is created, replicas can be added and removed at any time.
From the Elasticsearch documentation: “A document is a JSON document which is stored in Elasticsearch. It is like a row in a table in a relational database.”
As referenced, an analogue for an Elasticsearch document would be a database record. It is a collection of related fields and values. For example, it might look something like this:
<div class="code-snippet w-richtext">
"title" : "Hello, World!",
"author" : "John Doe",
"description" : "This is a JSON document!"
You may be thinking to yourself: “My data is sitting in Postgres/MySQL/whatever, which is most certainly not in a JSON format! How do I turn the contents of a DB table into something Elasticsearch can read?”
Normally this work is handled automatically by a client (see below). Most users will never need to worry about translating the database contents into the Elasticsearch documents, as it will be handled automatically and “behind the scenes,” so to speak.
Sometimes new users are confused about the term “document,” because their mental model (and possibly even the data they want to index) involves file formats like Word, PDF, Excel, RTF, PPT, and others. In Elasticsearch terminology, these formats are sometimes called “rich text documents,” and are completely different from Elasticsearch documents.
One reason this can be confusing is that Elasticsearch can index and search rich text documents. There are some plugins – the mapper-attachment and ingest-attachment (both supported on Bonsai) – which use the Apache Tika toolkit for extracting the contents of rich text documents and pushing it into Elasticsearch.
That said, if the data you’re trying to index resides in a rich text format, you can still index it into Elasticsearch. But when reading the documentation, keep in mind that “document” should refer to the JSON representation of the content, and some variant of “rich text” will refer to the file you’re trying to index.
A client is software that sits between your application and Elasticsearch cluster. It’s used to facilitate communication between your application and Elasticsearch. At a minimum, it will take data from the application, translate it into something Elasticsearch can understand, and then push that data into the cluster.
Most clients will also handle a lot of other Elasticsearch-related tasks, such as:
It is unlikely that you will need to create your own client. Elasticsearch maintains a list of language-specific clients that are well-documented and in wide use. Clients also exist for popular frameworks and content management systems, like:
In short, there is probably already an open sourced, well-documented client available.