Mar 20, 2023
Leo Shue Schuster
Elasticsearch is a distributed, RESTful search and analytics engine that is used to query, index, and store JSON data.
Distributed means Elasticsearch can scale to handle any amount of data. It's also fault-tolerant — if a node goes down, your cluster can continue to function with no loss of data or functionality when set up correctly.
RESTful means that you can use Elasticsearch with any modern web application framework (such as Ruby on Rails or Django) securely without having to write any custom code for interfacing with it.
Elasticsearch provides extensive APIs for performing searches and aggregations on your data set out of the box.
Elasticsearch is built on top of Apache Lucene, the open-source text search library written in Java. Lucene has been around for more than 20 years and is used in many applications.
While Elasticsearch has many use cases, it's most often used as an enterprise-level search tool that enables users to find information quickly in large quantities of structured data to get relevant results and powerful analytics.
Let’s review Elasticsearch core concepts.
A node is simply a running instance of Elasticsearch. While it’s possible to run several instances of Elasticsearch on the same server hardware, it really is a best practice to limit a server to a single running instance of Elasticsearch. On Bonsai, a node is technically a virtual server running in the cloud. Bonsai follows the best practice of one Elasticsearch instance per server.
A cluster is a collection of nodes running Elasticsearch. A cluster consists of one or more nodes which share the same cluster name. Each cluster has at least one primary node, which is chosen automatically by the cluster and can be replaced if the current primary node fails.
In Elasticsearch parlance, the word “index” can either be used as a verb or a noun. This can be confusing for users new to Elasticsearch. The intended meaning is usually understood through syntax and context clues.
An index is like a table in a relational database. It’s a logical namespace that maps to one or more primary shards and can have zero or more replica shards.
To add to this, an index is sort of an abstraction because the shards are the “real” search engines. Queries to an index’s contents are routed to its shards, each of which is actually a Lucene instance. In other words, an index can have many shards, but each shard can only belong to one index. This relationship precludes collocating multiple indices on a single shard.
The process of populating an Elasticsearch index (noun) with data. It is most commonly used as a transitive verb with the data as the direct object, rather than the index (noun) being populated. In other words, the process is performed on the data, so that you would say: “I need to index my data,” and not “I need to index my index.”
Indexing is a critical step in running a search engine. Without indexing your content, you will not be able to query it using Elasticsearch, or take advantage of any of the powerful search features Elasticsearch offers.
Elasticsearch does not have any mechanism for extracting the contents of your website or database on its own. Indexing is something that must be managed by the application and/or the Elasticsearch client.
Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard.
To add to this, each shard is technically a standalone search engine. There are two types of shards: primary and replica. We have some documentation on what shards are and where they come from in What are shards and replicas?, but we’ll also elaborate on some of the differences here.
According to the Elasticsearch documentation: “Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard.”
Another way to think about primary shards is “the number of ways your data is split up.” One reason to have an index composed of X primary shards is that each shard will contain 1/X of your data. Queries can then be performed in parallel. This is useful and can definitely improve performance if you have millions of documents.
The Elasticsearch definition for replica shards sums it up nicely:
A replica is a copy of the primary shard, and has two purposes: increase failover (a replica shard can be promoted to a primary shard if the primary failsIncrease performance) and retrieve search requests (can be handled by primary or replica shards). By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.
Another way to think about replica shards is “the number of redundant copies of your data.” If your index has 1 primary shard and 2 replica shards, then you can think of the cluster as having 3 total copies of the data. If the primary shard is lost – for example, if the server running it dies, or there is a network partition – then the cluster can recover automatically by using one of the replicas.
From the Elasticsearch documentation: “A document is a JSON document which is stored in Elasticsearch. It is like a row in a table in a relational database.”
As referenced, an analogue for an Elasticsearch document would be a database record. It is a collection of related fields and values. For example, it might look something like this:
"title" : "Hello, World!",
"author" : "John Doe",
"description" : "This is a JSON document!"
You may be thinking to yourself: “My data is sitting in Postgres/MySQL/whatever, which is most certainly not in a JSON format! How do I turn the contents of a DB table into something Elasticsearch can read?”
Normally this work is handled automatically by a client (see below). Most users will never need to worry about translating the database contents into the Elasticsearch documents, as it will be handled automatically and “behind the scenes,” so to speak.
Sometimes new users are confused about the term “document,” because their mental model (and possibly even the data they want to index) involves file formats like Word, PDF, Excel, RTF, PPT, and others. In Elasticsearch terminology, these formats are sometimes called “rich text documents,” and are completely different from Elasticsearch documents.
One reason this can be confusing is that Elasticsearch can index and search rich text documents. There are some plugins – the mapper-attachment and ingest-attachment (both supported on Bonsai) – which use the Apache Tika toolkit for extracting the contents of rich text documents and pushing it into Elasticsearch.
That said, if the data you’re trying to index resides in a rich text format, you can still index it into Elasticsearch. But when reading the documentation, keep in mind that “document” should refer to the JSON representation of the content, and some variant of “rich text” will refer to the file you’re trying to index.
A client is software that sits between your application and Elasticsearch cluster. It’s used to facilitate communication between your application and Elasticsearch. At a minimum, it will take data from the application, translate it into something Elasticsearch can understand, and then push that data into the cluster.
Most clients will also handle a lot of other Elasticsearch-related tasks, such as:
It is unlikely that you will need to create your own client. Elasticsearch maintains a list of language-specific clients that are well-documented and in wide use. Clients also exist for popular frameworks and content management systems, like:
In short, there is probably already an open sourced, well-documented client available.
Elasticsearch is a great tool to have in your developer arsenal. It's fast time-to-value and can be used for a variety of applications, from real-time analytics to search engine optimization (SEO). Not only does Elasticsearch offer these benefits, but it also has complementary tooling and plugins that make it easy to integrate with other technologies such as Kibana.
Elasticsearch is used in many different applications. It's used as a search engine for media, e-commerce, social networks and internal applications. Companies that use Elasticsearch include Netflix, Walmart, Ebay and many others.
Elasticsearch is a search engine, not a content crawler. If your use case involves scanning domains or websites, you’ll need to use a crawler like Apache Nutch. In this case, the crawler would be responsible for scraping content and indexing it into your cluster.
Elasticsearch can integrate with Java, Java Script (Node.js), Ruby, Python, Go, PHP, .NET (C#), and Perl to name a few.
Elasticsearch can be hosted in many different ways. You have the option to host Elasticsearch yourself on a server, or you can use one of the many cloud services that offer Elasticsearch as a service (SaaS).
The most popular SaaS options are:
Kibana can be used to visualize data from Elasticsearch, or you can use it to create custom visualizations and dashboards that leverage the full power of Elasticsearch.
As you can see, Elasticsearch is a powerful tool that can be used for many different applications. It's not just for search—you can use it to build analytics engines and data pipelines as well. If you're looking for something new to add to your developer arsenal and want some help getting started with Elasticsearch, we've got some great resources available in our documentation!
Schedule a free consultation to see how we can create a customized plan to meet your search needs.