Apr 11, 2023
Leo Shue Schuster
OpenSearch is a community-driven, open-source search and analytics suite derived from Apache 2.0 licensed Elasticsearch 7.10.2 & Kibana 7.10.2.
It consists of a search engine daemon, OpenSearch, and a visualization and user interface, OpenSearch Dashboards. OpenSearch enables people to easily ingest, secure, search, aggregate, view, and analyze data. These capabilities are popular for use cases such as application search, log analytics, and more.
With OpenSearch, people benefit from having an open source product they can use, modify, extend, monetize, and resell however they want. At the same time, OpenSearch will continue to provide a secure, high-quality search and analytics suite with a rich roadmap of new and innovative functionality.
OpenSearch was created as a response to Elastic’s decision to switch from the Apache 2 license to the highly-restrictive and anti-competition SSPL. In short, the final open source versions of Elasticsearch and Kibana 7.10.2 were forked, and those fork versions became OpenSearch and OpenSearch Dashboards 1.0.
The OpenSearch Project is community-driven, so innovation and relevant new features are always being proposed and worked on for ever changing search needs. AWS Open Source Blog’s take on staying open source: Keeping Open Source Open – Open Distro for Elasticsearch.
OpenSearch is a great tool to have in your developer arsenal. It's fast time-to-value and can be used for a variety of applications, from real-time analytics to search engine optimization (SEO). Not only does OpenSearch offer these benefits, but it also has complementary tooling and plugins that make it easy to integrate with other technologies such as OpenSearch Dashboards.
OpenSearch is used in many different applications. It's used as a search engine for media, e-commerce, social networks and internal applications. The OpenSearch project is lead by Amazon Web Services (AWS) and publicly sponsored by Bonsai, SAP, CapitalOne, RedHat, Logz, Aiven, Logit, InstaCluster, and BAInsight.
OpenSearch is an open-source product that is Apache 2.0-licensed. That means you are free to use, change, and monetize however you see fit.
For product teams that don’t want to have to worry about keeping up with licensing stipulations with Elasticsearch 7.10.2+ versions, OpenSearch is a great alternative.
When the OpenSearch 1.0 forked from Elasticsearch 7.10.2, it provided feature parity. Since then, OpenSearch has made many releases. Differing features and plugins include:
Yes, OpenSearch is backwards compatible with Elasticsearch 7.10.2.
When Amazon launched OpenSearch 1.0, they stated:
The Amazon OpenSearch Service APIs will be backwards compatible with the existing service APIs to eliminate any need for customers to update their current client code or applications. Additionally, just as we did for previous versions of Elasticsearch, we will provide a seamless upgrade path from existing Elasticsearch 6.x and 7.x managed clusters to OpenSearch.
Let’s review OpenSearch core concepts.
A node is simply a running instance of OpenSearch. While it’s possible to run several instances of OpenSearch on the same server hardware, it really is a best practice to limit a server to a single running instance of OpenSearch. On Bonsai, a node is technically a virtual server running in the cloud. Bonsai follows the best practice of one OpenSearch instance per server.
A cluster is a collection of nodes running OpenSearch. A cluster consists of one or more nodes which share the same cluster name. Each cluster has at least one primary node, which is chosen automatically by the cluster and can be replaced if the current primary node fails.
In OpenSearch parlance, the word “index” can either be used as a verb or a noun. This can be confusing for users new to OpenSearch. The intended meaning is usually understood through syntax and context clues.
An index is like a table in a relational database. It’s a logical namespace that maps to one or more primary shards and can have zero or more replica shards.
To add to this, an index is sort of an abstraction because the shards are the “real” search engines. Queries to an index’s contents are routed to its shards, each of which is actually a Lucene instance. In other words, an index can have many shards, but each shard can only belong to one index. This relationship precludes collocating multiple indices on a single shard.
The process of populating an OpenSearch index (noun) with data. It is most commonly used as a transitive verb with the data as the direct object, rather than the index (noun) being populated. In other words, the process is performed on the data, so that you would say: “I need to index my data,” and not “I need to index my index.”
Indexing is a critical step in running a search engine. Without indexing your content, you will not be able to query it using OpenSearch, or take advantage of any of the powerful search features OpenSearch offers.
OpenSearch does not have any mechanism for extracting the contents of your website or database on its own. Indexing is something that must be managed by the application and/or the OpenSearch client.
Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard.
To add to this, each shard is technically a standalone search engine. There are two types of shards: primary and replica. We have some documentation on what shards are and where they come from in What are shards and replicas?, but we’ll also elaborate on some of the differences here.
When you index a document, it is indexed first to the primary shard, then on all replicas of the primary shard.
Another way to think about primary shards is “the number of ways your data is split up.” One reason to have an index composed of X primary shards is that each shard will contain 1/X of your data. Queries can then be performed in parallel. This is useful and can definitely improve performance if you have millions of documents.
Primary shards offer diminishing returns while the overhead increases sharply.
The takeaway is that you never want to use a large number of primary shards without some serious capacity planning and testing. If you are wondering how many primary shards you’ll need, you can check out The Ideal Elasticsearch Index (specifically the benchmarking section), or simply shoot us an email.
The Elasticsearch definition for replica shards sums it up nicely:
A replica is a copy of the primary shard, and has two purposes:Increase failover: a replica shard can be promoted to a primary shard if the primary failsIncrease performance: get and search requests can be handled by primary or replica shards. By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.
Another way to think about replica shards is “the number of redundant copies of your data.” If your index has 1 primary shard and 2 replica shards, then you can think of the cluster as having 3 total copies of the data. If the primary shard is lost – for example, if the server running it dies, or there is a network partition – then the cluster can recover automatically by using one of the replicas.
A document is a JSON file that is stored in a OpenSearch cluster.
As referenced, an analogue for an OpenSearch document would be a database record. It is a collection of related fields and values. For example, it might look something like this:
"title" : "Hello, World!",
"author" : "John Doe",
"description" : "This is a JSON document!"
You may be thinking to yourself: “My data is sitting in Postgres/MySQL/whatever, which is most certainly not in a JSON format! How do I turn the contents of a DB table into something OpenSearch can read?”
Normally this work is handled automatically by a client (see below). Most users will never need to worry about translating the database contents into the OpenSearch documents, as it will be handled automatically and “behind the scenes,” so to speak.
A client is software that sits between your application and OpenSearch cluster. It’s used to facilitate communication between your application and OpenSearch. At a minimum, it will take data from the application, translate it into something OpenSearch can understand, and then push that data into the cluster.
Most clients will also handle a lot of other OpenSearch-related tasks, such as:
It is unlikely that you will need to create your own client. OpenSearch maintains a list of language-specific clients that are well-documented and in wide use. Clients also exist for popular frameworks and content management systems, like:
OpenSearch can be hosted in many different ways. You have the option to host OpenSearch yourself on a server, or you can use one of the many cloud services that offer OpenSearch as a service (SaaS).
The most popular SaaS options are:
OpenSearch Dashboards is used to visualize data from OpenSearch, or you can use it to create custom visualizations and dashboards that leverage the full power of OpenSearch.
The OpenSearch project is always creating new features and maintaining versions. Check out their developer roadmap.
As you can see, OpenSearch is a powerful tool that can be used for many different applications. It's not just for search—you can use it to build analytics engines and data pipelines as well. If you're looking for something new to add to your developer arsenal and want some help getting started with OpenSearch, we've got some great resources available in our documentation!