Bonsai makes upgrading major versions of Elasticsearch as painless as possible. There is no need to manage the operational details of deploying software upgrades to nodes or spinning up new servers. With Bonsai, version upgrades are instant and can be performed with zero downtime.
In this document, we’ll cover some general best practices and offer some Bonsai-specific guidelines for migrating your app to a new version of Elasticsearch.
This means having one cluster for production, and another for staging, another for development, and so on. Some users like to put staging indices alongside production indices on the same cluster in order to ensure identical behaviors between staging and production applications. First, this is a terrible idea in general; you should never run staging/dev applications on the same resources as production. This is a recipe for disaster.
Second, separating out your environments allows for upgrading them one at a time. While this may sound tedious, it is the most prudent approach and allows you to discover potential problems before they impact production.
Make sure to perform your due diligence by reading the release notes and breaking changes that accompany the version you’re targeting for the upgrade.
Another thing to investigate is whether your application’s Elasticsearch client supports the upgrade candidate. There have been cases where a popular client or framework was several support versions behind the official Elasticsearch release. Some of these have resulted in hours of down time for users who upgraded the production Elasticsearch cluster beyond the version supported by their Elasticsearch client.
This is one of many reasons we recommend upgrading non-production environments first.
Upgrading across major versions sometimes comes with breaking changes, new dependencies, and tweaks in behavior. It is important to validate that the upgrade is safe before pushing it out to production. We advise starting by upgrading the least critical environment first. A variation of the blue-green deployment strategy is useful here.
The process looks like this:
1. The application communicates with the old Elasticsearch cluster via a supported Elasticsearch client. Upgrading the cluster to a newer version of Elasticsearch will also generally require upgrading the client:
2. A new cluster running a later version of Elasticsearch is provisioned for the application. This cluster is not initially connected to the application via the client, because the client needs to be upgraded too:
3. The Elasticsearch client is upgraded, and the new version is updated to point to the new cluster. The application then performs a reindex, which populates the new cluster with data.
4. If the reindex worked as expected, and the application is behaving normally, then the old cluster can be destroyed. If there are any issues, then the client can be downgraded and configured to point back to the old cluster.
Ensure it works as expected, and make any changes as needed. Deploy those changes to the next least critical environment and then upgrade the cluster for that environment. Continue in this fashion until reaching the production environment. By that point, you should be fairly confident that the application and search will work as expected.
Make sure to validate that searches will work as expected in terms of relevancy. Also make sure to test full deletion and reindexing in the least critical environments before upgrading production. Reindexing is something you should be familiar with anyway as a part of normal usage of Elasticsearch, such as changing analyzer settings, backfilling a new field, or - in this case - upgrading to a new major version.
Once you are satisfied that the candidate version will work in production as expected, the final step is to take it live. This last step is usually complicated by the constraint that search must not go down at all, and data loss is unacceptable. Because of this constraint, planning and possibly additional infrastructure (like message queues), are required to ensure a zero-downtime switch and a fallback path in case something breaks.
Of course, if you're fortunate enough to have a use case where the production app can be put into maintenance mode while the new cluster is repopulated, then you can simply use the same process as outlined in the previous step.
For everyone else, the basic process is to have the old application and cluster serve traffic while the new application is populating the new cluster from the source database. Once the new cluster is populated with the same data as the old cluster, the new application is promoted to a production role and begins serving traffic. This strategy allows developers to quickly roll back to a known working state in the event that there is a serious issue with the new system.
The exact steps will vary considerably by application and use case. A typical strategy for this is outlined below:
1. The application communicates with the old Elasticsearch cluster via a supported Elasticsearch client. Upgrading the cluster to a newer version of Elasticsearch will also generally require upgrading the client:
2. A new cluster running a later version of Elasticsearch is provisioned for the application. This cluster is not initially connected to the application via the client, because the client needs to be upgraded too:
3. A fork is made of the production application. This fork will eventually be promoted to the production role. The fork is identical to the production application, except for the Elasticsearch client and any associated dependencies, which are updated to match the new version of Elasticsearch.
The fork is configured to read from the production DB and the client is configured to point to the new cluster:
4. The new cluster is indexed from the production DB. This allows the new cluster to "catch up" with the old cluster. If the production application handles a high volume of updates, it may be necessary to push updates into a queue of some kind instead of saving to the DB.
At the end of this process, the old and new clusters will have the same data:
5. The forked application is promoted to production. If updates were queued, those changes should be flushed to each cluster. Each cluster now has the same data, and the new Elasticsearch cluster is handling searches. No data was lost in the transition:
6. The old production application is still connected to the old cluster, but is not handling traffic. It exists only as an emergency fallback option. If the new production application and Elasticsearch version introduce a regression, the old production app can be promoted back to the production role.
In that event, updates would again need to be queued, and the old production app would index any updates made since the initial cutover. When that is done, the queue is again flushed, ensuring that no data has been lost:
7. If everything worked as expected, and a rollback is not needed, the old cluster and the old production application can be destroyed. The application is now running the latest version of Elasticsearch without data loss or downtime.
You will need to adapt this to your specific use case when planning out your blue-green strategy.
Upgrading major versions of Elasticsearch while running a production application can be tricky. If you’re unsure of what to do, are concerned about an edge case or special circumstance, or simply want to sanity check a plan, please do not hesitate to reach out to support@bonsai.io. We’re here to help ensure the smoothest upgrade possible.