At Bonsai, we run search clusters for thousands of customers, and some of them have billions of documents. These clusters require dozens of large instances totaling thousands of CPUs, petabytes of Disk, and terabytes of RAM.
At this scale, lots can go wrong. We’re here to make sure everything runs smoothly so our customers can focus on delivering business value to their customers, and not worry about the intricacies of keeping such a large cluster healthy.
We’re also keen on making sure our customers can be flexible in what they do. We don’t lock everything down, but we’ve found that most people don’t know about the dangerous side of running certain operations. Some things can get particularly unsavory when run against a gigantic cluster that ingests millions of documents per day, and executes even more queries.
So, dear reader, welcome to the scary side of search. In this post we outline several features that exist in Elasticsearch and OpenSearch that can outright ruin your week if you’re not careful.
While it doesn’t cover absolutely everything, this is a good start to learning more about the many potential footguns in your engine. We’ve broken them up into some groups: Dangerous, Destructive, Spatial, Heavy Read/Write, and Config. Enjoy!
Dangerous
These operations are called out first, in that they are non-obvious one-liners that will make things bad for everyone if you call them without realizing what the implications are. You should probably never use these unless you have a very specific reason and only with careful understanding and planning.
Name/Operation | Description & Risks | Docs |
---|---|---|
Force MergePOST [/{index}]/_forcemerge | Force segment merges to reduce segment count (optionally expunge deletes). Heavy CPU/IO; can create very large segments and can run for DAYS. | Elasticsearch: Force a merge OpenSearch: Force Merge API |
Clear CachePOST [/{index}]/_cache/clear | Clear request/query/fielddata caches for one or more indices. Drops hot caches. Immediate latency/CPU spikes as caches rebuild. | Elasticsearch: Clear Cache OpenSearch: Clear Cache |
RefreshPOST [/{index}]/_refresh | Explicitly refresh one or more indices to make recent writes searchable. Synchronous and resource-intensive; rely on the periodic | Elasticsearch: Refresh API OpenSearch: Refresh Index API |
FlushPOST [/{index}]/_flush | Force Lucene commit; fsync segments; rotate translog. Burst I/O + segment churn if run broadly/frequently. It's usually unnecessary, and Lucene will do this for you. | Elasticsearch: Flush API OpenSearch: Flush API |
Destructive
This section outlines ways to say goodbye to data. Obviously, you can delete an entire index in a single curl command, and that is typically run in very purposeful scenarios. But others are more devious. For example, you can delete by query - but triple check the query without the delete first and don’t YOLO your way into a week-long recovery effort!
Name/Operation | Description & Risks | Docs |
---|---|---|
Delete IndexDELETE /{index} | Delete an index (irreversible unless you have snapshots). Destructive; can break aliases and dependent apps; requires careful RBAC. | Elasticsearch: Delete Index OpenSearch: Delete Index |
Delete By QueryPOST /_delete_by_query | Delete all docs matching a query. Full/large scan; can delete massive volumes; heavy merges; difficult to roll back. | Elasticsearch: Delete By Query OpenSearch: Delete By Query |
Close IndexPOST /{index}/_close | Close index (no read/write/search; frees some resources). Operational risk: apps fail on closed index; Ingestion will fail. | Elasticsearch: Close Index OpenSearch: Close Index |
Heavy Read/Write/Compute
When you’re serving lots of queries in a live environment, and decide you want to run some ad-hoc reports, get some comprehensive stats, take a snapshot, or reindex, you can slow things down for your customer application. Likewise, if you’ve got a new large dataset that exceeds your typical daily ingest that you want to toss into the index, you should have careful planning before doing so.
Name/Operation | Description & Risks | Docs |
---|---|---|
ReindexPOST /_reindex | Copy documents from one index/alias/data stream to another (optionally filtered/transformed). Very heavy read+write workload; can saturate I/O, heap, and network; can create version conflicts; best throttled or run off-peak. | Elasticsearch: Reindex API OpenSearch: Reindex API |
Update By QueryPOST /{index}/_update_by_query | Scan + script-update matching docs in-place. Expensive full/large scan; scripts execute per-hit; version conflicts; large translog and segment churn. | Elasticsearch: Update By Query OpenSearch: Update By Query |
Create a SnapshotPOST /_snapshot/{repository}/{snapshot} | Filesystem/object-store snapshot of indices/cluster state. Heavy I/O and repository load; long-running; can contend with indexing. | Elasticsearch: Create Snapshot OpenSearch: Create Snapshot |
Restore a Snapshot/_snapshot/.../_restore | Restore indices/cluster metadata from snapshot. Cluster-wide writes and shard allocations; can overwhelm nodes and disrupt routing. | Elasticsearch: Restore Snapshot OpenSearch: Restore Snapshot |
Disk Usage API/{index}/_disk_usage?run_expensive_tasks=true | Analyze per-field on-disk footprint. Expensive offline analysis; can be very resource intensive on large indices. | Elasticsearch: Analyze Index Disk Usage OpenSearch: No Equivalent |
Scroll Search/_search?scroll=... | Long-lived search contexts to page large result sets. Holds resources per context; forgetting to clear can leak heap & file handles. | Elasticsearch: Scroll API OpenSearch: Scroll API |
Spatial (Disk & RAM consumption)
Have you ever run out of disk or memory? Talk about fun, if you’re bored these are just a really great way to find lots of work to do for the next 24 hours. Some operations unwittingly produce lots more data than you realize, sometimes double or more of your index. If you don’t have enough room, you’ll find out when the machines start complaining.
Name/Operation | Description & Risks | Docs |
---|---|---|
Reindex/_reindex | Copy documents from one index/alias/data stream to another (optionally filtered/transformed). Very heavy read+write workload; can saturate I/O, heap, and network; can create version conflicts; best throttled or run off-peak. | Elasticsearch: Reindex API OpenSearch: Reindex API |
Shrink Index/_shrink | Rewrites an index into fewer primary shards. Requires read-only state; creates new index; heavy reindex-like copy and segment rewrite. | Elasticsearch: Shrink Index OpenSearch: Shrink Index |
Split Index/_split | Rewrites an index into more primary shards (multiples only). Requires read-only source; full rewrite; heavy disk/CPU. | Elasticsearch: Split Index OpenSearch: Split Index |
Clone Index/_clone | Clone an index to a new one (same shard count). Faster than reindex, but still creates a full copy and transient resource spikes. | Elasticsearch: Clone Index OpenSearch: Clone Index |
Point In Time/_pit | Consistent snapshot for paginating searches. Keeps segments pinned; too many/long | Elasticsearch: Point-in-Time API OpenSearch: Point-in-Time API |
Restore a Snapshot/_snapshot/.../_restore | Restore indices/cluster metadata from snapshot. Requires you to close, delete, or rename the index. The former two may result in data loss, the latter requires enough space. | Elasticsearch: Restore Snapshot OpenSearch: Restore Snapshot |
Mappings, Settings, and Aggregations
This section doesn't cover explicit operations, but rather things that you can add to mappings and settings that might result in things you don't want.
First things first: Don't use dynamic mappings. This means that you have data in a document that doesn't have a corresponding mapping property/field. This will result in that data being poorly optimized - so always ensure you are explicitly declaring your data in mappings with full coverage.
Name/Operation | Description & Risks | Docs |
---|---|---|
Add a document without MappingsPOST /{index}/_doc/{id} | Automatic dynamic mapping updates. Field explosion & poor types from dynamic mapping; mapping growth increases heap and slows queries. | Elasticsearch: Dynamic mappings OpenSearch: Dynamic mappings |
Update MappingsPUT /{index}/_mapping | Update index mappings (add fields, parameters). Some changes are irreversible without a reindex. | Elasticsearch: Put mapping OpenSearch: Put mapping |
Update SettingsPUT /{index}/_settings | Change the settings for an index. Some changes can change infrastructure layout and impact runtime. | Elasticsearch: Update Settings OpenSearch: Update Settings |
Mapping footguns
Some things for mappings enable incredible things, like highlighting and aggregations (more on aggs later). Here are some specific properties you should use rarely.
Property | Description & Risks | Docs |
---|---|---|
term_vectors"term_vector": "with_positions_offsets" | Term vectors are used with position offsets to enable highlighting, and can also be used to enable payloads. Enabling with_positions_offsets will increase disk and heap use for the field on which it is enabled by a significant factor. | Elasticsearch: term_vector OpenSearch: term_vector |
copy_to"copy_to": ["other field", ...] | Copy_to allows you to duplicate a field into another for alternate index and query analysis configuration. Using copy_to on large text fields to multiple destinations with unoptimized analyzis can grow your index significantly | Elasticsearch: copy_to OpenSearch: copy_to |
Get to know all your mapping field parameters!
Settings footguns
Index Settings are vast. There are many of them which you should just leave as defaults unless you know what you are doing. Some settings can be changed on live indices to trigger infrastructure changes with deep implications.
As a general guideline, I encourage you to read through your respective engine's guide:
Property | Description & Risks | Docs |
---|---|---|
number_of_replicas"number_of_replicas": integer | This will set the number of primary shard replicas for your index. Can trigger mass shard movement and recovery (network + IO heavy), degrading search/indexing. | Elasticsearch: Index Settings OpenSearch: Index Settings |
refresh_interval"refresh_interval": integer | Sets the interval time (in seconds) that the engine will make recently added documents available for search. The default is 1 second, but in large clusters with high ingest rates, consider changing this to a higher number, upwards of 10 seconds maximum. | Elasticsearch: Index Settings OpenSearch: Index Settings |
Settings is also where you configure field analysis that your mapping properties will use. In general, avoid ngrams and shingles unless you need them for a specific purpose, as they will significantly increase spatial requirements of the fields using them.
Property | Description & Risks | Docs |
---|---|---|
N-gram token filter"type": "ngram" | Breaks down words into smaller pieces to assisist with partial matching and fuzzy search. This will grow your index vocabulary in size, impacting the disk and memory requirements for the field. | Elasticsearch: N-gram Token Filter OpenSearch: N-gram Token Filter |
Shingle token filter"type": "shingle" | Generates word n-grams ("shingles") which assists in phrase search. With the default | Elasticsearch: Shingle Token Filter OpenSearch: Shingle Token Filter |
Get to know all your analysis types!
Aggregations
I saved the best for last, because this is a query time footgun that I see all too often. Aggregations are complex counting operations that have high I/O and CPU use, and when used without care will saturate your CPU and make latency terrible for everyone. When serving large corpora at high query loads, be sure to optimize your aggregations and only use where necessary. Also be sure to couple them with strict matching/filter critera in your query to ensure you're not aggregating across too much data.
Conclusion
Well, that's all for now. Remember, Bonsai is here to take away the pain. Stay green, stay happy, and stay safe out there folks!
Ready to take a closer look at Bonsai?
Find out if Bonsai is a good fit for you in just 15 minutes.
Learn how a managed service works and why it’s valuable to dev teams
You won’t be pressured or used in any manipulative sales tactics
We’ll get a deep understanding of your current tech stack and needs
Get all the information you need to decide to continue exploring Bonsai services