Risky Operations in Elasticsearch and OpenSearch.

At Bonsai, we run search clusters for thousands of customers, and some of them have billions of documents. These clusters require dozens of large instances totaling thousands of CPUs, petabytes of Disk, and terabytes of RAM.

At this scale, lots can go wrong. We’re here to make sure everything runs smoothly so our customers can focus on delivering business value to their customers, and not worry about the intricacies of keeping such a large cluster healthy.

We’re also keen on making sure our customers can be flexible in what they do. We don’t lock everything down, but we’ve found that most people don’t know about the dangerous side of running certain operations. Some things can get particularly unsavory when run against a gigantic cluster that ingests millions of documents per day, and executes even more queries.

So, dear reader, welcome to the scary side of search. In this post we outline several features that exist in Elasticsearch and OpenSearch that can outright ruin your week if you’re not careful.

While it doesn’t cover absolutely everything, this is a good start to learning more about the many potential footguns in your engine. We’ve broken them up into some groups: Dangerous, Destructive, Spatial, Heavy Read/Write, and Config. Enjoy!

Dangerous

These operations are called out first, in that they are non-obvious one-liners that will make things bad for everyone if you call them without realizing what the implications are. You should probably never use these unless you have a very specific reason and only with careful understanding and planning.

Name/Operation	Description & Risks	Docs
Force Merge `POST [/{index}]/_forcemerge`	Force segment merges to reduce segment count (optionally expunge deletes). Heavy CPU/IO; can create very large segments and can run for DAYS.	Elasticsearch: Force a merge OpenSearch: Force Merge API
Clear Cache `POST [/{index}]/_cache/clear`	Clear request/query/fielddata caches for one or more indices. Drops hot caches. Immediate latency/CPU spikes as caches rebuild.	Elasticsearch: Clear Cache OpenSearch: Clear Cache
Refresh `POST [/{index}]/_refresh`	Explicitly refresh one or more indices to make recent writes searchable. Synchronous and resource-intensive; rely on the periodic `index.refresh_interval` instead.	Elasticsearch: Refresh API OpenSearch: Refresh Index API
Flush `POST [/{index}]/_flush`	Force Lucene commit; fsync segments; rotate translog. Burst I/O + segment churn if run broadly/frequently. It's usually unnecessary, and Lucene will do this for you.	Elasticsearch: Flush API OpenSearch: Flush API

Destructive

This section outlines ways to say goodbye to data. Obviously, you can delete an entire index in a single curl command, and that is typically run in very purposeful scenarios. But others are more devious. For example, you can delete by query - but triple check the query without the delete first and don’t YOLO your way into a week-long recovery effort!

Name/Operation	Description & Risks	Docs
Delete Index `DELETE /{index}`	Delete an index (irreversible unless you have snapshots). Destructive; can break aliases and dependent apps; requires careful RBAC.	Elasticsearch: Delete Index OpenSearch: Delete Index
Delete By Query `POST /_delete_by_query`	Delete all docs matching a query. Full/large scan; can delete massive volumes; heavy merges; difficult to roll back.	Elasticsearch: Delete By Query OpenSearch: Delete By Query
Close Index `POST /{index}/_close`	Close index (no read/write/search; frees some resources). Operational risk: apps fail on closed index; Ingestion will fail.	Elasticsearch: Close Index OpenSearch: Close Index

Heavy Read/Write/Compute

When you’re serving lots of queries in a live environment, and decide you want to run some ad-hoc reports, get some comprehensive stats, take a snapshot, or reindex, you can slow things down for your customer application. Likewise, if you’ve got a new large dataset that exceeds your typical daily ingest that you want to toss into the index, you should have careful planning before doing so.

Name/Operation	Description & Risks	Docs
Reindex `POST /_reindex`	Copy documents from one index/alias/data stream to another (optionally filtered/transformed). Very heavy read+write workload; can saturate I/O, heap, and network; can create version conflicts; best throttled or run off-peak.	Elasticsearch: Reindex API OpenSearch: Reindex API
Update By Query `POST /{index}/_update_by_query`	Scan + script-update matching docs in-place. Expensive full/large scan; scripts execute per-hit; version conflicts; large translog and segment churn.	Elasticsearch: Update By Query OpenSearch: Update By Query
Create a Snapshot `POST /_snapshot/{repository}/{snapshot}`	Filesystem/object-store snapshot of indices/cluster state. Heavy I/O and repository load; long-running; can contend with indexing.	Elasticsearch: Create Snapshot OpenSearch: Create Snapshot
Restore a Snapshot `/_snapshot/.../_restore`	Restore indices/cluster metadata from snapshot. Cluster-wide writes and shard allocations; can overwhelm nodes and disrupt routing.	Elasticsearch: Restore Snapshot OpenSearch: Restore Snapshot
Disk Usage API `/{index}/_disk_usage?run_expensive_tasks=true`	Analyze per-field on-disk footprint. Expensive offline analysis; can be very resource intensive on large indices.	Elasticsearch: Analyze Index Disk Usage OpenSearch: No Equivalent
Scroll Search `/_search?scroll=...`	Long-lived search contexts to page large result sets. Holds resources per context; forgetting to clear can leak heap & file handles.	Elasticsearch: Scroll API OpenSearch: Scroll API

Spatial (Disk & RAM consumption)

Have you ever run out of disk or memory? Talk about fun, if you’re bored these are just a really great way to find lots of work to do for the next 24 hours. Some operations unwittingly produce lots more data than you realize, sometimes double or more of your index. If you don’t have enough room, you’ll find out when the machines start complaining.

Name/Operation	Description & Risks	Docs
Reindex `/_reindex`	Copy documents from one index/alias/data stream to another (optionally filtered/transformed). Very heavy read+write workload; can saturate I/O, heap, and network; can create version conflicts; best throttled or run off-peak.	Elasticsearch: Reindex API OpenSearch: Reindex API
Shrink Index `/_shrink`	Rewrites an index into fewer primary shards. Requires read-only state; creates new index; heavy reindex-like copy and segment rewrite.	Elasticsearch: Shrink Index OpenSearch: Shrink Index
Split Index `/_split`	Rewrites an index into more primary shards (multiples only). Requires read-only source; full rewrite; heavy disk/CPU.	Elasticsearch: Split Index OpenSearch: Split Index
Clone Index `/_clone`	Clone an index to a new one (same shard count). Faster than reindex, but still creates a full copy and transient resource spikes.	Elasticsearch: Clone Index OpenSearch: Clone Index
Point In Time `/_pit`	Consistent snapshot for paginating searches. Keeps segments pinned; too many/long `keep_alive` PITs consume disk/heap.	Elasticsearch: Point-in-Time API OpenSearch: Point-in-Time API
Restore a Snapshot `/_snapshot/.../_restore`	Restore indices/cluster metadata from snapshot. Requires you to close, delete, or rename the index. The former two may result in data loss, the latter requires enough space.	Elasticsearch: Restore Snapshot OpenSearch: Restore Snapshot

Mappings, Settings, and Aggregations

This section doesn't cover explicit operations, but rather things that you can add to mappings and settings that might result in things you don't want.

First things first: Don't use dynamic mappings. This means that you have data in a document that doesn't have a corresponding mapping property/field. This will result in that data being poorly optimized - so always ensure you are explicitly declaring your data in mappings with full coverage.

Name/Operation	Description & Risks	Docs
Add a document without Mappings `POST /{index}/_doc/{id}`	Automatic dynamic mapping updates. Field explosion & poor types from dynamic mapping; mapping growth increases heap and slows queries.	Elasticsearch: Dynamic mappings OpenSearch: Dynamic mappings
Update Mappings `PUT /{index}/_mapping`	Update index mappings (add fields, parameters). Some changes are irreversible without a reindex.	Elasticsearch: Put mapping OpenSearch: Put mapping
Update Settings `PUT /{index}/_settings`	Change the settings for an index. Some changes can change infrastructure layout and impact runtime.	Elasticsearch: Update Settings OpenSearch: Update Settings

Mapping footguns

Some things for mappings enable incredible things, like highlighting and aggregations (more on aggs later). Here are some specific properties you should use rarely.

Property	Description & Risks	Docs
term_vectors `"term_vector": "with_positions_offsets"`	Term vectors are used with position offsets to enable highlighting, and can also be used to enable payloads. Enabling with_positions_offsets will increase disk and heap use for the field on which it is enabled by a significant factor.	Elasticsearch: term_vector OpenSearch: term_vector
copy_to `"copy_to": ["other field", ...]`	Copy_to allows you to duplicate a field into another for alternate index and query analysis configuration. Using copy_to on large text fields to multiple destinations with unoptimized analyzis can grow your index significantly	Elasticsearch: copy_to OpenSearch: copy_to

Get to know all your mapping field parameters!

Settings footguns

Index Settings are vast. There are many of them which you should just leave as defaults unless you know what you are doing. Some settings can be changed on live indices to trigger infrastructure changes with deep implications.

As a general guideline, I encourage you to read through your respective engine's guide:

Property	Description & Risks	Docs
number_of_replicas `"number_of_replicas": integer`	This will set the number of primary shard replicas for your index. Can trigger mass shard movement and recovery (network + IO heavy), degrading search/indexing.	Elasticsearch: Index Settings OpenSearch: Index Settings
refresh_interval `"refresh_interval": integer`	Sets the interval time (in seconds) that the engine will make recently added documents available for search. The default is 1 second, but in large clusters with high ingest rates, consider changing this to a higher number, upwards of 10 seconds maximum.	Elasticsearch: Index Settings OpenSearch: Index Settings

Settings is also where you configure field analysis that your mapping properties will use. In general, avoid ngrams and shingles unless you need them for a specific purpose, as they will significantly increase spatial requirements of the fields using them.

Property	Description & Risks	Docs
N-gram token filter `"type": "ngram"`	Breaks down words into smaller pieces to assisist with partial matching and fuzzy search. This will grow your index vocabulary in size, impacting the disk and memory requirements for the field.	Elasticsearch: N-gram Token Filter OpenSearch: N-gram Token Filter
Shingle token filter `"type": "shingle"`	Generates word n-grams ("shingles") which assists in phrase search. With the default `output_unigrams` set to `true` as the default, your index vocabulary will grow in size, impacting the disk and memory requirements for the field.	Elasticsearch: Shingle Token Filter OpenSearch: Shingle Token Filter

Get to know all your analysis types!

Aggregations

I saved the best for last, because this is a query time footgun that I see all too often. Aggregations are complex counting operations that have high I/O and CPU use, and when used without care will saturate your CPU and make latency terrible for everyone. When serving large corpora at high query loads, be sure to optimize your aggregations and only use where necessary. Also be sure to couple them with strict matching/filter critera in your query to ensure you're not aggregating across too much data.

Conclusion

Well, that's all for now. Remember, Bonsai is here to take away the pain. Stay green, stay happy, and stay safe out there folks!

Ready to take a closer look at Bonsai?

Find out if Bonsai is a good fit for you in just 15 minutes.

Learn how a managed service works and why it’s valuable to dev teams

You won’t be pressured or used in any manipulative sales tactics

We’ll get a deep understanding of your current tech stack and needs

Get all the information you need to decide to continue exploring Bonsai services