Apr 22, 2021
In the past week, we had the opportunity to help a customer diagnose a fairly convoluted issue that led to increasingly stale results. The problem began when we received system alerts and Slack messages from the customer about poor performance and a write queue that was growing rapidly and uncontrollably. The node would slow to a crawl, leading to a growing write queue. Rebooting the node would cause the issue to migrate around the cluster, while writes would only ever be partially applied. From their users’ perspective, the application’s search results were stale.
After our team spent several hours diagnosing the problem, the root cause was found to be a regular expression (RegEx) the customer was using to extract patterns for possessive tokens. This particular RegEx worked pretty well for most data, but it didn’t have any protections against edge cases. So when a certain document was indexed with a field containing data that caused RegEx backtracking to spend several minutes attempting to process. This resulted in lock contention for indexing to a specific index, which blocked all indexing threads, causing the queue to grow until ejected execution errors occurred. Only a single CPU core was pegged.
What made this incident harder to troubleshoot was that the data being ingested was coming from user inputs. Somewhere in a workflow, someone had posted something in an unusual format with control characters, which was unexpected (and unsanitized). This is what ultimately triggered the RegEx to be pathological.
Part of the reason this took so long to diagnose was that it’s a very unusual error case. This customer has highly variable traffic, and many TB of data. Often when we see performance problems, it can be addressed by adding some temporary capacity by way of additional nodes. When that didn’t resolve the issue, our operations team explored whether the hardware on the master node was failing (another, less common but still plausible situation). Replacing the node didn’t help, and the problem spread to the new master node.
When we realized that the normal remedies were not effective, we started exploring other possibilities. It turned out the smoking gun was in the hot threads API, which showed something like this:
<pre><code>99.9% (499.3ms out of 500ms) cpu usage by thread 'elasticsearch[ip-10-10-21-146][generic][T#4]'
7/10 snapshots sharing following 47 elements
What this tells us is that Elasticsearch was spending more than 99.9% of the thread’s time on processing a regular expression. That’s a clear indicator of something pathological. Additionally, we noticed all other indexing threads were blocked, waiting for their turn to write data for that index.
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski
Simply put, regular expressions are a way for humans to explain patterns to machines. They aren’t useful when you’re looking for something specific, but very useful when what you’re looking for follows a given pattern. For example, if you have a system log and want to find all the IP addresses it contains, you can “describe” what an IP address looks like to the computer, which can then process the file and extract all the matches it find (which, hopefully, are all actual IP addresses).
Regex is very useful at finding these patterns, but this speed and efficiency comes with a cost: it’s really easy to break things or hit edge cases. It’s also possible to craft a pattern that the computer can understand most of the time. But rarely, the pattern, applied to a certain piece of data can result in the one thinking about whether or not the data matches the pattern. When this happens, the computer can’t do anything else and the system locks up.
Imagine a bank teller (this represents the CPU) is tasked with handling customer transactions (this represents the data input). The transactions are all different, but usually follow a simple pattern: put X dollars into Y account. So the teller is given instructions: increase/decrease the balance of account Y by X. This works perfectly, and the teller is able to handle 100 customers per day.
Then one day a customer walks in and says: “I would like to X^Y dollars to account Y^X.” How does the poor teller process this using her training? The dependency of the contribution on the account, and vice versa, far exceed the teller’s training, and she spends all her time trying to work out what to do, while more and more customers are filling up the bank.
Eventually the bank reaches its maximum capacity and the fire marshall (the SRE) kicks them all out.
Under the hood, Elasticsearch is using Java’s regular expression engine to perform pattern analysis. The way this engine is implemented allows for blazing speed at the cost of safety checks like early termination. Pathological RegEx, like the kind that overwhelmed the poor teller in the previous example, are possible, and the onus for mitigating against such patterns is on the developer.
Once we realized that this was a regular expression at the root of the problem, we had to review the slow indexing logs to figure out which document was actually causing the RegEx engine to choke. When we found the document, we were able to overwrite it by updating the problematic field to be empty, and setting the document version to a large number. This solution was fortunate because this customer controlled their versions externally. Most customers don’t do this.
Updating the version field was a stop gap. Ultimately, we worked with the customer to refactor their analyzer to put boundary conditions on the expression ^ and $ simplistically. This made the expression less “greedy” and mitigated against the root problem.
RegEx is powerful, but it’s very difficult to do it right. It can be tricky to write one for real world data that has 0 edge cases. This is especially true when dealing with customer inputs; it’s common knowledge these days to sanitize database inputs, but it’s less common to think about sanitizing RegEx inputs, especially with engines that don’t provide safety mechanisms like early termination.
These edge cases can lead to serious problems and downtime later on. Keep this in mind when trying to solve an issue with RegEx. Or be prepared to have two problems.