Dec 17, 2021
“The [Log4J] vulnerability is particularly dangerous, cybersecurity experts say, because it impacts such a wide range of programs—nearly everything written in Java or that relies on software written in Java, ranging from products made by Amazon to Apple.”
Developers worldwide worked through the weekend to defend against the Log4J vulnerability.
It all started when Minecraft discovered a potential zero-day exploitation and made their discoveries public. What they discovered has ramifications for almost every technology company and potentially impacts hundreds of millions of devices and apps that use the logging library, Log4J. Fortunately, news about the exploitation spread quickly throughout the hacker community, allowing teams like ours to respond immediately to the threat.
At Bonsai, we learned about the vulnerability on Thursday night. Our CEO Nick Zadrozny saw tweets about the incident. “I noticed some people on Twitter writing about a possible zero-day, remote code execution vulnerability in Log4J. This was just a few posts, there was no official statement yet, but those words together are very scary.”
Nick texted Bonsai’s CTO, Dan Simpson, and shared a post in the community Slack to let everyone know something was going on. We didn’t know much yet. But when zero days like this occur, it’s best to prepare for the worst.
It was late in the evening. Nick contacted Sam Oen, Bonsai’s platform engineer in Australia, who was just beginning his day. Nick asked him to begin taking time to review what people were posting about to see what kind of exposure Bonsai might have to the vulnerability.
It was still late Thursday evening. By now, other companies had created example exploits, which Dan tested. Early tests were promising. We were able to focus our deployments on the subsets of Log4J that were most likely affected. We focused on the most likely places an exploitation would occur. Dan wasn’t able to repeat the flaw.
A good sign, but not a satisfying conclusion. As Nick describes it, “Absence of evidence in this case did not warrant evidence of absence.” But the negative tests at least provided our team enough confidence to take the night off and regroup in the morning.
Friday morning we dug into the research as a team. More details had emerged overnight and it seemed that most companies began a response that morning. By then, official statements were being published, offering better insight. The cause of the default was more clear. We ran additional tests, all of which came back negative. Again, this wasn’t enough to keep us feeling safe.
We decided to go with a full mitigation—update our entire fleet—to remove any possibility of a vulnerability. Within the first hour or two of our working day, we had four people on call at our company to go through the incident: to orientate on what was happening, create a plan of action, divvy out the tasks, and communicate with clients.
We created a status page on our website to let clients know we were investigating the issue and provide public updates about the analysis we were performing along the way.
The mitigation itself was straightforward, just one configuration setting. But the most sensitive and nontrivial task in our business was restarting servers. Bonsai’s services are built around very certain guarantees of uptime and availability. We design our system to restart any server without causing downtime. But we can only restart one server at a time, per cluster. And we have hundreds of clusters that service thousands of virtualized customer environments.
All to say: planning the sequencing of rollouts takes a coordinated effort. So that became the biggest part of our initial response. We created a plan to divide and conquer to get this rolled out. Four team members worked in parallel to apply the configuration changes, push those changes to the servers, and initiate a restart—all while monitoring the stability of the systems.
The process was time-consuming. In a search engine like Solar and Elasticsearch, it can take anything from a few minutes to a few hours to restart a cluster, depending on the number of servers, the volume of data, and resource utilization and traffic.
Even running multiple restarts in parallel, the process took all of Friday, overnight into Saturday (with the help of Sam in Australia), all day Saturday, and half of Sunday. We completed all the updates by midday. We started with the most at-risk systems and even scheduled some of our clients for specific windows during their lowest amount of traffic (thereby providing the lowest risk to their operations).
And that’s how it happened.
We continue to research the flaw. The Log4J vulnerability is still new and there’s emerging information. It looks like Elasticsearch has a relatively small amount of exposure. Nonetheless, we didn’t see that as a reason to ignore the possible risks. A data exfiltration bug is always a high priority security problem that couldn’t wait for more information.
At the end of the day, we consider ourselves lucky. We’re lucky that Elasticsearch is strict about how it uses Java. And we’re lucky to have invested in good tooling to apply these kinds of updates reliably and with a high degree of confidence.
We use Java in other parts of our business. Fortunately, we use other logging libraries that aren’t Log4J. Nick explained, “Some deliberate homogeneity in our infrastructure served as a blast zone to focus the impact of this vulnerability to a very limited area.”
We are satisfied to report at this point that we don’t have any lingering exposure to the Log4J vulnerability.
Nick said, “This was a stress test the likes of which we don’t get to face often.” It’s been more than five years since we had to conduct a full restart of our platform. So at the end of the day, while we endured Log4J unscathed, the response was a good exercise for our team to go through. We are happy with how the process went and also found new ways to enhance our platform to be even more prepared for future security scares.
In summary, we find ourselves very fortunate. While no exploitation was discovered within our system, we’re proud of our team’s response and hard work through the weekend.
Zero day vulnerabilities at this scale are always a race against the clock. Our philosophy is to bias toward action. When warning signs emerge, like they began to emerge on Twitter Thursday evening, it’s better to assume the worst than to wait around for more information. Every day of no response increases your risk.