The Challenge: Cross-matching two massive, constantly changing data sets
Outvote is an app that encourages voter turnout using a tried-and-true technology: SMS. Instead of relying on spammy anonymous texts, Outvote matches users to their contacts who were likely to vote for progressive candidates in upcoming elections. These users can then send texts encouraging certain contacts to vote.
Each Outvote user’s contacts was cross-checked with the DNC’s voter file data set of 260 million American voters and their basic information. Outvote’s Lead Engineer Jeff Quinn recognized how quickly this match had to happen so as to preserve the user experience. “We needed to process all of a user’s contacts and run them against ten Elasticsearch queries, all within the signup period. We have about five seconds to do all of that.” On top of this, Outvote received massive user spikes after the press would cover their service. This meant more concurrent connections between users’ data and the DNC voter file data in Elasticsearch. “At peak, we needed 256 concurrent connections,” said Quinn.
Quinn had his hands full with the DNC’s data set, which is over half a terabyte in size. “A standard upload of this data set would take a week,” said Quinn. The DNC voter file data set also changed at least once a month, and sometimes every few weeks.
The Solution: Heroku, Bonsai, Hadoop, and Spark
It was gratifying work for Quinn, but he knew it would not be easy. Although Quinn had worked in data engineering his entire career, at Outvote, he was “an Ops team of one.” Due to time and personnel constraints, he opted to use Heroku and Bonsai. “The plug-and-playability of Bonsai and Heroku was pretty valuable,” said Quinn. “Just being able to click that button or write a Heroku CLI command and get your cluster up and running was pretty nice.”
Rather than waiting a week each month to upload the DNC voter file data set, he opted to use a combination of Hadoop, Apache Spark, and Elasticsearch to speed things up. Quinn loaded the data into the Hadoop Distributed File System, and then used Apache Spark scripts to populate the data into his index on Bonsai.
Outvote’s data was quite large, so they opted for a customized Enterprise cluster. But their needs were a bit out-of-the-box, which prompted a member of Bonsai’s Ops team to reach out to them. “Bonsai saw that our data set was extraordinary, so they provisioned a special cluster for our use case. They also increased our concurrency limits for when our traffic spikes hit.”
Quinn took advantage of Bonsai’s production-ready cluster configuration frequently. When the DNC’s data set would change, he would perform a blue-green deployment.
Whenever I had to do an update, I would just spin up another Bonsai cluster, load the new data on that, and then do a quick switchover without any downtime.”
— Jeff Quinn, Lead Engineer, Outvote
The Results: Hockey stick growth, without hockey stick growth problems
Outvote took off months before the 2018 midterm elections. “The growth was pretty horrifying,” joked Quinn. Fortunately, he and the Bonsai team had planned accordingly. Outvote was featured in TechCrunch, the New York Times, moveon.org’s mailing list, and many other mainstream publications. “Whoopie Goldberg was even talking about Outvote on ‘The View,’” said Quinn. He relied on Bonsai’s Metrics Dashboard to ensure the cluster never faltered through the user spikes.
We stayed online throughout these situations without any degradation in service.
— Jeff Quinn, Lead Engineer, Outvote
“Outvote’s Ops team of one” managed to maintain its service throughout the midterms. Growth never became an issue, so Quinn is optimistic that his infrastructure could survive other election cycles. “Even though usage was spiking, latency was not.”