Winning the Relevancy Game

You did it. You found the right tweak to the right part of the query, and it just shipped. Now you're running that test query in production, and there it is: the missing document shows up right where you thought it should, and the irrelevant one is buried.

You close the ticket. Tomorrow, a dozen more take its place.

Welcome to the game of Search Relevancy Whac-A-Mole.

I think there's a lot of thinking about relevancy in terms of where we want to make changes. Customers send in surveys with feedback about the things they can't find. Execs want to see charts move up and to the right. People working on search have their own pet queries that they like to test.

This is a rite of passage for engineers working on search. You dive in with some example queries, and example documents that should match. You figure out some tweaks to the query, or the document's contents, or both, and you make those queries better. Only to notice that improving one query is starting to make another worse.

There is plenty of search work to be done where you don't want results to change. That may well be, all the other queries that I'm not paying attention to right now. Or it may be a major version upgrade that you expect to behave similarly. Or it may be a careful migration to another search engine entirely where you'd like the baseline relevancy not to degrade.

Managing change over time is not necessarily a new problem for complex software systems. How do we ensure that a change in one place doesn't cause problems in another? We write a test. We run those tests.

Testing relevancy can be a pretty heady topic (and one I want to write about), but for those who are overwhelmed by the possibilities, some basic regression testing in CI is a great place to start. So where do we start?

First, we need some useful data.
Next, we need some queries.
Last, we need a baseline for the results of those queries.

Data

Data is the reason people are searching your site, not Google or ChatGPT. You have something they don't: structure, originality, authority. Perhaps your site fits within a domain that can benefit from some kind of standardized testing, but for most clients I work with, there is no substitute for your own data.

Developers will typically have some scripts to seed your local database. Or perhaps a curated sample database, or test environment. But relevancy testing may want more, and I think it's best to get as close to the full corpus as possible.

Some apps may be able to index the full data set regularly. I don't know about you, but I have a comfortable 100+ GB of capacity on my four-year-old laptop ready to be put to work.

If that's not your situation, you may need to sample or simulate. Perhaps that means building some kind of ETL from prod, sanitizing anything sensitive. I like to pgdump, sanitize, sample, and save a test set from a primary database.

Does your data refresh itself continuously? Yeah, you can go down a whole rabbit hole there if your use case is, say, the news; or primarily user-generated activity. That said, remember that your first job is to build a reasonably general purpose model. It's useful to assume that search results which work well on a snapshot of your corpus will probably work well enough on new results as well.

Ultimately, don't worry about starting with something you feel is imperfect, so long as you are starting.

Queries

Queries come from your query logs. (You do have query logs… right?)

Similar to your data, you're going to want to use enough queries to have a useful volume of testing. This doesn't need to be comprehensive: beyond a threshold in size, any time and expense spent here in testing is time that may slow down velocity, or be budgeted toward more valuable kinds of tests.

So let's assume some sampling here as well. As a quick recap, head queries are the stuff you pay attention to. I like to call "face" queries the "in-your-face" (or "face-palm") queries that are standing out for some particular reason, and with other queries in the head, are probably what you're using to manually test.

With modest traffic, the torso and tail can get to be quite long, and are where you really start to exercise your model. We don't want to neglect them, but we also don't want to over-emphasize them. I like weighted random sampling for this kind of job. If you like reading papers, you can check out Weighted Random Sampling (2005; Efraimidis, Spirakis) (PDF).

But the punchline is this: for each query, calculate a probability to include it in a test set:

rand ^ (1.0 / weight)

And then: sort your queries by that. Select as many as you like in descending order of that probability value. That's a plenty useful sample of queries to test against.

And if you want to do this in Ruby, I have a very old and simple gem that you can use: omc/enumerable-weighted-sample. Don't be dismayed by the age, this thing has been making many tens of thousands of decisions for me all day, every day, since its inception.

Capture the Baseline

Capturing a baseline means: run the query, and save the results. We don't have to over-think this one; you can easily serialize this as a two-column CSV.

query,ids
"for whom the bell tolls","1,2,3,4"

Fiddle with this in whatever way seems most helpful as your testing evolves here, I'm just here to get things started. You could have a column for each of the, say, top 10 results. But I like a serialized list here because results lists may vary in their lengths.

So load up your test queries, run them, and save the results. I personally don't mind committing these as static files in a git repo — we can let it compress the contents and subsequent diffs. And any changes here will need to be available in CI, and made visible in some kind of PR process that is changing the relevancy model, whether intentionally (diff) or inadvertently (test failure).

A more interesting step is what happens next.

Measuring against the baseline

Let's first remind ourselves that relevancy has an interesting property that we're trying to protect. Our systems are trying to return certain results to users, in a certain order.

If we re-run our test queries, we might start testing to see whether the new results are identical to the previously captured baseline. Which is to say: the same exact results, in the same exact order.

By all means, feel free to start there. It is the highest possible standard. And in some respects it is a property worth noticing.

But remember also: we don't run tests because nothing is changing. We run tests because something changed, and we expect the output of some other system not to be materially influenced by that change. Unexpected changes could come from some accidental change, or it could come from some other element of nondeterministic behavior.

A good example is upgrading your search engine's version. Changes to underlying libraries may subtly change the formulas helping to calculate relevancy under the hood. Or even migrating between two search engines entirely: like Elasticsearch to OpenSearch, or Algolia to Elasticsearch, SQL LIKE to Apache Solr, and so on. Ostensibly these search engines have the same job — return certain documents in a certain order — but are going to go about those jobs in different ways.

In my experience, teams who start with exact matching quickly start to notice that a degree of difference between their baseline and a test can be acceptable. Say for a given query, the document in position six is merely transposed with the document in position five. It's not identical, but is it really such a regression?

I like Rank-Biased Overlap

If you are a fellow relevancy nerd like me, this is the point where it is oh-so-tempting to get into the weeds about different comparisons. But if you're a pragmatist who needs to get some real work done, then, by all means, start with Rank-Biased Overlap.

Here's another reference for the paper inclined: A Similarity Measure For Indefinite Rankings. And you can search for libraries in your language of choice which implement it. I've been using the rbo crate in Rust.

The nice properties of RBO, to me:

Measurement is normalized on a scale from 0.0 (perfectly dissimilar) to 1.0 (identical), so you get your identicality checks, if you want to capture that.
It captures changes in the ordering of results.
It does not require that both sets be of identical length.

There are some other useful algorithms in this space. For example, Jaccard Similarity compares simpler set unions and intersections. It measures differences in the contents of the set, but not ordering, and may be faster to compute at large scales. Or it may make more sense for use cases that care more about the entirety of the result set than the specific ordering.

When is this useful?

Very focused changes. I started this article with the example of relevancy "whac-a-mole." Sometimes you have a result of a very specific query, or set of queries, which you want to influence, without regressing others. So go make that focused change, and update its baseline to match, iterating with your regression test on the rest until their changes are in acceptable ranges.

Apples-to-apples upgrades. Use this for major version upgrades of your search engine, to ensure that you have an early heads up of any subtle changes under the hood.

Search engine migrations are worth particular attention here as well. You know there will be underlying changes, both in the technology, and likely also in how the logic of queries is expressed. You can start from the worst offenders in your regression tests, troubleshoot why they're different, and gradually close the gap toward something acceptable.

General drift over time. We assume our changes in one area won't influence another, because it's nice to forget about the little complexities and focus all of our attention on one particular problem at a time. I don't know about you, but for me, running a regression test periodically lets me stay in that flow because I trust the system just that little bit more.

Further resources

Want to learn some more about selecting documents for economies of testing at scale? I liked this Haystack presentation from Karel Bergmann talking about offline evaluation at Getty Images.

And the OpenSource Connections blog has this deeper dive article from Nate Day on Probability-Proportional-to-Size query sampling. With more helpful diagrams than I included here!

Help me out?

Just like your search relevancy gets better with user queries, content only gets better with readers' questions! If you're reading this far, you may have some questions or challenges or experiences with search relevancy. So hit me up with those, either on LinkedIn, or email me at [email protected].

Ready to take a closer look at Bonsai?

Find out if Bonsai is a good fit for you in just 15 minutes.

Learn how a managed service works and why it’s valuable to dev teams

You won’t be pressured or used in any manipulative sales tactics

We’ll get a deep understanding of your current tech stack and needs

Get all the information you need to decide to continue exploring Bonsai services