May 11, 2023
Search is a perpetually frustrating feature.
With the rise of ChatGPT and AI-based chat – on top of a rising tide of ML-based search techniques – people are talking about search more than ever. Google feels under threat for the first time in a decade and companies large and small are reassessing how customers find information on their sites and apps.
Despite that renewed interest, even the standard search experience remains hard to build, hard to maintain, and hard to improve. The real frustration, however, comes from the fact that there is a consensus belief, from developers to CTOs, that search is important. But despite that, most search experiences remain mediocre and improving those search experiences appears impossible.
Over the past ten years, I’ve worked with developers the world over to help them improve the search experiences they build and their companies offer. My goal, always, has been to help engineers build highly functional and highly effective search experiences. But over those years, I’ve found one of the biggest problems is one that occurs almost before the process really begins.
The pattern is simple: Engineers decide to improve search – maybe an outside team needs a feature the current experience doesn’t provide; maybe an explosive growth in users has demanded new ways to scale; or maybe the feature is showing its age and users are unhappy – but whatever the reason for the improvement is, a cascade of problems almost inevitably occurs when the engineers even touch search.
This cascade is overwhelming but produces a bigger problem: A misframing of why you’re stuck. On March 30, I spoke to Haystack Live about why improving search is so hard and what’s really going on when you get stuck before you begin.
Search is rarely a feature teams are building from the ground up. Engineers get stuck, and ask for my help, when they reach a breaking point with their current search system (much of which they inherited), decide to finally fix it, and break under the weight of the almost inevitable technical debt.
A technical problem likely prompted the motivation to improve on search but the first problem engineers will actually face is an organizational one: Fragmentation.
Even engineers with well-scoped problems and seemingly simple improvements in mind will likely run into organizational problems first. And if you don’t prioritize the organizational challenge, that first change will be difficult, if not impossible, and every subsequent improvement will be just as difficult.
Most companies we’ve worked with are dealing with a fragmented system. And a fragmented system isn’t merely confusing – it creates friction and blockers.
A consequence of search’s consensus importance is that its capabilities are often spread across a wide variety of teams. Very rarely, I’ve found, has a company assigned an individual or team to “own” the search experience. It’s a feature that everyone wants input on but no one is accountable to.
That’s why, even if the initial problem appears small and technical, the top priority will quickly become project management and cross-departmental organization.
Here, the instinct to work on small improvements won’t work as well as you might imagine because even small improvements risk causing a domino effect of dependent changes and scope creep. Take these three scenarios as representative examples.
Scenario 1: Let’s say you decide to take on a seemingly small improvement, such as changing your mappings. But despite your efforts to scope this work down, questions will multiply:
If you pull on one thread of the search experience, a whole knot of problems will come with it.
Scenario 2: Let’s say the marketing team wants search to capture a metric they need. Again, this seems simple, but tracking a new metric without built-in support can be a heavy lift:
Don’t underestimate the cultural effects of scenarios like these. You want your marketing team involved and they’ll be discouraged if you can’t add that metric.
Scenario 3: Let’s say your company has grown and you have many more users to serve. Search is breaking down and you need to improve it but scaling your current search experience is more complex than you anticipated.
If Scenario 2 affected the marketing team, imagine how Scenario 3 will affect the executive team. You don’t want the company to be suffering from success.
These scenarios are bad enough but they all involve some amount of agency. What happens when an outside issue forces or reveals these cascading problems? What happens if a zero day like Log4j suddenly drops and you need to upgrade or risk exposing insecure infrastructure?
It’s tempting to misdiagnose these scenarios as instances of scope creep and to assume the solution is better prioritization, but the inevitability of the scope creep points toward an upstream problem: Architecture.
The above scenarios are so common that we realized there must be an upstream issue. To figure it out, we examined the history of the search industry and the histories of each search experience.
We were then able to work backward to see the architecture that typically emerges over time – layer on top of layer, feature on top of feature. And we saw a problem: Sprawl.
“Architecture” is the term here but urban planning provides a useful metaphor. Some cities (think Washington D.C.) were intentionally designed with easily navigable grids and other cities (think Boston and London) have become incomprehensible knots of roads after centuries of haphazard development. Search architecture is much more like Boston, with cow paths paved into roads resulting in a messy, confusing sprawl.
If we take the typical search architecture as it stands and re-examine it all from first principles, two essentials emerge: The data pipeline and the search service. By understanding these first principles, we can rethink search architecture and find solutions to the upstream problems that make search development so frustrating.
The data pipeline involves all the work it takes to build your documents and the methods you use for indexing them.
Data pipelines are most effective when you run them through a buffer, such as Kafka. Consumers read the documents from the buffer and, in an ideal world, write to multiple indexes instead of just one.
Ineffective data pipelines will treat the search engine as the primary data store, which slows down the entire process and experience. More effective data pipelines are able to backfill the search engine from a separate, primary data store.
Even better data pipelines don’t just offer greater efficiency and scalability. The better your data pipeline is, the more easily you can manage change. If you want new document-level features, for example, an effective data pipeline would give you the ability to build a new version of an index so you can manage those feature changes.
The search service component, at its core, takes the user intent – as the user expresses it via the user interface – and rewrites it into a query language the search engine can understand and use.
Similar to the data pipeline component, seeing the search service as a component unto itself reveals effective and ineffective versions.
An ineffective search service is simplistic and merely passes intent to query.
An effective search service, in contrast, can write intent into two kinds of queries – one for supporting the query as is and one for fueling experimentation and iteration. The first query is a control and runs as previous queries ran. The second query supports iterative experiments, such as A/B testing.
Ideally, you can support an almost constant level of A/B testing so you can always be making data-backed improvements. But beyond step-by-step iteration, the ability to support multiple queries is also important for bigger changes.
If you want to migrate search engines, for example, you’ll want to have a search service that can send a query to your current search engine and send a second query to a search engine you’re testing.
With these two primary essentials in mind, the data pipeline and the search service, we can determine the baseline expectations a search experience has to provide.
Establishing a baseline lets us set expectations for any involved teams and hold ourselves accountable to what a minimally effective search experience should offer. But a baseline also gives us a foundation to plan out further improvements and work toward long-standing improvability.
The end goal isn’t perfect search, which isn’t really possible, but easily-improvable search – a search experience that doesn’t collapse into a cascade of problems whenever you attempt a small improvement.
The baseline search experience builds on a series of processes, each chaining into the next, and they all need to be functional. Many of the cascading problems I previously alluded to come from an incomplete baseline or a baseline with dysfunctional components.
To start: The search interface.
The search interface is what expresses search intent. Your application needs to be able to interpret user input and format the interpreted intent into a query language that it can then send to the search engine.
The search engine then needs to be able to run that query through an index that’s hosted on a server (wherever the server might be). The search engine also needs to be able to enrich that query with information about the user, such as the user’s location and any relevant information from their previous session.
But building a search interface that effectively expresses intent requires building an index.
An index is, fundamentally, a pre-computation of expected future queries. Because of that, when you build an index, you need to start from an understanding of the query’s end state – the results that a given user will eventually see.
Starting at the end will help you structure the features you build backward from, ensuring that your index includes everything that’s necessary to reach that endpoint.
The search interface or index are only useful if users can use them, so scalability becomes another baseline component. Scalability is deceptively simple: There are often cases where you might need to shard so that you can get more computers doing the work than a single computer could do on its own.
This baseline can’t be complete, of course, if users don’t see the results. Businesses rarely want raw results though, so even prior to presenting the results, a baseline system will need to be able to filter and re-rank results according to business requirements.
From there, you can present the filtered results to the user and the user can then engage with the results through a user interface and with the help of UI affordances.
This loop connects the interface to the query to the results and back to the interface feels complete but even a baseline experience requires some level of improvability.
Questions will inevitably arise:
Without a baseline level of improvability, even an initially effective search experience will eventually decay.
Thinking through the baseline above reveals many of the necessary processes and components to a baseline experience, but zooming out also reveals a big-picture problem: Tightly-coupled systems.
The question of whether a system is too tightly coupled or not has started too many engineering debates to count. Arguably, the whole debate relies on a mis-framing of the problem. Marianne Bellotti, legacy-systems expert, argues in her book Kill It With Fire that software tends to benefit from being tightly coupled early in a company’s life but as the company evolves and grows, greater benefits are to be gained from refactoring to a system that is more loosely coupled.
That’s where many companies with fragile search experiences find themselves. When a search system is tightly coupled, separate components are dependent on each other and it’s likely a change in one component requires a change in another. This level of coupling limits both improvability and repair, meaning improvements to one component require improvements to others and changes meant to fix isolated flaws can create a cascade of further flaws.
Let’s say a company has built a user interface that – when it expresses an intent as a query – serializes that intent directly into the query language of the search engine. That directness produces a tight coupling between the interface and the search engine. That process might work well enough for the company now, but if they later on want to change or replace the search engine, they’ll have to change the user interface code too.
The takeaway here is that when you’re planning, building, or mapping the state of your search experience, you need to consider the practical realities of improvability. In the above scenario, replacing the search engine is still possible but it’s difficult, and if there are other priorities – and there always are – then that improvement might be delayed. At a certain point, a system can be so tightly coupled that it’s no longer effectively or efficiently improvable.
Dividing the search experience in two – into the data pipeline and search service – allows you to keep your system decoupled. The end goal is a system you can iterate on, a system with components you can improve without having to orchestrate the entire search infrastructure.
With the right architecture, the right baseline features, and a design that prioritizes iteration, you can start to improve search over time and – maybe even more importantly – feel confident that you can make changes without causing cascading issues.
This feeling is important because the more doable improvement feels, the more likely you are to actually make those improvements. And the more often you propose improvements and follow through, the better the search experience gets and the more confident the team and company feel.
It’s not necessarily an easy path and the work I’ve described so far involves upfront effort for the sake of delayed gratification and reward. But it’s worth it and it’s helpful to imagine some of the improvements that will become possible.
An effective search service can translate user intent into different queries, which allows you to A/B test the accuracy of those queries. With these parallel queries constantly running, you can measure user engagement over time and fine-tune your search service so that it’s always getting better at identifying and translating intent.
A common failure mode for search results is if a user makes a search and the search returns redundant results. Redundancy can make users feel like the search feature they’re using isn’t accurate or that the result they’re searching for doesn’t exist (even if it does). With the right metrics and monitoring in place, you can find and log anomalies like these so that you can examine them, replicate them, and trace back the redundancy to whatever bug might have caused it.
There is a multitude of ways to identify and measure similarity (some of which ElasticSearch covers in its documentation). The goal of measuring similarity, typically, is to ensure you’re not presenting users with results that only seem to match intent given superficial similarities (such as similar text). With a robust search experience, you can implement sorting algorithms that score results according to a more granular similarity that includes syntactic and semantic content.
Performance is one of those values that many people tend to agree is important but few tend to prioritize. Performance is hard to improve and it’s not clear what kind of appreciable effects that effort will deliver.
I try to remind people, though, that even small improvements in performance can create lasting UX changes. It’s arguably more important, in some contexts, to be faster than accurate because a responsive search experience encourages users to iterate their search inputs and land on the right combination faster.
Performance can quickly become a big issue if there’s an influx of data or users. But with a better-architectured search, you can, for example, more easily change server types and shift sharding approaches. Eventually, it becomes relatively easy to either make changes in isolation or rebuild entire clusters as necessary.
A better search architecture also gives you flexibility. In the first example, I cited the ability to run A/B tests but that’s not always practical – no matter how good your search feature is. The primary pain point with A/B testing is that you need a significant amount of data for any result to be statistically meaningful.
But with a well-designed architecture, you can interleave results, which offers a cheaper form of iteration. By interleaving results, you can run a query on the control side and experimental side so that you can merge the results and present both to the user. You still get worthwhile results but the process requires less engagement and traffic to produce a meaningful signal.
Throughout this post and the talk I gave at Haystack, I tried to emphasize the optimistic angle of re-architecting search: It’s more possible than you think and given the right thinking and planning, you can get to a point where search is no longer a feature to fear.
But I want to close by pointing out some of the costs of not improving search, many of which have clear but difficult-to-quantify consequences:
The authors of The Phoenix Project evoke this dynamic well, writing that:
“Everything becomes just a little more difficult, bit by bit [as] our work becomes more tightly coupled. Smaller actions cause bigger failures, and we become more fearful and less tolerant of making changes. Work requires more communication, coordination, and approvals; teams must wait just a little longer for their dependent work to get done; and our quality keeps getting worse. The wheels begin grinding slower and require more effort to keep turning.”
When search experiences are as decrepit as they often are, problems become hard to identify and improvements hard to make – the wheels grind and slow to a halt. Between both gaps, it’s difficult to see just how good search can be.
That’s why I ultimately return to optimism: If you take the time to improve search from first principles, the payoff will be greater than you can even predict.