Ah, the birth of a new search. It is a glorious thing. I recently had the pleasure of building out our internal admin search for Bonsai's new SaaS application with OpenSearch, and wanted to share my insights.
The problem we're solving here is that we've got lots of data spread around in various relational tables, and tuning relevance across all these structures is quite tricky. The techniques in this post work for Postgres, SQLite, MSSQL, MySQL, MariaDB, Oracle, DB2, and any other RDMBS. In this post, you'll learn why search is a problem best solved by a search engine, and not just tacked on to your tables in the database using some full text feature or plugin. You'll also learn why vector search may be a poor choice for highly structured relational data.
I've recently done lots of big texty content and vectors and hybrid, but SaaS search is a horse of a different color. It's been a while since I've designed search for an application like this, and in my research I once again had some eyebrow raising revelations that there just isn't a good guide out there for this use case. In this article, I'm going to tell you how to do it right (and how to do it wrong).
For our example, we're going to use an open music database to search. Let's see a quick demo:
Before we get into how to do it right, I'll tell you how to do it wrong. Most people make two design mistakes when implementing search for their database:
- They make a different index per table (or use in-database indexing on a table)
- They use nested fields, to mimic hierarchical structure in their database
I blame these common mistakes on actual recommendations that Elasticsearch was espousing 15 or so years ago, before we knew better. Back when doc types were still a thing, that was all the rage - a new doc type for each table! After doc types finally got removed, for some reason, people switched to assuming a separate index was the thing to do. As for nested fields, well, if your data is hierarchical, then why not? But little do most people know - nested fields are terrible. They are slow and make things over-complicated.
These mistakes are easy to make if you're not a battle hardened search nerd like yours truly, because it's natural to think relationally about data if that's how your data is structured. But search is not relational! It's a relevance bucket that works best when you design for discovery and not relationships. And, if you put different tables in different indexes and nest fields, tuning for relevance and experience becomes very difficult, especially as your database grows and evolves.
One Index to Rule Them All
Having a single index for your entire database is not only good design for search, but it also follows good software architecture principles: separation of concerns. We don't couple the design of the search index to the design of the database. Instead we focus on the information needs of the person who needs to find things, design the index appropriately, and add data and tune the query.
But weirdly, SaaS info needs are pretty consistent in most cases at a base level - people need to find records of information! When you think about it, you can design a search index that hits most of these information needs without even knowing much about the application, and then grow from there.
So I'm going to do something now that I usually don't do. I'm going to hit you with a schema right off the bat. I don't even know what your database looks like, but this will get you most of the way there, and seeing an end result will make the above point crystal clear. After we review the schema, we'll look at a database example and show how the design really clicks.
Baseline Search Schema for any Database
| Field | Type | Description |
|---|---|---|
| id | keyword | The unique identifier to look stuff up |
| type | keyword | The name of the table the record came from |
| permissions | keyword | Who has access to see it? |
| url | url | An easy url to point to the actual record page from results |
| names | entity | Entities (column values) that people want to find |
| emails | If your table has an email address, it goes here | |
| notes | text | If your table has long text, it goes here |
| aka | entity | "Also Known As" - alternate names and recall for the record |
| address | entity | If your table has an address, it goes here |
| amount | numeric | The most important number in your table - i.e. prices, stock, downloads, etc. |
| created | date | Matches your table's created timestamp |
| updated | date | Matches your table's updated timestamp |
| deleted | date | Matches your table's soft-deleted timestamp |
| details | object | A json object that keeps unsearchable details that you want to show in results |
Yup, that's it. That is what gets you started. Just 14 fields.
For context the Bonsai admin search index has a very similar index. But the beauty of it is that if we want to include more tables in our search, it's really, really easy, and we can do it in 10 lines of code.
There's no secret or mystery here. The technique is straightforward: we more or less have one field per use case of how people want to find things. Want to search the name of a record? Check. Email? Check. Some texty notes? Check. Location? Check. It's likely that each table has things like this, and we're just making the bucket.
Well, maybe there is a bit of a trick: multivalued fields. If your table has 6 varchars that are all short, you can stick them all into one of the fields as an array per record. I call it a trick, but really it's just the process of negotiating the difference between how relational data is represented and how search engines work.
The Index Definition
Here's the full index settings and mappings for our 14-field schema. The analyzers are tailored for each field type: entities get light stemming to preserve proper nouns, text fields get full English analysis, and structured fields like emails and URLs get specialized tokenizers.
PUT /music
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"type": { "type": "keyword" },
"permissions": { "type": "keyword" },
"url": { "type": "text", "analyzer": "analyze_urls" },
"names": { "type": "text", "analyzer": "analyze_entities" },
"emails": { "type": "text", "analyzer": "analyze_emails" },
"notes": { "type": "text", "analyzer": "analyze_text" },
"aka": { "type": "text", "analyzer": "analyze_entities" },
"address": { "type": "text", "analyzer": "analyze_entities" },
"amount": { "type": "numeric" },
"created": { "type": "date" },
"updated": { "type": "date" },
"deleted": { "type": "date" },
"details": { "type": "object", "enabled": false, "dynamic": true }
}
},
"settings": {
"analysis": {
"filter": {
"english_stop": { "type": "stop", "stopwords": "_english_" },
"english_stem": { "type": "stemmer", "language": "english" },
"english_light": { "type": "stemmer", "language": "possessive_english" }
},
"char_filter": {
"protocol_strip": {
"type": "pattern_replace",
"pattern": "^(http|https)(://)",
"replacement": ""
}
},
"tokenizer": {
"path_hierarchy_tokenizer": { "type": "path_hierarchy" },
"email_tokenizer": { "type": "uax_url_email" }
},
"analyzer": {
"analyze_text": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "english_stop", "english_stem"]
},
"analyze_entities": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "english_light"]
},
"analyze_urls": {
"type": "custom",
"char_filter": ["protocol_strip"],
"tokenizer": "path_hierarchy_tokenizer",
"filter": ["lowercase"]
},
"analyze_emails": {
"type": "custom",
"tokenizer": "email_tokenizer",
"filter": ["lowercase"]
}
}
}
}
}
A few things to call out:
typeis the name of the table. This will be used to drive filtering/faceting and display for specific record types. We even use it to show icons that help the user visually identify what kind of record it is.permissionsis a keyword. No analysis needed and we need it for filtering. We'll usetermsqueries to check whether the current user's role intersects with the document's permissions array. We do this for every query.detailsis a bag for display data. It's set to"enabled": falseso OpenSearch stores it but doesn't index it. You can't search or filter on anything insidedetails! That's by design. It's only there to be returned with_sourceso your UI can display things in the search results. The"dynamic": truemeans each document type can put whatever it wants in there without a mapping update.- The analyzers are deliberately conservative. We're not doing heavy NLP here. Entity analysis preserves case-insensitive exact forms. Text analysis adds stop words and stemming. Emails and URLs get specialized tokenizers that understand their structure.
Implementation Example
Let's walk through a real example. For illustration, we're using the Chinook database, which is a well-known sample database that models a digital music store. It might have some of your favorite artists in there (but it's a small catalog so don't be dissappointed if it doesn't). It has lots of tables that we can use to infer varying information needs. Also, it sits perfectly next to our existing books search based on the Gutenberg project. Who doesn't love listening to music while reading books (or reading books while listening to music)?
The Chinook database schema. Copyright (c) 2008-2024 Luis Rocha, MIT License.
Chinook has 11 tables split across two functional areas
The catalog side: Artist, Album, Track, Genre, MediaType, Playlist, and PlaylistTrack. This is the music collection: Artists have albums. Albums have tracks. Tracks have genres and media types. Playlists group tracks together.
The sales side: Employee, Customer, Invoice, and InvoiceLine. This is the store operations. Employees support customers. Customers buy stuff. Invoices track the purchases.
Here's a quick look at the row counts so you get a sense of scale:
| Table | Rows |
|---|---|
| PlaylistTrack | 8,715 |
| Track | 3,503 |
| InvoiceLine | 2,240 |
| Invoice | 412 |
| Album | 347 |
| Artist | 275 |
| Customer | 59 |
| Genre | 25 |
| Playlist | 18 |
| Employee | 8 |
| MediaType | 5 |
Not the chonkiest dataset I know, but it has real variety across its tables: short names, long names, emails, addresses, dates, prices, free-text fields, foreign keys.
We're actually going to end up with fewer tables than this. Because MediaType and Genre are just lookup enums, and PlaylistTrack & InvoiceLine are join tables. We won't index those separately when we de-normalize. That leaves us with 7 tables to combine into our index for search.
What Do People Actually Need to Find?
Before we map anything, we need to think about who's using this search and what they're looking for. This is a classic information needs exercise. We don't pay attention to the structure of the database, we pay attention to the ways in which peole will use the search.
A visitor to our music store might need to:
- Find a track, album, or artist
- Discover or explore by genre
- Find their account or invoice details
An admin will do those things too, but they'll also:
- Find an invoice for a particular customer or billing city
- Find a customer by name, email, or company
- Look up an employee by name
These are all discovery tasks. The visitor executes a query, and they want to relevant results regardless of the source tables. "John" should return John Mayer the artist, John Gordon the customer, and any invoices billed to someone named John. One search box, one index, many types of results.
Ready to power your search with AI?
Launch a fully managed Elasticsearch or OpenSearch cluster, with built-in vector search and AI capabilities.
Mapping Chinook to Our Baseline Schema
We're going to take every column from every Chinook table and map it into our 14-field schema. Some columns map directly. Some get combined into multivalued fields. Some we skip entirely because they're internal foreign keys that have no meaning to a person searching.
Catalog Tables
Before we get started here are some basic conventions.
- We use the table name in the
typefield (Artist-->"type":"artist"). - The id for each table is a serial integer. We'll use that plus the table name prefix for our search document id (
Artist.ArtistId=247-->"id":"artist-247"). - We denormalize the foreign keys to resolve to their text values. For example,
Album.ArtistIdwill need to be resolved toArtist.Name. - "aka" is short for "Also Known As". We use this field as a bucket to match alternate names.
- Permissions is a column that is meant as a filter for who can see it - basically
all,admin, and the customer that owns a private record. A more detailed explanation is included with theCustomertable.
Otherwise, we map columns to fields for each table in our database.
Artist
Artist is really simple. Just 2 columns: ArtistId and Name.
| Chinook Column | Index Field |
|---|---|
| ArtistId | id (prefixed: artist-247) |
| (table name) | type = "artist" |
| Name | names |
| (implicit) | permissions = ["all"] |
The id gets a prefix so we can distinguish it from other record types (and we use this pattern for all tables). The type field tells us what kind of thing it is. The name goes into names. And permissions gets ["all"] because this is a public-facing catalog item visible to everyone.
Album
Album has three columns: AlbumId, Title, and ArtistId.
| Chinook Column | Index Field |
|---|---|
| AlbumId | id (prefixed: album-42) |
| (table name) | type = "album" |
| Title | names |
| ArtistId | aka (resolved to artist name) |
| (implicit) | permissions = ["all"] |
As noted above in our conventions, we resolve the ArtistId foreign key to the actual artist name and put it in aka. When someone searches "AC/DC" they should find the artist AND all the AC/DC albums. The aka field is perfect for this. It's our "also known as" bucket for associated names, aliases, and cross-references.
Track
Track is the big one. Nine columns.
| Chinook Column | Index Field |
|---|---|
| TrackId | id (prefixed: track-1) |
| (table name) | type = "track" |
| Name | names |
| AlbumId | aka (resolved to album title) |
| GenreId | aka (resolved to genre name) |
| MediaTypeId | details.media*type *(resolved)_ |
| Composer | aka |
| Milliseconds | details.duration_ms |
| Bytes | details.size_bytes |
| UnitPrice | amount |
| (implicit) | permissions = ["all"] |
The track name goes in names. The album title, genre name, and composer all land in aka as an array. Someone searching "Miles Davis" will find tracks where he's the composer. Someone searching "Jazz" will find tracks in that genre. Someone searching "Let There Be Rock" will find the album AND its tracks.
Duration and file size both go into details.
However, we come to our first architecture decision - because even if you are not searching for "4 minutes and 32 seconds" you may very well want to facet/filter on duration!
Should we add a column? I chose not to for this - but this is where you should start experimenting
Genre, MediaType, and Playlist are all tiny lookup tables. You could index them if you wanted, but honestly, they're more useful resolved into the records that reference them. Genre names live inside track documents via aka. Same with media types. Playlists could be their own indexed type if your admin needs to find playlists by name, but for this example we'll keep things simple.
PlaylistTrack is a junction table. It has no searchable content of its own. We skip it entirely.
People Tables
Customer
Customer has 13 columns and maps beautifully.
| Chinook Column | Index Field |
|---|---|
| CustomerId | id (prefixed: customer-1) |
| (table name) | type = "customer" |
| FirstName + LastName | names |
| Company | aka |
| emails | |
| Address, City, State, Country, PostalCode | address |
| Phone, Fax | notes |
| SupportRepId | details.support*rep *(resolved)_ |
| (implicit) | permissions = ["customer-1", "admin"] |
First and last name get concatenated into names. Company goes in aka so searching the company name surfaces the customer. Email goes in emails (obviously). All the address columns get combined into a single address array value. Phone and fax go into notes because people do search for phone numbers and it's close enough to free text.
Notice the permissions field: it contains the customer's own prefixed ID and "admin". This means customer-1 can find their own record, and any admin can find it too. A different customer searching wouldn't match — their ID isn't in this document's permissions array.
Employee
Employee is almost identical to Customer, plus a few extra fields.
| Chinook Column | Index Field |
|---|---|
| EmployeeId | id (prefixed: employee-1) |
| (table name) | type = "employee" |
| FirstName + LastName | names |
| Title (job title) | aka |
| emails | |
| Address, City, State, Country, PostalCode | address |
| Phone, Fax | notes |
| BirthDate | details.birth_date |
| HireDate | created |
| ReportsTo | details.reports*to *(resolved)_ |
| (implicit) | permissions = ["admin"] |
Job title goes in aka so you can search "Sales Manager" and find the right person. Hire date maps to created because that's when this employee record came into existence. Birth date goes in details for display purposes.
And here's where permissions really shows its value. Employees get ["admin"] — only admin users see employee records in search results. Customers get their own ID plus "admin", so only that customer and admins can see the record. Public catalog items like artists, albums, and tracks get ["all"]. At query time, you pass the current user's identifier into a terms filter on permissions, and the index handles visibility for you. An admin passes ["all", "admin"]. Customer-1 passes ["all", "customer-1"]. The terms filter intersects with the document's permissions array, and only matching documents come back.
Transaction Tables
Invoice
Invoice has 9 columns.
| Chinook Column | Index Field |
|---|---|
| InvoiceId | id (prefixed: invoice-1) |
| (table name) | type = "invoice" |
| CustomerId | names (resolved to customer name) |
| InvoiceDate | created |
| Billing Address, City, State, Country, PostalCode | address |
| Total | amount |
| CustomerId | permissions (prefixed: customer-1) |
| (implicit) | permissions += ["admin"] |
For invoices, we resolve the customer name into names. When someone searches "Luis" they'll find Luis Goncalves the customer AND his invoices. The billing address goes into address. Invoice date maps to created. Notice that CustomerId pulls double duty here: it resolves to the customer name for names, and the prefixed ID goes into permissions so only that customer (and admins) can see the invoice.
InvoiceLine is a line-item detail table. Like PlaylistTrack, it doesn't have much standalone search value. The interesting stuff (what was bought, for how much) lives on Track and Invoice. Skip it.
What the Documents Look Like
Here are three examples as they'd look going into the index.
A Track:
{
"id": "track-1",
"type": "track",
"permissions": ["all"],
"names": ["For Those About To Rock (We Salute You)"],
"aka": [
"AC/DC",
"For Those About To Rock We Salute You",
"Rock",
"Angus Young, Malcolm Young, Brian Johnson"
],
"amount": 0.99,
"details": {
"media_type": "MPEG audio file",
"duration_ms": 343719,
"size_bytes": 11170334
}
}
A Customer:
{
"id": "customer-1",
"type": "customer",
"permissions": ["customer-1", "admin"],
"names": ["Luis Goncalves"],
"aka": ["Embraer - Empresa Brasileira de Aeronautica S.A."],
"emails": ["luisg@embraer.com.br"],
"address": [
"Av. Brigadeiro Faria Lima, 2170",
"Sao Jose dos Campos",
"SP",
"Brazil",
"12227-000"
],
"notes": ["+55 (12) 3923-5555", "+55 (12) 3923-5566"],
"created": "2026-01-01T00:00:00Z",
"details": {
"support_rep": "Jane Peacock"
}
}
An Invoice:
{
"id": "invoice-98",
"type": "invoice",
"permissions": ["customer-1", "admin"],
"names": ["Luis Goncalves"],
"address": [
"Av. Brigadeiro Faria Lima, 2170",
"Sao Jose dos Campos",
"SP",
"Brazil",
"12227-000"
],
"amount": 3.98,
"created": "2009-03-11T00:00:00Z"
}
All three documents live in the same index. They share the same field names. They use different fields based on what makes sense for their type. And they're all discoverable with the same query.
The Query
The query structure we use is the same for everyone, and can easily be made into a template.
POST /music/_search
{
"size": 10,
"from": 0,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "{{QUERYSTRING}}",
"type": "cross_fields",
"fields": ["names^3", "aka^2", "emails", "notes", "address", "url"],
"boost": 1.0
}
}
],
"minimum_should_match": 1,
"filter": [
{ "terms": { "permissions": ["{{PERMISSIONS}}"] } }
]
}
},
"aggs": {
"type": { "terms": { "field": "type", "size": 10 } }
}
}
For the search bar, we take the user's query and substitute {{QUERYSTRING}}, and then add all to the permissions. If the user is logged in, we add their customer id to the permissions. If the logged in user is an internal admin, then we add admin to the permissions.
Let's say our admin types "Goncalves" into the search bar:
POST /music/_search
{
"size": 10,
"from": 0,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Goncalves",
"type": "cross_fields",
"fields": ["names^3", "aka^2", "emails", "notes", "address", "url"],
"boost": 1.0
}
}
],
"minimum_should_match": 1,
"filter": [
{ "terms": { "permissions": ["all", "admin"] } }
]
}
},
"aggs": {
"type": { "terms": { "field": "type", "size": 10 } }
}
}
We boost names and aka because those are the most likely intent when searching. The permissions filter ensures this admin user only sees documents they're allowed to see — which in this case is everything, since "admin" intersects with ["all"], ["customer-1", "admin"], and ["admin"]. And what comes back?
- customer-1: Luis Goncalves (the customer record)
- invoice-98, invoice-121, etc.: invoices billed to Luis Goncalves
This demonstrates the effectiveness of the pattern - we see the customer record at the top, followed by their invoices. Facets are available for the record type on the left.
If Luis Gonçalves (customer-1) searched the same thing, you'd pass ["all", "customer-1"] as the permissions filter. He'd see catalog items (which have "all"), his own customer record and invoices (which have "customer-1"), but not other customers' records, and not employee records. The permissions field acts as an access control list per document.
Ready to power your search with AI?
Launch a fully managed Elasticsearch or OpenSearch cluster, with built-in vector search and AI capabilities.
Relevance Tuning the Query
Database search is primarily a search for named entities. The main driver is people trying to find stuff by their unique names. In our music example, this means that users will search for the names of albums, songs, artists, employees, invoice identifiers, etc.
This is an important note because it shapes how we index and query the information. One thing that we often have trouble with in named entity search is recall - if you don't know the name or how to spell the name, you can't find it!
Take this trivial example: I want to find Stairway to Heaven, so I search with the query stair.
Let's improve our query. This is easier than it sounds for this basic use case - and we'll just search prefixes so stair will match stairway. Note that this also means that heave will match heaven, and sing will match single. All we need is to add another should clause for a match_phrase_prefix:
{
"match_phrase_prefix": {
"names": {
"query": "{{QUERYSTRING}}",
"boost": 0.1
}
}
}
Note the very low boost! We don't want to pollute the top with name fragments when we get a good match on the full name, so these will show up at the bottom. sing is a good example of this, since single is a common modifier for track names, and we want those tracks to appear below Sing Joyfully.
Here's the full modified query:
{
"size": 10,
"from": 0,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "{{QUERYSTRING}}",
"type": "cross_fields",
"fields": ["names^3", "aka^2", "emails", "notes", "address", "url"],
"boost": 1.0
}
},
{
"match_phrase_prefix": {
"names": {
"query": "{{QUERYSTRING}}",
"boost": 0.1
}
}
}
],
"minimum_should_match": 1,
"filter": [{ "terms": { "permissions": ["{{PERMISSIONS}}"] } }]
}
},
"aggs": {
"type": { "terms": { "field": "type", "size": 10 } }
}
}
We can go much further with improving the query, but that would be tied closer to your data and information needs. For example, you may also want to have a recency boost using created/updated dates. You might want to use fuzzy search or phrase slop. You might want to implement full querystring support with boolean clauses and parens. You might want to implement autocomplete. All these things and more can make your database search fit your customer needs.
Adding More Tables
Remember when I said adding a new table takes about 10 lines of code? Here's what I mean: If you add a Venue table to your music database, you write a small function that maps venue columns to the schema fields (name goes in names, city goes in address, description goes in notes), and you're done.
This scales the same way at Bonsai. We started with a handful of tables in our admin search and kept adding more. Each one was a trivial addition because the index design didn't change. We use Rails, and that made it even easier - we architected to use a concern Searchable that we could attach to the model, and just define an index_features method that mapped the model's row data to the target search schema.
What about vectors?
Vectors are meant to improve recall, and I wrote about this in When to Use Vector Search? with domain-specific guidelines. But beyond partial names this can result in lots of noise when your main info needs are driven by entity search. In that article you will notice that the entities are precisely what you should avoid with vector search.
Therefore, I recommend NOT using vectors for your SaaS database search unless you have columns that contain rich sententences/paragraphs/free-text. Only if you are really struggling with recall, and lexical hacks like prefixes, ngrams, and fuzzy search don't solve them.
Also remember to always rely on your info needs. If you planned those up front, and see strong representation in data and queries that warrant vectors, then go right ahead!
Try it out!
The source for the example search is available on GitHub if you want to try this yourself. You'll need the ChinookData.json file from the Chinook repo. Index it, query it, play with the boosts, then try it on your own project. You'll be surprised how far 14 fields can take you.
See you next time!
Ready to power your search with AI?
Launch a fully managed Elasticsearch or OpenSearch cluster, with built-in vector search and AI capabilities.
Learn how a managed service works and why it’s valuable to dev teams
You won’t be pressured or used in any manipulative sales tactics
We’ll get a deep understanding of your current tech stack and needs
Get the information you need to decide whether to go with Bonsai
Or, schedule a consultation: