Efficient sorting of geo distances in Elasticsearch

Rob Sears · April 14, 2015
5 minute read

Elasticsearch is much more than just a search engine: it's also a powerful analytics tool. One of the awesome things that Elasticsearch provides out of the box is the ability to calculate the distance between geographic points, and order the results by proximity. A common use case for this is an application where a user wants to see search results that are near a given point.

Take an app like OpenTable, for example, which uses Elasticsearch to power its mobile application. Someone using the iPhone app might want to see all of the best nearby restaurants. The application uses the iPhone's GPS to get a latitude and longitude for the user, then uses Elasticsearch to quickly return all nearby results within a region, sorted by the closest ones.

However, some users trying to build similar capability find that their search is lightning fast in development, but slows to a crawl in production. We see this at Bonsai from time to time, when users open support tickets asking what gives.

The answer has to do with how search is being implemented. Consider the following snippet, and see if you can spot the problem:

Restaurant.search(
  "*",
  sort: {
    _geo_distance: {
      location: "#{lat},#{lon}",
      order: "asc",
      unit: "km"
    }
  },
  size: 20
)

Seems pretty straightforward, right? The code block returns the 20 nearest restaurants to lat,lon, ordered by nearest to furthest. But there is a subtle problem.

See how we're searching for all documents (*)? The code is passing along a lat/lon point to Elasticsearch, and a geo_distance is being calculated between that point and every other point in all the documents in the index. If there are a large number of documents, then the time to compute geo_distances on all of them becomes non-trivial.

This is a direct consequence of how search engines sort results. In order to get the top 20 results, the engine has to calculate a score for every point in the set of results. Once that's done, it needs to pass the scored results through a sorting algorithm before the top results can be returned.

This process is generally pretty well optimized in Elasticsearch. Calculating geo_distances is more computationally complex than basic TF-IDF relevancy scoring, but the default sloppy_arc algorithm is blazing fast, and over 99% accurate. The sorting algorithms are smart too. For example, it will stop sorting after the first 20 results (or whatever is specified), since we don't need to know where a particular document ranks if it's not going to be returned to the user anyway.

This explains why sorting by geo_distance is often blazing fast in development, when there is a low document count, but sluggish in production. You're wasting too much CPU (and you may be memory-bound too).

What can I do about it?

The short answer is to reduce the number of documents you're sorting over. Your particular use case will drive exactly how you want to do this, but it typically involves dropping the wildcard character and/or using filters.

Going back to the OpenTable example, if a user in San Francisco wants to see nearby Chinese restaurants, there's no reason to calculate a geo_distance for a bistro in Paris or a steakhouse in Texas. The search should be limited to San Francisco, and specifically Chinese joints. Consider the following adaptation of the code above:

Restaurant.search(
  "*",
  filter: {
      must: [
      { term: { category: "chinese" }},
      { term: { city: "san francisco" }}
    ]
  },
  sort: {
    _geo_distance: {
      location: "#{lat},#{lon}",
      order: "asc",
      unit: "mi"
    }
  },
  size: 20
)

This would limit the scope of the search to documents with a category of "chinese" and a city of "san francisco," which is all the user wants to see anyway. That would require computing geo_distances and sorting on perhaps a few dozen entries in a specific area rather than millions around the world.

Another solution is to use the geo_bounding_box filter. This is a filter specifically designed for isolating geo_points in a result set. It allows you to define the lat/lon points of a rectangle, and filters by results within that area. This method is a little more elegant, but comes with a cost.

There's some discussion over the efficiency of this approach, since it seems to have mixed results (performance-wise) in production. Using a bounding box still requires geographic coordinates to be computed for every document in the result set when the filter is applied. In theory, there is less overhead in determining whether a point resides within an area than computing the distance between two points, but it's still overhead. So applying this filter to a wildcard query and nothing else will not be much more efficient than the original code.

Questions? Comments?

Using geo_distances in your app? We'd love to hear how you're using it and what kind of performance you're seeing. If you have any questions, comments or concerns, shoot us an email and we'll see how we can help.