Efficiently Sort Geo-points by Distance in Elasticsearch | by Aryella Lacerda | Jul, 2022

Querying geo-distance

Photo by Z on Unsplash

There are a lot of use cases for calculating the distance between two points in a search. If you’re working with geographical data, regardless of what business you’re in, this is bound to come up. And then you’ll probably need to take a distance into account when sorting these points because… well, why not?

So here are a few different ways you could do that. Though I try to explain everything in as much detail as possible, I’m assuming you have a beginner’s understanding of Elasticsearch (ES) and its basic queries.

For this example, let’s assume we’re a food delivery startup. Something like Uber Eats or DoorDash, maybe.

We have a mobile app into which the user types their search term (ie. “Chinese food”). Our app then lists all the establishments that contain that term, in whatever random order they’re found in our database.

We probably have every establishment’s latitude and longitude saved in the database.

If not, then we probably calculated every establishment’s geohash and saved that.

To begin making the most of ES’s geo queries, however, we should transform these values ​​into geopoints.

Conveniently, Elasticsearch allows you to upload geopoints in whatever format you happen to have saved: latitude/longitude objects, geohashes, strings, arrays of strings, WKT POINT primitives, etc. Take a look:

There are a few things to note here.

firstly, location is an arbitrary name, we can call our geopoint field whatever we’d like.

Secondly, we need to declare the location field’s type before uploading any documents. This is because, unless we explicitly tell Elasticsearch that "41.12,-71.34" is a geopoint, for example, it’ll interpret it as text. Similarly, it’ll consider [-71.34, 41.12] an array of numbers.

From the geopoint documentation, let me point out two very important notes:

For those who are very new to Elasticsearch: an “index, document, and field” in Elasticsearch is comparable to an SQL database’s “table, row, and the column”.

Every field has a type (or a “mapping”), which is important because each type of data needs to be stored in a specific way for lightning-fast searching. Elasticsearch can generate mappings dynamically as you upload new documents, but it’s sometimes desirable to declare them explicitly.

For our example, let’s create a simple index called establishments.

❗ Careful not to type it wrong: it’s geo_pointnot geopoint. Let’s also set up some sample data:

This is our map of Queens, New York with our eight establishments (in blue) and our hypothetical client (in orange):

A view of Queens, New York, USA. Here we can see 8 establishments (in blue) and 1 client (in orange).

Our first task is to establish a maximum distance between the client and the establishments that we’ll return from the query. In other words, we should only search for establishments within a certain radius of the client. I’ll leave the secondary task of retrieving the client’s coordinates up to you, but we’ll definitely need them.

There’s actually a simple geo_distance query for this:

We can use all sorts of units to establish the radius: miles, yards, feet, inches, meters, kilometers, centimeters, millimeters, and even nauticalmiles. We can also format our location object in various ways, just like we could when we first created the documents.

This is the result of our query:

In other words, only establishments 1, 2, 6, and 7 are within a 1-kilometer radius of our client.

If you take a close look at our map above, though, you’ll notice that Establishment 2 is actually closest to the client, so our results aren’t sorted by distance. By default, Elasticsearch sorts its results by relevance score, found in the _score field of every document.

However, you’ll notice that in the query above, all the establishments were returned with the same relevance score. When every document is equally “relevant”, their order is mostly random. But then… why are the scores identical?

It’s because the geo_distance query is a yes-or-no type of thing. Either the establishment is in the radius or it isn’t. All four establishments are “equally” inside the radius and therefore they all have the same score.

One way to validate this is to use the explain: true parameter when we run our query:

The explain parameter appends to every document an explanation of how that document’s score was calculated. For the query above, notice that every single document has the same explanation and therefore the same score.

However, a lot of Elasticsearch queries are carefully constructed such that the first results are the most relevant to the user. That might mean prioritizing establishments whose names and descriptions contain the exact keyword, or the newest establishments, or the ones with the highest ratings or most reviews.

In our case, we’ll want to prioritize the places closest to our client. That’s what the distance_feature query is for:

These are the results we get, now in a different order (2, 1, 7, 6). Notice that the relevance scores aren’t the same anymore.

Add an explain: true parameter to the query and take a look at the explanation field of our first result. Now that there are two queries computed separately (is the establishment within a 1km radius of the client? And how close is the establishment to the client?), the document’s final score (1.7851406) is the sum of the scores returned by each query ( 1 + 0.78514063).

The distance_feature computation is a little more complicated than the geo_distance computation, but it’s still simple enough to understand:

In the score explanations above, you’ll find a distance object for every establishment. But that’s a very roundabout way of fetching the distance between two points. I wouldn’t recommend it for a few reasons:

  • The shape of the explanation field changes every time you adjust the query, which makes retrieving the distance from there a pretty flaky operation.
  • The explanation field stores a lot of information other than the distance so you’ll be using up resources returning needless data to the client.
  • Semantically, that’s just not what the explanation field ought to be used for. It’s a debugging tool, not a query.

There are a couple of other ways to do it.

We can use a script to generate a new distance field at run-time. Fair warning: scripts queries are usually more expensive than built-in queries but they can be optimized if necessary. Try to avoid premature optimization if you can; Elasticsearch really is blazingly fast.

The arcDistance function is built-in to Elasticsearch and returns the distance in meters. Our results now tell us exactly how far away each establishment is from the client:

There’s a third option as well, for situations where the score isn’t important. The sort query will do as advertised and sort the results by given criteria. This criterion could be the distance between the client and the establishment, in our case.

These are the results. Notice that the order remains the same as the query above (2, 1, 7, 6) but the relevance score of every document is now null. On the other hand, since we’re not using a script, this search will likely be faster than the one above.

And that’s it! Thanks for reading and please let me know if you have any other ideas on how to calculate and sort by distance in Elasticsearch.

Leave a Comment