Radar with Jeff Kao
About building a high-performance geocoding platform with Rust
2026-01-08 63 min
Description & Show Notes
Radar processes billions of location events daily, powering geofencing and location APIs for companies like Uber, Lyft, and thousands of other apps. When their existing infrastructure started hitting performance and cost limits, they built HorizonDB, a specialized database which replaced both Elasticsearch and MongoDB with a custom single binary written in Rust and backed by RocksDB.
In this episode, we dive deep into the technical journey from prototype to production. We talk about RocksDB internals, finite-state transducers, the intricacies of geospatial indexing with Hilbert curves, and why Rust's type system and performance characteristics made it the perfect choice for rewriting critical infrastructure that processes location data at massive scale.
About Radar
About Radar
Radar is the leading geofencing and location platform, trusted by companies like Uber, Lyft, and thousands of apps to power location-based experiences. Processing billions of location events daily, Radar provides geofencing APIs, geocoding, and location tracking that enables developers to build powerful location-aware applications. Their infrastructure handles massive scale with a focus on accuracy, performance, and reliability.
About Jeff Kao
Jeff Kao is a Staff Engineer at Radar, where he led the development of HorizonDB, Radar's geospatial database written in Rust. His work replaced Elasticsearch and MongoDB with a custom Rust stack built on RocksDB, achieving dramatic improvements in performance and cost efficiency. Jeff has deep experience with geospatial systems and previously open-sourced Node.js TypeScript bindings for Google's S2 library. He holds a degree from the University of Waterloo.
Links From The Episode
- Radar Blog: High-Performance Geocoding in Rust - The blog post, which describes Radar's migration from Elasticsearch and MongoDB to Rust and RocksDB
- FourSquare - The compay Jeff worked at before
- Ruby - The basis for Rails
- PagerDuty) - Another company Jeff worked at. Hes' been around!
- CoffeeScript - The first big JavaScript alternative before TypeScript
- Scala - A functional JVM based language
- Wikipedia: MapReduce - Distributed application of functional programming
- Wikipedia: Algebraic Data Types - The concept behind Rust's Enums, also present in e.g. Scala
- Kotlin - Easier than Scala, better than Java
- Apache Lucene - The core of ElasticSearch
- Discord Blog: Why Discord is switching from Go to Rust - Always the #1 result in searches for Rust migrations
- Radar Blog: Introducing HorizonDB - A really nice write up of Horizon's architecture
- RocksDB - The primary storage layer used in HorizonDB
- MyRocks - A MySQL Storage Engine using RocksDB, written by Facebook
- MongoRocks - A MongoDB Storage Layer using RocksDB
- CockroachDB - PostgreSQL compatible, distributed, SQL Database
- InfluxDB - A timeseries database that used RocksDB at one point, and our very first guest in this Podcast!
- sled - An embedded database written in Rust
- rocksdb - Rust bindings for RocksDB
- H3 - Uber's Geo Hashing using hexagons
- S2 - Google's Geo Hashing library
- Wikipedia: Hilbert Curve - A way to map 2 dimensions onto 1 while retaining proximity
- Wikipedia: Finite-State Transducer - A state machine used for efficiently looking up if a word exists in the data set
- Rust in Production: Astral - Our episode with Charlie Marsh about tooling for the Python ecosystem
- burntsushi - A very prolific Rust developer, now working at Astral
- fst - FST crate from burntsushi
- Wikipedia: Trie - A tree-structure using common prefixes
- Wikipedia: Levenshtein Distance - The number of letters you have to change, add, or remove to turn word a into word b
- tantivy - A full-text search engine, written in Rust, inspired by Lucene
- LightGBM - A gradient boosted tree, similar to a decision tree
- fastText - A text classification library from Meta
- Wikipedia: Inverted Index - An index used for e.g. full text search
- Wikipedia: Okapi BM25 - The ranking algorithm used in tantivy
- Wikipedia: tf-idf - A classic and simple ranking algorithm
- Roaring Bitmaps - A very fast bitset library used in many places
- corrode.dev: Be Simple - A sentiment right down Matthias' alley
- loco-rs - Rust on Rails
Official Links
Transcript
Here's Rust in production, a podcast about companies who use Rust to shape the
future of infrastructure.
My name is Matthias Endler from corrode, and today we talk to Jeff Kao from Radar
about building a high-performance geocoding platform with Rust.
Jeffrey, thanks a lot for taking the time today for the interview.
Can you say a few words about yourself and about Radar?
Yeah, happy to be on this podcast. So my name is Jeff, and I'm a principal engineer at Radar Labs.
We are an enterprise geolocation tech company.
So that's a variety of things, spanning from maps and routing to search and
geocoding and geofencing, as well as fraud detection.
And a bit more about myself, I've been programming for quite some time now.
I've worked at a variety of different startups. I've been a freelancer before
as well, and even started my own small indie company a while ago.
But these days, most of my engineering focus, I would say, is largely on backend
and data infrastructure engineering.
What's your programming background? What other languages do you know besides Rust?
Yeah, so I would say, actually, it's kind of funny.
When I first graduated from university, I joined a company called Foursquare.
And I think at that time, it was maybe around the 2010s, where there were a
lot of companies that were moving from Ruby,
because Ruby on Rails, I think at the time was like, and it still is,
you know, but I think like, it was almost like peak Ruby on Rails.
And then there's the peak migrating from Ruby on Rails to something more quote
unquote scalable by companies like Twitter and like Foursquare and like SoundCloud.
So it's funny because I worked at two of those companies. And so I've done a lot of work in Scala.
You know, in university, I prototyped and I was doing freelancing.
And then I used a lot of Ruby on Rails at that time.
And so, you know, a lot of like web technology is involved with that as well.
So I did a lot of JavaScript.
And even at that time, there wasn't TypeScript, but I think people were using
CoffeeScript at that time. So I played around with that quite a bit.
We actually, it's funny, we migrated when I worked at PagerDuty,
we moved some workloads from JavaScript to actually CoffeeScript.
But I think it's pretty much defunct now. Nobody really uses it.
So those are probably my main languages, you know, dabbled in Python and some
C and C++, but never in like a really full-time sort of manner.
And so these days at Radar, mostly working with TypeScript, some Scala for some
Spark pipelines, of course, a lot of Rust.
And yeah, what else? Maybe some python as well that's the matter in the face.
That's pretty impressive ruby scala javascript coffee script python type scripts
that's a lot and now Rust what could people learn from the philosophy of ruby.
I think this is sort of like an interesting question because technology comes in cycles, I would say.
And that's definitely something I learned with some of the more senior principal
engineers at companies I've worked at before,
where really having an understanding and appreciation of things that are old
really gives you insight onto what's coming next because technology really is cyclical.
So I think there's a lot to learn and like about Ruby.
And usually, I would say the entry point for most people with Ruby is really Rails.
So it's really that concept of making programming languages work for you, in some sense.
And so what does that entail? It's almost like, especially in modern languages.
A lot of us have come to really be spoiled with these concepts of very powerful
list collections that you see from functional languages.
So mapping and filtering and reducing.
And even these concepts you can see are applied to things that aren't even just
programming languages, even in like distributed computation frameworks,
like literally MapReduce is that concept and you see that in Spark and with
all the new sort of data infrastructure that's written in Rust or even like SQL,
because if you think about, you know, like the where clause in SQL is like a filter or,
you know, you can do reductions with grouping,
like there are all sort of these ways to express, It's, you know,
how you process data in a way that's like very elegant and like it's commonly
used throughout all these different paradigms.
So I really think that, you know, on the Ruby side, for me, when I first started,
I was like, wow, this is like Python, but better.
But, you know, that's that's going to make some people angry, obviously.
And that was me being a naive college student looking at languages and seeing
like, oh, where's the sort of trend going? And why is everybody using Ruby on Rails?
And it's really just understanding that it's a very pleasant experience and
very expressive language.
And having all of those sort of batteries built in gives you a lot of productivity
and almost brings more joy to programming. Yeah.
For really expressing like your ideas because
really at the end of the day programming is about creating things
and almost it's almost like a creative profession that i think most people don't
really assume is creative from the outside but you're you're trying to solve
problems and build things and being able to express things in in a very i guess
like terse manner is is is very,
conducive to you sort of getting into the flow state as a programmer yeah.
When i think of ruby i think of elegance i
think of the joy of programming expressiveness where
would you see yourself more on the programming is a craft it's or an art or
would it be more like discipline of engineering where would you see yourself on that scale.
I think it's a mix of the two
things you have the tools but it shouldn't necessarily be like you know the
tools help you express something so maybe like as as like drawing parallels
to maybe some other hobbies i have like say you're a musician right um,
Especially like in jazz music, you know, like jazz musicians spend a lot of
time practicing because they want to express like, oh, I want to play the chord
changes in certain ways and like work on that. Or I want to learn these certain skills.
And, you know, at the end of the day, music isn't about scales or chords,
but those are the tools you use to express yourself.
And once you sort of have those tools ingrained, then they get out of the way,
they come natural to you. Then you can express, you know, your music,
how you, you know, write music and, and add these things.
So it's, it's, it's sort of like a mix of the two things, right?
You want to understand the tools and what's out there and they should help you
really ship the thing you're trying to do.
And the thing you're trying to do can be very technical. So in a lot of cases,
you can get really down like deep into the foundation, depending on what you're
doing. a lot of systems programming, you know, and this is a Rust podcast,
right? So there's that side of things.
But on the other hand, there's a whole, you know, ecosystem of digital products
that don't necessarily need, you know, the best tooling or anything like that.
I think, you know, back to when, you know, in college,
like I remember just creating some small websites by uploading
things to like an ftp server from
some of these like you know hosting companies that
just give you a box and it's like here you go upload like index.php and then
wow suddenly you you have something that's connected to the internet like it
can be something as simple as that so i i can see their appreciation for for
tools but i i really don't think it's necessarily going to,
I don't think it's necessarily productive to just focus on it,
but it's a mix. It's a mix.
Coming back to Rust, with your background, Ruby, Scala, and so on, you have a lot of,
functional part down. You have the expressiveness down. That part must have
been really easy for you to learn.
Algebraic data types do exist in Scala. And now you come to Rust.
How was that first impression to you?
Yeah, it was very refreshing. Rust really feels modern.
And there are so many things to like coming from. I guess at Radar,
our main programming language which is TypeScript.
We actually migrated to TypeScript from JavaScript a couple of years ago.
But looking at the JavaScript ecosystem where essentially there's a library
for everything. There's a joke even on Stack Overflow.
It's like, how do you add two plus two? Just add this NPM package to add two numbers.
So there's having really, even at the time where we first started building this
and we're definitely not early adopters.
We started building our Rust project maybe two and a half years ago or so.
There's a rich cargo crate ecosystem. There's a formatter,
flame graphs, and the paradigms are very functional, but you're not forced to use those either.
So having a rich data structure ecosystem in the standard library,
being able to process vectors with all of the functions that many developers
are used to these days really felt refreshing.
And when we were starting to build out HorizonDB, our Rust geo service project,
Those were some of the characteristics we were really looking for,
especially for a team with, like, largely a background in writing, you know, TypeScript.
Take us back to that time. What was the tech stack like before you started HorizonDB?
What was the team like, the team dynamics?
I guess most of them would be TypeScript developers, but there might be other people in the team.
I guess at that time, maybe to give some background on maybe the business side
of things, we're sort of tasked to build essentially an address validation API.
And so that's slightly different from geocoding. And we can talk about these two things.
So geocoding, or generally it's synonymous with for geocoding,
that's what most people assume what geocoding is, is essentially searching for
any geo entity. Whether that's an address, that can be a place or a region.
So those tend to be, you know, more specifically called as like a course geocode.
So for an address code, say I live at, you know, 123 Broadway.
The task is then to understand this query, to know that this is an address,
and you're looking for the number 123 and the street Broadway,
and then return a latitude and longitude.
And with our APIs, we offer like some other metadata such as,
oh, this is in New York in this postal code and things like that.
And so for address validation, there's a slight difference.
Address validation is more about understanding the deliverability of mail rather
than being focused on, you know, here's the latitude and longitude.
It's more like, does this address exist in the post office's database?
And so for us to accomplish that, the data format that we got from the post
office, essentially they give you ranges of addresses. So Broadway goes from,
let's say, 1 to 100, and it has this postal code.
Or 841 Broadway has floor 1 to floor 12, and they have these specific postal codes.
And it gets even more pedantic with a post office.
And the rabbit hole goes very deep with address validation.
So at that time, we didn't really, we were using an open source service for
geocoding, but there wasn't really a way for us to essentially ingest these
ranges and validate addresses.
So that was sort of the main motivation for us to come up with something new. And so...
Our tech stack, you know, at that time didn't have any Rust.
We have some front-end, you know, services written with React and Next and JavaScript.
And then our API server was written in TypeScript.
We had some, like, data processing jobs in Spark and Scala.
And then we used Airflow, so some Python as well. But nothing really,
you know, statically compiled or...
I mean, there's TypeScript, but more in the sense of like, you know,
these things translate into some bytecode, sort of like a JVM or into native instructions.
And we were sort of expecting, we had more constraints about like, oh,
if we want to build a service that does things like this and it overlaps so much with geocoding,
we might as well just sort of replace the service because we had some operational
issues with our existing geocoder, which we can talk about later. And so...
There was sort of a motivation to use something that, whether it's like an external
service, like something like Elasticsearch or, you know, having something like all in one.
Operationally, like we sort of got burnt by like having so many like external
services that we were almost sort of motivated to have something that would
just let us do everything on almost one package almost.
Was Rust the only language that you considered for that project?
So we were considering a couple of different options. And it's funny because at our company,
we write like tech specs or essentially design documents before we sort of go
into building some, you know, larger projects or like features or things that
will have like big downstream impacts.
So actually, looking back at the doc, we were considering a couple of languages,
and we sort of discussed the trade-offs there.
So we were thinking about Kotlin, and the thinking around there is that the
Java and JVM ecosystem is very rich.
From my personal experience, I think Scala is very complicated.
And you know onboarding the whole team to that might have been a little tricky
but Kotlin is a little bit more closer to like you know it's sort of in between
and I do feel like there's a little bit more of a backing around it now in you
know the 2020s versus Scala.
Scala really just seems like a little bit more niche and like sort of Spark
and even then like I think with Spark I I have some opinions about that and
like how Rust might play into a world like that.
So Kotlin felt like a good in between. And there was just a whole ton of, of these sort of,
interesting ecosystem aspects such as you know
elastic search is written in java and you
know essentially elastic search is just a distributed wrapper around lucene
i mean obviously it's much more than that you know entire company so
it's a very broad stroke but you know there's a whole ton of work around like
text processing which is essentially a lot of this this and we felt like okay
that is a very rich ecosystem that we could potentially use with the trade-off
of we know We're probably going to have to store large text indexes and all of these things,
even synonyms or different spellings or spell correction.
A lot of this is like dictionary lookups of strings and things like that.
So you can imagine, even from the onset, we're probably going to store giant
hash maps or a lot of things in memory.
And we'll have to deal with a garbage collector. and one of the motivating factors
from something like that as well, I guess before I talk about that,
we're also considering Golang,
which is sort of similar about it's garbage collected.
And we've seen that a lot of companies
have adopted it because it's really trivial to write web services,
and it's relatively simple in the sense that There's not a whole lot of different
ways to do something versus something like a Scala.
You can express your problem, how to do it in many ways.
So those were sort of the two biggest contenders.
And that was largely because we didn't feel too comfortable of transitioning
some JavaScript developers to something like CRC++.
And I think the hurdle around like setting up like C and C++ projects,
I know there's been so much innovation around, especially C++,
but it just didn't seem like such a natural transition for our developers.
And sort of similar with Golang, you know, like I think Golang was sort of evolved
from Google from like their large C++ monorepo.
And so a lot of these things that I mentioned about that make programming enjoyable
is things like mapping and filtering and all of these sort of expressive aspects.
We felt like because we had a bit of a short timeline, it almost felt like a
step back in terms of the joy of programmer being able to express,
especially concepts where you have to process strings in very precise ways.
In a lot of ways, a lot of searches is... If you ever do these leak code problems,
there's so much string processing.
And having more expression really helps you to make the code very clear about
some very potentially complicated ways that you process strings.
Which leaves us with Kotlin versus Rust.
Right, right. And so...
We saw, like, there was this, I think there's a very famous blog post.
If you go on Hacker News and look up, like, Rust or migrating to Rust,
I think the first result is always this one about Discord moving from Golang to Rust.
Yeah.
And so we, it almost sort of motivated us to see, like, oh, they had occasional
garbage collecting issues.
And even working at like other companies where we use Scala and the JVM,
like there was consistently like issues with like the JVM and like the garbage collector.
And, you know, there's so much innovation around the garbage collector.
But we knew like for a lot of what we were doing, text processing and indexing,
things like that, we're going to store large data structures just in memory.
I know there's like these concepts of like off heap memory, but that's already
sort of off the happy path of the language we're using. So we're almost like
putting a barrier or like something impeding us before we even got started.
So we really did want to have something where we had a lot more control over the memory.
But we also didn't want a language like C or C++ where you sort of have to expect
developers to understand these like concepts.
Like they still need to understand these concepts to some extent.
But it's you have to be really really up front and like be very almost very
very strict about these things so that would require significant like senior
engineering talent within that had like c and c++ knowledge which we didn't have.
There's a lot of ceremony around allocating and deallocating memory and in c
not so much in c++ nowadays but definitely in c still and when you say garbage
collectors yeah that is also true for both go and kotlin or java to a wider extent,
I remember that in the past, we ran a very large Elasticsearch cluster,
and those were different times.
We tried to migrate that to containers, and it wasn't really easy because operationally,
Java had some unfortunate default decisions where it would try to allocate as
much memory as it could possibly get on the machine.
And if it's co-hosted with a lot of other services on the same machine,
that sometimes caused some problems.
So did you also run into, or did you also consider these kind of concerns,
the operational part of it, the deployment process?
I guess specifically for, you know, how you mentioned Elasticsearch,
actually one of the things that, you know,
also sort of bit us and also why we decided not to use a JVM language was the
fact that we did have to maintain an Elasticsearch cluster to power geocoding
with our, you know, old iteration.
Of the geocoder.
And we realized we just had to put everything on one machine,
which is still the case, but we didn't even use really the sort of distributed
aspect of Elasticsearch because of the nature of the workload.
It would just end up, you know, essentially fanning out all the queries to all
the nodes, even though they're sharded.
So at that point, we just essentially just got one really like fat box and...
All the shards of the Elasticsearch cluster just lived on one box.
So it almost defeated the purpose of that.
And so, you know, maybe this is more of a design thing, more so than Rust.
But, you know, that motivation of, you know, Elasticsearch was,
you know, contains, let's say,
a bunch of geodata and like addresses and places and things like that.
And there were a couple of other microservices that we had powering our old geocoder.
What we really wanted was not just like a deployment story that was simple for
essentially the binary, but our data assets as well.
And so what that meant was that,
It would be much simpler where the old world is essentially,
oh, hey, I want to do a migration.
And, you know, developers deal with this all the time. If you want to do a database
migration, you essentially have to reason about what's the existing state of your database.
You know, do I want to introduce a column? You know, can my or you can't you
got to make sure that maybe everything can handle null to begin with.
And then, you know, maybe you have like a state change. Maybe you want to increment something.
So you have to understand that the state of your database looks like this.
You're going to apply this operation. What happens if your database script fails?
Then you've got to code in this concept of idempotency where you can rerun the
script many times. And that's a lot of things to reason about.
So in terms of a deployment story, for sure, we wanted something that was very simple to deploy.
And Rust just gives you a simple binary. And, you know, we essentially compile
it in CI and essentially ship up the binary to our servers.
But also the data assets, we wanted it all to be, you know, self-contained within the same server.
And even the way that we process it, actually, we rebuild the data sets from scratch.
We never backfill anything. We just take all the raw sources and then,
you know, distribute the compute with something like Spark.
And essentially compile it to, you know, some data assets.
And there are some things that are more specific to, you know,
the data formats that we use on our server.
So we process those later and then we ship those data assets to all of these boxes.
And so, you know, what that means is because everything is self-contained,
it's actually very trivial to roll back, which is something that you might not
be able to do if you had, say, like a third, you know, like if very simple,
like two-tier architecture, even a web app and a, say, SQL Server.
If you do a data migration...
That you have to reason about the data migration roll back and forward,
as well as the binary data roll back and forward.
But with our new service, it all sort of goes in one package and lockstep.
So if you need a rollback, you just switch everything over and all the data
pointers also switch back.
And so that's a way to roll back. And you don't have to reason about these many
states. It's just one self-contained thing.
And for that to work, you really do need something that's like very efficient,
essentially both in because they're shipping such you know so many data assets and.
I see that's an aspect that i never heard before you because Rust is so fast
you can afford to simplify operations by doing more at startup but it's still
performing so you can cut some corners thanks to Rust.
Right, because it's almost like, hey, you don't even need an external database
for something like you would typically grab for in web apps like a Postgres or MySQL.
It's, hey, I'm going to use an embedded database or have a large in-memory index,
and that's your sort of state.
And so all that ships together in one whole unit or package.
What do you use for the storage layer?
So we use a couple of things. And so we talk about this in the blog post that describes Verizon DV.
And we link to it in the show notes. It's a really, really nice write-up.
Yeah. I would say our primary storage mechanism is this thing called RocksDB,
which is an embedded storage or embedded database.
And the way people think about it is it's essentially the database that you use to build databases.
So, for example, Facebook or Meta,
they originally used this to power the storage layer of MySQL for them.
And I think there's a project called MyRocks, MongoDB, for example.
They actually used RocksDB for some time. If you look at CockroachDB,
their storage layer for a long time was also RocksDB.
Same for InfluxDB.
Right. There's so many projects around there and so much community backing that
we felt it was such a mature project that we could use it for our purposes.
I know for sure there are a lot of Rust crates now, things like,
I believe, SLED, stuff like that.
There's many people who write these sort of embedded databases.
And essentially, RocksDB is this data structure called a log structure merge tree.
And it's really designed for high write throughput, which is sort of different from our use case.
But as I mentioned before, technology is so cyclical where, you know,
if somebody builds a database that's like high write throughput,
well, they actually adopt a lot of these concepts and ways to tune it so that,
you know, it's also very well tuned for read throughputs.
And so we just felt that, yeah, that, you know, community backing and the sort
of, and it's a project that's written not in Rust, right? So that was a sort of another nice thing.
We had all the Rust bindings to RocksDB. and it
was very simple like integration to be
able to just pull in that project and that
is our sort of primary storage layer for you
know all of our entities so addresses places and
different regions and so it serves you know a number of different purposes so
obviously like primary key fetches so when our services have say like an event
and they have an id for a place They'll fetch that from our service,
but we also index it in a way that makes it really easy to do geo lookups.
So given this lat long, I want to
be able to fetch all the relevant geo entities. So am I inside the city?
Am I inside this country? Things like that.
RocksDB is very fast and very write
focused but also it's effective in
terms of storage and i guess that plays in your favor because if you try to
geocode the world you need a lot of storage and on top of it once the storage
is quite optimized you get really decent cache locality on top of it for free,
So it might have been a really great choice.
Yeah, there are a lot of nice aspects about a log structure merge tree.
And I think most modern database implementations use that in some shape or another.
And it's largely this concept of, obviously, there's so much engineering around
it, but everything is sorted.
So that gives you a huge advantage. And like that is something that many people take advantage of.
So in our blog post, we talk about, you know, very fast geo lookups using this
library called S2, which is essentially a geo hashing library.
And so what that means is you have a latitude and you have a longitude.
That's two dimensions. Right.
And so it's not clear at first how you can use sorting to help you make lookups faster.
And so, I mean, there is literally a thing called geo hashing.
And there are other like implementations such as from Uber, there's this thing
called H3. Ffrom Google, there's this thing called S2.
And essentially it collapses, you know, the latitude and longitude into a single
64 bit or a U64, I guess, in Rust terms.
And it has many nice properties because it tends to be the case,
obviously there's boundary conditions because latitude and longitude and they
go from negative 180 to 180 and negative 90 to 90.
So there are some boundary conditions to take into account. But for the most part.
You maintain because of like, you know, somebody who's smarter than me who came
up with this thing called like using a Hilbert curve, you have a locality with adjacent IDs,
which means that things that are close to each other will be,
you know, if you sort it will be next to each other.
Obviously, you know, barring some some boundary conditions, but that fits really
well into a system that has, you know, things sorted like a log structure merge tree.
So you're able to make very efficient range and geo queries from from something like that.
And all of that combines into a very efficient single binary lookup service
what do you use for the fuzzy searching part that you mentioned.
Right so this is more related to the forward geocoding side which is you know
essentially translating your text query into some sort of, you know, geo entity.
And so one of the requirements we had to deal with was essentially being able
to handle a little bit of typo tolerance from our address validation service.
And that comes in many different forms. Like there's so many like sort of failure
cases for search, which is a little bit different from like more typical web applications.
It's like you click through a couple of things and that you expected this
really like all the different use cases are
literally every type of single character that a user can type those are all
the potential use case so the cardinality is is is extremely high and essentially
the number of failure cases is is almost unbounded in some sense there's there's
just so many combinations that at that point like there's so many ways to to type something in,
So we deal with fuzzy search in a couple of ways.
We use a library called FST. And I remember there's an episode you had with
Charlie Marsh from UV. And I think...
There's an engineer who works there now. I only remember his GitHub name because
it's very memorable, like burntsushi.
But he works at UV, if I remember correctly.
And he's come up with a lot of really interesting Rust crates.
And I think he even implemented the regex or did the regex implementation of Rust.
Yeah, exactly. Jonathan Gallant.
Right. And yeah, so we make a lot of use of his libraries. but FST is essentially a character graph.
And so there's this concept of like a try that's very typically used for prefix queries.
And so you can sort of think of an FST as like a try where all the prefixes
that are shared compress, but it also compresses the suffixes.
So now you have almost like double compression, right? And essentially...
Doing text lookups is a traversal of the graph.
And you can see how that sort of primitive lets you do many things.
It lets you implement like a regex because the regex sort of works the same way.
And so for fuzzy search, you can implement like Levenstein distance by essentially
keeping a way of dropping or like if you type something incorrectly.
You can have like certain thresholds that, We use an edit distance of one because
more than that, it's essentially combinatorially expensive.
And we have other techniques for typotolerance.
But we use this library essentially for, one, obviously prefix search.
But then two, it does have some built-in ways to do 11 sign distance,
which we're actually starting to move off of a little bit.
Because now we're actually, as the project is starting to succeed,
we're getting more queries.
And so because you know what users typed and what their end result actually
was, we're able to, at scale, sort of figure out, oh, these typos actually map
to this type of, this actual correct spelling.
And so we're able just to do those lookups much quicker. But the funny part
is for some of these cases, we actually also indexed that into the FST.
So in a lot of ways, the FST is almost like You can almost think of it as a
hash map or maybe even a B-tree map of a string to a U64.
It's maybe a compressed version of that. But more than that,
it's really just a way to cache high cardinality text in a very compressed way.
Yeah. The way I always think of an FST, and I might be wrong here,
is it's a very fast state machine for looking up,
the existence of words very efficiently and very quickly.
So you mentioned the tri-data structure that is similar where it stores all of the inputs in a tree,
but it knows when there is no possible way for a word to exist in the data storage
because you reached a leaf node in the tree and there are no child nodes.
So you know, for example, but this is not a hit in your data structure.
And in your blog post, you mentioned a few other crates.
We just wanted to quickly do a quick shout out here.
10TV would be one, LightGBM, and FastText. What are these?
Right, so I'll give a quick rundown of these. So for any sort of search system,
the most fundamental data structure is this thing called an inverted index.
So that implies like a forward index. So maybe I'll explain what that is first.
And so forward index is more like a traditional database. So record one maps
to Broadway, record two maps to Prince Street.
The inverted index sort of switches that around. So you first tokenize,
and that is its own whole topic.
People research on how to tokenize text, especially with all AI and machine learning trend now.
But you can then say, oh, Broadway, the token maps to ID one and then prints
maps to document two and street maps to document two.
So when you type in prints, then it's a I mean, it's not a hash map in these
implementations, but essentially you just look up the word prints and then you
get all the documents that are related.
And so there's, once you have these documents, you can sort of perform these
set operations to essentially narrow down which documents are relevant.
And then there's, you know, Tantibi offers this thing called BM25,
which you can think of it as like TFIDF.
But essentially, once you have these documents.
And this is the difference between a database and a search engine.
How do you essentially rank which of these documents is most relevant?
I mean, that is its own, you know, that is its own, like, sort of,
there's a lot of work around figuring out what is the most relevant.
I mean, it's different for every company and every use case.
So Tantivy, you know, that's a library that came from, I think,
these people from France.
And they were building essentially, it's funny, it's an Elasticsearch replacement,
which is, we've been talking a little bit about that, called QuickWit.
And so there's just a lot of primitives that, you know, that came along with
this sort of Rust ecosystem. And that was one of the libraries that really drew us to using Rust.
We're actually using, we're migrating to a different implementation that doesn't
use any of these libraries now.
We use this thing called Roaring Bitmap, but it's the same concept.
And we're doing that largely because of sort of the more structured nature of
some of the geoentities we have.
You know, we have addresses and regions and people tend to type addresses in
a certain way. So we can take more advantage of that and fine tune how we do our search workloads.
But at the core of it really is this concept of an inverted index.
So that is, I would say, the core of geoconing.
And then we can talk a little bit about light GBM and fast text.
And these things always move very quickly.
So we're actually considering moving off of these libraries,
but we're still doing, you know, doing something that would serve what these libraries do.
And so light GBM is a gradient boosted tree.
I think if you take any basic machine learning courses, you'll learn about decision
trees. So the concepts are really the same.
And essentially, it's a way to do a couple of different functions versus classification.
So if I receive an email, does it seem like a positive sentiment or a negative
sentiment or a neutral sentiment?
It does interpolation.
So given the name and the country and I don't know, the sports that somebody
does, try and guess their height. You know, it's like, you know,
four feet to, I don't know, seven feet, something like that.
And that's interpolation.
It does learn to rank, which is particularly interesting.
And as I mentioned, we do, you know, from the inverted index,
we do ranking, which in some sense is like really just a primitive of classification.
But essentially what it's saying is like, given this list of entities,
which one has the highest rank and which one has the lowest rank?
And you can really just think of that as like pairwise comparisons.
So like if I'm given, if I type in 841 Broadway and I'm in this latitude and
longitude, is it referring to 841 Broadway in Brooklyn or 841 Broadway in Manhattan?
And so that is essentially what
Learn to Rank is, but you can apply it to more entities than just two.
But really it is just sort of classification if you think about it.
There's a Broadway in Brooklyn? Yeah.
Yes, there is. I think when you start to learn about the world,
I don't think people are really all that creative with naming things.
So it's not just programmers that have problems with naming things.
Everybody has those issues.
Nice. Well, it's just a wide way. It's just a wide street, so it totally makes sense. Right.
It is also a very wide street in Brooklyn, so we consistently have issues with
people trying to figure out which Broadway we're talking about.
Okay, we have all of those libraries. Just to recap that part,
when the service boots up, it starts to ingest all of that data that is more
or less stored in some flat storage.
You start to ingest it, you start to build your, let's say, understanding your
model of the world, quite literally.
And then you have that highly efficient lookup functionality.
And through the API, then you can find addresses really quickly and map addresses
to certain geolocations and maybe even vice versa if you needed to.
Right.
Yeah, and you can already see that because there are so many of these assets,
the startup time does take a little bit of a hit.
Can you roughly tell us how long it takes?
So we download all these assets in parallel from S3, and we do actually try
and sort of cache these assets.
So we deploy in Kubernetes, and generally when we're making new releases,
only a couple of things change, like, oh, maybe you only updated places.
So we only need to pull those changes from S3.
And so it's maybe like, couple minutes
but really like the future of this is we want
to get it to you know less than a minute or or even
like in in like 30 seconds or so and so
this is another area where i think Rust and like having and like all the work
around the data infrastructure area where like a lot of like before it was like
a lot of these java services from all the apache projects they're all using
java we're seeing like a whole ecosystem of these data infrastructure pieces
is being moved to Rust or like.
And so one of those things is we do want to start to separate the storage and
compute in a way that our database can start to come up right away or download
the minimum amount of assets.
And the fact that we almost like memory map some of these like assets,
you can actually translate, you know, that pretty, in a pretty straightforward
way with S3 because it offers like a range API.
And because like people have built libraries to essentially cache S3 on disk,
what we're thinking about is we're going to build a storage layer that almost caches S3.
And then while the surface is booting up and hasn't pulled in the assets to the local NVMe disk.
It can temporarily sort of pull in ranges of bytes from these storage services
so it can immediately start serving queries.
Obviously, there's going to be a latency hit, but in terms of being able to
deal with things like, oh, we have a traffic spike, that's sort of one of the
problems we're trying to solve.
And I think the fact that there's so much development around data infrastructure
projects and we use Spark at Radar and so many companies use Radar.
Just this whole concept of data infrastructure in the cloud and using object storage.
We're seeing new streaming frameworks written in Rust or a lot of these new
iceberg tools in Rust and that's going to help us a lot for building something
where we want to separate.
Or it's going to accelerate essentially what we want to build where we separate storage and compute.
How long did it take you to build and ship the first version of HorizonDB?
Right. So as I mentioned before, like our first use case was address validation.
We had a customer who was asking for it, like, and we built it in.
About a month actually i think less than that like
to actually get something working but to get it closer to like full production
with this customer maybe it was like two months but it was like with a lot of
qa and like a lot of feedback things like that but i would say it took about like two months did.
You have a lot of outages due to Rust production outages.
Not really one was because we were only onboarding a single customer and like
the qps wasn't like that high to begin with and two because we had this property
of everything was sort of self-contained,
it was very trivial for us to do blue green deploys which also includes the
data as well so like hey you process the string in in a wrong way and you're
starting to see that you just flip it back to the the previous service and it's
all self-contained whereas that might be the harder to do with like a single
like backing database or things like that can.
You even remember a single outage or a single error or a panic in production with with Rust.
Yes so i i will say there's we do make use of this geo library from Rust which
is very popular um and one of the top issues is that it panics and doesn't handle,
It doesn't handle errors, which is like unrest-like, I guess.
We have this service called Sentry that we use for error logging.
And there's essentially no exceptions normally apart from that library.
But typically, we don't see stack traces.
We do see outages. And that's just due to spikes or essentially queries of death.
And there you know we can there's a
whole podcast to talk about like reliability and things like that but
the service itself is pretty efficient and our our
rewrite of of the search indexes is going to make the service even faster and
as i mentioned with like the you know storage and compute separation that's
going to introduce like another level of scalability and like being able to
handle like spikes in a way that yeah generally i would say like if we're looking at the proportion,
and really there's two main backend servers.
There's our API server, and then HorizonDB.
I would say there are not significant amount of outages from that.
And talking about scale, can you give us a few numbers, number of requests,
number of notes, et cetera?
Right. So at a steady state, we're looking at about 18,000 queries per second.
That's a lot of queries. Yeah, I would say it's a lot.
Maybe it's at other companies, it's not super high because there are some cases
where things will fan out and suddenly you get hundreds of thousands of TPS.
But yeah, it's enough to know that like if you made like a certain optimization,
like you can quickly tell if things went right or went horribly wrong.
How many nodes do you need to handle that traffic?
Right. So we have 30 nodes. It's significantly over provision now,
which is why we're looking into like how we can be more efficient about dealing with spikes.
We already actually shard out the database servers by different use cases and
by different tiers of customers because we are a SaaS company just to reduce
or essentially just to isolate certain workloads and prevent outages from bad
query from one customer affecting others.
So all that's to say is we are quite over provisioned there, but we're working on it.
Well, it's great. You could do with less. That's always a good thing that managers
want to hear because saving costs becomes much easier if you can't just shut
down some of those nodes if you don't need them at the moment.
Yeah, that's right. Cost cutting is like a, I wouldn't say it was the primary
thing for getting this started, but we did see a lot of benefit from that.
So even running 30 nodes, we're able to replace a number of Elasticsearch instances.
And we represented geoqueries in many different ways, like denormalized across
different stores like Redis and Mongo as well.
And then in total, I would say actually even with the 30 nodes and the over-provisioning,
we probably saved maybe...
Mid to high five-figure monthly spending for some of these things.
Impressive. How many queries per second does one box do, and how much CPU does it use, roughly?
Right, so on average, which is a misleading metric, because for geocoding,
which is search, is a very different workload from reverse geocoding.
We do 600 queries per second per box, And it's about 20 to 30% CPU at the steady
state, which means it can go higher.
And I would say that, you know, our search workloads are probably like a 10x,
sometimes even 100x of like reverse lookups as well as like primary key lookups.
Would you say it's a rather CPU or compute bound problem or a storage or IO bound problem?
Because on one side you have hashing which can
be very expensive but maybe that's just a thing that you do at the beginning
when you boot up and then it's not really as compute intense later on during
operations or would you say well no actually that is kind of the bottleneck
or would you say no the storage part is the bottleneck right.
So i you know breaking it down into like the different use cases for geocoding
which is search query reverse geocoding which is like a lat-long lookup and
then maybe even like primary key lookups.
I would say the latter two are more IO bound.
Maybe you pay some cost of serialization, deserialization from some of the storage
systems, which will change soon because there's a lot of very nice zero copy
libraries from Rust. And so.
The CPU-intensive queries tend to be of search.
I would say it's actually both, because if you think about it,
as I mentioned, this inverted index structure, you have to do many key lookups, right?
You tokenize your query, 123 fake street, Brooklyn, New York.
And in fact, there's many ways to express it in synonyms. Maybe you expand the
query, and maybe instead of 123 fake ST, it could be street.
Or if you know the users in Germany, it's Straße. and we haven't indexed or
normalized everything in a certain way.
So we do expand the queries and you can see how it can fan out to many keys being pulled back.
And once you have all these candidates, you essentially have to rank them.
And so there is a little bit of CPU intensiveness in that, but we're working
on optimizing these things.
And I will say that if you do decide to build this sort of analytics or search
system, We're very big fans of this new approach with drawing bitmap,
which is, it's actually an implementation that you see across many different languages.
But we found, at least from the sort of candidate fetching, where we're essentially
in the inverted index and trying to pull out which candidates match a criteria,
those tend to be like microsecond operations for most of our geocoding queries.
So that's why we're sort of moving over to a more like dedicated and specific system for our use case.
Yeah. You said it yourself, lots of moving parts.
What's the median latency of a forward geocoding request and maybe a reverse
geocoding request in comparison?
Yeah, so the median latency is about 50 milliseconds.
And that's because when we were building this system, and we were seeing customer queries,
we're seeing that most customers tend to enter the right thing more or less,
like they tend to enter their address correctly,
even if there's like some spelling
mistakes, but they tend to have a more predictable sort of query pattern.
And that's sort of why I describe the FST as a high cardinality text cache.
Because almost everything, I would say 70% to 80% of the queries,
especially for US and Canada addresses, will just hit the FST.
And all of that, we have the sort of generosity of being able to store all of
that in memory because it's so compressed.
And Rust, we're writing in a way that's a single process and multi-threading
for concurrency versus something like Node, which, you know,
concurrency or like Python, where, I mean, there are starting to be thread concepts,
but the classical way is just to have multiple processes.
And then that means you need multiple copies of the data structure and memory.
And so oftentimes we're hitting that. And those, at least that side of things,
those operations are usually at most single-digit milliseconds.
Depends on the query, but a lot of times we will see things show up in less
than a millisecond for that side of things.
For reverse geocoding, as I mentioned, we sort of.
And to speed up your algorithms, oftentimes it's like, can you sort it or can you hash it?
And essentially, we have hashed it by taking the S2 library where we convert
the latitude and longitude, two dimensions, into one dimension.
And that's essentially just a key that you look up in RocksDB.
So many of our workloads look like that. And once it's like that,
it's almost no different from a primary key lookup.
It's just a key value lookup. So we're seeing less than a millisecond for those types of workloads.
Nice. And all of that with how many lines of Rust code in total for the server
implementation at the moment?
Right. So in terms of our code base, it's about 150,000 lines,
which I would say is actually not very much.
And I haven't taken the time to analyze which code line is serving which use case.
But I will also say, like, there is like a good percentage of stuff that's like
kind of hard coded right now that we could probably, I think,
to get like a minimal implementation, maybe like you could,
when we first started doing address validation, it was like a couple thousand
lines. It wasn't anything too big.
So and that sort of speaks to like how expressive, you know,
being able to express certain concepts in Rust is.
Are there any tips and tricks
that you could share with people who are planning
to build a larger Rust application i'm
thinking of composition patterns usage of traits how you structure your data
data objects immutability anything that comes to mind that that you learned
maybe that is uncommon in Ruby or Scala or any other language that you knew before?
I would almost like take the approach of almost, especially when you're starting
out, not to overthink these things.
I do think that Rust still has like so many different concepts.
And we're always like, honestly, as we're writing, I don't even feel like I'm
an expert in Rust at all. Like I always learn something new.
So with that in mind, especially like if you're first starting out with Rust,
just try and get something to work and then sort of optimize after.
Because I guess they always say like premature optimization, you know, is, is.
It's the root of all evil.
Basically, yeah. So try and get the simplest thing to work and then only apply
design patterns if you're starting to find you're really repeating yourself
or it becomes very finicky.
I will say that, especially in this age of AI editors that will essentially
copy and paste everything for you, I don't know.
But yeah, that's the way to think about it is really just to,
rather than take very high-level approaches where it can actually sort of inhibit
or prevent you from being able to do the thing you want it to do,
just do the simplest thing and then apply more optimizations or introduce more
sort of design patterns as you start to see the problems arise.
And then you really do get a sense of what problem you're trying to solve.
Yeah, perfect. That's right down my alley. even in
a geocoding sense because i recently wrote an article called be simple and i'm
very much in favor of that unfortunately we're getting close to the end we had
a blast so the time flew by the last question is what's your message to the Rust community i.
Would say obviously keep doing what you're doing. I think the Rust ecosystem is very vibrant.
And like, I think there's an interesting trend that I've seen where,
and it's almost like hacker news driven, where it's like, oh,
I built this service and it's written in Rust.
But we're actually even seeing in not even like a,
We're seeing that in almost a marketing approach for real businesses.
There are so many new, interesting pieces of infrastructure that are becoming
businesses that are open core.
I wouldn't say it's fully open source because I think the open source peers
would be, it really depends on licenses.
But I think it's such a flexible programming language and it's very modern that
you get a lot of these sort of productivity or enjoyment aspects that you would
get from maybe some other like scripting languages traditionally that I would love to see, you know,
more of these like businesses that are really like open source and open core
to really inspire what you can really do with the language.
And I would also love to just see more...
More work around even just doing the sort of 80% of things that most developers
are doing these days, which is really just having a nice web app experience.
I think there's still a lot to do there.
I know people brand it as an infrastructure or systems language,
but it really is a very enjoyable language to use for many different use cases.
Even like, I think, so not this instrument,
but I know Electron, they're in sweden and they
make like a lot of music production equipment and a lot of their code is written
in Rust so like you you really see like all these like different aspects because
like it's such a it's a language that's very expressive but on the other hand
it lets you get really deep in the weeds and so it lets you do so many things
so i i would love to see like more,
you know support and work around even just like basic things like wet like a
lot of like web apps like who knows maybe there will be like a rails for Rust
in the future that would be super interesting there's loco rs yeah i guess that's true
Jeff,
that was amazing thanks so much for taking the time today.
Yeah likewise it was a very fun chat
Rust
in production is a podcast by corrode it is hosted by me Matthias Endler and
produced by Simon Brüggen for show notes transcripts and to learn more about
how we can help your company make the most of Rust, visit corrode.dev.
Thanks for listening to Rust in Production.
Jeff
00:00:26
Matthias
00:01:11
Jeff
00:01:17
Matthias
00:02:59
Jeff
00:03:17
Matthias
00:06:17
Jeff
00:06:41
Matthias
00:08:55
Jeff
00:09:17
Matthias
00:11:01
Jeff
00:11:18
Matthias
00:15:04
Jeff
00:15:07
Matthias
00:19:22
Jeff
00:19:25
Matthias
00:19:43
Jeff
00:19:44
Matthias
00:21:02
Jeff
00:22:02
Matthias
00:26:03
Jeff
00:26:24
Matthias
00:26:49
Jeff
00:26:53
Matthias
00:27:01
Jeff
00:27:05
Matthias
00:27:45
Jeff
00:27:47
Matthias
00:29:42
Jeff
00:30:10
Matthias
00:32:12
Jeff
00:32:25
Matthias
00:33:58
Jeff
00:34:01
Matthias
00:36:28
Jeff
00:37:24
Matthias
00:41:45
Jeff
00:41:47
Matthias
00:42:04
Jeff
00:42:10
Matthias
00:42:21
Jeff
00:43:00
Matthias
00:43:09
Jeff
00:43:12
Matthias
00:45:38
Jeff
00:45:44
Matthias
00:46:17
Jeff
00:46:21
Matthias
00:46:55
Jeff
00:46:59
Matthias
00:48:17
Jeff
00:48:26
Matthias
00:48:56
Jeff
00:48:59
Matthias
00:49:35
Jeff
00:49:49
Matthias
00:50:28
Jeff
00:50:36
Matthias
00:51:04
Jeff
00:51:32
Matthias
00:53:29
Jeff
00:53:43
Matthias
00:55:49
Jeff
00:55:58
Matthias
00:56:38
Jeff
00:57:09
Matthias
00:57:44
Jeff
00:57:46
Matthias
00:58:38
Jeff
00:59:01
Matthias
01:01:22
Jeff
01:01:28
Matthias
01:01:30