WEBVTT

00:00:01.630 --> 00:00:06.250
<v Matthias>Here's Rust in production, a podcast about companies who use Rust to shape the

00:00:06.250 --> 00:00:07.170
<v Matthias>future of infrastructure.

00:00:07.490 --> 00:00:12.590
<v Matthias>My name is Matthias Endler from corrode, and today we talk to Jeff Kao from Radar

00:00:12.590 --> 00:00:16.790
<v Matthias>about building a high-performance geocoding platform with Rust.

00:00:19.250 --> 00:00:22.630
<v Matthias>Jeffrey, thanks a lot for taking the time today for the interview.

00:00:23.050 --> 00:00:25.610
<v Matthias>Can you say a few words about yourself and about Radar?

00:00:26.650 --> 00:00:34.550
<v Jeff>Yeah, happy to be on this podcast. So my name is Jeff, and I'm a principal engineer at Radar Labs.

00:00:34.770 --> 00:00:38.790
<v Jeff>We are an enterprise geolocation tech company.

00:00:39.030 --> 00:00:47.390
<v Jeff>So that's a variety of things, spanning from maps and routing to search and

00:00:47.390 --> 00:00:51.150
<v Jeff>geocoding and geofencing, as well as fraud detection.

00:00:51.470 --> 00:00:56.550
<v Jeff>And a bit more about myself, I've been programming for quite some time now.

00:00:56.550 --> 00:01:00.210
<v Jeff>I've worked at a variety of different startups. I've been a freelancer before

00:01:00.210 --> 00:01:03.710
<v Jeff>as well, and even started my own small indie company a while ago.

00:01:03.930 --> 00:01:09.590
<v Jeff>But these days, most of my engineering focus, I would say, is largely on backend

00:01:09.590 --> 00:01:11.370
<v Jeff>and data infrastructure engineering.

00:01:11.790 --> 00:01:16.590
<v Matthias>What's your programming background? What other languages do you know besides Rust?

00:01:17.160 --> 00:01:21.460
<v Jeff>Yeah, so I would say, actually, it's kind of funny.

00:01:21.620 --> 00:01:27.220
<v Jeff>When I first graduated from university, I joined a company called Foursquare.

00:01:27.520 --> 00:01:32.740
<v Jeff>And I think at that time, it was maybe around the 2010s, where there were a

00:01:32.740 --> 00:01:35.220
<v Jeff>lot of companies that were moving from Ruby,

00:01:35.480 --> 00:01:39.080
<v Jeff>because Ruby on Rails, I think at the time was like, and it still is,

00:01:39.220 --> 00:01:43.060
<v Jeff>you know, but I think like, it was almost like peak Ruby on Rails.

00:01:43.460 --> 00:01:47.680
<v Jeff>And then there's the peak migrating from Ruby on Rails to something more quote

00:01:47.680 --> 00:01:52.100
<v Jeff>unquote scalable by companies like Twitter and like Foursquare and like SoundCloud.

00:01:52.740 --> 00:01:57.940
<v Jeff>So it's funny because I worked at two of those companies. And so I've done a lot of work in Scala.

00:01:58.600 --> 00:02:03.000
<v Jeff>You know, in university, I prototyped and I was doing freelancing.

00:02:03.000 --> 00:02:06.160
<v Jeff>And then I used a lot of Ruby on Rails at that time.

00:02:06.380 --> 00:02:10.600
<v Jeff>And so, you know, a lot of like web technology is involved with that as well.

00:02:10.840 --> 00:02:12.060
<v Jeff>So I did a lot of JavaScript.

00:02:13.180 --> 00:02:17.340
<v Jeff>And even at that time, there wasn't TypeScript, but I think people were using

00:02:17.340 --> 00:02:21.060
<v Jeff>CoffeeScript at that time. So I played around with that quite a bit.

00:02:21.360 --> 00:02:24.720
<v Jeff>We actually, it's funny, we migrated when I worked at PagerDuty,

00:02:24.720 --> 00:02:28.820
<v Jeff>we moved some workloads from JavaScript to actually CoffeeScript.

00:02:29.100 --> 00:02:32.840
<v Jeff>But I think it's pretty much defunct now. Nobody really uses it.

00:02:33.400 --> 00:02:38.780
<v Jeff>So those are probably my main languages, you know, dabbled in Python and some

00:02:38.780 --> 00:02:44.160
<v Jeff>C and C++, but never in like a really full-time sort of manner.

00:02:44.380 --> 00:02:52.020
<v Jeff>And so these days at Radar, mostly working with TypeScript, some Scala for some

00:02:52.020 --> 00:02:54.440
<v Jeff>Spark pipelines, of course, a lot of Rust.

00:02:55.320 --> 00:02:59.800
<v Jeff>And yeah, what else? Maybe some python as well that's the matter in the face.

00:02:59.800 --> 00:03:08.800
<v Matthias>That's pretty impressive ruby scala javascript coffee script python type scripts

00:03:08.800 --> 00:03:16.620
<v Matthias>that's a lot and now Rust what could people learn from the philosophy of ruby.

00:03:17.610 --> 00:03:25.130
<v Jeff>I think this is sort of like an interesting question because technology comes in cycles, I would say.

00:03:25.570 --> 00:03:30.830
<v Jeff>And that's definitely something I learned with some of the more senior principal

00:03:30.830 --> 00:03:33.330
<v Jeff>engineers at companies I've worked at before,

00:03:33.350 --> 00:03:38.350
<v Jeff>where really having an understanding and appreciation of things that are old

00:03:38.350 --> 00:03:42.890
<v Jeff>really gives you insight onto what's coming next because technology really is cyclical.

00:03:44.200 --> 00:03:47.880
<v Jeff>So I think there's a lot to learn and like about Ruby.

00:03:48.580 --> 00:03:54.960
<v Jeff>And usually, I would say the entry point for most people with Ruby is really Rails.

00:03:55.280 --> 00:04:02.860
<v Jeff>So it's really that concept of making programming languages work for you, in some sense.

00:04:03.700 --> 00:04:08.580
<v Jeff>And so what does that entail? It's almost like, especially in modern languages.

00:04:09.120 --> 00:04:15.140
<v Jeff>A lot of us have come to really be spoiled with these concepts of very powerful

00:04:15.140 --> 00:04:17.880
<v Jeff>list collections that you see from functional languages.

00:04:17.880 --> 00:04:22.860
<v Jeff>So mapping and filtering and reducing.

00:04:23.300 --> 00:04:28.620
<v Jeff>And even these concepts you can see are applied to things that aren't even just

00:04:28.620 --> 00:04:32.680
<v Jeff>programming languages, even in like distributed computation frameworks,

00:04:32.900 --> 00:04:38.360
<v Jeff>like literally MapReduce is that concept and you see that in Spark and with

00:04:38.360 --> 00:04:42.760
<v Jeff>all the new sort of data infrastructure that's written in Rust or even like SQL,

00:04:42.940 --> 00:04:49.100
<v Jeff>because if you think about, you know, like the where clause in SQL is like a filter or,

00:04:49.320 --> 00:04:53.220
<v Jeff>you know, you can do reductions with grouping,

00:04:53.480 --> 00:04:57.560
<v Jeff>like there are all sort of these ways to express, It's, you know,

00:04:58.080 --> 00:05:03.200
<v Jeff>how you process data in a way that's like very elegant and like it's commonly

00:05:03.200 --> 00:05:05.160
<v Jeff>used throughout all these different paradigms.

00:05:05.260 --> 00:05:10.320
<v Jeff>So I really think that, you know, on the Ruby side, for me, when I first started,

00:05:10.440 --> 00:05:12.680
<v Jeff>I was like, wow, this is like Python, but better.

00:05:13.020 --> 00:05:16.120
<v Jeff>But, you know, that's that's going to make some people angry, obviously.

00:05:16.120 --> 00:05:23.180
<v Jeff>And that was me being a naive college student looking at languages and seeing

00:05:23.180 --> 00:05:28.660
<v Jeff>like, oh, where's the sort of trend going? And why is everybody using Ruby on Rails?

00:05:28.860 --> 00:05:33.160
<v Jeff>And it's really just understanding that it's a very pleasant experience and

00:05:33.160 --> 00:05:34.120
<v Jeff>very expressive language.

00:05:34.660 --> 00:05:41.200
<v Jeff>And having all of those sort of batteries built in gives you a lot of productivity

00:05:41.200 --> 00:05:45.420
<v Jeff>and almost brings more joy to programming. Yeah.

00:05:45.980 --> 00:05:49.480
<v Jeff>For really expressing like your ideas because

00:05:49.480 --> 00:05:52.380
<v Jeff>really at the end of the day programming is about creating things

00:05:52.380 --> 00:05:57.320
<v Jeff>and almost it's almost like a creative profession that i think most people don't

00:05:57.320 --> 00:06:01.000
<v Jeff>really assume is creative from the outside but you're you're trying to solve

00:06:01.000 --> 00:06:07.940
<v Jeff>problems and build things and being able to express things in in a very i guess

00:06:07.940 --> 00:06:10.980
<v Jeff>like terse manner is is is very,

00:06:12.200 --> 00:06:17.680
<v Jeff>conducive to you sort of getting into the flow state as a programmer yeah.

00:06:17.680 --> 00:06:20.860
<v Matthias>When i think of ruby i think of elegance i

00:06:20.860 --> 00:06:26.540
<v Matthias>think of the joy of programming expressiveness where

00:06:26.540 --> 00:06:35.380
<v Matthias>would you see yourself more on the programming is a craft it's or an art or

00:06:35.380 --> 00:06:41.120
<v Matthias>would it be more like discipline of engineering where would you see yourself on that scale.

00:06:41.120 --> 00:06:43.880
<v Jeff>I think it's a mix of the two

00:06:43.880 --> 00:06:51.160
<v Jeff>things you have the tools but it shouldn't necessarily be like you know the

00:06:51.160 --> 00:06:56.460
<v Jeff>tools help you express something so maybe like as as like drawing parallels

00:06:56.460 --> 00:07:01.780
<v Jeff>to maybe some other hobbies i have like say you're a musician right um,

00:07:02.730 --> 00:07:07.730
<v Jeff>Especially like in jazz music, you know, like jazz musicians spend a lot of

00:07:07.730 --> 00:07:11.390
<v Jeff>time practicing because they want to express like, oh, I want to play the chord

00:07:11.390 --> 00:07:15.410
<v Jeff>changes in certain ways and like work on that. Or I want to learn these certain skills.

00:07:16.710 --> 00:07:20.190
<v Jeff>And, you know, at the end of the day, music isn't about scales or chords,

00:07:20.250 --> 00:07:23.030
<v Jeff>but those are the tools you use to express yourself.

00:07:23.250 --> 00:07:26.830
<v Jeff>And once you sort of have those tools ingrained, then they get out of the way,

00:07:26.970 --> 00:07:30.850
<v Jeff>they come natural to you. Then you can express, you know, your music,

00:07:31.010 --> 00:07:34.870
<v Jeff>how you, you know, write music and, and add these things.

00:07:35.390 --> 00:07:39.790
<v Jeff>So it's, it's, it's sort of like a mix of the two things, right?

00:07:39.870 --> 00:07:45.610
<v Jeff>You want to understand the tools and what's out there and they should help you

00:07:45.610 --> 00:07:47.870
<v Jeff>really ship the thing you're trying to do.

00:07:48.010 --> 00:07:51.530
<v Jeff>And the thing you're trying to do can be very technical. So in a lot of cases,

00:07:51.550 --> 00:07:54.910
<v Jeff>you can get really down like deep into the foundation, depending on what you're

00:07:54.910 --> 00:08:00.010
<v Jeff>doing. a lot of systems programming, you know, and this is a Rust podcast,

00:08:00.230 --> 00:08:02.550
<v Jeff>right? So there's that side of things.

00:08:03.010 --> 00:08:11.390
<v Jeff>But on the other hand, there's a whole, you know, ecosystem of digital products

00:08:11.390 --> 00:08:17.250
<v Jeff>that don't necessarily need, you know, the best tooling or anything like that.

00:08:17.410 --> 00:08:21.410
<v Jeff>I think, you know, back to when, you know, in college,

00:08:21.710 --> 00:08:25.450
<v Jeff>like I remember just creating some small websites by uploading

00:08:25.450 --> 00:08:28.670
<v Jeff>things to like an ftp server from

00:08:28.670 --> 00:08:31.510
<v Jeff>some of these like you know hosting companies that

00:08:31.510 --> 00:08:36.350
<v Jeff>just give you a box and it's like here you go upload like index.php and then

00:08:36.350 --> 00:08:40.590
<v Jeff>wow suddenly you you have something that's connected to the internet like it

00:08:40.590 --> 00:08:45.510
<v Jeff>can be something as simple as that so i i can see their appreciation for for

00:08:45.510 --> 00:08:49.050
<v Jeff>tools but i i really don't think it's necessarily going to,

00:08:49.170 --> 00:08:52.690
<v Jeff>I don't think it's necessarily productive to just focus on it,

00:08:52.730 --> 00:08:54.150
<v Jeff>but it's a mix. It's a mix.

00:08:55.050 --> 00:09:01.010
<v Matthias>Coming back to Rust, with your background, Ruby, Scala, and so on, you have a lot of,

00:09:01.450 --> 00:09:06.370
<v Matthias>functional part down. You have the expressiveness down. That part must have

00:09:06.370 --> 00:09:08.490
<v Matthias>been really easy for you to learn.

00:09:09.090 --> 00:09:13.990
<v Matthias>Algebraic data types do exist in Scala. And now you come to Rust.

00:09:14.150 --> 00:09:16.530
<v Matthias>How was that first impression to you?

00:09:17.390 --> 00:09:23.130
<v Jeff>Yeah, it was very refreshing. Rust really feels modern.

00:09:23.750 --> 00:09:27.670
<v Jeff>And there are so many things to like coming from. I guess at Radar,

00:09:27.850 --> 00:09:30.850
<v Jeff>our main programming language which is TypeScript.

00:09:31.110 --> 00:09:35.030
<v Jeff>We actually migrated to TypeScript from JavaScript a couple of years ago.

00:09:35.230 --> 00:09:40.550
<v Jeff>But looking at the JavaScript ecosystem where essentially there's a library

00:09:40.550 --> 00:09:43.630
<v Jeff>for everything. There's a joke even on Stack Overflow.

00:09:43.770 --> 00:09:47.950
<v Jeff>It's like, how do you add two plus two? Just add this NPM package to add two numbers.

00:09:49.290 --> 00:09:56.470
<v Jeff>So there's having really, even at the time where we first started building this

00:09:56.470 --> 00:09:59.050
<v Jeff>and we're definitely not early adopters.

00:09:59.050 --> 00:10:05.590
<v Jeff>We started building our Rust project maybe two and a half years ago or so.

00:10:06.570 --> 00:10:11.330
<v Jeff>There's a rich cargo crate ecosystem. There's a formatter,

00:10:12.250 --> 00:10:21.730
<v Jeff>flame graphs, and the paradigms are very functional, but you're not forced to use those either.

00:10:21.730 --> 00:10:29.370
<v Jeff>So having a rich data structure ecosystem in the standard library,

00:10:29.750 --> 00:10:36.150
<v Jeff>being able to process vectors with all of the functions that many developers

00:10:36.150 --> 00:10:39.910
<v Jeff>are used to these days really felt refreshing.

00:10:40.730 --> 00:10:49.610
<v Jeff>And when we were starting to build out HorizonDB, our Rust geo service project,

00:10:50.480 --> 00:10:53.880
<v Jeff>Those were some of the characteristics we were really looking for,

00:10:54.160 --> 00:10:59.140
<v Jeff>especially for a team with, like, largely a background in writing, you know, TypeScript.

00:11:01.100 --> 00:11:07.880
<v Matthias>Take us back to that time. What was the tech stack like before you started HorizonDB?

00:11:08.100 --> 00:11:11.920
<v Matthias>What was the team like, the team dynamics?

00:11:12.460 --> 00:11:18.800
<v Matthias>I guess most of them would be TypeScript developers, but there might be other people in the team.

00:11:18.800 --> 00:11:26.560
<v Jeff>I guess at that time, maybe to give some background on maybe the business side

00:11:26.560 --> 00:11:32.680
<v Jeff>of things, we're sort of tasked to build essentially an address validation API.

00:11:33.160 --> 00:11:38.260
<v Jeff>And so that's slightly different from geocoding. And we can talk about these two things.

00:11:38.460 --> 00:11:43.000
<v Jeff>So geocoding, or generally it's synonymous with for geocoding,

00:11:43.160 --> 00:11:47.160
<v Jeff>that's what most people assume what geocoding is, is essentially searching for

00:11:47.160 --> 00:11:52.580
<v Jeff>any geo entity. Whether that's an address, that can be a place or a region.

00:11:52.940 --> 00:11:58.260
<v Jeff>So those tend to be, you know, more specifically called as like a course geocode.

00:11:58.440 --> 00:12:04.040
<v Jeff>So for an address code, say I live at, you know, 123 Broadway.

00:12:05.220 --> 00:12:10.140
<v Jeff>The task is then to understand this query, to know that this is an address,

00:12:10.140 --> 00:12:13.460
<v Jeff>and you're looking for the number 123 and the street Broadway,

00:12:13.720 --> 00:12:16.380
<v Jeff>and then return a latitude and longitude.

00:12:16.380 --> 00:12:20.000
<v Jeff>And with our APIs, we offer like some other metadata such as,

00:12:20.100 --> 00:12:23.920
<v Jeff>oh, this is in New York in this postal code and things like that.

00:12:25.060 --> 00:12:28.900
<v Jeff>And so for address validation, there's a slight difference.

00:12:29.040 --> 00:12:35.200
<v Jeff>Address validation is more about understanding the deliverability of mail rather

00:12:35.200 --> 00:12:38.880
<v Jeff>than being focused on, you know, here's the latitude and longitude.

00:12:39.160 --> 00:12:44.140
<v Jeff>It's more like, does this address exist in the post office's database?

00:12:45.080 --> 00:12:50.000
<v Jeff>And so for us to accomplish that, the data format that we got from the post

00:12:50.000 --> 00:12:55.160
<v Jeff>office, essentially they give you ranges of addresses. So Broadway goes from,

00:12:55.420 --> 00:12:57.980
<v Jeff>let's say, 1 to 100, and it has this postal code.

00:12:58.480 --> 00:13:05.900
<v Jeff>Or 841 Broadway has floor 1 to floor 12, and they have these specific postal codes.

00:13:06.000 --> 00:13:10.020
<v Jeff>And it gets even more pedantic with a post office.

00:13:10.320 --> 00:13:16.820
<v Jeff>And the rabbit hole goes very deep with address validation.

00:13:16.820 --> 00:13:23.500
<v Jeff>So at that time, we didn't really, we were using an open source service for

00:13:23.500 --> 00:13:28.920
<v Jeff>geocoding, but there wasn't really a way for us to essentially ingest these

00:13:28.920 --> 00:13:30.840
<v Jeff>ranges and validate addresses.

00:13:31.160 --> 00:13:36.360
<v Jeff>So that was sort of the main motivation for us to come up with something new. And so...

00:13:37.260 --> 00:13:40.540
<v Jeff>Our tech stack, you know, at that time didn't have any Rust.

00:13:40.980 --> 00:13:47.840
<v Jeff>We have some front-end, you know, services written with React and Next and JavaScript.

00:13:48.440 --> 00:13:52.480
<v Jeff>And then our API server was written in TypeScript.

00:13:52.860 --> 00:13:57.000
<v Jeff>We had some, like, data processing jobs in Spark and Scala.

00:13:57.260 --> 00:14:01.360
<v Jeff>And then we used Airflow, so some Python as well. But nothing really,

00:14:01.360 --> 00:14:04.960
<v Jeff>you know, statically compiled or...

00:14:04.960 --> 00:14:09.120
<v Jeff>I mean, there's TypeScript, but more in the sense of like, you know,

00:14:09.460 --> 00:14:14.720
<v Jeff>these things translate into some bytecode, sort of like a JVM or into native instructions.

00:14:15.920 --> 00:14:21.700
<v Jeff>And we were sort of expecting, we had more constraints about like, oh,

00:14:21.920 --> 00:14:27.920
<v Jeff>if we want to build a service that does things like this and it overlaps so much with geocoding,

00:14:28.120 --> 00:14:31.880
<v Jeff>we might as well just sort of replace the service because we had some operational

00:14:31.880 --> 00:14:37.220
<v Jeff>issues with our existing geocoder, which we can talk about later. And so...

00:14:38.410 --> 00:14:44.690
<v Jeff>There was sort of a motivation to use something that, whether it's like an external

00:14:44.690 --> 00:14:50.590
<v Jeff>service, like something like Elasticsearch or, you know, having something like all in one.

00:14:51.250 --> 00:14:55.450
<v Jeff>Operationally, like we sort of got burnt by like having so many like external

00:14:55.450 --> 00:14:59.970
<v Jeff>services that we were almost sort of motivated to have something that would

00:14:59.970 --> 00:15:03.290
<v Jeff>just let us do everything on almost one package almost.

00:15:04.070 --> 00:15:07.090
<v Matthias>Was Rust the only language that you considered for that project?

00:15:07.090 --> 00:15:12.050
<v Jeff>So we were considering a couple of different options. And it's funny because at our company,

00:15:12.250 --> 00:15:16.950
<v Jeff>we write like tech specs or essentially design documents before we sort of go

00:15:16.950 --> 00:15:21.330
<v Jeff>into building some, you know, larger projects or like features or things that

00:15:21.330 --> 00:15:23.710
<v Jeff>will have like big downstream impacts.

00:15:23.710 --> 00:15:29.410
<v Jeff>So actually, looking back at the doc, we were considering a couple of languages,

00:15:29.410 --> 00:15:32.750
<v Jeff>and we sort of discussed the trade-offs there.

00:15:33.010 --> 00:15:38.170
<v Jeff>So we were thinking about Kotlin, and the thinking around there is that the

00:15:38.170 --> 00:15:40.890
<v Jeff>Java and JVM ecosystem is very rich.

00:15:41.850 --> 00:15:45.530
<v Jeff>From my personal experience, I think Scala is very complicated.

00:15:46.330 --> 00:15:50.050
<v Jeff>And you know onboarding the whole team to that might have been a little tricky

00:15:50.050 --> 00:15:55.190
<v Jeff>but Kotlin is a little bit more closer to like you know it's sort of in between

00:15:55.190 --> 00:16:02.190
<v Jeff>and I do feel like there's a little bit more of a backing around it now in you

00:16:02.190 --> 00:16:04.650
<v Jeff>know the 2020s versus Scala.

00:16:04.810 --> 00:16:08.510
<v Jeff>Scala really just seems like a little bit more niche and like sort of Spark

00:16:08.510 --> 00:16:13.090
<v Jeff>and even then like I think with Spark I I have some opinions about that and

00:16:13.090 --> 00:16:15.790
<v Jeff>like how Rust might play into a world like that.

00:16:16.810 --> 00:16:22.150
<v Jeff>So Kotlin felt like a good in between. And there was just a whole ton of, of these sort of,

00:16:23.010 --> 00:16:26.130
<v Jeff>interesting ecosystem aspects such as you know

00:16:26.130 --> 00:16:29.850
<v Jeff>elastic search is written in java and you

00:16:29.850 --> 00:16:33.090
<v Jeff>know essentially elastic search is just a distributed wrapper around lucene

00:16:33.090 --> 00:16:35.930
<v Jeff>i mean obviously it's much more than that you know entire company so

00:16:35.930 --> 00:16:40.530
<v Jeff>it's a very broad stroke but you know there's a whole ton of work around like

00:16:40.530 --> 00:16:45.450
<v Jeff>text processing which is essentially a lot of this this and we felt like okay

00:16:45.450 --> 00:16:50.350
<v Jeff>that is a very rich ecosystem that we could potentially use with the trade-off

00:16:50.350 --> 00:16:55.610
<v Jeff>of we know We're probably going to have to store large text indexes and all of these things,

00:16:55.750 --> 00:17:00.230
<v Jeff>even synonyms or different spellings or spell correction.

00:17:00.510 --> 00:17:03.810
<v Jeff>A lot of this is like dictionary lookups of strings and things like that.

00:17:03.910 --> 00:17:08.230
<v Jeff>So you can imagine, even from the onset, we're probably going to store giant

00:17:08.230 --> 00:17:10.010
<v Jeff>hash maps or a lot of things in memory.

00:17:10.190 --> 00:17:15.210
<v Jeff>And we'll have to deal with a garbage collector. and one of the motivating factors

00:17:15.210 --> 00:17:20.250
<v Jeff>from something like that as well, I guess before I talk about that,

00:17:20.390 --> 00:17:21.710
<v Jeff>we're also considering Golang,

00:17:23.650 --> 00:17:27.030
<v Jeff>which is sort of similar about it's garbage collected.

00:17:29.510 --> 00:17:31.790
<v Jeff>And we've seen that a lot of companies

00:17:31.790 --> 00:17:36.070
<v Jeff>have adopted it because it's really trivial to write web services,

00:17:38.030 --> 00:17:44.730
<v Jeff>and it's relatively simple in the sense that There's not a whole lot of different

00:17:44.730 --> 00:17:48.070
<v Jeff>ways to do something versus something like a Scala.

00:17:48.230 --> 00:17:52.350
<v Jeff>You can express your problem, how to do it in many ways.

00:17:52.570 --> 00:17:55.090
<v Jeff>So those were sort of the two biggest contenders.

00:17:55.530 --> 00:18:01.370
<v Jeff>And that was largely because we didn't feel too comfortable of transitioning

00:18:01.370 --> 00:18:05.130
<v Jeff>some JavaScript developers to something like CRC++.

00:18:06.510 --> 00:18:10.710
<v Jeff>And I think the hurdle around like setting up like C and C++ projects,

00:18:10.830 --> 00:18:14.230
<v Jeff>I know there's been so much innovation around, especially C++,

00:18:14.250 --> 00:18:19.130
<v Jeff>but it just didn't seem like such a natural transition for our developers.

00:18:19.670 --> 00:18:24.290
<v Jeff>And sort of similar with Golang, you know, like I think Golang was sort of evolved

00:18:24.290 --> 00:18:29.310
<v Jeff>from Google from like their large C++ monorepo.

00:18:29.310 --> 00:18:36.790
<v Jeff>And so a lot of these things that I mentioned about that make programming enjoyable

00:18:36.790 --> 00:18:42.350
<v Jeff>is things like mapping and filtering and all of these sort of expressive aspects.

00:18:42.350 --> 00:18:48.110
<v Jeff>We felt like because we had a bit of a short timeline, it almost felt like a

00:18:48.110 --> 00:18:53.650
<v Jeff>step back in terms of the joy of programmer being able to express,

00:18:53.810 --> 00:19:01.230
<v Jeff>especially concepts where you have to process strings in very precise ways.

00:19:01.230 --> 00:19:07.270
<v Jeff>In a lot of ways, a lot of searches is... If you ever do these leak code problems,

00:19:07.390 --> 00:19:08.930
<v Jeff>there's so much string processing.

00:19:09.210 --> 00:19:17.110
<v Jeff>And having more expression really helps you to make the code very clear about

00:19:17.110 --> 00:19:21.330
<v Jeff>some very potentially complicated ways that you process strings.

00:19:22.430 --> 00:19:25.390
<v Matthias>Which leaves us with Kotlin versus Rust.

00:19:25.970 --> 00:19:27.890
<v Jeff>Right, right. And so...

00:19:29.620 --> 00:19:33.860
<v Jeff>We saw, like, there was this, I think there's a very famous blog post.

00:19:33.960 --> 00:19:36.940
<v Jeff>If you go on Hacker News and look up, like, Rust or migrating to Rust,

00:19:37.140 --> 00:19:43.200
<v Jeff>I think the first result is always this one about Discord moving from Golang to Rust.

00:19:43.760 --> 00:19:43.920
<v Matthias>Yeah.

00:19:44.080 --> 00:19:50.580
<v Jeff>And so we, it almost sort of motivated us to see, like, oh, they had occasional

00:19:50.580 --> 00:19:52.340
<v Jeff>garbage collecting issues.

00:19:52.340 --> 00:19:56.880
<v Jeff>And even working at like other companies where we use Scala and the JVM,

00:19:57.020 --> 00:20:02.080
<v Jeff>like there was consistently like issues with like the JVM and like the garbage collector.

00:20:02.280 --> 00:20:04.900
<v Jeff>And, you know, there's so much innovation around the garbage collector.

00:20:05.060 --> 00:20:09.480
<v Jeff>But we knew like for a lot of what we were doing, text processing and indexing,

00:20:09.620 --> 00:20:13.820
<v Jeff>things like that, we're going to store large data structures just in memory.

00:20:14.640 --> 00:20:18.360
<v Jeff>I know there's like these concepts of like off heap memory, but that's already

00:20:18.360 --> 00:20:22.720
<v Jeff>sort of off the happy path of the language we're using. So we're almost like

00:20:22.720 --> 00:20:28.520
<v Jeff>putting a barrier or like something impeding us before we even got started.

00:20:28.740 --> 00:20:33.920
<v Jeff>So we really did want to have something where we had a lot more control over the memory.

00:20:34.180 --> 00:20:40.640
<v Jeff>But we also didn't want a language like C or C++ where you sort of have to expect

00:20:40.640 --> 00:20:43.800
<v Jeff>developers to understand these like concepts.

00:20:44.180 --> 00:20:47.020
<v Jeff>Like they still need to understand these concepts to some extent.

00:20:47.020 --> 00:20:53.340
<v Jeff>But it's you have to be really really up front and like be very almost very

00:20:53.340 --> 00:20:58.120
<v Jeff>very strict about these things so that would require significant like senior

00:20:58.120 --> 00:21:02.340
<v Jeff>engineering talent within that had like c and c++ knowledge which we didn't have.

00:21:02.340 --> 00:21:09.900
<v Matthias>There's a lot of ceremony around allocating and deallocating memory and in c

00:21:09.900 --> 00:21:15.240
<v Matthias>not so much in c++ nowadays but definitely in c still and when you say garbage

00:21:15.240 --> 00:21:21.240
<v Matthias>collectors yeah that is also true for both go and kotlin or java to a wider extent,

00:21:22.250 --> 00:21:26.350
<v Matthias>I remember that in the past, we ran a very large Elasticsearch cluster,

00:21:26.530 --> 00:21:28.190
<v Matthias>and those were different times.

00:21:28.430 --> 00:21:34.310
<v Matthias>We tried to migrate that to containers, and it wasn't really easy because operationally,

00:21:34.530 --> 00:21:42.830
<v Matthias>Java had some unfortunate default decisions where it would try to allocate as

00:21:42.830 --> 00:21:45.050
<v Matthias>much memory as it could possibly get on the machine.

00:21:45.230 --> 00:21:49.170
<v Matthias>And if it's co-hosted with a lot of other services on the same machine,

00:21:49.550 --> 00:21:51.430
<v Matthias>that sometimes caused some problems.

00:21:52.590 --> 00:21:58.370
<v Matthias>So did you also run into, or did you also consider these kind of concerns,

00:21:58.630 --> 00:22:02.150
<v Matthias>the operational part of it, the deployment process?

00:22:02.650 --> 00:22:06.530
<v Jeff>I guess specifically for, you know, how you mentioned Elasticsearch,

00:22:06.790 --> 00:22:08.750
<v Jeff>actually one of the things that, you know,

00:22:09.030 --> 00:22:13.830
<v Jeff>also sort of bit us and also why we decided not to use a JVM language was the

00:22:13.830 --> 00:22:18.450
<v Jeff>fact that we did have to maintain an Elasticsearch cluster to power geocoding

00:22:18.450 --> 00:22:21.910
<v Jeff>with our, you know, old iteration.

00:22:22.250 --> 00:22:24.530
<v Jeff>Of the geocoder.

00:22:25.910 --> 00:22:29.410
<v Jeff>And we realized we just had to put everything on one machine,

00:22:29.590 --> 00:22:35.390
<v Jeff>which is still the case, but we didn't even use really the sort of distributed

00:22:35.390 --> 00:22:40.110
<v Jeff>aspect of Elasticsearch because of the nature of the workload.

00:22:40.170 --> 00:22:43.670
<v Jeff>It would just end up, you know, essentially fanning out all the queries to all

00:22:43.670 --> 00:22:44.990
<v Jeff>the nodes, even though they're sharded.

00:22:45.230 --> 00:22:50.810
<v Jeff>So at that point, we just essentially just got one really like fat box and...

00:22:51.810 --> 00:22:54.610
<v Jeff>All the shards of the Elasticsearch cluster just lived on one box.

00:22:54.710 --> 00:22:56.370
<v Jeff>So it almost defeated the purpose of that.

00:22:57.030 --> 00:23:01.430
<v Jeff>And so, you know, maybe this is more of a design thing, more so than Rust.

00:23:01.690 --> 00:23:05.630
<v Jeff>But, you know, that motivation of, you know, Elasticsearch was,

00:23:06.070 --> 00:23:07.230
<v Jeff>you know, contains, let's say,

00:23:07.350 --> 00:23:10.890
<v Jeff>a bunch of geodata and like addresses and places and things like that.

00:23:11.350 --> 00:23:15.750
<v Jeff>And there were a couple of other microservices that we had powering our old geocoder.

00:23:16.750 --> 00:23:21.970
<v Jeff>What we really wanted was not just like a deployment story that was simple for

00:23:21.970 --> 00:23:24.970
<v Jeff>essentially the binary, but our data assets as well.

00:23:26.050 --> 00:23:27.530
<v Jeff>And so what that meant was that,

00:23:28.560 --> 00:23:31.760
<v Jeff>It would be much simpler where the old world is essentially,

00:23:32.000 --> 00:23:33.560
<v Jeff>oh, hey, I want to do a migration.

00:23:33.960 --> 00:23:37.220
<v Jeff>And, you know, developers deal with this all the time. If you want to do a database

00:23:37.220 --> 00:23:41.900
<v Jeff>migration, you essentially have to reason about what's the existing state of your database.

00:23:42.260 --> 00:23:48.280
<v Jeff>You know, do I want to introduce a column? You know, can my or you can't you

00:23:48.280 --> 00:23:52.440
<v Jeff>got to make sure that maybe everything can handle null to begin with.

00:23:52.940 --> 00:23:59.380
<v Jeff>And then, you know, maybe you have like a state change. Maybe you want to increment something.

00:23:59.620 --> 00:24:03.880
<v Jeff>So you have to understand that the state of your database looks like this.

00:24:04.060 --> 00:24:08.180
<v Jeff>You're going to apply this operation. What happens if your database script fails?

00:24:09.320 --> 00:24:15.280
<v Jeff>Then you've got to code in this concept of idempotency where you can rerun the

00:24:15.280 --> 00:24:17.640
<v Jeff>script many times. And that's a lot of things to reason about.

00:24:18.920 --> 00:24:24.940
<v Jeff>So in terms of a deployment story, for sure, we wanted something that was very simple to deploy.

00:24:25.100 --> 00:24:29.460
<v Jeff>And Rust just gives you a simple binary. And, you know, we essentially compile

00:24:29.460 --> 00:24:34.840
<v Jeff>it in CI and essentially ship up the binary to our servers.

00:24:34.840 --> 00:24:43.000
<v Jeff>But also the data assets, we wanted it all to be, you know, self-contained within the same server.

00:24:43.580 --> 00:24:48.840
<v Jeff>And even the way that we process it, actually, we rebuild the data sets from scratch.

00:24:49.000 --> 00:24:53.240
<v Jeff>We never backfill anything. We just take all the raw sources and then,

00:24:53.280 --> 00:24:56.820
<v Jeff>you know, distribute the compute with something like Spark.

00:24:57.300 --> 00:25:00.400
<v Jeff>And essentially compile it to, you know, some data assets.

00:25:00.720 --> 00:25:03.940
<v Jeff>And there are some things that are more specific to, you know,

00:25:04.000 --> 00:25:05.700
<v Jeff>the data formats that we use on our server.

00:25:05.720 --> 00:25:10.260
<v Jeff>So we process those later and then we ship those data assets to all of these boxes.

00:25:10.680 --> 00:25:15.580
<v Jeff>And so, you know, what that means is because everything is self-contained,

00:25:16.200 --> 00:25:19.320
<v Jeff>it's actually very trivial to roll back, which is something that you might not

00:25:19.320 --> 00:25:23.720
<v Jeff>be able to do if you had, say, like a third, you know, like if very simple,

00:25:23.740 --> 00:25:28.100
<v Jeff>like two-tier architecture, even a web app and a, say, SQL Server.

00:25:28.680 --> 00:25:30.380
<v Jeff>If you do a data migration...

00:25:31.950 --> 00:25:35.110
<v Jeff>That you have to reason about the data migration roll back and forward,

00:25:35.290 --> 00:25:37.670
<v Jeff>as well as the binary data roll back and forward.

00:25:37.830 --> 00:25:42.930
<v Jeff>But with our new service, it all sort of goes in one package and lockstep.

00:25:43.170 --> 00:25:46.510
<v Jeff>So if you need a rollback, you just switch everything over and all the data

00:25:46.510 --> 00:25:48.250
<v Jeff>pointers also switch back.

00:25:48.470 --> 00:25:52.670
<v Jeff>And so that's a way to roll back. And you don't have to reason about these many

00:25:52.670 --> 00:25:54.930
<v Jeff>states. It's just one self-contained thing.

00:25:55.050 --> 00:25:58.690
<v Jeff>And for that to work, you really do need something that's like very efficient,

00:25:58.690 --> 00:26:03.770
<v Jeff>essentially both in because they're shipping such you know so many data assets and.

00:26:03.770 --> 00:26:11.870
<v Matthias>I see that's an aspect that i never heard before you because Rust is so fast

00:26:11.870 --> 00:26:20.650
<v Matthias>you can afford to simplify operations by doing more at startup but it's still

00:26:20.650 --> 00:26:23.730
<v Matthias>performing so you can cut some corners thanks to Rust.

00:26:24.550 --> 00:26:29.190
<v Jeff>Right, because it's almost like, hey, you don't even need an external database

00:26:29.190 --> 00:26:34.470
<v Jeff>for something like you would typically grab for in web apps like a Postgres or MySQL.

00:26:34.470 --> 00:26:40.370
<v Jeff>It's, hey, I'm going to use an embedded database or have a large in-memory index,

00:26:40.390 --> 00:26:42.450
<v Jeff>and that's your sort of state.

00:26:42.710 --> 00:26:48.550
<v Jeff>And so all that ships together in one whole unit or package.

00:26:49.530 --> 00:26:51.950
<v Matthias>What do you use for the storage layer?

00:26:53.230 --> 00:27:00.610
<v Jeff>So we use a couple of things. And so we talk about this in the blog post that describes Verizon DV.

00:27:01.150 --> 00:27:04.770
<v Matthias>And we link to it in the show notes. It's a really, really nice write-up.

00:27:05.250 --> 00:27:10.610
<v Jeff>Yeah. I would say our primary storage mechanism is this thing called RocksDB,

00:27:10.830 --> 00:27:17.130
<v Jeff>which is an embedded storage or embedded database.

00:27:17.330 --> 00:27:22.910
<v Jeff>And the way people think about it is it's essentially the database that you use to build databases.

00:27:23.230 --> 00:27:26.650
<v Jeff>So, for example, Facebook or Meta,

00:27:26.810 --> 00:27:32.090
<v Jeff>they originally used this to power the storage layer of MySQL for them.

00:27:32.270 --> 00:27:36.510
<v Jeff>And I think there's a project called MyRocks, MongoDB, for example.

00:27:36.870 --> 00:27:41.110
<v Jeff>They actually used RocksDB for some time. If you look at CockroachDB,

00:27:41.730 --> 00:27:45.030
<v Jeff>their storage layer for a long time was also RocksDB.

00:27:45.390 --> 00:27:46.630
<v Matthias>Same for InfluxDB.

00:27:47.050 --> 00:27:52.650
<v Jeff>Right. There's so many projects around there and so much community backing that

00:27:52.650 --> 00:27:57.070
<v Jeff>we felt it was such a mature project that we could use it for our purposes.

00:27:58.070 --> 00:28:01.390
<v Jeff>I know for sure there are a lot of Rust crates now, things like,

00:28:01.390 --> 00:28:04.310
<v Jeff>I believe, SLED, stuff like that.

00:28:04.430 --> 00:28:08.110
<v Jeff>There's many people who write these sort of embedded databases.

00:28:08.710 --> 00:28:15.010
<v Jeff>And essentially, RocksDB is this data structure called a log structure merge tree.

00:28:16.180 --> 00:28:20.740
<v Jeff>And it's really designed for high write throughput, which is sort of different from our use case.

00:28:20.960 --> 00:28:26.020
<v Jeff>But as I mentioned before, technology is so cyclical where, you know,

00:28:26.140 --> 00:28:29.040
<v Jeff>if somebody builds a database that's like high write throughput,

00:28:29.360 --> 00:28:32.680
<v Jeff>well, they actually adopt a lot of these concepts and ways to tune it so that,

00:28:32.780 --> 00:28:35.620
<v Jeff>you know, it's also very well tuned for read throughputs.

00:28:35.620 --> 00:28:43.960
<v Jeff>And so we just felt that, yeah, that, you know, community backing and the sort

00:28:43.960 --> 00:28:49.080
<v Jeff>of, and it's a project that's written not in Rust, right? So that was a sort of another nice thing.

00:28:49.220 --> 00:28:53.160
<v Jeff>We had all the Rust bindings to RocksDB. and it

00:28:53.160 --> 00:28:55.860
<v Jeff>was very simple like integration to be

00:28:55.860 --> 00:28:59.140
<v Jeff>able to just pull in that project and that

00:28:59.140 --> 00:29:02.480
<v Jeff>is our sort of primary storage layer for you

00:29:02.480 --> 00:29:06.120
<v Jeff>know all of our entities so addresses places and

00:29:06.120 --> 00:29:12.300
<v Jeff>different regions and so it serves you know a number of different purposes so

00:29:12.300 --> 00:29:17.840
<v Jeff>obviously like primary key fetches so when our services have say like an event

00:29:17.840 --> 00:29:22.560
<v Jeff>and they have an id for a place They'll fetch that from our service,

00:29:22.620 --> 00:29:28.240
<v Jeff>but we also index it in a way that makes it really easy to do geo lookups.

00:29:28.660 --> 00:29:31.080
<v Jeff>So given this lat long, I want to

00:29:31.080 --> 00:29:37.460
<v Jeff>be able to fetch all the relevant geo entities. So am I inside the city?

00:29:37.660 --> 00:29:41.340
<v Jeff>Am I inside this country? Things like that.

00:29:42.420 --> 00:29:46.100
<v Matthias>RocksDB is very fast and very write

00:29:46.100 --> 00:29:49.240
<v Matthias>focused but also it's effective in

00:29:49.240 --> 00:29:54.740
<v Matthias>terms of storage and i guess that plays in your favor because if you try to

00:29:54.740 --> 00:30:01.220
<v Matthias>geocode the world you need a lot of storage and on top of it once the storage

00:30:01.220 --> 00:30:06.760
<v Matthias>is quite optimized you get really decent cache locality on top of it for free,

00:30:07.970 --> 00:30:09.650
<v Matthias>So it might have been a really great choice.

00:30:10.070 --> 00:30:15.570
<v Jeff>Yeah, there are a lot of nice aspects about a log structure merge tree.

00:30:15.910 --> 00:30:21.530
<v Jeff>And I think most modern database implementations use that in some shape or another.

00:30:22.370 --> 00:30:30.350
<v Jeff>And it's largely this concept of, obviously, there's so much engineering around

00:30:30.350 --> 00:30:32.990
<v Jeff>it, but everything is sorted.

00:30:33.090 --> 00:30:38.810
<v Jeff>So that gives you a huge advantage. And like that is something that many people take advantage of.

00:30:39.170 --> 00:30:43.990
<v Jeff>So in our blog post, we talk about, you know, very fast geo lookups using this

00:30:43.990 --> 00:30:47.490
<v Jeff>library called S2, which is essentially a geo hashing library.

00:30:47.750 --> 00:30:51.150
<v Jeff>And so what that means is you have a latitude and you have a longitude.

00:30:51.370 --> 00:30:52.830
<v Jeff>That's two dimensions. Right.

00:30:53.390 --> 00:30:59.890
<v Jeff>And so it's not clear at first how you can use sorting to help you make lookups faster.

00:31:00.410 --> 00:31:04.590
<v Jeff>And so, I mean, there is literally a thing called geo hashing.

00:31:04.590 --> 00:31:08.450
<v Jeff>And there are other like implementations such as from Uber, there's this thing

00:31:08.450 --> 00:31:11.310
<v Jeff>called H3. Ffrom Google, there's this thing called S2.

00:31:11.830 --> 00:31:17.510
<v Jeff>And essentially it collapses, you know, the latitude and longitude into a single

00:31:17.510 --> 00:31:21.410
<v Jeff>64 bit or a U64, I guess, in Rust terms.

00:31:22.570 --> 00:31:27.470
<v Jeff>And it has many nice properties because it tends to be the case,

00:31:27.750 --> 00:31:33.190
<v Jeff>obviously there's boundary conditions because latitude and longitude and they

00:31:33.190 --> 00:31:36.170
<v Jeff>go from negative 180 to 180 and negative 90 to 90.

00:31:36.370 --> 00:31:39.570
<v Jeff>So there are some boundary conditions to take into account. But for the most part.

00:31:40.910 --> 00:31:44.750
<v Jeff>You maintain because of like, you know, somebody who's smarter than me who came

00:31:44.750 --> 00:31:50.890
<v Jeff>up with this thing called like using a Hilbert curve, you have a locality with adjacent IDs,

00:31:51.270 --> 00:31:54.910
<v Jeff>which means that things that are close to each other will be,

00:31:55.110 --> 00:31:57.870
<v Jeff>you know, if you sort it will be next to each other.

00:31:58.630 --> 00:32:02.850
<v Jeff>Obviously, you know, barring some some boundary conditions, but that fits really

00:32:02.850 --> 00:32:08.310
<v Jeff>well into a system that has, you know, things sorted like a log structure merge tree.

00:32:08.390 --> 00:32:12.810
<v Jeff>So you're able to make very efficient range and geo queries from from something like that.

00:32:12.810 --> 00:32:20.650
<v Matthias>And all of that combines into a very efficient single binary lookup service

00:32:20.650 --> 00:32:25.410
<v Matthias>what do you use for the fuzzy searching part that you mentioned.

00:32:25.410 --> 00:32:30.910
<v Jeff>Right so this is more related to the forward geocoding side which is you know

00:32:30.910 --> 00:32:36.270
<v Jeff>essentially translating your text query into some sort of, you know, geo entity.

00:32:37.870 --> 00:32:44.470
<v Jeff>And so one of the requirements we had to deal with was essentially being able

00:32:44.470 --> 00:32:48.490
<v Jeff>to handle a little bit of typo tolerance from our address validation service.

00:32:48.670 --> 00:32:52.890
<v Jeff>And that comes in many different forms. Like there's so many like sort of failure

00:32:52.890 --> 00:32:57.290
<v Jeff>cases for search, which is a little bit different from like more typical web applications.

00:32:57.510 --> 00:33:00.690
<v Jeff>It's like you click through a couple of things and that you expected this

00:33:00.690 --> 00:33:03.490
<v Jeff>really like all the different use cases are

00:33:03.490 --> 00:33:07.550
<v Jeff>literally every type of single character that a user can type those are all

00:33:07.550 --> 00:33:12.690
<v Jeff>the potential use case so the cardinality is is is extremely high and essentially

00:33:12.690 --> 00:33:17.230
<v Jeff>the number of failure cases is is almost unbounded in some sense there's there's

00:33:17.230 --> 00:33:21.490
<v Jeff>just so many combinations that at that point like there's so many ways to to type something in,

00:33:22.290 --> 00:33:27.990
<v Jeff>So we deal with fuzzy search in a couple of ways.

00:33:28.590 --> 00:33:36.670
<v Jeff>We use a library called FST. And I remember there's an episode you had with

00:33:36.670 --> 00:33:39.010
<v Jeff>Charlie Marsh from UV. And I think...

00:33:39.820 --> 00:33:45.100
<v Jeff>There's an engineer who works there now. I only remember his GitHub name because

00:33:45.100 --> 00:33:46.420
<v Jeff>it's very memorable, like burntsushi.

00:33:46.620 --> 00:33:49.680
<v Jeff>But he works at UV, if I remember correctly.

00:33:49.900 --> 00:33:53.420
<v Jeff>And he's come up with a lot of really interesting Rust crates.

00:33:53.680 --> 00:33:58.220
<v Jeff>And I think he even implemented the regex or did the regex implementation of Rust.

00:33:58.640 --> 00:34:00.820
<v Matthias>Yeah, exactly. Jonathan Gallant.

00:34:01.380 --> 00:34:11.480
<v Jeff>Right. And yeah, so we make a lot of use of his libraries. but FST is essentially a character graph.

00:34:12.620 --> 00:34:19.060
<v Jeff>And so there's this concept of like a try that's very typically used for prefix queries.

00:34:19.520 --> 00:34:27.560
<v Jeff>And so you can sort of think of an FST as like a try where all the prefixes

00:34:27.560 --> 00:34:31.800
<v Jeff>that are shared compress, but it also compresses the suffixes.

00:34:31.800 --> 00:34:37.380
<v Jeff>So now you have almost like double compression, right? And essentially...

00:34:37.830 --> 00:34:42.550
<v Jeff>Doing text lookups is a traversal of the graph.

00:34:42.790 --> 00:34:46.650
<v Jeff>And you can see how that sort of primitive lets you do many things.

00:34:46.810 --> 00:34:51.110
<v Jeff>It lets you implement like a regex because the regex sort of works the same way.

00:34:51.550 --> 00:34:57.370
<v Jeff>And so for fuzzy search, you can implement like Levenstein distance by essentially

00:34:57.370 --> 00:35:02.710
<v Jeff>keeping a way of dropping or like if you type something incorrectly.

00:35:03.510 --> 00:35:08.610
<v Jeff>You can have like certain thresholds that, We use an edit distance of one because

00:35:08.610 --> 00:35:11.910
<v Jeff>more than that, it's essentially combinatorially expensive.

00:35:11.970 --> 00:35:15.330
<v Jeff>And we have other techniques for typotolerance.

00:35:15.870 --> 00:35:20.390
<v Jeff>But we use this library essentially for, one, obviously prefix search.

00:35:20.650 --> 00:35:25.830
<v Jeff>But then two, it does have some built-in ways to do 11 sign distance,

00:35:25.830 --> 00:35:28.830
<v Jeff>which we're actually starting to move off of a little bit.

00:35:29.030 --> 00:35:32.770
<v Jeff>Because now we're actually, as the project is starting to succeed,

00:35:33.010 --> 00:35:34.890
<v Jeff>we're getting more queries.

00:35:34.890 --> 00:35:40.530
<v Jeff>And so because you know what users typed and what their end result actually

00:35:40.530 --> 00:35:45.990
<v Jeff>was, we're able to, at scale, sort of figure out, oh, these typos actually map

00:35:45.990 --> 00:35:49.490
<v Jeff>to this type of, this actual correct spelling.

00:35:49.710 --> 00:35:53.050
<v Jeff>And so we're able just to do those lookups much quicker. But the funny part

00:35:53.050 --> 00:35:58.710
<v Jeff>is for some of these cases, we actually also indexed that into the FST.

00:35:59.210 --> 00:36:07.350
<v Jeff>So in a lot of ways, the FST is almost like You can almost think of it as a

00:36:07.350 --> 00:36:11.710
<v Jeff>hash map or maybe even a B-tree map of a string to a U64.

00:36:13.810 --> 00:36:20.830
<v Jeff>It's maybe a compressed version of that. But more than that,

00:36:20.930 --> 00:36:27.810
<v Jeff>it's really just a way to cache high cardinality text in a very compressed way.

00:36:28.590 --> 00:36:33.690
<v Matthias>Yeah. The way I always think of an FST, and I might be wrong here,

00:36:33.810 --> 00:36:38.950
<v Matthias>is it's a very fast state machine for looking up,

00:36:39.900 --> 00:36:43.380
<v Matthias>the existence of words very efficiently and very quickly.

00:36:43.620 --> 00:36:52.000
<v Matthias>So you mentioned the tri-data structure that is similar where it stores all of the inputs in a tree,

00:36:52.660 --> 00:36:59.860
<v Matthias>but it knows when there is no possible way for a word to exist in the data storage

00:36:59.860 --> 00:37:05.840
<v Matthias>because you reached a leaf node in the tree and there are no child nodes.

00:37:05.840 --> 00:37:09.960
<v Matthias>So you know, for example, but this is not a hit in your data structure.

00:37:10.220 --> 00:37:13.600
<v Matthias>And in your blog post, you mentioned a few other crates.

00:37:13.920 --> 00:37:17.820
<v Matthias>We just wanted to quickly do a quick shout out here.

00:37:18.220 --> 00:37:23.660
<v Matthias>10TV would be one, LightGBM, and FastText. What are these?

00:37:24.340 --> 00:37:29.580
<v Jeff>Right, so I'll give a quick rundown of these. So for any sort of search system,

00:37:29.880 --> 00:37:33.500
<v Jeff>the most fundamental data structure is this thing called an inverted index.

00:37:34.000 --> 00:37:38.520
<v Jeff>So that implies like a forward index. So maybe I'll explain what that is first.

00:37:38.780 --> 00:37:43.120
<v Jeff>And so forward index is more like a traditional database. So record one maps

00:37:43.120 --> 00:37:46.080
<v Jeff>to Broadway, record two maps to Prince Street.

00:37:46.520 --> 00:37:51.740
<v Jeff>The inverted index sort of switches that around. So you first tokenize,

00:37:51.880 --> 00:37:54.000
<v Jeff>and that is its own whole topic.

00:37:54.200 --> 00:37:59.800
<v Jeff>People research on how to tokenize text, especially with all AI and machine learning trend now.

00:37:59.800 --> 00:38:07.100
<v Jeff>But you can then say, oh, Broadway, the token maps to ID one and then prints

00:38:07.100 --> 00:38:11.240
<v Jeff>maps to document two and street maps to document two.

00:38:11.540 --> 00:38:18.900
<v Jeff>So when you type in prints, then it's a I mean, it's not a hash map in these

00:38:18.900 --> 00:38:23.100
<v Jeff>implementations, but essentially you just look up the word prints and then you

00:38:23.100 --> 00:38:24.700
<v Jeff>get all the documents that are related.

00:38:24.700 --> 00:38:29.240
<v Jeff>And so there's, once you have these documents, you can sort of perform these

00:38:29.240 --> 00:38:33.960
<v Jeff>set operations to essentially narrow down which documents are relevant.

00:38:34.340 --> 00:38:39.200
<v Jeff>And then there's, you know, Tantibi offers this thing called BM25,

00:38:39.280 --> 00:38:41.860
<v Jeff>which you can think of it as like TFIDF.

00:38:41.980 --> 00:38:43.780
<v Jeff>But essentially, once you have these documents.

00:38:44.930 --> 00:38:47.650
<v Jeff>And this is the difference between a database and a search engine.

00:38:47.950 --> 00:38:51.870
<v Jeff>How do you essentially rank which of these documents is most relevant?

00:38:52.450 --> 00:38:55.830
<v Jeff>I mean, that is its own, you know, that is its own, like, sort of,

00:38:56.030 --> 00:39:00.190
<v Jeff>there's a lot of work around figuring out what is the most relevant.

00:39:00.450 --> 00:39:03.750
<v Jeff>I mean, it's different for every company and every use case.

00:39:04.450 --> 00:39:08.830
<v Jeff>So Tantivy, you know, that's a library that came from, I think,

00:39:09.230 --> 00:39:11.310
<v Jeff>these people from France.

00:39:11.850 --> 00:39:15.630
<v Jeff>And they were building essentially, it's funny, it's an Elasticsearch replacement,

00:39:15.810 --> 00:39:18.470
<v Jeff>which is, we've been talking a little bit about that, called QuickWit.

00:39:18.710 --> 00:39:23.110
<v Jeff>And so there's just a lot of primitives that, you know, that came along with

00:39:23.110 --> 00:39:28.190
<v Jeff>this sort of Rust ecosystem. And that was one of the libraries that really drew us to using Rust.

00:39:28.510 --> 00:39:32.410
<v Jeff>We're actually using, we're migrating to a different implementation that doesn't

00:39:32.410 --> 00:39:33.790
<v Jeff>use any of these libraries now.

00:39:33.930 --> 00:39:36.390
<v Jeff>We use this thing called Roaring Bitmap, but it's the same concept.

00:39:36.390 --> 00:39:42.550
<v Jeff>And we're doing that largely because of sort of the more structured nature of

00:39:42.550 --> 00:39:45.350
<v Jeff>some of the geoentities we have.

00:39:45.590 --> 00:39:49.450
<v Jeff>You know, we have addresses and regions and people tend to type addresses in

00:39:49.450 --> 00:39:54.350
<v Jeff>a certain way. So we can take more advantage of that and fine tune how we do our search workloads.

00:39:54.570 --> 00:39:58.590
<v Jeff>But at the core of it really is this concept of an inverted index.

00:39:58.930 --> 00:40:01.470
<v Jeff>So that is, I would say, the core of geoconing.

00:40:02.130 --> 00:40:04.710
<v Jeff>And then we can talk a little bit about light GBM and fast text.

00:40:04.910 --> 00:40:06.650
<v Jeff>And these things always move very quickly.

00:40:06.790 --> 00:40:09.730
<v Jeff>So we're actually considering moving off of these libraries,

00:40:09.830 --> 00:40:14.730
<v Jeff>but we're still doing, you know, doing something that would serve what these libraries do.

00:40:15.030 --> 00:40:18.770
<v Jeff>And so light GBM is a gradient boosted tree.

00:40:19.050 --> 00:40:23.630
<v Jeff>I think if you take any basic machine learning courses, you'll learn about decision

00:40:23.630 --> 00:40:26.450
<v Jeff>trees. So the concepts are really the same.

00:40:27.410 --> 00:40:33.870
<v Jeff>And essentially, it's a way to do a couple of different functions versus classification.

00:40:34.370 --> 00:40:40.170
<v Jeff>So if I receive an email, does it seem like a positive sentiment or a negative

00:40:40.170 --> 00:40:42.010
<v Jeff>sentiment or a neutral sentiment?

00:40:42.450 --> 00:40:44.150
<v Jeff>It does interpolation.

00:40:44.630 --> 00:40:50.550
<v Jeff>So given the name and the country and I don't know, the sports that somebody

00:40:50.550 --> 00:40:53.790
<v Jeff>does, try and guess their height. You know, it's like, you know,

00:40:54.070 --> 00:40:57.150
<v Jeff>four feet to, I don't know, seven feet, something like that.

00:40:57.290 --> 00:40:58.530
<v Jeff>And that's interpolation.

00:40:59.370 --> 00:41:03.070
<v Jeff>It does learn to rank, which is particularly interesting.

00:41:03.270 --> 00:41:05.950
<v Jeff>And as I mentioned, we do, you know, from the inverted index,

00:41:06.090 --> 00:41:10.970
<v Jeff>we do ranking, which in some sense is like really just a primitive of classification.

00:41:11.230 --> 00:41:15.490
<v Jeff>But essentially what it's saying is like, given this list of entities,

00:41:15.910 --> 00:41:19.490
<v Jeff>which one has the highest rank and which one has the lowest rank?

00:41:19.490 --> 00:41:22.430
<v Jeff>And you can really just think of that as like pairwise comparisons.

00:41:22.430 --> 00:41:29.230
<v Jeff>So like if I'm given, if I type in 841 Broadway and I'm in this latitude and

00:41:29.230 --> 00:41:34.350
<v Jeff>longitude, is it referring to 841 Broadway in Brooklyn or 841 Broadway in Manhattan?

00:41:34.690 --> 00:41:36.710
<v Jeff>And so that is essentially what

00:41:36.710 --> 00:41:41.410
<v Jeff>Learn to Rank is, but you can apply it to more entities than just two.

00:41:41.650 --> 00:41:44.830
<v Jeff>But really it is just sort of classification if you think about it.

00:41:45.510 --> 00:41:47.010
<v Matthias>There's a Broadway in Brooklyn? Yeah.

00:41:47.560 --> 00:41:52.140
<v Jeff>Yes, there is. I think when you start to learn about the world,

00:41:52.160 --> 00:41:56.680
<v Jeff>I don't think people are really all that creative with naming things.

00:41:57.520 --> 00:42:00.600
<v Jeff>So it's not just programmers that have problems with naming things.

00:42:01.160 --> 00:42:03.360
<v Jeff>Everybody has those issues.

00:42:04.560 --> 00:42:10.120
<v Matthias>Nice. Well, it's just a wide way. It's just a wide street, so it totally makes sense. Right.

00:42:10.300 --> 00:42:16.120
<v Jeff>It is also a very wide street in Brooklyn, so we consistently have issues with

00:42:16.120 --> 00:42:19.240
<v Jeff>people trying to figure out which Broadway we're talking about.

00:42:21.140 --> 00:42:25.120
<v Matthias>Okay, we have all of those libraries. Just to recap that part,

00:42:25.420 --> 00:42:31.840
<v Matthias>when the service boots up, it starts to ingest all of that data that is more

00:42:31.840 --> 00:42:34.180
<v Matthias>or less stored in some flat storage.

00:42:34.180 --> 00:42:40.440
<v Matthias>You start to ingest it, you start to build your, let's say, understanding your

00:42:40.440 --> 00:42:42.800
<v Matthias>model of the world, quite literally.

00:42:43.320 --> 00:42:49.400
<v Matthias>And then you have that highly efficient lookup functionality.

00:42:50.180 --> 00:42:55.500
<v Matthias>And through the API, then you can find addresses really quickly and map addresses

00:42:55.500 --> 00:42:59.480
<v Matthias>to certain geolocations and maybe even vice versa if you needed to.

00:43:00.380 --> 00:43:00.560
<v Jeff>Right.

00:43:01.850 --> 00:43:06.330
<v Jeff>Yeah, and you can already see that because there are so many of these assets,

00:43:06.650 --> 00:43:08.890
<v Jeff>the startup time does take a little bit of a hit.

00:43:09.470 --> 00:43:11.750
<v Matthias>Can you roughly tell us how long it takes?

00:43:12.790 --> 00:43:17.890
<v Jeff>So we download all these assets in parallel from S3, and we do actually try

00:43:17.890 --> 00:43:19.630
<v Jeff>and sort of cache these assets.

00:43:19.790 --> 00:43:23.750
<v Jeff>So we deploy in Kubernetes, and generally when we're making new releases,

00:43:23.770 --> 00:43:26.930
<v Jeff>only a couple of things change, like, oh, maybe you only updated places.

00:43:26.930 --> 00:43:29.310
<v Jeff>So we only need to pull those changes from S3.

00:43:29.390 --> 00:43:32.410
<v Jeff>And so it's maybe like, couple minutes

00:43:32.410 --> 00:43:35.110
<v Jeff>but really like the future of this is we want

00:43:35.110 --> 00:43:38.130
<v Jeff>to get it to you know less than a minute or or even

00:43:38.130 --> 00:43:41.010
<v Jeff>like in in like 30 seconds or so and so

00:43:41.010 --> 00:43:45.770
<v Jeff>this is another area where i think Rust and like having and like all the work

00:43:45.770 --> 00:43:50.890
<v Jeff>around the data infrastructure area where like a lot of like before it was like

00:43:50.890 --> 00:43:54.170
<v Jeff>a lot of these java services from all the apache projects they're all using

00:43:54.170 --> 00:43:58.650
<v Jeff>java we're seeing like a whole ecosystem of these data infrastructure pieces

00:43:58.650 --> 00:44:00.790
<v Jeff>is being moved to Rust or like.

00:44:01.110 --> 00:44:05.510
<v Jeff>And so one of those things is we do want to start to separate the storage and

00:44:05.510 --> 00:44:12.150
<v Jeff>compute in a way that our database can start to come up right away or download

00:44:12.150 --> 00:44:13.470
<v Jeff>the minimum amount of assets.

00:44:14.690 --> 00:44:19.250
<v Jeff>And the fact that we almost like memory map some of these like assets,

00:44:20.290 --> 00:44:24.750
<v Jeff>you can actually translate, you know, that pretty, in a pretty straightforward

00:44:24.750 --> 00:44:28.410
<v Jeff>way with S3 because it offers like a range API.

00:44:28.650 --> 00:44:34.650
<v Jeff>And because like people have built libraries to essentially cache S3 on disk,

00:44:34.930 --> 00:44:39.870
<v Jeff>what we're thinking about is we're going to build a storage layer that almost caches S3.

00:44:40.030 --> 00:44:45.350
<v Jeff>And then while the surface is booting up and hasn't pulled in the assets to the local NVMe disk.

00:44:46.640 --> 00:44:52.340
<v Jeff>It can temporarily sort of pull in ranges of bytes from these storage services

00:44:52.340 --> 00:44:54.340
<v Jeff>so it can immediately start serving queries.

00:44:54.560 --> 00:44:57.720
<v Jeff>Obviously, there's going to be a latency hit, but in terms of being able to

00:44:57.720 --> 00:45:02.380
<v Jeff>deal with things like, oh, we have a traffic spike, that's sort of one of the

00:45:02.380 --> 00:45:03.280
<v Jeff>problems we're trying to solve.

00:45:03.440 --> 00:45:08.360
<v Jeff>And I think the fact that there's so much development around data infrastructure

00:45:08.360 --> 00:45:14.060
<v Jeff>projects and we use Spark at Radar and so many companies use Radar.

00:45:14.060 --> 00:45:19.480
<v Jeff>Just this whole concept of data infrastructure in the cloud and using object storage.

00:45:19.520 --> 00:45:26.560
<v Jeff>We're seeing new streaming frameworks written in Rust or a lot of these new

00:45:26.560 --> 00:45:31.220
<v Jeff>iceberg tools in Rust and that's going to help us a lot for building something

00:45:31.220 --> 00:45:32.160
<v Jeff>where we want to separate.

00:45:32.460 --> 00:45:36.840
<v Jeff>Or it's going to accelerate essentially what we want to build where we separate storage and compute.

00:45:38.580 --> 00:45:43.460
<v Matthias>How long did it take you to build and ship the first version of HorizonDB?

00:45:44.180 --> 00:45:48.500
<v Jeff>Right. So as I mentioned before, like our first use case was address validation.

00:45:48.500 --> 00:45:53.260
<v Jeff>We had a customer who was asking for it, like, and we built it in.

00:45:54.670 --> 00:45:57.750
<v Jeff>About a month actually i think less than that like

00:45:57.750 --> 00:46:03.610
<v Jeff>to actually get something working but to get it closer to like full production

00:46:03.610 --> 00:46:11.190
<v Jeff>with this customer maybe it was like two months but it was like with a lot of

00:46:11.190 --> 00:46:17.050
<v Jeff>qa and like a lot of feedback things like that but i would say it took about like two months did.

00:46:17.050 --> 00:46:21.130
<v Matthias>You have a lot of outages due to Rust production outages.

00:46:21.130 --> 00:46:28.110
<v Jeff>Not really one was because we were only onboarding a single customer and like

00:46:28.110 --> 00:46:33.130
<v Jeff>the qps wasn't like that high to begin with and two because we had this property

00:46:33.130 --> 00:46:35.050
<v Jeff>of everything was sort of self-contained,

00:46:35.510 --> 00:46:40.430
<v Jeff>it was very trivial for us to do blue green deploys which also includes the

00:46:40.430 --> 00:46:43.870
<v Jeff>data as well so like hey you process the string in in a wrong way and you're

00:46:43.870 --> 00:46:47.150
<v Jeff>starting to see that you just flip it back to the the previous service and it's

00:46:47.150 --> 00:46:50.810
<v Jeff>all self-contained whereas that might be the harder to do with like a single

00:46:50.810 --> 00:46:55.070
<v Jeff>like backing database or things like that can.

00:46:55.070 --> 00:46:59.870
<v Matthias>You even remember a single outage or a single error or a panic in production with with Rust.

00:46:59.870 --> 00:47:06.350
<v Jeff>Yes so i i will say there's we do make use of this geo library from Rust which

00:47:06.350 --> 00:47:11.430
<v Jeff>is very popular um and one of the top issues is that it panics and doesn't handle,

00:47:12.840 --> 00:47:16.940
<v Jeff>It doesn't handle errors, which is like unrest-like, I guess.

00:47:18.260 --> 00:47:23.260
<v Jeff>We have this service called Sentry that we use for error logging.

00:47:23.520 --> 00:47:27.420
<v Jeff>And there's essentially no exceptions normally apart from that library.

00:47:27.580 --> 00:47:30.640
<v Jeff>But typically, we don't see stack traces.

00:47:30.820 --> 00:47:36.200
<v Jeff>We do see outages. And that's just due to spikes or essentially queries of death.

00:47:37.520 --> 00:47:40.260
<v Jeff>And there you know we can there's a

00:47:40.260 --> 00:47:43.660
<v Jeff>whole podcast to talk about like reliability and things like that but

00:47:43.660 --> 00:47:46.580
<v Jeff>the service itself is pretty efficient and our our

00:47:46.580 --> 00:47:51.520
<v Jeff>rewrite of of the search indexes is going to make the service even faster and

00:47:51.520 --> 00:47:55.440
<v Jeff>as i mentioned with like the you know storage and compute separation that's

00:47:55.440 --> 00:48:00.140
<v Jeff>going to introduce like another level of scalability and like being able to

00:48:00.140 --> 00:48:06.040
<v Jeff>handle like spikes in a way that yeah generally i would say like if we're looking at the proportion,

00:48:06.460 --> 00:48:08.700
<v Jeff>and really there's two main backend servers.

00:48:08.880 --> 00:48:12.220
<v Jeff>There's our API server, and then HorizonDB.

00:48:12.800 --> 00:48:16.600
<v Jeff>I would say there are not significant amount of outages from that.

00:48:17.960 --> 00:48:23.740
<v Matthias>And talking about scale, can you give us a few numbers, number of requests,

00:48:24.040 --> 00:48:25.240
<v Matthias>number of notes, et cetera?

00:48:26.240 --> 00:48:31.460
<v Jeff>Right. So at a steady state, we're looking at about 18,000 queries per second.

00:48:32.420 --> 00:48:37.160
<v Jeff>That's a lot of queries. Yeah, I would say it's a lot.

00:48:37.520 --> 00:48:43.040
<v Jeff>Maybe it's at other companies, it's not super high because there are some cases

00:48:43.040 --> 00:48:46.900
<v Jeff>where things will fan out and suddenly you get hundreds of thousands of TPS.

00:48:47.960 --> 00:48:52.240
<v Jeff>But yeah, it's enough to know that like if you made like a certain optimization,

00:48:52.920 --> 00:48:56.180
<v Jeff>like you can quickly tell if things went right or went horribly wrong.

00:48:56.680 --> 00:48:58.960
<v Matthias>How many nodes do you need to handle that traffic?

00:48:59.500 --> 00:49:04.020
<v Jeff>Right. So we have 30 nodes. It's significantly over provision now,

00:49:04.140 --> 00:49:08.720
<v Jeff>which is why we're looking into like how we can be more efficient about dealing with spikes.

00:49:08.720 --> 00:49:15.420
<v Jeff>We already actually shard out the database servers by different use cases and

00:49:15.420 --> 00:49:19.780
<v Jeff>by different tiers of customers because we are a SaaS company just to reduce

00:49:19.780 --> 00:49:25.300
<v Jeff>or essentially just to isolate certain workloads and prevent outages from bad

00:49:25.300 --> 00:49:27.940
<v Jeff>query from one customer affecting others.

00:49:28.360 --> 00:49:35.040
<v Jeff>So all that's to say is we are quite over provisioned there, but we're working on it.

00:49:35.900 --> 00:49:40.140
<v Matthias>Well, it's great. You could do with less. That's always a good thing that managers

00:49:40.140 --> 00:49:45.740
<v Matthias>want to hear because saving costs becomes much easier if you can't just shut

00:49:45.740 --> 00:49:48.500
<v Matthias>down some of those nodes if you don't need them at the moment.

00:49:49.500 --> 00:49:53.520
<v Jeff>Yeah, that's right. Cost cutting is like a, I wouldn't say it was the primary

00:49:53.520 --> 00:49:56.980
<v Jeff>thing for getting this started, but we did see a lot of benefit from that.

00:49:57.040 --> 00:50:02.140
<v Jeff>So even running 30 nodes, we're able to replace a number of Elasticsearch instances.

00:50:02.580 --> 00:50:06.940
<v Jeff>And we represented geoqueries in many different ways, like denormalized across

00:50:06.940 --> 00:50:10.520
<v Jeff>different stores like Redis and Mongo as well.

00:50:10.900 --> 00:50:18.160
<v Jeff>And then in total, I would say actually even with the 30 nodes and the over-provisioning,

00:50:18.320 --> 00:50:20.360
<v Jeff>we probably saved maybe...

00:50:21.840 --> 00:50:27.360
<v Jeff>Mid to high five-figure monthly spending for some of these things.

00:50:28.120 --> 00:50:35.960
<v Matthias>Impressive. How many queries per second does one box do, and how much CPU does it use, roughly?

00:50:36.460 --> 00:50:42.440
<v Jeff>Right, so on average, which is a misleading metric, because for geocoding,

00:50:42.600 --> 00:50:45.520
<v Jeff>which is search, is a very different workload from reverse geocoding.

00:50:45.600 --> 00:50:51.480
<v Jeff>We do 600 queries per second per box, And it's about 20 to 30% CPU at the steady

00:50:51.480 --> 00:50:53.460
<v Jeff>state, which means it can go higher.

00:50:53.900 --> 00:50:58.740
<v Jeff>And I would say that, you know, our search workloads are probably like a 10x,

00:50:59.080 --> 00:51:04.020
<v Jeff>sometimes even 100x of like reverse lookups as well as like primary key lookups.

00:51:04.420 --> 00:51:12.340
<v Matthias>Would you say it's a rather CPU or compute bound problem or a storage or IO bound problem?

00:51:12.340 --> 00:51:15.200
<v Matthias>Because on one side you have hashing which can

00:51:15.200 --> 00:51:19.460
<v Matthias>be very expensive but maybe that's just a thing that you do at the beginning

00:51:19.460 --> 00:51:24.780
<v Matthias>when you boot up and then it's not really as compute intense later on during

00:51:24.780 --> 00:51:29.700
<v Matthias>operations or would you say well no actually that is kind of the bottleneck

00:51:29.700 --> 00:51:32.720
<v Matthias>or would you say no the storage part is the bottleneck right.

00:51:32.720 --> 00:51:37.280
<v Jeff>So i you know breaking it down into like the different use cases for geocoding

00:51:37.280 --> 00:51:40.900
<v Jeff>which is search query reverse geocoding which is like a lat-long lookup and

00:51:40.900 --> 00:51:43.660
<v Jeff>then maybe even like primary key lookups.

00:51:44.380 --> 00:51:48.080
<v Jeff>I would say the latter two are more IO bound.

00:51:48.300 --> 00:51:52.040
<v Jeff>Maybe you pay some cost of serialization, deserialization from some of the storage

00:51:52.040 --> 00:51:55.760
<v Jeff>systems, which will change soon because there's a lot of very nice zero copy

00:51:55.760 --> 00:51:59.040
<v Jeff>libraries from Rust. And so.

00:52:00.460 --> 00:52:04.460
<v Jeff>The CPU-intensive queries tend to be of search.

00:52:04.800 --> 00:52:07.280
<v Jeff>I would say it's actually both, because if you think about it,

00:52:07.420 --> 00:52:12.720
<v Jeff>as I mentioned, this inverted index structure, you have to do many key lookups, right?

00:52:12.820 --> 00:52:15.940
<v Jeff>You tokenize your query, 123 fake street, Brooklyn, New York.

00:52:16.140 --> 00:52:19.760
<v Jeff>And in fact, there's many ways to express it in synonyms. Maybe you expand the

00:52:19.760 --> 00:52:25.100
<v Jeff>query, and maybe instead of 123 fake ST, it could be street.

00:52:25.440 --> 00:52:31.360
<v Jeff>Or if you know the users in Germany, it's Straße. and we haven't indexed or

00:52:31.360 --> 00:52:32.920
<v Jeff>normalized everything in a certain way.

00:52:33.300 --> 00:52:38.400
<v Jeff>So we do expand the queries and you can see how it can fan out to many keys being pulled back.

00:52:38.600 --> 00:52:41.960
<v Jeff>And once you have all these candidates, you essentially have to rank them.

00:52:42.260 --> 00:52:47.960
<v Jeff>And so there is a little bit of CPU intensiveness in that, but we're working

00:52:47.960 --> 00:52:49.140
<v Jeff>on optimizing these things.

00:52:49.720 --> 00:52:55.340
<v Jeff>And I will say that if you do decide to build this sort of analytics or search

00:52:55.340 --> 00:52:59.820
<v Jeff>system, We're very big fans of this new approach with drawing bitmap,

00:53:00.060 --> 00:53:04.700
<v Jeff>which is, it's actually an implementation that you see across many different languages.

00:53:05.020 --> 00:53:09.900
<v Jeff>But we found, at least from the sort of candidate fetching, where we're essentially

00:53:09.900 --> 00:53:14.920
<v Jeff>in the inverted index and trying to pull out which candidates match a criteria,

00:53:15.240 --> 00:53:21.920
<v Jeff>those tend to be like microsecond operations for most of our geocoding queries.

00:53:21.920 --> 00:53:28.280
<v Jeff>So that's why we're sort of moving over to a more like dedicated and specific system for our use case.

00:53:29.320 --> 00:53:33.400
<v Matthias>Yeah. You said it yourself, lots of moving parts.

00:53:34.180 --> 00:53:41.680
<v Matthias>What's the median latency of a forward geocoding request and maybe a reverse

00:53:41.680 --> 00:53:43.520
<v Matthias>geocoding request in comparison?

00:53:43.520 --> 00:53:48.380
<v Jeff>Yeah, so the median latency is about 50 milliseconds.

00:53:48.760 --> 00:53:53.860
<v Jeff>And that's because when we were building this system, and we were seeing customer queries,

00:53:54.120 --> 00:53:58.740
<v Jeff>we're seeing that most customers tend to enter the right thing more or less,

00:53:58.920 --> 00:54:01.340
<v Jeff>like they tend to enter their address correctly,

00:54:01.600 --> 00:54:02.840
<v Jeff>even if there's like some spelling

00:54:02.840 --> 00:54:09.600
<v Jeff>mistakes, but they tend to have a more predictable sort of query pattern.

00:54:09.600 --> 00:54:14.760
<v Jeff>And that's sort of why I describe the FST as a high cardinality text cache.

00:54:15.140 --> 00:54:22.680
<v Jeff>Because almost everything, I would say 70% to 80% of the queries,

00:54:22.860 --> 00:54:26.600
<v Jeff>especially for US and Canada addresses, will just hit the FST.

00:54:26.920 --> 00:54:32.900
<v Jeff>And all of that, we have the sort of generosity of being able to store all of

00:54:32.900 --> 00:54:34.540
<v Jeff>that in memory because it's so compressed.

00:54:35.840 --> 00:54:40.440
<v Jeff>And Rust, we're writing in a way that's a single process and multi-threading

00:54:40.440 --> 00:54:42.940
<v Jeff>for concurrency versus something like Node, which, you know,

00:54:42.980 --> 00:54:46.840
<v Jeff>concurrency or like Python, where, I mean, there are starting to be thread concepts,

00:54:47.040 --> 00:54:49.940
<v Jeff>but the classical way is just to have multiple processes.

00:54:50.180 --> 00:54:54.880
<v Jeff>And then that means you need multiple copies of the data structure and memory.

00:54:55.320 --> 00:55:01.000
<v Jeff>And so oftentimes we're hitting that. And those, at least that side of things,

00:55:01.000 --> 00:55:05.400
<v Jeff>those operations are usually at most single-digit milliseconds.

00:55:06.080 --> 00:55:10.840
<v Jeff>Depends on the query, but a lot of times we will see things show up in less

00:55:10.840 --> 00:55:12.760
<v Jeff>than a millisecond for that side of things.

00:55:13.960 --> 00:55:16.680
<v Jeff>For reverse geocoding, as I mentioned, we sort of.

00:55:18.900 --> 00:55:24.880
<v Jeff>And to speed up your algorithms, oftentimes it's like, can you sort it or can you hash it?

00:55:25.060 --> 00:55:30.760
<v Jeff>And essentially, we have hashed it by taking the S2 library where we convert

00:55:30.760 --> 00:55:34.300
<v Jeff>the latitude and longitude, two dimensions, into one dimension.

00:55:34.400 --> 00:55:37.780
<v Jeff>And that's essentially just a key that you look up in RocksDB.

00:55:38.200 --> 00:55:41.660
<v Jeff>So many of our workloads look like that. And once it's like that,

00:55:41.780 --> 00:55:43.960
<v Jeff>it's almost no different from a primary key lookup.

00:55:44.440 --> 00:55:49.440
<v Jeff>It's just a key value lookup. So we're seeing less than a millisecond for those types of workloads.

00:55:49.800 --> 00:55:56.260
<v Matthias>Nice. And all of that with how many lines of Rust code in total for the server

00:55:56.260 --> 00:55:57.420
<v Matthias>implementation at the moment?

00:55:58.180 --> 00:56:04.400
<v Jeff>Right. So in terms of our code base, it's about 150,000 lines,

00:56:04.480 --> 00:56:06.500
<v Jeff>which I would say is actually not very much.

00:56:06.720 --> 00:56:13.400
<v Jeff>And I haven't taken the time to analyze which code line is serving which use case.

00:56:13.400 --> 00:56:17.880
<v Jeff>But I will also say, like, there is like a good percentage of stuff that's like

00:56:17.880 --> 00:56:22.600
<v Jeff>kind of hard coded right now that we could probably, I think,

00:56:22.660 --> 00:56:25.160
<v Jeff>to get like a minimal implementation, maybe like you could,

00:56:25.400 --> 00:56:29.320
<v Jeff>when we first started doing address validation, it was like a couple thousand

00:56:29.320 --> 00:56:31.720
<v Jeff>lines. It wasn't anything too big.

00:56:32.040 --> 00:56:35.980
<v Jeff>So and that sort of speaks to like how expressive, you know,

00:56:36.480 --> 00:56:38.960
<v Jeff>being able to express certain concepts in Rust is.

00:56:38.960 --> 00:56:41.760
<v Matthias>Are there any tips and tricks

00:56:41.760 --> 00:56:45.180
<v Matthias>that you could share with people who are planning

00:56:45.180 --> 00:56:49.680
<v Matthias>to build a larger Rust application i'm

00:56:49.680 --> 00:56:56.460
<v Matthias>thinking of composition patterns usage of traits how you structure your data

00:56:56.460 --> 00:57:02.940
<v Matthias>data objects immutability anything that comes to mind that that you learned

00:57:02.940 --> 00:57:08.680
<v Matthias>maybe that is uncommon in Ruby or Scala or any other language that you knew before?

00:57:09.930 --> 00:57:15.490
<v Jeff>I would almost like take the approach of almost, especially when you're starting

00:57:15.490 --> 00:57:17.610
<v Jeff>out, not to overthink these things.

00:57:17.830 --> 00:57:21.870
<v Jeff>I do think that Rust still has like so many different concepts.

00:57:22.250 --> 00:57:26.550
<v Jeff>And we're always like, honestly, as we're writing, I don't even feel like I'm

00:57:26.550 --> 00:57:29.950
<v Jeff>an expert in Rust at all. Like I always learn something new.

00:57:30.250 --> 00:57:34.290
<v Jeff>So with that in mind, especially like if you're first starting out with Rust,

00:57:34.610 --> 00:57:39.350
<v Jeff>just try and get something to work and then sort of optimize after.

00:57:39.350 --> 00:57:43.430
<v Jeff>Because I guess they always say like premature optimization, you know, is, is.

00:57:44.960 --> 00:57:46.120
<v Matthias>It's the root of all evil.

00:57:46.820 --> 00:57:54.200
<v Jeff>Basically, yeah. So try and get the simplest thing to work and then only apply

00:57:54.200 --> 00:57:58.420
<v Jeff>design patterns if you're starting to find you're really repeating yourself

00:57:58.420 --> 00:57:59.680
<v Jeff>or it becomes very finicky.

00:57:59.960 --> 00:58:04.400
<v Jeff>I will say that, especially in this age of AI editors that will essentially

00:58:04.400 --> 00:58:08.200
<v Jeff>copy and paste everything for you, I don't know.

00:58:08.680 --> 00:58:13.080
<v Jeff>But yeah, that's the way to think about it is really just to,

00:58:13.460 --> 00:58:18.980
<v Jeff>rather than take very high-level approaches where it can actually sort of inhibit

00:58:18.980 --> 00:58:22.440
<v Jeff>or prevent you from being able to do the thing you want it to do,

00:58:22.520 --> 00:58:27.840
<v Jeff>just do the simplest thing and then apply more optimizations or introduce more

00:58:27.840 --> 00:58:33.820
<v Jeff>sort of design patterns as you start to see the problems arise.

00:58:33.980 --> 00:58:37.680
<v Jeff>And then you really do get a sense of what problem you're trying to solve.

00:58:38.220 --> 00:58:41.880
<v Matthias>Yeah, perfect. That's right down my alley. even in

00:58:41.880 --> 00:58:47.200
<v Matthias>a geocoding sense because i recently wrote an article called be simple and i'm

00:58:47.200 --> 00:58:52.740
<v Matthias>very much in favor of that unfortunately we're getting close to the end we had

00:58:52.740 --> 00:59:01.080
<v Matthias>a blast so the time flew by the last question is what's your message to the Rust community i.

00:59:01.080 --> 00:59:07.900
<v Jeff>Would say obviously keep doing what you're doing. I think the Rust ecosystem is very vibrant.

00:59:08.640 --> 00:59:12.740
<v Jeff>And like, I think there's an interesting trend that I've seen where,

00:59:13.300 --> 00:59:16.500
<v Jeff>and it's almost like hacker news driven, where it's like, oh,

00:59:16.740 --> 00:59:18.860
<v Jeff>I built this service and it's written in Rust.

00:59:19.140 --> 00:59:23.360
<v Jeff>But we're actually even seeing in not even like a,

00:59:24.690 --> 00:59:28.750
<v Jeff>We're seeing that in almost a marketing approach for real businesses.

00:59:29.010 --> 00:59:34.990
<v Jeff>There are so many new, interesting pieces of infrastructure that are becoming

00:59:34.990 --> 00:59:37.310
<v Jeff>businesses that are open core.

00:59:37.710 --> 00:59:40.970
<v Jeff>I wouldn't say it's fully open source because I think the open source peers

00:59:40.970 --> 00:59:43.150
<v Jeff>would be, it really depends on licenses.

00:59:43.150 --> 00:59:49.270
<v Jeff>But I think it's such a flexible programming language and it's very modern that

00:59:49.270 --> 00:59:55.330
<v Jeff>you get a lot of these sort of productivity or enjoyment aspects that you would

00:59:55.330 --> 01:00:02.030
<v Jeff>get from maybe some other like scripting languages traditionally that I would love to see, you know,

01:00:02.190 --> 01:00:07.770
<v Jeff>more of these like businesses that are really like open source and open core

01:00:07.770 --> 01:00:12.170
<v Jeff>to really inspire what you can really do with the language.

01:00:12.170 --> 01:00:16.150
<v Jeff>And I would also love to just see more...

01:00:17.120 --> 01:00:23.420
<v Jeff>More work around even just doing the sort of 80% of things that most developers

01:00:23.420 --> 01:00:27.300
<v Jeff>are doing these days, which is really just having a nice web app experience.

01:00:27.300 --> 01:00:29.440
<v Jeff>I think there's still a lot to do there.

01:00:29.720 --> 01:00:35.800
<v Jeff>I know people brand it as an infrastructure or systems language,

01:00:35.800 --> 01:00:40.780
<v Jeff>but it really is a very enjoyable language to use for many different use cases.

01:00:41.340 --> 01:00:44.360
<v Jeff>Even like, I think, so not this instrument,

01:00:44.620 --> 01:00:47.280
<v Jeff>but I know Electron, they're in sweden and they

01:00:47.280 --> 01:00:51.500
<v Jeff>make like a lot of music production equipment and a lot of their code is written

01:00:51.500 --> 01:00:56.940
<v Jeff>in Rust so like you you really see like all these like different aspects because

01:00:56.940 --> 01:01:00.940
<v Jeff>like it's such a it's a language that's very expressive but on the other hand

01:01:00.940 --> 01:01:04.740
<v Jeff>it lets you get really deep in the weeds and so it lets you do so many things

01:01:04.740 --> 01:01:06.600
<v Jeff>so i i would love to see like more,

01:01:07.080 --> 01:01:11.140
<v Jeff>you know support and work around even just like basic things like wet like a

01:01:11.140 --> 01:01:15.360
<v Jeff>lot of like web apps like who knows maybe there will be like a rails for Rust

01:01:15.360 --> 01:01:22.080
<v Jeff>in the future that would be super interesting there's loco rs yeah i guess that's true

01:01:22.080 --> 01:01:23.980
<v Matthias>Jeff,

01:01:23.980 --> 01:01:28.220
<v Matthias>that was amazing thanks so much for taking the time today.

01:01:28.220 --> 01:01:30.740
<v Jeff>Yeah likewise it was a very fun chat

01:01:30.740 --> 01:01:31.700
<v Matthias>Rust

01:01:31.700 --> 01:01:36.580
<v Matthias>in production is a podcast by corrode it is hosted by me Matthias Endler and

01:01:36.580 --> 01:01:41.100
<v Matthias>produced by Simon Brüggen for show notes transcripts and to learn more about

01:01:41.100 --> 01:01:45.560
<v Matthias>how we can help your company make the most of Rust, visit corrode.dev.

01:01:45.780 --> 01:01:48.160
<v Matthias>Thanks for listening to Rust in Production.