Rust in Production

Matthias Endler

Sentry with Arpad Borsos

Matthias Endler discusses enhancing a Python platform with Rust at Sentry with guest Arpad Borsos. They cover Rust challenges, async development, and integrating Rust with other languages. Arpad encourages companies to try Rust.

2024-02-22 75 min

Description & Show Notes

In an ever-expanding world of microservices, APIs, and devices, maintaining an overview of application states and the myriad errors that can occur is challenging. For years, Sentry has been a go-to choice for developers to monitor their applications and receive notifications about issues within their code.

Traditionally, Sentry was predominantly a Python shop, but they became one of the early adopters of Rust in their technology stack. They have been utilizing Rust for a few years now (since at least 2017), starting with sentry-cli, a command-line utility to work with Sentry, and continuing with their source map parsing pipeline, which reduced processing times from 20 seconds to less than 500 milliseconds.

More recently, they have been developing two new projects in Rust: relay and symbolicator. Relay acts as a proxy for sending events to Sentry, while symbolicator is a service for handling the symbolication of stack traces. Both projects are open source and available on GitHub.

Arpad Borsos (swatinem), Senior Native Platform Engineer at Sentry, sat down with me to discuss their journey with Rust and how it has enabled them to build a cutting-edge monitoring platform for large-scale applications.

Our conversation covered topics such as 100x speedups, the Rust-Python interface, and the comparison between actor-based and task-based concurrency.

In this episode, we talk to Arpad Borsos, Systems Software Engineer at Sentry, about how they use Rust to build a modern error monitoring platform for developers.

We discuss the challenges of building a high-performance, low-latency platform for processing and analyzing large amounts of data (like stack traces and source maps) in real-time. 

Arpad maintains the `symbolic` crate for stack trace symbolication, which is used on the Sentry platform.

About Sentry:
Sentry provides application performance monitoring and error tracking software for JavaScript, Python, Ruby, Go, and more. Their platform also supports session replay, profiling, cron monitoring, code coverage, and more.

About Arpad Borsos:
Arpad Borsos works on high-performance, low-latency systems and maintains open source projects like the popular `rust-cache` GitHub Action. He is an expert in asynchronous programming and spoke about async functions at EuroRust 2023.

Links From The Show:
- https://github.com/getsentry/relay
- https://github.com/getsentry/symbolic
- https://crates.io/crates/tracing
- https://actix.rs/
- https://firefox-source-docs.mozilla.org/js/index.html
- https://en.wikipedia.org/wiki/Name_mangling
- https://neon-bindings.com/
- https://github.com/PyO3/maturin
- https://docs.rs/sentry/latest/sentry/
- https://github.com/getsentry/sentry-rust
- https://en.wikipedia.org/wiki/Minification_(programming)#Source_mapping
- https://github.com/Swatinem/rust-cache
- https://tokio.rs/
- http://formats.kaitai.io/windows_minidump/

Official Links:
- https://sentry.io/
- https://github.com/getsentry/sentry
- https://sentry.engineering/
- https://www.linkedin.com/in/swatinem/
- https://github.com/Swatinem
- https://swatinem.de/

Transcript

This is Rust in Production, a podcast about companies who use Rust to shape the future of infrastructure. My name is Matthias Endler from Corode, and today we talk to Arpad Borsos from Sentry about enhancing a massive Python platform with Rust. Welcome to the show.
Arpad
00:00:20
Hi, thanks for having me.
Matthias
00:00:24
, can you quickly introduce yourself and Sentry, the company you work for?
Arpad
00:00:30
Yeah, absolutely. Well, I'm Arpad and I've been with Sentry for four years and Sentry is in the application monitoring business. So we do error handling or rather error tracking. We do performance monitoring, we do profiling, and we are also about to, well, in the future, launch a metrics product as well.
Matthias
00:00:56
And i know you from a GitHub repository that i personally use in a lot of rust projects because you are the maintainer of rust cache and your handle is swattingham and yeah it is very very popular i can tell so first off thanks for this GitHub action yeah.
Arpad
00:01:21
How d My pleasure.
Matthias
00:01:22
id you get started with that action before we go into the main content?
Arpad
00:01:28
Well, Sentry itself uses GitHub. And for people who don't know, Sentry is fully open source, depending on the definition of open source you want to use. So parts of Sentry are completely liberally licensed. Other parts use the user … well, I would say open source license, but other people might call it the source available license. But anyway all of the source is available on GitHub and Sentry is using GitHub and GitHub actions for all of the ci and well back in those days before i actually started writing my my GitHub action there was some recommendation about using the well GitHub's own caching action but you had to configure it in a weird way you had to define a cache key that well depending on On the operating system you run, depending on the Rust version you run, you have to actually make sure to bump this cache key manually. And also the recommended way of which files to cache was rather unoptimal for Rust. So my action had basically two goals was to make it extremely simple to use. So no manually defining cache keys by default. If people want to do it, they can, of course. and also to kind of find the right balance of how many files to cache, right? Because you have a limited... Space for for the cache itself and of course there's also download time extraction time so it it tries to strike the right balance between the overhead of the cache itself versus the gains you get from caching.
Matthias
00:03:11
an tell it is quite a sophisticated project by now and it is is very well received so far? that's all i can say. For how long have you been working at
Arpad
00:03:23
It's my fourth anniversary already so i'm four plus years here and i'm very lucky to be writing rusts as my day job since almost basically the whole time and.
Matthias
00:03:38
You also write articles on your blog about rust right.
Arpad
00:03:40
Yeah exactly so every time I discover some interesting fun fun facts or some problems and challenges that I face, then I tend to write about it. And the good thing here is that Sentry is on GitHub. It's open source. So I can actually write about all of this stuff and point directly to the code that has been causing issues. I can point directly to the fixes that I've applied. So people can really take a hard look at stuff.
Matthias
00:04:11
One of the benefits of working in the open, I guess.
Arpad
00:04:14
Exactly, yeah.
Matthias
00:04:16
Has Sentry always been open source?
Arpad
00:04:20
Yeah, as far as I know, Sentry has been open source since the very beginning.
Matthias
00:04:26
Amazing. Now, going back to the days when you started at Sentry, can you take us back and explain to us what the infrastructure looked like, what the application looked like, which languages you used, and all those things? Can you maybe just give us some context?
Arpad
00:04:45
Yeah, back when I joined Sentry, there was already quite a bit of Rust in production. And I initially started rewriting the native SDK. So I spent lots of time in C code before then moving on to the Rust SDK, which already existed back then and revamped that a little bit. There are things that already existed back in those days were Symbolicator, which is is part of Sentry's processing pipeline, which deals with native events, well, processing native events, and then, well, applying all of the debug files to make stack traces actually readable, and also Relay, which is first part, the ingestion part where events actually come in from customers. So those two pieces have already, since more than four years, been written in Rust. I started out with the native SDK, moved on to the Rust SDK, and then eventually took over the maintenance of the processing pipeline and specifically Symbolicator, which then later on also gained the ability to process JavaScript events.
Matthias
00:05:59
You have a strong background in C?
Arpad
00:06:01
Sentry Sentry See so before starting at Sentry i Sentry Sentry was actually a JavaScript and TypeScript developer i did contribute to c and also to rust before so i had some some hobby projects or whatever in rust and i've also contributed to c code one of the the contributions i'm actually quite proud of was contributing to Firefox and Mozilla so there's some contributions in Firefox things things that are touching C code, things that have to do with the SpiderMonkey JavaScript engine.
Matthias
00:06:37
How did you get started with that? I imagine it is quite a sophisticated piece of technology, and it sometimes can be intimidating. How did you even dare to touch the Firefox codebase?
Arpad
00:06:50
Yeah, it actually is quite intimidating. and I can remember that I was actually shivering a little bit opening my first code contribution to Firefox. And I would say I always felt quite welcome to do so. Back in those days, Mozilla had this Bugzilla bug tracker and there was tons of stuff in there. Some things might have been labeled as good first bugs for people to pick up and i just chose one of those i tried my hands at it and then one thing led to another.
Matthias
00:07:33
Firefox is a tool that is used by millions of users and back in the day probably it was more in the hundreds of millions of users. I would personally be scared that i introduce a, soundness bug in the code so parsing or handling input from various sources could lead to dangling pointers and access to memory that maybe should not be accessed were you scared of that were there eah t any safety hatches in the code base.
Arpad
00:08:08
his was always in the back of my mind And I believe I might have caused some issues as well, especially the contributions to the JavaScript engine. There were things that were discovered with fuzzing and other fuzzers. Otherwise, well, it's not only me as someone who comes to the project as a new contributor. It's also people who have been working with the project for a long time. They can also make mistakes, right? it and either you have like a really good test suit that covers this you might have fuzzing that, discovers things and and memory corruptions or whatever that no one would have thought of and now luckily we have the rust compiler who hopefully prevents But b most if not all of these issues.
Matthias
00:09:01
efore we get to that take us back to this moment when you started working on the c sdk and explain to us what you felt when you looked at the code and the patterns was it well tested was there any fuzzing going on did you have ways to check that for example the type system would check for bugs what was it like.
Arpad
00:09:27
Hmm i would say there was a test suit but it wasn't as extensive and some things i also pushed for quite heavily was to introduce things like fuzzing which then found quite a few bugs in in the c json parser for example and i also pushed for introducing more things like static code code analyzers, things like CodeCoverage and then Clang Analyzer and all these kinds of things that are in the C world, which are quite difficult to use, I would say, in comparison to things like Clippy or Miri, which are just a breeze in comparison. So, yeah, it was quite interesting working on it. And there was also lots of stuff that was going wrong. So as I was saying, with fuzzing, we found some memory corruption related to the JSON parser. There was also a very serious bug that might have caused data loss on customers' computers related to some problems in the C SDK.
Matthias
00:10:38
What did an error report look like back in the day? Did you really get the entire context of what happened? Did you have to take a lot of guesses? Could you kind of locate the issue in the code base? Was it a pleasant experience?
Arpad
00:10:54
It was complicated. So it was often either a GitHub issue where people reported these issues or some internal customer communications. And figuring this out, it all depends on the issue report, right? So if someone opens an issue on GitHub, it should have enough context for the maintainers to actually figure out what's happening. Oftentimes, the context might be very good and you find the issues immediately or you actually have to dig deeper. You have to figure out how can I reproduce this? And if you reproduce this, then comes the way of actually figuring out where is the bug and how can I fix it?
Matthias
00:11:41
And at the same time you look over your shoulder and your colleague is working on the rust sdk did that spur some interest from you or were you even maybe a little envious that they used this new modern language um.
Arpad
00:11:59
Yeah a little bit so when i started it was always you know advertised to me that I might be able to work with Rust. And I was really eager to actually move on from C to writing more Rust code as well.
Matthias
00:12:15
Was there a reason why you applied as well?
Arpad
00:12:19
Yes, it was definitely a very big factor. So like I said earlier, I used to be JavaScript and TypeScript developer. And actually, I intentionally made this move and cut in my my career to move to rust and this was a very good decision and very liberating as well does.
Matthias
00:12:40
It mean at some point you looked for companies that used rust in production so that you could do more rust in your day job.
Arpad
00:12:50
Not specifically actually it was how your Austrians would say Freunderlwirtschaft so i was basically referred from from a friend so a previous colleague who i worked with joined sentry first and she immediately referred me because she found out oh sentry was working with rust and she knows me so she immediately referred me as well and that's how i got to sentry eventually so so i did not actually search specifically for it but it kind of happened and through connections yeah.
Matthias
00:13:27
A very common pattern.
Arpad
00:13:28
Yeah Referrals between g
Matthias
00:13:31
ood colleagues yeah. Makes sense. But eventually you moved on and you moved over to rust that means you were the maintainer of the rust sdk is that correct or d
Arpad
00:13:45
id you share that responsibility with others? Yes i still am mostly the maintainer of the Sentry rust sdk there isn't too much work going on with it it's quite stable there there isn't much that we need to add or change right now, if i had the chance i would do one major rewrite yet again so there is some things that we did back when i first kind of rewrote the Rust sdk which i would do differently today day what.
Matthias
00:14:18
Would that be and how would you tackle that rewrite?
Arpad
00:14:24
So one thing I did early on was to split the Rust SDK into different sub crates. And I faintly remember that tokio early on did the same thing. And they moved back to actually a single crate with a bunch of feature flags. And I would probably do the same thing to instead of have like 10 different sub crates for each integration, like sentry anyhow, sentry tracing, sentry blah, blah, blah, whatever. I would rather move back to a single crate for a variety of reasons. So, one is maintaining it should be easier. And the other reason is that people get confused using this. So we had a couple of problems reported to GitHub that people had version mismatches, right? So they had one Sentry crate in one version, but another Sentry crate in a different version, and they're incompatible with each other. Because one good thing about Rust is that you can have a crate in different versions compiled in the same program. So it's a good thing that it works but for sentry the problem is that you configure sentry version i don't know version x to actually send things to sentry but then the sentry that actually captures panics is a different version which has never been configured to actually send events upstream. So people, I've come across this GitHub issue a couple of times, and then this happens to people and they are confused why panics never land in Sentry. Well.
Matthias
00:16:13
Couldn't you fix that retroactively by merging all of them and releasing a breaking change?
Arpad
00:16:19
We could definitely do it. And one of the problems with the Rust SDK is that we have breaking change releases all of the time, because our types are a bit too detailed. And Sentry itself as a product is changing also quite a bit. So when I joined, we only had the error reporting. And then we also added performance tracing. So the tracing integration, which is, by the way, really awesome. And I can highly recommend this to people. So you're just tracing instruments, a function, you just slap the attribute on there. And these things will be shown on Sentry then. So adding this to the SDK means we have new types, which back in those days, you didn't even had this non-exhaustive attribute. So this was added later. So at one point, we also added non-exhaustive to a lot of enums and lots of types that we have.
Matthias
00:17:19
Just for context, before you continue, non-exhaustive for the uninitiated is a feature where you... Add an attribute to a type like an enum which prevents your type from getting matched, on a call site for example if you have an enum with two variants you cannot match exhaustively on both variants and be sure that you covered all the cases and that is a safety or an escape hatch for adding further variants down the line and that feature was not available back then and I assume it would be quite hard to recreate such behavior because it's outside of the standard library, right?
Arpad
00:18:02
Mm-hmm, mm-hmm.
Matthias
00:18:04
It could have potentially been done with external crates, but then you would have to know about these crates first.
Arpad
00:18:12
Yeah, I mean, I think especially for enums, there used to be this pattern where you have a non-documented variant variant that never shows up in any kind of publicly visible documentation. But for the type checker, it exists, right? So that was back then, I believe, the main escape hatch to make something like this possible. And for structures as well, if you have a single private member, then you cannot use struct literals, for example. And the main thing with non-exhaustive is also you cannot use a struct literal, you You have to use a constructor, right?
Matthias
00:18:52
A struct literally is when you don't use a constructor, but you explicitly name the fields in the struct when you initialize it, right?
Arpad
00:19:00
Exactly, exactly. So in this case, you have to define every field separately, or you just do dot dot default. And then basically you get all the default values. You define the ones you want to override. And... One of the things I wish Rust could support was actually to combine this ..default struct literals with structs that are non-exhaustive. Because right now, it's not supported. So the Rust compiler will scream at you.
Matthias
00:19:37
Interesting. I did not know that that would have been a possibility, but yeah, it makes sense. Now moving on from the early rust days for you at sentry and moving on from the rust sdk which i assume is in good shape now because you took care of it, what was the next step for you in this rust journey at sentry.
Arpad
00:20:04
Yeah the next step was taking over ownership of Symbolicator which is a big part of the the whole processing pipeline in Sentry. And I remember one of the big early challenges was updating Symbolicator from things like the very, very old Actix web and the tokio 0.1 days. And then to update this to tokio 0.2. And then afterwards, we also went to tokio 1.0. And since then, it's stable. So that's been very good. But those early days, they were a bit rough. Right. And back then, Actix Web had really good support and kind of an out of the box solution for both having an HTTP client and also an HTTP server. Which I believe was one of the reasons people were such big fans of Actix Web back in those days. And moving from this to tokio was quite a challenge. So it was actually finding an HTTP client and we chose requests, which I believe lots of people in the Rust community use. But it did have some differences in behavior to how the Actix web clients used to work. So that was a challenge making it compatible and having the exact same behavior that we were relying on. And then also to change all of the internals from the Actix model over to the tokio model.
Matthias
00:21:39
For some context, just because people might be wondering, Actix is a web framework and tokio is an async runtime. But back in the day, or even to this day, I'm not sure, Actix was built on an actor framework. So that was a different model of running asynchronous code, I assume. And i think this is what you meant moving away from this actor model that actix imposed towards tokio plus.
Arpad
00:22:08
Mm-hmm a more
Matthias
00:22:09
Lightweight http client is that correct.
Arpad
00:22:11
Exactly yeah and it was also back then writing async rust wasn't really as comfortable as today so back then you you didn't even have syntax for async function. You just had future types from the futures crate and it was the futures crate 0.1. You actually had to chain each step onto the other using and then and then and then and then sharing state between those always involve things like arcs and maybe arc, And the way we write async Rust today with just async function and the compiler just takes care of everything. And you can have local references, you can have self-referential structs, you can have mute references and all of these nice things.
Matthias
00:23:12
Although if you use the default tokio runtime you still need to use arc and mutex you.
Arpad
00:23:18
Mm-hmm could use a local e
Matthias
00:23:19
xecutor but the default would be to still lock things, correct?
Arpad
00:23:25
Yes absolutely. But what i rather mean is you can write an async function just as you would a normal function. S o you can have mut reference and holding this across await points so this is perfectly possible with an async function but back then when you had to chain future well and then together this eah. was really really challenging.
Matthias
00:23:55
Essentially what the compiler does for us automatically now, we had to do by hand writing generator or continuation-like syntax. Oh, yes. A state machine in the end. Yeah. I'm glad that these days are mostly over.
Arpad
00:24:12
Oh, yes.
Matthias
00:24:14
Do you have any guidelines for that inside of Sentry on how to write Rust code, how to write idiomatic Rust code, how to maybe also improve your skills and improve your Rust prowess over time?
Arpad
00:24:30
So we do have Rust development guideline at Sentry, but I would say it's mostly for new contributors to the Rust code bases we have. Me, for example, or other team members who have been writing Rust for a very long time at Sentry, we kind of predate these guidelines in a way. But things that are written down, I would definitely say there are very good guidelines also about how do you write things like panic safe code and all of these kinds of things. So things I would definitely say that I would recommend to everyone is to be weary of slicing syntax.
Matthias
00:25:14
Can you elaborate?
Arpad
00:25:17
Well, slicing syntax, if you're out of bounds, you're panicking. And also, if you're dealing with strings and you happen to index into a Unicode boundary, then you're also panicking. If you use the get function call, you can specifically check for this because you get an option.
Matthias
00:25:37
Is there a rule inside Sentry which prevents or discourages slice syntax?
Arpad
00:25:45
I wouldn't say it's a hard rule. rule it rather depends on the reviewer reviewing the code and on the circumstances so especially because we're dealing with lots of untrusted data we tend to index or slice into things where the index isn't trusted and in those cases i would definitely flag this in reviews as hey please use get instead of slicing into this but.
Matthias
00:26:12
Sometimes you might miss a case and maybe there is a tool for that. I don't know if Clippy supports you with that. I don't know.
Arpad
00:26:21
I think Clippy supports you with that, yes. But the tool is called Sentry. And we capture those panics that happen in production.
Matthias
00:26:29
Nice segue. Are there generally any rules on how you review Rust code? Things that you look out for? Things that you keep in mind? Things that you usually check, which people often get wrong? You mentioned the slice syntax, for example. Are there any other things? How do you review Rust code?
Arpad
00:26:51
That's an interesting question. I mean, there are a couple of things I tend to be very cautious about, mostly because it's things that happened in production. For example, what I said, slicing into Unicode and getting an error or getting a panic because you slice on a Unicode boundary. Those things happened to us and they panicked in production and that's how you learn. I would say there are certain things I tend to focus on, things like unwraps, things like slicing. Maybe in some cases it might be worth using checked math, especially if you're dealing with untrusted data, but those are rare cases. Other things I try to review, especially now that we have a lot more engineers onboarding to Rust code, is to maybe look at things like clones a little bit more, especially if you have vectors or if you have strings cloning is actually o of n and it allocates and it might be better to just avoid it by using string slice what.
Matthias
00:28:00
I found was that if you use clones a lot then usually that indicates that you have a problem with your architecture or with your your design, there can be situations when you cannot fully lean into the Rust, borrow checker because of a fundamental issue with your code base. And if you try to remove those clones, you will find the root cause of it. And maybe you can improve the design of the entire application to not work against the borrow checker all the time, because it is certainly a warning sign if you have a lot of clones, at least to me. It's not bad unless you are not aware of it, of what's going on and maybe how you can improve the situation.
Arpad
00:28:47
I think you're right. There's definitely lots of cases where clones can show you that you're doing something wrong. For us, though, one of the problems might be that in lots of cases, we actually want to have things that have the static lifetime. So, or to phrase it differently, we want to have owned types without any lifetime, right? So for example, as you were, as we were discussing tokio previously, if you want to tokio spawn something, it better be, or it better have the static lifetime, right? Because tokio takes it. And now tokio is responsible for it. tokio will move it across threads. Well, and, and tokio will just you know drop it once it's finished and…
Matthias
00:29:38
Isn't that also one thing that, you see as well where you have normal set let's say synchronous rust code and you freely borrow things and you can move things around and all of a sudden you move into an async ecosystem system where maybe the runtime imposes that you think about such things like allocations or maybe ownership as well and and then some of the patterns don't work anymore so you need to clone where before you would have to you would be able to move for example.
Arpad
00:30:16
I think the very interesting thing here is that you actually have a choice and what i mean by this is async doesn't doesn't automatically mean that you cannot have references or that everything needs to be Sent and Static and all these kinds of things. Because you can have. Structured concurrency, I believe that's the term people use, is you can join all or you can create a bunch of futures and you can use join all. But join all means that those futures still execute within, well, your async function. But it's just spawn where you actually have to have something that sends sync and static, right? And there's different trade-offs here, right? Right. So if you use structured concurrency with join all or with, well, if you join two futures or if you join a vector of futures, then those run concurrently, but they do not run in parallel. Right. So it's still a single CPU at a time working all of these futures. But you have the advantage that you can actually have references without any kind of problems. But on the other hand though if you use tokio spawn you do have parallelism because individual cpu cores can execute all of this code but you have the disadvantage that you have to use things like arcs and mutexes all over. ery t
Matthias
00:31:50
rue. Although as a small caveat there are other runtimes which don't depend on work stealing where you could potentially have multiple things running concurrently on different cores like thread per core models where you don't need synchronization.
Arpad
00:32:09
You might not need synchronization but you definitely have to make these things static because they might potentially outlive the place you actually spawn the future right.
Matthias
00:32:23
Yes, yes so you might get rid of the mutex but not necessarily the arc.
Arpad
00:32:29
Mhm.
Matthias
00:32:31
Well it's kind of an interesting discussion we kind of diverged a little bit let's come back to the business side of things a little you mentioned that you worked on symbolication, and you also touched on relay real quick maybe can we start with symbolication what is it about what does it do, how is it built, and so on.
Arpad
00:32:57
Symbolication, to put it in simple terms, we make stack traces readable, which means right now symbolication is done for native events and it's done for JavaScript events. For native events, you might know also from Rust codes that if you have a panic in release mode built without debug symbols, you see a lot less detail as if you are panicking in a debug build, right? With Sentry, you build your release code with debug symbols, but then you split it apart. One thing you ship to production, the other thing with all the debug data you upload to Sentry. And we put those two parts together again to give you a readable stack trace. That's what the processing pipeline does. For JavaScript, it's a bit similar, but it's called source maps. And JavaScript, TypeScript developers might be familiar with them. Then in JavaScript, you also have your minified JavaScript that runs somewhere in some browser or maybe also in the cloud. And then you have a source map, which actually turns the stack traces into something that's readable. And that's also what we're doing in the processing pipeline.
Matthias
00:34:19
Which languages are supported for these stack traces?
Arpad
00:34:25
It's JavaScript and everything that boils down to native code. And also .NET to a certain degree.
Matthias
00:34:35
That's a lot of supported platforms. For example, that would also support Rust, I guess, and Go and C?
Arpad
00:34:43
Absolutely, yeah.
Matthias
00:34:44
And I only vaguely know that you have a concept called demangling in some of these languages, where you have certain identifiers for functions and types, and they get mangled in some form by the compiler. Hider and is that also something that you do you kind of reverse that process to get underlying symbol or type that you had in your code.
Arpad
00:35:15
Yes exactly so demangling is especially relevant for all the native platforms so as you mentioned the symbols that might be shipped in your or executable are compressed using, well, a certain definition or a certain way of how these symbols are compressed, the names of these symbols are compressed. And demangling basically reverses this and gives you a readable function name that also has things like the generic parameters and all these kinds of things in it.
Matthias
00:35:51
And how do you even build such a platform? walk us through the components which take some input and then in the end you have some output like what are the steps required the.
Arpad
00:36:04
First step is you get to events and i would say even the most difficult part then is finding the corresponding debug files, and for maybe on the.
Matthias
00:36:18
Server shouldn't they be right there.
Arpad
00:36:20
They are right there but you need to know which files to use and so there are many files. There needs to be some kind of identifier for it right and this is also something I talked about on on my blog a little bit is that the files need to be or need to have some kind of unique identifier identifier, because we're talking about a pair of files, right? So you have always one file that's with executable code that runs in production. And then you have a different file that has all of the debug information in it, right? And they have to share some kind of unique identifier, right? And if a crash or a panic happens in production, we attach this unique identifier to the event we send to Sentry. And with this unique identifier, we can then look up the corresponding debug file. And there's also things like system libraries, and there is a whole symbol server specification that I believe was pioneered by Microsoft. And for example, all of the Microsoft NT DLL32 or however they are called, they are accessible on Microsoft's official symbol server. And they all have a unique ID with which we can download it from there. And then we can show you the function name in the Windows system call.
Matthias
00:37:55
Do you also sometimes have to guess if you're not exactly sure what symbol might be? or do you have to reverse engineer this process or do you say the specification for symbols is sufficient?
Arpad
00:38:11
I would say, well, it's a tricky question because specification, are things really specified well enough? Sometimes maybe yes, sometimes maybe no. So the symbol server specification, where we know depending on things like unique identifiers and file names, how to download corresponding debug files, that's quite well done. And we can use it quite well. Things like demangling, they are so-so, I would say. It is kind of specified, but the specification lags behind reality a little bit. Because especially in the C++ world, Clang tends to add support for newer C++ language features, but those are not really specified in the demangling scheme yet. On the plus side.
Matthias
00:39:15
You'll learn a lot about new C++ language features like that. Absolutely, yeah.
Arpad
00:39:27
And also, what has been quite challenging since half a year or so, I believe it's a year or half a year roughly since we added support for source maps here. And source maps are especially difficult in this regard. Because JavaScript files do not have a unique identifier. And we are kind of championing this idea to give javascript files and their corresponding source maps a shared unique identifier because usually a javascript file just has a source map reference or not even this, so it only refers to its source map by name but as you might be aware in the web if you redeploy and the file names don't change they have different versions right so, don't know if you are using the right source map or maybe it's an outdated source map so you're applying the wrong debug info and you get the wrong results so especially in javascript the process of actually finding the correct source map for a javascript file is the most challenging thing this.
Matthias
00:40:45
This source map analysis and the demangling part were completely separate Were they separate code bases? Did you later integrate SourceMap support? How did that process work?
Arpad
00:40:58
Symbolicator It's not completely separate. So it's all built into Symbolicator. And Symbolicator internally has lots of shared code. Things like we do caching, very sophisticated caching on different kinds of layers. layers. And the JavaScript and source map processing definitely took lots of these things and extended on them. And afterwards, we also tried splitting up Symbolicator into a workspace with different crates as well. So we have a crate for native processing, we have a crate for source map processing, and then we have some other shared codes that's used by all of those.
Matthias
00:41:43
And where do you deploy the code? What cloud provider do you use?
Arpad
00:41:48
We are using Google, and we have a whole team responsible for all of these operations topics, and they use Kubernetes under the hoods. And for me personally, I sometimes take a look at all of these Kubernetes definition files, and I'm quite overwhelmed. I try to do some modifications here and there, but oftentimes I don't know what I'm doing.
Matthias
00:42:17
It's funny that you mentioned that because it feels like rust is easier for you to grasp or it feels more natural to you than yes that domain, and all of these parts can be scaled horizontally let's say you had a lot of traffic probably there was some way to scale it up to to handle the traffic right Mhm.
Arpad
00:42:41
Absolutely. So we do have horizontal scalability. So we have load balancers and behind the load balancers, there's a couple of servers that are deployed with a Symbolicator Rust code. And the load balancers just take care of distributing this to all of the nodes running the Rust code.
Matthias
00:43:04
Now the other project that you mentioned is relay and can you quickly talk about when it comes to the architecture the differences between relay and the of those two applications.
Arpad
00:43:22
So relay is really the first entry point when an event reaches so a client application communication panics, it sends a report, and it goes to Relay first. Relay handles a lot, a lot, a lot of traffic. And Relay does a couple of things. First, it actually checks that is your DSN correct. And just as a side note, DSN is kind of basically your token with which you send events to Sentry. So Relay validates that there is actually a project that exists. It validates that the project has quota to actually ingest those events relay parses the initial json payloads or depending on the architecture it's json or for native oftentimes it's also mini dumps so relay validates them that they have the correct format it rejects stuff that's invalid it does things like like spike protection all these kinds of things things. So Relay definitely has to handle a ton more traffic than Symbolicator does. So Relay is the big filter in front of the whole rest of the Sentry pipeline.
Matthias
00:44:45
And since it has to handle all of that traffic and it feels like it's rather monolithic correct me if i'm wrong here does that mean it runs on a big box and you have one or two instances in front like a proxy or is it also horizontally scalable somehow.
Arpad
00:45:02
It is also horizontally scalable but it's also vertically scalable so there is different data centers. And the thing that Sentry has invested a lot of energy recently was to open a new data center location in the EU. I'm not quite sure if it's available yet for customers, but it's definitely on the horizon. And these are called points of presence. We have relays running horizontally scaled there. And they do like what I talked about, this initial filtering. And then they forward stuff to other relays, which then do a little bit more processing of the event. And then they forward things into Kafka, which is kind of the source of truth when it comes to ingesting customer traffic or customer events.
Matthias
00:45:59
And these other relays, are they the same application, just different configurations?
Arpad
00:46:04
Exactly. Yeah. So it's the same application. It can run either in proxy mode or in processing mode and in proxy mode it just checks spike protection it checks that the token is valid that customer has quota and if those things match it's forwarded and otherwise it's rejected so that's just a relay in proxy mode and relay in in i think processing mode does more things like it it might split up certain events into different parts. It splits up Sentry envelopes into different payloads. And as a short explanation here, Sentry envelopes are things or is a custom data format to group different things together. So an envelope can have some metadata for the event itself. It can have attachments. For example, you tell the SDK to, hey, please attach this log file every time a panic happens, then that's a different payload and relay takes this whole envelope that contains everything and then it splits it up into the different components it validates those it applies like quota and and all of these kinds of things and then throws them into kafka okay.
Matthias
00:47:25
If you if you allow me this interruption. Early on you also mentioned mini dumps what are these because i've never heard that term either.
Arpad
00:47:35
Mini dumps are a way to ingest native crashes. And I believe it has been pioneered by Windows or by Microsoft because it's part of the Windows API. But there's also been some open source projects to generate the same format also on Linux and on Mac. And mini dumps basically contain the CPU context for all of the threads that are running. plus maybe memory regions for all of the stack memory. Maybe if you want, you can have other memory regions and also loaded libraries that are being loaded into the process. All of this is contained within this mini dump. And the processing pipeline can then take all of this information, the list of loaded libraries, which contain, by the way, the unique IDs we talked about previously. And the CPU context, and the stack memory. And from there, we actually extract the stack trace on our processing end. So that's different from a panic where, for example, the Rust itself or then the Rust SDK already has a stack trace at runtime and then just puts it into JSON payload to send it to Sentry. So that's the big difference.
Matthias
00:49:11
Okay a bit like a superset of a stack trace with much more context and you kind of have wrappers around the application or business logic or maybe the the error that happened and you process it at various stages you have the stack traces mini domes and then you have envelopes and at some point you need to go through each of these layers in order to route and process the errors that you encounter.
Arpad
00:49:40
Mm-hmm.
Matthias
00:49:42
In layman's terms. Yeah. Now comparing the architecture of the Symbolicator and Relay, are there any differences? Do you also use tokio in Relay? Do you use something else? What are the differences?
Arpad
00:49:59
Like Symbolicator, Relay has also made this change from Actix to tokio and also changing the front-end HTTP server to Axum right now. And also the client HTTP server that sends things further to request. So this change has also been done in Relay. But one difference, I believe, is that Relay still uses the actor model in some parts of its code base. And the actor model has then been appended with some niceties around it. It looks and feels like it's just an async function, but internally it's actually sending messages to actors and receiving replies and all these kinds of things one.
Matthias
00:50:45
Thing i heard about actor models in parallel processing which some people say you need to be cautious of is the queue or the message handling so every actor has an input queue or some sort of inbox and this could grow and depending on your settings it could either grow indefinitely or it could reach a limit where it stops to accept messages anymore did you ever run into such issues with relay or is there anything that you say you need to be cautious of when you use the actor model in your context.
Arpad
00:51:25
I Believe one of the big reasons that relay still still uses the actor model is exactly this point that you do have these inboxes and queues. But in a positive sense, this also gives you the possibility to apply back pressure and to react to back pressure a little bit better as if you just had async code. So I believe that is one of the main reasons Relay is still using this internally. But otherwise i'm not sure i can i can really answer this question.
Matthias
00:52:01
How would you handle back pressure with tokio?
Arpad
00:52:05
Ah, that's a good question i believe you gotta have some kind of queue or semaphore or something somewhere at.
Matthias
00:52:14
At the end you build your own inbox of some sort and you need to.
Arpad
00:52:18
Yeah basically.
Matthias
00:52:21
Yeah. I know that from previous projects, there were a lot of queues in such systems. And of course, there were also a lot of constraints and definitions of these queues. So how long they can be and what sort of traffic they handle and so on. It's a very, very tough problem, I guess. Interesting problem space, though. Now with regards to that sentry handles a lot of traffic and massive volume of events and transactions daily how would you say from your perspective has rust contributed to handling this high throughput and maybe also what performance gains have you observed.
Arpad
00:53:06
I would say Rust actually enables us to handle such volumes of traffic in lots of cases. So the whole processing pipeline that we talked extensively about that I maintain, it doesn't handle as much traffic. It's just about 2000, 3000 events per second. But here, actually processing them is the hard part or the expensive part in this case. Other parts of the pipeline they handle a lot lot more and i'm actually not quite sure about the exact number of total events that we get also considering things that are just rate limited and thrown away immediately versus things that actually get ingested into sentry but it, having this rust code and having this high performance is actually what enables us to to build new products that handle a lot more traffic. And maybe I've hinted on this previously that Sentry is also working on a metrics product. And metrics especially wouldn't be possible without having Rust go in between to handle the amounts of traffic. And to give you some numbers as well, there's been a couple of initiatives internally internally to rewrite some parts of the whole pipeline. The whole pipeline is enormous. It has lots of moving parts. And there's tons of Python code still there. And we try to move little bits of this to Rust, things that are especially critical or things that are costing a lot in terms of cloud costs. And here, one team has rewritten one critical part of the infrastructure in Rust. And they've observed 10 to 20x improvement in throughput and eventually in cloud cost as well. Because you can then just scale down the number of horizontally scaled workers and the CPU and memory allocations they get. And that has got 10 to 20x improvements here.
Matthias
00:55:20
And what is the amount of traffic that this new service gets?
Arpad
00:55:25
Depending on the workloads, there is one specific use case where it has more than 100k Kafka messages in this case per second.
Matthias
00:55:37
So that's massive.
Arpad
00:55:40
Yeah. Those are being handled by Rust code. They are being batched up and then processed further.
Matthias
00:55:46
Pretty impressive. What are, in your opinion, some of the more unique challenges that you faced when integrating Rust with other languages, which is given that Sentry has a very homogenous stack, but also multiple languages in the stack, like Python as well. What are some things that you encountered that maybe are very specific to your problem domain here?
Arpad
00:56:17
It's very interesting. So Sentry itself is a really big Python application and we are interfacing Python and Rust on different layers. So we have some extension modules written in Rust that are directly called from Python. We also have things like Symbolicator, for example, which expose an HTTP interface. And the HTTP interface is a completely different topic. It could be something else, but it's a completely different service, which is scaled differently, which is stateful in the sense that it's long running and it has some internal caching that is long lived. But the other way we interface to Python is via extension modules. So you write an extension module with Rust. And Sentry has been doing this for a long time already. And just recently, one of the things I did specifically was to pioneer PyO3. So we created a new Python package written using PyO3 and Maturin. And I'm actually really, really satisfied with how this turned out to look eventually. There were some really arcane ways of how to write Python extension modules previously where you had to use bindgen to generate C headers and then from there use, I don't know what the Python module is that actually consumes C headers and C FFI bindings and then just have tons more Python glue code on top of it. And PyO3 just does all of this for you. And that's really liberating. And I believe doing this one project now and actually these last couple of weeks, we've demonstrated that this actually works and it works really well and it's productive. And you can use this to write even more code in Rust that's been used from Python directly.
Matthias
00:58:31
And can you give me an example for what such a Python extension might do? What such a Python module might do for you for Sentry?
Arpad
00:58:43
There's different things we use this for. Things we use existing Python extension modules right now is for validating native debug files. So we talked extensively about debug files that they need to have a unique ID and all these kinds of things. And when customers upload them to Sentry, they are validated from within python code that calls rust code to do this validation and there's other parts of the processing pipeline that are not moved into symbolicator as like its own stateful service written in rust but which are still written in rust but called from python So, an example here is ProGuard processing. And for people who don't know, ProGuard processing is basically a similar de-mangling or de-obfuscation that happens for Java and for Android. So Java code is being obfuscated in a certain way and you have a ProGuard file which acts acts as some form of debug information where you can turn this back into readable function names and stack traces. And handling these ProGuard files and parsing this ProGuard format, and it's a text-based format and we do this in Rust, but the way we drive it is still via Python. So Python says, please parse this file, please de-obfuscate this function name. But the actual logic to do so, that's written in Rust.
Matthias
01:00:27
Yeah, and you get that input from users and it could be whatever, it could be anything. And I would imagine that Rust helps you treat such unsafe input like untyped JSON as something that maybe... Will turn out to be safe from the python ecosystem from the python's perspective is that correct.
Arpad
01:00:52
Yes yes so one thing that rust definitely helps with is dealing with untrusted user inputs.
Matthias
01:01:01
Yeah and with regards to that given that the rust ecosystem is still maturing and And it's probably still a little rough around the edges for very specific use cases. Have you encountered any gaps or missing features that you had to work around or contribute back to the community?
Arpad
01:01:26
In Rust specifically, what kind of missing features? There's different things I would say. One part, as you asked specifically about parsing things and handling files, one thing I miss from Rust as it exists today is self-referential data types. And we have some workarounds for this. So you can write unsafe codes to have like a wrapper that actually owns a buffer, but also holds a type that references this buffer. Right. And it's an owns type. It has a static lifetime on the outside, even though it has self references on the inside. Right. But for this, you still have to do some arcane trickery and unsafe and i would love this to be eventually possible and i hope it will be someday, the other thing from the rust ecosystem point of view more on the side of sentry as, product would be the whole thing about rust error handling and stack traces in particular particular.
Matthias
01:02:41
Isn't that one of the big advantages of rust?
Arpad
01:02:44
Um it is it is but it's also it's both an advantage and also disadvantage so it's it's an advantage in a way that it's very explicit, and i believe also this this split between panics that in theory you can catch but in practice you you don't versus results and error as a type that you return and you have this question mark operator that just propagates this. So from users perspective, this is really, really nice. It's a lot better to reason about than exceptions, where you never know which types you get. So Rust makes it extremely clear that if you have a result, you know which types to expect, even though in reality, no one really matches on the error type. The cases in which you really, really match on the error type are extremely rare. Oftentimes you just question mark and propagate the results. And from Sentry as a product perspective, what I would like here is really better integration for stack traces. And this is happening. There is an initiative in the Rust community to bring stack traces to errors. It's still very experimental and there's a lot of progress that still needs to be made. But one issue with this still, even if we had a way to have stack traces from Rust errors, it's a problem that where do you capture the stack trace, right? Right. So Rust is a language which makes it extremely clear what the costs of certain operations are. Right. So you do know what the cost of certain operations is and capturing a stack trace is quite expensive. Yeah so you you do not want to capture a stack trace and return it with every result in a tight loop.
Matthias
01:05:04
Because you would have to allocate things on the heap and that could potentially be expensive or almost impossible on some platforms.
Arpad
01:05:12
Yeah exactly so allocations for example is one thing but also capturing the stack trace is quite an expensive process.
Matthias
01:05:23
Yes.
Arpad
01:05:23
But it has advantages and disadvantages. The advantage is that Rust results or errors, for example, if you're using this error crate to just define an enum that has a bunch of variants, it's easy to do. And oftentimes it's quite cheap in the sense that it's just a couple of bytes. Those things can balloon quite a bit if you wrap an error in an error in an error but the problem with stack traces in particular is that. The location where you initially return an error versus the location where you capture it or where you attach a stack trace to it might be like completely different right from developers perspective i want to know what is the original location where an error was first returned but oftentimes i only get this wherever i unwrap or wherever i capture it in sentry or if you use anyhow with the backtrace feature whenever you convert a more detailed error type into an anyhow error and anyhow attaches a stack trace to it that's kind of the code location you get and oftentimes it's not detailed enough right that.
Matthias
01:06:54
Would be one area where you or more specifically would want to see some improvements maybe in the standard library or maybe also outside the standard library in some some crates. How would you try to improve the situation here?
Arpad
01:07:08
Mh-mhh. I would say anyhow, as a crate, or there's also alternatives to it, of course, but I'm just using it as one example, is already very good because it's quite a generic error type that you can throw or return from anywhere. And depending on feature flags, it does have a stack trace on it. So that's very good. So some ecosystem support here exists. The big question is, basically the question of anyhow versus this error. Do you want to have like an opaque error type that can be anything that, by the way, is also cheap in the sense of it's just a pointer to some heap allocated thing versus do I want to have a really detailed error that I can match on, but which might be an enum with sub-enums and which balloons to 200 bytes error case that also has some other costs. So there is this difference between these two approaches. And I'm not quite sure what the best way here is. I believe in the Rust community, it's often said that if you're writing a crate that you publish to crates.io, oh, you better have some detailed error enums. But on the other hand, if you have an application and you never match on those enums anyway, or just in like, I don't know, tiny percentage of cases, then you better use Anyhow for this. So there's these two walls and they have pros and cons.
Matthias
01:08:54
I agree. One other strategy that I heard about was that if you build a library that other people depend on, you might want to have an explicit error type with this error. Whereas if you build an application, you can use anyhow, as you explained. I mean, this is more or less anyway what you said, but I think it's a reasonable guideline. Line but certainly it's great to have the choice and to say oh yeah i need more control over my errors or no in fact i don't really care i want to have the option to handle this error if i want to in the future i can search for it i can grab for it but i don't have to do it right away and kind of deferring this error handling mechanism and maybe making a decision in the future is also really great that I miss from other languages where things happen and then an exception gets thrown and maybe it's not exactly apparent from the code whereas in Rust it's almost like a stop sign that makes you think about the code for a second even if you decide not to handle it at this point and with that I wonder if you have any final message to the Rust community anything that you want to mention, things that you say would be important going forward with Rust 2024. The Rust 2024 edition or anything else in the ecosystem? The stage is yours.
Arpad
01:10:28
So I'm definitely very excited about Rust 2024. There is a bunch of proposals here. There's lots of compiler internal work. There's work to teach the borrow checker new things with Polonius, which finally hopefully gets into the compiler, and I hope it makes it to Rust 2024. Other things here are a rewrite of the trait system, which might be just details in the compiler itself, but I'm also very excited about it. And all of these things hopefully enable enable support for self-referential data types, which I mentioned before. Right now you can't do it, but you need arcane unsafe code to do it. And I hope this might be built in at some point in the language. Another thing that's slowly being built and I'm following along the commits to the compiler every now and then. And one thing I notice is being worked on actively right now is support for first-class generator syntax. So things like writing an iterator, but with generator syntax. So that's also definitely something I'm looking forward to.
Matthias
01:11:50
Yes, and the generator syntax is also very close to my heart, coming from Python. I really, really use a lot of generators. And yes, it is something that I also would like to see in Rust. And it's always fun to have something to look forward to in this language. Which it's pretty exciting. Yeah, I hope that Rust 2024 will happen. And if so, I will upgrade immediately.
Arpad
01:12:19
I mean, it will definitely happen. The question is what's included.
Matthias
01:12:22
Yeah, yeah.
Arpad
01:12:24
But I'm definitely hopeful. And for other companies trying to adopt Rust, I would say just try it, give it a shot. But take a project, a part of the code base, which is quite isolated, which you can just experiment writing Rust in it. And Rust, by the way, as we talked about, we use it from Python. And from Python, I can really recommend Py03. It's a game changer from the ways we did it before. So it's actually very easy to use Rust from within other languages. Languages and i believe there's also PyO3 equivalent in the javascript and node ecosystem as well i'm not quite sure what it is exactly.
Matthias
01:13:09
It's called neon yeah or maybe there might be multiple but the one that i know is called neon and i don't know if it is used for any, bigger project but it exists yes there it's it's one of the great advantages of rust that we have have such nice ways to write wrappers for other languages.
Arpad
01:13:30
Yeah.
Matthias
01:13:31
And with that, we're getting very close to the end. Arpad, it was an honor to talk to you today to learn more about Sentry internals, monitoring, error handling, observability. I think if anyone out there wants to give Sentry a try or wants to integrate it into their Rust application, where can they go?
Arpad
01:13:57
They can just head to sentry.io or to docs.rs and take a look at the Sentry SDK. Both are fine, I would say. And especially in Rust, we have support for just capturing panics, which I believe is the most important thing. But as I mentioned previously, we also have integration for tracing. So if you do tracing error, the tracing error macro, or if you just instrument things with tracing instruments, we capture those. We give you a great overview over traces and we also have support for distributed tracing. So if you want to integrate a Rust backend with JavaScript frontend, that's covered end-to-end with the various, various language ecosystem SDKs that we have. So I would say just give it a shot.
Matthias
01:14:52
You heard it there. Head over to Sentry check it out. Arpad, thanks again for your time and see you next time yeah.
Arpad
01:15:01
Thanks for having me. Bye bye.
Matthias
01:15:04
Ciao Rust in Production is a podcast by corrode and hosted by me, Matthias Endler. For show notes, transcripts and to learn more about how I can help your company make the most of Rust, visit corrode.dev. Thanks for listening to Rust in Production.