Rust in Production

Matthias Endler

Cloudsmith with Cian Butler

About oxidizing Python backends with Rust

2026-04-09 74 min

Description & Show Notes

Rust adoption can be loud, like when companies such as Microsoft, Meta, and Google announce their use of Rust in high-profile projects. But there are countless smaller teams quietly using Rust to solve real-world problems, sometimes even without noticing. This episode tells one such story. Cian and his team at Cloudsmith have been adopting Rust in their Python monolith not because they wanted to rewrite everything in Rust, but because Rust extensions were simply best-in-class for the specific performance problems they were trying to solve in their Django application. As they had these initial successes, they gained more confidence in Rust and started using it in more and more areas of their codebase.

About Cloudsmith

Made with love in Belfast and trusted around the world. Cloudsmith is the fully-managed solution for controlling, securing, and distributing software artifacts. They analyze every package, container, and ML model in an organization's supply chain, allow blocking bad packages before they reach developers, and build an ironclad chain of custody.

About Cian Butler

Cian is a Service Reliability Engineer located in Dublin, Ireland. He has been working with Rust for 10 years and has a history of helping companies build reliable and efficient software. He has a BA in Computer Programming from Dublin City University.

Links From The Episode

  • Lee Skillen's blog - The blog of Lee Skillen, Cloudsmith's co-founder and CTO
  • Django - Python on Rails
  • Django Mixins - Great for scaling up, not great for long-term maintenance
  • SBOM - Software Bill of Materials
  • Microservice vs Monolith - Martin Fowler's canonical explanation
  • Jaeger - "Debugger" for microservices
  • PyO3 - Rust-to-Python and Python-to-Rust FFI crate
  • orjson - Pretty fast JSON handling in Python using Rust
  • drf-orjson-renderer - Simple orjson wrapper for Django REST Framework
  • Rust in Python cryptography - Parsing complex data formats is just safer in Rust!
  • jsonschema-py - jsonschema in Python with Rust, mentioned in the PyO3 docs
  • WSGI - Python's standard for HTTP server interfaces
  • uWSGI - A application server providing a WSGI interface
  • rustimport - Simply import Rust files as modules in Python, great for prototyping
  • granian - WSGI application server written in Rust with tokio and hyper
  • hyper - HTTP parsing and serialization library for Rust
  • HAProxy - Feature rich reverse proxy with good request queue support
  • nginx - Very common reverse proxy with very nice and readable config
  • locust - Fantastic load-test tool with configuration in Python
  • goose - Locust, but in Rust
  • Podman - Daemonless container engine
  • Docker - Container platform
  • buildx - Docker CLI plugin for extended build capabilities with BuildKit
  • OrbStack - Faster Docker for Desktop alternative
  • Rust in Production: curl with Daniel Stenberg - Talking about hyper's strictness being at odds with curl's permissive design
  • axum - Ergonomic and modular web framework for Rust
  • rocket - Web framework for Rust

Official Links

Transcript

Hello and welcome to Season 6 of Rust in Production, a podcast about companies who use Rust to shape the future of infrastructure. My name is Matthias Endler from corrode, and today I chat with Cian Butler from Cloudsmith about oxidizing Python backends with Rust. Cian, thanks so much for taking the time for the interview today. Can you say a few words about yourself?
Cian
00:00:24
Yep. I'm a performance engineer and SRE at Cloudsmith. I've been doing Rust in some form or another for the last 10 years, mostly as side projects, but I have been doing it professionally for nearly three years now. Working at Cloudsmith, trying to build on the Edge team, where we work on our CDN and all that fun networking stuff. Cloudsmith, we're a package management company. So we do package management as a SaaS. We support like 36 different formats of packages for Node, Cargo, Python, all the big ones. We do public repositories, private repositories and open source repositories. We're going pretty fast. We've got some big customers that I don't know who I can mention. So I won't mention anyone just in case. Because of that, we process about 110 million API requests daily, That equates to petabytes of packages downloaded every day. A lot of that is done in Python right now. We have a very old Django monolith that we've had since day one, which is 10 years ago. It's grown. And as we attempt to scale it, we needed to find new ways to scale it. So we started looking at Rust as a way of making it faster and more efficient.
Matthias
00:01:54
Great. That means the monolith is exactly as old as your Rust experience was long. So it's 10 years for the monolith and 10 years of Rust for you.
Cian
00:02:08
Yeah, I hadn't even thought about it, but yeah, it's a nice little commonality there.
Matthias
00:02:13
And I could imagine you want to use Cloudsmith in a situation where you have an organization that manages a bunch of packages, maybe a bunch of packages in different ecosystems, and you want to have hosted version of that that is secure and safe, like we're talking about supply chain security, or are there any other reasons for using Cloudsmith?
Cian
00:02:38
Oh, 100%. Supply chain security is one of those things we're very big on, very focused on. It's not just, though, security. So if you run different, you could run multiple different formats of packages or just one format. You'd use us to be a proxy to your upstreams. So you could say, pull all your packages through Cloudsmith, and that gets you better caching on them because you get our access to our CDN, and then you can apply security posture on it. So don't download any packages that have these vulnerabilities or CSVs published on them, which are decision engines for that kind of tooling. But as well, you might just publish your own packages for internal use. So if you are a big company that's building lots of packages that you use internally for other services, so you could be having, let's say, a logging library with your custom logs. You push it up there, it gets pulled in by all your microservices or CLIs and they can get built. That's much more the traditional way of like people have private packages they don't want to put on the internet and they don't want to have the insane tooling of putting all their packages in one repo. So they have lots of, so they have private repository. A lot more focus now in the industry, that's supply chain security. So that's where you see a lot of our development happening right now in like securing different supply chains. I won't say I'm an expert on that side of it. We have people who are a lot smarter about that, who focus on that. I mostly focus on the low-level networking stuff and data processing side of it all.
Matthias
00:04:15
I realize that you might not have been around, but can you maybe, from conversations with other employees, remember why Python was chosen to start a project in the first place?
Cian
00:04:30
I think it's a comfort situation. We had two founders who started it. Their story is not one I will be the best person to repeat, so I won't repeat it. If you want to look it up, we've definitely done some posts on it. One of our CTO likes to talk about his history. Lee Skillen, if you want to look up anything from his blog and his LinkedIn. in. But the reason we chose Python is that it's just familiarity. Like, it's a really good language. Like, it's powerful in that you can write so much code so easily. It's very business friendly. It's not overly verbose. So it leads to rapid prototyping very quickly. Same is true with Django. Django makes it so easy to spin up a web server and hook it up to a database and start playing around and getting your proof of concepts ready, getting your POC built. I think that, no, I don't think, I know. I know that we wouldn't have scaled as fast as we did without Django and Python because we wouldn't be able to roll features out as quickly as we did. They definitely helped us scale the company and get to where we are today. Saying that, after 10 years of code being written, I think I said there's like 200,000 lines of code in our thing, somewhere over 20,000 files in our monolith. That's a lot of code. It's a lot of. Code that not everyone understands and we're constantly like going back reading it trying to figure out how it works and if you've even this morning i was trying to read a set of mixins trying to figure out what request what path a request goes through as we have multiple different layers of python code to process it it adds up over time it so what made it really good for scaling on day one has kind of like caught up and made it really difficult to understand and handle now. So double-edged sword of Python there.
Matthias
00:06:54
Yeah, and a lot of people might say, let's just remove everything, start from scratch, rewrite it in Rust. But what people forget is that those 200,000 lines of code, they contain a lot of business logic and a lot of value. I'm assuming right now, please correct me if I'm wrong here, is that a lot of the logic is also about handling different package manager formats, file formats lots of parsing lots of error handling and so on can you talk a little bit about what's in there what's the bread and butter for you to make that infrastructure work even.
Cian
00:07:31
Yeah yeah it's it's that it's all that kind of stuff i said each package format is its own distinct. Concept like we have they all have a lot of similarities under the hood the data types are all very similar in our infrastructure but each one has its own idiosyncratic ways of being of being handled and request flows we can like the flow for uploading a package is under the hood is we take a binary and we store it somewhere but the handshake you do with that and the metadata you store and that differs in each package, which means that you could go into our code base and go into the slash packages folder and then you'll just see 36 different code bases in there that are similar. They have shared bits of code for logging and for metadata processing and tracking of events used internally and all that kind of business logic that's shared, but each format is different and their code paths are different.So we'll never, like, we could, like, sit down and very quickly scaffold out a brand new service in Go or Rust that hits those same things. But we then have the weird edge case of, like, how do you, how does that interact with our processing of, our processing of SBOMs generation? And and then we need to store that in this in the in a way that's can be queried by our api to be displayed in our ui or and we also need to track all those data all those bytes you care about how many bytes are being downloaded we need to ensure that all that data is being tracked correctly, we have we're still we're in that scale-up phase of startup life so we're we're hiring we're bringing on new engineers but we're still a small enough team so let if you brought in you bring in me you we bring in me lee our cto made the joke of one day he's going to wake up and everything's going to be rust after hiring me and we all we all we laugh and it's funny but we know it's not really going to happen we're going to have some core bits that are rust and but they're still going to be that core python code that's not changing because everyone in our shop knows python, we have a couple people who know Go. We have a couple people, we have me who knows Rust. We have some people willing to learn and who have tried Rust and Go at different times, but they're not like, ready to jump in on a project and start developing today or tomorrow right.
Matthias
00:10:21
But also even if you were let's say an expert in go it would be harder to integrate go into the project because go has its own runtime it has a garbage collector and you could do so by using the network boundary but not necessarily integrating it into the existing project as you could do with for example PyO3 or so.
Cian
00:10:48
100 we and we have actually experimented with go and that's where it ended up so we've moved logic for doing specific things out into a go microservice previously nothing core to that business it was specifically supporting for one format and for scaling that format, and yeah we couldn't you can't it's nice it works it's there and it's solid but it is a separate microservice and it goes against that belief we have that everything should be in the monolith this is one of those core tenets we have that we should scale our monolith we should, focus on making sure code is in the monolith interesting.
Matthias
00:11:30
Point because at least in the last decade or so monoliths were sort of frowned upon weirdly enough and now it feels like the industry is circling back on that idea can you maybe explain from your perspective what's so great about having a monolith.
Cian
00:11:47
Yeah yeah no i i think i came into the industry when we were just heading for that peak of microservices or on our way up to it and i've never liked them i'm like a big hater on them i've always hated them now maybe i got cut very early on and i've never like recovered from it but i think the thing about scaling microservices is it seems like it's a really easy thing to do you can just throw a little service at it and everything works and it's like it's i just have this i call that service and it gives me a response and that's great and when you're running like one box with talking to another box that does scale pretty nicely and when you have a small bit of traffic that scales real it scales really nicely because you have a small bit of traffic but in the real world it's never that simple you deploy 10 services for your microservice it says it's got 10 it's got you've 10 replicas and you have 20 replicas of your other service let's say you need to ensure that you're properly load balancing across those 10 services. You need to account for the network delay in your one service as it waits on the other. You start running into issues about managing connection pools and blocking IO resources. Like this is one of those things that we actually ran into a lot in our monolith. The way Python blocks... Can be quite problematic because it doesn't just like go to sleep and pull, it could just like sit there and wait. And then you just have resources that are blocked waiting for that. You need to know how to sleep and how to, and pick up more work in the background while you wait on resources to fill up. But if you never have to call across that network boundary, if you have all your logic in a monolith you don't you can avoid the overhead of a network you have a much simpler cognitive design that you can account for right.
Matthias
00:13:56
I fully agree and also refactoring across microservices is never fun.
Cian
00:14:01
No no and yeah and this is a problem we're running into i say i keep saying this problem running into we don't have microservices but we do have a cdn and we and how we roll code out to that cdn versus how it interacts with the monolith is a core part of what we do on the edge team and you need to really ensure that you have that two-factor step of like we add a feature in the monolith in the cdn so it can start using it and then you add you enable the feature in the monolith and then we can remove the old legacy. So you have that like three step deploy phase and it, you'd think it's, it's such a, a hassle is the is the only way to say it of like remembering that and if you don't do that you end up with all these dead code paths which we have we have hundreds of line of dead code paths in our in our edge because we just didn't go back and clean it up.
Matthias
00:15:00
Yeah it's such a bespoke process to make releases across microservices all the ceremony the adding a feature but also putting it behind a feature flag making sure that the other one is bumped up to the correct version and then slowly migrating over and whereas if you have a monolith you can just make all of those changes in one pull request and then review all of those changes and your debugger still works and your linter still works and and all of those niceties yep.
Cian
00:15:39
That's and i think that's the it's It's the debugger that still works. I think it's one of the nicest ones as well. I'm not a big debugger fan myself, but I know that a lot of people in our industries love debuggers. And it's the fact that you don't have to pull out something like Jaeger or Datadog to do that debugging because you're calling across different services. Like tracing is great. I love tracing tooling, like all the open telemetry kind of stuff. It's great. But when you need to run a dedicated open telemetry stack for debugging one simple request, that's a lot of overkill on my laptop. And like, I have a nice laptop, but I don't know I need to be running a data center on my laptop just to do a little bit debugging.
Matthias
00:16:23
Coming back to Rust, because that's kind of what I want to talk about. It's nicer in Rust. Yes, you can integrate Rust with PyO3, but I'm not sure how that process went for you. Did you even use PyO3 for that work, or did you decide on doing it a different way?
Cian
00:16:47
So let's step back a little. I came into Cloudsmith last year as performance engineer. We decided as a company, we wanted to focus on building and scaling. And it was known that I was a Rust developer coming from a Rust shop. So there was a known value that I was probably going to write some Rust at some point. But we didn't sit down and say, how can we bring Rust in to scale this service? I sat down and I just started looking at those traces. Started looking at Datadog, started looking at where the bottlenecks in our service were. We had load tests running. We were getting information back about what was slow, what were our slowest endpoints, all that kind of stuff. So the things that came out when you look at that data was we would sit waiting on io we would and it would be serialization these were two of our biggest things the io was two different types of io our database we we queried the database a lot probably too much but we do it uh, eats up a lot of resources. The other side is the network. So we call out to upstreams like PyPy and Cargo to pull in information. And then we have the inbound requests. So that's requests from our customers to us. So and how many requests per second can we process from the pull in from the network and process concurrently. The other bits being serialization, that's serializing large JSON payloads, large XML payloads, and that kind of stuff. So we sat down and said, how can we go about fixing this? And it wasn't a one shot of like, we need to fix it all at once, or we need to roll everything out, switch everything up at once, or let's build it ourselves. We try not to be a shop that suffers from not built here kind of thing we like to use open source software where possible or use sasses where possible because there's only so many people we have so we so i started googling because i knew a solution to the json serialization already, uh back in in two jobs ago back when i worked in video games we worked we had a very large logging pipeline where we would serialize everything to JSON across the whole fleet. And so we were also a Python shop, and I was working on the metrics team, and we rolled out a logging change that switched how we serialize JSON in all of our microservices with a Rust library called orjson.
Matthias
00:19:51
Oh, yeah.
Cian
00:19:52
It's a great library. Well, it's a Rust library and a Python library. It's written in Rust, and it's got nice Python bindings that look similar enough to the normal ones the normal json python bindings so i knew from then that the speed up varies somewhere between 7 and 10x depending on what you're doing and what it looks like and i know that when we did the change in that company i saw about a one to two percent change of cpu usage across our data center over a couple weeks it takes time for changes to go out but we definitely saw improvements, and at that scale it was really important to kind of like you get a lot of you get the, those small gains they really add up over time so i reached for that library because i had such success with it before and when we went to reach for it it turns out django already has a wrapper It was even easier than that. So we installed the Django or JSON serialization library, and it swapped out our... JSON serialization, which is just the normal Python JSON serialization with a Rust-based one. We then had to go through all the code base and find every place we imported JSON and replace it with orjson. And then we did these each incremental steps. We didn't like flip the switch. We flipped the switch on updating all the json files one at a time until i think one day i just got very bored sat down and had a train ride and just banged through every single i just grabbed everywhere we imported json library and just iterated through those files making sure they were all correct nice.
Matthias
00:21:58
It's always the train rides right.
Cian
00:22:00
Yeah it's.
Matthias
00:22:01
Always that's when we get the work done but.
Cian
00:22:03
Yeah also.
Matthias
00:22:04
What i find particularly interesting about that story is, If you didn't know it was written in Rust, you might not even have cared about it because it was yet another Python package that you just integrate into your workflow. And it was a drop-in replacement. But I wonder how many organizations out there run Rust without even knowing it this way because Orchason happens to be written in Rust.
Cian
00:22:31
I think there's probably so many places. Like, if you asked my previous employer if they're on Rust, they would say, nope we have no rust and i know for a fact there's rust in every service because of the fact i put it there through that python library and i think that's actually like a nice thing it's it's also true of the cryptography library in python it's rust based now it the there is rust and there's rust in linux now like rust is everywhere it's getting rolled out everywhere but it's those nice places like orjson where someone has sat down and said how can i make this faster without breaking the api or in such a way that it doesn't take a massive lift to switch it out yes.
Matthias
00:23:22
But also as some sort of counter argument to that someone might listen to it and think well json is sort of a nice easy interface to integrate with there because there's a nice api surface but how often does that happen in practice that you can just use a drop-in replacement what would you say to that.
Cian
00:23:44
Yeah, not as often as I would like. It's totally not as often as I'd like. We've managed to get... I talked about this before in previous talks at FOSDEM about our experience at it. We switched or JSON first, and it worked great. Well, it worked great. One customer broke because they were parsing JSON with bash and grep and seds and all those things. Don't do that. It's bad. they realized it was bad and they moved on so ordreson great drop replacement after the success of ordreson i i didn't want to like i knew pyotree existed so i sat down and said where could pyotree be used next my other the one that i wanted to look at was xml parsing not parsing serialization We have to serialize very large XML payloads. So I was interested in seeing how could I come up with a more efficient way of doing this for our use case using Rust and Pyotree. But I got distracted when I went on to Pyotree's docs and I noticed... That you they had a json schema library and i thought oh cool i wonder if this is faster than our json schema library so i went to see our usage of json schema library and found out we were already using the rust one but also we were using the python one we had both installed and we're using them both at different parts in the code and i kind of just looked at myself going how do we What happened here? Did someone just not look at our folder and say, do we have a JSON schema library already? Or were we planning to do the migration?
Matthias
00:25:32
My suggestion would be to use Cloudsmith because they handle package management for you. And this is how you could avoid the problem.
Cian
00:25:41
Yes, yeah, totally. Well, I think you could at least catch it sooner. Maybe we wouldn't have been running the two things for so long. But saying that, it gave another opportunity for us to like continue the rollout of like switching to Rust because we clearly knew it worked for us. We had success already. So all I needed to do that one was again, just switching the imports in all the other ones and removing the pure Python implementation. And we rolled it out and it was smooth as butter. Like that, it wasn't, I didn't change any code. I just changed the import statements. So there definitely is like that ability to do those drop-in replacements that work so well right there.
Matthias
00:26:25
What I find cool about that story is that these initial quick wins gave you a lot of confidence into integrating Rust in the stack without really requiring a lot of backing from the entire organization. You can just go step by step and you can see the success right away. But then eventually you might have hit a wall where this is no longer possible because all of the quick wins are gone so i wonder how you transitioned from there to maybe introducing more rust because well obviously it was kind of a success yeah.
Cian
00:27:04
Like i said i was playing around with pyotree and different ideas and when i started looking at our bottleneck for the network I started thinking about how we manage work in the service. And the way our request model worked was, We were using WSGI, W-S-G-I, and effectively where processing requests come in, we'd give them to a Python worker, and it would do the request to completion and then hand the response back. So for Rust developers, they might look at that, and the model is very similar to a Tokyo service that we had, and that's my instant thought about it. I looked at it and said, that looks like a Tokyo service that has one event loop that does some processing, hands it off to a background task, and then it waits for the task to complete and get back the results onto the main event loop and throws it back over the wire. Of course, it doesn't use serialization to bytes or any of that kind of stuff, but it looks like it. One of the bottlenecks I found was we were wasting cycles doing work for connections that had already closed.
Matthias
00:28:18
Oh, wow.
Cian
00:28:19
Yeah.
Matthias
00:28:19
Why is that?
Cian
00:28:21
It's a little to do with our queuing model and a little bit to do with request management in uWSGI, the process we were using. Effectively, if a request had sat in the queue for too long, it would be handed over to uWSGI. uWSGI would do it and it would time out in the upstream because it had been processing for longer than a minute. But there's no way to cancel the request once it's in flow. It would we would benefit from it because we do all the work and cache the result so another request would be so the request would have been retried and would be in the queue and by the time it gets to the front of the queue it's all its results are cached so it was a nasty flow but it kind of we kind of optimized for it yeah but i thought to myself this feels insane there's no. Is the first is going to be cached or it's going to be re-driven like most of the time it is we're dealing with some some of the i think someone recently described some of our clients as some of the best and worst clients in the world because they're designed for public infrastructure since they're all package management clients so they have a lot of retries but they have a lot of weird formats so we're dealing with some of the best and worst clients so we know a lot of things are going to be like retried and attempted again but it's also not a perfect cache because some of our caches are in memory and some of them are memcache so things that were in memcache those were quick but if it was in memory cache unless you hit the exact same node again that in memory cache is useless and like i said we're running lots of replicas so there's no real guarantee on that yeah.
Matthias
00:30:06
That's a thing that i heard a couple times already is that if you think about a highly performance service that does not waste a lot of cpu cycles then you need less of those which means you have higher cache locality if you have a service that is not as fast you need more instances so you lose the ability to have things in your in-memory cache so that's kind of another way on how more performant languages or more performant code is effective.
Cian
00:30:42
And.
Matthias
00:30:43
Helps with performance.
Cian
00:30:44
Yeah like those i'm a big believer that in-memory caches are are only good when you can have a small footprint because they're effectively they build up in that small footprint and if you need to have lots of replicas for whatever reason be that be that like budgetary or a limiting of like only having one CPU per map to a process or something like that, you end up with these very disparate caches that have different information and your load kind of ends up going all over the place.
Matthias
00:31:19
But wouldn't you have been able to query the in-memory cache and then, if that fails, go to memcache right away?
Cian
00:31:28
Yes, you would think that. But the issue isn't that we were, it was, it's not that we have one caching mechanism, it's that we have different caching mechanisms. So we were using the Python caching library for in-memory cache. And then we were using our memcache with our database to cache responses from the database. So these are actually two different caches. The memcache one is just, could we stop ourselves from going to database? And we would totally check that on every request. So if we had done a very expensive DB query, it should be in that memcache. So on the retry, it would come from the memcache. What wasn't being cached were those pure functions we were running inside the monolith that were in the Python cache.
Matthias
00:32:11
Got it.
Cian
00:32:12
Yeah.
Matthias
00:32:12
So the new bottleneck right now is between the network layer, which was your uWSGI, and the Django monolith. There's where you lose a lot of the performance now.
Cian
00:32:27
Yeah. And my goal was something we're still working on, was I wanted to be able to do request cancellation. So I wanted to be able to say, that's timed out upstream, I want to cancel it. Something i had previously done in a tokyo service so kind of was like totally let's do this so i sat down to try and figure out how i could map a tokyo request managed service to our WSGI app and, it was and i was reading pyotree docs and i was playing around with a library called, RustImport, which lets you very quickly write PyoTree bindings for your Rust libraries. You can get a very rough and ready code in 20 lines with some macros. And you can have this very rough importing of Rust code directly into your Python code without a lot of overhead. Great for prototyping. I had found some places where I thought I would probably change this if I wanted to bring it to prod and just use Pyotree for creating the interface exactly as I wanted to. But it was definitely great for prototyping. But saying that while prototyping i started looking at prior art and i had found someone had this idea already which is i want to say the best thing about like open source is sometimes you go and look and say someone has someone already had this idea and more often not someone has so.
Matthias
00:34:07
Yeah and also you could have gone and completely ignored that and not have done any more research and you would have that liability on your side whereas now you looked at prior art as you said and you found a thing that someone else worked on before so that also shows that you took a very level-headed approach to that.
Cian
00:34:33
Yeah, 100%. The project we found, it was called Granian, or I might mispronounce it a handful of times because I got so used to call it Granian at one point. But it's effectively a replacement for that WSGI service we were using that is written 100% in Rust. It's a Tokyo event loop that hands off to Python processes for doing the actual processing of the code. So all your business logic runs there and it just ensures that all the network logic is done inside Rust. This was really cool for me because I was like, cool, here's a project that does exactly what I wanted to do. And I started reading the code and I learned that the concept of request cancellation, the thing I was doing all of this to was not possible in new WSGI at all like there was never going to be a chance of doing it in WSGI because it's just not supported by a protocol Gradian does only supports it if you're using ASGI which is an async version of SGI that's more like a traditional event loop style of async await mm-hmm.
Matthias
00:35:50
Similar to an io_uring or so.
Cian
00:35:52
Exactly.
Matthias
00:35:52
Completion-based.
Cian
00:35:54
It's exactly the same kind of design. And you get to reuse all that kind of code that's designed for those IOU loops. But we sat down and we'd already started looking at it. So it saved me a lot of time in that concept of prior art walking down paths that we could have lost so much time if I had spent working on it. But it did have a feature that I loved, and that was it had a built-in queue for managing the requests. So right now, to this day and at the time, we were running HAProxy in front of uWSGI to allow us to scale. HA Proxy was effectively doing the queuing for us, managing work in a queue, and then handing it off to a uWSGI process that would hand it off to a Django process and do the request. And for reasons that elude me of why an engineer decided to do this, we also are running an Nginx in front of the HA Proxy to do very light routing control and optimizations. Nothing that couldn't have been done in HAProxy, but it was just being done in Nginx for some reason. And there was a ticket on a backlog for years of merge HAProxy and Nginx together and just have HAProxy.
Matthias
00:37:26
It's interesting that you make that decision. One could have made the decision to go with Nginx. Personally I find the nginx config to be easier to read and write in comparison to the HAproxy config maybe that was the reason for nginx you're.
Cian
00:37:44
Probably right it's like I, nginx is a really nice config it's super like readable and simple and probably of all the tools i'm going to talk about it's just it's the easiest to work with and was pretty bulletproof and in doing some amazing things for us but the ha proxy one was i think ha proxy is just a better queuing tool or at least my experience of using of doing request management in ha proxy has been better, but i think when no one knew which way we wanted to actually go the idea of like let's replace them both let's replace one with the other was this was the idea and when i found granian i looked at it and said oh this can not only replace our WSGI management interface but it can also replace ha-proxy because it can do that queuing internally and it has dials for tuning that queuing as we needed it to work so it there was and we also had this intense dislike of the of uWSGI because uWSGI is quite difficult to tune uWSGI being the tool we use for managing WSGI requests. So... So I started chatting to a principal and said, have a look at this. What do you think about this? And I got the thumbs up of, ah, sure, let's try it out and see what happens, which is a very Irish way of going, let's run a load test and see how it performs. So we threw up a version into our load test environment that replaced uWSGI with Granian. Granian. Granian.
Matthias
00:39:37
One of these.
Cian
00:39:38
One of these two. So we threw up a version into our load test environment that replaced uWSGI with Granian and we began load testing it. We just started throwing lots of different types of requests. We have some nice load testing tooling that simulates some request flows. So we just had it run. and the numbers we got back were marginally better it wasn't like a night and day like oh my god this thing is going to save us we found the savior of scaling no nothing like that but what it did say was it changed the numbers in our in our percent in our p50s and our p90s our p50s went down and our p90s went up which meant we just had a lot more outliers and our Averages were better, which was enough of a signal for us to sit down and go, there's something here. Don't, it could just be a better tool for us to be able to tune. It could just be more cues is helping us scale in some way or another. But it was definitely, it was a signal that we said, we need to test a little bit more with this. This isn't something we need to just walk away from.
Matthias
00:40:53
Right, because if you see that your P50 is better, that means the outliers are now more prominent. So there might be things in your business logic or timeouts with upstream, which mean that they drive up the P50. P90 or p95 signal but overall this is also a thing that you see a lot with replacing code with faster code on on the back end side is if you do it right then the outliers become more prominent.
Cian
00:41:28
Yeah no 100 we were definitely seeing that where it was these very slow paths that were blocking us were still the slow ones. But the very quick paths, they just became quicker. And there's a lot of differences in how uWSGI and Gradian were configured in those early load tests that I now know were silently masking different things about. They were handling switching contexts differently, how tread management worked. So the memory footprint was, little more stable in one while it correlated to workload better in the other that's got good and bad it meant that previously we would have like the memory which and cpu would stay flat but now like as requests went up you could actually see the cpu was going up and down because we were doing more work and we're like that's a good signal for us scaling now we could use that to do some where previously we couldn't do that auto-scaling.
Matthias
00:42:40
Yeah, because you could never go down to zero.
Cian
00:42:43
Exactly, yeah. So we sat down and we drew up a testing scenario, like some numbers we wanted to see, some testing we wanted to do. Which parts of the stack could we try removing now that we just, and could we just replace it with Gradian? So we did a lot of different load tests to the point we actually managed to bottleneck in the load test tooling. We hadn't scaled the load test tooling up high enough that it could push enough throughput in one of our tests that we needed to step back and change the load test tooling out. We were previously using Locust, which is a fantastic load test tool where you write your load test in Python, and then you spin up lots of Python workers that are managed and it does the load test from different places. But those workers were becoming our bottleneck. So, well, they're not really a bottleneck. How much money we were willing to spend on those workers became the bottleneck. Like how many workers could you spin up for a load test was the bottleneck. So we switched out for a tool called Goose, which was a reimagining of that in Rust. Managed to push the same amount of workers, we were able to push more requests, like, I think 100 or 1,000 X more requests per worker, which meant that bottleneck was out the window.
Matthias
00:44:10
It's somewhat funny that in the process of oxidization, you also have to swap out the load testing tool.
Cian
00:44:19
I think that was the biggest signal of we can push more was when we had to swap out the load testing tool because that was what was being saturated. Yeah. Yeah, and it was really good. At the end of it all, we had a test scenario that showed we were able to push about 2x, per compute resources than we previously were. And there's a lot of reasons for that. One is we were running less intermediate services. We weren't running Nginx after this. We weren't running HAProxy. And Gradian was effectively doing all of that for us in a nice Rust event loop and handing it off to background processes in Python. And the Python was that original P50 gain was adding up along with all these less resources having to be run.
Matthias
00:45:15
It's great because yeah as a first step you could say you handled twice the load which means you could have half the servers if you wanted to but then on top of it you have better memory locality now so maybe you even need less cache servers if you had those and on top of it even before the request even hits your monolith you can also optimize a lot because now you don't need nginx and ha proxy you could replace all of that with one service and.
Cian
00:45:51
I think that was the biggest one for us management saw that i what we're saying we could squeeze more requests out of what we're already paying that's we could scale We said we could scale down, but we knew scaling down was not going to be what we were going to do. We're signing customers on every day. We're growing every day. We're scaling up. So the idea of scaling down, of compressing the amount of work we can do in compute is big for us we it got us the time to experiment more and continue our testing and see what's next for us what's what can we improve and.
Matthias
00:46:29
You need that time because i'm assuming that there are differences between the old stack and the new stack especially if you deal with a lot of real world http traffic.
Cian
00:46:41
Yeah yeah there was two big differences for us that caused two annoying outages for us as well, the one that's gonna is burned into my brain was to do with docker we we have so docker has a lot of interesting clients is the best way i can describe it and it's a it's a standard of how you do stuff But every client can kind of implement, do the implementation slightly differently and handles edge cases slightly differently than each other. So for scaling reasons of our cdn we would often respond with trio sevens and say and say the resources over in this other location for storage go get it and and you download it yourself rather than me downloading it for you and handing it off like you don't want to be you don't want a python service doing a download and sending it back over the wire you want something that's built to scale and serve those requests. So it's our CDN out of the edge.
Matthias
00:47:48
The Docker clients that you meant are things like the implementations of things on your local machine. Like if you do Docker pull or you use Podman or...
Cian
00:48:00
No, yeah, it's... When I say Docker clients, I mean Podman versus Docker versus BuildX versus...
Matthias
00:48:09
OrbStack.
Cian
00:48:09
OrbStack, yeah. And there's hundreds more multiple... You work in a company, you'll be running different versions and different developer machines sometimes. And you'll be, so one developer is doing one thing and that could be different to prod because you're not running in prod, you're actually running Kubernetes, which is different again to Docker. Like the Docker clients are all different and unique and there's many of them with different edge cases. right.
Matthias
00:48:36
So back to your story we were at a point where you don't want to handle the requests for the clients instead you tell them look elsewhere for the resource that you're trying to pull yeah.
Cian
00:48:48
Exactly so we'd give them a nice 307 to our cdn location and they respect it and they pull it it's it's part of the protocol that they can do that but for reasons that are very legacy and to do it how go implemented it's a first HTTP client they were accepting bodies they accepted trio sevens with content lengths that were not zero and because of that they would have the docker client for some reason used that first content length it saw as the metadata as the content size of the image it was eventually going to be and it would so it would look at the response and say cool i have a content length of two of 200 megabytes going to put that in the met in the metadata for my for my docker my eventual docker image so it then follows the 307 and goes and grabs all the other layers and it says, and then it signs it and says, here you go, this is your built image. The issue came in when we were getting, so we're sending back this 307 saying it's got a content-like length of 200 megabytes, let's say. The ALB we were using, the load balancer we were using, started to have errors on this. It started saying, nope, that's an invalid request. I don't remember exactly what error it started returning, but it started throwing random errors that were not the correct error as well. So it was processing something internally and it broke its serialization. It's kind of scary when I kind of start saying it internally because these were SaaS products. We didn't have proper logs for them. We just had metrics of error rates going up and down. So we sat down, started digging in, and we managed to map the error rates to the Docker requests. And we decided we needed to flip some, we needed to move some stuff around and try some stuff out. So we started encoding, we said, oh, this is encoding 307s as, it's saying these 307s have a content length of 200 megabytes or whatever the eventual image size is going to be. Mm-hmm. Let's not do that. That's what's breaking this. Let's respond with actual valid HTTP and say the content length is zero. So we did that and Docker freaked out. It started, well, actually we ran tests and they were working. Like, we're like, great, our end-to-end tests are still working in this, this is fine. And then one of our developers came in and said, hey, I can't get my local dev to start. So we started debugging it. and it turned out that their local dev was getting the wrong metadata. And my local dev was working completely fine. And that's where it became really weird. I was using BuildX and they were not using BuildX for building their Docker images and running their Docker images. And that's when we realized it was very specific clients were doing stuff differently. Some of them were checking the metadata data from the header, and some of them were doing the maths themselves and putting it in there. We rolled back the change of the header, and we moved the logic around. We moved the validation of the content length out to the edge network so we could do some like after our load balancers had done all their work and hyper had changed. Nginx was just handling that 307 completely differently and arguably incorrectly it was doing it was massaging it into a way that the load balancer was accepting it yeah and we needed to work around all of all of those kind of weird edge cases that we had previously just got nginx working on nginx was just doing stuff in we moved it out to our cdn layer our so our request processing was at the edge then and it works like once you got once you move those things around you can see the that it does work but like there's so many weird edge cases in in hp that i that i can't like say this is a drop in replacement yeah it's one of those you really have to test them.
Matthias
00:53:32
Yeah, I remember that in one of our earlier episodes with the maintainer of cURL, Daniel Stenberg, he mentioned a very similar problem, which is that Hyper was very strict about certain ways HTTP traffic should be handled. And cURL needs to be extremely permissive because people expect it. That's kind of the API of the command line tool. And he needed people to go in and either soften the edges on hyper or make parts of that transition layer a bit more permissive on the cURL side. But that was a tough job for them. And eventually they removed the Rust backend because of that. So because they couldn't fix it or there were not enough people who wanted to put in the work. And you hit this because, I just want to reemphasize that, you hit that because Granian, the WSGI server, uses hyper correct.
Cian
00:54:34
Yeah it's exactly it's the exact same stuff hyper was doing everything technically correct the the fun sentence of everything is technically correct granian uses hyper and tokyo and pyotree it's just core libraries so it it was using hyper to serialized a response and it was just and.
Matthias
00:54:59
Do you believe that we need more permissive libraries more permissive rust crates for real world hdp usage or other areas where things have historically grown to make those you know rust adaptations easier for people, Or would you rather say, well, no, instead we should work with better standards and maybe fix our code? Yeah.
Cian
00:55:30
As I've noted, I work with some of the best and worst clients. They do retries, they expect really good responses, but I don't own the API contract on them. I have to just follow the API contract. I would love to say that we as an industry should be following the standards being so strict to them and I can totally see that if I look back at me five years ago I would be there shouting no no follow the standards we should make everyone who doesn't follow the standards feel the pain, the issue is there that's a lot of people that's a lot of pain and it's not something you can fix overnight like i think we i know because i work in a package company a lot of people run a lot of different versions of the same software so even if like we started making tools stricter every everyone on december on february 28th decided to do one launch where everything switched to strict mode the in every library we then have to get that rolled out to every version of that software, it's not going to be, it's going to be a painful rollout. You need to have a level of permissiveness in the clients. Saying that, I don't want the default to be permissive. The default should be perfect. It should be the best way a client should run. The client should have timeouts. It should have sane defaults and should follow the standard. But when you run a legacy system, you're going to have a lot of weird legacy issues. And you need to be able to flip those switches off to mean that you can enable these things.
Matthias
00:57:13
Otherwise.
Cian
00:57:14
You're going to end up with a lot of duct tape around your very strict system to flip those switches off.
Matthias
00:57:20
Yeah be very strict initially and then lower the guard yeah exactly now when you look back on the project what would you say were your key learnings i'm talking about things that you would have done differently but also things where you believe rust is a good fit, how did that project go? Maybe you can summarize it in a few sentences.
Cian
00:57:46
The project could have gone a lot better. It's still underway. We're using it in specific environments now. We haven't rolled out 100% everywhere because of these weird edge cases we found with Docker. And the other issue we found was about connection management to our database. It's a big problem. We need to do some upgrades, which means we've held off and we haven't got there. That's and that was the biggest things about the project that was the unknown unknowns we sat we sat down and i keep saying we there was maybe me a principal to review my work and a manager to like sign off on it and and set out like would you we'd leverage our end-to-end test to do stuff. We'd use our load tests to validate our request throughput and that kind of stuff. But we never had a plan. And we had rollback and rollout plans. We had rollout plans that were like, well, canary in lower environments, raise them up. We'll do it in off regions in quiet times. Following the SRE handbook of how do you roll out changes safely. But we we had issues with like that that business logic like at the start of all this we started pulling in rust tools to speed up python because we didn't want to do a full rewrite for many different reasons we wanted to use. Small bits of rust in our stack to speed it up or small bits of sea as well if that was not if that was going to be there as well we were very much just looking for faster ways to do what we were currently doing but what we were currently doing wasn't well wasn't understood enough by myself and, others because we have that 10 years of legacy there's edge cases where the person who worked done it has come and gone or that same person has come and gone to the company three times he's my principal and he's like this is kind of reminding me of an outage i had five years ago and he and he's trying to remember it and we're trying to fix it these are the things i would have loved to know beforehand i would have loved to have known that we were going to run into these, weird edge cases and i don't know how i would have known how i would have got there how much more time researching could we have come up with how much more testing could we have done were these things we're only going to find in prod probably but i wish there was better we we had a better way of validating these things like a better test suite for like hp testing better test suite for different clients.
Matthias
01:00:53
I guess the question I would ask to myself is, had I known all of these things before would I have made any difference? Would I have made a different choice? And maybe the outcome was kind of still worth it?
Cian
01:01:08
Yeah. No, we retroed. Like I said, we're still in process, but we do regular retros. And that question came up. Was this the right choice? Should we have made a different choice? And I said, and we all agreed this is the right choice. There is something here that is worth testing. It's worth using. If we're not moving as fast as we want to and we've just introduced a new thing, we still know where we're going. We all agreed on a roadmap. The roadmap was just a lot longer than we thought we'd agreed upon. But it's still worth it. The speed increases we're seeing and they're totally worth it.
Matthias
01:01:54
Now, looking back at your 10-year of Rust experience, three of them professionally how has your perception of rust changed over time remember that maybe when you started you might have been enthusiastic about the language just trying to explore what's there but now that you use it professionally what would you say has shaped your perception on rust in in the last couple years.
Cian
01:02:29
Rust has changed a lot over 10 years. Like, I can remember a time when you'd get a clippy warning that would tell you, don't do that, do this. You'd do that, and it would produce a different clippy warning. And you could be 10 clippy warnings deep before you had the working code. Rust today is a lot different than that was. Like, you get one clippy warning, and then you're fixed. Or maybe you get one clippy warning, and your fix is ever so slightly different because you didn't turn on pedantic mode or something like that. But like rust is a lot friendlier now than it used to be and when i started writing rust i was very much just looking for at the cool hip language i think i think i first found rust out of fosdom going like going full circle in my life.
Matthias
01:03:16
Me too, by the way.
Cian
01:03:18
Yeah it's like Mozilla was so big on it and it seemed so interesting and I had just come off learning Go and I had and Google, was going Go is great, Go is great and I was talking to SREs who were like this Go thing seems really cool but I was like I was just there was some idiosyncratic things about Go that I was never a big fan of so I so that's why I started learning Rust and I, It's, I think we, it's got some rough edges still that are not fully sanded out or fully well. The story's not there yet. Like, when to choose a framework is still like an interesting problem you have in Rust. You have the issue of, do you use Hyper or do you use Axum or do you use Rocket? And I'm not even sure is Rocket still a thing. Like, I remember when that came out and it was really good, but I've never used it professionally. I think I've always reached for Axum and Hyper professionally. Because you the smaller projects don't move as fast in the right in sometimes but saying that hyper hyper moves very slowly hyper was only went v1 a year and a bit ago it took a long time to get to v1 and v1 was a big change as well so that was a that switch was like was almost a full rewrite of services and i think that's the thing i appreciate though about rust we took our time to get to get an api that was going to be stable that wasn't going to change a lot and was and it's going and you can work against but when you look but how many of the projects have never hit v1 is scary i look at my my cargo lock file or my cargo toml and a lot of my projects are still 0.8 0.7 0.1239 like these values that i'm like they could break at any time but i need to keep track of these things because i work at a package management company where people need to track we We need to track stable and secure versions constantly.
Matthias
01:05:43
I would say... Yes, a lot of versions are still not 1.0, or there are a lot of unstable crates out there. But at least from my experience, they break less often than in other ecosystems. Even though there might be a feature release bump or so, rarely do I need to go in and make any bigger sweeping changes. It's rather just minor things, or sometimes I'm not even affected by that. And so like i don't want to devalue your point but it's just to give more people some perspective maybe people that don't work with rust a lot it it might not be the biggest problem right now in the ecosystem.
Cian
01:06:24
No i i'd say i don't think it's the biggest problem in the ecosystem right now i think it's i think it's one of those problems that is a bit of a perception problem and it's very much one that you might see newbies might interrupt might feel a lot more like i i say that i said that i'm like bumping crates and i i'm maybe being apprehensive but like i have the same experience you have i rarely have to go in and actually change an api i i don't i did a very big bump on some stuff recently and we're using a new framework for we're testing a new framework for writing parts of our edge code and rust and it went from version 0.6 to 0.7 and i think it just added a lot of optional args to a lot of stuff so we had to try to read the docs and add those optional arcs that was that was not a big change it took maybe an hour of my time to just do that and that was fine and at saying that it came with improvements it came with cache improvements and all that kind of stuff so taking in those changes was good it was obviously feature changes is not bug fixes and that. So I'm like happy to take that stuff in. But when I... We're trying out more Rust and I'm bringing more people in to look at Rust. They, who are coming from a Python world and coming from different worlds. And they look at a lock file and they say, why are none of these things stable? I have to have that conversation with them about why we're still using pre-release software and why it might be years before that pre-release software comes in. And I don't think it's a problem you need to fix, but maybe it's a problem of education. And how do we talk about the v0 of packages to make people understand that this is, should this be production or should this not be production? It's not, a v1 isn't a signal that this should be production or not. It's just a signal of stability of the API.
Matthias
01:08:30
Do you think you will use Rust in 10 years?
Cian
01:08:33
I hope so. Like, there's an answer of, I hope so. I think languages change a lot, and the language ecosystem change a lot. I didn't think 10 years ago I'd be still writing Python or JavaScript, but I'm still writing Python and JavaScript. But you look at them, and they're a lot different to the Python and JavaScript you wrote 10 years ago. So I think Rust is here to stay. It's, I said earlier, it's in the Linux kernel now. It's in low-level libraries for Python. It's in UV. It's in ty. It's becoming a core part of our industry. But how will I be writing it? Or will someone else be writing it? I don't know. Maybe we'll have got to a point where we have saturated the amount of rust we need to write. And we can use... Higher level tooling built on top of that rust could we have a language that's less verbose than rust that is gives us the same memory safety could we take the lessons we learned from the borrow checker and apply that to an a language that looks something like a python for business logic and call in and out of it and maybe that's better for us maybe that's actually what i want is a language that takes all the learnings from Rust and takes the stability from Rust, but is a little friendlier for newcomers or a little easier for people fresh, for graduates fresh out of college to get started with without feeling like they're writing a systems language. Because that's something you always hear. Rust is a systems language. It's for systems programming. It's for systems problems, which isn't true. You can write anything. you. Rust is a language. It's a tool. You can do whatever you want with that tool. I've written business APIs in it. I've written load balancers in it. I've written CLIs in it. It's great for all of those things. And we've learned a lot from it that we could apply to other places. So will I be writing Rust? I hope so. Will everyone be writing Rust? Probably not. Will there be a new language that hopefully isn't inspired by Rust? Probably. Will there be a new language? Definitely.
Matthias
01:11:00
And finally what's your message to the rust community as a whole.
Cian
01:11:05
I think it's gonna start with a thanks because i wouldn't be doing what i enjoy right now without the rust community like they that knowledge sharing the rust books people willing to talk about it and whenever someone's willing to talk about it it's always that very enthusiastic talking about it like i was lucky enough at for them to go for dinner with a lot of the other speakers of for the rust room the enthusiasm people have about their projects and not the language it's great and people are always willing to have a very open conversation and talk about different things lessons learned and all that kind of stuff and i gotta say that's such a great thing and we need to keep that so it's so important i don't think i would have got into rust without that because it's what led it's But reading the Rust book is what let me learn Rust. It's such a nice way to learn. And I think we have to keep focusing on ways to make it easy to get new people into learning the language, to make it a better language, and to make people not think of it as a fad or a systems programming language. We have to focus on that path for beginners. Tools like Clippy have done massive improvements there. like that it's it's more than just a linter it's a tool for helping you learn how to write good an idiomatic rust like i and when we focus on tooling that's natural to humans i think we just come up with a better language and i think we have to keep that in mind when we develop rust is it's tooling to make you as a human enjoy writing rust and make sure it's not a pain Where.
Matthias
01:12:50
Can people learn more about Cloudsmith?
Cian
01:12:53
So cloudsmith.com is our website. You can, if you want to use Cloudsmith or think that you need better package management, check it out. If you are interested in joining us, we are always hiring. My team is experimenting with Rust. So if you're a Rust developer and want to write some Rust in production, reach out, reach out to me. I'll get my email dropped in the show notes so people can reach out. And if they want to just talk about Cloudsmith, or package management or Rust, you can also reach out.
Matthias
01:13:27
Amazing. Cian, thanks so much for taking the time for the interview today.
Cian
01:13:31
Thank you. It's been a very pleasurable chat.
Matthias
01:13:35
Rust in Production is a podcast by corrode. It is hosted by me, Matthias Endler, and produced by Simon Brüggen. For show notes, transcripts, and to learn more about how we can help your company make the most of Rust, visit corrode.dev. Thanks for listening to Rust in Production.