Rust in Production

Matthias Endler

KSAT with Vegard Sandengen

About talking to satellites with Rust

2025-07-10 47 min

Description & Show Notes

As a kid, I was always fascinated by space tech. That fascination has only grown as I've learned more about the engineering challenges involved in space exploration.

In this episode, we talk to Vegard Sandengen, a Rust engineer at KSAT, a company that provides ground station services for satellites. They use Rust to manage the data flow from hundreds of satellites, ensuring that data is received, processed, and stored efficiently. This data is then made available to customers around the world, enabling them to make informed decisions based on real-time satellite data.

We dive deep into the technical challenges of building reliable, high-performance systems that operate 24/7 to capture and process satellite data. Vegard shares insights into why Rust was chosen for these mission-critical systems, how they handle the massive scale of data processing, and the unique reliability requirements when dealing with space-based infrastructure.

From ground station automation to data pipeline optimization, this conversation explores how modern systems programming languages are enabling the next generation of space technology infrastructure.

About KSAT

KSAT, or Kongsberg Satellite Services, is a global leader in providing ground station services for satellites. The company slogan is "We Connect Space And Earth," and their mission-critical services are used by customers around the world to access satellite data for a wide range of applications, including weather monitoring, environmental research, and disaster response.

About Vegard Sandengen

Vegard Sandengen is a Rust engineer at KSAT, where he works on the company's data management systems. He has a Master's degree in computer science and has been working in the space industry for several years.
At KSAT, Vegard focuses on building high-performance data processing pipelines that handle satellite telemetry and payload data from ground stations around the world. His work involves optimizing real-time data flows and ensuring system reliability for mission-critical space operations.

Links From The Episode

  • SpaceX - Private space exploration company revolutionizing satellite launches
  • CCSDS - Space data systems standardization body
  • Ground Station
  • Polar Orbit - Orbit with usually limited ground station visibility
  • TrollSat - Remote Ground Station in Antarctica
  • OpenStack - Build-your-own-cloud software stack
  • RustConf 2024: K2 Space Lightning Talk - K2 Space's sponsored lightning talk, talking about 100% Rust based satellites
  • K2 Space - Space company building satellites entirely in Rust
  • Blue Origin - Space exploration company focused on reusable rockets
  • Rocket Lab - Small satellite launch provider
  • AWS Ground Station - Cloud-based satellite ground station service
  • Strangler Pattern - A software design pattern to replace legacy applications step-by-step
  • Rust by Example: New Type Idiom - Creating new wrapper types to leverage Rust's type system guarantees for correct code
  • serde - Serialization and deserialization framework for Rust
  • utoipa - OpenAPI specification generation from Rust code
  • serde-json - The go-to solution for parsing JSON in Rust
  • axum - Ergonomic web framework built on tokio and tower
  • sqlx - Async SQL toolkit with compile-time checked queries
  • rayon - Data parallelism library for Rust
  • tokio - Asynchronous runtime for Rust applications
  • tokio-console - Debugger for async Rust applications
  • tracing - Application-level tracing framework for async-aware diagnostics
  • W3C Trace Context - Standard for distributed tracing context propagation
  • OpenTelemetry - Observability framework for distributed systems
  • Honeycomb - Observability platform for complex distributed systems
  • Azure Application Insights - Application performance monitoring service

Official Links

Transcript

This is Rust in Production, a podcast about companies who use Rust to shape the future of infrastructure. My name is Matthias endler from corrode, and today we talk to Vegard Sandein from KSAT about talking to satellites with Rust. Vegard, can you introduce yourself and KSAT, the company you work for?
Vegard
00:00:24
Thanks for having me. My name is Vegard Sandengen. I have a master's in computer science, I have worked most of my professional career in the space domain, even though it's usually on the ground, and I've been working at KSAT now for the last four years.
Matthias
00:00:40
And recently, you became a father, so there's one more rustacean in this world. Congratulations.
Vegard
00:00:47
Thank you.
Matthias
00:00:47
So, can you say a few words about KSAT? I know that the slogan is, connect space and Earth, and I really like that, but what is it about?
Vegard
00:00:56
KSAT is the abbreviation of the company, which is... Kongsberg satellite sources. So we're getting data from space to Earth, and then we're using that data. So ground network operations and Earth observation networks. I work in the ground network, which is our distributed network of antennas situated all around the world. And we enable satellite owners to talk with their satellites and get their data.
Matthias
00:01:24
A lot of people only know about satellite technology from television or from popular science. And the knowledge they have is probably rooted in the 60s and 70s. But a lot has happened since then. What has happened since the 60s?
Vegard
00:01:45
Yeah, satellite industry was traditionally operated by satellite companies, and they're using their software to just deliver on whatever their satellite had and a satellite itself started like in the 60s with the russians launching Sputnik and it was very expensive i mean launching a satellite took a government agency so all the way until basically Space X and a lot of the other newcomers in the satellite business or in the launch business came along it was extremely expensive to launch satellites so it was mostly just agencies and government entities that it could afford to put satellites into orbit. And some of those satellites are geostationary satellites in delivering your satellite communications for TV or for your sat phone, if you had that. And from the old days, it was mostly communication-based, but NASA and ESA also launched scientific instruments to monitor the earth or to monitor the sun or to send probes into outer space to do some other readings. And the way satellites communicate is almost, exclusively through radio frequency communication and different wavelengths on the radio frequency spectrum. And the type of wavelength you use determines what's the quality of your transmission. And Earth observation satellites, very close to the Earth, they orbit the Earth maybe 14, and 15 times a day, they can produce a lot of data. And as the instruments have gotten better, the resolution of whatever measurements they're doing is getting higher. The amount of data is getting higher. And the majority of the way to actually get data done is to have contact with the ground station. And you have a limited visibility over a ground station. So you only get like 10 to 15 minutes of visibility maximum that's peak and you have to push down gigabytes of data so the amount of data we're talking about is ever increasing.
Matthias
00:04:00
Yeah thanks for the overview but one thing i always wondered as some somewhat of a bystander is what is the standardization of the communication protocols. Do we keep using the same protocols since the 60s or does every satellite have its own protocol or is it something in between?
Vegard
00:04:20
It's everything and nothing, unfortunately. So there is a standardization body called CCSDS that a lot of the government agencies contribute into from the early days of the 80s-ish, if I remember correctly. So a lot of the hardware-related radio frequency protocols and how to handle data on the physical link has a lot of different standards. And in order to push data over the air, you also need some error correction and you need to be able to sequence your data just like TCP/IP, there's an equivalent standard in the space industry. Coming into the new space era there's a lot of new contenders on the market that are software companies that are using spacecrafts and not spacecraft companies using software they're also not really following some of these standards from the agency era so you get a lot of compatibility issues where you're basically having to custom fit okay how do we talk to this spacecraft because This is a new software company that has just looked at the standard and said, ah, we don't really need this. We'll do it our way. And it works for them. But at some level, you have a minimum viable product that you can share on a radio frequency level. And most people are compatible with that. But after that, all bets are off.
Matthias
00:05:56
Sounds like that approach would generate a ton of legacy code in a very short time.
Vegard
00:06:03
Yeah.
Matthias
00:06:05
Now, let's talk about the size of operations at KSAT.
Vegard
00:06:09
KSAT started off 25 years ago as a company, and we started off with one antenna and one customer, and that's about it. And as KSAT grew its company and this market shift into new space with all these new software actors really exploded the number of satellites launched into space, And KSAT followed suit and built up both their antenna park on how many antennas you have and how many employees we have to deal with this and how many engineers. And at this point in time, we're roughly at ballpark between 100 to 300 active antennas. It is one of the biggest providers of commercial ground station services.
Matthias
00:06:53
I think the official website mentions 23 sites worldwide, which sounds crazy to me. What is a site specifically and what goes in there to maintain that?
Vegard
00:07:05
A antenna site for us is mostly, it's a place where we need a lot of power and we need fiber optic cable, hopefully. We don't have that at every site, but what qualifies as a good site is that it's far enough apart from any other site we have, and that it covers a lot of ground we don't actually get from other sites in the vicinity. And the placement of the sites are usually depending a bit on what orbit the satellites go in. So the satellites usually have two orbits that are relevant. It's like polar orbit, where they go from pole to pole, and then it's the other one where they just follow the equator. And if you only have ground station at the equator and you have a polar orbiting satellite, you only get the visibility twice a day. But if you have a ground station near the poles, you get 10, 12, 14 contacts a day. So it really depends. Each contact has a duration of anything between 5 to 15 minutes, really. And that can generate anything from a few gigabytes to 100 gigabytes per contact.
Matthias
00:08:10
Data processing can come later, but the data exchange happens during that time frame.
Vegard
00:08:16
The data exchange between the satellite and the ground station, yes. And because of the volume of data increasing so much, our main concern going forward is not really building enough antennas, it's actually just building enough infrastructure to handle all this data. Because there's so much data and you need to push it around and you need to provide it to the customer in a reliable fashion. And it can be quite unreliable networking between a remote site in Canada or in New Zealand and you have the customer on the west coast of the US. That's a challenge really going forward.
Matthias
00:08:57
Okay, so to summarize, the setup is a bit like this. You have a ton of satellites circling the Earth on a regular basis. They go around the Earth 10 to 15 times a day or so, roughly like that. And then on the ground, you have antennas on ground stations. And then these antennas, they connect with the satellites, do the data exchange, and then you need to send the data over, say, Fiber to a central place.
Vegard
00:09:30
Usually delivered straight to the customer but due to our volume of data we also have to temporarily store it on the site itself but do not lose data in the process so yeah.
Matthias
00:09:45
Two things come to mind first it needs to be extremely reliable because if you lose the data that is big outage and probably a loss to the customer as well and the second part is how often can you make changes to that code how often can you modify the code that also needs to be reliable, i'm guessing you probably even have limitations as to how often you can access these ground stations and make changes.
Vegard
00:10:13
Yes that is correct that that has to be really reliable but you're actually right on point with how do we update the code because it's not that we're using the antenna 100 of the time but the ecosystem around the antenna with our software running in different hardware close to the antenna it is not easy to access i don't have access to it for instance but so i just have to push code and hope that someone else deploys it worst case it can take, weeks before something is deployed worldwide that is a process we're obviously trying to optimize and get better at but it's it is a pain point because it's also in inaccessible sites and the most inaccessible sites we have is probably in our antarctica and troll station that also doesn't have fiber optic cable so anything that you put down there we also have to beam up to a geostationary sites satellites so we can beam it down to earth again a place where we have fiber.
Matthias
00:11:10
I guess the huge advantage here is that for code that is written in Rust, you could just deploy a static binary and people would just be able to run it on the deploy target.
Vegard
00:11:21
It's generally that easy. I mean, everything we do nowadays is dockerized. On all our ground stations, we're running some variant of Kubernetes and just running it on OpenStack or Kubernetes directly.
Matthias
00:11:34
One could think that since you operate in the ground station and you probably have access to a rack or so, you're not resource-constrained. But one thing that people might forget is that you don't do constant updates to the hardware over there.
Vegard
00:11:52
We're definitely resource-constrained on a lot of our sites. Not all of them, but a lot of them. It can take us eight months to get a new computer just ordered from our vendor, and then we have to ship it to anywhere in the world, and you have to get people there on-site to install it. So we are resource constrained in the fact that we don't want to over-provision every data center around the world near to all our antennas on our ground station sites. Because, first of all, we don't necessarily have the resources to do that, and we don't have the ability to do it at some point. So it's nice to use something that doesn't hog all the resources.
Matthias
00:12:31
Wouldn't it then be super easy to fall into a trap of being extremely conservative about tech decisions? People might associate space technology with a lot of very old conservative technology, and maybe for a good reason, because it's tried and tested.
Vegard
00:12:50
I think the satellite industry or space industry is definitely very conservative. It takes a lot of effort to qualify something to run in space. I know, RustConf, last year there was one of the sponsors was K2 Space. They're actually a space company that with a lot of recruits from former AWS and SpaceX, they wanted to do everything in Rust. They wanted to build the satellite. They wanted to build the firmware. They wanted to build all the ground resources. 100% of Rust. They had a lightning talk at RustConf. It's probably out on YouTube. So there are definitely contenders out there that want to not be so conservative. But from the old space they are very conservative but i wouldn't say that's necessarily true on the ground the ground is a bit more like we can touch this we can fix it it's not the same in space earlier.
Matthias
00:13:42
You said there was a shift in the industry so we moved from space companies using software to software companies doing space things mostly two companies come to mind right away one would be spacex and the other one would be blue origin but i'm assuming that's just a tiny little slice of the picture and maybe there are other software companies that i might have heard of that pushed into the into the space.
Vegard
00:14:12
Into the space yes you also have a few other providers that are up there trying to and successfully doing so like rocket lab but from but these are launch providers they're facilitating the software companies to launch something into space but otherwise aws is actually going right at it and they're going after the data primarily they want the data because that's aws's business model is data and there's a lot of data in space a couple of years back or three or four years back they launched a ground station service which is i I wouldn't say a direct competitor to us, but they are definitely a competitor. And we have made a strategic partnership with AWS to be, a ground network of network providers. So people can come to us and they can use the resources in AWS, their antennas, their setup, but they can do it through us. But the business model is a bit different because AWS, as I said, they're a data company. They really just care about getting their data into the AWS data center. So you can do whatever you want with it there. So the space part is just a means to an end, really.
Matthias
00:15:22
So we move from space exploration to data exploration. What has changed on the language side? How did the story go at KSAT?
Vegard
00:15:34
Initially, everything was engineers writing Perl scripts and just making it work. And that has scaled very well, but it's still written in Perl, and it's not the newest version. And at some point, we needed to have a bit more control of whatever is running on our antennas. And that was developed in Java in the mid-2000s with an Oracle database. And that has scaled well. We're very thankful for the legacy that was provided to us so that we can even be here today to do something else at a bigger scale. Because that would not be possible without the humble beginnings.
Matthias
00:16:12
The 2000s were definitely the time of Java. It has some really nice traits, and I think it resonated well with the challenges of its time. But then what happened in the 2010s at KSAT?
Vegard
00:16:28
Yeah, so at some point, we're kind of scaled up with a bit more developers and with a bit more modern scripting and kind of Python took over. We have multiple Python applications still in production today from that era. But yeah, we started to see that due to how that Java application, and not necessarily Java in itself, but just the database and all the Perl integrations that unfortunately had direct database access, meant that we had a distributed network all over the world with scripts being able to access the raw contents of our database. And that was not very scalable. We launched an initiative to move away from this world into a more modern world where we can have more control over the life cycle of the data that we put in the database. 20-25 years ago everything was on ftp xml drop boxes you can call that an api as well but we've decided that we can try to offload responsibility into like segregated new Postgres database where access to the data is tightly controlled through an HTTP API. Yeah, so we're employing a sort of a strangle pattern on that and just trying to just grope in any responsibilities and kind of rewrite and repurpose it and have successfully launched a. Competing solution in-house now where half of the antennas are on the old system the old api was written in pearl and it was strangled on the hftp layer into a coffin application, and then we nipped at it and moved different responsibilities and endpoints around and now, i would say from a responsibility point of view where like it's 40 60 in rust right now but a lot of the boring parts are in the kotlin application but we're actively working to to migrate the remaining kotlin portions as well over trust earlier.
Matthias
00:18:32
You mentioned the strangler pattern how does it work.
Vegard
00:18:35
So strangler pattern is very convenient when you have a code base or a interface layer where you can very well design know what's going in and what's going out and you know that everything below this is just complete mess and you don't understand anything but you understand the interfaces or the boundaries and you can replace the implementation under each boundary with very great control and see the differences in implementation and behavior and you make it entirely seamless to all consumers that you have actually done anything which is very nice but it It requires that you have some sort of abstractions that actually make this feasible. And from an HTTP API layer, very easy, because the contract is in how the API responds or what parameter it takes. And you can replace that in any language. It's not really that hard. It's just a lot of verification that you've actually replicated all the behavior.
Matthias
00:19:41
Now, let's focus on the API for a second. You mentioned that it's a split between Kotlin and Rust at the moment. Where do you draw the line?
Vegard
00:19:52
I don't think there's a natural split now, other than whatever developer or team took that responsibility and what they were comfortable with. So we've had a very open policy at KSAT on what languages we would use to solve whatever problem. And it has definitely been a pushback to introduce Rust in some capacity by some team members in different teams. And I'm not necessarily sure all of their concerns are, I would call, valid, but there are definitely concerns. And some of the pushback I've heard is usually it's not mature enough or the ecosystem is not there. And I feel that is a sentiment that is often held about Rust that I'm not necessarily sure is true anymore, because I feel the ecosystem is very much present. I can do everything I want in the ecosystem in Rust today. And the other part is maybe just a lack of knowledge of how do you use such complex terms, because it comes from a system background. And a lot of regarding boroughs and lifetimes and stuff like that, it can seem a bit intimidating. To someone that's usually just very happy in their Java or .NET environment where that is not necessarily a concern for 99% of what they're doing. There are also positive receptions of Rust. And I have personally been able to, I don't know, convert a couple of teams to use Rust. So yeah, we're approximately three or four teams now using Rust in production at KSAT with maybe four-ish people in each team that's actively writing Rust.
Matthias
00:21:30
How does that usually go for you when you approach a team and they are curious about Rust, but they are not entirely convinced yet?
Vegard
00:21:39
The conversation often goes in the direction of this is what is very good about Rust, and that's what I start with. And you have to make some concessions. And the concessions are obviously just, is it a good team fit? Because I don't think Rust is hard to use once you've gotten over that initial, whoa, what happened here? It's a shock. But a lot of teams have their experiences in their toolboxes in other languages and know how to solve them. And if you don't really have a champion on that team itself, I don't think it's possible to really introduce Rust into a team because the team has to embrace it themselves. That it's a no-go if the team is not championed from within, really. So my job is more just like I try to do some good mentoring and try to have some common guidelines and try to curate some crates and make some internal crates that help the process along internally with the tooling and the way we do things. But ultimately, you require that team champion as well to be on your team.
Matthias
00:22:46
What's your success rate here? Have you lost some of these battles?
Vegard
00:22:51
Not on a team level, but maybe on an individual level, yes. But the general vibe is that it's going more and more into us for a lot of our distributed systems. And just because it's so nice to use once you actually get to know it. So it's just that hurdle of inviting people in that haven't used it before.
Matthias
00:23:12
I'm almost too afraid to ask it, but has Go ever come up in that conversation?
Vegard
00:23:19
Go has come up multiple times, and we have production code in Go as well. I'm a bit annoyed at that sentiment as well, because Go is maybe annoyed is not the right word, and I'm a bit intrigued by the why don't we just do it in Go? Because Go was released in March 2012. It's three years older than Rust. at this point is 10 and 13. It's not that big and much of a difference.
Matthias
00:23:45
In terms of age, but in terms of functionality and in terms of developer ergonomics, maybe?
Vegard
00:23:52
Yeah, but Go had a very simple language to begin with. So it was very easy to get going with Go. But I also think that there is an ecosystem in Go, but the ecosystem is harder to engage with than it is the Rust ecosystem because the tooling and with cargo on the kin as just miles above any tooling you have in Go. So that makes it, for me, also a no-brainer just because, like, disregarding just the language itself and the features and ergonomics of the language, just the tooling and the ecosystem with using the language is what makes Rust the number one contender on the market.
Matthias
00:24:34
Go is very much a day-one language. and starting a project and getting to your first production version is usually very ergonomic, very quick, very elegant. The problems start to arise on day two. Not exactly day two, but when you have a larger code base, you feel the limitations of the language of the ecosystem. It's trying to constrain you somehow. almost feels like it's strangling you.
Vegard
00:25:08
And you're not strangling it.
Matthias
00:25:10
I probably would have made the same decision in your position, of course. Obviously, I'm biased, but you have to maintain this software for a very long time.
Vegard
00:25:21
Yeah, so definitely, from my experience point of view, just being able to model your code in a way that just, it just feels, you just know where the boundaries of what you've made in the main And it's very easy to move that along and refactor it. So back in eons ago, I was a C and C++ developer, and I did a bit of that and a bit of that. And just trying to refactor a C++ code base and having confidence that you've actually done it correctly, I have never had that. But Rust, if it compiles it works, it basically is that. And that sentiment is overused i think but it still feels very true at some point because the compiler is so powerful but whenever it compiles i'm confident and i also have a few tests here and there and where the tests run as well which they do 99 percent of the time after i've done a major refactor i'm confident i will push no problem funny.
Matthias
00:26:25
That you say you have a few tests here and there does that mean you lean into Rust's strong type system a lot as well. And maybe you don't have to write that many tests that you would have to write in other languages, more dynamic languages like Python.
Vegard
00:26:39
Oh, definitely. Our tests is, I think there's a concept called like a diamond-shaped testing or something where you basically, you have very few unit tests, you have very few system tests, but you have a lot of integration tests. And those integration tests are placed on the boundaries of the network layer, so HTTP. And I have all my tests are basically just HTTP related API tests because I don't really care how the structs or functionality within the Rust code base behaves because what's important is just what is the contract or the HTTP boundaries so we have a few tests down to the database over the HTTP layer but from unit test point of view almost nothing Thanks.
Matthias
00:27:28
But in order for that to work, you would have to lean very heavily into the Rust mechanics, into the type system, and you would have to rely on it. Are there patterns that you commonly use to fully embrace that part of Rust?
Vegard
00:27:48
Yeah, so I use quite a lot of new types. For instance, a UID, I will new type it into a variant that represents this resource. Meaning that the API layer is very communicative of what it's actually expecting, or not the API layer, but the code base itself that serves. So it's very easy to modularize different components that work in some form of hierarchy because the types are so strong that you can convey so much with both the primitive types themselves, but also some types in form of enums. The one thing I miss every time I go to any other language is just the enum. I think this I could model very well in an enum, and I don't have this capability. And it saddens me.
Matthias
00:28:33
Do you have an example for an enum that, for example, comes to mind where modeling some certain business logic was very ergonomic?
Vegard
00:28:43
So I'm a big fan of the one-off pattern. It is represented in, for instance, OpenAPI definitions. There is like a one-off you can represent there. doing code gen for one-offs in open api to any other language is horrible but code gen 2 was very easy to use and being able to represent the fact that this resource has different properties depending on which kind it is is very powerful because even though at some level you're talking about this resource it has one resource id but it can manifest itself as different forms of different versions or represent different physical attributes on the network. And on some abstractions, you don't really care about those properties, but on others you do. It's very nice to be able to represent just the exact properties that are present and not load of optionals that are present only is this is true and this is true. And you have to carry that logic throughout the code, that makes it harder to refactor as well. If you know that this can only be set if this other value is set, and that's invariance in your code that you kind of encode with the enums instead.
Matthias
00:29:57
I'm not too familiar with it, but I know that in a schema you can say this is one of these variants, one of these kinds, and I guess it maps really well to enums. If you go further one step, you're probably also using the serde ecosystem of things and say this is my input type and so I convert it from the schema.
Vegard
00:30:20
So we're leaning heavily into serde. It's an excellent library.
Matthias
00:30:25
Any other crates that you personally like for that sort of work?
Vegard
00:30:29
You usually have to do some customizations on top of serde with serde with or stuff like that to actually do the proper transformations. I've been also experimenting now with Utopia to generate OpenAPI specifications.
Matthias
00:30:45
It's called UTO IPA. It's a very common misspelling, unfortunately. I made it a dozen times until someone pointed it out.
Vegard
00:30:53
Yeah, I will probably continue to misspell it.
Matthias
00:30:56
The reason why it's called UTOIPA, by the way, is IPA is API backwards. And it's also a good beer. That's from the READMEs.
Vegard
00:31:06
Of course. Sorry. Yeah. One slight issue I have with serde is that it's very versatile, well, but it doesn't really give you that great of a structured way of accessing errors. And that boggles me a bit because I really want to give good structured feedback in our API surfaces. And I don't want to fork Siri to just fix that because then I'm incompatible with everything. I'm not entirely sure how to solve that on an ecosystem level. But right now, I've just wrapped the outputs and parsed the strings to extract the vital information that I want. But I would definitely like to see a bit more structured error responses on what went wrong in the serialization process.
Matthias
00:31:53
I personally see serde more of a contract. You have the value type, you have to serialize, you have these traits, that's your building block. So what keeps you from building structured error messages from these smaller building blocks?
Vegard
00:32:10
Because the serde error type doesn't give you, like the serde error type, it is, well, I think it's possible, but it is, we're using the JSON, serde-json, because it's what we communicate over, and the serde-json error type eradicates any references to which field, for instance, was the error at. So you will have to parse the stringified message to extract it was at this field to get it out or you have to fork serde-json and fix it there. I could probably do that as well but I've seen it in multiple JSON parsing libraries as well that the level of programmatic access to the variants are not that great. But other than that, the serde ecosystem is amazing. You can do a lot of stuff with it. Just have to be a bit more forgiving on how you output the errors to the end user because that's kind of what matters here i mean for me as a programmer i don't really care but it's not the consumer of the api that cares from.
Matthias
00:33:15
What i can tell from our conversation so far stability is the main focus.
Vegard
00:33:20
From listening to a lot of your other guests on this podcast doing a lot of cool shit and they're it's it's very fun to listen to but and i get the feeling that our first usage is boring we're just using the top level just web frameworks and sqlx and axum and serde and just putting it all together and just making it work. I have a good example of that because, a couple of months back we needed to do some changes in a few of our services running and I went into the repository for that service to actually fix it, and I saw the last commit was one and a half year ago, and it's just been running. One and a half years. I haven't touched it, and I have never had that experience in my professional career. That service was the main authentication authorization service that authenticated and managed every API key and principle, so it was used on every request. It's really chugging along, so it's amazing. I've had only good experiences on that front end.
Matthias
00:34:23
Did you also have any bad experiences with Rust?
Vegard
00:34:28
You can call it a bad experience, but I would camouflage it as a good experience. So we've been running on-prem coaster for many years, and that on-prem coaster hasn't really gotten that much love and attention. So it's just chugging it on with the resources it had six years ago when it was installed. We also do a lot of calculations regarding satellite trajectories and visibilities to our run stations and stuff like that. So one of the things i wanted to calculate was just okay when is a satellite visible over our ground stations and we support quite a lot of satellites and we have a lot of ground stations so there's a lot of maths to figure out when are you where and when can i talk to you and i naively just put everything in a loop and then i slammed rayon on it and i pushed it to production and a couple of days later one of my devops team came and just like our production cluster is like running at 80% cpu it's struggling a bit also and it's majority from the service i just updated and yeah the computations work fine but it had a wider impact on our other production services,
Matthias
00:35:40
So it's too performant
Vegard
00:35:42
Too good i had to dial that back.
Matthias
00:35:48
Okay i can see how that might also be a benefit or how i could see it as a win but are there any other issues with the wider rust ecosystem that come to mind.
Vegard
00:36:01
Yeah i mean we're a big user of async because we're using axum and everything is just on a tokio runtime and it just works very well just doing basic features to handle HTTP requests and doing features to send database queries and get responses. And that just works very well. But when you're trying to combine that with a feature in the HTTP layer to also provide some computations, we ran into some issues. So a few months back, someone used our API in a way that we hadn't anticipated. And there was too much traffic on something that blocked. and just everything, everything just stagnates and response time speaks and it affects everything. And just trying to hunt down where we actually block or do computation for so long that you're starving the tokio runtime, that was very challenging.
Matthias
00:36:57
What I see a lot is teams using their development laptop to start a larger tokio application with say 16 or 32 cores. And then when they deploy the same service to production it ends up running on a two core node and obviously that's a completely different environment. Was it one of these cases where, the production system was very resource constrained and when you tested it in development it was not.
Vegard
00:37:29
The problem manifested itself when the traffic increased enough to actually trigger it so we didn't really trigger it we could reproduce it locally at some point when we actually knew what traffic to induce but so we had some inklings when stuff went wrong but it was quite a goose chase down this set of futures and where do you actually, how do you measure what blocks? And trying to use tooling like tokio console, it's a great project, but it's just not insightful enough at that level yet. So I would say the tooling is probably not right for the abstractions we need to be able to efficiently bisect where is the issue and how do I solve it? Solving it is very easy in tokio. You just spawn it on the blocking runtime and it's fine but it's definitely something to be aware of so it's a pitfall for newer developers and it got us as well.
Matthias
00:38:22
The typical pattern is that you see a spike on the cpu and there's really not much traffic coming in anymore it blocks on the the api layer but in reality your cpu is super busy with some computation but then it still doesn't tell you where that computation happens You just need to dig deeper and understand the business logic of it all.
Vegard
00:38:49
So that's somewhere also where the distributed tracing you have in an application and how you have insight into that comes well into mind i also like the tracing ecosystem very good love it but figuring out how you actually use tracing and like a distributed sense, it's a learning curve where you have to basically puzzle about pieces together yourself to figure at how do you actually get the correct level of tracing in the applications and across applications. That's also probably an area where there would be a good fit for some higher level abstraction crates for server application that just needs to have good defaults on everything.
Matthias
00:39:32
Do you use tracing across language boundaries or just within the Rust context?
Vegard
00:39:37
We use the W3C trace context standard to send trace parent headers to correlate tracing information across applications, but that works fine. We set up our own tracing infrastructure using tracing to create with a custom subscriber to Azure App Insights. App Insights is a good service, but it's also quite expensive. but just knowing where to wire up what you need to call when and how where in tracing and how do you model that into whatever subscriber you have so using for instance OpenTelemetry versus Honeycomb versus App Insights they all have a different behavior on how you open spans and when you close them and how you annotate them and when you actually send the event it's a learning curve to just employing correct tracing in your application is not something that's extremely easy to understand. So you usually spend a few months on it.
Matthias
00:40:38
From our conversation so far, it feels like a lot of services run on Azure or, cloud in more general terms, but how does that relate to whatever you do on the ground stations?
Vegard
00:40:56
The API layers we've developed over the years, it's been primarily, as you say, in a cloud setting, but due to the widespread nature of our antennas and where they are, we're also resource constrained, as we touched on earlier, on the resources we have on each antenna. And our challenges are often related to having running code there that can run forever and not have any downtime, really. Many years ago, we deployed at least one service on each antenna stack throughout, which is written in Rust. And it's just responsible for ping-ponging back whatever is in the cloud. What should I do on this antenna? So it's what we call our scheduler. We schedule anything and synchronize what we have there. And that has been running flawlessly on 120 antennas or something for three years now. I think I've had two bugs on it, and it's been purely logic bugs. The problem with bugs in that is that when there's a bug there, it affects everything. Because, yeah, nothing is happening on the antenna if the scheduler is down. Other than that, we also have data distribution and just pushing metrics from our basement equipment to the antennas. And everyone wants to consume those, your customers, system engineers, big part of our infrastructure is also just having the correct tooling on each antenna to be able to send out this infrastructure and we're using rust for that as well.
Matthias
00:42:20
It's incredible how far you are in your rust journey already i had no idea really. About the scheduler so what inputs does it take and what outputs does it generate.
Vegard
00:42:30
So it's running a in-house protocol to synchronize whatever schedule is available in the cloud and the cloud database is the source of truth and the schedulers on each antenna site is just figuring out what to synchronize from the cloud so i can operate autonomously in case of network failure without network connectivity we can still operate and take your contacts and yeah so it just synchronizes whatever it does there over a custom HTTP protocol and as the contact is about to begin it kicks off a event to a another service which we call the controller it's the controller of the entire contact just controls all the baseband equipment and whatever firewalls and whatnot needs to be opened and controlled. It's a just-in-time scheduler.
Matthias
00:43:25
And the reason why it doesn't pull everything is resource constraints again.
Vegard
00:43:30
Yeah and it also doesn't need to have the full state of the entire database because from clouds it only needs to know what do i need to do it would be very inefficient to synchronize the remote states from the cloud to every antenna that would not be feasible.
Matthias
00:43:47
But the calculation for knowing what it needs is that CPU bound or is the focus again on reliability here.
Vegard
00:43:58
Solely on reliability. So for scheduling or synchronizing, it's configurable for a scheduler, but usually it's deployed to one or three days ahead, so it can run a while until if we're, losing network connectivity and we can still salvage a lot of data even if we don't have connectivity to the cloud.
Matthias
00:44:20
Very impressive but that means the entire chain from the satellite all the way to the customer is at least in parts written in rust nowadays. What is your message to the Rust community?
Vegard
00:44:38
I think my primary message to the Rust community is just polish up async. Get it to be the best experience it can ever be. There are some pitfalls now, even though Rust 2024 edition stabilized the async closures. Very happy about that. But there are still some questions around observability of what is happening within an async context and how do you navigate that? And just, yeah, and getting to the bottom of issues related to, as I said, the blocking issues we have and just cancellation safety and drop safety and async drop and all these paper cuts that just are not completely answered. That would be my message to really polish up that. That would make selling Rust to others much easier.
Matthias
00:45:30
Yes, I could get behind this. Vegard, thanks so much for taking the time and for being a guest today.
Vegard
00:45:37
It's my pleasure. Thank you for having me.
Matthias
00:45:40
Rust in Production is a podcast by corrode. It is hosted by me, Matthias Endler, and produced by Simon Brüggen. For show notes, transcripts, and to learn more about how we can help your company make the most of Rust, visit corrode.dev. Thanks for listening to Rust in Production.