KSAT with Vegard Sandengen
About talking to satellites with Rust
2025-07-10 47 min
Description & Show Notes
As a kid, I was always fascinated by space tech. That fascination has only grown as I've learned more about the engineering challenges involved in space exploration.
In this episode, we talk to Vegard Sandengen, a Rust engineer at KSAT, a company that provides ground station services for satellites. They use Rust to manage the data flow from hundreds of satellites, ensuring that data is received, processed, and stored efficiently. This data is then made available to customers around the world, enabling them to make informed decisions based on real-time satellite data.
We dive deep into the technical challenges of building reliable, high-performance systems that operate 24/7 to capture and process satellite data. Vegard shares insights into why Rust was chosen for these mission-critical systems, how they handle the massive scale of data processing, and the unique reliability requirements when dealing with space-based infrastructure.
From ground station automation to data pipeline optimization, this conversation explores how modern systems programming languages are enabling the next generation of space technology infrastructure.
About KSAT
KSAT, or Kongsberg Satellite Services, is a global leader in providing ground station services for satellites. The company slogan is "We Connect Space And Earth," and their mission-critical services are used by customers around the world to access satellite data for a wide range of applications, including weather monitoring, environmental research, and disaster response.
About Vegard Sandengen
Vegard Sandengen is a Rust engineer at KSAT, where he works on the company's data management systems. He has a Master's degree in computer science and has been working in the space industry for several years.
At KSAT, Vegard focuses on building high-performance data processing pipelines that handle satellite telemetry and payload data from ground stations around the world. His work involves optimizing real-time data flows and ensuring system reliability for mission-critical space operations.
Links From The Episode
- SpaceX - Private space exploration company revolutionizing satellite launches
- CCSDS - Space data systems standardization body
- Ground Station
- Polar Orbit - Orbit with usually limited ground station visibility
- TrollSat - Remote Ground Station in Antarctica
- OpenStack - Build-your-own-cloud software stack
- RustConf 2024: K2 Space Lightning Talk - K2 Space's sponsored lightning talk, talking about 100% Rust based satellites
- K2 Space - Space company building satellites entirely in Rust
- Blue Origin - Space exploration company focused on reusable rockets
- Rocket Lab - Small satellite launch provider
- AWS Ground Station - Cloud-based satellite ground station service
- Strangler Pattern - A software design pattern to replace legacy applications step-by-step
- Rust by Example: New Type Idiom - Creating new wrapper types to leverage Rust's type system guarantees for correct code
- serde - Serialization and deserialization framework for Rust
- utoipa - OpenAPI specification generation from Rust code
- serde-json - The go-to solution for parsing JSON in Rust
- axum - Ergonomic web framework built on tokio and tower
- sqlx - Async SQL toolkit with compile-time checked queries
- rayon - Data parallelism library for Rust
- tokio - Asynchronous runtime for Rust applications
- tokio-console - Debugger for async Rust applications
- tracing - Application-level tracing framework for async-aware diagnostics
- W3C Trace Context - Standard for distributed tracing context propagation
- OpenTelemetry - Observability framework for distributed systems
- Honeycomb - Observability platform for complex distributed systems
- Azure Application Insights - Application performance monitoring service
Official Links
Transcript
This is Rust in Production, a podcast about companies who use Rust to shape
the future of infrastructure.
My name is Matthias endler from corrode, and today we talk to Vegard Sandein
from KSAT about talking to satellites with Rust.
Vegard, can you introduce yourself and KSAT, the company you work for?
Thanks for having me. My name is Vegard Sandengen. I have a master's in computer
science, I have worked most of my professional career in the space domain,
even though it's usually on the ground, and I've been working at KSAT now for the last four years.
And recently, you became a father, so there's one more rustacean in this world. Congratulations.
Thank you.
So, can you say a few words about KSAT? I know that the slogan is,
connect space and Earth, and I really like that, but what is it about?
KSAT is the abbreviation of the company, which is... Kongsberg satellite sources.
So we're getting data from space to Earth, and then we're using that data.
So ground network operations and Earth observation networks.
I work in the ground network, which
is our distributed network of antennas situated all around the world.
And we enable satellite owners to talk with their satellites and get their data.
A lot of people only know about satellite technology from television or from popular science.
And the knowledge they have is probably rooted in the 60s and 70s.
But a lot has happened since then.
What has happened since the 60s?
Yeah, satellite industry was traditionally operated by satellite companies,
and they're using their software to just deliver
on whatever their satellite had and a satellite
itself started like in the 60s with the russians launching
Sputnik and it was very expensive i mean launching a
satellite took a government agency so all
the way until basically Space X and a lot of the other newcomers in the satellite
business or in the launch business came along it was extremely expensive to
launch satellites so it was mostly just agencies and government entities that
it could afford to put satellites into orbit.
And some of those satellites are geostationary satellites in delivering your
satellite communications for TV or for your sat phone, if you had that.
And from the old days, it was mostly communication-based, but NASA and ESA also
launched scientific instruments to monitor the earth or to monitor the sun or
to send probes into outer space to do some other readings.
And the way satellites communicate is almost,
exclusively through radio frequency communication and different wavelengths
on the radio frequency spectrum.
And the type of wavelength you use determines what's the quality of your transmission.
And Earth observation satellites, very close to the Earth, they orbit the Earth
maybe 14, and 15 times a day, they can produce a lot of data.
And as the instruments have gotten better, the resolution of whatever measurements
they're doing is getting higher.
The amount of data is getting higher. And the majority of the way to actually
get data done is to have contact with the ground station.
And you have a limited visibility over a ground station.
So you only get like 10 to 15 minutes of visibility maximum that's peak and
you have to push down gigabytes of data so the amount of data we're talking
about is ever increasing.
Yeah thanks for the overview but one thing i always wondered as some somewhat
of a bystander is what is the standardization of the communication protocols. Do
we keep using the same protocols since the 60s or does every satellite have
its own protocol or is it something in between?
It's everything and nothing, unfortunately. So there is a standardization body
called CCSDS that a lot of the government agencies contribute into from the
early days of the 80s-ish, if I remember correctly.
So a lot of the hardware-related radio frequency protocols and how to handle
data on the physical link has a lot of different standards.
And in order to push data over the air, you also need some error correction
and you need to be able to sequence your data just like TCP/IP,
there's an equivalent standard in the space industry.
Coming into the new space era there's a lot of
new contenders on the market that are software
companies that are using spacecrafts and
not spacecraft companies using software they're
also not really following some
of these standards from the agency era so you get a lot of compatibility issues
where you're basically having to custom fit okay how do we talk to this spacecraft
because This is a new software company that has just looked at the standard and said,
ah, we don't really need this. We'll do it our way. And it works for them.
But at some level, you have a minimum viable product that you can share on a radio frequency level.
And most people are compatible with that. But after that, all bets are off.
Sounds like that approach would generate a ton of legacy code in a very short time.
Yeah.
Now, let's talk about the size of operations at KSAT.
KSAT started off 25 years ago as a company, and we started off with one antenna
and one customer, and that's about it.
And as KSAT grew its company and this market shift into new space with all these
new software actors really exploded the number of satellites launched into space,
And KSAT followed suit and built up both their antenna park on how many antennas
you have and how many employees we have to deal with this and how many engineers.
And at this point in time, we're roughly at ballpark between 100 to 300 active antennas.
It is one of the biggest providers of commercial
ground station services.
I think the official website mentions 23 sites worldwide, which sounds crazy to me.
What is a site specifically and what goes in there to maintain that?
A antenna site for us is mostly, it's a place where we need a lot of power and
we need fiber optic cable, hopefully.
We don't have that at every site, but what qualifies as a good site is that
it's far enough apart from any other site we have, and that it covers a lot
of ground we don't actually get from other sites in the vicinity.
And the placement of the sites are usually depending a bit on what orbit the satellites go in.
So the satellites usually have two orbits that are relevant.
It's like polar orbit, where they go from pole to pole, and then it's the other
one where they just follow the equator.
And if you only have ground station at the equator and you have a polar orbiting
satellite, you only get the visibility twice a day.
But if you have a ground station near the poles, you get 10,
12, 14 contacts a day. So it really depends.
Each contact has a duration of anything between 5 to 15 minutes, really.
And that can generate anything from a few gigabytes to 100 gigabytes per contact.
Data processing can come later, but the data exchange happens during that time frame.
The data exchange between the satellite and the ground station, yes.
And because of the volume of data increasing so much, our main concern going
forward is not really building enough antennas, it's actually just building
enough infrastructure to handle all this data.
Because there's so much data and you need to push it around and you need to
provide it to the customer in a reliable fashion.
And it can be quite unreliable networking between a remote site in Canada or
in New Zealand and you have the customer on the west coast of the US.
That's a challenge really going forward.
Okay, so to summarize, the setup is a bit like this.
You have a ton of satellites circling the Earth on a regular basis.
They go around the Earth 10 to 15 times a day or so, roughly like that.
And then on the ground, you have antennas on ground stations.
And then these antennas, they connect with the satellites, do the data exchange,
and then you need to send the data over, say, Fiber to a central place.
Usually delivered straight to the customer but due to our volume of data we
also have to temporarily store it on the site itself but do not lose data in the process so yeah.
Two things come to mind first it needs
to be extremely reliable because if you lose the data that is big outage and
probably a loss to the customer as well and the second part is how often can
you make changes to that code how often can you modify the code that also needs to be reliable,
i'm guessing you probably even have limitations as to how often you can access
these ground stations and make changes.
Yes that is correct that that
has to be really reliable but you're actually right on
point with how do we update the code because it's not
that we're using the antenna 100 of the time but the ecosystem around the antenna
with our software running in different hardware close to the antenna it is not
easy to access i don't have access to it for instance but so i just have to
push code and hope that someone else deploys it worst case it can take,
weeks before something is deployed worldwide that
is a process we're obviously trying to optimize and get better at
but it's it is a pain point because it's also in
inaccessible sites and the most inaccessible sites we have is probably in our
antarctica and troll station that also doesn't have fiber optic cable so anything
that you put down there we also have to beam up to a geostationary sites satellites
so we can beam it down to earth again a place where we have fiber.
I guess the huge advantage here is that for code that is written in Rust,
you could just deploy a static binary and people would just be able to run it on the deploy target.
It's generally that easy. I mean, everything we do nowadays is dockerized.
On all our ground stations, we're running some variant of Kubernetes and just
running it on OpenStack or Kubernetes directly.
One could think that since you operate in the ground station and you probably
have access to a rack or so, you're not resource-constrained.
But one thing that people might forget is that you don't do constant updates
to the hardware over there.
We're definitely resource-constrained on a lot of our sites.
Not all of them, but a lot of them. It can take us eight months to get a new
computer just ordered from our vendor, and then we have to ship it to anywhere
in the world, and you have to get people there on-site to install it.
So we are resource constrained in the fact that we don't want to over-provision
every data center around the world near to all our antennas on our ground station sites.
Because, first of all, we don't necessarily have the resources to do that,
and we don't have the ability to do it at some point.
So it's nice to use something that doesn't hog all the resources.
Wouldn't it then be super easy
to fall into a trap of being extremely conservative about tech decisions?
People might associate space technology with a lot of very old conservative
technology, and maybe for a good reason, because it's tried and tested.
I think the satellite industry or space industry is definitely very conservative.
It takes a lot of effort to qualify something to run in space. I know,
RustConf, last year there was one of the sponsors was K2 Space.
They're actually a space company that with a lot of recruits from former AWS
and SpaceX, they wanted to do everything in Rust. They wanted to build the satellite.
They wanted to build the firmware. They wanted to build all the ground resources.
100% of Rust. They had a lightning talk at RustConf. It's probably out on YouTube.
So there are definitely contenders out there that want to not be so conservative.
But from the old space they are very conservative but i wouldn't say that's
necessarily true on the ground the ground is a bit more like we can touch this
we can fix it it's not the same in space earlier.
You said there was a shift in the industry so
we moved from space companies using software to
software companies doing space things mostly two
companies come to mind right away one would be spacex and the other one would
be blue origin but i'm assuming that's just a tiny little slice of the picture
and maybe there are other software companies that i might have heard of that
pushed into the into the space.
Into the space yes you also have a few other providers that
are up there trying to and successfully doing
so like rocket lab but from but these
are launch providers they're facilitating the software
companies to launch something into space but otherwise
aws is actually going right
at it and they're going after the data primarily they want the data because
that's aws's business model is data and there's a lot of data in space a couple
of years back or three or four years back they launched a ground station service
which is i I wouldn't say a direct competitor to us, but they are definitely a competitor.
And we have made a strategic partnership with AWS to be,
a ground network of network providers. So people can come to us and they can
use the resources in AWS, their antennas, their setup, but they can do it through us.
But the business model is a bit different because AWS, as I said, they're a data company.
They really just care about getting their data into the AWS data center.
So you can do whatever you want with it there.
So the space part is just a means to an end, really.
So we move from space exploration to data exploration.
What has changed on the language side? How did the story go at KSAT?
Initially, everything was engineers writing Perl scripts and just making it work.
And that has scaled very well, but it's still written in Perl,
and it's not the newest version.
And at some point, we needed to have a bit more control of whatever is running
on our antennas. And that was developed in Java in the mid-2000s with an Oracle
database. And that has scaled well.
We're very thankful for the legacy that was provided to us so that we can even
be here today to do something else at a bigger scale.
Because that would not be possible without the humble beginnings.
The 2000s were definitely the time of Java. It has some really nice traits,
and I think it resonated well with the challenges of its time.
But then what happened in the 2010s at KSAT?
Yeah, so at some point, we're kind of scaled up with a bit more developers and
with a bit more modern scripting and kind of Python took over.
We have multiple Python applications still in production today from that era.
But yeah, we started to see that due to how that Java application,
and not necessarily Java in itself, but just the database and all the Perl integrations
that unfortunately had direct database access,
meant that we had a distributed network all over the world with scripts being
able to access the raw contents of our database.
And that was not very scalable. We launched
an initiative to move away from this world into
a more modern world where we can have more
control over the life cycle of the data that
we put in the database. 20-25 years ago everything was on ftp xml drop boxes
you can call that an api as well but we've decided that we can try to offload
responsibility into like segregated new Postgres database where access to the
data is tightly controlled through an HTTP API.
Yeah, so we're employing a sort of a strangle pattern on that and just trying
to just grope in any responsibilities and kind of rewrite and repurpose it and
have successfully launched a.
Competing solution in-house now where half of the antennas are on the old system
the old api was written in pearl and it was strangled on the hftp layer into a coffin application,
and then we nipped at it and moved different responsibilities and endpoints around and now,
i would say from a responsibility point of view where like it's 40 60 in rust
right now but a lot of the boring parts are in the kotlin application but we're
actively working to to migrate the remaining kotlin portions as well over trust earlier.
You mentioned the strangler pattern how does it work.
So strangler pattern is very
convenient when you have a code base
or a interface layer where you
can very well design know what's
going in and what's going out and you know that everything below
this is just complete mess and you don't
understand anything but you understand the interfaces or
the boundaries and you can replace the implementation
under each boundary with very
great control and see the differences in implementation and behavior and you
make it entirely seamless to all consumers that you have actually done anything
which is very nice but it It requires that you have some sort of abstractions
that actually make this feasible.
And from an HTTP API layer, very easy, because the contract is in how the API
responds or what parameter it takes.
And you can replace that in any language.
It's not really that hard. It's just a lot of verification that you've actually
replicated all the behavior.
Now, let's focus on the API for a second.
You mentioned that it's a split between Kotlin and Rust at the moment.
Where do you draw the line?
I don't think there's a natural split now, other than whatever developer or
team took that responsibility and what they were comfortable with.
So we've had a very open policy at KSAT on what languages we would use to solve
whatever problem. And it has definitely been a pushback to introduce Rust in
some capacity by some team members in different teams.
And I'm not necessarily sure all of their concerns are, I would call,
valid, but there are definitely concerns.
And some of the pushback I've heard is usually it's not mature enough or the ecosystem is not there.
And I feel that is a sentiment that is often held about Rust that I'm not necessarily
sure is true anymore, because I feel the ecosystem is very much present.
I can do everything I want in the ecosystem in Rust today.
And the other part is maybe just a lack of knowledge of how do you use such
complex terms, because it comes from a system background.
And a lot of regarding boroughs and lifetimes and stuff like that,
it can seem a bit intimidating.
To someone that's usually just very happy in their Java or .NET environment
where that is not necessarily a concern for 99% of what they're doing.
There are also positive receptions of Rust.
And I have personally been able to, I don't know, convert a couple of teams to use Rust.
So yeah, we're approximately three or four teams now using Rust in production
at KSAT with maybe four-ish people in each team that's actively writing Rust.
How does that usually go for you when you approach a team and they are curious
about Rust, but they are not entirely convinced yet?
The conversation often goes in the direction of this is what is very good about
Rust, and that's what I start with.
And you have to make some concessions. And the concessions are obviously just, is it a good team fit?
Because I don't think Rust is hard to use once you've gotten over that initial,
whoa, what happened here? It's a shock.
But a lot of teams have their experiences in their toolboxes in other languages
and know how to solve them.
And if you don't really have a champion on that team itself,
I don't think it's possible to really introduce Rust into a team because the
team has to embrace it themselves.
That it's a no-go if the team is not championed from within, really.
So my job is more just like I try to do some good mentoring and try to have
some common guidelines and try to curate some crates and make some internal
crates that help the process along internally with the tooling and the way we do things.
But ultimately, you require that team champion as well to be on your team.
What's your success rate here? Have you lost some of these battles?
Not on a team level, but maybe on an individual level, yes.
But the general vibe is that it's going more and more into us for a lot of our distributed systems.
And just because it's so nice to use once you actually get to know it.
So it's just that hurdle of inviting people in that haven't used it before.
I'm almost too afraid to ask it, but has Go ever come up in that conversation?
Go has come up multiple times, and we have production code in Go as well.
I'm a bit annoyed at that sentiment as well, because Go is maybe annoyed is
not the right word, and I'm a bit intrigued by the why don't we just do it in Go?
Because Go was released in March 2012.
It's three years older than Rust. at this point is 10 and 13.
It's not that big and much of a difference.
In terms of age, but in terms of functionality and in terms of developer ergonomics, maybe?
Yeah, but Go had a very simple language to begin with. So it was very easy to get going with Go.
But I also think that there is an ecosystem in Go,
but the ecosystem is harder to engage with than it is the Rust ecosystem because
the tooling and with cargo on the kin as just miles above any tooling you have in Go.
So that makes it, for me, also a no-brainer just because, like,
disregarding just the language itself and the features and ergonomics of the
language, just the tooling and the ecosystem with using the language is what
makes Rust the number one contender on the market.
Go is very much a day-one language. and starting a project and getting to your
first production version is usually very ergonomic, very quick, very elegant.
The problems start to arise on day two. Not exactly day two,
but when you have a larger code base, you feel the limitations of the language of the ecosystem.
It's trying to constrain you somehow. almost feels like it's strangling you.
And you're not strangling it.
I probably would have made the same decision in your position, of course.
Obviously, I'm biased, but you have to maintain this software for a very long time.
Yeah, so definitely, from my experience point of view, just being able to model
your code in a way that just, it just feels,
you just know where the boundaries of what you've made in the main And it's
very easy to move that along and refactor it.
So back in eons ago, I was a C and C++ developer, and I did a bit of that and a bit of that.
And just trying to refactor a C++ code base and having confidence that you've
actually done it correctly, I have never had that.
But Rust, if it compiles it works, it basically is that.
And that sentiment is overused i think
but it still feels very true at some point because the
compiler is so powerful but whenever it compiles i'm confident and i also have
a few tests here and there and where the tests run as well which they do 99
percent of the time after i've done a major refactor i'm confident i will push no problem funny.
That you say you have a few tests here and there does that mean you lean into
Rust's strong type system a lot as well.
And maybe you don't have to write that many tests that you would have to write
in other languages, more dynamic languages like Python.
Oh, definitely. Our tests is, I think there's a concept called like a diamond-shaped
testing or something where you basically, you have very few unit tests,
you have very few system tests, but you have a lot of integration tests.
And those integration tests are placed on the boundaries of the network layer, so HTTP.
And I have all my
tests are basically just HTTP related API tests
because I don't really care how the structs
or functionality within the Rust code base behaves because what's important
is just what is the contract or the HTTP boundaries so we have a few tests down
to the database over the HTTP layer but from unit test point of view almost nothing Thanks.
But in order for that to work, you would have to lean very heavily into the
Rust mechanics, into the type system, and you would have to rely on it.
Are there patterns that you commonly use to fully embrace that part of Rust?
Yeah, so I use quite a lot of new types.
For instance, a UID, I will new type it into a variant that represents this resource.
Meaning that the API layer is very communicative of what it's actually expecting,
or not the API layer, but the code base itself that serves.
So it's very easy to modularize different components that work in some form
of hierarchy because the types are so strong that you can convey so much with
both the primitive types themselves, but also some types in form of enums.
The one thing I miss every time I go to any other language is just the enum.
I think this I could model very well in an enum, and I don't have this capability. And it saddens me.
Do you have an example for an enum that, for example, comes to mind where modeling
some certain business logic was very ergonomic?
So I'm a big fan of the one-off pattern.
It is represented in, for instance, OpenAPI definitions.
There is like a one-off you can represent there. doing code
gen for one-offs in open api to any other
language is horrible but code
gen 2 was very easy to use and being
able to represent the fact that this resource has different
properties depending on which kind it is is very powerful because even though
at some level you're talking about this resource it has one resource id but
it can manifest itself as different forms of different versions or represent
different physical attributes on the network.
And on some abstractions, you don't really care about those properties, but on others you do.
It's very nice to be able to represent just the exact properties that are present
and not load of optionals that are present only is this is true and this is true.
And you have to carry that logic throughout the code, that makes it harder to refactor as well.
If you know that this can only be set if this other value is set,
and that's invariance in your code that you kind of encode with the enums instead.
I'm not too familiar with it, but I know that in a schema you can say this is
one of these variants, one of these kinds, and I guess it maps really well to enums.
If you go further one step, you're probably also using the serde ecosystem of things
and say this is my input type and so I convert it from the schema.
So we're leaning heavily into serde. It's an excellent library.
Any other crates that you personally like for that sort of work?
You usually have to do some customizations on top of serde with serde with or
stuff like that to actually do the proper transformations.
I've been also experimenting now with Utopia to generate OpenAPI specifications.
It's called UTO IPA. It's a very common misspelling, unfortunately.
I made it a dozen times until someone pointed it out.
Yeah, I will probably continue to misspell it.
The reason why it's called UTOIPA, by the way, is IPA is API backwards.
And it's also a good beer. That's from the READMEs.
Of course. Sorry. Yeah.
One slight issue I have with serde is that it's very versatile,
well, but it doesn't really give you that great of a structured way of accessing errors.
And that boggles me a bit because I really want to give good structured feedback in our API surfaces.
And I don't want to fork Siri to just fix that because then I'm incompatible with everything.
I'm not entirely sure how to solve that on an ecosystem level.
But right now, I've just wrapped the outputs and parsed the strings to extract
the vital information that I want.
But I would definitely like to see a bit more structured error responses on
what went wrong in the serialization process.
I personally see serde more of a contract.
You have the value type, you have to serialize, you have these traits,
that's your building block.
So what keeps you from building structured error messages from these smaller building blocks?
Because the serde error type doesn't give you, like the serde error type, it is,
well, I think it's possible, but it is, we're using the JSON, serde-json,
because it's what we communicate over, and the serde-json error type eradicates
any references to which field, for instance, was the error at.
So you will have to parse the stringified message to extract it was at this
field to get it out or you have to fork serde-json and fix it there.
I could probably do that as well but I've seen it in multiple JSON parsing libraries
as well that the level of programmatic access to the variants are not that great.
But other than that, the serde ecosystem is amazing. You can do a lot of stuff with it.
Just have to be a bit more forgiving on how you output the errors to the end
user because that's kind of what matters here i mean for me as a programmer
i don't really care but it's not the consumer of the api that cares from.
What i can tell from our conversation so far stability is the main focus.
From listening to a lot of your other guests
on this podcast doing a lot of cool shit and they're
it's it's very fun to listen to but and i get the feeling that our first usage
is boring we're just using the top level just web frameworks and sqlx and axum
and serde and just putting it all together and just making it work. I have a good example of that because,
a couple of months back we needed to do some changes in a few of our services
running and I went into the repository for that service to actually fix it,
and I saw the last commit was one and a half year ago, and it's just been running. One and a half years.
I haven't touched it, and I have never had that experience in my professional career.
That service was the main authentication authorization service that authenticated
and managed every API key and principle, so it was used on every request.
It's really chugging along, so it's amazing. I've had only good experiences on that front end.
Did you also have any bad experiences with Rust?
You can call it a bad experience, but I would camouflage it as a good experience.
So we've been running on-prem coaster for many years, and that on-prem coaster
hasn't really gotten that much love and attention.
So it's just chugging it on with the resources it had six years ago when it was installed.
We also do a lot of calculations regarding satellite trajectories and visibilities
to our run stations and stuff like that.
So one of the things i wanted to calculate was just
okay when is a satellite visible over
our ground stations and we support quite
a lot of satellites and we have a lot of ground stations so there's a
lot of maths to figure out when are you where
and when can i talk to you and i naively
just put everything in a loop and then
i slammed rayon on it and i pushed
it to production and a couple of days later one of
my devops team came and just like our production cluster is like running at
80% cpu it's struggling a bit also and it's majority from the service i just
updated and yeah the computations work fine but it had a wider impact on our
other production services,
So it's too performant
Too good i had to dial that back.
Okay i can see how that might also be a benefit or how i could see it as a win
but are there any other issues with the wider rust ecosystem that come to mind.
Yeah i mean we're a big user of async because we're using axum and everything
is just on a tokio runtime and it just works very well just doing basic features
to handle HTTP requests and doing features to send database queries and get responses.
And that just works very well. But when you're trying to combine that with a
feature in the HTTP layer to also provide some computations, we ran into some issues.
So a few months back, someone used our API in a way that we hadn't anticipated.
And there was too much traffic on something that blocked. and just everything,
everything just stagnates and response time speaks and it affects everything.
And just trying to hunt down where we actually block or do computation for so
long that you're starving the tokio runtime, that was very challenging.
What I see a lot is teams using their development laptop to start a larger tokio
application with say 16 or 32 cores.
And then when they deploy the same service to production it ends up running
on a two core node and obviously that's a completely different environment.
Was it one of these cases where,
the production system was very resource constrained and when you tested it in development it was not.
The problem manifested itself when the traffic increased
enough to actually trigger it so we
didn't really trigger it we could reproduce it locally at some point when we
actually knew what traffic to induce but so we had some inklings when stuff
went wrong but it was quite a goose chase down this set of futures and where
do you actually, how do you measure what blocks?
And trying to use tooling like tokio console, it's a great project,
but it's just not insightful enough at that level yet.
So I would say the tooling is probably not right for the abstractions we need
to be able to efficiently bisect where is the issue and how do I solve it?
Solving it is very easy in tokio. You just spawn it on the blocking runtime
and it's fine but it's definitely something to be aware of so it's a pitfall
for newer developers and it got us as well.
The typical pattern is that you see a spike
on the cpu and there's really
not much traffic coming in anymore it blocks on the the api layer but in reality
your cpu is super busy with some computation but then it still doesn't tell
you where that computation happens You just need to dig deeper and understand
the business logic of it all.
So that's somewhere also where the distributed tracing you have in an application
and how you have insight into that comes well into mind i also like the tracing
ecosystem very good love it but figuring out how you actually use tracing and
like a distributed sense,
it's a learning curve where you have to basically puzzle about pieces together
yourself to figure at how do you actually get the correct level of tracing in
the applications and across applications.
That's also probably an area where there would be a good fit for some higher
level abstraction crates for server application that just needs to have good
defaults on everything.
Do you use tracing across language boundaries or just within the Rust context?
We use the W3C trace context standard to send trace parent headers to correlate
tracing information across applications, but that works fine.
We set up our own tracing infrastructure using tracing to create with a custom
subscriber to Azure App Insights.
App Insights is a good service, but it's also quite expensive. but
just knowing where to wire up what you need
to call when and how where in
tracing and how do you model that into whatever
subscriber you have so using for
instance OpenTelemetry versus Honeycomb versus App
Insights they all have a different behavior on how you open spans and when you
close them and how you annotate them and when you actually send the event it's
a learning curve to just employing correct tracing in your application is not
something that's extremely easy to understand.
So you usually spend a few months on it.
From our conversation so far, it feels like a lot of services run on Azure or,
cloud in more general terms, but how does that relate to whatever you do on the ground stations?
The API layers we've developed over the years, it's been primarily,
as you say, in a cloud setting, but due to the widespread nature of our antennas
and where they are, we're also resource constrained, as we touched on earlier,
on the resources we have on each antenna.
And our challenges are often related to having running code there that can run
forever and not have any downtime, really.
Many years ago, we deployed at least one service on each antenna stack throughout,
which is written in Rust.
And it's just responsible for ping-ponging back whatever is in the cloud.
What should I do on this antenna? So it's what we call our scheduler.
We schedule anything and synchronize what we have there. And that has been running
flawlessly on 120 antennas or something for three years now.
I think I've had two bugs on it, and it's been purely logic bugs.
The problem with bugs in that is that when there's a bug there,
it affects everything. Because, yeah, nothing is happening on the antenna if the scheduler is down.
Other than that, we also have data distribution and just pushing metrics from
our basement equipment to the antennas. And everyone wants to consume those,
your customers, system engineers,
big part of our infrastructure is also just having the correct tooling on each
antenna to be able to send out this infrastructure and we're using rust for that as well.
It's incredible how far you are in your rust journey already i had no idea really. About
the scheduler so what inputs does it take and what outputs does it generate.
So it's running a in-house
protocol to synchronize whatever schedule is available
in the cloud and the cloud database is
the source of truth and the schedulers
on each antenna site is just figuring out what to
synchronize from the cloud so i can operate autonomously in
case of network failure without network connectivity
we can still operate and take your contacts and yeah
so it just synchronizes whatever it does
there over a custom HTTP protocol and
as the contact is about to begin it kicks off a event to a another service which
we call the controller it's the controller of the entire contact just controls
all the baseband equipment and whatever firewalls and whatnot needs to be opened
and controlled. It's a just-in-time scheduler.
And the reason why it doesn't pull everything is resource constraints again.
Yeah and it also doesn't need to have the full state of the entire database
because from clouds it only needs to know what do i need to do it would be very
inefficient to synchronize the remote states from the cloud to every antenna
that would not be feasible.
But
the calculation for knowing what it needs is that CPU bound or is the focus
again on reliability here.
Solely on reliability. So for scheduling or synchronizing,
it's configurable for a scheduler, but usually it's deployed to one or three
days ahead, so it can run a while until if we're,
losing network connectivity and we can still salvage a lot of data even if we
don't have connectivity to the cloud.
Very impressive but that means the entire chain from the satellite all the way
to the customer is at least in
parts written in rust nowadays. What is your message to the Rust community?
I think my primary message to the Rust community is just polish up async.
Get it to be the best experience it can ever be.
There are some pitfalls now, even though Rust 2024 edition stabilized the async closures.
Very happy about that. But there are still some questions around observability
of what is happening within an async context and how do you navigate that?
And just, yeah, and getting to the bottom of issues related to,
as I said, the blocking issues we have and just cancellation safety and drop
safety and async drop and all these paper cuts that just are not completely answered.
That would be my message to really polish up that. That would make selling Rust to others much easier.
Yes, I could get behind this. Vegard, thanks so much for taking the time and for being a guest today.
It's my pleasure. Thank you for having me.
Rust in Production is a podcast by corrode. It is hosted by me,
Matthias Endler, and produced by Simon Brüggen.
For show notes, transcripts, and to learn more about how we can help your company
make the most of Rust, visit corrode.dev.
Thanks for listening to Rust in Production.
Vegard
00:00:24
Matthias
00:00:40
Vegard
00:00:47
Matthias
00:00:47
Vegard
00:00:56
Matthias
00:01:24
Vegard
00:01:45
Matthias
00:04:00
Vegard
00:04:20
Matthias
00:05:56
Vegard
00:06:03
Matthias
00:06:05
Vegard
00:06:09
Matthias
00:06:53
Vegard
00:07:05
Matthias
00:08:10
Vegard
00:08:16
Matthias
00:08:57
Vegard
00:09:30
Matthias
00:09:45
Vegard
00:10:13
Matthias
00:11:10
Vegard
00:11:21
Matthias
00:11:34
Vegard
00:11:52
Matthias
00:12:31
Vegard
00:12:50
Matthias
00:13:42
Vegard
00:14:12
Matthias
00:15:22
Vegard
00:15:34
Matthias
00:16:12
Vegard
00:16:28
Matthias
00:18:32
Vegard
00:18:35
Matthias
00:19:41
Vegard
00:19:52
Matthias
00:21:30
Vegard
00:21:39
Matthias
00:22:46
Vegard
00:22:51
Matthias
00:23:12
Vegard
00:23:19
Matthias
00:23:45
Vegard
00:23:52
Matthias
00:24:34
Vegard
00:25:08
Matthias
00:25:10
Vegard
00:25:21
Matthias
00:26:25
Vegard
00:26:39
Matthias
00:27:28
Vegard
00:27:48
Matthias
00:28:33
Vegard
00:28:43
Matthias
00:29:57
Vegard
00:30:20
Matthias
00:30:25
Vegard
00:30:29
Matthias
00:30:45
Vegard
00:30:53
Matthias
00:30:56
Vegard
00:31:06
Matthias
00:31:53
Vegard
00:32:10
Matthias
00:33:15
Vegard
00:33:20
Matthias
00:34:23
Vegard
00:34:28
Matthias
00:35:40
Vegard
00:35:42
Matthias
00:35:48
Vegard
00:36:01
Matthias
00:36:57
Vegard
00:37:29
Matthias
00:38:22
Vegard
00:38:49
Matthias
00:39:32
Vegard
00:39:37
Matthias
00:40:38
Vegard
00:40:56
Matthias
00:42:20
Vegard
00:42:30
Matthias
00:43:25
Vegard
00:43:30
Matthias
00:43:47
Vegard
00:43:58
Matthias
00:44:20
Vegard
00:44:38
Matthias
00:45:30
Vegard
00:45:37
Matthias
00:45:40