Rust in Production

Matthias Endler

Oxide with Steve Klabnik

In this episode, I talk to Steve Klabnik, a software engineer at Oxide and renowned Rustacean, about the advantages of building hardware and software in tandem, the benefits of using Rust for systems programming, and the state of the Rust ecosystem.

2024-11-14 113 min

Description & Show Notes

What's even cooler than writing your own text editor or your own operating system? Building your own hardware from scratch with all the software written in Rust -- including firmware, the scheduler, and the hypervisor. Oxide Computer Company is one of the most admired companies in the Rust community. They are building "servers as they should be" with a focus on security and performance to serve the needs of modern on-premise data centers.

In this episode, I talk to Steve Klabnik, a software engineer at Oxide and renowned Rustacean, about the advantages of building hardware and software in tandem, the benefits of using Rust for systems programming, and the state of the Rust ecosystem.

About Oxide Computer Company

Founded by industry giants Bryan Cantrill, Jessie Frazelle, and Steve Tuck, Oxide Computer Company is a beloved name in the Rust community. They took on the daunting task of rethinking how servers are built -- starting all the way from the hardware and boot process (and no, there is no BIOS). Their 'On The Metal' podcast is a treasure trove of systems programming stories and proudly served as a role model for 'Rust in Production.'

About Steve Klabnik

In the Rust community, Steve does not need any introduction. He is a prolific writer, speaker, and software engineer who has contributed to the Rust ecosystem in many ways -- including writing the first version of the official Rust book. If you sent a tweet about Rust in the early days, chances are Steve was the one who replied. Previously, he worked at Mozilla and was a member of the Rust and Ruby core teams.

Links From The Episode (In Chronological Order)
 

Official Links

Transcript

This is Rust in Production, a podcast about companies who use Rust to shape the future of infrastructure. My name is Matthias Endler from corrode, and today we're talking to Steve Klabnik from Oxide Computer, about working on the hardware-software interface with Rust. Steve, I don't think you need a lot of introduction. A lot of Rust sessions know you, but maybe there's someone out there who doesn't yet. Can you quickly introduce yourself and oxide the company you work for.
Steve
00:00:30
Absolutely hi everybody i'm Steve Klabnik i am most known in the rust community for having co-authored the rust programming language which is the book that most people learn the language from or at least many people learn the language from i was on the rust core team for almost 10 years and you know i've been doing a bunch of rust stuff for a very long time i started using rust in december 2012 so i'm one of the few people that can say i have 10 years of rust experience so.
Matthias
00:00:54
You can finally apply to these job adverts which.
Steve
00:00:57
Yeah exactly oh yeah and oxide so oxide computer company is the startup that i work for we're basically building servers that you can buy and then it's like a very old-fashioned business you give us money we give you a server you install it your data center you run your jobs on it but the key is like you get a cloud-like deployment experience while it's actually on-prem So you're not renting a cloud, you're buying a cloud. And it's partially named Oxide because we use Rust for 99% of the code that gets written at the company.
Matthias
00:01:25
And that's what we're here for. And when you explain it like that, it sounds so simple. But in reality, it's very complex to build servers, I guess. We get to that in a second. But I just want to thank you personally, because you introduced me to Rust many, many years ago when you gave a talk at Fostam. And I was there and we happened to have a very, very brief chat after the talk. I don't even know if you remember, but that was kind of the starting point for my Ross journey. And I think a lot of people got inspired by what he did. So thanks for that.
Steve
00:01:57
That's awesome. That talk is really meaningful to me. And I was actually very sad because I only intended to give that talk one time at FOSDEM. So, you know, as you remember, but like other people probably don't, I kind of was chronicling like how Rust changed up before 1.0 because that was around the time of the Rust 1.0 release. And so I thought it'd be cool at FOSDEM to kind of like recap all of the development history. But then they ended up losing the recording. Something happened with the recording and it didn't work. So luckily the ACM had me give it one more time and so we got a recording then and so that was like a future thing but that was also a very special conference for me because that was when larry wall announced that pearl 6 was being like released finally or whatever and so i got to like speak after him and i i did ruby before rust but i did pearl before ruby and so that's always had a small soft spot in my heart for pearl stuff so that was also kind of fun about that conference specifically.
Matthias
00:02:48
For the people who haven't been there it was a really magical moment the audience was cheering you on they you could feel the atmosphere you could feel the energy in the room and it was a very special moment so it's sad to hear that they lost the tape but in any case you managed to inspire a lot of people also because you have a way with words and you always start from first principles and i think that's unique because many engineers they either are very technical or they are you know very very high level but i think you can switch between those two two levels of abstraction you can talk to everyone and and i wanted to capture some of that today now thanks on that premise you had a blog post on your website once where you said that you were not planning on quitting cloud where you worked before but you just waited for a good opportunity and it seemed to have presented itself in oxide and i do wonder what specifically about oxide and their mission attracted you to make this move yeah.
Steve
00:03:54
So it's kind of a combination of two things the first is i was born in 1986 i started programming when i was like seven years old so like 93 ish roughly about and i also like a Mac was the first computer I ever used and I was a really big like little Mac fanboy back in the day and so there's like a ton of. No, I guess maybe I'm getting old. So I don't know if this is still necessarily true. But like, at least in the early 2000s, late 90s, there was a lot of like, I want to say like sort of reverence held for the sort of like 80s in computers, like that's when computing stuff really took off. I mean, obviously, there were computers, you know, many years before, but like the 80s is when the sort of PC revolution started happening. And when people started having computers in their home, instead of just having them in universities and stuff like that. And so i was really really big on learning kind of about apple's history and a lot of the stories that are on say like folklore.org where you can like read a lot of early apple history were stories that i kind of grew up with and were really like meaningful and impactful to me and there was like one in particular i mean there's a lot of them i could recount but like a one that specifically kind of draws to oxide in in specifics is that like there's the story about how, as they were putting the first Mac together, Steve Jobs got the entire team to sign the inside of the case of the original Mac. And they ended up putting that in all the computers. And when they asked why that's the case, he was like, real artists sign their work. You get a painting and there's always a signature in the bottom corner. And that's what we're doing here. There was also a book that I used to read every year, and I haven't read in a couple years now, but it's called The Soul of a New Machine by Tracy Kidder. And I think it really sort of captures a lot of that. It was about a company called Data General in the 80s and their journey to go from like the 16 bit microcomputer to the 32 bit mini computer and them building this computer and how like much effort the team had to put in and their struggles and trials and tribulations while they're making all this work. And so I've kind of like always sort of been like I wish I was 10 years older so I could have like experienced that era of the industry you know because like by the time I was graduating you know high school it was 2000 and by the time I was supposed to be well. 2004 i started always high school in 2000 but like i was graduating high school in 2004 and i was graduating college in 2008 2009 and so by then it was a good 30 years after all that stuff had happened and so people weren't like starting computer companies anymore and so i'd always kind of thought that i wouldn't really get to like experience that sort of thing and so then lo and behold oxide starts up and it's a new computer company and so that was really cool they also talked about the soul of a new machine a lot and i was like oh like that's you know there's a very big like cultural share you know similarity here and they were using rust which is a tool i loved and had worked on and so i was like oh that makes sense and i had known two of the three co-founders Bryane and Jessie before oxide and i had like respected their technical opinions and like you know knew them and so that also seemed good and so just like all those things kind of aligning at the same time. I was employee 17. So you know, I didn't start the company, but I kind of got in on relatively on the ground floor. And, you know, it was all of that kind of stuff that really sort of appealed to me personally, it was, it's very, very rare that you get the opportunity to like, get it on the ground of building a new kind of computer. And so that was just like, too good to pass up.
Matthias
00:07:32
It sounds like a very lofty dream. But at the same time, it feels like building a new server is such a hard problem so what does it feel like on a day-to-day to work on this.
Steve
00:07:43
Yeah the thing is is it's not just a hard problem it's like a thousand hard problems like we keep joking that like oxide is actually a thousand startups in a trench coat like there are so many things that we are doing that like could conceivably be their own company or like if we were trying to do something for more than just us could be their own company. So like, for example, you know, we want to give people that own the rack, like very good way of monitoring how the system is doing. And so we need observability and metrics. And because we're on prem, like that doesn't mean, you know, you're not just going to hook up some sort of cloud metric service and ship information off to them, right? So we have like, are currently working on like internal metrics for the rack. And that's like a thing that whole companies do separately. And we kind of have to do on our own. You know, like there's just tons and tons of stuff that we're doing. So, so yeah, so it's not just one problem, it's a ton of problems. And so that also means, you know, it is a group effort. I think a really big difference between the eighties and now is like part of the reason why it was so easy to start a computer company in the eighties and why so many people did is because it was feasible for one person to design and code like the whole thing from top to bottom, all the hardware and all the software could be meaningfully understood by a single person. And we are far, far beyond that level complexity now. So, you know, I said I was employee 17, because it's now around 60, 70 people. And, you know, all of them are super necessary to get this done. And so, you know, it's, it's very much, in some ways, it's very much like working in a normal company in the sense of like, you know, you have the thing you're working on, and other people have the thing they're working on. And obviously, there's like a lot of cross pollination. But you know, it's also like tons of people are working on really cool stuff. You know, I'm not a hardware guy, personally, I'm a software person, mostly and so it's been really neat to like learn about how the hardware folks do their job and you know vice versa like you know they'll ask some software people for help with software stuff and so like there's a lot of cool collaboration that occurs since the product is so broad but you know at the end of the day you're still like working on you know your little part of the thing and oftentimes many little parts of the thing like another thing with a startup life is that you kind of have to wear many hats and so you know people move around and work on a variety of different things too but yeah i don't know that's definitely a big part of it.
Matthias
00:10:02
A naive person might say shouldn't it be easier today to build such a computer company than just thinking about the 80s because now we have standardization and we have standardized hardware components and all of that stuff but the complexity of course, is still very much existent. And I guess the rabbit hole runs very deep. What are some of the other hard problems that you have to solve to build such a computer?
Steve
00:10:34
Yeah, so I mean, I think the biggest, most straightforward one is like, it is fundamentally a distributed system. So like, to be clear, like when you buy an oxide rack and you plug it in and turn it on, you're not like managing individual servers yourself. You get an API that gives you the ability to, say, spin up a virtual machine, but it's not like you're like, OK, I have 10 VMs running on this sled and I have 15 VMs running on this sled. You're just presented as one whole stack of compute. And what that means is that our control plane software is the one that's multiplexing those VMs across all the hardware in the rack. And so that means at a high level, it's a distributed system in a box. And distributed systems are never easy in the first place. You know and then you're talking about you know virtual machines that have to manage hardware and do all that kind of stuff and so you know i think that's like another example of a really hard problem to sort of get to the like people think it's easy because the stuff is standardized that's sort of true but the standardization is also in some ways what we're like rejecting in many ways and so like we're not using a lot of those standard interfaces and we're writing our own firmware instead and so like that is also a really big problem and problem i mean problem to be solved like so for example Bryan gave a talk about called like i've i've come to bury the bios i'm forgetting a couple words in that title but it's like i haven't come to save the bios but to bury it or something like that and so you know we've like thrown out the concept of a bios and the the operating system boots up the hardware just like in the old days in order to do that that means we had to write our own firmware and that's a really really hard problem and so you know there's a lot of like stuff like that so yeah it's.
Matthias
00:12:20
A bit like is it a bit like with rust where some of the old knowledge was rediscovered with rust and suddenly you were able to use those concepts in the real production systems level language is it similar with oxide where you uncover some of the old truths that you know, OGs, the original creators of computing knew about and then we forgot about them over time?
Steve
00:12:49
I think that's true in a certain sense but also not entirely because like a lot of these things aren't, necessarily truths that are lost but it's more about the way that the computer industry evolved over time so like standardized interfaces and parts being swappable is like what made the pc platform succeed right like the whole the standardization is what allows a giant ecosystem of companies to be able to work together productively and you know be able to like ship things and so that was like you know again being like a mac fanboy in the 90s people be like oh yeah i need to buy a new hard drive i have 15 different options for buying new hard drives but you have one you know from apple and so you take what you can get with them and that's it you know oh you want to update your graphics card like cool you know that's not really as viable as it is in sort of the pc platform and so i and i think that like that made sense for the economics of the time and i mean it still makes sense in many contexts it's just it doesn't necessarily make sense in servers anymore you know i mean like my pc is still a very standard pc built from you know it's running all that firmware that those manufacturers have built pulled together from 20 different companies that are you know creating that together but i think there's sort of like. The wisdom is not so much like lost as so much as it is that like as computers at home like the industry changed a lot whenever it went from like a computer is a thing a university has that you get to timeshare on sometimes, to a computer is something that companies have, or maybe they have a couple of them, to a computer is something that every individual person has. And so... You know, it's the same sort of thing where with especially like the web and the internet being such a big thing. Now it's not like my company has a single server in a rack of servers in a data center. It's now my company has a whole rack. And then it became my company has a data center. And then it became my company rents some servers from, you know, another company that runs the data center. And so those kind of like economic changes happened over time and made different configurations of these things work. Now, part of the additional kind of evolution that happened there is if you'd read a lot of stuff about Google as they were sort of like growing back in the early days, there was a lot of talk about how they use like commodity servers instead of buying big iron. So instead of buying big old giant mainframes or servers, they were just using regular PCs and hooking up a lot of them together. You know, back in the Slashdot days, we'd made jokes about Beowulf clusters, but like that's kind of a thing that they supposedly, well, they didn't supposedly, they did do that. In fact, for a while they ran on commodity hardware, but they kind of learned that like the style and architecture of a computer that works well for people at home is not really well suited for running when you need like hundreds of thousands of computers, because it's sort of designed for the, you know, running on your desk, like my tower is like right here outside of the frame. But like, you know, the thing that's running on my desk has very different needs than when I have an entire data center full of computers. And so they started designing their own hardware. And so did all the other, we call them the hyperscalers. So like the AWSs of the world and all the other people that are running big clouds are. They sort of moved away from the sort of PC model because that just like doesn't really make sense economically speaking for a number of different reasons. And so, but like Amazon is not in the business of selling computers. They're in the business of renting them. And so you can't like say you're a Fortune 100 company because here in the States, companies are people. No, I kid. But like, you know, imagine you're a company, right? And you need some servers. So you go to try to buy some. you like can't go to amazon and be like hey i need one rack of the same servers that you've built custom built to live in an aws data center right like that's just not an option that's available to you uh these big clouds aren't in this selling computer business they're in the renting computer business and so you can rent them but you can't buy them and so there are other companies that are selling servers but they're not really in the custom hardware business they're still fundamentally selling you a big pc and so kind of what oxide is doing is i sort of liken it to. Prometheus you know the legend of prometheus who you know went and stole the fire from the gods and brought it back down to humans we'll ignore the part where prometheus was then tortured for all of eternity for you know doing what he did but like we're sort of like taking the concept we're not literally selling the same designs as amazon's like we're like sneaking into the data centers and you know like whatever but like we are we are selling those style of servers to companies that you know aren't necessarily going to be building their own there's tons of you know big companies and smaller ones necessarily but when you're talking about a whole rack of servers you kind of inherently are going after larger organizations so you know big companies that aren't going to spin up their own server design division even though they may have the money they just don't have the expertise of the culture to be able to do that right and they want to be able to buy computers instead and so that's kind of like what we're doing is taking these sort of hardware and software concepts from the hyperscalers and then selling them to companies that could use that amount of compute but just literally can't buy the same amount of computer anywhere else.
Matthias
00:18:15
I think this analogy with apple is really fitting for two reasons first they democratized computing for a lot more people but second they also focused on vertical integration In fact, when Steve Jobs moved away from Apple to build Next, he also built some sort of workstation, which was a different type of computer with different requirements. And he chose Objective-C for various reasons to build it, I guess. But of course, Oxide did not choose Objective-C, he chose Rust. What are some advantages that Rust offers for Oxide's unique hardware software integration challenges? Yeah.
Steve
00:18:55
Totally. So I should also really briefly say that, like, I always make the Apple analogy, because it makes sense to me. But like, Bryan literally worked at Sun forever. And a lot of the folks that are also from Sun, and so Sun and Apple are kind of very similar and very different in different ways. And so, you know, they might also pitch it as kind of like a, you know, next version of Sun in some ways, too. So anyway, in terms of like the why Rust, so I mean, you know, like fundamentally, the, you know, okay, so first you decide, like, you're going to get into this business so you know the the folks who started it were previously at joyant before and they ran a public cloud and so they kind of knew like what it's like to be on the customer side of that relationship and so they were like okay this sort of you know business needs to exist well you know how are we going to like enable it and so you know for a very long time a lot of the early folks at oxide were you know very big c people obviously like you know like i said Bryan started at sun working on the kernel in like the late 90s or whatever at the same time i was learning programming basically sorry Bryan you're a little older than me but like you know so they've been using C for a really really long time and so you know naturally kind of I would say we're initially a little skeptical of Rust conceptually but like Bryan has a great series of blog posts on him personally coming to realize like why he ended up liking you know Rust but. I think that like sort of an interesting thing about Rust positioning in this domain overall is that Rust is kind of the first time in a very long time that there have actually been a new language that can meaningfully replace C and C++ for the use cases where they're still in. And what I mean by this, it's like a little complicated, but so like when I started programming, C and C++ were sort of the like default choice for everything. Even then I'm being, I'm waving my hands a little bit. Like the Mac OS I started with used Pascal as its calling invention, right? Like there's more options in many ways in the early nineties, but like C and C++ were very dominant in many domains, you know, not just network programming and operating systems programming, but also application development. You would be writing stuff in that way. And then, you know, Java comes along and it takes out a large chunk of the application use case. And then scripting languages come along and they take out a lot of the sort of like web development use cases. And so the overall scope of like where in C and C++ have been the sort of dominant language has actually been declining for a really long time. But the sort of like operating system and systems case is one where they are truly has not been a challenger for a very, very long time. I mean, some people wanted to make Java OS happen. It didn't happen. And like there were some other people that came along and tried to do various things. So, for example, I used to write the D programming language with I didn't write the D programming language. I used the deprogramming language in college back in that 2000, 2004 kind of like era or 2004 to 2010 kind of era, but it had a garbage collector. And that meant that it wasn't really, you know, even though me and my friends in college were writing an operating system in it, although we did not get very far in the end, but like we were able to get the beginnings of that going. And while it worked, it still like felt like you were kind of fighting with the language. And so a lot of like sort of dedicated C fans over the last 30 or 40 years have kind of seen people show up and be like. We have a programming language that can do all the stuff C can do. And that's like mostly been like, well, actually Java's too high level. And like that sort of pattern has repeated itself over the years. And so I think a lot of them have become very like kind of jaded against the idea that there's a language that can like truly replace a lot of the low level use cases of C and C++. And in fact, even early Rust had a garbage collector. Like it was not necessarily actually a truly good fit for cnc plus use cases and so i bring this up partially because it was in late 2014 that rust made the decision so rust used to have a pretty big runtime and there was the option to use native threads like operating system threads or green threads and in sort of late 2014 there was the decision to pull out green threads as a concept and move purely to native threads. And that's kind of when Rust truly came into, it's like, you can really use Rust for low-level tasks because we don't need this runtime anymore and all that kind of stuff. And so it's funny because a lot of people now think about Rust as kind of always been in that use case, but it was honestly like really less than a year before 1.0 that it truly like actually really kind of decided to truly go after that space. And so- I bring that up partially because that is when Bryan specifically, so the CTO, I've mentioned him several times with Bryan Cantrell, the CTO and one of the co-founders of Oxide, saw that decision being made. And he has said before, that is the moment where he decided to truly look into Rust because he's a big anti-green threads person, or at least for the purposes of system stuff. He's like, it's always been ripped out of every system they've ever tried. and so the like that moment was when he was like oh this is a language i need to take seriously because they are actually finally deciding to truly do the low-level stuff that i like care about and so so that was i know when rust like truly first caught his eye or at least like he had seen it before then but that was the moment where he was like i need to investigate this truly and so it's a very very long-winded answer but like part of that is then okay so once you can do the low level stuff like cool that like makes sense but like sort of what value does it bring and a lot of that is in the all the classic standard rust things of you know we can have a strong type system and we can eliminate certain values of errors at compile time but i think a lot of people only focus on stuff that's unique and not the holistic package so like for example the the tooling like cargo is like really really important for a lot of people and stuff like enums we know once you use enums you're like why is this not in every programming language basically like we have structs or like you know product types to get a little fancy with it and so you need enums or the sum type to sort of complement that and it's so interesting that many many languages have like one half of this equation but not both halves and so there's a lot of kind of stuff like that holistically that makes rust you know kind of very attractive but i think another one that's like really interesting is that like i. Don't necessarily like to think of the stack as being purely up and down. Like it's not always exactly one way or the other. But the important part is that like Rust has, because of its commitment to being able to solve low level use cases, it also can make its way up the stack a surprising amount. And so like a lot of times, you know, if you don't start with a language that can address the lowest level needs that you have, you end up at least investing in two different languages. And I'll say that, you know, at Oxide, we definitely only don't only just use rust for a hundred percent of the things, but like for 90% of the things. And it's cool that we can even get to 90 because like. A lot of times it's like, okay, we're going to write our web app in, it's going to have JavaScript on the front end, and then Python or Ruby on the back end, and then maybe we'll need to do some low-level stuff in C. And you kind of have this at least three languages that are sort of necessary to fill in this sort of stuff that you need to be able to do. And so Rust enables us to get away with using one language for a lot more of a chunk of that stuff than other things. And so most importantly at the lowest level itself but then also like if you look at my comments from a long time ago back in 2014 era i would have told you i would never write a web app in rust but like we have and do many web apps actually at oxide and i think it works totally fine for those use cases and not necessarily for everyone but for us it does work really well and so like that versatility is super super useful and kind of like you know enables us to do you know all sorts of stuff so anyway i don't know if that like fully covers it necessarily but i think it's just like important that like c if we decided to go with c for the full lowest levels of the oxide like stack we would need some we would need to introduce another language much sooner than we do with rust like rust gives us the ability to address things at the high level of the stack as well as at the low level of the stack and so i think that's like a very valuable thing when you're talking about the scope of like you know we do everything from the firmware up to the front end of website running in the browser you know like we are a lot of people say full stack is like i can write some front end and back end web code but we're like full full stack and so being able to address all those use cases with fewer technologies is a legitimately valuable thing yeah.
Matthias
00:27:48
But some listeners who have a c background might say you folks just make it very hard on yourself there are things that you could have gotten for free if you chose c what would you say to them.
Steve
00:28:00
I don't really know what we would get for free necessarily. Like, I mean, maybe not needing to like write some code, but also like sort of the fundamental premise of the company in many ways is, so Alan Key has said in the past or NK, like you can't meaningfully build your own software without making your own hardware. Or I forget exactly how I phrased it. Maybe it's the other way around, but like we sort of need to do both. And like what that means on many levels is that like we're willing to write our own stuff because, you know, sometimes just like if you're building something custom, you need a lot of custom stuff. And so it's true that maybe we get some stuff for free, sort of, kind of, but then also means, you know, when it comes to things like support. So we really care a lot about making sure that everything works and works well together. And it's much, much harder to support a giant pile of other code that other people have written than it is code that you've written that's custom for purpose. And like you know bugs can appear sometimes because something is solving a problem for someone else's use case and not yours and a custom piece of software that only does what you needed to do and nothing else ends up being easier to understand and easier to take care of and so you know i mean i definitely don't think it would be impossible without rust like someone could make an oxide where c is used more but like i also think that just like you know at least for a seasoned rust developer like rust overall saves me time implementing stuff because i have to do so much less checking up after the fact that what i'm doing is reasonably correct and there's so many things that rust kind of gives you for free that you don't get with c so you know i definitely i don't think it's possible but you know someone is welcome to try let's put it that way like we we are we are demonstrating it can be done with rust sure maybe it could have been done with c but that's for somebody else to prove out, you know.
Matthias
00:29:59
This Alan Kay quote was really nice, where you say software informs the hardware and hardware informs the software. You can't build one without the other. How does it look at oxide? How does the hardware part of oxide and the software part of oxide, which is written in Rust, how does that inform each other?
Steve
00:30:16
Yeah, so one of my favorite stories, and this is not any work that I personally did, but I love talking about my coworkers doing cool things. So an example of this is, we talked a little bit earlier about the sort of the standardization layers. And so here's an example of how they can kind of like get in the way. When you're, say you have an entire room full of servers, right? And you, you know, maybe, you know, you're running tons and tons of jobs every day. What that means is that small failure rates happen. Like, I used to say this about living in New York City. It's like people be like, oh, there's crime in New York City. And it's like, sure. But that's also because there's millions and millions of people. So a one tenth of a percent chance that something happens means that it happens hundreds of times a day in New York City or whatever. I probably have the order of magnitudes off. But like the point is, is at scale, things that don't happen very often start to happen and happen often. And what that means is say that your hard drive controllers firmware has a bug and that bug only manifests 1% of the time. Well, if you have 100,000 servers, that 1% of the time is going to be happening all the time. Now, obviously, 1% is a very high rate for a bug in firmware. Like, I'm not saying that firmware bugs happen literally 1% of the time, but just like you will run across obscure edge cases in the software and hardware that you use, and those problems will occur. And so one of the things that standardization enables, which is great, is having, as we said earlier, tons of organizations come together and build all the stuff that goes into a computer. But the problem with that is that like, so say you buy a server from Dell, and you come across a firmware bug in the motherboard. Well, Dell didn't write the firmware that's running on that motherboard. It's some other company, you know, that ended up writing that firmware. And so if they need to like fix that problem, they will like file a ticket with their vendor. But then it's like, good luck getting that prioritized. You know what I mean? As a customer, you are now dealing not just with the company you bought it from, but the companies that they bought their stuff from. And so you're not the customer of that motherboard vendor. And so why should they care about you over someone else or whatever? And so one of the things that we've sort of done is thrown out a lot of those layers because we don't need... The fundamental purpose of stuff like BIOS and UEFI, for example, is to enable the operating system people to sort of right against the UEFI or BIOS spec, and then for all the hardware vendors to sort of like produce the APIs in their hardware that fit into that BIOS or UEFI spec. But like, we don't need to support 75 different manufacturers of RAM, you know, or five different manufacturers of hard drives or whatever. Like we have at the moment, since we have the first revision, or we're working on the second revision of the rack now, but like, you know, we know physically what hardware is in the machine. And so all of that extra sort of standardization interface is written for a benefit that we don't actually see. Like we're not we don't need to be able to you know build all of these different variants of all this different stuff and so what that means is all that code is being written to serve a purpose that we don't need anymore and so we've actually completely thrown out that layer in the oxide rack there is no bios there is no uefi the the operating system boots the hardware just sort of like in the very old days before that stuff even existed and so what that means is that we to throw out that firmware and so specifically like for amd there's a thing called a gisa that's like part of this firmware package that you get that you know you use to boot up the amd's cpu. And we said no we're not going to use that and we wrote our own and you know at first amd was kind of asking us like why are you guys asking questions about this like that's just like in you know the firmware and we're like yeah we're writing our own firmware they're like we don't really believe you or like you know they were like kind of okay guys sure whatever you say and eventually once we got it to boot we're like hey by the way like you know here's here's an example of this booting and they were like oh that's like really cool you know because we didn't expect they were to do that because literally no one else does this and so you know what that means on some level is like stuff boots really quickly but like how much does that matter because you're not really booting a server necessarily all the time but more importantly like by throwing away all of that stuff you know we've eliminated a ton of possibilities for things to go wrong you know we have eliminated security issues like you know talk about another thing that we got rid of is like the bmc or the baseboard management controllers that you know server grade hardware has a ton of other computers running inside it to make sure that the computer is running correctly so like for example you really want your main cpu to be running you know the jobs that you're running on the server you don't want it to be running stuff to manage the server itself right and so like. Server-grade hardware has this additional CPU and other stuff in it called a BMC. And that's usually, you know, making sure that like, oh, if something crashed on the main CPU, we can reboot it. Or, you know, we're able to like log into and monitor stuff from behind. But like a lot of, you know, server vendors, those BMCs have like full operating systems running on them. And those full operating systems have bugs and they can have problems and there's, you know, security issues. And it's like, it's really, really hard when you're buying a server today to even know what code is running on it at all. And so we have an equivalent thing, which I joke is the totally not a BMC, we call it a service processor, but it serves the same general idea of like, okay, it is a good thing to have a little mini extra computer monitoring the main big computer to make sure that all that stuff works. But like instead of just accepting the one that would come with the like motherboard manufacturer that we would buy something for since we designed our own motherboard we designed our own replacement system and it is much much smaller and it runs an os called hubris that we wrote from scratch instead and so you know we know that you're not running a full linux inside of the mini computer inside of the thing that's running your big computer and that means we can be more efficient means we can have less attack surface area it means we can audit everything because we're not just like accepting whatever the external manufacturer is giving us. And so that provides like a ton, a ton of benefit for, you know, doing this sort of stuff. And so anyway, that's like an example of like, we need to write the software that manages the hardware in that way. And it's only because we're designing our own hardware that we're able to write the software in that way, because, you know, we sort of know what we're putting in the rack. So if we didn't know, we would need a lot more of those standardized interfaces, you know? And so that's kind of like an example of those two things informing each other very deeply now the downside is you know we're releasing a new version of the rack new versions of cpus means we need to write more firmware right whereas before we would just be able to like update it to whatever and someone else wrote in that code for us so i'm not going to say that there's no trade-off there there obviously is a very large trade-off but that's one that we're willing to make and we think is like the only way to deliver on a lot of the sort of like quality promises and support promises that we want to give customers and reliability promises.
Matthias
00:37:16
So I knew that you had your own operating system called Hubris, and I wanted to talk a little bit about that because some people might wonder, why did you start your own operating system if something like TockOS already existed even back in the day?
Steve
00:37:31
Yeah.
Matthias
00:37:32
What's so special about Hubris that you needed to write that thing yourself?
Steve
00:37:37
So I've actually known the Tock people for a very long time. They're great, and I really like their project in general. oxide before i joined had investigated using Tock instead of hubris and decided not to and basically what that boils down to is a thing about embedded use cases is that diversity is the rule not the exception and like what i mean by that is every application tends to be different and because you're like literally often running on different hardware you have different needs and so Tock is very focused on sort of this use case where you kind of load a variety of different programs at runtime and they also care a lot about supporting programs that are written in c they have other goals too but at the time i should at least frame this like you know it's been a couple years since the decision was made so i don't necessarily want to speak fully to what toxic goals are right now but at the time ox i was looking into it it was like okay they're they're interested in dynamic program stuff and they're interested in supporting c programs and we didn't really have the same needs and so. The way that Hubris is different and sort of like kind of the reason we ended up like writing our own. First of all is like saying you write your own OS sounds like a massive undertaking, but the kernel and Hubris is like 4000 lines of Rust code. Like it's not very big, actually. And I mean, a lot of the drivers and the stuff that you need to actually make it useful and meaningful is a little bit more. But like it is feasible and was largely designed by one person, just Cliff Biffle. Many other people have also helped significantly but like you know he was kind of the person who did a lot of the initial sort of design work and what what makes hubris special is that it is aggressively static and by that what i mean is like when you sort of like so when you think about an operating system you're like the operating system's on my computer and then i install programs then i run programs right like i click on discord it boots up and it runs i click on firefox or chrome or whatever and it boots up and it runs hubris is kind of like when you make the image to even install on the hardware at all you say up front here are the programs that i am running and those programs run all of the time like there is no sort of like dynamic list of here's how many programs are currently running there's no like he let me load a program at runtime that wasn't running initially. It boots up, it starts all the programs, they're all running, and there's one instance of each program. It's not even like, okay, maybe I have three Firefox windows running or whatever, and obviously those share some code, whatever, blah, blah, blah. But the point is, on a regular computer, I can be running five instances of Bash all at the same time. And so in Hubris, you have one instance of every program, and they're running all the time. And that's something that, for our purposes... Works really well, but does not necessarily work for other people's use cases. And what's cool about stuff being sort of static is it means that you can make, I don't even say shortcuts exactly, but you can like skip some design decisions that other people have to deal with. So, for example, because we know every program that's running and that it's always running. Like Hubris doesn't really use like virtual memory, for example. Because we can actually like pre allocate, you know, at build time. OK, in the image, these are where these are the where all the programs are running individually. Like and we know the memory map statically of like where everything lives and we could tell, you know, OK, this will fit. Like we do have to worry about programs crashing at runtime to some degree but i mean like something like you know and in the days of virtual memory now on desktop machines this doesn't really happen but like back in the day you know when i was using this first max right you'd be like i have this much ram and my programs are currently using this much ram and if i start up a new program it's going to literally run out of ram and not work right and hubris is closer to those systems since we don't use virtual memory and so like we don't have to worry about the the problem of i start up a program and there's not enough RAM to run it because like we know at build time there is enough RAM to run the programs that you're trying to run you know for the most part and so like those kinds of aggressive design simplifications mean that we can get away with doing a lot less so another great example is like there's no global memory allocator or global heap in hubris because we don't need dynamically length lists because we know at build time how many programs are running So the OS does not need to keep track of a dynamic list of like, here's how many programs are currently executing. You know what I mean? And so these kinds of like design decisions all build on top of each other and, you know, sort of enable us to do something a little different. And that also means likewise, like I would never suggest that the Tock people stop building Tock and use hubris for everything because they're just literally trying to do something different than we're trying to do. And so, yeah, that's, that's kind of, I would say that like at the highest level, that's what hubris is, is trying to do. Everything is aggressively static. It's all at compile time. It's all at build time. Another thing is that hubris is very like a message passing OS, but is also fully synchronous. So we do use async rust a lot higher level in the stack. And we don't necessarily think that async rust is bad in the lower levels either, but for hubris specifically synchronous is much, much simpler than asynchronous. And so when we're talking about the firmware that runs at the lowest levels of our entire thing, we're really, really going for a simple system that could be understood and reasoned about very easily. And so Hubris is also aggressively synchronous. And so that's a thing that works for us, but does not necessarily work for everybody else. when.
Matthias
00:43:07
You explained that you have one instance of one program at a given time i thought oh that would make scheduling extremely easy because essentially do you even context switch between those processes do you have some time sharing mechanism how does that work.
Steve
00:43:24
Yeah i mean we do have to still do some of that but like the scheduler can be really really simple because you know like it doesn't we're talking you know i i forget exactly how many are running on like the real we're talking on the order of like 10s not on the order of hundreds of processes right so like you know the scheduler itself can be really really simple like if i remember correctly it's basically just around robin scheduler but there's a little bit of interesting bits around you know if a program is waiting on an interrupt to come in then we don't need to start it up again because we know that it's waiting on something like that but like you know it can be aggressively simple because of the fact that you know you're not talking about we don't need to handle you know what happens if someone spins up a thousand instances of something and how do we deal with that. Right. And so it's all like very, very straightforward and very, very simple, but that ends up being also reliable and easy to debug and, you know, stuff like that. So yeah, we have a little bit of like process priority stuff and there's some important things to like. A problem once you have process priority and scheduling is like a problem called priority inversion, where a program with a low priority ends up waiting on a program with a high priority that's blocked and or maybe vice versa. I forget if I said that backwards, but the point is that like you get stuck in loops, you know, and so we can actually like validate that that doesn't happen because of the fact that everything is so straightforward. And the fact that we don't do a lot of dynamic stuff is we can actually sort of like lint against like, hey, you may have a program that's like going to block execution of this other program and so stuff like that. Yeah, it's much more straightforward than you would see in a commercial desktop operating system solely because we're able to pare down those requirements so much because of the limited context of like what this is actually trying to do. And it turns out that like, you know, aggressively scoping down requirements is what allows you to have simplicity in a system. Sometimes if you have complicated requirements, you're going to have a complicated system. I don't think simplicity is always better than complexity. It's about how simple can you make something within a given set of requirements. And sometimes you can over-scope requirements down too simple, and then the thing doesn't work correctly. But it turns out for embedded devices or use cases like this, you can be really, really aggressive with how you scope stuff. And then that can make a system simpler and easier to understand and therefore more reliable.
Matthias
00:45:37
Must be a bliss to develop such an operating system.
Steve
00:45:40
You know, it's nice and it's got its problems too. Like, you know, I mentioned cargo being really useful before. It is true that cargo is wonderful and it's definitely very useful, but also we basically had to write a build system on top of cargo to make this work because cargo has several deficiencies with dealing with this kind of problem. Is actually one of the first things that i worked on at oxide was sort of rewriting the build system yet again on top of cargo you know definitely like it is it is a very different thing than developing a web app or developing something that's a much higher level and you know it's definitely some people enjoy it and some people don't and you know i used to work on that stuff and i found that it was a pleasure but you know obviously everybody has their different levels of expertise tell me a little.
Matthias
00:46:22
Bit about that what were some of the limitations that you ran into with cargo and also why did you have to build your own build system how does that look like.
Steve
00:46:28
Yeah so a very simple and straightforward one is that cargo kind of very deliberately does not include any post-processing of stuff and we need to build an operating system image so like on some level there is a step after cargo runs you know cargo produces a program or a couple programs and then we need to like assemble that into a thing that you would actually you know load onto to the microprocessor. So just even in those pure cases, we need something more than cargo because cargo deliberately leaves that to some sort of other process. So that's one example of sort of a lot of code in the Hubris build system is like, okay, now that we've gotten cargo to spit out all the final programs, how do we actually assemble that into an image for the OS to boot?
Matthias
00:47:12
How did you do that?
Steve
00:47:13
So there's a pattern called the X task pattern. And that's like the started by Matt clad is kind of this idea that you sort of have a program in your workspace called X task and you sort of write sort of scripts in Rust that sort of work that way. And so we kind of have a very extensive set of X tasks that then like you basically rarely invoke cargo build itself. You like invoke used to be, I'm not sure if this has changed lately, but like cargo X task dist to generate a distribution and that will go. Okay cool i need to build you know these five programs so let me invoke cargo five times to build those five programs and then take the final binaries they produce out and like run some code to figure out like okay you know how do i need to tweak this stuff and whatever and so it's all it's all in rust code but it's sort of written as this kind of like build system kind of like on top of things another sort of area where cargo is like kind of a little weak in this area is like. When you're doing embedded use cases you often need to build some things for the host but some things for the target so like if i want to build a thing where i'm building some code that's supposed to be running on my local machine but then i also want to build some programs to like be cross compiled cargo is not very good at that it kind of understands you know you get one target flag and so it'll try to build everything for the thing that you say you know target for and so there's a lot of sort of like kind of like nitty-gritty issues that become annoying when like say you know you want to be able to say like okay these three programs are meant to be built on 32-bit arm v8 and 32-bit arm v7 and then this is built on ar64 and then this is built in x86. And it kind of is like not as good at those kinds of use cases this comes up in a higher level use case if you ever try to build like a project with wasm where you want to build like say a server for your desktop you also want to build wasm for your front end it's like a little awkward at some of that kind of stuff and and yeah just like a lot of sort of like nitty-gritty things like that like cargo cargo is fantastic when you're sort of in the normal use case but the more you sort of diverge from like weird cases like you know sometimes we need to pass in interesting configuration stuff and it can get a little gnarly sometimes to do things like that so yeah you know it's definitely not like we're not going so far and you know obviously said we built this on top of cargo we're not going so far as to throw cargo away although i am very interested in the buck build system but i have not tried to actually port hubris's build system over to it although i've joked about trying it a number of times but you know there's sort of at some point you kind of end up outgrowing cargo to some degree but a lot of times it comes down to these sort of things where cargo is deliberately saying that's out of scope so it's not like. Cargo is inherently bad at those things and we're choosing not to do it it's more like like i said right earlier about if you can aggressively scope down your requirements it makes things easier cargo sort of determined that like pre or post processing build stuff is just not something cargo wants to do they want you to lay another tool on top of it and so that's just something we've like had come up in another a number of instances where we sort of need either pre or post processing of stuff that cargo is going to do and so we kind of have to like put something over over top of it to make it work and you know i don't think that's something that the cargo folks would find to be objectionable i think that's what they would like advise us to do in those cases because it's just not trying to do that it.
Matthias
00:50:26
Makes sense to keep the scope of cargo sort of small and and focused and there are also a couple of escape hatches that people can use if they don't want to go forward and build their own tooling for example you can have your own little build rs for pre-processing i guess and then you have workspaces which is very nice feature that a lot of enterprise customers use. I guess once you reach a certain level of maturity with your project and a certain scale, then you end up bumping against those limitations from time to time. For example, I also know that you have your own little CI service. I don't know if that is what you talked about, but the thing is called Build-O-MAT. And what is that about?
Steve
00:51:08
So, okay. So we talked about us writing our own operating system in the sense of hubris, but that's used for the sort of embedded use cases. The actual like control plane, the thing that's like scheduling your VMs to run on the hardware, that's a repository we call Omicron, which was started before COVID, kind of became an awkward name. Oh, well, we're kind of past Omicron in terms of COVID stuff now. We were actually just, you know, I've joked occasionally, like I wonder if we're ever going to accidentally name something after another COVID variant. That'd be unfortunate. But anyway, that needs to run something called like the host OS. And so we do have like a more fully featured OS that's running on top of the embedded stuff that is the thing that's actually scheduling your VMs. And so for that, you know, many places use Linux with a KVM to like do those kinds of things. But we decided to choose to use Illumos and bhyve. And Illumos, for those of you who do have not heard of it, which is probably most of you, to be honest, is a descendant of like Solaris and SunOS, which is a descendant of the BSDs. So if you go back to Unix history, they're kind of like cousins to Linux. There is a common ancestor way back there, but Linux came as a separate re-implementation of Unix, and Illumos is kind of like one of the many descendants of the tree of actual BSD, actual Unix. Yeah. So, yeah, so Illumos and bhyve is kind of like Linux and KVM, if you want to think about making analogies to that stuff. And so, yes, so like at those levels of the stack, we are running a full OS that like is largely written in C. Illumos has been around for, you know, it is, again, literally descended to that code. So there's a very old code in there. we do run you know right like we maintain the illumos port of rust for example to make sure that rust programs are right that work well on that and we do write some rust code and there is some rust being put in illumos a little bit here and there but there's a lot of reasons why the decision is a good one that makes sense for us but sort of again kind of like a classic thing that comes up in oxide related discussions is if you go your own way and build something custom it also means you need to go your own way and build something custom so there is no built-in github actions runner that works with illumos you know you can say please give me linux please give me mac please give me windows but like you can't say please give me illumos and github does have self-hosted runners they call it but that kind of requires you know you to have a certain kinds of setup for some things and i don't fully work on this so i can't speak to like all the details super specifically, but like, basically just like at that point, you're already doing a bunch of work to make this work out. And so effectively, Build-A-MAT is used for Illumos native jobs. Where we, because some of the stuff does, you know, because we're doing systems work, you know, a lot of stuff you can test on Mac, Linux and Windows and make sure that it works well. You know, a lot of our developer tooling, for example, runs on regular old GitHub actions with a bunch of that stuff. But occasionally we need to test on an honest to goodness Illumos system that the thing actually works with Illumos because we're making Illumos-specific system calls and dealing with Illumos-specific functionality. And so Build-A-MAT was kind of created as sort of a way that we can plug in. Illumos-specific jobs into these CI systems and make sure that that works. And so if you look at some of our repositories, you'll see GitHub Actions, sometimes there'll be normal Actions that'll also be like Build-A-MAT, and that's basically for those Illumos things. But sort of another reason why it kind of makes sense for us, you know so github actions is running on some servers that github rent somewhere but like we also you know we make servers we got a bunch of servers you know not just because it's fun and because dogfooding is good but like also because we do want to make sure our stuff works like. You know, having build a mat run on an oxide rack in the office that's testing the code we are using to build oxide is, you know, like a thing that makes more sense for us than it would for other companies. You know what I mean? And so like, you know, we spend money to get CI servers, we already have servers, why not use our own servers for own CI? You know what I mean? So being able to do that is also, you know, really kind of useful and helpful. But the core of it, you know, or the reason why we started doing that really does boil down to, you know we have very specific needs for our ci it is not reasonable for an upstream provider to be able to offer or offer in a way that like totally makes sense for us and so you know on some level that means needing to dig in and write our own stuff and so oxide is also you know a lot of people say rewriting software is bad and those people are not necessarily wrong but rules of thumb are only rules of thumb they don't mean that they're laws of truth you know what i mean And so Oxide is a place where we very, very often find legitimate needs to rewrite some software. And I'm not going to say that it's perfect or that it never introduces problems, but it works far better than the people who say never, ever rewrite your own software ever would lead you to believe, basically. Because it turns out that like writing a basic ci system is not an impossible task you know it's it's a project that like one or two people work on and in you know along with their other responsibilities and it serves us well and so you know you may think that like that's an impossible thing to do but some people think writing your own operating system is an impossible thing to do and we did that too so you know it's just like it's how it goes it does mean you know this is part of the reason why it took us four years to get from starting the company to shipping servers you know I'm not going to say that it's not easy necessarily, but not everything that's worth doing is easy. So what we're doing is very ambitious and hard. And sometimes that means you just have to commit to doing the work.
Matthias
00:56:45
You say it as if it was easy, but you have some of the best people in the world. It's like an excellent team that you assembled there. And what does it take to write Rust at that level? Do you have any coding guidelines internally that you stuck with? Is there anything internal where you say, hey, other companies could also profit from that knowledge too? Things that we might avoid even, for example, overusing generics or mixing sync and async. what are some of the patterns that have evolved when using Rust at that level.
Steve
00:57:17
Yeah i mean i also want to like i love my team everyone who works docs at is great but also like we are while we are very senior engineers almost universally you know there's not something super magic that necessarily makes us hands and shoulders better than everyone else i also want to like you know i appreciate the kind words and i don't want to say my co-workers are bad at their jobs because they're not. They're definitely very accomplished. But also, it's engineering, it's not magic. So there are also many, many good Rust engineers in other places as well. What I will say is that when I started four and a half years ago, there was a lot less experienced Rust engineers in general, period, just in the industry. And so a lot of the folks that we hired at the beginning were not necessarily fantastic Rust programmers. They were experienced and fantastic engineers in general, but they sort of, not necessarily picked up Rust on the job entirely, but let's just say we didn't have the luxury of demanding that people were good at Rust before joining Oxide in the earlier days. And so part of me coming on relatively early on was kind of replicating the... If you commented on Rust and Hacker News in the last 10 years, you probably got an answer from me personally at some point because just like I spent my time doing that. And so in the early days of me working at Oxide, I would spend a lot more time answering rust questions, helping people with the rust stuff, giving advice and choosing packages. But we also did have a lot of people who had a lot of deep rust experience when I started as well. But like part of me coming on initially was to sort of help make sure that if people had rust issues that like I was able to help, you know, sort of guide them with stuff like that. By now, you know, years later, there are many more experienced rust engineers out there. And so I would say that like, one of the factors that sort of happened is the overall quality of like the rust hiring pool has like gone up and so at this point you know we are much more likely to hire someone that knows rust than not know rust simply because there are enough people that know rust at a high enough level that we're able to find and hire those people so so that's changed over time to some degree but also some things that we do that sort of any company that's doing rust can kind of like replicate to help out with these sort of things. We have a dedicated channel in our chat system that's just purely for Rust questions. And so people will ask Rust specific questions. And I and many others who have a lot of expertise in Rust will make sure to pay attention to that question and answer questions as they come up. I think that's really important. Having space for people to ask things is really a big deal. We have a biweekly meeting that's optional. Other than one-on-ones, all employees can go to all meetings. So I want to say it's available to everyone, but that's just kind of true of stuff at Oxide in general. But if your company doesn't do that, I would still recommend a sort of like, open to everyone, sort of rust, we call it the rust study group. And basically, it's a meeting where, you know, every week, me and a couple other people who, you know, love rust stuff, have always blocked off on our calendar to make space for if somebody has a question that's maybe not hyper urgent, you know, or, you know, sometimes people do like, oh, I was working on this hobby project. And I came across, you know, I didn't want to bring it up at work in work hours, because it's not really about work. But I do, you know, like, I've been learning this thing about rust and i don't really know this detail or whatever you know that maybe helps them you know with their job and some other you know maybe it's not on a project that's job related but like leveling up a rust is still going to be good you know for your job so people will come and bring questions about like you know oh hey i saw this thing go by earlier in the week where people were talking about this new feature and i was curious what people thought or how it works like i don't really understand why people care about this or you know maybe they're like i have this tricky bit of code that i can't really figure out why the bar checker's mad at me or like you know, whatever. And so, you know, we kind of have both of those forms of sort of explicit time and space to help people with Rust related problems that they kind of have. In terms of specific things that have popped up, I think the biggest thing that's kind of, we've talked about it a little bit, but we haven't really done as great of a job of like getting our perspective on the conversation out there. But a really big thing that's popped up with us that some people are talking about for sure is async cancellation. And so for those of you who aren't familiar, async Rust has what I think is a really great cancellation model conceptually, which is like futures don't do anything until they're pulled. And to cancel a future all you got to do is just drop it and then it won't ever be pulled again and so it's effectively canceled and that's like cool but that also means that there are subtle issues where you don't realize that a future can be canceled because like dropping something is not obvious in rust all the time and so there are some patterns you can get into where something will cancel and you don't realize it and that will lead to sort of like logic bugs and so this is sort of an area where we've kind of like we we it it's a thing that surprises people because you know if it compiles it works is not literally true but it definitely can feel like it's true sometimes and so a lot of people's experience with rust early on is like oh my god it catches all these problems at compile time and i don't have these bugs because i used to have to always worry about you know like them happening and now the compiler catches them for me and that's great and they go into async and they hit their first time where something gets canceled where they don't expect it to and it feels it feels like walking that back they're like back in the land of like oh my god the compiler doesn't catch this problem for me anymore and now i feel like you know these like promises of rust catching all these things are like a little not as true as they used to be and i think some of that is just a like a natural counter reaction to how much rust handles for you in most cases but it is true that like async cancellation issues are like not generally statically find outable. And they also, you know, can be surprising. And that becomes itself surprising when you're sort of so used to Rust catching things aggressively. And so we've had to try to find some places where like, we can figure out like, okay, you know, here are some patterns to be avoided, or here are some libraries of like trying to be a little more explicit about cancellation. I'm totally drawing a blank off the top of my head but there's been some experience experiments, that some of my co-workers have tried to do to sort of like figure out some patterns like that and we've wanted to kind of like talk a little more about these problems publicly but have not actually had the time because it's always super busy when you're doing a zillion things to like necessarily get out there with all their experiences but definitely it's a thing that the async folks are familiar with being an issue overall and i think that anyone that does rust async at scale has eventually either found out about these bugs or maybe you've had some mystery bugs you can't track down that eventually you'll discover cancellation related but that's definitely an area where we've like found stuff to be a little surprising and have like started to get out there a little bit and talk about sometimes you.
Matthias
01:04:00
Said the drops are sometimes not obvious in rust and that can lead to subtle bugs with phasing execution can you give me an example for that.
Steve
01:04:09
Yeah so i'm gonna be very vague because to be honest this is not an area of the product i work on personally and when i was in these conversations was more like six months ago so i'm like a little teeny bit out of the loop but let's just say like a big thing is like a concept in this area is like if something is cancel safe and what that means is sometimes you have some code that is like, If you drop it in the middle of its operation, you need some sort of cleanup, or you need to do some sort of thing or notify someone or something. And so that won't necessarily happen. If an operation is canceled in the middle of it occurring. So for example, like, say that you're doing a select between some sort of like receiving on a channel, and then you're also like printing, you know, you're sort of selecting between some sort of timeout, and some sort of receive on the channel, like that if, if like the if the, if the timeout finishes before you know you're calculating some sort of value while waiting on the like receiver then like that means the other future and that select gets dropped like you know like if you're doing a select it kind of inherently means that you're going to drop the other things that did not finish early and so depending on how you've written those futures like obviously waiting on a channel and you drop it that's fine you're just no longer listening on the channel and say you're just sleeping and you drop that it's fine you're no longer sleeping but say that the thing that happens first is the timeout maybe that calculation had some sort of thing where it needed some sort of cleanup step and now because it's timed out that future just gets dropped well if you didn't do something to ensure that the cleanup would happen properly then you know maybe that doesn't necessarily like work and so this is like code that is totally fine if all the futures are said like cancel safe if they have some sort of ability to do you know that kind of thing but you know if they don't then that can kind of like be an issue and so you know that's like yeah like Like you can lose data sometimes. So for example, say, you know, in one future, you're kind of like writing some data inside of a loop and then that gets canceled in the middle of it. Well, that maybe means you only wrote part of the data, but you never wrote the final part of the data. And so that's like an example of like a problem that can occur if something is not necessarily like fully cancel safe. And so, you know, this is where a lot of the discussion about async drop comes in and like a lot of that stuff. If you've seen that discussion in sort of the Rust async world. And so, yeah, I don't know. I guess that's like a high level example of what's going on.
Matthias
01:06:33
Can you quickly explain what async drop means.
Steve
01:06:35
Yeah so there's a drop trade in rust and when something goes out of scope for the final time then the drop tree gets called and that runs some code this is like into a destructor in many other languages like especially op languages and definitely conceptually sort of similar but in the async world if a future doesn't get pulled anymore there's no equivalent like hook right like the drop trait kind of gives you the ability to They'll sort of hook into, hey, this object is about to be destructed or destroyed. So like do something then. So like a classic example is box. You know, you allocate memory up front and then when drop runs, you know, it would deallocate the memory. So a future doesn't have any sort of similar kind of mechanism. Like if you would allocate some memory manually in a future and then it no longer gets pulled, now that memory is leaked because there's no equivalent hook and point to sort of say destroy that memory that was called. And so there's been a lot of different proposals for what an async drop trait could look like and how that might happen or how we might want to solve the problem in a different way. There's been sort of a discussion in the async working group for a couple of years now.
Matthias
01:07:42
Why can't you just implement drop for the future?
Steve
01:07:45
Because drop is a synchronous, it is a function, not an async function. And therefore you're sort of like calling something synchronous inside of the, and plus the other reason is like the definition of dropping it, like it's no longer being called anymore, like kind of is a fuzzy definition. Like if I don't call, if I don't call poll for an hour and then I call it again later, you know, like technically I've called it again, but during that hour, I may never know if I'm ever going to be getting it called again. You know what I mean? And so there's also a little bit of like fuzziness there in the definition. And it's much more straightforward with synchronous code because there's no like, I mean, you might have a sleep call or something, but just like the point is you can always statically know, like okay eventually this point is going to happen where it's never going to be used again and in async stuff that can be a little more tricky and or like dynamic but you know also this is an area i work on specifically so i can only really give a higher higher level answer on that yeah.
Matthias
01:08:44
I think the slippery slope here is that when you go forward with that proposal or you go forward with that idea of introducing async equivalents of sync traits, then you end up maybe replicating parts of the standard library. We see something similar with read and async read or write and async write.
Steve
01:09:05
Yeah i mean the thing is is that programmers love like dry and they love trying to unify concepts that are that seem similar but are different and i think that's true but i also think that sometimes similar operations are just fundamentally different and you can't really abstract over them in the same way read async read and async write are great examples where we have the sync write and sync read apis but the thing is is that like for synchronous read and write those apis are very straightforward and have existed for a very long time and they like make sense and there's only one implementation of them but one of the reasons why async read and async write have not been stabilized yet in rust is because there are at least two meaningfully different proposals on how to implement those apis and so yeah well conceptually it's like just read but async i think that but is really load-bearing and like you know like, This is maybe stretching an analogy a little bit too far, but like addition and subtraction are both conceptually the same thing at some high level. Like they're both a binary operation with an operation in the middle. And you want to be able to say like, well, you know, like maybe those are the same thing. And so we should abstract over both of them. And that's how you get monoids. But like that doesn't always mean that abstraction is useful necessarily, because sometimes it can paper over the details that matter. You know, like at a high level, the idea that like I'm applying a binary operation to two things and it does something is like, sure, sometimes working at that level of abstraction makes sense. But sometimes you really care if it's actually addition or actually subtraction. You know what I mean? And so like, I think that a lot of people want to reach for the idea that we should inherently be able to abstract over sync and async. But i think that there are different enough things with different enough semantics that doing so at least let's put it this way for rust specifically a language that cares about the low-level implications of what you're doing that you need to be able to integrate with an underlying system that you like the details matter to you i think that over abstracting these things is a mistake i think that in a language like haskell as a reason arranged for mono a second ago like in haskell you extract over sync and async because conceptually they're the same thing but haskell does not have the same performance requirements and like low level requirements and little commitments that rust does and so it can it can afford that abstraction because that abstraction costs you something but in rust i think that the the details are significant enough and the processes are significantly different enough that it is important and meaningful to keep them separate and so i don't think it's inherently a bad thing that you're sort of redoing some. Async stuff or some sync stuff that's in the standard library because like also. Drop conceptually makes sense i'm not sure if i fully think that async drop specifically is a good idea but some sort of analog or way to solve that problem makes sense i think that read and write are there true but like that's like 95 of it there's like not a whole lot more you know like Like we're not necessarily like the other stuff that we may like make a sync is like still like, I don't know, it's not a brand new concept or like, I don't think it's the, I don't think it's the end of the world. Let's put it that way. Some duplication is totally fine and meaningful. And sometimes, you know, abstractions are good because they let you work with stuff at a high level, but you also need to be able to do stuff at a low level. And I just like think that sometimes that trade-off is that you don't get to use the high level abstractions. So, but this is definitely like a really big debate in the Rust world right now.
Matthias
01:12:42
But at the same time, I couldn't agree more. You really summarized it super well, because this is one thing that I really love about Rust. It's explicitness and not having these leaky abstractions, because you make it explicit that there is a difference between those abstractions. And I really like that part. But at the same time, the async ecosystem is pretty new. The rest of Rust, the sync part, has matured in the last 10 years. And since I have you, I might as well just ask you, because you have a lot of experience, you've been in this community for a long time, what would you say is one big mistake that the Rust language made? Something in the standard library or anything regarding the syntax or its semantics that you would probably see as a historical mistake and you would want to change.
Steve
01:13:32
Yeah so i have a joke answer that's funny and then i have a time where i thought that happened but i was wrong and then i definitely have a good answer for one that's real but i'm gonna tell the first two anyway to like let me think about what it actually is because like you know i want to make sure i'm getting a good answer so the joke answer i always say is that string should have been named stir buff and like that's and that's just like because like we path and path buff like i think that naming and the fact that it's like capitalist string like i feel like there'd be a lot less like rust has 36 different string types if we like acknowledged a little more that like it's a buffer like it's a mutable growable string as opposed to some other kind of string or whatever and so that's that's what i've always joked is like i want rust 2.0 i don't want anything changed except for i definitely like want stir buff instead of string and so that's kind of like a silly silly answer there was a time where i thought rust was making a huge mistake and i was definitely wrong and that is the post fic post fix await syntax so that was like for people who weren't around when we were doing async awaits design like there was a point in time where it was a hugely contentious topic about how to write a wait like should it be a prefix thing like javascript should it you know be some sort of other thing or like should it be what it is today which is like the dot await and i am really conservative when it comes to programming language design, actually. And I kind of like, at the time, I was like, we have a lot of people who are coming from JavaScript. They, and JavaScript and C Sharp both do this prefix 08 shenanigans. And I think that like, when it's not clear which way you should go, you should choose to be conservative when it comes to language design. In many other way of things in life, I do not believe this, but like when it comes to programming language stuff, I think being conservative is generally pretty good. And so I thought it was a really big mistake to add this weird syntax for async await. However, after having had to write a whole ton of async code. That would have been a huge mistake. And I'm really glad they did not listen to me personally and went with the post fix away anyway, because it's just clearly superior in every possible way. And brilliantly enough, people figured out how to address that major concern that I had, which is like what happens when JavaScript people come and write the wrong thing. And that is diagnostics in the compiler. So a really cool thing that Rust does that doesn't have to do is sometimes the Rust compiler will parse code that's wrong to give you a great error message. And so if you write, you know, instead of foo.bar.await, if you write await foo.bar, the Rust compiler knows how to parse that, even though it's not write Rust, just so it can deliberately say, hey, this is not how you write await, you would write it like this, foo.bar.await, do that instead. And that is like a really trivial way that like anyone who is like writing their old style of, you know, of coming from another language, they won't get confused, they will get helped into writing the correct thing. And so that's an area where I definitely was like, at the time, I was like on the wrong side. And I think I've like thoroughly admitted that that's like a mistake. I think my biggest things and started in sort of like where rust made mistakes that I think is a little more serious or real is like there was a couple things that sort of like. I don't want to say that no one cared about them at 1.0, but there's some things that were sort of designed in a certain way that there was so much work to do that they never really got fully, completely, totally thought out. And I think the biggest one of those is the module system. I really like Rust module system. I think a lot of stuff that it does is good, but it is a common problem for people coming to Rust. And it has been. And in Rust's 2018, we did some changes to some things that made it a little bit easier to understand. But it is the number one thing that when people read the book and they say, I was confused by something, they say the module system. And I don't know what that is. I don't have a constructive necessarily answer for how we should fix things instead. But like, it wasn't, it was something that was kind of built a very, very long time ago. I don't even remember by who initially. And then, like, there was just so much other stuff to do leading up to Rust 1.0. There was never a moment where it was like, okay, we need to really make sure to think through, is this how we want the module system to work? And then even in 2018 when we did there was a bunch of like legacy constraints of like okay we want to change how some of this stuff works but we have to make sure that it's not too different for all the people that currently know rust because that would be like very very bad if there was two completely totally different systems and so some of that was kind of like tied up that way and i think in general it's not really just the module system but also like name resolution in general which is like when you type an identifier how does rust determine what identifier you mean the module system is like part of that whole situation and a lot of that is kind of just like was made and was never really it never really had the time to be like fully thought through and designed by someone before 1.0 came out and i think there's a way that you could simplify, a lot of that stuff like. This gets into stuff that like most Rust programmers never even really think about, but like types and modules are two separate like namespaces. And so you could have a module with the same name as a type and the Rust language will disambiguate between the two and it's totally fine. And there's like, I think there's another one too that I'm totally drawing a blank on that does like three different sort of like versions of namespaces and they can all be like overloaded. And this like leads to like very confusing outcomes if people name things in sort of strange ways. and like that's all complexity that the compiler has to deal with when trying to look up and figure out a name is like looking through all those things. And I think all of that probably could have been pretty radically simplified and is something that's like basically kind of impossible to sort of like deal with at this point. Like it's just kind of like baked in. Macros also are kind of like Nick Cameron really wanted to do like a rebuild of macros. And so like that's why before Rust 1.0 We renamed the macro keyword into macro rules and reserved the macro keyword because the idea would be that someday there would be a new macro system that would be built. And that's just never happened. That all disappeared at some point. And so that's definitely a thing that's kind of similar. Although I don't think most macros are inherently bad, but there definitely are other options that could have been considered a little more. And like proc macros are like sort of wonderful but they are also really complicated and have some like really weird technical angles like you have to make a proc macro its own crate why do you have to do that well because proc macros are kind of like an out growing of compiler extensions which existed a long time ago and like why were compiler extensions the way they were well back in the day it was basically like you can ask the compiler to just like open up a library you write and mess with a compiler internal data structures and produce something and like proc macros are a crazy strong aspect of the rust ecosystem but that doesn't mean the feature had to be implemented the way that it's implemented and now that's like kind of just there forever and also it's good enough that there's nobody who's like truly invested in making a better one and so you know like that whole area yeah again is like sort of a similar like you know is it the worst no could it have been better yes and i think those are kind of some things that like a rust plus plus could like meaningfully address i don't think they're necessarily enough to justify having a whole separate new language but they're definitely areas that i think that are big warts that you know maybe could have like been fixed but just like aren't really possible to do now.
Matthias
01:21:30
All right, that was a really nice side quest into the Rust specifics. Before we close, I also wanted to quickly come back to Oxide because we covered the hardware, the firmware, we covered Hubris, we went up the stack to illumos and bhyve. But I wonder what's above that, the user-facing things, the interface with the system. Can you talk a little bit about that?
Steve
01:21:55
Yeah. So we're a super big believer in the OpenAPI spec. Is it perfect? No. Is it good enough? Yes. And so the way that you actually like interact with the Oxide API. Like the Oxide rack is via an API. So the same system that's running the sort of like control plane that's determining, you know, what VMs get scheduled where and things, that's exposing an HTTP API to users. And so what that means is, not only can you use client libraries to say, you know, like spin up a VM, like you can use the Oxide CLI and be like, give me, you know, a VM that looks like this, and it will like make one does that by making HTTP request. But also means we have a web console, which is just like the AWS console or anything else where it's a website you can load up, and you can click buttons and, you know, manage your rack that way. And all that is built off of this idea of like the rack exposing an open API definition, and then be able to generate clients on top of it. And so we are using TypeScript for the sort of like front end of the website of the console. And we're not using like Rust and Wasm or any of the sort of front end Rust technologies. And that's basically like TypeScript gives us like 85% of what we would want out of Rust over JavaScript. Like having strong type stuff is like really, really important to us. And, you know, all those kinds of things. And the Rust front-end web ecosystem is even younger than many other parts of the Rust ecosystems. And so, you know, I definitely don't want to say it's like not usable at all, but we didn't necessarily want to, you know, bet on something that was like that young while we're already doing so many other things. And plus, you know, a lot of the people that do front-end web stuff are already familiar with TypeScript. And so it's easier for them to kind of like pick up that sort of stuff necessarily, although many of them also do a little bit of Rust on the side because, you know, they interact with other parts of the stack too. Yeah. This has kind of become a very common sort of pattern for building applications at Oxide in general is like a Rust server backend, and then a TypeScript front end with an API layer expressly in the middle. And so we actually wrote a server framework called Dropshot that, you know, at the time, like, you know, a lot of people use, you know, Axum or, you know, what other other sort of web framework in Rust. And we kind of wrote our own specifically because of the time that we were building it, there was not a lot of stuff that had deep OpenAPI integration. And so what this means is a part that stinks about OpenAPI is trying to write the definition by hand. So we just don't do that. So for example, our drop shot, our server framework, you write the endpoints yourself, and then you can pass a command line flag to the server that says, hey, please generate an OpenAPI document for me. And it will look at all the code that you wrote to write all your endpoints, and it will generate the full OpenAPI specification document for you. So you don't need to write it by hand. And then we have a TypeScript generator that's able to read in an OpenAPI document and spit out a client library, not just for TypeScript, but on the web context is the most important. And spit out a TypeScript library that's able to know how to interface with things because of that OpenAPI document. And that means that I get, in practice when I'm building web stuff at Oxide, I write my server-side definition. I say hey please generate stuff and regenerate the client in typescript and when i switch back over to my typescript file it'll give me a type error that says hey you're not passing this correctly or whatever and so i get full type safety the whole way up through the stack which is really really cool and useful and so yeah so we've been very happy with you know using typescript and that's just like one area you know of the product where we're not fully using rust for everything but that's like based on the pragmatic decision to like you know engage with that ecosystem deeply too so yeah it's been very very nice people.
Matthias
01:25:41
Who are not familiar with open api might think oh that's just a lot of extra work that you do on top of what you already do which is an infinite number of other yaks to shave so i wonder what are the practical benefits of having an open api spec what can you do with that.
Steve
01:25:57
Yeah, so some of it is just like, very straightforward is like type safety is cool. Like we like type safety, it helps it doesn't solve every problem. But you know, also some of it is like we, people need to, you know, write applications against our API, like everybody and using a bunch of different libraries. And so I think for the non TypeScript side, I think that there's sort of like the TypeScript one is interesting for its own reasons. And the non TypeScript ones are interesting for their own reasons. So I'm gonna talk about those ones first, because there's like simpler. So, you know, in general, you know, you're going to want people that definitely want go they definitely want rust and then there's maybe some other stuff but like those are the two big ones but like we do need to support a bunch of different languages that people are going to want to write applications against and you know every company has their own stack and so we can't necessarily guarantee that like you know maybe they're a java shop and so they really want a java library okay so one of the things that the benefit gets us to be able to like meet customers where they are but like a lot of the i'm gonna say non-type script APIs are not very interesting because it's just a general REST client, right? You're making HTTP calls. There's nothing kind of like super novel there. The benefit is in being able to not have to handwrite every single one in every single language and have something reasonable kind of pop out in each language. I mean, obviously, handwriting ones can be nicer than sort of the more generic ones. But we're also not using the generic opening API tooling. We wrote our own generator to sort of generate ones that only need the features that we use. And so therefore, produce something that's a little bit nicer, but like, okay, it's a slightly nicer rest API is not that interesting. One of the things that I think is really cool on the TypeScript and like front-end side. Is if you go to the console repo, which is what we call the front end to this thing, the web console, if you go to that on GitHub, there'll actually be a link to a little Vercel thing that lets you play with the console in your browser. And the reason that that works is that we can use the OpenAPI definition to also generate a mock server and then run it in a web worker in the browser. And so you're able to play around and say like, spin up a server. And then it will like pretend to spin up a server by running that in the mock worker in your browser. And then when you go to list all servers, it'll have the one that you spun up. And when you say spin up another one, even though there's fake page loads in there, basically, it's able to remember all that stuff and you're able to get some very basic logic like that kind of working. And I just think that's such a cool demo to be able to actually play around and you can see what is going on or what it feels like to do it without needing to spin up a backend at all. And it's only possible, it's kind of funny way earlier, I was talking about how in the firmware level, having interface layers is like no good but like here you know on the very front end at the very highest levels that is we're actually interfacing with the external world is not something we can control and something where other people want to use other technologies to interact with us and so at that level it is worth it to actually do the work to sort of put in that kind of you know universal interface as opposed to something that's not and so you know open api like i said before is it the best api description language that ever existed no but it is one that a lot of people use and know and it works well enough so you know that's like a good example of i think us making a totally different trade-off at a different part of the stack where you know internally speaking we don't need to collaborate with folks and so those layers don't make sense but on the highest levels externally focusing we do need to do that and so it's worth putting the time and effort in at that layer that's.
Matthias
01:29:18
So incredible to even think about having such a use case because i don't think you had that in mind from the very beginning it just evolved over time and then you had the ability to put that test server on the web and it must have been a really magical moment when you as a company realized that you could do that.
Steve
01:29:39
Yeah i think that happened before i showed up personally but definitely when i was introduced to it when i got shown it i was like this is so cool i want to talk about this all the time because i think it's just a really really neat way of doing things for true and.
Matthias
01:29:50
No one else does it these are some of the things that are so exceptional, you rarely hear about them. It's another thing that I think you do, which not many other companies do necessarily, is to be very open about discussions and, wiring in the community you have this rfd process i think it stands for request for discussion.
Steve
01:30:16
Yeah and.
Matthias
01:30:17
It reminded me of ross rfc process is it modeled after that.
Steve
01:30:22
So it definitely takes some influence from that but also it's my understanding that joyant used rfds to talk about stuff internally as well and so like i mean both things were definitely inspired by the ietf rfc process Like, I think just in general, the idea that there is like a written document that you come to consensus around and then use that to move forward is like the high order bit. And so, you know, I definitely think it takes some inspiration from a bunch of different places. But, you know, at its core, all those processes are about the same sort of thing of like when you need to get a lot of people on the same page to do something, you know, you need some way to achieve consensus. And we really value the written word at Oxide because that's the thing that truly scales. You know if you have a meeting you know you can only have a meeting with so many people before it really falls apart and doesn't work but a bunch of different people commenting on some text. Is you know able to not be at the same time it's asynchronous instead of synchronous so you know we're a distributed company and we have people that live all over the place and so you know that like matters a lot and you know it lets people work at their own pace you know like in a synchronous meeting if i want to think about something you know give it a half an hour's worth a thought that's not really feasible you know in a meeting with 10 people i can't just be like well i want to sit here and think about this for an hour before i say what i want to say you know it doesn't work because you're wasting everyone's time whereas with an rfd you're able to like say okay i'm going to sit with this and like really think about it and you know you're not blocking anyone else and so i think there's a lot of advantages to doing stuff that way for sure but yeah i.
Matthias
01:31:53
Think we covered a lot of ground going from hardware all the way up to the software interface and web applications it has been a crazy tour the one thing that i wonder about and maybe some listeners might also be curious about that if oxide started in 2024 and, the programming language landscape has changed a bit would you write oxide in zig now.
Steve
01:32:19
No and there's a couple reasons for that i really like zig conceptually i've known andrew for many years i consider him a friend he himself says don't use zig for production yet like it is still changing massively all the time there is like you have like tigerbeetle and you have bun and i think there's maybe one other company but that's like it and at this point you know rust is still the like safe choice that has been used in you know it's like millions and millions of lines of rust in production that you know everyone touches like every day like even just cloud flare 10% of the internet hits rust code all the time and so you know i think also like there's lots of cool things about zig that i wish rust would steal honestly but i think the for me personally the ironclad memory safety guarantee that rust has versus the like we probably fix most memory safety bugs being what zig has is just like a a deep philosophical difference that i think for me personally 80% of the problem being solved is not worth it but 100% of the problem solved is and obviously rust is more like 99.99% of the time but like i still think it's a very significant advantage and there Whereas more and more, we are finding that 100% memory safe by default with an escape patch is the correct choice for languages. And so obviously, I would say that because I'm a Rust person and a Zig person, but I do think that's meaningful. And so I do think that the exact same choices would still be made today. Because while it is true that there is Zig and there's a couple other languages that are people working on upcoming, they're still very young and they're still very early. and they are not attempting to solve memory safety fully in the same way that Rust is.
Matthias
01:34:06
I think a lot of people are really excited about Oxide because you do so many things right. You have vertical integration, you have top-notch branding, you have your own podcast, you wire in the community, the RFD process. Everything is done well and it's done with purpose. and this other thing that you, always do, and I think that's amazing, is to contribute back to the open source community. A lot of the tools that we talked about today are open source, so people could check out the source code. Can you maybe list a few projects that are open source? And in general, what is the methodology at Oxide to decide how and when something can be open sourced?
Steve
01:34:56
Yeah so hilariously i co-wrote an rfd with Bryan on open source policy at oxide and hilariously i've forgotten to like mark that as completed and i think it's one of the ones that's not public so i should also get on making that public at some point because it's kind of funny that it itself is not but essentially the default position of oxide is that everything should be open sourced to the extent that we can possibly open source it and there's a couple different reasons for that so because we also have a little bit of an interesting relationship with open source stuff sometimes like like so because we are a company and we're doing so much we often don't have time to like community build around our projects so i mentioned drop shot before like it is open source it is on github we do accept pull requests but we're not like trying to make it the rails of rust because like we don't really have the time to run a full community managed product it's more like this is a thing that we built that's useful for us and if it's useful for you too then that's great like hubris for example like we basically we we take pull requests in the sense they are open and we'll accept them sometimes but like we also can't really even have the time to review like if someone are to refactor a major component or something like that's just like not a thing that we could accept because we really need to stay focused on like what's good for us and so there's this interesting balance between like we want to make stuff available. And sometimes libraries are easier to share with people in a way that would make sense conceptually. But like a lot of our open source is kind of like a, it's not a source dump. It's not like we're just throwing it over the wall, but we just like really can't accept a ton of external contributions. But it's really important to us that it is open sourced, because like, one of the big problems, as I mentioned a long time ago, in sort of the firmware part of this discussion is that like, you may not even know what is running on your computer. If you buy a computer from another vendor, there's like whole operating systems hiding in the nooks and crannies in your computer. And we think it's really, really important that when you buy a server from us, you are buying the server, like from us, and that like, it is yours. And therefore you get to know what is running on it because it's your hardware like we shouldn't have secret computer stuff running on the hardware that you purchased from us and that also includes like we don't do software licensing fees like if you if you buy a server from dell you're going to buy the hardware but you're also going to license the software and that means it's an ongoing cost so did you really buy the computer or are you just like paying a big up front and you're renting some of it you know what i mean and so that means that like so a thing that happened recently we use cockroach db in the control plane and cockroach announced that they're moving from the bsl the business source license to a proprietary license they're like going sort of closed source again source and yeah source available and so like until a lot of people said oh what that mean you're going to do and we're like well we're sticking with the last version that was apache licensed and we're just going to keep doing that. And people were like, oh, why didn't you negotiate at a license with Upstream? And it's like, maybe, you know, you could have paid for it, like, you know, get a discount or whatever. And it's like, well, you know, I mean, first of all, it's not like we didn't talk to them at all, but like the idea that we would be paying a per rack license fee for that doesn't make sense because our customers own the hardware after we sell it to them. So what, are they going to pay Cockroach, you know, a licensing fee? Like that wouldn't be fair on their behalf. And so like, so that means we really need to like, like open sourcing the software in general means that our interests are aligned with our customers interests which is that like we're trying to like sell you a computer i keep always going back to that because it's just so funny because like on some level access business is like so straightforward in a world where so many tech companies businesses models are really complicated like we really want to sell you a computer and then it's your computer obviously you know support contracts are an ongoing thing that we you know do and like stuff like that but like it's just true it's really important that you should know what's running on the computer that you get and that's that's true in like a an ethical sense but it's also just true in like a security sense it's important to us we feel like that if you you know want to make sure that you know nobody's we you know if you install this this rack in your data center it's important that nobody's trying to steal your stuff you know what i mean like privacy is probably pretty important to you if you're buying your own computers and racking them in your own data center, you care about making sure that you're the only one that gets to know what's running on your stuff. And so being auditable is a really important part of that. And so being open source is not necessarily a precondition to being audible, but it certainly makes being auditable a lot more easy because it means you can literally see the code. Now. There still are a couple of binary blobs that we have to do. Like while we did rewrite a lot of the firmware, there still are occasionally bits of the firmware that we can't actually fully open source. And occasionally there's stuff that, you know, maybe has an NDA or whatever. So I'm not going to say it's fully 100% there, but in general, like we try to open source everything that we do to the greatest extent possible. And like, we've also found that there's a benefit there because sometimes it's just more annoying for stuff to be closed source. Like if you've ever had to deal with cargo trying to depend on Git dependencies because you depend on some sort of closed source library that you have to deal with, like it's kind of a pain in the butt, right? And so like, it's just easier, even if you never intend for anyone else to actually read the code, it's just easier to be like, hey, this is actually open source. And so we use the MPL by default, which is a really interesting license. It's not, you know, most stuff in the Rust world is Apache 2 is less MIT licensed. And the MPL is a Mozilla public license. And it basically is like a weird hybrid between the MIT and the GPL. And so it basically is just like, you need to be still open source, but like when you use MPL based code, it's sort of a still open source, but it's the file level instead of the like project level. And so we feel like there's a really nice trade off between copyleft and totally, you know, the more Libre licenses. And so that's sort of our default choice. But if you're integrating with a code base that already has made some sort of choice there, then we need to stick with the same license. So we don't like mandate that everything has the same exact license, but we do feel like it's really important in many ways. And so that's something that we try to do, even if we're not necessarily on the big community building ends of open source stuff, because we're doing so much. We just don't have the time, but we do think it's like, it's like important to align our business interests with our customers' interests. And that's open source is like a great way to do that.
Matthias
01:41:22
If I had to play the cynic for a moment, I would ask, couldn't you also change the license like Cockroach did and lock in your users?
Steve
01:41:30
In theory but like we also don't really do copyright assignments at all so like you know we do change some of that stuff like but or like we don't have the ability if we've accepted stuff from other people or if we're building on another project like we couldn't necessarily change it but i mean you know i think a more cynical question would be like well why does it matter because it's custom hardware anyway so who's going to be running that on a different computer you know and hilariously a lot of it is actually pretty easy to run on other stuff but but yeah no i mean we could do it after anyone could make any decision at any time but you know currently you know that's that's where we're at so that's.
Matthias
01:42:07
Very fair and you never know where this approach leads you because maybe someone finds a very interesting way to use that source code in a completely different context at some point maybe 10 20 years down the road and also this is another thing that i was wondering about what makes you confident that rust will be around in 10 years or might even be relevant or, I don't know. Let's think about 20 or 30 years down the road. A lot of languages, they go away over time.
Steve
01:42:38
Yeah. I think that a lot of people don't realize how much production Rust is out there and how many companies truly depend on it. And if the idea, if Rust were to suddenly implode, how people wouldn't step up to fix it. Just meta alone has, I think, like on the order of 10 million lines of rust it's like i don't think it's it's not 100 million lines but i think it's more than single digit millions it's like between one and 10 million lines of rust you know amazon like rust is now involved in like s3 aws like aws ec2 uh tons of different aws servers are all like have rust at really key points i mean rust is in the windows kernel already well people were talking about like oh is rust mature enough for the linux kernel which, I mean, Asahi Linux has already built graphics drivers on it. And while it's not in the upstream kernel fully yet, and while there's been a little bit of some discussion about that, people are like, is it actually mature there? Like it's in the Windows kernel. Like I'm talking to you on a Windows machine. There is Rust code running in my kernel right now. It's just a little bit, but like it's expanding. Microsoft is rewriting some legacy Windows stuff in Rust, like GDI, the graphics drawing interface. They have like a port of that to Rust stuff. And so, you know there's just there are millions upon millions of rust lines of rust out there and being used for real important things if something were to happen to the rust team like it's also worth thinking about what would it mean for rust to die okay that would mean that the rust team would somehow not exist anymore but the code base would still exist of rust itself and so like at that point those companies it would be an existential threat to their business for that technology to implode and so they would have to come up with some alternate way of making that work because like you know it can't just stop at this point like rust has reached escape velocity and is a language that like will survive now is it will survive in the sense of like vb.net where like it technically exists but not a lot of people use it but like or is it a cobalt where like it's used for some things but like not for anything else i have no idea if it's that kind of legacy or is it like a c/c++ where it's used for smaller and smaller amounts of things but still for very important things i don't know but it's definitely past the point where it will just like disappear at some point because it is just used in too many things that are too meaningful like like the united states government is talking about. You know, preferring Rust over other languages when procuring things like, you know, and is that a thing I feel great about? I don't know. I feel complicated about it, but just like, it's at that level of maturity. And so I think a lot of the people who sort of think Rust is a fad are just like uninformed about the degree to which Rust is used in industry for real, meaningful production applications. It is absolutely going to live on. I don't know if it's going to be 30 years or 100 years, but it's definitely going to be 30 years. And maybe in 30 years there'll be the rust plus plus and cool kids will be using that or whatever but there still will be jobs for rust programmers in the same way that you know python super super hot now started in like you know 20 30 years ago itself so you know similar i think that's like kind of where rust will find itself in the future yeah.
Matthias
01:45:47
That was in 91 or 92 somewhere around that time.
Steve
01:45:50
Yeah i think so yeah before.
Matthias
01:45:53
We close now you have the opportunity to mention any of the tools that you want people to check out can be tools from oxide can be external open source tools that people might or might not know and things that you find interesting in the rust ecosystem.
Steve
01:46:10
Yeah the two things that i'm most interested in at the moment are not oxide projects but they are both written and rust the first one is buck2 which is a build system from facebook slash meta so buck one is kind of like without getting the full history it's kind of similar to basil or blaze if you're familiar with those from google but at one point because i was doing as i mentioned earlier the like build system stuff for hubris i got in my head like okay if i was gonna make a build system from scratch what would i do and i read a bunch of papers and i learned about the space. And I said, okay, cool. I would want this general design. And then I was like, okay, well, maybe I'll start writing that in Rust. And so I started toying around with the idea of doing it. And then I learned that Buck 2 existed. And what I found was, is it was all the people from all the papers that I read doing the design I wanted to build and writing it in Rust. So now what I'll say is that it's a little hard to use Buck if you're not familiar with those tools already. I think the developer documentation, the introductory documentation is like not very great. And I'm not going to say that it's a flawless tool, but it is one that I'm very interested in and learning more about. I still have not used it a whole ton. And like I said, I haven't ported any of my work projects over to it yet. We are using it at Oxide on a project for FPGA shenanigans, but it's definitely like a tool that I'm very interested in because. Tools like Cargo and NPM and RubyGems are kind of build tools that are good for the small, easy cases, as we talked about. Like if you're on the straightforward path, they work really, really well. And tools like, Buck and Bazel are really good for the monorepo, Google style, like everything in your whole company, hundreds of millions of lines of source code are being built with these kind of things, but they're hard to use. And so I'm really interested in how can we bridge these two worlds? Is there a way for a build system to be good in the small and in the large? And i think it's i think it's going to be better to bring the like conceptually correct large build systems down to be easier to use for the smaller cases than is to scale the small easy to use build systems up to the big cases but i don't think we're there yet so i want to i want to shout it out as a tool i'm interested in i don't think it's perfect i'm not saying you should switch to it tomorrow or anything but i do think it's interesting sort of space to watch and the second tool is called jj which is from google it is written in rust it is a new source control system that's Git compatible. I've actually written a tutorial on it that I haven't published a super ton yet, but is actually going to become the upstream tutorial sometime soon. We've been talking about it with the team, but it is like a version control system that is Git on the back end, but it's not Git on the front end. And, you know, I love Git. I've used Git for a very long time. I've loved Git CLI. Whenever people in the past said Git CLI is bad, I would be like, I understand that you struggle with it, but I do not. I think it's totally fine. I like it actually. And JJ is the first time I've ever, like I haven't used Git in months at this point because I just use JJ instead and it is both simpler and more powerful than Git at the same time. And like the C people, I think the Git people have been used to people saying that to them for a long time. And so I was very skeptical when I first heard about it, but JJ has a lot of really interesting things going for it and it ends up being a smaller set of primitives that are more orthogonal and that's why it ends up being like easier and more powerful than Git at the same time. So definitely check that out. And if you stock my GitHub, you can find my tutorial or maybe someday if you're watching this, it will be the upstream tutorial. But I definitely think it's really super powerful and as a tool I use every day and love, even though it's a pre-release tool, you know, it's got a lot going for it. I'm very, very excited about it.
Matthias
01:49:42
What I liked about JJ was the fact that you can avoid naming branches and you have a blog post on that.
Steve
01:49:50
Yeah, just wrote a blog post about that. Absolutely. That's another one of those things where I never really understood. I was like, what do you mean? How do you work with branches if you don't name them? And then at some point you're like, oh, I haven't named a branch in like a long time. I actually don't need to do that. It's cool. So yeah, absolutely.
Matthias
01:50:04
It was a very beautiful thought because you bridged the gap between version control, front-end and back-end technology with TypeScript and Rust. That was impressive.
Steve
01:50:14
Thank you.
Matthias
01:50:15
So many people asked me, and I almost forgot, one important thing. When will you start hiring in Europe?
Steve
01:50:23
So the funny thing I want to say is that, first of all, we would have been hiring in Europe except for Britain left Europe. So we have some people in the UK. But like we do, we are willing to hire people in Europe. The only thing is that like, because the team is still so small, we try to have working hours that overlaps to San Francisco. And that's really hard for most people in Europe. There are some people who definitely, you know, stay up late or wake up early and like can make that work. But it's definitely a little bit of a challenge. You know, as, as the company grows, we kind of like broaden and like make that requirement a little less and less. So you know it's definitely i would say that like europe is definitely not a guaranteed no right now but you know it'll just become more and more common as time goes by i don't have a great timeline for when exactly it is but that's kind of like the overall philosophy is it's not it's not that europe is a no it's the time zones are hard as programmers know so we'll get there in.
Matthias
01:51:17
Closing the traditional question do you have a message to the wider rust community anything that you want to share the stage is yours.
Steve
01:51:26
Yeah, I think that, you know, when I started using Rust, it was like 40 people in an IRC room. And now it is, you know, millions of developers literally all across the globe. And so growing is hard. And there's lots of changes that have happened. If you've been around Rust for a long time, you've seen a lot of change. I think change will continue to happen in the future. You know the best thing that we can do is like continue to write good software continue to try to treat each other well and you know continue to like build cool stuff in rust and share it with the world and you know not everybody's gonna love rust and that's totally okay but i still think there's a lot of growth in the rust world and i still think there's a lot of work to do and so i'm excited for us to keep continuing to just like build cool stuff in rust and you know just keep chugging along so yeah i don't know that's that's i think what i have to say right now Steve.
Matthias
01:52:17
That was amazing. I have to thank you so much. If just a single listener starts learning Rust because of this, I think I've achieved my goal and you did the same a thousandfold. And I'm so happy to have you as an ambassador of the language I love.
Steve
01:52:36
Awesome. Thank you so much. That's very kind.
Matthias
01:52:39
Rust in Production is a podcast by Corrode. It is hosted by me, Matthias Endler, and produced by Simon Brüggen. For show notes, transcripts, and to learn more about how we can help your company make the most of Rust, visit corrode.dev. Thanks for listening to Rust in Production.