Oxide with Steve Klabnik
In this episode, I talk to Steve Klabnik, a software engineer at Oxide and renowned Rustacean, about the advantages of building hardware and software in tandem, the benefits of using Rust for systems programming, and the state of the Rust ecosystem.
2024-11-14 113 min
Description & Show Notes
What's even cooler than writing your own text editor or your own operating system? Building your own hardware from scratch with all the software written in Rust -- including firmware, the scheduler, and the hypervisor. Oxide Computer Company is one of the most admired companies in the Rust community. They are building "servers as they should be" with a focus on security and performance to serve the needs of modern on-premise data centers.
In this episode, I talk to Steve Klabnik, a software engineer at Oxide and renowned Rustacean, about the advantages of building hardware and software in tandem, the benefits of using Rust for systems programming, and the state of the Rust ecosystem.
About Oxide Computer Company
Founded by industry giants Bryan Cantrill, Jessie Frazelle, and Steve Tuck, Oxide Computer Company is a beloved name in the Rust community. They took on the daunting task of rethinking how servers are built -- starting all the way from the hardware and boot process (and no, there is no BIOS). Their 'On The Metal' podcast is a treasure trove of systems programming stories and proudly served as a role model for 'Rust in Production.'
About Steve Klabnik
In the Rust community, Steve does not need any introduction. He is a prolific writer, speaker, and software engineer who has contributed to the Rust ecosystem in many ways -- including writing the first version of the official Rust book. If you sent a tweet about Rust in the early days, chances are Steve was the one who replied. Previously, he worked at Mozilla and was a member of the Rust and Ruby core teams.
Links From The Episode (In Chronological Order)
Links From The Episode (In Chronological Order)
- The Rust Programming Language (No Starch Press version) - The official Rust book
- The Story of Rust - FOSDEM / The History of Rust - ACM - Early history of Rust
- Signing Party - Story from Macintosh development
- The Soul of a New Machine by Tracy Kidder - Classic book on computer engineering
- I have come to bury the BIOS - Bryan's talk on firmware
- Beowulf cluster - Early parallel computing architecture
- Bryan's blog post on Rust - Journey of a systems programmer to Rust
- JavaOS - Operating system written in Java
- D Programming Language - Systems programming language
- Garbage Collection in early Rust - Historical Rust development
- Removing green threads RFC - Major change in Rust's concurrency model
- Hubris - Oxide's embedded operating system
- Tock OS - Embedded operating system in Rust predating Hubris
- cargo-xtask - Build automation for Rust projects
- Hubris Build Documentation - Building Hubris using cargo xtask
- Buck Build System - Facebook's build system
- Buildomat - Oxide's build system
- Omicron - Oxide's manufacturing test framework
- illumos - Unix operating system
- bhyve - BSD VM hypervisor
- About Self-hosted Runners - GitHub Actions documentation
- Async Drop Initiative - Rust async development
- Rust Playground Example - Demonstrating helpful error when using prefix await operator
- Rust Book - Modules - Rust module system
- OpenAPI Specification - API documentation standard
- Dropshot - Oxide's OpenAPI server framework
- Axum - Web framework for Rust
- Oxide Console - Oxide's web interface
- Oxide Console Preview - Demo of Oxide Console using a mocked backend
- Oxide RFD 1 - Request for Discussion process
- Rust RFCs - Rust's design process
- IETF RFCs - Internet standards process
- Zig - Systems programming language
- TigerBeetle - Financial accounting database written in Zig
- Bun - JavaScript toolkit written in Zig
- CockroachDB - Distributed SQL database used in Oxide's backplane
- Oxide and Friends: Wither CockroachDB? - Discussing Cockroach's switch away from BUSL
- Mozilla Public License - Oxide's default software license
- Asahi Linux - Linux on Apple Silicon with Rust drivers
- Buck2 - Meta's build system
- Jujutsu (jj) - Git replacement
- Steve's Jujutsu Tutorial - Guide to jj
- Steve's blog post on not naming branches
Official Links
- Oxide Computer Company - Building servers as they should be
- On The Metal Podcast - Stories from the hardware/software boundary
- Steve Klabnik's Blog - Thoughts on programming, Rust, and more
- Steve Klabnik on Bluesky - Follow Steve for Rust updates and more
Transcript
This is Rust in Production, a podcast about companies who use Rust to shape
the future of infrastructure.
My name is Matthias Endler from corrode, and today we're talking to Steve Klabnik
from Oxide Computer, about working on the hardware-software interface with Rust.
Steve, I don't think you need a lot of introduction. A lot of Rust sessions
know you, but maybe there's someone out there who doesn't yet.
Can you quickly introduce yourself and oxide the company you work for.
Absolutely hi everybody i'm Steve Klabnik
i am most known in the rust community for having co-authored the
rust programming language which is the book that most people learn the language
from or at least many people learn the language from i was on the rust core
team for almost 10 years and you know i've been doing a bunch of rust stuff
for a very long time i started using rust in december 2012 so i'm one of the
few people that can say i have 10 years of rust experience so.
You can finally apply to these job adverts which.
Yeah exactly oh yeah and oxide so oxide
computer company is the startup that i work for we're basically building servers
that you can buy and then it's like a very old-fashioned business you give us
money we give you a server you install it your data center you run your jobs
on it but the key is like you get a cloud-like deployment experience while it's
actually on-prem So you're not renting a cloud, you're buying a cloud.
And it's partially named Oxide because we use Rust for 99% of the code that
gets written at the company.
And that's what we're here for. And when you explain it like that,
it sounds so simple. But in reality, it's very complex to build servers, I guess.
We get to that in a second. But I just want to thank you personally,
because you introduced me to Rust many, many years ago when you gave a talk at Fostam.
And I was there and we happened to have a very, very brief chat after the talk.
I don't even know if you remember, but that was kind of the starting point for my Ross journey.
And I think a lot of people got inspired by what he did. So thanks for that.
That's awesome. That talk is really meaningful to me. And I was actually very
sad because I only intended to give that talk one time at FOSDEM.
So, you know, as you remember, but like other people probably don't,
I kind of was chronicling like how Rust changed up before 1.0 because that was
around the time of the Rust 1.0 release.
And so I thought it'd be cool at FOSDEM to kind of like recap all of the development
history. But then they ended up losing the recording.
Something happened with the recording and it didn't work. So luckily
the ACM had me give it one more time and so we got a
recording then and so that was like a future thing but that
was also a very special conference for me because that was when larry wall announced
that pearl 6 was being like released finally or whatever and so i got to like
speak after him and i i did ruby before rust but i did pearl before ruby and
so that's always had a small soft spot in my heart for pearl stuff so that was
also kind of fun about that conference specifically.
For the people who haven't been there it was a really magical
moment the audience was cheering you on
they you could feel the atmosphere you could feel the energy
in the room and it was a very special moment so it's sad
to hear that they lost the tape but in any case you
managed to inspire a lot of people also because you have
a way with words and you always start
from first principles and i think
that's unique because many engineers they
either are very technical or they are
you know very very high level but i think you can switch between
those two two levels of abstraction you can talk
to everyone and and i wanted to capture some of
that today now thanks on that premise
you had a blog post on
your website once where you said that you were not planning on quitting cloud
where you worked before but you just waited for a good opportunity and it seemed
to have presented itself in oxide and i do wonder what specifically about oxide
and their mission attracted you to make this move yeah.
So it's kind of a combination of two things the first
is i was born in 1986 i started programming when i was like seven years old
so like 93 ish roughly about and i also like a Mac was the first computer I
ever used and I was a really big like little Mac fanboy back in the day and so there's like a ton of.
No, I guess maybe I'm getting old. So I don't know if this is still necessarily true.
But like, at least in the early 2000s, late 90s, there was a lot of like,
I want to say like sort of reverence held for the sort of like 80s in computers,
like that's when computing stuff really took off.
I mean, obviously, there were computers, you know, many years before,
but like the 80s is when the sort of PC revolution started happening.
And when people started having computers in their home, instead of just having
them in universities and stuff like that.
And so i was really really big on learning kind of about
apple's history and a lot
of the stories that are on say like folklore.org where you
can like read a lot of early apple history were stories that i kind of grew
up with and were really like meaningful and impactful to me and there was like
one in particular i mean there's a lot of them i could recount but like a one
that specifically kind of draws to oxide in in specifics is that like there's
the story about how, as they were putting the first Mac together,
Steve Jobs got the entire team to sign the inside of the case of the original
Mac. And they ended up putting that in all the computers.
And when they asked why that's the case, he was like, real artists sign their work.
You get a painting and there's always a signature in the bottom corner.
And that's what we're doing here.
There was also a book that I used to read every year, and I haven't read in
a couple years now, but it's called The Soul of a New Machine by Tracy Kidder.
And I think it really sort of captures a lot of that.
It was about a company called Data General in the 80s and their journey to go
from like the 16 bit microcomputer to the 32 bit mini computer and them building
this computer and how like much effort the team had to put in and their struggles
and trials and tribulations while they're making all this work.
And so I've kind of like always sort of been like I wish I was 10 years older
so I could have like experienced that era of the industry you know because like
by the time I was graduating you know high school it was 2000 and by the time I was supposed to be well.
2004 i started always high school in 2000 but like
i was graduating high school in 2004 and i was graduating college
in 2008 2009 and so by then
it was a good 30 years after all that stuff had happened and so people weren't
like starting computer companies anymore and so
i'd always kind of thought that i wouldn't really get to like
experience that sort of thing and so
then lo and behold oxide starts up
and it's a new computer company and so
that was really cool they also talked about the soul
of a new machine a lot and i was like oh like
that's you know there's a very big like cultural share
you know similarity here and they were using rust
which is a tool i loved and had worked on and so i was like oh that makes sense
and i had known two of the three co-founders Bryane and Jessie before oxide and
i had like respected their technical opinions and like you know knew them and
so that also seemed good and so just like all those things kind of aligning
at the same time. I was employee 17.
So you know, I didn't start the company, but I kind of got in on relatively on the ground floor.
And, you know, it was all of that kind of stuff that really sort of appealed
to me personally, it was, it's very, very rare that you get the opportunity
to like, get it on the ground of building a new kind of computer.
And so that was just like, too good to pass up.
It sounds like a very lofty dream. But at the same time, it feels like building
a new server is such a hard problem so what does it feel like on a day-to-day to work on this.
Yeah the thing is is it's not just a
hard problem it's like a thousand hard problems like we keep joking that like
oxide is actually a thousand startups in a trench coat like there are so many
things that we are doing that like could conceivably be their own company or
like if we were trying to do something for more than just us could be their own company.
So like, for example, you know, we want to give people that own the rack,
like very good way of monitoring how the system is doing. And so we need observability and metrics.
And because we're on prem, like that doesn't mean, you know,
you're not just going to hook up some sort of cloud metric service and ship
information off to them, right?
So we have like, are currently working on like internal metrics for the rack.
And that's like a thing that whole companies do separately. And we kind of have to do on our own.
You know, like there's just tons and tons of stuff that we're doing.
So, so yeah, so it's not just one problem, it's a ton of problems.
And so that also means, you know, it is a group effort.
I think a really big difference between the eighties and now is like part of
the reason why it was so easy to start a computer company in the eighties and
why so many people did is because it was feasible for one person to design and
code like the whole thing from top to bottom,
all the hardware and all the software could be meaningfully understood by a single person.
And we are far, far beyond that level complexity now.
So, you know, I said I was employee 17, because it's now around 60, 70 people.
And, you know, all of them are super necessary to get this done.
And so, you know, it's, it's very much, in some ways, it's very much like working
in a normal company in the sense of like, you know, you have the thing you're
working on, and other people have the thing they're working on.
And obviously, there's like a lot of cross pollination.
But you know, it's also like tons of people are working on really cool stuff.
You know, I'm not a hardware guy, personally, I'm a software person, mostly
and so it's been really neat to like learn about
how the hardware folks do their job and you know vice versa like
you know they'll ask some software people for help with software stuff
and so like there's a lot of cool collaboration that
occurs since the product is so broad but you know at the end of the day you're
still like working on you know your little part of the thing and oftentimes
many little parts of the thing like another thing with a startup life is that
you kind of have to wear many hats and so you know people move around and work
on a variety of different things too but yeah i don't know that's definitely a big part of it.
A naive person might say shouldn't it be easier today to build such a computer
company than just thinking about the 80s because now we have standardization
and we have standardized hardware components and all of that stuff but the complexity of course,
is still very much existent. And I guess the rabbit hole runs very deep.
What are some of the other hard problems that you have to solve to build such a computer?
Yeah, so I mean, I think the biggest, most straightforward one is like,
it is fundamentally a distributed system.
So like, to be clear, like when you buy an oxide rack and you plug it in and
turn it on, you're not like managing individual servers yourself.
You get an API that gives you the ability to, say, spin up a virtual machine,
but it's not like you're like, OK, I have 10 VMs running on this sled and I
have 15 VMs running on this sled.
You're just presented as one whole stack of compute.
And what that means is that our control plane software is the one that's multiplexing
those VMs across all the hardware in the rack.
And so that means at a high level, it's a distributed system in a box.
And distributed systems are never easy in the first place.
You know and then you're talking about you know virtual machines that
have to manage hardware and do all that kind of stuff and so you know
i think that's like another example of a really hard problem to sort
of get to the like people think it's easy because the stuff is standardized
that's sort of true but the standardization is also
in some ways what we're like rejecting in many
ways and so like we're not using a
lot of those standard interfaces and we're writing our own
firmware instead and so like that is also
a really big problem and problem i mean problem to
be solved like so for example Bryan gave
a talk about called like i've i've come
to bury the bios i'm forgetting a couple words in that title but
it's like i haven't come to save the bios but to bury it or something like
that and so you know we've like thrown out the concept of a bios and the the
operating system boots up the hardware just like in the old days in order to
do that that means we had to write our own firmware and that's a really really
hard problem and so you know there's a lot of like stuff like that so yeah it's.
A bit like is it a bit like with rust where
some of the old knowledge was rediscovered with rust and suddenly you were able
to use those concepts in the real production systems level language is it similar
with oxide where you uncover some of the old truths that you know,
OGs, the original creators of computing knew about and then we forgot about them over time?
I think that's true in a certain sense but also not entirely because like a
lot of these things aren't,
necessarily truths that are lost but it's more
about the way that the computer industry
evolved over time so like standardized interfaces
and parts being swappable is like what made the
pc platform succeed right like the whole
the standardization is what allows a giant
ecosystem of companies to be able to work together productively and
you know be able to like ship things and so that was like you
know again being like a mac fanboy in the 90s people be
like oh yeah i need to buy a new hard drive i have 15 different options
for buying new hard drives but you have one you
know from apple and so you take what you can get with them and
that's it you know oh you want to update your graphics card like
cool you know that's not really as viable as it is in
sort of the pc platform and so i and i
think that like that made sense for the economics of
the time and i mean it still makes sense in many contexts
it's just it doesn't necessarily make sense in servers anymore you know i mean
like my pc is still a very standard pc built from you know it's running all
that firmware that those manufacturers have built pulled together from 20 different
companies that are you know creating that together but i think there's sort of like.
The wisdom is not so much like lost as so much as it is that like as computers
at home like the industry changed a lot whenever it went from like a computer
is a thing a university has that you get to timeshare on sometimes,
to a computer is something that companies have, or maybe they have a couple
of them, to a computer is something that every individual person has. And so...
You know, it's the same sort of thing where with especially like the web and
the internet being such a big thing.
Now it's not like my company has a single server in a rack of servers in a data center.
It's now my company has a whole rack. And then it became my company has a data center.
And then it became my company rents some servers from, you know,
another company that runs the data center. And so those kind of like economic
changes happened over time and made different configurations of these things work.
Now, part of the additional kind of evolution that happened there is if you'd
read a lot of stuff about Google as they were sort of like growing back in the early days,
there was a lot of talk about how they use like commodity servers instead of buying big iron.
So instead of buying big old giant mainframes or servers, they were just using
regular PCs and hooking up a lot of them together.
You know, back in the Slashdot days, we'd made jokes about Beowulf clusters,
but like that's kind of a thing that they supposedly, well, they didn't supposedly, they did do that.
In fact, for a while they ran on commodity hardware, but they kind of learned
that like the style and architecture of a computer that works well for people
at home is not really well suited for running when you need like hundreds of
thousands of computers,
because it's sort of designed for the,
you know, running on your desk, like my tower is like right here outside of the frame.
But like, you know, the thing that's running on my desk has very different needs
than when I have an entire data center full of computers. And so they started
designing their own hardware.
And so did all the other, we call them the hyperscalers.
So like the AWSs of the world and all the other people that are running big clouds are.
They sort of moved away from the sort of PC model because that just like doesn't
really make sense economically speaking for a number of different reasons.
And so, but like Amazon is not in the business of selling computers.
They're in the business of renting them. And so you can't like say you're a
Fortune 100 company because here in the States, companies are people.
No, I kid. But like, you know, imagine you're a company, right?
And you need some servers.
So you go to try to buy some. you like can't
go to amazon and be like hey i need
one rack of the same servers that you've built custom built
to live in an aws data center right like that's
just not an option that's available to you uh these big clouds
aren't in this selling computer business they're in the renting computer business
and so you can rent them but you can't buy them and so there are other companies
that are selling servers but they're not really in the custom hardware business
they're still fundamentally selling you a big pc and so kind of what oxide is
doing is i sort of liken it to.
Prometheus you know the legend of prometheus who you
know went and stole the fire from the gods and brought it back down to humans we'll
ignore the part where prometheus was then tortured for all of eternity for
you know doing what he did but like we're sort of like
taking the concept we're not
literally selling the same designs as amazon's like we're like sneaking
into the data centers and you know like whatever but like we are
we are selling those style of servers to
companies that you know aren't necessarily going to be building their own
there's tons of you know big companies and smaller
ones necessarily but when you're talking about a whole rack of servers you
kind of inherently are going after larger organizations so you
know big companies that aren't going to spin up their own server design
division even though they may have the money they just don't have the expertise
of the culture to be able to do that right and they want to be able to buy computers
instead and so that's kind of like what we're doing is taking these sort of
hardware and software concepts from the hyperscalers and then selling them to
companies that could use that amount of compute but just literally can't buy
the same amount of computer anywhere else.
I think this analogy with apple is really fitting for two reasons first they
democratized computing for a lot more people but second they also focused on
vertical integration In fact,
when Steve Jobs moved away from Apple to build Next,
he also built some sort of workstation,
which was a different type of computer with different requirements.
And he chose Objective-C for various reasons to build it, I guess.
But of course, Oxide did not choose Objective-C, he chose Rust.
What are some advantages that Rust offers for Oxide's unique hardware software
integration challenges? Yeah.
Totally. So I should also really briefly say that, like, I always make the Apple
analogy, because it makes sense to me.
But like, Bryan literally worked at Sun forever. And a lot of the folks that
are also from Sun, and so Sun and Apple are kind of very similar and very different in different ways.
And so, you know, they might also pitch it as kind of like a,
you know, next version of Sun in some ways, too.
So anyway, in terms of like the why Rust, so I mean, you know, like fundamentally,
the, you know, okay, so first you decide, like,
you're going to get into this business so you know the the folks
who started it were previously at joyant before and they
ran a public cloud and so they kind of knew like what
it's like to be on the customer side of that relationship and so they were like okay
this sort of you know business needs to exist well you
know how are we going to like enable it and so you know
for a very long time a lot of the early folks at oxide were
you know very big c people obviously like you
know like i said Bryan started at sun working on the kernel
in like the late 90s or whatever at the same time i was learning
programming basically sorry Bryan you're a little older than me but like
you know so they've been using C for a really really long time and so you know
naturally kind of I would say we're initially a little skeptical of Rust conceptually
but like Bryan has a great series of blog posts on him personally coming to
realize like why he ended up liking you know Rust but.
I think that like sort of an
interesting thing about Rust positioning in this domain overall is that Rust
is kind of the first time in a very long time that there have actually been
a new language that can meaningfully replace C and C++ for the use cases where they're still in.
And what I mean by this, it's like a little complicated, but so like when I
started programming, C and C++ were sort of the like default choice for everything.
Even then I'm being, I'm waving my hands a little bit. Like the Mac OS I started
with used Pascal as its calling invention, right?
Like there's more options in many ways in the early nineties,
but like C and C++ were very dominant in many domains, you know,
not just network programming and operating systems programming,
but also application development. You would be writing stuff in that way.
And then, you know, Java comes along and it takes out a large chunk of the application use case.
And then scripting languages come along and they take out a lot of the sort
of like web development use cases. And so the overall scope of like where in
C and C++ have been the sort of dominant language has actually been declining for a really long time.
But the sort of like operating system and systems case is one where they are
truly has not been a challenger for a very, very long time.
I mean, some people wanted to make Java OS happen. It didn't happen.
And like there were some other people that came along and tried to do various things.
So, for example, I used to write the D programming language with I didn't write
the D programming language. I used the deprogramming language in college back
in that 2000, 2004 kind of like era or 2004 to 2010 kind of era,
but it had a garbage collector.
And that meant that it wasn't really, you know, even though me and my friends
in college were writing an operating system in it, although we did not get very
far in the end, but like we were able to get the beginnings of that going.
And while it worked, it still like felt like you were kind of fighting with
the language. And so a lot of like sort of dedicated C fans over the last 30
or 40 years have kind of seen people show up and be like.
We have a programming language that can do all the stuff C can do.
And that's like mostly been like, well, actually Java's too high level.
And like that sort of pattern has repeated itself over the years.
And so I think a lot of them have become very like kind of jaded against the
idea that there's a language that can like truly replace a lot of the low level use cases of C and C++.
And in fact, even early Rust had a garbage collector.
Like it was not necessarily actually a
truly good fit for cnc plus use
cases and so i bring this up partially because
it was in late 2014 that
rust made the decision so rust used to have a pretty big runtime and there was
the option to use native threads like operating system threads or green threads
and in sort of late 2014 there was the decision to pull out green threads as
a concept and move purely to native threads.
And that's kind of when Rust truly came into, it's like, you can really use
Rust for low-level tasks because we don't need this runtime anymore and all that kind of stuff.
And so it's funny because a lot
of people now think about Rust as kind of always been in that use case,
but it was honestly like really less than a year before 1.0 that it truly like
actually really kind of decided to truly go after that space. And so-
I bring that up partially because that is when Bryan specifically,
so the CTO, I've mentioned him several times with Bryan Cantrell,
the CTO and one of the co-founders of Oxide, saw that decision being made.
And he has said before, that is the moment where he decided to truly look into
Rust because he's a big anti-green threads person, or at least for the purposes of system stuff.
He's like, it's always been ripped out of every system they've ever tried. and so
the like that moment was when he was
like oh this is a language i need to take seriously because they
are actually finally deciding to truly do the low-level stuff
that i like care about and so so that
was i know when rust like truly first caught his eye
or at least like he had seen it before then but that was
the moment where he was like i need to investigate this truly and so it's
a very very long-winded answer but like part of
that is then okay so once you can do the
low level stuff like cool that like makes sense but
like sort of what value does it bring and a lot
of that is in the all the classic
standard rust things of you know we can have
a strong type system and we can eliminate certain values of
errors at compile time but i think a lot of people
only focus on stuff that's unique and not
the holistic package so like for example the the
tooling like cargo is like really really important for a lot of people
and stuff like enums we know once
you use enums you're like why is this not in every programming language
basically like we have structs or like you know product types
to get a little fancy with it and so you need enums or
the sum type to sort of complement that and it's so interesting that
many many languages have like one half of this equation but not both halves
and so there's a lot of kind of stuff like that holistically that makes rust
you know kind of very attractive but i think another one that's like really
interesting is that like i.
Don't necessarily like to think of the stack as being purely up and down.
Like it's not always exactly one way or the other.
But the important part is that like Rust has, because of its commitment to being
able to solve low level use cases, it also can make its way up the stack a surprising amount.
And so like a lot of times, you know, if you don't start with a language that
can address the lowest level needs that you have, you end up at least investing
in two different languages.
And I'll say that, you know, at Oxide, we definitely only don't only just use
rust for a hundred percent of the things, but like for 90% of the things.
And it's cool that we can even get to 90 because like.
A lot of times it's like, okay, we're going to write our web app in,
it's going to have JavaScript on the front end, and then Python or Ruby on the
back end, and then maybe we'll need to do some low-level stuff in C.
And you kind of have this at least three languages that are sort of necessary
to fill in this sort of stuff that you need to be able to do.
And so Rust enables us to get away with using one language for a lot more of
a chunk of that stuff than other things.
And so most importantly at the lowest level itself but
then also like if you look at my comments from
a long time ago back in 2014 era i would have told you i would
never write a web app in rust but like we have and
do many web apps actually at oxide and i think
it works totally fine for those use cases and not
necessarily for everyone but for us it does work really well and
so like that versatility is super
super useful and kind of like
you know enables us to do you know all
sorts of stuff so anyway i don't know if that like
fully covers it necessarily but i think it's just like important that like
c if we decided to go with c for
the full lowest levels of the oxide like stack we
would need some we would need to introduce another language much sooner
than we do with rust like rust gives us the ability to address
things at the high level of the stack as well as at the
low level of the stack and so i think that's like a very valuable thing when
you're talking about the scope of like you know we
do everything from the firmware up to the front end of website running in the
browser you know like we are a lot of people say full stack is like i can write
some front end and back end web code but we're like full full stack and so being
able to address all those use cases with fewer technologies is a legitimately valuable thing yeah.
But some listeners who have a c background might say you folks just make it
very hard on yourself there are things that you could have gotten for free if
you chose c what would you say to them.
I don't really know what we would get for free necessarily.
Like, I mean, maybe not needing to like write some code, but also like sort
of the fundamental premise of the company in many ways is, so Alan Key has said in the past or NK,
like you can't meaningfully build your own software without making your own hardware.
Or I forget exactly how I phrased it. Maybe it's the other way around,
but like we sort of need to do both.
And like what that means on many levels is that like we're willing to write
our own stuff because, you know, sometimes just like if you're building something
custom, you need a lot of custom stuff.
And so it's true that maybe we get some stuff for free, sort of,
kind of, but then also means, you know, when it comes to things like support.
So we really care a lot about making sure that everything works and works well together.
And it's much, much harder to support a giant pile of other code that other
people have written than it is code that you've written that's custom for purpose.
And like you know bugs can appear sometimes because
something is solving a problem for someone else's use case
and not yours and a custom piece of software that only
does what you needed to do and nothing else ends up being easier
to understand and easier to take care of and so you know
i mean i definitely don't think it would be impossible without rust
like someone could make an oxide where c is used more
but like i also think that just like you know
at least for a seasoned rust developer like
rust overall saves me time implementing stuff
because i have to do so much less checking up
after the fact that what i'm doing is reasonably correct and
there's so many things that rust kind of gives you for free that you don't get
with c so you know i definitely i don't think it's possible but you know someone
is welcome to try let's put it that way like we we are we are demonstrating
it can be done with rust sure maybe it could have been done with c but that's
for somebody else to prove out, you know.
This Alan Kay quote was really nice, where you say software informs the hardware
and hardware informs the software.
You can't build one without the other. How does it look at oxide?
How does the hardware part of oxide and the software part of oxide,
which is written in Rust, how does that inform each other?
Yeah, so one of my favorite stories, and this is not any work that I personally
did, but I love talking about my coworkers doing cool things.
So an example of this is, we talked a little bit earlier about the sort of the
standardization layers.
And so here's an example of how they can kind of like get in the way.
When you're, say you have an entire room full of servers, right?
And you, you know, maybe, you know, you're running tons and tons of jobs every
day. What that means is that small failure rates happen.
Like, I used to say this about living in New York City. It's like people be
like, oh, there's crime in New York City. And it's like, sure.
But that's also because there's millions and millions of people.
So a one tenth of a percent chance that something happens means that it happens
hundreds of times a day in New York City or whatever.
I probably have the order of magnitudes off. But like the point is,
is at scale, things that don't happen very often start to happen and happen often.
And what that means is say that your hard drive controllers firmware has a bug
and that bug only manifests 1% of the time.
Well, if you have 100,000 servers, that 1% of the time is going to be happening all the time.
Now, obviously, 1% is a very high rate for a bug in firmware.
Like, I'm not saying that firmware bugs happen literally 1% of the time,
but just like you will run across obscure edge cases in the software and hardware
that you use, and those problems will occur.
And so one of the things that standardization enables, which is great,
is having, as we said earlier, tons of organizations come together and build
all the stuff that goes into a computer.
But the problem with that is that like, so say you buy a server from Dell,
and you come across a firmware bug in the motherboard.
Well, Dell didn't write the firmware that's running on that motherboard.
It's some other company, you know, that ended up writing that firmware.
And so if they need to like fix that problem, they will like file a ticket with their vendor.
But then it's like, good luck getting that prioritized. You know what I mean?
As a customer, you are now dealing not just with the company you bought it from,
but the companies that they bought their stuff from.
And so you're not the customer of that motherboard vendor.
And so why should they care about you over someone else or whatever?
And so one of the things that we've sort of done is thrown out a lot of those
layers because we don't need...
The fundamental purpose of stuff like BIOS and UEFI, for example,
is to enable the operating system people to sort of right against the UEFI or
BIOS spec, and then for all the hardware vendors to sort of like produce the
APIs in their hardware that fit into that BIOS or UEFI spec.
But like, we don't need to support 75 different manufacturers of RAM,
you know, or five different manufacturers of hard drives or whatever.
Like we have at the moment, since we have the first revision,
or we're working on the second revision of the rack now, but like,
you know, we know physically what hardware is in the machine.
And so all of that extra sort of standardization interface is written for a
benefit that we don't actually see.
Like we're not we don't need to
be able to you know build all of these
different variants of all this different stuff and so what that
means is all that code is being written to serve a purpose that we don't need
anymore and so we've actually completely thrown out that layer in the oxide
rack there is no bios there is no uefi the the operating system boots the hardware
just sort of like in the very old days before that stuff even existed and so
what that means is that we
to throw out that firmware and so specifically like for amd there's a thing
called a gisa that's like part of this firmware package that you get that you
know you use to boot up the amd's cpu.
And we said no we're not going to use that and we wrote our own and
you know at first amd was kind of asking us like why are
you guys asking questions about this like that's just like in you know
the firmware and we're like yeah we're writing our own firmware they're like we
don't really believe you or like you know they were like kind of okay guys
sure whatever you say and eventually once we got it to boot we're
like hey by the way like you know here's here's an example of this booting and
they were like oh that's like really cool you know because we didn't expect they
were to do that because literally no one else does this and
so you know what that means on some
level is like stuff boots really quickly but like
how much does that matter because you're not really booting a server necessarily all
the time but more importantly like by throwing
away all of that stuff you know we've eliminated
a ton of possibilities for things
to go wrong you know we have eliminated security
issues like you know talk about another thing that we
got rid of is like the bmc or the baseboard management controllers that
you know server grade hardware has a ton of other computers running inside it
to make sure that the computer is running correctly so like for example you
really want your main cpu to be running you know the jobs that you're running
on the server you don't want it to be running stuff to manage the server itself right and so like.
Server-grade hardware has this additional CPU and other stuff in it called a BMC.
And that's usually, you know, making sure that like, oh, if something crashed
on the main CPU, we can reboot it.
Or, you know, we're able to like log into and monitor stuff from behind.
But like a lot of, you know, server vendors, those BMCs have like full operating
systems running on them.
And those full operating systems have bugs and they can have problems and there's,
you know, security issues. And it's like, it's really, really hard when you're
buying a server today to even know what code is running on it at all.
And so we have an equivalent thing, which I joke is the totally not a BMC,
we call it a service processor, but it serves the same general idea of like,
okay, it is a good thing to have a little mini extra computer monitoring the
main big computer to make sure that all that stuff works.
But like instead of just accepting
the one that would come with the like motherboard manufacturer that
we would buy something for since we designed our own motherboard we designed
our own replacement system and it is much much
smaller and it runs an os called hubris that we
wrote from scratch instead and so you know we know that you're not running a
full linux inside of the mini computer inside of the thing that's running your
big computer and that means we can be more efficient means we can have less
attack surface area it means we can audit everything because we're not just
like accepting whatever the external manufacturer is giving us.
And so that provides like a ton, a ton of benefit for, you know,
doing this sort of stuff.
And so anyway, that's like an example of like, we need to write the software
that manages the hardware in that way.
And it's only because we're designing our own hardware that we're able to write
the software in that way, because, you know, we sort of know what we're putting in the rack.
So if we didn't know, we would need a lot more of those standardized interfaces, you know?
And so that's kind of like an example of those two things informing
each other very deeply now the downside is you know
we're releasing a new version of the rack new versions of
cpus means we need to write more firmware right whereas
before we would just be able to like update it to whatever
and someone else wrote in that code for us so i'm not going to say that there's
no trade-off there there obviously is a very large trade-off but that's one
that we're willing to make and we think is like the only way to deliver on a
lot of the sort of like quality promises and support promises that we want to
give customers and reliability promises.
So I knew that you had your own operating system called Hubris,
and I wanted to talk a little bit about that because some people might wonder,
why did you start your own operating system if something like TockOS already
existed even back in the day?
Yeah.
What's so special about Hubris that you needed to write that thing yourself?
So I've actually known the Tock people for a very long time.
They're great, and I really like their project in general. oxide before
i joined had investigated using Tock instead of hubris and
decided not to and basically what that boils down to
is a thing about embedded use
cases is that diversity is the rule not
the exception and like what i mean by that
is every application tends to be different and because you're
like literally often running on different hardware you have different
needs and so Tock is very focused on sort
of this use case where you kind of load a
variety of different programs at runtime and they also care a
lot about supporting programs that are written in c they have
other goals too but at the time i should at least frame this like you know it's
been a couple years since the decision was made so i don't necessarily want
to speak fully to what toxic goals are right now but at the time ox i was looking
into it it was like okay they're they're interested in dynamic program stuff
and they're interested in supporting c programs and we didn't really have the same needs and so.
The way that Hubris is different and sort of like kind of the reason we ended up like writing our own.
First of all is like saying you write your own OS sounds like a massive undertaking,
but the kernel and Hubris is like 4000 lines of Rust code. Like it's not very big, actually.
And I mean, a lot of the drivers and the stuff that you need to actually make
it useful and meaningful is a little bit more.
But like it is feasible and was largely designed by one person, just Cliff Biffle.
Many other people have also helped significantly but like you know
he was kind of the person who did a lot of the initial sort of design
work and what what makes
hubris special is that it is aggressively static
and by that what i mean is like when
you sort of like so when you think about an
operating system you're like the operating system's on my computer and then i
install programs then i run programs right like i click on
discord it boots up and it runs i click on firefox or
chrome or whatever and it boots up and it runs hubris is
kind of like when you make the image to even
install on the hardware at all you say up
front here are the programs that i am running and those programs run all of
the time like there is no sort of like dynamic list of here's how many programs
are currently running there's no like he let me load a program at runtime that
wasn't running initially.
It boots up, it starts all the programs, they're all running,
and there's one instance of each program.
It's not even like, okay, maybe I have three Firefox windows running or whatever,
and obviously those share some code, whatever, blah, blah, blah.
But the point is, on a regular computer, I can be running five instances of
Bash all at the same time.
And so in Hubris, you have one instance of every program, and they're running
all the time. And that's something that, for our purposes...
Works really well, but does not necessarily work for other people's use cases.
And what's cool about stuff being sort of static is it means that you can make,
I don't even say shortcuts exactly, but you can like skip some design decisions
that other people have to deal with.
So, for example, because we know every program that's running and that it's always running.
Like Hubris doesn't really use like virtual memory, for example.
Because we can actually like pre allocate, you know, at build time.
OK, in the image, these are where these are the where all the programs are running individually.
Like and we know the memory map statically of like where everything lives and
we could tell, you know, OK, this will fit.
Like we do have to worry about programs crashing at runtime to some
degree but i mean like something like you know and in
the days of virtual memory now on desktop machines this doesn't really happen but
like back in the day you know when i was using this first max right
you'd be like i have this much ram and my
programs are currently using this much ram and if i start up a new program it's
going to literally run out of ram and not work right and hubris is closer to
those systems since we don't use virtual memory and so like we don't
have to worry about the the problem of i start
up a program and there's not enough RAM to run it because like we know
at build time there is enough RAM to run the programs that
you're trying to run you know for the most part and so like
those kinds of aggressive design simplifications mean that
we can get away with doing a lot less so another great example is like there's
no global memory allocator or global heap in hubris because we don't need dynamically
length lists because we know at build time how many programs are running So
the OS does not need to keep track of a dynamic list of like,
here's how many programs are currently executing.
You know what I mean? And so these kinds of like design decisions all build
on top of each other and, you know, sort of enable us to do something a little different.
And that also means likewise, like I would never suggest that the Tock people
stop building Tock and use hubris for everything because they're just literally
trying to do something different than we're trying to do.
And so, yeah, that's, that's kind of, I would say that like at the highest level,
that's what hubris is, is trying to do.
Everything is aggressively static. It's all at compile time. It's all at build time.
Another thing is that hubris is very like a message passing OS,
but is also fully synchronous.
So we do use async rust a lot higher level in the stack.
And we don't necessarily think that async rust is bad in the lower levels either,
but for hubris specifically synchronous is much, much simpler than asynchronous.
And so when we're talking about the firmware that runs at the lowest levels
of our entire thing, we're really, really going for a simple system that could
be understood and reasoned about very easily.
And so Hubris is also aggressively synchronous.
And so that's a thing that works for us, but does not necessarily work for everybody else. when.
You explained that you have one instance of one program at a given time i thought
oh that would make scheduling extremely easy because essentially do you even
context switch between those processes do you have some time sharing mechanism how does that work.
Yeah i mean we do have to still do some of that but like the
scheduler can be really really simple because you know
like it doesn't we're talking you know i
i forget exactly how many are running on like the real we're talking on
the order of like 10s not on the order of hundreds of processes right
so like you know the scheduler itself can
be really really simple like if i remember correctly it's basically just
around robin scheduler but there's a little bit of interesting bits around you
know if a program is waiting on an interrupt to come in then we don't need to
start it up again because we know that it's waiting on something like that but
like you know it can be aggressively simple because of the fact that you know
you're not talking about we don't need to handle you know what happens if someone
spins up a thousand instances of something and how do we deal with that. Right.
And so it's all like very, very straightforward and very, very simple,
but that ends up being also reliable and easy to debug and, you know, stuff like that.
So yeah, we have a little bit of like process priority stuff and there's some
important things to like.
A problem once you have process priority and scheduling is like a problem called
priority inversion, where a program with a low priority ends up waiting on a
program with a high priority that's blocked and or maybe vice versa.
I forget if I said that backwards, but the point is that like you get stuck
in loops, you know, and so we can actually like validate that that doesn't happen
because of the fact that everything is so straightforward.
And the fact that we don't do a lot of dynamic stuff is we can actually sort
of like lint against like, hey, you may have a program that's like going to
block execution of this other program and so stuff like that.
Yeah, it's much more straightforward than you would see in a commercial desktop
operating system solely because we're able to pare down those requirements so
much because of the limited context of like what this is actually trying to do.
And it turns out that like, you know, aggressively scoping down requirements
is what allows you to have simplicity in a system. Sometimes if you have complicated
requirements, you're going to have a complicated system.
I don't think simplicity is always better than complexity.
It's about how simple can you make something within a given set of requirements.
And sometimes you can over-scope requirements down too simple,
and then the thing doesn't work correctly.
But it turns out for embedded devices or use cases like this,
you can be really, really aggressive with how you scope stuff.
And then that can make a system simpler and easier to understand and therefore more reliable.
Must be a bliss to develop such an operating system.
You know, it's nice and it's got its problems too. Like, you know,
I mentioned cargo being really useful before.
It is true that cargo is wonderful and it's definitely very useful,
but also we basically had to write a build system on top of cargo to make this
work because cargo has several deficiencies with dealing with this kind of problem.
Is actually one of the first things that i worked on at oxide was sort
of rewriting the build system yet again on top
of cargo you know definitely like it is it is a very different thing than developing
a web app or developing something that's a much higher level and you know it's
definitely some people enjoy it and some people don't and you know i used to
work on that stuff and i found that it was a pleasure but you know obviously
everybody has their different levels of expertise tell me a little.
Bit about that what were some of the limitations that you ran into with cargo
and also why did you have to build your own build system how does that look like.
Yeah so a very simple and straightforward one
is that cargo kind of very deliberately does not include any post-processing
of stuff and we need to build an operating system image so like on some level
there is a step after cargo runs you know cargo produces a program or a couple
programs and then we need to like assemble that into a thing that you would
actually you know load onto to the microprocessor.
So just even in those pure cases, we need something more than cargo because
cargo deliberately leaves that to some sort of other process.
So that's one example of sort of a lot of code in the Hubris build system is
like, okay, now that we've gotten cargo to spit out all the final programs,
how do we actually assemble that into an image for the OS to boot?
How did you do that?
So there's a pattern called the X task pattern. And that's like the started
by Matt clad is kind of this idea that you sort of have a program in your workspace
called X task and you sort of write sort of scripts in Rust that sort of work that way.
And so we kind of have a very extensive set of X tasks that then like you basically
rarely invoke cargo build itself.
You like invoke used to be, I'm not sure if this has changed lately,
but like cargo X task dist to generate a distribution and that will go.
Okay cool i need to build you know
these five programs so let me invoke cargo five times to
build those five programs and then take the final binaries they produce
out and like run some code to figure out like okay you know how do i need to
tweak this stuff and whatever and so it's all it's all in rust code but it's
sort of written as this kind of like build system kind of like on top of things
another sort of area where cargo is like kind of a little weak in this area is like.
When you're doing embedded use cases you often need to build
some things for the host but some things for the target so like
if i want to build a thing where i'm building some
code that's supposed to be running on my local machine but then i also want
to build some programs to like be cross compiled cargo
is not very good at that it kind of understands you know you get one target flag
and so it'll try to build everything for the thing that you say you know
target for and so there's a lot of sort of like kind of like nitty-gritty issues
that become annoying when like say you know you want to be able to say like
okay these three programs are meant to be built on 32-bit arm v8 and 32-bit
arm v7 and then this is built on ar64 and then this is built in x86.
And it kind of is like not as good at those kinds
of use cases this comes up in a higher level use case if
you ever try to build like a project with wasm where you want to
build like say a server for your desktop you
also want to build wasm for your front end it's like a little awkward at
some of that kind of stuff and and yeah just like a
lot of sort of like nitty-gritty things like that like cargo cargo
is fantastic when you're sort of in the
normal use case but the more you sort of diverge from like weird cases like
you know sometimes we need to pass in interesting configuration stuff and it
can get a little gnarly sometimes to do things like that so yeah you know it's
definitely not like we're not going so far and you know obviously said we built
this on top of cargo we're not going so far as to throw cargo away although
i am very interested in the
buck build system but i have not tried to actually port hubris's build system
over to it although i've joked about trying it a number of times but you know
there's sort of at some point you kind of end up outgrowing cargo to some degree
but a lot of times it comes down to these sort of things where cargo is deliberately
saying that's out of scope so it's not like.
Cargo is inherently bad at those things and we're choosing not to do
it it's more like like i said right earlier about if you can
aggressively scope down your requirements it makes things easier cargo sort
of determined that like pre or post processing build stuff
is just not something cargo wants to do they want you to lay another tool on
top of it and so that's just something we've like had come up in another a number
of instances where we sort of need either pre or post processing of stuff that
cargo is going to do and so we kind of have to like put something over over
top of it to make it work and you know i don't think that's something that the
cargo folks would find to be objectionable i think that's what they would like
advise us to do in those cases because it's just not trying to do that it.
Makes sense to keep the scope of cargo sort of
small and and focused and there are also a couple of escape hatches that people
can use if they don't want to go forward and build their own tooling for example
you can have your own little build rs for pre-processing i guess and then you
have workspaces which is very nice feature that a lot of enterprise customers use.
I guess once you reach a certain level of maturity with your project and a certain
scale, then you end up bumping against those limitations from time to time.
For example, I also know that you have your own little CI service.
I don't know if that is what you
talked about, but the thing is called Build-O-MAT. And what is that about?
So, okay. So we talked about us writing our own operating system in the sense
of hubris, but that's used for the sort of embedded use cases.
The actual like control plane, the thing that's like scheduling your VMs to
run on the hardware, that's a repository we call Omicron, which was started
before COVID, kind of became an awkward name.
Oh, well, we're kind of past Omicron in terms of COVID stuff now.
We were actually just, you know, I've joked occasionally, like I wonder if we're
ever going to accidentally name something after another COVID variant. That'd be unfortunate.
But anyway, that needs to run something called like the host OS.
And so we do have like a more fully featured OS that's running on top of the
embedded stuff that is the thing that's actually scheduling your VMs.
And so for that, you know, many places use Linux with a KVM to like do those kinds of things.
But we decided to choose to use Illumos and bhyve.
And Illumos, for those of you who do have not heard of it, which is probably
most of you, to be honest, is a descendant of like Solaris and SunOS,
which is a descendant of the BSDs. So if you go back to Unix history,
they're kind of like cousins to Linux.
There is a common ancestor way back there, but Linux came as a separate re-implementation
of Unix, and Illumos is kind of like one of the many descendants of the tree
of actual BSD, actual Unix. Yeah.
So, yeah, so Illumos and bhyve is kind of like Linux and KVM,
if you want to think about making analogies to that stuff.
And so, yes, so like at those levels of the stack, we are running a full OS
that like is largely written in C.
Illumos has been around for, you know, it is, again, literally descended to that code.
So there's a very old code in there. we do run you
know right like we maintain the illumos port of rust for
example to make sure that rust programs are right that work well
on that and we do write some rust code and there is some rust
being put in illumos a little bit here and there but there's a
lot of reasons why the decision is a good one that makes sense for us but sort
of again kind of like a classic thing that comes up in oxide related
discussions is if you go your own way and build something custom
it also means you need to go your own way and build something
custom so there is no built-in github
actions runner that works with
illumos you know you can say please give me linux please give
me mac please give me windows but like you can't say please give me illumos and
github does have self-hosted runners they call it but that kind of requires
you know you to have a certain kinds of setup for some things and i don't fully
work on this so i can't speak to like all the details super specifically,
but like, basically just like at that point, you're already doing a bunch of
work to make this work out.
And so effectively, Build-A-MAT is used for Illumos native jobs.
Where we, because some of the stuff does, you know, because we're doing systems
work, you know, a lot of stuff you can test on Mac, Linux and Windows and make sure that it works well.
You know, a lot of our developer tooling, for example, runs on regular old GitHub
actions with a bunch of that stuff.
But occasionally we need to test on an honest to goodness Illumos system that
the thing actually works with Illumos because we're making Illumos-specific
system calls and dealing with Illumos-specific functionality.
And so Build-A-MAT was kind of created as sort of a way that we can plug in.
Illumos-specific jobs into these CI systems and make sure that that works.
And so if you look at some of our repositories, you'll see GitHub Actions,
sometimes there'll be normal Actions that'll also be like Build-A-MAT,
and that's basically for those Illumos things.
But sort of another reason why it kind of makes sense for us,
you know so github actions is running on some servers that github rent somewhere
but like we also you know we make servers we got a bunch of servers you know
not just because it's fun and because dogfooding is good but like also because
we do want to make sure our stuff works like.
You know, having build a mat run on an oxide rack in the office that's testing
the code we are using to build oxide is, you know, like a thing that makes more
sense for us than it would for other companies. You know what I mean?
And so like, you know, we spend money to get CI servers, we already have servers,
why not use our own servers for own CI?
You know what I mean? So being able to do that is also, you know,
really kind of useful and helpful.
But the core of it, you know, or the reason why we started doing that really does boil down to,
you know we have very specific needs for our ci it
is not reasonable for an upstream provider to be able
to offer or offer in a way that like totally makes sense for us
and so you know on some level that means needing to dig in and write our own
stuff and so oxide is also you know a lot of people say rewriting software is
bad and those people are not necessarily wrong but rules of thumb are only rules
of thumb they don't mean that they're laws of truth you know what i mean And
so Oxide is a place where we very,
very often find legitimate needs to rewrite some software.
And I'm not going to say that it's perfect or that it never introduces problems,
but it works far better than the people who say never, ever rewrite your own
software ever would lead you to believe, basically.
Because it turns out that like writing a basic
ci system is not an impossible task you know
it's it's a project that like one or two people work on and
in you know along with their other responsibilities and it serves
us well and so you know you may think that like that's an impossible thing to
do but some people think writing your own operating system is an impossible
thing to do and we did that too so you know it's just like it's how it goes
it does mean you know this is part of the reason why it took us four years to
get from starting the company to shipping servers you know I'm not going to
say that it's not easy necessarily,
but not everything that's worth doing is easy.
So what we're doing is very ambitious and hard. And sometimes that means you
just have to commit to doing the work.
You say it as if it was easy, but you have some of the best people in the world.
It's like an excellent team that you assembled there.
And what does it take to write Rust at that level?
Do you have any coding guidelines internally that you stuck with?
Is there anything internal where you say, hey, other companies could also profit
from that knowledge too?
Things that we might avoid even, for example, overusing generics or mixing sync
and async. what are some of the patterns that have evolved when using Rust at that level.
Yeah i mean i also want
to like i love my team everyone
who works docs at is great but also like we are while we are very senior engineers
almost universally you know there's not something super magic that necessarily
makes us hands and shoulders better than everyone else i also want to like you
know i appreciate the kind words and i don't want to say my co-workers are bad
at their jobs because they're not. They're definitely very accomplished.
But also, it's engineering, it's not magic.
So there are also many, many good Rust engineers in other places as well.
What I will say is that when I started four and a half years ago,
there was a lot less experienced Rust engineers in general, period, just in the industry.
And so a lot of the folks that we hired at the beginning were not necessarily
fantastic Rust programmers.
They were experienced and fantastic engineers in general, but they sort of,
not necessarily picked up Rust on the job entirely, but let's just say we didn't
have the luxury of demanding that people were good at Rust before joining Oxide in the earlier days.
And so part of me coming on relatively early on was kind of replicating the...
If you commented on Rust and Hacker News in the last 10 years,
you probably got an answer from me personally at some point because just like
I spent my time doing that.
And so in the early days of me working at Oxide, I would spend a lot more time
answering rust questions, helping people with the rust stuff,
giving advice and choosing packages.
But we also did have a lot of people who had a lot of deep rust experience when I started as well.
But like part of me coming on initially was to sort of help make sure that if
people had rust issues that like I was able to help, you know,
sort of guide them with stuff like that.
By now, you know, years later, there are many more experienced rust engineers out there.
And so I would say that like, one of the factors that sort of happened
is the overall quality of like the rust hiring
pool has like gone up and so at this point you know
we are much more likely to hire someone that knows rust than not know rust simply
because there are enough people that know rust at a high enough level that we're
able to find and hire those people so so that's changed over time to some degree
but also some things that we do that sort of any company that's doing rust can
kind of like replicate to help out with these sort of things.
We have a dedicated channel in our chat system that's just purely for Rust questions.
And so people will ask Rust specific questions. And I and many others who have
a lot of expertise in Rust will make sure to pay attention to that question
and answer questions as they come up. I think that's really important.
Having space for people to ask things is really a big deal. We have a biweekly
meeting that's optional.
Other than one-on-ones, all employees can go to all meetings.
So I want to say it's available to everyone, but that's just kind of true of
stuff at Oxide in general.
But if your company doesn't do that, I would still recommend a sort of like,
open to everyone, sort of rust, we call it the rust study group.
And basically, it's a meeting where, you know, every week, me and a couple other people who,
you know, love rust stuff, have always blocked off on our calendar to make space
for if somebody has a question that's maybe not hyper urgent,
you know, or, you know, sometimes people do like, oh, I was working on this hobby project.
And I came across, you know, I didn't want to bring it up at work in work hours,
because it's not really about work.
But I do, you know, like, I've been learning this thing about
rust and i don't really know this detail or whatever you know
that maybe helps them you know with their job and some other you know
maybe it's not on a project that's job related but like leveling up
a rust is still going to be good you know for your job so people will
come and bring questions about like you know oh hey i saw this thing go by earlier
in the week where people were talking about this new feature and i was curious
what people thought or how it works like i don't really understand why people
care about this or you know maybe they're like i have this tricky bit of code
that i can't really figure out why the bar checker's mad at me or like you know, whatever.
And so, you know, we kind of have both of those forms of sort of explicit time
and space to help people with Rust related problems that they kind of have.
In terms of specific things that have popped up, I think the biggest thing that's
kind of, we've talked about it a little bit, but we haven't really done as great
of a job of like getting our perspective on the conversation out there.
But a really big thing that's popped up with us that some people are talking
about for sure is async cancellation.
And so for those of you who aren't familiar, async Rust has what I think is
a really great cancellation model conceptually, which is like futures don't
do anything until they're pulled.
And to cancel a future all you got to do is just drop it and
then it won't ever be pulled again and so it's effectively canceled and that's
like cool but that also
means that there are subtle issues where you don't realize that
a future can be canceled because like dropping something is
not obvious in rust all the time and so
there are some patterns you can get into where something
will cancel and you don't realize it and that will lead to sort of like
logic bugs and so this is
sort of an area where we've kind of like we we
it it's a thing that surprises people because you
know if it compiles it works is not literally true but it
definitely can feel like it's true sometimes and so
a lot of people's experience with rust early on is like
oh my god it catches all these problems at compile time and
i don't have these bugs because i used to have to always worry about
you know like them happening and now the compiler catches them
for me and that's great and they go into async and they
hit their first time where something gets canceled where they don't expect it to and it
feels it feels like walking that back they're
like back in the land of like oh my god the compiler doesn't
catch this problem for me anymore and now i feel
like you know these like promises of rust catching all these things are like
a little not as true as they used to be and i think some of that is just a like
a natural counter reaction to how much rust handles for you in most cases but
it is true that like async cancellation issues are like not generally statically find outable.
And they also, you know, can be surprising. And that becomes itself surprising
when you're sort of so used to Rust catching things aggressively.
And so we've had to try to find some places where like, we can figure out like,
okay, you know, here are some patterns to be avoided, or here are some libraries
of like trying to be a little more explicit about cancellation.
I'm totally drawing a blank off the top of my head but there's been some experience experiments,
that some of my co-workers have tried to do to sort of like figure out some patterns
like that and we've wanted to kind of like talk a
little more about these problems publicly but have not actually had the
time because it's always super busy when you're doing a zillion things to
like necessarily get out there with all their experiences but definitely it's
a thing that the async folks are familiar with
being an issue overall and i think that anyone that does rust async at scale
has eventually either found out about these bugs or maybe you've had some mystery
bugs you can't track down that eventually you'll discover cancellation related
but that's definitely an area where we've like found stuff to be a little surprising
and have like started to get out there a little bit and talk about sometimes you.
Said the drops are sometimes not obvious in rust and that can lead to subtle
bugs with phasing execution can you give me an example for that.
Yeah so i'm gonna be very vague because to
be honest this is not an area of the product i work on personally and when i
was in these conversations was more like six months ago so i'm like a little
teeny bit out of the loop but let's just say like a big thing is like a concept
in this area is like if something is cancel safe and what that means is sometimes
you have some code that is like,
If you drop it in the middle of its operation, you need some sort of cleanup,
or you need to do some sort of thing or notify someone or something.
And so that won't necessarily happen.
If an operation is canceled in the middle of it occurring. So for example,
like, say that you're doing a select between some sort of like receiving on
a channel, and then you're also like printing, you know, you're sort of selecting
between some sort of timeout,
and some sort of receive on the channel, like that if,
if like the if the, if the timeout
finishes before you know you're calculating some
sort of value while waiting on the like receiver then
like that means the other future and that select gets dropped like
you know like if you're doing a select it kind of inherently means that
you're going to drop the other things that did not finish early and
so depending on how you've written those futures like obviously
waiting on a channel and you drop it that's fine you're just no longer listening
on the channel and say you're just sleeping and you drop that it's fine
you're no longer sleeping but say that the thing that happens first is
the timeout maybe that calculation had some sort of thing where it
needed some sort of cleanup step and now because it's timed
out that future just gets dropped well if you didn't do something
to ensure that the cleanup would happen properly then you
know maybe that doesn't necessarily like work and so this is like code that
is totally fine if all the futures are said like cancel safe if they have some
sort of ability to do you know that kind of thing but you know if they don't
then that can kind of like be an issue and so you know that's like yeah like
Like you can lose data sometimes.
So for example, say, you know, in one future, you're kind of like writing some
data inside of a loop and then that gets canceled in the middle of it.
Well, that maybe means you only wrote part of the data, but you never wrote
the final part of the data.
And so that's like an example of like a problem that can occur if something
is not necessarily like fully cancel safe.
And so, you know, this is where a lot of the discussion about async drop comes
in and like a lot of that stuff. If you've seen that discussion in sort of the
Rust async world. And so, yeah, I don't know. I guess that's like a high level
example of what's going on.
Can you quickly explain what async drop means.
Yeah so there's a drop trade in
rust and when something goes out of scope for the final time
then the drop tree gets called and that runs some code this is like into a destructor
in many other languages like especially op languages and definitely conceptually
sort of similar but in the async world if a future doesn't get pulled anymore
there's no equivalent like hook right like the drop trait kind of gives you
the ability to They'll sort of hook into,
hey, this object is about to be destructed or destroyed. So like do something then.
So like a classic example is box. You know, you allocate memory up front and
then when drop runs, you know, it would deallocate the memory.
So a future doesn't have any sort of similar kind of mechanism.
Like if you would allocate some memory manually in a future and then it no longer
gets pulled, now that memory is leaked because there's no equivalent hook and
point to sort of say destroy that memory that was called.
And so there's been a lot of different proposals for what an async drop trait
could look like and how that might happen or how we might want to solve the
problem in a different way.
There's been sort of a discussion in the async working group for a couple of years now.
Why can't you just implement drop for the future?
Because drop is a synchronous, it is a function, not an async function.
And therefore you're sort of like calling something synchronous inside of the,
and plus the other reason is like the definition of dropping it,
like it's no longer being called anymore, like kind of is a fuzzy definition.
Like if I don't call, if I don't call poll for an hour and then I call it again
later, you know, like technically I've called it again, but during that hour,
I may never know if I'm ever going to be getting it called again.
You know what I mean? And so there's also a little bit of like fuzziness there in the definition.
And it's much more straightforward with synchronous code because there's no like,
I mean, you might have a sleep call or something, but just like the point is
you can always statically know,
like okay eventually this point is going to happen where it's never going to
be used again and in async stuff that can be a little more tricky and or like
dynamic but you know also this is an area i work on specifically so i can only
really give a higher higher level answer on that yeah.
I think the slippery slope here is that when you go forward with that proposal
or you go forward with that idea of introducing async equivalents of sync traits,
then you end up maybe replicating parts of the standard library.
We see something similar with read and async read or write and async write.
Yeah i mean the thing is
is that programmers love like dry
and they love trying to unify concepts that are
that seem similar but are different and i think
that's true but i also think that sometimes similar operations
are just fundamentally different and you can't
really abstract over them in the same way read async
read and async write are great examples where we have the sync write
and sync read apis but the
thing is is that like for synchronous read and write those apis
are very straightforward and have existed for a very long time and they like make
sense and there's only one implementation of them but one of the reasons why
async read and async write have not been stabilized yet in rust is because there
are at least two meaningfully different proposals on how to implement those
apis and so yeah well conceptually it's like just read but async i think that
but is really load-bearing and like you know like,
This is maybe stretching an analogy a little bit too far, but like addition
and subtraction are both conceptually the same thing at some high level.
Like they're both a binary operation with an operation in the middle.
And you want to be able to say like, well, you know, like maybe those are the same thing.
And so we should abstract over both of them. And that's how you get monoids.
But like that doesn't always mean that abstraction is useful necessarily,
because sometimes it can paper over the details that matter.
You know, like at a high level, the idea that like I'm applying a binary operation
to two things and it does something is like, sure, sometimes working at that
level of abstraction makes sense.
But sometimes you really care if it's actually addition or actually subtraction.
You know what I mean? And so like, I think that a lot of people want to reach
for the idea that we should inherently be able to abstract over sync and async.
But i think that there are different enough things with different enough
semantics that doing so at least
let's put it this way for rust specifically a
language that cares about the low-level implications of what
you're doing that you need to be able to integrate with an underlying system that
you like the details matter to you i think
that over abstracting these things is a mistake i
think that in a language like haskell as a reason arranged for mono a
second ago like in haskell you extract over sync and
async because conceptually they're the
same thing but haskell does not have the same performance requirements
and like low level requirements and little commitments that rust does
and so it can it can afford that abstraction because that abstraction costs
you something but in rust i think that the the details are significant enough
and the processes are significantly different enough that it is important and
meaningful to keep them separate and so i don't think it's inherently a bad
thing that you're sort of redoing some.
Async stuff or some sync stuff that's in the standard library because like also.
Drop conceptually makes sense i'm not sure if i fully think that async
drop specifically is a good idea but some sort of analog
or way to solve that problem makes sense i think that read and write are there
true but like that's like 95 of it there's like not a whole lot more you know
like Like we're not necessarily like the other stuff that we may like make a sync is like still like,
I don't know, it's not a brand new concept or like, I don't think it's the,
I don't think it's the end of the world. Let's put it that way.
Some duplication is totally fine and meaningful.
And sometimes, you know, abstractions are good because they let you work with
stuff at a high level, but you also need to be able to do stuff at a low level.
And I just like think that sometimes that trade-off is that you don't get to
use the high level abstractions. So, but this is definitely like a really big
debate in the Rust world right now.
But at the same time, I couldn't agree more. You really summarized it super
well, because this is one thing that I really love about Rust.
It's explicitness and not having these leaky abstractions, because you make
it explicit that there is a difference between those abstractions.
And I really like that part.
But at the same time, the async ecosystem is pretty new.
The rest of Rust, the sync part, has matured in the last 10 years.
And since I have you, I might as well just ask you, because you have a lot of
experience, you've been in this community for a long time, what would you say
is one big mistake that the Rust language made?
Something in the standard library or anything regarding the syntax or its semantics
that you would probably see as a historical mistake and you would want to change.
Yeah so i have a joke
answer that's funny and then i have a time
where i thought that happened but i was wrong and then i
definitely have a good answer for one that's real but i'm gonna tell the first
two anyway to like let me think about what it actually is because like
you know i want to make sure i'm getting a good answer so the joke answer i
always say is that string should have been named stir buff and like
that's and that's just like because like
we path and path buff like i think that naming and
the fact that it's like capitalist string like i feel like there'd
be a lot less like rust has 36 different string types if
we like acknowledged a little more that like it's
a buffer like it's a mutable growable string as opposed to some
other kind of string or whatever and so that's that's what i've always joked
is like i want rust 2.0 i don't want anything changed
except for i definitely like want stir buff
instead of string and so that's kind of like a silly silly answer there
was a time where i thought rust was making a huge mistake and i
was definitely wrong and that is the post fic post
fix await syntax so that was
like for people who weren't around when we were doing async awaits
design like there was a point in
time where it was a hugely contentious topic about how to write a wait like
should it be a prefix thing like javascript should it you know be some sort
of other thing or like should it be what it is today which is like the dot await
and i am really conservative when it comes to programming language design, actually.
And I kind of like, at the time, I was like, we have a lot of people who are
coming from JavaScript.
They, and JavaScript and C Sharp both do this prefix 08 shenanigans.
And I think that like, when it's not clear which way you should go,
you should choose to be conservative when it comes to language design.
In many other way of things in life, I do not believe this, but like when it
comes to programming language stuff, I think being conservative is generally pretty good.
And so I thought it was a really big mistake to add this weird syntax for async await.
However, after having had to write a whole ton of async code.
That would have been a huge mistake. And I'm really glad they did not listen
to me personally and went with the post fix away anyway, because it's just clearly
superior in every possible way.
And brilliantly enough, people figured out how to address that major concern
that I had, which is like what happens when JavaScript people come and write the wrong thing.
And that is diagnostics in the compiler.
So a really cool thing that Rust does that doesn't have to do is sometimes the
Rust compiler will parse code that's wrong to give you a great error message.
And so if you write, you know, instead of foo.bar.await, if you write await foo.bar,
the Rust compiler knows how to parse that, even though it's not write Rust,
just so it can deliberately say, hey, this is not how you write await,
you would write it like this, foo.bar.await, do that instead.
And that is like a really trivial way that like anyone who is like writing their
old style of, you know, of coming from another language, they won't get confused,
they will get helped into writing the correct thing.
And so that's an area where I definitely was like, at the time,
I was like on the wrong side.
And I think I've like thoroughly admitted that that's like a mistake.
I think my biggest things and started in sort of like where rust made mistakes
that I think is a little more serious or real is like there was a couple things that sort of like.
I don't want to say that no one cared about them at 1.0, but there's some things
that were sort of designed in a certain way that there was so much work to do
that they never really got fully, completely, totally thought out.
And I think the biggest one of those is the module system.
I really like Rust module system. I think a lot of stuff that it does is good,
but it is a common problem for people coming to Rust.
And it has been. And in Rust's 2018, we did some changes to some things that
made it a little bit easier to understand.
But it is the number one thing that when people read the book and they say,
I was confused by something, they say the module system. And I don't know what that is.
I don't have a constructive necessarily answer for how we should fix things
instead. But like, it wasn't, it was something that was kind of built a very,
very long time ago. I don't even remember by who initially.
And then, like, there was just so much other stuff to do leading up to Rust 1.0.
There was never a moment where it was like, okay, we need to really make sure
to think through, is this how we want the module system to work?
And then even in 2018 when we did there was
a bunch of like legacy constraints of like okay we want to
change how some of this stuff works but we have to make sure
that it's not too different for all the people that currently know rust because
that would be like very very bad if there was two completely totally
different systems and so some of that was kind of like tied
up that way and i think in general it's not
really just the module system but also like name resolution in general
which is like when you type an identifier how does rust
determine what identifier you mean the module system is like part of that whole
situation and a lot of that is kind of just like was made and was never really
it never really had the time to be like fully thought through and designed by
someone before 1.0 came out and i think there's a way that you could simplify,
a lot of that stuff like.
This gets into stuff that like most Rust programmers never even really think
about, but like types and modules are two separate like namespaces.
And so you could have a module with the same name as a type and the Rust language
will disambiguate between the two and it's totally fine.
And there's like, I think there's another one too that I'm totally drawing a
blank on that does like three different sort of like versions of namespaces
and they can all be like overloaded.
And this like leads to like very confusing outcomes if people name things in
sort of strange ways. and like that's all complexity that the compiler has to
deal with when trying to look up and figure out a name is like looking through all those things.
And I think all of that probably could have been pretty radically simplified
and is something that's like basically kind of impossible to sort of like deal
with at this point. Like it's just kind of like baked in.
Macros also are kind of like Nick Cameron really wanted to do like a rebuild of macros.
And so like that's why before Rust 1.0 We renamed the macro keyword into
macro rules and reserved the macro keyword because the idea would be that someday
there would be a new macro system that would be built.
And that's just never happened.
That all disappeared at some point. And so that's definitely a thing that's kind of similar.
Although I don't think most macros are inherently bad, but there definitely
are other options that could have been considered a little more.
And like proc macros are like sort of wonderful
but they are also really complicated
and have some like really weird technical angles
like you have to make a proc macro its own crate why do
you have to do that well because proc macros are kind
of like an out growing of compiler
extensions which existed a long time ago and like why
were compiler extensions the way they were well back in the
day it was basically like you can ask the compiler to just like open up
a library you write and mess with a compiler internal data structures
and produce something and like proc macros are a crazy
strong aspect of the rust ecosystem but
that doesn't mean the feature had to be implemented the way that it's implemented and
now that's like kind of just there forever and also it's good
enough that there's nobody who's like truly invested in
making a better one and so you know like
that whole area yeah again is like sort of a similar like
you know is it the worst no could it have been better yes
and i think those are kind of some things that like a rust plus plus could like
meaningfully address i don't think they're necessarily enough to justify having
a whole separate new language but they're definitely areas that i think that
are big warts that you know maybe could have like been fixed but just like aren't
really possible to do now.
All right, that was a really nice side quest into the Rust specifics.
Before we close, I also wanted to quickly come back to Oxide because we covered
the hardware, the firmware, we covered Hubris, we went up the stack to illumos and bhyve.
But I wonder what's above that, the user-facing things, the interface with the
system. Can you talk a little bit about that?
Yeah. So we're a super big believer in the OpenAPI spec.
Is it perfect? No. Is it good enough? Yes.
And so the way that you actually like interact with the Oxide API.
Like the Oxide rack is via an API.
So the same system that's running the sort of like control plane that's determining,
you know, what VMs get scheduled where and things, that's exposing an HTTP API to users.
And so what that means is, not only can you use client libraries to say,
you know, like spin up a VM, like you can use the Oxide CLI and be like,
give me, you know, a VM that looks like this, and it will like make one does
that by making HTTP request.
But also means we have a web console, which is just like the AWS console or
anything else where it's a website you can load up, and you can click buttons
and, you know, manage your rack that way.
And all that is built off of this idea of like the rack exposing an open API
definition, and then be able to generate clients on top of it.
And so we are using TypeScript for the sort of like front end of the website of the console.
And we're not using like Rust and Wasm or any of the sort of front end Rust technologies.
And that's basically like TypeScript gives us like 85% of what we would want
out of Rust over JavaScript.
Like having strong type stuff is like really, really important to us.
And, you know, all those kinds of things.
And the Rust front-end web ecosystem is even younger than many other parts of
the Rust ecosystems. And so, you know, I definitely don't want to say it's like
not usable at all, but we didn't necessarily want to, you know,
bet on something that was like that young while we're already doing so many other things.
And plus, you know, a lot of the people that do front-end web stuff are already
familiar with TypeScript. And so it's easier for them to kind of like pick up
that sort of stuff necessarily, although many of them also do a little bit of
Rust on the side because, you know, they interact with other parts of the stack too. Yeah.
This has kind of become a very common sort of pattern for building applications
at Oxide in general is like a Rust server backend, and then a TypeScript front
end with an API layer expressly in the middle.
And so we actually wrote a server framework called Dropshot that,
you know, at the time, like, you know, a lot of people use, you know,
Axum or, you know, what other other sort of web framework in Rust.
And we kind of wrote our own specifically because of the time that we were building
it, there was not a lot of stuff that had deep OpenAPI integration.
And so what this means is a part that stinks about OpenAPI is trying to write
the definition by hand. So we just don't do that.
So for example, our drop shot, our server framework, you write the endpoints
yourself, and then you can pass a command line flag to the server that says,
hey, please generate an OpenAPI document for me.
And it will look at all the code that you wrote to write all your endpoints,
and it will generate the full OpenAPI specification document for you.
So you don't need to write it by hand.
And then we have a TypeScript generator that's able to read in an OpenAPI document
and spit out a client library, not just for TypeScript, but on the web context is the most important.
And spit out a TypeScript library that's able to know how to interface with
things because of that OpenAPI document.
And that means that I get, in practice when I'm building web stuff at Oxide,
I write my server-side definition.
I say hey please generate stuff and
regenerate the client in typescript and when i switch back over
to my typescript file it'll give me a type error that says hey
you're not passing this correctly or whatever and so i get full type
safety the whole way up through the stack which is
really really cool and useful and so yeah so we've been very happy with you
know using typescript and that's just like one area you know of the product
where we're not fully using rust for everything but that's like based on the
pragmatic decision to like you know engage with that ecosystem deeply too so
yeah it's been very very nice people.
Who are not familiar with open api might think oh that's just a lot of extra
work that you do on top of what you already do which is an infinite number of
other yaks to shave so i wonder what are the practical benefits of having an
open api spec what can you do with that.
Yeah, so some of it is just like, very straightforward is like type safety is
cool. Like we like type safety, it helps it doesn't solve every problem.
But you know, also some of it is like we, people need to, you know,
write applications against our API, like everybody and using a bunch of different libraries.
And so I think for the non TypeScript side, I think that there's sort of like
the TypeScript one is interesting for its own reasons.
And the non TypeScript ones are interesting for their own reasons.
So I'm gonna talk about those ones first, because there's like simpler.
So, you know, in general, you know, you're going to want people that
definitely want go they definitely want rust and
then there's maybe some other stuff but like those are the two big ones but like
we do need to support a bunch of different languages that people are going to want to
write applications against and you know every company has
their own stack and so we can't necessarily guarantee that like you know maybe
they're a java shop and so they really want a java library okay so one of the
things that the benefit gets us to be able to like meet customers where they
are but like a lot of the i'm gonna say non-type script APIs are not very interesting
because it's just a general REST client, right? You're making HTTP calls.
There's nothing kind of like super novel there.
The benefit is in being able to not have to handwrite every single one in every
single language and have something reasonable kind of pop out in each language.
I mean, obviously, handwriting ones can be nicer than sort of the more generic
ones. But we're also not using the generic opening API tooling.
We wrote our own generator to sort of generate ones that only need the features that we use.
And so therefore, produce something that's a little bit nicer,
but like, okay, it's a slightly nicer rest API is not that interesting.
One of the things that I think is really cool on the TypeScript and like front-end side.
Is if you go to the console repo, which is what we call the front end to this
thing, the web console, if you go to that on GitHub, there'll actually be a
link to a little Vercel thing that lets you play with the console in your browser.
And the reason that that works is that we can use the OpenAPI definition to
also generate a mock server and then run it in a web worker in the browser.
And so you're able to play around and say like, spin up a server.
And then it will like pretend to spin up a server by running that in the mock worker in your browser.
And then when you go to list all servers, it'll have the one that you spun up.
And when you say spin up another one, even though there's fake page loads in
there, basically, it's able to remember all that stuff and you're able to get
some very basic logic like that kind of working.
And I just think that's such a cool demo to be able to actually play around
and you can see what is going on or what it feels like to do it without needing
to spin up a backend at all.
And it's only possible, it's kind of funny way earlier, I was talking about
how in the firmware level,
having interface layers is like no good but like
here you know on the very front end at the very highest levels
that is we're actually interfacing with the external world is
not something we can control and something where other people want
to use other technologies to interact with us and so at that level it
is worth it to actually do the work to sort of put
in that kind of you know universal interface as opposed
to something that's not and so you know open api
like i said before is it the best api description language that
ever existed no but it is one that a lot of people use and know
and it works well enough so you know that's like a good example of i think us
making a totally different trade-off at a different part of the stack where
you know internally speaking we don't need to collaborate with folks and so
those layers don't make sense but on the highest levels externally focusing
we do need to do that and so it's worth putting the time and effort in at that layer that's.
So incredible to even think about having
such a use case because i don't think you had that in mind from the very beginning
it just evolved over time and then you had the ability to put that test server
on the web and it must have been a really magical moment when you as a company
realized that you could do that.
Yeah i think that happened before i showed up personally but definitely when
i was introduced to it when i got shown it i was like this is so cool i want
to talk about this all the time because i think it's just a really really neat
way of doing things for true and.
No one else does it these are some of the things that are so exceptional,
you rarely hear about them.
It's another thing that I think you do, which not many other companies do necessarily,
is to be very open about discussions and,
wiring in the community you have this rfd process i think it stands for request for discussion.
Yeah and.
It reminded me of ross rfc process is it modeled after that.
So it definitely takes some influence from that but also it's my understanding
that joyant used rfds to talk about stuff internally as well and so like i mean
both things were definitely inspired by the ietf rfc process Like, I think just in general,
the idea that there is like a written document that you come to consensus around
and then use that to move forward is like the high order bit.
And so, you know, I definitely think it takes some inspiration from a bunch of different places.
But, you know, at its core, all those processes are about the same sort of thing
of like when you need to get a lot of people on the same page to do something,
you know, you need some way to achieve consensus.
And we really value the written word at Oxide because that's the thing that truly scales.
You know if you have a meeting you know you can only have a meeting with so
many people before it really falls apart and doesn't work but a bunch of different
people commenting on some text.
Is you know able to not be at the same time it's
asynchronous instead of synchronous so you know we're a distributed
company and we have people that live all over the place and so you know
that like matters a lot and you know it lets
people work at their own pace you know like in a
synchronous meeting if i want to think about something you know
give it a half an hour's worth a thought that's not really feasible you
know in a meeting with 10 people i can't just be like well i want to sit here
and think about this for an hour before i say what i want to say you know it
doesn't work because you're wasting everyone's time whereas with an rfd you're
able to like say okay i'm going to sit with this and like really think about
it and you know you're not blocking anyone else and so i think there's a lot
of advantages to doing stuff that way for sure but yeah i.
Think we covered a lot of ground going from hardware all the way up to the software
interface and web applications it has been a crazy tour the one thing that i
wonder about and maybe some listeners might also be curious about that if oxide started in 2024 and,
the programming language landscape has changed a bit would you write oxide in zig now.
No and there's
a couple reasons for that i really like zig conceptually i've
known andrew for many years i consider him a friend he himself
says don't use zig for production yet like
it is still changing massively all the
time there is like you have like tigerbeetle and
you have bun and i think there's maybe one other company but
that's like it and at this point you
know rust is still the like safe choice that
has been used in you know it's like millions and millions
of lines of rust in production that you know everyone touches
like every day like even just cloud flare 10% of
the internet hits rust code all the time and so
you know i think also like there's lots
of cool things about zig that i wish rust would steal honestly but
i think the for me personally the
ironclad memory safety guarantee that rust has versus the
like we probably fix most memory safety bugs being
what zig has is just like a a deep
philosophical difference that i think for me personally 80% of the problem being
solved is not worth it but 100% of the problem solved is and obviously rust
is more like 99.99% of the time but like i still think it's a very significant
advantage and there Whereas more and more,
we are finding that 100% memory safe by default with an escape patch is the
correct choice for languages.
And so obviously, I would say that because I'm a Rust person and a Zig person,
but I do think that's meaningful.
And so I do think that the exact same choices would still be made today.
Because while it is true that there is Zig and there's a couple other languages
that are people working on upcoming, they're still very young and they're still
very early. and they are not attempting to solve memory safety fully in the same way that Rust is.
I think a lot of people are really excited about Oxide because you do so many things right.
You have vertical integration, you have top-notch branding, you have your own
podcast, you wire in the community, the RFD process.
Everything is done well and it's done with purpose. and this other thing that you,
always do, and I think that's amazing, is to contribute back to the open source community.
A lot of the tools that we talked about today are open source,
so people could check out the source code.
Can you maybe list a few projects that are open source?
And in general, what is the methodology at Oxide to decide how and when something can be open sourced?
Yeah so hilariously i co-wrote
an rfd with Bryan on open source policy at
oxide and hilariously i've forgotten to like
mark that as completed and i think it's one of the ones that's not public so
i should also get on making that public at some point because it's
kind of funny that it itself is not but essentially the default
position of oxide is that everything should be
open sourced to the extent that we can possibly open source it
and there's a couple different reasons for that so because
we also have a little bit of an interesting relationship with open source stuff sometimes
like like so because we
are a company and we're doing so much we often
don't have time to like community build around
our projects so i mentioned drop shot before like
it is open source it is on github we do accept pull
requests but we're not like trying to make it the
rails of rust because like we don't really have the time to
run a full community managed product it's more like this
is a thing that we built that's useful for us and if it's useful for you
too then that's great like hubris for example like we
basically we we take pull requests in the
sense they are open and we'll accept them sometimes but like we also can't really
even have the time to review like if someone are to refactor a major component
or something like that's just like not a thing that we could accept because
we really need to stay focused on like what's good for us and so there's this
interesting balance between like we want to make stuff available.
And sometimes libraries are easier to share with people in a way that would
make sense conceptually.
But like a lot of our open source is kind of like a, it's not a source dump.
It's not like we're just throwing it over the wall, but we just like really
can't accept a ton of external contributions.
But it's really important to us that it is open sourced, because like,
one of the big problems, as I mentioned a long time ago, in sort of the firmware
part of this discussion is that like, you may not even know what is running on your computer.
If you buy a computer from another vendor, there's like whole operating systems
hiding in the nooks and crannies in your computer.
And we think it's really, really important that when you buy a server from us,
you are buying the server, like from us, and that like, it is yours.
And therefore you get to know what is running on it because it's
your hardware like we shouldn't have secret computer stuff
running on the hardware that you purchased from us and
that also includes like we don't do software licensing fees
like if you if you buy a server from dell you're going
to buy the hardware but you're also going to license the software and that
means it's an ongoing cost so did you really buy the
computer or are you just like paying a big
up front and you're renting some of it you know what i mean and so that
means that like so a thing that happened recently we
use cockroach db in the control plane and cockroach
announced that they're moving from the bsl the business source license to a
proprietary license they're like going sort of closed source again source and
yeah source available and so like until a lot of people said oh what that mean
you're going to do and we're like well we're sticking with the last version
that was apache licensed and we're just going to keep doing that.
And people were like, oh, why didn't you negotiate at a license with Upstream?
And it's like, maybe, you know, you could have paid for it, like,
you know, get a discount or whatever.
And it's like, well, you know, I mean, first of all, it's not like we didn't
talk to them at all, but like the idea that we would be paying a per rack license
fee for that doesn't make sense because our customers own the hardware after we sell it to them.
So what, are they going to pay Cockroach, you know, a licensing fee?
Like that wouldn't be fair on their behalf.
And so like, so that means we really need to like,
like open sourcing the software in general
means that our interests are aligned with our customers interests which is
that like we're trying to like sell you a computer i
keep always going back to that because it's just so funny because like on some level
access business is like so straightforward in a world where
so many tech companies businesses models are really complicated like we
really want to sell you a computer and then it's your computer obviously you
know support contracts are an ongoing thing that we you know
do and like stuff like that but like it's just true it's really
important that you should know what's running on the computer that you get and
that's that's true in like a an ethical sense but it's
also just true in like a security sense it's
important to us we feel like that if you you know want to make sure that you
know nobody's we you know if you install this this rack in your data center
it's important that nobody's trying to steal your stuff you know what i mean
like privacy is probably pretty important to you if you're buying your own computers
and racking them in your own data center,
you care about making sure that you're the only one that gets to know what's running on your stuff.
And so being auditable is a really important part of that. And so being open
source is not necessarily a precondition to being audible, but it certainly
makes being auditable a lot more easy because it means you can literally see the code. Now.
There still are a couple of binary blobs that we have to do.
Like while we did rewrite a lot of the firmware, there still are occasionally
bits of the firmware that we can't actually fully open source.
And occasionally there's stuff that, you know, maybe has an NDA or whatever.
So I'm not going to say it's fully 100% there, but in general,
like we try to open source everything that we do to the greatest extent possible.
And like, we've also found that there's a benefit there because sometimes it's
just more annoying for stuff to be closed source.
Like if you've ever had to deal with cargo trying to depend on Git dependencies
because you depend on some sort of closed source library that you have to deal
with, like it's kind of a pain in the butt, right?
And so like, it's just easier, even if you never intend for anyone else to actually
read the code, it's just easier to be like, hey, this is actually open source.
And so we use the MPL by default, which is a really interesting license.
It's not, you know, most stuff in the Rust world is Apache 2 is less MIT licensed.
And the MPL is a Mozilla public license.
And it basically is like a weird hybrid between the MIT and the GPL.
And so it basically is just like, you need to be still open source,
but like when you use MPL based code, it's sort of a still open source,
but it's the file level instead of the like project level.
And so we feel like there's a really nice trade off between copyleft and totally,
you know, the more Libre licenses.
And so that's sort of our default choice. But if you're integrating with a code
base that already has made some sort of choice there, then we need to stick
with the same license. So we don't like mandate that everything has the same
exact license, but we do feel like it's really important in many ways.
And so that's something that we try to do, even if we're not necessarily on
the big community building ends of open source stuff, because we're doing so much.
We just don't have the time, but we do think it's like, it's like important
to align our business interests with our customers' interests.
And that's open source is like a great way to do that.
If I had to play the cynic for a moment, I would ask, couldn't you also change
the license like Cockroach did and lock in your users?
In theory but like we also don't really do
copyright assignments at all so like you
know we do change some of that stuff like but or
like we don't have the ability if we've accepted stuff from other people
or if we're building on another project like we couldn't necessarily change it
but i mean you know i think a
more cynical question would be like well why does
it matter because it's custom hardware anyway so who's going to be running that
on a different computer you know and hilariously a lot of it is actually pretty
easy to run on other stuff but but yeah no i mean we could do it after anyone
could make any decision at any time
but you know currently you know that's that's where we're at so that's.
Very fair and you never know where this approach leads you
because maybe someone finds a very
interesting way to use that source code in a completely different context at
some point maybe 10 20 years down the road and also this is another thing that
i was wondering about what makes you confident that rust will be around in 10
years or might even be relevant or,
I don't know. Let's think about 20 or 30 years down the road.
A lot of languages, they go away over time.
Yeah. I think that a lot of people don't realize how much production Rust is
out there and how many companies truly depend on it.
And if the idea, if Rust were to suddenly implode, how people wouldn't step up to fix it.
Just meta alone has, I think,
like on the order of 10 million lines of rust it's like i don't
think it's it's not 100 million lines but i think it's more than single digit
millions it's like between one and 10 million
lines of rust you know amazon like rust
is now involved in like s3 aws like aws ec2 uh tons of different aws servers
are all like have rust at really key points i mean rust is in the windows kernel
already well people were talking about like oh is rust mature enough for the
linux kernel which, I mean,
Asahi Linux has already built graphics drivers on it.
And while it's not in the upstream kernel fully yet, and while there's been
a little bit of some discussion about that, people are like,
is it actually mature there?
Like it's in the Windows kernel. Like I'm talking to you on a Windows machine.
There is Rust code running in my kernel right now. It's just a little bit,
but like it's expanding.
Microsoft is rewriting some legacy Windows stuff in Rust, like GDI,
the graphics drawing interface.
They have like a port of that to Rust stuff. And so,
you know there's just there are millions upon millions of
rust lines of rust out there and being used for real important
things if something were to happen to the
rust team like it's also worth thinking about what would it
mean for rust to die okay that would mean that the rust team
would somehow not exist anymore but the
code base would still exist of rust itself and so
like at that point those companies it would be an existential
threat to their business for that technology to implode and so they
would have to come up with some alternate way of making that
work because like you know it can't just
stop at this point like rust has reached escape velocity
and is a language that like will survive now is
it will survive in the sense of like vb.net where
like it technically exists but not a lot of people use it but like
or is it a cobalt where like it's used for some things but
like not for anything else i have no idea if it's that kind of legacy or is
it like a c/c++ where it's used for smaller and smaller amounts of things
but still for very important things i don't know but it's definitely past the
point where it will just like disappear at some point because it is just used
in too many things that are too meaningful like like the united states government is talking about.
You know, preferring Rust over other languages when procuring things like,
you know, and is that a thing I feel great about?
I don't know. I feel complicated about it, but just like, it's at that level of maturity.
And so I think a lot of the people who sort of think Rust is a fad are just
like uninformed about the degree to which Rust is used in industry for real,
meaningful production applications.
It is absolutely going to live on.
I don't know if it's going to be 30 years or 100 years, but it's definitely going to be 30 years.
And maybe in 30 years there'll be the rust plus plus and cool kids will be using
that or whatever but there still will be jobs for rust programmers in the same
way that you know python super super hot now started in like you know 20 30
years ago itself so you know similar i think that's like kind of where rust
will find itself in the future yeah.
That was in 91 or 92 somewhere around that time.
Yeah i think so yeah before.
We close now you have the opportunity to mention any of the tools that you want
people to check out can be tools from oxide can be external open source tools
that people might or might not know and things that you find interesting in the rust ecosystem.
Yeah the two things that i'm most interested in
at the moment are not oxide projects but they are both written
and rust the first one is buck2 which is
a build system from facebook slash meta so buck
one is kind of like without getting the full history
it's kind of similar to basil or blaze if you're familiar with those from google
but at one point because i was doing as i mentioned earlier the like build system
stuff for hubris i got in my head like okay if i was gonna make a build system
from scratch what would i do and i read a bunch of papers and i learned about
the space. And I said, okay, cool.
I would want this general design. And then I was like, okay,
well, maybe I'll start writing that in Rust.
And so I started toying around with the idea of doing it. And then I learned that Buck 2 existed.
And what I found was, is it was all the people from all the papers that I read
doing the design I wanted to build and writing it in Rust.
So now what I'll say is that it's a little hard to use Buck if you're not familiar
with those tools already.
I think the developer documentation, the introductory documentation is like not very great.
And I'm not going to say that it's a flawless tool, but it is one that I'm very
interested in and learning more about.
I still have not used it a whole ton. And like I said, I haven't ported any
of my work projects over to it yet.
We are using it at Oxide on a project for FPGA shenanigans, but it's definitely
like a tool that I'm very interested in because.
Tools like Cargo and NPM and RubyGems are kind of build tools that are good
for the small, easy cases, as we talked about.
Like if you're on the straightforward path, they work really,
really well. And tools like,
Buck and Bazel are really good for the monorepo, Google style,
like everything in your whole company, hundreds of millions of lines of source
code are being built with these kind of things, but they're hard to use.
And so I'm really interested in how can we bridge these two worlds?
Is there a way for a build system to be good in the small and in the large?
And i think it's i think it's going to be better to bring the
like conceptually correct large build systems down to
be easier to use for the smaller cases than is
to scale the small easy to use build systems up
to the big cases but i don't think we're there yet so i want to i want to shout
it out as a tool i'm interested in i don't think it's perfect i'm not saying
you should switch to it tomorrow or anything but i do think it's interesting
sort of space to watch and the second tool is called jj which is from google
it is written in rust it is a new source control system that's Git compatible.
I've actually written a tutorial on it that I haven't published a super ton
yet, but is actually going to become the upstream tutorial sometime soon.
We've been talking about it with the team, but it is like a version control
system that is Git on the back end, but it's not Git on the front end.
And, you know, I love Git.
I've used Git for a very long time. I've loved Git CLI. Whenever people in the
past said Git CLI is bad, I would be like, I understand that you struggle with
it, but I do not. I think it's totally fine. I like it actually.
And JJ is the first time I've ever, like I haven't used Git in months at this
point because I just use JJ instead and it is both simpler and more powerful
than Git at the same time.
And like the C people, I think the Git people have been used to people saying
that to them for a long time.
And so I was very skeptical when I first heard about it, but JJ has a lot of
really interesting things going for it and it ends up being a smaller set of
primitives that are more orthogonal and that's why it ends up being like easier
and more powerful than Git at the same time.
So definitely check that out. And if you stock my GitHub, you can find my tutorial
or maybe someday if you're watching this, it will be the upstream tutorial.
But I definitely think it's really super powerful and as a tool I use every
day and love, even though it's a pre-release tool, you know,
it's got a lot going for it. I'm very, very excited about it.
What I liked about JJ was the fact that you can avoid naming branches and you have a blog post on that.
Yeah, just wrote a blog post about that. Absolutely. That's another one of those
things where I never really understood. I was like, what do you mean?
How do you work with branches if you don't name them?
And then at some point you're like, oh, I haven't named a branch in like a long
time. I actually don't need to do that. It's cool. So yeah, absolutely.
It was a very beautiful thought because you bridged the gap between version
control, front-end and back-end technology with TypeScript and Rust. That was impressive.
Thank you.
So many people asked me, and I almost forgot, one important thing.
When will you start hiring in Europe?
So the funny thing I want to say is that, first of all, we would have been hiring
in Europe except for Britain left Europe. So we have some people in the UK.
But like we do, we are willing to hire people in Europe.
The only thing is that like, because the team is still so small,
we try to have working hours that overlaps to San Francisco.
And that's really hard for most people in Europe.
There are some people who definitely, you know, stay up late or wake up early
and like can make that work. But it's definitely a little bit of a challenge.
You know, as, as the company grows, we kind of like broaden and like make that
requirement a little less and less.
So you know it's definitely i would say that like europe is definitely not a
guaranteed no right now but you know it'll just become more and more common
as time goes by i don't have a great timeline for when exactly it is but that's
kind of like the overall philosophy is it's not it's not that europe is a no
it's the time zones are hard as programmers know so we'll get there in.
Closing the traditional question do you have a message to the wider rust community
anything that you want to share the stage is yours.
Yeah, I think that, you know, when I started using Rust, it was like 40 people
in an IRC room. And now it is, you know, millions of developers literally all across the globe.
And so growing is hard. And there's lots of changes that have happened.
If you've been around Rust for a long time, you've seen a lot of change.
I think change will continue to happen in the future.
You know the best thing that we can do is like continue to
write good software continue to try to treat each other well
and you know continue to like build cool
stuff in rust and share it with the world and you know not everybody's gonna
love rust and that's totally okay but i still think there's a lot of growth
in the rust world and i still think there's a lot of work to do and so i'm excited
for us to keep continuing to just like build cool stuff in rust and you know
just keep chugging along so yeah
i don't know that's that's i think what i have to say right now Steve.
That was amazing. I have to thank you so much.
If just a single listener starts learning Rust because of this,
I think I've achieved my goal and you did the same a thousandfold.
And I'm so happy to have you as an ambassador of the language I love.
Awesome. Thank you so much. That's very kind.
Rust in Production is a podcast by Corrode. It is hosted by me,
Matthias Endler, and produced by Simon Brüggen.
For show notes, transcripts, and to learn more about how we can help your company
make the most of Rust, visit corrode.dev.
Thanks for listening to Rust in Production.
Steve
00:00:30
Matthias
00:00:54
Steve
00:00:57
Matthias
00:01:25
Steve
00:01:57
Matthias
00:02:48
Steve
00:03:54
Matthias
00:07:32
Steve
00:07:43
Matthias
00:10:02
Steve
00:10:34
Matthias
00:12:20
Steve
00:12:49
Matthias
00:18:15
Steve
00:18:55
Matthias
00:27:48
Steve
00:28:00
Matthias
00:29:59
Steve
00:30:16
Matthias
00:37:16
Steve
00:37:31
Matthias
00:37:32
Steve
00:37:37
Matthias
00:43:07
Steve
00:43:24
Matthias
00:45:37
Steve
00:45:40
Matthias
00:46:22
Steve
00:46:28
Matthias
00:47:12
Steve
00:47:13
Matthias
00:50:26
Steve
00:51:08
Matthias
00:56:45
Steve
00:57:17
Matthias
01:04:00
Steve
01:04:09
Matthias
01:06:33
Steve
01:06:35
Matthias
01:07:42
Steve
01:07:45
Matthias
01:08:44
Steve
01:09:05
Matthias
01:12:42
Steve
01:13:32
Matthias
01:21:30
Steve
01:21:55
Matthias
01:25:41
Steve
01:25:57
Matthias
01:29:18
Steve
01:29:39
Matthias
01:29:50
Steve
01:30:16
Matthias
01:30:17
Steve
01:30:22
Matthias
01:31:53
Steve
01:32:19
Matthias
01:34:06
Steve
01:34:56
Matthias
01:41:22
Steve
01:41:30
Matthias
01:42:07
Steve
01:42:38
Matthias
01:45:47
Steve
01:45:50
Matthias
01:45:53
Steve
01:46:10
Matthias
01:49:42
Steve
01:49:50
Matthias
01:50:04
Steve
01:50:14
Matthias
01:50:15
Steve
01:50:23
Matthias
01:51:17
Steve
01:51:26
Matthias
01:52:17
Steve
01:52:36
Matthias
01:52:39