Arroyo with Micah Wylde
Rust in Production episode explores Arroyo, a real-time data processing engine built in Rust. Micah Wylde from Arroyo shares insights on benefits, challenges, and future potential. Visit Arroyo's website for more.
2024-01-25 56 min
Description & Show Notes
In this episode, we have Micah Wylde from Arroyo as our guest. Micah introduces us to Arroyo, a real-time data processing engine that simplifies stream processing for data engineers using Rust. They explain how Arroyo enables users to write SQL queries with Rust user-defined functions on top of streaming data, highlighting the advantages of real-time data processing and discussing the challenges posed by competitors like Apache Flink. Moving on, we dive into the use of Rust in Arroyo and its benefits in terms of performance and memory safety. We explore the complementarity of workflow engines and stream processors and examine Arroyo's approach to real-time SQL and its compatibility with Postgres. Micah delves into memory and lifetime concerns and elaborates on how Arroyo manages them in its storage layer. Shifting gears, we explore the use of the Tokyo framework in the Arroyo system and how it has enhanced speed and efficiency. Micah shares insights into the challenges and advantages of utilizing Rust, drawing from their experiences with Arroyo projects. Looking ahead, we discuss the future of the Rust ecosystem, addressing the current state of the Rust core and standard library, as well as the challenges of interacting with other languages using FFI or dynamically loading code. We touch upon Rust's limitations regarding a stable ABI and explore potential solutions like WebAssembly. We also touch upon industry perceptions of Rust, investor perspectives, and the hiring process for Rust engineers. The conversation takes us through the crates used in the Arroyo system, our wishlist for Rust ecosystem improvements, and the cost-conscious nature of companies that make Rust an attractive choice in the current macroeconomic environment. As we wrap up, we discuss the challenges Rust faces in competing with slower Java systems and ponder the potential for new languages to disrupt the trend in the future. We touch upon efficiency challenges in application software and the potential for a new language to emerge in this space. We delve into the increasing interest in using Rust in data science and the promising prospects of combining Rust with higher-level languages. Finally, we discuss the importance of fostering a welcoming and drama-free Rust community. I would like to thank Micah for joining us today and sharing their insights. To find more resources related to today's discussion, please refer to the show notes. Stay tuned for our next episode, and thank you for listening!
About Arroyo
Arroyo was founded in 2022 by Micah Wylde and is based in San Francisco, CA. It is backed by Y Combinator (https://www.ycombinator.com/) (YC W23). The companies' mission is to accelerate the transition from batch-processing to a streaming-first world.
About Micah Wylde
Micah was previously tech lead for streaming compute at Splunk and Lyft, where he built real-time data infra powering Lyft's dynamic pricing, ETA, and safety features. He spends his time rock climbing, playing music, and bringing real-time data to companies that can't hire a streaming infra team.
Tools and Services Mentioned
- Apache Flink: https://flink.apache.org/
- Tokio Discord: https://discord.gg/tokio
- Clippy: https://github.com/rust-lang/rust-clippy
- Zero to Production in Rust by Luca Palmieri: https://www.zero2prod.com/
- Apache DataFusion: https://github.com/apache/arrow-datafusion
- Axum web framework: https://github.com/tokio-rs/axum
- `sqlx` crate: https://github.com/launchbadge/sqlx
- `log` crate: https://github.com/rust-lang/log
- `tokio tracing` crate: https://github.com/tokio-rs/tracing
- wasmtime - A standalone runtime for WebAssembly: https://github.com/bytecodealliance/wasmtime
References To Other Episodes
- Rust in Production Season 1 Episode 1: InfluxData: https://corrode.dev/podcast/s01e01-influxdata
Official Links
- Arroyo Homepage: https://www.arroyo.dev/
- Arroyo Streaming Engine: https://github.com/ArroyoSystems/arroyo
- Blog Post: Rust Is The Best Language For Data Infra: https://www.arroyo.dev/blog/rust-for-data-infra
- Micah Wylde on LinkedIn: https://www.linkedin.com/in/wylde/
- Micah Wylde on GitHub: https://github.com/mwylde
- Micah Wylde's Personal Homepage: https://www.micahw.com/
Arroyo was founded in 2022 by Micah Wylde and is based in San Francisco, CA. It is backed by Y Combinator (https://www.ycombinator.com/) (YC W23). The companies' mission is to accelerate the transition from batch-processing to a streaming-first world.
About Micah Wylde
Micah was previously tech lead for streaming compute at Splunk and Lyft, where he built real-time data infra powering Lyft's dynamic pricing, ETA, and safety features. He spends his time rock climbing, playing music, and bringing real-time data to companies that can't hire a streaming infra team.
Tools and Services Mentioned
- Apache Flink: https://flink.apache.org/
- Tokio Discord: https://discord.gg/tokio
- Clippy: https://github.com/rust-lang/rust-clippy
- Zero to Production in Rust by Luca Palmieri: https://www.zero2prod.com/
- Apache DataFusion: https://github.com/apache/arrow-datafusion
- Axum web framework: https://github.com/tokio-rs/axum
- `sqlx` crate: https://github.com/launchbadge/sqlx
- `log` crate: https://github.com/rust-lang/log
- `tokio tracing` crate: https://github.com/tokio-rs/tracing
- wasmtime - A standalone runtime for WebAssembly: https://github.com/bytecodealliance/wasmtime
References To Other Episodes
- Rust in Production Season 1 Episode 1: InfluxData: https://corrode.dev/podcast/s01e01-influxdata
Official Links
- Arroyo Homepage: https://www.arroyo.dev/
- Arroyo Streaming Engine: https://github.com/ArroyoSystems/arroyo
- Blog Post: Rust Is The Best Language For Data Infra: https://www.arroyo.dev/blog/rust-for-data-infra
- Micah Wylde on LinkedIn: https://www.linkedin.com/in/wylde/
- Micah Wylde on GitHub: https://github.com/mwylde
- Micah Wylde's Personal Homepage: https://www.micahw.com/
Transcript
This is Rust in Production, a podcast about companies who use Rust to shape
the future of infrastructure.
My name is Matthias Endler from corrode, and today we are talking to Micah Wylde
from Arroyo about how they simplified stream processing for data engineers with Rust.
Micah, welcome to the show. Can you tell us a few words about yourself and Arroyo,
the company you founded?
Thanks so much for having me. Yeah. So I am a Rust engineer and the creator
of the Arroyo streaming engine.
So Arroyo is a real-time data processing engine that allows you to write SQL
queries with Rust user-defined functions on top of streaming data.
For example, data you might might have in Kafka or another streaming system.
And I come to that problem and company after spending five years leading streaming
teams at companies like Splunk and Lyft, which is a rideshare company in the US.
And more broadly, I've been in the big data space working on data systems for
pretty much my entire career starting out in ad tech, working on real-time ad bidding systems,
and then leading data teams and building data systems so yeah that's that's
a brief background about me at.
Splunk you were a principal engineer and you were the team lead of the streaming
compute team, so that makes you somewhat of an expert in stream processing I would say.
Maybe for the uninitiated, could you give us a just very quick,
very brief introduction of what stream processing is in your own words?
Yeah. So traditionally, when people have wanted to process data,
we do it in what's called batch mode, which means you take all the data in through
whatever data sources those are, whether it's coming from logs you're reading,
from API requests that are ending up somewhere, or wherever that data is coming,
it all kind of filters through your system and eventually lands in traditionally
a database or today maybe like a data lake or a data warehouse.
And then once all that data is there, you run a really big data processing job
on top of all of that rest, that data at rest.
Often this means you wait, you know, an hour or a day for all the data to land
before you can kind of analyze it or learn anything about it.
Stream processing, in contrast, does this data processing as the data actually
arrives in your system. So in real time.
And the advantages there, obviously, latency is much better.
You can process the data within milliseconds or seconds instead of waiting hours or days.
But it also can give you a much kind of easier way to build these like end-to-end
data systems where you need to consider like different properties around like
timeliness and completeness in order to kind of build your higher level analytics or data products.
And for kind of real-time companies like at Lyft, this becomes really crucial
to be able to basically know things about your world really quickly.
In rideshare, you need to understand kind of where your users are, where your drivers are.
You need to understand traffic speeds in order to do routing.
You need to be able to do dynamic pricing based on supply and demand.
And all of this stuff really demands that you be able to do complex analysis
on data really quickly instead of waiting, you know, a day for it all to land in your data warehouse.
So that's kind of like a high level of the problem stream processing and solving or kind of how it fits.
Yeah, so stream processing has existed before.
There were other companies that did a lot of groundwork. You mentioned at some
point Hadoop, you mentioned BigQuery in your seminal article that we will get to in a second.
But I think maybe you can just quickly explain what makes Arroyo special in
this case and also what the competitors are lacking right now that maybe is
a nice niche for Arroyo. Yeah.
So, well, so BigQuery and Hadoop are both kind of like in that batch paradigm,
where you let all the data collect, and then you do a big data processing job over that data at rest.
In the streaming world, traditionally, the most popular system has been one called Apache Flink.
This is about a decade old
but was really the first system that found a good programming model for streaming
and really i would say made it work at a level of correctness that that allowed
it to be applied to a lot of these problems before flank we really had very
simple systems that were,
couldn't couldn't really guarantee anything about correctness or completeness
and we're sort of just orchestration systems around your own logic.
So I spent my year, my career in streaming working on Flink.
And I think that's true of most of the other people who are kind of doing new things now.
And for all of us, we kind of have this perspective on Flink that it solved
this problem really well for people who are able to invest a ton of energy into
becoming experts in Flink.
So, at the companies I've worked for, that meant staffing up teams that were 10 to 30 people,
full of people working on Flink, building infrastructure and tooling around
it, and then especially supporting end users who were actually building these streaming pipelines.
And I think while Flink was really successful allowing sophisticated companies
to roll out this technology in a way that would have been dramatically harder a few years earlier,
it never really got to that point of ease of use where you could hand Flink off to a data scientist,
to a data engineer, or a product engineer, and allow them to be successful building
these real-time pipelines on their own.
We always needed a lot of hands-on support from the Flink experts of the company.
And that's really what we're trying to innovate around in Arroyo.
We're trying to build a system that is easy enough for any engineer or data
scientist at your company to kind of pick up and build these correct,
reliable, performant real-time data pipelines.
So how do you see the relationship between stream processing on one end and
these new workflow engines that pop up nowadays like Windmill,
which is coincidentally also written in Rust? Do you see an overlap?
Do you see the industry converge to something that maybe encompasses both?
Or would you say these are fundamentally different areas of expertise?
Yeah, I think they're very different systems and they are good at different
kinds of problems. problems.
So workflow engines are really excellent at these very long-running tasks.
We have a bunch of things we need to do based on fairly simple criteria over the course of a day.
For example, a user signs up, we need to send them this email.
Depending on what they do in response to that, we need to do this other sequence of events.
And that's the sort of think actually streaming engines like Flink or Arroyo are pretty bad at.
It's hard to specify that type of logic, that kind of conditional logic over
all of these different states.
And they also architecturally are kind of way overpowered to do that kind of stuff.
I think these systems actually work together quite well because streaming systems,
stream processors are really good at data oriented problems.
So often, this will mean you put your like really big feed of data,
your millions of events per second feed into your streaming engine.
And that produces features or
events that can then be consumed by the much lower scale workflow system.
So that that's actually a pretty common pattern for these to kind of work together.
But at least in the near future, I don't see them as being kind of in the same space at all.
Mm-hmm on your website you have a very
nice example where you take a Kafka stream and then you write some i think it's SQL
might be Apache Arrow or some other syntax but it's similar to sql to pipe
events through your system and then see the results in real time and this was
pretty pretty impressive demo.
So is Apache Arrow SQL-like, or is it more than that?
Is it different? If so, in what sense?
So the main way you program Arroyo is through SQL.
We have slightly our own dialect, but we aim to be pretty Postgres compatible.
To do real-time SQL, you do need to extend it in some way. There's different approaches for this.
But SQL SQL, as originally defined, was really designed for these batch computations
to do like a group by or an aggregate or a join.
You really need all of the data to be available. Otherwise, how do you know,
you know, in a join, there might be more data coming in on one side or the other
in the future, so you can't ever return that result. result.
So different streaming systems that use SQL have come up with different answers
for basically how can we decide that we're done, we're able to return a result for these expressions.
In Arroyo, that looks like.
We introduce these time-oriented window functions, like a tumbling window and
a sliding window and a session window.
And these rely on a notion of what's called watermarking, which is this concept
of basically estimated completeness.
Watermark is a special value that flows through the data flow of the pipeline
and tells all of the operators that we have seen all of the data,
or we believe we've seen all the data from before a certain time.
And that tells us if we have like a window that closes at time t and we get
a watermark that is after t, it tells us that we can close that window,
that we've seen all the data that will be in that window and we're able to process
it and return the results to the user.
So this is a common pattern in certain types of string processors like Flink and Arroyo.
There's other approaches to this, which Arroyo and other systems like Materialize
also support, which is based around a more incremental style of computation,
where we actually decide we're never going to be done.
We never know that we have all the data for a particular time period.
So every time an event comes in, we're going to update the state of that window
and emit the new result there.
So depending on the kind of problem, you may kind of want one style of SQL or the other style of SQL.
But yeah in any case it's all it's all sequel you.
Wrote an article called rust is the best language for data infra which is kind
of a catchy title and i read the article and and one thing i wondered about was okay.
Was Rust your first choice when you started? Have you looked into,
for example, the solutions that came before you?
And also, was it around the time where Zig also became popular?
And where do you see yourself in this space?
Would you say, okay, Rust was just there at the right point in time?
Or would you also say, well, there would be alternative realities,
so to say, where Arroyo was written in C++ or maybe Zig in a different world.
Yeah, so I mean, kind of setting this historically, the very earliest systems
of this kind of in this space,
like the original Google systems that establish a lot of how we think about
big data today, like MapReduce and BigTable and GFS,
those were all written in C++.
And then we had a long history of writing systems in Java, like Hadoop and HBase.
And Flink itself is written in originally Scala, and then rewritten in Java.
And then we had a whole period of doing Go, like CockroachDB,
and a handful of other big data systems.
And yeah, I think now, definitely, we would not have chosen Java or Go for Arroyo.
I think in many ways, the current era of systems is a reaction to the previous
era of writing these systems in Java.
A lot of people are finding that you can get much better performance,
much easier operations, literally just by rewriting these systems in a non-managed
language like C++ or Rust.
So we're kind of following in the footsteps of projects like Red Panda,
which did this with Kafka, and ScyllaDB that did this with Cassandra.
So I think we could have done some of the things we're trying to do in Java
or Go, but it would have been much harder to accomplish our goals.
So in a world without Rust, I think we probably would have ended up choosing C++.
But I'm very grateful that we are in a world with Rust.
It has definitely made our lives
much easier than it would have been if we had to choose C++ for this.
Especially i assume to optimize
the platform you would have to avoid a
lot of copies and in c++ passing references
around can be a bit of a nightmare sometimes
if you don't know exactly what you're doing and even if you do there can be
issues with that and i just wondered do you have a lot of lifetimes in your
code as well or is that something that the rust compiler completely completely
elides and you don't even have to think about lifetimes at all.
So the most memory-oriented or lifetime-oriented part of our system is the storage layer.
So maybe to give a little bit of architectural insight here,
the way these systems look, they are these directed acyclic graphs of data flow.
Take a SQL statement, you compile it into a SQL plan, and then eventually optimize
that into this data flow graph.
Each node of this graph is some kind of potentially stateful operator.
So for example, doing a filter or a map or a stateful function like a window or a join.
And between these operators, the events and process data flow over queues or over network sockets.
So within these stateful operators, we potentially have to store data for long periods of time.
So if you imagine you have like a 30-day sliding window, we need to store some
representation of that data for 30 days.
And we do that in a mix of offline S3 storage and local disk cache and then in-memory cache.
And managing that in-memory cache brings into issue these lifetime concerns,
managing the data as it flows from that cache into the processor in order to be used.
Fortunately, in these systems, the architecture constrains that problem somewhat.
So at the semantic layer, you're kind of processing one event at a time in each of these operators.
So you don't have to deal with really concurrency issues at the direct processing layer. layer.
And that ends up simplifying the kind of lifetime management that you might
have in a more traditional database where you're kind of dealing with a bunch
of different requests to the same data.
So in Rust-lingua, that would be your types are not sync, or they don't have to be?
That's correct. Yeah, we're always accessing a particular...
You can think logically, each of these operators is single-threaded.
This is all implemented in tokio, so what's happening under the hood is much
more complicated than that.
But as a programmer, you can really think of it as synchronous processing, not a single thread. red.
Speaking of tokio it feels like
this is an ideal use case for it because
you're kind of leaning into things that are inherently concurrent
they don't really have to be sequential they can at least parts of them can
be executed concurrently sometimes maybe even in parallel but i wonder what
you think about tokio your experiences with the framework the ergonomics of it,
and also the recent discussion about async-rust being a send,
sync, and so on, and work-stealing,
schedulers, all of that stuff.
Yeah, so at a high level, a system like Arroyo doesn't really need a complex scheduler like tokio.
As I mentioned, each of these operators essentially acts as a single thread.
It receives one event, it does all the processing for that event,
and then it sends it on to to its next destination.
And all this has to happen in order to uphold the correctness guarantees of the system.
And because of that, the first version of Arroyo actually was built around threads and thread processing.
At some point, it migrated to tokio and AsyncRust, actually pretty early on.
And the core reason for that was that so much of the ecosystem is in AsyncRust
at this point, that if you want to use common network libraries or database
drivers or almost anything from kind of network programming ecosystem, ecosystem,
you do have to deal with async at some point.
And at some point, it's easier just to move your whole system over to async.
And that was definitely a challenging migration. Actually, for me,
I had never worked with async REST before.
So it involves a lot of learning, a lot of time on the tokio Discord channel,
which is extremely helpful.
But in the end, actually, the surprise was that it ended up being a lot faster.
Just purely doing that migration made the system like 30% faster,
which was not my expectation at all.
But it turns out that the tokio scheduler is really, really effective at this
class of problems, where even though it looks at a high level,
like all this processing is single-threaded,
there's a lot more going on under the hood, a lot more work that has to be coordinated.
Between like, you actually have threads, like in our case, like talking to S3
or talking to other systems.
We have a lot of queues involved.
So even though we have only a smallish number of actual processing threads,
there's a lot of network exchange happening on other threads, talking to...
Coordination system over gRPC. And tokio is really good at, at,
you know, organizing all of this work efficiently and really maxing out the use of your core.
I think that the most surprising thing for us is that we're able to run the
system at extremely high utilization,
like above 95% CPU utilization,
and everything remains responsive and reactive and is able to,
to really work effectively at that extremely high level of CPU thrash,
which has never been my experience with the systems written in other paradigms.
And then in terms of, I guess, kind of how I think about the async Rust,
I guess, drama, if we want to say that word, I think the Rust community has
a higher level of drama in general, and I don't fully understand why that is.
But I think maybe the technology just works so well that we can't,
you know we sort of have to invent other stuff to be upset about
but i will say async rust
definitely has a learning curve it took me like a
month coming from from being like i
would say a pretty strong rust programmer already it took
maybe a month to really be an effective async rust programmer and
it's definitely been the edge of the system that other people who contributed
to it have the most trouble with this the requirement that that values passing
over weight descend send definitely can be frustrating if you aren't experienced
in the strategies for dealing with that.
And the sometimes bad error messages in the compiler don't help with that either.
It can make it really hard to figure out where exactly that problem is introduced
in a large amount of code.
But I'd say overall, tokio has been a huge boon to us.
And it's really remarkable what, you know, what allows us to do in terms of
just not having to think very much about how we schedule work.
It just does a really remarkably good job on its own.
Well, Rust protects you from memory or safety problems. It does not protect
you from race conditions.
So I wonder if you, as someone that uses Rust and tokio at scale,
has run into any sort of data races or things that you encountered at runtime,
which maybe were a bit of an issue for your platform, or did you never ever
have any outages in production?
So not specifically from race conditions which
again like the architecture of the
this of our system makes the
high level concurrency pretty straightforward the although a lot more complexity
creeps in in the details especially when you try to get to the like that next
level of performance so for example like the the storage system is extremely
complex and has has a lot of concurrency.
But Rust does really help a lot with managing that complexity.
In terms of issues in production.
The issues we've seen are much more around the high level,
the ways that all the different pieces of this distributed system interact with
each other, and about wrong assumptions in different pieces about what other things are doing.
Which unfortunately rust definitely
does not fix distributed systems issues but
in terms of like the kind of micro level
it's remarkable how well things work once you get them to compile an example
i brought up in that blog post but still kind of blows my mind is that i wrote
the entire network stack like
the the piece of software that allows this system to to be distributed.
I wrote that in like a two-day push, basically like two 12-hour days.
And basically just coded that straight. And then at the end,
spent maybe an hour trying to get it to compile.
And from there, it just worked perfectly the very first time.
I made a single node system, a distributed system without any testing, any iteration on that.
And it basically hasn't changed since that initial implementation.
I've definitely never experienced that writing network software in C++ or even Java for that matter.
That's pretty impressive, yes. Pretty awesome that you could pull that off.
And it's a testament to the Rust type system, which is helpful,
and also the borrow checker and all of the things that make Rust development
and the developer ergonomics pretty awesome. Awesome.
I wondered though, although maybe you didn't have that many runtime problems,
I wondered if you had any compile time problems in a sense that maybe parts
of the ecosystem were not aligned,
like compatibility issues with, say, different versions of tokio or maybe different
libraries that were sometimes more mature, sometimes less.
Yeah, it's never been a huge issue. And just the Rust-created ecosystem in general
has been such a boon to us from a productivity perspective, compared to the
C++ world where using dependencies is so challenging.
And you don't have this incredible, rich ecosystem that we already have in Rust
after such a relatively small amount of time.
So there are occasional compatibility issues we've had to fork a few open source
projects we rely on but i would say it's a very small part of my day dealing with any of that stuff.
All of the things that you mentioned are at least to an average developer pretty
low level or at least you need a lot of expertise on how to structure such systems
or architect such such systems in order to perform well.
And I wondered, what do you think, how much does Rust guide you towards an idiomatic
solution? And what is your own expertise?
Yeah, I think Rust definitely guides you towards a correct solution.
I don't know, it always helps you that much with like being idiomatic.
Although the tooling around it is very helpful. Like Cargo Clippy is really
helpful and was very helpful.
So my co-founder had never used Rust before.
Working on this project. I really experienced distributed systems engineer and
has worked on a bunch of query systems, but was new to Rust.
And tools like Clippy really helped him pick up the idiomatic style of Rust pretty quickly.
Beyond that, I think the Rust community is also really helpful.
I mentioned already the tokio Discord, which was super useful when I was trying
to get up to speed with AsyncRust.
But in general, the Rust community is extremely useful in helping you solve
problems or figure out why some weird compile issue is happening.
Did you use any resources outside of the official Rust book and maybe the community
to help you get started with Rust?
Or did you start on a project and learn on the job? So.
I've actually been using Rust since like 2014. I've never convinced a company
we should do like a major project in Rust until now.
It was always a big uphill battle trying to introduce Rust into a large organization.
But I've been using it for all of my kind of like personal projects for a really long time.
I've been a fan of the language since basically I first learned about it.
But so in terms of my own development, there's been a lot of resources like over that time.
The first version of the Rust book, which I have on my bookshelf back there, that was very helpful.
But also it just changed so much in the early days that it was a full-time job
just kind of keeping track of the updates to the language.
Today, it's much easier. It's been pretty stable for a number of years.
And I think the quality and quantity of resources has also increased a lot.
But I know there's a really good book actually on Rust in production that I've
looked at a fair bit for the more –,
kind of how do you actually run rust details like what what does logging look like in Rust what does,
how do we do metrics like these these kinds of things that aren't necessarily
like part of the intro to what book is that it's called Rust in Production.
h by Luca Palmieri.
Awesome and you mentioned that it's a bit tricky sometimes to convince bigger
companies and and organizations to move towards Rust and introduce Rust at these companies?
Why is that in your experience?
Yeah, I think large companies tend not to be that ambitious in their technical choices.
A lot of it is built around minimizing risk rather than maximizing reward.
And Rust definitely seems risky to a CTO today.
They worry, will it be too hard for engineers to learn how to do Rust?
Will we be able to, if we restructure teams, will we be able to pass off this
project to another team? Will they have to figure out how to use it?
Will we be able to hire enough REST engineers?
And if you're Google and you need to hire 10,000 engineers, I think you should
be rightly concerned about hiring 10,000 REST engineers.
I doubt there are that many REST engineers in the world.
But for a smaller company, that's not an issue at all, right?
Hiring three Rust engineers is pretty easy.
And I think especially for a small company, it's an advantage in a way that
it maybe isn't for a big company to be using Rust.
Because as a small company, you can attract people because they want to work in Rust.
And that's a big incentive to work for you.
And I think that's working in maybe a slightly obscure language.
You get those people who are really excited about it. And that can be a big boon to you.
But for big companies, they kind of just see the risk side of that equation.
How do you hire Rust engineers? Do you reach out in your network or do you post
job announcements somewhere?
Yeah, well, I guess actually for us initially, we've been hiring more on the
streaming expertise side.
There's actually maybe more overlap there now than there was maybe two years ago.
A lot of the newer streaming systems are also in Rust. But historically,
as I mentioned, streaming systems have been largely in Java.
So that's where most people have expertise.
But I definitely anticipate as we try to hire more broadly that hiring from
that pool of REST engineers will
be pretty productive, especially as a non-cryptocurrency REST company.
There's, I think, a lot of demand for those jobs right now. So we'll be able to tap into that.
Very true. What sort of other crates do you use to get your job done?
I guess in the blog post, you mentioned Data Fusion. Maybe that's one that you
can talk about, but feel free also to talk about any other crate that you like.
Yeah, so Data Fusion is probably the most critical one to us.
Data Fusion is a number of things. This comes from the arrow-rs ecosystem.
We use it primarily as a SQL parser. So it takes SQL text and turns it into an AST and then planner.
So taking that AST and turning it into a graph-oriented plan that describes
what that SQL is supposed to do.
SQL is an extremely complex language with like 30 years of history and a bunch
of different equivalent ways to express stuff.
So having a library that deals with a lot of that complexity for you is extremely
helpful when you're building a SQL engine. we get a nice clean plan out of that
that we're able to then optimize in our own way and compile into our own set of operators.
So that data fusion has been extremely critical to us being able to build this
thing as quickly as we have. Beyond that.
I guess I'll also call out maybe a little bit lower level or higher level,
but really appreciate the actually Rust web ecosystem.
So we rely on Axum and SQLX, which is a really great SQL library.
This is not like the core of our product at all. This is like to power our API and our web interface.
Face but it's remarkable that even a domain that maybe rust isn't natively as
well suited to we still have these incredibly high quality libraries that make
it actually really easy to to build good products in that domain so that's that's
been an impressive discovery for us.
The grades that you mentioned i cannot speak about data fusion but definitely
the other ones are top of class in any language literally i would say at least
from my experience I used Axum and SQLX before, and I think they are really awesome.
But I wonder about the future of this ecosystem.
Do you see that we kind of reached a point where crates are starting to more
or less stabilize and there's one go-to crate that you pick for your job?
Or would you say the ecosystem is still so young that I can see myself switching,
let's say, to a different web framework in a year or maybe a different parser
or whatever, if if it comes up.
Yeah, I mean, I think it's probably too early to say that things have stabilized.
A year ago, your choices in a lot of these areas would have been different.
Definitely three years ago, none of these crates existed.
Axum itself is still changing quite a lot from release to release.
So I think even these crates are not directly stabilized.
But I think we will be hitting more of a period of stability,
especially with async rust becoming more feature complete.
A lot of these libraries have had to work around limitations in the async ecosystem
and implementation, like missing the ability to use async functions and traits.
Which has just landed or is about to land.
In 1.74, yeah.
Yeah. So I think that will allow things to stabilize their APIs in a way that
has been challenging so far.
And I do expect more kind of stability going forward and more obvious choices
around which crate we use to solve different problems.
And something that's been impressive to me about the Rust ecosystem is that
there maybe were opportunities to stabilize earlier.
Just to give you a random example, for logging, we had the log crate that was
like the obvious crate for a long time to use for logging.
And we could have just decided that was good enough. But actually,
it turns out there was a better option and a better design.
And we ended up with the tracing crate instead. dead.
And the ecosystem was able to move to kind of like this better option,
rather than getting bogged down in kind of like a local optimum.
And you've seen this in a lot of different areas where like there was like an
early consensus around a crate as the solution to this class of problems.
But the ecosystem was able to move on to something that solved it better.
And I think that that's not a property you have in all ecosystems.
And something I really appreciate about the rest community were able to move
fairly quickly and also in a pretty consensus driven way.
To better options in the ecosystem.
So I think we'll continue to see that happening. I don't know if Axum,
for example, is like the end term of like Rust web programming.
I think we'll continue to see iteration happening.
What about Rust itself, the standard library? What about stabilization of the Rust core?
Would you say this is already in a very satisfactory state?
Or would you say that for your use case, there would be things that you would wish for?
I think everyone has their own wish list of RFCs that we would hope finally get merged.
I think for me personally, the lack of completeness around async has been the biggest frustration.
Missing async functions and traits, for example, has required a lot of somewhat
ugly workarounds for us.
And even the version of this that's going to be stabilized isn't quite complete
enough for all of our use cases.
But I appreciate that Rust takes time to get these solutions right.
And I think we've seen that process play out with async.
Overall, I think the Rust programming language is in a really good place.
And I think it has stabilized over the past couple of years compared to the previous five years.
And we'll continue to see that stabilization with hopefully a few nice improvements,
like the work we're getting out of GADTs or the improvements to async we're
seeing right now.
I fully agree. Where I see some issues is on the edges of the Rust standard library.
So where you talk to other languages with FFI or where you load code dynamically.
And I guess for a streaming platform, that is also an interesting use case,
maybe where you can hook stuff into your engine at runtime.
And of course, there are technologies like WebAssembly and that sort of stuff
getting pushed forward.
I wonder if you already experimented with that and what's your impression on
the current state of the ecosystem around that?
Yeah, actually, maybe I should have called that out. I called that out in my
blog post as a frustration.
Rust does not have a stable ABI, Application Binary Interface,
which is challenging if you're trying to build anything that looks like a plugin system.
So if you want to compile, like in our case, we support user defined functions.
So users can write REST code that then gets loaded into the engine at runtime.
If you're writing this in C or C++, there's a stable C API that you're able
to use to basically dynamically link software at runtime. time.
Rust doesn't have this. So if you want to compile a library and a host application,
and link them, you have to do that with the exact same version of the Rust compiler.
And in many cases, like the same settings for those compilers.
So it makes it really hard to distribute basically binary software separately
from the thing that is consuming that library.
So today, the solution you basically have to use is to use the C API,
which means giving up a lot of like the features and power of Rust,
at least at your interfaces.
You also mentioned Wasm, which is another class of solutions to this problem.
In some ways, this is even worse from like interface perspective.
Because there's no real standard way to interact between like hosts and plugins in the Wasm ecosystem.
So every application sort of has to figure this out for themselves.
We have explored Wasm as a solution to kind of this class of problems.
We actually have have an integration with Wasm time, which is a great Rust Wasm runtime.
And I think for systems like ours, that probably will be the direction that we take going forward.
Particularly great for integrating with other language ecosystems
and there's a lot of energy in
the wasm world to kind of figure
out these integration problems like how how does
a rust program talk to a python program over shared
wasm memory how how can we build these kind of like unified interfaces that
mean that individual projects like ours don't have to keep solving this class
of problems over and over again but it would be really nice if rest were better
at this this kind of interacting dynamically with other compiled code.
Are you aware of any RFCs that propose the stabilization of the Rust ABI?
Yeah, there are a couple RFCs in this area with different approaches,
but I haven't seen a lot of progress in the past couple of years or real interest
in solving this problem.
I think it is somewhat niche. Most REST projects are distributed as source code.
Most libraries are compiled at build time.
But for projects like ours, or anything that's dealing with plugin ecosystems,
yeah, that's not really good enough.
I guess we covered a lot of the technicalities of the project.
I also wanted to touch on a few things that were a bit more business-related.
I guess the first question I had
along these lines would be Arroyo is backed by Y Combinator as far as I'm aware.
Did investors ever care about your choice of programming language or was it never for debate?
Or maybe it was even a good thing and maybe they encouraged you to use Rust.
Yeah, I would say Rust has only helped us talking to investors. Yeah.
There are a number of systems that have come before us that have proved Rust can work commercially.
Investors know it's like the hot language in the data space.
And so you definitely seem more attractive in that sense if you're using Rust.
But honestly, most investors do not care about your language choice.
That's just not the level that they operate on.
And if you come in as an expert in this area and you say, like,
I think this is the right technical choice, investors are not going to second
guess you on that they're much more interested on the like commercial side of
the question how are you going to sell this who are your users,
you know why why are they going to choose you over a more established company in this space,
they're definitely not grilling you about like why are you using tokio or async-io or whatever.
Yeah what would you recommend to people that are in the same space and are considering
to use rust maybe they dabbled into it but they are not sure if they should
fully commit to it for their next project.
Yeah well i think in this space rust is just the obvious choice today it's you
know we We went through this whole era of building these systems in Java or Go or whatever.
But today, especially in the current macroeconomic environment,
companies are much more cost-conscious.
And when you can write something in Rust that takes half the resources or a
quarter of the resources of the Java version of it, that's a huge, huge selling point.
And it's really hard to compete with these much slower Java systems.
And I mean the Java systems are responding by like rewriting core pieces in
like C++ like we saw Spark rewrote their core engine in C++.
Confluent, the Kafka people have been rewriting stuff as well so I think it's
just really hard to compete if you're not in either C++ or Rust and I think
you're going to find even though there's maybe a larger pool of C++ developers
today it's much easier to teach someone to become a good Rust programmer than
to teach them to be a good C++ programmer.
And the Rust compiler helps you so much with people who aren't really experienced
dealing with memory management.
It makes it much harder to make these classes of mistakes.
So I think it is very much the obvious choice.
Maybe there's some newer languages that you could explore. You mentioned ZIG earlier. earlier.
But all these kind of like new, I guess, rust replacements are so much less
mature today that you really would have to be very ambitious to,
to like experiment with that.
So yeah, so I think either C++ or Rust and really, unless you have a strong
reason to use C++, I think Rust is just the default choice today.
Taking the example of Confluent, they took parts of their code base and rewrote
it in C++, if I understood correctly.
And I wonder why they chose C++ instead of Rust, because maybe that's already
a very mature alternative.
Why didn't they pick Rust then? or was it before Rust even became that mature?
Yeah, maybe I'll speak of Spark. I maybe have a little more background on...
But yeah, so Spark historically was written in Scala and then mostly written in Java.
And then Databricks rewrote their core engine in C++.
And that's something that they've kept closed source. I think it was just this
project started like six years ago when Rust was much less mature than it is today.
Do you know of any other companies that are currently planning to rewrite parts
of their codebase in Rust in that space?
Well, yeah, a great example is InfluxDB, which was originally written in Java,
then they rewrote in Go, and have just completed a major rewrite of their core storage engine in Rust.
And actually, we've benefited a lot from that because they're big supporters
of Data Fusion and the Arrow project.
TiKV, I'm not exactly sure how you say that, another example where they started
in Go and rewrote their core engine in Rust.
So that's, I think, been a pretty common trend in recent years.
If you're curious about InfluxDB's usage of Rust, then you should check out
episode number one where we had Paul Nix on the show.
And yeah, he talks a lot about the reasoning for InfluxDB moving to Rust.
And I think this wasn't planned, but it's a nice segue into promoting
this other episode if people are interested very interesting
i think looking forward maybe in
the next three four or five years and looking at the projects that might get
started along the way and the things that are existing and are evolving over
time what is your perspective what is your vision for the future where do you see the industry move.
Yeah i i definitely see yeah i feel like I'm a broken record but like i think
for people starting new data systems or new like large-scale systems most people
are going to choose rust going forward,
there's still people starting new c++ systems but just looking in my own space
three quarters of the new systems are rust and one quarter of them are in are
in c++ and i think that trend is going going to only increase as Rust becomes
less risky from a technical perspective and from a hiring perspective.
I mean, you know, maybe we'll see disruption from these other newer languages
that are able to become more mature and start attracting projects.
At some point, I'm sure Rust will become boring and people will want to use
more exciting languages.
But that would be, I think, an extremely successful outcome for the Rust project.
For now, we have not regretted our technical choice at all.
We're about a little over a year into this, and Rust has proved an extremely
successful technology choice.
I think maybe an interesting question is how much REST adoption happens in the
more application space.
So there's kind of a divide here between infrastructural software and application software.
So infrastructure software, like the stuff we're working on,
or a database, for example, is something that's written by a small team and
then run by a much larger group of people.
So definitely, it makes sense to...
Put a lot of effort into making it really efficient and fast because it's going
to run on so many CPU cores over its lifetime.
For application software where the development costs are much closer or much
greater than the runtime costs, you don't necessarily have that same financial
pressure to make it really efficient.
And today, I think Rust is a much harder sell in that space space because the
the additional complexity of writing stuff in rest and additional like difficulty,
you know hiring rest engineers or training
people in rest so i wonder you know
how much rest will kind of grow in that
space through just more
maturity in the the language and ecosystem system and maybe
a growing set of like people who use rest or want to use rest but to me that's
like a big open kind of area for a language to expand or a new language to come
and move into that that space because i think we can do better than java and
go for for kind of application level programming.
There's so much of the ergonomics of rest that are great for that but dealing
with some of the the sharper corners of rust around the memory management and
lifetime issues, where it just feels like,
if you don't care that much about performance, you shouldn't need to kind of
deal with this for that class of problems.
If you look at a related field like data science, it feels like they are also
starting to experiment with some ideas from the Rust world,
if not even rewrite parts of their libraries in Rust to use it in less performant,
higher-level languages like Python.
You have Parquet files, and then you have parses around that,
and you have pandas, and all of this is inherently...
An interesting space for Rust because it's a mixture between analysis where
performance is also relevant, right? Do you agree in general?
Yeah, so I think that's definitely a term that will continue.
And this is really taking a Rust core and wrapping it in a higher level language like Python.
And that's been really, really successful. We see that in the Java ecosystem
system as well, where a lot of these Java tools have rewritten their cores in
Rust and gotten 10x or more performance out of that.
I'm not personally a Python person, but people obviously really love it.
And it's very hard to convince data scientists to use anything besides Python.
So if you do want to give them better performance, I think this approach of
writing the core and Rust and wrapping it in your higher-level language is something
that has been really successful.
I guess the other fascinating approach, I don't know if you're familiar with
Mojo. This is Chris Lattner's new language that the creators Swift.
This is creating a Python-like language that actually compiles into LLVM and
MLIR and aims to provide C++ performance.
Python somewhat python compatibility and that to me it is extremely ambitious
given the kind of semantics of python and how hard that is to optimize but like
if you don't want to to kind of take this rest approach like that's the only
other way you can really get acceptable performance,
with with these python apis so for us we're since we're starting with sql you
really it's very easy to optimize SQL into whatever,
implementation you want and that gives
us a lot of advantages for providing really really high performance because
SQL is kind of declarative you're able to rewrite the expressions in ways that
make it much faster to actually execute but when you have something like Python
you're much more limited in how much you can really optimize that even with a Rust core,
so i think it'll be interesting to see like how data science
as our data volumes
increase and the complexity of the processing we're doing increases
how that will kind of financial pressures will push people into high performance
paradigms but yeah for now i think the polar's approach of of kind of the rest
core is something we'll see in in a lot of these data science ecosystems systems yeah.
I agree i think we're
getting towards the end and it has been somewhat a tradition around here to
ask this final question to people if there was one thing that you could say
to the rust community as a whole one of a statement a message that you have
to the community what would it be i.
Think my my message to the rest community would be like chill out a little bit,
Rust is an incredible language, an incredible ecosystem and community.
And yet we seem to have 10 times the drama of any other language community I've been part of.
And I don't really understand why or where it all comes from.
But I think that that level of drama can only hurt adoption when people look
at the Rust Reddit and are like, this is a shit show. Why would I want to be part of this?
So yeah, I think hopefully we can look back at the past year and just say,
we all just need to calm down a little bit and figure out how to work with everyone
else and stop driving people out of the community.
That's a great final statement. Really love it. Micah, it has been a pleasure to have you on the show.
Where can people learn more about you, about Arroyo? How can they get started
with the platform? form?
Yeah. So I think we have some pretty good docs.
If you head to our website, arroyo.dev, we link to those there.
We have a Docker image, super easy to run it and play around,
get a nice web UI where you can write SQL, you can talk to like WebSocket APIs
and HTTP APIs, and it's easy to play around with publicly available streaming data.
And then we have a really friendly Discord community so if you head to our website
we have a link to that and you can join.
eard it, there's nothing more to say. Again thanks a lot Micah for coming into
the show and yeah thank you.
Thanks so much for having me! This was great.
Rust in production is a podcast by corrode and hosted by me Matthias Endler
for show notes transcripts and to learn more about how I can help your company
make the most of Rust, visit corrode.dev.
Thanks for listening to Rust in Production.
Micah
00:00:26
Matthias
00:01:28
Micah
00:01:50
Matthias
00:03:55
Micah
00:04:27
Matthias
00:06:43
Micah
00:07:07
Matthias
00:08:32
Micah
00:09:03
Matthias
00:11:25
Micah
00:12:12
Matthias
00:13:58
Micah
00:14:29
Matthias
00:16:36
Micah
00:16:44
Matthias
00:17:04
Micah
00:17:40
Matthias
00:21:47
Micah
00:22:14
Matthias
00:24:14
Micah
00:24:55
Matthias
00:25:30
Micah
00:25:54
Matthias
00:26:49
Micah
00:27:03
Matthias
00:28:26
Micah
00:28:42
Matthias
00:30:07
Micah
00:30:15
Matthias
00:30:56
Micah
00:31:14
Matthias
00:33:03
Micah
00:33:44
Matthias
00:34:30
Micah
00:34:32
Matthias
00:36:02
Micah
00:36:20
Matthias
00:37:25
Micah
00:38:01
Matthias
00:40:25
Micah
00:40:32
Matthias
00:41:05
Micah
00:41:32
Matthias
00:42:28
Micah
00:42:44
Matthias
00:44:38
Micah
00:45:02
Matthias
00:45:27
Micah
00:45:36
Matthias
00:46:10
Micah
00:46:55
Matthias
00:50:12
Micah
00:50:51
Matthias
00:53:23
Micah
00:53:44
Matthias
00:54:32
Micah
00:54:45
Matthias
00:55:17
Micah
00:55:26
Matthias
00:55:28