Rust in Production

Matthias Endler

Arroyo with Micah Wylde

Rust in Production episode explores Arroyo, a real-time data processing engine built in Rust. Micah Wylde from Arroyo shares insights on benefits, challenges, and future potential. Visit Arroyo's website for more.

2024-01-25 56 min

Description & Show Notes

In this episode, we have Micah Wylde from Arroyo as our guest. Micah introduces us to Arroyo, a real-time data processing engine that simplifies stream processing for data engineers using Rust. They explain how Arroyo enables users to write SQL queries with Rust user-defined functions on top of streaming data, highlighting the advantages of real-time data processing and discussing the challenges posed by competitors like Apache Flink. Moving on, we dive into the use of Rust in Arroyo and its benefits in terms of performance and memory safety. We explore the complementarity of workflow engines and stream processors and examine Arroyo's approach to real-time SQL and its compatibility with Postgres. Micah delves into memory and lifetime concerns and elaborates on how Arroyo manages them in its storage layer. Shifting gears, we explore the use of the Tokyo framework in the Arroyo system and how it has enhanced speed and efficiency. Micah shares insights into the challenges and advantages of utilizing Rust, drawing from their experiences with Arroyo projects. Looking ahead, we discuss the future of the Rust ecosystem, addressing the current state of the Rust core and standard library, as well as the challenges of interacting with other languages using FFI or dynamically loading code. We touch upon Rust's limitations regarding a stable ABI and explore potential solutions like WebAssembly. We also touch upon industry perceptions of Rust, investor perspectives, and the hiring process for Rust engineers. The conversation takes us through the crates used in the Arroyo system, our wishlist for Rust ecosystem improvements, and the cost-conscious nature of companies that make Rust an attractive choice in the current macroeconomic environment. As we wrap up, we discuss the challenges Rust faces in competing with slower Java systems and ponder the potential for new languages to disrupt the trend in the future. We touch upon efficiency challenges in application software and the potential for a new language to emerge in this space. We delve into the increasing interest in using Rust in data science and the promising prospects of combining Rust with higher-level languages. Finally, we discuss the importance of fostering a welcoming and drama-free Rust community. I would like to thank Micah for joining us today and sharing their insights. To find more resources related to today's discussion, please refer to the show notes. Stay tuned for our next episode, and thank you for listening!

About Arroyo
Arroyo was founded in 2022 by Micah Wylde and is based in San Francisco, CA. It is backed by Y Combinator (https://www.ycombinator.com/) (YC W23). The companies' mission is to accelerate the transition from batch-processing to a streaming-first world.

About Micah Wylde
Micah was previously tech lead for streaming compute at Splunk and Lyft, where he built real-time data infra powering Lyft's dynamic pricing, ETA, and safety features. He spends his time rock climbing, playing music, and bringing real-time data to companies that can't hire a streaming infra team.

Tools and Services Mentioned
- Apache Flink: https://flink.apache.org/
- Tokio Discord: https://discord.gg/tokio
- Clippy: https://github.com/rust-lang/rust-clippy
- Zero to Production in Rust by Luca Palmieri: https://www.zero2prod.com/
- Apache DataFusion: https://github.com/apache/arrow-datafusion
- Axum web framework: https://github.com/tokio-rs/axum
- `sqlx` crate: https://github.com/launchbadge/sqlx
- `log` crate: https://github.com/rust-lang/log
- `tokio tracing` crate: https://github.com/tokio-rs/tracing
- wasmtime - A standalone runtime for WebAssembly: https://github.com/bytecodealliance/wasmtime

References To Other Episodes
- Rust in Production Season 1 Episode 1: InfluxData: https://corrode.dev/podcast/s01e01-influxdata

Official Links
- Arroyo Homepage: https://www.arroyo.dev/
- Arroyo Streaming Engine: https://github.com/ArroyoSystems/arroyo
- Blog Post: Rust Is The Best Language For Data Infra: https://www.arroyo.dev/blog/rust-for-data-infra
- Micah Wylde on LinkedIn: https://www.linkedin.com/in/wylde/
- Micah Wylde on GitHub: https://github.com/mwylde
- Micah Wylde's Personal Homepage: https://www.micahw.com/

Transcript

This is Rust in Production, a podcast about companies who use Rust to shape the future of infrastructure. My name is Matthias Endler from corrode, and today we are talking to Micah Wylde from Arroyo about how they simplified stream processing for data engineers with Rust. Micah, welcome to the show. Can you tell us a few words about yourself and Arroyo, the company you founded?
Micah
00:00:26
Thanks so much for having me. Yeah. So I am a Rust engineer and the creator of the Arroyo streaming engine. So Arroyo is a real-time data processing engine that allows you to write SQL queries with Rust user-defined functions on top of streaming data. For example, data you might might have in Kafka or another streaming system. And I come to that problem and company after spending five years leading streaming teams at companies like Splunk and Lyft, which is a rideshare company in the US. And more broadly, I've been in the big data space working on data systems for pretty much my entire career starting out in ad tech, working on real-time ad bidding systems, and then leading data teams and building data systems so yeah that's that's a brief background about me at.
Matthias
00:01:28
Splunk you were a principal engineer and you were the team lead of the streaming compute team, so that makes you somewhat of an expert in stream processing I would say. Maybe for the uninitiated, could you give us a just very quick, very brief introduction of what stream processing is in your own words?
Micah
00:01:50
Yeah. So traditionally, when people have wanted to process data, we do it in what's called batch mode, which means you take all the data in through whatever data sources those are, whether it's coming from logs you're reading, from API requests that are ending up somewhere, or wherever that data is coming, it all kind of filters through your system and eventually lands in traditionally a database or today maybe like a data lake or a data warehouse. And then once all that data is there, you run a really big data processing job on top of all of that rest, that data at rest. Often this means you wait, you know, an hour or a day for all the data to land before you can kind of analyze it or learn anything about it. Stream processing, in contrast, does this data processing as the data actually arrives in your system. So in real time. And the advantages there, obviously, latency is much better. You can process the data within milliseconds or seconds instead of waiting hours or days. But it also can give you a much kind of easier way to build these like end-to-end data systems where you need to consider like different properties around like timeliness and completeness in order to kind of build your higher level analytics or data products. And for kind of real-time companies like at Lyft, this becomes really crucial to be able to basically know things about your world really quickly. In rideshare, you need to understand kind of where your users are, where your drivers are. You need to understand traffic speeds in order to do routing. You need to be able to do dynamic pricing based on supply and demand. And all of this stuff really demands that you be able to do complex analysis on data really quickly instead of waiting, you know, a day for it all to land in your data warehouse. So that's kind of like a high level of the problem stream processing and solving or kind of how it fits.
Matthias
00:03:55
Yeah, so stream processing has existed before. There were other companies that did a lot of groundwork. You mentioned at some point Hadoop, you mentioned BigQuery in your seminal article that we will get to in a second. But I think maybe you can just quickly explain what makes Arroyo special in this case and also what the competitors are lacking right now that maybe is a nice niche for Arroyo. Yeah.
Micah
00:04:27
So, well, so BigQuery and Hadoop are both kind of like in that batch paradigm, where you let all the data collect, and then you do a big data processing job over that data at rest. In the streaming world, traditionally, the most popular system has been one called Apache Flink. This is about a decade old but was really the first system that found a good programming model for streaming and really i would say made it work at a level of correctness that that allowed it to be applied to a lot of these problems before flank we really had very simple systems that were, couldn't couldn't really guarantee anything about correctness or completeness and we're sort of just orchestration systems around your own logic. So I spent my year, my career in streaming working on Flink. And I think that's true of most of the other people who are kind of doing new things now. And for all of us, we kind of have this perspective on Flink that it solved this problem really well for people who are able to invest a ton of energy into becoming experts in Flink. So, at the companies I've worked for, that meant staffing up teams that were 10 to 30 people, full of people working on Flink, building infrastructure and tooling around it, and then especially supporting end users who were actually building these streaming pipelines. And I think while Flink was really successful allowing sophisticated companies to roll out this technology in a way that would have been dramatically harder a few years earlier, it never really got to that point of ease of use where you could hand Flink off to a data scientist, to a data engineer, or a product engineer, and allow them to be successful building these real-time pipelines on their own. We always needed a lot of hands-on support from the Flink experts of the company. And that's really what we're trying to innovate around in Arroyo. We're trying to build a system that is easy enough for any engineer or data scientist at your company to kind of pick up and build these correct, reliable, performant real-time data pipelines.
Matthias
00:06:43
So how do you see the relationship between stream processing on one end and these new workflow engines that pop up nowadays like Windmill, which is coincidentally also written in Rust? Do you see an overlap? Do you see the industry converge to something that maybe encompasses both? Or would you say these are fundamentally different areas of expertise?
Micah
00:07:07
Yeah, I think they're very different systems and they are good at different kinds of problems. problems. So workflow engines are really excellent at these very long-running tasks. We have a bunch of things we need to do based on fairly simple criteria over the course of a day. For example, a user signs up, we need to send them this email. Depending on what they do in response to that, we need to do this other sequence of events. And that's the sort of think actually streaming engines like Flink or Arroyo are pretty bad at. It's hard to specify that type of logic, that kind of conditional logic over all of these different states. And they also architecturally are kind of way overpowered to do that kind of stuff. I think these systems actually work together quite well because streaming systems, stream processors are really good at data oriented problems. So often, this will mean you put your like really big feed of data, your millions of events per second feed into your streaming engine. And that produces features or events that can then be consumed by the much lower scale workflow system. So that that's actually a pretty common pattern for these to kind of work together. But at least in the near future, I don't see them as being kind of in the same space at all.
Matthias
00:08:32
Mm-hmm on your website you have a very nice example where you take a Kafka stream and then you write some i think it's SQL might be Apache Arrow or some other syntax but it's similar to sql to pipe events through your system and then see the results in real time and this was pretty pretty impressive demo. So is Apache Arrow SQL-like, or is it more than that? Is it different? If so, in what sense?
Micah
00:09:03
So the main way you program Arroyo is through SQL. We have slightly our own dialect, but we aim to be pretty Postgres compatible. To do real-time SQL, you do need to extend it in some way. There's different approaches for this. But SQL SQL, as originally defined, was really designed for these batch computations to do like a group by or an aggregate or a join. You really need all of the data to be available. Otherwise, how do you know, you know, in a join, there might be more data coming in on one side or the other in the future, so you can't ever return that result. result. So different streaming systems that use SQL have come up with different answers for basically how can we decide that we're done, we're able to return a result for these expressions. In Arroyo, that looks like. We introduce these time-oriented window functions, like a tumbling window and a sliding window and a session window. And these rely on a notion of what's called watermarking, which is this concept of basically estimated completeness. Watermark is a special value that flows through the data flow of the pipeline and tells all of the operators that we have seen all of the data, or we believe we've seen all the data from before a certain time. And that tells us if we have like a window that closes at time t and we get a watermark that is after t, it tells us that we can close that window, that we've seen all the data that will be in that window and we're able to process it and return the results to the user. So this is a common pattern in certain types of string processors like Flink and Arroyo. There's other approaches to this, which Arroyo and other systems like Materialize also support, which is based around a more incremental style of computation, where we actually decide we're never going to be done. We never know that we have all the data for a particular time period. So every time an event comes in, we're going to update the state of that window and emit the new result there. So depending on the kind of problem, you may kind of want one style of SQL or the other style of SQL. But yeah in any case it's all it's all sequel you.
Matthias
00:11:25
Wrote an article called rust is the best language for data infra which is kind of a catchy title and i read the article and and one thing i wondered about was okay. Was Rust your first choice when you started? Have you looked into, for example, the solutions that came before you? And also, was it around the time where Zig also became popular? And where do you see yourself in this space? Would you say, okay, Rust was just there at the right point in time? Or would you also say, well, there would be alternative realities, so to say, where Arroyo was written in C++ or maybe Zig in a different world.
Micah
00:12:12
Yeah, so I mean, kind of setting this historically, the very earliest systems of this kind of in this space, like the original Google systems that establish a lot of how we think about big data today, like MapReduce and BigTable and GFS, those were all written in C++. And then we had a long history of writing systems in Java, like Hadoop and HBase. And Flink itself is written in originally Scala, and then rewritten in Java. And then we had a whole period of doing Go, like CockroachDB, and a handful of other big data systems. And yeah, I think now, definitely, we would not have chosen Java or Go for Arroyo. I think in many ways, the current era of systems is a reaction to the previous era of writing these systems in Java. A lot of people are finding that you can get much better performance, much easier operations, literally just by rewriting these systems in a non-managed language like C++ or Rust. So we're kind of following in the footsteps of projects like Red Panda, which did this with Kafka, and ScyllaDB that did this with Cassandra. So I think we could have done some of the things we're trying to do in Java or Go, but it would have been much harder to accomplish our goals. So in a world without Rust, I think we probably would have ended up choosing C++. But I'm very grateful that we are in a world with Rust. It has definitely made our lives much easier than it would have been if we had to choose C++ for this.
Matthias
00:13:58
Especially i assume to optimize the platform you would have to avoid a lot of copies and in c++ passing references around can be a bit of a nightmare sometimes if you don't know exactly what you're doing and even if you do there can be issues with that and i just wondered do you have a lot of lifetimes in your code as well or is that something that the rust compiler completely completely elides and you don't even have to think about lifetimes at all.
Micah
00:14:29
So the most memory-oriented or lifetime-oriented part of our system is the storage layer. So maybe to give a little bit of architectural insight here, the way these systems look, they are these directed acyclic graphs of data flow. Take a SQL statement, you compile it into a SQL plan, and then eventually optimize that into this data flow graph. Each node of this graph is some kind of potentially stateful operator. So for example, doing a filter or a map or a stateful function like a window or a join. And between these operators, the events and process data flow over queues or over network sockets. So within these stateful operators, we potentially have to store data for long periods of time. So if you imagine you have like a 30-day sliding window, we need to store some representation of that data for 30 days. And we do that in a mix of offline S3 storage and local disk cache and then in-memory cache. And managing that in-memory cache brings into issue these lifetime concerns, managing the data as it flows from that cache into the processor in order to be used. Fortunately, in these systems, the architecture constrains that problem somewhat. So at the semantic layer, you're kind of processing one event at a time in each of these operators. So you don't have to deal with really concurrency issues at the direct processing layer. layer. And that ends up simplifying the kind of lifetime management that you might have in a more traditional database where you're kind of dealing with a bunch of different requests to the same data.
Matthias
00:16:36
So in Rust-lingua, that would be your types are not sync, or they don't have to be?
Micah
00:16:44
That's correct. Yeah, we're always accessing a particular... You can think logically, each of these operators is single-threaded. This is all implemented in tokio, so what's happening under the hood is much more complicated than that. But as a programmer, you can really think of it as synchronous processing, not a single thread. red.
Matthias
00:17:04
Speaking of tokio it feels like this is an ideal use case for it because you're kind of leaning into things that are inherently concurrent they don't really have to be sequential they can at least parts of them can be executed concurrently sometimes maybe even in parallel but i wonder what you think about tokio your experiences with the framework the ergonomics of it, and also the recent discussion about async-rust being a send, sync, and so on, and work-stealing, schedulers, all of that stuff.
Micah
00:17:40
Yeah, so at a high level, a system like Arroyo doesn't really need a complex scheduler like tokio. As I mentioned, each of these operators essentially acts as a single thread. It receives one event, it does all the processing for that event, and then it sends it on to to its next destination. And all this has to happen in order to uphold the correctness guarantees of the system. And because of that, the first version of Arroyo actually was built around threads and thread processing. At some point, it migrated to tokio and AsyncRust, actually pretty early on. And the core reason for that was that so much of the ecosystem is in AsyncRust at this point, that if you want to use common network libraries or database drivers or almost anything from kind of network programming ecosystem, ecosystem, you do have to deal with async at some point. And at some point, it's easier just to move your whole system over to async. And that was definitely a challenging migration. Actually, for me, I had never worked with async REST before. So it involves a lot of learning, a lot of time on the tokio Discord channel, which is extremely helpful. But in the end, actually, the surprise was that it ended up being a lot faster. Just purely doing that migration made the system like 30% faster, which was not my expectation at all. But it turns out that the tokio scheduler is really, really effective at this class of problems, where even though it looks at a high level, like all this processing is single-threaded, there's a lot more going on under the hood, a lot more work that has to be coordinated. Between like, you actually have threads, like in our case, like talking to S3 or talking to other systems. We have a lot of queues involved. So even though we have only a smallish number of actual processing threads, there's a lot of network exchange happening on other threads, talking to... Coordination system over gRPC. And tokio is really good at, at, you know, organizing all of this work efficiently and really maxing out the use of your core. I think that the most surprising thing for us is that we're able to run the system at extremely high utilization, like above 95% CPU utilization, and everything remains responsive and reactive and is able to, to really work effectively at that extremely high level of CPU thrash, which has never been my experience with the systems written in other paradigms. And then in terms of, I guess, kind of how I think about the async Rust, I guess, drama, if we want to say that word, I think the Rust community has a higher level of drama in general, and I don't fully understand why that is. But I think maybe the technology just works so well that we can't, you know we sort of have to invent other stuff to be upset about but i will say async rust definitely has a learning curve it took me like a month coming from from being like i would say a pretty strong rust programmer already it took maybe a month to really be an effective async rust programmer and it's definitely been the edge of the system that other people who contributed to it have the most trouble with this the requirement that that values passing over weight descend send definitely can be frustrating if you aren't experienced in the strategies for dealing with that. And the sometimes bad error messages in the compiler don't help with that either. It can make it really hard to figure out where exactly that problem is introduced in a large amount of code. But I'd say overall, tokio has been a huge boon to us. And it's really remarkable what, you know, what allows us to do in terms of just not having to think very much about how we schedule work. It just does a really remarkably good job on its own.
Matthias
00:21:47
Well, Rust protects you from memory or safety problems. It does not protect you from race conditions. So I wonder if you, as someone that uses Rust and tokio at scale, has run into any sort of data races or things that you encountered at runtime, which maybe were a bit of an issue for your platform, or did you never ever have any outages in production?
Micah
00:22:14
So not specifically from race conditions which again like the architecture of the this of our system makes the high level concurrency pretty straightforward the although a lot more complexity creeps in in the details especially when you try to get to the like that next level of performance so for example like the the storage system is extremely complex and has has a lot of concurrency. But Rust does really help a lot with managing that complexity. In terms of issues in production. The issues we've seen are much more around the high level, the ways that all the different pieces of this distributed system interact with each other, and about wrong assumptions in different pieces about what other things are doing. Which unfortunately rust definitely does not fix distributed systems issues but in terms of like the kind of micro level it's remarkable how well things work once you get them to compile an example i brought up in that blog post but still kind of blows my mind is that i wrote the entire network stack like the the piece of software that allows this system to to be distributed. I wrote that in like a two-day push, basically like two 12-hour days. And basically just coded that straight. And then at the end, spent maybe an hour trying to get it to compile. And from there, it just worked perfectly the very first time. I made a single node system, a distributed system without any testing, any iteration on that. And it basically hasn't changed since that initial implementation. I've definitely never experienced that writing network software in C++ or even Java for that matter.
Matthias
00:24:14
That's pretty impressive, yes. Pretty awesome that you could pull that off. And it's a testament to the Rust type system, which is helpful, and also the borrow checker and all of the things that make Rust development and the developer ergonomics pretty awesome. Awesome. I wondered though, although maybe you didn't have that many runtime problems, I wondered if you had any compile time problems in a sense that maybe parts of the ecosystem were not aligned, like compatibility issues with, say, different versions of tokio or maybe different libraries that were sometimes more mature, sometimes less.
Micah
00:24:55
Yeah, it's never been a huge issue. And just the Rust-created ecosystem in general has been such a boon to us from a productivity perspective, compared to the C++ world where using dependencies is so challenging. And you don't have this incredible, rich ecosystem that we already have in Rust after such a relatively small amount of time. So there are occasional compatibility issues we've had to fork a few open source projects we rely on but i would say it's a very small part of my day dealing with any of that stuff.
Matthias
00:25:30
All of the things that you mentioned are at least to an average developer pretty low level or at least you need a lot of expertise on how to structure such systems or architect such such systems in order to perform well. And I wondered, what do you think, how much does Rust guide you towards an idiomatic solution? And what is your own expertise?
Micah
00:25:54
Yeah, I think Rust definitely guides you towards a correct solution. I don't know, it always helps you that much with like being idiomatic. Although the tooling around it is very helpful. Like Cargo Clippy is really helpful and was very helpful. So my co-founder had never used Rust before. Working on this project. I really experienced distributed systems engineer and has worked on a bunch of query systems, but was new to Rust. And tools like Clippy really helped him pick up the idiomatic style of Rust pretty quickly. Beyond that, I think the Rust community is also really helpful. I mentioned already the tokio Discord, which was super useful when I was trying to get up to speed with AsyncRust. But in general, the Rust community is extremely useful in helping you solve problems or figure out why some weird compile issue is happening.
Matthias
00:26:49
Did you use any resources outside of the official Rust book and maybe the community to help you get started with Rust? Or did you start on a project and learn on the job? So.
Micah
00:27:03
I've actually been using Rust since like 2014. I've never convinced a company we should do like a major project in Rust until now. It was always a big uphill battle trying to introduce Rust into a large organization. But I've been using it for all of my kind of like personal projects for a really long time. I've been a fan of the language since basically I first learned about it. But so in terms of my own development, there's been a lot of resources like over that time. The first version of the Rust book, which I have on my bookshelf back there, that was very helpful. But also it just changed so much in the early days that it was a full-time job just kind of keeping track of the updates to the language. Today, it's much easier. It's been pretty stable for a number of years. And I think the quality and quantity of resources has also increased a lot. But I know there's a really good book actually on Rust in production that I've looked at a fair bit for the more –, kind of how do you actually run rust details like what what does logging look like in Rust what does, how do we do metrics like these these kinds of things that aren't necessarily like part of the intro to what book is that it's called Rust in Production. h by Luca Palmieri.
Matthias
00:28:26
Awesome and you mentioned that it's a bit tricky sometimes to convince bigger companies and and organizations to move towards Rust and introduce Rust at these companies? Why is that in your experience?
Micah
00:28:42
Yeah, I think large companies tend not to be that ambitious in their technical choices. A lot of it is built around minimizing risk rather than maximizing reward. And Rust definitely seems risky to a CTO today. They worry, will it be too hard for engineers to learn how to do Rust? Will we be able to, if we restructure teams, will we be able to pass off this project to another team? Will they have to figure out how to use it? Will we be able to hire enough REST engineers? And if you're Google and you need to hire 10,000 engineers, I think you should be rightly concerned about hiring 10,000 REST engineers. I doubt there are that many REST engineers in the world. But for a smaller company, that's not an issue at all, right? Hiring three Rust engineers is pretty easy. And I think especially for a small company, it's an advantage in a way that it maybe isn't for a big company to be using Rust. Because as a small company, you can attract people because they want to work in Rust. And that's a big incentive to work for you. And I think that's working in maybe a slightly obscure language. You get those people who are really excited about it. And that can be a big boon to you. But for big companies, they kind of just see the risk side of that equation.
Matthias
00:30:07
How do you hire Rust engineers? Do you reach out in your network or do you post job announcements somewhere?
Micah
00:30:15
Yeah, well, I guess actually for us initially, we've been hiring more on the streaming expertise side. There's actually maybe more overlap there now than there was maybe two years ago. A lot of the newer streaming systems are also in Rust. But historically, as I mentioned, streaming systems have been largely in Java. So that's where most people have expertise. But I definitely anticipate as we try to hire more broadly that hiring from that pool of REST engineers will be pretty productive, especially as a non-cryptocurrency REST company. There's, I think, a lot of demand for those jobs right now. So we'll be able to tap into that.
Matthias
00:30:56
Very true. What sort of other crates do you use to get your job done? I guess in the blog post, you mentioned Data Fusion. Maybe that's one that you can talk about, but feel free also to talk about any other crate that you like.
Micah
00:31:14
Yeah, so Data Fusion is probably the most critical one to us. Data Fusion is a number of things. This comes from the arrow-rs ecosystem. We use it primarily as a SQL parser. So it takes SQL text and turns it into an AST and then planner. So taking that AST and turning it into a graph-oriented plan that describes what that SQL is supposed to do. SQL is an extremely complex language with like 30 years of history and a bunch of different equivalent ways to express stuff. So having a library that deals with a lot of that complexity for you is extremely helpful when you're building a SQL engine. we get a nice clean plan out of that that we're able to then optimize in our own way and compile into our own set of operators. So that data fusion has been extremely critical to us being able to build this thing as quickly as we have. Beyond that. I guess I'll also call out maybe a little bit lower level or higher level, but really appreciate the actually Rust web ecosystem. So we rely on Axum and SQLX, which is a really great SQL library. This is not like the core of our product at all. This is like to power our API and our web interface. Face but it's remarkable that even a domain that maybe rust isn't natively as well suited to we still have these incredibly high quality libraries that make it actually really easy to to build good products in that domain so that's that's been an impressive discovery for us.
Matthias
00:33:03
The grades that you mentioned i cannot speak about data fusion but definitely the other ones are top of class in any language literally i would say at least from my experience I used Axum and SQLX before, and I think they are really awesome. But I wonder about the future of this ecosystem. Do you see that we kind of reached a point where crates are starting to more or less stabilize and there's one go-to crate that you pick for your job? Or would you say the ecosystem is still so young that I can see myself switching, let's say, to a different web framework in a year or maybe a different parser or whatever, if if it comes up.
Micah
00:33:44
Yeah, I mean, I think it's probably too early to say that things have stabilized. A year ago, your choices in a lot of these areas would have been different. Definitely three years ago, none of these crates existed. Axum itself is still changing quite a lot from release to release. So I think even these crates are not directly stabilized. But I think we will be hitting more of a period of stability, especially with async rust becoming more feature complete. A lot of these libraries have had to work around limitations in the async ecosystem and implementation, like missing the ability to use async functions and traits. Which has just landed or is about to land.
Matthias
00:34:30
In 1.74, yeah.
Micah
00:34:32
Yeah. So I think that will allow things to stabilize their APIs in a way that has been challenging so far. And I do expect more kind of stability going forward and more obvious choices around which crate we use to solve different problems. And something that's been impressive to me about the Rust ecosystem is that there maybe were opportunities to stabilize earlier. Just to give you a random example, for logging, we had the log crate that was like the obvious crate for a long time to use for logging. And we could have just decided that was good enough. But actually, it turns out there was a better option and a better design. And we ended up with the tracing crate instead. dead. And the ecosystem was able to move to kind of like this better option, rather than getting bogged down in kind of like a local optimum. And you've seen this in a lot of different areas where like there was like an early consensus around a crate as the solution to this class of problems. But the ecosystem was able to move on to something that solved it better. And I think that that's not a property you have in all ecosystems. And something I really appreciate about the rest community were able to move fairly quickly and also in a pretty consensus driven way. To better options in the ecosystem. So I think we'll continue to see that happening. I don't know if Axum, for example, is like the end term of like Rust web programming. I think we'll continue to see iteration happening.
Matthias
00:36:02
What about Rust itself, the standard library? What about stabilization of the Rust core? Would you say this is already in a very satisfactory state? Or would you say that for your use case, there would be things that you would wish for?
Micah
00:36:20
I think everyone has their own wish list of RFCs that we would hope finally get merged. I think for me personally, the lack of completeness around async has been the biggest frustration. Missing async functions and traits, for example, has required a lot of somewhat ugly workarounds for us. And even the version of this that's going to be stabilized isn't quite complete enough for all of our use cases. But I appreciate that Rust takes time to get these solutions right. And I think we've seen that process play out with async. Overall, I think the Rust programming language is in a really good place. And I think it has stabilized over the past couple of years compared to the previous five years. And we'll continue to see that stabilization with hopefully a few nice improvements, like the work we're getting out of GADTs or the improvements to async we're seeing right now.
Matthias
00:37:25
I fully agree. Where I see some issues is on the edges of the Rust standard library. So where you talk to other languages with FFI or where you load code dynamically. And I guess for a streaming platform, that is also an interesting use case, maybe where you can hook stuff into your engine at runtime. And of course, there are technologies like WebAssembly and that sort of stuff getting pushed forward. I wonder if you already experimented with that and what's your impression on the current state of the ecosystem around that?
Micah
00:38:01
Yeah, actually, maybe I should have called that out. I called that out in my blog post as a frustration. Rust does not have a stable ABI, Application Binary Interface, which is challenging if you're trying to build anything that looks like a plugin system. So if you want to compile, like in our case, we support user defined functions. So users can write REST code that then gets loaded into the engine at runtime. If you're writing this in C or C++, there's a stable C API that you're able to use to basically dynamically link software at runtime. time. Rust doesn't have this. So if you want to compile a library and a host application, and link them, you have to do that with the exact same version of the Rust compiler. And in many cases, like the same settings for those compilers. So it makes it really hard to distribute basically binary software separately from the thing that is consuming that library. So today, the solution you basically have to use is to use the C API, which means giving up a lot of like the features and power of Rust, at least at your interfaces. You also mentioned Wasm, which is another class of solutions to this problem. In some ways, this is even worse from like interface perspective. Because there's no real standard way to interact between like hosts and plugins in the Wasm ecosystem. So every application sort of has to figure this out for themselves. We have explored Wasm as a solution to kind of this class of problems. We actually have have an integration with Wasm time, which is a great Rust Wasm runtime. And I think for systems like ours, that probably will be the direction that we take going forward. Particularly great for integrating with other language ecosystems and there's a lot of energy in the wasm world to kind of figure out these integration problems like how how does a rust program talk to a python program over shared wasm memory how how can we build these kind of like unified interfaces that mean that individual projects like ours don't have to keep solving this class of problems over and over again but it would be really nice if rest were better at this this kind of interacting dynamically with other compiled code.
Matthias
00:40:25
Are you aware of any RFCs that propose the stabilization of the Rust ABI?
Micah
00:40:32
Yeah, there are a couple RFCs in this area with different approaches, but I haven't seen a lot of progress in the past couple of years or real interest in solving this problem. I think it is somewhat niche. Most REST projects are distributed as source code. Most libraries are compiled at build time. But for projects like ours, or anything that's dealing with plugin ecosystems, yeah, that's not really good enough.
Matthias
00:41:05
I guess we covered a lot of the technicalities of the project. I also wanted to touch on a few things that were a bit more business-related. I guess the first question I had along these lines would be Arroyo is backed by Y Combinator as far as I'm aware. Did investors ever care about your choice of programming language or was it never for debate? Or maybe it was even a good thing and maybe they encouraged you to use Rust.
Micah
00:41:32
Yeah, I would say Rust has only helped us talking to investors. Yeah. There are a number of systems that have come before us that have proved Rust can work commercially. Investors know it's like the hot language in the data space. And so you definitely seem more attractive in that sense if you're using Rust. But honestly, most investors do not care about your language choice. That's just not the level that they operate on. And if you come in as an expert in this area and you say, like, I think this is the right technical choice, investors are not going to second guess you on that they're much more interested on the like commercial side of the question how are you going to sell this who are your users, you know why why are they going to choose you over a more established company in this space, they're definitely not grilling you about like why are you using tokio or async-io or whatever.
Matthias
00:42:28
Yeah what would you recommend to people that are in the same space and are considering to use rust maybe they dabbled into it but they are not sure if they should fully commit to it for their next project.
Micah
00:42:44
Yeah well i think in this space rust is just the obvious choice today it's you know we We went through this whole era of building these systems in Java or Go or whatever. But today, especially in the current macroeconomic environment, companies are much more cost-conscious. And when you can write something in Rust that takes half the resources or a quarter of the resources of the Java version of it, that's a huge, huge selling point. And it's really hard to compete with these much slower Java systems. And I mean the Java systems are responding by like rewriting core pieces in like C++ like we saw Spark rewrote their core engine in C++. Confluent, the Kafka people have been rewriting stuff as well so I think it's just really hard to compete if you're not in either C++ or Rust and I think you're going to find even though there's maybe a larger pool of C++ developers today it's much easier to teach someone to become a good Rust programmer than to teach them to be a good C++ programmer. And the Rust compiler helps you so much with people who aren't really experienced dealing with memory management. It makes it much harder to make these classes of mistakes. So I think it is very much the obvious choice. Maybe there's some newer languages that you could explore. You mentioned ZIG earlier. earlier. But all these kind of like new, I guess, rust replacements are so much less mature today that you really would have to be very ambitious to, to like experiment with that. So yeah, so I think either C++ or Rust and really, unless you have a strong reason to use C++, I think Rust is just the default choice today.
Matthias
00:44:38
Taking the example of Confluent, they took parts of their code base and rewrote it in C++, if I understood correctly. And I wonder why they chose C++ instead of Rust, because maybe that's already a very mature alternative. Why didn't they pick Rust then? or was it before Rust even became that mature?
Micah
00:45:02
Yeah, maybe I'll speak of Spark. I maybe have a little more background on... But yeah, so Spark historically was written in Scala and then mostly written in Java. And then Databricks rewrote their core engine in C++. And that's something that they've kept closed source. I think it was just this project started like six years ago when Rust was much less mature than it is today.
Matthias
00:45:27
Do you know of any other companies that are currently planning to rewrite parts of their codebase in Rust in that space?
Micah
00:45:36
Well, yeah, a great example is InfluxDB, which was originally written in Java, then they rewrote in Go, and have just completed a major rewrite of their core storage engine in Rust. And actually, we've benefited a lot from that because they're big supporters of Data Fusion and the Arrow project. TiKV, I'm not exactly sure how you say that, another example where they started in Go and rewrote their core engine in Rust. So that's, I think, been a pretty common trend in recent years.
Matthias
00:46:10
If you're curious about InfluxDB's usage of Rust, then you should check out episode number one where we had Paul Nix on the show. And yeah, he talks a lot about the reasoning for InfluxDB moving to Rust. And I think this wasn't planned, but it's a nice segue into promoting this other episode if people are interested very interesting i think looking forward maybe in the next three four or five years and looking at the projects that might get started along the way and the things that are existing and are evolving over time what is your perspective what is your vision for the future where do you see the industry move.
Micah
00:46:55
Yeah i i definitely see yeah i feel like I'm a broken record but like i think for people starting new data systems or new like large-scale systems most people are going to choose rust going forward, there's still people starting new c++ systems but just looking in my own space three quarters of the new systems are rust and one quarter of them are in are in c++ and i think that trend is going going to only increase as Rust becomes less risky from a technical perspective and from a hiring perspective. I mean, you know, maybe we'll see disruption from these other newer languages that are able to become more mature and start attracting projects. At some point, I'm sure Rust will become boring and people will want to use more exciting languages. But that would be, I think, an extremely successful outcome for the Rust project. For now, we have not regretted our technical choice at all. We're about a little over a year into this, and Rust has proved an extremely successful technology choice. I think maybe an interesting question is how much REST adoption happens in the more application space. So there's kind of a divide here between infrastructural software and application software. So infrastructure software, like the stuff we're working on, or a database, for example, is something that's written by a small team and then run by a much larger group of people. So definitely, it makes sense to... Put a lot of effort into making it really efficient and fast because it's going to run on so many CPU cores over its lifetime. For application software where the development costs are much closer or much greater than the runtime costs, you don't necessarily have that same financial pressure to make it really efficient. And today, I think Rust is a much harder sell in that space space because the the additional complexity of writing stuff in rest and additional like difficulty, you know hiring rest engineers or training people in rest so i wonder you know how much rest will kind of grow in that space through just more maturity in the the language and ecosystem system and maybe a growing set of like people who use rest or want to use rest but to me that's like a big open kind of area for a language to expand or a new language to come and move into that that space because i think we can do better than java and go for for kind of application level programming. There's so much of the ergonomics of rest that are great for that but dealing with some of the the sharper corners of rust around the memory management and lifetime issues, where it just feels like, if you don't care that much about performance, you shouldn't need to kind of deal with this for that class of problems.
Matthias
00:50:12
If you look at a related field like data science, it feels like they are also starting to experiment with some ideas from the Rust world, if not even rewrite parts of their libraries in Rust to use it in less performant, higher-level languages like Python. You have Parquet files, and then you have parses around that, and you have pandas, and all of this is inherently... An interesting space for Rust because it's a mixture between analysis where performance is also relevant, right? Do you agree in general?
Micah
00:50:51
Yeah, so I think that's definitely a term that will continue. And this is really taking a Rust core and wrapping it in a higher level language like Python. And that's been really, really successful. We see that in the Java ecosystem system as well, where a lot of these Java tools have rewritten their cores in Rust and gotten 10x or more performance out of that. I'm not personally a Python person, but people obviously really love it. And it's very hard to convince data scientists to use anything besides Python. So if you do want to give them better performance, I think this approach of writing the core and Rust and wrapping it in your higher-level language is something that has been really successful. I guess the other fascinating approach, I don't know if you're familiar with Mojo. This is Chris Lattner's new language that the creators Swift. This is creating a Python-like language that actually compiles into LLVM and MLIR and aims to provide C++ performance. Python somewhat python compatibility and that to me it is extremely ambitious given the kind of semantics of python and how hard that is to optimize but like if you don't want to to kind of take this rest approach like that's the only other way you can really get acceptable performance, with with these python apis so for us we're since we're starting with sql you really it's very easy to optimize SQL into whatever, implementation you want and that gives us a lot of advantages for providing really really high performance because SQL is kind of declarative you're able to rewrite the expressions in ways that make it much faster to actually execute but when you have something like Python you're much more limited in how much you can really optimize that even with a Rust core, so i think it'll be interesting to see like how data science as our data volumes increase and the complexity of the processing we're doing increases how that will kind of financial pressures will push people into high performance paradigms but yeah for now i think the polar's approach of of kind of the rest core is something we'll see in in a lot of these data science ecosystems systems yeah.
Matthias
00:53:23
I agree i think we're getting towards the end and it has been somewhat a tradition around here to ask this final question to people if there was one thing that you could say to the rust community as a whole one of a statement a message that you have to the community what would it be i.
Micah
00:53:44
Think my my message to the rest community would be like chill out a little bit, Rust is an incredible language, an incredible ecosystem and community. And yet we seem to have 10 times the drama of any other language community I've been part of. And I don't really understand why or where it all comes from. But I think that that level of drama can only hurt adoption when people look at the Rust Reddit and are like, this is a shit show. Why would I want to be part of this? So yeah, I think hopefully we can look back at the past year and just say, we all just need to calm down a little bit and figure out how to work with everyone else and stop driving people out of the community.
Matthias
00:54:32
That's a great final statement. Really love it. Micah, it has been a pleasure to have you on the show. Where can people learn more about you, about Arroyo? How can they get started with the platform? form?
Micah
00:54:45
Yeah. So I think we have some pretty good docs. If you head to our website, arroyo.dev, we link to those there. We have a Docker image, super easy to run it and play around, get a nice web UI where you can write SQL, you can talk to like WebSocket APIs and HTTP APIs, and it's easy to play around with publicly available streaming data. And then we have a really friendly Discord community so if you head to our website we have a link to that and you can join.
Matthias
00:55:17
eard it, there's nothing more to say. Again thanks a lot Micah for coming into the show and yeah thank you.
Micah
00:55:26
Thanks so much for having me! This was great.
Matthias
00:55:28
Rust in production is a podcast by corrode and hosted by me Matthias Endler for show notes transcripts and to learn more about how I can help your company make the most of Rust, visit corrode.dev. Thanks for listening to Rust in Production.