InfinyOn with Deb Chowdhury
Matthias Endler talks with Deb Chowdhury from InfinyOn about Fluvio, a Rust-based data streaming platform, its impact on data management, and the potential of Rust and WebAssembly in modern applications.
2024-10-31 56 min
Description & Show Notes
Picture this: Your organization's data infrastructure resembles a busy kitchen with too many cooks. You're juggling Kafka for messaging, Flink for processing, Spark for analytics, Airflow for orchestration, and various Lambda functions scattered about. Each tool excellent at its job, but together they've created a complex feast of integration challenges. Your data teams are spending more time managing tools than extracting value from data.
InfinyOn reimagines this chaos with a radically simple approach: a unified system for data streaming that runs everywhere. Unlike traditional solutions that struggle at the edge, InfinyOn gracefully handles data streams from IoT devices to cloud servers. And instead of cobbling together different tools, developers can build complete data pipelines using their preferred languages - be it Rust, Python, or SQL - with built-in state management.
InfinyOn reimagines this chaos with a radically simple approach: a unified system for data streaming that runs everywhere. Unlike traditional solutions that struggle at the edge, InfinyOn gracefully handles data streams from IoT devices to cloud servers. And instead of cobbling together different tools, developers can build complete data pipelines using their preferred languages - be it Rust, Python, or SQL - with built-in state management.
At the heart of InfinyOn is Fluvio, a Rust-based data streaming platform that's fast, reliable, and easy to use.
About InfinyOn
Data pipelines are often slow, unreliable, and complex. InfinyOn, the creators of Fluvio, aims to fix this. Built in Rust, Fluvio offers fast, reliable data streaming. It lets you build event-driven pipelines quickly, running as a single 37 MB binary. With features like SmartModules, it handles various data types efficiently. Designed for developers, it offers a clean API and intuitive CLI. Streamline your data infrastructure at infinyon.com/rustinprod.
About Deb Roy Chowdhury
For fifteen years, Deb has been a behavioral detective, piecing together human decision-making through conversations, data, and research. His passion lies in product innovation—finding that sweet spot where desirability, viability, and feasibility converge. From 7-person startups to tech giants of 165,000, he helped build products that people love. Deb is currently the VP of Product Management at InfinyOn, where he leads the product strategy and roadmap for Fluvio, a Rust-based data streaming platform.
Links From The Show
- Polars - Fast DataFrame library implemented in Rust
- Apache Arrow - Cross-language development platform for in-memory data
- Arroyo - SQL-based data streaming platform in Rust
- Arroyo Podcast Episode with Micah Wylde
- NATS - High-performance messaging system written in Go
- Memphis - Go-based streaming stack built with NATS as its core
- Four Horsemen Of Bad Rust Code (FOSDEM 2024) - Matthias' talk on writing bad Rust code
Official Links
About corrode
"Rust in Production" is a podcast by corrode, a company that helps teams adopt Rust. We offer training, consulting, and development services to help you succeed with Rust. If you want to learn more about how we can help you, please get in touch.
Transcript
This is Rust in Production, a podcast about companies who use Rust to shape
the future of infrastructure.
My name is Matthias Endler from corrode, and today we are talking to Debadyuti Roy Chowdhury
from InfinyOn about building a distributed streaming engine in Rust.
Deb, so happy to have you. Can you say a few words about yourself?
Oh yeah, absolutely. I am talking to you from Toronto area. I've worked in healthcare,
automotive manufacturing, surveillance tech, e-commerce, and so on.
So I'm like what you would call a technical product manager.
So that's really my background.
And that background has everything to do with my current role as the vice president of product at InfinyOn.
Before we go into that, I did wonder, how do you go from health to data?
Where's the connection here?
Oh, that's a good clarification question. As more and more healthcare systems got digitized,
there were quality of care questions, which were analytical in nature,
which would be around population health and quality of care,
which has expanded into telemedicine and other things that you see like matching
patients to proper care and so on today.
Yeah, that makes sense. So essentially, a lot of those healthcare businesses
are nowadays also tech companies.
Somehow, they need to manage a lot of data. And this is where data came in and
where you maybe found a way into more of the data pipeline world, I guess.
Yeah, yeah, yeah. Initially, it was just mostly descriptive statistics.
We were trying to figure out what are the average age of people with certain
conditions. And those patterns then lead to something proactive in terms of
health practices, right?
And obviously, you need to first collect the data, build the system to have
all of that information go in.
And if you look at it, it has happened pretty recently, probably in the last 15 years or so.
Nothing existed before that. Everything was paper and pen. And I remember working
in scanned faxes and things like that, which were forms filled out.
And then you have to... It's like human assisted. that some part of it is automated,
some part of it is manual, and you still need to make sure that all of these
things add up and so on. Yeah.
You would love it here in Germany. We still have that system.
So once you found your way into data, what was the first real job working with lots of data?
So initially, I was working at a smaller firm, which was dealing with insurance providers in the US.
So Medicare, Medicaid, and other health plans.
Then I moved into a software company which sold software to hospitals across
the United States and in Canada.
And there were practices in Australia and across the globe in some sense.
We had a SQL Server implementation of the core database that supported the application.
And that had 4,200 tables.
So it's a pretty complex database. And then I worked in the analytics team.
So the analytics stack was actually built by one of the bigger software companies
using what was the state-of-the-art Microsoft stack, if you will.
But it was incredibly slow and complicated to process any of these things.
So someone is trying to run a report for what they did for one year,
and they're trying to run it for a few hundred doctors in their practice.
And it's taking days. and if it breaks and there is some problem in the data and some mapping,
those are all basically running stored procedures or extremely large SQL queries,
which are like dissertation paper where you could probably, if you printed them
on a Word doc, that would probably take up like five, seven pages.
And you're trying to troubleshoot in all the joint tables, you know,
a few hundred at times, where are things going wrong?
And none of it is optimized. So like during that period, I would pretty much
always have a profiler running.
And then you learn how not to write queries to make or break the system,
whether it's an issue with the schema or mapping some wrong data set,
wrong data point with another data point, or just the performance of queries
that are accurate, but how can you optimize it so that it doesn't take three
days for people to find a report and they'd say, there's something wrong with the data,
rerun this, because it's just not, it's like a fool's errand, right?
We're actually working on moving to an MPP store, still like an enterprise solution,
moving from Microsoft to Teradata for performance reasons.
But then it was like, okay, we have this quote unquote, better way in the big
data realm with Hadoop and Spark and Flume and Hive and all of this entire ecosystem, if you will,
of replicated storage and parallel processing.
And that's how I got into distributed systems.
And it was a bunch of Java, Scala, and making sure these systems work.
So you came from a mostly proprietary stack with some Microsoft stuff in there.
And then you ended up in an open source ecosystem with lots of new tools.
And some of those tools I still remember.
At least to me.
They were a revelation because eventually you had a system that was composable.
You had multiple different very well thought out components a lot of those things
were really well done you can remember Hadoop and then Spark and so on was that
the same to you or was it also lacking to some extent still all.
The papers that you're reading coming from Google or big tech companies seem
like a revelation and there's like an incredible halo effect of those.
Shiny objects.
Because fundamentally, if you separate or compartmentalize the storage layer
and the processing layer and the infrastructure layer, and you're able to access
via neat levels of abstraction, different aspects that you need to,
it actually, at least in theory,
enables more teams to work together and collaborate more effectively.
And then that was actually predicated on the promise of the JVM,
and obviously Java was huge, and Hadoop is how I got into Java,
and the promise was that write-ones run anywhere, right?
That was one of the key expectations, which became, in reality,
write-ones debug everywhere.
So that was the disillusionment, but I think initially, the theory was airtight,
and there is certainly promise to a composable system, which interoperates.
And then the learnings where, you know, what does it take to manage that infrastructure
and integrations and what have you.
And the other thing is like the human behavior of this preference.
We were always using databases, whether it's Oracle or SQL server or whatever.
Everyone wants to write SQL queries.
Now you're all of a sudden telling all of those people that you have to write
some Java or Spark, which is not something that they are mostly doing.
I think the challenge was that the foundation layer being rock solid and the
right ones run everywhere did not really work as well as anticipated.
And the performance challenges made it a little bit esoteric whereby certain people,
like the people who have built Kafka and Flink, they know how to wield the garbage
collector code in Java better than the average Java developer,
and that makes it a little bit challenging for everyone else to use it,
right? You're stuck with the API.
What essentially happened, as is true for any programming paradigm that gains
maturity and popularity, is that you'll get a lot of projects which does similar things.
There were projects to manage the infrastructure, projects to manage jobs and
orchestration and failures and things like that.
You're happy to compose with a bunch of different tools and make it work.
But the flip side of it is shoving a root problem under the rug,
which requires you to do some hard work of optimization.
If you're working at the level of abstraction of the API,
then that actually hamstrings you to how much optimization you can do or what
your infrastructure management challenges are going to be.
When i think of hadoop the
one word that comes to mind is it's complex it's
a very bespoke black box so to
say i could look at all of the components i could look at hdfs and i might be
able to understand it but it's complex at least you need to understand that
there's a runtime behind it and there's various things which could fail or it
could be flaky and that's a little tedious. Yeah.
And really every distributed system at the end of the day, you need to really
think about a couple of things.
One is all of the points of failures, right? Because you want it to be reliable,
highly available and computing the right way.
But the other piece of fault tolerance is, okay, you can have failures.
And most of the errors that you would find in that paradigm is out of memory
exceptions, which is the typical error that you see.
You're processing a large amount of data and for whatever reason,
you have to troubleshoot it. Like that's the paradigm. And it's quite onerous, right?
Cognitively burdensome to like even doing the troubleshooting,
even if you might have the code solution to the problem, just troubleshooting
and finding the problem where the problem is and where it broke.
It's pretty time-consuming. If you're playing at the API abstraction level,
all you can do is run the query again and run the job again and thereby incur
more costs of using cloud resources if you're cloud hosting.
In order to troubleshoot, you have to go like double-click a couple of times,
and that's not an option, right?
So that was 2014. We had batch processing. We had big boxes.
We had all of that heavy machinery. When I think of data processing in 2024,
I think of stream processing.
And maybe some lambdas, some smaller things that can be combined,
more of a Unix-y sort of way of thinking about things and also smaller machines.
But then again, I want to hear it from you because you're still working in data.
So there's still hope, I guess.
What has changed in the last decade and how did we end up with Rust in data pipelines?
I came across it because in my last rule, we were aggregating data from,
let's say, billion products on Amazon.
And we would scrape those at a certain frequency, anywhere from a few hours
for very popular high-selling products to maybe a few days to a few weeks,
depending on the sales frequency and so on.
And based on that data, we would make calls to a few other APIs to enrich it.
Basically, we wanted a time series of price history, time series of rank history,
and so on to see how the changes in those values relate to sales.
So when we went in and when I joined this role.
Very quickly, as a product person in data,
we have four or five teams in data that are doing collection,
pre-processing, building machine learning models, and the serving infrastructure
and so on, maintaining this entire system.
I built my roadmap for the data infrastructure team.
And then within a couple of months, everything was like, okay,
forget whatever you did in your first few weeks, because our cost of onboarding
every customer is more than the customer is willing to pay us.
For every dollar they paid us, we were actually spending maybe $1.60 or so.
Everyone wants insights, but data processing is not cheap.
And the reason why it's not cheap is because you need to use five different systems.
You're using a hard data store a data warehouse a
lake multiple regions in a lake and people are like
oh storage is cheap it's like the complexity of troubleshooting and
all of that all of the cost adds up like none of this is free you can only go
so far by decommissioning old data and putting them in cold storage and maintaining
just the basic hygiene of it right so then an architect friend of mine a little
more than two years back,
we're like, okay, we are doing the diagramming of this is what we are going to build.
And basically what I was trying to build is, hey, here is an abstraction to
put a connection to your data source.
Get it in. Some magic happens where you choose what are your columns and stuff like that.
And I give you one, not three tiers, four solutions or whatever,
but I give you one system that gives you standard data on the other end,
which you can immediately visualize.
And that's how I came across Fluvio and InfinyOn.
I had heard of Rust, but I was just in my happy place of Python,
Golang, and that level of optimization and whatever machine learning stack is
there, PyTorch, TensorFlow, SageMaker.
It's interesting because Python and Go are usually mentioned when people talk about data pipelines.
What was your first impression when you saw Rust?
So I just went into, okay, let me find out what is this entire thing about Rust
and WebAssembly, right?
So I just like dove in a little bit into the book at a conceptual level,
try to figure out, build a calculator and look at how things work.
And I'm like, okay, this is actually amazing. I don't think people will start
writing data pipelines in Rust just because most of the data people go back
to the inertia of preferring SQL over writing Java.
They have adopted Python after a lot of work. So I don't think it is so easy
to convert all these people to start using Rust. So I didn't look at it that way.
I actually remember reading that, oh, Async support is coming.
They're working on it. And then I looked at Wasm and OK, Wasm seemed to certainly
be saying the same thing that the JVM promised in some sense.
But the browser is way more standardized. Yeah.
In like modern day user experience where so many mobile devices,
so many compute devices, everyone is browsing web pages.
So it seemed like, okay, conceptually, this seems like there is a very,
very strong probability that this would be a safer system, a faster system,
all the benchmarks are showing that.
And at the same time, if WASM evolves, there was a big asterisk,
if WASM evolves to be robust enough, the security and other things are there,
but it It becomes functional,
that it can be general purpose for everyone.
Then I thought this would be like a system that would be miles ahead of any
existing system there is, right?
And when I looked at Fluvio, I was like, okay, this seems very interesting.
And then I spoke with A.J. and Sehyo, right?
They've been around in the Silicon Valley building products,
building software for decades, three or four now.
And they built a management layer, if you will, a control plane for Docker containers
before Kubernetes existed.
They got acquired by Nginx and they were running service mesh.
So large scale microservices, you can think of travel booking sites where they're
running all these services to aggregate all the tickets and the prices and the
availability and the inventory and users who are trying to book from all over
fairly complicated systems.
And that's how they came across using Kafka and Flink, because you need to do
all of this processing real time.
And five years ago, they started Fluvio because they had endured maybe a few
years trying to make Kafka and Flink work, not functionally.
Functionally, it was like, fine, you could write your Scala code or your Java
code or your Python code and make it all work.
But the infrastructure layer that they needed to do for maintaining this stuff
was very hard. So that's how they built Fluvio.
They were trying to build a solution that will serve as the backend system for
processing data to enable intelligent applications.
And I'm like, okay, this actually makes a lot of sense. I still know that people
in the Python and SQL world would struggle to just come into Rust.
Although in the last couple of years, there's been a lot more hype with Rust in the data space.
But I was skeptical of that. But I was convinced that this is a bet that I want
to place because it seems like this is the solution we need.
If I understand you correctly, using Python and Go is a very fine choice for data pipelines.
However, there are things that we can learn from using Rust as a platform.
In your own words, what is the big idea behind InfinyOn now?
What people are trying to build today for any intelligent application is basically
get data from different sources,
enrich it, count and present materialized results and add some machine learning
or AI sprinkles in between.
Right now we believe that what we need precisely is an infrastructure that is modular and composable,
and an infrastructure that is lean meaning you don't need five seven different
tools you use a few so if you think about all of the data integration streaming
and processing events you should not need three or four you should be able to
do it with one and that's really our core proposition here,
that InfinyOn is trying to build the single system for end-to-end event stream processing.
And so we do that by Fluvio, which is a distributed streaming engine.
We do that with Stateful Data Flow, which is a framework for building end-to-end
stream processing pipelines.
So you connect with your data source, you can send data anywhere,
you can process it the way you need.
And in the consumer layer, you serve it to the application that you need to, right?
Which is different in the sense that today you would have a three-tier application
with your front and backend database.
Then you change data capture, write it to a warehouse, do the standard raw standard
medallion architecture on your data lake or lake house or whatever.
Store it into an optimized format so that you can run fast queries to reverse ETL.
Now, all of a sudden, you've got multiple tools before you can get value of your data.
If I have to interact with five, seven systems, it's going to cost me more.
It's going to be hard to maintain.
And I'll have infrastructure overhead, my dev velocity will be slower or it
would require a lot more people.
Our thesis is that people need to compute data on demand,
It doesn't necessarily need to be real-time, but on-demand concurrently from
different sources, enrich it and materialize that data.
And you need a simple infrastructure to do that, to implement event-driven architecture.
And that's what we are building.
And when you explained that, I wondered, how would you scale that?
Am I responsible for adding new worker nodes, or do you do that on your end? Yeah.
So cool thing is that, again, with Rust and with Wasm, and this is where Rust and Wasm shines, right?
Like our core project, Fluvio, is a little over 120,000 lines of code,
compiles to a 37 MB binary, which I can compile to an ARM v7 Raspberry Pi.
This is the benefit that we get from Rust and Wasm that it enables running it
in a compatible device, which could be tiny, which could be resource constrained.
So why do I say that? Because you can scale up quite a bit when you're running
it, let's say, on a MacBook Pro M1 or M2 style machine.
It would scale up a lot more than a
current java-based system would and then
scaling out the workers you essentially allocate
the system right when you're running a service you initialize a worker and you
initialize the topic and we can partition and scale the partition of the topics
vertically and horizontally we can scale the workers that is processing different
service operations on a topic,
both vertically and horizontally as well. We take care of that.
And the other question I have is, which language can I use to write stream processors?
Yeah, so Fluvio currently supports in the core streaming engine,
three clients, so Rust, Python, and JavaScript, or Node.
And it generates all the boilerplate code or the dependencies.
And you just need to focus on writing the business logic in an infrastructure as code type pattern.
Let's say I wanted to use JavaScript. How would I even interact with the incoming data?
Do I import some dependency that you provide, some sort of standard library?
Or do I get an object that has a certain name into my scope automatically somewhat globally?
And then I pass it on to the next layer. How does that work?
What you need to define is the schema with the data types.
So you define that.
And we have primitives, which are like maps, filters, groups,
plates, join, whatever, those primitives.
That's all that you need to do. The system actually brings in all of the relevant dependencies.
And how about sharing state between those different stages?
Let's say I want to store some value globally so that some other component can access it.
Do you have any dependency on, say, a key value store in there or something like a Redis?
When I joined, we had some tight integrations with etcd as the key value store.
But that actually proved to be a blocker because people wanted to deploy on Docker,
deno nomad right so
you're like okay fine we decoupled our
dependency from etcd which had another
impact because the binary when i joined the fluvio binary
was like maybe 150 200 mb or something and
we were like no we want to just give you a small
tiny binary and we we worked
significant period of time i think maybe about
six months developer months if
you will to decouple etcd so we
have our own key value system it's not a full-fledged database but everything
that is required to manage offsets time stamps anything you can think of in
streaming topics we actually built that and and then it became like a 14 meg
binary and now we have added on top of that and now it's 37 megs with all of
the bells and whistles of deduplication at multiple levels.
Watermarking offset management all the things that you need to be certain that
you have control over your time and completeness in the API.
Yeah, long story short, we don't have any dependency on any key value store
and we interface with etcd when we are doing cloud deployment if we need to,
but we have most of the functionality that we have built ourselves because of that.
The ideal experience that we are going for is you,
Run the cluster locally, which is just download a binary, say,
fluvio cluster, start, it starts.
And SDF, or stateful data flow, you write your business logic and your operators
with your data model, and you iteratively build packages.
Now, you should just be able to allocate a worker and deploy it.
You don't have to worry about anything else on the cloud.
You just initialize a cluster on the cloud, you authenticate,
and you deploy it. That's pretty
much all you need to do. The rest of the stuff we take care of. earlier.
You mentioned that when you first looked into rust there was no async rust.
We were actually doing some async stuff that
was built in so we wrote some custom code to do async stuff which then when
async standard came up there are a bunch of async runtimes that were coming
up at the time then we went on to async standard because it's easier right and
then And I think over a period of time,
what happened is Tokio became the go-to,
and you can see both Async Standard and Tokio as a dependency.
That's because we essentially first developed a few pieces of our own,
which was pretty easy to port to Async Standard.
And then we figured the API that Async Standard had, the interfaces and abstractions
was working well, and Tokio was more functional.
So essentially, today we became async standard interface and Tokio was the runtime
for most of our workload.
And that has worked out pretty well. I think Tokio has developed really neatly.
And this is one of those things like whenever libraries are coming up,
these decisions are not easy.
It's like an art and a science. You look at community and sometimes things could go one way or the other.
But yeah it's worked out pretty well and we are pretty big
on async because we are wanting to support all
of these services async if you want to make an api call out and you have to
wait for the result to come back then it has it becomes synchronous and you
don't want that right you want it to run record by record through the log so
it's a hard problem to solve but async standard and tokio has worked out pretty well for us do.
You still support both async std and tokio or how does that work internally
do you have a shim on top of tokio that is like the old async std layer or you
have both as dependencies how.
Does that work we don't use the async standard runtime all that much anymore
there are still some parts which we are essentially in the,
passes of, let's say, migrating over to Tokio. But we have still very much maintained
the async standard interface.
So I would say most of our asynchronous runtime stuff is being handled by Tokio.
How does it feel as a product person, someone that needs to market a tool that
is based on Rust and maybe a fast-moving target as well?
The buyer doesn't really care. As long as they get their
data guaranteed they are okay right and in
the technical buyer space they will anchor you to a
kafka or a flink and then say oh do you have dead letter do
you have this xyz connector do you have this do you have that it's
always hard to compete against mature tooling
because the cost of migration is pretty high right i think the biggest problem
is essentially shaping up or right sizing a proof of concept which shows them
that hey this is a system that actually eliminates it's your need to run at
least three or four other systems,
which means that you're going to have better development velocity.
You're going to have less infrastructure overhead, and you're going to be able to innovate faster.
It's not going to need as much resources. So it's also going to cost you less
because you're doing more with less, not figuratively, quite literally.
And so the conversations there
becomes how quickly can we show proof of concept or a proof of value that,
with examples, tutorials, and contextualize, like, hey, let's scope out a POC and build something.
And then in the top layer, if you will, we communicate some other things around
the security and the delivery guarantees, which comes obviously leaning on to Rust.
And also the polyglot thing, which is not completely a realized dream,
but the Wasm write ones run everywhere with the component model allows us to
support natively other programming languages,
which obviously then means you don't have to all of a sudden become a Rust shop
in order to use our product, right?
So all of those things put together makes a decent proposal.
And I haven't really had any conversations with people who hate the idea.
It just becomes a thing of, okay, what is the value equation?
And when do we have the capacity to go for it?
What's the biggest argument for Fluvio and maybe to a larger extent for Rust
when you talk to customers?
Do they care about security, safety, scalability, reducing costs? What is it?
In the current market landscape, a lot of people would think of cost,
which was also my motivation to look at a leaner system.
Now, cost is obviously directly tied with efficiency.
You can never really sell performance because the proverbial notion is,
hey, you don't necessarily need
real time if you're making decisions at a different cadence or whatever.
But when you're doing system-to-system services that does not have a human in
the loop, then it has to be asynchronous and real-time.
Most people that actually are active on the platform today,
they are coming to us because they want an intuitive interface that allows them
to build 80% of the use cases without having to manage three or five different tools.
And that's really the core thing that they tell us because I don't tell them.
They're like, yeah, I currently use A, B, C, D, and they will list out a few
of our competitors and so on.
And they will invariably list out three to five tools, including a streaming
engine or message queue and key value store and old app store or database.
And okay, then we've gotten into five, six different things.
And they appreciate the ability to do 80% of that stuff on demand,
the application in one place.
And you said that they also come for the user interface because apparently it's
nicer than what the competitors offer.
What does the user interface look like?
Yeah, the user experience, which is broader than the interface,
is around transparency and debuggability and things like that.
So for a developer, the user experience is all about control and debuggability
and ease of finding issues and so on. So if you need to debug today, let's say,
a fling job and you find an error then you have to look at
the different workers that failed and then you find the memory exception and
then you go to your kafka partitions figure out the partitions went in balance
and this is still to a certain extent the pattern right and then you find out
okay it's run out of memory we have some obvious suspects if you will now that
is a non-starter in our scenario because you're not needing to deal with the
impedance mismatch i would call it across different tools,
whereby, oh, they're applying this mode of processing at the low level,
serialization, deserialization, and the interfaces now are struggling to integrate with each other.
So we give you an intuitive CLI where you can just run the commands and over
a period of time, developers like, okay, I'll just write this command and I get it.
And inside of the different CLIs, the Fluvio CLI, SDF CLI, you have ways to
look at the metrics, look at the logs, and if things are not working,
it gives you a good debugging experience. And then on top of that,
we have built out a ui for sdf so.
Dag is a directed acyclic graph what is an sdf.
A stateful data flow and.
In regard to the technology that you use for the front end is that also written in rust.
No so yeah of course it is written in rust i was gonna say no yeah we actually
wrote it was not a library is what i was thinking we actually had our own framework
i asked we called it heaven We built our own web framework to do visualization.
So on InfinyOn Cloud, when people log in, there is a web UI,
which when you're running a connector,
applying transforms, it gives you a continuously updating real-time graph of
how many data points you're getting, how many are you processing,
what is the back pressure, and so on.
But we moved recently from heaven actually
it's been a little while from heaven to leptos because leptos
became the more popular go-to
framework or library you want to call whatever you want
to call for the web paradigm so it is in rust but
we also have some examples which are using let's
say something like apache e-charts which is like a real-time visualization library
and that runs like a JavaScript server to read from WebSocket to consume the
data and the topics that is materialized and to just show some visualization
and so on to show other ways of doing it.
Whatever we are building as components, obviously Rust, we are a predominantly Rust shop.
Why did you move away from your own UI library?
Oh, it was like, we are a small team, 12 or so people.
The overall repository has maybe 65 or so contributors over the years,
it was just like taking up a lot of capacity to build out the components that are reusable.
So it was like a build versus reuse in this case decision.
And Leptos had matured and a couple of developers who work a lot on the middle
layer and the front end, they do some infrastructure server side,
but a lot of front end contributions are from a couple of developers in our team.
And they really loved Leptos, the way it works. And it would have been faster
for us to iterate and build.
Actually, if you were to build what we are building as the SDF UI in heaven,
I believe we would have probably taken a lot longer.
It gave us speed and allowed us to focus on the infrastructure and the distributed
sides of things, because those are obviously more complicated problems to solve.
Yeah, and we have really appreciated using Leptos so far.
Yeah, sounds like a smart decision to focus on the core part of the product.
Which data formats does Fluvio support?
When you're actually reading events if
you're reading from an api you're typically thinking an application you're
going to probably get json for the most part you
can obviously support anything binary if you're getting data from iot sensors
and so on you can load that payload if you have raw binary format that you want
to capture doable different customers or prospects they'll come and say okay
can you do avro can you do protobuf where the schema you you take it from the file format itself.
Most of our examples are JSON, but we can read from S3, CSV, XML, whatever.
And we have smart modules on our connectors, which you can then use to convert
that to JSON or if you wanted something else.
And then on the materialization side, things become a little more interesting
because typically when you are materializing data, you're saying,
okay, here's the payload.
You need it to be in some form of table or data frame, if you will.
And there are some aggregates or the records for whatever it is that you want to do.
So in that, we actually integrated Apache Arrow in the materialization side.
Arrow is pretty popular. Many projects around it.
In-memory data frame library is how I would describe it in memory data frame,
which is like fast and works very well with real time.
And then we found that there was an interface which was built on top of Arrow
called Polars, which is a very popular library as well. We actually also played
around with other engines there.
We have a DuckDB connector because materializing data, people want to do it in different ways.
We looked at graphite to visualize using Grafana and stuff like that.
So we've got a few connectors there.
These are probably like 500 lines of code, 2,000 lines of code each and separate repositories.
And we were actually for stateful data flow, all of the materialization piece,
using a fair bit of Polaris as the interface, the data frame interface to interact with Apache Arrow.
Your topic data comes in, you run your operations, and then when it becomes
a table of top five, top 10 aggregates,
whatever it is that you want to visualize, that goes into Arrow,
the Stateful DataFrame, and then we were using the Polars interface on top of that.
We had to since build our own data frame layer.
When we were using Arrow and Polars and Stateful Data Flow, we built it out.
One of the nagging problems were, and we are using M2 MacBooks or M1 MacBooks right across our team.
We are trying to build Stateful Data Flow as is with Arrow and Polars.
And we found that interface was actually making our build time or our compile
time was around two and a half, three minutes on my laptop.
And it's a 64 gig RAM M2, I think, MacBook Pro.
And we were like, okay, these are pretty expensive laptops. We can't expect
every developer to be working with that.
What about the person working with the Replit Repel or want to do this thing
on a Raspberry Pi or something? So we tried it on an orange Pi.
I think SDF took maybe 45 minutes or so to just compile. To run, it doesn't take as much.
This is no good. There was like this nagging problem that we had.
So there was no way to solve it.
But Polars is great so you're like okay what can we do
it was just like this hcd type thing like we don't want
this dependency if this bloats up our binary and
increases the build time every time we try to use it so
we basically spent again a couple of developer months to build our own data
frame api abstraction to interact with apache arrow and that made our product
a little more over-engineered little more feature-rich little more independent
or what have you and now on my machine the build time has dropped to 15 seconds does.
It mean you still support apache arrow and polars yeah.
Yeah, we are also very deliberately minimalistic because we don't want to become
yet another iteration of what looked like the big data ecosystem.
So we are trying to be very deliberate about what is the most necessary.
And what is most necessary is perhaps support for Python because everyone in
data knows Python, support for SQL because everyone knows SQL.
We support the serialization formats on the in. So, you know,
you do JSON, you do binary, you do protobuf, you do Avro.
Obviously, that will help people not have to think about how they can load data into Fluvio.
And then on the materialization side, I believe right now all the evidence suggests
Apache Arrow is something that has matured to the point where large companies
want to build on top of it and so on.
So we kept Arrow and Polar support. And on the materialization side or the consumption
side, we will enable it to be as low level or as high level as you want.
The thing that you don't want to do is to put too many different tools to confuse
people. Like, I don't want five tools to do the same thing.
But Polars as a data frame is an iteration or evolution on top of pandas to
give them a fast data frame.
And all data scientists were already used to pandas and they want the Polars interface.
So if they want the Polars interface, have at it.
My impression is that you integrate with a lot of different technologies.
And the one thing I would wonder about as a Rust developer is how mature is
the ecosystem to interface with all of these systems?
Because you're putting the company on the line here by being...
Maybe confident enough to say that you can provide a certain level of quality
with all of these different integrations.
And that's not always a given because the ecosystem is still very young and
people expect a certain quality of code when they trust you with their data.
Absolutely. Different things that we are integrating with, even though they
are a small enough list for my liking, because if you look at the current state
of the data stack, there are probably like 2,500 tools there that you can integrate with.
But there are parts where in critical places where we integrate with mature technology.
And again, like I mentioned, so Avro, Protobuf, Arrow, Polarsis,
not as mature as Arrow is, let's say, but these are considered to be mature.
Now, the rest of the ecosystem, let's say the Wasm component model,
for example, is, you know, Dev Preview 3 running right now.
And, you know, there is a lot of work that goes into managing the infrastructure
and making sure that there is reliability and so on.
This is where the engineering experience of the entire team comes in.
Like our head of engineering was working with, you know, embedded aerospace,
autonomous aircraft and storage, disk storage systems and stuff like that.
So we've got a lot of very solid systems people to make sure that we,
whatever we are building is robust. we are not building it in a loose way.
Our CI, CD and our integration testing costs are probably as much as our infrastructure
costs are running things.
So we obviously make sure that we build a robust solution. Functionally.
You cannot do whatever you're depending on.
So for example, we were using WASM initially for smart modules,
and you could only do transformations using smart modules on streams of data.
And it was really difficult to do time-bound window operations like the stateful computation.
You could do it. But then WASM was enhanced, if you will, because WASI is very much an extension of it.
So now we have all these new interfaces that allows us to do more with it.
And then came the component model, right? So in this paradigm,
functionally, we are constrained, let's say, to deliver whatever the leading
edge of technology can deliver.
But that's actually quite a bit of critical mass because we are working with
IoT companies which are embedding Fluvio on edge devices, let's say,
flow meters that are monitoring flow rate, right?
And then using Fluvio on the cloud and stateful data flow to compute and run
their analytics, right?
And it takes care of your use cases around
predictive maintenance detecting anomalies fraud
and stuff like that and we have customers that
have now come on board to start implementing which is not the case a couple
of years back so we obviously it's been a five-year journey to try and get something
over the line and all databases take like decades to mature you can see like
all of the challenges in the ecosystem there.
So certainly everything is not figured out, right? It's really figuring out
how to make the mature and opinionated,
components, jive well with the somewhat newer radioactive components,
and ensure that you don't pass on your tech debt or the challenges of this particular
synergy, if you will, to the customer or the user.
So that's why we exist to make the system.
Did you hit any runtime bugs with Rust?
Since we launched the beta, it's typically been something around,
oh, you're using a certain version of Python or a certain version of Node at
the interface level, and so the client is struggling to give you this version.
And so typically updating the user side to the latest version has solved it,
which is what you would ideally want, right?
Or you have to give them an environment, okay, these are all the requirements
or whatever, but it's been pretty smooth.
If you talk to a company about using Rust, for example, as part of their startup,
what advice would you give them?
In the context of you are a non-vendor company, like you're not a company like
InfinyOn trying to build a streaming tool.
You are a healthcare company or an e-commerce company or whatever,
a data team there, and you're trying to build a better, more performance system using Rust.
You have to iterate and find the parts of your system which requires most concurrency,
See most asynchronous pieces and piece by piece,
look at existing repositories that can help you and build it for any other company
that are trying to build Rust-based tooling. Right. Startups like us.
I think one of the things you recognize, especially in the data infrastructure
space, is that it is a long game.
You cannot come in here to like, oh, think, okay, let's just like put a wrapper.
A wrapper would not be as valuable because there's already a Kafka wire wrapper
and a few like those things like you can do code gens.
We were doing this stuff where we would use PyO3 to generate Python code and
generate the bindings and run.
Now, those things are great for POC type scenarios, but in production,
you want some robustness to the solution that you build, right?
So that makes the build process a little bit longer.
And finally, I would say like it's well worth it. The co-founders,
they actually were initially debating on whether to use Golang or Rust to build Fluvio five years ago.
And it was a fairly contentious decision. And the CTO won that debate and Rust was selected.
They had done a bunch of Golang work with server side and Golang is pretty fast, right?
But the things that are most difficult to learn in Rust for beginners,
like borrower, checker, like async traits, are the ones that guarantee safety
and security and guarantee that you get your data the way you want it.
And this is why we made, or the CTO and the CEO, when they were making the decision,
made the choice to write the Fluvio project in Rust.
In Season 1, we had Micah Wylde from Arroyo as a guest.
Do you see them as competition?
We know Micah. Micah is on our Discord.
They built a Fluvio connector or integration before we even knew about Arroyo.
I think their play is really amazing.
I don't see the entire streaming ecosystem as a competitor because everyone has their angle.
The market then decides what they want to do.
There is room for many players to take a slice of the market.
And it's not time for consolidation in this part of the data market. Micah is from Liften.
They were running large-scale flink operations. And so they recognize,
again, the ability to run these data flows, which are stateful,
managed state, time versus completeness trade-offs, and give a SQL interface.
I think they are also building on top of Arrow. So I'd say in that sense, we are friends.
The difference between Arroyo and our thing would be that most engines that
you would come across, their thing would start from a distributed streaming
engine, right? Like they will start at, oh, here is a Kafka topic.
Now we are going to do the processing, right? But we built the Kafka alternative
first, which is extremely hard to penetrate into the market because Kafka has
a pretty solid stronghold there. Right.
We have a slightly different proposition. It's like one system which does a
bunch of different things.
And most people that are adopting us are really bought in on that unification
and the lean proposition.
Like they are willing to write some Rust. And a lot of our current users are actually Rust shops.
And if they are not, they have at least a few developers who are picking up and writing Rust code.
And so that's slightly different. In our case, giving inline SQL or Python is
like the icing on the cake and the cherry on top. It's not the core, right?
But I believe like Arrow's core proposition is, hey, SQL, making it easy.
And I really applaud that project. I think it's a solid project.
I don't see them as a competitor necessarily, although we might have customers
that we may provide something together. We have actually had open source users
who have come and said, oh, you know what? We are building this before we had SDF.
We have Fluvio and they have the, Arroyo has the Fluvio integration to get data from topics.
We are building this and we have helped a few open source projects like that
to put something up the ground.
So I at least see it as more collaborative and not necessarily competitive.
What about other products like Nets and ATS that seems to be thrown around quite
a lot lately and that's written in Go? What are your thoughts on that?
Yeah, we have found a bunch of people who are NATS users. We've had on the open
source project people who have asked for the NATS connector and things like that.
We had a NATS connector built, but it was not like the full functional latest version API and so on.
And if I remember correctly, there are a couple of other products on top of NATS.
I think there is another one called memphis.dev, which was NATS plus, you can say.
You can call that as a competitor in some sense. But Nats, again,
any of the competitor, right, like just smaller, newer project is going after
the incumbent that is Kafka, right?
And that they take is to first create an integration, which we did as well.
And then the integration becomes tighter and then you have diversity of options and workflows.
I think that the Nats core product is pretty solid. and the buying decision
with the user would become one of really preference.
There is no way to convince one or the other.
I think we exist in a parallel ecosystem. I would be perhaps a tiny bit arrogant
and say that I don't think the alternatives.
Which are in the distributed streaming space, not the processing part,
the alternatives would struggle to be as lean and nimble as us to do the IoT
use cases with the API callouts and enrichment.
I think of the ecosystem, by the way, in a slightly different way whereby it's okay.
What are the other systems that I can integrate with, which could create meaningful
technology partnerships? I'm looking at,
Rust-based data systems that we have synergy with.
So for example, like Arroyo is a Rust-based system and they built a Fluvio connector.
We didn't even ask them. And then, okay, great. We built a relationship with
Micah and we're happy to collaborate and move the community towards this.
Similarly, there are a bunch of Rust-based vector databases which are for Gen AI stuff.
And we are working with a couple to create something there. The team at Quadrant,
the vector database, they built a Fluvio connector.
This is another thing that came as a pleasant surprise. That's okay, we have built this.
Do you want to integrate it and put it on your hub? We didn't know about it.
And they have done pretty well. OpenAI is using them. Twitter is using them, right?
That is a big enough vision, I guess, where multiple players would need to come
together to fulfill that need.
And I don't really have much time to worry about competition in that case.
Make sure that we are giving the features that the developers need to implement
and give them a reliable infrastructure that works for them, right?
There are many companies in the low-code ETL space, which we don't play in,
but people tend to ask us like, oh, are you like another ETL?
They give you like out of the box 500 connectors, but those connectors are at
a level of abstraction where you cannot do much with it other than connect,
authenticate, and have open access data.
So you can do EL with it. you cannot really do much of a T because T requires
you to know the model, data types, and all this complexity come in immediately.
And that's the thing that the data world is stuck in, that it's similar to batch processing.
Another parallel mindset like SQL is ETL, or everyone wants that out of the box, give me something.
And we're like, no, it's a secure thing that belongs to you.
I'd rather teach you how to fish rather than give you a fish.
It will not take you a long
time to build and maintain a connector that you do
because it's core to your data operation like you you want to own that and you
want to have let's alleviate the infrastructure but you own anything to do with
your secrets your authentication your connections to your data i talk to clients who.
Run their pipelines in those no code or low code platforms and their problem
is that it works really well in the beginning but then after a while when the
company grows they feel locked into this platform and then the runtime is not
controlled by them they cannot really,
work their way around it and then even if they
talk to the companies that they integrate with it doesn't mean that necessarily
they have same interests as these growing companies and then eventually you
find yourself somehow locked into these platforms and then yeah they look for
alternatives so that seems to be a very real problem yeah.
Yeah someone on our discord said that this like a golden cage that they're locked
in like they can't escape the golden cage and fly.
I think we're getting close to the end,
but one thing I was really curious about was when you explained how you can
visualize the different stream processors and how the data would flow through
your engine and you as a user could see that,
that automatically rang a bell because I thought of logging and tracing.
And that seems to be one other area where people wanted to have more quality
of service, I guess, with open telemetry and with observability in general.
What is Fluvio's story there?
We have logging at multiple levels where you can troubleshoot and debug your system.
And we also bake in fairly exhaustive amount of metrics that can be piped into
their own topics for monitoring and observability.
And then if you want to do more with that, like you wanted to do an open telemetry
connector to then pipe and have a graph on our dashboard of how that system is running.
We have prototypes, design, stuff like that.
Most of the things that we already thought of needs to be monitored and visualized.
We have already built into it.
And anything additional, we have the integration capabilities that will allow
you to build a connector and pipe data out and take it to visualize it if you
wanted to visualize it with other data and so on.
We don't have an open telemetry connector yet, but we have had that come up
enough to say, okay, a few more people asked if we would build it.
What would be your message to the Rust community?
I really loved the enthusiasm of the Rust community overall.
And I would say like all the conferences in your talk, even that's how we connected
the four horsemen of bad Rust code, if I remember that correctly.
Yeah, I think the community has done really a lot of amazing work.
And yeah, if you're looking for a distributed streaming system,
a messaging system, enterprise service bus, whatever it may be,
different people call it different things, I would say check out Fluvio.
We are happy to grow this and integrate it with other solutions.
Like I mentioned, a few that we have had the opportunity to reintegrate with.
So I'm really looking forward to partaking in the journey with the rest of the Rust community.
Deb, thank you so much for being a guest. And I hope to talk to you soon.
Yeah, thank you for having me.
Rust in Production is a podcast by corrode. It is hosted by me,
Matthias Endler, and produced by Simon Brüggen.
For show notes, transcripts, and to learn more about how we can help your company
make the most of Rust, visit corrode.dev.
Thanks for listening to Rust in Production.
Deb
00:00:23
Matthias
00:00:45
Deb
00:00:53
Matthias
00:01:16
Deb
00:01:34
Matthias
00:02:19
Deb
00:02:31
Matthias
00:04:59
Deb
00:05:38
Matthias
00:05:47
Deb
00:05:49
Matthias
00:08:26
Deb
00:08:56
Matthias
00:09:53
Deb
00:10:29
Matthias
00:13:11
Deb
00:13:23
Matthias
00:16:55
Deb
00:17:13
Matthias
00:19:15
Deb
00:19:24
Matthias
00:20:30
Deb
00:20:36
Matthias
00:20:55
Deb
00:21:16
Matthias
00:21:23
Deb
00:21:25
Matthias
00:21:37
Deb
00:21:54
Matthias
00:23:50
Deb
00:23:56
Matthias
00:25:31
Deb
00:25:44
Matthias
00:26:06
Deb
00:26:17
Matthias
00:28:12
Deb
00:28:30
Matthias
00:29:46
Deb
00:29:57
Matthias
00:31:15
Deb
00:31:20
Matthias
00:31:23
Deb
00:31:27
Matthias
00:32:39
Deb
00:32:42
Matthias
00:33:41
Deb
00:33:52
Matthias
00:37:33
Deb
00:37:36
Matthias
00:38:58
Deb
00:39:35
Matthias
00:42:49
Deb
00:42:53
Matthias
00:43:19
Deb
00:43:27
Matthias
00:45:24
Deb
00:45:33
Matthias
00:48:01
Deb
00:48:12
Matthias
00:52:04
Deb
00:52:44
Matthias
00:52:52
Deb
00:53:30
Matthias
00:54:26
Deb
00:54:29
Matthias
00:55:13
Deb
00:55:19
Matthias
00:55:22