Exploring Apache DataFusion as a Foundation for Streaming Framework
Discoveries about using Apache DataFusion in a streaming fashion.
Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!
Rust, Arrow, and DataFusion
Over the past few years, “Rewrite Bigdata in Rust” has become a real movement. The idea is that Rust can bring modern systems-level performance to data processing without the usual trade-offs. One of the key enablers in this space is Apache Arrow, an in-memory columnar format that has rapidly become the go-to for high-performance analytics.
Apache DataFusion is a query engine written in Rust that heavily leverages Arrow. It’s primarily used for building databases and query engines, but there’s also plenty of excitement in the community around real-time data processing. DataFusion was designed to be very extensible. You can add your own connectors, formats, and operators.
I’ve been working on a new stream-processing framework on top of DataFusion for the past few months1. If you may remember, I shared some thoughts on what a new stream-processing framework may look like:
This post looks at DataFusion and tries to answer the following question: Can it serve as a solid foundation for a streaming framework? Or, put differently, how close does it come to offering what systems like Apache Flink provide out of the box?
What Does a Streaming Framework Need?
A reliable and performant stream-processing framework does more than just shuffle Kafka records around in real time. It typically addresses several key areas:
First, it needs an internal data format that can efficiently store different data types. Historically, streaming frameworks used row-based formats, but lately, I have seen more adoption of columnar formats.
Second, it provides an execution runtime that can handle unbounded data. This means the runtime must keep up with a continuous flow of records. It usually operates in a distributed fashion, so you can scale horizontally across many nodes. The runtime also tends to include backpressure mechanisms so that if downstream operators get overwhelmed, upstream producers can slow down or queue data more gracefully.
Third, connectors are essential to bring data in from external sources - like Kafka, Pulsar, Redpanda, or various databases over CDC - and send processed results to sinks such as data lakes and OLAP databases. Connectors also need to handle the offset or checkpoint mechanism so that if a job crashes, it doesn’t start from the very beginning unless explicitly told to do so.
Another area is fault tolerance. Continuous data processing systems are expected to run for days, weeks, or even months without interruption. Yet all things fail, patches and updates need to happen, etc. A typical approach is Chandy–Lamport checkpointing, which involves periodically saving the progress of the streaming pipeline so it can be reconstructed after a restart.
Finally, state management and time-based semantics are required to support stateful transformations like joins and aggregations. I previously wrote about it here:
Prior Art
I know about a few projects that build stream-processing frameworks on top of DataFusion:
Arroyo. It is one of the earliest projects in the space (btw, Arroyo’s blog is top-notch, and I highly recommend following it). However, as far as I know, it mostly uses DataFusion for SQL parsing and generating logical plans. It does use Arrow as a data format. The execution runtime is custom-built, as well as connectors.
Synnada. I don’t think any of their work on top of DataFusion is open-sourced. They do give a lot of great talks, especially in the context of building a unified, not just a stream-processing engine.
Denormalized. The latest addition. Denormalized tries to be a “DuckDB for streaming” by focusing on single-node execution.
DataFormat and Execution Runtime
Apache DataFusion operates by processing Arrow batches of records (RecordBatch
) through a pipeline of operators.
Arrow is a columnar data format. It means that your intermediate data is stored as a set of arrays. Each RecordBatch
is 8192 records by default. Some purists may say, isn’t it an indicator of a micro-batch system? Well, I don’t think so. Even in the case of Apache Flink, data is always batched at many levels. Processing semantics are more important in this case.
Does batching increase latency? Well, it may, but there are ways to work around it. Micah Wylde, CEO and Founder of Arroyo, explained it really well here.
DataFusion is designed as a pull-based engine. Conceptually, it means that each operator runs a tight loop that pulls data from the upstream sources. In practice, DataFusion uses Tokio Streams. I want to highlight two observations:
Tokio Stream (kinda like an iterator of Futures) is the primary abstraction, even when it comes to bounded sources (e.g. reading a bunch of Parquet files).
Pull-based execution doesn’t offer much control over backpressure. This makes it very different from Apache Flink, which can offer reliable backpressure, fine-grained flow control and adaptive buffers between operators. These things are not as important in the context of a query engine (whose goal is to read a bunch of files as fast as possible), but they do matter a lot for a streaming engine.
Anyway, having the Stream abstraction as a default way of dealing with data seems like a huge help when building a stream-processing engine!
Does it mean that DataFusion’s runtime can be easily used as a streaming runtime out of the box? Almost, but not quite. Each operator in DataFusion defines its execution mode: Bounded, Unbounded or PipelineBreaking. The first two should be pretty straightforward to understand: use Unbounded operators (and build your custom ones as Unbounded) and avoid Bounded operators in the streaming context. As far as I can see, Bounded is mostly used by the source operators that scan files (also things like the LIMIT clause, EXPLAIN command, etc.)
The PipelineBreaking operators are more complicated. Here’s the description of this execution mode:
Some of the operator's input stream(s) are unbounded, but the operator cannot generate streaming results from these streaming inputs. In this case, the execution mode will be pipeline breaking, e.g. the operator requires unbounded memory to generate results. This information is used by the planner when performing sanity checks on plans processings unbounded data sources.
As you probably guessed, this means that some types of joins and aggregations are PipelineBreaking operators, which means that they can’t be used in the streaming context!
The only workaround in this case is implementing your own operator. This is what Denormalized has done to support windowed aggregations.
DataFusion operators also support partitioning. By default, since DataFusion is a single-node engine, it uses the number of CPU cores as a number of partitions. It’s possible to define the number of partitions and the partitioning strategy in your custom operators.
Connectors: Building Your Own
For a streaming framework to be truly useful, it has to integrate with real data sources. DataFusion doesn’t provide an extensive set of built-in connectors at the moment, mostly because it was never designed specifically for data integration use cases. Out of the box, you can read different types of files (locally or from the object storage). There is no standard Kafka connector.
DataFusion Table Providers is an excellent source of community-provided connectors. However, some of those connectors were not designed to be executed in the streaming environments, so some tweaks may be needed.
You can create a custom connector by implementing a TableProvider
trait. Check this handy guide. I’ve had a chance to build both source and sink connectors, and I’d say that the experience is quite nice2.
Overall, it doesn’t seem too hard to write a Rust connector for common platforms like Kafka, but it becomes trickier when you factor in the need for failure recovery.
Checkpointing and Fault Tolerance
Checkpointing is one of the biggest differences between a “batch query engine” and a genuine streaming framework. Systems like Apache Flink store a snapshot of the entire pipeline state at regular intervals so they can resume from the last good snapshot if something goes wrong. DataFusion just doesn't have anything like that.
I’d say that correctly implementing checkpointing and failure recovery would probably be the hardest thing for implementing an MVP of a streaming engine on top of DataFusion.
I’d encourage you to check checkpointing implementation in Arroyo and Denormalized.
Scaling Out Beyond a Single Node
DataFusion runs on a single node by default. While that might be sufficient for some workloads, many streaming use cases can easily exceed the capacity of one machine. Two main paths come to mind:
You can run multiple instances of DataFusion in a Kafka Streams–style model, where a consumer group is used to coordinate partition assignment across different instances. Of course, it means you can only support Kafka-compatible sources.
You could look at distributed frameworks like Ballista (part of the DataFusion ecosystem) or general-purpose engines like Ray for parallel execution across many nodes. Both Ballista and Ray have been getting a lot of attention recently from the folks wanting to run DataFusion at scale. Anyway, this option means your system now needs to support shuffles, which brings a lot of complexity. I haven’t had a chance to explore these tools yet, but the first thing that comes to mind when evaluating these options is support for streaming execution. Just having a shuffle operator is not enough, you need a streaming shuffle.
No Typical Stateful Streaming Features
It looks like DataFusion doesn’t have built-in operators that provide arbitrary state access (something similar to ValueState in Flink). Creating such functionality feels straightforward - “just” add a wrapper for RocksDB or SlateDB. However, again, I think a lot of complexity around checkpointing and failure recovery will need to be addressed.
Many streaming scenarios revolve around time-based aggregations and handling out-of-order events with watermarks. DataFusion currently doesn’t support these concepts at all (there is no need for this in batch execution). Tokio timers would probably be a way to implement windowing.
Conclusion
Building on top of DataFusion is not for the faint of heart. There is very little documentation, and most of the time, you end up reading the source to understand what’s going on. Thankfully, the source code is very friendly!
If you’re serious about building a production-grade stream-processing framework, you’ll need to invest a lot:
make sure the operators you want to support can be executed in the streaming environment.
have some form of failure recovery.
build source and sink connectors.
find a way to scale out (either via something like Kafka consumer groups or by adopting Ballista/Ray).
(if you need stateful streaming): introduce key-value store and windowing abstractions.
Sounds like a lot of work! 🙂 However, building an engine from scratch is way more work. DataFusion is an excellent foundation, and it’s highly customizable - you can change almost any aspect of the system if you don’t like it. The list of known users is very impressive!
Also, the “Rewrite Big Data in Rust” movement has momentum, and plenty of folks are intrigued by the idea of a next-generation streaming engine that combines safety, speed, and a modern language ecosystem. Watch this space.
It’s part of a consulting project I’m working on for a client. There is no appetite for open-sourcing it anytime soon.
Unless you need to deal with Avro. Unfortunately, its support is still not great.