I’ve been building data streaming systems for the past 8 years, and I feel like, as an industry, we haven’t made a huge amount of progress when it comes to data streaming and stream-processing adoption. Yes, Kafka 4.0 and Flink 2.0 are very different from the versions of Kafka and Flink from 8 years ago (in a really good way). Streaming databases are here, and streaming SQL is a thing. ML / “AI” requirements made near-realtime data pipelines very important.
And yet, data streaming is struggling. I know many startups in the space have a hard time attracting customers. Confluent’s growth is steady but not that impressive.
Why?
I don’t have a simple answer. But I’m trying to come up with a few themes explaining the challenges of modern data streaming. I’m going to use Apache Flink for my examples since it’s the most popular stream-processing technology out there.
Today, I want to talk about efficiency.
Large-Scale Systems Are Not Necessarily Efficient
Many large-scale distributed systems like Hadoop, Spark, Kafka, Flink, etc. are very impressive in many dimensions, but they were built to solve the scalability problem first. They’re not necessarily efficient. You should really understand the difference. For example, just because Flink can scale to thousands of task slots doesn’t mean that each task slot will perform the most efficient computation. It’s actually much more challenging to build.
Here are just a few examples from Flink’s Slack and mailing list. One person wrote:
I wanted to discuss our current system setup and see if Apache Flink would be a good fit to replace one of our consumer groups. Right now, we have two Kafka consumer groups: the raw-consumer group, which is scalable and handles the input from vehicles, and the parsed-consumer group, which handles the post-processing tasks like decoding and filtering. The issue we’re facing is that while the raw-consumer group can scale effectively with the number of vehicles, the parsed-consumer group starts to slow down when we increase the load, especially as the tasks it performs are quite CPU-intensive.
Currently, we’re running the system with 20 vehicles, and here’s the resource breakdown: ... However, when scaling to 10,000 vehicles, we’re expecting a significant increase in resource usage. …
Given this, we’re wondering if replacing the parsed-consumer group with Apache Flink would be a good solution. Flink is known for handling heavy processing tasks like decoding and filtering efficiently, and we believe it could scale better to meet the needs of our growing vehicle fleet. Would Flink be able to handle this load more effectively than our current parsed-consumer group? And if we were to make this transition, would it provide the scalability we need without running into performance issues or excessive resource consumption?
The person asking this question has a CPU-bound Kafka Consumer with straightforward stateless logic. But for some reason, they think that Flink could be more efficient for this workload. This doesn’t make sense - a Flink job will need to do at least the same amount of work by leveraging the same Kafka Consumer logic. In practice, it’ll do much more: checkpointing, maybe additional serialization/deserialization, shuffling, etc.
Flink excels at stateful computations like joins and aggregation, but if you have a simple stateless Kafka Consumer that’s CPU bound, you should focus on profiling it, not rewriting it to a Flink job.
Here’s another one:
I have a job entirely written in Flink SQL. The first part of the program processes 10 input topics and generates one output topic with normalized messages and some filtering applied (really easy, some where by fields and substring). Nine of the topics produce between hundreds and thousands of messages per second, with an average of 4–10 partitions each. The other topic produces 150K messages per second and has 500 partitions. They are unioned to the output topic.
The average output rate needed to avoid lag after filtering messages should be around 60K messages per second. I’ve been testing different configurations of parallelism, slots and pods (everything runs on Kubernetes), but I’m far from achieving those numbers.
In the latest configuration, I used 20 pods, a parallelism of 120, with 4 slots per taskmanager. With this setup, I achieve approximately 20K messages per second, but I’m unable to consume the largest topic at the rate messages are being produced. Additionally, setting parallelism to 120 creates hundreds of subtasks for the smaller topics, which don’t do much but still consume minimal resources even if idle.
…
When I check the use of cpu and memory to the pods and don't see any problem and they are far from the limit, each taskmanager has 4gb and 2cpus and they are never close to using the CPU.
…. How can I improve the throughput rate? Should I be concerned about the hundreds of subtasks created for the smaller topics?
The author later mentions that they perform a join, which probably (at least partially) explains the performance degradation. Unfortunately, it’s very easy to write a poorly performing join in Flink, especially in Flink SQL.
However, it’s pretty interesting to highlight another inefficiency: because Flink doesn’t allow fine-grained parallelism tuning for Kafka sources in Flink SQL1, the author had to apply unnecessarily high levels of parallelism to every source, allocating a lot of wasteful resources.
So, repeat after me: scalability != efficiency. My advice here is to try building good mental models for the technologies that you use and choosing the right tool for the job.
Efficiency == 💰💰💰
You may say: but why do you care about the efficiency so much?
Well, efficiency translates to cost savings. From Meta’s blog:
So, the engineer typed an “&” after the auto keyword to indicate we want a reference instead of a copy. It was a one-character commit, which, after it was shipped to production, equated to an estimated 15,000 servers in capacity savings per year!
Not bad for having a slightly more efficient system?
Or look at Apache DataFusion Comet. It reimplements a bunch of Spark operators in a more efficient language (Rust) and runtime (Arrow/DataFusion) with ~2x overall speedup, which could mean ~2x lower bill. And it’s a drop-in replacement!
Cost efficiency is especially important in the post-ZIRP era and for teams with reduced headcounts.
But Is It Really a Big Deal?
Cost savings are nice, sure, but is it really a deal-breaker?
OK, not necessarily, I agree. Even without the cost aspect, inefficiency is just one particular thing that contributes to the overall problem.
However, inefficiencies can be multiplicative.
Having a single poorly performing join might be fine, but it’s hard to deal with eight of them. A slow Kafka Avro source could be ok to use, but if you need to regularly reprocess petabytes of data, it could even affect the way you design your overall architecture.
Another way to look at this: a vendor that sells a somewhat inefficient product needs to adjust its margins accordingly, which affects the final cost and experience. This gets passed on to you. If you do the same for your customers, the effect multiplies.
By the way, I also think it’s the reason why the Lambda architecture is still alive. It’s still much more efficient to process a LakeHouse table / a bunch of Parquet files than the same amount of data sitting in a Kafka topic.
Learn From Databases
I hope you’re convinced by now. What can we do?
I think the solutions are out there. “Just” copy the ideas from the database research in the past 5-10 years.
Storage / compute separation is an example that everyone understands. Tiered storage in Kafka, disaggregated state store in Flink. The idea initially appeared in the database world.
Another idea is applying query engine optimizations like predicate pushdown. This makes a massive impact on query performance and efficiency. Unfortunately, standard Kafka storage is not designed to support this. That’s why I’m very bullish on LakeHouse support in Kafka like Confluent Tableflow and Iceberg Topics in Redpanda. Combine that with a Hybrid Source in Flink, and you get a very efficient engine.
A more interesting one is columnar data processing and vectorization. I keep linking to this blog from Arroyo because it’s so good. Alibaba is working on a lot of innovations here with Fluss (columnar streaming storage) and Flash (vectorized Flink-compatible engine).
Finally, code specialization / compilation. Frequently used to optimize queries, but why not apply it more widely? We deal with a lot of schemas, so let’s leverage that!
Another thing worth mentioning is Polymorphic Table Functions (PTFs) in Flink. A declarative language like SQL can make writing some simple things very hard (or inefficient). PTFs allow you to stay in the SQL ecosystem but create new, highly customizable operators.
Actually, it looks like the support for specifying individual Kafka source parallelism in Flink SQL was merged a month ago! Still not released, though.
> Another idea is applying query engine optimizations like predicate pushdown. This makes a massive impact on query performance and efficiency. Unfortunately, standard Kafka storage is not designed to support this. That’s why I’m very bullish on LakeHouse support in Kafka like Confluent Tableflow and Iceberg Topics in Redpanda.
Could you explain this better? We're pretty far away from having first-class schema support and the ability to query these things in a predicate manner from Kafka