Disclaimer: the views and opinions expressed in this post are solely my own and do not reflect the views or opinions of my employer.
About a year ago, I started using Redpanda in production. Redpanda is a foundational building block of the platform (together with Apache Flink), supporting most of the functionality provided to our customers.
I know many people are interested in exploring Redpanda nowadays, so I wanted to write a retrospective: what went well, what didn’t go well and what can be improved.
Two important bits of context before I start:
In production, we use a fully managed cloud offering, so I don’t have experience with actually running Redpanda infrastructure (but I have a pretty good understanding of what it may look like by now).
The cloud offering that we used most of the time is called “Cloud V1”. Recently, we’ve switched to “Cloud V2” which was announced at the end of last year. It looks like it’s just called “Redpanda Cloud” in all marketing materials.
Wait, what is Redpanda exactly?
In my words: Redpanda is an Apache Kafka alternative written in C++. It supports most of the Kafka API and all the operational workflows you’d expect (e.g. rebalancing). Redpanda Console is an amazing UI for any Kafka-compatible product, including Redpanda.
Both are open-source and free. Redpanda Enterprise and Cloud also come with Tiered Storage, managed connectors, integrated Console and various networking and security features.
Why would you consider it over Apache Kafka?
The biggest claim that Redpanda makes is that it offers superior performance: higher throughput and lower latencies with a smaller infrastructure footprint, which translates to significant cost savings.
They published several convincing benchmarks: one, two. A few days after the publication of the second benchmark, Jack Vanlightly from Confluent replied with the critique and its own benchmarks. Which was then addressed by Redpanda here. This has not yet reached Databricks vs Snowflake level war, but any competition is good for the end customer, so I hope to see more 🍿.
Now you should have all the context, let’s continue with the retrospective.
What went well
Kafka API compatibility. This feels almost magical, but it’s true: all Kafka clients just work with Redpanda without any changes. I’ve used Java, Python and Golang consumers/producers, as well as the Admin and Schema Registry APIs. The only exception to this was the DELETE_RECORDS feature, but it looks like it’s now addressed.
Redpanda Console. OMG, it’s beautiful! See for yourself here. The interface for consuming / seeking / filtering messages is extremely thoughtful. In the past, I had to use various tools like Apache Zeppelin or Apache Drill, or write custom ad-hoc applications, but it’s no longer needed.
Packaged as a single binary. Aside from the Console, everything else you may need for operating a streaming platform is packaged into a single binary: the broker itself, the Schema Registry (compatible with the Confluent one), HTTP proxy, metrics endpoint, and probably something else I’m forgetting. Not only this reduces operational complexity, but it also makes it very easy to use for integration testing. This is what Decodable does for testing their newest SDK, for example.
Lower cost. There is no pricing calculator to show, but I found that it can be a few times cheaper than other products. Cut your Confluent bill by 50%! challenge makes sense. Obviously, this is possible because of the smaller infrastructure footprint.
Tiered Storage. To this day, vanilla Apache Kafka doesn’t support Tiered Storage. Confluent added support a while ago and made it its competitive advantage, so the development for Apache Kafka is actually led by Uber.
This means that, for a while, only Confluent and Redpanda provided Tiered Storage functionality for Kafka APIs. Recently it was added to AWS MSK as well. And out of all three, I believe Redpanda implements Tiered Storage in the best way:
It supports different types of retention configuration:
retention.bytes
andretention.ms
set total topic retention (local + remote) andretention.local.target.bytes
andretention.local.target.ms
set local retention. This enables fine-grained control and unlocks some interesting use-cases. AWS MSK supports the same configuration, but Confluent Cloud does not (you can only control total topic retention).It allows using Tiered Storage for compacted topics. Note that allowing doesn’t mean supporting: there are quite a few gotchas with the way compaction works, but it’s theoretically possible to enable both compaction and Tiered Storage for a topic in Redpanda (effectively using it is a different story). Other tools don’t support it at the control plane level, so there is no way to make it work. Honestly, enabling compaction to work seamlessly with Tiered Storage is A LOT of work, but I hope to see it working one day.
Kafka Connect support. Redpanda supports several Kafka Connect connectors. Interestingly, it looks like it actually runs Apache Kafka Connect internally (it wasn’t rewritten in C++ 🙂), so all your standard configuration options just work.
Vision. Finally, this is a big one for me. I love where Redpanda is going, I like their vision, and I believe that they can truly innovate in the space.
For example, recently, they’re very quietly announced plans to expose their Tiered Storage with Iceberg format:
Another example is their work on WASM transformations. Redpanda doesn’t currently implement a streaming framework (no “Redpanda Streams” yet), but they’ve been working on WASM engine for a while. I haven’t researched this enough, but it looks like an extremely novel approach, so I’m excited to try it out in the future.
What didn’t go well
Schema Registry in Cloud V1. I think that almost all customers have been migrated to Cloud V2, but this is still worth sharing. As I mentioned above, Schema Registry API is shipped in the same binary as the Redpanda broker itself (with other features). It also means that whenever a Redpanda broker is down, its Schema Registry endpoint is also down. And in Cloud V1 the cluster-level Schema Registry endpoint that you interact with was just a DNS record for all Redpanda brokers. In practice, that meant that if one broker out of five goes down, ~20% of Schema Registry requests would also fail. Losing a single broker with the right replication factor is usually not a big deal, but it causes reconnects and more Schema Registry requests! Lesson learned: put a load balancer in front of your Schema Registry and only route to healthy endpoints (it seems like this is what Cloud V2 does).
Rare bugs in Tiered Storage. Tiered Storage is a relatively new Redpanda feature, and we got unlucky and affected by some bugs. It’d be great to see more Jepsen-like testing for Tiered Storage.
Understanding source code is hard. This is very subjective. I don’t have much experience with C++, and I find it hard to navigate the project’s source code. Anyone got a “modern C++ tutorial” they can send my way?
Overloaded nodes crash with OOMs. In my personal experience, when Apache Kafka brokers are overloaded, they become CPU starved. Redpanda nodes crash with OutOfMemory errors instead. One is not better than the other, but I’d love to see more throttling and rate-limiting mechanisms in place, especially for the Cloud offering. But please don’t ask me to track 5+ metrics for that like Confluent does.
What can be improved
Schema Registry Console UI. Right now, it’s practically read-only. Please allow me to manage my schemas from UI!
Metered pricing and serverless. This is probably the biggest gap at the moment. I’d like to be able to start (and resize) clusters in minutes and know exactly how much I’ll pay for them. This also likely prevents many small companies from trying out the Cloud quickly. Yes, they can start with the open-source version, but the difference is significant.
Make Tiered Storage free? I hope that some Redpanda employees will read this post, so let me ask something - could you please consider making Tiered Storage free (part of the Community edition)? As far as I can see, it’s already open-source, but requires an Enterprise license or a Cloud environment.
In my opinion, Tiered Storage is a fundamental part of modern streaming data pipelines. And it’s the missing piece required for Kappa architecture to become truly widespread. Requiring an enterprise license to make the architecture work is not a great place to be - it limits innovation and adoption, especially for smaller companies. Imagine what kind of impact this will make!
Conclusion
I think that Redpanda is an amazing piece of tech. There are almost no excuses not to use it in the local development and CI environments - it’ll be faster and easier to use.
When it comes to running Redpanda in production, I believe it’s ready for prime time. However, the absence of serverless and metered billing can be a blocker for many people.
Rewriting a system from scratch is rarely a good idea. But, Redpanda shows us that sometimes it’s worth it. Should you switch from Apache Kafka to Redpanda? I really can’t say, but it’s definitely worth checking out.
Enjoy reading it. Quick question for
"Tiered Storage. To this day, vanilla Apache Kafka doesn’t support Tiered Storage. Confluent added support a while ago and made it its competitive advantage, so the development for Apache Kafka is actually led by Uber." Do you mean the development of Tiered Storage in Apache Kafka is led by Uber? Since you mentioned "to this day, vanilla Apache Kafka doesn’t support Tiered Storage. " so this is WIP in the OSS version, I guess.