5 Comments

Nice write up!

I can relay the same pain on my experience dealing with exactly-once Flink applications. If Flink failed to commit the transaction before the timeout for whatever reason, the app will just get stuck into a crashlooping state because it'll kep trying to commit a transaction that doesn't exist anymore..

Re: better solutions. I think upsert syntax is generally the preferrable option here. For ledger-like applications that has zero tolerance on duplicates, implementing stateful deduplication process likely require a global window (i.e. one that without TTL). This means that you end up with a inifitely growing state size, which leads to another whole set of problems e.g. storage, recovery time, latency.

Expand full comment

True! "Global state" can be very painful, but you can make it work.

In general, it's rare for me see ledger-like applications to require zero tolerance on duplicates specifically in Kafka, IMO you can usually get away with upserts.

Expand full comment

about upserts - yes, it's better from the streaming platform point of view because it pushes the problem into the next system. The "global window" problem is exactly the same on the database side. If you have a small amount of data, it's not a problem, but when you push more than 50-100k ev/s doing upserts is not just hard but make queries really slow (and you can check how OLAP databases that support upserts are generally slower)

I don't have a lot of experience with flink/other streaming technologies but a "global state" when you run 100-200k ev/s sounds like not that easy when you have 3-4 months of data. We have clients that are sending duplicated entries with more than 1 year window.

Expand full comment

Wow, you set yourself a very high bar as the first post in the substack 👏

Showing the number of bugs related to ExactlyOnce in Flink/Spark/Beam/Kafka is an interesting and convincing angle. Just like 100% uptime, it's a goal not really something could be achieved in the real world. I am with you. Sending upsert to downstream, or applying dedup during streaming processing will be much more practical. For streaming databases like us, one or more columns can be set as primary key(s) for a data stream, so you can just send duplicated data in and write SQL in the clean way. No need to configure Kafka as a rocket scientist.

Expand full comment

Thanks for the kind words!

Expand full comment