Discussion about this post

User's avatar
Bin H's avatar

Nice write up!

I can relay the same pain on my experience dealing with exactly-once Flink applications. If Flink failed to commit the transaction before the timeout for whatever reason, the app will just get stuck into a crashlooping state because it'll kep trying to commit a transaction that doesn't exist anymore..

Re: better solutions. I think upsert syntax is generally the preferrable option here. For ledger-like applications that has zero tolerance on duplicates, implementing stateful deduplication process likely require a global window (i.e. one that without TTL). This means that you end up with a inifitely growing state size, which leads to another whole set of problems e.g. storage, recovery time, latency.

Expand full comment
Jove Zhong's avatar

Wow, you set yourself a very high bar as the first post in the substack 👏

Showing the number of bugs related to ExactlyOnce in Flink/Spark/Beam/Kafka is an interesting and convincing angle. Just like 100% uptime, it's a goal not really something could be achieved in the real world. I am with you. Sending upsert to downstream, or applying dedup during streaming processing will be much more practical. For streaming databases like us, one or more columns can be set as primary key(s) for a data stream, so you can just send duplicated data in and write SQL in the clean way. No need to configure Kafka as a rocket scientist.

Expand full comment
3 more comments...

No posts