Change Data Capture Is Still an Anti-pattern…

Jul 31, 2023

The transactional outbox pattern is too good to be ignored.

7 Comments

Aug 7, 2023

This was a great post on Change Data Capture. I’m a growing data Engineer and I have had my fair share experiences of siloed product engineering teams producing data in a normalized format and then creating issues when it comes time to perform machine learning tasks like EDA, Feature Transformation etc, because of all the messy joins that need to happen to answer or model a niche problem. Building a project where I’ll be coding the full stack and was trying to think about ways I could ingest important user activity on the front end react application with strong reliability without pushing to a Kafka topic, the outbox pattern with OLTP database coming with a CDC pull from the outbox seems very grokkable

Expand full comment

Naveen Nagarajan

Aug 1, 2023Edited

Awesome post. CDC is not a sliver bullet, always abstract the final layer exposed to the end user.

Streaming Windowed joins are lot harder to manage due to the hard resource limits enforced due to window end and state store size, for joins/de-normalization task pipeline should be batch oriented in nature. For CDC I would use a mix of batch and stream transformations.

OLAP can comprise of Raw , Curated and Aggregate Zones. End user serving layer should include only De-normalized Curated Zone and Aggregated zones.

OLTP DB → CDC → [ RAW ZONE (Stream) → De-normalized (Batch) + De-duped (Stream) Curated Zone → Summary/Aggregate Layer (Batch) ] OLAP

Expand full comment

Reply (1)

Yaroslav Tkachenko

Aug 1, 2023

Thanks! Yes, I think your approach is fairly reasonable. However, you need to make sure that the raw zone is not available for most of the users directly. This can be somewhat awkward organizationally.

Expand full comment

Stanislav Kozlovski

Sep 16

Is the suggestion then that your Shopify pipeline would have simply added one more `INSERT x INTO sales_domain_event_log` in its original DB transaction that updated 9 tables?

That seems like it has its own drawbacks. One I can think of is the performance impact to the operational estate - you need to write/maintain one more table just so the analytical world works better with streaming?

Also I can't see any nascent organization justifying to itself to invest in this architecture only for the future to work better - so in most cases such domain event tables would be added much after-the-fact... and that would hit the same problem - you have to talk to humans, justify the effort and modify a ton of code.

Perhaps this denormalization works well enough for Batch Procesing that it's not worth the extra trouble to optimize via event-logs for Stream Processing, unless you really realy need to apply real-time processing to it.*

* it's also a question how real time - e.g are microbatches via eg Spark Streaming good enough and do they omit most of the streaming-join problems you mentioned?

Expand full comment

Reply (1)

Yaroslav Tkachenko

Sep 16

> Is the suggestion then that your Shopify pipeline would have simply added one more `INSERT x INTO sales_domain_event_log` in its original DB transaction that updated 9 tables?

> That seems like it has its own drawbacks. One I can think of is the performance impact to the operational estate - you need to write/maintain one more table just so the analytical world works better with streaming?

Yes. That's the point of shifting left: your operational estate should care about the data products.

Adding an extra write doesn't seem that excessive. Some databases (like Postgres) even allow you not to store any actual data in a table, but still emit the event via the outbox (see pg_logical_emit_message).

> Also I can't see any nascent organization justifying to itself to invest in this architecture only for the future to work better - so in most cases such domain event tables would be added much after-the-fact... and that would hit the same problem - you have to talk to humans, justify the effort and modify a ton of code.

Yep, talking to the humans is the key suggestion here :)

> are microbatches via eg Spark Streaming good enough and do they omit most of the streaming-join problems you mentioned?

IMO the same issue. It's always better not to perform a 9-way join if you can.

Expand full comment

Charudatta Deshmukh

Feb 11, 2024

Thanks for the wonderful post. Really touches some pain areas related to world problems. Just one thought for more clarity for me, how about the latency when transition from raw stream event to domain stream event. What are the options available in current tech stack for this transformation (DB triggers, Stored Proc, or some Other batch option)

Expand full comment

Reply (1)

Yaroslav Tkachenko

Feb 11, 2024

Thank you!

I still think that the Transactional Outbox pattern is the best approach. The latency is not a problem because it relies on the same low-latency mechanism (like the log-bases Change Data Capture) and you avoid the complexity of dual writes.

Expand full comment

Data Streaming Journey

Change Data Capture Is Still an Anti-pattern…