Discussion about this post

User's avatar
Andrew Bempah's avatar

This was a great post on Change Data Capture. I’m a growing data Engineer and I have had my fair share experiences of siloed product engineering teams producing data in a normalized format and then creating issues when it comes time to perform machine learning tasks like EDA, Feature Transformation etc, because of all the messy joins that need to happen to answer or model a niche problem. Building a project where I’ll be coding the full stack and was trying to think about ways I could ingest important user activity on the front end react application with strong reliability without pushing to a Kafka topic, the outbox pattern with OLTP database coming with a CDC pull from the outbox seems very grokkable

Expand full comment
Naveen Nagarajan's avatar

Awesome post. CDC is not a sliver bullet, always abstract the final layer exposed to the end user.

Streaming Windowed joins are lot harder to manage due to the hard resource limits enforced due to window end and state store size, for joins/de-normalization task pipeline should be batch oriented in nature. For CDC I would use a mix of batch and stream transformations.

OLAP can comprise of Raw , Curated and Aggregate Zones. End user serving layer should include only De-normalized Curated Zone and Aggregated zones.

OLTP DB → CDC → [ RAW ZONE (Stream) → De-normalized (Batch) + De-duped (Stream) Curated Zone → Summary/Aggregate Layer (Batch) ] OLAP

Expand full comment
3 more comments...

No posts