This was a great post on Change Data Capture. I’m a growing data Engineer and I have had my fair share experiences of siloed product engineering teams producing data in a normalized format and then creating issues when it comes time to perform machine learning tasks like EDA, Feature Transformation etc, because of all the messy joins that need to happen to answer or model a niche problem. Building a project where I’ll be coding the full stack and was trying to think about ways I could ingest important user activity on the front end react application with strong reliability without pushing to a Kafka topic, the outbox pattern with OLTP database coming with a CDC pull from the outbox seems very grokkable
Awesome post. CDC is not a sliver bullet, always abstract the final layer exposed to the end user.
Streaming Windowed joins are lot harder to manage due to the hard resource limits enforced due to window end and state store size, for joins/de-normalization task pipeline should be batch oriented in nature. For CDC I would use a mix of batch and stream transformations.
OLAP can comprise of Raw , Curated and Aggregate Zones. End user serving layer should include only De-normalized Curated Zone and Aggregated zones.
OLTP DB → CDC → [ RAW ZONE (Stream) → De-normalized (Batch) + De-duped (Stream) Curated Zone → Summary/Aggregate Layer (Batch) ] OLAP
Thanks! Yes, I think your approach is fairly reasonable. However, you need to make sure that the raw zone is not available for most of the users directly. This can be somewhat awkward organizationally.
Thanks for the wonderful post. Really touches some pain areas related to world problems. Just one thought for more clarity for me, how about the latency when transition from raw stream event to domain stream event. What are the options available in current tech stack for this transformation (DB triggers, Stored Proc, or some Other batch option)
I still think that the Transactional Outbox pattern is the best approach. The latency is not a problem because it relies on the same low-latency mechanism (like the log-bases Change Data Capture) and you avoid the complexity of dual writes.
This was a great post on Change Data Capture. I’m a growing data Engineer and I have had my fair share experiences of siloed product engineering teams producing data in a normalized format and then creating issues when it comes time to perform machine learning tasks like EDA, Feature Transformation etc, because of all the messy joins that need to happen to answer or model a niche problem. Building a project where I’ll be coding the full stack and was trying to think about ways I could ingest important user activity on the front end react application with strong reliability without pushing to a Kafka topic, the outbox pattern with OLTP database coming with a CDC pull from the outbox seems very grokkable
Awesome post. CDC is not a sliver bullet, always abstract the final layer exposed to the end user.
Streaming Windowed joins are lot harder to manage due to the hard resource limits enforced due to window end and state store size, for joins/de-normalization task pipeline should be batch oriented in nature. For CDC I would use a mix of batch and stream transformations.
OLAP can comprise of Raw , Curated and Aggregate Zones. End user serving layer should include only De-normalized Curated Zone and Aggregated zones.
OLTP DB → CDC → [ RAW ZONE (Stream) → De-normalized (Batch) + De-duped (Stream) Curated Zone → Summary/Aggregate Layer (Batch) ] OLAP
Thanks! Yes, I think your approach is fairly reasonable. However, you need to make sure that the raw zone is not available for most of the users directly. This can be somewhat awkward organizationally.
Thanks for the wonderful post. Really touches some pain areas related to world problems. Just one thought for more clarity for me, how about the latency when transition from raw stream event to domain stream event. What are the options available in current tech stack for this transformation (DB triggers, Stored Proc, or some Other batch option)
Thank you!
I still think that the Transactional Outbox pattern is the best approach. The latency is not a problem because it relies on the same low-latency mechanism (like the log-bases Change Data Capture) and you avoid the complexity of dual writes.