Change Data Capture Is Still an Anti-pattern. And You Still Should Use It.

The transactional outbox pattern is too good to be ignored.

Jul 31, 2023

Change Data Capture (CDC) is a data ingestion technique used for getting data out of databases (primarily relational). There are different types of CDC, but the most powerful and efficient one is the log-based CDC. It’s usually a pull-based approach that leverages a write-ahead logging (WAL) mechanism (e.g. BinLog in MySQL or replication slots in Postgres), receives all database operations as soon as they happen and writes them to a destination, typically a streaming platform like Apache Kafka.

When considered purely in the context of data ingestion, CDC may seem like a great approach. But if we zoom out and consider the role of CDC in data modelling, for example, arguably, it does more harm than good. But first, let’s explore why it’s used in the first place.

Application Events

Once upon a time, if you needed to get data from an application for some kind of analytics, you’d need to instrument an event. This required many things:

A message broker or a streaming platform as a scalable data ingestion system.
Some form of eventing SDK containing an event producer which simplifies the creation and submission of an event.
Modifying core business logic to include calls to the eventing SDK in relevant places, e.g. user signup, order shipment, etc.

Sometimes you can get away with webhooks instead of events, but modifying core business logic is still necessary.

And this is probably the biggest challenge. It’s not a technical one, but rather organizational - teams that worked on the core business logic and teams that needed to collect and use application events historically were very different teams. Depending on the level of politics and bureaucracy in an organization, getting those events instrumented and collected could range from mild inconvenience to an impossible mission (“This is a critical path! We can’t add your code here…“).

And then, thanks to microservices and the wide adoption of Debezium and similar technologies, Change Data Capture offered a different solution.

Change Data Capture to the Rescue

One of the key promises of log-based data capture is reliability. The Application Events collection approach described above has a major flaw: dual writes. You need to write the same information to two different data stores: your operational database and a streaming platform like Kafka. Kafka and similar technologies don’t support distributed transactions, so you don’t have any guarantees for writes to be atomic. This means your operational database write may succeed, but your Kafka write may fail (and you may not be able to recover it).

But because log-based CDC reads database WAL and typically supports at-least-once delivery, you have strong guarantees that all writes will succeed (at least eventually).

However, here’s another “advantage” that’s not widely advertised: you don’t need to talk to the team working on the core business logic (well, at least this is what you think in the beginning)! Just point your CDC tool to the operational database; that’s it! All tables end up in your data lake or data warehouse of choice.

But here’s a problem.

You Forgot About Data Modelling

Data captured from relational tables should not be used as is. It’s heavily normalized, it may leak unnecessary implementation details, and, in the end, it can be just too verbose.

But in the age of Query-Driven-Modeling very few resources are dedicated to converting this raw data to a proper data model. Data engineers or data scientists see the tables available in the data lake and happily use them right away. And this is the moment where normalization becomes a huge issue.

This is bad for batch workloads, but it’s truly horrible for streaming. A few years ago, I gave a talk (video, slides) about building a very challenging Flink pipeline at Shopify that performed a streaming join of nine topics. In most of the resources online, you read about joining two streams (essentially fact and dimension) in a streaming fashion, and that alone comes with challenges. Ensuring that we could join nine topics took a lot of effort, but ultimately, we had a successful prototype. Can you guess what was the source of those nine topics? Yes, you got it right; all of them were topics captured from the operational MySQL database using CDC! All were heavily normalized.

You may want to push joins to an analytical data store instead, but not all of them have great support for joins (especially when it comes to OLAP engines).

Integration Database All Over Again?

Using databases for data integration is a well-known anti-pattern. In the world of microservices, we have learned that we should never expose our application databases to other services directly - it leads to all kinds of nasty problems with encapsulation.

Does it look like CDC tries to do exactly that? I think it does. It may use better tech, but logically it’s the same pattern.

Of course, sometimes you must use a database for integration, e.g. when replicating data.

Is There a Better Way?

Yes. Aside from changes on the technical side, you also need to talk to other humans more. I know, but it’s actually useful.

The idea of domain events is not new; it’s something that Domain-Driven Design has been advocating for years. There are a lot of online resources about this, so I’m going to highlight the difference between the application events concept I described in the beginning and domain events.

Application events typically mirror business logic: I have a user signup function over here, so I’m going to add a callback and emit an event when it’s successful.

Properly implemented domain events attempt to do some data modelling beforehand. Conceptually it’s driven by domains as boundaries, not implementation details like classes or modules. In the case of that Shopify project, we could’ve had a nice Sale domain event encapsulating data from many tables like orders, sales, line items, addresses, etc. That would’ve eliminated the need to perform most of the streaming joins, saving us months of work.

So why didn’t we model it as a single domain event? The business logic was actually owned by a team from another department, and convincing them to invest in proper data modelling was nearly impossible… Oh, and what about reliable delivery?

Transactional Outbox

I’m not saying you should never use CDC. In fact, it solves one of the hardest problems with domain events - their reliable delivery. The Outbox pattern is a simple and elegant way to leverage CDC for delivering application or domain events. In this approach, a special “event log” table is created for writing events so that you can perform writes to an operational table and an event log table transactionally and atomically. And then only an event log table is ingested via CDC.

This seems great on paper, but transforming big organizations to use it is very hard. Why? You still need to talk to other humans…

About the Humans

Even with the best-in-class domain event framework and transactional outbox in your sleeve, it takes a lot of effort to adopt. Why? Still, in most companies, the core engineering teams responsible for the core business logic don’t care too much about data pipelines. Data is a second-class citizen. In order to change this, you have to change the organization.

A Data Mesh is one of the possible solutions. Despite what many vendors say, just like DevOps, Data Mesh is about organizational principles, not the underlying tech. In a Data Mesh scenario, data engineers responsible for defining domain events and building relevant data pipelines would be embedded into core engineering teams. In fact, those teams become producers and consumers of the data products they deliver, just like the application products. But as with any other organizational change, it’s much harder to perform than switching to another framework.

Instead of a Summary

Forgive me for a slightly clickbait-y title 🙂 I truly believe that Change Data Capture, Domain Events and Data Mesh can all work together and level up our data engineering practices.

But I’d be very cautious about using CDC for “raw” table ingestion. If you do that, understand that the end result will always be secondary to well-thought domain events.

Andrew Bempah

Aug 7, 2023

This was a great post on Change Data Capture. I’m a growing data Engineer and I have had my fair share experiences of siloed product engineering teams producing data in a normalized format and then creating issues when it comes time to perform machine learning tasks like EDA, Feature Transformation etc, because of all the messy joins that need to happen to answer or model a niche problem. Building a project where I’ll be coding the full stack and was trying to think about ways I could ingest important user activity on the front end react application with strong reliability without pushing to a Kafka topic, the outbox pattern with OLTP database coming with a CDC pull from the outbox seems very grokkable

Expand full comment

Naveen Nagarajan

Aug 1, 2023Edited

Awesome post. CDC is not a sliver bullet, always abstract the final layer exposed to the end user.

Streaming Windowed joins are lot harder to manage due to the hard resource limits enforced due to window end and state store size, for joins/de-normalization task pipeline should be batch oriented in nature. For CDC I would use a mix of batch and stream transformations.

OLAP can comprise of Raw , Curated and Aggregate Zones. End user serving layer should include only De-normalized Curated Zone and Aggregated zones.

OLTP DB → CDC → [ RAW ZONE (Stream) → De-normalized (Batch) + De-duped (Stream) Curated Zone → Summary/Aggregate Layer (Batch) ] OLAP

1 reply by Yaroslav Tkachenko

3 more comments...

Data Streaming Journey

Discussion about this post