The idea of changelog topics or data streams is not new: stream-table duality advertised by Kafka Streams was popularized quite a bit in 2018 (paper, blog). Related concepts have been mentioned regularly in the past 30 years (the earliest paper I could find is from 1992). And nowadays, you can see it everywhere, not just in Kafka Streams:
Apache Flink has a concept of Dynamic Tables.
Decodable supports Change streams.
Timeplus supports Changelog streams.
Arroyo explains streams-table duality in the context of streaming SQL.
Changelog data streams became extremely popular with the wider adoption of Debezium and Change Data Capture. But how should one use it? And is the stream-table duality still relevant in 2023? Let’s see.
Append-Only Data Streams
Let’s start with regular, append-only data streams. Clickstream data, e.g. user activity data from web pages, is a typical example of an append-only data stream. A piece of JavaScript captures user clicks, page views and even mouse movements and reports them to an ingestion endpoint.
Most of the clickstream data is anonymized or pseudonymized, and there is usually a notion of a “session”, but that doesn’t necessarily map to a specific user 1:1 - one user may have many sessions during the day (which is defined with session windows).
So the main characteristics of clickstream data:
Time-series
Anonymized or pseudonymized
Tracks user activity, but rarely contains a specific user id
Finally, it’s also somewhat ephemeral: once it’s captured, there is no way to somehow replay it from the source of truth (the user has already left the page).
And that’s why once it’s captured, it’s usually written to a data store in an append-only fashion: there is no need to support updates or deletes. Each user action is isolated and pretty much irreversible.
Other examples of append-only data streams can be telemetry data or IoT data.
Changelog Data Streams
Changelog data streams have two key characteristics:
The notion of a “primary key” (PK) borrowed from the database world
The notion of an operation (like insert, update or delete)… also borrowed from the database world 🙂
For example, a changelog payload can contain user information with the user PK and an update operation (indicating to all consumers that they should update the record instead of inserting a new one).
Changelog data doesn’t even need to be time-series data (but it’s beneficial to understand at least the relative ordering with something like logical timestamps, you’ll see why in a bit).
Change Data Capture made changelog streams extremely popular. Data from tables in an operational database can be represented as changelog streams, so it can be replicated to a streaming platform like Apache Kafka or Redpanda, data lake, analytical data store, etc. However, there are some gotchas that I’ve covered here:
And this is exactly why stream-table duality is so important: a Postgres table can be represented as a Kafka stream and then mirrored to a table in Apache Pinot, for example. Silently, the stream-table duality became the foundation of modern data integration and replication, and it’s leveraged by many frameworks like Apache Kafka Connect and Apache Flink.
It’s somewhat funny to read the stream-table duality critique from 3 years ago: “it’s not a true duality”, “Kafka is not a database”, etc. Stream-table duality is not about replacing databases; it’s about integrating databases and stream-processing platforms and frameworks together. And it’s clearly working! Just check the links from the beginning of this post.
Changelog Data Streams as a Default Mode
I like the idea of treating any data stream as a changelog stream, even if it’s not coming from databases in the form of CDC. Even typically append-only data like clickstream and telemetry data can be modelled this way. Just treat everything as an insert message to start.
Pros
Changelog data streams allow you to design your overall data platform with upserts in mind. This means you can emit updates and deletes for any kind of data correction, retraction, backfill or reprocessing (even if the source data doesn’t support updates). I described why upserts are very important here:
This is a huge benefit that significantly simplifies data reprocessing.
You can leverage Kafka compaction for efficiently storing these data streams in a streaming platform.
Cons
You must always have a way to generate or obtain a primary key field. This could be problematic with traditionally append-only data streams. However, it should generally be possible to resort to a randomly generated UUID in this case - you just need to make sure to use it consistently when emitting updates.
Ordering within the same primary key becomes very important. With append-only data streams, the ordering of specific messages is almost always irrelevant. But if you have an insert message, an update message and a delete message, depending on their order, the end result once the data is materialized in a data store will be very different, and you’ll only end up with the expected outcome (no data) if the delete message comes last. When using Kafka topics, it’s very common to use a primary key as a message key, so you get guaranteed ordering within a partition. But even with that guarantee, getting producer writes in order can be challenging. Here are just a few examples demonstrating how easy it is to get it wrong:
A Java producer and a Golang producer writing data to the same topic can end up emitting records with the same message key to different partitions, breaking the ordering guarantees.
A default Kafka producer before version 3.x may produce data out of order; see the documentation for max.in.flight.requests.per.connection:
Note that if this configuration is set to be greater than 1 and
enable.idempotence
is set to false, there is a risk of message reordering after a failed send due to retries (i.e., if retries are enabled)
Some data stores (e.g. most of the data warehouses) still only effectively support append-only writes. In this case, you can choose to capture an operation as one of the table columns (which complicates queries quite a bit) or use a stream processing engine like Apache Flink to “materialize” the results for you and only write the final “view”. Unfortunately, this usually requires setting an explicit limit on how much the data can be delayed.
Anyway, I think this idea is very exciting if you have decent support for upserts. Changelog data streams should not only be used for CDC. In fact, we’ve been using them for quite a bit at my current job to power fairly generic datasets. It’s fun to watch how a record with a delete operation marker can retract data.