Most of the modern data streaming frameworks focus on the idea of a dataflow (probably inspired by papers from Google). You define sources, transformations, and sinks. The system compiles them to a dataflow topology and runs it. You could even build a layer of relational algebra on top and end up with a “streaming database”. However, most of the streaming SQL flavours out there still just define sources, transformations (or materialized views), and sinks, but in a declarative way.
Is there a different paradigm?
Actors 101
A recent discovery got me thinking about Actors again. The actor model is a fairly high-level programming model that I used a lot early in my career. Erlang/OTP and Akka are the most widely used implementations out there that I know of. You may never have heard about either of them, but they’re quite popular for some niche applications (Erlang can be found anywhere you need millions of concurrent users and high reliability, e.g. WhatsApp or Call of Duty networking stack).
Here’s a simple actor example written in Scala using the Akka framework (this uses the Classic Actors version, which I think is simpler to showcase than the current version):
class MyActor extends Actor {
var counter = 0
def receive = {
case Hello => sender() ! Hi
case Inc => counter += 1
case Report => sender() ! Total(counter)
}
}
An actor can receive messages, send messages, and use an internal state. An actor typically represents an entity: a user, an account, a device, etc. You get concurrency for free: the actor system guarantees that each actor processes messages serially, but you get high concurrency by running millions of actors in parallel.
Actors And Messaging
Since actors exchange messages, putting a messaging system in between can make a lot of sense. This is what we did at Bench Accounting ~10 years ago: Akka Camel allowed actors in different microservices to exchange messages between them over ActiveMQ. It worked beautifully and forever changed the way I think about microservices and inter-service communication.
Anyhow, messaging and streaming are very similar, so…
Actors And Streaming?
Akka Streams immediately comes to mind. It combines dataflow programming with actors: an actor can be a source or a sink in your topology.
You can get to a similar paradigm with restate (and even Flink StateFun and other tools like that if you squint a bit).
So, but is it useful? Is it more expressive or better than SQL or DataStream/DataFrame APIs?
I can’t say it’s better, it’s just different. Subjectively, it could be more friendly to engineers with very little or no data experience, like web developers or back-end engineers.
I feel like the industry has been busy making streaming easier for a data scientist / data analyst persona. Streaming databases, Flink SQL, etc. But streaming will never become truly ubiquitous until it’s easy enough for any engineer to use. Maybe I’m too ambitious here, but isn’t it a great goal?
Ok, let’s look at a concrete example. I’m going to use Scala/Akka-inspired pseudocode below.
Let’s say we need to enrich a stream of orders with shipping information. In Flink or other dataflow frameworks, we’d need to implement a lookup join. In SQL, you’d express it as a LEFT JOIN. However, it can also be modelled as an actor:
class EnrichedOrder(orderId: String) extends Actor {
var orderStatus
var orderProducts
var subTotal
var shippingCountry
var shippingAddress
var shippingPrice
fn total = subTotal + shippingPrice
@Event("order.created")
fn handleOrderCreated(order) {
this.orderStatus = order.status || "CREATED"
this.orderProducts = order.products
this.subTotal = order.subTotal
}
@Event("shippingLabel.created")
fn handleShippingLabelCreated(shippingInfo) {
this.shippingCountry = shippingInfo.country
this.shippingAddress = shippingInfo.address
this.shippingPrice = shippingInfo.price
}
}
This is a very simple example, but it should be enough to get a high-level idea. Instead of thinking about data streams and data pipelines, topologies and DAGs, you just get a class that reacts to some events. Surely, this is not a new paradigm! In fact, the browser you’re reading this in uses tons of events.
Of course, this example doesn’t explain message ordering, state management, observability, etc. But neither do simple examples presented in Flink or streaming database documentation.
Actors can also change their behaviour when reacting to messages, which makes them great candidates for implementing Finite-State Machines (FSMs). And FSMs can be great at reducing complexity. For example, the previous example can have two states:
Joining state:
accepts the order and shipping info
reply with something like NotReady if you try to get total until both events are received
could even stash all pending total requests and reply to them later when the data is available
Ready state:
doesn’t accept updates
returns the total value
With or without FSMs, it should also be possible to implement aggregations (e.g. by incrementing an internal variable), windowing (with timers) and enrichments (RPC or async style).
Actors are also a great abstraction for serverless computing: process an event, hibernate (by storing internal state externally), wake up when a new event is received, etc. Of course, this only makes sense in low-throughput scenarios.
Finally, actors can also emit events. These events can be consumed by other actors, but also sent outside. By the way, regarding that…
With this paradigm, data sources are available as events, actors handle data transformations, but what about sinks? Or what else can be done about the accumulated data? I see two applications:
Actors can use their state to reply to queries directly. Flink and Kafka Streams have a concept of Queryable State, which is applicable here. This can work really well for data products or any data-intensive backend applications.
Actors can define a way to represent their state as a changelog stream (Akka Persistence is a great example of that), which could be connected to a sink. So, we end with a type of dataflow in the end, just the one that’s focused on actors.
Is It That Different?
One can argue that Flink’s KeyedProcessFunction is not that different:
processElement is how you process events.
you can create state variables.
you have access to a timer.
Which is true! But I think it all comes down to developer experience in the end. Flink has its own challenges that go beyond the syntax. If the overall system is easier to use and simpler to reason about, it could be a game changer.
What do you think? I’d love to hear any feedback and welcome any ideas. Is this approach worth exploring?
My experience is that there's a big difference between actors and Flink KeyedCoProcessfunction - both in terms of expressiveness as well as the problem it addresses. It is true that Lightbend made a huge progress in bringing actors to the mainstream. However, it doesn't solve the problem of state across streams with timeouts. Also, durability and scalability relies on Actor persistence plus akka clustering which in my understanding only Lightbend understands well, and is farfetched. Will not even extend to replaying messages without duplicates and scaling up/down.
I used KeyedProcessfunctions to develop stateful microservices for trading, and persist risk factors. Also have seen other teams using Flink Statefun and the infrastructure that needs to be leveraged is considerably smaller. ReState is a spin-off of Flink Statefun but with the advantage of a simpler setup.
https://restate.dev/blog/why-we-built-restate/
"""Stateful Functions (in Apache Flink): Our thoughts started a while back, and our early experiments created StateFun. These thoughts and ideas then grew to be much much more now, resulting in Restate. Of course, you can still recognize some of the StateFun roots in Restate."""
Love what this team is doing, and I encourage you to follow their work.