This is part 2 of the Modern Data Streaming Challenges series. Part 1 is available here.
Today, I want to talk about developer experience.
What is Developer Experience, Exactly?
Ask ten different people; I bet you’d get ten different answers. In my opinion, Developer Experience (DX) includes everything that’s related to how you interact with a piece of technology: writing code, debugging, deploying, etc. You want to be able to iterate fast. Testing should be easy. Low coupling, high cohesion. And so on and so forth. Overall, you want the development process to be clear and efficient.
When talking about DX, some people focus on technicalities: you need to implement a CLI this way, local development should be done like this, or you need to be able to deploy that way, etc. Of course, a lot of these things are very important, but I think they’re somewhat secondary. In my opinion, everything originates from the programming language and programming model. Choosing the right language and model makes it much easier to have good DX.
Programming Languages and Models
I don’t want to give you a history lesson or theorize which language is supposedly the best for data streaming: history is history, and choosing the language is often subjective. However, what’s clear to me is that making it possible to author data streaming pipelines in more languages is going to be a huge win for the industry. So, whenever a company decides to adopt data streaming to support some new AI data use case, they don’t need to necessarily learn new language, new tooling, or new DX!
I want to live in a world where the most demanding data streaming abstractions can be expressed in TypeScript. Or Ruby. Or whatever works for you.
Of course, you probably immediately hear an inner voice saying: but what about using the right tool for the right job? Sure, but how can we say something is not right in this case? What do you actually need?
In my opinion, you can identify two distinct levels of abstraction for any data streaming technology:
Dataflow definition: creating a graph of connected sources, transforms and sinks. This can be expressed quite easily in any high-level language, either with functional composition, OOP, or a Builder pattern. A DataFrame-like interface is quite popular in modern data processing frameworks.
Optionally, message processing: how each individual message should be transformed, enriched, etc. Marked as optional since it is not always needed (e.g. in a case of simple count without any transformations).
As you can see, there is nothing special in either of those. Any general high-level programming language should be capable of expressing both abstraction levels.
So, if this is so straightforward, why don’t we see more data streaming tech implemented in various languages? In my mind, it’s a combination of historical decisions (betting on Java and Python), architectural gravity (it’s hard to migrate away) and a lot of recent focus on SQL…
And just to clarify: there are definitely many new data streaming frameworks and stream-processing libraries written in Rust, Go, TypeScript, and some exotic languages. However, none of them is even close to getting the same level of adoption as Flink, Spark, or Kafka Streams.
In fact, you can even get by with YAML for the dataflow definition (see Flink CDC, Redpanda Connect). This is where the right programming model helps a lot: typically, in this case, you want a strict declarative language with no side effects. It’s funny how the slightest mismatch in the model perception makes things much worse: Spark and Flink both need to serialize method closures in order to send them over the network to workers / task managers; this breaks “this is just a declaratively defined dataflow definition” abstraction. So, naturally, it’s one of the first issues any engineer faces when they start using Spark or Flink.
Finally, I want to highlight that I’m not advocating for any specific language: I really want to avoid a situation like getting Node overload. The ability to choose is more important.
Shifting Left to Make It Right
The discussion about programming languages can’t be complete without the “shift left” trend. Historically, data teams have been treated as pretty much a separate organization: most data pipelines populated internal datasets not exposed to users, so application developers never cared about the data world. This led to database migrations and back-end changes routinely breaking data pipelines.
“Shift left” prescribes that data developers be more involved in the application domain, e.g. by partnering with application teams and crafting end-to-end data products. As usual, this is mostly a people problem, so easier said than done… But this is the best approach I can think of.
Maybe surprisingly, but in practice, “shift left” works both ways. You also want your application developers to care about datasets and data pipelines. And this can be extremely challenging if all your data pipelines are implemented using a different language (and maybe even a completely different programming model).
The “AI” craziness that we experience right now will make this even more important: you can’t build a decent, personalized “AI” experience without having real-time data.
About SQL
I feel like I'm starting to sound repetitive, so bear with me. I love streaming databases and streaming SQL. It means that people without a software engineering background, but with some SQL knowledge, can build streaming workloads (at least in theory). IVM databases have made a lot of progress. TimePlus is great at combining historical and real-time data. DeltaStream has announced support for unified batch/streaming/real-time SQL. Arroyo got acquired.
And yet, I don’t think we’ll ever be able to satisfy all data streaming use cases with SQL. It’s coming back to having the right programming model. And, sometimes, declarative dataflow is not expressive enough. Or precise enough to get the right semantics or the right performance optimizations. For example, no matter how smart and powerful your query optimizer, sometimes we know exactly the level of parallelism a certain operator needs. Or that you really need to add an extra shuffle over there. It can also be hard to reuse, refactor and unit-test your SQL logic. dbt templating can only go so far.
User-defined functions (UDFs) help, but they only operate at the message processing level. And, unfortunately, they’re frequently a black-box for the runtime.
And then, getting DX for streaming SQL is hard! I gave this talk a few years ago, thinking that I figured out most of the challenges. But that was just the tip of the iceberg. There are still really hard, unsolved problems like state reuse and evolution in streaming queries.
SQL does try to evolve (e.g Polymorphic table functions look really cool), so, in a way, it feels like we’re just starting (even though the language is 40+ years old!). Some companies don’t want to wait and extend streaming SQL with useful abstractions like changelogs right now12.
Conclusion
So, what am I trying to say? We need to build more! More declarative (SQL) and imperative. More programming models and abstractions. More languages. Time will tell what works.
Here are a few things on my mind recently.
Cloudflare Pipelines made a lot of noise when it was announced, but it doesn’t look like it can support any advanced use cases (the documentation is just sad to look at). Integrating Arroyo actually feels like a step back: we want streaming JavaScript, not SQL!
AquaLang is a great attempt to properly bridge dataflow and message processing levels together. I like this project a lot, it shows the kind of innovation we need.
Server-side WebAssembly is finally reaching the point where it can be widely used.
And I still think about actors, especially with all the new durable execution tech: