Dataflow CPUs
9/27/2025
An exploration of dataflow architectures, initially inspired by this paper.These days, most CPUs vendors differentiate themselves in a few ways:
- More cores,
- Specialized cores,
- Wider vectors, different vector schemes (SVE/SME),
- Lower power consumption
(Arm’s scalable vector extensions are probably my favorite from this list.)
None of these, though, actually breaks the basic assumption behind the machines: a program is (conceptually) a sequence of instructions executed in some order.
That assumption is convenient. It maps directly to how we think about writing code and how compilers and OSes are organized. But it’s not the only way to organize computation. Enter: dataflow architectures.
Dataflow vs Von Neumann
In a von Neumann machine you have a program counter, you fetch instructions, and you execute them (with pipelines, speculation, out-of-order tricks, etc.). Dependencies are enforced by registers, memory and the compiler’s choices. The result: when the hardware waits for data, you get wasted cycles.
Dataflow flips the control model. Instructions don’t wait for a program counter — they fire when their inputs arrive. Computation is driven by data tokens; tokens carry the readiness that triggers work. You can think of it like a kitchen where dishes get cooked whenever their ingredients show up, not when some head chef calls the next order.
A pretty good mental model for this can be found in an old parallel programming workbook:
The Linda Model
How to Write Parallel Programs, a charming little workbook on parallel programming, uses the Linda model to explain parallel programming concepts. This model is a very good fit for understanding dataflow architectures, I think:
The Linda model is a memory model. Linda memory (called tuple space) consists of a collection of logical tuples.
There are two kinds of tuples. Process tuples are under active evaluation; data tuples are passive. The process tuples (which are all executing simultaneously) exchange data by generating, reading and consuming data tuples. A process tuples that is finished executing turns into a data tuple, indistinguishable from other data tuples.
Linda exposes a shared tuple space: data tuples sit in the space, and process tuples (the active things) consume and produce them.
Parallelism is more naturally exposed; you might imagine Linda as a giant bowl of soup with data and instructions floating around, and when an instruction’s data are ready, it triggers the instruction to execute, which might then generate more data and trigger more instructions. That maps very naturally onto dataflow: tokens in, tokens out; no central program counter; execution happens opportunistically.
Challenges
Why haven’t dataflow processors eaten the market? There are some challenges. Namely:
- Matching data with their instructions and the instruction lifecycle can be pretty expensive,
- resource allocation is hard, and
- handling data structures is also hard.
A fully pure dataflow processor sorta assumes instruction purity and idempotency, which doesn’t mesh well with immutability.
Another formidable problem is the management of data structures The dataflow functionality principle implies that all operations are side-effect free; that is, when a scalar operation is performed, new tokens are generated after the input tokens have been consumed. However, absence of side effects implies that if tokens are allowed to carry vectors, arrays, or other complex structures, an operation on a structure element must result in an entirely new structure.
This poses a bit of a challenge, to put it lightly. The paper suggests hybrid approaches that seem far more plausible to me.
This hybrid model involves grouping elements of programs into grains (or macroactors). Within a single grain, operations are performed as sequentially (to the extent that you consider modern CPUs to execute instructions sequentially), and each grain itself is scheduled in a dataflow manner.
This convergence combines the power of the dataflow model for exposing parallelism with the execution efficiency of the control-flow model. Although the spectrum of dataflowlvon Neumann hybrid is very broad, two key features supporting this shift are sequential scheduling and use of registers to temporarily buffer the results between instructions.
[H]ybrid dataflow architectures can be viewed as von Neumann machines extended to support fine-grained interleaving of multiple threads.
In either approach, you need lots of coprocessors to do things like match up data and instruction tags and move memory around, since there may not be registers outside the local scope of a grain or microactor. Consequently, this addresses another downside of the dataflow model; exceptions/interrupts are not well-ordered. If exceptions are well-ordered within the context that the user expects, then the parallel execution is transparent to the user. In a full dataflow model, exceptions may fire unreliably. This is also an issue with the control-flow model when the user opts-in to additional out-of-order execution, like vectorization.
Visualization
You can sort-of imagine instructions and grains in dataflow processors to work like coroutines which await on values corresponding to puts on the instruction/grain’s input ports. You might think this sounds like plain ’ol out-of-order execution on the CPU in my phone. What’s so special? Great question!
I think register scheduling is probably the biggest difference. In your phone’s A19 for example, Apple’s compiler has already decided which registers will be used to render the animations for this website. In a dataflow processor however, the compiler can pretend it has infinite registers like an SSA IR and they’ll all get mapped to the ports available on the processor and scheduled dynamically. This is much closer to the Linda model with an infinitely large tuple space.
Take this animation:
The input data are ready when the program starts. Once each input datum for an instruction is ready, the hardware can pick up the instruction and fire, no matter the physical location of the instruction’s data dependencies. The ports are really just data, not dictated by a static register file.
In a von Neumann architecture, the compiler may well reorder some of your instructions depending on the flags you used, but those decisions are statically determined; nearly everything is out-of-order and potentially parallel in a dataflow architecture.
Links
- The Dataflow Abstract Machine Simulator Framework
- Dataflow Architectures and Multithreading
- HPC Gets A Reconfigurable Dataflow Engine To Take On CPUs And GPUs
- Startup Claims up to 100x Better Embedded Computing Efficiency
- Wikipedia: Dataflow architecture
- How to Write Parallel Programs: A First Course
- MIT Tagged-Token Dataflow Architecture - SpringerLink
- Executing a Program on the MIT Tagged-Token Dataflow Architecture - IEEE
- MIT CSG Dataflow Research Papers
- Dataflow: Passing the Token - Arvind’s Research
- Resource Management for Tagged Token Dataflow Architecture - MIT
- SambaNova Reconfigurable Dataflow Architecture Whitepaper
- SambaNova Architecture Documentation
- Accelerating Scientific Applications with SambaNova RDA
- SambaNova vs Nvidia Architecture Comparison
- SambaNova SN10 RDU: A 7nm Dataflow Architecture - IEEE
- Ultra-fast RNNs with SambaNova’s RDA
- SambaNova SN40L: Scaling AI Memory Wall - ArXiv
- Design and Implementation of the TRIPS EDGE Architecture
- Simulator for heterogeneous dataflow architectures