Apache Flink [Java] -
system for high-throughput, low-latency data stream processing that
supports stateful computation, data-driven windowing semantics and
iterative stream processing.
Apache Heron
(incubating) [Java] - a realtime, distributed, fault-tolerant stream
processing engine from Twitter.
Apache Samza
[Scala/Java] - distributed stream processing framework that build on
Kafka(messaging, storage) and YARN(fault tolerance, processor isolation,
security and resource management).
Apache Spark Streaming
[Scala] - makes it easy to build scalable fault-tolerant streaming
applications.
Apache Storm
[Clojure/Java] - distributed real-time computation system. Storm is to
stream processing what Hadoop is to batch processing.
ArkFlow [Rust] -
High-performance Rust stream processing engine, providing powerful data
stream processing capabilities, supporting multiple input/output sources
and processors.
Arroyo [Rust]
- a distributed stream processing engine. Supports SQL and Rust
pipelines. Scales up to millions of events per second. Supports stateful
operations like windows and joins, state checkpointing for
fault-tolerance and recovery of pipelines. Uses the Timely Dataflow
model.
AthenaX [Java] -
Uber’s Stream Analytics Framework used in production
Bytewax [Python] -
data parallel, distributed, stateful stream processing framework.
CocoIndex
[Rust/Python] - ETL framework to build fresh index for AI, with realtime
incremental updates.
Faust [Python] -
stream processing library, porting the ideas from Kafka Streams to
Python
Gearpump [Scala]
- lightweight real-time distributed streaming engine built on Akka.
Hazelcast
Jet [Java] - A general purpose distributed data processing engine,
built on top of Hazelcast.
hailstorm
[Haskell] - distributed stream processing with exactly-once semantics
based on Storm.
Maki Nage
[Python] - A stream processing framework for data scientists, based on
Kafka and ReactiveX.
mantis [Java] -
Netflix’s platform to build an ecosystem of realtime stream processing
applications
mupd8(muppet)
[Scala/Java] - mapReduce-style framework for processing fast/streaming
data.
Numaflow
[Java/Python/Go/Rust] - Kubernetes native stream processing platform
with language agnostic framework. Scalable and cost-efficient
Onyx [Clojure] -
Distributed, masterless, high performance, fault tolerant data
processing.
Pathway [Python]
- The fastest data processing engine supporting unified workflows for
batch, streaming data, and LLM applications.
s4 [Java] -
general-purpose, distributed, scalable, fault-tolerant, pluggable
platform that allows programmers to easily develop applications for
processing continuous unbounded streams of data.
Scramjet
Cloud Platform [Python/JavaScript/Node.js] - data processing engine
for running multiple data processing apps (sequences) written in Python,
JavaScript or TypeScript
SPQR [Java] -
dynamic framework for processing high volumn data streams through
pipelines.
tigon [C++/Java] -
high throughput real-time streaming processing framework built on Hadoop
and HBase.
Teknek
[Java] - Simple elegant stream processing with interactive prototying
shell SOL (Stream Operator Language) Mesos, designed for high
performance data processing jobs that require flexibility &
control.
Trill [.NET/C#] -
Trill is a high-performance one-pass in-memory streaming analytics
engine from Microsoft Research.
Wallaroo
[Python] - A fast, stream-processing framework. Wallaroo makes it easy
to react to data in real-time. By eliminating infrastructure complexity,
going from prototype to production has never been simpler.
HStreamDB
[Haskell] - The streaming database built for IoT data storage and
real-time processing.
Kuiper [Golang] - An
edge lightweight IoT data analytics/streaming software implemented by
Golang, and it can be run at all kinds of resource-constrained edge
devices.
WindFlow [C++] -
A C++17 Data Stream Processing Parallel Library for Multicores and
GPUs.
RisingWave
[Rust] - A PostgreSQL-compatible streaming database that is designed to
build event-driven applications, real-time ETL pipelines, continuous
analytics services, and feature stores for AI applications. It excels in
extracting fresh and consistent insights from real-time event streams,
database CDC, and time series data within sub-seconds. It unifies
streaming and batch processing, enabling users to ingest, join, and
analyze both live and historical data at a cloud scale.
Streaming Library
Apache Kafka Streams
[Java] - lightweight stream processing library included in Apache Kafka
(since 0.10 version).
Streamiz
[C#] - a .Net Stream Processing Library for Apache Kafka
Akka Streams [Scala] -
stream processing library on Akka Actors.
Daggy [C++] -
real-time streams aggregation and catching.
Benthos [Go] -
Benthos is a high performance and resilient message streaming service,
able to connect various sources and sinks and perform arbitrary actions,
transformations and filters on payloads
FastStream
[Python] - powerful and easy-to-use Python library simplifying the
process of writing producers and consumers for message queues, handling
all the parsing, networking and documentation generation automatically.
Supports multiple protocols such as Apache Kafka, RabbitMQ and
alike.
monix [Scala] -
high-performance Scala / Scala.js library for composing asynchronous and
event-based programs.
Quix Streams
[Python] - a streaming library originally designed for the McLaren
Formula 1 racing team that can process high volumes of time-series data
with up to nanosecond precision using Apache Kafka as a message
broker.
Scramjet
Python - [Python] functional reactive stream programming framework
written from scratch operating on object, string and buffer
streams.
Scramjet
C++ - [C++] functional reactive stream programming framework written
on top of Node.js object streams.
Streamline
[Java] - Stream Analytics Framework by Hortonworks, designed as a
wrapper around existing streaming solutions like Storm. Aimed to allow
users to drag-and-drop streaming components to focus on business
logic.
StreamAlert
[Python] - Airbnb’s Real-time Data Analysis and Alerting.
Swave [Scala] - A
lightweight Reactive Streams Infrastructure Toolkit for Scala.
Streamz
[Python] - A lightweight library for building pipelines to manage
continuous streams of data; supports complex pipelines that involve
branching, joining, flow control, feedback, back pressure, and so
on.
Stream Ops
[Java] - A fully embeddable data streaming engine and stream processing
API for Java.
Substation [Go] -
Substation is a cloud native data pipeline and transformation toolkit
written in Go.
SwimOS [Rust] - A
framework for building real-time streaming data processing applications
written in Rust.
Tributary
[Python] - A python library for constructing dataflow graphs. Supports
synchronous, reactive data streams built using python generators that
mimic complex event processors, as well as lazily-evaluated acyclic
graphs and functional currying streams.
YoMo [Go] - An open
source Streaming Serverless Framework for building Low-latency
Geo-distributed system. YoMo Built atop QUIC Transport Protocol
and Functional Reactive Programming interface.
Mediapipe -
Cross-platform, customizable ML solutions for live and streaming
media.
Streaming Application
javactrl-kafka
[Java] - An application of a stateful stream processing for workflow as
Java code (microservices orchestration, business process automation, and
more).
straw [Python/Java] - A
platform for real-time streaming search.
storm-crawler
[Java] - Web crawler SDK based on Apache Storm.
Zilla [Java] -
Cross-platform, API gateway built for event-driven architectures and
streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT
and the native Kafka protocol.
IoT
sensorbee [Go]
- lightweight stream processing engine for IoT.
Apache
Edgent [Java] - a programming model and runtime that enables
continuous streaming analytics on gateways and edge devices which can
work with centralized systems to provide efficient and timely analytics
across the whole IoT ecosystem: from the center to the edge, opens
sourced by IBM.
Apache
StreamPipes [Java] - a self-service (Industrial) IoT toolbox to
enable non-technical users to connect, analyze and explore IoT data
streams.
DSL
Apache Beam [Java,
Python, SQL, Scala, Go] - unified model and set of language-specific
SDKs for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration
Patterns (EIPs) and Domain Specific Languages (DSLs), open sourced by
Google.
coast [Scala] - a DSL
that builds DAGs on top of Samza and provides exactly-once
semantics.
Esper [Java] -
component for complex event processing (CEP) and event series
analysis.
Streamparse
[Python] - lets you run Python code against real-time streams of data
via Apache Storm.
summingbird
[Scala] - library that lets you write MapReduce programs that look like
native Scala or Java collection transformations and execute them on a
number of well-known distributed MapReduce platforms, including Storm
and Scalding.
Data Pipeline
Apache Kafka
[Scala/Java] - distributed, partitioned, replicated commit log service,
which provides the functionality of a messaging system, but with a
unique design.
Apache
Pulsar [Java] - distributed pub-sub messaging platform with a very
flexible messaging model and an intuitive client API.
Apache RocketMQ
[Java] - distributed messaging and streaming platform with low latency,
high performance and reliability, trillion-level capacity and flexible
scalability.
AutoMQ [Scala/Java] -
cloud-first alternative to Kafka by decoupling durability to S3 and EBS.
100% Kafka compatible. 10x cost-effective. Autoscale in seconds.
Single-digit ms latency.
brooklin [Java]
- a distributed system intended for streaming data between various
heterogeneous source and destination systems with high reliability and
throughput at scale from Linkedin (replaced databus).
databus [Java] -
Linkedin’s source-agnostic distributed change data capture system.
flume [Java] -
distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
fluvio [Rust/WASM]
- Real-time programmable data streaming platform with in-line
computation capabilities.
Gazette [golang] -
Distributed streaming infrastructure built on cloud storage which makes
it easy to mix and match batch and streaming paradigms.
LogDevice [C++] - a
high-performant distributed system by Facebook for streaming and storing
sequential data, using a log structure.
metaq
[Java] - Taobao’s high available, high performance distributed messaging
system
NATS
streaming [Go] - fast disk-backed messaging solution
nsq [Go] - realtime
distributed messaging platform designed to operate at scale, handling
billions of messages per day.
Redpanda
[C++] - Redpanda is Kafka compatible, ZooKeeper-free, JVM-free and
source available.
RudderStack [Go]
- an open source customer data infrastructure (segment, mparticle
alternative).
suro [Java] - data
pipeline service for collecting, aggregating, and dispatching large
volume of application events including log data.
StreamSets
Data Collector [Java] - continuous big data ingestion infrastructure
that reads from and writes to a large number of end-points, including
S3, JDBC, Hadoop, Kafka, Cassandra and many others.
Online Machine Learning
Apache Samoa
[Java] - distributed streaming machine learning (ML) framework that
contains a programing abstraction for distributed streaming ML
algorithms.
DataSketches
[Java] - sketches library from Yahoo!.
[Numalogic] (https://github.com/numaproj/numalogic) [Python] -
Collection of ML models and libraries for real-time anomaly detection
and forecasting on time series data. Built on Numaflow, a K8s native
stream processing platform
streamDM
[Scala] - mining Big Data streams using Spark Streaming from
Huawei.
StreamingBandit
[Python] - Provides a webserver to quickly setup and evaluate possible
solutions to contextual multi-armed bandit (cMAB) problems.
StormCV [Java]
- enables the use of Apache Storm for video processing by adding
computer vision (CV) specific operations and data model.
trident-ml
[Java] - realtime online machine learning library based on Trident.
yurita [Scala] -
Anomaly detection framework built on Spark Structured Streaming from
Paypal.
Streaming SQL
pipelinedb
[C] - An open-source relational database that runs SQL queries
continuously on streams, incrementally storing results in tables.
squall [Java] -
Squall executes SQL queries on top of Storm for doing online
processing.
StreamCQL
[Java] - Continuous Query Language on RealTime Computation System.
ksqlDB [Java] - A
cloud-native, source-available database
purpose-built for stream processing applications
Materialize [Rust] - A
source-available streaming SQL engine for maintaining materialized views
on data from message brokers and databases.
Siddhi [Java] - A
cloud native Streaming and Complex Event Processing engine that
understands Streaming SQL queries in order to capture events from
diverse data sources, process them, detect complex conditions, and
publish output to various endpoints in real time.
Proton [C++] - A
unified streaming and historical data analytics database in a single
binary, powered by ClickHouse.
Benchmark
storm-perf-test
[Java] - a simple storm performance/stress test.
streaming-benchmarks
[Java] - Benchmarks for Low Latency (Streaming) solutions including
Apache Storm, Apache Spark, Apache Flink, etc.
flotilla [Go] -
Automated message queue orchestration for scaled-up benchmarking.
Toolkit
akka [Scala] - toolkit
and runtime for building highly concurrent, distributed, and resilient
message-driven application on the JVM.
Apache Pekko
[Scala, Java] - Fork of Akka 2.6.x, prior to the Akka project’s adoption
of the Business Source License.
pulsar [Python] -
Actor based event driven concurrent framework for Python.
aeron [Java/C++] -
efficient reliable unicast and multicast message transport.
StreamFlow [Java] -
stream processing tool designed to help build and monitor processing
workflows.
samza-luwak
[Java] - uses Luwak, a stored-query engine built on Lucene, to implement
full-text search on streams.
Streamdal [Go/Node.js/Python] -
A tool to embed privacy controls in your application code to detect PII
as it enters and leaves your systems, preventing it from reaching
unintended data streams or pipelines.
Turbine [Java] -
tool for aggregating streams of Server-Sent Event (SSE) JSON data into a
single stream.
Nussknacker
[Scala] - A visual tool to define and run real-time decision
algorithms.
Closed Source
Amazon Kinesis Streams
[Java] - real-time, fully managed and scalable data stream engine
provided by AWS.
Azure
Stream Analytics [.NET] a massively scalable, fully managed,
real-time, data stream engine provided by Microsoft Azure.
Cloud
Dataflow[Java, Python, SQL, Scala] - Google’s managed stream and
batch data processing engine. Supports running Beam pipelines.
concord
[C++] - a distributed stream processing framework built in C++ on top of
Apache.
IBM
Streams [Python/Java/Scala] - platform for distributed processing
and real-time analytics. Provides toolkits for advanced analytics like
geospatial, time series, etc. out of the box.