Files
awesome-awesomeness/html/bigdata.html
2025-07-18 22:22:32 +02:00

1908 lines
98 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<h1 id="awesome-big-data">Awesome Big Data</h1>
<p><a href="https://github.com/sindresorhus/awesome"><img
src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg"
alt="Awesome" /></a></p>
<p>A curated list of awesome big data frameworks, resources and other
awesomeness. Inspired by <a
href="https://github.com/ziadoz/awesome-php">awesome-php</a>, <a
href="https://github.com/vinta/awesome-python">awesome-python</a>, <a
href="https://github.com/Sdogruyol/awesome-ruby">awesome-ruby</a>, <a
href="http://hadoopecosystemtable.github.io/">hadoopecosystemtable</a>
&amp; <a href="http://usefulstuff.io/big-data/">big-data</a>.</p>
<p>Your contributions are always welcome!</p>
<ul>
<li><a href="#awesome-big-data">Awesome Big Data</a>
<ul>
<li><a href="#rdbms">RDBMS</a></li>
<li><a href="#frameworks">Frameworks</a></li>
<li><a href="#distributed-programming">Distributed Programming</a></li>
<li><a href="#distributed-filesystem">Distributed Filesystem</a></li>
<li><a href="#distributed-index">Distributed Index</a></li>
<li><a href="#document-data-model">Document Data Model</a></li>
<li><a href="#key-map-data-model">Key Map Data Model</a></li>
<li><a href="#key-value-data-model">Key-value Data Model</a></li>
<li><a href="#graph-data-model">Graph Data Model</a></li>
<li><a href="#columnar-databases">Columnar Databases</a></li>
<li><a href="#newsql-databases">NewSQL Databases</a></li>
<li><a href="#time-series-databases">Time-Series Databases</a></li>
<li><a href="#sql-like-processing">SQL-like processing</a></li>
<li><a href="#data-ingestion">Data Ingestion</a></li>
<li><a href="#service-programming">Service Programming</a></li>
<li><a href="#scheduling">Scheduling</a></li>
<li><a href="#machine-learning">Machine Learning</a></li>
<li><a href="#benchmarking">Benchmarking</a></li>
<li><a href="#security">Security</a></li>
<li><a href="#system-deployment">System Deployment</a></li>
<li><a href="#applications">Applications</a></li>
<li><a href="#search-engine-and-framework">Search engine and
framework</a></li>
<li><a href="#mysql-forks-and-evolutions">MySQL forks and
evolutions</a></li>
<li><a href="#postgresql-forks-and-evolutions">PostgreSQL forks and
evolutions</a></li>
<li><a href="#memcached-forks-and-evolutions">Memcached forks and
evolutions</a></li>
<li><a href="#embedded-databases">Embedded Databases</a></li>
<li><a href="#business-intelligence">Business Intelligence</a></li>
<li><a href="#data-visualization">Data Visualization</a></li>
<li><a href="#internet-of-things-and-sensor-data">Internet of things and
sensor data</a></li>
<li><a href="#interesting-readings">Interesting Readings</a></li>
<li><a href="#interesting-papers">Interesting Papers</a>
<ul>
<li><a href="#2015---2016">2015 - 2016</a></li>
<li><a href="#2013---2014">2013 - 2014</a></li>
<li><a href="#2011---2012">2011 - 2012</a></li>
<li><a href="#2001---2010">2001 - 2010</a></li>
</ul></li>
<li><a href="#videos">Videos</a></li>
<li><a href="#books">Books</a>
<ul>
<li><a href="#streaming">Streaming</a></li>
<li><a href="#distributed-systems">Distributed systems</a></li>
<li><a href="#graph-based-approach">Graph Based approach</a></li>
<li><a href="#data-visualization-1">Data Visualization</a></li>
</ul></li>
</ul></li>
<li><a href="#other-awesome-lists">Other Awesome Lists</a></li>
</ul>
<h2 id="rdbms">RDBMS</h2>
<ul>
<li><a href="https://www.mysql.com/">MySQL</a> The worlds most popular
open source database.</li>
<li><a href="https://www.postgresql.org/">PostgreSQL</a> The worlds
most advanced open source database.</li>
<li><a
href="http://www.oracle.com/us/corporate/features/database-12c/index.html">Oracle
Database</a> - object-relational database management system.</li>
<li><a
href="http://www.teradata.com/products-and-services/teradata-database/">Teradata</a>
- high-performance MPP data warehouse platform.</li>
</ul>
<h2 id="frameworks">Frameworks</h2>
<ul>
<li><a href="https://github.com/facebook/bistro">Bistro</a> -
general-purpose data processing engine for both batch and stream
analytics. It is based on a novel data model, which represents data via
<em>functions</em> and processes data via <em>column operations</em> as
opposed to having only set operations in conventional approaches like
MapReduce or SQL.</li>
<li><a
href="https://www.ibm.com/analytics/us/en/technology/stream-computing/">IBM
Streams</a> - platform for distributed processing and real-time
analytics. Integrates with many of the popular technologies in the Big
Data ecosystem (Kafka, HDFS, Spark, etc.)</li>
<li><a href="http://hadoop.apache.org/">Apache Hadoop</a> - framework
for distributed processing. Integrates MapReduce (parallel processing),
YARN (job scheduling) and HDFS (distributed file system).</li>
<li><a href="https://github.com/caskdata/tigon">Tigon</a> - High
Throughput Real-time Stream Processing Framework.</li>
<li><a href="http://pachyderm.io/">Pachyderm</a> - Pachyderm is a data
storage platform built on Docker and Kubernetes to provide reproducible
data processing and analysis.</li>
<li><a href="https://github.com/polyaxon/polyaxon">Polyaxon</a> - A
platform for reproducible and scalable machine learning and deep
learning.</li>
<li><a href="https://github.com/smooks/smooks">Smooks</a> - An
extensible Java framework for building XML and non-XML (CSV, EDI, Java,
etc…) streaming applications.</li>
</ul>
<h2 id="distributed-programming">Distributed Programming</h2>
<ul>
<li><a href="https://github.com/addthis/hydra">AddThis Hydra</a> -
distributed data processing and storage system originally developed at
AddThis.</li>
<li><a href="http://databricks.github.io/simr/">AMPLab SIMR</a> - run
Spark on Hadoop MapReduce v1.</li>
<li><a href="https://apex.apache.org/">Apache APEX</a> - a unified,
enterprise platform for big data stream and batch processing.</li>
<li><a href="https://beam.apache.org/">Apache Beam</a> - an unified
model and set of language-specific SDKs for defining and executing data
processing workflows.</li>
<li><a href="http://crunch.apache.org/">Apache Crunch</a> - a simple
Java API for tasks like joining and data aggregation that are tedious to
implement on plain MapReduce.</li>
<li><a href="http://incubator.apache.org/projects/datafu.html">Apache
DataFu</a> - collection of user-defined functions for Hadoop and Pig
developed by LinkedIn.</li>
<li><a href="http://flink.apache.org/">Apache Flink</a> -
high-performance runtime, and automatic program optimization.</li>
<li><a href="http://gearpump.apache.org/">Apache Gearpump</a> -
real-time big data streaming engine based on Akka.</li>
<li><a href="http://gora.apache.org/">Apache Gora</a> - framework for
in-memory data model and persistence.</li>
<li><a href="http://hama.apache.org/">Apache Hama</a> - BSP (Bulk
Synchronous Parallel) computing framework.</li>
<li><a href="https://wiki.apache.org/hadoop/MapReduce/">Apache
MapReduce</a> - programming model for processing large data sets with a
parallel, distributed algorithm on a cluster.</li>
<li><a href="https://pig.apache.org/">Apache Pig</a> - high level
language to express data analysis programs for Hadoop.</li>
<li><a href="http://reef.apache.org/">Apache REEF</a> - retainable
evaluator execution framework to simplify and unify the lower layers of
big data systems.</li>
<li><a href="http://incubator.apache.org/projects/s4.html">Apache S4</a>
- framework for stream processing, implementation of S4.</li>
<li><a href="http://spark.apache.org/">Apache Spark</a> - framework
for in-memory cluster computing.</li>
<li><a
href="https://spark.apache.org/docs/latest/streaming-programming-guide.html">Apache
Spark Streaming</a> - framework for stream processing, part of
Spark.</li>
<li><a href="http://storm.apache.org">Apache Storm</a> - framework for
stream processing by Twitter also on YARN.</li>
<li><a href="http://samza.apache.org/">Apache Samza</a> - stream
processing framework, based on Kafka and YARN.</li>
<li><a href="http://tez.apache.org/">Apache Tez</a> - application
framework for executing a complex DAG (directed acyclic graph) of tasks,
built on YARN.</li>
<li><a href="https://incubator.apache.org/projects/twill.html">Apache
Twill</a> - abstraction over YARN that reduces the complexity of
developing distributed applications.</li>
<li><a href="http://bigflow.cloud/en/index.html">Baidu Bigflow</a> - an
interface that allows for writing distributed computing programs
providing lots of simple, flexible, powerful APIs to easily handle data
of any scale.</li>
<li><a href="http://cascalog.org/">Cascalog</a> - data processing and
querying library.</li>
<li><a
href="http://vldbarc.org/pvldb/vldb2010/pvldb_vol3/I08.pdf">Cheetah</a>
- High Performance, Custom Data Warehouse on Top of MapReduce.</li>
<li><a href="http://www.cascading.org/">Concurrent Cascading</a> -
framework for data management/analytics on Hadoop.</li>
<li><a href="https://github.com/damballa/parkour">Damballa Parkour</a> -
MapReduce library for Clojure.</li>
<li><a href="https://github.com/datasalt/pangool">Datasalt Pangool</a> -
alternative MapReduce paradigm.</li>
<li><a href="https://www.datatorrent.com/">DataTorrent StrAM</a> -
real-time engine is designed to enable distributed, asynchronous, real
time in-memory big-data computations in as unblocked a way as possible,
with minimal overhead and impact on performance.</li>
<li><a
href="https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920">Facebook
Corona</a> - Hadoop enhancement which removes single point of
failure.</li>
<li><a href="http://peregrine_mapreduce.bitbucket.org/">Facebook
Peregrine</a> - Map Reduce framework.</li>
<li><a
href="https://www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with-scuba/10150599692628920">Facebook
Scuba</a> - distributed in-memory datastore.</li>
<li><a
href="https://googledevelopers.blogspot.it/2014/06/cloud-platform-at-google-io-new-big.html">Google
Dataflow</a> - create data pipelines to help themæingest, transform and
analyze data.</li>
<li><a href="https://research.google.com/archive/mapreduce.html">Google
MapReduce</a> - map reduce framework.</li>
<li><a href="https://research.google.com/pubs/pub41378.html">Google
MillWheel</a> - fault tolerant stream processing framework.</li>
<li><a
href="https://www.ibm.com/analytics/us/en/technology/stream-computing/">IBM
Streams</a> - platform for distributed processing and real-time
analytics. Provides toolkits for advanced analytics like geospatial,
time series, etc. out of the box.</li>
<li><a href="https://code.google.com/p/jaql/">JAQL</a> - declarative
programming language for working with structured, semi-structured and
unstructured data.</li>
<li><a href="http://kitesdk.org/docs/current/">Kite</a> - is a set of
libraries, tools, examples, and documentation focused on making it
easier to build systems on top of the Hadoop ecosystem.</li>
<li><a href="http://druid.io/">Metamarkets Druid</a> - framework for
real-time analysis of large datasets.</li>
<li><a href="https://github.com/Netflix/PigPen">Netflix PigPen</a> -
map-reduce for Clojure which compiles to Apache Pig.</li>
<li><a href="http://discoproject.org/">Nokia Disco</a> - MapReduce
framework developed by Nokia.</li>
<li><a href="http://www.onyxplatform.org/">Onyx</a> - Distributed
computation for the cloud.</li>
<li><a
href="https://medium.com/@Pinterest_Engineering/pinlater-an-asynchronous-job-execution-system-b8664cb8aa7d">Pinterest
Pinlater</a> - asynchronous job execution system.</li>
<li><a href="http://crs4.github.io/pydoop/">Pydoop</a> - Python
MapReduce and HDFS API for Hadoop.</li>
<li><a href="https://github.com/ray-project/ray">Ray</a> - A fast and
simple framework for building and running distributed applications.</li>
<li><a href="http://blueflood.io/">Rackerlabs Blueflood</a> -
multi-tenant distributed metric processing system</li>
<li><a href="https://github.com/skale-me/skale-engine">Skale</a> - High
performance distributed data processing in NodeJS.</li>
<li><a href="http://stratosphere.eu/">Stratosphere</a> - general purpose
cluster computing framework.</li>
<li><a href="https://streamdrill.com/">Streamdrill</a> - useful for
counting activities of event streams over different time windows and
finding the most active one.</li>
<li><a
href="https://github.com/IBMStreams/streamsx.topology">streamsx.topology</a>
- Libraries to enable building IBM Streams application in Java, Python
or Scala.</li>
<li><a href="https://github.com/UnderstandLingBV/Tuktu">Tuktu</a> -
Easy-to-use platform for batch and streaming computation, built using
Scala, Akka and Play!</li>
<li><a href="https://github.com/twitter/heron">Twitter Heron</a> - Heron
is a realtime, distributed, fault-tolerant stream processing engine from
Twitter replacing Storm.</li>
<li><a href="https://github.com/twitter/scalding">Twitter Scalding</a> -
Scala library for Map Reduce jobs, built on Cascading.</li>
<li><a href="https://github.com/twitter/summingbird">Twitter
Summingbird</a> - Streaming MapReduce with Scalding and Storm, by
Twitter.</li>
<li><a
href="https://blog.twitter.com/engineering/en_us/a/2014/tsar-a-timeseries-aggregator.html">Twitter
TSAR</a> - TimeSeries AggregatoR by Twitter.</li>
<li><a href="http://www.wallaroolabs.com/community">Wallaroo</a> - The
ultrafast and elastic data processing engine. Big or fast data - no
fuss, no Java needed.</li>
</ul>
<h2 id="distributed-filesystem">Distributed Filesystem</h2>
<ul>
<li><a href="https://github.com/linkedin/ambry">Ambry</a> - a
distributed object store that supports storage of trillion of small
immutable objects as well as billions of large objects.</li>
<li><a href="http://hadoop.apache.org/">Apache HDFS</a> - a way to store
large files across multiple machines.</li>
<li><a href="http://kudu.apache.org/">Apache Kudu</a> - Hadoops storage
layer to enable fast analytics on fast data.</li>
<li><a href="https://www.beegfs.io/content/">BeeGFS</a> - formerly
FhGFS, parallel distributed file system.</li>
<li><a href="http://ceph.com/ceph-storage/file-system/">Ceph
Filesystem</a> - software storage platform designed.</li>
<li><a
href="http://disco.readthedocs.org/en/latest/howto/ddfs.html">Disco
DDFS</a> - distributed filesystem.</li>
<li><a
href="https://www.facebook.com/note.php?note_id=76191543919">Facebook
Haystack</a> - object storage system.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">Google
GFS</a> - distributed filesystem.</li>
<li><a href="https://research.google.com/pubs/pub36971.html">Google
Megastore</a> - scalable, highly available storage.</li>
<li><a href="https://www.gridgain.com/">GridGain</a> - GGFS, Hadoop
compliant in-memory file system.</li>
<li><a href="http://wiki.lustre.org/">Lustre file system</a> -
high-performance distributed filesystem.</li>
<li><a
href="https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html">Microsoft
Azure Data Lake Store</a> - HDFS-compatible storage in Azure cloud</li>
<li><a
href="https://www.quantcast.com/about-us/quantcast-file-system/">Quantcast
File System QFS</a> - open-source distributed file system.</li>
<li><a href="http://gluster.org/">Red Hat GlusterFS</a> - scale-out
network-attached storage file system.</li>
<li><a href="https://github.com/chrislusf/seaweedfs">Seaweed-FS</a> -
simple and highly scalable distributed file system.</li>
<li><a href="http://www.alluxio.org/">Alluxio</a> - reliable file
sharing at memory speed across cluster frameworks.</li>
<li><a href="https://www.tahoe-lafs.org/trac/tahoe-lafs">Tahoe-LAFS</a>
- decentralized cloud storage system.</li>
<li><a href="https://github.com/baidu/bfs">Baidu File System</a> -
distributed filesystem.</li>
</ul>
<h2 id="distributed-index">Distributed Index</h2>
<ul>
<li><a href="https://github.com/pilosa/pilosa">Pilosa</a> Open source
distributed bitmap index that dramatically accelerates queries across
multiple, massive data sets.</li>
</ul>
<h2 id="document-data-model">Document Data Model</h2>
<ul>
<li><a
href="https://www.actian.com/data-management/ingres-sql-rdbms/">Actian
Versant</a> - commercial object-oriented database management systems
.</li>
<li><a href="https://crate.io/">Crate Data</a> - is an open source
massively scalable data store. It requires zero administration.</li>
<li><a href="http://www.infoq.com/news/2014/06/facebook-apollo">Facebook
Apollo</a> - Facebooks Paxos-like NoSQL database.</li>
<li><a href="http://comsysto.github.io/jumbodb/">jumboDB</a> - document
oriented datastore over Hadoop.</li>
<li><a href="https://engineering.linkedin.com/data">LinkedIn
Espresso</a> - horizontally scalable document-oriented NoSQL data
store.</li>
<li><a href="http://www.marklogic.com/">MarkLogic</a> - Schema-agnostic
Enterprise NoSQL database technology.</li>
<li><a
href="https://azure.microsoft.com/en-us/services/cosmos-db/">Microsoft
Azure DocumentDB</a> - NoSQL cloud database service with protocol
support for MongoDB</li>
<li><a href="https://www.mongodb.com/">MongoDB</a> - Document-oriented
database system.</li>
<li><a href="https://ravendb.net/">RavenDB</a> - A transactional,
open-source Document Database.</li>
<li><a href="https://rethinkdb.com/">RethinkDB</a> - document database
that supports queries like table joins and group by.</li>
</ul>
<h2 id="key-map-data-model">Key Map Data Model</h2>
<p><strong>Note</strong>: There is some term confusion in the industry,
and two different things are called “Columnar Databases”. Some, listed
here, are distributed, persistent databases built around the “key-map”
data model: all data has a (possibly composite) key, with which a map of
key-value pairs is associated. In some systems, multiple such value maps
can be associated with a key, and these maps are referred to as “column
families” (with value map keys being referred to as “columns”).</p>
<p>Another group of technologies that can also be called “columnar
databases” is distinguished by how it stores data, on disk or in memory
rather than storing data the traditional way, where all column values
for a given key are stored next to each other, “row by row”, these
systems store all <em>column</em> values next to each other. So more
work is needed to get all columns for a given key, but less work is
needed to get all values for a given column.</p>
<p>The former group is referred to as “key map data model” here. The
line between these and the <a href="#key-value-data-model">Key-value
Data Model</a> stores is fairly blurry.</p>
<p>The latter, being more about the storage format than about the data
model, is listed under <a href="#columnar-databases">Columnar
Databases</a>.</p>
<p>You can read more about this distinction on Prof. Daniel Abadis
blog: <a
href="http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html">Distinguishing
two major types of Column Stores</a>.</p>
<ul>
<li><a href="http://accumulo.apache.org/">Apache Accumulo</a> -
distributed key/value store, built on Hadoop.</li>
<li><a href="http://cassandra.apache.org/">Apache Cassandra</a> -
column-oriented distributed datastore, inspired by BigTable.</li>
<li><a href="http://hbase.apache.org/">Apache HBase</a> -
column-oriented distributed datastore, inspired by BigTable.</li>
<li><a href="https://github.com/baidu/tera">Baidu Tera</a> - an
Internet-scale database, inspired by BigTable.</li>
<li><a
href="https://code.facebook.com/posts/321111638043166/hydrabase-the-evolution-of-hbase-facebook/">Facebook
HydraBase</a> - evolution of HBase made by Facebook.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">Google
BigTable</a> - column-oriented distributed datastore.</li>
<li><a
href="https://cloud.google.com/datastore/docs/concepts/overview">Google
Cloud Datastore</a> - is a fully managed, schemaless database for
storing non-relational data over BigTable.</li>
<li><a href="http://www.hypertable.org/">Hypertable</a> -
column-oriented distributed datastore, inspired by BigTable.</li>
<li><a href="https://github.com/infinidb/infinidb/">InfiniDB</a> - is
accessed through a MySQL interface and use massive parallel processing
to parallelize queries.</li>
<li><a href="https://github.com/caskdata/tephra">Tephra</a> -
Transactions for HBase.</li>
<li><a
href="https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html">Twitter
Manhattan</a> - real-time, multi-tenant distributed database for Twitter
scale.</li>
<li><a href="http://www.scylladb.com/">ScyllaDB</a> - column-oriented
distributed datastore written in C++, totally compatible with Apache
Cassandra.</li>
</ul>
<h2 id="key-value-data-model">Key-value Data Model</h2>
<ul>
<li><a href="http://www.aerospike.com/">Aerospike</a> - NoSQL
flash-optimized, in-memory. Open source and “Server code in C (not
Java or Erlang) precisely tuned to avoid context switching and memory
copies.”</li>
<li><a href="https://aws.amazon.com/dynamodb/">Amazon DynamoDB</a> -
distributed key/value store, implementation of Dynamo paper.</li>
<li><a href="https://open.dgraph.io/post/badger/">Badger</a> - a fast,
simple, efficient, and persistent key-value store written natively in
Go.</li>
<li><a href="https://github.com/boltdb/bolt">Bolt</a> - an embedded
key-value database for Go.</li>
<li><a href="https://github.com/Bobris/BTDB">BTDB</a> - Key Value
Database in .Net with Object DB Layer, RPC, dynamic IL and much
more</li>
<li><a href="https://github.com/tidwall/buntdb">BuntDB</a> - a fast,
embeddable, in-memory key/value database for Go with custom indexing and
geospatial support.</li>
<li><a href="https://github.com/cbd/edis">Edis</a> - is a
protocol-compatible Server replacement for Redis.</li>
<li><a href="https://github.com/nathanmarz/elephantdb">ElephantDB</a> -
Distributed database specialized in exporting data from Hadoop.</li>
<li><a href="https://geteventstore.com/">EventStore</a> - distributed
time series database.</li>
<li><a href="https://github.com/jakekgrog/GhostDB">GhostDB</a> - a
distributed, in-memory, general purpose key-value data store that
delivers microsecond performance at any scale.</li>
<li><a href="https://github.com/deroproject/graviton">Graviton</a> - a
simple, fast, versioned, authenticated, embeddable key-value store
database in pure Go(lang).</li>
<li><a href="https://github.com/griddb/griddb_nosql">GridDB</a> -
suitable for sensor data stored in a timeseries.</li>
<li><a href="https://github.com/rescrv/HyperDex">HyperDex</a> - a
scalable, next generation key-value and document store with a wide array
of features, including consistency, fault tolerance and high
performance.</li>
<li><a href="https://ignite.apache.org/index.html">Ignite</a> - is an
in-memory key-value data store providing full SQL-compliant data access
that can optionally be backed by disk storage.</li>
<li><a
href="https://github.com/linkedin-sna/sna-page/tree/master/krati">LinkedIn
Krati</a> - is a simple persistent data store with very low latency and
high throughput.</li>
<li><a href="http://www.project-voldemort.com/voldemort/">Linkedin
Voldemort</a> - distributed key/value storage system.</li>
<li><a
href="http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html">Oracle
NoSQL Database</a> - distributed key-value database by Oracle
Corporation.</li>
<li><a href="https://redis.io/">Redis</a> - in memory key value
datastore.</li>
<li><a href="https://github.com/basho/riak">Riak</a> - a decentralized
datastore.</li>
<li><a href="https://github.com/twitter/storehaus">Storehaus</a> -
library to work with asynchronous key value stores, by Twitter.</li>
<li><a href="https://github.com/tidwall/summitdb">SummitDB</a> - an
in-memory, NoSQL key/value database, with disk persistence and using the
Raft consensus algorithm.</li>
<li><a href="https://github.com/tarantool/tarantool">Tarantool</a> - an
efficient NoSQL database and a Lua application server.</li>
<li><a href="https://github.com/pingcap/tikv">TiKV</a> - a distributed
key-value database powered by Rust and inspired by Google Spanner and
HBase.</li>
<li><a href="https://github.com/tidwall/tile38">Tile38</a> - a
geolocation data store, spatial index, and realtime geofence, supporting
a variety of object types including latitude/longitude points, bounding
boxes, XYZ tiles, Geohashes, and GeoJSON</li>
<li><a href="https://github.com/Treode/store">TreodeDB</a> - key-value
store thats replicated and sharded and provides atomic multirow
writes.</li>
</ul>
<h2 id="graph-data-model">Graph Data Model</h2>
<ul>
<li><a href="http://www.agensgraph.com/">AgensGraph</a> - a new
generation multi-model graph database for the modern complex data
environment.</li>
<li><a href="http://giraph.apache.org/">Apache Giraph</a> -
implementation of Pregel, based on Hadoop.</li>
<li><a
href="http://spark.apache.org/docs/0.7.3/bagel-programming-guide.html">Apache
Spark Bagel</a> - implementation of Pregel, part of Spark.</li>
<li><a href="https://www.arangodb.com/">ArangoDB</a> - multi model
distributed database.</li>
<li><a href="https://github.com/dgraph-io/dgraph">DGraph</a> - A
scalable, distributed, low latency, high throughput graph database aimed
at providing Google production level scale and throughput, with low
enough latency to be serving real time user queries, over terabytes of
structured data.</li>
<li><a href="https://github.com/krotik/eliasdb">EliasDB</a> - a
lightweight graph based database that does not require any third-party
libraries.</li>
<li><a
href="https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920">Facebook
TAO</a> - TAO is the distributed data store that is widely used at
facebook to store and serve the social graph.</li>
<li><a href="https://github.com/gchq/Gaffer">GCHQ Gaffer</a> - Gaffer by
GCHQ is a framework that makes it easy to store large-scale graphs in
which the nodes and edges have statistics.</li>
<li><a href="https://github.com/cayleygraph/cayley">Google Cayley</a> -
open-source graph database.</li>
<li><a href="http://kowshik.github.io/JPregel/pregel_paper.pdf">Google
Pregel</a> - graph processing framework.</li>
<li><a href="https://turi.com/products/create/docs/">GraphLab
PowerGraph</a> - a core C++ GraphLab API and a collection of
high-performance machine learning and data mining toolkits built on top
of the GraphLab API.</li>
<li><a
href="https://amplab.cs.berkeley.edu/publication/graphx-grades/">GraphX</a>
- resilient Distributed Graph System on Spark.</li>
<li><a href="https://github.com/tinkerpop/gremlin">Gremlin</a> - graph
traversal Language.</li>
<li><a href="https://github.com/paulhoule/infovore">Infovore</a> -
RDF-centric Map/Reduce framework.</li>
<li><a href="https://01.org/graphbuilder/">Intel GraphBuilder</a> -
tools to construct large-scale graphs on top of Hadoop.</li>
<li><a href="http://janusgraph.org">JanusGraph</a> - open-source,
distributed graph database with multiple options for storage backends
(Bigtable, HBase, Cassandra, etc.) and indexing backends (Elasticsearch,
Solr, Lucene).</li>
<li><a
href="https://www.blazegraph.com/mapgraph-technology/">MapGraph</a> -
Massively Parallel Graph processing on GPUs.</li>
<li><a href="https://github.com/Microsoft/GraphEngine">Microsoft Graph
Engine</a> - a distributed in-memory data processing engine, underpinned
by a strongly-typed in-memory key-value store and a general distributed
computation engine.</li>
<li><a href="https://neo4j.com/">Neo4j</a> - graph database written
entirely in Java.</li>
<li><a href="http://orientdb.com/">OrientDB</a> - document and graph
database.</li>
<li><a href="https://github.com/xslogic/phoebus">Phoebus</a> - framework
for large scale graph processing.</li>
<li><a href="http://thinkaurelius.github.io/titan/">Titan</a> -
distributed graph database, built over Cassandra.</li>
<li><a href="https://github.com/twitter-archive/flockdb">Twitter
FlockDB</a> - distributed graph database.</li>
<li><a href="https://nodexl.codeplex.com/">NodeXL</a> - A free,
open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016
that makes it easy to explore network graphs.</li>
</ul>
<h2 id="columnar-databases">Columnar Databases</h2>
<p><strong>Note</strong> please read the note on <a
href="#key-map-data-model">Key-Map Data Model</a> section.</p>
<ul>
<li><a href="http://the-paper-trail.org/blog/columnar-storage/">Columnar
Storage</a> - an explanation of what columnar storage is and when you
might want it.</li>
<li><a href="http://www.actian.com/">Actian Vector</a> - column-oriented
analytic database.</li>
<li><a href="https://clickhouse.yandex/">ClickHouse</a> - an open-source
column-oriented database management system that allows generating
analytical data reports in real time.</li>
<li><a href="http://eventql.io/">EventQL</a> - a distributed,
column-oriented database built for large-scale event collection and
analytics.</li>
<li><a href="https://www.monetdb.org/">MonetDB</a> - column store
database.</li>
<li><a href="http://parquet.apache.org/">Parquet</a> - columnar storage
format for Hadoop.</li>
<li><a href="https://pivotal.io/pivotal-greenplum">Pivotal Greenplum</a>
- purpose-built, dedicated analytic data warehouse that offers a
columnar engine as well as a traditional row-based one.</li>
<li><a href="https://www.vertica.com/">Vertica</a> - is designed to
manage large, fast-growing volumes of data and provide very fast query
performance when used for data warehouses.</li>
<li><a href="http://sqream.com/">SQream DB</a> - A GPU powered big data
database, designed for analytics and data warehousing, with ANSI-92
compliant SQL, suitable for data sets from 10TB to 1PB.</li>
<li><a href="https://cloud.google.com/bigquery/what-is-bigquery">Google
BigQuery</a> - Googles cloud offering backed by their pioneering work
on Dremel.</li>
<li><a href="https://aws.amazon.com/redshift/">Amazon Redshift</a> -
Amazons cloud offering, also based on a columnar datastore
backend.</li>
<li><a href="https://github.com/shunfei/indexr">IndexR</a> - an
open-source columnar storage format for fast &amp; realtime analytic
with big data.</li>
<li><a href="https://github.com/cswinter/LocustDB">LocustDB</a> - an
experimental analytics database aiming to set a new standard for query
performance on commodity hardware.</li>
</ul>
<h2 id="newsql-databases">NewSQL Databases</h2>
<ul>
<li><a
href="http://www.actian.com/products/operational-databases/">Actian
Ingres</a> - commercially supported, open-source SQL relational database
management system.</li>
<li><a href="https://github.com/biokoda/actordb">ActorDB</a> - a
distributed SQL database with the scalability of a KV store, while
keeping the query capabilities of a relational database.</li>
<li><a href="http://aws.amazon.com/redshift/">Amazon RedShift</a> - data
warehouse service, based on PostgreSQL.</li>
<li><a href="https://github.com/probcomp/BayesDB">BayesDB</a> -
statistic oriented SQL database.</li>
<li><a href="http://bedrockdb.com/">Bedrock</a> - a simple, modular,
networked and distributed transaction layer built atop SQLite.</li>
<li><a href="https://www.citusdata.com/">CitusDB</a> - scales out
PostgreSQL through sharding and replication.</li>
<li><a href="https://github.com/cockroachdb/cockroach">Cockroach</a> -
Scalable, Geo-Replicated, Transactional Datastore.</li>
<li><a href="https://github.com/bloomberg/comdb2">Comdb2</a> - a
clustered RDBMS built on optimistic concurrency control techniques.</li>
<li><a href="http://www.datomic.com/">Datomic</a> - distributed database
designed to enable scalable, flexible and intelligent applications.</li>
<li><a href="https://foundationdb.com/">FoundationDB</a> - distributed
database, inspired by F1.</li>
<li><a href="https://research.google.com/pubs/pub41344.html">Google
F1</a> - distributed SQL database built on Spanner.</li>
<li><a href="https://research.google.com/archive/spanner.html">Google
Spanner</a> - globally distributed semi-relational database.</li>
<li><a href="http://hstore.cs.brown.edu/">H-Store</a> - is an
experimental main-memory, parallel database management system that is
optimized for on-line transaction processing (OLTP) applications.</li>
<li><a href="https://github.com/VCNC/haeinsa">Haeinsa</a> - linearly
scalable multi-row, multi-table transaction library for HBase based on
Percolator.</li>
<li><a
href="https://www.percona.com/doc/percona-server/5.5/performance/handlersocket.html">HandlerSocket</a>
- NoSQL plugin for MySQL/MariaDB.</li>
<li><a href="http://www.infinisql.org/">InfiniSQL</a> - infinity
scalable RDBMS.</li>
<li><a href="https://github.com/rayokota/kareldb">KarelDB</a> - a
relational database backed by Apache Kafka.</li>
<li><a href="https://www.mapd.com/">Map-D</a> - GPU in-memory database,
big data analysis and visualization platform.</li>
<li><a href="http://www.memsql.com/">MemSQL</a> - in memory SQL database
witho optimized columnar storage on flash.</li>
<li><a href="http://www.nuodb.com/">NuoDB</a> - SQL/ACID compliant
distributed database.</li>
<li><a
href="http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html">Oracle
TimesTen in-Memory Database</a> - in-memory, relational database
management system with persistence and recoverability.</li>
<li><a href="http://gemfirexd.docs.pivotal.io/latest/">Pivotal GemFire
XD</a> - Low-latency, in-memory, distributed SQL data store. Provides
SQL interface to in-memory table data, persistable in HDFS.</li>
<li><a href="https://hana.sap.com/abouthana.html">SAP HANA</a> - is an
in-memory, column-oriented, relational database management system.</li>
<li><a href="http://senseidb.github.io/sensei/">SenseiDB</a> -
distributed, realtime, semi-structured database.</li>
<li><a href="http://skydb.io/">Sky</a> - database used for flexible,
high performance analysis of behavioral data.</li>
<li><a href="http://www.symmetricds.org/">SymmetricDS</a> - open source
software for both file and database synchronization.</li>
<li><a href="https://github.com/pingcap/tidb">TiDB</a> - TiDB is a
distributed SQL database. Inspired by the design of Google F1.</li>
<li><a href="https://www.voltdb.com/">VoltDB</a> - claims to be fastest
in-memory database.</li>
<li><a href="https://github.com/YugaByte/yugabyte-db">yugabyteDB</a> -
open source, high-performance, distributed SQL database compatible with
PostgreSQL.</li>
</ul>
<h2 id="time-series-databases">Time-Series Databases</h2>
<ul>
<li><a
href="http://axibase.com/products/axibase-time-series-database/">Axibase
Time Series Database</a> - Integrated time series database on top of
HBase with built-in visualization, rule-engine and SQL support.</li>
<li><a href="http://chronix.io/">Chronix</a> - a time series storage
built to store time series highly compressed and for fast access
times.</li>
<li><a href="http://square.github.io/cube/">Cube</a> - uses MongoDB to
store time series data.</li>
<li><a href="https://spotify.github.io/heroic/#!/index">Heroic</a> - is
a scalable time series database based on Cassandra and
Elasticsearch.</li>
<li><a href="https://www.influxdata.com/">InfluxDB</a> - a time series
database with optimised IO and queries, supports pgsql and influx wire
protocols.</li>
<li><a href="https://questdb.io/">QuestDB</a> - high-performance,
open-source SQL database for applications in financial services, IoT,
machine learning, DevOps and observability.</li>
<li><a href="https://www.circonus.com/irondb/">IronDB</a> - scalable,
general-purpose time series database.</li>
<li><a href="https://github.com/kairosdb/kairosdb">Kairosdb</a> -
similar to OpenTSDB but allows for Cassandra.</li>
<li><a href="http://m3db.github.io/m3/m3db/">M3DB</a> - a distributed
time series database that can be used for storing realtime metrics at
long retention.</li>
<li><a href="https://opennms.github.io/newts/">Newts</a> - a time series
database based on Apache Cassandra.</li>
<li><a href="https://github.com/taosdata/TDengine/">TDengine</a> - a
time series database in C utilizing unique features of IoT to improve
read/write throughput and reduce space needed to store data</li>
<li><a href="http://opentsdb.net">OpenTSDB</a> - distributed time series
database on top of HBase.</li>
<li><a href="https://prometheus.io/">Prometheus</a> - a time series
database and service monitoring system.</li>
<li><a href="https://github.com/facebookincubator/beringei">Beringei</a>
- Facebooks in-memory time-series database.</li>
<li><a href="http://traildb.io/">TrailDB</a> - an efficient tool for
storing and querying series of events.</li>
<li><a href="https://github.com/druid-io/druid/">Druid</a> Column
oriented distributed data store ideal for powering interactive
applications</li>
<li><a href="http://basho.com/products/riak-ts/">Riak-TS</a> Riak TS is
the only enterprise-grade NoSQL time series database optimized
specifically for IoT and Time Series data.</li>
<li><a href="https://github.com/akumuli/Akumuli">Akumuli</a> Akumuli is
a numeric time-series database. It can be used to capture, store and
process time-series data in real-time. The word “akumuli” can be
translated from esperanto as “accumulate”.</li>
<li><a href="https://github.com/Pardot/Rhombus">Rhombus</a> A
time-series object store for Cassandra that handles all the complexity
of building wide row indexes.</li>
<li><a href="https://github.com/dalmatinerdb/dalmatinerdb">Dalmatiner
DB</a> Fast distributed metrics database</li>
<li><a href="https://github.com/rackerlabs/blueflood">Blueflood</a> A
distributed system designed to ingest and process time series data</li>
<li><a
href="https://github.com/NationalSecurityAgency/timely">Timely</a>
Timely is a time series database application that provides secure access
to time series data based on Accumulo and Grafana.</li>
<li><a
href="https://github.com/transceptor-technology/siridb-server">SiriDB</a>
Highly-scalable, robust and fast, open source time series database with
cluster functionality.</li>
<li><a href="https://github.com/improbable-eng/thanos">Thanos</a> -
Thanos is a set of components to create a highly available metric system
with unlimited storage capacity using multiple (existing) Prometheus
deployments.</li>
<li><a
href="https://github.com/VictoriaMetrics/VictoriaMetrics">VictoriaMetrics</a>
- fast, scalable and resource-effective open-source TSDB compatible with
Prometheus. Single-node and cluster versions included</li>
</ul>
<h2 id="sql-like-processing">SQL-like processing</h2>
<ul>
<li><a
href="http://www.actian.com/analytic-database/vectorh-sql-hadoop">Actian
SQL for Hadoop</a> - high performance interactive SQL access to all
Hadoop data.</li>
<li><a href="http://drill.apache.org/">Apache Drill</a> - framework for
interactive analysis, inspired by Dremel.</li>
<li><a
href="https://cwiki.apache.org/confluence/display/Hive/HCatalog">Apache
HCatalog</a> - table and storage management layer for Hadoop.</li>
<li><a href="http://hive.apache.org/">Apache Hive</a> - SQL-like data
warehouse system for Hadoop.</li>
<li><a href="http://calcite.apache.org/">Apache Calcite</a> - framework
that allows efficient translation of queries involving heterogeneous and
federated data.</li>
<li><a href="http://phoenix.apache.org/index.html">Apache Phoenix</a> -
SQL skin over HBase.</li>
<li><a
href="http://www.teradata.com/products-and-services/Teradata-Aster/teradata-aster-database">Aster
Database</a> - SQL-like analytic processing for MapReduce.</li>
<li><a
href="https://www.cloudera.com/products/apache-hadoop/impala.html">Cloudera
Impala</a> - framework for interactive analysis, Inspired by
Dremel.</li>
<li><a href="http://www.cascading.org/projects/lingual/">Concurrent
Lingual</a> - SQL-like query language for Cascading.</li>
<li><a href="http://www.datasalt.com/products/splout-sql/">Datasalt
Splout SQL</a> - full SQL query engine for big datasets.</li>
<li><a href="https://www.dremio.com/">Dremio</a> - an open-source,
SQL-like Data-as-a-Service Platform based on Apache Arrow.</li>
<li><a href="https://prestodb.io/">Facebook PrestoDB</a> - distributed
SQL query engine.</li>
<li><a href="https://research.google.com/pubs/pub36632.html">Google
BigQuery</a> - framework for interactive analysis, implementation of
Dremel.</li>
<li><a href="https://iceberg.apache.org/">Iceberg</a> - an open table
format for huge analytic datasets. Iceberg adds tables to Trino and
Spark that use a high-performance format that works just like a SQL
table.</li>
<li><a
href="https://github.com/materializeinc/materialize">Materialize</a> -
is a streaming database for real-time applications using SQL for queries
and supporting a large fraction of PostgreSQL.</li>
<li><a
href="https://documentation.invantive.com/2017R2/invantive-sql-grammar/invantive-sql-grammar-17.30.html">Invantive
SQL</a> - SQL engine for online and on-premise use with integrated local
data replication and 70+ connectors.</li>
<li><a href="https://www.pipelinedb.com/">PipelineDB</a> - an
open-source relational database that runs SQL queries continuously on
streams, incrementally storing results in tables.</li>
<li><a href="https://pivotal.io/pivotal-hdb">Pivotal HDB</a> - SQL-like
data warehouse system for Hadoop.</li>
<li><a
href="http://rainstor.com/products/rainstor-database/">RainstorDB</a> -
database for storing petabyte-scale volumes of structured and
semi-structured data.</li>
<li><a href="https://github.com/apache/spark/tree/master/sql">Spark
Catalyst</a> - is a Query Optimization Framework for Spark and
Shark.</li>
<li><a
href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">SparkSQL</a>
- Manipulating Structured Data Using Spark.</li>
<li><a href="https://www.splicemachine.com/">Splice Machine</a> - a
full-featured SQL-on-Hadoop RDBMS with ACID transactions.</li>
<li><a href="https://hortonworks.com/innovation/stinger/">Stinger</a> -
interactive query for Hive.</li>
<li><a href="http://tajo.apache.org/">Tajo</a> - distributed data
warehouse system on Hadoop.</li>
<li><a
href="https://wiki.trafodion.org/wiki/index.php/Main_Page">Trafodion</a>
- enterprise-class SQL-on-HBase solution targeting big data
transactional or operational workloads.</li>
</ul>
<h2 id="data-ingestion">Data Ingestion</h2>
<ul>
<li><a href="https://vectorized.io/redpanda">redpanda</a> - A Kafka®
replacement for mission critical systems; 10x faster. Written in
C++.</li>
<li><a href="https://aws.amazon.com/kinesis/">Amazon Kinesis</a> -
real-time processing of streaming data at massive scale.</li>
<li><a href="https://aws.amazon.com/glue/">Amazon Web Services Glue</a>
- serverless fully managed extract, transform, and load (ETL)
service</li>
<li><a href="https://getcensus.com/">Census</a> - A reverse ETL product
that let you sync data from your data warehouse to SaaS Applications. No
engineering favors required—just SQL.</li>
<li><a href="http://chukwa.apache.org/">Apache Chukwa</a> - data
collection system.</li>
<li><a href="http://flume.apache.org/">Apache Flume</a> - service to
manage large amount of log data.</li>
<li><a href="http://kafka.apache.org/">Apache Kafka</a> - distributed
publish-subscribe messaging system.</li>
<li><a href="https://nifi.apache.org/">Apache NiFi</a> - Apache NiFi is
an integrated data logistics platform for automating the movement of
data between disparate systems.</li>
<li><a href="https://github.com/apache/pulsar">Apache Pulsar</a> - a
distributed pub-sub messaging platform with a very flexible messaging
model and an intuitive client API.</li>
<li><a href="http://sqoop.apache.org/">Apache Sqoop</a> - tool to
transfer data between Hadoop and a structured datastore.</li>
<li><a href="http://www.embulk.org">Embulk</a> - open-source bulk data
loader that helps data transfer between various databases, storages,
file formats, and cloud services.</li>
<li><a href="https://github.com/facebookarchive/scribe">Facebook
Scribe</a> - streamed log data aggregator.</li>
<li><a href="http://www.fluentd.org">Fluentd</a> - tool to collect
events and logs.</li>
<li><a href="https://github.com/gazette/core">Gazette</a> - Distributed
streaming infrastructure built on cloud storage which makes it easy to
mix and match batch and streaming paradigms.</li>
<li><a href="https://research.google.com/pubs/pub41318.html">Google
Photon</a> - geographically distributed system for joining multiple
continuously flowing streams of data in real-time with high scalability
and low latency.</li>
<li><a href="https://github.com/mozilla-services/heka">Heka</a> - open
source stream processing software system.</li>
<li><a href="https://github.com/sonalgoyal/hiho">HIHO</a> - framework
for connecting disparate data sources with Hadoop.</li>
<li><a href="https://github.com/papertrail/kestrel">Kestrel</a> -
distributed message queue system.</li>
<li><a href="https://engineering.linkedin.com/data">LinkedIn Databus</a>
- stream of change capture events for a database.</li>
<li><a href="https://github.com/linkedin/kamikaze">LinkedIn Kamikaze</a>
- utility package for compressing sorted integer arrays.</li>
<li><a href="https://github.com/linkedin/white-elephant">LinkedIn White
Elephant</a> - log aggregator and dashboard.</li>
<li><a href="https://www.elastic.co/products/logstash">Logstash</a> - a
tool for managing events and logs.</li>
<li><a href="https://github.com/Netflix/suro">Netflix Suro</a> - log
agregattor like Storm and Samza based on Chukwa.</li>
<li><a href="https://github.com/pinterest/secor">Pinterest Secor</a> -
is a service implementing Kafka log persistance.</li>
<li><a href="https://github.com/linkedin/gobblin">Linkedin Gobblin</a> -
linkedins universal data ingestion framework.</li>
<li><a href="https://github.com/skizzehq/skizze">Skizze</a> - sketch
data store to deal with all problems around counting and sketching using
probabilistic data-structures.</li>
<li><a href="https://github.com/streamsets/datacollector">StreamSets
Data Collector</a> - continuous big data ingest infrastructure with a
simple to use IDE.</li>
<li><a href="https://www.alooma.com/integrations/mysql">Alooma</a> -
data pipeline as a service enabling moving data sources such as MySQL
into data warehouses.</li>
<li><a
href="https://github.com/rudderlabs/rudder-server">RudderStack</a> - an
open source customer data infrastructure (segment, mParticle
alternative) written in go.</li>
<li><a href="https://github.com/aklivity/zilla">Zilla</a> - An API
gateway built for event-driven architectures and streaming that supports
standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka
protocol.</li>
</ul>
<h2 id="service-programming">Service Programming</h2>
<ul>
<li><a href="http://akka.io/">Akka Toolkit</a> - runtime for
distributed, and fault tolerant event-driven applications on the
JVM.</li>
<li><a href="http://avro.apache.org/">Apache Avro</a> - data
serialization system.</li>
<li><a href="http://curator.apache.org/">Apache Curator</a> - Java
libraries for Apache ZooKeeper.</li>
<li><a href="http://karaf.apache.org/">Apache Karaf</a> - OSGi runtime
that runs on top of any OSGi framework.</li>
<li><a href="http://thrift.apache.org//">Apache Thrift</a> - framework
to build binary protocols.</li>
<li><a href="http://zookeeper.apache.org/">Apache Zookeeper</a> -
centralized service for process management.</li>
<li><a href="https://research.google.com/archive/chubby.html">Google
Chubby</a> - a lock service for loosely-coupled distributed
systems.</li>
<li><a href="https://github.com/Hydrospheredata/mist">Hydrosphere
Mist</a> - a service for exposing Apache Spark analytics jobs and
machine learning models as realtime, batch or reactive web
services.</li>
<li><a href="https://engineering.linkedin.com/data">Linkedin Norbert</a>
- cluster manager.</li>
<li><a href="https://github.com/mara/data-integration">Mara</a> - A
lightweight opinionated ETL framework, halfway between plain scripts and
Apache Airflow</li>
<li><a href="https://www.open-mpi.org/">OpenMPI</a> - message passing
framework.</li>
<li><a href="https://www.serf.io/">Serf</a> - decentralized solution for
service discovery and orchestration.</li>
<li><a href="https://github.com/spotify/luigi">Spotify Luigi</a> - a
Python package for building complex pipelines of batch jobs. It handles
dependency resolution, workflow management, visualization, handling
failures, command line integration, and much more.</li>
<li><a href="https://github.com/spring-projects/spring-xd">Spring XD</a>
- distributed and extensible system for data ingestion, real time
analytics, batch processing, and data export.</li>
<li><a href="https://github.com/twitter/elephant-bird">Twitter Elephant
Bird</a> - libraries for working with LZOP-compressed data.</li>
<li><a href="https://twitter.github.io/finagle/">Twitter Finagle</a> -
asynchronous network stack for the JVM.</li>
</ul>
<h2 id="scheduling">Scheduling</h2>
<ul>
<li><a href="https://github.com/apache/incubator-airflow">Apache
Airflow</a> - a platform to programmatically author, schedule and
monitor workflows.</li>
<li><a href="http://aurora.apache.org/">Apache Aurora</a> - is a service
scheduler that runs on top of Apache Mesos.</li>
<li><a href="http://falcon.apache.org/">Apache Falcon</a> - data
management framework.</li>
<li><a href="http://oozie.apache.org/">Apache Oozie</a> - workflow job
scheduler.</li>
<li><a
href="https://docs.microsoft.com/en-us/azure/data-factory/data-factory-introduction">Azure
Data Factory</a> - cloud-based pipeline orchestration for on-prem, cloud
and HDInsight</li>
<li><a href="http://mesos.github.io/chronos/">Chronos</a> - distributed
and fault-tolerant scheduler.</li>
<li><a href="https://github.com/jhuckaby/Cronicle">Cronicle</a> -
Distributed, easy to install, NodeJS based, task scheduler</li>
<li><a href="https://github.com/dagster-io/dagster">Dagster</a> - a data
orchestrator for machine learning, analytics, and ETL.</li>
<li><a href="https://azkaban.github.io/">Linkedin Azkaban</a> - batch
workflow job scheduler.</li>
<li><a href="https://github.com/ottogroup/schedoscope">Schedoscope</a> -
Scala DSL for agile scheduling of Hadoop jobs.</li>
<li><a href="https://github.com/radlab/sparrow">Sparrow</a> - scheduling
platform.</li>
</ul>
<h2 id="machine-learning">Machine Learning</h2>
<ul>
<li><a href="https://studio.azureml.net/">Azure ML Studio</a> -
Cloud-based AzureML, R, Python Machine Learning platform</li>
<li><a href="https://github.com/harthur/brain">brain</a> - Neural
networks in JavaScript.</li>
<li><a href="https://github.com/OryxProject/oryx">Oryx</a> - Lambda
architecture on Apache Spark, Apache Kafka for real-time large scale
machine learning.</li>
<li><a href="http://www.cascading.org/projects/pattern/">Concurrent
Pattern</a> - machine learning library for Cascading.</li>
<li><a href="https://github.com/karpathy/convnetjs">convnetjs</a> - Deep
Learning in Javascript. Train Convolutional Neural Networks (or ordinary
ones) in your browser.</li>
<li><a href="https://github.com/deeplearning4j/DataVec">DataVec</a> - A
vectorization and data preprocessing library for deep learning in Java
and Scala. Part of the Deeplearning4j ecosystem.</li>
<li><a href="https://github.com/deeplearning4j">Deeplearning4j</a> -
Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural
network configuration layer powered by a C++ library. Uses Spark and
Hadoop to train nets on multiple GPUs and CPUs.</li>
<li><a href="https://github.com/danielsdeleo/Decider">Decider</a> -
Flexible and Extensible Machine Learning in Ruby.</li>
<li><a href="http://www.heatonresearch.com/encog/">ENCOG</a> - machine
learning framework that supports a variety of advanced algorithms, as
well as support classes to normalize and process data.</li>
<li><a href="http://www.etcml.com/">etcML</a> - text classification with
machine learning.</li>
<li><a href="https://github.com/etsy/Conjecture">Etsy Conjecture</a> -
scalable Machine Learning in Scalding.</li>
<li><a href="https://github.com/gojek/feast">Feast</a> - A feature store
for the management, discovery, and access of machine learning features.
Feast provides a consistent view of feature data for both model training
and model serving.</li>
<li><a href="https://dato.com/products/create/">GraphLab Create</a> - A
machine learning platform in Python with a broad collection of ML
toolkits, data engineering, and deployment tools.</li>
<li><a href="https://github.com/h2oai/h2o-3/">H2O</a> - statistical,
machine learning and math runtime with Hadoop. R and Python.</li>
<li><a href="https://github.com/benedekrozemberczki/karateclub">Karate
Club</a> - An unsupervised machine learning library for graph structured
data. Python</li>
<li><a href="https://github.com/fchollet/keras">Keras</a> - An intuitive
neural net API inspired by Torch that runs atop Theano and
Tensorflow.</li>
<li><a href="https://github.com/johnsonc/lambdo">Lambdo</a> - Lambdo is
a workflow engine which significantly simplifies the analysis process by
unifying feature engineering and machine learning operations.</li>
<li><a
href="https://github.com/benedekrozemberczki/littleballoffur">Little
Ball of Fur</a> - A subsampling library for graph structured data.
Python</li>
<li><a href="http://mahout.apache.org/">Mahout</a> - An Apache-backed
machine learning library for Hadoop.</li>
<li><a href="http://www.mlbase.org/">MLbase</a> - distributed machine
learning libraries for the BDAS stack.</li>
<li><a
href="https://github.com/nikolaypavlov/MLPNeuralNet">MLPNeuralNet</a> -
Fast multilayer perceptron neural network library for iOS and Mac OS
X.</li>
<li><a href="https://github.com/ml-tooling/ml-workspace">ML
Workspace</a> - All-in-one web-based IDE specialized for machine
learning and data science.</li>
<li><a href="http://moa.cms.waikato.ac.nz">MOA</a> - MOA performs big
data stream mining in real time, and large scale machine learning.</li>
<li><a href="https://monkeylearn.com/">MonkeyLearn</a> - Text mining
made easy. Extract and classify data from text.</li>
<li><a href="https://github.com/deeplearning4j/nd4j">ND4J</a> - A matrix
library for the JVM. Numpy for Java.</li>
<li><a href="https://github.com/numenta/nupic">nupic</a> - Numenta
Platform for Intelligent Computing: a brain-inspired machine
intelligence platform, and biologically accurate neural network based on
cortical learning algorithms.</li>
<li><a
href="http://predictionio.incubator.apache.org/index.html">PredictionIO</a>
- machine learning server built on Hadoop, Mahout and Cascading.</li>
<li><a
href="https://github.com/benedekrozemberczki/pytorch_geometric_temporal">PyTorch
Geometric Temporal</a> - a temporal extension library for PyTorch
Geometric .</li>
<li><a href="https://github.com/deeplearning4j/rl4j">RL4J</a> -
Reinforcement learning for Java and Scala. Includes Deep-Q learning and
A3C algorithms, and integrates with Open AIs Gym. Runs in the
Deeplearning4j ecosystem.</li>
<li><a href="http://samoa.incubator.apache.org/">SAMOA</a> - distributed
streaming machine learning framework.</li>
<li><a
href="https://github.com/scikit-learn/scikit-learn">scikit-learn</a> -
scikit-learn: machine learning in Python.</li>
<li><a href="https://github.com/benedekrozemberczki/shapley">Shapley</a>
- A data-driven framework to quantify the value of classifiers in a
machine learning ensemble.</li>
<li><a href="http://spark.apache.org/docs/0.9.0/mllib-guide.html">Spark
MLlib</a> - a Spark implementation of some common machine learning (ML)
functionality.</li>
<li><a
href="https://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf">Sibyl</a>
- System for Large Scale Machine Learning at Google.</li>
<li><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a> -
Library from Google for machine learning using data flow graphs.</li>
<li><a href="https://github.com/theano">Theano</a> - A Python-focused
machine learning library supported by the University of Montreal.</li>
<li><a href="https://github.com/torch">Torch</a> - A deep learning
library with a Lua API, supported by NYU and Facebook.</li>
<li><a href="https://github.com/amplab/velox-modelserver">Velox</a> -
System for serving machine learning predictions.</li>
<li><a href="https://github.com/JohnLangford/vowpal_wabbit/wiki">Vowpal
Wabbit</a> - learning system sponsored by Microsoft and Yahoo!.</li>
<li><a href="http://www.cs.waikato.ac.nz/ml/weka/">WEKA</a> - suite of
machine learning software.</li>
<li><a href="https://github.com/BIDData/BIDMach">BidMach</a> - CPU and
GPU-accelerated Machine Learning Library.</li>
</ul>
<h2 id="benchmarking">Benchmarking</h2>
<ul>
<li><a
href="https://issues.apache.org/jira/browse/MAPREDUCE-3561">Apache
Hadoop Benchmarking</a> - micro-benchmarks for testing Hadoop
performances.</li>
<li><a href="https://github.com/SWIMProjectUCB/SWIM/wiki">Berkeley SWIM
Benchmark</a> - real-world big data workload benchmark.</li>
<li><a href="https://github.com/intel-hadoop/HiBench">Intel HiBench</a>
- a Hadoop benchmark suite.</li>
<li><a href="https://issues.apache.org/jira/browse/MAPREDUCE-5116">PUMA
Benchmarking</a> - benchmark suite for MapReduce applications.</li>
<li><a
href="http://yahoohadoop.tumblr.com/post/98294079296/gridmix3-emulating-production-workload-for">Yahoo
Gridmix3</a> - Hadoop cluster benchmarking from Yahoo engineer
team.</li>
<li><a
href="https://github.com/deeplearning4j/dl4j-benchmark">Deeplearning4j
Benchmarks</a></li>
<li><a href="https://github.com/unum-cloud/ucsb">UCSB</a> - extended
Yahoo Cloud Serving Benchmark for NoSQL databases.</li>
</ul>
<h2 id="security">Security</h2>
<ul>
<li><a href="http://ranger.apache.org/">Apache Ranger</a> - Central
security admin &amp; fine-grained authorization for Hadoop</li>
<li><a href="http://eagle.apache.org/">Apache Eagle</a> - real time
monitoring solution</li>
<li><a href="http://knox.apache.org/">Apache Knox Gateway</a> - single
point of secure access for Hadoop clusters.</li>
<li><a href="http://incubator.apache.org/projects/sentry.html">Apache
Sentry</a> - security module for data stored in Hadoop.</li>
<li><a href="https://github.com/kotobukki/BDA/">BDA</a> - The
vulnerability detector for Hadoop and Spark</li>
</ul>
<h2 id="system-deployment">System Deployment</h2>
<ul>
<li><a href="http://ambari.apache.org/">Apache Ambari</a> - operational
framework for Hadoop management.</li>
<li><a href="http://bigtop.apache.org//">Apache Bigtop</a> - system
deployment framework for the Hadoop ecosystem.</li>
<li><a href="http://helix.apache.org/">Apache Helix</a> - cluster
management framework.</li>
<li><a href="http://mesos.apache.org/">Apache Mesos</a> - cluster
manager.</li>
<li><a href="https://github.com/apache/incubator-slider">Apache
Slider</a> - is a YARN application to deploy existing distributed
applications on YARN.</li>
<li><a href="http://whirr.apache.org/">Apache Whirr</a> - set of
libraries for running cloud services.</li>
<li><a href="https://hortonworks.com/hadoop/yarn/">Apache YARN</a> -
Cluster manager.</li>
<li><a href="http://brooklyncentral.github.io/">Brooklyn</a> - library
that simplifies application deployment and management.</li>
<li><a href="http://buildoop.github.io/">Buildoop</a> - Similar to
Apache BigTop based on Groovy language.</li>
<li><a href="http://gethue.com/">Cloudera HUE</a> - web application for
interacting with Hadoop.</li>
<li><a href="http://www.wired.com/2012/08/facebook-prism/">Facebook
Prism</a> - multi datacenters replication system.</li>
<li><a
href="https://www.wired.com/2013/03/google-borg-twitter-mesos/all/">Google
Borg</a> - job scheduling and monitoring system.</li>
<li><a href="https://www.youtube.com/watch?v=0ZFMlO98Jkc">Google
Omega</a> - job scheduling and monitoring system.</li>
<li><a
href="https://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/">Hortonworks
HOYA</a> - application that can deploy HBase cluster on YARN.</li>
<li><a href="https://kubernetes.io/">Kubernetes</a> - a system for
automating deployment, scaling, and management of containerized
applications.</li>
<li><a href="https://github.com/mesosphere/marathon">Marathon</a> -
Mesos framework for long-running services.</li>
<li><a href="https://github.com/WeBankFinTech/Linkis">Linkis</a> -
Linkis helps easily connect to various back-end computation/storage
engines.</li>
</ul>
<h2 id="applications">Applications</h2>
<ul>
<li><a href="https://github.com/etsy/411">411</a> - an web application
for alert management resulting from scheduled searches into
Elasticsearch.</li>
<li><a href="https://github.com/adobe-research/spindle">Adobe
spindle</a> - Next-generation web analytics processing with Scala,
Spark, and Parquet.</li>
<li><a href="http://metron.apache.org/">Apache Metron</a> - a platform
that integrates a variety of open source big data technologies in order
to offer a centralized tool for security monitoring and analysis.</li>
<li><a href="http://nutch.apache.org/">Apache Nutch</a> - open source
web crawler.</li>
<li><a href="http://oodt.apache.org/">Apache OODT</a> - capturing,
processing and sharing of data for NASAs scientific archives.</li>
<li><a href="https://tika.apache.org/">Apache Tika</a> - content
analysis toolkit.</li>
<li><a href="https://github.com/salesforce/Argus">Argus</a> - Time
series monitoring and alerting platform.</li>
<li><a href="https://github.com/uber/AthenaX">AthenaX</a> - a streaming
analytics platform that enables users to run production-quality, large
scale streaming analytics using Structured Query Language (SQL).</li>
<li><a href="https://github.com/Netflix/atlas">Atlas</a> - a backend for
managing dimensional time series data.</li>
<li><a href="https://count.ly/">Countly</a> - open source mobile and web
analytics platform, based on Node.js &amp; MongoDB.</li>
<li><a href="https://www.comet.com/site/">Comet</a> - Comet provides an
end-to-end model evaluation platform for AI developers, with best in
class LLM evaluations, experiment tracking, and production
monitoring.</li>
<li><a href="https://www.dominodatalab.com/">Domino</a> - Run, scale,
share, and deploy models — without any infrastructure.</li>
<li><a href="http://www.eclipse.org/birt/">Eclipse BIRT</a> -
Eclipse-based reporting system.</li>
<li><a href="https://github.com/Yelp/elastalert">ElastAert</a> -
ElastAlert is a simple framework for alerting on anomalies, spikes, or
other patterns of interest from data in ElasticSearch.</li>
<li><a href="https://github.com/Codecademy/EventHub">Eventhub</a> - open
source event analytics platform.</li>
<li><a href="https://hash.ai">HASH</a> - open source simulation and
visualization platform.</li>
<li><a href="https://github.com/allegro/hermes">Hermes</a> -
asynchronous message broker built on top of Kafka.</li>
<li><a href="https://www.splunk.com/en_us/download/hunk.html">Hunk</a> -
Splunk analytics for Hadoop.</li>
<li><a href="http://opensource.indeedeng.io/imhotep/">Imhotep</a> -
Large scale analytics platform by indeed.</li>
<li><a href="https://www.indicative.com/">Indicative</a> - Web &amp;
mobile analytics tool, with data warehouse (AWS, BigQuery)
integration.</li>
<li><a href="https://jupyter.org/">Jupyter</a> - Notebook and project
application for interactive data science and scientific computing across
all programming languages.</li>
<li><a href="http://madlib.incubator.apache.org/community/">MADlib</a> -
data-processing library of an RDBMS to analyze data.</li>
<li><a href="https://github.com/influxdata/kapacitor">Kapacitor</a> - an
open source framework for processing, monitoring, and alerting on time
series data.</li>
<li><a href="http://kylin.apache.org/">Kylin</a> - open source
Distributed Analytics Engine from eBay.</li>
<li><a href="https://github.com/pivotalsoftware/PivotalR">PivotalR</a> -
R on Pivotal HD / HAWQ and PostgreSQL.</li>
<li><a href="https://www.comet.com/site/products/opik/">Opik</a> -
Debug, evaluate, and monitor your LLM applications, RAG systems, and
agentic workflows with comprehensive tracing, automated evaluations, and
production-ready dashboards.</li>
<li><a href="https://github.com/rakam-io/rakam">Rakam</a> - open-source
real-time custom analytics platform powered by Postgresql, Kinesis and
PrestoDB.</li>
<li><a href="https://www.qubole.com/">Qubole</a> - auto-scaling Hadoop
cluster, built-in data connectors.</li>
<li><a href="https://github.com/SnappyDataInc/snappydata">SnappyData</a>
- a distributed in-memory data store for real-time operational
analytics, delivering stream analytics, OLTP (online transaction
processing) and OLAP (online analytical processing) built on Spark in a
single integrated cluster.</li>
<li><a href="https://github.com/snowplow/snowplow">Snowplow</a> -
enterprise-strength web and event analytics, powered by Hadoop, Kinesis,
Redshift and Postgres.</li>
<li><a href="http://amplab-extras.github.io/SparkR-pkg/">SparkR</a> - R
frontend for Spark.</li>
<li><a href="https://www.splunk.com/">Splunk</a> - analyzer for
machine-generated data.</li>
<li><a href="https://www.sumologic.com/">Sumo Logic</a> - cloud based
analyzer for machine-generated data.</li>
<li><a href="https://github.com/brexhq/substation">Substation</a> -
Substation is a cloud native data pipeline and transformation toolkit
written in Go.</li>
<li><a href="http://www.talend.com/products/big-data/">Talend</a> -
unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog
&amp; Pig.</li>
</ul>
<h2 id="search-engine-and-framework">Search engine and framework</h2>
<ul>
<li><a href="http://lucene.apache.org/">Apache Lucene</a> - Search
engine library.</li>
<li><a href="http://lucene.apache.org/solr/">Apache Solr</a> - Search
platform for Apache Lucene.</li>
<li><a href="https://github.com/strapdata/elassandra">Elassandra</a> -
is a fork of Elasticsearch modified to run on top of Apache Cassandra in
a scalable and resilient peer-to-peer architecture.</li>
<li><a href="https://www.elastic.co/">ElasticSearch</a> - Search and
analytics engine based on Apache Lucene.</li>
<li><a href="https://www.enigma.com/">Enigma.io</a> Freemium robust
web application for exploring, filtering, analyzing, searching and
exporting massive datasets scraped from across the Web.</li>
<li><a
href="https://googleblog.blogspot.it/2010/06/our-new-search-index-caffeine.html">Google
Caffeine</a> - continuous indexing system.</li>
<li><a href="https://research.google.com/pubs/pub36726.html">Google
Percolator</a> - continuous indexing system.</li>
<li><a
href="https://blogs.apache.org/hbase/entry/coprocessor_introduction">HBase
Coprocessor</a> - implementation of Percolator, part of HBase.</li>
<li><a href="http://ngdata.github.io/hbase-indexer/">Lily HBase
Indexer</a> - quickly and easily search for any content stored in
HBase.</li>
<li><a href="http://senseidb.github.io/bobo/">LinkedIn Bobo</a> - is a
Faceted Search implementation written purely in Java, an extension to
Apache Lucene.</li>
<li><a href="https://github.com/linkedin/cleo">LinkedIn Cleo</a> - is a
flexible software library for enabling rapid development of partial,
out-of-order and real-time typeahead search.</li>
<li><a
href="https://engineering.linkedin.com/search/did-you-mean-galene">LinkedIn
Galene</a> - search architecture at LinkedIn.</li>
<li><a href="https://github.com/senseidb/zoie">LinkedIn Zoie</a> - is a
realtime search/indexing system written in Java.</li>
<li><a href="http://mg4j.di.unimi.it/">MG4J</a> - MG4J (Managing
Gigabytes for Java) is a full-text search engine for large document
collections written in Java. It is highly customisable, high-performance
and provides state-of-the-art features and new research algorithms.</li>
<li><a href="http://sphinxsearch.com/">Sphinx Search Server</a> -
fulltext search engine.</li>
<li><a href="http://vespa.ai/">Vespa</a> - is an engine for low-latency
computation over large data sets. It stores and indexes your data such
that queries, selection and processing over the data can be performed at
serving time.</li>
<li><a href="https://github.com/facebookresearch/faiss">Facebook
Faiss</a> - is a library for efficient similarity search and clustering
of dense vectors. It contains algorithms that search in sets of vectors
of any size, up to ones that possibly do not fit in RAM. It also
contains supporting code for evaluation and parameter tuning. Faiss is
written in C++ with complete wrappers for Python/numpy.</li>
<li><a href="https://github.com/spotify/annoy">Annoy</a> - is a C++
library with Python bindings to search for points in space that are
close to a given query point. It also creates large read-only file-based
data structures that are mmapped into memory so that many processes may
share the same data.</li>
<li><a href="https://github.com/semi-technologies/weaviate">Weaviate</a>
- Weaviate is a GraphQL-based semantic search engine with build-in
(word) embeddings.</li>
</ul>
<h2 id="mysql-forks-and-evolutions">MySQL forks and evolutions</h2>
<ul>
<li><a href="https://aws.amazon.com/rds/">Amazon RDS</a> - MySQL
databases in Amazons cloud.</li>
<li><a href="http://www.drizzle.org/">Drizzle</a> - evolution of MySQL
6.0.</li>
<li><a href="https://cloud.google.com/sql/docs/">Google Cloud SQL</a> -
MySQL databases in Googles cloud.</li>
<li><a href="https://mariadb.org/">MariaDB</a> - enhanced, drop-in
replacement for MySQL.</li>
<li><a href="https://www.mysql.com/products/cluster/">MySQL Cluster</a>
- MySQL implementation using NDB Cluster storage engine.</li>
<li><a
href="https://www.percona.com/software/mysql-database/percona-server">Percona
Server</a> - enhanced, drop-in replacement for MySQL.</li>
<li><a href="https://github.com/renecannao/proxysql">ProxySQL</a> - High
Performance Proxy for MySQL.</li>
<li><a href="https://www.percona.com/">TokuDB</a> - TokuDB is a storage
engine for MySQL and MariaDB.</li>
<li><a href="http://webscalesql.org/">WebScaleSQL</a> - is a
collaboration among engineers from several companies that face similar
challenges in running MySQL at scale.</li>
</ul>
<h2 id="postgresql-forks-and-evolutions">PostgreSQL forks and
evolutions</h2>
<ul>
<li><a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html">HadoopDB</a>
- hybrid of MapReduce and DBMS.</li>
<li><a href="http://www-01.ibm.com/software/data/netezza/">IBM
Netezza</a> - high-performance data warehouse appliances.</li>
<li><a href="http://www.postgres-xl.org/">Postgres-XL</a> - Scalable
Open Source PostgreSQL-based Database Cluster.</li>
<li><a href="http://www-users.cs.umn.edu/~sarwat/RecDB/">RecDB</a> -
Open Source Recommendation Engine Built Entirely Inside PostgreSQL.</li>
<li><a href="http://www.stormdb.com/community/stado">Stado</a> - open
source MPP database system solely targeted at data warehousing and data
mart applications.</li>
<li><a
href="https://www.scribd.com/doc/3159239/70-Everest-PGCon-RT">Yahoo
Everest</a> - multi-peta-byte database / MPP derived by PostgreSQL.</li>
<li><a href="http://www.timescale.com/">TimescaleDB</a> - An open-source
time-series database optimized for fast ingest and complex queries</li>
<li><a href="https://www.pipelinedb.com/">PipelineDB</a> - The Streaming
SQL Database. An open-source relational database that runs SQL queries
continuously on streams, incrementally storing results in tables</li>
</ul>
<h2 id="memcached-forks-and-evolutions">Memcached forks and
evolutions</h2>
<ul>
<li><a
href="https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920">Facebook
McDipper</a> - key/value cache for flash storage.</li>
<li><a
href="https://www.facebook.com/notes/facebook-engineering/scaling-memcache-at-facebook/10151411410803920">Facebook
Memcached</a> - fork of Memcache.</li>
<li><a href="https://github.com/twitter/twemproxy">Twemproxy</a> - A
fast, light-weight proxy for memcached and redis.</li>
<li><a href="https://github.com/twitter/fatcache">Twitter Fatcache</a> -
key/value cache for flash storage.</li>
<li><a href="https://github.com/twitter/twemcache">Twitter Twemcache</a>
- fork of Memcache.</li>
</ul>
<h2 id="embedded-databases">Embedded Databases</h2>
<ul>
<li><a
href="http://www.actian.com/products/operational-databases/">Actian
PSQL</a> - ACID-compliant DBMS developed by Pervasive Software,
optimized for embedding in applications.</li>
<li><a
href="https://www.oracle.com/database/berkeley-db/index.html">BerkeleyDB</a>
- a software library that provides a high-performance embedded database
for key/value data.</li>
<li><a href="https://github.com/krestenkrab/hanoidb">HanoiDB</a> -
Erlang LSM BTree Storage.</li>
<li><a href="https://github.com/google/leveldb">LevelDB</a> - a fast
key-value storage library written at Google that provides an ordered
mapping from string keys to string values.</li>
<li><a href="https://symas.com/mdb/">LMDB</a> - ultra-fast,
ultra-compact key-value embedded data store developed by Symas.</li>
<li><a href="http://rocksdb.org/">RocksDB</a> - embeddable persistent
key-value store for fast storage based on LevelDB.</li>
</ul>
<h2 id="business-intelligence">Business Intelligence</h2>
<ul>
<li><a href="https://www.bimeanalytics.com/?lang=en">BIME Analytics</a>
- business intelligence platform in the cloud.</li>
<li><a href="https://github.com/ankane/blazer">Blazer</a> - business
intelligence made simple.</li>
<li><a href="https://chartio.com">Chartio</a> - lean business
intelligence platform to visualize and explore your data.</li>
<li><a href="https://count.co">Count</a> - notebook-based anlytics and
visualisation platform using SQL or drag-and-drop.</li>
<li><a href="https://www.datapine.com/">datapine</a> - self-service
business intelligence tool in the cloud.</li>
<li><a href="https://dekart.xyz/">Dekart</a> - Large scale geospatial
analytics for Google BigQuery based on Kepler.gl.</li>
<li><a href="https://www.gooddata.com/">GoodData</a> - platform for data
products and embedded analytics.</li>
<li><a href="https://www.jaspersoft.com/">Jaspersoft</a> - powerful
business intelligence suite.</li>
<li><a href="https://www.jedox.com/en/">Jedox Palo</a> - customisable
Business Intelligence platform.</li>
<li><a href="https://jethro.io/">Jethrodata</a> - Interactive Big Data
Analytics.</li>
<li><a href="https://intermix.io/">intermix.io</a> - Performance
Monitoring for Amazon Redshift</li>
<li><a href="https://github.com/lightdash/lightdash">Lightdash</a> - The
open source Looker alternative built on dbt</li>
<li><a href="https://github.com/metabase/metabase">Metabase</a> - The
simplest, fastest way to get business intelligence and analytics to
everyone in your company.</li>
<li><a
href="http://www.microsoft.com/en-us/server-cloud/solutions/business-intelligence/default.aspx">Microsoft</a>
- business intelligence software and platform.</li>
<li><a href="https://www.microstrategy.com/">Microstrategy</a> -
software platforms for business intelligence, mobile intelligence, and
network applications.</li>
<li><a href="https://numeracy.co/">Numeracy</a> - Fast, clean SQL client
and business intelligence.</li>
<li><a href="http://www.pentaho.com/">Pentaho</a> - business
intelligence platform.</li>
<li><a href="http://www.qlik.com/us/">Qlik</a> - business intelligence
and analytics platform.</li>
<li><a href="https://redash.io/">Redash</a> - Open source business
intelligence platform, supporting multiple data sources and planned
queries.</li>
<li><a href="https://www.meteorite.bi/">Saiku Analytics</a> - Open
source analytics platform.</li>
<li><a href="https://www.knowage-suite.com/">Knowage</a> - open source
business intelligence platform. (former <a
href="http://www.spagobi.org/">SpagoBi</a>)</li>
<li><a href="http://sparklinedata.com/">SparklineData SNAP</a> - modern
B.I platform powered by Apache Spark.</li>
<li><a href="https://www.tableau.com/">Tableau</a> - business
intelligence platform.</li>
<li><a href="https://www.zoomdata.com/">Zoomdata</a> - Big Data
Analytics.</li>
</ul>
<h2 id="data-visualization">Data Visualization</h2>
<ul>
<li><a href="https://github.com/airbnb/airpal">Airpal</a> - Web UI for
PrestoDB.</li>
<li><a href="http://www.anychart.com">AnyChart</a> - fast, simple and
flexible JavaScript (HTML5) charting library featuring pure JS API.</li>
<li><a href="https://github.com/samizdatco/arbor">Arbor</a> - graph
visualization library using web workers and jQuery.</li>
<li><a href="https://github.com/LucidWorks/banana">Banana</a> -
visualize logs and time-stamped data stored in Solr. Port of
Kibana.</li>
<li><a href="https://github.com/ufukomer/bloomery">Bloomery</a> - Web UI
for Impala.</li>
<li><a href="http://bokeh.pydata.org/en/latest/">Bokeh</a> - A powerful
Python interactive visualization library that targets modern web
browsers for presentation, with the goal of providing elegant, concise
construction of novel graphics in the style of D3.js, but also
delivering this capability with high-performance interactivity over very
large or streaming datasets.</li>
<li><a href="http://c3js.org/">C3</a> - D3-based reusable chart
library</li>
<li><a href="https://github.com/CartoDB/cartodb">CartoDB</a> -
open-source or freemium hosting for geospatial databases with powerful
front-end editing capabilities and a robust API.</li>
<li><a href="http://chartd.co/">chartd</a> - responsive,
retina-compatible charts with just an img tag.</li>
<li><a href="http://www.chartjs.org/">Chart.js</a> - open source HTML5
Charts visualizations.</li>
<li><a href="https://github.com/gionkunz/chartist-js">Chartist.js</a> -
another open source HTML5 Charts visualization.</li>
<li><a href="http://square.github.io/crossfilter/">Crossfilter</a> -
JavaScript library for exploring large multivariate datasets in the
browser. Works well with dc.js and d3.js.</li>
<li><a href="https://github.com/square/cubism">Cubism</a> - JavaScript
library for time series visualization.</li>
<li><a href="http://cytoscape.github.io/">Cytoscape</a> - JavaScript
library for visualizing complex networks.</li>
<li><a href="http://dc-js.github.io/dc.js/">DC.js</a> - Dimensional
charting built to work natively with crossfilter rendered using d3.js.
Excellent for connecting charts/additional metadata to hover events in
D3.</li>
<li><a href="https://d3js.org/">D3</a> - javaScript library for
manipulating documents.</li>
<li><a href="https://github.com/CSNW/d3.compose">D3.compose</a> -
Compose complex, data-driven visualizations from reusable charts and
components.</li>
<li><a href="http://d3plus.org">D3Plus</a> - A fairly robust set of
reusable charts and styles for d3.js.</li>
<li><a href="https://github.com/plotly/dash">Dash</a> - Analytical Web
Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS
required</li>
<li><a href="https://dekart.xyz/">Dekart</a> - Large scale geospatial
analytics for Google BigQuery based on Kepler.gl.</li>
<li><a
href="https://devexpress.github.io/devextreme-reactive/react/chart/">DevExtreme
React Chart</a> - High-performance plugin-based React chart for
Bootstrap and Material Design.</li>
<li><a href="https://github.com/ecomfe/echarts">Echarts</a> - Baidus
enterprise charts.</li>
<li><a
href="https://github.com/HumbleSoftware/envisionjs">Envisionjs</a> -
dynamic HTML5 visualization.</li>
<li><a href="https://metrictools.org/">FnordMetric</a> - write SQL
queries that return SVG charts rather than tables</li>
<li><a href="https://frappe.io/charts">Frappe Charts</a> -
GitHub-inspired simple and modern SVG charts for the web with zero
dependencies.</li>
<li><a href="https://github.com/Freeboard/freeboard">Freeboard</a> - pen
source real-time dashboard builder for IOT and other web mashups.</li>
<li><a href="https://github.com/gephi/gephi">Gephi</a> - An
award-winning open-source platform for visualizing and manipulating
large graphs and network connections. Its like Photoshop, but for
graphs. Available for Windows and Mac OS X.</li>
<li><a href="https://developers.google.com/chart/">Google Charts</a> -
simple charting API.</li>
<li><a href="https://grafana.com/">Grafana</a> - graphite dashboard
frontend, editor and graph composer.</li>
<li><a href="http://graphiteapp.org/">Graphite</a> - scalable Realtime
Graphing.</li>
<li><a href="https://www.highcharts.com/">Highcharts</a> - simple and
flexible charting API.</li>
<li><a href="http://ipython.org/">IPython</a> - provides a rich
architecture for interactive computing.</li>
<li><a href="https://www.elastic.co/products/kibana">Kibana</a> -
visualize logs and time-stamped data</li>
<li><a href="http://lumify.io/">Lumify</a> - open source big data
analysis and visualization platform</li>
<li><a href="https://github.com/matplotlib/matplotlib">Matplotlib</a> -
plotting with Python.</li>
<li><a href="https://metricsgraphicsjs.org/">Metricsgraphic.js</a> - a
library built on top of D3 that is optimized for time-series data</li>
<li><a href="http://nvd3.org/">NVD3</a> - chart components for
d3.js.</li>
<li><a href="https://github.com/benpickles/peity">Peity</a> -
Progressive SVG bar, line and pie charts.</li>
<li><a href="https://plot.ly/">Plot.ly</a> - Easy-to-use web service
that allows for rapid creation of complex charts, from heatmaps to
histograms. Upload data to create and style charts with Plotlys online
spreadsheet. Fork others plots.</li>
<li><a href="https://github.com/plotly/plotly.js">Plotly.js</a> The open
source javascript graphing library that powers plotly.</li>
<li><a href="https://github.com/okfn/recline">Recline</a> - simple but
powerful library for building data applications in pure Javascript and
HTML.</li>
<li><a href="https://github.com/getredash/redash">Redash</a> -
open-source platform to query and visualize data.</li>
<li><a href="http://recharts.org/">ReCharts</a> - A composable charting
library built on React components</li>
<li><a href="http://shiny.rstudio.com/">Shiny</a> - a web application
framework for R.</li>
<li><a href="https://github.com/jacomyal/sigma.js">Sigma.js</a> -
JavaScript library dedicated to graph drawing.</li>
<li><a href="https://github.com/apache/incubator-superset">Superset</a>
- a data exploration platform designed to be visual, intuitive and
interactive, making it easy to slice, dice and visualize data and
perform analytics at the speed of thought.</li>
<li><a href="https://github.com/vega/vega">Vega</a> - a visualization
grammar.</li>
<li><a href="https://github.com/ZEPL/zeppelin">Zeppelin</a> - a
notebook-style collaborative data analysis.</li>
<li><a href="https://www.zingchart.com/">Zing Charts</a> - JavaScript
charting library for big data.</li>
<li><a
href="https://github.com/WeBankFinTech/DataSphereStudio">DataSphere
Studio</a> - one-stop data application development management
portal.</li>
</ul>
<h2 id="internet-of-things-and-sensor-data">Internet of things and
sensor data</h2>
<ul>
<li><a href="http://edgent.apache.org/">Apache Edgent (Incubating)</a> -
a programming model and micro-kernel style runtime that can be embedded
in gateways and small footprint edge devices enabling local, real-time,
analytics on the edge devices.</li>
<li><a href="https://azure.microsoft.com/en-us/services/iot-hub/">Azure
IoT Hub</a> - Cloud-based bi-directional monitoring and messaging
hub</li>
<li><a href="https://www.tempoiq.com/">TempoIQ</a> - Cloud-based sensor
analytics.</li>
<li><a href="http://2lemetry.com/">2lemetry</a> - Platform for Internet
of things.</li>
<li><a href="https://www.pubnub.com/">Pubnub</a> - Data stream
network</li>
<li><a href="https://www.thingworx.com/">ThingWorx</a> - Rapid
development and connection of intelligent systems</li>
<li><a href="https://ifttt.com/">IFTTT</a> - If this then that</li>
<li><a href="https://evrythng.com/">Evrything</a>- Making products
smart</li>
<li><a href="https://github.com/marty90/netlytics/">NetLytics</a> -
Analytics platform to process network data on Spark.</li>
<li><a href="https://ably.com/">Ably</a> - Pub/sub messaging platform
for IoT</li>
</ul>
<h2 id="interesting-readings">Interesting Readings</h2>
<ul>
<li><a href="https://amplab.cs.berkeley.edu/benchmark/">Big Data
Benchmark</a> - Benchmark of Redshift, Hive, Shark, Impala and
Stiger/Tez.</li>
<li><a
href="https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis">NoSQL
Comparison</a> - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs
HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo
vs VoltDB vs Scalaris comparison.</li>
<li><a
href="https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics?ref=awesome">Monitoring
Kafka performance</a> - Guide to monitoring Apache Kafka, including
native methods for metrics collection.</li>
<li><a
href="https://www.datadoghq.com/blog/monitor-hadoop-metrics?ref=awesome">Monitoring
Hadoop performance</a> - Guide to monitoring Hadoop, with an overview of
Hadoop architecture, and native methods for metrics collection.</li>
<li><a
href="https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/?ref=awesome">Monitoring
Cassandra performance</a> - Guide to monitoring Cassandra, including
native methods for metrics collection.</li>
</ul>
<h2 id="interesting-papers">Interesting Papers</h2>
<h3 id="section">2015 - 2016</h3>
<ul>
<li><a href="http://www.vldb.org/pvldb/vol8/p1804-ching.pdf">2015</a> -
<strong>Facebook</strong> - One Trillion Edges: Graph Processing at
Facebook-Scale.</li>
</ul>
<h3 id="section-1">2013 - 2014</h3>
<ul>
<li><a href="http://infolab.stanford.edu/~ullman/mmds/book.pdf">2014</a>
- <strong>Stanford</strong> - Mining of Massive Datasets.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/03/eurosys13-paper83.pdf">2013</a>
- <strong>AMPLab</strong> - Presto: Distributed Machine Learning and
Graph Processing with Sparse Matrices.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/01/dmx1.pdf">2013</a>
- <strong>AMPLab</strong> - MLbase: A Distributed Machine-learning
System.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/02/shark_sigmod2013.pdf">2013</a>
- <strong>AMPLab</strong> - Shark: SQL and Rich Analytics at Scale.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf">2013</a>
- <strong>AMPLab</strong> - GraphX: A Resilient Distributed Graph System
on Spark.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf">2013</a>
- <strong>Google</strong> - HyperLogLog in Practice: Algorithmic
Engineering of a State of The Art Cardinality Estimation Algorithm.</li>
<li><a
href="http://research.microsoft.com/pubs/200169/now-vldb.pdf">2013</a> -
<strong>Microsoft</strong> - Scalable Progressive Analytics on Big Data
in the Cloud.</li>
<li><a href="http://static.druid.io/docs/druid.pdf">2013</a> -
<strong>Metamarkets</strong> - Druid: A Real-time Analytical Data
Store.</li>
<li><a
href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p764-rae.pdf">2013</a>
- <strong>Google</strong> - Online, Asynchronous Schema Change in
F1.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41344.pdf">2013</a>
- <strong>Google</strong> - F1: A Distributed SQL Database That
Scales.</li>
<li><a
href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p734-akidau.pdf">2013</a>
- <strong>Google</strong> - MillWheel: Fault-Tolerant Stream Processing
at Internet Scale.</li>
<li><a
href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p767-wiener.pdf">2013</a>
- <strong>Facebook</strong> - Scuba: Diving into Data at Facebook.</li>
<li><a
href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p871-curtiss.pdf">2013</a>
- <strong>Facebook</strong> - Unicorn: A System for Searching the Social
Graph.</li>
<li><a
href="https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf">2013</a>
- <strong>Facebook</strong> - Scaling Memcache at Facebook.</li>
</ul>
<h3 id="section-2">2011 - 2012</h3>
<ul>
<li><a
href="http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf">2012</a>
- <strong>Twitter</strong> - The Unified Logging Infrastructure for Data
Analytics at Twitter.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/04/blinkdb_vldb12_demo.pdf">2012</a>
- <strong>AMPLab</strong> - Blink and Its Done: Interactive Queries on
Very Large Data.</li>
<li><a
href="https://www.usenix.org/system/files/login/articles/zaharia.pdf">2012</a>
- <strong>AMPLab</strong> - Fast and Interactive Analytics over Hadoop
Data with Spark.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mod482-xin1.pdf">2012</a>
- <strong>AMPLab</strong> - Shark: Fast Data Analysis Using
Coarse-grained Distributed Memory.</li>
<li><a
href="https://www.usenix.org/legacy/event/nsdi11/tech/full_papers/Bolosky.pdf">2012</a>
- <strong>Microsoft</strong> - Paxos Replicated State Machines as the
Basis of a High-Performance Data Store.</li>
<li><a
href="http://research.microsoft.com/pubs/178045/ppaoxs-paper29.pdf">2012</a>
- <strong>Microsoft</strong> - Paxos Made Parallel.</li>
<li><a href="https://arxiv.org/pdf/1203.5485.pdf">2012</a> -
<strong>AMPLab</strong> - BlinkDB: Queries with Bounded Errors and
Bounded Response Times on Very Large Data.</li>
<li><a
href="http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf">2012</a>
- <strong>Google</strong> - Processing a trillion cells per mouse
click.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf">2012</a>
- <strong>Google</strong> - Spanner: Googles Globally-Distributed
Database.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/euro118-ananthanarayanan.pdf">2011</a>
- <strong>AMPLab</strong> - Scarlett: Coping with Skewed Popularity
Content in MapReduce Clusters.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf">2011</a>
- <strong>AMPLab</strong> - Mesos: A Platform for Fine-Grained Resource
Sharing in the Data Center.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf">2011</a>
- <strong>Google</strong> - Megastore: Providing Scalable, Highly
Available Storage for Interactive Services.</li>
</ul>
<h3 id="section-3">2001 - 2010</h3>
<ul>
<li><a
href="https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf">2010</a>
- <strong>Facebook</strong> - Finding a needle in Haystack: Facebooks
photo storage.</li>
<li><a
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf">2010</a>
- <strong>AMPLab</strong> - Spark: Cluster Computing with Working
Sets.</li>
<li><a href="http://kowshik.github.io/JPregel/pregel_paper.pdf">2010</a>
- <strong>Google</strong> - Pregel: A System for Large-Scale Graph
Processing.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf">2010</a>
- <strong>Google</strong> - Large-scale Incremental Processing Using
Distributed Transactions and notifications base of Percolator and
Caffeine.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf">2010</a>
- <strong>Google</strong> - Dremel: Interactive Analysis of Web-Scale
Datasets.</li>
<li><a href="http://leoneu.github.io/">2010</a> - <strong>Yahoo</strong>
- S4: Distributed Stream Computing Platform.</li>
<li><a href="http://www.cs.umd.edu/~abadi/papers/hadoopdb.pdf">2009</a>
- HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies
for Analytical Workloads.</li>
<li><a
href="https://cwiki.apache.org/confluence/download/attachments/120729877/chukwa_cca08.pdf?version=1&amp;modificationDate=1562667399000&amp;api=v2">2008</a>
- <strong>AMPLab</strong> - Chukwa: A large-scale monitoring
system.</li>
<li><a
href="http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf">2007</a>
- <strong>Amazon</strong> - Dynamo: Amazons Highly Available Key-value
Store.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf">2006</a>
- <strong>Google</strong> - The Chubby lock service for loosely-coupled
distributed systems.</li>
<li><a
href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf">2006</a>
- <strong>Google</strong> - Bigtable: A Distributed Storage System for
Structured Data.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf">2004</a>
- <strong>Google</strong> - MapReduce: Simplied Data Processing on Large
Clusters.</li>
<li><a
href="http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">2003</a>
- <strong>Google</strong> - The Google File System.</li>
</ul>
<h2 id="videos">Videos</h2>
<ul>
<li><a href="https://www.manning.com/livevideo/spark-in-motion">Spark in
Motion</a> - Spark in Motion teaches you how to use Spark for batch and
streaming data analytics.</li>
<li><a
href="https://www.manning.com/livevideo/machine-learning-data-science-and-deep-learning-with-python">Machine
Learning, Data Science and Deep Learning with Python</a> - LiveVideo
tutorial that covers machine learning, Tensorflow, artificial
intelligence, and neural networks.</li>
<li><a href="https://snir.dev/talks/data-warehouse-schema-design">Data
warehouse schema design - dimensional modeling and star schema</a> -
Introduction to schema design for data warehouse using the star schema
method.</li>
<li><a
href="https://www.manning.com/livevideo/elasticsearch-7-and-elastic-stack">Elasticsearch
7 and Elastic Stack</a> - LiveVideo tutorial that covers searching,
analyzing, and visualizing big data on a cluster with Elasticsearch,
Logstash, Beats, Kibana, and more.</li>
</ul>
<h2 id="books">Books</h2>
<h4 id="streaming">Streaming</h4>
<ul>
<li><a
href="https://www.manning.com/books/data-science-at-scale-with-python-and-dask">Data
Science at Scale with Python and Dask</a> - Data Science at Scale with
Python and Dask teaches you how to build distributed data projects that
can handle huge amounts of data.</li>
<li><a href="https://www.manning.com/books/streaming-data">Streaming
Data</a> - Streaming Data introduces the concepts and requirements of
streaming and real-time data systems.</li>
<li><a href="https://www.manning.com/books/storm-applied">Storm
Applied</a> - Storm Applied is a practical guide to using Apache Storm
for the real-world tasks associated with processing and analyzing
real-time data streams.</li>
<li><a
href="http://www.cambridge.org/us/academic/subjects/engineering/communications-and-signal-processing/fundamentals-stream-processing-application-design-systems-and-analytics">Fundamentals
of Stream Processing: Application Design, Systems, and Analytics</a> -
This comprehensive, hands-on guide combining the fundamental building
blocks and emerging research in stream processing is ideal for
application designers, system builders, analytic developers, as well as
students and researchers in the field.</li>
<li><a href="http://www.springer.com/us/book/9780387710020">Stream Data
Processing: A Quality of Service Perspective</a> - Presents a new
paradigm suitable for stream and complex event processing.</li>
<li><a
href="https://www.manning.com/books/event-streams-in-action">Unified Log
Processing</a> - Unified Log Processing is a practical guide to
implementing a unified log of event streams (Kafka or Kinesis) in your
business</li>
<li><a
href="https://www.manning.com/books/kafka-streams-in-action">Kafka
Streams in Action</a> - Kafka Streams in Action teaches you everything
you need to know to implement stream processing on data flowing into
your Kafka platform, allowing you to focus on getting more from your
data without sacrificing time or effort.</li>
<li><a href="https://www.manning.com/books/big-data">Big Data</a> - Big
Data teaches you to build big data systems using an architecture that
takes advantage of clustered hardware along with new tools designed
specifically to capture and analyze web-scale data.</li>
<li><a href="https://www.manning.com/books/spark-in-action">Spark in
Action</a> &amp; <a
href="https://www.manning.com/books/spark-in-action-second-edition">Spark
in Action 2nd Ed.</a> - Spark in Action teaches you the theory and
skills you need to effectively handle batch and streaming data using
Spark. Fully updated for Spark 2.0.</li>
<li><a href="https://www.manning.com/books/kafka-in-action">Kafka in
Action</a> - Kafka in Action is a fast-paced introduction to every
aspect of working with Kafka you need to really reap its benefits.</li>
<li><a href="https://www.manning.com/books/fusion-in-action">Fusion in
Action</a> - Fusion in Action teaches you to build a full-featured data
analytics pipeline, including document and data search and distributed
data clustering.</li>
<li><a
href="https://www.manning.com/books/reactive-data-handling">Reactive
Data Handling</a> - Reactive Data Handling is a collection of five
hand-picked chapters, selected by Manuel Bernhardt, that introduce you
to building reactive applications capable of handling real-time
processing with large data loadsfree eBook!</li>
<li><a href="https://www.manning.com/books/azure-data-engineering">Azure
Data Engineering</a> - A book about data engineering in general and the
Azure platform specifically</li>
<li><a
href="https://www.manning.com/books/grokking-streaming-systems">Grokking
Streaming Systems</a> - Grokking Streaming Systems helps you unravel
what streaming systems are, how they work, and whether theyre right for
your business. Written to be tool-agnostic, youll be able to apply what
you learn no matter which framework you choose.</li>
</ul>
<h4 id="distributed-systems">Distributed systems</h4>
<ul>
<li><a href="http://book.mixu.net/distsys/">Distributed Systems for fun
and profit</a> Theory of distributed systems. Include parts about time
and ordering, replication and impossibility results.</li>
</ul>
<h4 id="graph-based-approach">Graph Based approach</h4>
<ul>
<li><a
href="https://www.manning.com/books/graph-powered-machine-learning">Graph-Powered
Machine Learning</a> - Alessandro Negro. Combine graph theory and models
to improve machine learning projects</li>
</ul>
<h3 id="data-visualization-1">Data Visualization</h3>
<ul>
<li><a href="https://www.youtube.com/watch?v=5Zg-C8AAIGg">The beauty of
data visualization</a></li>
<li><a href="https://www.youtube.com/watch?v=R-oiKt7bUU8">Designing Data
Visualizations with Noah Iliinsky</a></li>
<li><a href="https://www.youtube.com/watch?v=jbkSRLYSojo">Hans Roslings
200 Countries, 200 Years, 4 Minutes</a></li>
<li><a href="https://www.youtube.com/watch?v=qTEchen97rQ">Ice Bucket
Challenge Data Visualization</a></li>
</ul>
<h1 id="other-awesome-lists">Other Awesome Lists</h1>
<ul>
<li>Other awesome lists <a
href="https://github.com/bayandin/awesome-awesomeness">awesome-awesomeness</a>.</li>
<li>Even more lists <a
href="https://github.com/sindresorhus/awesome">awesome</a>.</li>
<li>Another list? <a href="https://github.com/jnv/lists">list</a>.</li>
<li>WTF! <a
href="https://github.com/t3chnoboy/awesome-awesome-awesome">awesome-awesome-awesome</a>.</li>
<li>Analytics <a
href="https://github.com/onurakpolat/awesome-analytics">awesome-analytics</a>.</li>
<li>Public Datasets <a
href="https://github.com/awesomedata/awesome-public-datasets">awesome-public-datasets</a>.</li>
<li>Graph Classification <a
href="https://github.com/benedekrozemberczki/awesome-graph-classification">awesome-graph-classification</a>.</li>
<li>Network Embedding <a
href="https://github.com/chihming/awesome-network-embedding">awesome-network-embedding</a>.</li>
<li>Community Detection <a
href="https://github.com/benedekrozemberczki/awesome-community-detection">awesome-community-detection</a>.</li>
<li>Decision Tree Papers <a
href="https://github.com/benedekrozemberczki/awesome-decision-tree-papers">awesome-decision-tree-papers</a>.</li>
<li>Fraud Detection Papers <a
href="https://github.com/benedekrozemberczki/awesome-fraud-detection-papers">awesome-fraud-detection-papers</a>.</li>
<li>Gradient Boosting Papers <a
href="https://github.com/benedekrozemberczki/awesome-gradient-boosting-papers">awesome-gradient-boosting-papers</a>.</li>
<li>Monte Carlo Tree Search Papers <a
href="https://github.com/benedekrozemberczki/awesome-monte-carlo-tree-search-papers">awesome-monte-carlo-tree-search-papers</a>.</li>
<li>Kafka <a
href="https://github.com/monksy/awesome-kafka">awesome-kafka</a>.</li>
<li><a href="https://github.com/zrosenbauer/awesome-bigtable">Google
Bigtable</a>.</li>
</ul>
<p><a href="https://github.com/0xnr/awesome-bigdata">bigdata.md
Github</a></p>