1908 lines
98 KiB
HTML
1908 lines
98 KiB
HTML
<h1 id="awesome-big-data">Awesome Big Data</h1>
|
||
<p><a href="https://github.com/sindresorhus/awesome"><img
|
||
src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg"
|
||
alt="Awesome" /></a></p>
|
||
<p>A curated list of awesome big data frameworks, resources and other
|
||
awesomeness. Inspired by <a
|
||
href="https://github.com/ziadoz/awesome-php">awesome-php</a>, <a
|
||
href="https://github.com/vinta/awesome-python">awesome-python</a>, <a
|
||
href="https://github.com/Sdogruyol/awesome-ruby">awesome-ruby</a>, <a
|
||
href="http://hadoopecosystemtable.github.io/">hadoopecosystemtable</a>
|
||
& <a href="http://usefulstuff.io/big-data/">big-data</a>.</p>
|
||
<p>Your contributions are always welcome!</p>
|
||
<ul>
|
||
<li><a href="#awesome-big-data">Awesome Big Data</a>
|
||
<ul>
|
||
<li><a href="#rdbms">RDBMS</a></li>
|
||
<li><a href="#frameworks">Frameworks</a></li>
|
||
<li><a href="#distributed-programming">Distributed Programming</a></li>
|
||
<li><a href="#distributed-filesystem">Distributed Filesystem</a></li>
|
||
<li><a href="#distributed-index">Distributed Index</a></li>
|
||
<li><a href="#document-data-model">Document Data Model</a></li>
|
||
<li><a href="#key-map-data-model">Key Map Data Model</a></li>
|
||
<li><a href="#key-value-data-model">Key-value Data Model</a></li>
|
||
<li><a href="#graph-data-model">Graph Data Model</a></li>
|
||
<li><a href="#columnar-databases">Columnar Databases</a></li>
|
||
<li><a href="#newsql-databases">NewSQL Databases</a></li>
|
||
<li><a href="#time-series-databases">Time-Series Databases</a></li>
|
||
<li><a href="#sql-like-processing">SQL-like processing</a></li>
|
||
<li><a href="#data-ingestion">Data Ingestion</a></li>
|
||
<li><a href="#service-programming">Service Programming</a></li>
|
||
<li><a href="#scheduling">Scheduling</a></li>
|
||
<li><a href="#machine-learning">Machine Learning</a></li>
|
||
<li><a href="#benchmarking">Benchmarking</a></li>
|
||
<li><a href="#security">Security</a></li>
|
||
<li><a href="#system-deployment">System Deployment</a></li>
|
||
<li><a href="#applications">Applications</a></li>
|
||
<li><a href="#search-engine-and-framework">Search engine and
|
||
framework</a></li>
|
||
<li><a href="#mysql-forks-and-evolutions">MySQL forks and
|
||
evolutions</a></li>
|
||
<li><a href="#postgresql-forks-and-evolutions">PostgreSQL forks and
|
||
evolutions</a></li>
|
||
<li><a href="#memcached-forks-and-evolutions">Memcached forks and
|
||
evolutions</a></li>
|
||
<li><a href="#embedded-databases">Embedded Databases</a></li>
|
||
<li><a href="#business-intelligence">Business Intelligence</a></li>
|
||
<li><a href="#data-visualization">Data Visualization</a></li>
|
||
<li><a href="#internet-of-things-and-sensor-data">Internet of things and
|
||
sensor data</a></li>
|
||
<li><a href="#interesting-readings">Interesting Readings</a></li>
|
||
<li><a href="#interesting-papers">Interesting Papers</a>
|
||
<ul>
|
||
<li><a href="#2015---2016">2015 - 2016</a></li>
|
||
<li><a href="#2013---2014">2013 - 2014</a></li>
|
||
<li><a href="#2011---2012">2011 - 2012</a></li>
|
||
<li><a href="#2001---2010">2001 - 2010</a></li>
|
||
</ul></li>
|
||
<li><a href="#videos">Videos</a></li>
|
||
<li><a href="#books">Books</a>
|
||
<ul>
|
||
<li><a href="#streaming">Streaming</a></li>
|
||
<li><a href="#distributed-systems">Distributed systems</a></li>
|
||
<li><a href="#graph-based-approach">Graph Based approach</a></li>
|
||
<li><a href="#data-visualization-1">Data Visualization</a></li>
|
||
</ul></li>
|
||
</ul></li>
|
||
<li><a href="#other-awesome-lists">Other Awesome Lists</a></li>
|
||
</ul>
|
||
<h2 id="rdbms">RDBMS</h2>
|
||
<ul>
|
||
<li><a href="https://www.mysql.com/">MySQL</a> The world’s most popular
|
||
open source database.</li>
|
||
<li><a href="https://www.postgresql.org/">PostgreSQL</a> The world’s
|
||
most advanced open source database.</li>
|
||
<li><a
|
||
href="http://www.oracle.com/us/corporate/features/database-12c/index.html">Oracle
|
||
Database</a> - object-relational database management system.</li>
|
||
<li><a
|
||
href="http://www.teradata.com/products-and-services/teradata-database/">Teradata</a>
|
||
- high-performance MPP data warehouse platform.</li>
|
||
</ul>
|
||
<h2 id="frameworks">Frameworks</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/facebook/bistro">Bistro</a> -
|
||
general-purpose data processing engine for both batch and stream
|
||
analytics. It is based on a novel data model, which represents data via
|
||
<em>functions</em> and processes data via <em>column operations</em> as
|
||
opposed to having only set operations in conventional approaches like
|
||
MapReduce or SQL.</li>
|
||
<li><a
|
||
href="https://www.ibm.com/analytics/us/en/technology/stream-computing/">IBM
|
||
Streams</a> - platform for distributed processing and real-time
|
||
analytics. Integrates with many of the popular technologies in the Big
|
||
Data ecosystem (Kafka, HDFS, Spark, etc.)</li>
|
||
<li><a href="http://hadoop.apache.org/">Apache Hadoop</a> - framework
|
||
for distributed processing. Integrates MapReduce (parallel processing),
|
||
YARN (job scheduling) and HDFS (distributed file system).</li>
|
||
<li><a href="https://github.com/caskdata/tigon">Tigon</a> - High
|
||
Throughput Real-time Stream Processing Framework.</li>
|
||
<li><a href="http://pachyderm.io/">Pachyderm</a> - Pachyderm is a data
|
||
storage platform built on Docker and Kubernetes to provide reproducible
|
||
data processing and analysis.</li>
|
||
<li><a href="https://github.com/polyaxon/polyaxon">Polyaxon</a> - A
|
||
platform for reproducible and scalable machine learning and deep
|
||
learning.</li>
|
||
<li><a href="https://github.com/smooks/smooks">Smooks</a> - An
|
||
extensible Java framework for building XML and non-XML (CSV, EDI, Java,
|
||
etc…) streaming applications.</li>
|
||
</ul>
|
||
<h2 id="distributed-programming">Distributed Programming</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/addthis/hydra">AddThis Hydra</a> -
|
||
distributed data processing and storage system originally developed at
|
||
AddThis.</li>
|
||
<li><a href="http://databricks.github.io/simr/">AMPLab SIMR</a> - run
|
||
Spark on Hadoop MapReduce v1.</li>
|
||
<li><a href="https://apex.apache.org/">Apache APEX</a> - a unified,
|
||
enterprise platform for big data stream and batch processing.</li>
|
||
<li><a href="https://beam.apache.org/">Apache Beam</a> - an unified
|
||
model and set of language-specific SDKs for defining and executing data
|
||
processing workflows.</li>
|
||
<li><a href="http://crunch.apache.org/">Apache Crunch</a> - a simple
|
||
Java API for tasks like joining and data aggregation that are tedious to
|
||
implement on plain MapReduce.</li>
|
||
<li><a href="http://incubator.apache.org/projects/datafu.html">Apache
|
||
DataFu</a> - collection of user-defined functions for Hadoop and Pig
|
||
developed by LinkedIn.</li>
|
||
<li><a href="http://flink.apache.org/">Apache Flink</a> -
|
||
high-performance runtime, and automatic program optimization.</li>
|
||
<li><a href="http://gearpump.apache.org/">Apache Gearpump</a> -
|
||
real-time big data streaming engine based on Akka.</li>
|
||
<li><a href="http://gora.apache.org/">Apache Gora</a> - framework for
|
||
in-memory data model and persistence.</li>
|
||
<li><a href="http://hama.apache.org/">Apache Hama</a> - BSP (Bulk
|
||
Synchronous Parallel) computing framework.</li>
|
||
<li><a href="https://wiki.apache.org/hadoop/MapReduce/">Apache
|
||
MapReduce</a> - programming model for processing large data sets with a
|
||
parallel, distributed algorithm on a cluster.</li>
|
||
<li><a href="https://pig.apache.org/">Apache Pig</a> - high level
|
||
language to express data analysis programs for Hadoop.</li>
|
||
<li><a href="http://reef.apache.org/">Apache REEF</a> - retainable
|
||
evaluator execution framework to simplify and unify the lower layers of
|
||
big data systems.</li>
|
||
<li><a href="http://incubator.apache.org/projects/s4.html">Apache S4</a>
|
||
- framework for stream processing, implementation of S4.</li>
|
||
<li><a href="http://spark.apache.org/">Apache Spark</a> - framework
|
||
for in-memory cluster computing.</li>
|
||
<li><a
|
||
href="https://spark.apache.org/docs/latest/streaming-programming-guide.html">Apache
|
||
Spark Streaming</a> - framework for stream processing, part of
|
||
Spark.</li>
|
||
<li><a href="http://storm.apache.org">Apache Storm</a> - framework for
|
||
stream processing by Twitter also on YARN.</li>
|
||
<li><a href="http://samza.apache.org/">Apache Samza</a> - stream
|
||
processing framework, based on Kafka and YARN.</li>
|
||
<li><a href="http://tez.apache.org/">Apache Tez</a> - application
|
||
framework for executing a complex DAG (directed acyclic graph) of tasks,
|
||
built on YARN.</li>
|
||
<li><a href="https://incubator.apache.org/projects/twill.html">Apache
|
||
Twill</a> - abstraction over YARN that reduces the complexity of
|
||
developing distributed applications.</li>
|
||
<li><a href="http://bigflow.cloud/en/index.html">Baidu Bigflow</a> - an
|
||
interface that allows for writing distributed computing programs
|
||
providing lots of simple, flexible, powerful APIs to easily handle data
|
||
of any scale.</li>
|
||
<li><a href="http://cascalog.org/">Cascalog</a> - data processing and
|
||
querying library.</li>
|
||
<li><a
|
||
href="http://vldbarc.org/pvldb/vldb2010/pvldb_vol3/I08.pdf">Cheetah</a>
|
||
- High Performance, Custom Data Warehouse on Top of MapReduce.</li>
|
||
<li><a href="http://www.cascading.org/">Concurrent Cascading</a> -
|
||
framework for data management/analytics on Hadoop.</li>
|
||
<li><a href="https://github.com/damballa/parkour">Damballa Parkour</a> -
|
||
MapReduce library for Clojure.</li>
|
||
<li><a href="https://github.com/datasalt/pangool">Datasalt Pangool</a> -
|
||
alternative MapReduce paradigm.</li>
|
||
<li><a href="https://www.datatorrent.com/">DataTorrent StrAM</a> -
|
||
real-time engine is designed to enable distributed, asynchronous, real
|
||
time in-memory big-data computations in as unblocked a way as possible,
|
||
with minimal overhead and impact on performance.</li>
|
||
<li><a
|
||
href="https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920">Facebook
|
||
Corona</a> - Hadoop enhancement which removes single point of
|
||
failure.</li>
|
||
<li><a href="http://peregrine_mapreduce.bitbucket.org/">Facebook
|
||
Peregrine</a> - Map Reduce framework.</li>
|
||
<li><a
|
||
href="https://www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with-scuba/10150599692628920">Facebook
|
||
Scuba</a> - distributed in-memory datastore.</li>
|
||
<li><a
|
||
href="https://googledevelopers.blogspot.it/2014/06/cloud-platform-at-google-io-new-big.html">Google
|
||
Dataflow</a> - create data pipelines to help themæingest, transform and
|
||
analyze data.</li>
|
||
<li><a href="https://research.google.com/archive/mapreduce.html">Google
|
||
MapReduce</a> - map reduce framework.</li>
|
||
<li><a href="https://research.google.com/pubs/pub41378.html">Google
|
||
MillWheel</a> - fault tolerant stream processing framework.</li>
|
||
<li><a
|
||
href="https://www.ibm.com/analytics/us/en/technology/stream-computing/">IBM
|
||
Streams</a> - platform for distributed processing and real-time
|
||
analytics. Provides toolkits for advanced analytics like geospatial,
|
||
time series, etc. out of the box.</li>
|
||
<li><a href="https://code.google.com/p/jaql/">JAQL</a> - declarative
|
||
programming language for working with structured, semi-structured and
|
||
unstructured data.</li>
|
||
<li><a href="http://kitesdk.org/docs/current/">Kite</a> - is a set of
|
||
libraries, tools, examples, and documentation focused on making it
|
||
easier to build systems on top of the Hadoop ecosystem.</li>
|
||
<li><a href="http://druid.io/">Metamarkets Druid</a> - framework for
|
||
real-time analysis of large datasets.</li>
|
||
<li><a href="https://github.com/Netflix/PigPen">Netflix PigPen</a> -
|
||
map-reduce for Clojure which compiles to Apache Pig.</li>
|
||
<li><a href="http://discoproject.org/">Nokia Disco</a> - MapReduce
|
||
framework developed by Nokia.</li>
|
||
<li><a href="http://www.onyxplatform.org/">Onyx</a> - Distributed
|
||
computation for the cloud.</li>
|
||
<li><a
|
||
href="https://medium.com/@Pinterest_Engineering/pinlater-an-asynchronous-job-execution-system-b8664cb8aa7d">Pinterest
|
||
Pinlater</a> - asynchronous job execution system.</li>
|
||
<li><a href="http://crs4.github.io/pydoop/">Pydoop</a> - Python
|
||
MapReduce and HDFS API for Hadoop.</li>
|
||
<li><a href="https://github.com/ray-project/ray">Ray</a> - A fast and
|
||
simple framework for building and running distributed applications.</li>
|
||
<li><a href="http://blueflood.io/">Rackerlabs Blueflood</a> -
|
||
multi-tenant distributed metric processing system</li>
|
||
<li><a href="https://github.com/skale-me/skale-engine">Skale</a> - High
|
||
performance distributed data processing in NodeJS.</li>
|
||
<li><a href="http://stratosphere.eu/">Stratosphere</a> - general purpose
|
||
cluster computing framework.</li>
|
||
<li><a href="https://streamdrill.com/">Streamdrill</a> - useful for
|
||
counting activities of event streams over different time windows and
|
||
finding the most active one.</li>
|
||
<li><a
|
||
href="https://github.com/IBMStreams/streamsx.topology">streamsx.topology</a>
|
||
- Libraries to enable building IBM Streams application in Java, Python
|
||
or Scala.</li>
|
||
<li><a href="https://github.com/UnderstandLingBV/Tuktu">Tuktu</a> -
|
||
Easy-to-use platform for batch and streaming computation, built using
|
||
Scala, Akka and Play!</li>
|
||
<li><a href="https://github.com/twitter/heron">Twitter Heron</a> - Heron
|
||
is a realtime, distributed, fault-tolerant stream processing engine from
|
||
Twitter replacing Storm.</li>
|
||
<li><a href="https://github.com/twitter/scalding">Twitter Scalding</a> -
|
||
Scala library for Map Reduce jobs, built on Cascading.</li>
|
||
<li><a href="https://github.com/twitter/summingbird">Twitter
|
||
Summingbird</a> - Streaming MapReduce with Scalding and Storm, by
|
||
Twitter.</li>
|
||
<li><a
|
||
href="https://blog.twitter.com/engineering/en_us/a/2014/tsar-a-timeseries-aggregator.html">Twitter
|
||
TSAR</a> - TimeSeries AggregatoR by Twitter.</li>
|
||
<li><a href="http://www.wallaroolabs.com/community">Wallaroo</a> - The
|
||
ultrafast and elastic data processing engine. Big or fast data - no
|
||
fuss, no Java needed.</li>
|
||
</ul>
|
||
<h2 id="distributed-filesystem">Distributed Filesystem</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/linkedin/ambry">Ambry</a> - a
|
||
distributed object store that supports storage of trillion of small
|
||
immutable objects as well as billions of large objects.</li>
|
||
<li><a href="http://hadoop.apache.org/">Apache HDFS</a> - a way to store
|
||
large files across multiple machines.</li>
|
||
<li><a href="http://kudu.apache.org/">Apache Kudu</a> - Hadoop’s storage
|
||
layer to enable fast analytics on fast data.</li>
|
||
<li><a href="https://www.beegfs.io/content/">BeeGFS</a> - formerly
|
||
FhGFS, parallel distributed file system.</li>
|
||
<li><a href="http://ceph.com/ceph-storage/file-system/">Ceph
|
||
Filesystem</a> - software storage platform designed.</li>
|
||
<li><a
|
||
href="http://disco.readthedocs.org/en/latest/howto/ddfs.html">Disco
|
||
DDFS</a> - distributed filesystem.</li>
|
||
<li><a
|
||
href="https://www.facebook.com/note.php?note_id=76191543919">Facebook
|
||
Haystack</a> - object storage system.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">Google
|
||
GFS</a> - distributed filesystem.</li>
|
||
<li><a href="https://research.google.com/pubs/pub36971.html">Google
|
||
Megastore</a> - scalable, highly available storage.</li>
|
||
<li><a href="https://www.gridgain.com/">GridGain</a> - GGFS, Hadoop
|
||
compliant in-memory file system.</li>
|
||
<li><a href="http://wiki.lustre.org/">Lustre file system</a> -
|
||
high-performance distributed filesystem.</li>
|
||
<li><a
|
||
href="https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html">Microsoft
|
||
Azure Data Lake Store</a> - HDFS-compatible storage in Azure cloud</li>
|
||
<li><a
|
||
href="https://www.quantcast.com/about-us/quantcast-file-system/">Quantcast
|
||
File System QFS</a> - open-source distributed file system.</li>
|
||
<li><a href="http://gluster.org/">Red Hat GlusterFS</a> - scale-out
|
||
network-attached storage file system.</li>
|
||
<li><a href="https://github.com/chrislusf/seaweedfs">Seaweed-FS</a> -
|
||
simple and highly scalable distributed file system.</li>
|
||
<li><a href="http://www.alluxio.org/">Alluxio</a> - reliable file
|
||
sharing at memory speed across cluster frameworks.</li>
|
||
<li><a href="https://www.tahoe-lafs.org/trac/tahoe-lafs">Tahoe-LAFS</a>
|
||
- decentralized cloud storage system.</li>
|
||
<li><a href="https://github.com/baidu/bfs">Baidu File System</a> -
|
||
distributed filesystem.</li>
|
||
</ul>
|
||
<h2 id="distributed-index">Distributed Index</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/pilosa/pilosa">Pilosa</a> Open source
|
||
distributed bitmap index that dramatically accelerates queries across
|
||
multiple, massive data sets.</li>
|
||
</ul>
|
||
<h2 id="document-data-model">Document Data Model</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.actian.com/data-management/ingres-sql-rdbms/">Actian
|
||
Versant</a> - commercial object-oriented database management systems
|
||
.</li>
|
||
<li><a href="https://crate.io/">Crate Data</a> - is an open source
|
||
massively scalable data store. It requires zero administration.</li>
|
||
<li><a href="http://www.infoq.com/news/2014/06/facebook-apollo">Facebook
|
||
Apollo</a> - Facebook’s Paxos-like NoSQL database.</li>
|
||
<li><a href="http://comsysto.github.io/jumbodb/">jumboDB</a> - document
|
||
oriented datastore over Hadoop.</li>
|
||
<li><a href="https://engineering.linkedin.com/data">LinkedIn
|
||
Espresso</a> - horizontally scalable document-oriented NoSQL data
|
||
store.</li>
|
||
<li><a href="http://www.marklogic.com/">MarkLogic</a> - Schema-agnostic
|
||
Enterprise NoSQL database technology.</li>
|
||
<li><a
|
||
href="https://azure.microsoft.com/en-us/services/cosmos-db/">Microsoft
|
||
Azure DocumentDB</a> - NoSQL cloud database service with protocol
|
||
support for MongoDB</li>
|
||
<li><a href="https://www.mongodb.com/">MongoDB</a> - Document-oriented
|
||
database system.</li>
|
||
<li><a href="https://ravendb.net/">RavenDB</a> - A transactional,
|
||
open-source Document Database.</li>
|
||
<li><a href="https://rethinkdb.com/">RethinkDB</a> - document database
|
||
that supports queries like table joins and group by.</li>
|
||
</ul>
|
||
<h2 id="key-map-data-model">Key Map Data Model</h2>
|
||
<p><strong>Note</strong>: There is some term confusion in the industry,
|
||
and two different things are called “Columnar Databases”. Some, listed
|
||
here, are distributed, persistent databases built around the “key-map”
|
||
data model: all data has a (possibly composite) key, with which a map of
|
||
key-value pairs is associated. In some systems, multiple such value maps
|
||
can be associated with a key, and these maps are referred to as “column
|
||
families” (with value map keys being referred to as “columns”).</p>
|
||
<p>Another group of technologies that can also be called “columnar
|
||
databases” is distinguished by how it stores data, on disk or in memory
|
||
– rather than storing data the traditional way, where all column values
|
||
for a given key are stored next to each other, “row by row”, these
|
||
systems store all <em>column</em> values next to each other. So more
|
||
work is needed to get all columns for a given key, but less work is
|
||
needed to get all values for a given column.</p>
|
||
<p>The former group is referred to as “key map data model” here. The
|
||
line between these and the <a href="#key-value-data-model">Key-value
|
||
Data Model</a> stores is fairly blurry.</p>
|
||
<p>The latter, being more about the storage format than about the data
|
||
model, is listed under <a href="#columnar-databases">Columnar
|
||
Databases</a>.</p>
|
||
<p>You can read more about this distinction on Prof. Daniel Abadi’s
|
||
blog: <a
|
||
href="http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html">Distinguishing
|
||
two major types of Column Stores</a>.</p>
|
||
<ul>
|
||
<li><a href="http://accumulo.apache.org/">Apache Accumulo</a> -
|
||
distributed key/value store, built on Hadoop.</li>
|
||
<li><a href="http://cassandra.apache.org/">Apache Cassandra</a> -
|
||
column-oriented distributed datastore, inspired by BigTable.</li>
|
||
<li><a href="http://hbase.apache.org/">Apache HBase</a> -
|
||
column-oriented distributed datastore, inspired by BigTable.</li>
|
||
<li><a href="https://github.com/baidu/tera">Baidu Tera</a> - an
|
||
Internet-scale database, inspired by BigTable.</li>
|
||
<li><a
|
||
href="https://code.facebook.com/posts/321111638043166/hydrabase-the-evolution-of-hbase-facebook/">Facebook
|
||
HydraBase</a> - evolution of HBase made by Facebook.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">Google
|
||
BigTable</a> - column-oriented distributed datastore.</li>
|
||
<li><a
|
||
href="https://cloud.google.com/datastore/docs/concepts/overview">Google
|
||
Cloud Datastore</a> - is a fully managed, schemaless database for
|
||
storing non-relational data over BigTable.</li>
|
||
<li><a href="http://www.hypertable.org/">Hypertable</a> -
|
||
column-oriented distributed datastore, inspired by BigTable.</li>
|
||
<li><a href="https://github.com/infinidb/infinidb/">InfiniDB</a> - is
|
||
accessed through a MySQL interface and use massive parallel processing
|
||
to parallelize queries.</li>
|
||
<li><a href="https://github.com/caskdata/tephra">Tephra</a> -
|
||
Transactions for HBase.</li>
|
||
<li><a
|
||
href="https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html">Twitter
|
||
Manhattan</a> - real-time, multi-tenant distributed database for Twitter
|
||
scale.</li>
|
||
<li><a href="http://www.scylladb.com/">ScyllaDB</a> - column-oriented
|
||
distributed datastore written in C++, totally compatible with Apache
|
||
Cassandra.</li>
|
||
</ul>
|
||
<h2 id="key-value-data-model">Key-value Data Model</h2>
|
||
<ul>
|
||
<li><a href="http://www.aerospike.com/">Aerospike</a> - NoSQL
|
||
flash-optimized, in-memory. Open source and “Server code in ‘C’ (not
|
||
Java or Erlang) precisely tuned to avoid context switching and memory
|
||
copies.”</li>
|
||
<li><a href="https://aws.amazon.com/dynamodb/">Amazon DynamoDB</a> -
|
||
distributed key/value store, implementation of Dynamo paper.</li>
|
||
<li><a href="https://open.dgraph.io/post/badger/">Badger</a> - a fast,
|
||
simple, efficient, and persistent key-value store written natively in
|
||
Go.</li>
|
||
<li><a href="https://github.com/boltdb/bolt">Bolt</a> - an embedded
|
||
key-value database for Go.</li>
|
||
<li><a href="https://github.com/Bobris/BTDB">BTDB</a> - Key Value
|
||
Database in .Net with Object DB Layer, RPC, dynamic IL and much
|
||
more</li>
|
||
<li><a href="https://github.com/tidwall/buntdb">BuntDB</a> - a fast,
|
||
embeddable, in-memory key/value database for Go with custom indexing and
|
||
geospatial support.</li>
|
||
<li><a href="https://github.com/cbd/edis">Edis</a> - is a
|
||
protocol-compatible Server replacement for Redis.</li>
|
||
<li><a href="https://github.com/nathanmarz/elephantdb">ElephantDB</a> -
|
||
Distributed database specialized in exporting data from Hadoop.</li>
|
||
<li><a href="https://geteventstore.com/">EventStore</a> - distributed
|
||
time series database.</li>
|
||
<li><a href="https://github.com/jakekgrog/GhostDB">GhostDB</a> - a
|
||
distributed, in-memory, general purpose key-value data store that
|
||
delivers microsecond performance at any scale.</li>
|
||
<li><a href="https://github.com/deroproject/graviton">Graviton</a> - a
|
||
simple, fast, versioned, authenticated, embeddable key-value store
|
||
database in pure Go(lang).</li>
|
||
<li><a href="https://github.com/griddb/griddb_nosql">GridDB</a> -
|
||
suitable for sensor data stored in a timeseries.</li>
|
||
<li><a href="https://github.com/rescrv/HyperDex">HyperDex</a> - a
|
||
scalable, next generation key-value and document store with a wide array
|
||
of features, including consistency, fault tolerance and high
|
||
performance.</li>
|
||
<li><a href="https://ignite.apache.org/index.html">Ignite</a> - is an
|
||
in-memory key-value data store providing full SQL-compliant data access
|
||
that can optionally be backed by disk storage.</li>
|
||
<li><a
|
||
href="https://github.com/linkedin-sna/sna-page/tree/master/krati">LinkedIn
|
||
Krati</a> - is a simple persistent data store with very low latency and
|
||
high throughput.</li>
|
||
<li><a href="http://www.project-voldemort.com/voldemort/">Linkedin
|
||
Voldemort</a> - distributed key/value storage system.</li>
|
||
<li><a
|
||
href="http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html">Oracle
|
||
NoSQL Database</a> - distributed key-value database by Oracle
|
||
Corporation.</li>
|
||
<li><a href="https://redis.io/">Redis</a> - in memory key value
|
||
datastore.</li>
|
||
<li><a href="https://github.com/basho/riak">Riak</a> - a decentralized
|
||
datastore.</li>
|
||
<li><a href="https://github.com/twitter/storehaus">Storehaus</a> -
|
||
library to work with asynchronous key value stores, by Twitter.</li>
|
||
<li><a href="https://github.com/tidwall/summitdb">SummitDB</a> - an
|
||
in-memory, NoSQL key/value database, with disk persistence and using the
|
||
Raft consensus algorithm.</li>
|
||
<li><a href="https://github.com/tarantool/tarantool">Tarantool</a> - an
|
||
efficient NoSQL database and a Lua application server.</li>
|
||
<li><a href="https://github.com/pingcap/tikv">TiKV</a> - a distributed
|
||
key-value database powered by Rust and inspired by Google Spanner and
|
||
HBase.</li>
|
||
<li><a href="https://github.com/tidwall/tile38">Tile38</a> - a
|
||
geolocation data store, spatial index, and realtime geofence, supporting
|
||
a variety of object types including latitude/longitude points, bounding
|
||
boxes, XYZ tiles, Geohashes, and GeoJSON</li>
|
||
<li><a href="https://github.com/Treode/store">TreodeDB</a> - key-value
|
||
store that’s replicated and sharded and provides atomic multirow
|
||
writes.</li>
|
||
</ul>
|
||
<h2 id="graph-data-model">Graph Data Model</h2>
|
||
<ul>
|
||
<li><a href="http://www.agensgraph.com/">AgensGraph</a> - a new
|
||
generation multi-model graph database for the modern complex data
|
||
environment.</li>
|
||
<li><a href="http://giraph.apache.org/">Apache Giraph</a> -
|
||
implementation of Pregel, based on Hadoop.</li>
|
||
<li><a
|
||
href="http://spark.apache.org/docs/0.7.3/bagel-programming-guide.html">Apache
|
||
Spark Bagel</a> - implementation of Pregel, part of Spark.</li>
|
||
<li><a href="https://www.arangodb.com/">ArangoDB</a> - multi model
|
||
distributed database.</li>
|
||
<li><a href="https://github.com/dgraph-io/dgraph">DGraph</a> - A
|
||
scalable, distributed, low latency, high throughput graph database aimed
|
||
at providing Google production level scale and throughput, with low
|
||
enough latency to be serving real time user queries, over terabytes of
|
||
structured data.</li>
|
||
<li><a href="https://github.com/krotik/eliasdb">EliasDB</a> - a
|
||
lightweight graph based database that does not require any third-party
|
||
libraries.</li>
|
||
<li><a
|
||
href="https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920">Facebook
|
||
TAO</a> - TAO is the distributed data store that is widely used at
|
||
facebook to store and serve the social graph.</li>
|
||
<li><a href="https://github.com/gchq/Gaffer">GCHQ Gaffer</a> - Gaffer by
|
||
GCHQ is a framework that makes it easy to store large-scale graphs in
|
||
which the nodes and edges have statistics.</li>
|
||
<li><a href="https://github.com/cayleygraph/cayley">Google Cayley</a> -
|
||
open-source graph database.</li>
|
||
<li><a href="http://kowshik.github.io/JPregel/pregel_paper.pdf">Google
|
||
Pregel</a> - graph processing framework.</li>
|
||
<li><a href="https://turi.com/products/create/docs/">GraphLab
|
||
PowerGraph</a> - a core C++ GraphLab API and a collection of
|
||
high-performance machine learning and data mining toolkits built on top
|
||
of the GraphLab API.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/publication/graphx-grades/">GraphX</a>
|
||
- resilient Distributed Graph System on Spark.</li>
|
||
<li><a href="https://github.com/tinkerpop/gremlin">Gremlin</a> - graph
|
||
traversal Language.</li>
|
||
<li><a href="https://github.com/paulhoule/infovore">Infovore</a> -
|
||
RDF-centric Map/Reduce framework.</li>
|
||
<li><a href="https://01.org/graphbuilder/">Intel GraphBuilder</a> -
|
||
tools to construct large-scale graphs on top of Hadoop.</li>
|
||
<li><a href="http://janusgraph.org">JanusGraph</a> - open-source,
|
||
distributed graph database with multiple options for storage backends
|
||
(Bigtable, HBase, Cassandra, etc.) and indexing backends (Elasticsearch,
|
||
Solr, Lucene).</li>
|
||
<li><a
|
||
href="https://www.blazegraph.com/mapgraph-technology/">MapGraph</a> -
|
||
Massively Parallel Graph processing on GPUs.</li>
|
||
<li><a href="https://github.com/Microsoft/GraphEngine">Microsoft Graph
|
||
Engine</a> - a distributed in-memory data processing engine, underpinned
|
||
by a strongly-typed in-memory key-value store and a general distributed
|
||
computation engine.</li>
|
||
<li><a href="https://neo4j.com/">Neo4j</a> - graph database written
|
||
entirely in Java.</li>
|
||
<li><a href="http://orientdb.com/">OrientDB</a> - document and graph
|
||
database.</li>
|
||
<li><a href="https://github.com/xslogic/phoebus">Phoebus</a> - framework
|
||
for large scale graph processing.</li>
|
||
<li><a href="http://thinkaurelius.github.io/titan/">Titan</a> -
|
||
distributed graph database, built over Cassandra.</li>
|
||
<li><a href="https://github.com/twitter-archive/flockdb">Twitter
|
||
FlockDB</a> - distributed graph database.</li>
|
||
<li><a href="https://nodexl.codeplex.com/">NodeXL</a> - A free,
|
||
open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016
|
||
that makes it easy to explore network graphs.</li>
|
||
</ul>
|
||
<h2 id="columnar-databases">Columnar Databases</h2>
|
||
<p><strong>Note</strong> please read the note on <a
|
||
href="#key-map-data-model">Key-Map Data Model</a> section.</p>
|
||
<ul>
|
||
<li><a href="http://the-paper-trail.org/blog/columnar-storage/">Columnar
|
||
Storage</a> - an explanation of what columnar storage is and when you
|
||
might want it.</li>
|
||
<li><a href="http://www.actian.com/">Actian Vector</a> - column-oriented
|
||
analytic database.</li>
|
||
<li><a href="https://clickhouse.yandex/">ClickHouse</a> - an open-source
|
||
column-oriented database management system that allows generating
|
||
analytical data reports in real time.</li>
|
||
<li><a href="http://eventql.io/">EventQL</a> - a distributed,
|
||
column-oriented database built for large-scale event collection and
|
||
analytics.</li>
|
||
<li><a href="https://www.monetdb.org/">MonetDB</a> - column store
|
||
database.</li>
|
||
<li><a href="http://parquet.apache.org/">Parquet</a> - columnar storage
|
||
format for Hadoop.</li>
|
||
<li><a href="https://pivotal.io/pivotal-greenplum">Pivotal Greenplum</a>
|
||
- purpose-built, dedicated analytic data warehouse that offers a
|
||
columnar engine as well as a traditional row-based one.</li>
|
||
<li><a href="https://www.vertica.com/">Vertica</a> - is designed to
|
||
manage large, fast-growing volumes of data and provide very fast query
|
||
performance when used for data warehouses.</li>
|
||
<li><a href="http://sqream.com/">SQream DB</a> - A GPU powered big data
|
||
database, designed for analytics and data warehousing, with ANSI-92
|
||
compliant SQL, suitable for data sets from 10TB to 1PB.</li>
|
||
<li><a href="https://cloud.google.com/bigquery/what-is-bigquery">Google
|
||
BigQuery</a> - Google’s cloud offering backed by their pioneering work
|
||
on Dremel.</li>
|
||
<li><a href="https://aws.amazon.com/redshift/">Amazon Redshift</a> -
|
||
Amazon’s cloud offering, also based on a columnar datastore
|
||
backend.</li>
|
||
<li><a href="https://github.com/shunfei/indexr">IndexR</a> - an
|
||
open-source columnar storage format for fast & realtime analytic
|
||
with big data.</li>
|
||
<li><a href="https://github.com/cswinter/LocustDB">LocustDB</a> - an
|
||
experimental analytics database aiming to set a new standard for query
|
||
performance on commodity hardware.</li>
|
||
</ul>
|
||
<h2 id="newsql-databases">NewSQL Databases</h2>
|
||
<ul>
|
||
<li><a
|
||
href="http://www.actian.com/products/operational-databases/">Actian
|
||
Ingres</a> - commercially supported, open-source SQL relational database
|
||
management system.</li>
|
||
<li><a href="https://github.com/biokoda/actordb">ActorDB</a> - a
|
||
distributed SQL database with the scalability of a KV store, while
|
||
keeping the query capabilities of a relational database.</li>
|
||
<li><a href="http://aws.amazon.com/redshift/">Amazon RedShift</a> - data
|
||
warehouse service, based on PostgreSQL.</li>
|
||
<li><a href="https://github.com/probcomp/BayesDB">BayesDB</a> -
|
||
statistic oriented SQL database.</li>
|
||
<li><a href="http://bedrockdb.com/">Bedrock</a> - a simple, modular,
|
||
networked and distributed transaction layer built atop SQLite.</li>
|
||
<li><a href="https://www.citusdata.com/">CitusDB</a> - scales out
|
||
PostgreSQL through sharding and replication.</li>
|
||
<li><a href="https://github.com/cockroachdb/cockroach">Cockroach</a> -
|
||
Scalable, Geo-Replicated, Transactional Datastore.</li>
|
||
<li><a href="https://github.com/bloomberg/comdb2">Comdb2</a> - a
|
||
clustered RDBMS built on optimistic concurrency control techniques.</li>
|
||
<li><a href="http://www.datomic.com/">Datomic</a> - distributed database
|
||
designed to enable scalable, flexible and intelligent applications.</li>
|
||
<li><a href="https://foundationdb.com/">FoundationDB</a> - distributed
|
||
database, inspired by F1.</li>
|
||
<li><a href="https://research.google.com/pubs/pub41344.html">Google
|
||
F1</a> - distributed SQL database built on Spanner.</li>
|
||
<li><a href="https://research.google.com/archive/spanner.html">Google
|
||
Spanner</a> - globally distributed semi-relational database.</li>
|
||
<li><a href="http://hstore.cs.brown.edu/">H-Store</a> - is an
|
||
experimental main-memory, parallel database management system that is
|
||
optimized for on-line transaction processing (OLTP) applications.</li>
|
||
<li><a href="https://github.com/VCNC/haeinsa">Haeinsa</a> - linearly
|
||
scalable multi-row, multi-table transaction library for HBase based on
|
||
Percolator.</li>
|
||
<li><a
|
||
href="https://www.percona.com/doc/percona-server/5.5/performance/handlersocket.html">HandlerSocket</a>
|
||
- NoSQL plugin for MySQL/MariaDB.</li>
|
||
<li><a href="http://www.infinisql.org/">InfiniSQL</a> - infinity
|
||
scalable RDBMS.</li>
|
||
<li><a href="https://github.com/rayokota/kareldb">KarelDB</a> - a
|
||
relational database backed by Apache Kafka.</li>
|
||
<li><a href="https://www.mapd.com/">Map-D</a> - GPU in-memory database,
|
||
big data analysis and visualization platform.</li>
|
||
<li><a href="http://www.memsql.com/">MemSQL</a> - in memory SQL database
|
||
witho optimized columnar storage on flash.</li>
|
||
<li><a href="http://www.nuodb.com/">NuoDB</a> - SQL/ACID compliant
|
||
distributed database.</li>
|
||
<li><a
|
||
href="http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html">Oracle
|
||
TimesTen in-Memory Database</a> - in-memory, relational database
|
||
management system with persistence and recoverability.</li>
|
||
<li><a href="http://gemfirexd.docs.pivotal.io/latest/">Pivotal GemFire
|
||
XD</a> - Low-latency, in-memory, distributed SQL data store. Provides
|
||
SQL interface to in-memory table data, persistable in HDFS.</li>
|
||
<li><a href="https://hana.sap.com/abouthana.html">SAP HANA</a> - is an
|
||
in-memory, column-oriented, relational database management system.</li>
|
||
<li><a href="http://senseidb.github.io/sensei/">SenseiDB</a> -
|
||
distributed, realtime, semi-structured database.</li>
|
||
<li><a href="http://skydb.io/">Sky</a> - database used for flexible,
|
||
high performance analysis of behavioral data.</li>
|
||
<li><a href="http://www.symmetricds.org/">SymmetricDS</a> - open source
|
||
software for both file and database synchronization.</li>
|
||
<li><a href="https://github.com/pingcap/tidb">TiDB</a> - TiDB is a
|
||
distributed SQL database. Inspired by the design of Google F1.</li>
|
||
<li><a href="https://www.voltdb.com/">VoltDB</a> - claims to be fastest
|
||
in-memory database.</li>
|
||
<li><a href="https://github.com/YugaByte/yugabyte-db">yugabyteDB</a> -
|
||
open source, high-performance, distributed SQL database compatible with
|
||
PostgreSQL.</li>
|
||
</ul>
|
||
<h2 id="time-series-databases">Time-Series Databases</h2>
|
||
<ul>
|
||
<li><a
|
||
href="http://axibase.com/products/axibase-time-series-database/">Axibase
|
||
Time Series Database</a> - Integrated time series database on top of
|
||
HBase with built-in visualization, rule-engine and SQL support.</li>
|
||
<li><a href="http://chronix.io/">Chronix</a> - a time series storage
|
||
built to store time series highly compressed and for fast access
|
||
times.</li>
|
||
<li><a href="http://square.github.io/cube/">Cube</a> - uses MongoDB to
|
||
store time series data.</li>
|
||
<li><a href="https://spotify.github.io/heroic/#!/index">Heroic</a> - is
|
||
a scalable time series database based on Cassandra and
|
||
Elasticsearch.</li>
|
||
<li><a href="https://www.influxdata.com/">InfluxDB</a> - a time series
|
||
database with optimised IO and queries, supports pgsql and influx wire
|
||
protocols.</li>
|
||
<li><a href="https://questdb.io/">QuestDB</a> - high-performance,
|
||
open-source SQL database for applications in financial services, IoT,
|
||
machine learning, DevOps and observability.</li>
|
||
<li><a href="https://www.circonus.com/irondb/">IronDB</a> - scalable,
|
||
general-purpose time series database.</li>
|
||
<li><a href="https://github.com/kairosdb/kairosdb">Kairosdb</a> -
|
||
similar to OpenTSDB but allows for Cassandra.</li>
|
||
<li><a href="http://m3db.github.io/m3/m3db/">M3DB</a> - a distributed
|
||
time series database that can be used for storing realtime metrics at
|
||
long retention.</li>
|
||
<li><a href="https://opennms.github.io/newts/">Newts</a> - a time series
|
||
database based on Apache Cassandra.</li>
|
||
<li><a href="https://github.com/taosdata/TDengine/">TDengine</a> - a
|
||
time series database in C utilizing unique features of IoT to improve
|
||
read/write throughput and reduce space needed to store data</li>
|
||
<li><a href="http://opentsdb.net">OpenTSDB</a> - distributed time series
|
||
database on top of HBase.</li>
|
||
<li><a href="https://prometheus.io/">Prometheus</a> - a time series
|
||
database and service monitoring system.</li>
|
||
<li><a href="https://github.com/facebookincubator/beringei">Beringei</a>
|
||
- Facebook’s in-memory time-series database.</li>
|
||
<li><a href="http://traildb.io/">TrailDB</a> - an efficient tool for
|
||
storing and querying series of events.</li>
|
||
<li><a href="https://github.com/druid-io/druid/">Druid</a> Column
|
||
oriented distributed data store ideal for powering interactive
|
||
applications</li>
|
||
<li><a href="http://basho.com/products/riak-ts/">Riak-TS</a> Riak TS is
|
||
the only enterprise-grade NoSQL time series database optimized
|
||
specifically for IoT and Time Series data.</li>
|
||
<li><a href="https://github.com/akumuli/Akumuli">Akumuli</a> Akumuli is
|
||
a numeric time-series database. It can be used to capture, store and
|
||
process time-series data in real-time. The word “akumuli” can be
|
||
translated from esperanto as “accumulate”.</li>
|
||
<li><a href="https://github.com/Pardot/Rhombus">Rhombus</a> A
|
||
time-series object store for Cassandra that handles all the complexity
|
||
of building wide row indexes.</li>
|
||
<li><a href="https://github.com/dalmatinerdb/dalmatinerdb">Dalmatiner
|
||
DB</a> Fast distributed metrics database</li>
|
||
<li><a href="https://github.com/rackerlabs/blueflood">Blueflood</a> A
|
||
distributed system designed to ingest and process time series data</li>
|
||
<li><a
|
||
href="https://github.com/NationalSecurityAgency/timely">Timely</a>
|
||
Timely is a time series database application that provides secure access
|
||
to time series data based on Accumulo and Grafana.</li>
|
||
<li><a
|
||
href="https://github.com/transceptor-technology/siridb-server">SiriDB</a>
|
||
Highly-scalable, robust and fast, open source time series database with
|
||
cluster functionality.</li>
|
||
<li><a href="https://github.com/improbable-eng/thanos">Thanos</a> -
|
||
Thanos is a set of components to create a highly available metric system
|
||
with unlimited storage capacity using multiple (existing) Prometheus
|
||
deployments.</li>
|
||
<li><a
|
||
href="https://github.com/VictoriaMetrics/VictoriaMetrics">VictoriaMetrics</a>
|
||
- fast, scalable and resource-effective open-source TSDB compatible with
|
||
Prometheus. Single-node and cluster versions included</li>
|
||
</ul>
|
||
<h2 id="sql-like-processing">SQL-like processing</h2>
|
||
<ul>
|
||
<li><a
|
||
href="http://www.actian.com/analytic-database/vectorh-sql-hadoop">Actian
|
||
SQL for Hadoop</a> - high performance interactive SQL access to all
|
||
Hadoop data.</li>
|
||
<li><a href="http://drill.apache.org/">Apache Drill</a> - framework for
|
||
interactive analysis, inspired by Dremel.</li>
|
||
<li><a
|
||
href="https://cwiki.apache.org/confluence/display/Hive/HCatalog">Apache
|
||
HCatalog</a> - table and storage management layer for Hadoop.</li>
|
||
<li><a href="http://hive.apache.org/">Apache Hive</a> - SQL-like data
|
||
warehouse system for Hadoop.</li>
|
||
<li><a href="http://calcite.apache.org/">Apache Calcite</a> - framework
|
||
that allows efficient translation of queries involving heterogeneous and
|
||
federated data.</li>
|
||
<li><a href="http://phoenix.apache.org/index.html">Apache Phoenix</a> -
|
||
SQL skin over HBase.</li>
|
||
<li><a
|
||
href="http://www.teradata.com/products-and-services/Teradata-Aster/teradata-aster-database">Aster
|
||
Database</a> - SQL-like analytic processing for MapReduce.</li>
|
||
<li><a
|
||
href="https://www.cloudera.com/products/apache-hadoop/impala.html">Cloudera
|
||
Impala</a> - framework for interactive analysis, Inspired by
|
||
Dremel.</li>
|
||
<li><a href="http://www.cascading.org/projects/lingual/">Concurrent
|
||
Lingual</a> - SQL-like query language for Cascading.</li>
|
||
<li><a href="http://www.datasalt.com/products/splout-sql/">Datasalt
|
||
Splout SQL</a> - full SQL query engine for big datasets.</li>
|
||
<li><a href="https://www.dremio.com/">Dremio</a> - an open-source,
|
||
SQL-like Data-as-a-Service Platform based on Apache Arrow.</li>
|
||
<li><a href="https://prestodb.io/">Facebook PrestoDB</a> - distributed
|
||
SQL query engine.</li>
|
||
<li><a href="https://research.google.com/pubs/pub36632.html">Google
|
||
BigQuery</a> - framework for interactive analysis, implementation of
|
||
Dremel.</li>
|
||
<li><a href="https://iceberg.apache.org/">Iceberg</a> - an open table
|
||
format for huge analytic datasets. Iceberg adds tables to Trino and
|
||
Spark that use a high-performance format that works just like a SQL
|
||
table.</li>
|
||
<li><a
|
||
href="https://github.com/materializeinc/materialize">Materialize</a> -
|
||
is a streaming database for real-time applications using SQL for queries
|
||
and supporting a large fraction of PostgreSQL.</li>
|
||
<li><a
|
||
href="https://documentation.invantive.com/2017R2/invantive-sql-grammar/invantive-sql-grammar-17.30.html">Invantive
|
||
SQL</a> - SQL engine for online and on-premise use with integrated local
|
||
data replication and 70+ connectors.</li>
|
||
<li><a href="https://www.pipelinedb.com/">PipelineDB</a> - an
|
||
open-source relational database that runs SQL queries continuously on
|
||
streams, incrementally storing results in tables.</li>
|
||
<li><a href="https://pivotal.io/pivotal-hdb">Pivotal HDB</a> - SQL-like
|
||
data warehouse system for Hadoop.</li>
|
||
<li><a
|
||
href="http://rainstor.com/products/rainstor-database/">RainstorDB</a> -
|
||
database for storing petabyte-scale volumes of structured and
|
||
semi-structured data.</li>
|
||
<li><a href="https://github.com/apache/spark/tree/master/sql">Spark
|
||
Catalyst</a> - is a Query Optimization Framework for Spark and
|
||
Shark.</li>
|
||
<li><a
|
||
href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">SparkSQL</a>
|
||
- Manipulating Structured Data Using Spark.</li>
|
||
<li><a href="https://www.splicemachine.com/">Splice Machine</a> - a
|
||
full-featured SQL-on-Hadoop RDBMS with ACID transactions.</li>
|
||
<li><a href="https://hortonworks.com/innovation/stinger/">Stinger</a> -
|
||
interactive query for Hive.</li>
|
||
<li><a href="http://tajo.apache.org/">Tajo</a> - distributed data
|
||
warehouse system on Hadoop.</li>
|
||
<li><a
|
||
href="https://wiki.trafodion.org/wiki/index.php/Main_Page">Trafodion</a>
|
||
- enterprise-class SQL-on-HBase solution targeting big data
|
||
transactional or operational workloads.</li>
|
||
</ul>
|
||
<h2 id="data-ingestion">Data Ingestion</h2>
|
||
<ul>
|
||
<li><a href="https://vectorized.io/redpanda">redpanda</a> - A Kafka®
|
||
replacement for mission critical systems; 10x faster. Written in
|
||
C++.</li>
|
||
<li><a href="https://aws.amazon.com/kinesis/">Amazon Kinesis</a> -
|
||
real-time processing of streaming data at massive scale.</li>
|
||
<li><a href="https://aws.amazon.com/glue/">Amazon Web Services Glue</a>
|
||
- serverless fully managed extract, transform, and load (ETL)
|
||
service</li>
|
||
<li><a href="https://getcensus.com/">Census</a> - A reverse ETL product
|
||
that let you sync data from your data warehouse to SaaS Applications. No
|
||
engineering favors required—just SQL.</li>
|
||
<li><a href="http://chukwa.apache.org/">Apache Chukwa</a> - data
|
||
collection system.</li>
|
||
<li><a href="http://flume.apache.org/">Apache Flume</a> - service to
|
||
manage large amount of log data.</li>
|
||
<li><a href="http://kafka.apache.org/">Apache Kafka</a> - distributed
|
||
publish-subscribe messaging system.</li>
|
||
<li><a href="https://nifi.apache.org/">Apache NiFi</a> - Apache NiFi is
|
||
an integrated data logistics platform for automating the movement of
|
||
data between disparate systems.</li>
|
||
<li><a href="https://github.com/apache/pulsar">Apache Pulsar</a> - a
|
||
distributed pub-sub messaging platform with a very flexible messaging
|
||
model and an intuitive client API.</li>
|
||
<li><a href="http://sqoop.apache.org/">Apache Sqoop</a> - tool to
|
||
transfer data between Hadoop and a structured datastore.</li>
|
||
<li><a href="http://www.embulk.org">Embulk</a> - open-source bulk data
|
||
loader that helps data transfer between various databases, storages,
|
||
file formats, and cloud services.</li>
|
||
<li><a href="https://github.com/facebookarchive/scribe">Facebook
|
||
Scribe</a> - streamed log data aggregator.</li>
|
||
<li><a href="http://www.fluentd.org">Fluentd</a> - tool to collect
|
||
events and logs.</li>
|
||
<li><a href="https://github.com/gazette/core">Gazette</a> - Distributed
|
||
streaming infrastructure built on cloud storage which makes it easy to
|
||
mix and match batch and streaming paradigms.</li>
|
||
<li><a href="https://research.google.com/pubs/pub41318.html">Google
|
||
Photon</a> - geographically distributed system for joining multiple
|
||
continuously flowing streams of data in real-time with high scalability
|
||
and low latency.</li>
|
||
<li><a href="https://github.com/mozilla-services/heka">Heka</a> - open
|
||
source stream processing software system.</li>
|
||
<li><a href="https://github.com/sonalgoyal/hiho">HIHO</a> - framework
|
||
for connecting disparate data sources with Hadoop.</li>
|
||
<li><a href="https://github.com/papertrail/kestrel">Kestrel</a> -
|
||
distributed message queue system.</li>
|
||
<li><a href="https://engineering.linkedin.com/data">LinkedIn Databus</a>
|
||
- stream of change capture events for a database.</li>
|
||
<li><a href="https://github.com/linkedin/kamikaze">LinkedIn Kamikaze</a>
|
||
- utility package for compressing sorted integer arrays.</li>
|
||
<li><a href="https://github.com/linkedin/white-elephant">LinkedIn White
|
||
Elephant</a> - log aggregator and dashboard.</li>
|
||
<li><a href="https://www.elastic.co/products/logstash">Logstash</a> - a
|
||
tool for managing events and logs.</li>
|
||
<li><a href="https://github.com/Netflix/suro">Netflix Suro</a> - log
|
||
agregattor like Storm and Samza based on Chukwa.</li>
|
||
<li><a href="https://github.com/pinterest/secor">Pinterest Secor</a> -
|
||
is a service implementing Kafka log persistance.</li>
|
||
<li><a href="https://github.com/linkedin/gobblin">Linkedin Gobblin</a> -
|
||
linkedin’s universal data ingestion framework.</li>
|
||
<li><a href="https://github.com/skizzehq/skizze">Skizze</a> - sketch
|
||
data store to deal with all problems around counting and sketching using
|
||
probabilistic data-structures.</li>
|
||
<li><a href="https://github.com/streamsets/datacollector">StreamSets
|
||
Data Collector</a> - continuous big data ingest infrastructure with a
|
||
simple to use IDE.</li>
|
||
<li><a href="https://www.alooma.com/integrations/mysql">Alooma</a> -
|
||
data pipeline as a service enabling moving data sources such as MySQL
|
||
into data warehouses.</li>
|
||
<li><a
|
||
href="https://github.com/rudderlabs/rudder-server">RudderStack</a> - an
|
||
open source customer data infrastructure (segment, mParticle
|
||
alternative) written in go.</li>
|
||
<li><a href="https://github.com/aklivity/zilla">Zilla</a> - An API
|
||
gateway built for event-driven architectures and streaming that supports
|
||
standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka
|
||
protocol.</li>
|
||
</ul>
|
||
<h2 id="service-programming">Service Programming</h2>
|
||
<ul>
|
||
<li><a href="http://akka.io/">Akka Toolkit</a> - runtime for
|
||
distributed, and fault tolerant event-driven applications on the
|
||
JVM.</li>
|
||
<li><a href="http://avro.apache.org/">Apache Avro</a> - data
|
||
serialization system.</li>
|
||
<li><a href="http://curator.apache.org/">Apache Curator</a> - Java
|
||
libraries for Apache ZooKeeper.</li>
|
||
<li><a href="http://karaf.apache.org/">Apache Karaf</a> - OSGi runtime
|
||
that runs on top of any OSGi framework.</li>
|
||
<li><a href="http://thrift.apache.org//">Apache Thrift</a> - framework
|
||
to build binary protocols.</li>
|
||
<li><a href="http://zookeeper.apache.org/">Apache Zookeeper</a> -
|
||
centralized service for process management.</li>
|
||
<li><a href="https://research.google.com/archive/chubby.html">Google
|
||
Chubby</a> - a lock service for loosely-coupled distributed
|
||
systems.</li>
|
||
<li><a href="https://github.com/Hydrospheredata/mist">Hydrosphere
|
||
Mist</a> - a service for exposing Apache Spark analytics jobs and
|
||
machine learning models as realtime, batch or reactive web
|
||
services.</li>
|
||
<li><a href="https://engineering.linkedin.com/data">Linkedin Norbert</a>
|
||
- cluster manager.</li>
|
||
<li><a href="https://github.com/mara/data-integration">Mara</a> - A
|
||
lightweight opinionated ETL framework, halfway between plain scripts and
|
||
Apache Airflow</li>
|
||
<li><a href="https://www.open-mpi.org/">OpenMPI</a> - message passing
|
||
framework.</li>
|
||
<li><a href="https://www.serf.io/">Serf</a> - decentralized solution for
|
||
service discovery and orchestration.</li>
|
||
<li><a href="https://github.com/spotify/luigi">Spotify Luigi</a> - a
|
||
Python package for building complex pipelines of batch jobs. It handles
|
||
dependency resolution, workflow management, visualization, handling
|
||
failures, command line integration, and much more.</li>
|
||
<li><a href="https://github.com/spring-projects/spring-xd">Spring XD</a>
|
||
- distributed and extensible system for data ingestion, real time
|
||
analytics, batch processing, and data export.</li>
|
||
<li><a href="https://github.com/twitter/elephant-bird">Twitter Elephant
|
||
Bird</a> - libraries for working with LZOP-compressed data.</li>
|
||
<li><a href="https://twitter.github.io/finagle/">Twitter Finagle</a> -
|
||
asynchronous network stack for the JVM.</li>
|
||
</ul>
|
||
<h2 id="scheduling">Scheduling</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/apache/incubator-airflow">Apache
|
||
Airflow</a> - a platform to programmatically author, schedule and
|
||
monitor workflows.</li>
|
||
<li><a href="http://aurora.apache.org/">Apache Aurora</a> - is a service
|
||
scheduler that runs on top of Apache Mesos.</li>
|
||
<li><a href="http://falcon.apache.org/">Apache Falcon</a> - data
|
||
management framework.</li>
|
||
<li><a href="http://oozie.apache.org/">Apache Oozie</a> - workflow job
|
||
scheduler.</li>
|
||
<li><a
|
||
href="https://docs.microsoft.com/en-us/azure/data-factory/data-factory-introduction">Azure
|
||
Data Factory</a> - cloud-based pipeline orchestration for on-prem, cloud
|
||
and HDInsight</li>
|
||
<li><a href="http://mesos.github.io/chronos/">Chronos</a> - distributed
|
||
and fault-tolerant scheduler.</li>
|
||
<li><a href="https://github.com/jhuckaby/Cronicle">Cronicle</a> -
|
||
Distributed, easy to install, NodeJS based, task scheduler</li>
|
||
<li><a href="https://github.com/dagster-io/dagster">Dagster</a> - a data
|
||
orchestrator for machine learning, analytics, and ETL.</li>
|
||
<li><a href="https://azkaban.github.io/">Linkedin Azkaban</a> - batch
|
||
workflow job scheduler.</li>
|
||
<li><a href="https://github.com/ottogroup/schedoscope">Schedoscope</a> -
|
||
Scala DSL for agile scheduling of Hadoop jobs.</li>
|
||
<li><a href="https://github.com/radlab/sparrow">Sparrow</a> - scheduling
|
||
platform.</li>
|
||
</ul>
|
||
<h2 id="machine-learning">Machine Learning</h2>
|
||
<ul>
|
||
<li><a href="https://studio.azureml.net/">Azure ML Studio</a> -
|
||
Cloud-based AzureML, R, Python Machine Learning platform</li>
|
||
<li><a href="https://github.com/harthur/brain">brain</a> - Neural
|
||
networks in JavaScript.</li>
|
||
<li><a href="https://github.com/OryxProject/oryx">Oryx</a> - Lambda
|
||
architecture on Apache Spark, Apache Kafka for real-time large scale
|
||
machine learning.</li>
|
||
<li><a href="http://www.cascading.org/projects/pattern/">Concurrent
|
||
Pattern</a> - machine learning library for Cascading.</li>
|
||
<li><a href="https://github.com/karpathy/convnetjs">convnetjs</a> - Deep
|
||
Learning in Javascript. Train Convolutional Neural Networks (or ordinary
|
||
ones) in your browser.</li>
|
||
<li><a href="https://github.com/deeplearning4j/DataVec">DataVec</a> - A
|
||
vectorization and data preprocessing library for deep learning in Java
|
||
and Scala. Part of the Deeplearning4j ecosystem.</li>
|
||
<li><a href="https://github.com/deeplearning4j">Deeplearning4j</a> -
|
||
Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural
|
||
network configuration layer powered by a C++ library. Uses Spark and
|
||
Hadoop to train nets on multiple GPUs and CPUs.</li>
|
||
<li><a href="https://github.com/danielsdeleo/Decider">Decider</a> -
|
||
Flexible and Extensible Machine Learning in Ruby.</li>
|
||
<li><a href="http://www.heatonresearch.com/encog/">ENCOG</a> - machine
|
||
learning framework that supports a variety of advanced algorithms, as
|
||
well as support classes to normalize and process data.</li>
|
||
<li><a href="http://www.etcml.com/">etcML</a> - text classification with
|
||
machine learning.</li>
|
||
<li><a href="https://github.com/etsy/Conjecture">Etsy Conjecture</a> -
|
||
scalable Machine Learning in Scalding.</li>
|
||
<li><a href="https://github.com/gojek/feast">Feast</a> - A feature store
|
||
for the management, discovery, and access of machine learning features.
|
||
Feast provides a consistent view of feature data for both model training
|
||
and model serving.</li>
|
||
<li><a href="https://dato.com/products/create/">GraphLab Create</a> - A
|
||
machine learning platform in Python with a broad collection of ML
|
||
toolkits, data engineering, and deployment tools.</li>
|
||
<li><a href="https://github.com/h2oai/h2o-3/">H2O</a> - statistical,
|
||
machine learning and math runtime with Hadoop. R and Python.</li>
|
||
<li><a href="https://github.com/benedekrozemberczki/karateclub">Karate
|
||
Club</a> - An unsupervised machine learning library for graph structured
|
||
data. Python</li>
|
||
<li><a href="https://github.com/fchollet/keras">Keras</a> - An intuitive
|
||
neural net API inspired by Torch that runs atop Theano and
|
||
Tensorflow.</li>
|
||
<li><a href="https://github.com/johnsonc/lambdo">Lambdo</a> - Lambdo is
|
||
a workflow engine which significantly simplifies the analysis process by
|
||
unifying feature engineering and machine learning operations.</li>
|
||
<li><a
|
||
href="https://github.com/benedekrozemberczki/littleballoffur">Little
|
||
Ball of Fur</a> - A subsampling library for graph structured data.
|
||
Python</li>
|
||
<li><a href="http://mahout.apache.org/">Mahout</a> - An Apache-backed
|
||
machine learning library for Hadoop.</li>
|
||
<li><a href="http://www.mlbase.org/">MLbase</a> - distributed machine
|
||
learning libraries for the BDAS stack.</li>
|
||
<li><a
|
||
href="https://github.com/nikolaypavlov/MLPNeuralNet">MLPNeuralNet</a> -
|
||
Fast multilayer perceptron neural network library for iOS and Mac OS
|
||
X.</li>
|
||
<li><a href="https://github.com/ml-tooling/ml-workspace">ML
|
||
Workspace</a> - All-in-one web-based IDE specialized for machine
|
||
learning and data science.</li>
|
||
<li><a href="http://moa.cms.waikato.ac.nz">MOA</a> - MOA performs big
|
||
data stream mining in real time, and large scale machine learning.</li>
|
||
<li><a href="https://monkeylearn.com/">MonkeyLearn</a> - Text mining
|
||
made easy. Extract and classify data from text.</li>
|
||
<li><a href="https://github.com/deeplearning4j/nd4j">ND4J</a> - A matrix
|
||
library for the JVM. Numpy for Java.</li>
|
||
<li><a href="https://github.com/numenta/nupic">nupic</a> - Numenta
|
||
Platform for Intelligent Computing: a brain-inspired machine
|
||
intelligence platform, and biologically accurate neural network based on
|
||
cortical learning algorithms.</li>
|
||
<li><a
|
||
href="http://predictionio.incubator.apache.org/index.html">PredictionIO</a>
|
||
- machine learning server built on Hadoop, Mahout and Cascading.</li>
|
||
<li><a
|
||
href="https://github.com/benedekrozemberczki/pytorch_geometric_temporal">PyTorch
|
||
Geometric Temporal</a> - a temporal extension library for PyTorch
|
||
Geometric .</li>
|
||
<li><a href="https://github.com/deeplearning4j/rl4j">RL4J</a> -
|
||
Reinforcement learning for Java and Scala. Includes Deep-Q learning and
|
||
A3C algorithms, and integrates with Open AI’s Gym. Runs in the
|
||
Deeplearning4j ecosystem.</li>
|
||
<li><a href="http://samoa.incubator.apache.org/">SAMOA</a> - distributed
|
||
streaming machine learning framework.</li>
|
||
<li><a
|
||
href="https://github.com/scikit-learn/scikit-learn">scikit-learn</a> -
|
||
scikit-learn: machine learning in Python.</li>
|
||
<li><a href="https://github.com/benedekrozemberczki/shapley">Shapley</a>
|
||
- A data-driven framework to quantify the value of classifiers in a
|
||
machine learning ensemble.</li>
|
||
<li><a href="http://spark.apache.org/docs/0.9.0/mllib-guide.html">Spark
|
||
MLlib</a> - a Spark implementation of some common machine learning (ML)
|
||
functionality.</li>
|
||
<li><a
|
||
href="https://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf">Sibyl</a>
|
||
- System for Large Scale Machine Learning at Google.</li>
|
||
<li><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a> -
|
||
Library from Google for machine learning using data flow graphs.</li>
|
||
<li><a href="https://github.com/theano">Theano</a> - A Python-focused
|
||
machine learning library supported by the University of Montreal.</li>
|
||
<li><a href="https://github.com/torch">Torch</a> - A deep learning
|
||
library with a Lua API, supported by NYU and Facebook.</li>
|
||
<li><a href="https://github.com/amplab/velox-modelserver">Velox</a> -
|
||
System for serving machine learning predictions.</li>
|
||
<li><a href="https://github.com/JohnLangford/vowpal_wabbit/wiki">Vowpal
|
||
Wabbit</a> - learning system sponsored by Microsoft and Yahoo!.</li>
|
||
<li><a href="http://www.cs.waikato.ac.nz/ml/weka/">WEKA</a> - suite of
|
||
machine learning software.</li>
|
||
<li><a href="https://github.com/BIDData/BIDMach">BidMach</a> - CPU and
|
||
GPU-accelerated Machine Learning Library.</li>
|
||
</ul>
|
||
<h2 id="benchmarking">Benchmarking</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://issues.apache.org/jira/browse/MAPREDUCE-3561">Apache
|
||
Hadoop Benchmarking</a> - micro-benchmarks for testing Hadoop
|
||
performances.</li>
|
||
<li><a href="https://github.com/SWIMProjectUCB/SWIM/wiki">Berkeley SWIM
|
||
Benchmark</a> - real-world big data workload benchmark.</li>
|
||
<li><a href="https://github.com/intel-hadoop/HiBench">Intel HiBench</a>
|
||
- a Hadoop benchmark suite.</li>
|
||
<li><a href="https://issues.apache.org/jira/browse/MAPREDUCE-5116">PUMA
|
||
Benchmarking</a> - benchmark suite for MapReduce applications.</li>
|
||
<li><a
|
||
href="http://yahoohadoop.tumblr.com/post/98294079296/gridmix3-emulating-production-workload-for">Yahoo
|
||
Gridmix3</a> - Hadoop cluster benchmarking from Yahoo engineer
|
||
team.</li>
|
||
<li><a
|
||
href="https://github.com/deeplearning4j/dl4j-benchmark">Deeplearning4j
|
||
Benchmarks</a></li>
|
||
<li><a href="https://github.com/unum-cloud/ucsb">UCSB</a> - extended
|
||
Yahoo Cloud Serving Benchmark for NoSQL databases.</li>
|
||
</ul>
|
||
<h2 id="security">Security</h2>
|
||
<ul>
|
||
<li><a href="http://ranger.apache.org/">Apache Ranger</a> - Central
|
||
security admin & fine-grained authorization for Hadoop</li>
|
||
<li><a href="http://eagle.apache.org/">Apache Eagle</a> - real time
|
||
monitoring solution</li>
|
||
<li><a href="http://knox.apache.org/">Apache Knox Gateway</a> - single
|
||
point of secure access for Hadoop clusters.</li>
|
||
<li><a href="http://incubator.apache.org/projects/sentry.html">Apache
|
||
Sentry</a> - security module for data stored in Hadoop.</li>
|
||
<li><a href="https://github.com/kotobukki/BDA/">BDA</a> - The
|
||
vulnerability detector for Hadoop and Spark</li>
|
||
</ul>
|
||
<h2 id="system-deployment">System Deployment</h2>
|
||
<ul>
|
||
<li><a href="http://ambari.apache.org/">Apache Ambari</a> - operational
|
||
framework for Hadoop management.</li>
|
||
<li><a href="http://bigtop.apache.org//">Apache Bigtop</a> - system
|
||
deployment framework for the Hadoop ecosystem.</li>
|
||
<li><a href="http://helix.apache.org/">Apache Helix</a> - cluster
|
||
management framework.</li>
|
||
<li><a href="http://mesos.apache.org/">Apache Mesos</a> - cluster
|
||
manager.</li>
|
||
<li><a href="https://github.com/apache/incubator-slider">Apache
|
||
Slider</a> - is a YARN application to deploy existing distributed
|
||
applications on YARN.</li>
|
||
<li><a href="http://whirr.apache.org/">Apache Whirr</a> - set of
|
||
libraries for running cloud services.</li>
|
||
<li><a href="https://hortonworks.com/hadoop/yarn/">Apache YARN</a> -
|
||
Cluster manager.</li>
|
||
<li><a href="http://brooklyncentral.github.io/">Brooklyn</a> - library
|
||
that simplifies application deployment and management.</li>
|
||
<li><a href="http://buildoop.github.io/">Buildoop</a> - Similar to
|
||
Apache BigTop based on Groovy language.</li>
|
||
<li><a href="http://gethue.com/">Cloudera HUE</a> - web application for
|
||
interacting with Hadoop.</li>
|
||
<li><a href="http://www.wired.com/2012/08/facebook-prism/">Facebook
|
||
Prism</a> - multi datacenters replication system.</li>
|
||
<li><a
|
||
href="https://www.wired.com/2013/03/google-borg-twitter-mesos/all/">Google
|
||
Borg</a> - job scheduling and monitoring system.</li>
|
||
<li><a href="https://www.youtube.com/watch?v=0ZFMlO98Jkc">Google
|
||
Omega</a> - job scheduling and monitoring system.</li>
|
||
<li><a
|
||
href="https://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/">Hortonworks
|
||
HOYA</a> - application that can deploy HBase cluster on YARN.</li>
|
||
<li><a href="https://kubernetes.io/">Kubernetes</a> - a system for
|
||
automating deployment, scaling, and management of containerized
|
||
applications.</li>
|
||
<li><a href="https://github.com/mesosphere/marathon">Marathon</a> -
|
||
Mesos framework for long-running services.</li>
|
||
<li><a href="https://github.com/WeBankFinTech/Linkis">Linkis</a> -
|
||
Linkis helps easily connect to various back-end computation/storage
|
||
engines.</li>
|
||
</ul>
|
||
<h2 id="applications">Applications</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/etsy/411">411</a> - an web application
|
||
for alert management resulting from scheduled searches into
|
||
Elasticsearch.</li>
|
||
<li><a href="https://github.com/adobe-research/spindle">Adobe
|
||
spindle</a> - Next-generation web analytics processing with Scala,
|
||
Spark, and Parquet.</li>
|
||
<li><a href="http://metron.apache.org/">Apache Metron</a> - a platform
|
||
that integrates a variety of open source big data technologies in order
|
||
to offer a centralized tool for security monitoring and analysis.</li>
|
||
<li><a href="http://nutch.apache.org/">Apache Nutch</a> - open source
|
||
web crawler.</li>
|
||
<li><a href="http://oodt.apache.org/">Apache OODT</a> - capturing,
|
||
processing and sharing of data for NASA’s scientific archives.</li>
|
||
<li><a href="https://tika.apache.org/">Apache Tika</a> - content
|
||
analysis toolkit.</li>
|
||
<li><a href="https://github.com/salesforce/Argus">Argus</a> - Time
|
||
series monitoring and alerting platform.</li>
|
||
<li><a href="https://github.com/uber/AthenaX">AthenaX</a> - a streaming
|
||
analytics platform that enables users to run production-quality, large
|
||
scale streaming analytics using Structured Query Language (SQL).</li>
|
||
<li><a href="https://github.com/Netflix/atlas">Atlas</a> - a backend for
|
||
managing dimensional time series data.</li>
|
||
<li><a href="https://count.ly/">Countly</a> - open source mobile and web
|
||
analytics platform, based on Node.js & MongoDB.</li>
|
||
<li><a href="https://www.comet.com/site/">Comet</a> - Comet provides an
|
||
end-to-end model evaluation platform for AI developers, with best in
|
||
class LLM evaluations, experiment tracking, and production
|
||
monitoring.</li>
|
||
<li><a href="https://www.dominodatalab.com/">Domino</a> - Run, scale,
|
||
share, and deploy models — without any infrastructure.</li>
|
||
<li><a href="http://www.eclipse.org/birt/">Eclipse BIRT</a> -
|
||
Eclipse-based reporting system.</li>
|
||
<li><a href="https://github.com/Yelp/elastalert">ElastAert</a> -
|
||
ElastAlert is a simple framework for alerting on anomalies, spikes, or
|
||
other patterns of interest from data in ElasticSearch.</li>
|
||
<li><a href="https://github.com/Codecademy/EventHub">Eventhub</a> - open
|
||
source event analytics platform.</li>
|
||
<li><a href="https://hash.ai">HASH</a> - open source simulation and
|
||
visualization platform.</li>
|
||
<li><a href="https://github.com/allegro/hermes">Hermes</a> -
|
||
asynchronous message broker built on top of Kafka.</li>
|
||
<li><a href="https://www.splunk.com/en_us/download/hunk.html">Hunk</a> -
|
||
Splunk analytics for Hadoop.</li>
|
||
<li><a href="http://opensource.indeedeng.io/imhotep/">Imhotep</a> -
|
||
Large scale analytics platform by indeed.</li>
|
||
<li><a href="https://www.indicative.com/">Indicative</a> - Web &
|
||
mobile analytics tool, with data warehouse (AWS, BigQuery)
|
||
integration.</li>
|
||
<li><a href="https://jupyter.org/">Jupyter</a> - Notebook and project
|
||
application for interactive data science and scientific computing across
|
||
all programming languages.</li>
|
||
<li><a href="http://madlib.incubator.apache.org/community/">MADlib</a> -
|
||
data-processing library of an RDBMS to analyze data.</li>
|
||
<li><a href="https://github.com/influxdata/kapacitor">Kapacitor</a> - an
|
||
open source framework for processing, monitoring, and alerting on time
|
||
series data.</li>
|
||
<li><a href="http://kylin.apache.org/">Kylin</a> - open source
|
||
Distributed Analytics Engine from eBay.</li>
|
||
<li><a href="https://github.com/pivotalsoftware/PivotalR">PivotalR</a> -
|
||
R on Pivotal HD / HAWQ and PostgreSQL.</li>
|
||
<li><a href="https://www.comet.com/site/products/opik/">Opik</a> -
|
||
Debug, evaluate, and monitor your LLM applications, RAG systems, and
|
||
agentic workflows with comprehensive tracing, automated evaluations, and
|
||
production-ready dashboards.</li>
|
||
<li><a href="https://github.com/rakam-io/rakam">Rakam</a> - open-source
|
||
real-time custom analytics platform powered by Postgresql, Kinesis and
|
||
PrestoDB.</li>
|
||
<li><a href="https://www.qubole.com/">Qubole</a> - auto-scaling Hadoop
|
||
cluster, built-in data connectors.</li>
|
||
<li><a href="https://github.com/SnappyDataInc/snappydata">SnappyData</a>
|
||
- a distributed in-memory data store for real-time operational
|
||
analytics, delivering stream analytics, OLTP (online transaction
|
||
processing) and OLAP (online analytical processing) built on Spark in a
|
||
single integrated cluster.</li>
|
||
<li><a href="https://github.com/snowplow/snowplow">Snowplow</a> -
|
||
enterprise-strength web and event analytics, powered by Hadoop, Kinesis,
|
||
Redshift and Postgres.</li>
|
||
<li><a href="http://amplab-extras.github.io/SparkR-pkg/">SparkR</a> - R
|
||
frontend for Spark.</li>
|
||
<li><a href="https://www.splunk.com/">Splunk</a> - analyzer for
|
||
machine-generated data.</li>
|
||
<li><a href="https://www.sumologic.com/">Sumo Logic</a> - cloud based
|
||
analyzer for machine-generated data.</li>
|
||
<li><a href="https://github.com/brexhq/substation">Substation</a> -
|
||
Substation is a cloud native data pipeline and transformation toolkit
|
||
written in Go.</li>
|
||
<li><a href="http://www.talend.com/products/big-data/">Talend</a> -
|
||
unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog
|
||
& Pig.</li>
|
||
</ul>
|
||
<h2 id="search-engine-and-framework">Search engine and framework</h2>
|
||
<ul>
|
||
<li><a href="http://lucene.apache.org/">Apache Lucene</a> - Search
|
||
engine library.</li>
|
||
<li><a href="http://lucene.apache.org/solr/">Apache Solr</a> - Search
|
||
platform for Apache Lucene.</li>
|
||
<li><a href="https://github.com/strapdata/elassandra">Elassandra</a> -
|
||
is a fork of Elasticsearch modified to run on top of Apache Cassandra in
|
||
a scalable and resilient peer-to-peer architecture.</li>
|
||
<li><a href="https://www.elastic.co/">ElasticSearch</a> - Search and
|
||
analytics engine based on Apache Lucene.</li>
|
||
<li><a href="https://www.enigma.com/">Enigma.io</a> – Freemium robust
|
||
web application for exploring, filtering, analyzing, searching and
|
||
exporting massive datasets scraped from across the Web.</li>
|
||
<li><a
|
||
href="https://googleblog.blogspot.it/2010/06/our-new-search-index-caffeine.html">Google
|
||
Caffeine</a> - continuous indexing system.</li>
|
||
<li><a href="https://research.google.com/pubs/pub36726.html">Google
|
||
Percolator</a> - continuous indexing system.</li>
|
||
<li><a
|
||
href="https://blogs.apache.org/hbase/entry/coprocessor_introduction">HBase
|
||
Coprocessor</a> - implementation of Percolator, part of HBase.</li>
|
||
<li><a href="http://ngdata.github.io/hbase-indexer/">Lily HBase
|
||
Indexer</a> - quickly and easily search for any content stored in
|
||
HBase.</li>
|
||
<li><a href="http://senseidb.github.io/bobo/">LinkedIn Bobo</a> - is a
|
||
Faceted Search implementation written purely in Java, an extension to
|
||
Apache Lucene.</li>
|
||
<li><a href="https://github.com/linkedin/cleo">LinkedIn Cleo</a> - is a
|
||
flexible software library for enabling rapid development of partial,
|
||
out-of-order and real-time typeahead search.</li>
|
||
<li><a
|
||
href="https://engineering.linkedin.com/search/did-you-mean-galene">LinkedIn
|
||
Galene</a> - search architecture at LinkedIn.</li>
|
||
<li><a href="https://github.com/senseidb/zoie">LinkedIn Zoie</a> - is a
|
||
realtime search/indexing system written in Java.</li>
|
||
<li><a href="http://mg4j.di.unimi.it/">MG4J</a> - MG4J (Managing
|
||
Gigabytes for Java) is a full-text search engine for large document
|
||
collections written in Java. It is highly customisable, high-performance
|
||
and provides state-of-the-art features and new research algorithms.</li>
|
||
<li><a href="http://sphinxsearch.com/">Sphinx Search Server</a> -
|
||
fulltext search engine.</li>
|
||
<li><a href="http://vespa.ai/">Vespa</a> - is an engine for low-latency
|
||
computation over large data sets. It stores and indexes your data such
|
||
that queries, selection and processing over the data can be performed at
|
||
serving time.</li>
|
||
<li><a href="https://github.com/facebookresearch/faiss">Facebook
|
||
Faiss</a> - is a library for efficient similarity search and clustering
|
||
of dense vectors. It contains algorithms that search in sets of vectors
|
||
of any size, up to ones that possibly do not fit in RAM. It also
|
||
contains supporting code for evaluation and parameter tuning. Faiss is
|
||
written in C++ with complete wrappers for Python/numpy.</li>
|
||
<li><a href="https://github.com/spotify/annoy">Annoy</a> - is a C++
|
||
library with Python bindings to search for points in space that are
|
||
close to a given query point. It also creates large read-only file-based
|
||
data structures that are mmapped into memory so that many processes may
|
||
share the same data.</li>
|
||
<li><a href="https://github.com/semi-technologies/weaviate">Weaviate</a>
|
||
- Weaviate is a GraphQL-based semantic search engine with build-in
|
||
(word) embeddings.</li>
|
||
</ul>
|
||
<h2 id="mysql-forks-and-evolutions">MySQL forks and evolutions</h2>
|
||
<ul>
|
||
<li><a href="https://aws.amazon.com/rds/">Amazon RDS</a> - MySQL
|
||
databases in Amazon’s cloud.</li>
|
||
<li><a href="http://www.drizzle.org/">Drizzle</a> - evolution of MySQL
|
||
6.0.</li>
|
||
<li><a href="https://cloud.google.com/sql/docs/">Google Cloud SQL</a> -
|
||
MySQL databases in Google’s cloud.</li>
|
||
<li><a href="https://mariadb.org/">MariaDB</a> - enhanced, drop-in
|
||
replacement for MySQL.</li>
|
||
<li><a href="https://www.mysql.com/products/cluster/">MySQL Cluster</a>
|
||
- MySQL implementation using NDB Cluster storage engine.</li>
|
||
<li><a
|
||
href="https://www.percona.com/software/mysql-database/percona-server">Percona
|
||
Server</a> - enhanced, drop-in replacement for MySQL.</li>
|
||
<li><a href="https://github.com/renecannao/proxysql">ProxySQL</a> - High
|
||
Performance Proxy for MySQL.</li>
|
||
<li><a href="https://www.percona.com/">TokuDB</a> - TokuDB is a storage
|
||
engine for MySQL and MariaDB.</li>
|
||
<li><a href="http://webscalesql.org/">WebScaleSQL</a> - is a
|
||
collaboration among engineers from several companies that face similar
|
||
challenges in running MySQL at scale.</li>
|
||
</ul>
|
||
<h2 id="postgresql-forks-and-evolutions">PostgreSQL forks and
|
||
evolutions</h2>
|
||
<ul>
|
||
<li><a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html">HadoopDB</a>
|
||
- hybrid of MapReduce and DBMS.</li>
|
||
<li><a href="http://www-01.ibm.com/software/data/netezza/">IBM
|
||
Netezza</a> - high-performance data warehouse appliances.</li>
|
||
<li><a href="http://www.postgres-xl.org/">Postgres-XL</a> - Scalable
|
||
Open Source PostgreSQL-based Database Cluster.</li>
|
||
<li><a href="http://www-users.cs.umn.edu/~sarwat/RecDB/">RecDB</a> -
|
||
Open Source Recommendation Engine Built Entirely Inside PostgreSQL.</li>
|
||
<li><a href="http://www.stormdb.com/community/stado">Stado</a> - open
|
||
source MPP database system solely targeted at data warehousing and data
|
||
mart applications.</li>
|
||
<li><a
|
||
href="https://www.scribd.com/doc/3159239/70-Everest-PGCon-RT">Yahoo
|
||
Everest</a> - multi-peta-byte database / MPP derived by PostgreSQL.</li>
|
||
<li><a href="http://www.timescale.com/">TimescaleDB</a> - An open-source
|
||
time-series database optimized for fast ingest and complex queries</li>
|
||
<li><a href="https://www.pipelinedb.com/">PipelineDB</a> - The Streaming
|
||
SQL Database. An open-source relational database that runs SQL queries
|
||
continuously on streams, incrementally storing results in tables</li>
|
||
</ul>
|
||
<h2 id="memcached-forks-and-evolutions">Memcached forks and
|
||
evolutions</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920">Facebook
|
||
McDipper</a> - key/value cache for flash storage.</li>
|
||
<li><a
|
||
href="https://www.facebook.com/notes/facebook-engineering/scaling-memcache-at-facebook/10151411410803920">Facebook
|
||
Memcached</a> - fork of Memcache.</li>
|
||
<li><a href="https://github.com/twitter/twemproxy">Twemproxy</a> - A
|
||
fast, light-weight proxy for memcached and redis.</li>
|
||
<li><a href="https://github.com/twitter/fatcache">Twitter Fatcache</a> -
|
||
key/value cache for flash storage.</li>
|
||
<li><a href="https://github.com/twitter/twemcache">Twitter Twemcache</a>
|
||
- fork of Memcache.</li>
|
||
</ul>
|
||
<h2 id="embedded-databases">Embedded Databases</h2>
|
||
<ul>
|
||
<li><a
|
||
href="http://www.actian.com/products/operational-databases/">Actian
|
||
PSQL</a> - ACID-compliant DBMS developed by Pervasive Software,
|
||
optimized for embedding in applications.</li>
|
||
<li><a
|
||
href="https://www.oracle.com/database/berkeley-db/index.html">BerkeleyDB</a>
|
||
- a software library that provides a high-performance embedded database
|
||
for key/value data.</li>
|
||
<li><a href="https://github.com/krestenkrab/hanoidb">HanoiDB</a> -
|
||
Erlang LSM BTree Storage.</li>
|
||
<li><a href="https://github.com/google/leveldb">LevelDB</a> - a fast
|
||
key-value storage library written at Google that provides an ordered
|
||
mapping from string keys to string values.</li>
|
||
<li><a href="https://symas.com/mdb/">LMDB</a> - ultra-fast,
|
||
ultra-compact key-value embedded data store developed by Symas.</li>
|
||
<li><a href="http://rocksdb.org/">RocksDB</a> - embeddable persistent
|
||
key-value store for fast storage based on LevelDB.</li>
|
||
</ul>
|
||
<h2 id="business-intelligence">Business Intelligence</h2>
|
||
<ul>
|
||
<li><a href="https://www.bimeanalytics.com/?lang=en">BIME Analytics</a>
|
||
- business intelligence platform in the cloud.</li>
|
||
<li><a href="https://github.com/ankane/blazer">Blazer</a> - business
|
||
intelligence made simple.</li>
|
||
<li><a href="https://chartio.com">Chartio</a> - lean business
|
||
intelligence platform to visualize and explore your data.</li>
|
||
<li><a href="https://count.co">Count</a> - notebook-based anlytics and
|
||
visualisation platform using SQL or drag-and-drop.</li>
|
||
<li><a href="https://www.datapine.com/">datapine</a> - self-service
|
||
business intelligence tool in the cloud.</li>
|
||
<li><a href="https://dekart.xyz/">Dekart</a> - Large scale geospatial
|
||
analytics for Google BigQuery based on Kepler.gl.</li>
|
||
<li><a href="https://www.gooddata.com/">GoodData</a> - platform for data
|
||
products and embedded analytics.</li>
|
||
<li><a href="https://www.jaspersoft.com/">Jaspersoft</a> - powerful
|
||
business intelligence suite.</li>
|
||
<li><a href="https://www.jedox.com/en/">Jedox Palo</a> - customisable
|
||
Business Intelligence platform.</li>
|
||
<li><a href="https://jethro.io/">Jethrodata</a> - Interactive Big Data
|
||
Analytics.</li>
|
||
<li><a href="https://intermix.io/">intermix.io</a> - Performance
|
||
Monitoring for Amazon Redshift</li>
|
||
<li><a href="https://github.com/lightdash/lightdash">Lightdash</a> - The
|
||
open source Looker alternative built on dbt</li>
|
||
<li><a href="https://github.com/metabase/metabase">Metabase</a> - The
|
||
simplest, fastest way to get business intelligence and analytics to
|
||
everyone in your company.</li>
|
||
<li><a
|
||
href="http://www.microsoft.com/en-us/server-cloud/solutions/business-intelligence/default.aspx">Microsoft</a>
|
||
- business intelligence software and platform.</li>
|
||
<li><a href="https://www.microstrategy.com/">Microstrategy</a> -
|
||
software platforms for business intelligence, mobile intelligence, and
|
||
network applications.</li>
|
||
<li><a href="https://numeracy.co/">Numeracy</a> - Fast, clean SQL client
|
||
and business intelligence.</li>
|
||
<li><a href="http://www.pentaho.com/">Pentaho</a> - business
|
||
intelligence platform.</li>
|
||
<li><a href="http://www.qlik.com/us/">Qlik</a> - business intelligence
|
||
and analytics platform.</li>
|
||
<li><a href="https://redash.io/">Redash</a> - Open source business
|
||
intelligence platform, supporting multiple data sources and planned
|
||
queries.</li>
|
||
<li><a href="https://www.meteorite.bi/">Saiku Analytics</a> - Open
|
||
source analytics platform.</li>
|
||
<li><a href="https://www.knowage-suite.com/">Knowage</a> - open source
|
||
business intelligence platform. (former <a
|
||
href="http://www.spagobi.org/">SpagoBi</a>)</li>
|
||
<li><a href="http://sparklinedata.com/">SparklineData SNAP</a> - modern
|
||
B.I platform powered by Apache Spark.</li>
|
||
<li><a href="https://www.tableau.com/">Tableau</a> - business
|
||
intelligence platform.</li>
|
||
<li><a href="https://www.zoomdata.com/">Zoomdata</a> - Big Data
|
||
Analytics.</li>
|
||
</ul>
|
||
<h2 id="data-visualization">Data Visualization</h2>
|
||
<ul>
|
||
<li><a href="https://github.com/airbnb/airpal">Airpal</a> - Web UI for
|
||
PrestoDB.</li>
|
||
<li><a href="http://www.anychart.com">AnyChart</a> - fast, simple and
|
||
flexible JavaScript (HTML5) charting library featuring pure JS API.</li>
|
||
<li><a href="https://github.com/samizdatco/arbor">Arbor</a> - graph
|
||
visualization library using web workers and jQuery.</li>
|
||
<li><a href="https://github.com/LucidWorks/banana">Banana</a> -
|
||
visualize logs and time-stamped data stored in Solr. Port of
|
||
Kibana.</li>
|
||
<li><a href="https://github.com/ufukomer/bloomery">Bloomery</a> - Web UI
|
||
for Impala.</li>
|
||
<li><a href="http://bokeh.pydata.org/en/latest/">Bokeh</a> - A powerful
|
||
Python interactive visualization library that targets modern web
|
||
browsers for presentation, with the goal of providing elegant, concise
|
||
construction of novel graphics in the style of D3.js, but also
|
||
delivering this capability with high-performance interactivity over very
|
||
large or streaming datasets.</li>
|
||
<li><a href="http://c3js.org/">C3</a> - D3-based reusable chart
|
||
library</li>
|
||
<li><a href="https://github.com/CartoDB/cartodb">CartoDB</a> -
|
||
open-source or freemium hosting for geospatial databases with powerful
|
||
front-end editing capabilities and a robust API.</li>
|
||
<li><a href="http://chartd.co/">chartd</a> - responsive,
|
||
retina-compatible charts with just an img tag.</li>
|
||
<li><a href="http://www.chartjs.org/">Chart.js</a> - open source HTML5
|
||
Charts visualizations.</li>
|
||
<li><a href="https://github.com/gionkunz/chartist-js">Chartist.js</a> -
|
||
another open source HTML5 Charts visualization.</li>
|
||
<li><a href="http://square.github.io/crossfilter/">Crossfilter</a> -
|
||
JavaScript library for exploring large multivariate datasets in the
|
||
browser. Works well with dc.js and d3.js.</li>
|
||
<li><a href="https://github.com/square/cubism">Cubism</a> - JavaScript
|
||
library for time series visualization.</li>
|
||
<li><a href="http://cytoscape.github.io/">Cytoscape</a> - JavaScript
|
||
library for visualizing complex networks.</li>
|
||
<li><a href="http://dc-js.github.io/dc.js/">DC.js</a> - Dimensional
|
||
charting built to work natively with crossfilter rendered using d3.js.
|
||
Excellent for connecting charts/additional metadata to hover events in
|
||
D3.</li>
|
||
<li><a href="https://d3js.org/">D3</a> - javaScript library for
|
||
manipulating documents.</li>
|
||
<li><a href="https://github.com/CSNW/d3.compose">D3.compose</a> -
|
||
Compose complex, data-driven visualizations from reusable charts and
|
||
components.</li>
|
||
<li><a href="http://d3plus.org">D3Plus</a> - A fairly robust set of
|
||
reusable charts and styles for d3.js.</li>
|
||
<li><a href="https://github.com/plotly/dash">Dash</a> - Analytical Web
|
||
Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS
|
||
required</li>
|
||
<li><a href="https://dekart.xyz/">Dekart</a> - Large scale geospatial
|
||
analytics for Google BigQuery based on Kepler.gl.</li>
|
||
<li><a
|
||
href="https://devexpress.github.io/devextreme-reactive/react/chart/">DevExtreme
|
||
React Chart</a> - High-performance plugin-based React chart for
|
||
Bootstrap and Material Design.</li>
|
||
<li><a href="https://github.com/ecomfe/echarts">Echarts</a> - Baidus
|
||
enterprise charts.</li>
|
||
<li><a
|
||
href="https://github.com/HumbleSoftware/envisionjs">Envisionjs</a> -
|
||
dynamic HTML5 visualization.</li>
|
||
<li><a href="https://metrictools.org/">FnordMetric</a> - write SQL
|
||
queries that return SVG charts rather than tables</li>
|
||
<li><a href="https://frappe.io/charts">Frappe Charts</a> -
|
||
GitHub-inspired simple and modern SVG charts for the web with zero
|
||
dependencies.</li>
|
||
<li><a href="https://github.com/Freeboard/freeboard">Freeboard</a> - pen
|
||
source real-time dashboard builder for IOT and other web mashups.</li>
|
||
<li><a href="https://github.com/gephi/gephi">Gephi</a> - An
|
||
award-winning open-source platform for visualizing and manipulating
|
||
large graphs and network connections. It’s like Photoshop, but for
|
||
graphs. Available for Windows and Mac OS X.</li>
|
||
<li><a href="https://developers.google.com/chart/">Google Charts</a> -
|
||
simple charting API.</li>
|
||
<li><a href="https://grafana.com/">Grafana</a> - graphite dashboard
|
||
frontend, editor and graph composer.</li>
|
||
<li><a href="http://graphiteapp.org/">Graphite</a> - scalable Realtime
|
||
Graphing.</li>
|
||
<li><a href="https://www.highcharts.com/">Highcharts</a> - simple and
|
||
flexible charting API.</li>
|
||
<li><a href="http://ipython.org/">IPython</a> - provides a rich
|
||
architecture for interactive computing.</li>
|
||
<li><a href="https://www.elastic.co/products/kibana">Kibana</a> -
|
||
visualize logs and time-stamped data</li>
|
||
<li><a href="http://lumify.io/">Lumify</a> - open source big data
|
||
analysis and visualization platform</li>
|
||
<li><a href="https://github.com/matplotlib/matplotlib">Matplotlib</a> -
|
||
plotting with Python.</li>
|
||
<li><a href="https://metricsgraphicsjs.org/">Metricsgraphic.js</a> - a
|
||
library built on top of D3 that is optimized for time-series data</li>
|
||
<li><a href="http://nvd3.org/">NVD3</a> - chart components for
|
||
d3.js.</li>
|
||
<li><a href="https://github.com/benpickles/peity">Peity</a> -
|
||
Progressive SVG bar, line and pie charts.</li>
|
||
<li><a href="https://plot.ly/">Plot.ly</a> - Easy-to-use web service
|
||
that allows for rapid creation of complex charts, from heatmaps to
|
||
histograms. Upload data to create and style charts with Plotly’s online
|
||
spreadsheet. Fork others’ plots.</li>
|
||
<li><a href="https://github.com/plotly/plotly.js">Plotly.js</a> The open
|
||
source javascript graphing library that powers plotly.</li>
|
||
<li><a href="https://github.com/okfn/recline">Recline</a> - simple but
|
||
powerful library for building data applications in pure Javascript and
|
||
HTML.</li>
|
||
<li><a href="https://github.com/getredash/redash">Redash</a> -
|
||
open-source platform to query and visualize data.</li>
|
||
<li><a href="http://recharts.org/">ReCharts</a> - A composable charting
|
||
library built on React components</li>
|
||
<li><a href="http://shiny.rstudio.com/">Shiny</a> - a web application
|
||
framework for R.</li>
|
||
<li><a href="https://github.com/jacomyal/sigma.js">Sigma.js</a> -
|
||
JavaScript library dedicated to graph drawing.</li>
|
||
<li><a href="https://github.com/apache/incubator-superset">Superset</a>
|
||
- a data exploration platform designed to be visual, intuitive and
|
||
interactive, making it easy to slice, dice and visualize data and
|
||
perform analytics at the speed of thought.</li>
|
||
<li><a href="https://github.com/vega/vega">Vega</a> - a visualization
|
||
grammar.</li>
|
||
<li><a href="https://github.com/ZEPL/zeppelin">Zeppelin</a> - a
|
||
notebook-style collaborative data analysis.</li>
|
||
<li><a href="https://www.zingchart.com/">Zing Charts</a> - JavaScript
|
||
charting library for big data.</li>
|
||
<li><a
|
||
href="https://github.com/WeBankFinTech/DataSphereStudio">DataSphere
|
||
Studio</a> - one-stop data application development management
|
||
portal.</li>
|
||
</ul>
|
||
<h2 id="internet-of-things-and-sensor-data">Internet of things and
|
||
sensor data</h2>
|
||
<ul>
|
||
<li><a href="http://edgent.apache.org/">Apache Edgent (Incubating)</a> -
|
||
a programming model and micro-kernel style runtime that can be embedded
|
||
in gateways and small footprint edge devices enabling local, real-time,
|
||
analytics on the edge devices.</li>
|
||
<li><a href="https://azure.microsoft.com/en-us/services/iot-hub/">Azure
|
||
IoT Hub</a> - Cloud-based bi-directional monitoring and messaging
|
||
hub</li>
|
||
<li><a href="https://www.tempoiq.com/">TempoIQ</a> - Cloud-based sensor
|
||
analytics.</li>
|
||
<li><a href="http://2lemetry.com/">2lemetry</a> - Platform for Internet
|
||
of things.</li>
|
||
<li><a href="https://www.pubnub.com/">Pubnub</a> - Data stream
|
||
network</li>
|
||
<li><a href="https://www.thingworx.com/">ThingWorx</a> - Rapid
|
||
development and connection of intelligent systems</li>
|
||
<li><a href="https://ifttt.com/">IFTTT</a> - If this then that</li>
|
||
<li><a href="https://evrythng.com/">Evrything</a>- Making products
|
||
smart</li>
|
||
<li><a href="https://github.com/marty90/netlytics/">NetLytics</a> -
|
||
Analytics platform to process network data on Spark.</li>
|
||
<li><a href="https://ably.com/">Ably</a> - Pub/sub messaging platform
|
||
for IoT</li>
|
||
</ul>
|
||
<h2 id="interesting-readings">Interesting Readings</h2>
|
||
<ul>
|
||
<li><a href="https://amplab.cs.berkeley.edu/benchmark/">Big Data
|
||
Benchmark</a> - Benchmark of Redshift, Hive, Shark, Impala and
|
||
Stiger/Tez.</li>
|
||
<li><a
|
||
href="https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis">NoSQL
|
||
Comparison</a> - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs
|
||
HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo
|
||
vs VoltDB vs Scalaris comparison.</li>
|
||
<li><a
|
||
href="https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics?ref=awesome">Monitoring
|
||
Kafka performance</a> - Guide to monitoring Apache Kafka, including
|
||
native methods for metrics collection.</li>
|
||
<li><a
|
||
href="https://www.datadoghq.com/blog/monitor-hadoop-metrics?ref=awesome">Monitoring
|
||
Hadoop performance</a> - Guide to monitoring Hadoop, with an overview of
|
||
Hadoop architecture, and native methods for metrics collection.</li>
|
||
<li><a
|
||
href="https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/?ref=awesome">Monitoring
|
||
Cassandra performance</a> - Guide to monitoring Cassandra, including
|
||
native methods for metrics collection.</li>
|
||
</ul>
|
||
<h2 id="interesting-papers">Interesting Papers</h2>
|
||
<h3 id="section">2015 - 2016</h3>
|
||
<ul>
|
||
<li><a href="http://www.vldb.org/pvldb/vol8/p1804-ching.pdf">2015</a> -
|
||
<strong>Facebook</strong> - One Trillion Edges: Graph Processing at
|
||
Facebook-Scale.</li>
|
||
</ul>
|
||
<h3 id="section-1">2013 - 2014</h3>
|
||
<ul>
|
||
<li><a href="http://infolab.stanford.edu/~ullman/mmds/book.pdf">2014</a>
|
||
- <strong>Stanford</strong> - Mining of Massive Datasets.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/03/eurosys13-paper83.pdf">2013</a>
|
||
- <strong>AMPLab</strong> - Presto: Distributed Machine Learning and
|
||
Graph Processing with Sparse Matrices.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/01/dmx1.pdf">2013</a>
|
||
- <strong>AMPLab</strong> - MLbase: A Distributed Machine-learning
|
||
System.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/02/shark_sigmod2013.pdf">2013</a>
|
||
- <strong>AMPLab</strong> - Shark: SQL and Rich Analytics at Scale.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf">2013</a>
|
||
- <strong>AMPLab</strong> - GraphX: A Resilient Distributed Graph System
|
||
on Spark.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf">2013</a>
|
||
- <strong>Google</strong> - HyperLogLog in Practice: Algorithmic
|
||
Engineering of a State of The Art Cardinality Estimation Algorithm.</li>
|
||
<li><a
|
||
href="http://research.microsoft.com/pubs/200169/now-vldb.pdf">2013</a> -
|
||
<strong>Microsoft</strong> - Scalable Progressive Analytics on Big Data
|
||
in the Cloud.</li>
|
||
<li><a href="http://static.druid.io/docs/druid.pdf">2013</a> -
|
||
<strong>Metamarkets</strong> - Druid: A Real-time Analytical Data
|
||
Store.</li>
|
||
<li><a
|
||
href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p764-rae.pdf">2013</a>
|
||
- <strong>Google</strong> - Online, Asynchronous Schema Change in
|
||
F1.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41344.pdf">2013</a>
|
||
- <strong>Google</strong> - F1: A Distributed SQL Database That
|
||
Scales.</li>
|
||
<li><a
|
||
href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p734-akidau.pdf">2013</a>
|
||
- <strong>Google</strong> - MillWheel: Fault-Tolerant Stream Processing
|
||
at Internet Scale.</li>
|
||
<li><a
|
||
href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p767-wiener.pdf">2013</a>
|
||
- <strong>Facebook</strong> - Scuba: Diving into Data at Facebook.</li>
|
||
<li><a
|
||
href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p871-curtiss.pdf">2013</a>
|
||
- <strong>Facebook</strong> - Unicorn: A System for Searching the Social
|
||
Graph.</li>
|
||
<li><a
|
||
href="https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf">2013</a>
|
||
- <strong>Facebook</strong> - Scaling Memcache at Facebook.</li>
|
||
</ul>
|
||
<h3 id="section-2">2011 - 2012</h3>
|
||
<ul>
|
||
<li><a
|
||
href="http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf">2012</a>
|
||
- <strong>Twitter</strong> - The Unified Logging Infrastructure for Data
|
||
Analytics at Twitter.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/04/blinkdb_vldb12_demo.pdf">2012</a>
|
||
- <strong>AMPLab</strong> - Blink and It’s Done: Interactive Queries on
|
||
Very Large Data.</li>
|
||
<li><a
|
||
href="https://www.usenix.org/system/files/login/articles/zaharia.pdf">2012</a>
|
||
- <strong>AMPLab</strong> - Fast and Interactive Analytics over Hadoop
|
||
Data with Spark.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mod482-xin1.pdf">2012</a>
|
||
- <strong>AMPLab</strong> - Shark: Fast Data Analysis Using
|
||
Coarse-grained Distributed Memory.</li>
|
||
<li><a
|
||
href="https://www.usenix.org/legacy/event/nsdi11/tech/full_papers/Bolosky.pdf">2012</a>
|
||
- <strong>Microsoft</strong> - Paxos Replicated State Machines as the
|
||
Basis of a High-Performance Data Store.</li>
|
||
<li><a
|
||
href="http://research.microsoft.com/pubs/178045/ppaoxs-paper29.pdf">2012</a>
|
||
- <strong>Microsoft</strong> - Paxos Made Parallel.</li>
|
||
<li><a href="https://arxiv.org/pdf/1203.5485.pdf">2012</a> -
|
||
<strong>AMPLab</strong> - BlinkDB: Queries with Bounded Errors and
|
||
Bounded Response Times on Very Large Data.</li>
|
||
<li><a
|
||
href="http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf">2012</a>
|
||
- <strong>Google</strong> - Processing a trillion cells per mouse
|
||
click.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf">2012</a>
|
||
- <strong>Google</strong> - Spanner: Google’s Globally-Distributed
|
||
Database.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/euro118-ananthanarayanan.pdf">2011</a>
|
||
- <strong>AMPLab</strong> - Scarlett: Coping with Skewed Popularity
|
||
Content in MapReduce Clusters.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf">2011</a>
|
||
- <strong>AMPLab</strong> - Mesos: A Platform for Fine-Grained Resource
|
||
Sharing in the Data Center.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf">2011</a>
|
||
- <strong>Google</strong> - Megastore: Providing Scalable, Highly
|
||
Available Storage for Interactive Services.</li>
|
||
</ul>
|
||
<h3 id="section-3">2001 - 2010</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf">2010</a>
|
||
- <strong>Facebook</strong> - Finding a needle in Haystack: Facebook’s
|
||
photo storage.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf">2010</a>
|
||
- <strong>AMPLab</strong> - Spark: Cluster Computing with Working
|
||
Sets.</li>
|
||
<li><a href="http://kowshik.github.io/JPregel/pregel_paper.pdf">2010</a>
|
||
- <strong>Google</strong> - Pregel: A System for Large-Scale Graph
|
||
Processing.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf">2010</a>
|
||
- <strong>Google</strong> - Large-scale Incremental Processing Using
|
||
Distributed Transactions and notifications base of Percolator and
|
||
Caffeine.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf">2010</a>
|
||
- <strong>Google</strong> - Dremel: Interactive Analysis of Web-Scale
|
||
Datasets.</li>
|
||
<li><a href="http://leoneu.github.io/">2010</a> - <strong>Yahoo</strong>
|
||
- S4: Distributed Stream Computing Platform.</li>
|
||
<li><a href="http://www.cs.umd.edu/~abadi/papers/hadoopdb.pdf">2009</a>
|
||
- HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies
|
||
for Analytical Workloads.</li>
|
||
<li><a
|
||
href="https://cwiki.apache.org/confluence/download/attachments/120729877/chukwa_cca08.pdf?version=1&modificationDate=1562667399000&api=v2">2008</a>
|
||
- <strong>AMPLab</strong> - Chukwa: A large-scale monitoring
|
||
system.</li>
|
||
<li><a
|
||
href="http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf">2007</a>
|
||
- <strong>Amazon</strong> - Dynamo: Amazon’s Highly Available Key-value
|
||
Store.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf">2006</a>
|
||
- <strong>Google</strong> - The Chubby lock service for loosely-coupled
|
||
distributed systems.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf">2006</a>
|
||
- <strong>Google</strong> - Bigtable: A Distributed Storage System for
|
||
Structured Data.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf">2004</a>
|
||
- <strong>Google</strong> - MapReduce: Simplied Data Processing on Large
|
||
Clusters.</li>
|
||
<li><a
|
||
href="http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">2003</a>
|
||
- <strong>Google</strong> - The Google File System.</li>
|
||
</ul>
|
||
<h2 id="videos">Videos</h2>
|
||
<ul>
|
||
<li><a href="https://www.manning.com/livevideo/spark-in-motion">Spark in
|
||
Motion</a> - Spark in Motion teaches you how to use Spark for batch and
|
||
streaming data analytics.</li>
|
||
<li><a
|
||
href="https://www.manning.com/livevideo/machine-learning-data-science-and-deep-learning-with-python">Machine
|
||
Learning, Data Science and Deep Learning with Python</a> - LiveVideo
|
||
tutorial that covers machine learning, Tensorflow, artificial
|
||
intelligence, and neural networks.</li>
|
||
<li><a href="https://snir.dev/talks/data-warehouse-schema-design">Data
|
||
warehouse schema design - dimensional modeling and star schema</a> -
|
||
Introduction to schema design for data warehouse using the star schema
|
||
method.</li>
|
||
<li><a
|
||
href="https://www.manning.com/livevideo/elasticsearch-7-and-elastic-stack">Elasticsearch
|
||
7 and Elastic Stack</a> - LiveVideo tutorial that covers searching,
|
||
analyzing, and visualizing big data on a cluster with Elasticsearch,
|
||
Logstash, Beats, Kibana, and more.</li>
|
||
</ul>
|
||
<h2 id="books">Books</h2>
|
||
<h4 id="streaming">Streaming</h4>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.manning.com/books/data-science-at-scale-with-python-and-dask">Data
|
||
Science at Scale with Python and Dask</a> - Data Science at Scale with
|
||
Python and Dask teaches you how to build distributed data projects that
|
||
can handle huge amounts of data.</li>
|
||
<li><a href="https://www.manning.com/books/streaming-data">Streaming
|
||
Data</a> - Streaming Data introduces the concepts and requirements of
|
||
streaming and real-time data systems.</li>
|
||
<li><a href="https://www.manning.com/books/storm-applied">Storm
|
||
Applied</a> - Storm Applied is a practical guide to using Apache Storm
|
||
for the real-world tasks associated with processing and analyzing
|
||
real-time data streams.</li>
|
||
<li><a
|
||
href="http://www.cambridge.org/us/academic/subjects/engineering/communications-and-signal-processing/fundamentals-stream-processing-application-design-systems-and-analytics">Fundamentals
|
||
of Stream Processing: Application Design, Systems, and Analytics</a> -
|
||
This comprehensive, hands-on guide combining the fundamental building
|
||
blocks and emerging research in stream processing is ideal for
|
||
application designers, system builders, analytic developers, as well as
|
||
students and researchers in the field.</li>
|
||
<li><a href="http://www.springer.com/us/book/9780387710020">Stream Data
|
||
Processing: A Quality of Service Perspective</a> - Presents a new
|
||
paradigm suitable for stream and complex event processing.</li>
|
||
<li><a
|
||
href="https://www.manning.com/books/event-streams-in-action">Unified Log
|
||
Processing</a> - Unified Log Processing is a practical guide to
|
||
implementing a unified log of event streams (Kafka or Kinesis) in your
|
||
business</li>
|
||
<li><a
|
||
href="https://www.manning.com/books/kafka-streams-in-action">Kafka
|
||
Streams in Action</a> - Kafka Streams in Action teaches you everything
|
||
you need to know to implement stream processing on data flowing into
|
||
your Kafka platform, allowing you to focus on getting more from your
|
||
data without sacrificing time or effort.</li>
|
||
<li><a href="https://www.manning.com/books/big-data">Big Data</a> - Big
|
||
Data teaches you to build big data systems using an architecture that
|
||
takes advantage of clustered hardware along with new tools designed
|
||
specifically to capture and analyze web-scale data.</li>
|
||
<li><a href="https://www.manning.com/books/spark-in-action">Spark in
|
||
Action</a> & <a
|
||
href="https://www.manning.com/books/spark-in-action-second-edition">Spark
|
||
in Action 2nd Ed.</a> - Spark in Action teaches you the theory and
|
||
skills you need to effectively handle batch and streaming data using
|
||
Spark. Fully updated for Spark 2.0.</li>
|
||
<li><a href="https://www.manning.com/books/kafka-in-action">Kafka in
|
||
Action</a> - Kafka in Action is a fast-paced introduction to every
|
||
aspect of working with Kafka you need to really reap its benefits.</li>
|
||
<li><a href="https://www.manning.com/books/fusion-in-action">Fusion in
|
||
Action</a> - Fusion in Action teaches you to build a full-featured data
|
||
analytics pipeline, including document and data search and distributed
|
||
data clustering.</li>
|
||
<li><a
|
||
href="https://www.manning.com/books/reactive-data-handling">Reactive
|
||
Data Handling</a> - Reactive Data Handling is a collection of five
|
||
hand-picked chapters, selected by Manuel Bernhardt, that introduce you
|
||
to building reactive applications capable of handling real-time
|
||
processing with large data loads–free eBook!</li>
|
||
<li><a href="https://www.manning.com/books/azure-data-engineering">Azure
|
||
Data Engineering</a> - A book about data engineering in general and the
|
||
Azure platform specifically</li>
|
||
<li><a
|
||
href="https://www.manning.com/books/grokking-streaming-systems">Grokking
|
||
Streaming Systems</a> - Grokking Streaming Systems helps you unravel
|
||
what streaming systems are, how they work, and whether they’re right for
|
||
your business. Written to be tool-agnostic, you’ll be able to apply what
|
||
you learn no matter which framework you choose.</li>
|
||
</ul>
|
||
<h4 id="distributed-systems">Distributed systems</h4>
|
||
<ul>
|
||
<li><a href="http://book.mixu.net/distsys/">Distributed Systems for fun
|
||
and profit</a> – Theory of distributed systems. Include parts about time
|
||
and ordering, replication and impossibility results.</li>
|
||
</ul>
|
||
<h4 id="graph-based-approach">Graph Based approach</h4>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.manning.com/books/graph-powered-machine-learning">Graph-Powered
|
||
Machine Learning</a> - Alessandro Negro. Combine graph theory and models
|
||
to improve machine learning projects</li>
|
||
</ul>
|
||
<h3 id="data-visualization-1">Data Visualization</h3>
|
||
<ul>
|
||
<li><a href="https://www.youtube.com/watch?v=5Zg-C8AAIGg">The beauty of
|
||
data visualization</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=R-oiKt7bUU8">Designing Data
|
||
Visualizations with Noah Iliinsky</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=jbkSRLYSojo">Hans Rosling’s
|
||
200 Countries, 200 Years, 4 Minutes</a></li>
|
||
<li><a href="https://www.youtube.com/watch?v=qTEchen97rQ">Ice Bucket
|
||
Challenge Data Visualization</a></li>
|
||
</ul>
|
||
<h1 id="other-awesome-lists">Other Awesome Lists</h1>
|
||
<ul>
|
||
<li>Other awesome lists <a
|
||
href="https://github.com/bayandin/awesome-awesomeness">awesome-awesomeness</a>.</li>
|
||
<li>Even more lists <a
|
||
href="https://github.com/sindresorhus/awesome">awesome</a>.</li>
|
||
<li>Another list? <a href="https://github.com/jnv/lists">list</a>.</li>
|
||
<li>WTF! <a
|
||
href="https://github.com/t3chnoboy/awesome-awesome-awesome">awesome-awesome-awesome</a>.</li>
|
||
<li>Analytics <a
|
||
href="https://github.com/onurakpolat/awesome-analytics">awesome-analytics</a>.</li>
|
||
<li>Public Datasets <a
|
||
href="https://github.com/awesomedata/awesome-public-datasets">awesome-public-datasets</a>.</li>
|
||
<li>Graph Classification <a
|
||
href="https://github.com/benedekrozemberczki/awesome-graph-classification">awesome-graph-classification</a>.</li>
|
||
<li>Network Embedding <a
|
||
href="https://github.com/chihming/awesome-network-embedding">awesome-network-embedding</a>.</li>
|
||
<li>Community Detection <a
|
||
href="https://github.com/benedekrozemberczki/awesome-community-detection">awesome-community-detection</a>.</li>
|
||
<li>Decision Tree Papers <a
|
||
href="https://github.com/benedekrozemberczki/awesome-decision-tree-papers">awesome-decision-tree-papers</a>.</li>
|
||
<li>Fraud Detection Papers <a
|
||
href="https://github.com/benedekrozemberczki/awesome-fraud-detection-papers">awesome-fraud-detection-papers</a>.</li>
|
||
<li>Gradient Boosting Papers <a
|
||
href="https://github.com/benedekrozemberczki/awesome-gradient-boosting-papers">awesome-gradient-boosting-papers</a>.</li>
|
||
<li>Monte Carlo Tree Search Papers <a
|
||
href="https://github.com/benedekrozemberczki/awesome-monte-carlo-tree-search-papers">awesome-monte-carlo-tree-search-papers</a>.</li>
|
||
<li>Kafka <a
|
||
href="https://github.com/monksy/awesome-kafka">awesome-kafka</a>.</li>
|
||
<li><a href="https://github.com/zrosenbauer/awesome-bigtable">Google
|
||
Bigtable</a>.</li>
|
||
</ul>
|
||
<p><a href="https://github.com/onurakpolat/awesome-bigdata">bigdata.md
|
||
Github</a></p>
|