Update and add index

This commit is contained in:
Jonas Zeunert
2024-04-23 15:17:38 +02:00
parent 4d0cd768f7
commit 8d4db5d359
726 changed files with 41721 additions and 53949 deletions

View File

@@ -1,10 +1,9 @@
 Awesome Big Data
 Awesome Big Data
!Awesome (https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg) (https://github.com/sindresorhus/awesome)
A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php (https://github.com/ziadoz/awesome-php), awesome-python 
(https://github.com/vinta/awesome-python), awesome-ruby (https://github.com/Sdogruyol/awesome-ruby), hadoopecosystemtable (http://hadoopecosystemtable.github.io/) & big-data 
(http://usefulstuff.io/big-data/).
A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php (https://github.com/ziadoz/awesome-php), awesome-python (https://github.com/vinta/awesome-python), awesome-ruby 
(https://github.com/Sdogruyol/awesome-ruby), hadoopecosystemtable (http://hadoopecosystemtable.github.io/) & big-data (http://usefulstuff.io/big-data/).
Your contributions are always welcome!
@@ -60,10 +59,10 @@
Frameworks
⟡ Bistro (https://github.com/facebook/bistro) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via 
functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
⟡ IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Integrates with many of the popular 
technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
⟡ Bistro (https://github.com/facebook/bistro) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations 
as opposed to having only set operations in conventional approaches like MapReduce or SQL.
⟡ IBM Streams
 (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
⟡ Apache Hadoop (http://hadoop.apache.org/) - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
⟡ Tigon (https://github.com/caskdata/tigon) - High Throughput Real-time Stream Processing Framework.
⟡ Pachyderm (http://pachyderm.io/) - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
@@ -92,24 +91,21 @@
⟡ Apache Samza (http://samza.apache.org/) - stream processing framework, based on Kafka and YARN.
⟡ Apache Tez (http://tez.apache.org/) - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
⟡ Apache Twill (https://incubator.apache.org/projects/twill.html) - abstraction over YARN that reduces the complexity of developing distributed applications.
⟡ Baidu Bigflow (http://bigflow.cloud/en/index.html) - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle 
data of any scale.
⟡ Baidu Bigflow (http://bigflow.cloud/en/index.html) - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
⟡ Cascalog (http://cascalog.org/) - data processing and querying library.
⟡ Cheetah (http://vldbarc.org/pvldb/vldb2010/pvldb_vol3/I08.pdf) - High Performance, Custom Data Warehouse on Top of MapReduce.
⟡ Concurrent Cascading (http://www.cascading.org/) - framework for data management/analytics on Hadoop.
⟡ Damballa Parkour (https://github.com/damballa/parkour) - MapReduce library for Clojure.
⟡ Datasalt Pangool (https://github.com/datasalt/pangool) - alternative MapReduce paradigm.
⟡ DataTorrent StrAM (https://www.datatorrent.com/) - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as 
possible, with minimal overhead and impact on performance.
⟡ Facebook Corona (https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920) - Hadoop enhancement which 
removes single point of failure.
⟡ DataTorrent StrAM
 (https://www.datatorrent.com/) - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
⟡ Facebook Corona (https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920) - Hadoop enhancement which removes single point of failure.
⟡ Facebook Peregrine (http://peregrine_mapreduce.bitbucket.org/) - Map Reduce framework.
⟡ Facebook Scuba (https://www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with-scuba/10150599692628920) - distributed in-memory datastore.
⟡ Google Dataflow (https://googledevelopers.blogspot.it/2014/06/cloud-platform-at-google-io-new-big.html) - create data pipelines to help themæingest, transform and analyze data.
⟡ Google MapReduce (https://research.google.com/archive/mapreduce.html) - map reduce framework.
⟡ Google MillWheel (https://research.google.com/pubs/pub41378.html) - fault tolerant stream processing framework.
⟡ IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like 
geospatial, time series, etc. out of the box.
⟡ IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
⟡ JAQL (https://code.google.com/p/jaql/) - declarative programming language for working with structured, semi-structured and unstructured data.
⟡ Kite (http://kitesdk.org/docs/current/) - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
⟡ Metamarkets Druid (http://druid.io/) - framework for real-time analysis of large datasets.
@@ -171,20 +167,18 @@
Key Map Data Model
Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the 
"key-map" data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and 
these maps are referred to as "column families" (with value map keys being referred to as "columns").
Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly 
composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as 
"columns").
Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where 
all column values for a given key are stored next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given 
key, but less work is needed to get all values for a given column.
Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored 
next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.
The former group is referred to as "key map data model" here. The line between these and the Key-value Data Model (#key-value-data-model) stores is fairly blurry.
The latter, being more about the storage format than about the data model, is listed under Columnar Databases (#columnar-databases).
You can read more about this distinction on Prof. Daniel Abadi's blog: Distinguishing two major types of Column Stores 
(http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html).
You can read more about this distinction on Prof. Daniel Abadi's blog: Distinguishing two major types of Column Stores (http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html).
⟡ Apache Accumulo (http://accumulo.apache.org/) - distributed key/value store, built on Hadoop.
⟡ Apache Cassandra (http://cassandra.apache.org/) - column-oriented distributed datastore, inspired by BigTable.
@@ -196,15 +190,13 @@
⟡ Hypertable (http://www.hypertable.org/) - column-oriented distributed datastore, inspired by BigTable.
⟡ InfiniDB (https://github.com/infinidb/infinidb/) - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
⟡ Tephra (https://github.com/caskdata/tephra) - Transactions for HBase.
⟡ Twitter Manhattan (https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html) - real-time, multi-tenant distributed 
database for Twitter scale.
⟡ Twitter Manhattan (https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html) - real-time, multi-tenant distributed database for Twitter scale.
⟡ ScyllaDB (http://www.scylladb.com/) - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.
Key-value Data Model
⟡ Aerospike
 (http://www.aerospike.com/) - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
⟡ Aerospike (http://www.aerospike.com/) - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
⟡ Amazon DynamoDB (https://aws.amazon.com/dynamodb/) - distributed key/value store, implementation of Dynamo paper.
⟡ Badger (https://open.dgraph.io/post/badger/) - a fast, simple, efficient, and persistent key-value store written natively in Go.
⟡ Bolt (https://github.com/boltdb/bolt) - an embedded key-value database for Go.
@@ -216,8 +208,7 @@
⟡ GhostDB (https://github.com/jakekgrog/GhostDB) - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
⟡ Graviton (https://github.com/deroproject/graviton) - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
⟡ GridDB (https://github.com/griddb/griddb_nosql) - suitable for sensor data stored in a timeseries.
⟡ HyperDex
 (https://github.com/rescrv/HyperDex) - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
⟡ HyperDex (https://github.com/rescrv/HyperDex) - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
⟡ Ignite (https://ignite.apache.org/index.html) - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
⟡ LinkedIn Krati (https://github.com/linkedin-sna/sna-page/tree/master/krati) - is a simple persistent data store with very low latency and high throughput.
⟡ Linkedin Voldemort (http://www.project-voldemort.com/voldemort/) - distributed key/value storage system.
@@ -228,8 +219,7 @@
⟡ SummitDB (https://github.com/tidwall/summitdb) - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
⟡ Tarantool (https://github.com/tarantool/tarantool) - an efficient NoSQL database and a Lua application server.
⟡ TiKV (https://github.com/pingcap/tikv) - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
⟡ Tile38 (https://github.com/tidwall/tile38) - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, 
bounding boxes, XYZ tiles, Geohashes, and GeoJSON
⟡ Tile38 (https://github.com/tidwall/tile38) - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
⟡ TreodeDB (https://github.com/Treode/store) - key-value store that's replicated and sharded and provides atomic multirow writes.
@@ -239,16 +229,14 @@
⟡ Apache Giraph (http://giraph.apache.org/) - implementation of Pregel, based on Hadoop.
⟡ Apache Spark Bagel (http://spark.apache.org/docs/0.7.3/bagel-programming-guide.html) - implementation of Pregel, part of Spark.
⟡ ArangoDB (https://www.arangodb.com/) - multi model distributed database.
⟡ DGraph (https://github.com/dgraph-io/dgraph) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low
enough latency to be serving real time user queries, over terabytes of structured data.
⟡ DGraph (https://github.com/dgraph-io/dgraph) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user 
queries, over terabytes of structured data.
⟡ EliasDB (https://github.com/krotik/eliasdb) - a lightweight graph based database that does not require any third-party libraries.
⟡ Facebook TAO (https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920) - TAO is the distributed data store that is widely used at facebook to store 
and serve the social graph.
⟡ Facebook TAO (https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920) - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
⟡ GCHQ Gaffer (https://github.com/gchq/Gaffer) - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
⟡ Google Cayley (https://github.com/cayleygraph/cayley) - open-source graph database.
⟡ Google Pregel (http://kowshik.github.io/JPregel/pregel_paper.pdf) - graph processing framework.
⟡ GraphLab PowerGraph
 (https://turi.com/products/create/docs/) - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
⟡ GraphLab PowerGraph (https://turi.com/products/create/docs/) - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
⟡ GraphX (https://amplab.cs.berkeley.edu/publication/graphx-grades/) - resilient Distributed Graph System on Spark.
⟡ Gremlin (https://github.com/tinkerpop/gremlin) - graph traversal Language.
⟡ Infovore (https://github.com/paulhoule/infovore) - RDF-centric Map/Reduce framework.
@@ -257,8 +245,7 @@
 with multiple options for storage backends (Bigtable, HBase, Cassandra, etc.)
 and indexing backends (Elasticsearch, Solr, Lucene).
⟡ MapGraph (https://www.blazegraph.com/mapgraph-technology/) - Massively Parallel Graph processing on GPUs.
⟡ Microsoft Graph Engine (https://github.com/Microsoft/GraphEngine) - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general 
distributed computation engine.
⟡ Microsoft Graph Engine (https://github.com/Microsoft/GraphEngine) - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
⟡ Neo4j (https://neo4j.com/) - graph database written entirely in Java.
⟡ OrientDB (http://orientdb.com/) - document and graph database.
⟡ Phoebus (https://github.com/xslogic/phoebus) - framework for large scale graph processing.
@@ -307,8 +294,7 @@
⟡ Map-D (https://www.mapd.com/) - GPU in-memory database, big data analysis and visualization platform.
⟡ MemSQL (http://www.memsql.com/) - in memory SQL database witho optimized columnar storage on flash.
⟡ NuoDB (http://www.nuodb.com/) - SQL/ACID compliant distributed database.
⟡ Oracle TimesTen in-Memory Database
 (http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html) - in-memory, relational database management system with persistence and recoverability.
⟡ Oracle TimesTen in-Memory Database (http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html) - in-memory, relational database management system with persistence and recoverability.
⟡ Pivotal GemFire XD (http://gemfirexd.docs.pivotal.io/latest/) - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
⟡ SAP HANA (https://hana.sap.com/abouthana.html) - is an in-memory, column-oriented, relational database management system.
⟡ SenseiDB (http://senseidb.github.io/sensei/) - distributed, realtime, semi-structured database.
@@ -320,8 +306,7 @@
Time-Series Databases
⟡ Axibase Time Series Database
 (http://axibase.com/products/axibase-time-series-database/) - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
⟡ Axibase Time Series Database (http://axibase.com/products/axibase-time-series-database/) - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
⟡ Chronix (http://chronix.io/) - a time series storage built to store time series highly compressed and for fast access times.
⟡ Cube (http://square.github.io/cube/) - uses MongoDB to store time series data.
⟡ Heroic (https://spotify.github.io/heroic/#!/index) - is a scalable time series database based on Cassandra and Elasticsearch.
@@ -338,17 +323,14 @@
⟡ TrailDB (http://traildb.io/) - an efficient tool for storing and querying series of events.
⟡ Druid (https://github.com/druid-io/druid/) Column oriented distributed data store ideal for powering interactive applications
⟡ Riak-TS (http://basho.com/products/riak-ts/) Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
⟡ Akumuli (https://github.com/akumuli/Akumuli) Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be
translated from esperanto as "accumulate".
⟡ Akumuli (https://github.com/akumuli/Akumuli) Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
⟡ Rhombus (https://github.com/Pardot/Rhombus) A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
⟡ Dalmatiner DB (https://github.com/dalmatinerdb/dalmatinerdb) Fast distributed metrics database
⟡ Blueflood (https://github.com/rackerlabs/blueflood) A distributed system designed to ingest and process time series data
⟡ Timely (https://github.com/NationalSecurityAgency/timely) Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
⟡ SiriDB (https://github.com/transceptor-technology/siridb-server) Highly-scalable, robust and fast, open source time series database with cluster functionality.
⟡ Thanos (https://github.com/improbable-eng/thanos) - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) 
Prometheus deployments.
⟡ VictoriaMetrics
 (https://github.com/VictoriaMetrics/VictoriaMetrics) - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
⟡ Thanos (https://github.com/improbable-eng/thanos) - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
⟡ VictoriaMetrics (https://github.com/VictoriaMetrics/VictoriaMetrics) - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
SQL-like processing
@@ -366,8 +348,7 @@
⟡ Facebook PrestoDB (https://prestodb.io/) - distributed SQL query engine.
⟡ Google BigQuery (https://research.google.com/pubs/pub36632.html) - framework for interactive analysis, implementation of Dremel.
⟡ Materialize (https://github.com/materializeinc/materialize) - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
⟡ Invantive SQL (https://documentation.invantive.com/2017R2/invantive-sql-grammar/invantive-sql-grammar-17.30.html) - SQL engine for online and on-premise use with integrated local data 
replication and 70+ connectors.
⟡ Invantive SQL (https://documentation.invantive.com/2017R2/invantive-sql-grammar/invantive-sql-grammar-17.30.html) - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
⟡ PipelineDB (https://www.pipelinedb.com/) - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
⟡ Pivotal HDB (https://pivotal.io/pivotal-hdb) - SQL-like data warehouse system for Hadoop.
⟡ RainstorDB (http://rainstor.com/products/rainstor-database/) - database for storing petabyte-scale volumes of structured and semi-structured data.
@@ -393,8 +374,7 @@
⟡ Facebook Scribe (https://github.com/facebookarchive/scribe) - streamed log data aggregator.
⟡ Fluentd (http://www.fluentd.org) - tool to collect events and logs.
⟡ Gazette (https://github.com/gazette/core) - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
⟡ Google Photon (https://research.google.com/pubs/pub41318.html) - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high 
scalability and low latency.
⟡ Google Photon (https://research.google.com/pubs/pub41318.html) - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
⟡ Heka (https://github.com/mozilla-services/heka) - open source stream processing software system.
⟡ HIHO (https://github.com/sonalgoyal/hiho) - framework for connecting disparate data sources with Hadoop.
⟡ Kestrel (https://github.com/papertrail/kestrel) - distributed message queue system.
@@ -409,8 +389,7 @@
⟡ StreamSets Data Collector (https://github.com/streamsets/datacollector) - continuous big data ingest infrastructure with a simple to use IDE.
⟡ Alooma (https://www.alooma.com/integrations/mysql) - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
⟡ RudderStack (https://github.com/rudderlabs/rudder-server) - an open source customer data infrastructure (segment, mParticle alternative) written in go.
⟡ Zilla (https://github.com/aklivity/zilla) - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native 
Kafka protocol.
⟡ Zilla (https://github.com/aklivity/zilla) - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.
Service Programming
@@ -426,8 +405,8 @@
⟡ Mara (https://github.com/mara/data-integration) - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
⟡ OpenMPI (https://www.open-mpi.org/) - message passing framework.
⟡ Serf (https://www.serf.io/) - decentralized solution for service discovery and orchestration.
⟡ Spotify Luigi (https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, 
handling failures, command line integration, and much more.
⟡ Spotify Luigi
 (https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
⟡ Spring XD (https://github.com/spring-projects/spring-xd) - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
⟡ Twitter Elephant Bird (https://github.com/twitter/elephant-bird) - libraries for working with LZOP-compressed data.
⟡ Twitter Finagle (https://twitter.github.io/finagle/) - asynchronous network stack for the JVM.
@@ -455,20 +434,18 @@
⟡ Concurrent Pattern (http://www.cascading.org/projects/pattern/) - machine learning library for Cascading.
⟡ convnetjs (https://github.com/karpathy/convnetjs) - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
⟡ DataVec (https://github.com/deeplearning4j/DataVec) - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem. 
⟡ Deeplearning4j (https://github.com/deeplearning4j) - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark 
and Hadoop to train nets on multiple GPUs and CPUs.
⟡ Deeplearning4j
 (https://github.com/deeplearning4j) - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
⟡ Decider (https://github.com/danielsdeleo/Decider) - Flexible and Extensible Machine Learning in Ruby.
⟡ ENCOG (http://www.heatonresearch.com/encog/) - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
⟡ etcML (http://www.etcml.com/) - text classification with machine learning.
⟡ Etsy Conjecture (https://github.com/etsy/Conjecture) - scalable Machine Learning in Scalding.
⟡ Feast (https://github.com/gojek/feast) - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both 
model training and model serving.
⟡ Feast (https://github.com/gojek/feast) - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
⟡ GraphLab Create (https://dato.com/products/create/) - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
⟡ H2O (https://github.com/h2oai/h2o-3/) - statistical, machine learning and math runtime with Hadoop. R and Python.
⟡ Karate Club (https://github.com/benedekrozemberczki/karateclub) - An unsupervised machine learning library for graph structured data. Python
⟡ Keras (https://github.com/fchollet/keras) - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
⟡ Lambdo
 (https://github.com/johnsonc/lambdo) - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
⟡ Lambdo (https://github.com/johnsonc/lambdo) - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
⟡ Little Ball of Fur (https://github.com/benedekrozemberczki/littleballoffur) - A subsampling library for graph structured data. Python
⟡ Mahout (http://mahout.apache.org/) - An Apache-backed machine learning library for Hadoop.
⟡ MLbase (http://www.mlbase.org/) - distributed machine learning libraries for the BDAS stack.
@@ -477,12 +454,10 @@
⟡ MOA (http://moa.cms.waikato.ac.nz) - MOA performs big data stream mining in real time, and large scale machine learning.
⟡ MonkeyLearn (https://monkeylearn.com/) - Text mining made easy. Extract and classify data from text.
⟡ ND4J (https://github.com/deeplearning4j/nd4j) - A matrix library for the JVM. Numpy for Java. 
⟡ nupic (https://github.com/numenta/nupic) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on 
cortical learning algorithms.
⟡ nupic (https://github.com/numenta/nupic) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
⟡ PredictionIO (http://predictionio.incubator.apache.org/index.html) - machine learning server buit on Hadoop, Mahout and Cascading.
⟡ PyTorch Geometric Temporal (https://github.com/benedekrozemberczki/pytorch_geometric_temporal) - a temporal extension library for PyTorch Geometric .
⟡ RL4J (https://github.com/deeplearning4j/rl4j) - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the 
Deeplearning4j ecosystem. 
⟡ RL4J (https://github.com/deeplearning4j/rl4j) - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem. 
⟡ SAMOA (http://samoa.incubator.apache.org/) - distributed streaming machine learning framework.
⟡ scikit-learn (https://github.com/scikit-learn/scikit-learn) - scikit-learn: machine learning in Python.
⟡ Shapley (https://github.com/benedekrozemberczki/shapley) - A data-driven framework to quantify the value of classifiers in a machine learning ensemble. 
@@ -537,14 +512,12 @@
⟡ 411 (https://github.com/etsy/411) - an web application for alert management resulting from scheduled searches into Elasticsearch.
⟡ Adobe spindle (https://github.com/adobe-research/spindle) - Next-generation web analytics processing with Scala, Spark, and Parquet.
⟡ Apache Metron
 (http://metron.apache.org/) - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
⟡ Apache Metron (http://metron.apache.org/) - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
⟡ Apache Nutch (http://nutch.apache.org/) - open source web crawler.
⟡ Apache OODT (http://oodt.apache.org/) - capturing, processing and sharing of data for NASA's scientific archives.
⟡ Apache Tika (https://tika.apache.org/) - content analysis toolkit.
⟡ Argus (https://github.com/salesforce/Argus) - Time series monitoring and alerting platform.
⟡ AthenaX
 (https://github.com/uber/AthenaX) - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
⟡ AthenaX (https://github.com/uber/AthenaX) - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
⟡ Atlas (https://github.com/Netflix/atlas) - a backend for managing dimensional time series data.
⟡ Countly (https://count.ly/) - open source mobile and web analytics platform, based on Node.js & MongoDB.
⟡ Domino (https://www.dominodatalab.com/) - Run, scale, share, and deploy models — without any infrastructure.
@@ -563,8 +536,8 @@
⟡ PivotalR (https://github.com/pivotalsoftware/PivotalR) - R on Pivotal HD / HAWQ and PostgreSQL.
⟡ Rakam (https://github.com/rakam-io/rakam) - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB. 
⟡ Qubole (https://www.qubole.com/) - auto-scaling Hadoop cluster, built-in data connectors.
⟡ SnappyData (https://github.com/SnappyDataInc/snappydata) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction 
processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
⟡ SnappyData (https://github.com/SnappyDataInc/snappydata) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical 
processing) built on Spark in a single integrated cluster.
⟡ Snowplow (https://github.com/snowplow/snowplow) - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
⟡ SparkR (http://amplab-extras.github.io/SparkR-pkg/) - R frontend for Spark.
⟡ Splunk (https://www.splunk.com/) - analyzer for machine-generated data.
@@ -587,16 +560,14 @@
⟡ LinkedIn Cleo (https://github.com/linkedin/cleo) - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
⟡ LinkedIn Galene (https://engineering.linkedin.com/search/did-you-mean-galene) - search architecture at LinkedIn.
⟡ LinkedIn Zoie (https://github.com/senseidb/zoie) - is a realtime search/indexing system written in Java.
⟡ MG4J (http://mg4j.di.unimi.it/) - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance
and provides state-of-the-art features and new research algorithms.
⟡ MG4J (http://mg4j.di.unimi.it/) - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and 
new research algorithms.
⟡ Sphinx Search Server (http://sphinxsearch.com/) - fulltext search engine.
⟡ Vespa (http://vespa.ai/) - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be 
performed at serving time.
⟡ Facebook Faiss (https://github.com/facebookresearch/faiss) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of 
vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for 
Python/numpy.
⟡ Annoy (https://github.com/spotify/annoy) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only 
file-based data structures that are mmapped into memory so that many processes may share the same data.
⟡ Vespa (http://vespa.ai/) - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
⟡ Facebook Faiss (https://github.com/facebookresearch/faiss) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do 
not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
⟡ Annoy (https://github.com/spotify/annoy) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into 
memory so that many processes may share the same data.
⟡ Weaviate (https://github.com/semi-technologies/weaviate) - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
MySQL forks and evolutions
@@ -620,8 +591,7 @@
⟡ Stado (http://www.stormdb.com/community/stado) - open source MPP database system solely targeted at data warehousing and data mart applications.
⟡ Yahoo Everest (https://www.scribd.com/doc/3159239/70-Everest-PGCon-RT) - multi-peta-byte database / MPP derived by PostgreSQL.
⟡ TimescaleDB (http://www.timescale.com/) - An open-source time-series database optimized for fast ingest and complex queries
⟡ PipelineDB
 (https://www.pipelinedb.com/) - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables
⟡ PipelineDB (https://www.pipelinedb.com/) - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables
Memcached forks and evolutions
@@ -674,8 +644,8 @@
⟡ Arbor (https://github.com/samizdatco/arbor) - graph visualization library using web workers and jQuery.
⟡ Banana (https://github.com/LucidWorks/banana) - visualize logs and time-stamped data stored in Solr. Port of Kibana.
⟡ Bloomery (https://github.com/ufukomer/bloomery) - Web UI for Impala.
⟡ Bokeh (http://bokeh.pydata.org/en/latest/) - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, 
concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
⟡ Bokeh (http://bokeh.pydata.org/en/latest/) - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the 
style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
⟡ C3 (http://c3js.org/) - D3-based reusable chart library
⟡ CartoDB (https://github.com/CartoDB/cartodb) - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
⟡ chartd (http://chartd.co/) - responsive, retina-compatible charts with just an img tag.
@@ -684,8 +654,7 @@
⟡ Crossfilter (http://square.github.io/crossfilter/) - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
⟡ Cubism (https://github.com/square/cubism) - JavaScript library for time series visualization.
⟡ Cytoscape (http://cytoscape.github.io/) - JavaScript library for visualizing complex networks.
⟡ DC.js (http://dc-js.github.io/dc.js/) - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover 
events in D3.
⟡ DC.js (http://dc-js.github.io/dc.js/) - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
⟡ D3 (https://d3js.org/) - javaScript library for manipulating documents.
⟡ D3.compose (https://github.com/CSNW/d3.compose) - Compose complex, data-driven visualizations from reusable charts and components.
⟡ D3Plus (http://d3plus.org) - A fairly robust set of reusable charts and styles for d3.js.
@@ -697,8 +666,7 @@
⟡ FnordMetric (https://metrictools.org/) - write SQL queries that return SVG charts rather than tables
⟡ Frappe Charts (https://frappe.io/charts) - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
⟡ Freeboard (https://github.com/Freeboard/freeboard) - pen source real-time dashboard builder for IOT and other web mashups.
⟡ Gephi (https://github.com/gephi/gephi) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. 
Available for Windows and Mac OS X.
⟡ Gephi (https://github.com/gephi/gephi) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
⟡ Google Charts (https://developers.google.com/chart/) - simple charting API.
⟡ Grafana (https://grafana.com/) - graphite dashboard frontend, editor and graph composer.
⟡ Graphite (http://graphiteapp.org/) - scalable Realtime Graphing.
@@ -710,24 +678,21 @@
⟡ Metricsgraphic.js (https://metricsgraphicsjs.org/) - a library built on top of D3 that is optimized for time-series data
⟡ NVD3 (http://nvd3.org/) - chart components for d3.js.
⟡ Peity (https://github.com/benpickles/peity) - Progressive SVG bar, line and pie charts.
⟡ Plot.ly (https://plot.ly/) - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's 
online spreadsheet. Fork others' plots.
⟡ Plot.ly (https://plot.ly/) - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.
⟡ Plotly.js (https://github.com/plotly/plotly.js) The open source javascript graphing library that powers plotly.
⟡ Recline (https://github.com/okfn/recline) - simple but powerful library for building data applications in pure Javascript and HTML.
⟡ Redash (https://github.com/getredash/redash) - open-source platform to query and visualize data.
⟡ ReCharts (http://recharts.org/) - A composable charting library built on React components
⟡ Shiny (http://shiny.rstudio.com/) - a web application framework for R.
⟡ Sigma.js (https://github.com/jacomyal/sigma.js) - JavaScript library dedicated to graph drawing.
⟡ Superset (https://github.com/apache/incubator-superset) - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and 
perform analytics at the speed of thought.
⟡ Superset (https://github.com/apache/incubator-superset) - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
⟡ Vega (https://github.com/vega/vega) - a visualization grammar.
⟡ Zeppelin (https://github.com/ZEPL/zeppelin) - a notebook-style collaborative data analysis.
⟡ Zing Charts (https://www.zingchart.com/) - JavaScript charting library for big data.
⟡ DataSphere Studio (https://github.com/WeBankFinTech/DataSphereStudio) - one-stop data application development management portal.
Internet of things and sensor data
⟡ Apache Edgent (Incubating) (http://edgent.apache.org/) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local,
real-time, analytics on the edge devices.
⟡ Apache Edgent (Incubating) (http://edgent.apache.org/) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
⟡ Azure IoT Hub (https://azure.microsoft.com/en-us/services/iot-hub/) - Cloud-based bi-directional monitoring and messaging hub
⟡ TempoIQ (https://www.tempoiq.com/) - Cloud-based sensor analytics.
⟡ 2lemetry (http://2lemetry.com/) - Platform for Internet of things.
@@ -741,14 +706,11 @@
Interesting Readings
⟡ Big Data Benchmark (https://amplab.cs.berkeley.edu/benchmark/) - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
⟡ NoSQL Comparison (https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis) - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs 
ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
⟡ Monitoring Kafka performance
 (https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics?ref=awesome) - Guide to monitoring Apache Kafka, including native methods for metrics collection.
⟡ Monitoring Hadoop performance
 (https://www.datadoghq.com/blog/monitor-hadoop-metrics?ref=awesome) - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
⟡ Monitoring Cassandra performance
 (https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/?ref=awesome) - Guide to monitoring Cassandra, including native methods for metrics collection.
⟡ NoSQL Comparison
 (https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis) - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
⟡ Monitoring Kafka performance (https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics?ref=awesome) - Guide to monitoring Apache Kafka, including native methods for metrics collection.
⟡ Monitoring Hadoop performance (https://www.datadoghq.com/blog/monitor-hadoop-metrics?ref=awesome) - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
⟡ Monitoring Cassandra performance (https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/?ref=awesome) - Guide to monitoring Cassandra, including native methods for metrics collection.
Interesting Papers
@@ -761,8 +723,7 @@
⟡ 2013 (https://amplab.cs.berkeley.edu/wp-content/uploads/2013/01/dmx1.pdf) - AMPLab - MLbase: A Distributed Machine-learning System.
⟡ 2013 (https://amplab.cs.berkeley.edu/wp-content/uploads/2013/02/shark_sigmod2013.pdf) - AMPLab - Shark: SQL and Rich Analytics at Scale.
⟡ 2013 (https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf) - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
⟡ 2013 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf) - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality
Estimation Algorithm.
⟡ 2013 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf) - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
⟡ 2013 (http://research.microsoft.com/pubs/200169/now-vldb.pdf) - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
⟡ 2013 (http://static.druid.io/docs/druid.pdf) - Metamarkets - Druid: A Real-time Analytical Data Store.
⟡ 2013 (http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p764-rae.pdf) - Google - Online, Asynchronous Schema Change in F1.
@@ -785,8 +746,7 @@
⟡ 2012 (http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf) - Google - Processing a trillion cells per mouse click.
⟡ 2012 (http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf) - Google - Spanner: Googles Globally-Distributed Database.
⟡ 2011 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/euro118-ananthanarayanan.pdf) - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
⟡ 2011 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf) - AMPLab - Mesos: A Platform for Fine-Grained 
Resource Sharing in the Data Center.
⟡ 2011 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf) - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
⟡ 2011 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf) - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
2001 - 2010
@@ -794,59 +754,51 @@
⟡ 2010 (https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf) - Facebook - Finding a needle in Haystack: Facebooks photo storage.
⟡ 2010 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf) - AMPLab - Spark: Cluster Computing with Working Sets.
⟡ 2010 (http://kowshik.github.io/JPregel/pregel_paper.pdf) - Google - Pregel: A System for Large-Scale Graph Processing.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf) - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications
base of Percolator and Caffeine.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf) - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf) - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
⟡ 2010 (http://leoneu.github.io/) - Yahoo - S4: Distributed Stream Computing Platform.
⟡ 2009 (http://www.cs.umd.edu/~abadi/papers/hadoopdb.pdf) - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. 
⟡ 2008 (https://cwiki.apache.org/confluence/download/attachments/120729877/chukwa_cca08.pdf?version=1&modificationDate=1562667399000&api=v2) - AMPLab - Chukwa: A large-scale monitoring 
system.
⟡ 2008 (https://cwiki.apache.org/confluence/download/attachments/120729877/chukwa_cca08.pdf?version=1&modificationDate=1562667399000&api=v2) - AMPLab - Chukwa: A large-scale monitoring system.
⟡ 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf) - Amazon - Dynamo: Amazons Highly Available Key-value Store.
⟡ 2006 (http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf) - Google - The Chubby lock service for loosely-coupled distributed systems.
⟡ 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) - Google - Bigtable: A Distributed Storage System for 
Structured Data.
⟡ 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) - Google - Bigtable: A Distributed Storage System for Structured Data.
⟡ 2004 (http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) - Google - MapReduce: Simplied Data Processing on Large Clusters.
⟡ 2003 (http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf) - Google - The Google File System.
Videos
⟡ Spark in Motion (https://www.manning.com/livevideo/spark-in-motion) - Spark in Motion teaches you how to use Spark for batch and streaming data analytics.
⟡ Machine Learning, Data Science and Deep Learning with Python  (https://www.manning.com/livevideo/machine-learning-data-science-and-deep-learning-with-python) - LiveVideo tutorial that 
covers machine learning, Tensorflow, artificial intelligence, and neural networks.
⟡ Data warehouse schema design - dimensional modeling and star schema
 (https://snir.dev/talks/data-warehouse-schema-design) - Introduction to schema design for data warehouse using the star schema method.
⟡ Elasticsearch 7 and Elastic Stack (https://www.manning.com/livevideo/elasticsearch-7-and-elastic-stack) - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a 
cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
⟡ Machine Learning, Data Science and Deep Learning with Python 
 (https://www.manning.com/livevideo/machine-learning-data-science-and-deep-learning-with-python) - LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.
⟡ Data warehouse schema design - dimensional modeling and star schema (https://snir.dev/talks/data-warehouse-schema-design) - Introduction to schema design for data warehouse using the star schema method.
⟡ Elasticsearch 7 and Elastic Stack
 (https://www.manning.com/livevideo/elasticsearch-7-and-elastic-stack) - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
Books
Streaming
⟡ Data Science at Scale with Python and Dask (https://www.manning.com/books/data-science-at-scale-with-python-and-dask) - Data Science at Scale with Python and Dask teaches you how to build 
distributed data projects that can handle huge amounts of data.
⟡ Data Science at Scale with Python and Dask
 (https://www.manning.com/books/data-science-at-scale-with-python-and-dask) - Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.
⟡ Streaming Data (https://www.manning.com/books/streaming-data) - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
⟡ Storm Applied (https://www.manning.com/books/storm-applied) - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing 
real-time data streams.
⟡ Storm Applied (https://www.manning.com/books/storm-applied) - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
⟡ Fundamentals of Stream Processing: Application Design, Systems, and Analytics 
(http://www.cambridge.org/us/academic/subjects/engineering/communications-and-signal-processing/fundamentals-stream-processing-application-design-systems-and-analytics) - This comprehensive, 
hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as 
students and researchers in the field.
(http://www.cambridge.org/us/academic/subjects/engineering/communications-and-signal-processing/fundamentals-stream-processing-application-design-systems-and-analytics) - This comprehensive, hands-on guide combining the fundamental 
building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
⟡ Stream Data Processing: A Quality of Service Perspective (http://www.springer.com/us/book/9780387710020) - Presents a new paradigm suitable for stream and complex event processing.
⟡ Unified Log Processing
 (https://www.manning.com/books/event-streams-in-action) - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
⟡ Kafka Streams in Action (https://www.manning.com/books/kafka-streams-in-action) - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data 
flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.
⟡ Big Data (https://www.manning.com/books/big-data) - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools 
designed specifically to capture and analyze web-scale data.
⟡ Spark in Action (https://www.manning.com/books/spark-in-action) & Spark in Action 2nd Ed. (https://www.manning.com/books/spark-in-action-second-edition) - Spark in Action teaches you the 
theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
⟡ Unified Log Processing (https://www.manning.com/books/event-streams-in-action) - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
⟡ Kafka Streams in Action (https://www.manning.com/books/kafka-streams-in-action) - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to
focus on getting more from your data without sacrificing time or effort.
⟡ Big Data (https://www.manning.com/books/big-data) - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze 
web-scale data.
⟡ Spark in Action (https://www.manning.com/books/spark-in-action) & Spark in Action 2nd Ed. (https://www.manning.com/books/spark-in-action-second-edition) - Spark in Action teaches you the theory and skills you need to effectively 
handle batch and streaming data using Spark. Fully updated for Spark 2.0.
⟡ Kafka in Action (https://www.manning.com/books/kafka-in-action) - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
⟡ Fusion in Action (https://www.manning.com/books/fusion-in-action) - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and 
distributed data clustering.
⟡ Reactive Data Handling (https://www.manning.com/books/reactive-data-handling) - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that 
introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook! 
⟡ Fusion in Action (https://www.manning.com/books/fusion-in-action) - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
⟡ Reactive Data Handling (https://www.manning.com/books/reactive-data-handling) - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications 
capable of handling real-time processing with large data loads--free eBook! 
⟡ Azure Data Engineering (https://www.manning.com/books/azure-data-engineering) - A book about data engineering in general and the Azure platform specifically 
⟡ Grokking Streaming Systems (https://www.manning.com/books/grokking-streaming-systems) - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether 
theyre right for your business. Written to be tool-agnostic, youll be able to apply what you learn no matter which framework you choose.
⟡ Grokking Streaming Systems (https://www.manning.com/books/grokking-streaming-systems) - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether theyre right for your business. Written to be
tool-agnostic, youll be able to apply what you learn no matter which framework you choose.
Distributed systems
⟡ Distributed Systems for fun and profit (http://book.mixu.net/distsys/) Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.
@@ -861,7 +813,7 @@
 ⟡ Ice Bucket Challenge Data Visualization (https://www.youtube.com/watch?v=qTEchen97rQ)
 Other Awesome Lists
 Other Awesome Lists
- Other awesome lists awesome-awesomeness (https://github.com/bayandin/awesome-awesomeness).
- Even more lists awesome (https://github.com/sindresorhus/awesome).
- Another list? list (https://github.com/jnv/lists).