update lists

This commit is contained in:
2025-07-18 22:22:32 +02:00
parent 55bed3b4a1
commit 5916c5c074
3078 changed files with 331679 additions and 357255 deletions

View File

@@ -1,4 +1,4 @@
 Awesome Big Data
 Awesome Big Data
!Awesome (https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg) (https://github.com/sindresorhus/awesome)
@@ -59,10 +59,9 @@
Frameworks
⟡ Bistro (https://github.com/facebook/bistro) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations 
as opposed to having only set operations in conventional approaches like MapReduce or SQL.
⟡ IBM Streams
 (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
⟡ Bistro (https://github.com/facebook/bistro) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed
to having only set operations in conventional approaches like MapReduce or SQL.
⟡ IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
⟡ Apache Hadoop (http://hadoop.apache.org/) - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
⟡ Tigon (https://github.com/caskdata/tigon) - High Throughput Real-time Stream Processing Framework.
⟡ Pachyderm (http://pachyderm.io/) - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
@@ -97,8 +96,7 @@
⟡ Concurrent Cascading (http://www.cascading.org/) - framework for data management/analytics on Hadoop.
⟡ Damballa Parkour (https://github.com/damballa/parkour) - MapReduce library for Clojure.
⟡ Datasalt Pangool (https://github.com/datasalt/pangool) - alternative MapReduce paradigm.
⟡ DataTorrent StrAM
 (https://www.datatorrent.com/) - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
⟡ DataTorrent StrAM (https://www.datatorrent.com/) - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
⟡ Facebook Corona (https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920) - Hadoop enhancement which removes single point of failure.
⟡ Facebook Peregrine (http://peregrine_mapreduce.bitbucket.org/) - Map Reduce framework.
⟡ Facebook Scuba (https://www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with-scuba/10150599692628920) - distributed in-memory datastore.
@@ -167,12 +165,11 @@
Key Map Data Model
Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly 
composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as 
"columns").
Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly composite) key,
with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as "columns").
Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored 
next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.
Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored next to 
each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.
The former group is referred to as "key map data model" here. The line between these and the Key-value Data Model (#key-value-data-model) stores is fairly blurry.
@@ -216,7 +213,7 @@
⟡ Redis (https://redis.io/) - in memory key value datastore.
⟡ Riak (https://github.com/basho/riak) - a decentralized datastore.
⟡ Storehaus (https://github.com/twitter/storehaus) - library to work with asynchronous key value stores, by Twitter.
⟡ SummitDB (https://github.com/tidwall/summitdb) - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
⟡ SummitDB (https://github.com/tidwall/summitdb) - an in-memory, NoSQL key/value database, with disk persistence and using the Raft consensus algorithm.
⟡ Tarantool (https://github.com/tarantool/tarantool) - an efficient NoSQL database and a Lua application server.
⟡ TiKV (https://github.com/pingcap/tikv) - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
⟡ Tile38 (https://github.com/tidwall/tile38) - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
@@ -229,8 +226,8 @@
⟡ Apache Giraph (http://giraph.apache.org/) - implementation of Pregel, based on Hadoop.
⟡ Apache Spark Bagel (http://spark.apache.org/docs/0.7.3/bagel-programming-guide.html) - implementation of Pregel, part of Spark.
⟡ ArangoDB (https://www.arangodb.com/) - multi model distributed database.
⟡ DGraph (https://github.com/dgraph-io/dgraph) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user 
queries, over terabytes of structured data.
⟡ DGraph (https://github.com/dgraph-io/dgraph) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, 
over terabytes of structured data.
⟡ EliasDB (https://github.com/krotik/eliasdb) - a lightweight graph based database that does not require any third-party libraries.
⟡ Facebook TAO (https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920) - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
⟡ GCHQ Gaffer (https://github.com/gchq/Gaffer) - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
@@ -347,6 +344,7 @@
⟡ Dremio (https://www.dremio.com/) - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
⟡ Facebook PrestoDB (https://prestodb.io/) - distributed SQL query engine.
⟡ Google BigQuery (https://research.google.com/pubs/pub36632.html) - framework for interactive analysis, implementation of Dremel.
⟡ Iceberg (https://iceberg.apache.org/) - an open table format for huge analytic datasets. Iceberg adds tables to Trino and Spark that use a high-performance format that works just like a SQL table.
⟡ Materialize (https://github.com/materializeinc/materialize) - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
⟡ Invantive SQL (https://documentation.invantive.com/2017R2/invantive-sql-grammar/invantive-sql-grammar-17.30.html) - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
⟡ PipelineDB (https://www.pipelinedb.com/) - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
@@ -395,7 +393,7 @@
⟡ Akka Toolkit (http://akka.io/) - runtime for distributed, and fault tolerant event-driven applications on the JVM.
⟡ Apache Avro (http://avro.apache.org/) - data serialization system.
⟡ Apache Curator (http://curator.apache.org/) - Java libaries for Apache ZooKeeper.
⟡ Apache Curator (http://curator.apache.org/) - Java libraries for Apache ZooKeeper.
⟡ Apache Karaf (http://karaf.apache.org/) - OSGi runtime that runs on top of any OSGi framework.
⟡ Apache Thrift (http://thrift.apache.org//) - framework to build binary protocols.
⟡ Apache Zookeeper (http://zookeeper.apache.org/) - centralized service for process management.
@@ -405,8 +403,7 @@
⟡ Mara (https://github.com/mara/data-integration) - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
⟡ OpenMPI (https://www.open-mpi.org/) - message passing framework.
⟡ Serf (https://www.serf.io/) - decentralized solution for service discovery and orchestration.
⟡ Spotify Luigi
 (https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
⟡ Spotify Luigi (https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
⟡ Spring XD (https://github.com/spring-projects/spring-xd) - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
⟡ Twitter Elephant Bird (https://github.com/twitter/elephant-bird) - libraries for working with LZOP-compressed data.
⟡ Twitter Finagle (https://twitter.github.io/finagle/) - asynchronous network stack for the JVM.
@@ -434,8 +431,7 @@
⟡ Concurrent Pattern (http://www.cascading.org/projects/pattern/) - machine learning library for Cascading.
⟡ convnetjs (https://github.com/karpathy/convnetjs) - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
⟡ DataVec (https://github.com/deeplearning4j/DataVec) - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem. 
⟡ Deeplearning4j
 (https://github.com/deeplearning4j) - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
⟡ Deeplearning4j (https://github.com/deeplearning4j) - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
⟡ Decider (https://github.com/danielsdeleo/Decider) - Flexible and Extensible Machine Learning in Ruby.
⟡ ENCOG (http://www.heatonresearch.com/encog/) - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
⟡ etcML (http://www.etcml.com/) - text classification with machine learning.
@@ -455,7 +451,7 @@
⟡ MonkeyLearn (https://monkeylearn.com/) - Text mining made easy. Extract and classify data from text.
⟡ ND4J (https://github.com/deeplearning4j/nd4j) - A matrix library for the JVM. Numpy for Java. 
⟡ nupic (https://github.com/numenta/nupic) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
⟡ PredictionIO (http://predictionio.incubator.apache.org/index.html) - machine learning server buit on Hadoop, Mahout and Cascading.
⟡ PredictionIO (http://predictionio.incubator.apache.org/index.html) - machine learning server built on Hadoop, Mahout and Cascading.
⟡ PyTorch Geometric Temporal (https://github.com/benedekrozemberczki/pytorch_geometric_temporal) - a temporal extension library for PyTorch Geometric .
⟡ RL4J (https://github.com/deeplearning4j/rl4j) - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem. 
⟡ SAMOA (http://samoa.incubator.apache.org/) - distributed streaming machine learning framework.
@@ -490,7 +486,7 @@
System Deployment
⟡ Apache Ambari (http://ambari.apache.org/) - operational framework for Hadoop mangement.
⟡ Apache Ambari (http://ambari.apache.org/) - operational framework for Hadoop management.
⟡ Apache Bigtop (http://bigtop.apache.org//) - system deployment framework for the Hadoop ecosystem.
⟡ Apache Helix (http://helix.apache.org/) - cluster management framework.
⟡ Apache Mesos (http://mesos.apache.org/) - cluster manager.
@@ -520,6 +516,7 @@
⟡ AthenaX (https://github.com/uber/AthenaX) - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
⟡ Atlas (https://github.com/Netflix/atlas) - a backend for managing dimensional time series data.
⟡ Countly (https://count.ly/) - open source mobile and web analytics platform, based on Node.js & MongoDB.
⟡ Comet (https://www.comet.com/site/) - Comet provides an end-to-end model evaluation platform for AI developers, with best in class LLM evaluations, experiment tracking, and production monitoring.
⟡ Domino (https://www.dominodatalab.com/) - Run, scale, share, and deploy models — without any infrastructure.
⟡ Eclipse BIRT (http://www.eclipse.org/birt/) - Eclipse-based reporting system.
⟡ ElastAert (https://github.com/Yelp/elastalert) - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
@@ -534,10 +531,11 @@
⟡ Kapacitor (https://github.com/influxdata/kapacitor) - an open source framework for processing, monitoring, and alerting on time series data.
⟡ Kylin (http://kylin.apache.org/) - open source Distributed Analytics Engine from eBay.
⟡ PivotalR (https://github.com/pivotalsoftware/PivotalR) - R on Pivotal HD / HAWQ and PostgreSQL.
⟡ Opik (https://www.comet.com/site/products/opik/) - Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
⟡ Rakam (https://github.com/rakam-io/rakam) - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB. 
⟡ Qubole (https://www.qubole.com/) - auto-scaling Hadoop cluster, built-in data connectors.
⟡ SnappyData (https://github.com/SnappyDataInc/snappydata) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical 
processing) built on Spark in a single integrated cluster.
⟡ SnappyData (https://github.com/SnappyDataInc/snappydata) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built 
on Spark in a single integrated cluster.
⟡ Snowplow (https://github.com/snowplow/snowplow) - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
⟡ SparkR (http://amplab-extras.github.io/SparkR-pkg/) - R frontend for Spark.
⟡ Splunk (https://www.splunk.com/) - analyzer for machine-generated data.
@@ -560,14 +558,14 @@
⟡ LinkedIn Cleo (https://github.com/linkedin/cleo) - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
⟡ LinkedIn Galene (https://engineering.linkedin.com/search/did-you-mean-galene) - search architecture at LinkedIn.
⟡ LinkedIn Zoie (https://github.com/senseidb/zoie) - is a realtime search/indexing system written in Java.
⟡ MG4J (http://mg4j.di.unimi.it/) - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and 
new research algorithms.
⟡ MG4J (http://mg4j.di.unimi.it/) - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new 
research algorithms.
⟡ Sphinx Search Server (http://sphinxsearch.com/) - fulltext search engine.
⟡ Vespa (http://vespa.ai/) - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
⟡ Facebook Faiss (https://github.com/facebookresearch/faiss) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do 
not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
⟡ Annoy (https://github.com/spotify/annoy) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into 
memory so that many processes may share the same data.
⟡ Facebook Faiss (https://github.com/facebookresearch/faiss) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in
RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
⟡ Annoy (https://github.com/spotify/annoy) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so 
that many processes may share the same data.
⟡ Weaviate (https://github.com/semi-technologies/weaviate) - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
MySQL forks and evolutions
@@ -623,6 +621,7 @@
⟡ Jedox Palo (https://www.jedox.com/en/) - customisable Business Intelligence platform.
⟡ Jethrodata (https://jethro.io/) - Interactive Big Data Analytics.
⟡ intermix.io (https://intermix.io/) - Performance Monitoring for Amazon Redshift
⟡ Lightdash (https://github.com/lightdash/lightdash) - The open source Looker alternative built on dbt
⟡ Metabase (https://github.com/metabase/metabase) - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
⟡ Microsoft (http://www.microsoft.com/en-us/server-cloud/solutions/business-intelligence/default.aspx) - business intelligence software and platform.
⟡ Microstrategy (https://www.microstrategy.com/) - software platforms for business intelligence, mobile intelligence, and network applications.
@@ -644,8 +643,8 @@
⟡ Arbor (https://github.com/samizdatco/arbor) - graph visualization library using web workers and jQuery.
⟡ Banana (https://github.com/LucidWorks/banana) - visualize logs and time-stamped data stored in Solr. Port of Kibana.
⟡ Bloomery (https://github.com/ufukomer/bloomery) - Web UI for Impala.
⟡ Bokeh (http://bokeh.pydata.org/en/latest/) - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the 
style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
⟡ Bokeh (http://bokeh.pydata.org/en/latest/) - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of 
D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
⟡ C3 (http://c3js.org/) - D3-based reusable chart library
⟡ CartoDB (https://github.com/CartoDB/cartodb) - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
⟡ chartd (http://chartd.co/) - responsive, retina-compatible charts with just an img tag.
@@ -706,8 +705,7 @@
Interesting Readings
⟡ Big Data Benchmark (https://amplab.cs.berkeley.edu/benchmark/) - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
⟡ NoSQL Comparison
 (https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis) - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
⟡ NoSQL Comparison (https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis) - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
⟡ Monitoring Kafka performance (https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics?ref=awesome) - Guide to monitoring Apache Kafka, including native methods for metrics collection.
⟡ Monitoring Hadoop performance (https://www.datadoghq.com/blog/monitor-hadoop-metrics?ref=awesome) - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
⟡ Monitoring Cassandra performance (https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/?ref=awesome) - Guide to monitoring Cassandra, including native methods for metrics collection.
@@ -754,7 +752,7 @@
⟡ 2010 (https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf) - Facebook - Finding a needle in Haystack: Facebooks photo storage.
⟡ 2010 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf) - AMPLab - Spark: Cluster Computing with Working Sets.
⟡ 2010 (http://kowshik.github.io/JPregel/pregel_paper.pdf) - Google - Pregel: A System for Large-Scale Graph Processing.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf) - Google - Large-scale Incremental Processing Using Distributed Transactions and Notications base of Percolator and Caffeine.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf) - Google - Large-scale Incremental Processing Using Distributed Transactions and notifications base of Percolator and Caffeine.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf) - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
⟡ 2010 (http://leoneu.github.io/) - Yahoo - S4: Distributed Stream Computing Platform.
⟡ 2009 (http://www.cs.umd.edu/~abadi/papers/hadoopdb.pdf) - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. 
@@ -782,22 +780,22 @@
⟡ Streaming Data (https://www.manning.com/books/streaming-data) - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
⟡ Storm Applied (https://www.manning.com/books/storm-applied) - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
⟡ Fundamentals of Stream Processing: Application Design, Systems, and Analytics 
(http://www.cambridge.org/us/academic/subjects/engineering/communications-and-signal-processing/fundamentals-stream-processing-application-design-systems-and-analytics) - This comprehensive, hands-on guide combining the fundamental 
building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
(http://www.cambridge.org/us/academic/subjects/engineering/communications-and-signal-processing/fundamentals-stream-processing-application-design-systems-and-analytics) - This comprehensive, hands-on guide combining the fundamental building 
blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
⟡ Stream Data Processing: A Quality of Service Perspective (http://www.springer.com/us/book/9780387710020) - Presents a new paradigm suitable for stream and complex event processing.
⟡ Unified Log Processing (https://www.manning.com/books/event-streams-in-action) - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
⟡ Kafka Streams in Action (https://www.manning.com/books/kafka-streams-in-action) - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to
focus on getting more from your data without sacrificing time or effort.
⟡ Big Data (https://www.manning.com/books/big-data) - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze 
web-scale data.
⟡ Spark in Action (https://www.manning.com/books/spark-in-action) & Spark in Action 2nd Ed. (https://www.manning.com/books/spark-in-action-second-edition) - Spark in Action teaches you the theory and skills you need to effectively 
handle batch and streaming data using Spark. Fully updated for Spark 2.0.
⟡ Kafka Streams in Action (https://www.manning.com/books/kafka-streams-in-action) - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on 
getting more from your data without sacrificing time or effort.
⟡ Big Data
 (https://www.manning.com/books/big-data) - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
⟡ Spark in Action (https://www.manning.com/books/spark-in-action) & Spark in Action 2nd Ed. (https://www.manning.com/books/spark-in-action-second-edition) - Spark in Action teaches you the theory and skills you need to effectively handle batch 
and streaming data using Spark. Fully updated for Spark 2.0.
⟡ Kafka in Action (https://www.manning.com/books/kafka-in-action) - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
⟡ Fusion in Action (https://www.manning.com/books/fusion-in-action) - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
⟡ Reactive Data Handling (https://www.manning.com/books/reactive-data-handling) - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications 
capable of handling real-time processing with large data loads--free eBook! 
⟡ Reactive Data Handling (https://www.manning.com/books/reactive-data-handling) - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of 
handling real-time processing with large data loads--free eBook! 
⟡ Azure Data Engineering (https://www.manning.com/books/azure-data-engineering) - A book about data engineering in general and the Azure platform specifically 
⟡ Grokking Streaming Systems (https://www.manning.com/books/grokking-streaming-systems) - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether theyre right for your business. Written to be
⟡ Grokking Streaming Systems (https://www.manning.com/books/grokking-streaming-systems) - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether theyre right for your business. Written to be 
tool-agnostic, youll be able to apply what you learn no matter which framework you choose.
Distributed systems
@@ -813,7 +811,7 @@
 ⟡ Ice Bucket Challenge Data Visualization (https://www.youtube.com/watch?v=qTEchen97rQ)
 Other Awesome Lists
 Other Awesome Lists
- Other awesome lists awesome-awesomeness (https://github.com/bayandin/awesome-awesomeness).
- Even more lists awesome (https://github.com/sindresorhus/awesome).
- Another list? list (https://github.com/jnv/lists).
@@ -829,3 +827,5 @@
- Monte Carlo Tree Search Papers awesome-monte-carlo-tree-search-papers (https://github.com/benedekrozemberczki/awesome-monte-carlo-tree-search-papers).
- Kafka awesome-kafka (https://github.com/monksy/awesome-kafka).
- Google Bigtable (https://github.com/zrosenbauer/awesome-bigtable).
bigdata Github: https://github.com/0xnr/awesome-bigdata