Updating conversion, creating readmes

This commit is contained in:
Jonas Zeunert
2024-04-19 23:37:46 +02:00
parent 3619ac710a
commit 08e75b0f0a
635 changed files with 30878 additions and 37344 deletions

View File

@@ -1,9 +1,9 @@
 Awesome Big Data
 Awesome Big Data
!Awesome (https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg) (https://github.com/sindresorhus/awesome)
A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php (https://github.com/ziadoz/awesome-php), awesome-python (https://github.com/vinta/awesome-python), 
awesome-ruby (https://github.com/Sdogruyol/awesome-ruby), hadoopecosystemtable (http://hadoopecosystemtable.github.io/) & big-data (http://usefulstuff.io/big-data/).
A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php (https://github.com/ziadoz/awesome-php), awesome-python (https://github.com/vinta/awesome-python), awesome-ruby 
(https://github.com/Sdogruyol/awesome-ruby), hadoopecosystemtable (http://hadoopecosystemtable.github.io/) & big-data (http://usefulstuff.io/big-data/).
Your contributions are always welcome!
@@ -59,10 +59,10 @@
Frameworks
⟡ Bistro (https://github.com/facebook/bistro) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes 
data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
⟡ IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data 
ecosystem (Kafka, HDFS, Spark, etc.)
⟡ Bistro (https://github.com/facebook/bistro) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations 
as opposed to having only set operations in conventional approaches like MapReduce or SQL.
⟡ IBM Streams
 (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
⟡ Apache Hadoop (http://hadoop.apache.org/) - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
⟡ Tigon (https://github.com/caskdata/tigon) - High Throughput Real-time Stream Processing Framework.
⟡ Pachyderm (http://pachyderm.io/) - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
@@ -97,17 +97,15 @@
⟡ Concurrent Cascading (http://www.cascading.org/) - framework for data management/analytics on Hadoop.
⟡ Damballa Parkour (https://github.com/damballa/parkour) - MapReduce library for Clojure.
⟡ Datasalt Pangool (https://github.com/datasalt/pangool) - alternative MapReduce paradigm.
⟡ DataTorrent StrAM (https://www.datatorrent.com/) - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal 
overhead and impact on performance.
⟡ Facebook Corona
 (https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920) - Hadoop enhancement which removes single point of failure.
⟡ DataTorrent StrAM
 (https://www.datatorrent.com/) - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
⟡ Facebook Corona (https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920) - Hadoop enhancement which removes single point of failure.
⟡ Facebook Peregrine (http://peregrine_mapreduce.bitbucket.org/) - Map Reduce framework.
⟡ Facebook Scuba (https://www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with-scuba/10150599692628920) - distributed in-memory datastore.
⟡ Google Dataflow (https://googledevelopers.blogspot.it/2014/06/cloud-platform-at-google-io-new-big.html) - create data pipelines to help themæingest, transform and analyze data.
⟡ Google MapReduce (https://research.google.com/archive/mapreduce.html) - map reduce framework.
⟡ Google MillWheel (https://research.google.com/pubs/pub41378.html) - fault tolerant stream processing framework.
⟡ IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time 
series, etc. out of the box.
⟡ IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/) - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
⟡ JAQL (https://code.google.com/p/jaql/) - declarative programming language for working with structured, semi-structured and unstructured data.
⟡ Kite (http://kitesdk.org/docs/current/) - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
⟡ Metamarkets Druid (http://druid.io/) - framework for real-time analysis of large datasets.
@@ -169,13 +167,12 @@
Key Map Data Model
Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all 
data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families"
(with value map keys being referred to as "columns").
Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly 
composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as 
"columns").
Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values 
for a given key are stored next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get 
all values for a given column.
Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored 
next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.
The former group is referred to as "key map data model" here. The line between these and the Key-value Data Model (#key-value-data-model) stores is fairly blurry.
@@ -193,8 +190,7 @@
⟡ Hypertable (http://www.hypertable.org/) - column-oriented distributed datastore, inspired by BigTable.
⟡ InfiniDB (https://github.com/infinidb/infinidb/) - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
⟡ Tephra (https://github.com/caskdata/tephra) - Transactions for HBase.
⟡ Twitter Manhattan
 (https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html) - real-time, multi-tenant distributed database for Twitter scale.
⟡ Twitter Manhattan (https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html) - real-time, multi-tenant distributed database for Twitter scale.
⟡ ScyllaDB (http://www.scylladb.com/) - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.
@@ -223,8 +219,7 @@
⟡ SummitDB (https://github.com/tidwall/summitdb) - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
⟡ Tarantool (https://github.com/tarantool/tarantool) - an efficient NoSQL database and a Lua application server.
⟡ TiKV (https://github.com/pingcap/tikv) - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
⟡ Tile38 (https://github.com/tidwall/tile38) - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles,
Geohashes, and GeoJSON
⟡ Tile38 (https://github.com/tidwall/tile38) - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
⟡ TreodeDB (https://github.com/Treode/store) - key-value store that's replicated and sharded and provides atomic multirow writes.
@@ -234,11 +229,10 @@
⟡ Apache Giraph (http://giraph.apache.org/) - implementation of Pregel, based on Hadoop.
⟡ Apache Spark Bagel (http://spark.apache.org/docs/0.7.3/bagel-programming-guide.html) - implementation of Pregel, part of Spark.
⟡ ArangoDB (https://www.arangodb.com/) - multi model distributed database.
⟡ DGraph (https://github.com/dgraph-io/dgraph) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to 
be serving real time user queries, over terabytes of structured data.
⟡ DGraph (https://github.com/dgraph-io/dgraph) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user 
queries, over terabytes of structured data.
⟡ EliasDB (https://github.com/krotik/eliasdb) - a lightweight graph based database that does not require any third-party libraries.
⟡ Facebook TAO
 (https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920) - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
⟡ Facebook TAO (https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920) - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
⟡ GCHQ Gaffer (https://github.com/gchq/Gaffer) - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
⟡ Google Cayley (https://github.com/cayleygraph/cayley) - open-source graph database.
⟡ Google Pregel (http://kowshik.github.io/JPregel/pregel_paper.pdf) - graph processing framework.
@@ -251,8 +245,7 @@
 with multiple options for storage backends (Bigtable, HBase, Cassandra, etc.)
 and indexing backends (Elasticsearch, Solr, Lucene).
⟡ MapGraph (https://www.blazegraph.com/mapgraph-technology/) - Massively Parallel Graph processing on GPUs.
⟡ Microsoft Graph Engine
 (https://github.com/Microsoft/GraphEngine) - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
⟡ Microsoft Graph Engine (https://github.com/Microsoft/GraphEngine) - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
⟡ Neo4j (https://neo4j.com/) - graph database written entirely in Java.
⟡ OrientDB (http://orientdb.com/) - document and graph database.
⟡ Phoebus (https://github.com/xslogic/phoebus) - framework for large scale graph processing.
@@ -301,8 +294,7 @@
⟡ Map-D (https://www.mapd.com/) - GPU in-memory database, big data analysis and visualization platform.
⟡ MemSQL (http://www.memsql.com/) - in memory SQL database witho optimized columnar storage on flash.
⟡ NuoDB (http://www.nuodb.com/) - SQL/ACID compliant distributed database.
⟡ Oracle TimesTen in-Memory Database
 (http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html) - in-memory, relational database management system with persistence and recoverability.
⟡ Oracle TimesTen in-Memory Database (http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html) - in-memory, relational database management system with persistence and recoverability.
⟡ Pivotal GemFire XD (http://gemfirexd.docs.pivotal.io/latest/) - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
⟡ SAP HANA (https://hana.sap.com/abouthana.html) - is an in-memory, column-oriented, relational database management system.
⟡ SenseiDB (http://senseidb.github.io/sensei/) - distributed, realtime, semi-structured database.
@@ -331,8 +323,7 @@
⟡ TrailDB (http://traildb.io/) - an efficient tool for storing and querying series of events.
⟡ Druid (https://github.com/druid-io/druid/) Column oriented distributed data store ideal for powering interactive applications
⟡ Riak-TS (http://basho.com/products/riak-ts/) Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
⟡ Akumuli (https://github.com/akumuli/Akumuli) Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from 
esperanto as "accumulate".
⟡ Akumuli (https://github.com/akumuli/Akumuli) Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
⟡ Rhombus (https://github.com/Pardot/Rhombus) A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
⟡ Dalmatiner DB (https://github.com/dalmatinerdb/dalmatinerdb) Fast distributed metrics database
⟡ Blueflood (https://github.com/rackerlabs/blueflood) A distributed system designed to ingest and process time series data
@@ -357,8 +348,7 @@
⟡ Facebook PrestoDB (https://prestodb.io/) - distributed SQL query engine.
⟡ Google BigQuery (https://research.google.com/pubs/pub36632.html) - framework for interactive analysis, implementation of Dremel.
⟡ Materialize (https://github.com/materializeinc/materialize) - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
⟡ Invantive SQL
 (https://documentation.invantive.com/2017R2/invantive-sql-grammar/invantive-sql-grammar-17.30.html) - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
⟡ Invantive SQL (https://documentation.invantive.com/2017R2/invantive-sql-grammar/invantive-sql-grammar-17.30.html) - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
⟡ PipelineDB (https://www.pipelinedb.com/) - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
⟡ Pivotal HDB (https://pivotal.io/pivotal-hdb) - SQL-like data warehouse system for Hadoop.
⟡ RainstorDB (http://rainstor.com/products/rainstor-database/) - database for storing petabyte-scale volumes of structured and semi-structured data.
@@ -415,8 +405,8 @@
⟡ Mara (https://github.com/mara/data-integration) - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
⟡ OpenMPI (https://www.open-mpi.org/) - message passing framework.
⟡ Serf (https://www.serf.io/) - decentralized solution for service discovery and orchestration.
⟡ Spotify Luigi (https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, 
command line integration, and much more.
⟡ Spotify Luigi
 (https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
⟡ Spring XD (https://github.com/spring-projects/spring-xd) - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
⟡ Twitter Elephant Bird (https://github.com/twitter/elephant-bird) - libraries for working with LZOP-compressed data.
⟡ Twitter Finagle (https://twitter.github.io/finagle/) - asynchronous network stack for the JVM.
@@ -444,14 +434,13 @@
⟡ Concurrent Pattern (http://www.cascading.org/projects/pattern/) - machine learning library for Cascading.
⟡ convnetjs (https://github.com/karpathy/convnetjs) - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
⟡ DataVec (https://github.com/deeplearning4j/DataVec) - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem. 
⟡ Deeplearning4j (https://github.com/deeplearning4j) - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train 
nets on multiple GPUs and CPUs.
⟡ Deeplearning4j
 (https://github.com/deeplearning4j) - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
⟡ Decider (https://github.com/danielsdeleo/Decider) - Flexible and Extensible Machine Learning in Ruby.
⟡ ENCOG (http://www.heatonresearch.com/encog/) - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
⟡ etcML (http://www.etcml.com/) - text classification with machine learning.
⟡ Etsy Conjecture (https://github.com/etsy/Conjecture) - scalable Machine Learning in Scalding.
⟡ Feast (https://github.com/gojek/feast) - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and 
model serving.
⟡ Feast (https://github.com/gojek/feast) - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
⟡ GraphLab Create (https://dato.com/products/create/) - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
⟡ H2O (https://github.com/h2oai/h2o-3/) - statistical, machine learning and math runtime with Hadoop. R and Python.
⟡ Karate Club (https://github.com/benedekrozemberczki/karateclub) - An unsupervised machine learning library for graph structured data. Python
@@ -465,8 +454,7 @@
⟡ MOA (http://moa.cms.waikato.ac.nz) - MOA performs big data stream mining in real time, and large scale machine learning.
⟡ MonkeyLearn (https://monkeylearn.com/) - Text mining made easy. Extract and classify data from text.
⟡ ND4J (https://github.com/deeplearning4j/nd4j) - A matrix library for the JVM. Numpy for Java. 
⟡ nupic
 (https://github.com/numenta/nupic) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
⟡ nupic (https://github.com/numenta/nupic) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
⟡ PredictionIO (http://predictionio.incubator.apache.org/index.html) - machine learning server buit on Hadoop, Mahout and Cascading.
⟡ PyTorch Geometric Temporal (https://github.com/benedekrozemberczki/pytorch_geometric_temporal) - a temporal extension library for PyTorch Geometric .
⟡ RL4J (https://github.com/deeplearning4j/rl4j) - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem. 
@@ -548,8 +536,8 @@
⟡ PivotalR (https://github.com/pivotalsoftware/PivotalR) - R on Pivotal HD / HAWQ and PostgreSQL.
⟡ Rakam (https://github.com/rakam-io/rakam) - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB. 
⟡ Qubole (https://www.qubole.com/) - auto-scaling Hadoop cluster, built-in data connectors.
⟡ SnappyData (https://github.com/SnappyDataInc/snappydata) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP 
(online analytical processing) built on Spark in a single integrated cluster.
⟡ SnappyData (https://github.com/SnappyDataInc/snappydata) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical 
processing) built on Spark in a single integrated cluster.
⟡ Snowplow (https://github.com/snowplow/snowplow) - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
⟡ SparkR (http://amplab-extras.github.io/SparkR-pkg/) - R frontend for Spark.
⟡ Splunk (https://www.splunk.com/) - analyzer for machine-generated data.
@@ -572,15 +560,14 @@
⟡ LinkedIn Cleo (https://github.com/linkedin/cleo) - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
⟡ LinkedIn Galene (https://engineering.linkedin.com/search/did-you-mean-galene) - search architecture at LinkedIn.
⟡ LinkedIn Zoie (https://github.com/senseidb/zoie) - is a realtime search/indexing system written in Java.
⟡ MG4J (http://mg4j.di.unimi.it/) - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides 
state-of-the-art features and new research algorithms.
⟡ MG4J (http://mg4j.di.unimi.it/) - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and 
new research algorithms.
⟡ Sphinx Search Server (http://sphinxsearch.com/) - fulltext search engine.
⟡ Vespa
 (http://vespa.ai/) - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
⟡ Facebook Faiss (https://github.com/facebookresearch/faiss) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up 
to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
⟡ Annoy (https://github.com/spotify/annoy) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures 
that are mmapped into memory so that many processes may share the same data.
⟡ Vespa (http://vespa.ai/) - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
⟡ Facebook Faiss (https://github.com/facebookresearch/faiss) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do 
not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
⟡ Annoy (https://github.com/spotify/annoy) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into 
memory so that many processes may share the same data.
⟡ Weaviate (https://github.com/semi-technologies/weaviate) - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
MySQL forks and evolutions
@@ -657,8 +644,8 @@
⟡ Arbor (https://github.com/samizdatco/arbor) - graph visualization library using web workers and jQuery.
⟡ Banana (https://github.com/LucidWorks/banana) - visualize logs and time-stamped data stored in Solr. Port of Kibana.
⟡ Bloomery (https://github.com/ufukomer/bloomery) - Web UI for Impala.
⟡ Bokeh (http://bokeh.pydata.org/en/latest/) - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of 
novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
⟡ Bokeh (http://bokeh.pydata.org/en/latest/) - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the 
style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
⟡ C3 (http://c3js.org/) - D3-based reusable chart library
⟡ CartoDB (https://github.com/CartoDB/cartodb) - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
⟡ chartd (http://chartd.co/) - responsive, retina-compatible charts with just an img tag.
@@ -679,8 +666,7 @@
⟡ FnordMetric (https://metrictools.org/) - write SQL queries that return SVG charts rather than tables
⟡ Frappe Charts (https://frappe.io/charts) - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
⟡ Freeboard (https://github.com/Freeboard/freeboard) - pen source real-time dashboard builder for IOT and other web mashups.
⟡ Gephi (https://github.com/gephi/gephi) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows 
and Mac OS X.
⟡ Gephi (https://github.com/gephi/gephi) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
⟡ Google Charts (https://developers.google.com/chart/) - simple charting API.
⟡ Grafana (https://grafana.com/) - graphite dashboard frontend, editor and graph composer.
⟡ Graphite (http://graphiteapp.org/) - scalable Realtime Graphing.
@@ -692,24 +678,21 @@
⟡ Metricsgraphic.js (https://metricsgraphicsjs.org/) - a library built on top of D3 that is optimized for time-series data
⟡ NVD3 (http://nvd3.org/) - chart components for d3.js.
⟡ Peity (https://github.com/benpickles/peity) - Progressive SVG bar, line and pie charts.
⟡ Plot.ly (https://plot.ly/) - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork
others' plots.
⟡ Plot.ly (https://plot.ly/) - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.
⟡ Plotly.js (https://github.com/plotly/plotly.js) The open source javascript graphing library that powers plotly.
⟡ Recline (https://github.com/okfn/recline) - simple but powerful library for building data applications in pure Javascript and HTML.
⟡ Redash (https://github.com/getredash/redash) - open-source platform to query and visualize data.
⟡ ReCharts (http://recharts.org/) - A composable charting library built on React components
⟡ Shiny (http://shiny.rstudio.com/) - a web application framework for R.
⟡ Sigma.js (https://github.com/jacomyal/sigma.js) - JavaScript library dedicated to graph drawing.
⟡ Superset (https://github.com/apache/incubator-superset) - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at
the speed of thought.
⟡ Superset (https://github.com/apache/incubator-superset) - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
⟡ Vega (https://github.com/vega/vega) - a visualization grammar.
⟡ Zeppelin (https://github.com/ZEPL/zeppelin) - a notebook-style collaborative data analysis.
⟡ Zing Charts (https://www.zingchart.com/) - JavaScript charting library for big data.
⟡ DataSphere Studio (https://github.com/WeBankFinTech/DataSphereStudio) - one-stop data application development management portal.
Internet of things and sensor data
⟡ Apache Edgent (Incubating)
 (http://edgent.apache.org/) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
⟡ Apache Edgent (Incubating) (http://edgent.apache.org/) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
⟡ Azure IoT Hub (https://azure.microsoft.com/en-us/services/iot-hub/) - Cloud-based bi-directional monitoring and messaging hub
⟡ TempoIQ (https://www.tempoiq.com/) - Cloud-based sensor analytics.
⟡ 2lemetry (http://2lemetry.com/) - Platform for Internet of things.
@@ -723,11 +706,10 @@
Interesting Readings
⟡ Big Data Benchmark (https://amplab.cs.berkeley.edu/benchmark/) - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
⟡ NoSQL Comparison (https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis) - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs 
VoltDB vs Scalaris comparison.
⟡ NoSQL Comparison
 (https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis) - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
⟡ Monitoring Kafka performance (https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics?ref=awesome) - Guide to monitoring Apache Kafka, including native methods for metrics collection.
⟡ Monitoring Hadoop performance
 (https://www.datadoghq.com/blog/monitor-hadoop-metrics?ref=awesome) - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
⟡ Monitoring Hadoop performance (https://www.datadoghq.com/blog/monitor-hadoop-metrics?ref=awesome) - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
⟡ Monitoring Cassandra performance (https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/?ref=awesome) - Guide to monitoring Cassandra, including native methods for metrics collection.
Interesting Papers
@@ -741,8 +723,7 @@
⟡ 2013 (https://amplab.cs.berkeley.edu/wp-content/uploads/2013/01/dmx1.pdf) - AMPLab - MLbase: A Distributed Machine-learning System.
⟡ 2013 (https://amplab.cs.berkeley.edu/wp-content/uploads/2013/02/shark_sigmod2013.pdf) - AMPLab - Shark: SQL and Rich Analytics at Scale.
⟡ 2013 (https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf) - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
⟡ 2013 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf) - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation 
Algorithm.
⟡ 2013 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf) - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
⟡ 2013 (http://research.microsoft.com/pubs/200169/now-vldb.pdf) - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
⟡ 2013 (http://static.druid.io/docs/druid.pdf) - Metamarkets - Druid: A Real-time Analytical Data Store.
⟡ 2013 (http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p764-rae.pdf) - Google - Online, Asynchronous Schema Change in F1.
@@ -765,8 +746,7 @@
⟡ 2012 (http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf) - Google - Processing a trillion cells per mouse click.
⟡ 2012 (http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf) - Google - Spanner: Googles Globally-Distributed Database.
⟡ 2011 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/euro118-ananthanarayanan.pdf) - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
⟡ 2011 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf) - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the 
Data Center.
⟡ 2011 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf) - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
⟡ 2011 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf) - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
2001 - 2010
@@ -774,8 +754,7 @@
⟡ 2010 (https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf) - Facebook - Finding a needle in Haystack: Facebooks photo storage.
⟡ 2010 (https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf) - AMPLab - Spark: Cluster Computing with Working Sets.
⟡ 2010 (http://kowshik.github.io/JPregel/pregel_paper.pdf) - Google - Pregel: A System for Large-Scale Graph Processing.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf) - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator 
and Caffeine.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf) - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
⟡ 2010 (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf) - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
⟡ 2010 (http://leoneu.github.io/) - Yahoo - S4: Distributed Stream Computing Platform.
⟡ 2009 (http://www.cs.umd.edu/~abadi/papers/hadoopdb.pdf) - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. 
@@ -792,8 +771,8 @@
⟡ Machine Learning, Data Science and Deep Learning with Python 
 (https://www.manning.com/livevideo/machine-learning-data-science-and-deep-learning-with-python) - LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.
⟡ Data warehouse schema design - dimensional modeling and star schema (https://snir.dev/talks/data-warehouse-schema-design) - Introduction to schema design for data warehouse using the star schema method.
⟡ Elasticsearch 7 and Elastic Stack (https://www.manning.com/livevideo/elasticsearch-7-and-elastic-stack) - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with 
Elasticsearch, Logstash, Beats, Kibana, and more.
⟡ Elasticsearch 7 and Elastic Stack
 (https://www.manning.com/livevideo/elasticsearch-7-and-elastic-stack) - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
Books
@@ -803,24 +782,23 @@
⟡ Streaming Data (https://www.manning.com/books/streaming-data) - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
⟡ Storm Applied (https://www.manning.com/books/storm-applied) - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
⟡ Fundamentals of Stream Processing: Application Design, Systems, and Analytics 
(http://www.cambridge.org/us/academic/subjects/engineering/communications-and-signal-processing/fundamentals-stream-processing-application-design-systems-and-analytics) - This comprehensive, hands-on guide 
combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
(http://www.cambridge.org/us/academic/subjects/engineering/communications-and-signal-processing/fundamentals-stream-processing-application-design-systems-and-analytics) - This comprehensive, hands-on guide combining the fundamental 
building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
⟡ Stream Data Processing: A Quality of Service Perspective (http://www.springer.com/us/book/9780387710020) - Presents a new paradigm suitable for stream and complex event processing.
⟡ Unified Log Processing (https://www.manning.com/books/event-streams-in-action) - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
⟡ Kafka Streams in Action (https://www.manning.com/books/kafka-streams-in-action) - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka 
platform, allowing you to focus on getting more from your data without sacrificing time or effort.
⟡ Big Data (https://www.manning.com/books/big-data) - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to
capture and analyze web-scale data.
⟡ Spark in Action (https://www.manning.com/books/spark-in-action) & Spark in Action 2nd Ed. (https://www.manning.com/books/spark-in-action-second-edition) - Spark in Action teaches you the theory and skills you 
need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
⟡ Kafka Streams in Action (https://www.manning.com/books/kafka-streams-in-action) - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to
focus on getting more from your data without sacrificing time or effort.
⟡ Big Data (https://www.manning.com/books/big-data) - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze 
web-scale data.
⟡ Spark in Action (https://www.manning.com/books/spark-in-action) & Spark in Action 2nd Ed. (https://www.manning.com/books/spark-in-action-second-edition) - Spark in Action teaches you the theory and skills you need to effectively 
handle batch and streaming data using Spark. Fully updated for Spark 2.0.
⟡ Kafka in Action (https://www.manning.com/books/kafka-in-action) - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
⟡ Fusion in Action
 (https://www.manning.com/books/fusion-in-action) - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
⟡ Reactive Data Handling (https://www.manning.com/books/reactive-data-handling) - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building
reactive applications capable of handling real-time processing with large data loads--free eBook! 
⟡ Fusion in Action (https://www.manning.com/books/fusion-in-action) - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
⟡ Reactive Data Handling (https://www.manning.com/books/reactive-data-handling) - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications 
capable of handling real-time processing with large data loads--free eBook! 
⟡ Azure Data Engineering (https://www.manning.com/books/azure-data-engineering) - A book about data engineering in general and the Azure platform specifically 
⟡ Grokking Streaming Systems (https://www.manning.com/books/grokking-streaming-systems) - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether theyre right for 
your business. Written to be tool-agnostic, youll be able to apply what you learn no matter which framework you choose.
⟡ Grokking Streaming Systems (https://www.manning.com/books/grokking-streaming-systems) - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether theyre right for your business. Written to be
tool-agnostic, youll be able to apply what you learn no matter which framework you choose.
Distributed systems
⟡ Distributed Systems for fun and profit (http://book.mixu.net/distsys/) Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.
@@ -835,7 +813,7 @@
 ⟡ Ice Bucket Challenge Data Visualization (https://www.youtube.com/watch?v=qTEchen97rQ)
 Other Awesome Lists
 Other Awesome Lists
- Other awesome lists awesome-awesomeness (https://github.com/bayandin/awesome-awesomeness).
- Even more lists awesome (https://github.com/sindresorhus/awesome).
- Another list? list (https://github.com/jnv/lists).