Hadoop
MapReduce - Hadoop MapReduce is a software framework for easily
writing applications which process vast amounts of data (multi-terabyte
data-sets) - in-parallel on large clusters (thousands of nodes) - of
commodity hardware in a reliable, fault-tolerant manner.
Spark - A multi-language
engine for executing data engineering, data science, and machine
learning on single-node machines or clusters.
- Spark Packages - A
community index of packages for Apache Spark.
- Deep Spark -
Connecting Apache Spark with different data stores. Deprecated.
- Spark
RDD API Examples - Examples by Zhen He.
- Livy - The REST
Spark Server.
- Delight - A
free & cross platform monitoring tool (Spark UI / Spark History
Server alternative).
AWS EMR - A web service
that makes it easy to quickly and cost-effectively process vast amounts
of data.
Data Mechanics - A
cloud-based platform deployed on Kubernetes making Apache Spark more
developer-friendly and cost-effective.
Tez - An application
framework which allows for a complex directed-acyclic-graph of tasks for
processing data.
Bistro - A
light-weight engine for general-purpose data processing including both
batch and stream analytics. It is based on a novel unique data model,
which represents data via functions and processes data via
columns operations as opposed to having only set operations in
conventional approaches like MapReduce or SQL.
Batch ML
- H2O - Fast scalable machine
learning API for smarter applications.
- Mahout - An environment for
quickly creating scalable performant machine learning applications.
- Spark
MLlib - Spark’s scalable machine learning library consisting of
common learning algorithms and utilities, including classification,
regression, clustering, collaborative filtering, dimensionality
reduction, as well as underlying optimization primitives.
Batch Graph
- GraphLab Create
- A machine learning platform that enables data scientists and app
developers to easily create intelligent apps at scale.
- Giraph - An iterative graph
processing system built for high scalability.
- Spark GraphX - Apache
Spark’s API for graphs and graph-parallel computation.
Batch SQL
- Presto - A
distributed SQL query engine designed to query large data sets
distributed over one or more heterogeneous data sources.
- Hive - Data warehouse software
facilitates querying and managing large datasets residing in distributed
storage.
- Hivemall
- Scalable machine learning library for Hive/Hadoop.
- PyHive - Python
interface to Hive and Presto.
- Drill - Schema-free SQL
Query Engine for Hadoop, NoSQL and Cloud Storage.