
Awesome Spark 
A curated list of awesome Apache
Spark packages and resources.
Apache Spark is an open-source cluster-computing framework.
Originally developed at the University of
California, Berkeley’s
AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has
maintained it since. Spark provides an interface for programming entire
clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).
Users of Apache Spark may choose between different the Python, R,
Scala and Java programming languages to interface with the Apache Spark
APIs.
Packages
Language Bindings
Notebooks and IDEs
- almond
- A scala kernel for Jupyter.
- Apache Zeppelin
- Web-based notebook that enables interactive data analytics with
plugable backends, integrated plotting, and extensive Spark support
out-of-the-box.
- Polynote
- Polynote: an IDE-inspired polyglot notebook. It supports mixing
multiple languages in one notebook, and sharing data between them
seamlessly. It encourages reproducible notebooks with its immutable data
model. Originating from Netflix.
- sparkmagic
- Jupyter magics and kernels for
working with remote Spark clusters, for interactively working with
remote Spark clusters through Livy, in Jupyter
notebooks.
General Purpose Libraries
- itachi
- A library that brings useful functions from modern database management
systems to Apache Spark.
- spark-daria
- A Scala library with essential Spark functions and extensions to make
you more productive.
- quinn
- A native PySpark implementation of spark-daria.
- Apache
DataFu
- A library of general purpose functions and UDF’s.
- Joblib Apache Spark
Backend
- joblib
backend for running tasks on Spark clusters.
SQL Data Sources
SparkSQL has serveral
built-in Data Sources for files. These include csv,
json, parquet, orc, and
avro. It also supports JDBC databases as well as Apache
Hive. Additional data sources can be added by including the packages
listed below, or writing your own.
Storage
- Delta Lake
- Storage layer with ACID transactions.
- Apache Hudi
-
Upserts, Deletes And Incremental Processing on Big Data..
- Apache Iceberg
- Upserts, Deletes And Incremental Processing on Big Data..
- lakeFS
- Integration with the lakeFS atomic versioned storage layer.
- ADAM
- Set of tools designed to analyse genomics data.
- Hail
-
Genetic analysis framework.
GIS
- Apache
Sedona
- Cluster computing system for processing large-scale spatial data.
Graph Processing
Machine Learning Extension
- Apache SystemML
- Declarative machine learning framework on top of Spark.
- Mahout
Spark Bindings [status unknown] - linear algebra DSL and optimizer
with R-like syntax.
- KeystoneML - Type safe machine
learning pipelines with RDDs.
- JPMML-Spark
- PMML transformer library for Spark ML.
- ModelDB
- A system to manage machine learning models for spark.ml
and scikit-learn
.
- Sparkling
Water
- H2O interoperability layer.
- BigDL
- Distributed Deep Learning library.
- MLeap
- Execution engine and serialization format which supports deployment of
o.a.s.ml models without dependency on
SparkSession.
- Microsoft ML for Apache
Spark
- A distributed ml library with support for LightGBM, Vowpal Wabbit,
OpenCV, Deep Learning, Cognitive Services, and Model Deployment.
- MLflow
- Machine learning orchestration platform.
Middleware
- Livy
- REST server with extensive language support (Python, R, Scala),
ability to maintain interactive sessions and object sharing.
- spark-jobserver
- Simple Spark as a Service which supports objects sharing using so
called named objects. JVM only.
- Apache Toree
- IPython protocol based middleware for interactive applications.
- Apache Kyuubi
- A distributed multi-tenant JDBC server for large-scale data processing
and analytics, built on top of Apache Spark.
Monitoring
Utilities
- sparkly
- Helpers & syntactic sugar for PySpark.
- Flintrock
- A command-line tool for launching Spark clusters on EC2.
- Optimus
- Data Cleansing and Exploration utilities with the goal of simplifying
data cleaning.
Natural Language Processing
- spark-nlp
- Natural language processing library built on top of Apache Spark
ML.
Streaming
- Apache Bahir
-
Collection of the streaming connectors excluded from Spark 2.0 (Akka,
MQTT, Twitter. ZeroMQ).
Interfaces
- Apache Beam
-
Unified data processing engine supporting both batch and streaming
applications. Apache Spark is one of the supported execution
environments.
- Koalas
- Pandas DataFrame API on top of Apache Spark.
Data quality
- deequ
- Deequ is a library built on top of Apache Spark for defining “unit
tests for data”, which measure data quality in large datasets.
- python-deequ
- Python API for Deequ.
Testing
Web Archives
Workflow Management
Resources
Books
Papers
MOOCS
Workshops
Projects Using Spark
- Oryx 2 - Lambda architecture platform
built on Apache Spark and Apache
Kafka with specialization for real-time large scale machine
learning.
- Photon ML - A
machine learning library supporting classical Generalized Mixed Model
and Generalized Additive Mixed Effect Model.
- PredictionIO - Machine Learning
server for developers and data scientists to build and deploy predictive
applications in a fraction of the time.
- Crossdata - Data
integration platform with extended DataSource API and multi-user
environment.
Docker Images
Miscellaneous
References
Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.”
https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.
License
This work (Awesome Spark, by
https://github.com/awesome-spark/awesome-spark),
identified by
Maciej Szymkiewicz, is free of known
copyright restrictions.
Apache Spark, Spark, Apache, and the Spark logo are
trademarks of
The Apache Software Foundation. This
compilation is not endorsed by The Apache Software Foundation.
Inspired by sindresorhus/awesome.
spark.md
Github