
Awesome Spark 
A curated list of awesome Apache
Spark packages and resources.
Apache Spark is an open-source cluster-computing framework.
Originally developed at the University of
California, Berkeley’s
AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has
maintained it since. Spark provides an interface for programming entire
clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).
Users of Apache Spark may choose between different the Python, R,
Scala and Java programming languages to interface with the Apache Spark
APIs.
Contents
Packages
Language Bindings
Notebooks and IDEs
- almond
- A scala kernel for Jupyter.
- Apache Zeppelin
- Web-based notebook that enables interactive data analytics with
plugable backends, integrated plotting, and extensive Spark support
out-of-the-box.
- Polynote
- Polynote: an IDE-inspired polyglot notebook. It supports mixing
multiple languages in one notebook, and sharing data between them
seamlessly. It encourages reproducible notebooks with its immutable data
model. Originating from Netflix.
- Spark
Notebook
- Scalable and stable Scala and Spark focused notebook bridging the gap
between JVM and Data Scientists (incl. extendable, typesafe and reactive
charts).
- sparkmagic
- Jupyter magics and kernels for
working with remote Spark clusters, for interactively working with
remote Spark clusters through Livy, in Jupyter
notebooks.
General Purpose Libraries
- Succinct
-
Support for efficient queries on compressed data.
- itachi
- A library that brings useful functions from modern database management
systems to Apache Spark.
- spark-daria
- A Scala library with essential Spark functions and extensions to make
you more productive.
- quinn
- A native PySpark implementation of spark-daria.
- Apache
DataFu
- A library of general purpose functions and UDF’s.
- Joblib Apache Spark
Backend
- joblib
backend for running tasks on Spark clusters.
SQL Data Sources
SparkSQL has serveral
built-in Data Sources for files. These include csv,
json, parquet, orc, and
avro. It also supports JDBC databases as well as Apache
Hive. Additional data sources can be added by including the packages
listed below, or writing your own.
Storage
Delta Lake
- Storage layer with ACID transactions.
lakeFS
- Integration with the lakeFS atomic versioned storage layer. ###
Bioinformatics
ADAM
- Set of tools designed to analyse genomics data.
Hail
-
Genetic analysis framework.
GIS
- Magellan
- Geospatial analytics using Spark.
- Apache
Sedona
- Cluster computing system for processing large-scale spatial data.
Time Series Analytics
- Spark-Timeseries
- Scala / Java / Python library for interacting with time series data on
Apache Spark.
- flint
- A time series library for Apache Spark.
Graph Processing
- Mazerunner
- Graph analytics platform on top of Neo4j and GraphX.
- GraphFrames
- Data frame based graph API.
- neo4j-spark-connector
- Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX /
GraphFrames support.
- SparklingGraph
- Library extending GraphX features with multiple functionalities useful
in graph analytics (measures, generators, link prediction etc.).
Machine Learning Extension
Middleware
- Livy
- REST server with extensive language support (Python, R, Scala),
ability to maintain interactive sessions and object sharing.
- spark-jobserver
- Simple Spark as a Service which supports objects sharing using so
called named objects. JVM only.
- Mist
- Service for exposing Spark analytical jobs and machine learning models
as realtime, batch or reactive web services.
- Apache Toree
- IPython protocol based middleware for interactive applications.
- Apache Kyuubi
- A distributed multi-tenant JDBC server for large-scale data processing
and analytics, built on top of Apache Spark.
Monitoring
Utilities
- silex
-
Collection of tools varying from ML extensions to additional RDD
methods.
- sparkly
- Helpers & syntactic sugar for PySpark.
- pyspark-stubs
- Static type annotations for PySpark (obsolete since Spark 3.1. See SPARK-32681).
- Flintrock
- A command-line tool for launching Spark clusters on EC2.
- Optimus
- Data Cleansing and Exploration utilities with the goal of simplifying
data cleaning.
Natural Language Processing
Streaming
- Apache Bahir
-
Collection of the streaming connectors excluded from Spark 2.0 (Akka,
MQTT, Twitter. ZeroMQ).
Interfaces
- Apache Beam
-
Unified data processing engine supporting both batch and streaming
applications. Apache Spark is one of the supported execution
environments.
- Blaze
-
Interface for querying larger than memory datasets using Pandas-like
syntax. It supports both Spark DataFrames and
RDDs.
- Koalas
- Pandas DataFrame API on top of Apache Spark.
Testing
- deequ
- Deequ is a library built on top of Apache Spark for defining “unit
tests for data”, which measure data quality in large datasets.
- spark-testing-base
- Collection of base test classes.
- spark-fast-tests
- A lightweight and fast testing framework.
Web Archives
Workflow Management
Resources
Books
Papers
MOOCS
Workshops
Projects Using Spark
- Oryx 2 - Lambda architecture platform
built on Apache Spark and Apache
Kafka with specialization for real-time large scale machine
learning.
- Photon ML - A
machine learning library supporting classical Generalized Mixed Model
and Generalized Additive Mixed Effect Model.
- PredictionIO - Machine Learning
server for developers and data scientists to build and deploy predictive
applications in a fraction of the time.
- Crossdata - Data
integration platform with extended DataSource API and multi-user
environment.
Docker Images
Miscellaneous
References
Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.”
https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.
License
This work (Awesome Spark, by
https://github.com/awesome-spark/awesome-spark),
identified by
Maciej Szymkiewicz, is free of known
copyright restrictions.
Apache Spark, Spark, Apache, and the Spark logo are
trademarks of
The Apache Software Foundation. This
compilation is not endorsed by The Apache Software Foundation.
Inspired by sindresorhus/awesome.