464 lines
23 KiB
HTML
464 lines
23 KiB
HTML
<p><a
|
||
href="https://spark.apache.org/"><img src="https://cdn.rawgit.com/awesome-spark/awesome-spark/f78a16db/spark-logo-trademark.svg" align="right"></a></p>
|
||
<h1 id="awesome-spark-awesome">Awesome Spark <a
|
||
href="https://github.com/sindresorhus/awesome"><img
|
||
src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg"
|
||
alt="Awesome" /></a></h1>
|
||
<p>A curated list of awesome <a href="https://spark.apache.org/">Apache
|
||
Spark</a> packages and resources.</p>
|
||
<p><em>Apache Spark is an open-source cluster-computing framework.
|
||
Originally developed at the <a
|
||
href="https://www.universityofcalifornia.edu/">University of
|
||
California</a>, <a href="https://amplab.cs.berkeley.edu/">Berkeley’s
|
||
AMPLab</a>, the Spark codebase was later donated to the <a
|
||
href="https://www.apache.org/">Apache Software Foundation</a>, which has
|
||
maintained it since. Spark provides an interface for programming entire
|
||
clusters with implicit data parallelism and fault-tolerance</em> (<a
|
||
href="#wikipedia-2017">Wikipedia 2017</a>).</p>
|
||
<p>Users of Apache Spark may choose between different the Python, R,
|
||
Scala and Java programming languages to interface with the Apache Spark
|
||
APIs.</p>
|
||
<h2 id="packages">Packages</h2>
|
||
<h3 id="language-bindings">Language Bindings</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/Kotlin/kotlin-spark-api">Kotlin for
|
||
Apache Spark</a>
|
||
<img src="https://img.shields.io/github/last-commit/Kotlin/kotlin-spark-api.svg">
|
||
- Kotlin API bindings and extensions.</li>
|
||
<li><a href="https://github.com/dotnet/spark">.NET for Apache Spark</a>
|
||
<img src="https://img.shields.io/github/last-commit/dotnet/spark.svg"> -
|
||
.NET bindings.</li>
|
||
<li><a href="https://github.com/rstudio/sparklyr">sparklyr</a>
|
||
<img src="https://img.shields.io/github/last-commit/rstudio/sparklyr.svg">
|
||
- An alternative R backend, using <a
|
||
href="https://github.com/hadley/dplyr"><code>dplyr</code></a>.</li>
|
||
<li><a href="https://github.com/tweag/sparkle">sparkle</a>
|
||
<img src="https://img.shields.io/github/last-commit/tweag/sparkle.svg">
|
||
- Haskell on Apache Spark.</li>
|
||
<li><a
|
||
href="https://github.com/sjrusso8/spark-connect-rs">spark-connect-rs</a>
|
||
<img src="https://img.shields.io/github/last-commit/sjrusso8/spark-connect-rs.svg">
|
||
- Rust bindings.</li>
|
||
<li><a
|
||
href="https://github.com/apache/spark-connect-go">spark-connect-go</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/spark-connect-go.svg">
|
||
- Golang bindings.</li>
|
||
<li><a
|
||
href="https://github.com/mdrakiburrahman/spark-connect-csharp">spark-connect-csharp</a>
|
||
<img src="https://img.shields.io/github/last-commit/mdrakiburrahman/spark-connect-csharp.svg">
|
||
- C# bindings.</li>
|
||
</ul>
|
||
<h3 id="notebooks-and-ides">Notebooks and IDEs</h3>
|
||
<ul>
|
||
<li><a href="https://almond.sh/">almond</a>
|
||
<img src="https://img.shields.io/github/last-commit/almond-sh/almond.svg">
|
||
- A scala kernel for <a href="https://jupyter.org/">Jupyter</a>.</li>
|
||
<li><a href="https://zeppelin.incubator.apache.org/">Apache Zeppelin</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/zeppelin.svg">
|
||
- Web-based notebook that enables interactive data analytics with
|
||
plugable backends, integrated plotting, and extensive Spark support
|
||
out-of-the-box.</li>
|
||
<li><a href="https://polynote.org/">Polynote</a>
|
||
<img src="https://img.shields.io/github/last-commit/polynote/polynote.svg">
|
||
- Polynote: an IDE-inspired polyglot notebook. It supports mixing
|
||
multiple languages in one notebook, and sharing data between them
|
||
seamlessly. It encourages reproducible notebooks with its immutable data
|
||
model. Originating from <a
|
||
href="https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447">Netflix</a>.</li>
|
||
<li><a
|
||
href="https://github.com/jupyter-incubator/sparkmagic">sparkmagic</a>
|
||
<img src="https://img.shields.io/github/last-commit/jupyter-incubator/sparkmagic.svg">
|
||
- <a href="https://jupyter.org/">Jupyter</a> magics and kernels for
|
||
working with remote Spark clusters, for interactively working with
|
||
remote Spark clusters through <a
|
||
href="https://github.com/cloudera/livy">Livy</a>, in Jupyter
|
||
notebooks.</li>
|
||
</ul>
|
||
<h3 id="general-purpose-libraries">General Purpose Libraries</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/yaooqinn/itachi">itachi</a>
|
||
<img src="https://img.shields.io/github/last-commit/yaooqinn/itachi.svg">
|
||
- A library that brings useful functions from modern database management
|
||
systems to Apache Spark.</li>
|
||
<li><a href="https://github.com/mrpowers-io/spark-daria">spark-daria</a>
|
||
<img src="https://img.shields.io/github/last-commit/mrpowers-io/spark-daria.svg">
|
||
- A Scala library with essential Spark functions and extensions to make
|
||
you more productive.</li>
|
||
<li><a href="https://github.com/mrpowers-io/quinn">quinn</a>
|
||
<img src="https://img.shields.io/github/last-commit/mrpowers-io/quinn.svg">
|
||
- A native PySpark implementation of spark-daria.</li>
|
||
<li><a
|
||
href="https://github.com/apache/datafu/tree/master/datafu-spark">Apache
|
||
DataFu</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/datafu.svg">
|
||
- A library of general purpose functions and UDF’s.</li>
|
||
<li><a href="https://github.com/joblib/joblib-spark">Joblib Apache Spark
|
||
Backend</a>
|
||
<img src="https://img.shields.io/github/last-commit/joblib/joblib-spark.svg">
|
||
- <a href="https://github.com/joblib/joblib"><code>joblib</code></a>
|
||
backend for running tasks on Spark clusters.</li>
|
||
</ul>
|
||
<h3 id="sql-data-sources">SQL Data Sources</h3>
|
||
<p>SparkSQL has <a
|
||
href="https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options">serveral
|
||
built-in Data Sources</a> for files. These include <code>csv</code>,
|
||
<code>json</code>, <code>parquet</code>, <code>orc</code>, and
|
||
<code>avro</code>. It also supports JDBC databases as well as Apache
|
||
Hive. Additional data sources can be added by including the packages
|
||
listed below, or writing your own.</p>
|
||
<ul>
|
||
<li><a href="https://github.com/databricks/spark-xml">Spark XML</a>
|
||
<img src="https://img.shields.io/github/last-commit/databricks/spark-xml.svg">
|
||
- XML parser and writer.</li>
|
||
<li><a
|
||
href="https://github.com/datastax/spark-cassandra-connector">Spark
|
||
Cassandra Connector</a>
|
||
<img src="https://img.shields.io/github/last-commit/datastax/spark-cassandra-connector.svg">
|
||
- Cassandra support including data source and API and support for
|
||
arbitrary queries.</li>
|
||
<li><a href="https://github.com/mongodb/mongo-spark">Mongo-Spark</a>
|
||
<img src="https://img.shields.io/github/last-commit/mongodb/mongo-spark.svg">
|
||
- Official MongoDB connector.</li>
|
||
</ul>
|
||
<h3 id="storage">Storage</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/delta-io/delta">Delta Lake</a>
|
||
<img src="https://img.shields.io/github/last-commit/delta-io/delta.svg">
|
||
- Storage layer with ACID transactions.</li>
|
||
<li><a href="https://github.com/apache/hudi">Apache Hudi</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/hudi.svg"> -
|
||
Upserts, Deletes And Incremental Processing on Big Data..</li>
|
||
<li><a href="https://github.com/apache/iceberg">Apache Iceberg</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/iceberg.svg">
|
||
- Upserts, Deletes And Incremental Processing on Big Data..</li>
|
||
<li><a href="https://docs.lakefs.io/integrations/spark.html">lakeFS</a>
|
||
<img src="https://img.shields.io/github/last-commit/treeverse/lakefs.svg">
|
||
- Integration with the lakeFS atomic versioned storage layer.</li>
|
||
</ul>
|
||
<h3 id="bioinformatics">Bioinformatics</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/bigdatagenomics/adam">ADAM</a>
|
||
<img src="https://img.shields.io/github/last-commit/bigdatagenomics/adam.svg">
|
||
- Set of tools designed to analyse genomics data.</li>
|
||
<li><a href="https://github.com/hail-is/hail">Hail</a>
|
||
<img src="https://img.shields.io/github/last-commit/hail-is/hail.svg"> -
|
||
Genetic analysis framework.</li>
|
||
</ul>
|
||
<h3 id="gis">GIS</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/apache/incubator-sedona">Apache
|
||
Sedona</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/incubator-sedona.svg">
|
||
- Cluster computing system for processing large-scale spatial data.</li>
|
||
</ul>
|
||
<h3 id="graph-processing">Graph Processing</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/graphframes/graphframes">GraphFrames</a>
|
||
<img src="https://img.shields.io/github/last-commit/graphframes/graphframes.svg">
|
||
- Data frame based graph API.</li>
|
||
<li><a
|
||
href="https://github.com/neo4j-contrib/neo4j-spark-connector">neo4j-spark-connector</a>
|
||
<img src="https://img.shields.io/github/last-commit/neo4j-contrib/neo4j-spark-connector.svg">
|
||
- Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX /
|
||
GraphFrames support.</li>
|
||
</ul>
|
||
<h3 id="machine-learning-extension">Machine Learning Extension</h3>
|
||
<ul>
|
||
<li><a href="https://systemml.apache.org/">Apache SystemML</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/systemml.svg">
|
||
- Declarative machine learning framework on top of Spark.</li>
|
||
<li><a
|
||
href="https://mahout.apache.org/users/sparkbindings/home.html">Mahout
|
||
Spark Bindings</a> [status unknown] - linear algebra DSL and optimizer
|
||
with R-like syntax.</li>
|
||
<li><a href="http://keystone-ml.org/">KeystoneML</a> - Type safe machine
|
||
learning pipelines with RDDs.</li>
|
||
<li><a href="https://github.com/jpmml/jpmml-spark">JPMML-Spark</a>
|
||
<img src="https://img.shields.io/github/last-commit/jpmml/jpmml-spark.svg">
|
||
- PMML transformer library for Spark ML.</li>
|
||
<li><a href="https://mitdbg.github.io/modeldb">ModelDB</a>
|
||
<img src="https://img.shields.io/github/last-commit/mitdbg/modeldb.svg">
|
||
- A system to manage machine learning models for <code>spark.ml</code>
|
||
and <a
|
||
href="https://github.com/scikit-learn/scikit-learn"><code>scikit-learn</code></a>
|
||
<img src="https://img.shields.io/github/last-commit/scikit-learn/scikit-learn.svg">.</li>
|
||
<li><a href="https://github.com/h2oai/sparkling-water">Sparkling
|
||
Water</a>
|
||
<img src="https://img.shields.io/github/last-commit/h2oai/sparkling-water.svg">
|
||
- <a href="http://www.h2o.ai/">H2O</a> interoperability layer.</li>
|
||
<li><a href="https://github.com/intel-analytics/BigDL">BigDL</a>
|
||
<img src="https://img.shields.io/github/last-commit/intel-analytics/BigDL.svg">
|
||
- Distributed Deep Learning library.</li>
|
||
<li><a href="https://github.com/combust/mleap">MLeap</a>
|
||
<img src="https://img.shields.io/github/last-commit/combust/mleap.svg">
|
||
- Execution engine and serialization format which supports deployment of
|
||
<code>o.a.s.ml</code> models without dependency on
|
||
<code>SparkSession</code>.</li>
|
||
<li><a href="https://github.com/Azure/mmlspark">Microsoft ML for Apache
|
||
Spark</a>
|
||
<img src="https://img.shields.io/github/last-commit/Azure/mmlspark.svg">
|
||
- A distributed ml library with support for LightGBM, Vowpal Wabbit,
|
||
OpenCV, Deep Learning, Cognitive Services, and Model Deployment.</li>
|
||
<li><a
|
||
href="https://mlflow.org/docs/latest/python_api/mlflow.spark.html#module-mlflow.spark">MLflow</a>
|
||
<img src="https://img.shields.io/github/last-commit/mlflow/mlflow.svg">
|
||
- Machine learning orchestration platform.</li>
|
||
</ul>
|
||
<h3 id="middleware">Middleware</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/apache/incubator-livy">Livy</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/incubator-livy.svg">
|
||
- REST server with extensive language support (Python, R, Scala),
|
||
ability to maintain interactive sessions and object sharing.</li>
|
||
<li><a
|
||
href="https://github.com/spark-jobserver/spark-jobserver">spark-jobserver</a>
|
||
<img src="https://img.shields.io/github/last-commit/spark-jobserver/spark-jobserver.svg">
|
||
- Simple Spark as a Service which supports objects sharing using so
|
||
called named objects. JVM only.</li>
|
||
<li><a href="https://github.com/apache/incubator-toree">Apache Toree</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/incubator-toree.svg">
|
||
- IPython protocol based middleware for interactive applications.</li>
|
||
<li><a href="https://github.com/apache/kyuubi">Apache Kyuubi</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/kyuubi.svg">
|
||
- A distributed multi-tenant JDBC server for large-scale data processing
|
||
and analytics, built on top of Apache Spark.</li>
|
||
</ul>
|
||
<h3 id="monitoring">Monitoring</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/datamechanics/delight">Data Mechanics
|
||
Delight</a>
|
||
<img src="https://img.shields.io/github/last-commit/datamechanics/delight.svg">
|
||
- Cross-platform monitoring tool (Spark UI / Spark History Server
|
||
replacement).</li>
|
||
</ul>
|
||
<h3 id="utilities">Utilities</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/Tubular/sparkly">sparkly</a>
|
||
<img src="https://img.shields.io/github/last-commit/Tubular/sparkly.svg">
|
||
- Helpers & syntactic sugar for PySpark.</li>
|
||
<li><a href="https://github.com/nchammas/flintrock">Flintrock</a>
|
||
<img src="https://img.shields.io/github/last-commit/nchammas/flintrock.svg">
|
||
- A command-line tool for launching Spark clusters on EC2.</li>
|
||
<li><a href="https://github.com/ironmussa/Optimus/">Optimus</a>
|
||
<img src="https://img.shields.io/github/last-commit/ironmussa/Optimus.svg">
|
||
- Data Cleansing and Exploration utilities with the goal of simplifying
|
||
data cleaning.</li>
|
||
</ul>
|
||
<h3 id="natural-language-processing">Natural Language Processing</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/JohnSnowLabs/spark-nlp">spark-nlp</a>
|
||
<img src="https://img.shields.io/github/last-commit/JohnSnowLabs/spark-nlp.svg">
|
||
- Natural language processing library built on top of Apache Spark
|
||
ML.</li>
|
||
</ul>
|
||
<h3 id="streaming">Streaming</h3>
|
||
<ul>
|
||
<li><a href="https://bahir.apache.org/">Apache Bahir</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/bahir.svg"> -
|
||
Collection of the streaming connectors excluded from Spark 2.0 (Akka,
|
||
MQTT, Twitter. ZeroMQ).</li>
|
||
</ul>
|
||
<h3 id="interfaces">Interfaces</h3>
|
||
<ul>
|
||
<li><a href="https://beam.apache.org/">Apache Beam</a>
|
||
<img src="https://img.shields.io/github/last-commit/apache/beam.svg"> -
|
||
Unified data processing engine supporting both batch and streaming
|
||
applications. Apache Spark is one of the supported execution
|
||
environments.</li>
|
||
<li><a href="https://github.com/databricks/koalas">Koalas</a>
|
||
<img src="https://img.shields.io/github/last-commit/databricks/koalas.svg">
|
||
- Pandas DataFrame API on top of Apache Spark.</li>
|
||
</ul>
|
||
<h3 id="data-quality">Data quality</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/awslabs/deequ">deequ</a>
|
||
<img src="https://img.shields.io/github/last-commit/awslabs/deequ.svg">
|
||
- Deequ is a library built on top of Apache Spark for defining “unit
|
||
tests for data”, which measure data quality in large datasets.</li>
|
||
<li><a href="https://github.com/awslabs/python-deequ">python-deequ</a>
|
||
<img src="https://img.shields.io/github/last-commit/awslabs/python-deequ.svg">
|
||
- Python API for Deequ.</li>
|
||
</ul>
|
||
<h3 id="testing">Testing</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/holdenk/spark-testing-base">spark-testing-base</a>
|
||
<img src="https://img.shields.io/github/last-commit/holdenk/spark-testing-base.svg">
|
||
- Collection of base test classes.</li>
|
||
<li><a
|
||
href="https://github.com/mrpowers-io/spark-fast-tests">spark-fast-tests</a>
|
||
<img src="https://img.shields.io/github/last-commit/mrpowers-io/spark-fast-tests.svg">
|
||
- A lightweight and fast testing framework.</li>
|
||
<li><a href="https://github.com/MrPowers/chispa">chispa</a>
|
||
<img src="https://img.shields.io/github/last-commit/MrPowers/chispa.svg">
|
||
- PySpark test helpers with beautiful error messages.</li>
|
||
</ul>
|
||
<h3 id="web-archives">Web Archives</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/archivesunleashed/aut">Archives
|
||
Unleashed Toolkit</a>
|
||
<img src="https://img.shields.io/github/last-commit/archivesunleashed/aut.svg">
|
||
- Open-source toolkit for analyzing web archives.</li>
|
||
</ul>
|
||
<h3 id="workflow-management">Workflow Management</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/broadinstitute/cromwell#spark-backend">Cromwell</a>
|
||
<img src="https://img.shields.io/github/last-commit/broadinstitute/cromwell.svg">
|
||
- Workflow management system with <a
|
||
href="https://github.com/broadinstitute/cromwell#spark-backend">Spark
|
||
backend</a>.</li>
|
||
</ul>
|
||
<h2 id="resources">Resources</h2>
|
||
<h3 id="books">Books</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/">Learning
|
||
Spark, 2nd Edition</a> - Introduction to Spark API with Spark 3.0
|
||
covered. Good source of knowledge about basic concepts.</li>
|
||
<li><a href="http://shop.oreilly.com/product/0636920035091.do">Advanced
|
||
Analytics with Spark</a> - Useful collection of Spark processing
|
||
patterns. Accompanying GitHub repository: <a
|
||
href="https://github.com/sryza/aas">sryza/aas</a>.</li>
|
||
<li><a
|
||
href="https://jaceklaskowski.gitbooks.io/mastering-apache-spark/">Mastering
|
||
Apache Spark</a> - Interesting compilation of notes by <a
|
||
href="https://github.com/jaceklaskowski">Jacek Laskowski</a>. Focused on
|
||
different aspects of Spark internals.</li>
|
||
<li><a href="https://www.manning.com/books/spark-in-action">Spark in
|
||
Action</a> - New book in the Manning’s “in action” family with +400
|
||
pages. Starts gently, step-by-step and covers large number of topics.
|
||
Free excerpt on how to <a
|
||
href="http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/">setup
|
||
Eclipse for Spark application development</a> and how to bootstrap a new
|
||
application using the provided Maven Archetype. You can find the
|
||
accompanying GitHub repo <a
|
||
href="https://github.com/spark-in-action/first-edition">here</a>.</li>
|
||
</ul>
|
||
<h3 id="papers">Papers</h3>
|
||
<ul>
|
||
<li><a href="https://arxiv.org/pdf/2009.08044.pdf">Large-Scale
|
||
Intelligent Microservices</a> - Microsoft paper that presents an Apache
|
||
Spark-based micro-service orchestration framework that extends database
|
||
operations to include web service primitives.</li>
|
||
<li><a
|
||
href="https://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Resilient
|
||
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
|
||
Computing</a> - Paper introducing a core distributed memory
|
||
abstraction.</li>
|
||
<li><a
|
||
href="https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf">Spark
|
||
SQL: Relational Data Processing in Spark</a> - Paper introducing
|
||
relational underpinnings, code generation and Catalyst optimizer.</li>
|
||
<li><a
|
||
href="https://cs.stanford.edu/~matei/papers/2018/sigmod_structured_streaming.pdf">Structured
|
||
Streaming: A Declarative API for Real-Time Applications in Apache
|
||
Spark</a> - Structured Streaming is a new high-level streaming API, it
|
||
is a declarative API based on automatically incrementalizing a static
|
||
relational query.</li>
|
||
</ul>
|
||
<h3 id="moocs">MOOCS</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://www.edx.org/xseries/data-science-engineering-apache-spark">Data
|
||
Science and Engineering with Apache Spark (edX XSeries)</a> - Series of
|
||
five courses (<a
|
||
href="https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x">Introduction
|
||
to Apache Spark</a>, <a
|
||
href="https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x">Distributed
|
||
Machine Learning with Apache Spark</a>, <a
|
||
href="https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x">Big
|
||
Data Analysis with Apache Spark</a>, <a
|
||
href="https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x">Advanced
|
||
Apache Spark for Data Science and Data Engineering</a>, <a
|
||
href="https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x">Advanced
|
||
Distributed Machine Learning with Apache Spark</a>) covering different
|
||
aspects of software engineering and data science. Python oriented.</li>
|
||
<li><a href="https://www.coursera.org/learn/big-data-analysys">Big Data
|
||
Analysis with Scala and Spark (Coursera)</a> - Scala oriented
|
||
introductory course. Part of <a
|
||
href="https://www.coursera.org/specializations/scala">Functional
|
||
Programming in Scala Specialization</a>.</li>
|
||
</ul>
|
||
<h3 id="workshops">Workshops</h3>
|
||
<ul>
|
||
<li><a href="http://ampcamp.berkeley.edu">AMP Camp</a> - Periodical
|
||
training event organized by the <a
|
||
href="https://amplab.cs.berkeley.edu/">UC Berkeley AMPLab</a>. A source
|
||
of useful exercise and recorded workshops covering different tools from
|
||
the <a href="https://amplab.cs.berkeley.edu/software/">Berkeley Data
|
||
Analytics Stack</a>.</li>
|
||
</ul>
|
||
<h3 id="projects-using-spark">Projects Using Spark</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/OryxProject/oryx">Oryx 2</a> - <a
|
||
href="http://lambda-architecture.net/">Lambda architecture</a> platform
|
||
built on Apache Spark and <a href="http://kafka.apache.org/">Apache
|
||
Kafka</a> with specialization for real-time large scale machine
|
||
learning.</li>
|
||
<li><a href="https://github.com/linkedin/photon-ml">Photon ML</a> - A
|
||
machine learning library supporting classical Generalized Mixed Model
|
||
and Generalized Additive Mixed Effect Model.</li>
|
||
<li><a href="https://prediction.io/">PredictionIO</a> - Machine Learning
|
||
server for developers and data scientists to build and deploy predictive
|
||
applications in a fraction of the time.</li>
|
||
<li><a href="https://github.com/Stratio/Crossdata">Crossdata</a> - Data
|
||
integration platform with extended DataSource API and multi-user
|
||
environment.</li>
|
||
</ul>
|
||
<h3 id="docker-images">Docker Images</h3>
|
||
<ul>
|
||
<li><a href="https://hub.docker.com/r/apache/spark">apache/spark</a> -
|
||
Apache Spark Official Docker images.</li>
|
||
<li><a
|
||
href="https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook">jupyter/docker-stacks/pyspark-notebook</a>
|
||
- PySpark with Jupyter Notebook and Mesos client.</li>
|
||
<li><a
|
||
href="https://github.com/sequenceiq/docker-spark">sequenceiq/docker-spark</a>
|
||
- Yarn images from <a
|
||
href="http://www.sequenceiq.com/">SequenceIQ</a>.</li>
|
||
<li><a
|
||
href="https://hub.docker.com/r/datamechanics/spark">datamechanics/spark</a>
|
||
- An easy to setup Docker image for Apache Spark from <a
|
||
href="https://www.datamechanics.co/">Data Mechanics</a>.</li>
|
||
</ul>
|
||
<h3 id="miscellaneous">Miscellaneous</h3>
|
||
<ul>
|
||
<li><a href="https://gitter.im/spark-scala/Lobby">Spark with Scala
|
||
Gitter channel</a> - “<em>A place to discuss and ask questions about
|
||
using Scala for Spark programming</em>” started by <a
|
||
href="https://github.com/deanwampler"><span class="citation"
|
||
data-cites="deanwampler">@deanwampler</span></a>.</li>
|
||
<li><a
|
||
href="http://apache-spark-user-list.1001560.n3.nabble.com/">Apache Spark
|
||
User List</a> and <a
|
||
href="http://apache-spark-developers-list.1001551.n3.nabble.com/">Apache
|
||
Spark Developers List</a> - Mailing lists dedicated to usage questions
|
||
and development topics respectively.</li>
|
||
</ul>
|
||
<h2 id="references">References</h2>
|
||
<p id="wikipedia-2017">
|
||
Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.”
|
||
<a href="https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753" class="uri">https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753</a>.
|
||
</p>
|
||
<h2 id="license">License</h2>
|
||
<p xmlns:dct="http://purl.org/dc/terms/">
|
||
<a rel="license" href="http://creativecommons.org/publicdomain/mark/1.0/">
|
||
<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/publicdomain.svg"
|
||
style="border-style: none;" alt="Public Domain Mark" /> </a> <br />
|
||
This work (<span property="dct:title">Awesome Spark</span>, by
|
||
<a href="https://github.com/awesome-spark/awesome-spark" rel="dct:creator">https://github.com/awesome-spark/awesome-spark</a>),
|
||
identified by
|
||
<a href="https://github.com/zero323" rel="dct:publisher"><span
|
||
property="dct:title">Maciej Szymkiewicz</span></a>, is free of known
|
||
copyright restrictions.
|
||
</p>
|
||
<p>Apache Spark, Spark, Apache, and the Spark logo are
|
||
<a href="https://www.apache.org/foundation/marks/">trademarks</a> of
|
||
<a href="http://www.apache.org">The Apache Software Foundation</a>. This
|
||
compilation is not endorsed by The Apache Software Foundation.</p>
|
||
<p>Inspired by <a
|
||
href="https://github.com/sindresorhus/awesome">sindresorhus/awesome</a>.</p>
|
||
<p><a href="https://github.com/awesome-spark/awesome-spark">spark.md
|
||
Github</a></p>
|