update lists

This commit is contained in:
2025-07-18 22:22:32 +02:00
parent 55bed3b4a1
commit 5916c5c074
3078 changed files with 331679 additions and 357255 deletions

View File

@@ -18,43 +18,6 @@ href="#wikipedia-2017">Wikipedia 2017</a>).</p>
<p>Users of Apache Spark may choose between different the Python, R,
Scala and Java programming languages to interface with the Apache Spark
APIs.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#packages">Packages</a>
<ul>
<li><a href="#language-bindings">Language Bindings</a></li>
<li><a href="#notebooks-and-ides">Notebooks and IDEs</a></li>
<li><a href="#general-purpose-libraries">General Purpose
Libraries</a></li>
<li><a href="#sql-data-sources">SQL Data Sources</a></li>
<li><a href="#storage">Storage</a></li>
<li><a href="#bioinformatics">Bioinformatics</a></li>
<li><a href="#gis">GIS</a></li>
<li><a href="#time-series-analytics">Time Series Analytics</a></li>
<li><a href="#graph-processing">Graph Processing</a></li>
<li><a href="#machine-learning-extension">Machine Learning
Extension</a></li>
<li><a href="#middleware">Middleware</a></li>
<li><a href="#utilities">Utilities</a></li>
<li><a href="#natural-language-processing">Natural Language
Processing</a></li>
<li><a href="#streaming">Streaming</a></li>
<li><a href="#interfaces">Interfaces</a></li>
<li><a href="#testing">Testing</a></li>
<li><a href="#web-archives">Web Archives</a></li>
<li><a href="#workflow-management">Workflow Management</a></li>
</ul></li>
<li><a href="#resources">Resources</a>
<ul>
<li><a href="#books">Books</a></li>
<li><a href="#papers">Papers</a></li>
<li><a href="#moocs">MOOCS</a></li>
<li><a href="#workshops">Workshops</a></li>
<li><a href="#projects-using-spark">Projects Using Spark</a></li>
<li><a href="#docker-images">Docker Images</a></li>
<li><a href="#miscellaneous">Miscellaneous</a></li>
</ul></li>
</ul>
<h2 id="packages">Packages</h2>
<h3 id="language-bindings">Language Bindings</h3>
<ul>
@@ -62,12 +25,6 @@ Processing</a></li>
Apache Spark</a>
<img src="https://img.shields.io/github/last-commit/Kotlin/kotlin-spark-api.svg">
- Kotlin API bindings and extensions.</li>
<li><a href="https://github.com/yieldbot/flambo">Flambo</a>
<img src="https://img.shields.io/github/last-commit/yieldbot/flambo.svg">
- Clojure DSL.</li>
<li><a href="https://github.com/Microsoft/Mobius">Mobius</a>
<img src="https://img.shields.io/github/last-commit/Microsoft/Mobius.svg">
- C# bindings (Deprecated in favor of .NET for Apache Spark).</li>
<li><a href="https://github.com/dotnet/spark">.NET for Apache Spark</a>
<img src="https://img.shields.io/github/last-commit/dotnet/spark.svg"> -
.NET bindings.</li>
@@ -78,6 +35,18 @@ href="https://github.com/hadley/dplyr"><code>dplyr</code></a>.</li>
<li><a href="https://github.com/tweag/sparkle">sparkle</a>
<img src="https://img.shields.io/github/last-commit/tweag/sparkle.svg">
- Haskell on Apache Spark.</li>
<li><a
href="https://github.com/sjrusso8/spark-connect-rs">spark-connect-rs</a>
<img src="https://img.shields.io/github/last-commit/sjrusso8/spark-connect-rs.svg">
- Rust bindings.</li>
<li><a
href="https://github.com/apache/spark-connect-go">spark-connect-go</a>
<img src="https://img.shields.io/github/last-commit/apache/spark-connect-go.svg">
- Golang bindings.</li>
<li><a
href="https://github.com/mdrakiburrahman/spark-connect-csharp">spark-connect-csharp</a>
<img src="https://img.shields.io/github/last-commit/mdrakiburrahman/spark-connect-csharp.svg">
- C# bindings.</li>
</ul>
<h3 id="notebooks-and-ides">Notebooks and IDEs</h3>
<ul>
@@ -96,12 +65,6 @@ multiple languages in one notebook, and sharing data between them
seamlessly. It encourages reproducible notebooks with its immutable data
model. Originating from <a
href="https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447">Netflix</a>.</li>
<li><a href="https://github.com/andypetrella/spark-notebook">Spark
Notebook</a>
<img src="https://img.shields.io/github/last-commit/spark-notebook/spark-notebook.svg">
- Scalable and stable Scala and Spark focused notebook bridging the gap
between JVM and Data Scientists (incl. extendable, typesafe and reactive
charts).</li>
<li><a
href="https://github.com/jupyter-incubator/sparkmagic">sparkmagic</a>
<img src="https://img.shields.io/github/last-commit/jupyter-incubator/sparkmagic.svg">
@@ -113,19 +76,16 @@ notebooks.</li>
</ul>
<h3 id="general-purpose-libraries">General Purpose Libraries</h3>
<ul>
<li><a href="http://succinct.cs.berkeley.edu/">Succinct</a>
<img src="https://img.shields.io/github/last-commit/amplab/succinct.svg">-
Support for efficient queries on compressed data.</li>
<li><a href="https://github.com/yaooqinn/itachi">itachi</a>
<img src="https://img.shields.io/github/last-commit/yaooqinn/itachi.svg">
- A library that brings useful functions from modern database management
systems to Apache Spark.</li>
<li><a href="https://github.com/mrpowers/spark-daria">spark-daria</a>
<img src="https://img.shields.io/github/last-commit/mrpowers/spark-daria.svg">
<li><a href="https://github.com/mrpowers-io/spark-daria">spark-daria</a>
<img src="https://img.shields.io/github/last-commit/mrpowers-io/spark-daria.svg">
- A Scala library with essential Spark functions and extensions to make
you more productive.</li>
<li><a href="https://github.com/mrpowers/quinn">quinn</a>
<img src="https://img.shields.io/github/last-commit/mrpowers/quinn.svg">
<li><a href="https://github.com/mrpowers-io/quinn">quinn</a>
<img src="https://img.shields.io/github/last-commit/mrpowers-io/quinn.svg">
- A native PySpark implementation of spark-daria.</li>
<li><a
href="https://github.com/apache/datafu/tree/master/datafu-spark">Apache
@@ -147,15 +107,6 @@ built-in Data Sources</a> for files. These include <code>csv</code>,
Hive. Additional data sources can be added by including the packages
listed below, or writing your own.</p>
<ul>
<li><a href="https://github.com/databricks/spark-csv">Spark CSV</a>
<img src="https://img.shields.io/github/last-commit/databricks/spark-csv.svg">
- CSV reader and writer (obsolete since Spark 2.0 <a
href="https://issues.apache.org/jira/browse/SPARK-12833">[SPARK-12833]</a>).</li>
<li><a href="https://github.com/databricks/spark-avro">Spark Avro</a>
<img src="https://img.shields.io/github/last-commit/databricks/spark-avro.svg">
- <a href="https://avro.apache.org/">Apache Avro</a> reader and writer
(obselete since Spark 2.4 <a
href="https://issues.apache.org/jira/browse/SPARK-24768">[SPARK-24768]</a>).</li>
<li><a href="https://github.com/databricks/spark-xml">Spark XML</a>
<img src="https://img.shields.io/github/last-commit/databricks/spark-xml.svg">
- XML parser and writer.</li>
@@ -165,62 +116,43 @@ Cassandra Connector</a>
<img src="https://img.shields.io/github/last-commit/datastax/spark-cassandra-connector.svg">
- Cassandra support including data source and API and support for
arbitrary queries.</li>
<li><a href="https://github.com/basho/spark-riak-connector">Spark Riak
Connector</a>
<img src="https://img.shields.io/github/last-commit/basho/spark-riak-connector.svg">
- Riak TS &amp; Riak KV connector.</li>
<li><a href="https://github.com/mongodb/mongo-spark">Mongo-Spark</a>
<img src="https://img.shields.io/github/last-commit/mongodb/mongo-spark.svg">
- Official MongoDB connector.</li>
<li><a
href="https://github.com/orientechnologies/spark-orientdb">OrientDB-Spark</a>
<img src="https://img.shields.io/github/last-commit/orientechnologies/spark-orientdb.svg">
- Official OrientDB connector.</li>
</ul>
<h3 id="storage">Storage</h3>
<ul>
<li><p><a href="https://github.com/delta-io/delta">Delta Lake</a>
<li><a href="https://github.com/delta-io/delta">Delta Lake</a>
<img src="https://img.shields.io/github/last-commit/delta-io/delta.svg">
- Storage layer with ACID transactions.</p></li>
<li><p><a
href="https://docs.lakefs.io/integrations/spark.html">lakeFS</a>
- Storage layer with ACID transactions.</li>
<li><a href="https://github.com/apache/hudi">Apache Hudi</a>
<img src="https://img.shields.io/github/last-commit/apache/hudi.svg"> -
Upserts, Deletes And Incremental Processing on Big Data..</li>
<li><a href="https://github.com/apache/iceberg">Apache Iceberg</a>
<img src="https://img.shields.io/github/last-commit/apache/iceberg.svg">
- Upserts, Deletes And Incremental Processing on Big Data..</li>
<li><a href="https://docs.lakefs.io/integrations/spark.html">lakeFS</a>
<img src="https://img.shields.io/github/last-commit/treeverse/lakefs.svg">
- Integration with the lakeFS atomic versioned storage layer. ###
Bioinformatics</p></li>
<li><p><a href="https://github.com/bigdatagenomics/adam">ADAM</a>
- Integration with the lakeFS atomic versioned storage layer.</li>
</ul>
<h3 id="bioinformatics">Bioinformatics</h3>
<ul>
<li><a href="https://github.com/bigdatagenomics/adam">ADAM</a>
<img src="https://img.shields.io/github/last-commit/bigdatagenomics/adam.svg">
- Set of tools designed to analyse genomics data.</p></li>
<li><p><a href="https://github.com/hail-is/hail">Hail</a>
- Set of tools designed to analyse genomics data.</li>
<li><a href="https://github.com/hail-is/hail">Hail</a>
<img src="https://img.shields.io/github/last-commit/hail-is/hail.svg"> -
Genetic analysis framework.</p></li>
Genetic analysis framework.</li>
</ul>
<h3 id="gis">GIS</h3>
<ul>
<li><a href="https://github.com/harsha2010/magellan">Magellan</a>
<img src="https://img.shields.io/github/last-commit/harsha2010/magellan.svg">
- Geospatial analytics using Spark.</li>
<li><a href="https://github.com/apache/incubator-sedona">Apache
Sedona</a>
<img src="https://img.shields.io/github/last-commit/apache/incubator-sedona.svg">
- Cluster computing system for processing large-scale spatial data.</li>
</ul>
<h3 id="time-series-analytics">Time Series Analytics</h3>
<ul>
<li><a
href="https://github.com/cloudera/spark-timeseries">Spark-Timeseries</a>
<img src="https://img.shields.io/github/last-commit/cloudera/spark-timeseries.svg">
- Scala / Java / Python library for interacting with time series data on
Apache Spark.</li>
<li><a href="https://github.com/twosigma/flint">flint</a>
<img src="https://img.shields.io/github/last-commit/twosigma/flint.svg">
- A time series library for Apache Spark.</li>
</ul>
<h3 id="graph-processing">Graph Processing</h3>
<ul>
<li><a
href="https://github.com/neo4j-contrib/neo4j-mazerunner">Mazerunner</a>
<img src="https://img.shields.io/github/last-commit/neo4j-contrib/neo4j-mazerunner.svg">
- Graph analytics platform on top of Neo4j and GraphX.</li>
<li><a href="https://github.com/graphframes/graphframes">GraphFrames</a>
<img src="https://img.shields.io/github/last-commit/graphframes/graphframes.svg">
- Data frame based graph API.</li>
@@ -229,27 +161,9 @@ href="https://github.com/neo4j-contrib/neo4j-spark-connector">neo4j-spark-connec
<img src="https://img.shields.io/github/last-commit/neo4j-contrib/neo4j-spark-connector.svg">
- Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX /
GraphFrames support.</li>
<li><a href="http://sparkling.ml">SparklingGraph</a>
<img src="https://img.shields.io/github/last-commit/sparkling-graph/sparkling-graph.svg">
- Library extending GraphX features with multiple functionalities useful
in graph analytics (measures, generators, link prediction etc.).</li>
</ul>
<h3 id="machine-learning-extension">Machine Learning Extension</h3>
<ul>
<li><a
href="https://github.com/Clustering4Ever/Clustering4Ever">Clustering4Ever</a>
<img src="https://img.shields.io/github/last-commit/Clustering4Ever/Clustering4Ever.svg">
Scala and Spark API to benchmark and analyse clustering algorithms on
any vectorization you can generate.</li>
<li><a
href="https://github.com/irvingc/dbscan-on-spark">dbscan-on-spark</a>
<img src="https://img.shields.io/github/last-commit/irvingc/dbscan-on-spark.svg">
- An Implementation of the DBSCAN clustering algorithm on top of Apache
Spark by <a href="https://github.com/irvingc">irvingc</a> and based on
the paper from He, Yaobin, et al. <a
href="https://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf">MR-DBSCAN:
a scalable MapReduce-based DBSCAN algorithm for heavily skewed
data</a>.</li>
<li><a href="https://systemml.apache.org/">Apache SystemML</a>
<img src="https://img.shields.io/github/last-commit/apache/systemml.svg">
- Declarative machine learning framework on top of Spark.</li>
@@ -257,18 +171,11 @@ data</a>.</li>
href="https://mahout.apache.org/users/sparkbindings/home.html">Mahout
Spark Bindings</a> [status unknown] - linear algebra DSL and optimizer
with R-like syntax.</li>
<li><a
href="https://github.com/databricks/spark-sklearn">spark-sklearn</a>
<img src="https://img.shields.io/github/last-commit/databricks/spark-sklearn.svg">
- Scikit-learn integration with distributed model training.</li>
<li><a href="http://keystone-ml.org/">KeystoneML</a> - Type safe machine
learning pipelines with RDDs.</li>
<li><a href="https://github.com/jpmml/jpmml-spark">JPMML-Spark</a>
<img src="https://img.shields.io/github/last-commit/jpmml/jpmml-spark.svg">
- PMML transformer library for Spark ML.</li>
<li><a href="https://github.com/cerndb/dist-keras">Distributed Keras</a>
<img src="https://img.shields.io/github/last-commit/cerndb/dist-keras.svg">
- Distributed deep learning framework with PySpark and Keras.</li>
<li><a href="https://mitdbg.github.io/modeldb">ModelDB</a>
<img src="https://img.shields.io/github/last-commit/mitdbg/modeldb.svg">
- A system to manage machine learning models for <code>spark.ml</code>
@@ -308,10 +215,6 @@ href="https://github.com/spark-jobserver/spark-jobserver">spark-jobserver</a>
<img src="https://img.shields.io/github/last-commit/spark-jobserver/spark-jobserver.svg">
- Simple Spark as a Service which supports objects sharing using so
called named objects. JVM only.</li>
<li><a href="https://github.com/Hydrospheredata/mist">Mist</a>
<img src="https://img.shields.io/github/last-commit/Hydrospheredata/mist.svg">
- Service for exposing Spark analytical jobs and machine learning models
as realtime, batch or reactive web services.</li>
<li><a href="https://github.com/apache/incubator-toree">Apache Toree</a>
<img src="https://img.shields.io/github/last-commit/apache/incubator-toree.svg">
- IPython protocol based middleware for interactive applications.</li>
@@ -330,17 +233,9 @@ replacement).</li>
</ul>
<h3 id="utilities">Utilities</h3>
<ul>
<li><a href="https://github.com/willb/silex">silex</a>
<img src="https://img.shields.io/github/last-commit/willb/silex.svg"> -
Collection of tools varying from ML extensions to additional RDD
methods.</li>
<li><a href="https://github.com/Tubular/sparkly">sparkly</a>
<img src="https://img.shields.io/github/last-commit/Tubular/sparkly.svg">
- Helpers &amp; syntactic sugar for PySpark.</li>
<li><a href="https://github.com/zero323/pyspark-stubs">pyspark-stubs</a>
<img src="https://img.shields.io/github/last-commit/zero323/pyspark-stubs.svg">
- Static type annotations for PySpark (obsolete since Spark 3.1. See <a
href="https://issues.apache.org/jira/browse/SPARK-32681">SPARK-32681</a>).</li>
<li><a href="https://github.com/nchammas/flintrock">Flintrock</a>
<img src="https://img.shields.io/github/last-commit/nchammas/flintrock.svg">
- A command-line tool for launching Spark clusters on EC2.</li>
@@ -351,11 +246,6 @@ data cleaning.</li>
</ul>
<h3 id="natural-language-processing">Natural Language Processing</h3>
<ul>
<li><a
href="https://github.com/databricks/spark-corenlp">spark-corenlp</a>
<img src="https://img.shields.io/github/last-commit/databricks/spark-corenlp.svg">
- DataFrame wrapper for <a
href="https://stanfordnlp.github.io/CoreNLP/">Stanford CoreNLP</a>.</li>
<li><a href="https://github.com/JohnSnowLabs/spark-nlp">spark-nlp</a>
<img src="https://img.shields.io/github/last-commit/JohnSnowLabs/spark-nlp.svg">
- Natural language processing library built on top of Apache Spark
@@ -375,29 +265,33 @@ MQTT, Twitter. ZeroMQ).</li>
Unified data processing engine supporting both batch and streaming
applications. Apache Spark is one of the supported execution
environments.</li>
<li><a href="https://github.com/blaze/blaze">Blaze</a>
<img src="https://img.shields.io/github/last-commit/blaze/blaze.svg"> -
Interface for querying larger than memory datasets using Pandas-like
syntax. It supports both Spark <code>DataFrames</code> and
<code>RDDs</code>.</li>
<li><a href="https://github.com/databricks/koalas">Koalas</a>
<img src="https://img.shields.io/github/last-commit/databricks/koalas.svg">
- Pandas DataFrame API on top of Apache Spark.</li>
</ul>
<h3 id="testing">Testing</h3>
<h3 id="data-quality">Data quality</h3>
<ul>
<li><a href="https://github.com/awslabs/deequ">deequ</a>
<img src="https://img.shields.io/github/last-commit/awslabs/deequ.svg">
- Deequ is a library built on top of Apache Spark for defining “unit
tests for data”, which measure data quality in large datasets.</li>
<li><a href="https://github.com/awslabs/python-deequ">python-deequ</a>
<img src="https://img.shields.io/github/last-commit/awslabs/python-deequ.svg">
- Python API for Deequ.</li>
</ul>
<h3 id="testing">Testing</h3>
<ul>
<li><a
href="https://github.com/holdenk/spark-testing-base">spark-testing-base</a>
<img src="https://img.shields.io/github/last-commit/holdenk/spark-testing-base.svg">
- Collection of base test classes.</li>
<li><a
href="https://github.com/MrPowers/spark-fast-tests">spark-fast-tests</a>
<img src="https://img.shields.io/github/last-commit/MrPowers/spark-fast-tests.svg">
href="https://github.com/mrpowers-io/spark-fast-tests">spark-fast-tests</a>
<img src="https://img.shields.io/github/last-commit/mrpowers-io/spark-fast-tests.svg">
- A lightweight and fast testing framework.</li>
<li><a href="https://github.com/MrPowers/chispa">chispa</a>
<img src="https://img.shields.io/github/last-commit/MrPowers/chispa.svg">
- PySpark test helpers with beautiful error messages.</li>
</ul>
<h3 id="web-archives">Web Archives</h3>
<ul>
@@ -431,9 +325,6 @@ href="https://jaceklaskowski.gitbooks.io/mastering-apache-spark/">Mastering
Apache Spark</a> - Interesting compilation of notes by <a
href="https://github.com/jaceklaskowski">Jacek Laskowski</a>. Focused on
different aspects of Spark internals.</li>
<li><a href="https://github.com/awesome-spark/spark-gotchas">Spark
Gotchas</a> - Subjective compilation of tips, tricks and common
programming mistakes.</li>
<li><a href="https://www.manning.com/books/spark-in-action">Spark in
Action</a> - New book in the Mannings “in action” family with +400
pages. Starts gently, step-by-step and covers large number of topics.
@@ -568,3 +459,5 @@ copyright restrictions.
compilation is not endorsed by The Apache Software Foundation.</p>
<p>Inspired by <a
href="https://github.com/sindresorhus/awesome">sindresorhus/awesome</a>.</p>
<p><a href="https://github.com/awesome-spark/awesome-spark">spark.md
Github</a></p>