Updating conversion, creating readmes

This commit is contained in:
Jonas Zeunert
2024-04-19 23:37:46 +02:00
parent 3619ac710a
commit 08e75b0f0a
635 changed files with 30878 additions and 37344 deletions

View File

@@ -1,12 +1,12 @@
 (https://spark.apache.org/)
 Awesome Spark !Awesome (https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg) (https://github.com/sindresorhus/awesome)
 Awesome Spark !Awesome (https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg) (https://github.com/sindresorhus/awesome)
A curated list of awesome Apache Spark (https://spark.apache.org/) packages and resources.
_Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California (https://www.universityofcalifornia.edu/), Berkeley's AMPLab (https://amplab.cs.berkeley.edu/), 
the Spark codebase was later donated to the Apache Software Foundation (https://www.apache.org/), which has maintained it since. Spark provides an interface for programming entire clusters with implicit data 
parallelism and fault-tolerance_ (Wikipedia 2017 (#wikipedia-2017)).
_Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California (https://www.universityofcalifornia.edu/), Berkeley's AMPLab (https://amplab.cs.berkeley.edu/), the Spark codebase was 
later donated to the Apache Software Foundation (https://www.apache.org/), which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance_ (Wikipedia 2017 
(#wikipedia-2017)).
Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs.
@@ -56,12 +56,11 @@
Notebooks and IDEs
⟡ almond (https://almond.sh/) - A scala kernel for Jupyter (https://jupyter.org/).
⟡ Apache Zeppelin (https://zeppelin.incubator.apache.org/) - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
⟡ Polynote (https://polynote.org/) - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible 
notebooks with its immutable data model. Originating from Netflix (https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447).
⟡ Spark Notebook
 (https://github.com/andypetrella/spark-notebook) - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
⟡ sparkmagic (https://github.com/jupyter-incubator/sparkmagic) - Jupyter (https://jupyter.org/) magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters 
through Livy (https://github.com/cloudera/livy), in Jupyter notebooks.
⟡ Polynote (https://polynote.org/) - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable
data model. Originating from Netflix (https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447).
⟡ Spark Notebook (https://github.com/andypetrella/spark-notebook) - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
⟡ sparkmagic (https://github.com/jupyter-incubator/sparkmagic) - Jupyter (https://jupyter.org/) magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy 
(https://github.com/cloudera/livy), in Jupyter notebooks.
General Purpose Libraries
@@ -74,8 +73,8 @@
SQL Data Sources
SparkSQL has serveral built-in Data Sources (https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options) for files. These include csv, json, parquet, orc, and avro
. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.
SparkSQL has serveral built-in Data Sources (https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options) for files. These include csv, json, parquet, orc, and avro. It also supports JDBC 
databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.
⟡ Spark CSV (https://github.com/databricks/spark-csv) - CSV reader and writer (obsolete since Spark 2.0 SPARK-12833  (https://issues.apache.org/jira/browse/SPARK-12833)).
⟡ Spark Avro (https://github.com/databricks/spark-avro) - Apache Avro (https://avro.apache.org/) reader and writer (obselete since Spark 2.4 SPARK-24768  (https://issues.apache.org/jira/browse/SPARK-24768)).
@@ -113,8 +112,8 @@
Machine Learning Extension
⟡ Clustering4Ever (https://github.com/Clustering4Ever/Clustering4Ever) Scala and Spark API to benchmark and analyse clustering algorithms on any vectorization you can generate.
⟡ dbscan-on-spark (https://github.com/irvingc/dbscan-on-spark) - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by irvingc (https://github.com/irvingc) and based on the paper from 
He, Yaobin, et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data 
⟡ dbscan-on-spark (https://github.com/irvingc/dbscan-on-spark) - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by irvingc (https://github.com/irvingc) and based on the paper from He, Yaobin, et al. 
MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data 
(https://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf).
⟡ Apache SystemML (https://systemml.apache.org/) - Declarative machine learning framework on top of Spark.
⟡ Mahout Spark Bindings (https://mahout.apache.org/users/sparkbindings/home.html) *status unknown* - linear algebra DSL and optimizer with R-like syntax.
@@ -183,43 +182,40 @@
⟡ Learning Spark, 2nd Edition (https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/) - Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.
⟡ Advanced Analytics with Spark (http://shop.oreilly.com/product/0636920035091.do) - Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas (https://github.com/sryza/aas).
⟡ Mastering Apache Spark (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/) - Interesting compilation of notes by Jacek Laskowski (https://github.com/jaceklaskowski). Focused on different aspects of 
Spark internals.
⟡ Mastering Apache Spark (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/) - Interesting compilation of notes by Jacek Laskowski (https://github.com/jaceklaskowski). Focused on different aspects of Spark internals.
⟡ Spark Gotchas (https://github.com/awesome-spark/spark-gotchas) - Subjective compilation of tips, tricks and common programming mistakes.
⟡ Spark in Action (https://www.manning.com/books/spark-in-action) - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on 
how to setup Eclipse for Spark application development (http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/) and how to bootstrap a new application using the provided Maven 
Archetype. You can find the accompanying GitHub repo here (https://github.com/spark-in-action/first-edition).
⟡ Spark in Action (https://www.manning.com/books/spark-in-action) - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to setup Eclipse for 
Spark application development (http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/) and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo 
here (https://github.com/spark-in-action/first-edition).
Papers
⟡ Large-Scale Intelligent Microservices
 (https://arxiv.org/pdf/2009.08044.pdf) - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.
⟡ Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
 (https://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf) - Paper introducing a core distributed memory abstraction.
⟡ Spark SQL: Relational Data Processing in Spark
 (https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf) - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
⟡ Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark (https://cs.stanford.edu/~matei/papers/2018/sigmod_structured_streaming.pdf) - Structured Streaming is a new high-level 
streaming API, it is a declarative API based on automatically incrementalizing a static relational query.
⟡ Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (https://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf) - Paper introducing a core distributed memory abstraction.
⟡ Spark SQL: Relational Data Processing in Spark (https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf) - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
⟡ Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark
 (https://cs.stanford.edu/~matei/papers/2018/sigmod_structured_streaming.pdf) - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.
MOOCS
⟡ Data Science and Engineering with Apache Spark (edX XSeries) (https://www.edx.org/xseries/data-science-engineering-apache-spark) - Series of five courses (Introduction to Apache Spark 
(https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x), Distributed Machine Learning with Apache Spark (https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x), 
Big Data Analysis with Apache Spark (https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x), Advanced Apache Spark for Data Science and Data Engineering 
(https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x), Advanced Distributed Machine Learning with Apache Spark 
(https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x)) covering different aspects of software engineering and data science. Python oriented.
(https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x), Distributed Machine Learning with Apache Spark (https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x), Big Data Analysis with 
Apache Spark (https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x), Advanced Apache Spark for Data Science and Data Engineering 
(https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x), Advanced Distributed Machine Learning with Apache Spark (https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x)) 
covering different aspects of software engineering and data science. Python oriented.
⟡ Big Data Analysis with Scala and Spark (Coursera) (https://www.coursera.org/learn/big-data-analysys) - Scala oriented introductory course. Part of Functional Programming in Scala Specialization 
(https://www.coursera.org/specializations/scala).
Workshops
⟡ AMP Camp (http://ampcamp.berkeley.edu) - Periodical training event organized by the UC Berkeley AMPLab (https://amplab.cs.berkeley.edu/). A source of useful exercise and recorded workshops covering different 
tools from the Berkeley Data Analytics Stack (https://amplab.cs.berkeley.edu/software/).
⟡ AMP Camp (http://ampcamp.berkeley.edu) - Periodical training event organized by the UC Berkeley AMPLab (https://amplab.cs.berkeley.edu/). A source of useful exercise and recorded workshops covering different tools from the Berkeley 
Data Analytics Stack (https://amplab.cs.berkeley.edu/software/).
Projects Using Spark
⟡ Oryx 2 (https://github.com/OryxProject/oryx) - Lambda architecture (http://lambda-architecture.net/) platform built on Apache Spark and Apache Kafka (http://kafka.apache.org/) with specialization for real-time
large scale machine learning.
⟡ Oryx 2 (https://github.com/OryxProject/oryx) - Lambda architecture (http://lambda-architecture.net/) platform built on Apache Spark and Apache Kafka (http://kafka.apache.org/) with specialization for real-time large scale machine 
learning.
⟡ Photon ML (https://github.com/linkedin/photon-ml) - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
⟡ PredictionIO (https://prediction.io/) - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
⟡ Crossdata (https://github.com/Stratio/Crossdata) - Data integration platform with extended DataSource API and multi-user environment.
@@ -234,10 +230,9 @@
Miscellaneous
- Spark with Scala Gitter channel (https://gitter.im/spark-scala/Lobby) - "_A place to discuss and ask questions about using Scala for Spark programming_" started by @deanwampler 
(https://github.com/deanwampler).
- Apache Spark User List (http://apache-spark-user-list.1001560.n3.nabble.com/) and Apache Spark Developers List (http://apache-spark-developers-list.1001551.n3.nabble.com/) - Mailing lists dedicated to usage 
questions and development topics respectively.
- Spark with Scala Gitter channel (https://gitter.im/spark-scala/Lobby) - "_A place to discuss and ask questions about using Scala for Spark programming_" started by @deanwampler (https://github.com/deanwampler).
- Apache Spark User List (http://apache-spark-user-list.1001560.n3.nabble.com/) and Apache Spark Developers List (http://apache-spark-developers-list.1001551.n3.nabble.com/) - Mailing lists dedicated to usage questions and development 
topics respectively.
References