Updating conversion, creating readmes

This commit is contained in:
Jonas Zeunert
2024-04-19 23:37:46 +02:00
parent 3619ac710a
commit 08e75b0f0a
635 changed files with 30878 additions and 37344 deletions

View File

@@ -1,4 +1,4 @@
 Awesome Data Engineering !Awesome (https://awesome.re/badge-flat2.svg) (https://github.com/sindresorhus/awesome)
 Awesome Data Engineering !Awesome (https://awesome.re/badge-flat2.svg) (https://github.com/sindresorhus/awesome)
▐ A curated list of awesome things related to Data Engineering.
@@ -35,8 +35,7 @@
 - RQLite (https://github.com/rqlite/rqlite) - Replicated SQLite using the Raft consensus protocol.
 - MySQL (https://www.mysql.com/) - The world's most popular open source database.
- **TiDB** (https://github.com/pingcap/tidb) - TiDB is a distributed NewSQL database compatible with MySQL protocol. 
- **Percona XtraBackup** (https://www.percona.com/software/mysql-database/percona-xtrabackup) - Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQ 
L® and MariaDB®. 
- **Percona XtraBackup** (https://www.percona.com/software/mysql-database/percona-xtrabackup) - Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
- **mysql_utils** (https://github.com/pinterest/mysql_utils) - Pinterest MySQL Management Tools. 
 - MariaDB (https://mariadb.org/) - An enhanced, drop-in replacement for MySQL.
 - PostgreSQL (https://www.postgresql.org/) - The world's most advanced open source database.
@@ -52,20 +51,18 @@
 - IonDB (https://github.com/iondbproject/iondb) - A key-value store for microcontroller and IoT applications.
- Column
 - Cassandra (https://cassandra.apache.org/) - The right choice when you need scalability and high availability without compromising performance.
- **Cassandra Calculator** (https://www.ecyrd.com/cassandracalculator/) - This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application 
. 
- **Cassandra Calculator** (https://www.ecyrd.com/cassandracalculator/) - This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
- **CCM** (https://github.com/pcmanus/ccm) - A script to easily create and destroy an Apache Cassandra cluster on localhost. 
- **ScyllaDB** (https://github.com/scylladb/scylla) - NoSQL data store using the seastar framework, compatible with Apache Cassandra. 
 - HBase (https://hbase.apache.org/) - The Hadoop database, a distributed, scalable, big data store.
 - AWS Redshift (https://aws.amazon.com/redshift/) - A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business 
intelligence tools.
 - AWS Redshift (https://aws.amazon.com/redshift/) - A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.
 - FiloDB (https://github.com/filodb/FiloDB) - Distributed. Columnar. Versioned. Streaming. SQL.
 - Vertica (https://www.vertica.com) - Distributed, MPP columnar database with extensive analytics SQL.
 - ClickHouse (https://clickhouse.tech) - Distributed columnar DBMS for OLAP. SQL.
- Document
 - MongoDB (https://www.mongodb.com) - An open-source, document database designed for ease of development and scaling.
- **Percona Server for MongoDB** (https://www.percona.com/software/mongo-database/percona-server-for-mongodb) - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement 
 for the MongoDB® Community Edition that includes enterprise-grade features and functionality. 
- **Percona Server for MongoDB** (https://www.percona.com/software/mongo-database/percona-server-for-mongodb) - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Communi 
ty Edition that includes enterprise-grade features and functionality. 
- **MemDB** (https://github.com/rain1017/memdb) - Distributed Transactional In-Memory Database (based on MongoDB). 
 - Elasticsearch (https://www.elastic.co/) - Search & Analyze Data in Real Time.
 - Couchbase (https://www.couchbase.com/) - The highest performing NoSQL distributed database.
@@ -89,25 +86,23 @@
 - Heroic (https://github.com/spotify/heroic) - A scalable time series database based on Cassandra and Elasticsearch, by Spotify.
 - Druid (https://github.com/apache/incubator-druid) - Column oriented distributed data store ideal for powering interactive applications.
 - Riak-TS (https://basho.com/products/riak-ts/) - Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
 - Akumuli (https://github.com/akumuli/Akumuli) - Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from
esperanto as "accumulate".
 - Akumuli (https://github.com/akumuli/Akumuli) - Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as 
"accumulate".
 - Rhombus (https://github.com/Pardot/Rhombus) - A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
 - Dalmatiner DB (https://github.com/dalmatinerdb/dalmatinerdb) - Fast distributed metrics database.
 - Blueflood (https://github.com/rackerlabs/blueflood) - A distributed system designed to ingest and process time series data.
 - Timely (https://github.com/NationalSecurityAgency/timely) - Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
- Other
 - Tarantool (https://github.com/tarantool/tarantool/) - Tarantool is an in-memory database and application server.
 - GreenPlum (https://github.com/greenplum-db/gpdb) - The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data 
volumes.
 - GreenPlum (https://github.com/greenplum-db/gpdb) - The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
 - cayley (https://github.com/cayleygraph/cayley) - An open-source graph database. Google.
 - Snappydata (https://github.com/SnappyDataInc/snappydata) - SnappyData: OLTP + OLAP Database built on Apache Spark.
 - TimescaleDB (https://www.timescale.com/) - Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a 
proven storage engine.
 - TimescaleDB (https://www.timescale.com/) - Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.
Data Comparison
- datacompy (https://github.com/capitalone/datacompy) - DataComPy is a Python library that facilitates the comparison of two DataFrames in pandas, Polars, Spark and more. The library goes beyond basic equality 
checks by providing detailed insights into discrepancies at both row and column levels. 
- datacompy (https://github.com/capitalone/datacompy) - DataComPy is a Python library that facilitates the comparison of two DataFrames in pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing 
detailed insights into discrepancies at both row and column levels. 
Data Ingestion
@@ -149,16 +144,15 @@
- SnackFS (https://github.com/tuplejump/snackfs-release) - SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built over Cassandra.
- GlusterFS (https://www.gluster.org/) - Gluster Filesystem.
- XtreemFS (https://www.xtreemfs.org/) - Fault-tolerant distributed file system for all storage needs.
- SeaweedFS (https://github.com/chrislusf/seaweedfs) - Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead 
of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
- SeaweedFS (https://github.com/chrislusf/seaweedfs) - Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX 
file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
- S3QL (https://github.com/s3ql/s3ql/) - S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.
- LizardFS (https://lizardfs.com/) - LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.
Serialization format
- Apache Avro (https://avro.apache.org) - Apache Avro™ is a data serialization system.
- Apache Parquet (https://parquet.apache.org) - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or 
programming language.
- Apache Parquet (https://parquet.apache.org) - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
 - Snappy (https://github.com/google/snappy) - A fast compressor/decompressor. Used with Parquet.
 - PigZ (https://zlib.net/pigz/) - A parallel implementation of gzip for modern multi-processor, multi-core machines.
- Apache ORC (https://orc.apache.org/) - The smallest, fastest columnar storage for Hadoop workloads.
@@ -187,8 +181,8 @@
Batch Processing
- Hadoop MapReduce (https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) - Hadoop MapReduce is a software framework for easily writing applications 
which process vast amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner.
- Hadoop MapReduce (https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) - Hadoop MapReduce is a software framework for easily writing applications which process vast 
amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner.
- Spark (https://spark.apache.org/) - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
 - Spark Packages (https://spark-packages.org) - A community index of packages for Apache Spark.
 - Deep Spark (https://github.com/Stratio/deep-spark) - Connecting Apache Spark with different data stores. Deprecated.
@@ -198,14 +192,14 @@
- AWS EMR (https://aws.amazon.com/emr/) - A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
- Data Mechanics (https://www.datamechanics.co) - A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.
- Tez (https://tez.apache.org/) - An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
- Bistro (https://github.com/asavinov/bistro) - A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents 
data via _functions_ and processes data via _columns operations_ as opposed to having only set operations in conventional approaches like MapReduce or SQL.
- Bistro (https://github.com/asavinov/bistro) - A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via _functions_ and 
processes data via _columns operations_ as opposed to having only set operations in conventional approaches like MapReduce or SQL.
- Batch ML
 - H2O (https://www.h2o.ai/) - Fast scalable machine learning API for smarter applications.
 - Mahout (https://mahout.apache.org/) - An environment for quickly creating scalable performant machine learning applications.
 - Spark MLlib (https://spark.apache.org/docs/latest/ml-guide.html) - Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, 
clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
 - Spark MLlib (https://spark.apache.org/docs/latest/ml-guide.html) - Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative 
filtering, dimensionality reduction, as well as underlying optimization primitives.
- Batch Graph
 - GraphLab Create (https://turi.com/products/create/docs/) - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
 - Giraph (https://giraph.apache.org/) - An iterative graph processing system built for high scalability.
@@ -238,26 +232,23 @@
 - CronQ (https://github.com/seatgeek/cronq) - An application cron-like system. Used (https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/) w/Luige. Deprecated.
- Cascading (https://www.cascading.org/) - Java based application development platform.
- Airflow (https://github.com/apache/airflow) - Airflow is a system to programmaticaly author, schedule and monitor data pipelines.
- Azkaban (https://azkaban.github.io/) - Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web 
user interface to maintain and track your workflows.
- Azkaban (https://azkaban.github.io/) - Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain
and track your workflows.
- Oozie (https://oozie.apache.org/) - Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
- Pinball (https://github.com/pinterest/pinball) - DAG based workflow manager. Job flows are defined programmaticaly in Python. Support output passing between jobs.
- Dagster (https://github.com/dagster-io/dagster) - Dagster is an open-source Python library for building data applications.
- Kedro (https://kedro.readthedocs.io/en/latest/) - Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and
pipeline assembly.
- Dataform (https://dataform.co/) - An open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency 
management, testing, documentation and more.
- Census (https://getcensus.com/) - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors 
required—just SQL.
- Kedro (https://kedro.readthedocs.io/en/latest/) - Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.
- Dataform (https://dataform.co/) - An open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, testing, 
documentation and more.
- Census (https://getcensus.com/) - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
- dbt (https://getdbt.com/) - A command line tool that enables data analysts and engineers to transform data in their warehouses more effectively.
- RudderStack (https://github.com/rudderlabs/rudder-server) - A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in 
your warehouse and business tools.
- PACE (https://github.com/getstrm/pace) - An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, 
BigQuery, DataBricks, etc.)
- RudderStack (https://github.com/rudderlabs/rudder-server) - A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and 
business tools.
- PACE (https://github.com/getstrm/pace) - An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)
- Prefect (https://prefect.io/) - Prefect is an orchestration and observability platform. With it, developers can rapidly build and scale resilient code, and triage disruptions effortlessly.
- Multiwoven (https://github.com/Multiwoven/multiwoven) - The open-source reverse ETL, data activation platform for modern data teams.
- SuprSend (https://www.suprsend.com/products/workflows) - Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to 
trigger notifications directly from your data warehouse.
- SuprSend (https://www.suprsend.com/products/workflows) - Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to trigger notifications 
directly from your data warehouse.
Data Lake Management
@@ -296,8 +287,7 @@
- GitHub Archive (https://www.gharchive.org/) - GitHub's public timeline since 2011, updated every hour.
- Common Crawl (https://commoncrawl.org/) - Open source repository of web crawl data.
- Wikipedia (https://dumps.wikimedia.org/enwiki/latest/) - Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are 
also available.
- Wikipedia (https://dumps.wikimedia.org/enwiki/latest/) - Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.
Monitoring
@@ -314,8 +304,8 @@
Testing
- Grai (https://github.com/grai-io/grai-core/) - A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break 
data pipelines or BI dashboards from making it to production.
- Grai (https://github.com/grai-io/grai-core/) - A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI 
dashboards from making it to production.
- DQOps (https://github.com/dqops/dqo) - An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.
Community
@@ -332,5 +322,5 @@
Podcasts
- Data Engineering Podcast (https://www.dataengineeringpodcast.com/) - The show about modern data infrastructure.
- The Data Stack Show (https://datastackshow.com/) - A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering
data and data products, and driving better outcomes across their businesses with data.
- The Data Stack Show (https://datastackshow.com/) - A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, 
and driving better outcomes across their businesses with data.