update lists

This commit is contained in:
2025-07-18 22:22:32 +02:00
parent 55bed3b4a1
commit 5916c5c074
3078 changed files with 331679 additions and 357255 deletions

View File

@@ -38,6 +38,7 @@ Kibana</a></li>
<li><a href="#forums">Forums</a></li>
<li><a href="#conferences">Conferences</a></li>
<li><a href="#podcasts">Podcasts</a></li>
<li><a href="#books">Books</a></li>
</ul></li>
</ul>
<h2 id="databases">Databases</h2>
@@ -219,6 +220,10 @@ open-source graph database. Google.</li>
extension on top of PostgreSQL, TimescaleDB is a time-series SQL
database providing fast analytics, scalability, with automated data
management on a proven storage engine.</li>
<li><a href="https://duckdb.org/">DuckDB</a> - DuckDB is a fast
in-process analytical database that has zero external dependencies, runs
on Linux/macOS/Windows, offers a rich SQL dialect, and is free and
extensible.</li>
</ul></li>
</ul>
<h2 id="data-comparison">Data Comparison</h2>
@@ -255,7 +260,7 @@ Node.js client for Apache Kafka 0.8.</li>
<li><a href="https://github.com/pinterest/secor">Secor</a> - Pinterests
Kafka to S3 distributed consumer.</li>
<li><a href="https://github.com/uber/kafka-logger">Kafka-logger</a> -
Kafka-winston logger for Node.js from uber.</li>
Kafka-winston logger for Node.js from Uber.</li>
</ul></li>
<li><a href="https://aws.amazon.com/kinesis/">AWS Kinesis</a> - A fully
managed, cloud-based service for real-time data processing over large,
@@ -276,7 +281,7 @@ and structured datastores such as relational databases.</li>
<li><a href="https://github.com/mozilla-services/heka">Heka</a> - Data
Acquisition and Processing Made Easy. Deprecated.</li>
<li><a href="https://github.com/apache/incubator-gobblin">Gobblin</a> -
Universal data ingestion framework for Hadoop from Linkedin.</li>
Universal data ingestion framework for Hadoop from LinkedIn.</li>
<li><a href="https://nakadi.io">Nakadi</a> - Nakadi is an open source
event messaging platform that provides a REST API on top of Kafka-like
queues.</li>
@@ -286,12 +291,30 @@ data.</li>
<li><a href="https://pulsar.apache.org/">Apache Pulsar</a> - Apache
Pulsar is an open-source distributed pub-sub messaging system.</li>
<li><a href="https://github.com/awslabs/aws-data-wrangler">AWS Data
Wranlger</a> - Utility belt to handle data on AWS.</li>
Wrangler</a> - Utility belt to handle data on AWS.</li>
<li><a href="https://airbyte.io/">Airbyte</a> - Open-source data
integration for modern data teams.</li>
<li><a href="https://www.artie.com/">Artie</a> - Real-time data
ingestion tool leveraging change data capture.</li>
<li><a href="https://slingdata.io/">Sling</a> - Sling is CLI data
integration tool specialized in moving data between databases, as well
as storage systems.</li>
<li><a href="https://meltano.com/">Meltano</a> - CLI &amp; code-first
ELT.
<ul>
<li><a href="https://sdk.meltano.com">Singer SDK</a> - The fastest way
to build custom data extractors and loaders compliant with the Singer
Spec.</li>
</ul></li>
<li><a href="https://github.com/fulldecent/google-sheets-etl">Google
Sheets ETL</a> - Live import all your Google Sheets to your data
warehouse.</li>
<li><a href="https://www.csvpath.org/">CsvPath Framework</a> - A
delimited data preboarding framework that fills the gap between MFT and
the data lake.</li>
<li><a href="https://estuary.dev">Estuary Flow</a> - No/low-code data
pipeline platform that handles both batch and real-time data
ingestion.</li>
</ul>
<h2 id="file-system">File System</h2>
<ul>
@@ -315,11 +338,14 @@ at memory-speed across cluster frameworks, such as Spark and
MapReduce.</li>
<li><a href="https://ceph.com/">CEPH</a> - Ceph is a unified,
distributed storage system designed for excellent performance,
reliability and scalability.</li>
reliability, and scalability.</li>
<li><a href="https://github.com/juicedata/juicefs">JuiceFS</a> - JuiceFS
is a high-performance Cloud-Native file system driven by object storage
for large-scale data storage.</li>
<li><a href="https://www.orangefs.org/">OrangeFS</a> - Orange File
System is a branch of the Parallel Virtual File System.</li>
<li><a href="https://github.com/tuplejump/snackfs-release">SnackFS</a> -
SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built
SnackFS is our bite-sized, lightweight HDFS compatible file system built
over Cassandra.</li>
<li><a href="https://www.gluster.org/">GlusterFS</a> - Gluster
Filesystem.</li>
@@ -389,6 +415,8 @@ powerful, and reliable system to process and distribute data.</li>
<li><a href="https://hudi.apache.org/">Apache Hudi</a> - An open source
framework for managing storage for real time processing, one of the most
interesting feature is the Upsert.</li>
<li><a href="https://github.com/cocoindex-io/cocoindex">CocoIndex</a> -
An open source ETL framework to build fresh index for AI.</li>
<li><a href="https://voltdb.com/">VoltDB</a> - VoltDb is an
ACID-compliant RDBMS which uses a <a
href="https://en.wikipedia.org/wiki/Shared-nothing_architecture">shared
@@ -412,20 +440,23 @@ and it can be run at all kinds of resource-constrained edge
devices.</li>
<li><a href="https://github.com/aklivity/zilla">Zilla</a> - - An API
gateway built for event-driven architectures and streaming that supports
standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka
standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka
protocol.</li>
<li><a href="https://github.com/swimos/swim-rust">SwimOS</a> - A
framework for building real-time streaming data processing applications
that supports a wide range of ingestion sources.</li>
</ul>
<h2 id="batch-processing">Batch Processing</h2>
<ul>
<li><p><a
<li><a
href="https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Hadoop
MapReduce</a> - Hadoop MapReduce is a software framework for easily
writing applications which process vast amounts of data (multi-terabyte
data-sets) - in-parallel on large clusters (thousands of nodes) - of
commodity hardware in a reliable, fault-tolerant manner.</p></li>
<li><p><a href="https://spark.apache.org/">Spark</a> - A multi-language
commodity hardware in a reliable, fault-tolerant manner.</li>
<li><a href="https://spark.apache.org/">Spark</a> - A multi-language
engine for executing data engineering, data science, and machine
learning on single-node machines or clusters.</p>
learning on single-node machines or clusters.
<ul>
<li><a href="https://spark-packages.org">Spark Packages</a> - A
community index of packages for Apache Spark.</li>
@@ -440,22 +471,25 @@ Spark Server.</li>
free &amp; cross platform monitoring tool (Spark UI / Spark History
Server alternative).</li>
</ul></li>
<li><p><a href="https://aws.amazon.com/emr/">AWS EMR</a> - A web service
<li><a href="https://aws.amazon.com/emr/">AWS EMR</a> - A web service
that makes it easy to quickly and cost-effectively process vast amounts
of data.</p></li>
<li><p><a href="https://www.datamechanics.co">Data Mechanics</a> - A
of data.</li>
<li><a href="https://www.datamechanics.co">Data Mechanics</a> - A
cloud-based platform deployed on Kubernetes making Apache Spark more
developer-friendly and cost-effective.</p></li>
<li><p><a href="https://tez.apache.org/">Tez</a> - An application
framework which allows for a complex directed-acyclic-graph of tasks for
processing data.</p></li>
<li><p><a href="https://github.com/asavinov/bistro">Bistro</a> - A
developer-friendly and cost-effective.</li>
<li><a href="https://tez.apache.org/">Tez</a> - An application framework
which allows for a complex directed-acyclic-graph of tasks for
processing data.</li>
<li><a href="https://github.com/asavinov/bistro">Bistro</a> - A
light-weight engine for general-purpose data processing including both
batch and stream analytics. It is based on a novel unique data model,
which represents data via <em>functions</em> and processes data via
<em>columns operations</em> as opposed to having only set operations in
conventional approaches like MapReduce or SQL.</p></li>
<li><p>Batch ML</p>
conventional approaches like MapReduce or SQL.</li>
<li><a href="https://github.com/brexhq/substation">Substation</a> -
Substation is a cloud native data pipeline and transformation toolkit
written in Go.</li>
<li>Batch ML
<ul>
<li><a href="https://www.h2o.ai/">H2O</a> - Fast scalable machine
learning API for smarter applications.</li>
@@ -467,7 +501,7 @@ common learning algorithms and utilities, including classification,
regression, clustering, collaborative filtering, dimensionality
reduction, as well as underlying optimization primitives.</li>
</ul></li>
<li><p>Batch Graph</p>
<li>Batch Graph
<ul>
<li><a href="https://turi.com/products/create/docs/">GraphLab Create</a>
- A machine learning platform that enables data scientists and app
@@ -477,7 +511,7 @@ processing system built for high scalability.</li>
<li><a href="https://spark.apache.org/graphx/">Spark GraphX</a> - Apache
Sparks API for graphs and graph-parallel computation.</li>
</ul></li>
<li><p>Batch SQL</p>
<li>Batch SQL
<ul>
<li><a
href="https://prestodb.github.io/docs/current/index.html">Presto</a> - A
@@ -508,7 +542,7 @@ library.</li>
<li><a href="https://d3js.org/">D3.js</a> - A JavaScript library for
manipulating documents based on data.
<ul>
<li><a href="https://d3plus.org">D3Plus</a> - D3s simplier, easier to
<li><a href="https://d3plus.org">D3Plus</a> - D3s simpler, easier to
use cousin. Mostly predefined templates that you can just plug data
in.</li>
</ul></li>
@@ -532,35 +566,40 @@ ask questions and learn from data.</li>
pure-python graphics and GUI library built on PyQt4 / PySide and numpy.
It is intended for use in mathematics / scientific / engineering
applications.</li>
<li><a href="https://seaborn.pydata.org">Seaborn</a> - A Python
visualization library based on matplotlib. It provides a high-level
interface for drawing attractive statistical graphics.</li>
</ul>
<h2 id="workflow">Workflow</h2>
<ul>
<li><a href="https://github.com/spotify/luigi">Luigi</a> - Luigi is a
Python module that helps you build complex pipelines of batch jobs.
<ul>
Python module that helps you build complex pipelines of batch jobs.</li>
<li><a href="https://github.com/seatgeek/cronq">CronQ</a> - An
application cron-like system. <a
href="https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/">Used</a>
w/Luige. Deprecated.</li>
</ul></li>
<li><a href="https://www.cascading.org/">Cascading</a> - Java based
application development platform.</li>
<li><a href="https://github.com/apache/airflow">Airflow</a> - Airflow is
a system to programmaticaly author, schedule and monitor data
a system to programmatically author, schedule, and monitor data
pipelines.</li>
<li><a href="https://azkaban.github.io/">Azkaban</a> - Azkaban is a
batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
Azkaban resolves the ordering through job dependencies and provides an
easy to use web user interface to maintain and track your
easy-to-use web user interface to maintain and track your
workflows.</li>
<li><a href="https://oozie.apache.org/">Oozie</a> - Oozie is a workflow
scheduler system to manage Apache Hadoop jobs.</li>
<li><a href="https://github.com/pinterest/pinball">Pinball</a> - DAG
based workflow manager. Job flows are defined programmaticaly in Python.
Support output passing between jobs.</li>
based workflow manager. Job flows are defined programmatically in
Python. Support output passing between jobs.</li>
<li><a href="https://github.com/dagster-io/dagster">Dagster</a> -
Dagster is an open-source Python library for building data
applications.</li>
<li><a href="https://github.com/dagworks-inc/hamilton">Hamilton</a> -
Hamilton is a lightweight library to define data transformations as a
directed-acyclic graph (DAG). If you like dbt for SQL transforms, you
will like Hamilton for Python processing.</li>
<li><a href="https://kedro.readthedocs.io/en/latest/">Kedro</a> - Kedro
is a framework that makes it easy to build robust and scalable data
pipelines by providing uniform project templates, data abstraction,
@@ -576,6 +615,9 @@ engineering favors required—just SQL.</li>
<li><a href="https://getdbt.com/">dbt</a> - A command line tool that
enables data analysts and engineers to transform data in their
warehouses more effectively.</li>
<li><a href="https://kestra.io/">Kestra</a> - Scalable, event-driven,
language-agnostic orchestration and scheduling platform to manage
millions of workflows declaratively in code.</li>
<li><a
href="https://github.com/rudderlabs/rudder-server">RudderStack</a> - A
warehouse-first Customer Data Platform that enables you to collect data
@@ -597,6 +639,12 @@ Create automated workflows and logic using APIs for your notification
service. Add templates, batching, preferences, inapp inbox with
workflows to trigger notifications directly from your data
warehouse.</li>
<li><a href="https://github.com/kestra-io/kestra">Kestra</a> - A
versatile open source orchestrator and scheduler built on Java, designed
to handle a broad range of workflows with a language-agnostic, API-first
architecture.</li>
<li><a href="https://www.mage.ai">Mage</a> - Open-source data pipeline
tool for transforming and integrating data.</li>
</ul>
<h2 id="data-lake-management">Data Lake Management</h2>
<ul>
@@ -606,12 +654,18 @@ object-storage based data lakes.</li>
<li><a href="https://github.com/projectnessie/nessie">Project Nessie</a>
- Project Nessie is a Transactional Catalog for Data Lakes with Git-like
semantics. Works with Apache Iceberg tables.</li>
<li><a href="https://ilum.cloud/">Ilum</a> - Ilum is a modular Data
Lakehouse platform that simplifies the management and monitoring of
Apache Spark clusters across Kubernetes and Hadoop environments.</li>
<li><a href="https://github.com/apache/gravitino">Gravitino</a> -
Gravitino is an open-source, unified metadata management for data lakes,
data warehouses, and external catalogs.</li>
</ul>
<h2 id="elk-elastic-logstash-kibana">ELK Elastic Logstash Kibana</h2>
<ul>
<li><a
href="https://github.com/pblittle/docker-logstash">docker-logstash</a> -
A highly configurable logstash (1.4.4) - docker image running
A highly configurable Logstash (1.4.4) - Docker image running
Elasticsearch (1.7.0) - and Kibana (3.1.2).</li>
<li><a
href="https://github.com/jprante/elasticsearch-jdbc">elasticsearch-jdbc</a>
@@ -622,7 +676,7 @@ Extension that allows creating an index backed by Elasticsearch.</li>
<h2 id="docker">Docker</h2>
<ul>
<li><a href="https://github.com/redbooth/gockerize">Gockerize</a> -
Package golang service into minimal docker containers.</li>
Package golang service into minimal Docker containers.</li>
<li><a href="https://github.com/ClusterHQ/flocker">Flocker</a> - Easily
manage Docker containers &amp; their data.</li>
<li><a href="https://rancher.com/rancher-os/">Rancher</a> - RancherOS is
@@ -645,9 +699,9 @@ href="https://github.com/grammarly/rocker-compose">Rocker-compose</a> -
Docker composition tool with idempotency features for deploying apps
composed of multiple containers. Deprecated.</li>
<li><a href="https://github.com/hashicorp/nomad">Nomad</a> - Nomad is a
cluster manager, designed for both long lived services and short lived
cluster manager, designed for both long-lived services and short-lived
batch processing workloads.</li>
<li><a href="https://imagelayers.io/">ImageLayers</a> - Vizualize docker
<li><a href="https://imagelayers.io/">ImageLayers</a> - Visualize Docker
images and the layers that compose them.</li>
</ul>
<h2 id="datasets">Datasets</h2>
@@ -672,7 +726,7 @@ public timeline since 2011, updated every hour.</li>
<li><a href="https://commoncrawl.org/">Common Crawl</a> - Open source
repository of web crawl data.</li>
<li><a href="https://dumps.wikimedia.org/enwiki/latest/">Wikipedia</a> -
Wikipedias complete copy of all wikis, in the form of wikitext source
Wikipedias complete copy of all wikis, in the form of Wikitext source
and metadata embedded in XML. A number of raw database tables in SQL
form are also available.</li>
</ul>
@@ -704,13 +758,19 @@ production.</li>
data quality platform for the whole data platform lifecycle from
profiling new data sources to applying full automation of data quality
monitoring.</li>
<li><a href="https://datakitchen.io/">DataKitchen</a> - Open Source Data
Observability for end-to-end Data Journey Observability, data profiling,
anomaly detection, and auto-created data quality validation tests.</li>
<li><a href="https://runsql.com/">RunSQL</a> - Free online SQL
playground for MySQL, PostgreSQL, and SQL Server. Create database
structures, run queries, and share results instantly.</li>
</ul>
<h2 id="community">Community</h2>
<h3 id="forums">Forums</h3>
<ul>
<li><a
href="https://www.reddit.com/r/dataengineering/">/r/dataengineering</a>
- News, tips and background on Data Engineering.</li>
- News, tips, and background on Data Engineering.</li>
<li><a href="https://www.reddit.com/r/ETL/">/r/etl</a> - Subreddit
focused on ETL.</li>
</ul>
@@ -730,3 +790,19 @@ about their experience around building and maintaining data
infrastructure, delivering data and data products, and driving better
outcomes across their businesses with data.</li>
</ul>
<h3 id="books">Books</h3>
<ul>
<li><a
href="https://www.manning.com/books/snowflake-data-engineering">Snowflake
Data Engineering</a> - A practical introduction to data engineering on
the Snowflake cloud data platform.</li>
<li><a
href="https://www.appliedaicourse.com/blog/data-science-books/">Best
Data Science Books</a> - This blog offers a curated list of top data
science books, categorized by topics and learning stages, to aid readers
in building foundational knowledge and staying updated with industry
trends.</li>
</ul>
<p><a
href="https://github.com/igorbarinov/awesome-data-engineering">dataengineering.md
Github</a></p>