update lists

2025-07-18 22:22:32 +02:00
parent 55bed3b4a1
commit 5916c5c074
3078 changed files with 331679 additions and 357255 deletions
--- a/html/dataengineering.html
+++ b/html/dataengineering.html
@@ -38,6 +38,7 @@ Kibana</a></li>
 <li><a href="#forums">Forums</a></li>
 <li><a href="#conferences">Conferences</a></li>
 <li><a href="#podcasts">Podcasts</a></li>
+<li><a href="#books">Books</a></li>
 </ul></li>
 </ul>
 <h2 id="databases">Databases</h2>
@@ -219,6 +220,10 @@ open-source graph database. Google.</li>
 extension on top of PostgreSQL, TimescaleDB is a time-series SQL
 database providing fast analytics, scalability, with automated data
 management on a proven storage engine.</li>
+<li><a href="https://duckdb.org/">DuckDB</a> - DuckDB is a fast
+in-process analytical database that has zero external dependencies, runs
+on Linux/macOS/Windows, offers a rich SQL dialect, and is free and
+extensible.</li>
 </ul></li>
 </ul>
 <h2 id="data-comparison">Data Comparison</h2>
@@ -255,7 +260,7 @@ Node.js client for Apache Kafka 0.8.</li>
 <li><a href="https://github.com/pinterest/secor">Secor</a> - Pinterest’s
 Kafka to S3 distributed consumer.</li>
 <li><a href="https://github.com/uber/kafka-logger">Kafka-logger</a> -
-Kafka-winston logger for Node.js from uber.</li>
+Kafka-winston logger for Node.js from Uber.</li>
 </ul></li>
 <li><a href="https://aws.amazon.com/kinesis/">AWS Kinesis</a> - A fully
 managed, cloud-based service for real-time data processing over large,
@@ -276,7 +281,7 @@ and structured datastores such as relational databases.</li>
 <li><a href="https://github.com/mozilla-services/heka">Heka</a> - Data
 Acquisition and Processing Made Easy. Deprecated.</li>
 <li><a href="https://github.com/apache/incubator-gobblin">Gobblin</a> -
-Universal data ingestion framework for Hadoop from Linkedin.</li>
+Universal data ingestion framework for Hadoop from LinkedIn.</li>
 <li><a href="https://nakadi.io">Nakadi</a> - Nakadi is an open source
 event messaging platform that provides a REST API on top of Kafka-like
 queues.</li>
@@ -286,12 +291,30 @@ data.</li>
 <li><a href="https://pulsar.apache.org/">Apache Pulsar</a> - Apache
 Pulsar is an open-source distributed pub-sub messaging system.</li>
 <li><a href="https://github.com/awslabs/aws-data-wrangler">AWS Data
-Wranlger</a> - Utility belt to handle data on AWS.</li>
+Wrangler</a> - Utility belt to handle data on AWS.</li>
 <li><a href="https://airbyte.io/">Airbyte</a> - Open-source data
 integration for modern data teams.</li>
+<li><a href="https://www.artie.com/">Artie</a> - Real-time data
+ingestion tool leveraging change data capture.</li>
 <li><a href="https://slingdata.io/">Sling</a> - Sling is CLI data
 integration tool specialized in moving data between databases, as well
 as storage systems.</li>
+<li><a href="https://meltano.com/">Meltano</a> - CLI &amp; code-first
+ELT.
+<ul>
+<li><a href="https://sdk.meltano.com">Singer SDK</a> - The fastest way
+to build custom data extractors and loaders compliant with the Singer
+Spec.</li>
+</ul></li>
+<li><a href="https://github.com/fulldecent/google-sheets-etl">Google
+Sheets ETL</a> - Live import all your Google Sheets to your data
+warehouse.</li>
+<li><a href="https://www.csvpath.org/">CsvPath Framework</a> - A
+delimited data preboarding framework that fills the gap between MFT and
+the data lake.</li>
+<li><a href="https://estuary.dev">Estuary Flow</a> - No/low-code data
+pipeline platform that handles both batch and real-time data
+ingestion.</li>
 </ul>
 <h2 id="file-system">File System</h2>
 <ul>
@@ -315,11 +338,14 @@ at memory-speed across cluster frameworks, such as Spark and
 MapReduce.</li>
 <li><a href="https://ceph.com/">CEPH</a> - Ceph is a unified,
 distributed storage system designed for excellent performance,
-reliability and scalability.</li>
+reliability, and scalability.</li>
+<li><a href="https://github.com/juicedata/juicefs">JuiceFS</a> - JuiceFS
+is a high-performance Cloud-Native file system driven by object storage
+for large-scale data storage.</li>
 <li><a href="https://www.orangefs.org/">OrangeFS</a> - Orange File
 System is a branch of the Parallel Virtual File System.</li>
 <li><a href="https://github.com/tuplejump/snackfs-release">SnackFS</a> -
-SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built
+SnackFS is our bite-sized, lightweight HDFS compatible file system built
 over Cassandra.</li>
 <li><a href="https://www.gluster.org/">GlusterFS</a> - Gluster
 Filesystem.</li>
@@ -389,6 +415,8 @@ powerful, and reliable system to process and distribute data.</li>
 <li><a href="https://hudi.apache.org/">Apache Hudi</a> - An open source
 framework for managing storage for real time processing, one of the most
 interesting feature is the Upsert.</li>
+<li><a href="https://github.com/cocoindex-io/cocoindex">CocoIndex</a> -
+An open source ETL framework to build fresh index for AI.</li>
 <li><a href="https://voltdb.com/">VoltDB</a> - VoltDb is an
 ACID-compliant RDBMS which uses a <a
 href="https://en.wikipedia.org/wiki/Shared-nothing_architecture">shared
@@ -412,20 +440,23 @@ and it can be run at all kinds of resource-constrained edge
 devices.</li>
 <li><a href="https://github.com/aklivity/zilla">Zilla</a> - - An API
 gateway built for event-driven architectures and streaming that supports
-standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka
+standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka
 protocol.</li>
+<li><a href="https://github.com/swimos/swim-rust">SwimOS</a> - A
+framework for building real-time streaming data processing applications
+that supports a wide range of ingestion sources.</li>
 </ul>
 <h2 id="batch-processing">Batch Processing</h2>
 <ul>
-<li><p><a
+<li><a
 href="https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Hadoop
 MapReduce</a> - Hadoop MapReduce is a software framework for easily
 writing applications which process vast amounts of data (multi-terabyte
 data-sets) - in-parallel on large clusters (thousands of nodes) - of
-commodity hardware in a reliable, fault-tolerant manner.</p></li>
-<li><p><a href="https://spark.apache.org/">Spark</a> - A multi-language
+commodity hardware in a reliable, fault-tolerant manner.</li>
+<li><a href="https://spark.apache.org/">Spark</a> - A multi-language
 engine for executing data engineering, data science, and machine
-learning on single-node machines or clusters.</p>
+learning on single-node machines or clusters.
 <ul>
 <li><a href="https://spark-packages.org">Spark Packages</a> - A
 community index of packages for Apache Spark.</li>
@@ -440,22 +471,25 @@ Spark Server.</li>
 free &amp; cross platform monitoring tool (Spark UI / Spark History
 Server alternative).</li>
 </ul></li>
-<li><p><a href="https://aws.amazon.com/emr/">AWS EMR</a> - A web service
+<li><a href="https://aws.amazon.com/emr/">AWS EMR</a> - A web service
 that makes it easy to quickly and cost-effectively process vast amounts
-of data.</p></li>
-<li><p><a href="https://www.datamechanics.co">Data Mechanics</a> - A
+of data.</li>
+<li><a href="https://www.datamechanics.co">Data Mechanics</a> - A
 cloud-based platform deployed on Kubernetes making Apache Spark more
-developer-friendly and cost-effective.</p></li>
-<li><p><a href="https://tez.apache.org/">Tez</a> - An application
-framework which allows for a complex directed-acyclic-graph of tasks for
-processing data.</p></li>
-<li><p><a href="https://github.com/asavinov/bistro">Bistro</a> - A
+developer-friendly and cost-effective.</li>
+<li><a href="https://tez.apache.org/">Tez</a> - An application framework
+which allows for a complex directed-acyclic-graph of tasks for
+processing data.</li>
+<li><a href="https://github.com/asavinov/bistro">Bistro</a> - A
 light-weight engine for general-purpose data processing including both
 batch and stream analytics. It is based on a novel unique data model,
 which represents data via <em>functions</em> and processes data via
 <em>columns operations</em> as opposed to having only set operations in
-conventional approaches like MapReduce or SQL.</p></li>
-<li><p>Batch ML</p>
+conventional approaches like MapReduce or SQL.</li>
+<li><a href="https://github.com/brexhq/substation">Substation</a> -
+Substation is a cloud native data pipeline and transformation toolkit
+written in Go.</li>
+<li>Batch ML
 <ul>
 <li><a href="https://www.h2o.ai/">H2O</a> - Fast scalable machine
 learning API for smarter applications.</li>
@@ -467,7 +501,7 @@ common learning algorithms and utilities, including classification,
 regression, clustering, collaborative filtering, dimensionality
 reduction, as well as underlying optimization primitives.</li>
 </ul></li>
-<li><p>Batch Graph</p>
+<li>Batch Graph
 <ul>
 <li><a href="https://turi.com/products/create/docs/">GraphLab Create</a>
 - A machine learning platform that enables data scientists and app
@@ -477,7 +511,7 @@ processing system built for high scalability.</li>
 <li><a href="https://spark.apache.org/graphx/">Spark GraphX</a> - Apache
 Spark’s API for graphs and graph-parallel computation.</li>
 </ul></li>
-<li><p>Batch SQL</p>
+<li>Batch SQL
 <ul>
 <li><a
 href="https://prestodb.github.io/docs/current/index.html">Presto</a> - A
@@ -508,7 +542,7 @@ library.</li>
 <li><a href="https://d3js.org/">D3.js</a> - A JavaScript library for
 manipulating documents based on data.
 <ul>
-<li><a href="https://d3plus.org">D3Plus</a> - D3’s simplier, easier to
+<li><a href="https://d3plus.org">D3Plus</a> - D3’s simpler, easier to
 use cousin. Mostly predefined templates that you can just plug data
 in.</li>
 </ul></li>
@@ -532,35 +566,40 @@ ask questions and learn from data.</li>
 pure-python graphics and GUI library built on PyQt4 / PySide and numpy.
 It is intended for use in mathematics / scientific / engineering
 applications.</li>
+<li><a href="https://seaborn.pydata.org">Seaborn</a> - A Python
+visualization library based on matplotlib. It provides a high-level
+interface for drawing attractive statistical graphics.</li>
 </ul>
 <h2 id="workflow">Workflow</h2>
 <ul>
 <li><a href="https://github.com/spotify/luigi">Luigi</a> - Luigi is a
-Python module that helps you build complex pipelines of batch jobs.
-<ul>
+Python module that helps you build complex pipelines of batch jobs.</li>
 <li><a href="https://github.com/seatgeek/cronq">CronQ</a> - An
 application cron-like system. <a
 href="https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/">Used</a>
 w/Luige. Deprecated.</li>
-</ul></li>
 <li><a href="https://www.cascading.org/">Cascading</a> - Java based
 application development platform.</li>
 <li><a href="https://github.com/apache/airflow">Airflow</a> - Airflow is
-a system to programmaticaly author, schedule and monitor data
+a system to programmatically author, schedule, and monitor data
 pipelines.</li>
 <li><a href="https://azkaban.github.io/">Azkaban</a> - Azkaban is a
 batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
 Azkaban resolves the ordering through job dependencies and provides an
-easy to use web user interface to maintain and track your
+easy-to-use web user interface to maintain and track your
 workflows.</li>
 <li><a href="https://oozie.apache.org/">Oozie</a> - Oozie is a workflow
 scheduler system to manage Apache Hadoop jobs.</li>
 <li><a href="https://github.com/pinterest/pinball">Pinball</a> - DAG
-based workflow manager. Job flows are defined programmaticaly in Python.
-Support output passing between jobs.</li>
+based workflow manager. Job flows are defined programmatically in
+Python. Support output passing between jobs.</li>
 <li><a href="https://github.com/dagster-io/dagster">Dagster</a> -
 Dagster is an open-source Python library for building data
 applications.</li>
+<li><a href="https://github.com/dagworks-inc/hamilton">Hamilton</a> -
+Hamilton is a lightweight library to define data transformations as a
+directed-acyclic graph (DAG). If you like dbt for SQL transforms, you
+will like Hamilton for Python processing.</li>
 <li><a href="https://kedro.readthedocs.io/en/latest/">Kedro</a> - Kedro
 is a framework that makes it easy to build robust and scalable data
 pipelines by providing uniform project templates, data abstraction,
@@ -576,6 +615,9 @@ engineering favors required—just SQL.</li>
 <li><a href="https://getdbt.com/">dbt</a> - A command line tool that
 enables data analysts and engineers to transform data in their
 warehouses more effectively.</li>
+<li><a href="https://kestra.io/">Kestra</a> - Scalable, event-driven,
+language-agnostic orchestration and scheduling platform to manage
+millions of workflows declaratively in code.</li>
 <li><a
 href="https://github.com/rudderlabs/rudder-server">RudderStack</a> - A
 warehouse-first Customer Data Platform that enables you to collect data
@@ -597,6 +639,12 @@ Create automated workflows and logic using API’s for your notification
 service. Add templates, batching, preferences, inapp inbox with
 workflows to trigger notifications directly from your data
 warehouse.</li>
+<li><a href="https://github.com/kestra-io/kestra">Kestra</a> - A
+versatile open source orchestrator and scheduler built on Java, designed
+to handle a broad range of workflows with a language-agnostic, API-first
+architecture.</li>
+<li><a href="https://www.mage.ai">Mage</a> - Open-source data pipeline
+tool for transforming and integrating data.</li>
 </ul>
 <h2 id="data-lake-management">Data Lake Management</h2>
 <ul>
@@ -606,12 +654,18 @@ object-storage based data lakes.</li>
 <li><a href="https://github.com/projectnessie/nessie">Project Nessie</a>
 - Project Nessie is a Transactional Catalog for Data Lakes with Git-like
 semantics. Works with Apache Iceberg tables.</li>
+<li><a href="https://ilum.cloud/">Ilum</a> - Ilum is a modular Data
+Lakehouse platform that simplifies the management and monitoring of
+Apache Spark clusters across Kubernetes and Hadoop environments.</li>
+<li><a href="https://github.com/apache/gravitino">Gravitino</a> -
+Gravitino is an open-source, unified metadata management for data lakes,
+data warehouses, and external catalogs.</li>
 </ul>
 <h2 id="elk-elastic-logstash-kibana">ELK Elastic Logstash Kibana</h2>
 <ul>
 <li><a
 href="https://github.com/pblittle/docker-logstash">docker-logstash</a> -
-A highly configurable logstash (1.4.4) - docker image running
+A highly configurable Logstash (1.4.4) - Docker image running
 Elasticsearch (1.7.0) - and Kibana (3.1.2).</li>
 <li><a
 href="https://github.com/jprante/elasticsearch-jdbc">elasticsearch-jdbc</a>
@@ -622,7 +676,7 @@ Extension that allows creating an index backed by Elasticsearch.</li>
 <h2 id="docker">Docker</h2>
 <ul>
 <li><a href="https://github.com/redbooth/gockerize">Gockerize</a> -
-Package golang service into minimal docker containers.</li>
+Package golang service into minimal Docker containers.</li>
 <li><a href="https://github.com/ClusterHQ/flocker">Flocker</a> - Easily
 manage Docker containers &amp; their data.</li>
 <li><a href="https://rancher.com/rancher-os/">Rancher</a> - RancherOS is
@@ -645,9 +699,9 @@ href="https://github.com/grammarly/rocker-compose">Rocker-compose</a> -
 Docker composition tool with idempotency features for deploying apps
 composed of multiple containers. Deprecated.</li>
 <li><a href="https://github.com/hashicorp/nomad">Nomad</a> - Nomad is a
-cluster manager, designed for both long lived services and short lived
+cluster manager, designed for both long-lived services and short-lived
 batch processing workloads.</li>
-<li><a href="https://imagelayers.io/">ImageLayers</a> - Vizualize docker
+<li><a href="https://imagelayers.io/">ImageLayers</a> - Visualize Docker
 images and the layers that compose them.</li>
 </ul>
 <h2 id="datasets">Datasets</h2>
@@ -672,7 +726,7 @@ public timeline since 2011, updated every hour.</li>
 <li><a href="https://commoncrawl.org/">Common Crawl</a> - Open source
 repository of web crawl data.</li>
 <li><a href="https://dumps.wikimedia.org/enwiki/latest/">Wikipedia</a> -
-Wikipedia’s complete copy of all wikis, in the form of wikitext source
+Wikipedia’s complete copy of all wikis, in the form of Wikitext source
 and metadata embedded in XML. A number of raw database tables in SQL
 form are also available.</li>
 </ul>
@@ -704,13 +758,19 @@ production.</li>
 data quality platform for the whole data platform lifecycle from
 profiling new data sources to applying full automation of data quality
 monitoring.</li>
+<li><a href="https://datakitchen.io/">DataKitchen</a> - Open Source Data
+Observability for end-to-end Data Journey Observability, data profiling,
+anomaly detection, and auto-created data quality validation tests.</li>
+<li><a href="https://runsql.com/">RunSQL</a> - Free online SQL
+playground for MySQL, PostgreSQL, and SQL Server. Create database
+structures, run queries, and share results instantly.</li>
 </ul>
 <h2 id="community">Community</h2>
 <h3 id="forums">Forums</h3>
 <ul>
 <li><a
 href="https://www.reddit.com/r/dataengineering/">/r/dataengineering</a>
- News, tips and background on Data Engineering.</li>
+- News, tips, and background on Data Engineering.</li>
 <li><a href="https://www.reddit.com/r/ETL/">/r/etl</a> - Subreddit
 focused on ETL.</li>
 </ul>
@@ -730,3 +790,19 @@ about their experience around building and maintaining data
 infrastructure, delivering data and data products, and driving better
 outcomes across their businesses with data.</li>
 </ul>
+<h3 id="books">Books</h3>
+<ul>
+<li><a
+href="https://www.manning.com/books/snowflake-data-engineering">Snowflake
+Data Engineering</a> - A practical introduction to data engineering on
+the Snowflake cloud data platform.</li>
+<li><a
+href="https://www.appliedaicourse.com/blog/data-science-books/">Best
+Data Science Books</a> - This blog offers a curated list of top data
+science books, categorized by topics and learning stages, to aid readers
+in building foundational knowledge and staying updated with industry
+trends.</li>
+</ul>
+<p><a
+href="https://github.com/igorbarinov/awesome-data-engineering">dataengineering.md
+Github</a></p>