update lists
This commit is contained in:
@@ -38,6 +38,7 @@ Kibana</a></li>
|
||||
<li><a href="#forums">Forums</a></li>
|
||||
<li><a href="#conferences">Conferences</a></li>
|
||||
<li><a href="#podcasts">Podcasts</a></li>
|
||||
<li><a href="#books">Books</a></li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
<h2 id="databases">Databases</h2>
|
||||
@@ -219,6 +220,10 @@ open-source graph database. Google.</li>
|
||||
extension on top of PostgreSQL, TimescaleDB is a time-series SQL
|
||||
database providing fast analytics, scalability, with automated data
|
||||
management on a proven storage engine.</li>
|
||||
<li><a href="https://duckdb.org/">DuckDB</a> - DuckDB is a fast
|
||||
in-process analytical database that has zero external dependencies, runs
|
||||
on Linux/macOS/Windows, offers a rich SQL dialect, and is free and
|
||||
extensible.</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
<h2 id="data-comparison">Data Comparison</h2>
|
||||
@@ -255,7 +260,7 @@ Node.js client for Apache Kafka 0.8.</li>
|
||||
<li><a href="https://github.com/pinterest/secor">Secor</a> - Pinterest’s
|
||||
Kafka to S3 distributed consumer.</li>
|
||||
<li><a href="https://github.com/uber/kafka-logger">Kafka-logger</a> -
|
||||
Kafka-winston logger for Node.js from uber.</li>
|
||||
Kafka-winston logger for Node.js from Uber.</li>
|
||||
</ul></li>
|
||||
<li><a href="https://aws.amazon.com/kinesis/">AWS Kinesis</a> - A fully
|
||||
managed, cloud-based service for real-time data processing over large,
|
||||
@@ -276,7 +281,7 @@ and structured datastores such as relational databases.</li>
|
||||
<li><a href="https://github.com/mozilla-services/heka">Heka</a> - Data
|
||||
Acquisition and Processing Made Easy. Deprecated.</li>
|
||||
<li><a href="https://github.com/apache/incubator-gobblin">Gobblin</a> -
|
||||
Universal data ingestion framework for Hadoop from Linkedin.</li>
|
||||
Universal data ingestion framework for Hadoop from LinkedIn.</li>
|
||||
<li><a href="https://nakadi.io">Nakadi</a> - Nakadi is an open source
|
||||
event messaging platform that provides a REST API on top of Kafka-like
|
||||
queues.</li>
|
||||
@@ -286,12 +291,30 @@ data.</li>
|
||||
<li><a href="https://pulsar.apache.org/">Apache Pulsar</a> - Apache
|
||||
Pulsar is an open-source distributed pub-sub messaging system.</li>
|
||||
<li><a href="https://github.com/awslabs/aws-data-wrangler">AWS Data
|
||||
Wranlger</a> - Utility belt to handle data on AWS.</li>
|
||||
Wrangler</a> - Utility belt to handle data on AWS.</li>
|
||||
<li><a href="https://airbyte.io/">Airbyte</a> - Open-source data
|
||||
integration for modern data teams.</li>
|
||||
<li><a href="https://www.artie.com/">Artie</a> - Real-time data
|
||||
ingestion tool leveraging change data capture.</li>
|
||||
<li><a href="https://slingdata.io/">Sling</a> - Sling is CLI data
|
||||
integration tool specialized in moving data between databases, as well
|
||||
as storage systems.</li>
|
||||
<li><a href="https://meltano.com/">Meltano</a> - CLI & code-first
|
||||
ELT.
|
||||
<ul>
|
||||
<li><a href="https://sdk.meltano.com">Singer SDK</a> - The fastest way
|
||||
to build custom data extractors and loaders compliant with the Singer
|
||||
Spec.</li>
|
||||
</ul></li>
|
||||
<li><a href="https://github.com/fulldecent/google-sheets-etl">Google
|
||||
Sheets ETL</a> - Live import all your Google Sheets to your data
|
||||
warehouse.</li>
|
||||
<li><a href="https://www.csvpath.org/">CsvPath Framework</a> - A
|
||||
delimited data preboarding framework that fills the gap between MFT and
|
||||
the data lake.</li>
|
||||
<li><a href="https://estuary.dev">Estuary Flow</a> - No/low-code data
|
||||
pipeline platform that handles both batch and real-time data
|
||||
ingestion.</li>
|
||||
</ul>
|
||||
<h2 id="file-system">File System</h2>
|
||||
<ul>
|
||||
@@ -315,11 +338,14 @@ at memory-speed across cluster frameworks, such as Spark and
|
||||
MapReduce.</li>
|
||||
<li><a href="https://ceph.com/">CEPH</a> - Ceph is a unified,
|
||||
distributed storage system designed for excellent performance,
|
||||
reliability and scalability.</li>
|
||||
reliability, and scalability.</li>
|
||||
<li><a href="https://github.com/juicedata/juicefs">JuiceFS</a> - JuiceFS
|
||||
is a high-performance Cloud-Native file system driven by object storage
|
||||
for large-scale data storage.</li>
|
||||
<li><a href="https://www.orangefs.org/">OrangeFS</a> - Orange File
|
||||
System is a branch of the Parallel Virtual File System.</li>
|
||||
<li><a href="https://github.com/tuplejump/snackfs-release">SnackFS</a> -
|
||||
SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built
|
||||
SnackFS is our bite-sized, lightweight HDFS compatible file system built
|
||||
over Cassandra.</li>
|
||||
<li><a href="https://www.gluster.org/">GlusterFS</a> - Gluster
|
||||
Filesystem.</li>
|
||||
@@ -389,6 +415,8 @@ powerful, and reliable system to process and distribute data.</li>
|
||||
<li><a href="https://hudi.apache.org/">Apache Hudi</a> - An open source
|
||||
framework for managing storage for real time processing, one of the most
|
||||
interesting feature is the Upsert.</li>
|
||||
<li><a href="https://github.com/cocoindex-io/cocoindex">CocoIndex</a> -
|
||||
An open source ETL framework to build fresh index for AI.</li>
|
||||
<li><a href="https://voltdb.com/">VoltDB</a> - VoltDb is an
|
||||
ACID-compliant RDBMS which uses a <a
|
||||
href="https://en.wikipedia.org/wiki/Shared-nothing_architecture">shared
|
||||
@@ -412,20 +440,23 @@ and it can be run at all kinds of resource-constrained edge
|
||||
devices.</li>
|
||||
<li><a href="https://github.com/aklivity/zilla">Zilla</a> - - An API
|
||||
gateway built for event-driven architectures and streaming that supports
|
||||
standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka
|
||||
standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka
|
||||
protocol.</li>
|
||||
<li><a href="https://github.com/swimos/swim-rust">SwimOS</a> - A
|
||||
framework for building real-time streaming data processing applications
|
||||
that supports a wide range of ingestion sources.</li>
|
||||
</ul>
|
||||
<h2 id="batch-processing">Batch Processing</h2>
|
||||
<ul>
|
||||
<li><p><a
|
||||
<li><a
|
||||
href="https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Hadoop
|
||||
MapReduce</a> - Hadoop MapReduce is a software framework for easily
|
||||
writing applications which process vast amounts of data (multi-terabyte
|
||||
data-sets) - in-parallel on large clusters (thousands of nodes) - of
|
||||
commodity hardware in a reliable, fault-tolerant manner.</p></li>
|
||||
<li><p><a href="https://spark.apache.org/">Spark</a> - A multi-language
|
||||
commodity hardware in a reliable, fault-tolerant manner.</li>
|
||||
<li><a href="https://spark.apache.org/">Spark</a> - A multi-language
|
||||
engine for executing data engineering, data science, and machine
|
||||
learning on single-node machines or clusters.</p>
|
||||
learning on single-node machines or clusters.
|
||||
<ul>
|
||||
<li><a href="https://spark-packages.org">Spark Packages</a> - A
|
||||
community index of packages for Apache Spark.</li>
|
||||
@@ -440,22 +471,25 @@ Spark Server.</li>
|
||||
free & cross platform monitoring tool (Spark UI / Spark History
|
||||
Server alternative).</li>
|
||||
</ul></li>
|
||||
<li><p><a href="https://aws.amazon.com/emr/">AWS EMR</a> - A web service
|
||||
<li><a href="https://aws.amazon.com/emr/">AWS EMR</a> - A web service
|
||||
that makes it easy to quickly and cost-effectively process vast amounts
|
||||
of data.</p></li>
|
||||
<li><p><a href="https://www.datamechanics.co">Data Mechanics</a> - A
|
||||
of data.</li>
|
||||
<li><a href="https://www.datamechanics.co">Data Mechanics</a> - A
|
||||
cloud-based platform deployed on Kubernetes making Apache Spark more
|
||||
developer-friendly and cost-effective.</p></li>
|
||||
<li><p><a href="https://tez.apache.org/">Tez</a> - An application
|
||||
framework which allows for a complex directed-acyclic-graph of tasks for
|
||||
processing data.</p></li>
|
||||
<li><p><a href="https://github.com/asavinov/bistro">Bistro</a> - A
|
||||
developer-friendly and cost-effective.</li>
|
||||
<li><a href="https://tez.apache.org/">Tez</a> - An application framework
|
||||
which allows for a complex directed-acyclic-graph of tasks for
|
||||
processing data.</li>
|
||||
<li><a href="https://github.com/asavinov/bistro">Bistro</a> - A
|
||||
light-weight engine for general-purpose data processing including both
|
||||
batch and stream analytics. It is based on a novel unique data model,
|
||||
which represents data via <em>functions</em> and processes data via
|
||||
<em>columns operations</em> as opposed to having only set operations in
|
||||
conventional approaches like MapReduce or SQL.</p></li>
|
||||
<li><p>Batch ML</p>
|
||||
conventional approaches like MapReduce or SQL.</li>
|
||||
<li><a href="https://github.com/brexhq/substation">Substation</a> -
|
||||
Substation is a cloud native data pipeline and transformation toolkit
|
||||
written in Go.</li>
|
||||
<li>Batch ML
|
||||
<ul>
|
||||
<li><a href="https://www.h2o.ai/">H2O</a> - Fast scalable machine
|
||||
learning API for smarter applications.</li>
|
||||
@@ -467,7 +501,7 @@ common learning algorithms and utilities, including classification,
|
||||
regression, clustering, collaborative filtering, dimensionality
|
||||
reduction, as well as underlying optimization primitives.</li>
|
||||
</ul></li>
|
||||
<li><p>Batch Graph</p>
|
||||
<li>Batch Graph
|
||||
<ul>
|
||||
<li><a href="https://turi.com/products/create/docs/">GraphLab Create</a>
|
||||
- A machine learning platform that enables data scientists and app
|
||||
@@ -477,7 +511,7 @@ processing system built for high scalability.</li>
|
||||
<li><a href="https://spark.apache.org/graphx/">Spark GraphX</a> - Apache
|
||||
Spark’s API for graphs and graph-parallel computation.</li>
|
||||
</ul></li>
|
||||
<li><p>Batch SQL</p>
|
||||
<li>Batch SQL
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://prestodb.github.io/docs/current/index.html">Presto</a> - A
|
||||
@@ -508,7 +542,7 @@ library.</li>
|
||||
<li><a href="https://d3js.org/">D3.js</a> - A JavaScript library for
|
||||
manipulating documents based on data.
|
||||
<ul>
|
||||
<li><a href="https://d3plus.org">D3Plus</a> - D3’s simplier, easier to
|
||||
<li><a href="https://d3plus.org">D3Plus</a> - D3’s simpler, easier to
|
||||
use cousin. Mostly predefined templates that you can just plug data
|
||||
in.</li>
|
||||
</ul></li>
|
||||
@@ -532,35 +566,40 @@ ask questions and learn from data.</li>
|
||||
pure-python graphics and GUI library built on PyQt4 / PySide and numpy.
|
||||
It is intended for use in mathematics / scientific / engineering
|
||||
applications.</li>
|
||||
<li><a href="https://seaborn.pydata.org">Seaborn</a> - A Python
|
||||
visualization library based on matplotlib. It provides a high-level
|
||||
interface for drawing attractive statistical graphics.</li>
|
||||
</ul>
|
||||
<h2 id="workflow">Workflow</h2>
|
||||
<ul>
|
||||
<li><a href="https://github.com/spotify/luigi">Luigi</a> - Luigi is a
|
||||
Python module that helps you build complex pipelines of batch jobs.
|
||||
<ul>
|
||||
Python module that helps you build complex pipelines of batch jobs.</li>
|
||||
<li><a href="https://github.com/seatgeek/cronq">CronQ</a> - An
|
||||
application cron-like system. <a
|
||||
href="https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/">Used</a>
|
||||
w/Luige. Deprecated.</li>
|
||||
</ul></li>
|
||||
<li><a href="https://www.cascading.org/">Cascading</a> - Java based
|
||||
application development platform.</li>
|
||||
<li><a href="https://github.com/apache/airflow">Airflow</a> - Airflow is
|
||||
a system to programmaticaly author, schedule and monitor data
|
||||
a system to programmatically author, schedule, and monitor data
|
||||
pipelines.</li>
|
||||
<li><a href="https://azkaban.github.io/">Azkaban</a> - Azkaban is a
|
||||
batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
|
||||
Azkaban resolves the ordering through job dependencies and provides an
|
||||
easy to use web user interface to maintain and track your
|
||||
easy-to-use web user interface to maintain and track your
|
||||
workflows.</li>
|
||||
<li><a href="https://oozie.apache.org/">Oozie</a> - Oozie is a workflow
|
||||
scheduler system to manage Apache Hadoop jobs.</li>
|
||||
<li><a href="https://github.com/pinterest/pinball">Pinball</a> - DAG
|
||||
based workflow manager. Job flows are defined programmaticaly in Python.
|
||||
Support output passing between jobs.</li>
|
||||
based workflow manager. Job flows are defined programmatically in
|
||||
Python. Support output passing between jobs.</li>
|
||||
<li><a href="https://github.com/dagster-io/dagster">Dagster</a> -
|
||||
Dagster is an open-source Python library for building data
|
||||
applications.</li>
|
||||
<li><a href="https://github.com/dagworks-inc/hamilton">Hamilton</a> -
|
||||
Hamilton is a lightweight library to define data transformations as a
|
||||
directed-acyclic graph (DAG). If you like dbt for SQL transforms, you
|
||||
will like Hamilton for Python processing.</li>
|
||||
<li><a href="https://kedro.readthedocs.io/en/latest/">Kedro</a> - Kedro
|
||||
is a framework that makes it easy to build robust and scalable data
|
||||
pipelines by providing uniform project templates, data abstraction,
|
||||
@@ -576,6 +615,9 @@ engineering favors required—just SQL.</li>
|
||||
<li><a href="https://getdbt.com/">dbt</a> - A command line tool that
|
||||
enables data analysts and engineers to transform data in their
|
||||
warehouses more effectively.</li>
|
||||
<li><a href="https://kestra.io/">Kestra</a> - Scalable, event-driven,
|
||||
language-agnostic orchestration and scheduling platform to manage
|
||||
millions of workflows declaratively in code.</li>
|
||||
<li><a
|
||||
href="https://github.com/rudderlabs/rudder-server">RudderStack</a> - A
|
||||
warehouse-first Customer Data Platform that enables you to collect data
|
||||
@@ -597,6 +639,12 @@ Create automated workflows and logic using API’s for your notification
|
||||
service. Add templates, batching, preferences, inapp inbox with
|
||||
workflows to trigger notifications directly from your data
|
||||
warehouse.</li>
|
||||
<li><a href="https://github.com/kestra-io/kestra">Kestra</a> - A
|
||||
versatile open source orchestrator and scheduler built on Java, designed
|
||||
to handle a broad range of workflows with a language-agnostic, API-first
|
||||
architecture.</li>
|
||||
<li><a href="https://www.mage.ai">Mage</a> - Open-source data pipeline
|
||||
tool for transforming and integrating data.</li>
|
||||
</ul>
|
||||
<h2 id="data-lake-management">Data Lake Management</h2>
|
||||
<ul>
|
||||
@@ -606,12 +654,18 @@ object-storage based data lakes.</li>
|
||||
<li><a href="https://github.com/projectnessie/nessie">Project Nessie</a>
|
||||
- Project Nessie is a Transactional Catalog for Data Lakes with Git-like
|
||||
semantics. Works with Apache Iceberg tables.</li>
|
||||
<li><a href="https://ilum.cloud/">Ilum</a> - Ilum is a modular Data
|
||||
Lakehouse platform that simplifies the management and monitoring of
|
||||
Apache Spark clusters across Kubernetes and Hadoop environments.</li>
|
||||
<li><a href="https://github.com/apache/gravitino">Gravitino</a> -
|
||||
Gravitino is an open-source, unified metadata management for data lakes,
|
||||
data warehouses, and external catalogs.</li>
|
||||
</ul>
|
||||
<h2 id="elk-elastic-logstash-kibana">ELK Elastic Logstash Kibana</h2>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://github.com/pblittle/docker-logstash">docker-logstash</a> -
|
||||
A highly configurable logstash (1.4.4) - docker image running
|
||||
A highly configurable Logstash (1.4.4) - Docker image running
|
||||
Elasticsearch (1.7.0) - and Kibana (3.1.2).</li>
|
||||
<li><a
|
||||
href="https://github.com/jprante/elasticsearch-jdbc">elasticsearch-jdbc</a>
|
||||
@@ -622,7 +676,7 @@ Extension that allows creating an index backed by Elasticsearch.</li>
|
||||
<h2 id="docker">Docker</h2>
|
||||
<ul>
|
||||
<li><a href="https://github.com/redbooth/gockerize">Gockerize</a> -
|
||||
Package golang service into minimal docker containers.</li>
|
||||
Package golang service into minimal Docker containers.</li>
|
||||
<li><a href="https://github.com/ClusterHQ/flocker">Flocker</a> - Easily
|
||||
manage Docker containers & their data.</li>
|
||||
<li><a href="https://rancher.com/rancher-os/">Rancher</a> - RancherOS is
|
||||
@@ -645,9 +699,9 @@ href="https://github.com/grammarly/rocker-compose">Rocker-compose</a> -
|
||||
Docker composition tool with idempotency features for deploying apps
|
||||
composed of multiple containers. Deprecated.</li>
|
||||
<li><a href="https://github.com/hashicorp/nomad">Nomad</a> - Nomad is a
|
||||
cluster manager, designed for both long lived services and short lived
|
||||
cluster manager, designed for both long-lived services and short-lived
|
||||
batch processing workloads.</li>
|
||||
<li><a href="https://imagelayers.io/">ImageLayers</a> - Vizualize docker
|
||||
<li><a href="https://imagelayers.io/">ImageLayers</a> - Visualize Docker
|
||||
images and the layers that compose them.</li>
|
||||
</ul>
|
||||
<h2 id="datasets">Datasets</h2>
|
||||
@@ -672,7 +726,7 @@ public timeline since 2011, updated every hour.</li>
|
||||
<li><a href="https://commoncrawl.org/">Common Crawl</a> - Open source
|
||||
repository of web crawl data.</li>
|
||||
<li><a href="https://dumps.wikimedia.org/enwiki/latest/">Wikipedia</a> -
|
||||
Wikipedia’s complete copy of all wikis, in the form of wikitext source
|
||||
Wikipedia’s complete copy of all wikis, in the form of Wikitext source
|
||||
and metadata embedded in XML. A number of raw database tables in SQL
|
||||
form are also available.</li>
|
||||
</ul>
|
||||
@@ -704,13 +758,19 @@ production.</li>
|
||||
data quality platform for the whole data platform lifecycle from
|
||||
profiling new data sources to applying full automation of data quality
|
||||
monitoring.</li>
|
||||
<li><a href="https://datakitchen.io/">DataKitchen</a> - Open Source Data
|
||||
Observability for end-to-end Data Journey Observability, data profiling,
|
||||
anomaly detection, and auto-created data quality validation tests.</li>
|
||||
<li><a href="https://runsql.com/">RunSQL</a> - Free online SQL
|
||||
playground for MySQL, PostgreSQL, and SQL Server. Create database
|
||||
structures, run queries, and share results instantly.</li>
|
||||
</ul>
|
||||
<h2 id="community">Community</h2>
|
||||
<h3 id="forums">Forums</h3>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://www.reddit.com/r/dataengineering/">/r/dataengineering</a>
|
||||
- News, tips and background on Data Engineering.</li>
|
||||
- News, tips, and background on Data Engineering.</li>
|
||||
<li><a href="https://www.reddit.com/r/ETL/">/r/etl</a> - Subreddit
|
||||
focused on ETL.</li>
|
||||
</ul>
|
||||
@@ -730,3 +790,19 @@ about their experience around building and maintaining data
|
||||
infrastructure, delivering data and data products, and driving better
|
||||
outcomes across their businesses with data.</li>
|
||||
</ul>
|
||||
<h3 id="books">Books</h3>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://www.manning.com/books/snowflake-data-engineering">Snowflake
|
||||
Data Engineering</a> - A practical introduction to data engineering on
|
||||
the Snowflake cloud data platform.</li>
|
||||
<li><a
|
||||
href="https://www.appliedaicourse.com/blog/data-science-books/">Best
|
||||
Data Science Books</a> - This blog offers a curated list of top data
|
||||
science books, categorized by topics and learning stages, to aid readers
|
||||
in building foundational knowledge and staying updated with industry
|
||||
trends.</li>
|
||||
</ul>
|
||||
<p><a
|
||||
href="https://github.com/igorbarinov/awesome-data-engineering">dataengineering.md
|
||||
Github</a></p>
|
||||
|
||||
Reference in New Issue
Block a user