353 lines
60 KiB
Plaintext
353 lines
60 KiB
Plaintext
[38;5;12m [39m[38;2;255;187;0m[1m[4mAwesome Data Engineering [0m[38;5;14m[1m[4m![0m[38;2;255;187;0m[1m[4mAwesome[0m[38;5;14m[1m[4m (https://awesome.re/badge-flat2.svg)[0m[38;2;255;187;0m[1m[4m (https://github.com/sindresorhus/awesome)[0m
|
||
|
||
[38;5;11m[1m▐[0m[38;5;12m [39m[38;5;12mA curated list of awesome things related to Data Engineering.[39m
|
||
|
||
[38;2;255;187;0m[4mContents[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mDatabases[0m[38;5;12m (#databases)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mData Comparison[0m[38;5;12m (#data-comparison)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mData Ingestion[0m[38;5;12m (#data-ingestion)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mFile System[0m[38;5;12m (#file-system)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSerialization format[0m[38;5;12m (#serialization-format)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mStream Processing[0m[38;5;12m (#stream-processing)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mBatch Processing[0m[38;5;12m (#batch-processing)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCharts and Dashboards[0m[38;5;12m (#charts-and-dashboards)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mWorkflow[0m[38;5;12m (#workflow)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mData Lake Management[0m[38;5;12m (#data-lake-management)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mELK Elastic Logstash Kibana[0m[38;5;12m (#elk-elastic-logstash-kibana)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mDocker[0m[38;5;12m (#docker)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mDatasets[0m[38;5;12m (#datasets)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRealtime[0m[38;5;12m (#realtime)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mData Dumps[0m[38;5;12m (#data-dumps)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMonitoring[0m[38;5;12m (#monitoring)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPrometheus[0m[38;5;12m (#prometheus)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mProfiling[0m[38;5;12m (#profiling)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mData Profiler[0m[38;5;12m (#data-profiler)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mTesting[0m[38;5;12m (#testing)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCommunity[0m[38;5;12m (#community)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mForums[0m[38;5;12m (#forums)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mConferences[0m[38;5;12m (#conferences)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPodcasts[0m[38;5;12m (#podcasts)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mBooks[0m[38;5;12m (#books)[39m
|
||
|
||
[38;2;255;187;0m[4mDatabases[0m
|
||
|
||
[38;5;12m- Relational[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRQLite[0m[38;5;12m (https://github.com/rqlite/rqlite) - Replicated SQLite using the Raft consensus protocol.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mMySQL[0m[38;5;12m (https://www.mysql.com/) - The world's most popular open source database.[39m
|
||
[48;5;235m[38;5;249m- **TiDB** (https://github.com/pingcap/tidb) - TiDB is a distributed NewSQL database compatible with MySQL protocol.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m- **Percona XtraBackup** (https://www.percona.com/software/mysql-database/percona-xtrabackup) - Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.[49m[39m
|
||
[48;5;235m[38;5;249m- **mysql_utils** (https://github.com/pinterest/mysql_utils) - Pinterest MySQL Management Tools.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[38;5;12m - [39m[38;5;14m[1mMariaDB[0m[38;5;12m (https://mariadb.org/) - An enhanced, drop-in replacement for MySQL.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPostgreSQL[0m[38;5;12m (https://www.postgresql.org/) - The world's most advanced open source database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mAmazon RDS[0m[38;5;12m (https://aws.amazon.com/rds/) - Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mCrate.IO[0m[38;5;12m (https://crate.io/) - Scalable SQL database with the NOSQL goodies.[39m
|
||
[38;5;12m- Key-Value[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRedis[0m[38;5;12m (https://redis.io/) - An open source, BSD licensed, advanced key-value cache and store.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRiak[0m[38;5;12m (https://docs.basho.com/riak/kv/) - A distributed database designed to deliver maximum data availability by distributing data across multiple servers.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mAWS DynamoDB[0m[38;5;12m (https://aws.amazon.com/dynamodb/) - A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mHyperDex[0m[38;5;12m (https://github.com/rescrv/HyperDex) - HyperDex is a scalable, searchable key-value store. Deprecated.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSSDB[0m[38;5;12m (https://ssdb.io) - A high performance NoSQL database supporting many data structures, an alternative to Redis.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mKyoto Tycoon[0m[38;5;12m (https://github.com/alticelabs/kyoto) - Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mIonDB[0m[38;5;12m (https://github.com/iondbproject/iondb) - A key-value store for microcontroller and IoT applications.[39m
|
||
[38;5;12m- Column[39m
|
||
[38;5;12m - [39m[38;5;14m[1mCassandra[0m[38;5;12m (https://cassandra.apache.org/) - The right choice when you need scalability and high availability without compromising performance.[39m
|
||
[48;5;235m[38;5;249m- **Cassandra Calculator** (https://www.ecyrd.com/cassandracalculator/) - This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.[49m[39m
|
||
[48;5;235m[38;5;249m- **CCM** (https://github.com/pcmanus/ccm) - A script to easily create and destroy an Apache Cassandra cluster on localhost.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m- **ScyllaDB** (https://github.com/scylladb/scylla) - NoSQL data store using the seastar framework, compatible with Apache Cassandra.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[38;5;12m - [39m[38;5;14m[1mHBase[0m[38;5;12m (https://hbase.apache.org/) - The Hadoop database, a distributed, scalable, big data store.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mAWS Redshift[0m[38;5;12m (https://aws.amazon.com/redshift/) - A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mFiloDB[0m[38;5;12m (https://github.com/filodb/FiloDB) - Distributed. Columnar. Versioned. Streaming. SQL.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mVertica[0m[38;5;12m (https://www.vertica.com) - Distributed, MPP columnar database with extensive analytics SQL.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mClickHouse[0m[38;5;12m (https://clickhouse.tech) - Distributed columnar DBMS for OLAP. SQL.[39m
|
||
[38;5;12m- Document[39m
|
||
[38;5;12m - [39m[38;5;14m[1mMongoDB[0m[38;5;12m (https://www.mongodb.com) - An open-source, document database designed for ease of development and scaling.[39m
|
||
[48;5;235m[38;5;249m- **Percona Server for MongoDB** (https://www.percona.com/software/mongo-database/percona-server-for-mongodb) - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m that includes enterprise-grade features and functionality.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m- **MemDB** (https://github.com/rain1017/memdb) - Distributed Transactional In-Memory Database (based on MongoDB).[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[38;5;12m - [39m[38;5;14m[1mElasticsearch[0m[38;5;12m (https://www.elastic.co/) - Search & Analyze Data in Real Time.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mCouchbase[0m[38;5;12m (https://www.couchbase.com/) - The highest performing NoSQL distributed database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRethinkDB[0m[38;5;12m (https://rethinkdb.com/) - The open-source database for the realtime web.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRavenDB[0m[38;5;12m (https://ravendb.net/) - Fully Transactional NoSQL Document Database.[39m
|
||
[38;5;12m- Graph[39m
|
||
[38;5;12m - [39m[38;5;14m[1mNeo4j[0m[38;5;12m (https://neo4j.com/) - The world's leading graph database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mOrientDB[0m[38;5;12m (https://orientdb.com) - 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mArangoDB[0m[38;5;12m (https://www.arangodb.com/) - A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mTitan[0m[38;5;12m (https://titan.thinkaurelius.com) - A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mFlockDB[0m[38;5;12m (https://github.com/twitter-archive/flockdb) - A distributed, fault-tolerant graph database by Twitter. Deprecated.[39m
|
||
[38;5;12m- Distributed[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDAtomic[0m[38;5;12m (https://www.datomic.com) - The fully transactional, cloud-ready, distributed database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mApache Geode[0m[38;5;12m (https://geode.apache.org/) - An open source, distributed, in-memory database for scale-out applications.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mGaffer[0m[38;5;12m (https://github.com/gchq/Gaffer) - A large-scale graph database.[39m
|
||
[38;5;12m- Timeseries[39m
|
||
[38;5;12m - [39m[38;5;14m[1mInfluxDB[0m[38;5;12m (https://github.com/influxdata/influxdb) - Scalable datastore for metrics, events, and real-time analytics.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mOpenTSDB[0m[38;5;12m (https://github.com/OpenTSDB/opentsdb) - A scalable, distributed Time Series Database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mQuestDB[0m[38;5;12m (https://questdb.io/) - A relational column-oriented database designed for real-time analytics on time series and event data.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkairosdb[0m[38;5;12m (https://github.com/kairosdb/kairosdb) - Fast scalable time series database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mHeroic[0m[38;5;12m (https://github.com/spotify/heroic) - A scalable time series database based on Cassandra and Elasticsearch, by Spotify.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDruid[0m[38;5;12m (https://github.com/apache/incubator-druid) - Column oriented distributed data store ideal for powering interactive applications.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRiak-TS[0m[38;5;12m (https://basho.com/products/riak-ts/) - Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mAkumuli[0m[38;5;12m (https://github.com/akumuli/Akumuli) - Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRhombus[0m[38;5;12m (https://github.com/Pardot/Rhombus) - A time-series object store for Cassandra that handles all the complexity of building wide row indexes.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDalmatiner DB[0m[38;5;12m (https://github.com/dalmatinerdb/dalmatinerdb) - Fast distributed metrics database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mBlueflood[0m[38;5;12m (https://github.com/rackerlabs/blueflood) - A distributed system designed to ingest and process time series data.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mTimely[0m[38;5;12m (https://github.com/NationalSecurityAgency/timely) - Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.[39m
|
||
[38;5;12m- Other[39m
|
||
[38;5;12m - [39m[38;5;14m[1mTarantool[0m[38;5;12m (https://github.com/tarantool/tarantool/) - Tarantool is an in-memory database and application server.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mGreenPlum[0m[38;5;12m (https://github.com/greenplum-db/gpdb) - The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mcayley[0m[38;5;12m (https://github.com/cayleygraph/cayley) - An open-source graph database. Google.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSnappydata[0m[38;5;12m (https://github.com/SnappyDataInc/snappydata) - SnappyData: OLTP + OLAP Database built on Apache Spark.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mTimescaleDB[0m[38;5;12m (https://www.timescale.com/) - Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDuckDB[0m[38;5;12m (https://duckdb.org/) - DuckDB is a fast in-process analytical database that has zero external dependencies, runs on Linux/macOS/Windows, offers a rich SQL dialect, and is free and extensible.[39m
|
||
|
||
[38;2;255;187;0m[4mData Comparison[0m
|
||
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mdatacompy[0m[38;5;12m [39m[38;5;12m(https://github.com/capitalone/datacompy)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mDataComPy[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mPython[39m[38;5;12m [39m[38;5;12mlibrary[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mfacilitates[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mcomparison[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mtwo[39m[38;5;12m [39m[38;5;12mDataFrames[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mpandas,[39m[38;5;12m [39m[38;5;12mPolars,[39m[38;5;12m [39m[38;5;12mSpark[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mmore.[39m[38;5;12m [39m[38;5;12mThe[39m[38;5;12m [39m[38;5;12mlibrary[39m[38;5;12m [39m[38;5;12mgoes[39m[38;5;12m [39m[38;5;12mbeyond[39m[38;5;12m [39m[38;5;12mbasic[39m[38;5;12m [39m[38;5;12mequality[39m[38;5;12m [39m[38;5;12mchecks[39m[38;5;12m [39m[38;5;12mby[39m[38;5;12m [39m[38;5;12mproviding[39m[38;5;12m [39m[38;5;12mdetailed[39m[38;5;12m [39m
|
||
[38;5;12minsights[39m[38;5;12m [39m[38;5;12minto[39m[38;5;12m [39m[38;5;12mdiscrepancies[39m[38;5;12m [39m[38;5;12mat[39m[38;5;12m [39m[38;5;12mboth[39m[38;5;12m [39m[38;5;12mrow[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mcolumn[39m[38;5;12m [39m[38;5;12mlevels.[39m
|
||
|
||
[38;2;255;187;0m[4mData Ingestion[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mKafka[0m[38;5;12m (https://kafka.apache.org/) - Publish-subscribe messaging rethought as a distributed commit log.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mBottledWater[0m[38;5;12m (https://github.com/confluentinc/bottledwater-pg) - Change data capture from PostgreSQL into Kafka. Deprecated.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafkat[0m[38;5;12m (https://github.com/airbnb/kafkat) - Simplified command-line administration for Kafka brokers.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafkacat[0m[38;5;12m (https://github.com/edenhill/kafkacat) - Generic command line non-JVM Apache Kafka producer and consumer.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mpg-kafka[0m[38;5;12m (https://github.com/xstevens/pg_kafka) - A PostgreSQL extension to produce messages to Apache Kafka.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mlibrdkafka[0m[38;5;12m (https://github.com/edenhill/librdkafka) - The Apache Kafka C/C++ library.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafka-docker[0m[38;5;12m (https://github.com/wurstmeister/kafka-docker) - Kafka in Docker.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafka-manager[0m[38;5;12m (https://github.com/yahoo/kafka-manager) - A tool for managing Apache Kafka.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafka-node[0m[38;5;12m (https://github.com/SOHU-Co/kafka-node) - Node.js client for Apache Kafka 0.8.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSecor[0m[38;5;12m (https://github.com/pinterest/secor) - Pinterest's Kafka to S3 distributed consumer.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mKafka-logger[0m[38;5;12m (https://github.com/uber/kafka-logger) - Kafka-winston logger for Node.js from Uber.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAWS Kinesis[0m[38;5;12m (https://aws.amazon.com/kinesis/) - A fully managed, cloud-based service for real-time data processing over large, distributed data streams.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRabbitMQ[0m[38;5;12m (https://www.rabbitmq.com/) - Robust messaging for applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mdlt[0m[38;5;12m (https://www.dlthub.com) - A fast&simple pipeline building library for python data devs, runs in notebooks, cloud functions, airflow, etc.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mFluentD[0m[38;5;12m (https://www.fluentd.org) - An open source data collector for unified logging layer.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mEmbulk[0m[38;5;12m (https://www.embulk.org) - An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Sqoop[0m[38;5;12m (https://sqoop.apache.org) - A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mHeka[0m[38;5;12m (https://github.com/mozilla-services/heka) - Data Acquisition and Processing Made Easy. Deprecated.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mGobblin[0m[38;5;12m (https://github.com/apache/incubator-gobblin) - Universal data ingestion framework for Hadoop from LinkedIn.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mNakadi[0m[38;5;12m (https://nakadi.io) - Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPravega[0m[38;5;12m (https://www.pravega.io) - Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Pulsar[0m[38;5;12m (https://pulsar.apache.org/) - Apache Pulsar is an open-source distributed pub-sub messaging system.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAWS Data Wrangler[0m[38;5;12m (https://github.com/awslabs/aws-data-wrangler) - Utility belt to handle data on AWS.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAirbyte[0m[38;5;12m (https://airbyte.io/) - Open-source data integration for modern data teams.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mArtie[0m[38;5;12m (https://www.artie.com/) - Real-time data ingestion tool leveraging change data capture.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSling[0m[38;5;12m (https://slingdata.io/) - Sling is CLI data integration tool specialized in moving data between databases, as well as storage systems.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMeltano[0m[38;5;12m (https://meltano.com/) - CLI & code-first ELT.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSinger SDK[0m[38;5;12m (https://sdk.meltano.com) - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mGoogle Sheets ETL[0m[38;5;12m (https://github.com/fulldecent/google-sheets-etl) - Live import all your Google Sheets to your data warehouse.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCsvPath Framework[0m[38;5;12m (https://www.csvpath.org/) - A delimited data preboarding framework that fills the gap between MFT and the data lake.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mEstuary Flow[0m[38;5;12m (https://estuary.dev) - No/low-code data pipeline platform that handles both batch and real-time data ingestion.[39m
|
||
|
||
[38;2;255;187;0m[4mFile System[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mHDFS[0m[38;5;12m (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) - A distributed file system designed to run on commodity hardware.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSnakebite[0m[38;5;12m (https://github.com/spotify/snakebite) - A pure python HDFS client.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAWS S3[0m[38;5;12m (https://aws.amazon.com/s3/) - Object storage built to retrieve any amount of data from anywhere.[39m
|
||
[38;5;12m - [39m[38;5;14m[1msmart_open[0m[38;5;12m (https://github.com/RaRe-Technologies/smart_open) - Utils for streaming large files (S3, HDFS, gzip, bz2).[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAlluxio[0m[38;5;12m (https://www.alluxio.org/) - Alluxio is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCEPH[0m[38;5;12m (https://ceph.com/) - Ceph is a unified, distributed storage system designed for excellent performance, reliability, and scalability.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mJuiceFS[0m[38;5;12m (https://github.com/juicedata/juicefs) - JuiceFS is a high-performance Cloud-Native file system driven by object storage for large-scale data storage.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mOrangeFS[0m[38;5;12m (https://www.orangefs.org/) - Orange File System is a branch of the Parallel Virtual File System.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSnackFS[0m[38;5;12m (https://github.com/tuplejump/snackfs-release) - SnackFS is our bite-sized, lightweight HDFS compatible file system built over Cassandra.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mGlusterFS[0m[38;5;12m (https://www.gluster.org/) - Gluster Filesystem.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mXtreemFS[0m[38;5;12m (https://www.xtreemfs.org/) - Fault-tolerant distributed file system for all storage needs.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mSeaweedFS[0m[38;5;12m [39m[38;5;12m(https://github.com/chrislusf/seaweedfs)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mSeaweed-FS[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12msimple[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mhighly[39m[38;5;12m [39m[38;5;12mscalable[39m[38;5;12m [39m[38;5;12mdistributed[39m[38;5;12m [39m[38;5;12mfile[39m[38;5;12m [39m[38;5;12msystem.[39m[38;5;12m [39m[38;5;12mThere[39m[38;5;12m [39m[38;5;12mare[39m[38;5;12m [39m[38;5;12mtwo[39m[38;5;12m [39m[38;5;12mobjectives:[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mstore[39m[38;5;12m [39m[38;5;12mbillions[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mfiles![39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mserve[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mfiles[39m[38;5;12m [39m[38;5;12mfast![39m[38;5;12m [39m[38;5;12mInstead[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12msupporting[39m[38;5;12m [39m[38;5;12mfull[39m[38;5;12m [39m[38;5;12mPOSIX[39m[38;5;12m [39m[38;5;12mfile[39m[38;5;12m [39m[38;5;12msystem[39m
|
||
[38;5;12msemantics,[39m[38;5;12m [39m[38;5;12mSeaweed-FS[39m[38;5;12m [39m[38;5;12mchoose[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mimplement[39m[38;5;12m [39m[38;5;12monly[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mkey~file[39m[38;5;12m [39m[38;5;12mmapping.[39m[38;5;12m [39m[38;5;12mSimilar[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mword[39m[38;5;12m [39m[38;5;12m"NoSQL",[39m[38;5;12m [39m[38;5;12myou[39m[38;5;12m [39m[38;5;12mcan[39m[38;5;12m [39m[38;5;12mcall[39m[38;5;12m [39m[38;5;12mit[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12m"NoFS".[39m
|
||
[38;5;12m- [39m[38;5;14m[1mS3QL[0m[38;5;12m (https://github.com/s3ql/s3ql/) - S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mLizardFS[0m[38;5;12m (https://lizardfs.com/) - LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.[39m
|
||
|
||
[38;2;255;187;0m[4mSerialization format[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mApache Avro[0m[38;5;12m (https://avro.apache.org) - Apache Avro™ is a data serialization system.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Parquet[0m[38;5;12m (https://parquet.apache.org) - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSnappy[0m[38;5;12m (https://github.com/google/snappy) - A fast compressor/decompressor. Used with Parquet.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPigZ[0m[38;5;12m (https://zlib.net/pigz/) - A parallel implementation of gzip for modern multi-processor, multi-core machines.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache ORC[0m[38;5;12m (https://orc.apache.org/) - The smallest, fastest columnar storage for Hadoop workloads.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Thrift[0m[38;5;12m (https://thrift.apache.org) - The Apache Thrift software framework, for scalable cross-language services development.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mProtoBuf[0m[38;5;12m (https://github.com/protocolbuffers/protobuf) - Protocol Buffers - Google's data interchange format.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSequenceFile[0m[38;5;12m (https://wiki.apache.org/hadoop/SequenceFile) - SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mKryo[0m[38;5;12m (https://github.com/EsotericSoftware/kryo) - Kryo is a fast and efficient object graph serialization framework for Java.[39m
|
||
|
||
[38;2;255;187;0m[4mStream Processing[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mApache Beam[0m[38;5;12m (https://beam.apache.org/) - Apache Beam is a unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSpark Streaming[0m[38;5;12m (https://spark.apache.org/streaming/) - Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Flink[0m[38;5;12m (https://flink.apache.org/) - Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Storm[0m[38;5;12m (https://storm.apache.org) - Apache Storm is a free and open source distributed realtime computation system.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Samza[0m[38;5;12m (https://samza.apache.org) - Apache Samza is a distributed stream processing framework.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache NiFi[0m[38;5;12m (https://nifi.apache.org/) - An easy to use, powerful, and reliable system to process and distribute data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Hudi[0m[38;5;12m (https://hudi.apache.org/) - An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCocoIndex[0m[38;5;12m (https://github.com/cocoindex-io/cocoindex) - An open source ETL framework to build fresh index for AI. [39m
|
||
[38;5;12m- [39m[38;5;14m[1mVoltDB[0m[38;5;12m (https://voltdb.com/) - VoltDb is an ACID-compliant RDBMS which uses a [39m[38;5;14m[1mshared nothing architecture[0m[38;5;12m (https://en.wikipedia.org/wiki/Shared-nothing_architecture).[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPipelineDB[0m[38;5;12m (https://github.com/pipelinedb/pipelinedb) - The Streaming SQL Database.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSpring Cloud Dataflow[0m[38;5;12m (https://cloud.spring.io/spring-cloud-dataflow/) - Streaming and tasks execution between Spring Boot apps.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mBonobo[0m[38;5;12m (https://www.bonobo-project.org/) - Bonobo is a data-processing toolkit for python 3.5+.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRobinhood's Faust[0m[38;5;12m (https://github.com/faust-streaming/faust) - Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mHStreamDB[0m[38;5;12m (https://github.com/hstreamdb/hstream) - The streaming database built for IoT data storage and real-time processing.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mKuiper[0m[38;5;12m (https://github.com/emqx/kuiper) - An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mZilla[0m[38;5;12m (https://github.com/aklivity/zilla) - - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSwimOS[0m[38;5;12m (https://github.com/swimos/swim-rust) - A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.[39m
|
||
|
||
[38;2;255;187;0m[4mBatch Processing[0m
|
||
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mHadoop[0m[38;5;14m[1m [0m[38;5;14m[1mMapReduce[0m[38;5;12m [39m[38;5;12m(https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mHadoop[39m[38;5;12m [39m[38;5;12mMapReduce[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12msoftware[39m[38;5;12m [39m[38;5;12mframework[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12measily[39m[38;5;12m [39m[38;5;12mwriting[39m[38;5;12m [39m[38;5;12mapplications[39m[38;5;12m [39m[38;5;12mwhich[39m[38;5;12m [39m[38;5;12mprocess[39m[38;5;12m [39m[38;5;12mvast[39m[38;5;12m [39m[38;5;12mamounts[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m
|
||
[38;5;12m(multi-terabyte[39m[38;5;12m [39m[38;5;12mdata-sets)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12min-parallel[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12mlarge[39m[38;5;12m [39m[38;5;12mclusters[39m[38;5;12m [39m[38;5;12m(thousands[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mnodes)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mcommodity[39m[38;5;12m [39m[38;5;12mhardware[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mreliable,[39m[38;5;12m [39m[38;5;12mfault-tolerant[39m[38;5;12m [39m[38;5;12mmanner.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSpark[0m[38;5;12m (https://spark.apache.org/) - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSpark Packages[0m[38;5;12m (https://spark-packages.org) - A community index of packages for Apache Spark.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDeep Spark[0m[38;5;12m (https://github.com/Stratio/deep-spark) - Connecting Apache Spark with different data stores. Deprecated.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSpark RDD API Examples[0m[38;5;12m (https://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html) - Examples by Zhen He.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mLivy[0m[38;5;12m (https://livy.incubator.apache.org) - The REST Spark Server.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDelight[0m[38;5;12m (https://github.com/datamechanics/delight) - A free & cross platform monitoring tool (Spark UI / Spark History Server alternative).[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAWS EMR[0m[38;5;12m (https://aws.amazon.com/emr/) - A web service that makes it easy to quickly and cost-effectively process vast amounts of data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mData Mechanics[0m[38;5;12m (https://www.datamechanics.co) - A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mTez[0m[38;5;12m (https://tez.apache.org/) - An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mBistro[0m[38;5;12m [39m[38;5;12m(https://github.com/asavinov/bistro)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mlight-weight[39m[38;5;12m [39m[38;5;12mengine[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12mgeneral-purpose[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mprocessing[39m[38;5;12m [39m[38;5;12mincluding[39m[38;5;12m [39m[38;5;12mboth[39m[38;5;12m [39m[38;5;12mbatch[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mstream[39m[38;5;12m [39m[38;5;12manalytics.[39m[38;5;12m [39m[38;5;12mIt[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12mbased[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mnovel[39m[38;5;12m [39m[38;5;12munique[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mmodel,[39m[38;5;12m [39m[38;5;12mwhich[39m[38;5;12m [39m[38;5;12mrepresents[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mvia[39m[38;5;12m [39m[38;5;12m_functions_[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mprocesses[39m[38;5;12m [39m
|
||
[38;5;12mdata[39m[38;5;12m [39m[38;5;12mvia[39m[38;5;12m [39m[38;5;12m_columns[39m[38;5;12m [39m[38;5;12moperations_[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12mopposed[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mhaving[39m[38;5;12m [39m[38;5;12monly[39m[38;5;12m [39m[38;5;12mset[39m[38;5;12m [39m[38;5;12moperations[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mconventional[39m[38;5;12m [39m[38;5;12mapproaches[39m[38;5;12m [39m[38;5;12mlike[39m[38;5;12m [39m[38;5;12mMapReduce[39m[38;5;12m [39m[38;5;12mor[39m[38;5;12m [39m[38;5;12mSQL.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSubstation[0m[38;5;12m (https://github.com/brexhq/substation) - Substation is a cloud native data pipeline and transformation toolkit written in Go.[39m
|
||
[38;5;12m- Batch ML[39m
|
||
[38;5;12m - [39m[38;5;14m[1mH2O[0m[38;5;12m (https://www.h2o.ai/) - Fast scalable machine learning API for smarter applications.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mMahout[0m[38;5;12m (https://mahout.apache.org/) - An environment for quickly creating scalable performant machine learning applications.[39m
|
||
[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mSpark[0m[38;5;14m[1m [0m[38;5;14m[1mMLlib[0m[38;5;12m [39m[38;5;12m(https://spark.apache.org/docs/latest/ml-guide.html)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mSpark's[39m[38;5;12m [39m[38;5;12mscalable[39m[38;5;12m [39m[38;5;12mmachine[39m[38;5;12m [39m[38;5;12mlearning[39m[38;5;12m [39m[38;5;12mlibrary[39m[38;5;12m [39m[38;5;12mconsisting[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mcommon[39m[38;5;12m [39m[38;5;12mlearning[39m[38;5;12m [39m[38;5;12malgorithms[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mutilities,[39m[38;5;12m [39m[38;5;12mincluding[39m[38;5;12m [39m[38;5;12mclassification,[39m[38;5;12m [39m[38;5;12mregression,[39m[38;5;12m [39m[38;5;12mclustering,[39m[38;5;12m [39m[38;5;12mcollaborative[39m[38;5;12m [39m[38;5;12mfiltering,[39m[38;5;12m [39m
|
||
[38;5;12mdimensionality[39m[38;5;12m [39m[38;5;12mreduction,[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12mwell[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12munderlying[39m[38;5;12m [39m[38;5;12moptimization[39m[38;5;12m [39m[38;5;12mprimitives.[39m
|
||
[38;5;12m- Batch Graph[39m
|
||
[38;5;12m - [39m[38;5;14m[1mGraphLab Create[0m[38;5;12m (https://turi.com/products/create/docs/) - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mGiraph[0m[38;5;12m (https://giraph.apache.org/) - An iterative graph processing system built for high scalability.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSpark GraphX[0m[38;5;12m (https://spark.apache.org/graphx/) - Apache Spark's API for graphs and graph-parallel computation.[39m
|
||
[38;5;12m- Batch SQL[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPresto[0m[38;5;12m (https://prestodb.github.io/docs/current/index.html) - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mHive[0m[38;5;12m (https://hive.apache.org) - Data warehouse software facilitates querying and managing large datasets residing in distributed storage.[39m
|
||
[48;5;235m[38;5;249m- **Hivemall** (https://github.com/apache/incubator-hivemall) - Scalable machine learning library for Hive/Hadoop.[49m[39m
|
||
[48;5;235m[38;5;249m- **PyHive** (https://github.com/dropbox/PyHive) - Python interface to Hive and Presto.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDrill[0m[38;5;12m (https://drill.apache.org/) - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.[39m
|
||
|
||
[38;2;255;187;0m[4mCharts and Dashboards[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mHighcharts[0m[38;5;12m (https://www.highcharts.com/) - A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mZingChart[0m[38;5;12m (https://www.zingchart.com/) - Fast JavaScript charts for any data set.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mC3.js[0m[38;5;12m (https://c3js.org) - D3-based reusable chart library.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mD3.js[0m[38;5;12m (https://d3js.org/) - A JavaScript library for manipulating documents based on data.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mD3Plus[0m[38;5;12m (https://d3plus.org) - D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSmoothieCharts[0m[38;5;12m (https://smoothiecharts.org) - A JavaScript Charting Library for Streaming Data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPyXley[0m[38;5;12m (https://github.com/stitchfix/pyxley) - Python helpers for building dashboards using Flask and React.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPlotly[0m[38;5;12m (https://github.com/plotly/dash) - Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Superset[0m[38;5;12m (https://github.com/apache/incubator-superset) - Apache Superset (incubating) - A modern, enterprise-ready business intelligence web application.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRedash[0m[38;5;12m (https://redash.io/) - Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMetabase[0m[38;5;12m (https://github.com/metabase/metabase) - Metabase is the easy, open source way for everyone in your company to ask questions and learn from data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPyQtGraph[0m[38;5;12m (https://www.pyqtgraph.org/) - PyQtGraph is a pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSeaborn[0m[38;5;12m (https://seaborn.pydata.org) - A Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.[39m
|
||
|
||
[38;2;255;187;0m[4mWorkflow[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mLuigi[0m[38;5;12m (https://github.com/spotify/luigi) - Luigi is a Python module that helps you build complex pipelines of batch jobs.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCronQ[0m[38;5;12m (https://github.com/seatgeek/cronq) - An application cron-like system. [39m[38;5;14m[1mUsed[0m[38;5;12m (https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/) w/Luige. Deprecated.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCascading[0m[38;5;12m (https://www.cascading.org/) - Java based application development platform.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAirflow[0m[38;5;12m (https://github.com/apache/airflow) - Airflow is a system to programmatically author, schedule, and monitor data pipelines.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mAzkaban[0m[38;5;12m [39m[38;5;12m(https://azkaban.github.io/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAzkaban[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mbatch[39m[38;5;12m [39m[38;5;12mworkflow[39m[38;5;12m [39m[38;5;12mjob[39m[38;5;12m [39m[38;5;12mscheduler[39m[38;5;12m [39m[38;5;12mcreated[39m[38;5;12m [39m[38;5;12mat[39m[38;5;12m [39m[38;5;12mLinkedIn[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mrun[39m[38;5;12m [39m[38;5;12mHadoop[39m[38;5;12m [39m[38;5;12mjobs.[39m[38;5;12m [39m[38;5;12mAzkaban[39m[38;5;12m [39m[38;5;12mresolves[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mordering[39m[38;5;12m [39m[38;5;12mthrough[39m[38;5;12m [39m[38;5;12mjob[39m[38;5;12m [39m[38;5;12mdependencies[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mprovides[39m[38;5;12m [39m[38;5;12man[39m[38;5;12m [39m[38;5;12measy-to-use[39m[38;5;12m [39m[38;5;12mweb[39m[38;5;12m [39m[38;5;12muser[39m[38;5;12m [39m[38;5;12minterface[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mmaintain[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mtrack[39m
|
||
[38;5;12myour[39m[38;5;12m [39m[38;5;12mworkflows.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mOozie[0m[38;5;12m (https://oozie.apache.org/) - Oozie is a workflow scheduler system to manage Apache Hadoop jobs.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPinball[0m[38;5;12m (https://github.com/pinterest/pinball) - DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mDagster[0m[38;5;12m (https://github.com/dagster-io/dagster) - Dagster is an open-source Python library for building data applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mHamilton[0m[38;5;12m (https://github.com/dagworks-inc/hamilton) - Hamilton is a lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mKedro[0m[38;5;12m (https://kedro.readthedocs.io/en/latest/) - Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mDataform[0m[38;5;12m [39m[38;5;12m(https://dataform.co/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAn[39m[38;5;12m [39m[38;5;12mopen-source[39m[38;5;12m [39m[38;5;12mframework[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mweb[39m[38;5;12m [39m[38;5;12mbased[39m[38;5;12m [39m[38;5;12mIDE[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mmanage[39m[38;5;12m [39m[38;5;12mdatasets[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mtheir[39m[38;5;12m [39m[38;5;12mdependencies.[39m[38;5;12m [39m[38;5;12mSQLX[39m[38;5;12m [39m[38;5;12mextends[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mexisting[39m[38;5;12m [39m[38;5;12mSQL[39m[38;5;12m [39m[38;5;12mwarehouse[39m[38;5;12m [39m[38;5;12mdialect[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12madd[39m[38;5;12m [39m[38;5;12mfeatures[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12msupport[39m[38;5;12m [39m[38;5;12mdependency[39m[38;5;12m [39m[38;5;12mmanagement,[39m[38;5;12m [39m[38;5;12mtesting,[39m[38;5;12m [39m[38;5;12mdocumentation[39m[38;5;12m [39m
|
||
[38;5;12mand[39m[38;5;12m [39m[38;5;12mmore.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCensus[0m[38;5;12m (https://getcensus.com/) - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mdbt[0m[38;5;12m (https://getdbt.com/) - A command line tool that enables data analysts and engineers to transform data in their warehouses more effectively.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mKestra[0m[38;5;12m (https://kestra.io/) - Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRudderStack[0m[38;5;12m (https://github.com/rudderlabs/rudder-server) - A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPACE[0m[38;5;12m (https://github.com/getstrm/pace) - An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPrefect[0m[38;5;12m (https://prefect.io/) - Prefect is an orchestration and observability platform. With it, developers can rapidly build and scale resilient code, and triage disruptions effortlessly.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMultiwoven[0m[38;5;12m (https://github.com/Multiwoven/multiwoven) - The open-source reverse ETL, data activation platform for modern data teams.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mSuprSend[0m[38;5;12m [39m[38;5;12m(https://www.suprsend.com/products/workflows)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mCreate[39m[38;5;12m [39m[38;5;12mautomated[39m[38;5;12m [39m[38;5;12mworkflows[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mlogic[39m[38;5;12m [39m[38;5;12musing[39m[38;5;12m [39m[38;5;12mAPI's[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mnotification[39m[38;5;12m [39m[38;5;12mservice.[39m[38;5;12m [39m[38;5;12mAdd[39m[38;5;12m [39m[38;5;12mtemplates,[39m[38;5;12m [39m[38;5;12mbatching,[39m[38;5;12m [39m[38;5;12mpreferences,[39m[38;5;12m [39m[38;5;12minapp[39m[38;5;12m [39m[38;5;12minbox[39m[38;5;12m [39m[38;5;12mwith[39m[38;5;12m [39m[38;5;12mworkflows[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mtrigger[39m[38;5;12m [39m[38;5;12mnotifications[39m[38;5;12m [39m[38;5;12mdirectly[39m[38;5;12m [39m[38;5;12mfrom[39m[38;5;12m [39m
|
||
[38;5;12myour[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mwarehouse.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mKestra[0m[38;5;12m (https://github.com/kestra-io/kestra) - A versatile open source orchestrator and scheduler built on Java, designed to handle a broad range of workflows with a language-agnostic, API-first architecture.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMage[0m[38;5;12m (https://www.mage.ai) - Open-source data pipeline tool for transforming and integrating data.[39m
|
||
|
||
[38;2;255;187;0m[4mData Lake Management[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mlakeFS[0m[38;5;12m (https://github.com/treeverse/lakeFS) - lakeFS is an open source platform that delivers resilience and manageability to object-storage based data lakes.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mProject Nessie[0m[38;5;12m (https://github.com/projectnessie/nessie) - Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mIlum[0m[38;5;12m (https://ilum.cloud/) - Ilum is a modular Data Lakehouse platform that simplifies the management and monitoring of Apache Spark clusters across Kubernetes and Hadoop environments.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mGravitino[0m[38;5;12m (https://github.com/apache/gravitino) - Gravitino is an open-source, unified metadata management for data lakes, data warehouses, and external catalogs. [39m
|
||
|
||
[38;2;255;187;0m[4mELK Elastic Logstash Kibana[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mdocker-logstash[0m[38;5;12m (https://github.com/pblittle/docker-logstash) - A highly configurable Logstash (1.4.4) - Docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).[39m
|
||
[38;5;12m- [39m[38;5;14m[1melasticsearch-jdbc[0m[38;5;12m (https://github.com/jprante/elasticsearch-jdbc) - JDBC importer for Elasticsearch.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mZomboDB[0m[38;5;12m (https://github.com/zombodb/zombodb) - Postgres Extension that allows creating an index backed by Elasticsearch.[39m
|
||
|
||
[38;2;255;187;0m[4mDocker[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mGockerize[0m[38;5;12m (https://github.com/redbooth/gockerize) - Package golang service into minimal Docker containers.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mFlocker[0m[38;5;12m (https://github.com/ClusterHQ/flocker) - Easily manage Docker containers & their data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRancher[0m[38;5;12m (https://rancher.com/rancher-os/) - RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mKontena[0m[38;5;12m (https://www.kontena.io/) - Application Containers for Masses.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mWeave[0m[38;5;12m (https://github.com/weaveworks/weave) - Weaving Docker containers into applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mZodiac[0m[38;5;12m (https://github.com/CenturyLinkLabs/zodiac) - A lightweight tool for easy deployment and rollback of dockerized applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mcAdvisor[0m[38;5;12m (https://github.com/google/cadvisor) - Analyzes resource usage and performance characteristics of running containers.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMicro S3 persistence[0m[38;5;12m (https://github.com/figadore/micro-s3-persistence) - Docker microservice for saving/restoring volume data to S3.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRocker-compose[0m[38;5;12m (https://github.com/grammarly/rocker-compose) - Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mNomad[0m[38;5;12m (https://github.com/hashicorp/nomad) - Nomad is a cluster manager, designed for both long-lived services and short-lived batch processing workloads.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mImageLayers[0m[38;5;12m (https://imagelayers.io/) - Visualize Docker images and the layers that compose them.[39m
|
||
|
||
[38;2;255;187;0m[4mDatasets[0m
|
||
|
||
[38;2;255;187;0m[4mRealtime[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mTwitter Realtime[0m[38;5;12m (https://developer.twitter.com/en/docs/tweets/filter-realtime/overview) - The Streaming APIs give developers low latency access to Twitter's global stream of Tweet data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mEventsim[0m[38;5;12m (https://github.com/Interana/eventsim) - Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mReddit[0m[38;5;12m (https://www.reddit.com/r/datasets/comments/3mk1vg/realtime_data_is_available_including_comments/) - Real-time data is available including comments, submissions and links posted to reddit.[39m
|
||
|
||
[38;2;255;187;0m[4mData Dumps[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mGitHub Archive[0m[38;5;12m (https://www.gharchive.org/) - GitHub's public timeline since 2011, updated every hour.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCommon Crawl[0m[38;5;12m (https://commoncrawl.org/) - Open source repository of web crawl data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mWikipedia[0m[38;5;12m (https://dumps.wikimedia.org/enwiki/latest/) - Wikipedia's complete copy of all wikis, in the form of Wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.[39m
|
||
|
||
[38;2;255;187;0m[4mMonitoring[0m
|
||
|
||
[38;2;255;187;0m[4mPrometheus[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mPrometheus.io[0m[38;5;12m (https://github.com/prometheus/prometheus) - An open-source service monitoring system and time series database.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mHAProxy Exporter[0m[38;5;12m (https://github.com/prometheus/haproxy_exporter) - Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption.[39m
|
||
|
||
[38;2;255;187;0m[4mProfiling[0m
|
||
|
||
[38;2;255;187;0m[4mData Profiler[0m
|
||
[38;5;12m- [39m[38;5;14m[1mData Profiler[0m[38;5;12m (https://github.com/capitalone/dataprofiler) - The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.[39m
|
||
|
||
|
||
[38;2;255;187;0m[4mTesting[0m
|
||
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mGrai[0m[38;5;12m [39m[38;5;12m(https://github.com/grai-io/grai-core/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mcatalog[39m[38;5;12m [39m[38;5;12mtool[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mintegrates[39m[38;5;12m [39m[38;5;12minto[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mCI[39m[38;5;12m [39m[38;5;12msystem[39m[38;5;12m [39m[38;5;12mexposing[39m[38;5;12m [39m[38;5;12mdownstream[39m[38;5;12m [39m[38;5;12mimpact[39m[38;5;12m [39m[38;5;12mtesting[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mchanges.[39m[38;5;12m [39m[38;5;12mThese[39m[38;5;12m [39m[38;5;12mtests[39m[38;5;12m [39m[38;5;12mprevent[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mchanges[39m[38;5;12m [39m[38;5;12mwhich[39m[38;5;12m [39m[38;5;12mmight[39m[38;5;12m [39m[38;5;12mbreak[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mpipelines[39m[38;5;12m [39m[38;5;12mor[39m[38;5;12m [39m[38;5;12mBI[39m[38;5;12m [39m[38;5;12mdashboards[39m[38;5;12m [39m[38;5;12mfrom[39m[38;5;12m [39m
|
||
[38;5;12mmaking[39m[38;5;12m [39m[38;5;12mit[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mproduction.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mDQOps[0m[38;5;12m (https://github.com/dqops/dqo) - An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mDataKitchen[0m[38;5;12m (https://datakitchen.io/) - Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRunSQL[0m[38;5;12m (https://runsql.com/) - Free online SQL playground for MySQL, PostgreSQL, and SQL Server. Create database structures, run queries, and share results instantly.[39m
|
||
|
||
[38;2;255;187;0m[4mCommunity[0m
|
||
|
||
[38;2;255;187;0m[4mForums[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1m/r/dataengineering[0m[38;5;12m (https://www.reddit.com/r/dataengineering/) - News, tips, and background on Data Engineering.[39m
|
||
[38;5;12m- [39m[38;5;14m[1m/r/etl[0m[38;5;12m (https://www.reddit.com/r/ETL/) - Subreddit focused on ETL.[39m
|
||
|
||
[38;2;255;187;0m[4mConferences[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mData Council[0m[38;5;12m (https://www.datacouncil.ai/about) - Data Council is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.[39m
|
||
|
||
[38;2;255;187;0m[4mPodcasts[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mData Engineering Podcast[0m[38;5;12m (https://www.dataengineeringpodcast.com/) - The show about modern data infrastructure.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mThe[0m[38;5;14m[1m [0m[38;5;14m[1mData[0m[38;5;14m[1m [0m[38;5;14m[1mStack[0m[38;5;14m[1m [0m[38;5;14m[1mShow[0m[38;5;12m [39m[38;5;12m(https://datastackshow.com/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mshow[39m[38;5;12m [39m[38;5;12mwhere[39m[38;5;12m [39m[38;5;12mthey[39m[38;5;12m [39m[38;5;12mtalk[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mengineers,[39m[38;5;12m [39m[38;5;12manalysts,[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mscientists[39m[38;5;12m [39m[38;5;12mabout[39m[38;5;12m [39m[38;5;12mtheir[39m[38;5;12m [39m[38;5;12mexperience[39m[38;5;12m [39m[38;5;12maround[39m[38;5;12m [39m[38;5;12mbuilding[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mmaintaining[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12minfrastructure,[39m[38;5;12m [39m[38;5;12mdelivering[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mproducts,[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m
|
||
[38;5;12mdriving[39m[38;5;12m [39m[38;5;12mbetter[39m[38;5;12m [39m[38;5;12moutcomes[39m[38;5;12m [39m[38;5;12macross[39m[38;5;12m [39m[38;5;12mtheir[39m[38;5;12m [39m[38;5;12mbusinesses[39m[38;5;12m [39m[38;5;12mwith[39m[38;5;12m [39m[38;5;12mdata.[39m
|
||
|
||
[38;2;255;187;0m[4mBooks[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mSnowflake Data Engineering[0m[38;5;12m (https://www.manning.com/books/snowflake-data-engineering) - A practical introduction to data engineering on the Snowflake cloud data platform.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mBest[0m[38;5;14m[1m [0m[38;5;14m[1mData[0m[38;5;14m[1m [0m[38;5;14m[1mScience[0m[38;5;14m[1m [0m[38;5;14m[1mBooks[0m[38;5;12m [39m[38;5;12m(https://www.appliedaicourse.com/blog/data-science-books/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mThis[39m[38;5;12m [39m[38;5;12mblog[39m[38;5;12m [39m[38;5;12moffers[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mcurated[39m[38;5;12m [39m[38;5;12mlist[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mtop[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mscience[39m[38;5;12m [39m[38;5;12mbooks,[39m[38;5;12m [39m[38;5;12mcategorized[39m[38;5;12m [39m[38;5;12mby[39m[38;5;12m [39m[38;5;12mtopics[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mlearning[39m[38;5;12m [39m[38;5;12mstages,[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12maid[39m[38;5;12m [39m[38;5;12mreaders[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mbuilding[39m[38;5;12m [39m[38;5;12mfoundational[39m[38;5;12m [39m[38;5;12mknowledge[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m
|
||
[38;5;12mstaying[39m[38;5;12m [39m[38;5;12mupdated[39m[38;5;12m [39m[38;5;12mwith[39m[38;5;12m [39m[38;5;12mindustry[39m[38;5;12m [39m[38;5;12mtrends.[39m
|
||
|
||
[38;5;12mdataengineering Github: https://github.com/igorbarinov/awesome-data-engineering[39m
|