347 lines
70 KiB
Plaintext
347 lines
70 KiB
Plaintext
[38;5;12m [39m[38;2;255;187;0m[1m[4mAwesome Data Engineering [0m[38;5;14m[1m[4m![0m[38;2;255;187;0m[1m[4mAwesome[0m[38;5;14m[1m[4m (https://awesome.re/badge-flat2.svg)[0m[38;2;255;187;0m[1m[4m (https://github.com/sindresorhus/awesome)[0m
|
||
|
||
[38;5;11m[1m▐[0m[38;5;12m [39m[38;5;12mA curated list of awesome things related to Data Engineering.[39m
|
||
|
||
[38;2;255;187;0m[4mContents[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mDatabases[0m[38;5;12m (#databases)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mData Comparison[0m[38;5;12m (#data-comparison)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mData Ingestion[0m[38;5;12m (#data-ingestion)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mFile System[0m[38;5;12m (#file-system)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSerialization format[0m[38;5;12m (#serialization-format)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mStream Processing[0m[38;5;12m (#stream-processing)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mBatch Processing[0m[38;5;12m (#batch-processing)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCharts and Dashboards[0m[38;5;12m (#charts-and-dashboards)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mWorkflow[0m[38;5;12m (#workflow)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mData Lake Management[0m[38;5;12m (#data-lake-management)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mELK Elastic Logstash Kibana[0m[38;5;12m (#elk-elastic-logstash-kibana)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mDocker[0m[38;5;12m (#docker)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mDatasets[0m[38;5;12m (#datasets)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRealtime[0m[38;5;12m (#realtime)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mData Dumps[0m[38;5;12m (#data-dumps)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMonitoring[0m[38;5;12m (#monitoring)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPrometheus[0m[38;5;12m (#prometheus)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mProfiling[0m[38;5;12m (#profiling)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mData Profiler[0m[38;5;12m (#data-profiler)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mTesting[0m[38;5;12m (#testing)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCommunity[0m[38;5;12m (#community)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mForums[0m[38;5;12m (#forums)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mConferences[0m[38;5;12m (#conferences)[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPodcasts[0m[38;5;12m (#podcasts)[39m
|
||
|
||
[38;2;255;187;0m[4mDatabases[0m
|
||
|
||
[38;5;12m- Relational[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRQLite[0m[38;5;12m (https://github.com/rqlite/rqlite) - Replicated SQLite using the Raft consensus protocol.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mMySQL[0m[38;5;12m (https://www.mysql.com/) - The world's most popular open source database.[39m
|
||
[48;5;235m[38;5;249m- **TiDB** (https://github.com/pingcap/tidb) - TiDB is a distributed NewSQL database compatible with MySQL protocol.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m- **Percona XtraBackup** (https://www.percona.com/software/mysql-database/percona-xtrabackup) - Percona XtraBackup is a free, open source, complete online backup solution for all versions of [49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249mPercona Server, MySQL® and MariaDB®.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m- **mysql_utils** (https://github.com/pinterest/mysql_utils) - Pinterest MySQL Management Tools.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[38;5;12m - [39m[38;5;14m[1mMariaDB[0m[38;5;12m (https://mariadb.org/) - An enhanced, drop-in replacement for MySQL.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPostgreSQL[0m[38;5;12m (https://www.postgresql.org/) - The world's most advanced open source database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mAmazon RDS[0m[38;5;12m (https://aws.amazon.com/rds/) - Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mCrate.IO[0m[38;5;12m (https://crate.io/) - Scalable SQL database with the NOSQL goodies.[39m
|
||
[38;5;12m- Key-Value[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRedis[0m[38;5;12m (https://redis.io/) - An open source, BSD licensed, advanced key-value cache and store.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRiak[0m[38;5;12m (https://docs.basho.com/riak/kv/) - A distributed database designed to deliver maximum data availability by distributing data across multiple servers.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mAWS DynamoDB[0m[38;5;12m (https://aws.amazon.com/dynamodb/) - A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mHyperDex[0m[38;5;12m (https://github.com/rescrv/HyperDex) - HyperDex is a scalable, searchable key-value store. Deprecated.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSSDB[0m[38;5;12m (https://ssdb.io) - A high performance NoSQL database supporting many data structures, an alternative to Redis.[39m
|
||
[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mKyoto[0m[38;5;14m[1m [0m[38;5;14m[1mTycoon[0m[38;5;12m [39m[38;5;12m(https://github.com/alticelabs/kyoto)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mKyoto[39m[38;5;12m [39m[38;5;12mTycoon[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mlightweight[39m[38;5;12m [39m[38;5;12mnetwork[39m[38;5;12m [39m[38;5;12mserver[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12mtop[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mKyoto[39m[38;5;12m [39m[38;5;12mCabinet[39m[38;5;12m [39m[38;5;12mkey-value[39m[38;5;12m [39m[38;5;12mdatabase,[39m[38;5;12m [39m[38;5;12mbuilt[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12mhigh-performance[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m
|
||
[38;5;12mconcurrency.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mIonDB[0m[38;5;12m (https://github.com/iondbproject/iondb) - A key-value store for microcontroller and IoT applications.[39m
|
||
[38;5;12m- Column[39m
|
||
[38;5;12m - [39m[38;5;14m[1mCassandra[0m[38;5;12m (https://cassandra.apache.org/) - The right choice when you need scalability and high availability without compromising performance.[39m
|
||
[48;5;235m[38;5;249m- **Cassandra Calculator** (https://www.ecyrd.com/cassandracalculator/) - This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is [49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249mfor your application.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m- **CCM** (https://github.com/pcmanus/ccm) - A script to easily create and destroy an Apache Cassandra cluster on localhost.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m- **ScyllaDB** (https://github.com/scylladb/scylla) - NoSQL data store using the seastar framework, compatible with Apache Cassandra.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[38;5;12m - [39m[38;5;14m[1mHBase[0m[38;5;12m (https://hbase.apache.org/) - The Hadoop database, a distributed, scalable, big data store.[39m
|
||
[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mAWS[0m[38;5;14m[1m [0m[38;5;14m[1mRedshift[0m[38;5;12m [39m[38;5;12m(https://aws.amazon.com/redshift/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mfast,[39m[38;5;12m [39m[38;5;12mfully[39m[38;5;12m [39m[38;5;12mmanaged,[39m[38;5;12m [39m[38;5;12mpetabyte-scale[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mwarehouse[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mmakes[39m[38;5;12m [39m[38;5;12mit[39m[38;5;12m [39m[38;5;12msimple[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mcost-effective[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12manalyze[39m[38;5;12m [39m[38;5;12mall[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12musing[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mexisting[39m
|
||
[38;5;12mbusiness[39m[38;5;12m [39m[38;5;12mintelligence[39m[38;5;12m [39m[38;5;12mtools.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mFiloDB[0m[38;5;12m (https://github.com/filodb/FiloDB) - Distributed. Columnar. Versioned. Streaming. SQL.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mVertica[0m[38;5;12m (https://www.vertica.com) - Distributed, MPP columnar database with extensive analytics SQL.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mClickHouse[0m[38;5;12m (https://clickhouse.tech) - Distributed columnar DBMS for OLAP. SQL.[39m
|
||
[38;5;12m- Document[39m
|
||
[38;5;12m - [39m[38;5;14m[1mMongoDB[0m[38;5;12m (https://www.mongodb.com) - An open-source, document database designed for ease of development and scaling.[39m
|
||
[48;5;235m[38;5;249m- **Percona Server for MongoDB** (https://www.percona.com/software/mongo-database/percona-server-for-mongodb) - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source,[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[48;5;235m[38;5;249m- **MemDB** (https://github.com/rain1017/memdb) - Distributed Transactional In-Memory Database (based on MongoDB).[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[38;5;12m - [39m[38;5;14m[1mElasticsearch[0m[38;5;12m (https://www.elastic.co/) - Search & Analyze Data in Real Time.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mCouchbase[0m[38;5;12m (https://www.couchbase.com/) - The highest performing NoSQL distributed database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRethinkDB[0m[38;5;12m (https://rethinkdb.com/) - The open-source database for the realtime web.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRavenDB[0m[38;5;12m (https://ravendb.net/) - Fully Transactional NoSQL Document Database.[39m
|
||
[38;5;12m- Graph[39m
|
||
[38;5;12m - [39m[38;5;14m[1mNeo4j[0m[38;5;12m (https://neo4j.com/) - The world's leading graph database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mOrientDB[0m[38;5;12m (https://orientdb.com) - 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mArangoDB[0m[38;5;12m (https://www.arangodb.com/) - A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.[39m
|
||
[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mTitan[0m[38;5;12m [39m[38;5;12m(https://titan.thinkaurelius.com)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mscalable[39m[38;5;12m [39m[38;5;12mgraph[39m[38;5;12m [39m[38;5;12mdatabase[39m[38;5;12m [39m[38;5;12moptimized[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12mstoring[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mquerying[39m[38;5;12m [39m[38;5;12mgraphs[39m[38;5;12m [39m[38;5;12mcontaining[39m[38;5;12m [39m[38;5;12mhundreds[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mbillions[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mvertices[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12medges[39m[38;5;12m [39m[38;5;12mdistributed[39m[38;5;12m [39m[38;5;12macross[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m
|
||
[38;5;12mmulti-machine[39m[38;5;12m [39m[38;5;12mcluster.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mFlockDB[0m[38;5;12m (https://github.com/twitter-archive/flockdb) - A distributed, fault-tolerant graph database by Twitter. Deprecated.[39m
|
||
[38;5;12m- Distributed[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDAtomic[0m[38;5;12m (https://www.datomic.com) - The fully transactional, cloud-ready, distributed database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mApache Geode[0m[38;5;12m (https://geode.apache.org/) - An open source, distributed, in-memory database for scale-out applications.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mGaffer[0m[38;5;12m (https://github.com/gchq/Gaffer) - A large-scale graph database.[39m
|
||
[38;5;12m- Timeseries[39m
|
||
[38;5;12m - [39m[38;5;14m[1mInfluxDB[0m[38;5;12m (https://github.com/influxdata/influxdb) - Scalable datastore for metrics, events, and real-time analytics.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mOpenTSDB[0m[38;5;12m (https://github.com/OpenTSDB/opentsdb) - A scalable, distributed Time Series Database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mQuestDB[0m[38;5;12m (https://questdb.io/) - A relational column-oriented database designed for real-time analytics on time series and event data.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkairosdb[0m[38;5;12m (https://github.com/kairosdb/kairosdb) - Fast scalable time series database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mHeroic[0m[38;5;12m (https://github.com/spotify/heroic) - A scalable time series database based on Cassandra and Elasticsearch, by Spotify.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDruid[0m[38;5;12m (https://github.com/apache/incubator-druid) - Column oriented distributed data store ideal for powering interactive applications.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRiak-TS[0m[38;5;12m (https://basho.com/products/riak-ts/) - Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.[39m
|
||
[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mAkumuli[0m[38;5;12m [39m[38;5;12m(https://github.com/akumuli/Akumuli)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAkumuli[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mnumeric[39m[38;5;12m [39m[38;5;12mtime-series[39m[38;5;12m [39m[38;5;12mdatabase.[39m[38;5;12m [39m[38;5;12mIt[39m[38;5;12m [39m[38;5;12mcan[39m[38;5;12m [39m[38;5;12mbe[39m[38;5;12m [39m[38;5;12mused[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mcapture,[39m[38;5;12m [39m[38;5;12mstore[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mprocess[39m[38;5;12m [39m[38;5;12mtime-series[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mreal-time.[39m[38;5;12m [39m[38;5;12mThe[39m[38;5;12m [39m[38;5;12mword[39m[38;5;12m [39m[38;5;12m"akumuli"[39m[38;5;12m [39m
|
||
[38;5;12mcan[39m[38;5;12m [39m[38;5;12mbe[39m[38;5;12m [39m[38;5;12mtranslated[39m[38;5;12m [39m[38;5;12mfrom[39m[38;5;12m [39m[38;5;12mesperanto[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12m"accumulate".[39m
|
||
[38;5;12m - [39m[38;5;14m[1mRhombus[0m[38;5;12m (https://github.com/Pardot/Rhombus) - A time-series object store for Cassandra that handles all the complexity of building wide row indexes.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDalmatiner DB[0m[38;5;12m (https://github.com/dalmatinerdb/dalmatinerdb) - Fast distributed metrics database.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mBlueflood[0m[38;5;12m (https://github.com/rackerlabs/blueflood) - A distributed system designed to ingest and process time series data.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mTimely[0m[38;5;12m (https://github.com/NationalSecurityAgency/timely) - Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.[39m
|
||
[38;5;12m- Other[39m
|
||
[38;5;12m - [39m[38;5;14m[1mTarantool[0m[38;5;12m (https://github.com/tarantool/tarantool/) - Tarantool is an in-memory database and application server.[39m
|
||
[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mGreenPlum[0m[38;5;12m [39m[38;5;12m(https://github.com/greenplum-db/gpdb)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mThe[39m[38;5;12m [39m[38;5;12mGreenplum[39m[38;5;12m [39m[38;5;12mDatabase[39m[38;5;12m [39m[38;5;12m(GPDB)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAn[39m[38;5;12m [39m[38;5;12madvanced,[39m[38;5;12m [39m[38;5;12mfully[39m[38;5;12m [39m[38;5;12mfeatured,[39m[38;5;12m [39m[38;5;12mopen[39m[38;5;12m [39m[38;5;12msource[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mwarehouse.[39m[38;5;12m [39m[38;5;12mIt[39m[38;5;12m [39m[38;5;12mprovides[39m[38;5;12m [39m[38;5;12mpowerful[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mrapid[39m[38;5;12m [39m[38;5;12manalytics[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m
|
||
[38;5;12mpetabyte[39m[38;5;12m [39m[38;5;12mscale[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mvolumes.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mcayley[0m[38;5;12m (https://github.com/cayleygraph/cayley) - An open-source graph database. Google.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSnappydata[0m[38;5;12m (https://github.com/SnappyDataInc/snappydata) - SnappyData: OLTP + OLAP Database built on Apache Spark.[39m
|
||
[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mTimescaleDB[0m[38;5;12m [39m[38;5;12m(https://www.timescale.com/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mBuilt[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12man[39m[38;5;12m [39m[38;5;12mextension[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12mtop[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mPostgreSQL,[39m[38;5;12m [39m[38;5;12mTimescaleDB[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mtime-series[39m[38;5;12m [39m[38;5;12mSQL[39m[38;5;12m [39m[38;5;12mdatabase[39m[38;5;12m [39m[38;5;12mproviding[39m[38;5;12m [39m[38;5;12mfast[39m[38;5;12m [39m[38;5;12manalytics,[39m[38;5;12m [39m[38;5;12mscalability,[39m[38;5;12m [39m[38;5;12mwith[39m[38;5;12m [39m[38;5;12mautomated[39m[38;5;12m [39m[38;5;12mdata[39m
|
||
[38;5;12mmanagement[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mproven[39m[38;5;12m [39m[38;5;12mstorage[39m[38;5;12m [39m[38;5;12mengine.[39m
|
||
|
||
[38;2;255;187;0m[4mData Comparison[0m
|
||
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mdatacompy[0m[38;5;12m [39m[38;5;12m(https://github.com/capitalone/datacompy)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mDataComPy[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mPython[39m[38;5;12m [39m[38;5;12mlibrary[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mfacilitates[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mcomparison[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mtwo[39m[38;5;12m [39m[38;5;12mDataFrames[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mpandas,[39m[38;5;12m [39m[38;5;12mPolars,[39m[38;5;12m [39m[38;5;12mSpark[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mmore.[39m[38;5;12m [39m[38;5;12mThe[39m[38;5;12m [39m[38;5;12mlibrary[39m[38;5;12m [39m[38;5;12mgoes[39m[38;5;12m [39m
|
||
[38;5;12mbeyond[39m[38;5;12m [39m[38;5;12mbasic[39m[38;5;12m [39m[38;5;12mequality[39m[38;5;12m [39m[38;5;12mchecks[39m[38;5;12m [39m[38;5;12mby[39m[38;5;12m [39m[38;5;12mproviding[39m[38;5;12m [39m[38;5;12mdetailed[39m[38;5;12m [39m[38;5;12minsights[39m[38;5;12m [39m[38;5;12minto[39m[38;5;12m [39m[38;5;12mdiscrepancies[39m[38;5;12m [39m[38;5;12mat[39m[38;5;12m [39m[38;5;12mboth[39m[38;5;12m [39m[38;5;12mrow[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mcolumn[39m[38;5;12m [39m[38;5;12mlevels.[39m[38;5;12m [39m
|
||
|
||
[38;2;255;187;0m[4mData Ingestion[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mKafka[0m[38;5;12m (https://kafka.apache.org/) - Publish-subscribe messaging rethought as a distributed commit log.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mBottledWater[0m[38;5;12m (https://github.com/confluentinc/bottledwater-pg) - Change data capture from PostgreSQL into Kafka. Deprecated.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafkat[0m[38;5;12m (https://github.com/airbnb/kafkat) - Simplified command-line administration for Kafka brokers.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafkacat[0m[38;5;12m (https://github.com/edenhill/kafkacat) - Generic command line non-JVM Apache Kafka producer and consumer.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mpg-kafka[0m[38;5;12m (https://github.com/xstevens/pg_kafka) - A PostgreSQL extension to produce messages to Apache Kafka.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mlibrdkafka[0m[38;5;12m (https://github.com/edenhill/librdkafka) - The Apache Kafka C/C++ library.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafka-docker[0m[38;5;12m (https://github.com/wurstmeister/kafka-docker) - Kafka in Docker.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafka-manager[0m[38;5;12m (https://github.com/yahoo/kafka-manager) - A tool for managing Apache Kafka.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mkafka-node[0m[38;5;12m (https://github.com/SOHU-Co/kafka-node) - Node.js client for Apache Kafka 0.8.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSecor[0m[38;5;12m (https://github.com/pinterest/secor) - Pinterest's Kafka to S3 distributed consumer.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mKafka-logger[0m[38;5;12m (https://github.com/uber/kafka-logger) - Kafka-winston logger for Node.js from uber.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAWS Kinesis[0m[38;5;12m (https://aws.amazon.com/kinesis/) - A fully managed, cloud-based service for real-time data processing over large, distributed data streams.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRabbitMQ[0m[38;5;12m (https://www.rabbitmq.com/) - Robust messaging for applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mdlt[0m[38;5;12m (https://www.dlthub.com) - A fast&simple pipeline building library for python data devs, runs in notebooks, cloud functions, airflow, etc. [39m
|
||
[38;5;12m- [39m[38;5;14m[1mFluentD[0m[38;5;12m (https://www.fluentd.org) - An open source data collector for unified logging layer.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mEmbulk[0m[38;5;12m (https://www.embulk.org) - An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Sqoop[0m[38;5;12m (https://sqoop.apache.org) - A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mHeka[0m[38;5;12m (https://github.com/mozilla-services/heka) - Data Acquisition and Processing Made Easy. Deprecated.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mGobblin[0m[38;5;12m (https://github.com/apache/incubator-gobblin) - Universal data ingestion framework for Hadoop from Linkedin.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mNakadi[0m[38;5;12m (https://nakadi.io) - Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPravega[0m[38;5;12m (https://www.pravega.io) - Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Pulsar[0m[38;5;12m (https://pulsar.apache.org/) - Apache Pulsar is an open-source distributed pub-sub messaging system.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAWS Data Wranlger[0m[38;5;12m (https://github.com/awslabs/aws-data-wrangler) - Utility belt to handle data on AWS.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAirbyte[0m[38;5;12m (https://airbyte.io/) - Open-source data integration for modern data teams.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSling[0m[38;5;12m (https://slingdata.io/) - Sling is CLI data integration tool specialized in moving data between databases, as well as storage systems.[39m
|
||
|
||
[38;2;255;187;0m[4mFile System[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mHDFS[0m[38;5;12m (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) - A distributed file system designed to run on commodity hardware.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSnakebite[0m[38;5;12m (https://github.com/spotify/snakebite) - A pure python HDFS client.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAWS S3[0m[38;5;12m (https://aws.amazon.com/s3/) - Object storage built to retrieve any amount of data from anywhere.[39m
|
||
[38;5;12m - [39m[38;5;14m[1msmart_open[0m[38;5;12m (https://github.com/RaRe-Technologies/smart_open) - Utils for streaming large files (S3, HDFS, gzip, bz2).[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mAlluxio[0m[38;5;12m [39m[38;5;12m(https://www.alluxio.org/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAlluxio[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mmemory-centric[39m[38;5;12m [39m[38;5;12mdistributed[39m[38;5;12m [39m[38;5;12mstorage[39m[38;5;12m [39m[38;5;12msystem[39m[38;5;12m [39m[38;5;12menabling[39m[38;5;12m [39m[38;5;12mreliable[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12msharing[39m[38;5;12m [39m[38;5;12mat[39m[38;5;12m [39m[38;5;12mmemory-speed[39m[38;5;12m [39m[38;5;12macross[39m[38;5;12m [39m[38;5;12mcluster[39m[38;5;12m [39m[38;5;12mframeworks,[39m[38;5;12m [39m[38;5;12msuch[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12mSpark[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m
|
||
[38;5;12mMapReduce.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCEPH[0m[38;5;12m (https://ceph.com/) - Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mOrangeFS[0m[38;5;12m (https://www.orangefs.org/) - Orange File System is a branch of the Parallel Virtual File System.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSnackFS[0m[38;5;12m (https://github.com/tuplejump/snackfs-release) - SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built over Cassandra.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mGlusterFS[0m[38;5;12m (https://www.gluster.org/) - Gluster Filesystem.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mXtreemFS[0m[38;5;12m (https://www.xtreemfs.org/) - Fault-tolerant distributed file system for all storage needs.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mSeaweedFS[0m[38;5;12m [39m[38;5;12m(https://github.com/chrislusf/seaweedfs)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mSeaweed-FS[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12msimple[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mhighly[39m[38;5;12m [39m[38;5;12mscalable[39m[38;5;12m [39m[38;5;12mdistributed[39m[38;5;12m [39m[38;5;12mfile[39m[38;5;12m [39m[38;5;12msystem.[39m[38;5;12m [39m[38;5;12mThere[39m[38;5;12m [39m[38;5;12mare[39m[38;5;12m [39m[38;5;12mtwo[39m[38;5;12m [39m[38;5;12mobjectives:[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mstore[39m[38;5;12m [39m[38;5;12mbillions[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mfiles![39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mserve[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m
|
||
[38;5;12mfiles[39m[38;5;12m [39m[38;5;12mfast![39m[38;5;12m [39m[38;5;12mInstead[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12msupporting[39m[38;5;12m [39m[38;5;12mfull[39m[38;5;12m [39m[38;5;12mPOSIX[39m[38;5;12m [39m[38;5;12mfile[39m[38;5;12m [39m[38;5;12msystem[39m[38;5;12m [39m[38;5;12msemantics,[39m[38;5;12m [39m[38;5;12mSeaweed-FS[39m[38;5;12m [39m[38;5;12mchoose[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mimplement[39m[38;5;12m [39m[38;5;12monly[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mkey~file[39m[38;5;12m [39m[38;5;12mmapping.[39m[38;5;12m [39m[38;5;12mSimilar[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mword[39m[38;5;12m [39m[38;5;12m"NoSQL",[39m[38;5;12m [39m[38;5;12myou[39m[38;5;12m [39m[38;5;12mcan[39m[38;5;12m [39m[38;5;12mcall[39m[38;5;12m [39m[38;5;12mit[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12m"NoFS".[39m
|
||
[38;5;12m- [39m[38;5;14m[1mS3QL[0m[38;5;12m (https://github.com/s3ql/s3ql/) - S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mLizardFS[0m[38;5;12m (https://lizardfs.com/) - LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.[39m
|
||
|
||
[38;2;255;187;0m[4mSerialization format[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mApache Avro[0m[38;5;12m (https://avro.apache.org) - Apache Avro™ is a data serialization system.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mApache[0m[38;5;14m[1m [0m[38;5;14m[1mParquet[0m[38;5;12m [39m[38;5;12m(https://parquet.apache.org)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mApache[39m[38;5;12m [39m[38;5;12mParquet[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mcolumnar[39m[38;5;12m [39m[38;5;12mstorage[39m[38;5;12m [39m[38;5;12mformat[39m[38;5;12m [39m[38;5;12mavailable[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12many[39m[38;5;12m [39m[38;5;12mproject[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mHadoop[39m[38;5;12m [39m[38;5;12mecosystem,[39m[38;5;12m [39m[38;5;12mregardless[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mchoice[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mprocessing[39m[38;5;12m [39m
|
||
[38;5;12mframework,[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mmodel[39m[38;5;12m [39m[38;5;12mor[39m[38;5;12m [39m[38;5;12mprogramming[39m[38;5;12m [39m[38;5;12mlanguage.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSnappy[0m[38;5;12m (https://github.com/google/snappy) - A fast compressor/decompressor. Used with Parquet.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPigZ[0m[38;5;12m (https://zlib.net/pigz/) - A parallel implementation of gzip for modern multi-processor, multi-core machines.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache ORC[0m[38;5;12m (https://orc.apache.org/) - The smallest, fastest columnar storage for Hadoop workloads.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Thrift[0m[38;5;12m (https://thrift.apache.org) - The Apache Thrift software framework, for scalable cross-language services development.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mProtoBuf[0m[38;5;12m (https://github.com/protocolbuffers/protobuf) - Protocol Buffers - Google's data interchange format.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSequenceFile[0m[38;5;12m (https://wiki.apache.org/hadoop/SequenceFile) - SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mKryo[0m[38;5;12m (https://github.com/EsotericSoftware/kryo) - Kryo is a fast and efficient object graph serialization framework for Java.[39m
|
||
|
||
[38;2;255;187;0m[4mStream Processing[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mApache Beam[0m[38;5;12m (https://beam.apache.org/) - Apache Beam is a unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSpark Streaming[0m[38;5;12m (https://spark.apache.org/streaming/) - Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mApache[0m[38;5;14m[1m [0m[38;5;14m[1mFlink[0m[38;5;12m [39m[38;5;12m(https://flink.apache.org/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mApache[39m[38;5;12m [39m[38;5;12mFlink[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mstreaming[39m[38;5;12m [39m[38;5;12mdataflow[39m[38;5;12m [39m[38;5;12mengine[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mprovides[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mdistribution,[39m[38;5;12m [39m[38;5;12mcommunication,[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mfault[39m[38;5;12m [39m[38;5;12mtolerance[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12mdistributed[39m[38;5;12m [39m[38;5;12mcomputations[39m[38;5;12m [39m[38;5;12mover[39m[38;5;12m [39m
|
||
[38;5;12mdata[39m[38;5;12m [39m[38;5;12mstreams.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Storm[0m[38;5;12m (https://storm.apache.org) - Apache Storm is a free and open source distributed realtime computation system.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Samza[0m[38;5;12m (https://samza.apache.org) - Apache Samza is a distributed stream processing framework.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache NiFi[0m[38;5;12m (https://nifi.apache.org/) - An easy to use, powerful, and reliable system to process and distribute data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Hudi[0m[38;5;12m (https://hudi.apache.org/) - An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mVoltDB[0m[38;5;12m (https://voltdb.com/) - VoltDb is an ACID-compliant RDBMS which uses a [39m[38;5;14m[1mshared nothing architecture[0m[38;5;12m (https://en.wikipedia.org/wiki/Shared-nothing_architecture).[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPipelineDB[0m[38;5;12m (https://github.com/pipelinedb/pipelinedb) - The Streaming SQL Database.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSpring Cloud Dataflow[0m[38;5;12m (https://cloud.spring.io/spring-cloud-dataflow/) - Streaming and tasks execution between Spring Boot apps.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mBonobo[0m[38;5;12m (https://www.bonobo-project.org/) - Bonobo is a data-processing toolkit for python 3.5+.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRobinhood's Faust[0m[38;5;12m (https://github.com/faust-streaming/faust) - Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mHStreamDB[0m[38;5;12m (https://github.com/hstreamdb/hstream) - The streaming database built for IoT data storage and real-time processing.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mKuiper[0m[38;5;12m [39m[38;5;12m(https://github.com/emqx/kuiper)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAn[39m[38;5;12m [39m[38;5;12medge[39m[38;5;12m [39m[38;5;12mlightweight[39m[38;5;12m [39m[38;5;12mIoT[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12manalytics/streaming[39m[38;5;12m [39m[38;5;12msoftware[39m[38;5;12m [39m[38;5;12mimplemented[39m[38;5;12m [39m[38;5;12mby[39m[38;5;12m [39m[38;5;12mGolang,[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mit[39m[38;5;12m [39m[38;5;12mcan[39m[38;5;12m [39m[38;5;12mbe[39m[38;5;12m [39m[38;5;12mrun[39m[38;5;12m [39m[38;5;12mat[39m[38;5;12m [39m[38;5;12mall[39m[38;5;12m [39m[38;5;12mkinds[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mresource-constrained[39m[38;5;12m [39m[38;5;12medge[39m[38;5;12m [39m
|
||
[38;5;12mdevices.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mZilla[0m[38;5;12m [39m[38;5;12m(https://github.com/aklivity/zilla)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAn[39m[38;5;12m [39m[38;5;12mAPI[39m[38;5;12m [39m[38;5;12mgateway[39m[38;5;12m [39m[38;5;12mbuilt[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12mevent-driven[39m[38;5;12m [39m[38;5;12marchitectures[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mstreaming[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12msupports[39m[38;5;12m [39m[38;5;12mstandard[39m[38;5;12m [39m[38;5;12mprotocols[39m[38;5;12m [39m[38;5;12msuch[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12mHTTP,[39m[38;5;12m [39m[38;5;12mSSE,[39m[38;5;12m [39m[38;5;12mgRPC,[39m[38;5;12m [39m[38;5;12mMQTT[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mnative[39m
|
||
[38;5;12mKafka[39m[38;5;12m [39m[38;5;12mprotocol.[39m
|
||
|
||
[38;2;255;187;0m[4mBatch Processing[0m
|
||
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mHadoop[0m[38;5;14m[1m [0m[38;5;14m[1mMapReduce[0m[38;5;12m [39m[38;5;12m(https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mHadoop[39m[38;5;12m [39m[38;5;12mMapReduce[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12msoftware[39m[38;5;12m [39m[38;5;12mframework[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12measily[39m[38;5;12m [39m
|
||
[38;5;12mwriting[39m[38;5;12m [39m[38;5;12mapplications[39m[38;5;12m [39m[38;5;12mwhich[39m[38;5;12m [39m[38;5;12mprocess[39m[38;5;12m [39m[38;5;12mvast[39m[38;5;12m [39m[38;5;12mamounts[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12m(multi-terabyte[39m[38;5;12m [39m[38;5;12mdata-sets)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12min-parallel[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12mlarge[39m[38;5;12m [39m[38;5;12mclusters[39m[38;5;12m [39m[38;5;12m(thousands[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mnodes)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mcommodity[39m[38;5;12m [39m[38;5;12mhardware[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mreliable,[39m[38;5;12m [39m[38;5;12mfault-tolerant[39m[38;5;12m [39m
|
||
[38;5;12mmanner.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSpark[0m[38;5;12m (https://spark.apache.org/) - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSpark Packages[0m[38;5;12m (https://spark-packages.org) - A community index of packages for Apache Spark.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDeep Spark[0m[38;5;12m (https://github.com/Stratio/deep-spark) - Connecting Apache Spark with different data stores. Deprecated.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSpark RDD API Examples[0m[38;5;12m (https://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html) - Examples by Zhen He.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mLivy[0m[38;5;12m (https://livy.incubator.apache.org) - The REST Spark Server.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDelight[0m[38;5;12m (https://github.com/datamechanics/delight) - A free & cross platform monitoring tool (Spark UI / Spark History Server alternative).[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAWS EMR[0m[38;5;12m (https://aws.amazon.com/emr/) - A web service that makes it easy to quickly and cost-effectively process vast amounts of data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mData Mechanics[0m[38;5;12m (https://www.datamechanics.co) - A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mTez[0m[38;5;12m (https://tez.apache.org/) - An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mBistro[0m[38;5;12m [39m[38;5;12m(https://github.com/asavinov/bistro)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mlight-weight[39m[38;5;12m [39m[38;5;12mengine[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12mgeneral-purpose[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mprocessing[39m[38;5;12m [39m[38;5;12mincluding[39m[38;5;12m [39m[38;5;12mboth[39m[38;5;12m [39m[38;5;12mbatch[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mstream[39m[38;5;12m [39m[38;5;12manalytics.[39m[38;5;12m [39m[38;5;12mIt[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12mbased[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mnovel[39m[38;5;12m [39m[38;5;12munique[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mmodel,[39m[38;5;12m [39m
|
||
[38;5;12mwhich[39m[38;5;12m [39m[38;5;12mrepresents[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mvia[39m[38;5;12m [39m[38;5;12m_functions_[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mprocesses[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mvia[39m[38;5;12m [39m[38;5;12m_columns[39m[38;5;12m [39m[38;5;12moperations_[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12mopposed[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mhaving[39m[38;5;12m [39m[38;5;12monly[39m[38;5;12m [39m[38;5;12mset[39m[38;5;12m [39m[38;5;12moperations[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mconventional[39m[38;5;12m [39m[38;5;12mapproaches[39m[38;5;12m [39m[38;5;12mlike[39m[38;5;12m [39m[38;5;12mMapReduce[39m[38;5;12m [39m[38;5;12mor[39m[38;5;12m [39m[38;5;12mSQL.[39m
|
||
|
||
[38;5;12m- Batch ML[39m
|
||
[38;5;12m - [39m[38;5;14m[1mH2O[0m[38;5;12m (https://www.h2o.ai/) - Fast scalable machine learning API for smarter applications.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mMahout[0m[38;5;12m (https://mahout.apache.org/) - An environment for quickly creating scalable performant machine learning applications.[39m
|
||
[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mSpark[0m[38;5;14m[1m [0m[38;5;14m[1mMLlib[0m[38;5;12m [39m[38;5;12m(https://spark.apache.org/docs/latest/ml-guide.html)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mSpark's[39m[38;5;12m [39m[38;5;12mscalable[39m[38;5;12m [39m[38;5;12mmachine[39m[38;5;12m [39m[38;5;12mlearning[39m[38;5;12m [39m[38;5;12mlibrary[39m[38;5;12m [39m[38;5;12mconsisting[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mcommon[39m[38;5;12m [39m[38;5;12mlearning[39m[38;5;12m [39m[38;5;12malgorithms[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mutilities,[39m[38;5;12m [39m[38;5;12mincluding[39m[38;5;12m [39m
|
||
[38;5;12mclassification,[39m[38;5;12m [39m[38;5;12mregression,[39m[38;5;12m [39m[38;5;12mclustering,[39m[38;5;12m [39m[38;5;12mcollaborative[39m[38;5;12m [39m[38;5;12mfiltering,[39m[38;5;12m [39m[38;5;12mdimensionality[39m[38;5;12m [39m[38;5;12mreduction,[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12mwell[39m[38;5;12m [39m[38;5;12mas[39m[38;5;12m [39m[38;5;12munderlying[39m[38;5;12m [39m[38;5;12moptimization[39m[38;5;12m [39m[38;5;12mprimitives.[39m
|
||
[38;5;12m- Batch Graph[39m
|
||
[38;5;12m - [39m[38;5;14m[1mGraphLab Create[0m[38;5;12m (https://turi.com/products/create/docs/) - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mGiraph[0m[38;5;12m (https://giraph.apache.org/) - An iterative graph processing system built for high scalability.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mSpark GraphX[0m[38;5;12m (https://spark.apache.org/graphx/) - Apache Spark's API for graphs and graph-parallel computation.[39m
|
||
[38;5;12m- Batch SQL[39m
|
||
[38;5;12m - [39m[38;5;14m[1mPresto[0m[38;5;12m (https://prestodb.github.io/docs/current/index.html) - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mHive[0m[38;5;12m (https://hive.apache.org) - Data warehouse software facilitates querying and managing large datasets residing in distributed storage.[39m
|
||
[48;5;235m[38;5;249m- **Hivemall** (https://github.com/apache/incubator-hivemall) - Scalable machine learning library for Hive/Hadoop.[49m[39m
|
||
[48;5;235m[38;5;249m- **PyHive** (https://github.com/dropbox/PyHive) - Python interface to Hive and Presto.[49m[39m[48;5;235m[38;5;249m [49m[39m
|
||
[38;5;12m - [39m[38;5;14m[1mDrill[0m[38;5;12m (https://drill.apache.org/) - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.[39m
|
||
|
||
[38;2;255;187;0m[4mCharts and Dashboards[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mHighcharts[0m[38;5;12m (https://www.highcharts.com/) - A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mZingChart[0m[38;5;12m (https://www.zingchart.com/) - Fast JavaScript charts for any data set.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mC3.js[0m[38;5;12m (https://c3js.org) - D3-based reusable chart library.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mD3.js[0m[38;5;12m (https://d3js.org/) - A JavaScript library for manipulating documents based on data.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mD3Plus[0m[38;5;12m (https://d3plus.org) - D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mSmoothieCharts[0m[38;5;12m (https://smoothiecharts.org) - A JavaScript Charting Library for Streaming Data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPyXley[0m[38;5;12m (https://github.com/stitchfix/pyxley) - Python helpers for building dashboards using Flask and React.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPlotly[0m[38;5;12m (https://github.com/plotly/dash) - Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mApache Superset[0m[38;5;12m (https://github.com/apache/incubator-superset) - Apache Superset (incubating) - A modern, enterprise-ready business intelligence web application.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRedash[0m[38;5;12m (https://redash.io/) - Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMetabase[0m[38;5;12m (https://github.com/metabase/metabase) - Metabase is the easy, open source way for everyone in your company to ask questions and learn from data.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mPyQtGraph[0m[38;5;12m [39m[38;5;12m(https://www.pyqtgraph.org/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mPyQtGraph[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mpure-python[39m[38;5;12m [39m[38;5;12mgraphics[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mGUI[39m[38;5;12m [39m[38;5;12mlibrary[39m[38;5;12m [39m[38;5;12mbuilt[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12mPyQt4[39m[38;5;12m [39m[38;5;12m/[39m[38;5;12m [39m[38;5;12mPySide[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mnumpy.[39m[38;5;12m [39m[38;5;12mIt[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12mintended[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12muse[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mmathematics[39m[38;5;12m [39m[38;5;12m/[39m[38;5;12m [39m[38;5;12mscientific[39m[38;5;12m [39m[38;5;12m/[39m[38;5;12m [39m
|
||
[38;5;12mengineering[39m[38;5;12m [39m[38;5;12mapplications.[39m
|
||
|
||
[38;2;255;187;0m[4mWorkflow[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mLuigi[0m[38;5;12m (https://github.com/spotify/luigi) - Luigi is a Python module that helps you build complex pipelines of batch jobs.[39m
|
||
[38;5;12m - [39m[38;5;14m[1mCronQ[0m[38;5;12m (https://github.com/seatgeek/cronq) - An application cron-like system. [39m[38;5;14m[1mUsed[0m[38;5;12m (https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/) w/Luige. Deprecated.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCascading[0m[38;5;12m (https://www.cascading.org/) - Java based application development platform.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mAirflow[0m[38;5;12m (https://github.com/apache/airflow) - Airflow is a system to programmaticaly author, schedule and monitor data pipelines.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mAzkaban[0m[38;5;12m [39m[38;5;12m(https://azkaban.github.io/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAzkaban[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mbatch[39m[38;5;12m [39m[38;5;12mworkflow[39m[38;5;12m [39m[38;5;12mjob[39m[38;5;12m [39m[38;5;12mscheduler[39m[38;5;12m [39m[38;5;12mcreated[39m[38;5;12m [39m[38;5;12mat[39m[38;5;12m [39m[38;5;12mLinkedIn[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mrun[39m[38;5;12m [39m[38;5;12mHadoop[39m[38;5;12m [39m[38;5;12mjobs.[39m[38;5;12m [39m[38;5;12mAzkaban[39m[38;5;12m [39m[38;5;12mresolves[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mordering[39m[38;5;12m [39m[38;5;12mthrough[39m[38;5;12m [39m[38;5;12mjob[39m[38;5;12m [39m[38;5;12mdependencies[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mprovides[39m[38;5;12m [39m
|
||
[38;5;12man[39m[38;5;12m [39m[38;5;12measy[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12muse[39m[38;5;12m [39m[38;5;12mweb[39m[38;5;12m [39m[38;5;12muser[39m[38;5;12m [39m[38;5;12minterface[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mmaintain[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mtrack[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mworkflows.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mOozie[0m[38;5;12m (https://oozie.apache.org/) - Oozie is a workflow scheduler system to manage Apache Hadoop jobs.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPinball[0m[38;5;12m (https://github.com/pinterest/pinball) - DAG based workflow manager. Job flows are defined programmaticaly in Python. Support output passing between jobs.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mDagster[0m[38;5;12m (https://github.com/dagster-io/dagster) - Dagster is an open-source Python library for building data applications.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mKedro[0m[38;5;12m [39m[38;5;12m(https://kedro.readthedocs.io/en/latest/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mKedro[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12ma[39m[38;5;12m [39m[38;5;12mframework[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mmakes[39m[38;5;12m [39m[38;5;12mit[39m[38;5;12m [39m[38;5;12measy[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mbuild[39m[38;5;12m [39m[38;5;12mrobust[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mscalable[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mpipelines[39m[38;5;12m [39m[38;5;12mby[39m[38;5;12m [39m[38;5;12mproviding[39m[38;5;12m [39m[38;5;12muniform[39m[38;5;12m [39m[38;5;12mproject[39m[38;5;12m [39m[38;5;12mtemplates,[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m
|
||
[38;5;12mabstraction,[39m[38;5;12m [39m[38;5;12mconfiguration[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mpipeline[39m[38;5;12m [39m[38;5;12massembly.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mDataform[0m[38;5;12m [39m[38;5;12m(https://dataform.co/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAn[39m[38;5;12m [39m[38;5;12mopen-source[39m[38;5;12m [39m[38;5;12mframework[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mweb[39m[38;5;12m [39m[38;5;12mbased[39m[38;5;12m [39m[38;5;12mIDE[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mmanage[39m[38;5;12m [39m[38;5;12mdatasets[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mtheir[39m[38;5;12m [39m[38;5;12mdependencies.[39m[38;5;12m [39m[38;5;12mSQLX[39m[38;5;12m [39m[38;5;12mextends[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mexisting[39m[38;5;12m [39m[38;5;12mSQL[39m[38;5;12m [39m[38;5;12mwarehouse[39m[38;5;12m [39m[38;5;12mdialect[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12madd[39m[38;5;12m [39m[38;5;12mfeatures[39m[38;5;12m [39m[38;5;12mthat[39m
|
||
[38;5;12msupport[39m[38;5;12m [39m[38;5;12mdependency[39m[38;5;12m [39m[38;5;12mmanagement,[39m[38;5;12m [39m[38;5;12mtesting,[39m[38;5;12m [39m[38;5;12mdocumentation[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mmore.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mCensus[0m[38;5;12m [39m[38;5;12m(https://getcensus.com/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mreverse-ETL[39m[38;5;12m [39m[38;5;12mtool[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mlet[39m[38;5;12m [39m[38;5;12myou[39m[38;5;12m [39m[38;5;12msync[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mfrom[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mcloud[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mwarehouse[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mSaaS[39m[38;5;12m [39m[38;5;12mapplications[39m[38;5;12m [39m[38;5;12mlike[39m[38;5;12m [39m[38;5;12mSalesforce,[39m[38;5;12m [39m[38;5;12mMarketo,[39m[38;5;12m [39m[38;5;12mHubSpot,[39m[38;5;12m [39m[38;5;12mZendesk,[39m[38;5;12m [39m[38;5;12metc.[39m[38;5;12m [39m[38;5;12mNo[39m[38;5;12m [39m
|
||
[38;5;12mengineering[39m[38;5;12m [39m[38;5;12mfavors[39m[38;5;12m [39m[38;5;12mrequired—just[39m[38;5;12m [39m[38;5;12mSQL.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mdbt[0m[38;5;12m (https://getdbt.com/) - A command line tool that enables data analysts and engineers to transform data in their warehouses more effectively.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mRudderStack[0m[38;5;12m [39m[38;5;12m(https://github.com/rudderlabs/rudder-server)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mwarehouse-first[39m[38;5;12m [39m[38;5;12mCustomer[39m[38;5;12m [39m[38;5;12mData[39m[38;5;12m [39m[38;5;12mPlatform[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12menables[39m[38;5;12m [39m[38;5;12myou[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mcollect[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mfrom[39m[38;5;12m [39m[38;5;12mevery[39m[38;5;12m [39m[38;5;12mapplication,[39m[38;5;12m [39m[38;5;12mwebsite[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mSaaS[39m[38;5;12m [39m[38;5;12mplatform,[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m
|
||
[38;5;12mthen[39m[38;5;12m [39m[38;5;12mactivate[39m[38;5;12m [39m[38;5;12mit[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mwarehouse[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mbusiness[39m[38;5;12m [39m[38;5;12mtools.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mPACE[0m[38;5;12m [39m[38;5;12m(https://github.com/getstrm/pace)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAn[39m[38;5;12m [39m[38;5;12mopen[39m[38;5;12m [39m[38;5;12msource[39m[38;5;12m [39m[38;5;12mframework[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mallows[39m[38;5;12m [39m[38;5;12myou[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12menforce[39m[38;5;12m [39m[38;5;12magreements[39m[38;5;12m [39m[38;5;12mon[39m[38;5;12m [39m[38;5;12mhow[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mshould[39m[38;5;12m [39m[38;5;12mbe[39m[38;5;12m [39m[38;5;12maccessed,[39m[38;5;12m [39m[38;5;12mused,[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mtransformed,[39m[38;5;12m [39m[38;5;12mregardless[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m
|
||
[38;5;12mplatform[39m[38;5;12m [39m[38;5;12m(Snowflake,[39m[38;5;12m [39m[38;5;12mBigQuery,[39m[38;5;12m [39m[38;5;12mDataBricks,[39m[38;5;12m [39m[38;5;12metc.)[39m
|
||
[38;5;12m- [39m[38;5;14m[1mPrefect[0m[38;5;12m (https://prefect.io/) - Prefect is an orchestration and observability platform. With it, developers can rapidly build and scale resilient code, and triage disruptions effortlessly.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMultiwoven[0m[38;5;12m (https://github.com/Multiwoven/multiwoven) - The open-source reverse ETL, data activation platform for modern data teams.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mSuprSend[0m[38;5;12m [39m[38;5;12m(https://www.suprsend.com/products/workflows)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mCreate[39m[38;5;12m [39m[38;5;12mautomated[39m[38;5;12m [39m[38;5;12mworkflows[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mlogic[39m[38;5;12m [39m[38;5;12musing[39m[38;5;12m [39m[38;5;12mAPI's[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mnotification[39m[38;5;12m [39m[38;5;12mservice.[39m[38;5;12m [39m[38;5;12mAdd[39m[38;5;12m [39m[38;5;12mtemplates,[39m[38;5;12m [39m[38;5;12mbatching,[39m[38;5;12m [39m[38;5;12mpreferences,[39m[38;5;12m [39m[38;5;12minapp[39m[38;5;12m [39m[38;5;12minbox[39m[38;5;12m [39m
|
||
[38;5;12mwith[39m[38;5;12m [39m[38;5;12mworkflows[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mtrigger[39m[38;5;12m [39m[38;5;12mnotifications[39m[38;5;12m [39m[38;5;12mdirectly[39m[38;5;12m [39m[38;5;12mfrom[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mwarehouse.[39m
|
||
|
||
[38;2;255;187;0m[4mData Lake Management[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mlakeFS[0m[38;5;12m (https://github.com/treeverse/lakeFS) - lakeFS is an open source platform that delivers resilience and manageability to object-storage based data lakes.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mProject Nessie[0m[38;5;12m (https://github.com/projectnessie/nessie) - Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.[39m
|
||
|
||
[38;2;255;187;0m[4mELK Elastic Logstash Kibana[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mdocker-logstash[0m[38;5;12m (https://github.com/pblittle/docker-logstash) - A highly configurable logstash (1.4.4) - docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).[39m
|
||
[38;5;12m- [39m[38;5;14m[1melasticsearch-jdbc[0m[38;5;12m (https://github.com/jprante/elasticsearch-jdbc) - JDBC importer for Elasticsearch.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mZomboDB[0m[38;5;12m (https://github.com/zombodb/zombodb) - Postgres Extension that allows creating an index backed by Elasticsearch.[39m
|
||
|
||
[38;2;255;187;0m[4mDocker[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mGockerize[0m[38;5;12m (https://github.com/redbooth/gockerize) - Package golang service into minimal docker containers.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mFlocker[0m[38;5;12m (https://github.com/ClusterHQ/flocker) - Easily manage Docker containers & their data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRancher[0m[38;5;12m (https://rancher.com/rancher-os/) - RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mKontena[0m[38;5;12m (https://www.kontena.io/) - Application Containers for Masses.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mWeave[0m[38;5;12m (https://github.com/weaveworks/weave) - Weaving Docker containers into applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mZodiac[0m[38;5;12m (https://github.com/CenturyLinkLabs/zodiac) - A lightweight tool for easy deployment and rollback of dockerized applications.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mcAdvisor[0m[38;5;12m (https://github.com/google/cadvisor) - Analyzes resource usage and performance characteristics of running containers.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mMicro S3 persistence[0m[38;5;12m (https://github.com/figadore/micro-s3-persistence) - Docker microservice for saving/restoring volume data to S3.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mRocker-compose[0m[38;5;12m (https://github.com/grammarly/rocker-compose) - Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mNomad[0m[38;5;12m (https://github.com/hashicorp/nomad) - Nomad is a cluster manager, designed for both long lived services and short lived batch processing workloads.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mImageLayers[0m[38;5;12m (https://imagelayers.io/) - Vizualize docker images and the layers that compose them.[39m
|
||
|
||
[38;2;255;187;0m[4mDatasets[0m
|
||
|
||
[38;2;255;187;0m[4mRealtime[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mTwitter Realtime[0m[38;5;12m (https://developer.twitter.com/en/docs/tweets/filter-realtime/overview) - The Streaming APIs give developers low latency access to Twitter's global stream of Tweet data.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mEventsim[0m[38;5;12m (https://github.com/Interana/eventsim) - Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mReddit[0m[38;5;12m [39m[38;5;12m(https://www.reddit.com/r/datasets/comments/3mk1vg/realtime_data_is_available_including_comments/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mReal-time[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mis[39m[38;5;12m [39m[38;5;12mavailable[39m[38;5;12m [39m[38;5;12mincluding[39m[38;5;12m [39m[38;5;12mcomments,[39m[38;5;12m [39m[38;5;12msubmissions[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mlinks[39m[38;5;12m [39m[38;5;12mposted[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m
|
||
[38;5;12mreddit.[39m
|
||
|
||
[38;2;255;187;0m[4mData Dumps[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mGitHub Archive[0m[38;5;12m (https://www.gharchive.org/) - GitHub's public timeline since 2011, updated every hour.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mCommon Crawl[0m[38;5;12m (https://commoncrawl.org/) - Open source repository of web crawl data.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mWikipedia[0m[38;5;12m [39m[38;5;12m(https://dumps.wikimedia.org/enwiki/latest/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mWikipedia's[39m[38;5;12m [39m[38;5;12mcomplete[39m[38;5;12m [39m[38;5;12mcopy[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mall[39m[38;5;12m [39m[38;5;12mwikis,[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mform[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mwikitext[39m[38;5;12m [39m[38;5;12msource[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mmetadata[39m[38;5;12m [39m[38;5;12membedded[39m[38;5;12m [39m[38;5;12min[39m[38;5;12m [39m[38;5;12mXML.[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mnumber[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mraw[39m[38;5;12m [39m[38;5;12mdatabase[39m[38;5;12m [39m[38;5;12mtables[39m
|
||
[38;5;12min[39m[38;5;12m [39m[38;5;12mSQL[39m[38;5;12m [39m[38;5;12mform[39m[38;5;12m [39m[38;5;12mare[39m[38;5;12m [39m[38;5;12malso[39m[38;5;12m [39m[38;5;12mavailable.[39m
|
||
|
||
[38;2;255;187;0m[4mMonitoring[0m
|
||
|
||
[38;2;255;187;0m[4mPrometheus[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mPrometheus.io[0m[38;5;12m (https://github.com/prometheus/prometheus) - An open-source service monitoring system and time series database.[39m
|
||
[38;5;12m- [39m[38;5;14m[1mHAProxy Exporter[0m[38;5;12m (https://github.com/prometheus/haproxy_exporter) - Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption.[39m
|
||
|
||
[38;2;255;187;0m[4mProfiling[0m
|
||
|
||
[38;2;255;187;0m[4mData Profiler[0m
|
||
[38;5;12m- [39m[38;5;14m[1mData Profiler[0m[38;5;12m (https://github.com/capitalone/dataprofiler) - The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.[39m
|
||
|
||
|
||
[38;2;255;187;0m[4mTesting[0m
|
||
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mGrai[0m[38;5;12m [39m[38;5;12m(https://github.com/grai-io/grai-core/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mcatalog[39m[38;5;12m [39m[38;5;12mtool[39m[38;5;12m [39m[38;5;12mthat[39m[38;5;12m [39m[38;5;12mintegrates[39m[38;5;12m [39m[38;5;12minto[39m[38;5;12m [39m[38;5;12myour[39m[38;5;12m [39m[38;5;12mCI[39m[38;5;12m [39m[38;5;12msystem[39m[38;5;12m [39m[38;5;12mexposing[39m[38;5;12m [39m[38;5;12mdownstream[39m[38;5;12m [39m[38;5;12mimpact[39m[38;5;12m [39m[38;5;12mtesting[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mchanges.[39m[38;5;12m [39m[38;5;12mThese[39m[38;5;12m [39m[38;5;12mtests[39m[38;5;12m [39m[38;5;12mprevent[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mchanges[39m[38;5;12m [39m
|
||
[38;5;12mwhich[39m[38;5;12m [39m[38;5;12mmight[39m[38;5;12m [39m[38;5;12mbreak[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mpipelines[39m[38;5;12m [39m[38;5;12mor[39m[38;5;12m [39m[38;5;12mBI[39m[38;5;12m [39m[38;5;12mdashboards[39m[38;5;12m [39m[38;5;12mfrom[39m[38;5;12m [39m[38;5;12mmaking[39m[38;5;12m [39m[38;5;12mit[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mproduction.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mDQOps[0m[38;5;12m [39m[38;5;12m(https://github.com/dqops/dqo)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mAn[39m[38;5;12m [39m[38;5;12mopen-source[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mquality[39m[38;5;12m [39m[38;5;12mplatform[39m[38;5;12m [39m[38;5;12mfor[39m[38;5;12m [39m[38;5;12mthe[39m[38;5;12m [39m[38;5;12mwhole[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mplatform[39m[38;5;12m [39m[38;5;12mlifecycle[39m[38;5;12m [39m[38;5;12mfrom[39m[38;5;12m [39m[38;5;12mprofiling[39m[38;5;12m [39m[38;5;12mnew[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12msources[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mapplying[39m[38;5;12m [39m[38;5;12mfull[39m[38;5;12m [39m[38;5;12mautomation[39m[38;5;12m [39m[38;5;12mof[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mquality[39m
|
||
[38;5;12mmonitoring.[39m
|
||
|
||
[38;2;255;187;0m[4mCommunity[0m
|
||
|
||
[38;2;255;187;0m[4mForums[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1m/r/dataengineering[0m[38;5;12m (https://www.reddit.com/r/dataengineering/) - News, tips and background on Data Engineering.[39m
|
||
[38;5;12m- [39m[38;5;14m[1m/r/etl[0m[38;5;12m (https://www.reddit.com/r/ETL/) - Subreddit focused on ETL.[39m
|
||
|
||
[38;2;255;187;0m[4mConferences[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mData Council[0m[38;5;12m (https://www.datacouncil.ai/about) - Data Council is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.[39m
|
||
|
||
[38;2;255;187;0m[4mPodcasts[0m
|
||
|
||
[38;5;12m- [39m[38;5;14m[1mData Engineering Podcast[0m[38;5;12m (https://www.dataengineeringpodcast.com/) - The show about modern data infrastructure.[39m
|
||
[38;5;12m-[39m[38;5;12m [39m[38;5;14m[1mThe[0m[38;5;14m[1m [0m[38;5;14m[1mData[0m[38;5;14m[1m [0m[38;5;14m[1mStack[0m[38;5;14m[1m [0m[38;5;14m[1mShow[0m[38;5;12m [39m[38;5;12m(https://datastackshow.com/)[39m[38;5;12m [39m[38;5;12m-[39m[38;5;12m [39m[38;5;12mA[39m[38;5;12m [39m[38;5;12mshow[39m[38;5;12m [39m[38;5;12mwhere[39m[38;5;12m [39m[38;5;12mthey[39m[38;5;12m [39m[38;5;12mtalk[39m[38;5;12m [39m[38;5;12mto[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mengineers,[39m[38;5;12m [39m[38;5;12manalysts,[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mscientists[39m[38;5;12m [39m[38;5;12mabout[39m[38;5;12m [39m[38;5;12mtheir[39m[38;5;12m [39m[38;5;12mexperience[39m[38;5;12m [39m[38;5;12maround[39m[38;5;12m [39m[38;5;12mbuilding[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mmaintaining[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m
|
||
[38;5;12minfrastructure,[39m[38;5;12m [39m[38;5;12mdelivering[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mdata[39m[38;5;12m [39m[38;5;12mproducts,[39m[38;5;12m [39m[38;5;12mand[39m[38;5;12m [39m[38;5;12mdriving[39m[38;5;12m [39m[38;5;12mbetter[39m[38;5;12m [39m[38;5;12moutcomes[39m[38;5;12m [39m[38;5;12macross[39m[38;5;12m [39m[38;5;12mtheir[39m[38;5;12m [39m[38;5;12mbusinesses[39m[38;5;12m [39m[38;5;12mwith[39m[38;5;12m [39m[38;5;12mdata.[39m
|