12 KiB
12 KiB
Data Science Tutorials & Resources for Beginners !Awesome (https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)
(https://github.com/sindresorhus/awesome)
If you want to know more about Data Science but don't know where to start this list is for you! :chart_with_upwards_trend:
No previous knowledge required but Python and statistics basics will definitely come in handy. These ressources have been used successfully for many beginners at my local Data Science student
group ML-KA (http://ml-ka.de/).
What is Data Science?
- 'What is Data Science?' on Quora (https://www.quora.com/What-is-data-science)
- Explanation of important vocabulary (https://www.quora.com/What-is-the-difference-between-Data-Analytics-Data-Analysis-Data-Mining-Data-Science-Machine-Learning-and-Big-Data-1?share=1) -
Differentiation of Big Data, Machine Learning, Data Science.
- Data Science for Business (Book) (https://amzn.to/2voPJUi) - An introduction to Data Science and its use as a business asset.
Common Algorithms and Procedures
- Supervised vs unsupervised learning (https://stackoverflow.com/questions/1832076/what-is-the-difference-between-supervised-learning-and-unsupervised-learning) - The two most common types of
Machine Learning algorithms.
- 9 important Data Science algorithms and their implementation (https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.05-Naive-Bayes.ipynb)
- Cross validation (https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.03-Hyperparameters-and-Model-Validation.ipynb) - Evaluate the performance of
your algorithm / model.
- Feature engineering (https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb) - Modifying the data to better model
predictions.
- Scientific introduction to 10 important Data Science algorithms (http://www.cs.umd.edu/%7Esamir/498/10Algorithms-08.pdf)
- Model ensemble: Explanation (https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/) - Combine multiple models into one for better
performance.
Data Science using Python
This list covers only Python, as many are already familiar with this language. Data Science tutorials using R (https://github.com/ujjwalkarn/DataScienceR).
General
- O'Reilly Data Science from Scratch (Book) (https://amzn.to/2GSjjrK) - Data processing, implementation, and visualization with example code.
- Coursera Applied Data Science (https://www.coursera.org/specializations/data-science-python) - Online Course using Python that covers most of the relevant toolkits.
Learning Python
- YouTube tutorial series by sentdex (https://www.youtube.com/watch?v=oVp1vrfL_w4&list=PLQVvvaa0QuDe8XSftW-RAxdo6OmaeL85M)
- Interactive Python tutorial website (http://www.learnpython.org/)
numpy
numpy (http://www.numpy.org/) is a Python library which provides large multidimensional arrays and fast mathematical operations on them.
- Numpy tutorial on DataCamp (https://www.datacamp.com/community/tutorials/python-numpy-tutorial#gs.h3DvLnk)
pandas
pandas (http://pandas.pydata.org/index.html) provides efficient data structures and analysis tools for Python. It is build on top of numpy.
- Introduction to pandas (http://www.synesthesiam.com/posts/an-introduction-to-pandas.html)
- DataCamp pandas foundations (https://www.datacamp.com/courses/pandas-foundations) - Paid course, but 30 free days upon account creation (enough to complete course).
- Pandas cheatsheet (https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) - Quick overview over the most important functions.
scikit-learn
scikit-learn (http://scikit-learn.org/stable/) is the most common library for Machine Learning and Data Science in Python.
- Introduction and first model application (https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.02-Introducing-Scikit-Learn.ipynb)
- Rough guide for choosing estimators (http://scikit-learn.org/stable/tutorial/machine_learning_map/)
- Scikit-learn complete user guide (http://scikit-learn.org/stable/user_guide.html)
- Model ensemble: Implementation in Python (http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/)
Jupyter Notebook
Jupyter Notebook (https://jupyter.org/) is a web application for easy data visualisation and code presentation.
- Downloading and running first Jupyter notebook (https://jupyter.org/install.html)
- Example notebook for data exploration (https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart)
- Seaborn data visualization tutorial (https://elitedatascience.com/python-seaborn-tutorial) - Plot library that works great with Jupyter.
Various other helpful tools and resources
- Template folder structure for organizing Data Science projects (https://github.com/drivendata/cookiecutter-data-science)
- Anaconda Python distribution (https://www.continuum.io/downloads) - Contains most of the important Python packages for Data Science.
- Spacy (https://spacy.io/) - Open source toolkit for working with text-based data.
- LightGBM gradient boosting framework (https://github.com/Microsoft/LightGBM) - Successfully used in many Kaggle challenges.
- Amazon AWS (https://aws.amazon.com/) - Rent cloud servers for more timeconsuming calculations (r4.xlarge server is a good place to start).
Data Science Challenges for Beginners
Sorted by increasing complexity.
- Walkthrough: House prices challenge (https://www.dataquest.io/blog/kaggle-getting-started/) - Walkthrough through a simple challenge on house prices.
- Blood Donation Challenge (https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/) - Predict if a donor will donate again.
- Titanic Challenge (https://www.kaggle.com/c/titanic) - Predict survival on the Titanic.
- Water Pump Challenge (https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/) - Predict the operating condition of water pumps in Africa.
More advanced resources and lists
- Awesome Data Science (https://github.com/bulutyazilim/awesome-datascience)
- Data Science Python (https://github.com/ujjwalkarn/DataSciencePython)
- Machine Learning Tutorials (https://github.com/ujjwalkarn/Machine-Learning-Tutorials)
Contribute
Contributions welcome! Read the contribution guidelines (contributing.md) first.
License
!CC0 (http://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg) (http://creativecommons.org/publicdomain/zero/1.0)
To the extent possible under law, Simon Böhm has waived all copyright and
related or neighboring rights to this work. Disclaimer: Some of the links are affiliate links.
(https://github.com/sindresorhus/awesome)
If you want to know more about Data Science but don't know where to start this list is for you! :chart_with_upwards_trend:
No previous knowledge required but Python and statistics basics will definitely come in handy. These ressources have been used successfully for many beginners at my local Data Science student
group ML-KA (http://ml-ka.de/).
What is Data Science?
- 'What is Data Science?' on Quora (https://www.quora.com/What-is-data-science)
- Explanation of important vocabulary (https://www.quora.com/What-is-the-difference-between-Data-Analytics-Data-Analysis-Data-Mining-Data-Science-Machine-Learning-and-Big-Data-1?share=1) -
Differentiation of Big Data, Machine Learning, Data Science.
- Data Science for Business (Book) (https://amzn.to/2voPJUi) - An introduction to Data Science and its use as a business asset.
Common Algorithms and Procedures
- Supervised vs unsupervised learning (https://stackoverflow.com/questions/1832076/what-is-the-difference-between-supervised-learning-and-unsupervised-learning) - The two most common types of
Machine Learning algorithms.
- 9 important Data Science algorithms and their implementation (https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.05-Naive-Bayes.ipynb)
- Cross validation (https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.03-Hyperparameters-and-Model-Validation.ipynb) - Evaluate the performance of
your algorithm / model.
- Feature engineering (https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb) - Modifying the data to better model
predictions.
- Scientific introduction to 10 important Data Science algorithms (http://www.cs.umd.edu/%7Esamir/498/10Algorithms-08.pdf)
- Model ensemble: Explanation (https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/) - Combine multiple models into one for better
performance.
Data Science using Python
This list covers only Python, as many are already familiar with this language. Data Science tutorials using R (https://github.com/ujjwalkarn/DataScienceR).
General
- O'Reilly Data Science from Scratch (Book) (https://amzn.to/2GSjjrK) - Data processing, implementation, and visualization with example code.
- Coursera Applied Data Science (https://www.coursera.org/specializations/data-science-python) - Online Course using Python that covers most of the relevant toolkits.
Learning Python
- YouTube tutorial series by sentdex (https://www.youtube.com/watch?v=oVp1vrfL_w4&list=PLQVvvaa0QuDe8XSftW-RAxdo6OmaeL85M)
- Interactive Python tutorial website (http://www.learnpython.org/)
numpy
numpy (http://www.numpy.org/) is a Python library which provides large multidimensional arrays and fast mathematical operations on them.
- Numpy tutorial on DataCamp (https://www.datacamp.com/community/tutorials/python-numpy-tutorial#gs.h3DvLnk)
pandas
pandas (http://pandas.pydata.org/index.html) provides efficient data structures and analysis tools for Python. It is build on top of numpy.
- Introduction to pandas (http://www.synesthesiam.com/posts/an-introduction-to-pandas.html)
- DataCamp pandas foundations (https://www.datacamp.com/courses/pandas-foundations) - Paid course, but 30 free days upon account creation (enough to complete course).
- Pandas cheatsheet (https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) - Quick overview over the most important functions.
scikit-learn
scikit-learn (http://scikit-learn.org/stable/) is the most common library for Machine Learning and Data Science in Python.
- Introduction and first model application (https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.02-Introducing-Scikit-Learn.ipynb)
- Rough guide for choosing estimators (http://scikit-learn.org/stable/tutorial/machine_learning_map/)
- Scikit-learn complete user guide (http://scikit-learn.org/stable/user_guide.html)
- Model ensemble: Implementation in Python (http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/)
Jupyter Notebook
Jupyter Notebook (https://jupyter.org/) is a web application for easy data visualisation and code presentation.
- Downloading and running first Jupyter notebook (https://jupyter.org/install.html)
- Example notebook for data exploration (https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart)
- Seaborn data visualization tutorial (https://elitedatascience.com/python-seaborn-tutorial) - Plot library that works great with Jupyter.
Various other helpful tools and resources
- Template folder structure for organizing Data Science projects (https://github.com/drivendata/cookiecutter-data-science)
- Anaconda Python distribution (https://www.continuum.io/downloads) - Contains most of the important Python packages for Data Science.
- Spacy (https://spacy.io/) - Open source toolkit for working with text-based data.
- LightGBM gradient boosting framework (https://github.com/Microsoft/LightGBM) - Successfully used in many Kaggle challenges.
- Amazon AWS (https://aws.amazon.com/) - Rent cloud servers for more timeconsuming calculations (r4.xlarge server is a good place to start).
Data Science Challenges for Beginners
Sorted by increasing complexity.
- Walkthrough: House prices challenge (https://www.dataquest.io/blog/kaggle-getting-started/) - Walkthrough through a simple challenge on house prices.
- Blood Donation Challenge (https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/) - Predict if a donor will donate again.
- Titanic Challenge (https://www.kaggle.com/c/titanic) - Predict survival on the Titanic.
- Water Pump Challenge (https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/) - Predict the operating condition of water pumps in Africa.
More advanced resources and lists
- Awesome Data Science (https://github.com/bulutyazilim/awesome-datascience)
- Data Science Python (https://github.com/ujjwalkarn/DataSciencePython)
- Machine Learning Tutorials (https://github.com/ujjwalkarn/Machine-Learning-Tutorials)
Contribute
Contributions welcome! Read the contribution guidelines (contributing.md) first.
License
!CC0 (http://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg) (http://creativecommons.org/publicdomain/zero/1.0)
To the extent possible under law, Simon Böhm has waived all copyright and
related or neighboring rights to this work. Disclaimer: Some of the links are affiliate links.