Files
awesome-awesomeness/readmes/learndatascience.md2
2025-07-18 23:13:11 +02:00

101 lines
7.1 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Data Science Tutorials & Resources for Beginners [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
*If you want to know more about Data Science but don't know where to start this list is for you!* :chart_with_upwards_trend:
No previous knowledge is required but Python and statistics basics will definitely come in handy. These resources have been used successfully for many beginners at my local Data Science student group [ML-KA](http://ml-ka.de/).
## What is Data Science?
- ['What is Data Science?' on Quora](https://www.quora.com/What-is-data-science)
- [Explanation of important vocabulary](https://www.quora.com/What-is-the-difference-between-Data-Analytics-Data-Analysis-Data-Mining-Data-Science-Machine-Learning-and-Big-Data-1?share=1) - Differentiation of Big Data, Machine Learning, Data Science.
- [Data Science for Business (Book)](https://amzn.to/2voPJUi) - An introduction to Data Science and its use as a business asset.
- [Data Science Process: A Beginners Comprehensive Guide](https://www.scaler.com/blog/data-science-process/) - Technical Skills for the Data Science: This emphasizes the practical skills needed throughout the data science process.
## Common Algorithms and Procedures
- [Supervised vs unsupervised learning](https://stackoverflow.com/questions/1832076/what-is-the-difference-between-supervised-learning-and-unsupervised-learning) - The two most common types of Machine Learning algorithms.
- [9 important Data Science algorithms and their implementation](https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.05-Naive-Bayes.ipynb)
- [Cross validation](https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.03-Hyperparameters-and-Model-Validation.ipynb) - Evaluate the performance of your algorithm/model.
- [Feature engineering](https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb) - Modifying the data to better model predictions.
- [Scientific introduction to 10 important Data Science algorithms](http://www.cs.umd.edu/%7Esamir/498/10Algorithms-08.pdf)
- [Model ensemble: Explanation](https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/) - Combine multiple models into one for better performance.
## Data Science using Python
This list covers only Python, as many are already familiar with this language. [Data Science tutorials using R](https://github.com/ujjwalkarn/DataScienceR).
### General
- [O'Reilly Data Science from Scratch (Book)](https://amzn.to/2GSjjrK) - Data processing, implementation, and visualization with example code.
- [Coursera Applied Data Science](https://www.coursera.org/specializations/data-science-python) - Online Course using Python that covers most of the relevant toolkits.
### Learning Python
- [YouTube tutorial series by sentdex](https://www.youtube.com/watch?v=oVp1vrfL_w4&list=PLQVvvaa0QuDe8XSftW-RAxdo6OmaeL85M)
- [Interactive Python tutorial website](http://www.learnpython.org/)
### numpy
[numpy](http://www.numpy.org/) is a Python library which provides large multidimensional arrays and fast mathematical operations on them.
- [Numpy tutorial on DataCamp](https://www.datacamp.com/community/tutorials/python-numpy-tutorial#gs.h3DvLnk)
### pandas
[pandas](http://pandas.pydata.org/index.html) provides efficient data structures and analysis tools for Python. It is build on top of numpy.
- [Introduction to pandas](http://www.synesthesiam.com/posts/an-introduction-to-pandas.html)
- [DataCamp pandas foundations](https://www.datacamp.com/courses/pandas-foundations) - Paid course, but 30 free days upon account creation (enough to complete course).
- [Pandas cheatsheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) - Quick overview over the most important functions.
### scikit-learn
[scikit-learn](http://scikit-learn.org/stable/) is the most common library for Machine Learning and Data Science in Python.
- [Introduction and first model application](https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.02-Introducing-Scikit-Learn.ipynb)
- [Rough guide for choosing estimators](http://scikit-learn.org/stable/tutorial/machine_learning_map/)
- [Scikit-learn complete user guide](http://scikit-learn.org/stable/user_guide.html)
- [Model ensemble: Implementation in Python](http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/)
### Jupyter Notebook
[Jupyter Notebook](https://jupyter.org/) is a web application for easy data visualisation and code presentation.
- [Downloading and running first Jupyter notebook](https://jupyter.org/install.html)
- [Example notebook for data exploration](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart)
- [Seaborn data visualization tutorial](https://elitedatascience.com/python-seaborn-tutorial) - Plot library that works great with Jupyter.
### Various other helpful tools and resources
- [Template folder structure for organizing Data Science projects](https://github.com/drivendata/cookiecutter-data-science)
- [Anaconda Python distribution](https://www.continuum.io/downloads) - Contains most of the important Python packages for Data Science.
- [Spacy](https://spacy.io/) - Open source toolkit for working with text-based data.
- [LightGBM gradient boosting framework](https://github.com/Microsoft/LightGBM) - Successfully used in many Kaggle challenges.
- [Amazon AWS](https://aws.amazon.com/) - Rent cloud servers for more timeconsuming calculations (r4.xlarge server is a good place to start).
## Data Science Challenges for Beginners
Sorted by increasing complexity.
- [Walkthrough: House prices challenge](https://www.dataquest.io/blog/kaggle-getting-started/) - Walkthrough through a simple challenge on house prices.
- [Blood Donation Challenge](https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/) - Predict if a donor will donate again.
- [Titanic Challenge](https://www.kaggle.com/c/titanic) - Predict survival on the Titanic.
- [Water Pump Challenge](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/) - Predict the operating condition of water pumps in Africa.
## More advanced resources and lists
- [Awesome Data Science](https://github.com/bulutyazilim/awesome-datascience)
- [Data Science Python](https://github.com/ujjwalkarn/DataSciencePython)
- [Machine Learning Tutorials](https://github.com/ujjwalkarn/Machine-Learning-Tutorials)
## Contribute
Contributions welcome! Read the [contribution guidelines](contributing.md) first.
## License
[![CC0](http://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](http://creativecommons.org/publicdomain/zero/1.0)
To the extent possible under law, Simon Böhm has waived all copyright and
related or neighboring rights to this work. Disclaimer: Some of the links are affiliate links.
[learndatascience.md Github](https://github.com/siboehm/awesome-learn-datascience
)