Awesome Data Analysis #

500+ curated tools, libraries, cheatsheets, roadmaps, and tutorials to master data analysis. Perfect for beginners and experienced data analysts and scientists.

📑 Contents#

🏆 Awesome Data Science Repositories
🗺️ Roadmaps
🐍 Python
- Resources
- Data Manipulation with Pandas and Numpy
- Useful Python Tools for Data Analysis
🗃️ SQL & Databases
- Resources
- Tools
📊 Data Visualization
- Resources
- Tools
📈 Dashboards & BI
- Resources
- Tools
- Software
🕸️ Web Scraping & Crawling
- Resources
- Tools
🔢 Mathematics
🎲 Statistics & Probability
- Resources
- Tools
🧪 A/B Testing
⏳ Time Series Analysis
- Resources
- Tools
⚙️ Data Engineering
- Resources
- Tools
📖 Natural Language Processing (NLP)
- Resources
- Tools
🤖 Machine Learning & AI
- Resources
- Tools
🧠 Productivity & Development Tools
- Resources
- Useful Linux Tools
- Useful VS Code Extensions
📚 Skill Development & Career
- Practice Resources
- Curated Jupyter Notebooks
- Data Sources & Datasets
- Resume and Interview Tips
📋 Cheatsheets
- GoalKicker Programming Notes
- Python
- Data Science & Machine Learning
- Linux & Git
- Probability & Statistics
- SQL & Databases
- Miscellaneous
📦 Additional Python Libraries
📝 More Awesome Lists
🌐 Additional Resources
🤝 Contributing
📜 License

🏆 Awesome Data Science Repositories#

Curated collections of high-quality GitHub repos for inspiration and learning.

Awesome Data Science - A curated list of courses, books, tools, and resources for data science.
Data Science for Beginners - Microsoft’s data science curriculum.
OSSU Data Science - Open Source Society University’s self-study path.
Data Science Best Resources - Carefully curated links for data science resources in one place.
Data Science Articles from CodeCut - A collection of articles, videos, and code related to data science.
Data Science Using Python - Resources for data analysis using Python.

⬆ back to top

🗺️ Roadmaps#

Step-by-step guides and skill trees to master data science and analytics.

Data Analyst Roadmap - Structured learning path for analysts.
Data Science Roadmap from A to Z - Comprehensive roadmap for data science.
Roadmap To Learn Data Science - A comprehensive and updated roadmap for learning data science with modern tools and technologies.
66DaysOfData - 66-day data analytics learning challenge.
Data Analyst Roadmap for Professionals - 8-week program for analysts at all levels.
Data Science Roadmap Tutorials - Tutorials for the data science roadmap.
Data Analyst Roadmap from Zero - Guide to becoming a data analyst from scratch.

⬆ back to top

🐍 Python#

Resources#

A collection of resources for learning and mastering Python programming.

Awesome Python - An opinionated list of awesome Python frameworks, libraries, software, and resources.
30 Days Of Python - A 30-day programming challenge to learn the Python programming language.
Real Python Tutorials - Tutorials on Python from Real Python.
Awesome Python Data Science - A curated list of Python resources for data science.
Python Data Science Handbook - Full text of the “Python Data Science Handbook” in Jupyter Notebooks.
Interactive Coding Challenges - 120+ interactive Python coding interview challenges.
Clean Code Python - Clean Code concepts adapted for Python.
Best of Python - A ranked list of awesome Python open-source libraries and tools.
GeeksforGeeks Python - Python tutorial from GeeksforGeeks.
W3Schools Python - A beginner-friendly tutorial and reference for the Python programming language.
Tanu N Prabhu Python - This repository helps you understand Python from scratch.

⬆ back to top

Data Manipulation with Pandas and Numpy#

Tutorials and best practices for working with Pandas and Numpy.

Awesome Pandas - A curated list of resources for using the Pandas library.
100 data puzzles for pandas - A collection of data puzzles to practice your Pandas skills.
Pandas Tutor - Visualize Pandas operations step-by-step (perfect for beginners).
Pandas Exercises - Exercises designed to help you improve your Pandas skills.
Pandas Cookbook - A cookbook with various recipes for using Pandas effectively.
Hands-On Data Analysis with Pandas - Materials for following along with Hands-On Data Analysis with Pandas.
Effective Pandas - A series focused on writing effective and idiomatic Pandas code.
From Python to Numpy - An open-access book on vectorization and efficient numerical computing with NumPy.
NumPy 100 Exercises - A collection of 100 exercises to master the NumPy library for scientific computing.

⬆ back to top

Useful Python Tools for Data Analysis#

A collection of Python libraries for efficient data manipulation, cleaning, visualization, validation, and analysis.

Data Processing & Transformation#

Pandas DQ - Data type correction and automatic DataFrame cleaning.
Vaex - High-performance Python library for lazy Out-of-Core DataFrames.
Polars - Multithreaded, vectorized query engine for DataFrames.
Fugue - Unified interface for Pandas, Spark, and Dask.
TheFuzz - Fuzzy string matching (Levenshtein distance).
DateUtil - Extensions for standard Python datetime features.
Arrow - Enhanced work with dates and times.
Pendulum - Alternative to datetime with timezone support.
Dask - Parallel computing for arrays and DataFrames.
Modin - Speeds up Pandas by distributing computations.
Pandarallel - Parallel operations for pandas DataFrames.
DataCleaner - Python tool for automatically cleaning and preparing datasets.
Pandas Flavor - Add custom methods to Pandas.
Pandas DataReader - Reads data from various online sources into pandas DataFrames.
Sklearn Pandas - Bridge between Pandas and Scikit-learn.
CuPy - A NumPy-compatible array library accelerated by NVIDIA CUDA for high-performance computing.
Numba - A JIT compiler that translates a subset of Python and NumPy code into fast machine code.
Pandas Stubs - Type stubs for pandas, improves IDE autocompletion.
Petl - ETL tool for data cleaning and transformation.

⬆ back to top

Automated EDA and Visualization Tools#

AutoViz - Automatic data visualization in 1 line of code.
Sweetviz - Automatic EDA with dataset comparison.
Lux - Automatic DataFrame visualization in Jupyter.
YData Profiling - Data quality profiling & exploratory data analysis.
Missingno - Visualize missing data patterns.
Vizro - Low-code toolkit for building data visualization apps.
Yellowbrick - Visual diagnostic tools for machine learning.
Great Tables - Create awesome display tables using Python.
DataMapPlot - Create beautiful plots of data maps.
Datashader - Quickly and accurately render even the largest data.
PandasAI - Conversational data analysis using LLMs and RAG.
Mito - Jupyter extensions for faster code writing.
D-Tale - Interactive GUI for data analysis in a browser.
Pandasgui - GUI for viewing and filtering DataFrames.
PyGWalker - Interactive UIs for visual analysis of DataFrames.
QGrid - Interactive grid for DataFrames in Jupyter.
Pivottablejs - Interactive PivotTable.js tables in Jupyter.

⬆ back to top

Data Quality & Validation#

PyOD - Outlier and anomaly detection.
Alibi Detect - Outlier, adversarial and drift detection.
Pandera - Data validation through declarative schemas.
Cerberus - Data validation through schemas.
Pydantic - Data validation using Python type annotations.
Dora - Automate EDA: preprocessing, feature engineering, visualization.
Great Expectations - Data validation and testing.

⬆ back to top

Feature Engineering & Selection#

FeatureTools - Automated feature engineering.
Feature Engine - Feature engineering with Scikit-Learn compatibility.
Prince - Multivariate exploratory data analysis (PCA, CA, MCA).
Fitter - Figures out the distribution your data comes from.
Feature Selector - Tool for dimensionality reduction of machine learning datasets.
Category Encoders - Extensive collection of categorical variable encoders.
Imbalanced Learn - Handling imbalanced datasets.

⬆ back to top

Specialized Data Tools#

Faker - Generates fake data for testing.
Mimesis - Generates realistic test data.
Geopy - Geocoding addresses and calculating distances.
PySAL - Spatial analysis functions.
Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.
Scattertext - Beautiful visualizations of language differences among document types.
IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.
Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.
ImageIO - A library that provides an easy interface to read and write a wide range of image data.
Texthero - Text preprocessing, representation and visualization.
Geopandas - Geographic data operations with pandas.
NetworkX - Network analysis and graph theory.

⬆ back to top

🗃️ SQL & Databases#

Resources#

SQL tutorials and database design principles.

SQLZoo - SQL Tutorial - Interactive SQL tutorial.
SQL Bolt - Learn SQL - Learn SQL through interactive lessons.
SQL Tutorial - Comprehensive SQL tutorial resource.
SQL Tutorial by W3Schools. - Comprehensive SQL tutorial.
PostgreSQL Tutorial by W3Resource - Tutorial for PostgreSQL.
MySQL Tutorial by W3Resource - Tutorial for MySQL.
MongoDB Tutorial by W3Resource - Tutorial for MongoDB.
EverSQL - AI-powered SQL query optimization and database observability tool.
Awesome Postgres - A curated list of awesome PostgreSQL software, libraries, tools and resources.
Awesome MySql - A curated list of awesome MySQL software, libraries, tools and resources.
Awesome Clickhouse - A curated list of awesome ClickHouse software.
Awesome SQLAlchemy - A curated list of awesome tools for SQLAlchemy.
Awesome Sql - List of tools and techniques for working with relational databases.

⬆ back to top

Tools#

A collection of Python libraries and drivers for seamless database access and interaction.

PyODBC - Python library for ODBC database access.
SQLAlchemy - SQL toolkit and ORM for Python.
Psycopg2 - PostgreSQL database adapter.
MySQL Connector/Python - MySQL driver for Python.
PonyORM - ORM for Python with dynamic query generation.
PyMongo - Official MongoDB driver for Python.
SQLiteviz - A tool for exploring SQLite databases and visualizing the results of your queries.
SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine.
DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite.
DBeaver - A free universal database tool and SQL client for developers, SQL programmers, and administrators.
Beekeeper Studio - A modern, easy-to-use SQL client and database manager with a clean, cross-platform interface.
SQLFluff - A modular SQL linter and auto-formatter designed to enforce consistent style and catch errors in SQL code.
PyMySQL - A pure-Python MySQL client library for interacting with MySQL databases from Python applications.
Vanna.AI - An AI-powered tool for generating SQL queries from natural language questions.
SQLChat - A chat-based SQL client that allows you to query databases using natural language conversations.
Records - SQL queries to databases via Python syntax.
Dataset - JSON-like interface for working with SQL databases.
SQLGlot - A no-dependency SQL parser, transpiler, and optimizer for Python.
TDengine - An open-source big data platform designed for time-series data, IoT, and industrial monitoring.
TimescaleDB - An open-source time-series SQL database optimized for fast ingest and complex queries.
DuckDB - In-memory analytical database for fast SQL queries.

⬆ back to top

📊 Data Visualization#

Resources#

Color theory, chart selection guides, and storytelling tips.

From Data to Viz - A guide to choosing the right visualization based on your data.
Awesome DataViz - A curated list of awesome data visualization libraries, tools, and resources.
Visualization Curriculum - Interactive notebooks designed to teach data visualization concepts.
The Python Graph Gallery - A collection of Python graph examples for data visualization.
FlowingData - Insights on data analysis and visualization.
Data Visualization Catalogue - A comprehensive catalog of data visualization types.
Data Viz Project - A resource for selecting suitable visualizations.
Chartopedia - A guide to help you select the appropriate chart types.
DataForVisualization - Tutorials and insights on data visualization techniques.
Truth & Beauty - Exploration of the aesthetics of data visualization.
Cedric Scherer’s DataViz Resources - A collection of top data visualization resources and inspiration.
Information is Beautiful - A site dedicated to visualizations that make complex ideas clear and engaging.
Plottie - A vast library of scientific plots for visualization inspiration and ideas.

⬆ back to top

Tools#

Libraries for static, interactive, and 3D visualizations.

Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.
Seaborn - A statistical data visualization library based on Matplotlib.
Plotly - A library for creating interactive plots and dashboards.
Altair - A declarative statistical visualization library for Python.
Bokeh - A library for creating interactive visualizations for modern web browsers.
HoloViews - A tool for building complex visualizations easily.
Geopandas - An extension of Pandas for geospatial data.
Folium - A library for visualizing data on interactive maps.
Pygal - A Python SVG charting library.
Plotnine - A grammar of graphics for Python.
Bqplot - A plotting library for IPython/Jupyter notebooks.
PyPalettes - A large (+2500) collection of color maps for Python.
Deck.gl - A WebGL-powered framework for visual exploratory data analysis of large datasets.
Python for Geo - Contextily: add background basemaps to your plots in GeoPandas.
OSMnx - A package to easily download, model, analyze, and visualize street networks from OpenStreetMap.
Apache ECharts - A powerful, interactive charting and visualization library for browser-based applications.
VisPy - A high-performance interactive 2D/3D data visualization library leveraging the power of OpenGL.
Glumpy - A Python library for scientific visualization that is fast, scalable and beautiful, based on OpenGL.
Pandas-bokeh - Bokeh plotting backend for Pandas.

⬆ back to top

📈 Dashboards & BI#

Resources#

Ttutorials for building and enhancing dashboards and visualizations using various tools and frameworks.

Awesome Dashboards - A collection of outstanding dashboard and visualization resources.
Best of Streamlit - Showcase of community-built Streamlit applications.
Awesome Dash - Comprehensive resources for Dash users.
Awesome Panel - Resources and support for Panel users.
Awesome Streamlit - Curated list of Streamlit resources and components.
Dash Enterprise Samples - Production-ready Dash apps.
geeksforgeeks - Tableau Tutorial - Comprehensive tutorial on Tableau.
geeksforgeeks - Power BI Tutorial - Detailed tutorial on Power BI.

⬆ back to top

Tools#

Frameworks for building custom dashboard solutions.

Dash - Framework for creating interactive web applications.
Streamlit - Simplified framework for building data applications.
Panel - Framework for creating interactive web applications.
Gradio - Tool for creating and sharing machine learning applications.
OpenSearch Dashboards - A powerful data visualization and dashboarding tool for OpenSearch data, forked from Kibana.
GridStack.js - A library for building draggable, resizable responsive dashboard layouts.
Tremor - A React library to build dashboards fast with pre-built components for charts, KPIs, and more.
Appsmith - An open-source platform to build and deploy internal tools, admin panels, and CRUD apps quickly.
Grafanalib - A Python library for generating Grafana dashboards configuration as code.
H2O Wave - A Python framework for rapidly building and deploying realtime web apps and dashboards for AI and analytics.
Shiny for Python - Python version of the popular R Shiny framework.
Voilà - Turn Jupyter notebooks into standalone web applications.
Reflex - Full-stack Python framework for building web apps.

⬆ back to top

Software#

A list of leading tools and platforms for data visualization and dashboard creation.

Tableau - Leading data visualization software.
Microsoft Power BI - Business analytics tool for visualizing data.
QlikView - Tool for data visualization and business intelligence.
Metabase - User-friendly open-source BI tool.
Apache Superset - Open-source data exploration and visualization platform.
Preset - A platform for modern business intelligence, providing a hosted version of Apache Superset.
Redash - Tool for visualizing and sharing data insights.
Grafana - Dashboarding and monitoring tool.
Datawrapper - User-friendly chart and map creation tool.
ChartBlocks - Online chart creation platform.
Infogram - Tool for creating infographics and visual content.
Google Data Studio - Free tool for creating interactive dashboards and reports.
Rath - Next-generation automated data exploratory analysis and visualization platform.
Kibana - The official visualization and dashboarding tool for the Elastic Stack (Elasticsearch, Logstash, Beats).

⬆ back to top

🕸️ Web Scraping & Crawling#

Resources#

A collection of valuable resources, tutorials, and libraries for web scraping with Python.

Awesome Web Scraping - List of libraries, tools, and APIs for web scraping and data processing.
Python Scraping - Code samples from the book “Web Scraping with Python”.
Scraping Tutorial - Tutorial for scraping streaming sites.
Webscraping from 0 to Hero - An open project repository sharing knowledge and experiences about web scraping with Python.

⬆ back to top

Tools#

A list of Python libraries and tools for web scraping.

BeautifulSoup - A library for parsing HTML and XML documents.
Selenium - A tool for automating web applications for testing purposes.
Scrapy - An open-source and collaborative web crawling framework for Python.
Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.
AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.
Feedparser - A library to parse feeds in Python.
Trafilatura - A Python & command-line tool to gather text and metadata on the web.
You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.
MechanicalSoup - A Python library for automating interaction with websites.
ScrapeGraph AI - A Python scraper based on AI.
Snscrape - A social networking service scraper in Python.
Ferret - A web scraping system that lets you declaratively describe what data to extract using a simple query language.
Grab - A Python framework for building web scraping apps, providing a high-level API for asynchronous requests.
Playwright - Python version of the Playwright browser automation library.
PyQuery - A jQuery-like library for parsing HTML documents in Python.
Helium - High-level Selenium wrapper for easier web automation.

⬆ back to top

🔢 Mathematics#

A collection of resources for learning mathematics, particularly in the context of data science and machine learning.

Awesome Math - A curated list of mathematics resources, books, and online courses.
MML Bool - Comprehensive resource for mathematics in machine learning.
3Blue1Brown - Visual explanations of mathematical concepts through animated videos.
Immersive Linear Algebra - Interactive resource for understanding linear algebra.
Hackermath - Resource for learning statistics and mathematics for data science.
Stats Maths with Python - Collection of Python scripts and notebooks for statistics and mathematics.
Fast.ai - Computational Linear Algebra - Resource for learning linear algebra computationally.

⬆ back to top

🎲 Statistics & Probability#

Resources#

A selection of resources focused on statistics and probability, including tutorials and comprehensive guides.

Awesome Statistics - A curated list of statistics resources, software, and learning materials.
The Elements of Statistical Learning - Notebooks for understanding statistical learning concepts.
Seeing Theory - Interactive visual resource for learning probability and statistics.
Code repository for O’Reilly book - Companion code for a practical statistics book.
Statistical Learning Theory - Stanford University - Lecture notes on statistical learning theory.
StatLect - Comprehensive online textbook covering probability and statistics concepts.
stanford.edu - Probabilities and Statistics - Refresher course on probabilities and statistics from Stanford University.
Bayesian Methods for Hackers - Resource for learning Bayesian methods in Python.
Bayesian Modeling and Computation in Python - Code for the book “Bayesian Modeling and Computation in Python”.
Stat Trek - A resource for learning statistics and probability, with tutorials and tools.
Online Statistics Book - An interactive online statistics book with simulations and demonstrations.
All of Statistics - Resource for studying statistics based on Wasserman’s book.

⬆ back to top

Tools#

A collection of tools focused on statistics and probability.

SciPy - Fundamental library for scientific computing and statistics.
Statsmodels - Statistical modeling, testing, and data exploration.
PyMC - A probabilistic programming library for Python that allows for flexible Bayesian modeling.
Pingouin - Statistical package with improved usability over SciPy.
scikit-posthocs - Post-hoc tests for statistical analysis of data.
Lifelines - Survival analysis and event history analysis in Python.
scikit-survival - Survival analysis built on scikit-learn for time-to-event prediction.
Bootstrap - Bootstrap confidence interval estimation methods.
PyStan - Python interface to Stan for Bayesian statistical modeling.
ArviZ - Exploratory analysis of Bayesian models with visual diagnostics.
PyGAM - A Python library for generalized additive models with built-in smoothing and regularization.
NumPyro - A probabilistic programming library built on JAX for high-performance Bayesian modeling.
Causal Impact - A Python implementation of the R package for causal inference using Bayesian structural time-series models.
DoWhy - A Python library for causal inference that supports explicit modeling and testing of causal assumptions.
Patsy - A Python library for describing statistical models and building design matrices.
Pomegranate - Fast and flexible probabilistic modeling library for Python with GPU support.

⬆ back to top

🧪 A/B Testing#

A collection of resources focused on A/B testing.

DynamicYield A/B Testing - An online course covering advanced testing and optimization techniques.
Evan’s Awesome A/B Tools - A/B test calculators.
Experimentguide - A practical guide to A/B testing and experimentation from industry leaders.
Google’s A/B Testing Course - A free Udacity course covering the fundamentals of A/B testing.

⬆ back to top

⏳ Time Series Analysis#

Resources#

A collection of resources for understanding time series fundamentals and analytical techniques.

Awesome Time Series - A curated list of resources dedicated to time series analysis and forecasting.
Forecasting: Principles and Practice - Comprehensive textbook on forecasting methods with practical examples.
NIST/SEMATECH e-Handbook - Official time series analysis guide from NIST.
Awesome Time Series Anomaly Detection - A curated list of tools, datasets, and papers dedicated to time series anomaly detection.
Awesome Time Series in Python - A comprehensive list of Python tools and libraries for time series analysis.

⬆ back to top

Tools#

A collection of tools for working with temporal data.

Facebook Prophet - A procedure for forecasting time series data based on an additive model.
Uber Orbit - A Python package for Bayesian time series forecasting and inference.
sktime - A unified Python framework for machine learning with time series, compatible with scikit-learn.
GluonTS - A Python toolkit for probabilistic time series modeling, built on MXNet.
Time-Series-Library - A library for deep learning-based time series analysis and forecasting.
TimesFM - A pretrained time series foundation model from Google Research for zero-shot forecasting.
PyTorch Forecasting - A PyTorch-based library for time series forecasting with neural networks.
Time-series-prediction - A collection of time series prediction methods and implementations.
PlotJuggler - A tool to visualize and analyze time series data logs in real-time.
TSFresh - Automatically extracting features from time series data.
pmdarima - Python library for ARIMA modeling and time series analysis.
Kats - Toolkit for analyzing time series data from Facebook Research.

⬆ back to top

⚙️ Data Engineering#

Resources#

A collection of resources to help you build and manage robust data pipelines and infrastructure.

Data Engineer Handbook - A comprehensive guide covering fundamental and advanced data engineering concepts.
Data Engineering Zoomcamp - Free course on data engineering fundamentals.
Awesome Data Engineering - A curated list of data engineering tools, software, and resources.
Data Engineering Cookbook - Techniques and strategies for building reliable data platforms.
Awesome Cloud Native - A curated list of resources for cloud native technologies.
Awesome Pipeline - A curated list of pipeline toolkits for data processing and workflow management.

⬆ back to top

Tools#

A collection of tools for building, deploying, and managing data pipelines and infrastructure.

dbt-core - A framework for transforming data in your warehouse using SQL and Jinja.
Apache Spark - A unified engine for large-scale data processing and analytics.
Apache Kafka - A distributed event streaming platform for building real-time data pipelines.
Dagster - A data orchestrator for machine learning, analytics, and ETL.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows.
Apache Hive - A data warehouse software for reading, writing, and managing large datasets in distributed storage using SQL.
Apache Hadoop - A framework that allows for the distributed processing of large data sets across clusters of computers.
Luigi - A Python module for building complex and batch-oriented data pipelines.
Apache Iceberg - A high-performance table format for huge analytic datasets.
Apache Cassandra - A highly scalable distributed NoSQL database designed for handling large amounts of data across many commodity servers.
Apache Flink - A framework for stateful computations over unbounded and bounded data streams (real-time stream processing).
Apache Beam - A unified model for defining both batch and streaming data-parallel processing pipelines.
Apache Pulsar - A cloud-native, distributed messaging and streaming platform.
Delta Lake - A storage layer that brings ACID transactions to Apache Spark and big data workloads.
Apache Hudi - An open data lakehouse platform, built on a high-performance open table format.
Trino - A distributed SQL query engine designed for fast analytic queries against large datasets.
DataHub - A metadata platform for the modern data stack.
OpenLineage - An open framework for collection and analysis of data lineage.
Netflix Metaflow - A human-friendly Python library for helping scientists and engineers build and manage real-life data science projects.
Feast - A feature store for machine learning that manages and serves ML features to models.
Kedro - A framework for creating reproducible, maintainable and modular data science code.
Apache Calcite - A dynamic data management framework that allows for SQL parsing, optimization, and federation.
Prefect - Workflow orchestration for building resilient data pipelines.
Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.

⬆ back to top

📖 Natural Language Processing (NLP)#

Resources#

A selection of resources for learning and applying natural language processing in Python.

Awesome Nlp - A ranked list of awesome Python libraries for natural language processing (NLP).
Hugging Face NLP Course - Official course on transformers and NLP from Hugging Face.
Practical NLP Code - Code examples and notebooks for practical natural language processing.
Oxford Deep NLP Lectures - Lecture materials from Oxford’s Deep Natural Language Processing course.
NLTK Book - Natural Language Processing with Python.
NLP with Python by Susan Li - Jupyter notebooks demonstrating various NLP techniques and applications.
Hands on NLTK Tutorial - The hands-on NLTK tutorial for NLP in Python.

⬆ back to top

Tools#

A collection of powerful libraries and frameworks for natural language processing in Python.

Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.
TextBlob - A simple library for processing textual data.
SpaCy - An open-source software library for advanced NLP in Python.
BERT - A transformer-based model for NLP tasks.
Flair - A simple framework for state-of-the-art NLP.
OpenHands - A library and framework for building applications with large language models.
Stanford CoreNLP - A Java suite of core NLP tools providing fundamental linguistic analysis capabilities.
John Snow Labs Spark-NLP - A state-of-the-art Natural Language Processing library built on Apache Spark.
TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.
Gensim - Topic modeling and natural language processing library for Python.
Stanza - Python NLP library for many human languages, from the Stanford NLP Group.
SentenceTransformers - Framework for state-of-the-art sentence and text embeddings.

⬆ back to top

🤖 Machine Learning & AI#

Resources#

A collection of resources to help you learn and apply machine learning concepts and techniques.

Awesome Machine Learning - A curated list of awesome Machine Learning frameworks, libraries and software.
Awesome Deep Learning - A curated list of awesome Deep Learning tutorials, projects and communities.
Awesome LLM - A curated list of papers, projects, and resources related to Large Language Models.
Best of ML Python - A ranked list of awesome machine learning Python libraries and tools.
Awesome Generative AI Guide - A comprehensive guide to generative AI models, tools, and applications.
Microsoft ML for Beginners - A beginner-friendly introduction to machine learning concepts and practices.
MLOps Zoomcamp - A free course focused on the practical aspects of deploying and maintaining ML systems.
Awesome MLOps - A curated list of references for MLOps - Machine Learning Operations.
mlcourse.ai - Open Machine Learning Course with practical assignments and real-world applications.
Machine Learning Zoomcamp - A free practical machine learning course focused on building and deploying models.
LLM Zoomcamp - A course dedicated to Large Language Models, their architecture and applications.
ML Engineering Guide - A practical guide to machine learning engineering and MLOps best practices.
Awesome Production Machine Learning - A curated list of tools for deploying, monitoring, and maintaining ML systems in production.
Awesome Artificial Intelligence - A curated list of artificial intelligence resources.
100 Days of ML Coding - A comprehensive coding challenge to learn machine learning over 100 days.
Made With ML - Resource for building and deploying machine learning applications.
Handson-ml3 - Hands-on guide to machine learning and deep learning using Python.
Machine Learning with Python by Susan Li - Jupyter notebooks covering various machine learning algorithms and applications.

⬆ back to top

Tools#

A collection of tools for developing and deploying machine learning models.

TensorFlow - An end-to-end open source platform for machine learning and deep learning applications.
PyTorch - Deep learning framework with strong support for research and production.
Scikit-learn - Machine learning library for classical algorithms and model building.
HuggingFace Transformers - The model-definition framework for state-of-the-art machine learning models.
XGBoost - Optimized distributed gradient boosting library for tree-based models.
LightGBM - Fast, distributed, high-performance gradient boosting framework.
CatBoost - High-performance gradient boosting on decision trees with categorical features support.
LangChain - Framework for developing applications powered by language models.
LlamaIndex - Data framework for LLM-based applications with RAG capabilities.
vLLM - High-throughput and memory-efficient inference library for LLMs.
Ollama - Tool for running large language models locally on your machine.
MLflow - An open-source platform for the complete machine learning lifecycle.
Fast.ai - Deep learning library simplifying training fast and accurate neural nets.
HuggingFace Diffusers - A library for state-of-the-art pretrained diffusion models for image and audio generation.
PEFT - A library for efficiently adapting large pretrained models to various tasks.
Evidently - A tool for analyzing and monitoring data and model drift in production.
Wandb - A tool for experiment tracking, dataset versioning, and model management.
SHAP - A game theoretic approach to explain the output of any machine learning model.
BentoML - A framework for building, shipping, and scaling machine learning applications.
Optuna - Hyperparameter optimization framework.
DVC - A version control system for machine learning projects to track data, models, and experiments.
OpenLLM - Open platform for operating large language models in production.
Deepchecks - Validation for ML models and data.
Kubeflow - A machine learning toolkit for Kubernetes, focused on simplifying deployments.
Sematic - An open-source tool to build, debug, and execute ML pipelines with native Python.

⬆ back to top

🧠 Productivity#

Resources#

A collection of resources and tools to enhance productivity and streamline development processes.

Best of Jupyter - Ranked list of notable Jupyter Notebook, Hub, and Lab projects.
Notion - An all-in-one workspace for note-taking and task management.
Trello - A visual project management tool.
ChatGPT Data Science Prompts - A collection of useful prompts for data scientists using ChatGPT.
Cookiecutter Data Science - A standardized project structure for data science projects.
The Markdown Guide - Comprehensive guide to learning Markdown.
Readme-AI - A tool to automatically generate README.md files for your projects.
Markdown Here - Extension for writing emails in Markdown and rendering them before sending.
Habitica - A habit-building and productivity app that treats your life like a role-playing game.
Microsoft To Do - A simple to-do list app from Microsoft.
Google Keep - A note-taking and list-making app.
Bujo - Tools to help transform the way you work and live.
Parabola - An AI-powered workflow builder for organizing data.
Asana - A project management platform for tracking work and projects.
Puter - An open-source, browser-based computing environment and cloud OS.

⬆ back to top

Useful Linux Tools#

A selection of tools to enhance productivity and functionality in Linux environments.

tldr-pages - Simplified and community-driven man pages with practical examples.
Bat - Cat clone with syntax highlighting.
Exa - Modern replacement for ls.
Ripgrep - Faster grep alternative.
Zoxide - Smarter cd command.
Peek - Simple animated GIF screen recorder with an easy to use interface.
CopyQ - Clipboard manager with advanced features.
Translate Shell - Command-line translator using Google Translate, Bing Translator, Yandex.Translate, etc.
Espanso - Cross-platform Text Expander written in Rust.
Flameshot - Powerful yet simple to use screenshot software.
DrawIO Desktop - An open-source diagramming software for making flowcharts, process diagrams, and more.
Inkscape - A powerful, free, and open-source vector graphics editor for creating and editing visualizations.
Rclone - A command-line program to manage files on cloud storage.
Rsync - A fast and versatile file copying tool that can synchronize files and directories between two locations over a network or locally.
Timeshift - System restore tool for Linux that creates filesystem snapshots using rsync+hardlinks or BTRFS snapshots.
Backintime - A comfortable and well-configurable graphical frontend for incremental backups.
Fzf - A command-line fuzzy finder.
Osquery - SQL powered operating system instrumentation, monitoring, and analytics.
GNU Parallel - A tool to run jobs in parallel.
HTop - An interactive process viewer.
Ncdu - A disk usage analyzer with an ncurses interface.
Thefuck - A command line tool to correct your previous console command.
Miller - A tool for querying, processing, and formatting data in various file formats (CSV, JSON, etc.), like awk/sed/cut for data.
jq - Command-line JSON processor for parsing and manipulating JSON data.
yq - Portable command-line YAML processor (like jq for YAML and XML).
q - Run SQL directly on CSV or TSV files from the command line.
VisiData - Interactive multitool for tabular data exploration in the terminal.
csvkit - Suite of command-line tools for working with CSV data.
httpie - Modern command-line HTTP client for API testing and debugging.
glances - Cross-platform system monitoring tool for resource usage analysis.
hyperfine - Command-line benchmarking tool for performance testing.
termgraph - Draw basic graphs in the terminal for quick data visualization.
fd - Simple, fast and user-friendly alternative to ‘find’.
dust - More intuitive version of du written in rust.
bottom - Cross-platform graphical process/system monitor.

⬆ back to top

Useful VS Code Extensions#

A collection of extensions to enhance functionality and productivity in Visual Studio Code.

JDBC Adapter - Connect to various databases using JDBC.
DBCode - Connect - Database client for managing and querying databases.
Markdown All in One - Essential tools for Markdown editing.
Markdown Preview GitHub Styles - Changes VS Code’s markdown preview to match GitHub’s styling.
Snippington Python Pandas Basic - Basic tools for working with Pandas in Python.
PDF Viewer for Visual Studio Code - View PDF files directly in VS Code.
Quick Python Print - Quickly handle print operations in Python.
Rainbow CSV - Highlight CSV and TSV files and run SQL-like queries.
Remove Blank Lines - Extension to remove empty lines in documents.
PDF Preview in VSCode - Show PDF previews in VS Code.
CSV to Table - Convert CSV/TSV/PSV files to ASCII formatted tables.
Data Preview - Import, view, slice, and export data.
Data Wrangler - Tool for cleaning and preparing tabular datasets.
Error Lens - Enhances the display of errors and warnings in code.
Indent Rainbow - Makes indentation easier to read.
Markdown Table Editor - Add features to edit Markdown tables.
WYSIWYG Editor for Markdown - View Word and Excel files and edit Markdown.
Prettier - Code formatting extension for VS Code.
Project Manager - Easily switch between projects.
Python Indent - Automatically indent Python code.
SandDance - Visually explore and present your data.
SQL Notebooks - Open SQL files as VSCode Notebooks.
SQL Tools - Database management tools for VSCode.
Kanban Board - A Kanban board extension for organizing tasks within VS Code.
Path Autocomplete - Provides path completion for files and directories in VS Code.
Path Intellisense - Autocompletes filenames in your code.
Python Imports Utils - Utilities for managing Python imports.
Workspace Dashboard - Organize your workspaces in a speed-dial manner.
Remote Development - Open any folder in a container, on a remote machine, or in WSL.
Text Power Tools - An all-in-one solution with 240+ commands for text manipulation.
Toggle Quotes - Toggle between single, double, and backticks for strings.
Comment Translate - Helps translate comments, strings, and variable names in your code.
Text Marker - Select text in your code and mark all matches with configurable highlight color.
Bookmarks - Mark lines in your code and jump to them easily.
Dendron - A hierarchical note-taking tool that grows as you do.
Gitignore Generator - Simplifies the process of generating .gitignore files.
Test Explorer UI - Run your tests in the sidebar of Visual Studio Code.
Python Test Explorer - Run your Python tests in the sidebar of Visual Studio Code.
VSCode Markdownlint - A VS Code extension to lint and style check markdown files.

⬆ back to top

📚 Skill Development & Career#

Practice Resources#

A collection of resources to enhance skills and advance your career in data analysis and related fields.

Kaggle Competitions - Platform for participating in data analysis and machine learning competitions.
Makeovermonday - A platform focused on enhancing data visualization practices.
Workout Wednesday - Engage in weekly challenges to improve your visualization skills.
Official TidyTuesday Repository - Repository for the TidyTuesday project, promoting data analysis.
DrivenData Competitions - Data analysis competitions with a social impact focus.
Codecademy Data Science Path - Interactive courses for learning data analysis.
SQL Masterclass - A course to master SQL for data analysis, complete with real-world projects.
Hugging Face Tasks - Hands-on practice with specific NLP and machine learning tasks using real models.

⬆ back to top

Curated Jupyter Notebooks#

A selection of curated Jupyter notebooks to support learning and exploration in data science and analysis.

Awesome Notebooks - Data & AI notebook templates catalog organized by tools.
Data Science Ipython Notebooks - Data science Python notebooks covering various topics.
Pydata Book - Materials and IPython notebooks for “Python for Data Analysis” by Wes McKinney.
Spark py Notebooks - Apache Spark & Python tutorials for big data analysis and machine learning.
DataMiningNotebooks - Example notebooks for data mining accompanying the course at Southern Methodist University.
Pythondataanalysis - Python data repository with Jupyter notebooks and scripts.
Python For Data Analysis - An introduction to data science using Python and Pandas with Jupyter notebooks.
Jdwittenauer Ipython Notebooks - A collection of IPython notebooks covering various topics.
DataScienceInteractivePython - A collection of interactive Python notebooks for learning data science concepts.

⬆ back to top

Data Sources & Datasets#

A collection of resources for accessing datasets and data sources for analysis and projects.

Kaggle Datasets - Extensive collection of datasets for practice in data analysis.
Opendatasets - A Python library for downloading datasets from Kaggle, Google Drive, and other online sources.
Datasette - An open source multi-tool for exploring and publishing data.
Awesome Public Datasets - Curated list of high-quality open datasets.
Open Data Sources - Collection of various open data sources.
Free Datasets for Projects - Dataquest’s compilation of free datasets.
Data World - The enterprise data catalog that CIOs, governance professionals, data analysts, and engineers trust in the AI era.
Awesome Public Real Time Datasets - A list of publicly available datasets with real-time data.
Google Dataset Search - A search engine for datasets from across the web.
NASA Open Data Portal - A site for NASA’s open data initiative, providing access to NASA’s data resources.
The World Bank Data - Free and open access to global development data by The World Bank.
Voice Datasets - A collection of audio and speech datasets for voice AI and machine learning.
HuggingFace Datasets - A lightweight library to easily share and access datasets for audio, computer vision, and NLP.
TensorFlow Datasets - A collection of ready-to-use datasets for use with TensorFlow and other Python ML frameworks.
NLP Datasets - A curated list of datasets for natural language processing (NLP) tasks.
TorchVision Datasets - The torchvision.datasets module provides many built-in computer vision datasets.
LLM Datasets - A collection of datasets and resources for training and fine-tuning Large Language Models (LLMs).
Unsplash Datasets - A collection of datasets from Unsplash, useful for computer vision and research.
Awesome JSON Datasets - A curated list of awesome JSON datasets that are publicly available without authentication.

⬆ back to top

Resume and Interview Tips#

A variety of resources to help you prepare for interviews and enhance your resume.

Data Science Interview Questions Answers - Curated list of data science interview questions and answers.
Data Science Interview Preperation Resources - Resource to help you prepare for your upcoming data science interviews.
Data Science Interviews - A comprehensive collection of data science interview questions and resources.
The Data Science Interview Book - A comprehensive resource to prepare for data science and machine learning interviews.
Machine Learning Interviews Book - A comprehensive guide to preparing for machine learning engineering interviews.
Devinterview - Ace your next tech interview with confidence.
Interviewqs - Ace your next data science interview.
Cracking Data Science Interview - A Collection of Cheatsheets, Books, Questions, and Portfolio For DS/ML Interview Prep.
Interview Query - Another platform to prepare for data science interviews.
Enhancv Data Scientist Resumes - A collection of resume examples and tips tailored for data scientists.
Data Science Portfolio - A platform to create and showcase your data science portfolio.
InterviewBit - SQL Interview Questions - Collection of SQL interview questions.

⬆ back to top

📋 Cheatsheets#

A collection of cheatsheets across various domains to aid in quick reference and learning.

GoalKicker Programming Notes#

Bash Notes for Professionals - A comprehensive guide to shell scripting and command-line mastery.
Git Notes for Professionals - Everything you need to know about version control with Git, from basics to advanced workflows.
Linux Notes for Professionals - A deep dive into Linux system administration, commands, and environment management.
Microsoft SQL Server Notes for Professionals - A detailed reference for developing and administering MS SQL Server databases.
PowerShell Notes for Professionals - A guide to task automation and configuration management using PowerShell.
Python Notes for Professionals - A massive collection of Python concepts, idioms, and best practices for all levels.
SQL Notes for Professionals - A definitive guide to SQL syntax, queries, and database interaction concepts.
PostgreSQL Notes for Professionals - A professional compendium of knowledge for PostgreSQL administration and development.
MySQL Notes for Professionals - Essential reference material for working with the MySQL database management system.
Oracle Database Notes for Professionals - A guide to Oracle Database concepts, PL/SQL, and administration tasks.
MongoDB Notes for Professionals - A practical guide to working with NoSQL and MongoDB for modern application development.

⬆ back to top

Python#

Python Cheat Sheet - Comprehensive Python syntax and examples.
Learn Python - Interactive Python learning.
Pythoncheatsheet - Quick reference for Python basics and advanced topics.
Comprehensive Python Cheatsheet - Detailed Python functions and libraries.
Python Cheatsheet - A comprehensive cheatsheet for the Python programming language.

⬆ back to top

Data Science & Machine Learning#

DS Cheatsheets - List of Data Science Cheatsheets.
DS Notes & Cheatsheets - Cheatsheets for data science, ML, computer science and more.
Data Science Cheat Sheets (Math) - Cheat sheets for quick reference in data science mathematics.
Pandas Cheat Sheet - Data manipulation with Pandas.
PySpark Cheatsheet - Common PySpark patterns.

⬆ back to top

Linux & Git#

Linux Cheatsheet - Linux commands and shortcuts.
Bash Awesome Cheatsheets - Bash scripting essentials.
Unix Commands Reference - Unix terminal basics.
GitHub Cheat Sheet - Git/GitHub workflows and tips.
Git Awesome Cheatsheets - Git commands and best practices.
Git and Git Flow Cheat Sheet - Branching strategies.

⬆ back to top

Probability & Statistics#

Stanford CME 106 Cheatsheets - Probability and statistics for engineers.
10-Page Probability Cheatsheet - In-depth probability concepts.
Statistics Cheatsheet - Key statistical methods.

⬆ back to top

SQL & Databases#

Quick SQL Cheatsheet - Handy SQL reference guide.
PostgreSQL Cheatsheet - A handy reference for the most common PostgreSQL psql commands and queries.

⬆ back to top

Miscellaneous#

CheatSheet for CheatSheets - Mega-repository of cheat sheets.
Dataquest - Power BI Cheat Sheet - A helpful resource for Power BI users.
Data Structures Cheat Sheet - A concise reference for common data structures and their properties.
Matplotlib Cheatsheets - Official cheatsheets for the Matplotlib plotting library in Python.
VSCode Awesome Cheatsheets - VS Code shortcuts.
Markdown Cheatsheet - Formatting for GitHub READMEs.
Emoji Cheat Sheet - Emojis in Markdown.
Docker Cheat Sheet - Docker commands and workflows.
Docker Awesome Cheatsheets - Containerization basics.

⬆ back to top

📦 Additional Python Libraries#

A collection of supplementary Python libraries that enhance development workflow, automate processes, and maintain project quality beyond core data analysis tools.

Code Quality & Development#

Black - Uncompromising Python code formatter.
Pre-commit - Framework for managing pre-commit hooks.
Pylint - Python code static analysis.
Mypy - Optional static typing for Python.
Rich - Rich text and beautiful formatting in the terminal.
Icecream - Debugging without using print.
Pandas-log - Logs pandas operations for data transformation tracking.
PandasVet - Code style validator for Pandas.
Pydeps - Python module dependency graphs.
PyForest - Automated Python imports for data science.

⬆ back to top

Documentation & File Processing#

Sphinx - Documentation generator.
Pdoc - API documentation for Python projects.
Mkdocs - Project documentation with Markdown.
OpenPyXL - Read/write Excel files.
Tablib - Exports data to XLSX, JSON, CSV.
PyPDF2 - Reads and writes PDF files.
Python-docx - Reads and writes Word documents.
CleverCSV - Smart CSV reader for messy data.
Python-markdownify - Convert HTML to Markdown.
Xlwings - Integration of Python with Excel.
Xmltodict - Converts XML to Python dictionaries.
MarkItDown - Python tool for converting files and office documents to Markdown.
Jupyter-book - Build publication-quality books from Jupyter notebooks.
WeasyPrint - Convert HTML to PDF.
PyMuPDF - Advanced PDF manipulation library.
Camelot - PDF table extraction library.

⬆ back to top

Web & APIs#

HTTPX - Next-generation HTTP client for Python.
FastAPI - Modern web framework for building APIs.
Typer - Library for building CLI applications.
Requests-cache - Persistent caching for requests library.

⬆ back to top

Miscellaneous#

Funcy - Fancy functional tools for Python.
Pillow - Image processing library.
Ftfy - Fixes broken Unicode strings.
JmesPath - Queries JSON data (SQL-like for JSON).
Glom - Transforms nested data structures.
Diagrams - Diagrams as code for cloud architecture.
Pytest - Framework for writing small tests.
Pampy - Pattern matching for Python dictionaries.
Pygorithm - A Python module for learning all major algorithms.
GitPython - A Python library used to interact with Git repositories.
TQDM - Progress bars for loops and operations.
Loguru - Python logging made simple.
Click - Beautiful command line interfaces.
Poetry - Python dependency management and packaging.
Hydra - Elegant configuration management.

⬆ back to top

📝 More Awesome Lists#

A curated list of other awesome lists on various topics and technologies.

Awesome AI Agents - A collection of resources, frameworks, and tools for building AI agents.
Awesome Chatgpt Prompts - A repository for ChatGPT prompt curation.
Awesome Jupyter - Curated list of Jupyter projects, libraries, and resources.
Awesome Business Intelligence - Actively curated list of awesome BI tools.
Awesome LLM Apps - A collection of applications built with large language models.
Awesome Prompt Engineering - A curated list of resources for prompt engineering with LLMs like ChatGPT.
Awesome Linux Software - A list of awesome applications and tools for Linux.
Awesome Product Management - A curated list of resources for product managers and aspiring PMs.
Awesome Python Applications - A list of free software and applications written in Python.
Awesome FastAPI - A curated list of FastAPI frameworks, libraries, and resources.
Awesome AutoHotkey - A curated list of awesome AutoHotkey libraries, scripts, and resources.
Awesome Productivity - A curated list of delightful productivity resources.
Awesome Scientific Writing - A curated list of resources for scientific writing, publishing, and research.
Awesome LaTeX - A curated list of LaTeX resources, libraries, and tools.
Awesome Actions - A curated list of awesome GitHub Actions for automation.
Awesome Quarto - A curated list of Quarto resources, including talks, tools, examples, and articles. Contributions are welcome!
Awesome Vscode - A comprehensive list of useful VS Code extensions and resources.
Awesome Readme - Collection of well-crafted README files for inspiration.
Awesome GitHub Profile Readme - A collection of awesome GitHub profile READMEs and resources.
Awesome Code Review - A collection of resources for code review practices.
Awesome Certificates - A curated list of IT and developer certifications and learning resources.
Awesome Tunneling - A list of ngrok alternatives and tunneling software.
Anomaly Detection Resources - Books, papers, videos, and toolboxes related to anomaly detection.

⬆ back to top

🌐 Additional Resources#

A wide range of resources designed to facilitate learning, development, and exploration across different domains.

UC Berkeley - Data 8 - Course materials for the Data Science Foundations course.
A collective list of free APIs - A comprehensive list of free APIs for various purposes.
arXiv.org - A free distribution service and open-access archive for scholarly articles.
Elicit - An AI research assistant that helps automate parts of literature review.
500+ AI/ML/DL/NLP Projects - A massive collection of AI and machine learning projects with code for learning and portfolios.
Kittl - Platform for creating and editing charts and data visualizations.
Zasper - High Performace IDE for Jupyter Notebooks.
Sketch - Toolkit designed for designers, focusing on their workflow.
Growth.Design - A collection of product case studies and behavioral psychology insights for data-driven decision-making.

⬆ back to top

🤝 Contributing#

We welcome your contributions!

See CONTRIBUTING.md for how to add resources.

⬆ back to top

📜 License#

This work is dedicated to the public domain under the CC0 1.0 Universal license.

⬆ back to top