Awesome Data Analysis Awesome#

400+ curated tools, libraries, cheatsheets, roadmaps, and tutorials to master data analysis. Perfect for beginners and experienced data analysts and scientists.

📑 Contents#


🏆 Awesome Data Science Repositories#

Curated collections of high-quality GitHub repos for inspiration and learning.

⬆ back to top


🗺️ Roadmaps#

Step-by-step guides and skill trees to master data science and analytics.

⬆ back to top


🐍 Python#

Resources#

A collection of resources for learning and mastering Python programming.

⬆ back to top


Data Manipulation with Pandas#

Tutorials and best practices for working with pandas DataFrames.

⬆ back to top


Useful Python Tools for Data Analysis#

A collection of Python libraries for efficient data manipulation, cleaning, visualization, validation, and analysis.

Data Manipulation & Cleaning#

  • Pandas-dq - Data type correction and automatic DataFrame cleaning.

  • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.

  • DataCleaner - Python tool for automatically cleaning and preparing datasets.

  • Polars - Multithreaded, vectorized query engine for DataFrames (Rust-powered).

  • Pandas-flavor - Add custom methods to Pandas.

  • TheFuzz - Fuzzy string matching (Levenshtein distance).

  • PandasAI - Conversational data analysis using LLMs and RAG.

  • DateUtil - Extensions for standard Python datetime features.

  • Fugue - Unified interface for Pandas, Spark, and Dask.

  • Pandas-DataReader - Reads data from various online sources into pandas DataFrames.

  • sklearn-pandas - Bridge between Pandas and Scikit-learn.

  • fitter - Figures out the distribution your data comes from.

  • Arrow - Enhanced work with dates and times.

  • Pendulum - Alternative to datetime with timezone support.

⬆ back to top


Automated Data Visualization Tools#

  • AutoViz - Automatic data visualization in 1 line of code.

  • Vizro - Low-code toolkit for building high-quality data visualization apps.

  • Great Tables - Create awesome display tables using Python.

  • DataMapPlot - Create beautiful plots of data maps.

  • Datashader - Quickly and accurately render even the largest data.

  • Sweetviz - Automatic EDA with dataset comparison.

  • Lux - Automatic DataFrame visualization in Jupyter with a click.

  • Yellowbrick - A suite of visual diagnostic tools for machine learning, extending the Scikit-Learn API.

⬆ back to top


Data Quality & Profiling#

  • Pandas-profiling - Automatic DataFrame visualization and profiling.

  • PyOD - Python library for outlier and anomaly detection.

  • YData Profiling - 1 line of code data quality profiling & exploratory data analysis.

  • Missingno - Visualize missing data patterns in matrix format.

  • Dora - Automate EDA: preprocessing, feature engineering, visualization.

  • Alibi-detect - Algorithms for outlier, adversarial and drift detection.

⬆ back to top


Feature Engineering & Selection#

  • FeatureTools - Open-source automated feature engineering.

  • Feature Selector - Tool for dimensionality reduction of machine learning datasets.

  • TSFresh - A Python library for automatically extracting features from time series data.

  • Feature Engine - A feature engineering library with Scikit-Learn compatibility.

  • Prince - A Python library for multivariate exploratory data analysis, including PCA, CA, MCA, and more.

  • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.

⬆ back to top


Testing & Validation#

  • Pytest - Framework for writing small tests.

  • Cerberus - Data validation through schemas.

  • Pandera - Data validation through declarative schemas.

  • PandasVet - Code style validator for Pandas (similar to ESLint).

⬆ back to top


ETL & Data Pipelines#

  • Prefect - Workflow orchestration for building resilient data pipelines.

  • Airflow - Platform for automating data workflows.

  • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.

  • Petl - ETL tool for data cleaning and transformation.

  • DuckDB - In-memory analytical database for fast SQL queries.

⬆ back to top


Interactive Tools & GUIs#

  • D-Tale - Interactive GUI for data analysis in a browser.

  • Pandasgui - GUI for viewing and filtering DataFrames.

  • QGrid - Interactive grid for sorting, filtering, and editing DataFrames in Jupyter.

  • PyGWalker - Interactive UIs for visual analysis of pandas DataFrames.

  • Mito - Jupyter extensions that help you write code faster.

  • Pivottablejs - Interactive PivotTable.js tables in Jupyter.

⬆ back to top


Data Generation & Simulation#

  • Faker - Generates fake data for testing.

  • Mimesis - Generates realistic test data.

⬆ back to top


Formatting & Logging#

  • Rich - Rich text and beautiful formatting in the terminal.

  • Pandas-log - Logs pandas operations for data transformation tracking.

  • Icecream - Debugging without using print.

⬆ back to top


Module Dependency & Code Management#

  • Pydeps - Python module dependency graphs.

  • PyForest - Automated Python imports for data science.

⬆ back to top


Parallel Computing for DataFrames#

  • Pandarallel - Parallel operations for pandas DataFrames.

  • Dask - Parallel computing for arrays and DataFrames.

  • Modin - Speeds up Pandas by distributing computations.

⬆ back to top


Documentation#

  • Sphinx - The Sphinx documentation generator.

  • Pdoc - API documentation for Python projects.

  • Mkdocs - Project documentation with Markdown.

⬆ back to top


File Formats & Documents#

  • OpenPyXL - Read/write Excel files with support for advanced features.

  • Tablib - Exports data to XLSX, JSON, CSV via a single API.

  • PyPDF2 - Reads and writes PDF files.

  • Python-docx - Reads and writes Word documents.

  • CleverCSV - Smart CSV reader for messy data.

  • Xlwings - Integration of Python with Excel.

  • Xmltodict - Converts XML to Python dictionaries.

  • Python-markdownify - Convert HTML to Markdown.

  • MarkItDown - Python tool for converting files and office documents to Markdown.

⬆ back to top


Additional#

  • Pillow - Image processing library.

  • Ftfy - Fixes broken Unicode strings.

  • Records - SQL queries to databases via Python syntax.

  • Dataset - JSON-like interface for working with SQL databases.

  • JmesPath - Queries JSON data (SQL-like for JSON).

  • Glom - Transforms nested data structures.

  • Pampy - Pattern matching for Python dictionaries.

  • Geopy - Geocoding addresses and calculating distances.

  • Diagrams - Diagrams as code for cloud system architecture prototyping.

  • Scattertext - Beautiful visualizations of language differences among document types.

  • Pygorithm - A Python module for learning all major algorithms.

  • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.

  • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.

⬆ back to top


🗃️ SQL & Databases#

Resources#

SQL tutorials and database design principles.

⬆ back to top


Tools#

A collection of Python libraries and drivers for seamless database access and interaction.

⬆ back to top


📊 Data Visualization#

Resources#

Color theory, chart selection guides, and storytelling tips.

⬆ back to top


Tools#

Libraries for static, interactive, and 3D visualizations.

  • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.

  • Seaborn - A statistical data visualization library based on Matplotlib.

  • Plotly - A library for creating interactive plots and dashboards.

  • Altair - A declarative statistical visualization library for Python.

  • Bokeh - A library for creating interactive visualizations for modern web browsers.

  • HoloViews - A tool for building complex visualizations easily.

  • Geopandas - An extension of Pandas for geospatial data.

  • Folium - A library for visualizing data on interactive maps.

  • Pygal - A Python SVG charting library.

  • Plotnine - A grammar of graphics for Python.

  • Bqplot - A plotting library for IPython/Jupyter notebooks.

  • PyPalettes - A large (+2500) collection of color maps for Python.

⬆ back to top


📈 Dashboards#

Resources#

Ttutorials for building and enhancing dashboards and visualizations using various tools and frameworks.

⬆ back to top


Tools#

Frameworks for building custom dashboard solutions.

  • Dash - Framework for creating interactive web applications.

  • Streamlit - Simplified framework for building data applications.

  • Panel - Framework for creating interactive web applications.

  • Gradio - Tool for creating and sharing machine learning applications.

⬆ back to top


Software#

A list of leading tools and platforms for data visualization and dashboard creation.

  • Tableau - Leading data visualization software.

  • Microsoft Power BI - Business analytics tool for visualizing data.

  • QlikView - Tool for data visualization and business intelligence.

  • Metabase - User-friendly open-source BI tool.

  • Apache Superset - Open-source data exploration and visualization platform.

  • Redash - Tool for visualizing and sharing data insights.

  • Grafana - Dashboarding and monitoring tool.

  • Datawrapper - User-friendly chart and map creation tool.

  • ChartBlocks - Online chart creation platform.

  • Infogram - Tool for creating infographics and visual content.

  • Google Data Studio - Free tool for creating interactive dashboards and reports.

  • Rath - Next-generation automated data exploratory analysis and visualization platform.

⬆ back to top


🕸️ Web Scraping & Crawling#

Resources#

A collection of valuable resources, tutorials, and libraries for web scraping with Python.

⬆ back to top


Tools#

A list of Python libraries and tools for web scraping.

  • BeautifulSoup - A library for parsing HTML and XML documents.

  • Selenium - A tool for automating web applications for testing purposes.

  • Scrapy - An open-source and collaborative web crawling framework for Python.

  • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.

  • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.

  • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.

  • Feedparser - A library to parse feeds in Python.

  • Trafilatura - A Python & command-line tool to gather text and metadata on the web.

  • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.

  • Dirsearch - A web path scanner.

  • MechanicalSoup - A Python library for automating interaction with websites.

  • ScrapeGraph AI - A Python scraper based on AI.

  • Snscrape - A social networking service scraper in Python.

⬆ back to top


📖 Natural Language Processing (NLP)#

Resources#

A selection of resources for learning and applying natural language processing in Python.

⬆ back to top


Tools#

A collection of powerful libraries and frameworks for natural language processing in Python.

  • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.

  • TextBlob - A simple library for processing textual data.

  • SpaCy - An open-source software library for advanced NLP in Python.

  • TextRank - A library for TextRank algorithm implementation.

  • Flair - A simple framework for state-of-the-art NLP.

  • BERT - A transformer-based model for NLP tasks.

  • Transformers - A library for state-of-the-art NLP models.

⬆ back to top


🔢 Mathematics, Statistics & Probability#

Mathematics#

A collection of resources for learning and applying mathematics and statistics, particularly in the context of data science and machine learning.

⬆ back to top


Statistics & Probability#

A selection of resources focused on statistics and probability, including tutorials, interactive tools, and comprehensive guides.

⬆ back to top


🧪 A/B Testing#

A collection of resources focused on A/B testing.

⬆ back to top


🤖 Machine Learning#

A collection of resources to help you learn and apply machine learning concepts and techniques.

⬆ back to top


🧠 Productivity & Development Tools#

Resources#

A collection of resources and tools to enhance productivity and streamline development processes.

  • Awesome Jupyter - Curated list of Jupyter projects, libraries, and resources.

  • Best of Jupyter - Ranked list of notable Jupyter Notebook, Hub, and Lab projects.

  • Awesome AutoHotkey - A curated list of awesome AutoHotkey libraries, scripts, and resources.

  • Awesome Productivity - A curated list of delightful productivity resources.

  • Microsoft To Do - A simple to-do list app from Microsoft.

  • Google Keep - A note-taking and list-making app.

  • Bujo - Tools to help transform the way you work and live.

  • Parabola - An AI-powered workflow builder for organizing data.

  • Notion - An all-in-one workspace for note-taking and task management.

  • Trello - A visual project management tool.

  • Asana - A project management platform for tracking work and projects.

  • Awesome Chatgpt Prompts - A repository for ChatGPT prompt curation.

  • Markdown Here - Extension for writing emails in Markdown and rendering them before sending.

  • Cookiecutter Data Science - A standardized project structure for data science projects.

  • Sketch - Toolkit designed for designers, focusing on their workflow.

  • The Markdown Guide - Comprehensive guide to learning Markdown.

  • Kittl - Platform for creating and editing charts and data visualizations.

⬆ back to top


Useful Linux Tools#

A selection of tools to enhance productivity and functionality in Linux environments.

  • Peek - Simple animated GIF screen recorder with an easy to use interface.

  • CopyQ - Clipboard manager with advanced features.

  • Translate Shell - Command-line translator using Google Translate, Bing Translator, Yandex.Translate, etc.

  • Espanso - Cross-platform Text Expander written in Rust.

  • Flameshot - Powerful yet simple to use screenshot software.

  • Inkscape - A powerful, free, and open-source vector graphics editor for creating and editing visualizations.

  • Rclone - A command-line program to manage files on cloud storage.

  • Rsync - A fast and versatile file copying tool that can synchronize files and directories between two locations over a network or locally.

  • Timeshift - System restore tool for Linux that creates filesystem snapshots using rsync+hardlinks or BTRFS snapshots.

  • Backintime - A comfortable and well-configurable graphical frontend for incremental backups.

  • Fzf - A command-line fuzzy finder.

  • Osquery - SQL powered operating system instrumentation, monitoring, and analytics.

  • GNU Parallel - A tool to run jobs in parallel.

  • HTop - An interactive process viewer.

  • Ncdu - A disk usage analyzer with an ncurses interface.

  • Thefuck - A command line tool to correct your previous console command.

⬆ back to top


Useful VS Code Extensions#

A collection of extensions to enhance functionality and productivity in Visual Studio Code.

⬆ back to top


📚 Skill Development & Career Resources#

Practice Resources#

A collection of resources to enhance skills and advance your career in data analysis and related fields.

⬆ back to top


Curated Jupyter Notebooks#

A selection of curated Jupyter notebooks to support learning and exploration in data science and analysis.

⬆ back to top


Data Sources & Datasets#

A collection of resources for accessing datasets and data sources for analysis and projects.

⬆ back to top


Resume and Interview Tips#

A variety of resources to help you prepare for interviews and enhance your resume.

⬆ back to top


📋 Cheatsheets#

A collection of cheatsheets across various domains to aid in quick reference and learning.

Python#

⬆ back to top


Data Science & Machine Learning#

⬆ back to top


Linux & Command Line#

⬆ back to top


Git & GitHub#

⬆ back to top


Probability & Statistics#

⬆ back to top


Docker#

⬆ back to top


Tools & Workflow#

⬆ back to top


SQL & Databases#

⬆ back to top


Interview Preparation#

⬆ back to top


Miscellaneous#

⬆ back to top


🌐 Additional Resources#

A wide range of resources designed to facilitate learning, development, and exploration across different domains.

  • Growth.Design - A collection of product case studies and behavioral psychology insights for data-driven decision-making.

  • Jupyter Book - Create beautiful, publication-quality books and documents from computational content.

  • Awesome Quarto - A curated list of Quarto resources, including talks, tools, examples, and articles. Contributions are welcome!

  • Awesome Vscode - A comprehensive list of useful VS Code extensions and resources.

  • UC Berkeley - Data 8 - Course materials for the Data Science Foundations course.

  • A collective list of free APIs - A comprehensive list of free APIs for various purposes.

  • Introduction to Big Data - Resources and materials for understanding Big Data concepts.

  • Awesome Readme - Collection of well-crafted README files for inspiration.

  • Anomaly Detection Resources - Books, papers, videos, and toolboxes related to anomaly detection.

  • Awesome Code Review - A collection of resources for code review practices.

  • W3Resource - Online platform offering tutorials, code examples, and exercises for various programming languages.

⬆ back to top


🤝 Contributing#

We welcome your contributions!

See CONTRIBUTING.md for how to add resources.

⬆ back to top


📜 License#

CC0

This work is dedicated to the public domain under the CC0 1.0 Universal license.

⬆ back to top