Awesome Data Analysis Awesome#

500+ curated tools, libraries, cheatsheets, roadmaps, and tutorials to master data analysis. Perfect for beginners and experienced data analysts and scientists.

πŸ“‘ Contents#


πŸ† Awesome Data Science Repositories#

Curated collections of high-quality GitHub repos for inspiration and learning.

⬆ back to top


πŸ—ΊοΈ Roadmaps#

Step-by-step guides and skill trees to master data science and analytics.

⬆ back to top


🐍 Python#

Resources#

A collection of resources for learning and mastering Python programming.

⬆ back to top


Data Manipulation with Pandas and Numpy#

Tutorials and best practices for working with Pandas and Numpy.

⬆ back to top


Useful Python Tools for Data Analysis#

A collection of Python libraries for efficient data manipulation, cleaning, visualization, validation, and analysis.

Data Processing & Transformation#

  • Pandas DQ - Data type correction and automatic DataFrame cleaning.

  • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.

  • Polars - Multithreaded, vectorized query engine for DataFrames.

  • Fugue - Unified interface for Pandas, Spark, and Dask.

  • TheFuzz - Fuzzy string matching (Levenshtein distance).

  • DateUtil - Extensions for standard Python datetime features.

  • Arrow - Enhanced work with dates and times.

  • Pendulum - Alternative to datetime with timezone support.

  • Dask - Parallel computing for arrays and DataFrames.

  • Modin - Speeds up Pandas by distributing computations.

  • Pandarallel - Parallel operations for pandas DataFrames.

  • DataCleaner - Python tool for automatically cleaning and preparing datasets.

  • Pandas Flavor - Add custom methods to Pandas.

  • Pandas DataReader - Reads data from various online sources into pandas DataFrames.

  • Sklearn Pandas - Bridge between Pandas and Scikit-learn.

  • CuPy - A NumPy-compatible array library accelerated by NVIDIA CUDA for high-performance computing.

  • Numba - A JIT compiler that translates a subset of Python and NumPy code into fast machine code.

  • Pandas Stubs - Type stubs for pandas, improves IDE autocompletion.

  • Petl - ETL tool for data cleaning and transformation.

⬆ back to top


Automated EDA and Visualization Tools#

  • AutoViz - Automatic data visualization in 1 line of code.

  • Sweetviz - Automatic EDA with dataset comparison.

  • Lux - Automatic DataFrame visualization in Jupyter.

  • YData Profiling - Data quality profiling & exploratory data analysis.

  • Missingno - Visualize missing data patterns.

  • Vizro - Low-code toolkit for building data visualization apps.

  • Yellowbrick - Visual diagnostic tools for machine learning.

  • Great Tables - Create awesome display tables using Python.

  • DataMapPlot - Create beautiful plots of data maps.

  • Datashader - Quickly and accurately render even the largest data.

  • PandasAI - Conversational data analysis using LLMs and RAG.

  • Mito - Jupyter extensions for faster code writing.

  • D-Tale - Interactive GUI for data analysis in a browser.

  • Pandasgui - GUI for viewing and filtering DataFrames.

  • PyGWalker - Interactive UIs for visual analysis of DataFrames.

  • QGrid - Interactive grid for DataFrames in Jupyter.

  • Pivottablejs - Interactive PivotTable.js tables in Jupyter.

⬆ back to top


Data Quality & Validation#

  • PyOD - Outlier and anomaly detection.

  • Alibi Detect - Outlier, adversarial and drift detection.

  • Pandera - Data validation through declarative schemas.

  • Cerberus - Data validation through schemas.

  • Pydantic - Data validation using Python type annotations.

  • Dora - Automate EDA: preprocessing, feature engineering, visualization.

  • Great Expectations - Data validation and testing.

⬆ back to top


Feature Engineering & Selection#

  • FeatureTools - Automated feature engineering.

  • Feature Engine - Feature engineering with Scikit-Learn compatibility.

  • Prince - Multivariate exploratory data analysis (PCA, CA, MCA).

  • Fitter - Figures out the distribution your data comes from.

  • Feature Selector - Tool for dimensionality reduction of machine learning datasets.

  • Category Encoders - Extensive collection of categorical variable encoders.

  • Imbalanced Learn - Handling imbalanced datasets.

⬆ back to top


Specialized Data Tools#

  • Faker - Generates fake data for testing.

  • Mimesis - Generates realistic test data.

  • Geopy - Geocoding addresses and calculating distances.

  • PySAL - Spatial analysis functions.

  • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.

  • Scattertext - Beautiful visualizations of language differences among document types.

  • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.

  • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.

  • ImageIO - A library that provides an easy interface to read and write a wide range of image data.

  • Texthero - Text preprocessing, representation and visualization.

  • Geopandas - Geographic data operations with pandas.

  • NetworkX - Network analysis and graph theory.

⬆ back to top


πŸ—ƒοΈ SQL & Databases#

Resources#

SQL tutorials and database design principles.

⬆ back to top


Tools#

A collection of Python libraries and drivers for seamless database access and interaction.

  • PyODBC - Python library for ODBC database access.

  • SQLAlchemy - SQL toolkit and ORM for Python.

  • Psycopg2 - PostgreSQL database adapter.

  • MySQL Connector/Python - MySQL driver for Python.

  • PonyORM - ORM for Python with dynamic query generation.

  • PyMongo - Official MongoDB driver for Python.

  • SQLiteviz - A tool for exploring SQLite databases and visualizing the results of your queries.

  • SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine.

  • DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite.

  • DBeaver - A free universal database tool and SQL client for developers, SQL programmers, and administrators.

  • Beekeeper Studio - A modern, easy-to-use SQL client and database manager with a clean, cross-platform interface.

  • SQLFluff - A modular SQL linter and auto-formatter designed to enforce consistent style and catch errors in SQL code.

  • PyMySQL - A pure-Python MySQL client library for interacting with MySQL databases from Python applications.

  • Vanna.AI - An AI-powered tool for generating SQL queries from natural language questions.

  • SQLChat - A chat-based SQL client that allows you to query databases using natural language conversations.

  • Records - SQL queries to databases via Python syntax.

  • Dataset - JSON-like interface for working with SQL databases.

  • SQLGlot - A no-dependency SQL parser, transpiler, and optimizer for Python.

  • TDengine - An open-source big data platform designed for time-series data, IoT, and industrial monitoring.

  • TimescaleDB - An open-source time-series SQL database optimized for fast ingest and complex queries.

  • DuckDB - In-memory analytical database for fast SQL queries.

⬆ back to top


πŸ“Š Data Visualization#

Resources#

Color theory, chart selection guides, and storytelling tips.

⬆ back to top


Tools#

Libraries for static, interactive, and 3D visualizations.

  • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.

  • Seaborn - A statistical data visualization library based on Matplotlib.

  • Plotly - A library for creating interactive plots and dashboards.

  • Altair - A declarative statistical visualization library for Python.

  • Bokeh - A library for creating interactive visualizations for modern web browsers.

  • HoloViews - A tool for building complex visualizations easily.

  • Geopandas - An extension of Pandas for geospatial data.

  • Folium - A library for visualizing data on interactive maps.

  • Pygal - A Python SVG charting library.

  • Plotnine - A grammar of graphics for Python.

  • Bqplot - A plotting library for IPython/Jupyter notebooks.

  • PyPalettes - A large (+2500) collection of color maps for Python.

  • Deck.gl - A WebGL-powered framework for visual exploratory data analysis of large datasets.

  • Python for Geo - Contextily: add background basemaps to your plots in GeoPandas.

  • OSMnx - A package to easily download, model, analyze, and visualize street networks from OpenStreetMap.

  • Apache ECharts - A powerful, interactive charting and visualization library for browser-based applications.

  • VisPy - A high-performance interactive 2D/3D data visualization library leveraging the power of OpenGL.

  • Glumpy - A Python library for scientific visualization that is fast, scalable and beautiful, based on OpenGL.

  • Pandas-bokeh - Bokeh plotting backend for Pandas.

⬆ back to top


πŸ“ˆ Dashboards & BI#

Resources#

Ttutorials for building and enhancing dashboards and visualizations using various tools and frameworks.

⬆ back to top


Tools#

Frameworks for building custom dashboard solutions.

  • Dash - Framework for creating interactive web applications.

  • Streamlit - Simplified framework for building data applications.

  • Panel - Framework for creating interactive web applications.

  • Gradio - Tool for creating and sharing machine learning applications.

  • OpenSearch Dashboards - A powerful data visualization and dashboarding tool for OpenSearch data, forked from Kibana.

  • GridStack.js - A library for building draggable, resizable responsive dashboard layouts.

  • Tremor - A React library to build dashboards fast with pre-built components for charts, KPIs, and more.

  • Appsmith - An open-source platform to build and deploy internal tools, admin panels, and CRUD apps quickly.

  • Grafanalib - A Python library for generating Grafana dashboards configuration as code.

  • H2O Wave - A Python framework for rapidly building and deploying realtime web apps and dashboards for AI and analytics.

  • Shiny for Python - Python version of the popular R Shiny framework.

  • VoilΓ  - Turn Jupyter notebooks into standalone web applications.

  • Reflex - Full-stack Python framework for building web apps.

⬆ back to top


Software#

A list of leading tools and platforms for data visualization and dashboard creation.

  • Tableau - Leading data visualization software.

  • Microsoft Power BI - Business analytics tool for visualizing data.

  • QlikView - Tool for data visualization and business intelligence.

  • Metabase - User-friendly open-source BI tool.

  • Apache Superset - Open-source data exploration and visualization platform.

  • Preset - A platform for modern business intelligence, providing a hosted version of Apache Superset.

  • Redash - Tool for visualizing and sharing data insights.

  • Grafana - Dashboarding and monitoring tool.

  • Datawrapper - User-friendly chart and map creation tool.

  • ChartBlocks - Online chart creation platform.

  • Infogram - Tool for creating infographics and visual content.

  • Google Data Studio - Free tool for creating interactive dashboards and reports.

  • Rath - Next-generation automated data exploratory analysis and visualization platform.

  • Kibana - The official visualization and dashboarding tool for the Elastic Stack (Elasticsearch, Logstash, Beats).

⬆ back to top


πŸ•ΈοΈ Web Scraping & Crawling#

Resources#

A collection of valuable resources, tutorials, and libraries for web scraping with Python.

⬆ back to top


Tools#

A list of Python libraries and tools for web scraping.

  • BeautifulSoup - A library for parsing HTML and XML documents.

  • Selenium - A tool for automating web applications for testing purposes.

  • Scrapy - An open-source and collaborative web crawling framework for Python.

  • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.

  • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.

  • Feedparser - A library to parse feeds in Python.

  • Trafilatura - A Python & command-line tool to gather text and metadata on the web.

  • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.

  • MechanicalSoup - A Python library for automating interaction with websites.

  • ScrapeGraph AI - A Python scraper based on AI.

  • Snscrape - A social networking service scraper in Python.

  • Ferret - A web scraping system that lets you declaratively describe what data to extract using a simple query language.

  • Grab - A Python framework for building web scraping apps, providing a high-level API for asynchronous requests.

  • Playwright - Python version of the Playwright browser automation library.

  • PyQuery - A jQuery-like library for parsing HTML documents in Python.

  • Helium - High-level Selenium wrapper for easier web automation.

⬆ back to top


πŸ”’ Mathematics#

A collection of resources for learning mathematics, particularly in the context of data science and machine learning.

⬆ back to top


🎲 Statistics & Probability#

Resources#

A selection of resources focused on statistics and probability, including tutorials and comprehensive guides.

⬆ back to top


Tools#

A collection of tools focused on statistics and probability.

  • SciPy - Fundamental library for scientific computing and statistics.

  • Statsmodels - Statistical modeling, testing, and data exploration.

  • PyMC - A probabilistic programming library for Python that allows for flexible Bayesian modeling.

  • Pingouin - Statistical package with improved usability over SciPy.

  • scikit-posthocs - Post-hoc tests for statistical analysis of data.

  • Lifelines - Survival analysis and event history analysis in Python.

  • scikit-survival - Survival analysis built on scikit-learn for time-to-event prediction.

  • Bootstrap - Bootstrap confidence interval estimation methods.

  • PyStan - Python interface to Stan for Bayesian statistical modeling.

  • ArviZ - Exploratory analysis of Bayesian models with visual diagnostics.

  • PyGAM - A Python library for generalized additive models with built-in smoothing and regularization.

  • NumPyro - A probabilistic programming library built on JAX for high-performance Bayesian modeling.

  • Causal Impact - A Python implementation of the R package for causal inference using Bayesian structural time-series models.

  • DoWhy - A Python library for causal inference that supports explicit modeling and testing of causal assumptions.

  • Patsy - A Python library for describing statistical models and building design matrices.

  • Pomegranate - Fast and flexible probabilistic modeling library for Python with GPU support.

⬆ back to top


πŸ§ͺ A/B Testing#

A collection of resources focused on A/B testing.

⬆ back to top


⏳ Time Series Analysis#

Resources#

A collection of resources for understanding time series fundamentals and analytical techniques.

⬆ back to top


Tools#

A collection of tools for working with temporal data.

  • Facebook Prophet - A procedure for forecasting time series data based on an additive model.

  • Uber Orbit - A Python package for Bayesian time series forecasting and inference.

  • sktime - A unified Python framework for machine learning with time series, compatible with scikit-learn.

  • GluonTS - A Python toolkit for probabilistic time series modeling, built on MXNet.

  • Time-Series-Library - A library for deep learning-based time series analysis and forecasting.

  • TimesFM - A pretrained time series foundation model from Google Research for zero-shot forecasting.

  • PyTorch Forecasting - A PyTorch-based library for time series forecasting with neural networks.

  • Time-series-prediction - A collection of time series prediction methods and implementations.

  • PlotJuggler - A tool to visualize and analyze time series data logs in real-time.

  • TSFresh - Automatically extracting features from time series data.

  • pmdarima - Python library for ARIMA modeling and time series analysis.

  • Kats - Toolkit for analyzing time series data from Facebook Research.

⬆ back to top


βš™οΈ Data Engineering#

Resources#

A collection of resources to help you build and manage robust data pipelines and infrastructure.

⬆ back to top


Tools#

A collection of tools for building, deploying, and managing data pipelines and infrastructure.

  • dbt-core - A framework for transforming data in your warehouse using SQL and Jinja.

  • Apache Spark - A unified engine for large-scale data processing and analytics.

  • Apache Kafka - A distributed event streaming platform for building real-time data pipelines.

  • Dagster - A data orchestrator for machine learning, analytics, and ETL.

  • Apache Airflow - A platform to programmatically author, schedule, and monitor workflows.

  • Apache Hive - A data warehouse software for reading, writing, and managing large datasets in distributed storage using SQL.

  • Apache Hadoop - A framework that allows for the distributed processing of large data sets across clusters of computers.

  • Luigi - A Python module for building complex and batch-oriented data pipelines.

  • Apache Iceberg - A high-performance table format for huge analytic datasets.

  • Apache Cassandra - A highly scalable distributed NoSQL database designed for handling large amounts of data across many commodity servers.

  • Apache Flink - A framework for stateful computations over unbounded and bounded data streams (real-time stream processing).

  • Apache Beam - A unified model for defining both batch and streaming data-parallel processing pipelines.

  • Apache Pulsar - A cloud-native, distributed messaging and streaming platform.

  • Delta Lake - A storage layer that brings ACID transactions to Apache Spark and big data workloads.

  • Apache Hudi - An open data lakehouse platform, built on a high-performance open table format.

  • Trino - A distributed SQL query engine designed for fast analytic queries against large datasets.

  • DataHub - A metadata platform for the modern data stack.

  • OpenLineage - An open framework for collection and analysis of data lineage.

  • Netflix Metaflow - A human-friendly Python library for helping scientists and engineers build and manage real-life data science projects.

  • Feast - A feature store for machine learning that manages and serves ML features to models.

  • Kedro - A framework for creating reproducible, maintainable and modular data science code.

  • Apache Calcite - A dynamic data management framework that allows for SQL parsing, optimization, and federation.

  • Prefect - Workflow orchestration for building resilient data pipelines.

  • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.

⬆ back to top


πŸ“– Natural Language Processing (NLP)#

Resources#

A selection of resources for learning and applying natural language processing in Python.

⬆ back to top


Tools#

A collection of powerful libraries and frameworks for natural language processing in Python.

  • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.

  • TextBlob - A simple library for processing textual data.

  • SpaCy - An open-source software library for advanced NLP in Python.

  • BERT - A transformer-based model for NLP tasks.

  • Flair - A simple framework for state-of-the-art NLP.

  • OpenHands - A library and framework for building applications with large language models.

  • Stanford CoreNLP - A Java suite of core NLP tools providing fundamental linguistic analysis capabilities.

  • John Snow Labs Spark-NLP - A state-of-the-art Natural Language Processing library built on Apache Spark.

  • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.

  • Gensim - Topic modeling and natural language processing library for Python.

  • Stanza - Python NLP library for many human languages, from the Stanford NLP Group.

  • SentenceTransformers - Framework for state-of-the-art sentence and text embeddings.

⬆ back to top


πŸ€– Machine Learning & AI#

Resources#

A collection of resources to help you learn and apply machine learning concepts and techniques.

⬆ back to top


Tools#

A collection of tools for developing and deploying machine learning models.

  • TensorFlow - An end-to-end open source platform for machine learning and deep learning applications.

  • PyTorch - Deep learning framework with strong support for research and production.

  • Scikit-learn - Machine learning library for classical algorithms and model building.

  • HuggingFace Transformers - The model-definition framework for state-of-the-art machine learning models.

  • XGBoost - Optimized distributed gradient boosting library for tree-based models.

  • LightGBM - Fast, distributed, high-performance gradient boosting framework.

  • CatBoost - High-performance gradient boosting on decision trees with categorical features support.

  • LangChain - Framework for developing applications powered by language models.

  • LlamaIndex - Data framework for LLM-based applications with RAG capabilities.

  • vLLM - High-throughput and memory-efficient inference library for LLMs.

  • Ollama - Tool for running large language models locally on your machine.

  • MLflow - An open-source platform for the complete machine learning lifecycle.

  • Fast.ai - Deep learning library simplifying training fast and accurate neural nets.

  • HuggingFace Diffusers - A library for state-of-the-art pretrained diffusion models for image and audio generation.

  • PEFT - A library for efficiently adapting large pretrained models to various tasks.

  • Evidently - A tool for analyzing and monitoring data and model drift in production.

  • Wandb - A tool for experiment tracking, dataset versioning, and model management.

  • SHAP - A game theoretic approach to explain the output of any machine learning model.

  • BentoML - A framework for building, shipping, and scaling machine learning applications.

  • Optuna - Hyperparameter optimization framework.

  • DVC - A version control system for machine learning projects to track data, models, and experiments.

  • OpenLLM - Open platform for operating large language models in production.

  • Deepchecks - Validation for ML models and data.

  • Kubeflow - A machine learning toolkit for Kubernetes, focused on simplifying deployments.

  • Sematic - An open-source tool to build, debug, and execute ML pipelines with native Python.

⬆ back to top


🧠 Productivity#

Resources#

A collection of resources and tools to enhance productivity and streamline development processes.

  • Best of Jupyter - Ranked list of notable Jupyter Notebook, Hub, and Lab projects.

  • Notion - An all-in-one workspace for note-taking and task management.

  • Trello - A visual project management tool.

  • ChatGPT Data Science Prompts - A collection of useful prompts for data scientists using ChatGPT.

  • Cookiecutter Data Science - A standardized project structure for data science projects.

  • The Markdown Guide - Comprehensive guide to learning Markdown.

  • Readme-AI - A tool to automatically generate README.md files for your projects.

  • Markdown Here - Extension for writing emails in Markdown and rendering them before sending.

  • Habitica - A habit-building and productivity app that treats your life like a role-playing game.

  • Microsoft To Do - A simple to-do list app from Microsoft.

  • Google Keep - A note-taking and list-making app.

  • Bujo - Tools to help transform the way you work and live.

  • Parabola - An AI-powered workflow builder for organizing data.

  • Asana - A project management platform for tracking work and projects.

  • Puter - An open-source, browser-based computing environment and cloud OS.

⬆ back to top


Useful Linux Tools#

A selection of tools to enhance productivity and functionality in Linux environments.

  • tldr-pages - Simplified and community-driven man pages with practical examples.

  • Bat - Cat clone with syntax highlighting.

  • Exa - Modern replacement for ls.

  • Ripgrep - Faster grep alternative.

  • Zoxide - Smarter cd command.

  • Peek - Simple animated GIF screen recorder with an easy to use interface.

  • CopyQ - Clipboard manager with advanced features.

  • Translate Shell - Command-line translator using Google Translate, Bing Translator, Yandex.Translate, etc.

  • Espanso - Cross-platform Text Expander written in Rust.

  • Flameshot - Powerful yet simple to use screenshot software.

  • DrawIO Desktop - An open-source diagramming software for making flowcharts, process diagrams, and more.

  • Inkscape - A powerful, free, and open-source vector graphics editor for creating and editing visualizations.

  • Rclone - A command-line program to manage files on cloud storage.

  • Rsync - A fast and versatile file copying tool that can synchronize files and directories between two locations over a network or locally.

  • Timeshift - System restore tool for Linux that creates filesystem snapshots using rsync+hardlinks or BTRFS snapshots.

  • Backintime - A comfortable and well-configurable graphical frontend for incremental backups.

  • Fzf - A command-line fuzzy finder.

  • Osquery - SQL powered operating system instrumentation, monitoring, and analytics.

  • GNU Parallel - A tool to run jobs in parallel.

  • HTop - An interactive process viewer.

  • Ncdu - A disk usage analyzer with an ncurses interface.

  • Thefuck - A command line tool to correct your previous console command.

  • Miller - A tool for querying, processing, and formatting data in various file formats (CSV, JSON, etc.), like awk/sed/cut for data.

  • jq - Command-line JSON processor for parsing and manipulating JSON data.

  • yq - Portable command-line YAML processor (like jq for YAML and XML).

  • q - Run SQL directly on CSV or TSV files from the command line.

  • VisiData - Interactive multitool for tabular data exploration in the terminal.

  • csvkit - Suite of command-line tools for working with CSV data.

  • httpie - Modern command-line HTTP client for API testing and debugging.

  • glances - Cross-platform system monitoring tool for resource usage analysis.

  • hyperfine - Command-line benchmarking tool for performance testing.

  • termgraph - Draw basic graphs in the terminal for quick data visualization.

  • fd - Simple, fast and user-friendly alternative to β€˜find’.

  • dust - More intuitive version of du written in rust.

  • bottom - Cross-platform graphical process/system monitor.

⬆ back to top


Useful VS Code Extensions#

A collection of extensions to enhance functionality and productivity in Visual Studio Code.

⬆ back to top


πŸ“š Skill Development & Career#

Practice Resources#

A collection of resources to enhance skills and advance your career in data analysis and related fields.

⬆ back to top


Curated Jupyter Notebooks#

A selection of curated Jupyter notebooks to support learning and exploration in data science and analysis.

⬆ back to top


Data Sources & Datasets#

A collection of resources for accessing datasets and data sources for analysis and projects.

  • Kaggle Datasets - Extensive collection of datasets for practice in data analysis.

  • Opendatasets - A Python library for downloading datasets from Kaggle, Google Drive, and other online sources.

  • Datasette - An open source multi-tool for exploring and publishing data.

  • Awesome Public Datasets - Curated list of high-quality open datasets.

  • Open Data Sources - Collection of various open data sources.

  • Free Datasets for Projects - Dataquest’s compilation of free datasets.

  • Data World - The enterprise data catalog that CIOs, governance professionals, data analysts, and engineers trust in the AI era.

  • Awesome Public Real Time Datasets - A list of publicly available datasets with real-time data.

  • Google Dataset Search - A search engine for datasets from across the web.

  • NASA Open Data Portal - A site for NASA’s open data initiative, providing access to NASA’s data resources.

  • The World Bank Data - Free and open access to global development data by The World Bank.

  • Voice Datasets - A collection of audio and speech datasets for voice AI and machine learning.

  • HuggingFace Datasets - A lightweight library to easily share and access datasets for audio, computer vision, and NLP.

  • TensorFlow Datasets - A collection of ready-to-use datasets for use with TensorFlow and other Python ML frameworks.

  • NLP Datasets - A curated list of datasets for natural language processing (NLP) tasks.

  • TorchVision Datasets - The torchvision.datasets module provides many built-in computer vision datasets.

  • LLM Datasets - A collection of datasets and resources for training and fine-tuning Large Language Models (LLMs).

  • Unsplash Datasets - A collection of datasets from Unsplash, useful for computer vision and research.

  • Awesome JSON Datasets - A curated list of awesome JSON datasets that are publicly available without authentication.

⬆ back to top


Resume and Interview Tips#

A variety of resources to help you prepare for interviews and enhance your resume.

⬆ back to top


πŸ“‹ Cheatsheets#

A collection of cheatsheets across various domains to aid in quick reference and learning.

GoalKicker Programming Notes#

⬆ back to top


Python#

⬆ back to top


Data Science & Machine Learning#

⬆ back to top


Linux & Git#

⬆ back to top


Probability & Statistics#

⬆ back to top


SQL & Databases#

⬆ back to top


Miscellaneous#

⬆ back to top


πŸ“¦ Additional Python Libraries#

A collection of supplementary Python libraries that enhance development workflow, automate processes, and maintain project quality beyond core data analysis tools.

Code Quality & Development#

  • Black - Uncompromising Python code formatter.

  • Pre-commit - Framework for managing pre-commit hooks.

  • Pylint - Python code static analysis.

  • Mypy - Optional static typing for Python.

  • Rich - Rich text and beautiful formatting in the terminal.

  • Icecream - Debugging without using print.

  • Pandas-log - Logs pandas operations for data transformation tracking.

  • PandasVet - Code style validator for Pandas.

  • Pydeps - Python module dependency graphs.

  • PyForest - Automated Python imports for data science.

⬆ back to top


Documentation & File Processing#

  • Sphinx - Documentation generator.

  • Pdoc - API documentation for Python projects.

  • Mkdocs - Project documentation with Markdown.

  • OpenPyXL - Read/write Excel files.

  • Tablib - Exports data to XLSX, JSON, CSV.

  • PyPDF2 - Reads and writes PDF files.

  • Python-docx - Reads and writes Word documents.

  • CleverCSV - Smart CSV reader for messy data.

  • Python-markdownify - Convert HTML to Markdown.

  • Xlwings - Integration of Python with Excel.

  • Xmltodict - Converts XML to Python dictionaries.

  • MarkItDown - Python tool for converting files and office documents to Markdown.

  • Jupyter-book - Build publication-quality books from Jupyter notebooks.

  • WeasyPrint - Convert HTML to PDF.

  • PyMuPDF - Advanced PDF manipulation library.

  • Camelot - PDF table extraction library.

⬆ back to top


Web & APIs#

  • HTTPX - Next-generation HTTP client for Python.

  • FastAPI - Modern web framework for building APIs.

  • Typer - Library for building CLI applications.

  • Requests-cache - Persistent caching for requests library.

⬆ back to top


Miscellaneous#

  • Funcy - Fancy functional tools for Python.

  • Pillow - Image processing library.

  • Ftfy - Fixes broken Unicode strings.

  • JmesPath - Queries JSON data (SQL-like for JSON).

  • Glom - Transforms nested data structures.

  • Diagrams - Diagrams as code for cloud architecture.

  • Pytest - Framework for writing small tests.

  • Pampy - Pattern matching for Python dictionaries.

  • Pygorithm - A Python module for learning all major algorithms.

  • GitPython - A Python library used to interact with Git repositories.

  • TQDM - Progress bars for loops and operations.

  • Loguru - Python logging made simple.

  • Click - Beautiful command line interfaces.

  • Poetry - Python dependency management and packaging.

  • Hydra - Elegant configuration management.

⬆ back to top


πŸ“ More Awesome Lists#

A curated list of other awesome lists on various topics and technologies.

⬆ back to top


🌐 Additional Resources#

A wide range of resources designed to facilitate learning, development, and exploration across different domains.

  • UC Berkeley - Data 8 - Course materials for the Data Science Foundations course.

  • A collective list of free APIs - A comprehensive list of free APIs for various purposes.

  • arXiv.org - A free distribution service and open-access archive for scholarly articles.

  • Elicit - An AI research assistant that helps automate parts of literature review.

  • 500+ AI/ML/DL/NLP Projects - A massive collection of AI and machine learning projects with code for learning and portfolios.

  • Kittl - Platform for creating and editing charts and data visualizations.

  • Zasper - High Performace IDE for Jupyter Notebooks.

  • Sketch - Toolkit designed for designers, focusing on their workflow.

  • Growth.Design - A collection of product case studies and behavioral psychology insights for data-driven decision-making.

⬆ back to top


🀝 Contributing#

We welcome your contributions!

See CONTRIBUTING.md for how to add resources.

⬆ back to top


πŸ“œ License#

CC0

This work is dedicated to the public domain under the CC0 1.0 Universal license.

⬆ back to top