Awesome Data Analysis

Contents

Awesome Data Analysis Awesome#

500+ curated resources for data analysis and data science: tools, libraries, roadmaps, cheatsheets, interview guides and more.

๐Ÿ“‘ Contents#


๐Ÿ† Awesome Data Science Repositories#

Curated collections of high-quality GitHub repos for inspiration and learning.

โฌ† back to contents


๐Ÿ—บ๏ธ Roadmaps#

Step-by-step guides and skill trees to master data science and analytics.

โฌ† back to contents


๐Ÿ Python#

Resources#

A collection of resources for learning and mastering Python programming.

โฌ† back to contents


Data Manipulation with Pandas and Numpy#

Tutorials and best practices for working with Pandas and Numpy.

โฌ† back to contents


Useful Python Tools for Data Analysis#

A collection of Python libraries for efficient data manipulation, cleaning, visualization, validation, and analysis.

Data Processing & Transformation#

  • Pandas DQ - Data type correction and automatic DataFrame cleaning.

  • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.

  • Polars - Multithreaded, vectorized query engine for DataFrames.

  • Fugue - Unified interface for Pandas, Spark, and Dask.

  • TheFuzz - Fuzzy string matching (Levenshtein distance).

  • DateUtil - Extensions for standard Python datetime features.

  • Arrow - Enhanced work with dates and times.

  • Pendulum - Alternative to datetime with timezone support.

  • Dask - Parallel computing for arrays and DataFrames.

  • Modin - Speeds up Pandas by distributing computations.

  • Pandarallel - Parallel operations for pandas DataFrames.

  • DataCleaner - Python tool for automatically cleaning and preparing datasets.

  • Pandas Flavor - Add custom methods to Pandas.

  • Pandas DataReader - Reads data from various online sources into pandas DataFrames.

  • Sklearn Pandas - Bridge between Pandas and Scikit-learn.

  • CuPy - A NumPy-compatible array library accelerated by NVIDIA CUDA for high-performance computing.

  • Numba - A JIT compiler that translates a subset of Python and NumPy code into fast machine code.

  • Pandas Stubs - Type stubs for pandas, improves IDE autocompletion.

  • Petl - ETL tool for data cleaning and transformation.

โฌ† back to contents


Automated EDA and Visualization Tools#

  • AutoViz - Automatic data visualization in 1 line of code.

  • Sweetviz - Automatic EDA with dataset comparison.

  • Lux - Automatic DataFrame visualization in Jupyter.

  • YData Profiling - Data quality profiling & exploratory data analysis.

  • Missingno - Visualize missing data patterns.

  • Vizro - Low-code toolkit for building data visualization apps.

  • Yellowbrick - Visual diagnostic tools for machine learning.

  • Great Tables - Create awesome display tables using Python.

  • DataMapPlot - Create beautiful plots of data maps.

  • Datashader - Quickly and accurately render even the largest data.

  • PandasAI - Conversational data analysis using LLMs and RAG.

  • Mito - Jupyter extensions for faster code writing.

  • D-Tale - Interactive GUI for data analysis in a browser.

  • Pandasgui - GUI for viewing and filtering DataFrames.

  • PyGWalker - Interactive UIs for visual analysis of DataFrames.

  • QGrid - Interactive grid for DataFrames in Jupyter.

  • Pivottablejs - Interactive PivotTable.js tables in Jupyter.

โฌ† back to contents


Data Quality & Validation#

  • PyOD - Outlier and anomaly detection.

  • Alibi Detect - Outlier, adversarial and drift detection.

  • Pandera - Data validation through declarative schemas.

  • Cerberus - Data validation through schemas.

  • Pydantic - Data validation using Python type annotations.

  • Dora - Automate EDA: preprocessing, feature engineering, visualization.

  • Great Expectations - Data validation and testing.

โฌ† back to contents


Feature Engineering & Selection#

  • FeatureTools - Automated feature engineering.

  • Feature Engine - Feature engineering with Scikit-Learn compatibility.

  • Prince - Multivariate exploratory data analysis (PCA, CA, MCA).

  • Fitter - Figures out the distribution your data comes from.

  • Feature Selector - Tool for dimensionality reduction of machine learning datasets.

  • Category Encoders - Extensive collection of categorical variable encoders.

  • Imbalanced Learn - Handling imbalanced datasets.

โฌ† back to contents


Specialized Data Tools#

  • cuDF - A GPU DataFrame library for loading, joining, and aggregating data.

  • Faker - Generates fake data for testing.

  • Mimesis - Generates realistic test data.

  • Geopy - Geocoding addresses and calculating distances.

  • PySAL - Spatial analysis functions.

  • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.

  • Scattertext - Beautiful visualizations of language differences among document types.

  • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.

  • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.

  • ImageIO - A library that provides an easy interface to read and write a wide range of image data.

  • Texthero - Text preprocessing, representation and visualization.

  • Geopandas - Geographic data operations with pandas.

  • NetworkX - Network analysis and graph theory.

โฌ† back to contents


๐Ÿ—ƒ๏ธ SQL & Databases#

Resources#

SQL tutorials and database design principles.

โฌ† back to contents


Tools#

A collection of libraries and drivers for seamless database access and interaction.

  • PyODBC - Python library for ODBC database access.

  • SQLAlchemy - SQL toolkit and ORM for Python.

  • Psycopg2 - PostgreSQL database adapter.

  • MySQL Connector/Python - MySQL driver for Python.

  • PonyORM - ORM for Python with dynamic query generation.

  • PyMongo - Official MongoDB driver for Python.

  • SQLiteviz - A tool for exploring SQLite databases and visualizing the results of your queries.

  • SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine.

  • DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite.

  • DBeaver - A free universal database tool and SQL client for developers, SQL programmers, and administrators.

  • Beekeeper Studio - A modern, easy-to-use SQL client and database manager with a clean, cross-platform interface.

  • SQLFluff - A modular SQL linter and auto-formatter designed to enforce consistent style and catch errors in SQL code.

  • PyMySQL - A pure-Python MySQL client library for interacting with MySQL databases from Python applications.

  • Vanna.AI - An AI-powered tool for generating SQL queries from natural language questions.

  • SQLChat - A chat-based SQL client that allows you to query databases using natural language conversations.

  • Records - SQL queries to databases via Python syntax.

  • Dataset - JSON-like interface for working with SQL databases.

  • SQLGlot - A no-dependency SQL parser, transpiler, and optimizer for Python.

  • TDengine - An open-source big data platform designed for time-series data, IoT, and industrial monitoring.

  • TimescaleDB - An open-source time-series SQL database optimized for fast ingest and complex queries.

  • DuckDB - In-memory analytical database for fast SQL queries.

โฌ† back to contents


๐Ÿ“Š Data Visualization#

Resources#

Color theory, chart selection guides, and storytelling tips.

โฌ† back to contents


Tools#

Libraries for static, interactive, and 3D visualizations.

  • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.

  • Seaborn - A statistical data visualization library based on Matplotlib.

  • Plotly - A library for creating interactive plots and dashboards.

  • Altair - A declarative statistical visualization library for Python.

  • Bokeh - A library for creating interactive visualizations for modern web browsers.

  • HoloViews - A tool for building complex visualizations easily.

  • Geopandas - An extension of Pandas for geospatial data.

  • Folium - A library for visualizing data on interactive maps.

  • Pygal - A Python SVG charting library.

  • Plotnine - A grammar of graphics for Python.

  • Bqplot - A plotting library for IPython/Jupyter notebooks.

  • PyPalettes - A large (+2500) collection of color maps for Python.

  • Deck.gl - A WebGL-powered framework for visual exploratory data analysis of large datasets.

  • Python for Geo - Contextily: add background basemaps to your plots in GeoPandas.

  • OSMnx - A package to easily download, model, analyze, and visualize street networks from OpenStreetMap.

  • Apache ECharts - A powerful, interactive charting and visualization library for browser-based applications.

  • VisPy - A high-performance interactive 2D/3D data visualization library leveraging the power of OpenGL.

  • Glumpy - A Python library for scientific visualization that is fast, scalable and beautiful, based on OpenGL.

  • Pandas-bokeh - Bokeh plotting backend for Pandas.

โฌ† back to contents


๐Ÿ“ˆ Dashboards & BI#

Resources#

Ttutorials for building and enhancing dashboards and visualizations using various tools and frameworks.

โฌ† back to contents


Tools#

Frameworks for building custom dashboard solutions.

  • Dash - Framework for creating interactive web applications.

  • Streamlit - Simplified framework for building data applications.

  • Panel - Python library for creating custom interactive web apps and dashboards.

  • Gradio - Tool for creating and sharing machine learning applications.

  • OpenSearch Dashboards - A powerful data visualization and dashboarding tool for OpenSearch data, forked from Kibana.

  • GridStack.js - A library for building draggable, resizable responsive dashboard layouts.

  • Tremor - A React library to build dashboards fast with pre-built components for charts, KPIs, and more.

  • Appsmith - An open-source platform to build and deploy internal tools, admin panels, and CRUD apps quickly.

  • Grafanalib - A Python library for generating Grafana dashboards configuration as code.

  • H2O Wave - A Python framework for rapidly building and deploying realtime web apps and dashboards for AI and analytics.

  • Shiny for Python - Python version of the popular R Shiny framework.

  • Voilร  - Turn Jupyter notebooks into standalone web applications.

  • Reflex - Full-stack Python framework for building web apps.

  • Taipy - Python library for building web applications and interactive dashboards.

  • Evidence - Business intelligence platform that uses SQL and Markdown for reports.

โฌ† back to contents


Software#

A list of leading tools and platforms for data visualization and dashboard creation.

  • Tableau - Leading data visualization software.

  • Microsoft Power BI - Business analytics tool for visualizing data.

  • QlikView - Tool for data visualization and business intelligence.

  • Metabase - User-friendly open-source BI tool.

  • Apache Superset - Open-source data exploration and visualization platform.

  • Preset - A platform for modern business intelligence, providing a hosted version of Apache Superset.

  • Metabase - The simplest way to get analytics and business intelligence for everyone in your company.

  • Redash - Tool for visualizing and sharing data insights.

  • Grafana - Dashboarding and monitoring tool.

  • Datawrapper - User-friendly chart and map creation tool.

  • ChartBlocks - Online chart creation platform.

  • Infogram - Tool for creating infographics and visual content.

  • Google Data Studio - Free tool for creating interactive dashboards and reports.

  • Rath - Next-generation automated data exploratory analysis and visualization platform.

  • Kibana - The official visualization and dashboarding tool for the Elastic Stack (Elasticsearch, Logstash, Beats).

โฌ† back to contents


๐Ÿ•ธ๏ธ Web Scraping & Crawling#

Resources#

A collection of valuable resources, tutorials, and libraries for web scraping with Python.

โฌ† back to contents


Tools#

A list of libraries and tools for web scraping.

  • Requests - A simple, yet elegant, HTTP library for Python.

  • BeautifulSoup - A library for parsing HTML and XML documents.

  • Selenium - A tool for automating web applications for testing purposes.

  • Scrapy - An open-source and collaborative web crawling framework for Python.

  • Browser Use - A library for browser automation and web scraping.

  • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.

  • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.

  • Feedparser - A library to parse feeds in Python.

  • Trafilatura - A Python & command-line tool to gather text and metadata on the web.

  • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.

  • MechanicalSoup - A Python library for automating interaction with websites.

  • ScrapeGraph AI - A Python scraper based on AI.

  • Snscrape - A social networking service scraper in Python.

  • Ferret - A web scraping system that lets you declaratively describe what data to extract using a simple query language.

  • Grab - A Python framework for building web scraping apps, providing a high-level API for asynchronous requests.

  • Playwright - Python version of the Playwright browser automation library.

  • PyQuery - A jQuery-like library for parsing HTML documents in Python.

  • Helium - High-level Selenium wrapper for easier web automation.

  • Scrapling - A framework for building web scrapers and crawlers.

  • Crawl4AI - Advanced web crawling framework designed for AI and data extraction tasks.

โฌ† back to contents


๐Ÿ”ข Mathematics#

A collection of resources for learning mathematics, particularly in the context of data science and machine learning.

โฌ† back to contents


๐ŸŽฒ Statistics & Probability#

Resources#

A selection of resources focused on statistics and probability, including tutorials and comprehensive guides.

โฌ† back to contents


Tools#

A collection of tools focused on statistics and probability.

  • SciPy - Fundamental library for scientific computing and statistics.

  • Statsmodels - Statistical modeling, testing, and data exploration.

  • PyMC - A probabilistic programming library for Python that allows for flexible Bayesian modeling.

  • Pingouin - Statistical package with improved usability over SciPy.

  • scikit-posthocs - Post-hoc tests for statistical analysis of data.

  • Lifelines - Survival analysis and event history analysis in Python.

  • scikit-survival - Survival analysis built on scikit-learn for time-to-event prediction.

  • Bootstrap - Bootstrap confidence interval estimation methods.

  • PyStan - Python interface to Stan for Bayesian statistical modeling.

  • ArviZ - Exploratory analysis of Bayesian models with visual diagnostics.

  • PyGAM - A Python library for generalized additive models with built-in smoothing and regularization.

  • NumPyro - A probabilistic programming library built on JAX for high-performance Bayesian modeling.

  • Causal Impact - A Python implementation of the R package for causal inference using Bayesian structural time-series models.

  • DoWhy - A Python library for causal inference that supports explicit modeling and testing of causal assumptions.

  • Patsy - A Python library for describing statistical models and building design matrices.

  • Pomegranate - Fast and flexible probabilistic modeling library for Python with GPU support.

  • Pgmpy - Python library for probabilistic and causal inference using graphical models.

โฌ† back to contents


๐Ÿงช A/B Testing#

A collection of resources focused on A/B testing.

โฌ† back to contents


โณ Time Series Analysis#

Resources#

A collection of resources for understanding time series fundamentals and analytical techniques.

โฌ† back to contents


Tools#

A collection of tools for working with temporal data.

  • Facebook Prophet - A procedure for forecasting time series data based on an additive model.

  • Uber Orbit - A Python package for Bayesian time series forecasting and inference.

  • sktime - A unified Python framework for machine learning with time series, compatible with scikit-learn.

  • GluonTS - A Python toolkit for probabilistic time series modeling, built on MXNet.

  • Time-Series-Library - A library for deep learning-based time series analysis and forecasting.

  • TimesFM - A pretrained time series foundation model from Google Research for zero-shot forecasting.

  • PyTorch Forecasting - A PyTorch-based library for time series forecasting with neural networks.

  • Time-series-prediction - A collection of time series prediction methods and implementations.

  • PlotJuggler - A tool to visualize and analyze time series data logs in real-time.

  • TSFresh - Automatically extracting features from time series data.

  • pmdarima - Python library for ARIMA modeling and time series analysis.

  • Kats - Toolkit for analyzing time series data from Facebook Research.

โฌ† back to contents


โš™๏ธ Data Engineering#

Resources#

A collection of resources to help you build and manage robust data pipelines and infrastructure.

โฌ† back to contents


Tools#

A collection of tools for building, deploying, and managing data pipelines and infrastructure.

  • dbt-core - A framework for transforming data in your warehouse using SQL and Jinja.

  • Apache Spark - A unified engine for large-scale data processing and analytics.

  • Apache Kafka - A distributed event streaming platform for building real-time data pipelines.

  • Dagster - A data orchestrator for machine learning, analytics, and ETL.

  • Apache Airflow - A platform to programmatically author, schedule, and monitor workflows.

  • Apache Hive - A data warehouse software for reading, writing, and managing large datasets in distributed storage using SQL.

  • Apache Hadoop - A framework that allows for the distributed processing of large data sets across clusters of computers.

  • Luigi - A Python module for building complex and batch-oriented data pipelines.

  • Apache Iceberg - A high-performance table format for huge analytic datasets.

  • Apache Cassandra - A highly scalable distributed NoSQL database designed for handling large amounts of data across many commodity servers.

  • Apache Flink - A framework for stateful computations over unbounded and bounded data streams (real-time stream processing).

  • Apache Beam - A unified model for defining both batch and streaming data-parallel processing pipelines.

  • Apache Pulsar - A cloud-native, distributed messaging and streaming platform.

  • Delta Lake - A storage layer that brings ACID transactions to Apache Spark and big data workloads.

  • Apache Hudi - An open data lakehouse platform, built on a high-performance open table format.

  • Trino - A distributed SQL query engine designed for fast analytic queries against large datasets.

  • DataHub - A metadata platform for the modern data stack.

  • OpenLineage - An open framework for collection and analysis of data lineage.

  • Kedro - A framework for creating reproducible, maintainable and modular data science code.

  • Apache Calcite - A dynamic data management framework that allows for SQL parsing, optimization, and federation.

  • Prefect - Workflow orchestration for building resilient data pipelines.

  • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.

  • Kestra - An open-source, event-driven orchestrator that simplifies data workflow management.

โฌ† back to contents


๐Ÿ“– Natural Language Processing (NLP)#

Resources#

A selection of resources for learning and applying natural language processing in Python.

โฌ† back to contents


Tools#

A collection of powerful libraries and frameworks for natural language processing.

  • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.

  • TextBlob - A simple library for processing textual data.

  • SpaCy - An open-source software library for advanced NLP in Python.

  • BERT - A transformer-based model for NLP tasks.

  • Flair - A simple framework for state-of-the-art NLP.

  • OpenHands - A library and framework for building applications with large language models.

  • Stanford CoreNLP - A Java suite of core NLP tools providing fundamental linguistic analysis capabilities.

  • John Snow Labs Spark-NLP - A state-of-the-art Natural Language Processing library built on Apache Spark.

  • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.

  • Gensim - Topic modeling and natural language processing library for Python.

  • Stanza - Python NLP library for many human languages, from the Stanford NLP Group.

  • SentenceTransformers - Framework for state-of-the-art sentence and text embeddings.

  • LangExtract - Googleโ€™s library for structured information extraction from text using language models.

  • Rasa - Open-source framework for building contextual AI assistants and chatbots.

โฌ† back to contents


๐Ÿค– Machine Learning & AI#

Resources#

A collection of resources to help you learn and apply machine learning concepts and techniques.

โฌ† back to contents


Tools#

A collection of tools for developing and deploying machine learning models.

Machine Learning#

  • Scikit-learn - Machine learning library for classical algorithms and model building.

  • XGBoost - Optimized distributed gradient boosting library for tree-based models.

  • LightGBM - Fast, distributed, high-performance gradient boosting framework.

  • CatBoost - High-performance gradient boosting on decision trees with categorical features support.

  • H2O-3 - Open-source distributed machine learning platform.

  • cuML - GPU-accelerated machine learning algorithms from RAPIDS.

  • dlib - Modern C++ toolkit containing machine learning algorithms and tools.

  • SHAP - Game theoretic approach to explain the output of any machine learning model.

  • InterpretML - Fit interpretable models and explain blackbox machine learning.

  • Optuna - Hyperparameter optimization framework.

Deep Learning#

  • TensorFlow - End-to-end open source platform for machine learning and deep learning.

  • PyTorch - Deep learning framework with strong support for research and production.

  • PyTorch Lightning - PyTorch wrapper for high-performance AI research.

  • PyTorch Ignite - High-level library to help with training and evaluating neural networks.

  • Keras - High-level neural networks API, running on top of TensorFlow.

  • Fast.ai - Deep learning library simplifying training fast and accurate neural nets.

  • HuggingFace Transformers - Model-definition framework for state-of-the-art machine learning models.

  • HuggingFace Diffusers - Library for state-of-the-art pretrained diffusion models.

  • PEFT - Library for efficiently adapting large pretrained models.

  • YOLOv5 - Real-time object detection system.

  • Ultralytics - YOLOv8 and other computer vision models.

  • ONNX - Open standard for machine learning interoperability.

  • PyTorch Geometric - Geometric deep learning extension library for PyTorch.

  • Pyro - Deep universal probabilistic programming with Python and PyTorch.

  • Skorch - Scikit-learn compatible neural network library.

  • Sonnet - DeepMindโ€™s library for building complex neural networks.

  • JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.

  • TensorFlow Models - Official TensorFlow repository with models and examples.

โฌ† back to contents


๐Ÿš€ MLOps#

Resources#

Materials and curated lists for machine learning operations.

โฌ† back to contents


Tools#

Platforms and utilities for deploying, monitoring, and maintaining ML systems.

  • ColossalAI - High-performance distributed training framework.

  • DVC - Version control system for machine learning projects.

  • Evidently - Tool for analyzing and monitoring data and model drift.

  • Deepchecks - Validation for ML models and data.

  • Sematic - Tool to build, debug, and execute ML pipelines with native Python.

  • netdata - Real-time performance monitoring.

  • meilisearch - Fast, open-source search engine.

  • vLLM - High-throughput and memory-efficient inference library for LLMs.

  • haystack - LLM framework for building search and question answering systems.

  • Kubeflow - Machine learning toolkit for Kubernetes.

  • Seldon Core - Open source platform for deploying and monitoring machine learning models in production.

  • Feast - A feature store for machine learning that manages and serves ML features to models.

  • BentoML - Framework for building, shipping, and scaling ML applications.

  • MLflow - Open-source platform for the complete machine learning lifecycle.

  • Wandb - Tool for experiment tracking, dataset versioning, and model management.

  • Comet ML - ML platform for tracking, comparing and optimizing experiments.

  • Netflix Metaflow - A human-friendly Python library for helping scientists and engineers build and manage real-life data science projects.

  • mindsdb - Platform for integrating AI into databases and applications.

  • KServe - Standardized serverless inference platform for deploying and serving machine learning models on Kubernetes.

  • SQLFlow - Brings machine learning capabilities to SQL, enabling model training and prediction using SQL syntax.

  • Jina AI Serve - Framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets.

  • LiteLLM - Unified interface to call all LLM APIs (OpenAI, Anthropic, Cohere, etc.) with consistent output formatting.

โฌ† back to contents


๐Ÿง  AI Applications & Platforms#

Resources#

A collection of resources focused on AI applications and platforms.

  • Awesome LLM Apps - Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models.

  • Awesome Generative AI - A curated list of modern Generative Artificial Intelligence projects and services.

  • AI Agents for Beginners - Microsoftโ€™s course on designing and building AI agents.

  • Generative AI for Beginners - Course on generative AI for beginners from Microsoft.

  • LLM Course - Practical course to master large language models from start to finish.

  • Awesome AI Agents - A curated list of AI autonomous agents, environments, and frameworks.

  • AI Collection - The Generative AI Landscape - A Collection of Awesome Generative AI Applications.

  • Awesome AI Apps - A collection of projects showcasing RAG, agents, workflows, and other AI use cases.

  • System Prompts and Models - System Prompts, Internal Tools & AI Models from various AI applications and coding tools.

  • RAG Techniques - Collection of advanced techniques for Retrieval-Augmented Generation.

  • Awesome LangChain - Awesome list of tools and projects with the awesome LangChain framework.

  • Awesome AI Tools - A curated list of Artificial Intelligence Top Tools.

  • Awesome LLM Security - A curation of awesome tools, documents and projects about LLM Security.

  • Claude Cookbooks - Official Anthropic examples and recipes for working with Claude AI.

  • Hands On Large Language Models - Covers LLM fundamentals, prompt engineering, and fine-tuning.

  • AI Engineering Hub - Resources for building, deploying, and maintaining AI systems.

  • Agents Towards Production - Code-first tutorials for building production-grade GenAI agents.

  • LLM Engineer Toolkit - Curated list of 120+ LLM libraries across various categories.

  • GenAI Agents - Repository of AI agent implementations and tutorials.

  • AI Notes - Personal notes and essays on AI and software development.

  • Open LLMs - Comprehensive list of open-source large language models and their capabilities.

  • Prompt Engineering Guide - Guides, papers, and resources for prompt engineering with LLMs.

  • Prompt Engineering - Collection of prompt engineering techniques and strategies.

โฌ† back to contents


Tools#

A collection of frameworks, platforms, and end-user applications for building and deploying AI-powered solutions.

AI Agents & Automation#

  • n8n - Workflow automation platform for connecting APIs and services.

  • crewAI - Framework for orchestrating role-playing AI agents.

  • autogen - Framework for building multi-agent conversational systems.

  • AutoGPT - Autonomous AI agent that can complete complex tasks.

  • LangGraph - Framework for building stateful, multi-actor applications with LLMs, with cycles and control flow.

  • Agents.md - Open source framework for building agentic AI systems.

Development Frameworks & Tools#

  • LangChain - Framework for developing applications powered by language models.

  • LlamaIndex - Data framework for LLM-based applications with RAG capabilities.

  • openai-python - Official Python library for OpenAI API.

  • openai-agents-python - Official OpenAI framework for building AI agents.

  • ragflow - Open-source RAG (Retrieval-Augmented Generation) workflow platform.

  • firecrawl - Web crawling and data extraction service for AI applications.

  • Fabric - Framework for augmenting humans using AI.

  • Dyad - Open-source platform for building AI applications with custom API keys.

Code Generation & Assistance#

  • gpt-engineer - AI-powered code generation tool.

  • gpt-pilot - AI pair programmer that writes entire applications.

  • tabby - Self-hosted AI coding assistant.

Model Deployment & Platforms#

  • Ollama - Tool for running large language models locally.

  • OpenLLM - Open platform for operating large language models in production.

  • LocalAI - Self-hosted, local-first AI model deployment platform.

  • dify - Visual LLM application development platform.

  • LLaMA-Factory - Easy-to-use LLM fine-tuning framework.

End-User Applications#

  • open-webui - Web interface for interacting with various LLMs.

  • ComfyUI - Visual node-based interface for Stable Diffusion.

  • lobe-chat - Modern AI conversation interface.

  • LibreChat - Open-source ChatGPT alternative.

  • quivr - Personal second brain and AI assistant.

  • upscayl - AI-powered image upscaling tool.

  • facefusion - AI face swapping and enhancement tool.

  • DocsGPT - Documentation-based question answering system.

  • Whisper - Robust speech recognition model for transcription and translation.

  • Deep Research - AI-powered research assistant for iterative, deep research on any topic.

โฌ† back to contents


โ˜๏ธ Cloud Platforms & Infrastructure#

Resources#

A collection of resources for mastering cloud-native technologies, containerization, and infrastructure management.

โฌ† back to contents


Tools#

Tools for containerization, orchestration, infrastructure as code, and cloud-native development.

Containerization & Orchestration#

  • Docker - Open platform for developing, shipping, and running applications in containers.

  • Docker Compose - A tool for defining and running multi-container Docker applications.

  • Kubernetes - Production-grade container orchestration system.

  • Kompose - Conversion tool from Docker Compose to Kubernetes.

Infrastructure as Code#

  • Terraform - Infrastructure as Code tool.

  • OpenTofu - Open source fork of Terraform.

  • Pulumi - Modern IaC platform using familiar programming languages.

  • CDK8s - Define Kubernetes apps using familiar languages.

CI/CD & GitOps#

  • Jenkins - Open source automation server.

  • Argo CD - Declarative GitOps continuous delivery.

  • Argo Workflows - Container-native workflow engine.

  • Tekton - Kubernetes-native CI/CD framework.

  • Spinnaker - Multi-cloud continuous delivery.

  • Dagger - Portable devkit for CI/CD pipelines.

Service Mesh & API Gateways#

  • Traefik - Modern HTTP reverse proxy and load balancer.

  • Kong - Cloud-native API Gateway.

  • Apache APISIX - Dynamic API gateway.

  • Envoy Gateway - Manages Envoy Proxy as gateway.

  • Higress - Cloud-native API gateway based on Istio.

  • Meshery - Service mesh management.

Kubernetes Ecosystem#

  • Helm - Package manager for Kubernetes.

  • Kustomize - Configuration customization for Kubernetes.

  • Kubernetes Dashboard - Web-based UI for Kubernetes.

  • Skaffold - Continuous development for Kubernetes.

  • Tilt - Local development for Kubernetes.

  • Flagger - Progressive delivery operator.

  • KubeVela - Application delivery platform.

  • KubeSphere - Kubernetes multi-cloud management.

Developer Platforms & Control Planes#

โฌ† back to contents


โšก Productivity#

Resources#

A collection of resources to enhance productivity.

  • Positron - A next-generation data science IDE.

  • Nanobrowser - An open-source AI web automation tool with multi-agent system that runs directly in your browser.

  • Best of Jupyter - Ranked list of notable Jupyter Notebook, Hub, and Lab projects.

  • Deepnote - AI native data science notebook platform compatible with Jupyter, featuring real-time collaboration, environment management, and integrations.

  • AFFiNE - All-in-one workspace for notes, docs, and data visualization.

  • Marimo - Reactive Python notebook for reproducible and interactive data science.

  • ChatGPT Data Science Prompts - A collection of useful prompts for data scientists using ChatGPT.

  • Gamma.app - AI-powered platform for creating and sharing presentations and documents.

  • Cookiecutter Data Science - A standardized project structure for data science projects.

  • Learn Regex - Comprehensive guide to learning regular expressions with examples and exercises.

  • Awesome Regex - Curated collection of regex tools, libraries, and learning resources.

  • The Markdown Guide - Comprehensive guide to learning Markdown.

  • Readme-AI - A tool to automatically generate README.md files for your projects.

  • Markdown Here - Extension for writing emails in Markdown and rendering them before sending.

  • MarkText - Simple and elegant markdown editor for documentation.

  • QuarkDown - Lightweight markdown processor for fast document rendering.

  • screenshot-to-code - AI tool that converts screenshots into code for various frontend stacks.

  • Codebeautify - All-in-one online code formatter and beautifier for Python, SQL, JSON, and more.

  • Notion - An all-in-one workspace for note-taking and task management.

  • Trello - A visual project management tool.

  • Habitica - A habit-building and productivity app that treats your life like a role-playing game.

  • Bujo - Tools to help transform the way you work and live.

  • Parabola - An AI-powered workflow builder for organizing data.

  • Asana - A project management platform for tracking work and projects.

  • Puter - An open-source, browser-based computing environment and cloud OS.

โฌ† back to contents


Useful Linux Tools#

A selection of tools to enhance productivity and functionality in Linux environments.

  • tldr-pages - Simplified and community-driven man pages with practical examples.

  • Bat - Cat clone with syntax highlighting.

  • Exa - Modern replacement for ls.

  • Ripgrep - Faster grep alternative.

  • Zoxide - Smarter cd command.

  • Peek - Simple animated GIF screen recorder with an easy to use interface.

  • CopyQ - Clipboard manager with advanced features.

  • Translate Shell - Command-line translator using Google Translate, Bing Translator, Yandex.Translate, etc.

  • Espanso - Cross-platform Text Expander written in Rust.

  • Flameshot - Powerful yet simple to use screenshot software.

  • DrawIO Desktop - An open-source diagramming software for making flowcharts, process diagrams, and more.

  • Inkscape - A powerful, free, and open-source vector graphics editor for creating and editing visualizations.

  • Rclone - A command-line program to manage files on cloud storage.

  • Rsync - A fast and versatile file copying tool that can synchronize files and directories between two locations over a network or locally.

  • Timeshift - System restore tool for Linux that creates filesystem snapshots using rsync+hardlinks or BTRFS snapshots.

  • Backintime - A comfortable and well-configurable graphical frontend for incremental backups.

  • Fzf - A command-line fuzzy finder.

  • Osquery - SQL powered operating system instrumentation, monitoring, and analytics.

  • GNU Parallel - A tool to run jobs in parallel.

  • HTop - An interactive process viewer.

  • Ncdu - A disk usage analyzer with an ncurses interface.

  • Thefuck - A command line tool to correct your previous console command.

  • Miller - A tool for querying, processing, and formatting data in various file formats (CSV, JSON, etc.), like awk/sed/cut for data.

  • jq - Command-line JSON processor for parsing and manipulating JSON data.

  • yq - Portable command-line YAML processor (like jq for YAML and XML).

  • q - Run SQL directly on CSV or TSV files from the command line.

  • VisiData - Interactive multitool for tabular data exploration in the terminal.

  • csvkit - Suite of command-line tools for working with CSV data.

  • httpie - Modern command-line HTTP client for API testing and debugging.

  • glances - Cross-platform system monitoring tool for resource usage analysis.

  • hyperfine - Command-line benchmarking tool for performance testing.

  • termgraph - Draw basic graphs in the terminal for quick data visualization.

  • fd - Simple, fast and user-friendly alternative to โ€˜findโ€™.

  • dust - More intuitive version of du written in rust.

  • bottom - Cross-platform graphical process/system monitor.

  • Keychain - Tool for managing and securely storing passwords and secrets.

โฌ† back to contents


Useful VS Code Extensions#

A collection of extensions to enhance functionality and productivity in Visual Studio Code.

โฌ† back to contents


๐Ÿ“š Skill Development & Career#

Practice Resources#

A collection of resources to enhance skills and advance your career in data analysis and related fields.

โฌ† back to contents


Curated Jupyter Notebooks#

A selection of curated Jupyter notebooks to support learning and exploration in data science and analysis.

โฌ† back to contents


Data Sources & Datasets#

A collection of resources for accessing datasets and data sources for analysis and projects.

  • Kaggle Datasets - Extensive collection of datasets for practice in data analysis.

  • Opendatasets - A Python library for downloading datasets from Kaggle, Google Drive, and other online sources.

  • Datasette - An open source multi-tool for exploring and publishing data.

  • Awesome Public Datasets - Curated list of high-quality open datasets.

  • Open Data Sources - Collection of various open data sources.

  • Free Datasets for Projects - Dataquestโ€™s compilation of free datasets.

  • Data World - The enterprise data catalog that CIOs, governance professionals, data analysts, and engineers trust in the AI era.

  • Awesome Public Real Time Datasets - A list of publicly available datasets with real-time data.

  • Google Dataset Search - A search engine for datasets from across the web.

  • NASA Open Data Portal - A site for NASAโ€™s open data initiative, providing access to NASAโ€™s data resources.

  • The World Bank Data - Free and open access to global development data by The World Bank.

  • Voice Datasets - A collection of audio and speech datasets for voice AI and machine learning.

  • HuggingFace Datasets - A lightweight library to easily share and access datasets for audio, computer vision, and NLP.

  • TensorFlow Datasets - A collection of ready-to-use datasets for use with TensorFlow and other Python ML frameworks.

  • NLP Datasets - A curated list of datasets for natural language processing (NLP) tasks.

  • TorchVision Datasets - The torchvision.datasets module provides many built-in computer vision datasets.

  • LLM Datasets - A collection of datasets and resources for training and fine-tuning Large Language Models (LLMs).

  • Unsplash Datasets - A collection of datasets from Unsplash, useful for computer vision and research.

  • Awesome JSON Datasets - A curated list of awesome JSON datasets that are publicly available without authentication.

โฌ† back to contents


Resume and Interview Tips#

A variety of resources to help you prepare for interviews and enhance your resume.

โฌ† back to contents


๐Ÿ“‹ Cheatsheets#

A collection of cheatsheets across various domains to aid in quick reference and learning.

GoalKicker Programming Notes#

โฌ† back to contents


Python#

โฌ† back to contents


Data Science & Machine Learning#

โฌ† back to contents


Linux & Git#

โฌ† back to contents


Probability & Statistics#

โฌ† back to contents


SQL & Databases#

โฌ† back to contents


Miscellaneous#

โฌ† back to contents


๐Ÿ“ฆ Additional Python Libraries#

A collection of supplementary Python libraries that enhance development workflow, automate processes, and maintain project quality beyond core data analysis tools.

Code Quality & Development#

  • Black - Uncompromising Python code formatter.

  • Pre-commit - Framework for managing pre-commit hooks.

  • Pylint - Python code static analysis.

  • Mypy - Optional static typing for Python.

  • Rich - Rich text and beautiful formatting in the terminal.

  • Icecream - Debugging without using print.

  • Pandas-log - Logs pandas operations for data transformation tracking.

  • PandasVet - Code style validator for Pandas.

  • Pydeps - Python module dependency graphs.

  • PyForest - Automated Python imports for data science.

โฌ† back to contents


Documentation & File Processing#

  • Sphinx - Documentation generator.

  • Pdoc - API documentation for Python projects.

  • Mkdocs - Project documentation with Markdown.

  • OpenPyXL - Read/write Excel files.

  • Tablib - Exports data to XLSX, JSON, CSV.

  • PyPDF2 - Reads and writes PDF files.

  • Python-docx - Reads and writes Word documents.

  • CleverCSV - Smart CSV reader for messy data.

  • Python-markdownify - Convert HTML to Markdown.

  • Xlwings - Integration of Python with Excel.

  • Xmltodict - Converts XML to Python dictionaries.

  • MarkItDown - Python tool for converting files and office documents to Markdown.

  • Jupyter-book - Build publication-quality books from Jupyter notebooks.

  • WeasyPrint - Convert HTML to PDF.

  • PyMuPDF - Advanced PDF manipulation library.

  • Camelot - PDF table extraction library.

โฌ† back to contents


Web & APIs#

  • HTTPX - Next-generation HTTP client for Python.

  • FastAPI - Modern web framework for building APIs.

  • Flask - Lightweight Python web framework for building applications and APIs.

  • Typer - Library for building CLI applications.

  • Requests-cache - Persistent caching for requests library.

โฌ† back to contents


Miscellaneous#

  • UV - An extremely fast Python package installer and resolver.

  • Funcy - Fancy functional tools for Python.

  • Pillow - Image processing library.

  • Ftfy - Fixes broken Unicode strings.

  • JmesPath - Queries JSON data (SQL-like for JSON).

  • Glom - Transforms nested data structures.

  • Diagrams - Diagrams as code for cloud architecture.

  • Pytest - Framework for writing small tests.

  • Pampy - Pattern matching for Python dictionaries.

  • Pygorithm - A Python module for learning all major algorithms.

  • GitPython - A Python library used to interact with Git repositories.

  • TQDM - Progress bars for loops and operations.

  • Loguru - Python logging made simple.

  • Click - Beautiful command line interfaces.

  • Poetry - Python dependency management and packaging.

  • Hydra - Elegant configuration management.

โฌ† back to contents


๐Ÿ“ More Awesome Lists#

A curated list of other awesome lists on various topics and technologies.

โฌ† back to contents


๐ŸŒ Additional Resources and Tools#

A wide range of resources and tools designed to facilitate learning, development, and exploration across different domains.

  • OSSU Computer Science - Path to a free, self-taught education in computer science.

  • UC Berkeley - Data 8 - Course materials for the Data Science Foundations course.

  • PaddleOCR - Production-ready OCR toolkit with multilingual and document AI support.

  • A collective list of free APIs - A comprehensive list of free APIs for various purposes.

  • arXiv.org - A free distribution service and open-access archive for scholarly articles.

  • Elicit - An AI research assistant that helps automate parts of literature review.

  • 500+ AI/ML/DL/NLP Projects - A massive collection of AI and machine learning projects with code for learning and portfolios.

  • Full Stack Fastapi Template - Full-stack template with FastAPI, React, and PostgreSQL.

  • Kittl - Platform for creating and editing charts and data visualizations.

  • Zasper - High Performace IDE for Jupyter Notebooks.

  • Sketch - Toolkit designed for designers, focusing on their workflow.

  • Growth.Design - A collection of product case studies and behavioral psychology insights for data-driven decision-making.

  • Markdown Badges - Collection of badges for GitHub profiles and Markdown files.

โฌ† back to contents


๐Ÿค Contributing#

We welcome your contributions!

See CONTRIBUTING.md for how to add resources.

โฌ† back to contents


๐Ÿ“œ License#

CC0

This work is dedicated to the public domain under the CC0 1.0 Universal license.

โฌ† back to contents