Data processing for and with foundation models
SDG is a specialized framework
Git-based data version control for machine learning workflows
An end-to-end Data Scientist
Collection of useful data science topics along with articles
Data science interview questions and answers
Self-learning data agent that grounds its answers in layers of content
Synthetic Data Generation for tabular, relational and time series data
A Collection of Cheatsheets, Books, Questions, and Portfolio
Cloud-native open source data warehouse for analytics and AI queries
Conditional GAN for generating synthetic tabular data
Label Studio is a multi-type data labeling and annotation tool
Training data (data labeling, annotation, workflow) for all data types
A Simple and Universal Swarm Intelligence Engine
Deep Research framework, combining language models with tools
Machine learning in Python
LLM based data scientist, AI native data application
AI coding assistant skill (Claude Code, Codex, OpenCode, OpenClaw)
OCRmyPDF adds an OCR text layer to scanned PDF files
Benchmarking synthetic data generation methods
Video-based AI memory library. Store millions of text chunks in MP4
The standard data-centric AI package for data quality and ML
ExtractThinker is a Document Intelligence library for LLMs
Data science on data without acquiring a copy
The open-source tool for building high-quality datasets