unstructured data free download

Showing 54 open source projects for "unstructured data"

View related business solutions

Artificial Intelligence Windows Clear Filters & Widen Search

Secure your business by securing your people.
Over 100,000 businesses trust 1Password

Take the guesswork out of password management, shadow IT, infrastructure, and secret sharing so you can keep your people safe and your business moving.

Learn More
Polygon Software | Apparel Software | PLM and ERP Solutions
Small to mid-sized sewn goods manufacturers and textile mills.

PolyPM is an integrated enterprise resource planning (ERP) and product lifecycle management (PLM) solution developed by Polygon Software. Built for small to medium-sized apparel manufacturers, PolyPM enables businesses to integrate all aspects of the product development, supply chain and production processes, as well as instantly access all their style and manufacturing information anywhere in the world. This allows businesses to shorten time-to-market, incur lower development costs, and improve customer service and worker productivity.

Learn More
1

Unstructured.IO

Open source libraries and APIs to build custom preprocessing pipelines

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data into structured outputs.

Downloads: 1 This Week

Last Update: 7 days ago
See Project
2

Instill Core

Instill Core is a full-stack AI infrastructure tool for data

Instill Core is an open-source, full-stack AI infrastructure platform designed to orchestrate data pipelines, machine learning models, and unstructured data processing into a unified, production-ready system. It provides an end-to-end solution that enables developers to build, deploy, and manage AI-powered applications without needing to manually stitch together multiple tools across the data and model lifecycle. The platform focuses heavily on handling unstructured data such as documents, images, audio, and video, transforming them into AI-ready formats through integrated ETL pipelines and processing workflows. ...

Downloads: 3 This Week

Last Update: 2026-03-19
See Project
3

LOTUS

AI-Powered Data Processing: Use LOTUS to process all of your datasets

LOTUS is an open-source framework and query engine designed to enable efficient processing of structured and unstructured datasets using large language models. The system provides a declarative programming model that allows developers to express complex AI data operations using high-level commands rather than manually orchestrating model calls. It offers a Python interface with a Pandas-like API, making it familiar for data scientists and engineers already working with data analysis libraries. ...

Downloads: 2 This Week

Last Update: 2026-03-06
See Project
4

Superlinked

Superlinked is a Python framework for AI Engineers

Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.

Downloads: 0 This Week

Last Update: 2025-10-22
See Project
Modernize Your Lab with the #1 Rated LIMS
Labs that need a powerful LIMS system

Nothing is more critical to a lab’s success than the quality, security, and traceability of samples. The Lockbox LIMS system provides robust sample management functionality to laboratory professionals, giving them full visibility on every aspect of a sample’s journey, from accessioning to long-term storage.

Learn More
5

DataProfiler

Extract schema, statistics and entities from datasets

DataProfiler is an AI-powered tool for automatic data analysis and profiling, designed to detect patterns, anomalies, and schema inconsistencies in structured and unstructured datasets. The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame.

Downloads: 0 This Week

Last Update: 2025-07-30
See Project
6

LlamaParse

Parse files for optimal RAG

LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). Load in 160+ data sources and data formats, from unstructured, and semi-structured, to structured data (API's, PDFs, documents, SQL, etc.) Store and index your data for different use cases. Integrate with 40+ vector stores, document stores, graph stores, and SQL db providers.

Downloads: 1 This Week

Last Update: 2026-02-13
See Project
7

fireworks-tech-graph

Claude Code skill for generating production-quality SVG+PNG technical

fireworks-tech-graph is an AI-driven project focused on building structured knowledge graphs that map relationships between technologies, concepts, and entities within technical domains. It aims to transform unstructured information into interconnected graphs that can be queried and analyzed for insights, making it easier to understand complex ecosystems such as software stacks or research fields. The system likely leverages AI techniques for entity extraction, relationship mapping, and...

Downloads: 29 This Week

Last Update: 21 hours ago
See Project
8

LiteParse

A fast, helpful, and open-source document parser

LiteParse is an open-source lightweight parsing library designed to extract structured data from unstructured text using large language models in an efficient and cost-effective manner. It focuses on simplifying the process of turning raw text into structured outputs such as JSON by providing a streamlined interface for prompt-based parsing. The system is designed to minimize overhead, making it suitable for applications where performance and cost are critical considerations. ...

Downloads: 7 This Week

Last Update: 6 hours ago
See Project
9

DeepAnalyze

Autonomous LLM agent for end-to-end data science workflows

DeepAnalyze is an open source project that introduces an agentic large language model designed to perform autonomous data science tasks from start to finish. It is built to handle the entire data science pipeline, including data preparation, analysis, modeling, visualization, and report generation without requiring continuous human guidance. DeepAnalyze is capable of conducting open-ended data research across multiple data formats such as structured tables, semi-structured files, and unstructured text, enabling flexible and comprehensive analysis workflows. ...

Downloads: 2 This Week

Last Update: 5 days ago
See Project
Taking the Paper Out of Work
For organizations that need powerful ECM and document automation software

The Square 9 AI-powered intelligent document processing platform takes the paper out of work and makes it easier to get things done with digital workflows.

Learn More
10

Milvus

Vector database for scalable similarity search and AI applications

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment. Milvus 2.0 is a cloud-native vector database with storage and computation separated by design. All components in this refactored version of Milvus are stateless to enhance elasticity and flexibility. Average latency measured in milliseconds on trillion vector datasets. ...

Downloads: 2 This Week

Last Update: 4 days ago
See Project
11

LlamaIndex

Central interface to connect your LLM's with external data

LlamaIndex (GPT Index) is a project that provides a central interface to connect your LLM's with external data. LlamaIndex is a simple, flexible interface between your external data and LLMs. It provides the following tools in an easy-to-use fashion. Provides indices over your unstructured and structured data for use with LLM's. These indices help to abstract away common boilerplate and pain points for in-context learning. Dealing with prompt limitations (e.g. 4096 tokens for Davinci) when the context is too big. ...

Downloads: 0 This Week

Last Update: 2026-04-03
See Project
12

GraphRAG

A modular graph-based Retrieval-Augmented Generation (RAG) system

The GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.

Downloads: 4 This Week

Last Update: 2026-04-13
See Project
13

DocETL

A system for agentic LLM-powered data processing and ETL

DocETL is an open-source system designed to build and execute data processing pipelines powered by large language models, particularly for analyzing complex collections of documents and unstructured datasets. The platform allows developers and researchers to construct structured workflows that extract, transform, and organize information from sources such as reports, transcripts, legal documents, and other text-heavy data.

Downloads: 2 This Week

Last Update: 2026-03-05
See Project
14

OpenViking

Context database designed specifically for AI Agents

OpenViking is an open-source context database engineered for efficient indexing and retrieval of large amounts of unstructured or semi-structured context data used by AI applications. It’s primarily designed to serve as a high-performance, scalable backend for storing app context, embeddings, conversational histories, and other textual artifacts that need rapid lookup and semantic search, which makes it especially useful for systems like chatbots or memory-augmented agents. ...

Downloads: 1 This Week

Last Update: 3 days ago
See Project
15

DataChain

AI-data warehouse to enrich, transform and analyze unstructured data

Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them. The typical use cases are data curation, LLM analytics and validation, image segmentation, pose detection, and GenAI alignment. Datachain...

Downloads: 4 This Week

Last Update: 8 hours ago
See Project
16

Quivr

Your Second Brain supercharged by Generative AI

Quivr, your second brain, utilizes the power of GenerativeAI to store and retrieve unstructured information. Think of it as Obsidian, but turbocharged with AI capabilities.

Downloads: 3 This Week

Last Update: 2025-02-04
See Project
17

Extractous

Fast and efficient unstructured data extraction

Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
18

RAGFlow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.

Downloads: 3 This Week

Last Update: 2026-02-10
See Project
19

Search-Index

A persistent, network resilient, full text search library

Search-Index is a lightweight and fast JavaScript-based search engine that enables full-text search indexing and retrieval for web applications.

Downloads: 5 This Week

Last Update: 2025-03-12
See Project
20

MyScaleDB

A @ClickHouse fork that supports high-performance vector search

...The system is built on top of the ClickHouse database engine and extends it with specialized indexing and search capabilities optimized for vector embeddings. This design allows developers to store structured data, unstructured text, and high-dimensional vector embeddings within a single database platform. MyScaleDB enables developers to perform vector similarity searches using standard SQL syntax, eliminating the need to learn specialized vector database query languages. The database is optimized for high performance and scalability, allowing it to handle extremely large datasets and high query loads typical of production AI applications.

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
21

Eidos

An extensible framework for Personal Data Management

Eidos is an extensible personal data management platform designed to help users organize and interact with their information using a local-first architecture. The system transforms SQLite into a flexible personal database that can store structured and unstructured information such as notes, documents, datasets, and knowledge resources. Its interface is inspired by tools like Notion, allowing users to create documents, databases, and custom views to organize personal information. ...

Downloads: 8 This Week

Last Update: 2026-04-02
See Project
22

Diffgram

Training data (data labeling, annotation, workflow) for all data types

...Training Data is the art of supervising machines through data. This includes the activities of annotation, which produces structured data; ready to be consumed by a machine learning model. Annotation is required because raw media is considered to be unstructured and not usable without it. That’s why training data is required for many modern machine learning use cases including computer vision, natural language processing and speech recognition.

Downloads: 2 This Week

Last Update: 2024-10-14
See Project
23

Airweave

Airweave lets agents search any app

Airweave is an open-source platform that enables agents to semantically search across various applications, databases, and APIs. By transforming disparate data sources into a unified, searchable knowledge base, Airweave facilitates intelligent information retrieval through REST APIs or the MCP protocol. It's particularly useful for building AI agents that require access to structured and unstructured data across multiple platforms.

Downloads: 1 This Week

Last Update: 12 hours ago
See Project
24

cognee

Deterministic LLMs Outputs for AI Applications and AI Agents

...Any kind of data works; unstructured text or raw media files, PDFs, tables, presentations, JSON files, and so many more. Add small or large files, or many files at once. We map out a knowledge graph from all the facts and relationships we extract from your data. Then, we establish graph topology and connect related knowledge clusters, enabling the LLM to "understand" the data.

Downloads: 1 This Week

Last Update: 2 days ago
See Project
25

refinery

Open-source choice to scale, assess and maintain natural language data

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact. You are one of the people we've built refinery for. refinery helps you to build better NLP models in a data-centric approach. Semi-automate your labeling, find low-quality subsets in your training data, and monitor your data in one place. refinery doesn't get rid of manual labeling, but it makes sure that your valuable time is spent well. Also, the makers...

Downloads: 0 This Week

Last Update: 2024-06-13
See Project