unstructured data free download

Showing 91 open source projects for "unstructured data"

View related business solutions

Windows Clear Filters & Widen Search

Securing the Cloud Made Easy
Multi-cloud security delivered — now and in the future.

Designed for organizations operating in the cloud who need complete, centralized visibility of their entire cloud estate and want more time and resources dedicated to remediating the actual risks that matter, Orca Security is an agentless cloud Security Platform that provides security teams with 100% coverage their entire cloud environment.

Learn More
Next-generation security awareness training. Built for AI email phishing, vishing, smishing, and deepfakes.
Track your GenAI risk, run multichannel deepfake simulations, and engage employees with incredible security training.

Assess how your company's digital footprint can be leveraged by cybercriminals. Identify the most at-risk individuals using thousands of public data points and take steps to proactively defend them.

Learn More
1

Unstructured.IO

Open source libraries and APIs to build custom preprocessing pipelines

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data into structured outputs.

Downloads: 1 This Week

Last Update: 1 day ago
See Project
2

LOTUS

AI-Powered Data Processing: Use LOTUS to process all of your datasets

LOTUS is an open-source framework and query engine designed to enable efficient processing of structured and unstructured datasets using large language models. The system provides a declarative programming model that allows developers to express complex AI data operations using high-level commands rather than manually orchestrating model calls. It offers a Python interface with a Pandas-like API, making it familiar for data scientists and engineers already working with data analysis libraries. ...

Downloads: 2 This Week

Last Update: 2026-03-06
See Project
3

Instill Core

Instill Core is a full-stack AI infrastructure tool for data

Instill Core is an open-source, full-stack AI infrastructure platform designed to orchestrate data pipelines, machine learning models, and unstructured data processing into a unified, production-ready system. It provides an end-to-end solution that enables developers to build, deploy, and manage AI-powered applications without needing to manually stitch together multiple tools across the data and model lifecycle. The platform focuses heavily on handling unstructured data such as documents, images, audio, and video, transforming them into AI-ready formats through integrated ETL pipelines and processing workflows. ...

Downloads: 0 This Week

Last Update: 2026-03-19
See Project
4

Superlinked

Superlinked is a Python framework for AI Engineers

Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.

Downloads: 0 This Week

Last Update: 2025-10-22
See Project
Tremendous is the global payouts platform for businesses sending gift cards and money at scale.
Getting started is simple: add a funding method and place your first order in minutes.

Trusted by 20,000+ leading organizations, Tremendous has delivered billions of rewards and enables businesses to reach recipients across 230+ countries and regions. Recipients have 2,500+ payout options to choose from, including gift cards, prepaid cards, cash transfers, and charitable donations.

Learn More
5

LlamaParse

Parse files for optimal RAG

LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). Load in 160+ data sources and data formats, from unstructured, and semi-structured, to structured data (API's, PDFs, documents, SQL, etc.) Store and index your data for different use cases. Integrate with 40+ vector stores, document stores, graph stores, and SQL db providers.

Downloads: 2 This Week

Last Update: 2026-02-13
See Project
6

fireworks-tech-graph

Claude Code skill for generating production-quality SVG+PNG technical

fireworks-tech-graph is an AI-driven project focused on building structured knowledge graphs that map relationships between technologies, concepts, and entities within technical domains. It aims to transform unstructured information into interconnected graphs that can be queried and analyzed for insights, making it easier to understand complex ecosystems such as software stacks or research fields. The system likely leverages AI techniques for entity extraction, relationship mapping, and...

Downloads: 33 This Week

Last Update: 1 day ago
See Project
7

DataProfiler

Extract schema, statistics and entities from datasets

DataProfiler is an AI-powered tool for automatic data analysis and profiling, designed to detect patterns, anomalies, and schema inconsistencies in structured and unstructured datasets. The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame.

Downloads: 0 This Week

Last Update: 2025-07-30
See Project
8

LiteParse

A fast, helpful, and open-source document parser

LiteParse is an open-source lightweight parsing library designed to extract structured data from unstructured text using large language models in an efficient and cost-effective manner. It focuses on simplifying the process of turning raw text into structured outputs such as JSON by providing a streamlined interface for prompt-based parsing. The system is designed to minimize overhead, making it suitable for applications where performance and cost are critical considerations. ...

Downloads: 5 This Week

Last Update: 21 hours ago
See Project
9

DeepAnalyze

Autonomous LLM agent for end-to-end data science workflows

DeepAnalyze is an open source project that introduces an agentic large language model designed to perform autonomous data science tasks from start to finish. It is built to handle the entire data science pipeline, including data preparation, analysis, modeling, visualization, and report generation without requiring continuous human guidance. DeepAnalyze is capable of conducting open-ended data research across multiple data formats such as structured tables, semi-structured files, and unstructured text, enabling flexible and comprehensive analysis workflows. ...

Downloads: 2 This Week

Last Update: 5 days ago
See Project
Cloud-Based Software Licensing - Zentitle by Nalpeiron
The #1 Software Licensing Solution. Release new Software License Models fast with no engineering. Increase software sales and drive up revenues.

1000’s software companies have used Zentitle to launch new software products fast and control their entitlements easily - many going from startup to IPO on our platform. Our software monetization infrastructure allows you to easily build or

Learn More
10

Milvus

Vector database for scalable similarity search and AI applications

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment. Milvus 2.0 is a cloud-native vector database with storage and computation separated by design. All components in this refactored version of Milvus are stateless to enhance elasticity and flexibility. Average latency measured in milliseconds on trillion vector datasets. ...

Downloads: 1 This Week

Last Update: 5 days ago
See Project
11

LlamaIndex

Central interface to connect your LLM's with external data

LlamaIndex (GPT Index) is a project that provides a central interface to connect your LLM's with external data. LlamaIndex is a simple, flexible interface between your external data and LLMs. It provides the following tools in an easy-to-use fashion. Provides indices over your unstructured and structured data for use with LLM's. These indices help to abstract away common boilerplate and pain points for in-context learning. Dealing with prompt limitations (e.g. 4096 tokens for Davinci) when the context is too big. ...

Downloads: 0 This Week

Last Update: 18 hours ago
See Project
12

MeshLab

The open source mesh processing system

MeshLab is an open-source, portable, and extensible system for the processing and editing of unstructured large 3D triangular meshes. It is aimed to help the processing of the typical not-so-small unstructured models arising in 3D scanning, providing a set of tools for editing, cleaning, healing, inspecting, rendering and converting this kind of meshes. MeshLab is mostly based on the open source C++ mesh processing library VCGlib developed at the Visual Computing Lab of ISTI - CNR. ...

Downloads: 42 This Week

Last Update: 2025-07-22
See Project
13

GraphRAG

A modular graph-based Retrieval-Augmented Generation (RAG) system

The GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.

Downloads: 3 This Week

Last Update: 2026-04-13
See Project
14

OpenViking

Context database designed specifically for AI Agents

OpenViking is an open-source context database engineered for efficient indexing and retrieval of large amounts of unstructured or semi-structured context data used by AI applications. It’s primarily designed to serve as a high-performance, scalable backend for storing app context, embeddings, conversational histories, and other textual artifacts that need rapid lookup and semantic search, which makes it especially useful for systems like chatbots or memory-augmented agents. ...

Downloads: 1 This Week

Last Update: 3 days ago
See Project
15

CrateDB

CrateDB is a distributed and scalable SQL database

CrateDB is a distributed SQL database designed for massive machine data and real-time analytics. It combines the scalability and performance of NoSQL with the power and simplicity of SQL, allowing for horizontal scaling, full-text search, and complex queries over large datasets. Built in Java and powered by Elasticsearch and Lucene, CrateDB is optimized for high-velocity data ingestion and dynamic queries.

Downloads: 2 This Week

Last Update: 4 days ago
See Project
16

TextFSM

Python module for parsing semi-structured text into python tables

TextFSM is a Python library created by Google that provides a template-based state machine engine for parsing semi-structured text. It is particularly useful for extracting structured data from command-line interface (CLI) outputs, such as those from network devices, routers, and switches. By defining parsing logic through reusable template files, TextFSM transforms unstructured text into structured data like lists or tables without requiring complex regular expression code. Each template defines states, transitions, and regex patterns that determine how to interpret text line by line, enabling precise extraction of key information from varied sources. ...

Downloads: 0 This Week

Last Update: 2025-10-11
See Project
17

DataChain

AI-data warehouse to enrich, transform and analyze unstructured data

Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them. The typical use cases are data curation, LLM analytics and validation, image segmentation, pose detection, and GenAI alignment. Datachain...

Downloads: 5 This Week

Last Update: 1 day ago
See Project
18

DocETL

A system for agentic LLM-powered data processing and ETL

DocETL is an open-source system designed to build and execute data processing pipelines powered by large language models, particularly for analyzing complex collections of documents and unstructured datasets. The platform allows developers and researchers to construct structured workflows that extract, transform, and organize information from sources such as reports, transcripts, legal documents, and other text-heavy data.

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
19

Parsera

Lightweight library for scraping web-sites with LLMs

Scrape data from any website with only a link and column descriptions. Parsera is a tool designed to scrape web content, specifically handling poorly structured or messy websites.

Downloads: 2 This Week

Last Update: 2025-10-08
See Project
20

Fluentd

Fluentd: Unified Logging Layer (project under CNCF)

Fluentd is a CNCF‑graduated open-source data collector that unifies log data collection and consumption across diverse systems. It supports robust reliability, buffering, extensible plugin architecture, and real-time log routing. Fluentd serves as a unified logging layer for structured/unstructured data processing.

Downloads: 1 This Week

Last Update: 2026-02-13
See Project
21

Quivr

Your Second Brain supercharged by Generative AI

Quivr, your second brain, utilizes the power of GenerativeAI to store and retrieve unstructured information. Think of it as Obsidian, but turbocharged with AI capabilities.

Downloads: 4 This Week

Last Update: 2025-02-04
See Project
22

RAGFlow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.

Downloads: 3 This Week

Last Update: 9 hours ago
See Project
23

Extractous

Fast and efficient unstructured data extraction

Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
24

Thulite

Web framework designed for speed, security, and SEO

Thulite is an AI-powered search and recommendation engine that enhances search functionality in applications. It provides intelligent query processing, result ranking, and personalized recommendations.

Downloads: 0 This Week

Last Update: 2026-02-24
See Project
25

Pimcore

Open Source Data & Experience Management Platform

No matter if you're dealing with unstructured web documents or structured data for MDM/PIM, you define the UI design (web documents by a template and structured data with an intuitive graphical editor), Pimcore knows how to persist the data efficiently and optimized for fast access. Due to the framework approach, Pimcore is very flexible and adapts perfectly to your needs.

Downloads: 2 This Week

Last Update: 2026-04-08
See Project