Showing 192 open source projects for "data processing"

View related business solutions
  • Premier Construction Software Icon
    Premier Construction Software

    Premier is a global leader in financial construction ERP software.

    Rated #1 Construction Accounting Software by Forbes Advisor in 2022 & 2023. Our modern SAAS solution is designed to meet the needs of General Contractors, Developers/Owners, Homebuilders & Specialty Contractors.
    Learn More
  • Collect! is a highly configurable debt collection software Icon
    Collect! is a highly configurable debt collection software

    Everything that matters to debt collection, all in one solution.

    The flexible & scalable debt collection software built to automate your workflow. From startup to enterprise, we have the solution for you.
    Learn More
  • 1
    Data-Juicer

    Data-Juicer

    Data processing for and with foundation models

    Data-Juicer is an open-source data processing and augmentation framework designed to enhance the quality and diversity of datasets for machine learning tasks. It includes a modular pipeline for scalable data transformation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Synthetic Data Generator

    Synthetic Data Generator

    SDG is a specialized framework

    ...It also includes a data processing module capable of handling different data types, preprocessing columns, managing missing values, and converting formats automatically before model training.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    Agentic Data Scientist

    Agentic Data Scientist

    An end-to-end Data Scientist

    ...Each agent is designed to independently call functions, interact with data sources, and adapt to uncertainties during processing, enabling iterative refinement of models without manual coordination. The framework supports interoperability with existing data tools and libraries, letting the agents leverage libraries like pandas, scikit-learn, and visualization frameworks to perform real computations rather than mock demonstrations.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    ExtractThinker

    ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs

    ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.
    Downloads: 3 This Week
    Last Update:
    See Project
  • Field Service+ for MS Dynamics 365 & Salesforce Icon
    Field Service+ for MS Dynamics 365 & Salesforce

    Empower your field service with mobility and reliability

    Resco’s mobile solution streamlines your field service operations with offline work, fast data sync, and powerful tools for frontline workers, all natively integrated into Dynamics 365 and Salesforce.
    Learn More
  • 5
    DeerFlow

    DeerFlow

    Deep Research framework, combining language models with tools

    DeerFlow is an open-source, community-driven “deep research” framework / multi-agent orchestration platform developed by ByteDance. It aims to combine the reasoning power of large language models (LLMs) with automated tool-use — such as web search, web crawling, Python execution, and data processing — to enable complex, end-to-end research workflows. Instead of a monolithic AI assistant, DeerFlow defines multiple specialized agents (e.g. “planner,” “searcher,” “coder,” “report generator”) that collaborate in a structured workflow, allowing tasks like literature reviews, data gathering, data analysis, code execution, and final report generation to be largely automated. ...
    Downloads: 572 This Week
    Last Update:
    See Project
  • 6
    Bytewax

    Bytewax

    Python Stream Processing

    ...Bytewax is a Python framework and Rust distributed processing engine that uses a dataflow computational model to provide parallelizable stream processing and event processing capabilities similar to Flink, Spark, and Kafka Streams. You can use Bytewax for a variety of workloads from moving data à la Kafka Connect style all the way to advanced online machine learning workloads. Bytewax is not limited to streaming applications but excels anywhere that data can be distributed at the input and output.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Awesome Fraud Detection Research Papers

    Awesome Fraud Detection Research Papers

    A curated list of data mining papers about fraud detection

    A curated list of data mining papers about fraud detection from several conferences.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    LOTUS

    LOTUS

    AI-Powered Data Processing: Use LOTUS to process all of your datasets

    LOTUS is an open-source framework and query engine designed to enable efficient processing of structured and unstructured datasets using large language models. The system provides a declarative programming model that allows developers to express complex AI data operations using high-level commands rather than manually orchestrating model calls. It offers a Python interface with a Pandas-like API, making it familiar for data scientists and engineers already working with data analysis libraries. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    Diffgram

    Diffgram

    Training data (data labeling, annotation, workflow) for all data types

    ...Training Data is the art of supervising machines through data. This includes the activities of annotation, which produces structured data; ready to be consumed by a machine learning model. Annotation is required because raw media is considered to be unstructured and not usable without it. That’s why training data is required for many modern machine learning use cases including computer vision, natural language processing and speech recognition.
    Downloads: 2 This Week
    Last Update:
    See Project
  • Simplify Purchasing For Your Business Icon
    Simplify Purchasing For Your Business

    Manage what you buy and how you buy it with Order.co, so you have control over your time and money spent.

    Simplify every aspect of buying for your business in Order.co. From sourcing products to scaling purchasing across locations to automating your AP and approvals workstreams, Order.co is the platform of choice for growing businesses.
    Learn More
  • 10
    Unstructured.IO

    Unstructured.IO

    Open source libraries and APIs to build custom preprocessing pipelines

    The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data into structured outputs.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    DataDreamer

    DataDreamer

    DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models

    DataDreamer is a tool designed to assist in the generation and manipulation of synthetic data for various applications, including testing and machine learning.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    DOLMA

    DOLMA

    Data and tools for generating and inspecting OLMo pre-training data

    DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Instill Core

    Instill Core

    Instill Core is a full-stack AI infrastructure tool for data

    Instill Core is an open-source, full-stack AI infrastructure platform designed to orchestrate data pipelines, machine learning models, and unstructured data processing into a unified, production-ready system. It provides an end-to-end solution that enables developers to build, deploy, and manage AI-powered applications without needing to manually stitch together multiple tools across the data and model lifecycle. The platform focuses heavily on handling unstructured data such as documents, images, audio, and video, transforming them into AI-ready formats through integrated ETL pipelines and processing workflows. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    MineContext

    MineContext

    MineContext is your proactive context-aware AI partner

    ...It is built around a context engineering framework that manages the full lifecycle of data, including capture, processing, storage, retrieval, and consumption. The platform emphasizes privacy through a local-first architecture, allowing users to keep their data stored and processed on their own device rather than relying on external cloud services.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 15
    DataProfiler

    DataProfiler

    Extract schema, statistics and entities from datasets

    DataProfiler is an AI-powered tool for automatic data analysis and profiling, designed to detect patterns, anomalies, and schema inconsistencies in structured and unstructured datasets. The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI), and...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    DocETL

    DocETL

    A system for agentic LLM-powered data processing and ETL

    DocETL is an open-source system designed to build and execute data processing pipelines powered by large language models, particularly for analyzing complex collections of documents and unstructured datasets. The platform allows developers and researchers to construct structured workflows that extract, transform, and organize information from sources such as reports, transcripts, legal documents, and other text-heavy data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Superlinked

    Superlinked

    Superlinked is a Python framework for AI Engineers

    Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    LightAutoML

    LightAutoML

    Fast and customizable framework for automatic ML model creation

    LightAutoML is an automated machine learning (AutoML) framework optimized for efficient model training and hyperparameter tuning, focusing on both tabular and text data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    deepdoctection

    deepdoctection

    A Repo For Document AI

    ...For more specific text processing tasks use one of the many other great NLP libraries.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 20
    python-small-examples

    python-small-examples

    Focus on creating classic Python small examples and cases

    python-small-examples is an open-source educational repository that contains hundreds of concise Python programming examples designed to illustrate practical coding techniques. The project focuses on teaching programming concepts through small, focused scripts that demonstrate common tasks in data processing, visualization, and general programming. Each example highlights a specific function or programming pattern so that learners can quickly understand how to apply Python features in real-world scenarios. The repository includes examples covering topics such as file processing, JSON manipulation, data visualization, and library usage. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 21
    OCRmyPDF

    OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files

    OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF files, allowing them to be searched. PDF is the best format for storing and exchanging scanned documents. Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply image processing and OCR (recognized, searchable text) to existing PDFs.
    Downloads: 91 This Week
    Last Update:
    See Project
  • 22
    MindNLP

    MindNLP

    Easy-to-use and high-performance NLP and LLM framework

    MindNLP is a natural language processing library built on the MindSpore framework, providing tools and models for various NLP tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    E2M

    E2M

    E2M converts various file types (doc, docx, epub, html, htm, url

    E2M is a SourceForge mirror of the e2m open-source project, which focuses on providing tools or services designed to convert or process content between different formats or systems. Projects with similar naming conventions typically emphasize automation workflows where input data from one environment is transformed into another representation or output structure. The mirrored repository allows users to access the project’s codebase independently from its original hosting platform while preserving the development history and release artifacts. Systems like e2m often serve as middleware components that connect different software systems or facilitate data processing pipelines. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Sparrow

    Sparrow

    Structured data extraction and instruction calling with ML, LLM

    ...The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Datasets

    Datasets

    Hub of ready-to-use datasets for ML models

    Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks. Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB