Search Results for "data processing" - Page 5

Showing 2018 open source projects for "data processing"

View related business solutions
  • DataHub is the leading open-source data catalog helping teams discover, understand, and govern their data assets. Icon
    DataHub is the leading open-source data catalog helping teams discover, understand, and govern their data assets.

    Modern Data Catalog and Metadata Platform

    Built on an open source foundation with a thriving community of 13,000+ members, DataHub gives you unmatched flexibility to customize and extend without vendor lock-in. DataHub Cloud is a modern metadata platform with REST and GraphQL APIs that optimize performance for complex queries, essential for AI-ready data management and ML lifecycle support.
    Learn More
  • Online Project Management Platform - Zoho Icon
    Online Project Management Platform - Zoho

    A plan put together with small businesses and startups in mind.

    Zoho Projects is a cloud-based project management solution that helps teams plan, track, collaborate, and achieve project goals.
    Learn More
  • 1
    LiteParse

    LiteParse

    A fast, helpful, and open-source document parser

    ...It also includes mechanisms for validation and error handling, ensuring that outputs conform to expected schemas and reducing the need for manual postprocessing. The library is particularly useful for tasks such as data extraction, document processing, and building pipelines that require structured outputs from natural language input.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 2
    DSP.jl

    DSP.jl

    Filter design, periodograms, window functions

    DSP.jl provides a number of common digital signal processing routines in Julia.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    TDengine

    TDengine

    Open-source time-series database with high-performance and scalability

    Enables efficient, real-time data ingestion, processing and monitoring of TB and even PB scale data per day, generated by billions of sensors and data collectors. TDengine can be widely applied to IoT, Industrial Internet, Connected Vehicles, DevOps, Energy , Finance and many other use-cases. TDengine’s innovative design and purpose-built storage engine outperforms other time-series databases for data ingestion, querying and data compression while significantly reducing storage and computing costs. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    spaCy

    spaCy

    Industrial-strength Natural Language Processing (NLP)

    spaCy is a library built on the very latest research for advanced Natural Language Processing (NLP) in Python and Cython. Since its inception it was designed to be used for real world applications-- for building real products and gathering real insights. It comes with pretrained statistical models and word vectors, convolutional neural network models, easy deep learning integration and so much more. spaCy is the fastest syntactic parser in the world according to independent benchmarks, with...
    Downloads: 11 This Week
    Last Update:
    See Project
  • Modernize Your Lab with the #1 Rated LIMS Icon
    Modernize Your Lab with the #1 Rated LIMS

    Labs that need a powerful LIMS system

    Nothing is more critical to a lab’s success than the quality, security, and traceability of samples. The Lockbox LIMS system provides robust sample management functionality to laboratory professionals, giving them full visibility on every aspect of a sample’s journey, from accessioning to long-term storage.
    Learn More
  • 5
    sharp

    sharp

    High performance Node.js image processing module

    The typical use case for this high speed Node.js module is to convert large images in common formats to smaller, web-friendly JPEG, PNG, AVIF and WebP images of varying dimensions. Resizing an image is typically 4x-5x faster than using the quickest ImageMagick and GraphicsMagick settings due to its use of libvips. Colour spaces, embedded ICC profiles and alpha transparency channels are all handled correctly. Lanczos resampling ensures quality is not sacrificed for speed. As well as image...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    Meetily

    Meetily

    Privacy first, AI meeting assistant with 4x faster Parakeet/Whisper

    This project is a privacy-first AI meeting assistant that captures meeting audio, produces real-time transcripts, and generates summaries while keeping processing entirely on your own machine or infrastructure. It’s built for organizations that want meeting intelligence without sending recordings or transcripts to third-party cloud services, which helps address compliance and data sovereignty requirements. The app supports live transcription with local model options (including Whisper- and Parakeet-based workflows) and presents the transcript as the meeting happens, making it useful both for note-taking and accessibility. ...
    Downloads: 20 This Week
    Last Update:
    See Project
  • 7
    Kingfisher

    Kingfisher

    Lightweight, pure-Swift library for downloading images from the web

    Kingfisher is a powerful, pure-Swift library for downloading and caching images from the web. It provides you a chance to use a pure-Swift way to work with remote images in your next app. Asynchronous image downloading and caching. Loading image from either URLSession-based networking or local provided data. Useful image processors and filters provided. Multiple-layer hybrid cache for both memory and disk. Fine control on cache behavior. Customizable expiration date and size limit....
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    Sparrow

    Sparrow

    Structured data extraction and instruction calling with ML, LLM

    ...The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    pg_analytics

    pg_analytics

    DuckDB-powered analytics for Postgres

    pg_analytics (formerly named pg_lakehouse) puts DuckDB inside Postgres. With pg_analytics installed, Postgres can query foreign object stores like AWS S3 and table formats like Iceberg or Delta Lake. Queries are pushed down to DuckDB, a high-performance analytical query engine. By transforming Postgres into a performant search and analytics engine, ParadeDB frees your team from the pain of scaling and syncing Elasticsearch.
    Downloads: 50 This Week
    Last Update:
    See Project
  • The top-rated AI recruiting platform for faster, smarter hiring. Icon
    The top-rated AI recruiting platform for faster, smarter hiring.

    Humanly is an AI recruiting platform that automates candidate conversations, screening, and scheduling.

    Humanly is an AI-first recruiting platform that helps talent teams hire in days, not months—without adding headcount. Our intuitive CRM pairs with powerful agentic AI to engage and screen every candidate instantly, surfacing top talent fast. Built on insights from over 4 million candidate interactions, Humanly delivers speed, structure, and consistency at scale—engaging 100% of interested candidates and driving pipeline growth through targeted outreach and smart re-engagement. We integrate seamlessly with all major ATSs to reduce manual work, improve data flow, and enhance recruiter efficiency and candidate experience. Independent audits ensure our AI remains fair and bias-free, so you can hire confidently.
    Learn More
  • 10
    AionUi

    AionUi

    Free, local, open-source Cowork for Gemini CLI, Claude Code, Codex

    ...Instead of forcing users to work in separate terminals for each tool, AionUi automatically detects installed CLI tools and provides a central visual workspace where sessions can run in parallel, contexts are preserved, and conversations are saved locally without sending data to external servers. It enhances productivity by offering smart file management features like batch renaming, automatic organization, and intelligent file classification, thereby reducing manual overhead when working with large datasets or complex document structures. AionUi also supports a remote WebUI mode, allowing users to access their local AI tools securely over a network from other devices while keeping all processing and data on their own hardware.
    Downloads: 46 This Week
    Last Update:
    See Project
  • 11
    Datasets

    Datasets

    Hub of ready-to-use datasets for ML models

    Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks. Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    WebP Codec

    WebP Codec

    Library to encode and decode images in WebP format

    libwebp is the reference codec library for Google’s WebP image format, providing both encoding and decoding along with command-line tools. It supplies cwebp to compress images into WebP and dwebp to decompress them back, making it easy to test quality/size trade-offs across presets and tuning parameters. The GitHub repository is a mirror; the canonical source of truth lives on Chromium’s git, and developer docs are hosted on WebP’s portal. The project underpins WebP support across browsers,...
    Downloads: 29 This Week
    Last Update:
    See Project
  • 13
    CocoIndex

    CocoIndex

    ETL framework to index data for AI, such as RAG

    CocoIndex is an open-source framework designed for building powerful, local-first semantic search systems. It lets users index and retrieve content based on meaning rather than keywords, making it ideal for modern AI-based search applications. CocoIndex leverages vector embeddings and integrates with various models and frameworks, including OpenAI and Hugging Face, to provide high-quality semantic understanding. It’s built for transparency, ease of use, and local control over your search...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    PHP Code Coverage

    PHP Code Coverage

    Collection, processing, and rendering functionality for PHP code

    The php-code-coverage library, authored by Sebastian Bergmann, enables collection, processing, and rendering of PHP code coverage data. It integrates with PHPUnit or other testing frameworks to track which lines, methods, or classes are executed during tests. The library supports generating detailed reports in formats like HTML, Clover, or XML, helping teams understand test completeness and identify untested code paths.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    Apache Flink

    Apache Flink

    Stream processing framework with powerful stream

    Apache Flink is a distributed engine for stateful computations over data streams and batches, designed for low-latency processing at scale. Its core runtime executes dataflow graphs with fine-grained backpressure and checkpointing, allowing applications to recover consistently from failures. Flink’s event-time model and watermarks enable accurate out-of-order processing, windowing, and complex time semantics that typical real-time systems struggle with.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    mediasoup

    mediasoup

    Cutting Edge WebRTC Video Conferencing

    mediasoup is a Node.js library that provides a cutting-edge WebRTC server capable of handling real-time communications with efficient media routing and processing.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    Jimp

    Jimp

    An image processing library written entirely in JavaScript for Node

    An image processing library for Node written entirely in JavaScript, with zero native dependencies. If you're using this library with TypeScript the method of importing slightly differs from JavaScript. Instead of using require, you must import it with ES6 default import scheme. If you're using a web bundles (webpack, rollup, parcel) you can benefit from using the module build of jimp. Using the module build will allow your bundler to understand your code better and exclude things you aren't...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    Databend

    Databend

    Cloud-native open source data warehouse for analytics and AI queries

    Databend is an open source cloud-native data warehouse designed for large-scale analytics and modern data workloads. Built in Rust, the system focuses on high performance, scalability, and efficient data processing for analytical queries. It is designed with a separation of compute and storage, allowing compute nodes to scale independently while storing data in object storage systems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    spider_collection

    spider_collection

    Collection of Python web scraping scripts for data extraction tasks

    ...In addition to raw data collection, some spiders include basic data processing and analysis using tools such as pandas and simple visualization with matplotlib. It also contains examples of proxy pool integration and encapsulation to support more reliable crawling when working with sites that enforce request limits.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 20
    Apache Sedona

    Apache Sedona

    Cluster computing framework for processing large-scale geospatial data

    Apache Sedona™ is a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark and Apache Flink, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines. According to our benchmark and third-party research papers, Sedona runs 2X - 10X faster than other Spark-based geospatial data systems on computation-intensive query workloads. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    ESPnet

    ESPnet

    End-to-end speech processing toolkit

    ESPnet is a comprehensive end-to-end speech processing toolkit covering a wide spectrum of tasks, including automatic speech recognition (ASR), text-to-speech (TTS), speech translation (ST), speech enhancement, speaker diarization, and spoken language understanding. It uses PyTorch as its deep learning engine and adopts a Kaldi-style data processing pipeline for features, data formats, and experimental recipes.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    Fast CSV

    Fast CSV

    CSV parser and formatter for node

    A high-performance Node.js library for parsing and formatting CSV data efficiently.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    Kestra

    Kestra

    Kestra is an infinitely scalable orchestration and scheduling platform

    Build reliable workflows, blazingly fast, deploy in just a few clicks. Kestra is an open-source, event-driven orchestrator that simplifies data operations and improves collaboration between engineers and business users. By bringing Infrastructure as Code best practices to data pipelines, Kestra allows you to build reliable workflows and manage them with confidence. Thanks to the declarative YAML interface for defining orchestration logic, everyone who benefits from analytics can participate...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 24
    Documind

    Documind

    Open-source platform for extracting structured data from documents

    Documind is an advanced document processing tool that leverages AI to extract structured data from PDFs. It is built to handle PDF conversions, extract relevant information, and format results as specified by customizable schemas.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Logstash

    Logstash

    Centralize, transform and stash your data

    Logstash is a server-side data processing pipeline that dynamically ingests data from numerous sources, transforms it, and ships it to your favorite “stash” regardless of format or complexity. It supports and ingests data of all shapes, sizes and sources, dynamically transforms and prepares this data, and transports it to the output of your choice. Logstash is extensible, with over 200 plugins available to let you create and configure your pipeline how you choose.
    Downloads: 3 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB