data processing free download

Showing 1666 open source projects for "data processing"

View related business solutions

Windows Clear Filters & Widen Search

The Most Powerful Software Platform for EHSQ and ESG Management
Addresses the needs of small businesses and large global organizations with thousands of users in multiple locations.

Choose from a complete set of software solutions across EHSQ that address all aspects of top performing Environmental, Health and Safety, and Quality management programs.

Learn More
Skillfully - The future of skills based hiring
Realistic Workplace Simulations that Show Applicant Skills in Action

Skillfully transforms hiring through AI-powered skill simulations that show you how candidates actually perform before you hire them. Our platform helps companies cut through AI-generated resumes and rehearsed interviews by validating real capabilities in action. Through dynamic job specific simulations and skill-based assessments, companies like Bloomberg and McKinsey have cut screening time by 50% while dramatically improving hire quality.

Learn More
1

Data-Juicer

Data processing for and with foundation models

Data-Juicer is an open-source data processing and augmentation framework designed to enhance the quality and diversity of datasets for machine learning tasks. It includes a modular pipeline for scalable data transformation.

Downloads: 4 This Week

Last Update: 2026-03-17
See Project
2

Data Formulator

Create rich visualizations with AI

To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs' code generation ability. ...

Downloads: 13 This Week

Last Update: 2026-03-03
See Project
3

Synthetic Data Generator

SDG is a specialized framework

...It also includes a data processing module capable of handling different data types, preprocessing columns, managing missing values, and converting formats automatically before model training.

Downloads: 16 This Week

Last Update: 2026-03-06
See Project
4

Agentic Data Scientist

An end-to-end Data Scientist

...Each agent is designed to independently call functions, interact with data sources, and adapt to uncertainties during processing, enabling iterative refinement of models without manual coordination. The framework supports interoperability with existing data tools and libraries, letting the agents leverage libraries like pandas, scikit-learn, and visualization frameworks to perform real computations rather than mock demonstrations.

Downloads: 1 This Week

Last Update: 2026-02-05
See Project
Feroot AI automates website security with 24/7 monitoring
Trusted by enterprises, healthcare providers, retailers, SaaS platforms, payment service providers, and public sector organizations.

Feroot unifies JavaScript behavior analysis, web compliance scanning, third-party script monitoring, consent enforcement, and data privacy posture management to stop Magecart, formjacking, and unauthorized tracking.

Learn More
5

NYC Taxi Data

Import public NYC taxi and for-hire vehicle (Uber, Lyft)

The nyc-taxi-data repository is a rich dataset and exploratory project around New York City taxi trip records. It collects and preprocesses large-scale trip datasets (fares, pickup/dropoff, timestamps, locations, passenger counts) to enable data analysis, modeling, and visualization efforts. The project includes scripts and notebooks for cleaning and filtering the raw data, memory-efficient processing for large CSV/Parquet files, and aggregation workflows (e.g. trips per hour, heatmaps of pickups/dropoffs). ...

Downloads: 1 This Week

Last Update: 2025-10-01
See Project
6

Kapacitor

Open source framework for processing, monitoring, and alerting

Open source framework for processing, monitoring, and alerting on time series data. Kapacitor is a real-time data processing engine for monitoring and alerting, specifically designed to work with time-series data from InfluxDB.

Downloads: 8 This Week

Last Update: 2026-03-03
See Project
7

go-streams

A lightweight stream processing library for Go

A lightweight stream processing library for Go. go-streams provides a simple and concise DSL to build data pipelines. In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion.

Downloads: 7 This Week

Last Update: 2025-05-10
See Project
8

Arroyo

Distributed stream processing engine in Rust

Arroyo is a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data. Unlike traditional batch processing, streaming engines can operate on both bounded and unbounded sources, emitting results as soon as they are available.

Downloads: 7 This Week

Last Update: 2025-12-01
See Project
9

Pathway

Python ETL framework for stream processing, real-time analytics, LLM

...Unlike traditional batch processing frameworks, Pathway continuously updates the results of your data logic as new events arrive, functioning more like a database that reacts in real-time. It supports Python, integrates with modern data tools, and offers a deterministic dataflow model to ensure reproducibility and correctness.

Downloads: 21 This Week

Last Update: 2026-03-24
See Project
Windocks - Docker Oracle and SQL Server Containers
Deliver faster. Provision data for AI/ML. Enhance data privacy. Improve quality.

Windocks is a leader in cloud native database DevOps, recognized by Gartner as a Cool Vendor, and as an innovator by Bloor research in Test Data Management. Novartis, DriveTime, American Family Insurance, and other enterprises rely on Windocks for on-demand database environments for development, testing, and DevOps. Windocks software is easily downloaded for evaluation on standard Linux and Windows servers, for use on-premises or cloud, and for data delivery of SQL Server, Oracle, PostgreSQL, and MySQL to Docker containers or conventional database instances.

Learn More
10

Bytewax

Python Stream Processing

...Bytewax is a Python framework and Rust distributed processing engine that uses a dataflow computational model to provide parallelizable stream processing and event processing capabilities similar to Flink, Spark, and Kafka Streams. You can use Bytewax for a variety of workloads from moving data à la Kafka Connect style all the way to advanced online machine learning workloads. Bytewax is not limited to streaming applications but excels anywhere that data can be distributed at the input and output.

Downloads: 8 This Week

Last Update: 2024-11-25
See Project
11

Numaflow

Kubernetes-native platform to run massively parallel data/streaming

Numaflow is a Kubernetes-native tool for running massively parallel stream processing. A Numaflow Pipeline is implemented as a Kubernetes custom resource and consists of one or more source, data processing, and sink vertices. Numaflow installs in a few minutes and is easier and cheaper to use for simple data processing applications than a full-featured stream processing platform.

Downloads: 8 This Week

Last Update: 2026-03-14
See Project
12

CyberChef

A web app for encryption, encoding, compression and data analysis

CyberChef, developed by GCHQ, is a versatile web application dubbed the "Cyber Swiss Army Knife." It enables users to perform a wide array of operations on data, including encryption, encoding, compression, and analysis, all within a browser interface.

Downloads: 71 This Week

Last Update: 2026-04-07
See Project
13

LAStools

efficient tools for LiDAR processing

LAStools is a collection of efficient, multi-core, scriptable tools for processing LiDAR data. It supports various formats, including LAS, LAZ, Terrasolid BIN, and ESRI Shapefiles, providing a comprehensive suite for LiDAR data management and analysis.

Downloads: 32 This Week

Last Update: 2025-10-23
See Project
14

pdfcpu

A PDF processor written in Go

pdfcpu is a PDF processing library written in Go supporting encryption. It provides both an API and a CLI. Supported are all versions up to PDF 1.7 (ISO-32000). This is an effort to build a comprehensive PDF processing library from the ground up written in Go. Over time pdfcpu aims to support the standard range of PDF processing features and also any interesting use cases that may present themselves along the way. The main focus lies on strong support for batch processing and scripting via a...

Downloads: 22 This Week

Last Update: 2025-10-21
See Project
15

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 8 This Week

Last Update: 2025-06-09
See Project
16

MeshLab

The open source mesh processing system

...VCG can be used as a stand-alone large-scale automated mesh processing pipeline, while MeshLab makes it easy to experiment with its algorithms interactively. The open source system for processing and editing 3D triangular meshes. It provides a set of tools for editing, cleaning, healing, inspecting, rendering, texturing and converting meshes. It offers features for processing raw data produced by 3D digitization tools/devices and for preparing models for 3D printing.

Downloads: 36 This Week

Last Update: 2025-07-22
See Project
17

ThingsBoard

Device management, data collection, processing and visualization

...Define relations between your devices, assets, customers or any other entities. Collect and store telemetry data in a scalable and fault-tolerant way. Visualize your data with built-in or custom widgets and flexible dashboards. Share dashboards with your customers. Define data processing rule chains. Transform and normalize your device data. Raise alarms on incoming telemetry events, attribute updates, device inactivity, and user actions.

Downloads: 16 This Week

Last Update: 2026-03-30
See Project
18

Siddhi Core Libraries

Stream Processing and Complex Event Processing Engine

Fully open source, cloud-native, scalable, micro streaming, and complex event processing system capable of building event-driven applications for use cases such as real-time analytics, data integration, notification management, and adaptive decision-making. Event processing logic can be written using Streaming SQL queries via graphical and source editors, to capture events from diverse data sources, process and analyze them, integrate with multiple services and data stores, and publish output to various endpoints in real time. ...

Downloads: 3 This Week

Last Update: 2025-03-05
See Project
19

LOTUS

AI-Powered Data Processing: Use LOTUS to process all of your datasets

LOTUS is an open-source framework and query engine designed to enable efficient processing of structured and unstructured datasets using large language models. The system provides a declarative programming model that allows developers to express complex AI data operations using high-level commands rather than manually orchestrating model calls. It offers a Python interface with a Pandas-like API, making it familiar for data scientists and engineers already working with data analysis libraries. ...

Downloads: 5 This Week

Last Update: 2026-03-06
See Project
20

SageMaker Spark Container

Docker image used to run data processing workloads

Apache Spark™ is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

Downloads: 3 This Week

Last Update: 2025-12-04
See Project
21

DOLMA

Data and tools for generating and inspecting OLMo pre-training data

DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.

Downloads: 10 This Week

Last Update: 2025-06-25
See Project
22

Reactor Core

Non-Blocking Reactive Foundation for the JVM

Reactor Core is a foundational library for building reactive applications in Java, providing a powerful API for asynchronous, non-blocking programming.

Downloads: 7 This Week

Last Update: 8 hours ago
See Project
23

Diffgram

Training data (data labeling, annotation, workflow) for all data types

...Training Data is the art of supervising machines through data. This includes the activities of annotation, which produces structured data; ready to be consumed by a machine learning model. Annotation is required because raw media is considered to be unstructured and not usable without it. That’s why training data is required for many modern machine learning use cases including computer vision, natural language processing and speech recognition.

Downloads: 9 This Week

Last Update: 2024-10-14
See Project
24

Pachyderm

Data-Centric Pipelines and Data Versioning

...Pachyderm provides a powerful solution to optimize data processing, MLOps, and ML Lifecycles.

Downloads: 1 This Week

Last Update: 2025-01-15
See Project
25

Miller

Miller is like awk, sed, cut, join, and sort for name-indexed data

Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically, pretty-print, and more. Miller operates on key-value-pair data while the...

Downloads: 41 This Week

Last Update: 2026-02-21
See Project