Page 2 | Best Open Source Linux Big Data Tools 2026

Big Data Tools for Linux

View 45 business solutions

Big Data Linux Clear Filters

ServiceDesk Plus, a world-class IT and enterprise service management platform
Design, automate, deliver, and manage critical IT and business services

Best in class online service desk software. Offer your customers world-class services with ServiceDesk Plus Cloud, the easy-to-use SaaS service desk software from ManageEngine, the IT management division of Zoho. Track and manage IT tickets efficiently, resolve issues faster, and ensure end-user satisfaction with the cloud-based IT ticketing system used by over 100,000 IT service desks worldwide. Manage the complete life cycle of IT incidents, problems, changes, and projects with out of the box ITIL workflows. Create support SLAs, define escalation levels, and ensure compliance. Automate ticket dispatch, categorization, classification, and assignment based on predefined business rules, and set up notifications and alerts for timely ticket resolution. Reduce walk ins and unnecessary tickets by giving your users more control. Enable end users to access IT services through your service catalog in the self-service portal. Help users create and track tickets and search for solutions.

Learn More
Contract Management Software | Concord
AI-powered contract management that helps businesses track spending, negotiate smarter, and never miss deadlines.

Concord serves small and mid-sized businesses and Fortune 500 companies. This robust, web-based platform is used by human resource, sales, procurement, and legal teams, and virtually anyone who deals with contracts.

Learn More
1

NCHC-Storm

NCHC's Storm Team

Sharing the applications of storm which developed by NCHC's Storm Team.

Downloads: 3 This Week

Last Update: 2022-12-21
See Project
2

json-scada

A portable SCADA/IoT platform centered on the MongoDB database server.

Standard IT tools applied to SCADA/IoT (MongoDB, PostgreSQL/TimescaleDB,Node.js, C#, Golang, Grafana, etc.). MongoDB as the real-time core database, persistence layer, config store, SOE historian. Portability and interoperability over Linux, Windows, x86/64, ARM. Horizontal scalability, from a single computer to big clusters (MongoDB-sharding), Bare Metal, Docker containers, VM, cloud, or hybrid deployments. Unlimited tags, servers, and users. HTML5 Web interface. UTF-8/I18N. Protocols: IEC61850 Client, IEC60870-5-101/104 Client and Server, DNP3 Client, OPC-UA Client/Server, MQTT/Sparkplug-B, Telegraf (various data sources for monitoring like Modbus, SNMP, etc.) Github. project https://github.com/riclolsen/json-scada Requirements for Windows Installer: Windows 10/11 64 bits or Server 2016, Windows PowerShell.

Downloads: 2 This Week

Last Update: 2026-03-22
See Project
3

Augustus

PMML-compliant scoring engine and analytic toolkit

Augustus development has moved to google code. The new project page is augustus.googlecode.com. New releases of the project are not currently being released to sourceforge. Augustus is designed for statistical and data mining models and produces and consumes models with 10,000s of segments. Versions of Augustus support PMML 3, 4.0.1, and 4.1.

1 Review

Downloads: 1 This Week

Last Update: 2013-04-16
See Project
4

Cube Platform

Cube Platform is a decentralized grid computing system that uses P2P Pastry protocol for communication between nodes. It's a big data storage written in Java.

Downloads: 1 This Week

Last Update: 2013-04-23
See Project
DataImpulse - Ethical Provider of Residential, Mobile, and Data Center IPs
For anyone looking for residential proxies, mobile proxies, and datacenter proxies

DataImpulse (dataimpulse.com) is a proxy provider offering a pool of over 90 million ethically sourced residential, mobile, and data center IP addresses from 195 countries. Pricing for residential proxies starts at $1 per GB with a pay-as-you-go model; there are no subscriptions or traffic expiration dates.

Learn More
5

.NET for Apache Spark

A free, open-source, and cross-platform big data analytics framework

.NET for Apache Spark provides high-performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer. .NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including Azure HDInsight Spark, Amazon EMR Spark, AWS & Azure Databricks.

Downloads: 0 This Week

Last Update: 2026-02-13
See Project
6

An introduction to Data Analysis in R

A guide for learning the basic tools on data anaylisis with R

An Introduction to Data Analysis in R [Book] A guide for learning the basic tools on data anaylisis: process, visualize and learn from your data using R programming. This repository holds the necessary data sets for the book "An introduction to Data Analysis in R", to be published by Springer series Use R!. The book can be purchased in XXX. The book is meant as an introductory guide to manipulate data sets in the Big Data paradigm. One of the main goals of this book is to take the analyst from the very first moment when she/he contacts with data to the final conclusion and presentation of results of analysis. We take into account the variety of fields where data analysis occurs nowadays. We pay special attention to the different ways to obtain data and how to make it manageable before starting the analysis. The data analysis includes most of the basic visualization options and some advanced extra options. Finally, basic statistics is used to learn from the processed data.

Downloads: 0 This Week

Last Update: 2020-02-08
See Project
7

Apache Hudi

Upserts, Deletes And Incremental Processing on Big Data

Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics. Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records.

Downloads: 0 This Week

Last Update: 2025-12-18
See Project
8

Apache InLong

Apache InLong - a one-stop integration framework for massive data

Apache InLong is a one-stop integration framework for massive data that provides automatic, secure and reliable data transmission capabilities. InLong supports both batch and stream data processing at the same time, which offers great power to build data analysis, modeling and other real-time applications based on streaming data. InLong (应龙) is a divine beast in Chinese mythology who guides the river into the sea, and it is regarded as a metaphor of the InLong system for reporting data streams. InLong was originally built at Tencent, which has served online businesses for more than 8 years, to support massive data (data scale of more than 80 trillion pieces of data per day) reporting services in big data scenarios. The entire platform has integrated 5 modules: Ingestion, Convergence, Caching, Sorting, and Management, so that the business only needs to provide data sources, data service quality, data landing clusters and data landing formats.

Downloads: 0 This Week

Last Update: 2025-11-13
See Project
9

BEAR

CBR Meets Big Data

Case-based regression learner for big data. The package contains source and binary files for running BEAR's method. BEAR utilizes EAR4 and locality sensitive hashing in its implementation.

Downloads: 0 This Week

Last Update: 2015-08-11
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Learn More
10

Big Sack

Big Sack: A lightweight Java Key/Value store with undo and disk cache.

Big Sack is a Java persistence mechanism that allows storage of key value pairs following the popular Big Data paradigms. Its a very simple and straightforward way to bridge the gap between in-memory data structures and long-term storage. It has the convenience of Java SDK TreeMap and TreeSet classes and is used the same easy way, but it includes rollback through undo logging to checkpoint data so it does not wind up in an unknown state regardless of failures. Data storage in the exabyte range is possible using filesystem and/or memory-mapped IO. Three levels of configurable write-through caching at different granularities ensure performance.

Downloads: 0 This Week

Last Update: 2013-12-21
See Project
11

Chordalysis

Log-linear analysis (data modelling) for high-dimensional data

===== Project moved to https://github.com/fpetitjean/Chordalysis ===== Log-linear analysis is the statistical method used to capture multi-way relationships between variables. However, due to its exponential nature, previous approaches did not allow scale-up to more than a dozen variables. We present here Chordalysis, a log-linear analysis method for big data. Chordalysis exploits recent discoveries in graph theory by representing complex models as compositions of triangular structures, also known as chordal graphs. Chordalysis makes it possible to discover the structure of datasets with thousands of variables on a standard desktop computer. Associated papers at ICDM 2013, ICDM 2014 and SDM 2015 can be found at http://www.francois-petitjean.com/Research/ YourKit is supporting Chordalysis open source project with its full-featured Java Profiler. YourKit is the creator of innovative and intelligent tools for profiling Java and .NET applications. http://www.yourkit.com

Downloads: 0 This Week

Last Update: 2015-01-29
See Project
12

Custom Apache Big data Distribution

A Custom Apache Distribution including Spark and Hadoop, for Windows.

This Distribution has been customized to work out of the box. So, just download it, and unzip it. Set the Path variables for bin folders, HADOOP_HOME, SPARK_HOME, and JAVA_HOME. That's it..! use Hadoop and Spark natively on Windows.

Downloads: 0 This Week

Last Update: 2020-03-11
See Project
13

ElasticJob

Distributed scheduled job framework

ElasticJob is a distributed scheduling solution consisting of two separate projects, ElasticJob-Lite and ElasticJob-Cloud. ElasticJob-Lite is a lightweight, decentralized solution that provides distributed task sharding services. ElasticJob-Cloud uses Mesos to manage and isolate resources. It uses a unified job API for each project. Developers only need code one time and can deploy at will. Support job sharding and high availability in distributed system. Scale out for throughput and efficiency improvement. Job processing capacity is flexible and scalable with the allocation of resources. Execute job on suitable time and assigned resources. Aggregation same job to same job executor. Append resources to newly assigned jobs dynamically. Using ElasticJob can make developers no longer worry about the non-functional requirements such as jobs scale out, so that they can focus more on business coding.

Downloads: 0 This Week

Last Update: 2026-01-31
See Project
14

Faum

Fast Autonomous Unsupervised Multidimiensional Classification

This is the proof-of-concept implementation of the FAUM Clustering method. This implementation was used to perform the published results and is now released in the hope that it will be useful.

Downloads: 0 This Week

Last Update: 2024-02-02
See Project
15

Flamingo Project

Workflow Designer, Hive Editor, Pig Editor, File System Browser

Flamingo is a open-source Big Data Platform that combine a Ajax Rich Web Interface + Workflow Engine + Workflow Designer + MapReduce + Hive Editor + Pig Editor. 1. Easy Tool for big data 2. Use comfortable in Hadoop EcoSystem projects 3. Based GPL V3 License Supporting Pig IDE, Hive IDE, HDFS Browser, Scheduler, Hadoop Job Monitoring, Workflow Engine, Workflow Designer, MapReduce.

3 Reviews

Downloads: 0 This Week

Last Update: 2016-11-29
See Project
16

GOBIG

GOBIG is a toolbox that can be used for detecting genetic variations. The project is intended to handle big data. What's more important is that it be used to detect clusters of SNP variants. It is the intention to use the toolbox with common and rare variants. To use it, for example, to find the genetic map of genes causing complex diseases.

Downloads: 0 This Week

Last Update: 2015-09-10
See Project
17

GnuCopy

GnuCopy is an Open-Source tool to copy and archive all your important data. It supports all important archive typs like Zip and Tar to guaranty an easy and secure exchange between all types of operating systems. Additionally, you can create profiles to blacklist or whitelist specific file types or folders to seperate your big data stores for backups.

Downloads: 0 This Week

Last Update: 2023-07-28
See Project
18

GridDB

GridDB is a next-generation open source database

A cyber-physical systems is a system that collects a variety of data in physical space (the real world), analyzes and converts it into knowledge in cyberspace, and feeds the knowledge back to the real world to revitalize industry and solve social problems. GridDB is an open database that enables real-time processing of vast amounts of time-series data in physical space, which is necessary to realize a cyber-physical system. Multi-model architecture capable of supporting various data stores with time-series data-oriented and pluggable data stores for efficient real-time processing and management of huge amounts of time-series data at high frequency. Various architectural innovations, such as in-memory orientation with "memory as the main unit and disk as the secondary unit" and event-driven design with minimal overhead, have been incorporated to achieve processing capabilities that can handle petabyte-scale applications.

Downloads: 0 This Week

Last Update: 2026-02-18
See Project
19

HSRA

Hadoop spliced read aligner for RNA-seq data

HSRA is a MapReduce-based parallel tool for mapping reads from RNA sequencing (RNA-seq) experiments. RNA-seq analyses typically begin by mapping reads to a reference genome in order to determine the location from which the reads were originated, which is a very time-consuming step. This tool allows bioinformatics researchers to efficiently distribute their mapping tasks over the nodes of a cluster by combining a fast multithreaded spliced aligner (HISAT2) with Apache Hadoop, which is a distributed computing framework for scalable Big Data processing. HSRA currently supports single-end and paired-end read alignments from FASTQ/FASTA datasets. Moreover, our tool uses the Hadoop Sequence Parser (HSP) library (link above) to efficiently read the input datasets stored on the Hadoop Distributed File System (HDFS), being able to process datasets compressed with Gzip and BZip2 codecs.

Downloads: 0 This Week

Last Update: 2019-01-23
See Project
20

LEACrypt

TTAK.KO-12.0223 Lightweight Encryption Algorithm Tool

The Lightweight Encryption Algorithm (also known as LEA) is a 128-bit block cipher developed by South Korea in 2013 to provide confidentiality in high-speed environments such as big data and cloud computing, as well as lightweight environments such as IoT devices and mobile devices. LEA is one of the cryptographic algorithms approved by the Korean Cryptographic Module Validation Program (KCMVP) and is the national standard of Republic of Korea (KS X 3246). LEA is included in the ISO/IEC 29192-2:2019 standard (Information security - Lightweight cryptography - Part 2: Block ciphers). This project is licensed under the ISC License. Copyright © 2020-2021 ALBANESE Research Lab Source code: https://github.com/pedroalbanese/leacrypt Visit: http://albanese.atwebpages.com

Downloads: 0 This Week

Last Update: 2022-12-16
See Project
21

LogicalSets

Integrated Comprehensive Data Architecture & Methodology

This is an advanced data architecture and methodology. A comprehensive Enterprise Resource Management System. A re-usable database with rules for customization, While being a data driven transaction processing engine, this system has very advanced reporting capabilities. This design eliminates up to 90% of business logic due to the way the data is structured. Uses a concept called Table Sets. Has a compound key that tells the programmer what tableset, which record which applet will view/edit the data. Developed in SAP PowerDesigner, for (Sybase) SQL Anywhere. Don't let the date fool you, this system is ahead of its time.

Downloads: 0 This Week

Last Update: 2021-12-06
See Project
22

MapReduce Brazil

Aggregates MapReduce projects

Nowadays the production and storage of Big Data is common, both in the academy and in the enterprises. To process this huge amount of data it is essential the use of high performance platforms and programming models like MapReduce

Downloads: 0 This Week

Last Update: 2015-08-26
See Project
23

MarDRe

MapReduce-based tool to remove duplicate DNA reads

MarDRe is a de novo MapReduce-based parallel tool to remove duplicate and near-duplicate DNA reads through the clustering of single-end and paired-end sequences from FASTQ/FASTA datasets. This tool allows bioinformatics to avoid the analysis of not necessary reads, reducing the time of subsequent procedures with the dataset. MarDRe is the Big Data counterpart of ParDRe (link above), which employs HPC technologies (i.e., hybrid MPI/multithreading) to reduce runtime on multicore systems. Instead, MarDRe takes advantage of the MapReduce programming model to significantly improve ParDRe performance on distributed systems, especially on cloud-based infrastructures. Written in pure Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for Big Data processing.

Downloads: 0 This Week

Last Update: 2019-01-23
See Project
24

Modin

Scale your Pandas workflows by changing a single line of code

Scale your pandas workflow by changing a single line of code. Modin uses Ray, Dask or Unidist to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical. It is not necessary to know in advance the available hardware resources in order to use Modin. Additionally, it is not necessary to specify how to distribute or place data. Modin acts as a drop-in replacement for pandas, which means that you can continue using your previous pandas notebooks, unchanged, while experiencing a considerable speedup thanks to Modin, even on a single machine. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas.

Downloads: 0 This Week

Last Update: 2025-10-02
See Project
25

Nebula Graph

A distributed, fast open-source graph database

The graph database built for super large-scale graphs with milliseconds of latency. Optimized SUBGRAPH and FIND PATH for better performance. Optimized query paths to reduce redundant paths and time complexity. Optimized the method to get properties for better performance of MATCH statements. Nebula Graph adopts the Apache 2.0 license, one of the most permissive free software licenses in the world. Free as in freedom, because, under the Apache 2.0 license, you can use, copy, modify and redistribute Nebula Graph, even for commercial purposes, all without asking for permission. We believe that great open source projects are not built in isolation, but rather by a community of contributors. We welcome contributions to Nebula Graph from anyone regardless of skill level or background in software development. If you have an idea for a feature you would like to see added, or you have identified a bug that needs fixing, please don't hesitate to submit an issue to our Github repository.

Downloads: 0 This Week

Last Update: 2024-05-17
See Project