pentaho data integration free download

Showing 20 open source projects for "pentaho data integration"

View related business solutions

Internet Python Clear Filters & Widen Search

The Most Powerful Software Platform for EHSQ and ESG Management
Addresses the needs of small businesses and large global organizations with thousands of users in multiple locations.

Choose from a complete set of software solutions across EHSQ that address all aspects of top performing Environmental, Health and Safety, and Quality management programs.

Learn More
Skillfully - The future of skills based hiring
Realistic Workplace Simulations that Show Applicant Skills in Action

Skillfully transforms hiring through AI-powered skill simulations that show you how candidates actually perform before you hire them. Our platform helps companies cut through AI-generated resumes and rehearsed interviews by validating real capabilities in action. Through dynamic job specific simulations and skill-based assessments, companies like Bloomberg and McKinsey have cut screening time by 50% while dramatically improving hire quality.

Learn More
1

AWS Data Wrangler

Pandas on AWS, easy integration with Athena, Glue, Redshift, etc.

An AWS Professional Service open-source python initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data-related services. Easy integration with Athena, Glue, Redshift, Timestream, OpenSearch, Neptune, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON, and EXCEL). Built on top of other open-source projects like Pandas, Apache Arrow and Boto3, it offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses, and Databases. ...

Downloads: 20 This Week

Last Update: 6 days ago
See Project
2

Anna’s Archive

Comprehensive search engine for books, papers, comics, magazines

Anna’s Archive is a large-scale open-source search engine and data aggregation platform designed to index and provide access to a vast collection of books, academic papers, comics, magazines, and other digital texts through a unified interface. The project includes all the infrastructure required to run a full instance locally or in production, combining web servers, databases, and search indexing systems into a scalable architecture. It relies heavily on technologies such as Elasticsearch...

Downloads: 139 This Week

Last Update: 2026-03-23
See Project
3

spider_collection

Collection of Python web scraping scripts for data extraction tasks

...In addition to raw data collection, some spiders include basic data processing and analysis using tools such as pandas and simple visualization with matplotlib. It also contains examples of proxy pool integration and encapsulation to support more reliable crawling when working with sites that enforce request limits.

Downloads: 1 This Week

Last Update: 5 days ago
See Project
4

SearXNG

Free internet metasearch engine which aggregates

SearXNG is a free and open-source metasearch engine designed to aggregate results from multiple search engines while prioritizing user privacy and anonymity. Instead of maintaining its own index, it queries numerous external search providers and merges the results into a single interface, increasing coverage and diversity of information. One of its core principles is privacy, as it does not track users, store personal data, or create search profiles, making it a strong alternative to...

Downloads: 21 This Week

Last Update: 2026-04-07
See Project
QA Wolf | We Write, Run and Maintain Tests
For developer teams searching for a testing software

QA Wolf is an AI-native service that delivers 80% automated E2E test coverage for web & mobile apps in weeks not years.

Learn More
5

SEO Machine

A specialized Claude Code workspace for creating long-form

SEO Machine is an AI-powered content production system built as a structured workspace for generating long-form, SEO-optimized blog content through automated workflows. It integrates research, writing, analysis, and optimization into a single pipeline, allowing users to produce high-quality articles tailored to search engine performance. The system uses specialized commands and agents to perform tasks such as keyword research, competitor analysis, content drafting, and optimization. It...

Downloads: 5 This Week

Last Update: 4 days ago
See Project
6

Scrapling

An adaptive Web Scraping framework

...The framework includes advanced fetchers capable of bypassing anti-bot protections such as Cloudflare Turnstile using stealth and browser automation techniques. Its powerful spider system supports multi-session crawling, pause and resume functionality, and real-time streaming of scraped data. Scrapling combines high performance, memory efficiency, and extensive async support to deliver blazing-fast scraping workflows. With a developer-friendly API, CLI tools, MCP server integration for AI-assisted extraction, and Docker support, it offers a complete solution for modern web scrapers.

Downloads: 4 This Week

Last Update: 1 day ago
See Project
7

FEAPDER

Powerful Python crawler framework for scalable web scraping tasks

feapder is a Python-based web crawling framework designed to simplify the process of building scalable and efficient web scrapers. It focuses on providing a developer-friendly environment that makes it easier to create, run, and manage crawlers for a variety of data collection tasks. It includes several built-in spider types, such as AirSpider, Spider, TaskSpider, and BatchSpider, which address different crawling scenarios ranging from lightweight scraping to distributed and batch-based...

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
8

Shynet

Modern, privacy-friendly, and detailed web analytics

Modern, privacy-friendly, and detailed web analytics that works without cookies or JS. There are a lot of web analytics tools. Unfortunately, most of them come with the following caveats. They require handing all of your visitors' info to a third-party company They use cookies to track visitors across sessions, so you need to have those annoying cookie notices. They collect so much personal data that even the NSA is jealous. They are closed source and/or expensive, often with limited data...

Downloads: 0 This Week

Last Update: 2026-03-15
See Project
9

news-please

Python tool for crawling and extracting structured data from news site

news-please is an open source news crawler and information extraction tool designed to collect and structure articles from online news websites. It provides an integrated pipeline that crawls news sites, retrieves article pages, and extracts structured information such as headlines, authors, publication dates, and article text. news-please can recursively follow internal links and read RSS feeds to gather both recent and archived articles from a news outlet when given only the root URL of a...

Downloads: 1 This Week

Last Update: 6 days ago
See Project
Monitor production, track downtime and improve OEE.
For manufacturing companies interested in OEE monitoring solutions

Evocon is a visual and user-friendly OEE software that helps manufacturing companies improve productivity and remove waste as they become better.

Learn More
10

Pelican

Static site generator that supports Markdown and reST syntax

Pelican is a static site generator that requires no database or server-side logic. Chronological content (e.g., articles, blog posts) as well as static pages. Integration with external services. Site themes (created using Jinja2 templates). Publication of articles in multiple languages. Generation of Atom and RSS feeds. Code syntax highlighting via Pygments. Import existing content from WordPress, Dotclear, or RSS feeds. Fast rebuild times due to content caching and selective output writing....

Downloads: 0 This Week

Last Update: 2025-01-15
See Project
11

Mezzanine

CMS framework for Django

Mezzanine is a powerful open source content management platform built using the Django framework. In many ways it is like many other content management tools, offering an intuitive interface for managing all of your content. But Mezzanine is different in that it provides most of its functionality by default. While other platforms rely heavily on modules or reusable applications, Mezzanine comes ready with all the functionality you need, making it the more efficient choice. Mezzanine has a...

Downloads: 5 This Week

Last Update: 2025-06-05
See Project
12

TOMUSS

TOMUSS: The Online Multi User Simple Spreadsheet

TOMUSS is an interactive web application (groupware) allowing multiple concurrent users to edit data tables. Its primary goal is the management of students grades.

Downloads: 2 This Week

Last Update: 2026-04-03
See Project
13

apache-logs-to-mysql

Apache Log Parser and Data Normalization Application

Apache Log Parser and Data Normalization Application Python handles File Processing & MySQL handles Data Processing ApacheLogs2MySQL consists of two Python Modules & one MySQL Schema to automate importing Access & Error files and normalizing data into database designed for reports & data analysis. Runs on Windows, Linux and MacOS & tested with MySQL versions 8.0.39, 8.4.3, 9.0.0 & 9.1.0. 4 LogFormats & 2 ErrorLogFormats can be loaded and 5 MySQL Stored Procedures can be processed in a single Python `ProcessLogs function` execution. ...

Downloads: 0 This Week

Last Update: 2026-01-22
See Project
14

Crawlab

Distributed web crawler admin platform for spiders management

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium. Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB database. The frontend app interacts with the master node, which communicates with other components such as MongoDB, SeaweedFS and worker nodes. Master node and worker nodes communicate...

Downloads: 10 This Week

Last Update: 2023-07-26
See Project
15

mlscraper

ML-based HTML scraper that learns extraction rules from examples

mlscraper is a Python library designed to automatically extract structured data from HTML pages without requiring developers to manually write CSS selectors or XPath rules. Instead of defining extraction logic by hand, users provide a few examples of the data they want to retrieve from a webpage. It analyzes those examples within the HTML document and determines patterns or rules that can be used to extract the same type of information from similar pages. Once trained, the generated scraper...

Downloads: 4 This Week

Last Update: 4 days ago
See Project
16

Ungoogled Chromium Android

Android build for ungoogled-chromium

Ungoogled Chromium Android is the Android platform build configuration and support tooling for Ungoogled Chromium on mobile devices, enabling privacy-minded users and developers to compile a version of the Chromium browser for Android that excludes Google-dependent services, telemetry, and tracking. This repository contains platform-specific patches, build targets, and integration scripts to adapt the upstream Chromium source for Android while eliminating components like Google Play Services hooks, automatic updater mechanisms, and preconfigured search engines that compromise privacy. The goal is to offer an Android browser that feels familiar in capability and rendering fidelity but does not phone home, engage with proprietary APIs, or leak usage data to third-party providers. ...

Downloads: 131 This Week

Last Update: 2026-01-13
See Project
17

TRACARDI - Customer Data Platform

TRACARDI free open-source customer data platform

TRACARDI is easy to use and free GUI for Apache Unomi. Unomi is an open source Customer Data Platform that allows anyone to collect user profiles and manage them in a very robust way. TRACARDI with is API first approach enables you to collect data from multiple channels. Regardless if it is web site, mobile app or CRM system open Api let you send data for further processing. Integrate data into one consistent user profile. TRACARDI is free open-source platform which you can extend...

1 Review

Downloads: 0 This Week

Last Update: 2021-05-04
See Project
18

openISI : topical data integration

A tool for autonomous and virtual topical data integration using the focused web-harvesting method.

Downloads: 0 This Week

Last Update: 2013-04-09
See Project
19

RAD for zope and relational databases

zetadb is a python/zope tool that allows a rapid application development of relational database oriented web applications. It generates transladable and user friendly applications to maintain data over the web. It also implements OpenOffice integration

Downloads: 0 This Week

Last Update: 2013-04-17
See Project
20

Crow

Crow - Computational Representation Of Whatever. A platform for the integration and mining of complex and distributed data. Represents cross-linked semantic web documents as a network of software objects and offers easy ways to filter, and sort them.

1 Review

Downloads: 0 This Week

Last Update: 2013-04-17
See Project