html source extractor free download

Showing 1819 open source projects for "html source extractor"

View related business solutions

Internet Clear Filters & Widen Search

Rezku Point of Sale
Designed for Real-World Restaurant Operations

Rezku is an all-inclusive ordering platform and management solution for all types of restaurant and bar concepts. You can now get a fully custom branded downloadable smartphone ordering app for your restaurant exclusively from Rezku.

Learn More
AestheticsPro Medical Spa Software
Our new software release will dramatically improve your medspa business performance while enhancing the customer experience

AestheticsPro is the most complete Aesthetics Software on the market today. HIPAA Cloud Compliant with electronic charting, integrated POS, targeted marketing and results driven reporting; AestheticsPro delivers the tools you need to manage your medical spa business. It is our mission To Provide an All-in-One Cutting Edge Software to the Aesthetics Industry.

Learn More
1

html-metadata

MetaData html scraper and parser for Node.js (supports Promises

The aim of this library is to be a comprehensive source for extracting all HTML-embedded metadata. Currently, it supports Schema.org microdata using a third-party library, a native BEPress, Dublin Core, Highwire Press, JSON-LD, Open Graph, Twitter, EPrints, PRISM, and COinS implementation, and some general metadata that doesn't belong to a particular standard (for instance, the content of the title tag, or meta description tags).

Downloads: 0 This Week

Last Update: 2025-04-30
See Project
2

Trafilatura

Python & command-line tool to gather text on the Web

...Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). ...

Downloads: 0 This Week

Last Update: 2024-12-03
See Project
3

jsoup

Java library for working with real-world HTML

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree. The parser will make...

Downloads: 0 This Week

Last Update: 1 day ago
See Project
4

crawley

The unix-way web crawler

Crawls web pages and prints any link it can find. Fast HTML SAX-parser (powered by golang.org/x/net/html) Small (below 1500 SLOC), idiomatic, 100% test-covered codebase. Grabs most of useful resources URLs (pics, videos, audios, forms, etc...) Found URLs are streamed to stdout and guaranteed to be unique (with fragments omitted) Scan depth (limited by starting host and path, by default - 0) can be configured. Can crawl rules and sitemaps from robots.txt. Brute mode - scan HTML comments for...

Downloads: 8 This Week

Last Update: 2 days ago
See Project
The Most Powerful Software Platform for EHSQ and ESG Management
Addresses the needs of small businesses and large global organizations with thousands of users in multiple locations.

Choose from a complete set of software solutions across EHSQ that address all aspects of top performing Environmental, Health and Safety, and Quality management programs.

Learn More
5

geckodriver

WebDriver for Firefox

geckodriver is an implementation of WebDriver, and WebDriver can be used for widely different purposes. How you invoke geckodriver largely depends on your use case. If you are using geckodriver through Selenium, you must ensure that you have version 3.11 or greater. Because geckodriver implements the W3C WebDriver standard and not the same Selenium wire protocol older drivers are using, you may experience incompatibilities and migration problems when making the switch from FirefoxDriver to...

Downloads: 72 This Week

Last Update: 2025-02-25
See Project
6

openvpn-monitor

openvpn-monitor is a web based OpenVPN monitor

openvpn-monitor is a simple Python program to generate HTML that displays the status of an OpenVPN server, including all current connections. It uses the OpenVPN management console. It typically runs on the same host as the OpenVPN server, however, it does not necessarily need to. OpenVPN-monitor is a web-based OpenVPN monitor, that shows current connection information, such as users, location, and data transferred.

Downloads: 3 This Week

Last Update: 2025-01-02
See Project
7

Happy DOM

Happy DOM is a JavaScript implementation of a web browser

Happy DOM is a JavaScript implementation of a web browser without its graphical user interface. It includes many web standards from WHATWG DOM and HTML. The goal of Happy DOM is to emulate enough of a web browser to be useful for testing, scraping web sites, and server-side rendering. Happy DOM focuses heavily on performance and can be used as an alternative to JSDOM. Happy DOM now supports Declarative Shadow DOM which can be used for server-side rendering of web components. This package...

Downloads: 3 This Week

Last Update: 2026-04-13
See Project
8

Eruda

Console for mobile browsers

With Eruda you can display JavaScript logs, check dom state, show requests status, show localStorage, cookie information, show url, user agent info, include snippets used most often, Html, js, css source viewer, and install. The JavaScript file size is quite huge(about 100kb gzipped) and therefore not suitable to include in mobile pages. It's recommended to make sure eruda is loaded only when eruda is set to true. When initialization, a configuration object can be passed in. Container element, if not set, it will append an element directly under html root element. ...

Downloads: 27 This Week

Last Update: 2025-06-15
See Project
9

miniblink49

Lighter, faster browser kernel of blink to integrate HTML UI in apps

miniblink is an open source, one file, small browser widget based on chromium. By using C interface, you can create a browser with just some line code. miniblink is an open source, single-file, and currently the smallest known chromium-based browser control. Through its exported pure C interface, a browser control can be created in a few lines of code. C++, C#, Delphi and other language calls (support C++, C#, Delphi language to call). Embedded Nodejs, support electron (with Nodejs, can run...

Downloads: 8 This Week

Last Update: 2025-12-13
See Project
Planfix: Manage Projects, Team's Tasks and Business Processes
All-in-One Enterprise-Level Software is Now Available for SMB

Planfix is like a souped-up business process management system for folks who really know their stuff. It's built to help you dive deeper and gives you more options than your run-of-the-mill project and task management systems. Best part? Even small businesses and non-profits can get in on the action.

Learn More
10

single-file-cli

CLI tool to save complete web pages as single self-contained HTML file

SingleFile CLI is an open source command-line tool designed to save complete web pages as a single self-contained HTML file. It captures the rendered page in a headless browser and embeds all required resources directly into the output document, including stylesheets, scripts, images, and fonts. By consolidating every dependency into one file, it allows users to preserve a faithful copy of a web page that can be viewed offline without requiring external assets.

Downloads: 4 This Week

Last Update: 2026-03-11
See Project
11

Lighthouse

Automated auditing, performance metrics, & best practices for the web

Lighthouse is an open-source, automated tool that analyzes and audits web apps and web pages in order to improve their quality. Lighthouse collects modern performance metrics and insights on developer best practices; auditing for performance, accessibility, SEO and more. After auditing it produces a report either in JSON or HTML. Included in the report is a reference doc that explains the importance of the audit and how to fix the problem areas, which you can use to improve the web app or web page. ...

Downloads: 7 This Week

Last Update: 2026-04-07
See Project
12

GoAccess

GoAccess is a real-time web log analyzer and interactive viewer

GoAccess is an open-source, real-time log analyzer and interactive viewer for web server logs. It runs in terminals on UNIX-like systems and can generate standalone HTML, JSON, or CSV reports for browser-based analysis. GoAccess offers enhanced WebSocket authentication, supporting local and external JWT verification, with secure token refresh capabilities and seamless integration with external authentication systems.

5 Reviews

Downloads: 4 This Week

Last Update: 2026-04-01
See Project
13

camofox-browser

Headless browser automation server for AI agents to visit sites

camofox-browser is a headless browser automation server built specifically for AI agents that need to interact with websites that often block standard automation stacks. It wraps Camoufox, a Firefox fork that performs fingerprint spoofing at the C++ level, which means many browser characteristics are altered before page scripts can inspect them, rather than relying on JavaScript-layer stealth patches. The project is designed around a REST API, making it easier for agents and external tools...

Downloads: 10 This Week

Last Update: 2 days ago
See Project
14

Dillo

Dillo, a multi-platform graphical web browser

...Its goals include enabling web access on old or constrained hardware, using slow or unreliable network connections, minimizing dependencies, and avoiding many of the complexities and overheads of modern full-featured browsers. It omits many modern features (notably JavaScript), instead focusing on rendering HTML (mostly older/standardized subsets), images, and some CSS, while keeping the codebase small. It is free/open source under GPL-3.0.

Downloads: 21 This Week

Last Update: 2025-09-11
See Project
15

Beacon

Open-source Content Management System (CMS)

Beacon is a modern open-source CMS built with Phoenix LiveView, offering fast server-rendered HTML for content-heavy pages with LiveView interactivity layered on top. It includes runtime content reloading, SEO-optimized rendering, and an admin interface (Beacon LiveAdmin) for managing pages, layouts, and components in a cluster-friendly setup. Developed by DockYard, Beacon aims to deliver high performance content sites fully within the Elixir ecosystem.

Downloads: 1 This Week

Last Update: 2025-07-10
See Project
16

Winter

Free, open-source, self-hosted CMS platform based on the Laravel PHP

...Build intricate websites with little more than HTML, CSS and JavaScript through a beautiful, user-friendly and easy backend.

Downloads: 5 This Week

Last Update: 2026-02-20
See Project
17

Spider

High-performance Rust web crawler and scraper for large-scale data

Spider is a high-performance web crawler and web scraping library written in Rust that enables developers to crawl and index websites efficiently. It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents. Spider can operate concurrently across many pages, allowing it to gather large...

Downloads: 1 This Week

Last Update: 2026-03-31
See Project
18

Browserless

The headless Chrome/Chromium driver on top of Puppeteer

Browserless is an open-source headless browser automation library and service built on top of Puppeteer that simplifies the process of running and scaling Chromium-based browser tasks in production environments. It provides a high-level API for interacting with headless Chrome, allowing developers to perform operations such as generating PDFs, capturing screenshots, extracting text or HTML, and automating web navigation.

Downloads: 4 This Week

Last Update: 7 days ago
See Project
19

Jackett

API Support for your favorite torrent trackers

Jackett works as a proxy server, it translates queries from apps (Sonarr, Radarr, SickRage, CouchPotato, Mylar3, Lidarr, DuckieTV, qBittorrent, Nefarious, etc.) into tracker-site-specific HTTP queries, parses the HTML or JSON response, and then sends results back to the requesting software. This allows for getting recent uploads (like RSS) and performing searches. Jackett is a single repository of maintained indexer scraping & translation logic, removing the burden from other apps. Trackers...

Downloads: 173 This Week

Last Update: 11 hours ago
See Project
20

goclone

Fast CLI tool for cloning entire websites for local browsing offline

goclone is a command-line utility designed to download and mirror complete websites to a local directory for offline access. It retrieves HTML pages, stylesheets, JavaScript files, images, and other assets from a target site and stores them on the user’s computer. It preserves the original site’s structure by maintaining relative links between pages, allowing the mirrored copy to function similarly to the live version when opened locally. Once a site has been cloned, users can browse the...

Downloads: 7 This Week

Last Update: 2026-03-11
See Project
21

Maxun

Small event-delegation library for decoupling event binding and handli

Maxun named JsAction by Google serves as a lightweight event delegation library built in JavaScript. It allows developers to separate the logic of binding events from the code that handles those events, helping to keep DOM event wiring cleaner and more maintainable. It is archived and marked as read-only, indicating that the project is no longer actively maintained or intended for production use. The README states that ongoing development has migrated into a larger framework under the...

Downloads: 4 This Week

Last Update: 2026-03-10
See Project
22

reveal.js

The HTML Presentation Framework

reveal.js is a framework for creating beautiful interactive presentations using HTML. It comes with a wide range of features, including nested slides, auto-sliding, touch navigation, Markdown support, PDF export, speaker notes, theming and more. It also comes with a JavaScript API that allows you to control various other options, and a list of plugins that can be used to extend reveal.js further. reveal.js currently offers full support for any recently released version of the following...

Downloads: 5 This Week

Last Update: 2026-04-11
See Project
23

TiddlyWiki

A self-contained JavaScript wiki for the browser, Node.js, AWS Lambda

TiddlyWiki5 is a mature, self-contained open-source personal wiki application and non-linear notebook implemented entirely in JavaScript that runs in the browser or a Node.js environment, letting users create, organize, and interlink small pieces of content called tiddlers without the need for a server backend or traditional hierarchical pages. Its entire application — including content, interface, and logic — can live in a single HTML file that users open and edit directly in a web browser, making it portable, offline-capable, and easy to share or archive without dependencies. ...

Downloads: 1 This Week

Last Update: 20 hours ago
See Project
24

newpipeextractor

Library for extracting streaming site data without official APIs

...It handles many low-level tasks involved in web data extraction, including parsing responses, managing platform-specific logic, and handling errors, allowing developers to focus on implementing application features rather than scraping mechanics. Each supported service is implemented through its own extractor components that conform to a common interface, enabling consistent access to data across different platforms.

Downloads: 1 This Week

Last Update: 2026-04-10
See Project
25

FlareSolverr

Proxy server to bypass Cloudflare protection

FlareSolverr is a proxy server to bypass Cloudflare and DDoS-GUARD protection. FlareSolverr starts a proxy server, and it waits for user requests in an idle state using few resources. When some request arrives, it uses puppeteer with the stealth plugin to create a headless browser (Firefox). It opens the URL with user parameters and waits until the Cloudflare challenge is solved (or timeout). The HTML code and the cookies are sent back to the user, and those cookies can be used to bypass...

1 Review

Downloads: 43 This Week

Last Update: 2025-11-29
See Project