html source extractor free download

Showing 1578 open source projects for "html source extractor"

View related business solutions

Internet Linux Clear Filters & Widen Search

DeskTime is a cloud-based time tracking software
DeskTime is best for medium to large companies, as well as freelancers who want to boost productivity without overworking.

DeskTime is a high-performance, automated time tracking and workforce management solution for teams and freelancers. It runs silently in the background, logging computer activity from the moment of boot-up to ensure 100% accurate data without the need for manual timers.

Learn More
Cortex: Boost Developer Coding Skills
Cortex makes coding easier and faster for developers. See how our portal connects tools and cuts busywork.

Cortex is a simple portal that helps developers work smarter by linking all your tools, setting clear rules, and slashing repetitive tasks. It speeds up onboarding, updates old code, and fixes issues fast. Over 100 big companies use it to save time and get better results.

Try it now!
1

html-metadata

MetaData html scraper and parser for Node.js (supports Promises

The aim of this library is to be a comprehensive source for extracting all HTML-embedded metadata. Currently, it supports Schema.org microdata using a third-party library, a native BEPress, Dublin Core, Highwire Press, JSON-LD, Open Graph, Twitter, EPrints, PRISM, and COinS implementation, and some general metadata that doesn't belong to a particular standard (for instance, the content of the title tag, or meta description tags).

Downloads: 0 This Week

Last Update: 2025-04-30
See Project
2

Trafilatura

Python & command-line tool to gather text on the Web

...Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). ...

Downloads: 0 This Week

Last Update: 2024-12-03
See Project
3

jsoup

Java library for working with real-world HTML

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree. The parser will make...

Downloads: 0 This Week

Last Update: 2026-01-01
See Project
4

Happy DOM

Happy DOM is a JavaScript implementation of a web browser

Happy DOM is a JavaScript implementation of a web browser without its graphical user interface. It includes many web standards from WHATWG DOM and HTML. The goal of Happy DOM is to emulate enough of a web browser to be useful for testing, scraping web sites, and server-side rendering. Happy DOM focuses heavily on performance and can be used as an alternative to JSDOM. Happy DOM now supports Declarative Shadow DOM which can be used for server-side rendering of web components. This package...

Downloads: 5 This Week

Last Update: 6 days ago
See Project
Transform months of data modeling and coding into days.
Automatically generate, document, and govern your entire data architecture.

Efficiently model your business and data models, and generate code for your data pipelines, data lakehouse, and analytical applications

Learn More
5

geckodriver

WebDriver for Firefox

geckodriver is an implementation of WebDriver, and WebDriver can be used for widely different purposes. How you invoke geckodriver largely depends on your use case. If you are using geckodriver through Selenium, you must ensure that you have version 3.11 or greater. Because geckodriver implements the W3C WebDriver standard and not the same Selenium wire protocol older drivers are using, you may experience incompatibilities and migration problems when making the switch from FirefoxDriver to...

Downloads: 72 This Week

Last Update: 2025-02-25
See Project
6

Eruda

Console for mobile browsers

With Eruda you can display JavaScript logs, check dom state, show requests status, show localStorage, cookie information, show url, user agent info, include snippets used most often, Html, js, css source viewer, and install. The JavaScript file size is quite huge(about 100kb gzipped) and therefore not suitable to include in mobile pages. It's recommended to make sure eruda is loaded only when eruda is set to true. When initialization, a configuration object can be passed in. Container element, if not set, it will append an element directly under html root element. ...

Downloads: 26 This Week

Last Update: 2025-06-15
See Project
7

Maxun

Small event-delegation library for decoupling event binding and handli

Maxun named JsAction by Google serves as a lightweight event delegation library built in JavaScript. It allows developers to separate the logic of binding events from the code that handles those events, helping to keep DOM event wiring cleaner and more maintainable. It is archived and marked as read-only, indicating that the project is no longer actively maintained or intended for production use. The README states that ongoing development has migrated into a larger framework under the...

Downloads: 28 This Week

Last Update: 2026-03-10
See Project
8

Spider

High-performance Rust web crawler and scraper for large-scale data

Spider is a high-performance web crawler and web scraping library written in Rust that enables developers to crawl and index websites efficiently. It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents. Spider can operate concurrently across many pages, allowing it to gather large...

Downloads: 6 This Week

Last Update: 2026-03-31
See Project
9

Beacon

Open-source Content Management System (CMS)

Beacon is a modern open-source CMS built with Phoenix LiveView, offering fast server-rendered HTML for content-heavy pages with LiveView interactivity layered on top. It includes runtime content reloading, SEO-optimized rendering, and an admin interface (Beacon LiveAdmin) for managing pages, layouts, and components in a cluster-friendly setup. Developed by DockYard, Beacon aims to deliver high performance content sites fully within the Elixir ecosystem.

Downloads: 5 This Week

Last Update: 2025-07-10
See Project
Network Discovery Software | JDisc Discovery
JDisc Discovery supports the IT organizationss of medium-sized businesses and large-scale enterprises.

JDisc Discovery is a comprehensive network inventory and IT asset management solution designed to help organizations gain clear, up-to-date visibility into their IT environment. It automatically scans and maps devices across the network, including servers, workstations, virtual machines, and network hardware, to create a detailed inventory of all connected assets. This includes critical information such as hardware configurations, software installations, patch levels, and relationshipots between devices.

Learn More
10

miniblink49

Lighter, faster browser kernel of blink to integrate HTML UI in apps

miniblink is an open source, one file, small browser widget based on chromium. By using C interface, you can create a browser with just some line code. miniblink is an open source, single-file, and currently the smallest known chromium-based browser control. Through its exported pure C interface, a browser control can be created in a few lines of code. C++, C#, Delphi and other language calls (support C++, C#, Delphi language to call). Embedded Nodejs, support electron (with Nodejs, can run...

Downloads: 10 This Week

Last Update: 2025-12-13
See Project
11

single-file-cli

CLI tool to save complete web pages as single self-contained HTML file

SingleFile CLI is an open source command-line tool designed to save complete web pages as a single self-contained HTML file. It captures the rendered page in a headless browser and embeds all required resources directly into the output document, including stylesheets, scripts, images, and fonts. By consolidating every dependency into one file, it allows users to preserve a faithful copy of a web page that can be viewed offline without requiring external assets.

Downloads: 6 This Week

Last Update: 2026-03-11
See Project
12

openvpn-monitor

openvpn-monitor is a web based OpenVPN monitor

openvpn-monitor is a simple Python program to generate HTML that displays the status of an OpenVPN server, including all current connections. It uses the OpenVPN management console. It typically runs on the same host as the OpenVPN server, however, it does not necessarily need to. OpenVPN-monitor is a web-based OpenVPN monitor, that shows current connection information, such as users, location, and data transferred.

Downloads: 1 This Week

Last Update: 2025-01-02
See Project
13

goclone

Fast CLI tool for cloning entire websites for local browsing offline

goclone is a command-line utility designed to download and mirror complete websites to a local directory for offline access. It retrieves HTML pages, stylesheets, JavaScript files, images, and other assets from a target site and stores them on the user’s computer. It preserves the original site’s structure by maintaining relative links between pages, allowing the mirrored copy to function similarly to the live version when opened locally. Once a site has been cloned, users can browse the...

Downloads: 18 This Week

Last Update: 2026-03-11
See Project
14

Lighthouse

Automated auditing, performance metrics, & best practices for the web

Lighthouse is an open-source, automated tool that analyzes and audits web apps and web pages in order to improve their quality. Lighthouse collects modern performance metrics and insights on developer best practices; auditing for performance, accessibility, SEO and more. After auditing it produces a report either in JSON or HTML. Included in the report is a reference doc that explains the importance of the audit and how to fix the problem areas, which you can use to improve the web app or web page. ...

Downloads: 10 This Week

Last Update: 2026-04-07
See Project
15

camofox-browser

Headless browser automation server for AI agents to visit sites

camofox-browser is a headless browser automation server built specifically for AI agents that need to interact with websites that often block standard automation stacks. It wraps Camoufox, a Firefox fork that performs fingerprint spoofing at the C++ level, which means many browser characteristics are altered before page scripts can inspect them, rather than relying on JavaScript-layer stealth patches. The project is designed around a REST API, making it easier for agents and external tools...

Downloads: 9 This Week

Last Update: 22 hours ago
See Project
16

WebMagic

A scalable web crawler framework for Java

WebMagic is a scalable crawler framework. It covers the whole lifecycle of crawler, downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. WebMagic is a simple but scalable crawler framework. You can develop a crawler easily based on it. WebMagic has a simple core with high flexibility, a simple API for html extracting. It also provides annotation with POJO to customize a crawler, and no configuration is needed. Some other...

Downloads: 4 This Week

Last Update: 2025-02-10
See Project
17

Jackett

API Support for your favorite torrent trackers

Jackett works as a proxy server, it translates queries from apps (Sonarr, Radarr, SickRage, CouchPotato, Mylar3, Lidarr, DuckieTV, qBittorrent, Nefarious, etc.) into tracker-site-specific HTTP queries, parses the HTML or JSON response, and then sends results back to the requesting software. This allows for getting recent uploads (like RSS) and performing searches. Jackett is a single repository of maintained indexer scraping & translation logic, removing the burden from other apps. Trackers...

Downloads: 186 This Week

Last Update: 1 day ago
See Project
18

Winter

Free, open-source, self-hosted CMS platform based on the Laravel PHP

...Build intricate websites with little more than HTML, CSS and JavaScript through a beautiful, user-friendly and easy backend.

Downloads: 5 This Week

Last Update: 2026-02-20
See Project
19

Browserless

The headless Chrome/Chromium driver on top of Puppeteer

Browserless is an open-source headless browser automation library and service built on top of Puppeteer that simplifies the process of running and scaling Chromium-based browser tasks in production environments. It provides a high-level API for interacting with headless Chrome, allowing developers to perform operations such as generating PDFs, capturing screenshots, extracting text or HTML, and automating web navigation.

Downloads: 4 This Week

Last Update: 5 days ago
See Project
20

Dillo

Dillo, a multi-platform graphical web browser

...Its goals include enabling web access on old or constrained hardware, using slow or unreliable network connections, minimizing dependencies, and avoiding many of the complexities and overheads of modern full-featured browsers. It omits many modern features (notably JavaScript), instead focusing on rendering HTML (mostly older/standardized subsets), images, and some CSS, while keeping the codebase small. It is free/open source under GPL-3.0.

Downloads: 14 This Week

Last Update: 2025-09-11
See Project
21

ScrapeGraphAI

Python scraper based on AI

Extracting content from websites and local documents using LLM. ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.

Downloads: 14 This Week

Last Update: 19 hours ago
See Project
22

TiddlyWiki

A self-contained JavaScript wiki for the browser, Node.js, AWS Lambda

TiddlyWiki5 is a mature, self-contained open-source personal wiki application and non-linear notebook implemented entirely in JavaScript that runs in the browser or a Node.js environment, letting users create, organize, and interlink small pieces of content called tiddlers without the need for a server backend or traditional hierarchical pages. Its entire application — including content, interface, and logic — can live in a single HTML file that users open and edit directly in a web browser, making it portable, offline-capable, and easy to share or archive without dependencies. ...

Downloads: 2 This Week

Last Update: 2026-01-25
See Project
23

reveal.js

The HTML Presentation Framework

reveal.js is a framework for creating beautiful interactive presentations using HTML. It comes with a wide range of features, including nested slides, auto-sliding, touch navigation, Markdown support, PDF export, speaker notes, theming and more. It also comes with a JavaScript API that allows you to control various other options, and a list of plugins that can be used to extend reveal.js further. reveal.js currently offers full support for any recently released version of the following...

Downloads: 5 This Week

Last Update: 2026-04-11
See Project
24

FlareSolverr

Proxy server to bypass Cloudflare protection

FlareSolverr is a proxy server to bypass Cloudflare and DDoS-GUARD protection. FlareSolverr starts a proxy server, and it waits for user requests in an idle state using few resources. When some request arrives, it uses puppeteer with the stealth plugin to create a headless browser (Firefox). It opens the URL with user parameters and waits until the Cloudflare challenge is solved (or timeout). The HTML code and the cookies are sent back to the user, and those cookies can be used to bypass...

1 Review

Downloads: 49 This Week

Last Update: 2025-11-29
See Project
25

GoAccess

GoAccess is a real-time web log analyzer and interactive viewer

GoAccess is an open-source, real-time log analyzer and interactive viewer for web server logs. It runs in terminals on UNIX-like systems and can generate standalone HTML, JSON, or CSV reports for browser-based analysis. GoAccess offers enhanced WebSocket authentication, supporting local and external JWT verification, with secure token refresh capabilities and seamless integration with external authentication systems.

5 Reviews

Downloads: 1 This Week

Last Update: 2026-04-01
See Project