Showing 1578 open source projects for "html source extractor"

View related business solutions
  • DeskTime is a cloud-based time tracking software Icon
    DeskTime is a cloud-based time tracking software

    DeskTime is best for medium to large companies, as well as freelancers who want to boost productivity without overworking.

    DeskTime is a high-performance, automated time tracking and workforce management solution for teams and freelancers. It runs silently in the background, logging computer activity from the moment of boot-up to ensure 100% accurate data without the need for manual timers.
    Learn More
  • Cortex: Boost Developer Coding Skills Icon
    Cortex: Boost Developer Coding Skills

    Cortex makes coding easier and faster for developers. See how our portal connects tools and cuts busywork.

    Cortex is a simple portal that helps developers work smarter by linking all your tools, setting clear rules, and slashing repetitive tasks. It speeds up onboarding, updates old code, and fixes issues fast. Over 100 big companies use it to save time and get better results.
    Try it now!
  • 1
    html-metadata

    html-metadata

    MetaData html scraper and parser for Node.js (supports Promises

    The aim of this library is to be a comprehensive source for extracting all HTML-embedded metadata. Currently, it supports Schema.org microdata using a third-party library, a native BEPress, Dublin Core, Highwire Press, JSON-LD, Open Graph, Twitter, EPrints, PRISM, and COinS implementation, and some general metadata that doesn't belong to a particular standard (for instance, the content of the title tag, or meta description tags).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Trafilatura

    Trafilatura

    Python & command-line tool to gather text on the Web

    ...Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    jsoup

    jsoup

    Java library for working with real-world HTML

    jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree. The parser will make...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Happy DOM

    Happy DOM

    Happy DOM is a JavaScript implementation of a web browser

    Happy DOM is a JavaScript implementation of a web browser without its graphical user interface. It includes many web standards from WHATWG DOM and HTML. The goal of Happy DOM is to emulate enough of a web browser to be useful for testing, scraping web sites, and server-side rendering. Happy DOM focuses heavily on performance and can be used as an alternative to JSDOM. Happy DOM now supports Declarative Shadow DOM which can be used for server-side rendering of web components. This package...
    Downloads: 5 This Week
    Last Update:
    See Project
  • Transform months of data modeling and coding into days. Icon
    Transform months of data modeling and coding into days.

    Automatically generate, document, and govern your entire data architecture.

    Efficiently model your business and data models, and generate code for your data pipelines, data lakehouse, and analytical applications
    Learn More
  • 5
    geckodriver

    geckodriver

    WebDriver for Firefox

    geckodriver is an implementation of WebDriver, and WebDriver can be used for widely different purposes. How you invoke geckodriver largely depends on your use case. If you are using geckodriver through Selenium, you must ensure that you have version 3.11 or greater. Because geckodriver implements the W3C WebDriver standard and not the same Selenium wire protocol older drivers are using, you may experience incompatibilities and migration problems when making the switch from FirefoxDriver to...
    Downloads: 72 This Week
    Last Update:
    See Project
  • 6
    Eruda

    Eruda

    Console for mobile browsers

    With Eruda you can display JavaScript logs, check dom state, show requests status, show localStorage, cookie information, show url, user agent info, include snippets used most often, Html, js, css source viewer, and install. The JavaScript file size is quite huge(about 100kb gzipped) and therefore not suitable to include in mobile pages. It's recommended to make sure eruda is loaded only when eruda is set to true. When initialization, a configuration object can be passed in. Container element, if not set, it will append an element directly under html root element. ...
    Downloads: 26 This Week
    Last Update:
    See Project
  • 7
    Maxun

    Maxun

    Small event-delegation library for decoupling event binding and handli

    Maxun named JsAction by Google serves as a lightweight event delegation library built in JavaScript. It allows developers to separate the logic of binding events from the code that handles those events, helping to keep DOM event wiring cleaner and more maintainable. It is archived and marked as read-only, indicating that the project is no longer actively maintained or intended for production use. The README states that ongoing development has migrated into a larger framework under the...
    Downloads: 28 This Week
    Last Update:
    See Project
  • 8
    Spider

    Spider

    High-performance Rust web crawler and scraper for large-scale data

    Spider is a high-performance web crawler and web scraping library written in Rust that enables developers to crawl and index websites efficiently. It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents. Spider can operate concurrently across many pages, allowing it to gather large...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 9
    Beacon

    Beacon

    Open-source Content Management System (CMS)

    Beacon is a modern open-source CMS built with Phoenix LiveView, offering fast server-rendered HTML for content-heavy pages with LiveView interactivity layered on top. It includes runtime content reloading, SEO-optimized rendering, and an admin interface (Beacon LiveAdmin) for managing pages, layouts, and components in a cluster-friendly setup. Developed by DockYard, Beacon aims to deliver high performance content sites fully within the Elixir ecosystem.
    Downloads: 5 This Week
    Last Update:
    See Project
  • Network Discovery Software | JDisc Discovery Icon
    Network Discovery Software | JDisc Discovery

    JDisc Discovery supports the IT organizationss of medium-sized businesses and large-scale enterprises.

    JDisc Discovery is a comprehensive network inventory and IT asset management solution designed to help organizations gain clear, up-to-date visibility into their IT environment. It automatically scans and maps devices across the network, including servers, workstations, virtual machines, and network hardware, to create a detailed inventory of all connected assets. This includes critical information such as hardware configurations, software installations, patch levels, and relationshipots between devices.
    Learn More
  • 10
    miniblink49

    miniblink49

    Lighter, faster browser kernel of blink to integrate HTML UI in apps

    miniblink is an open source, one file, small browser widget based on chromium. By using C interface, you can create a browser with just some line code. miniblink is an open source, single-file, and currently the smallest known chromium-based browser control. Through its exported pure C interface, a browser control can be created in a few lines of code. C++, C#, Delphi and other language calls (support C++, C#, Delphi language to call). Embedded Nodejs, support electron (with Nodejs, can run...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 11
    single-file-cli

    single-file-cli

    CLI tool to save complete web pages as single self-contained HTML file

    SingleFile CLI is an open source command-line tool designed to save complete web pages as a single self-contained HTML file. It captures the rendered page in a headless browser and embeds all required resources directly into the output document, including stylesheets, scripts, images, and fonts. By consolidating every dependency into one file, it allows users to preserve a faithful copy of a web page that can be viewed offline without requiring external assets.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 12
    openvpn-monitor

    openvpn-monitor

    openvpn-monitor is a web based OpenVPN monitor

    openvpn-monitor is a simple Python program to generate HTML that displays the status of an OpenVPN server, including all current connections. It uses the OpenVPN management console. It typically runs on the same host as the OpenVPN server, however, it does not necessarily need to. OpenVPN-monitor is a web-based OpenVPN monitor, that shows current connection information, such as users, location, and data transferred.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    goclone

    goclone

    Fast CLI tool for cloning entire websites for local browsing offline

    goclone is a command-line utility designed to download and mirror complete websites to a local directory for offline access. It retrieves HTML pages, stylesheets, JavaScript files, images, and other assets from a target site and stores them on the user’s computer. It preserves the original site’s structure by maintaining relative links between pages, allowing the mirrored copy to function similarly to the live version when opened locally. Once a site has been cloned, users can browse the...
    Downloads: 18 This Week
    Last Update:
    See Project
  • 14
    Lighthouse

    Lighthouse

    Automated auditing, performance metrics, & best practices for the web

    Lighthouse is an open-source, automated tool that analyzes and audits web apps and web pages in order to improve their quality. Lighthouse collects modern performance metrics and insights on developer best practices; auditing for performance, accessibility, SEO and more. After auditing it produces a report either in JSON or HTML. Included in the report is a reference doc that explains the importance of the audit and how to fix the problem areas, which you can use to improve the web app or web page. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 15
    camofox-browser

    camofox-browser

    Headless browser automation server for AI agents to visit sites

    camofox-browser is a headless browser automation server built specifically for AI agents that need to interact with websites that often block standard automation stacks. It wraps Camoufox, a Firefox fork that performs fingerprint spoofing at the C++ level, which means many browser characteristics are altered before page scripts can inspect them, rather than relying on JavaScript-layer stealth patches. The project is designed around a REST API, making it easier for agents and external tools...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 16
    WebMagic

    WebMagic

    A scalable web crawler framework for Java

    WebMagic is a scalable crawler framework. It covers the whole lifecycle of crawler, downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. WebMagic is a simple but scalable crawler framework. You can develop a crawler easily based on it. WebMagic has a simple core with high flexibility, a simple API for html extracting. It also provides annotation with POJO to customize a crawler, and no configuration is needed. Some other...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 17
    Jackett

    Jackett

    API Support for your favorite torrent trackers

    Jackett works as a proxy server, it translates queries from apps (Sonarr, Radarr, SickRage, CouchPotato, Mylar3, Lidarr, DuckieTV, qBittorrent, Nefarious, etc.) into tracker-site-specific HTTP queries, parses the HTML or JSON response, and then sends results back to the requesting software. This allows for getting recent uploads (like RSS) and performing searches. Jackett is a single repository of maintained indexer scraping & translation logic, removing the burden from other apps. Trackers...
    Downloads: 186 This Week
    Last Update:
    See Project
  • 18
    Winter

    Winter

    Free, open-source, self-hosted CMS platform based on the Laravel PHP

    ...Build intricate websites with little more than HTML, CSS and JavaScript through a beautiful, user-friendly and easy backend.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 19
    Browserless

    Browserless

    The headless Chrome/Chromium driver on top of Puppeteer

    Browserless is an open-source headless browser automation library and service built on top of Puppeteer that simplifies the process of running and scaling Chromium-based browser tasks in production environments. It provides a high-level API for interacting with headless Chrome, allowing developers to perform operations such as generating PDFs, capturing screenshots, extracting text or HTML, and automating web navigation.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 20
    Dillo

    Dillo

    Dillo, a multi-platform graphical web browser

    ...Its goals include enabling web access on old or constrained hardware, using slow or unreliable network connections, minimizing dependencies, and avoiding many of the complexities and overheads of modern full-featured browsers. It omits many modern features (notably JavaScript), instead focusing on rendering HTML (mostly older/standardized subsets), images, and some CSS, while keeping the codebase small. It is free/open source under GPL-3.0.
    Downloads: 14 This Week
    Last Update:
    See Project
  • 21
    ScrapeGraphAI

    ScrapeGraphAI

    Python scraper based on AI

    Extracting content from websites and local documents using LLM. ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.
    Downloads: 14 This Week
    Last Update:
    See Project
  • 22
    TiddlyWiki

    TiddlyWiki

    A self-contained JavaScript wiki for the browser, Node.js, AWS Lambda

    TiddlyWiki5 is a mature, self-contained open-source personal wiki application and non-linear notebook implemented entirely in JavaScript that runs in the browser or a Node.js environment, letting users create, organize, and interlink small pieces of content called tiddlers without the need for a server backend or traditional hierarchical pages. Its entire application — including content, interface, and logic — can live in a single HTML file that users open and edit directly in a web browser, making it portable, offline-capable, and easy to share or archive without dependencies. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    reveal.js

    reveal.js

    The HTML Presentation Framework

    reveal.js is a framework for creating beautiful interactive presentations using HTML. It comes with a wide range of features, including nested slides, auto-sliding, touch navigation, Markdown support, PDF export, speaker notes, theming and more. It also comes with a JavaScript API that allows you to control various other options, and a list of plugins that can be used to extend reveal.js further. reveal.js currently offers full support for any recently released version of the following...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 24
    FlareSolverr

    FlareSolverr

    Proxy server to bypass Cloudflare protection

    FlareSolverr is a proxy server to bypass Cloudflare and DDoS-GUARD protection. FlareSolverr starts a proxy server, and it waits for user requests in an idle state using few resources. When some request arrives, it uses puppeteer with the stealth plugin to create a headless browser (Firefox). It opens the URL with user parameters and waits until the Cloudflare challenge is solved (or timeout). The HTML code and the cookies are sent back to the user, and those cookies can be used to bypass...
    Downloads: 49 This Week
    Last Update:
    See Project
  • 25
    GoAccess

    GoAccess

    GoAccess is a real-time web log analyzer and interactive viewer

    GoAccess is an open-source, real-time log analyzer and interactive viewer for web server logs. It runs in terminals on UNIX-like systems and can generate standalone HTML, JSON, or CSV reports for browser-based analysis. GoAccess offers enhanced WebSocket authentication, supporting local and external JWT verification, with secure token refresh capabilities and seamless integration with external authentication systems.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB