Showing 176 open source projects for "web scraping"

View related business solutions
  • Data management solutions for confident marketing Icon
    Data management solutions for confident marketing

    For companies wanting a complete Data Management solution that is native to Salesforce

    Verify, deduplicate, manipulate, and assign records automatically to keep your CRM data accurate, complete, and ready for business.
    Learn More
  • The AI workplace management platform Icon
    The AI workplace management platform

    Plan smart spaces, connect teams, manage assets, and get insights with the leading AI-powered operating system for the built world.

    By combining AI workflows, predictive intelligence, and automated insights, OfficeSpace gives leaders a complete view of how their spaces are used and how people work. Facilities, IT, HR, and Real Estate teams use OfficeSpace to optimize space utilization, enhance employee experience, and reduce portfolio costs with precision.
    Learn More
  • 1
    Web Scraping for Laravel

    Web Scraping for Laravel

    Laravel adapter for Roach, the complete web scraping toolkit for PHP

    This is the Laravel adapter for Roach, the complete web scraping toolkit for PHP. Easily integrate Roach into any Laravel application. The Laravel adapter mostly provides the necessary container bindings for the various services Roach uses, as well as making certain configuration options available via a config file. The Laravel adapter of Roach registers a few Artisan commands to make out development experience as pleasant as possible.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 2
    The Web MCP

    The Web MCP

    A powerful Model Context Protocol (MCP) server

    Bright Data’s Web MCP server gives AI assistants robust, real-time web capabilities through an MCP interface designed to avoid blocks, rate limits, and CAPTCHAs. It presents search, crawl, navigate, and extraction tools that agents can call directly, replacing brittle scraping prompts with typed operations. The README markets it as a “gateway” to the live web so assistants don’t fall back to stale training data.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 3
    spider_collection

    spider_collection

    Collection of Python web scraping scripts for data extraction tasks

    spider_collection is a collection of Python web crawler scripts created primarily for experimentation, learning, and practical scraping tasks. spider_collection gathers multiple independent spiders designed to collect data from different platforms and services, demonstrating a variety of scraping techniques and workflows. These crawlers make use of common Python scraping tools such as requests, parsel, BeautifulSoup, and the Scrapy framework to extract structured information from web pages. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    wombat

    wombat

    Lightweight Ruby DSL for scraping structured data from web pages

    Wombat is a lightweight web crawling and scraping library written in Ruby that focuses on extracting structured data from web pages using a concise domain-specific language (DSL). It is designed to simplify the process of defining how information should be collected from HTML documents without requiring large amounts of scraping boilerplate code. Developers can declare the data fields they want and specify selectors or rules for retrieving them, allowing Wombat to parse and return structured results. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • Next-Gen Encryption for Post-Quantum Security | CLEAR by Quantum Knight Icon
    Next-Gen Encryption for Post-Quantum Security | CLEAR by Quantum Knight

    Lock Down Any Resource, Anywhere, Anytime

    CLEAR by Quantum Knight is a FIPS-140-3 validated encryption SDK engineered for enterprises requiring top-tier security. Offering robust post-quantum cryptography, CLEAR secures files, streaming media, databases, and networks with ease across over 30 modern platforms. Its compact design, smaller than a single smartphone image, ensures maximum efficiency and low energy consumption.
    Learn More
  • 5
    kimuraframework

    kimuraframework

    AI-first Ruby framework for building fast, flexible web scraping spide

    Kimurai is an open source web scraping framework written in Ruby that simplifies the process of building automated data extraction tools. It provides a clean domain-specific language that allows developers to define scraping logic and data schemas with minimal boilerplate code. Kimurai can use AI-assisted extraction to identify where data resides in HTML pages, automatically generating selectors that are cached for future use so subsequent scraping runs operate with pure Ruby performance. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    skycaiji

    skycaiji

    Open source web scraping system for automated data collection tasks

    ...SkyCaiji also supports automated workflows that continuously gather data and process it based on defined collection rules. Its architecture enables users to build scalable web scraping pipelines that can run unattended once configured.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    crawler

    crawler

    Collection of JS reverse engineering examples for web scraping study

    crawler is a collection of web scraping and JavaScript reverse engineering examples designed for learning how modern websites protect their data and how those protections can be analyzed. It contains many case studies that demonstrate how to analyze and replicate request parameters, cookies, and encryption logic used by real websites. Each directory in the project focuses on a specific target service or scenario, showing how browser network requests and JavaScript code can be studied to reproduce API calls programmatically. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 8
    Roach

    Roach

    The complete web scraping toolkit for PHP

    Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popular Scrapy package for Python. Roach allows us to define spiders that crawl and scrape web documents. But wait, there’s more. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 9
    Deep Research Web UI

    Deep Research Web UI

    AI-powered research assistant that performs iterative, deep research

    Deep Research Web UI is an AI-powered research assistant interface designed to automate complex, multi-step information gathering workflows through a combination of search engines, web scraping, and large language models. It operates as a front-end system for deep research agents that iteratively refine queries, retrieve information from multiple sources, and synthesize structured outputs into coherent reports.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Outbound sales software Icon
    Outbound sales software

    Unified cloud-based platform for dialing, emailing, appointment scheduling, lead management and much more.

    Adversus is an outbound dialing solution that helps you streamline your call strategies, automate manual processes, and provide valuable insights to improve your outbound workflows and efficiency.
    Learn More
  • 10
    rvest

    rvest

    Simple web scraping for R

    rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser. If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 12
    ScrapeGraphAI

    ScrapeGraphAI

    Python scraper based on AI

    Extracting content from websites and local documents using LLM. ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.
    Downloads: 12 This Week
    Last Update:
    See Project
  • 13
    Robin

    Robin

    AI-powered tool for dark web OSINT search and investigation

    ...Robin also performs scraping of discovered pages through Tor sessions, allowing users to gather additional context from dark web sites while maintaining the required network routing. By integrating AI models, the platform can interpret results, highlight key information, and produce summaries that help analysts understand findings faster. The project provides a modular architecture separating search, scraping, and AI processing components so it can be extended with new data sources.
    Downloads: 18 This Week
    Last Update:
    See Project
  • 14
    Geziyor

    Geziyor

    Blazing fast Go framework for web crawling and data scraping tasks

    Geziyor is a high-performance web crawling and web scraping framework built for the Go programming language. It is designed to help developers crawl websites and extract structured information from web pages efficiently. It focuses on speed and scalability, allowing large numbers of requests to be processed concurrently. Geziyor supports use cases such as data mining, monitoring web content, and automated testing workflows.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 15
    Ulixee Hero

    Ulixee Hero

    The web browser built for scraping

    It's the first modern headless browsers designed specifically for scraping instead of just automated testing. Hero provides access to the W3C DOM specification without the need for Puppeteer's complicated evaluate callbacks and multi-context switching. We've recreated a fully compliant DOM directly in NodeJS allowing you bypass the headaches of previous scraper tools. The powerful Chrome engine sits under the hood, allowing for lightning fast rendering. Emulators make it easy to disguise...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 16
    Scrapling

    Scrapling

    An adaptive Web Scraping framework

    Scrapling is an adaptive web scraping framework designed to handle everything from a single HTTP request to large-scale, concurrent crawls. Built for modern websites, it intelligently adapts to structural changes by automatically relocating elements when page layouts update. The framework includes advanced fetchers capable of bypassing anti-bot protections such as Cloudflare Turnstile using stealth and browser automation techniques.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 17
    Python-Spider

    Python-Spider

    Python3 web crawler practice

    Python-Spider is a repository intended to teach or provide examples for writing web spiders / crawlers in Python — part of a broader learning and resource collection by its author. The code and documentation are oriented toward beginners or intermediate learners who want to learn how to fetch, parse, and extract data from websites programmatically. As part of the author’s public learning-path repositories, python-spider likely includes examples of HTTP requests, HTML parsing, maybe concurrency or scheduling to crawl multiple pages, and techniques to handle common web-scraping issues. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 18
    crawlee

    crawlee

    A web scraping and browser automation library for Node.js

    Crawlee is a web scraping and browser automation library. It helps you build reliable crawlers. Fast. Crawlee won't fix broken selectors for you (yet), but it helps you build and maintain your crawlers faster. When a website adds JavaScript rendering, you don't have to rewrite everything, only switch to one of the browser crawlers. When you later find a great API to speed up your crawls, flip the switch back.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 19
    Parsera

    Parsera

    Lightweight library for scraping web-sites with LLMs

    Scrape data from any website with only a link and column descriptions. Parsera is a tool designed to scrape web content, specifically handling poorly structured or messy websites.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 20
    MDCx

    MDCx

    Movie metadata scraper and organizer for media libraries and NFO

    ...It also supports image processing tasks such as downloading and cropping artwork used by media centers. It includes several interfaces, allowing users to operate it through a graphical desktop application, a browser-based web interface, or command-line utilities depending on their workflow. Its architecture separates core scraping logic from the user interfaces, allowing the same metadata processing system to be reused across different modes.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 21
    DotnetSpider

    DotnetSpider

    Lightweight .NET framework for fast web crawling and data scraping

    DotnetSpider is a web crawling and data extraction framework built on the .NET Standard platform. It is designed to help developers create efficient and scalable crawlers for collecting structured data from websites. It provides a high-level API that simplifies the process of defining spiders, managing requests, and extracting content from web pages. Developers can create custom spiders by extending base classes and configuring pipelines that handle downloading, parsing, and storing collected data. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 22
    Actors MCP Server

    Actors MCP Server

    Model Context Protocol (MCP) Server for Apify's Actors

    The Apify Actors MCP Server is a Model Context Protocol (MCP) server that enables AI assistants to interact with Apify Actors. This integration allows AI models to utilize various web scraping and automation tools provided by Apify, facilitating tasks such as data extraction and web automation. ​
    Downloads: 9 This Week
    Last Update:
    See Project
  • 23
    Python Code Tutorials

    Python Code Tutorials

    The Python Code Tutorials

    Python Code Tutorials is a large educational repository that aggregates programming tutorials from the “The Python Code” website into a structured collection of Python projects and learning materials. The repository covers a wide range of programming topics including cybersecurity, networking, web scraping, machine learning, GUI development, and automation scripts. Each tutorial typically includes complete Python code examples and explanations that demonstrate how to build real tools and applications step by step. Many tutorials focus on practical implementations such as building network scanners, web scraping tools, object detection systems, and automation utilities using Python libraries. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Firecrawl MCP Server

    Firecrawl MCP Server

    Adds powerful web scraping and search to Cursor and Claude

    firecrawl-mcp-server is the official MCP integration for Firecrawl that brings high-recall web scraping, crawling, and search into IDEs and agent runtimes. It exposes tools for single-page scrape, multi-URL batch jobs, site discovery, and search enrichment, returning cleaned, structured content suitable for downstream LLM reasoning. The server is designed to run with Firecrawl’s hosted API or self-hosted deployments, making it flexible for enterprise data-governance requirements. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 25
    HeadlessX

    HeadlessX

    The undetected self-hosted browser automation platform

    HeadlessX is an open-source, self-hosted browser automation platform designed to run headless browsers for tasks such as web scraping, automation, and testing. The system provides a centralized service that allows developers to programmatically control browser sessions and extract data from websites through a structured API. It is built using modern technologies including Node.js, Next.js, TypeScript, and Playwright, and uses a specialized browser engine called Camoufox based on Firefox. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB