text-dedup is a Python library that enables efficient deduplication of large text corpora by using MinHash and other probabilistic techniques to detect near-duplicate content. This is especially useful for NLP tasks where duplicated training data can skew model performance. text-dedup scales to billions of documents and offers tools for chunking, hashing, and comparing text efficiently with low memory usage. It supports Jaccard similarity thresholding, parallel execution, and flexible deduplication strategies, making it ideal for cleaning web-scraped data, language model training datasets, or document archives.

Features

  • Fast and scalable near-duplicate detection
  • Uses MinHash and Jaccard similarity for fuzzy matching
  • Designed for web-scale datasets with billions of documents
  • Supports customizable deduplication thresholds
  • Multi-threaded and memory-efficient processing
  • Hashing-based representation of text chunks
  • Optional GPU acceleration for faster computation
  • Suitable for cleaning NLP and LLM training data

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow text-dedup

text-dedup Web Site

Other Useful Business Software
Time tracking software for the global workforce Icon
Time tracking software for the global workforce

Teams of all sizes and in various industries that want the best time tracking and employee monitoring solution.

It's easy with Hubstaff, a time-tracking and workforce management platform that automates almost every aspect of running or growing a business. Teams can track time to projects and to-dos using Hubstaff's desktop, web, or mobile applications. You'll be able to see how much time your team spends on different tasks, plus productivity metrics like activity rates and app usage through Hubstaff's online dashboard. Most of the available features are customizable on a per-user basis, so you can create the team management tool you need.
Learn More
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of text-dedup!

Additional Project Details

Programming Language

Python

Related Categories

Python Stream Processing Tool

Registered

2025-04-08