text-dedup

text-dedup is a Python library that enables efficient deduplication of large text corpora by using MinHash and other probabilistic techniques to detect near-duplicate content. This is especially useful for NLP tasks where duplicated training data can skew model performance. text-dedup scales to billions of documents and offers tools for chunking, hashing, and comparing text efficiently with low memory usage. It supports Jaccard similarity thresholding, parallel execution, and flexible deduplication strategies, making it ideal for cleaning web-scraped data, language model training datasets, or document archives.

Features

Fast and scalable near-duplicate detection
Uses MinHash and Jaccard similarity for fuzzy matching
Designed for web-scale datasets with billions of documents
Supports customizable deduplication thresholds
Multi-threaded and memory-efficient processing
Hashing-based representation of text chunks
Optional GPU acceleration for faster computation
Suitable for cleaning NLP and LLM training data

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow text-dedup

text-dedup Web Site

Other Useful Business Software

Time tracking software for the global workforce

Teams of all sizes and in various industries that want the best time tracking and employee monitoring solution.

It's easy with Hubstaff, a time-tracking and workforce management platform that automates almost every aspect of running or growing a business. Teams can track time to projects and to-dos using Hubstaff's desktop, web, or mobile applications. You'll be able to see how much time your team spends on different tasks, plus productivity metrics like activity rates and app usage through Hubstaff's online dashboard. Most of the available features are customizable on a per-user basis, so you can create the team management tool you need.

Learn More

Rate This Project

User Reviews

Be the first to post a review of text-dedup!

Additional Project Details

Programming Language

Python

Related Categories

Python Stream Processing Tool

Registered

2025-04-08

Similar Business Software

groundcover

Cloud-based observability solution that helps businesses track and manage workload and performance on a unified dashboard. Monitor everything you run in your cloud without compromising on cost, granularity, or scale. groundcover is a full stack cloud-native APM platform designed to make...

See Software
MongoDB Atlas

The most innovative cloud database service on the market, with unmatched data distribution and mobility across AWS, Azure, and Google Cloud, built-in automation for resource and workload optimization, and so much more. MongoDB Atlas is the global cloud database service for modern applications....

See Software
Ably

Ably is the definitive realtime experience platform. We power more WebSocket connections than any other pub/sub platform, serving over a billion devices monthly. Businesses like HubSpot, NASCAR and Webflow trust us to power their critical applications - reliably, securely and at serious...

See Software
RudderStack

RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by pulling analysis from your data warehouse to trigger enrichment and activation in customer tools for identity stitching and other advanced use cases. Start...

See Software
Aiven

Aiven manages your open source data infrastructure in the cloud - so you don't have to. Developers can do what they do best: create applications. We do what we do best: manage cloud data infrastructure. All solutions are open source. You can also freely move data between clouds or create...

See Software
Nussknacker

Nussknacker is a low-code visual tool for domain experts to define and run real-time decisioning algorithms instead of implementing them in the code. It serves where real-time actions on data have to be made: real-time marketing, fraud detection, Internet of Things, Customer 360, and Machine...

See Software

Report inappropriate content

text-dedup

All-in-one text de-duplication

Get an email when there's a new version of text-dedup

Features

Project Samples

Project Activity

Categories

License

Follow text-dedup

User Reviews

Additional Project Details

Programming Language

Related Categories

Registered