Fast State-of-the-art tokenizers, optimized for both research and production. Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in Transformers. Train new vocabularies and tokenize, using today’s most used tokenizers. Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU. Easy to use, but also extremely versatile. Designed for both research and production. Full alignment tracking. Even with destructive normalization, it’s always possible to get the part of the original sentence that corresponds to any token. Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

Features

  • Train new vocabularies and tokenize, using today’s most used tokenizers
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU
  • Easy to use, but also extremely versatile
  • Designed for both research and production
  • Full alignment tracking
  • Truncation, Padding, add the special tokens your model needs

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Tokenizers

Tokenizers Web Site

Other Useful Business Software
Simplify Purchasing For Your Business Icon
Simplify Purchasing For Your Business

Manage what you buy and how you buy it with Order.co, so you have control over your time and money spent.

Simplify every aspect of buying for your business in Order.co. From sourcing products to scaling purchasing across locations to automating your AP and approvals workstreams, Order.co is the platform of choice for growing businesses.
Learn More
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Tokenizers!

Additional Project Details

Programming Language

Rust

Related Categories

Rust Artificial Intelligence Software, Rust Machine Learning Software

Registered

2023-03-23