SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. Purely data driven, sentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required. SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.

Features

  • Multiple subword algorithms
  • Subword regularization
  • Fast and lightweight
  • Self-contained
  • Direct vocabulary id generation
  • NFKC-based normalization

Project Samples

Project Activity

See All Activity >

Categories

Machine Learning

License

Apache License V2.0

Follow SentencePiece

SentencePiece Web Site

Other Useful Business Software
Employees get more done with Rippling Icon
Employees get more done with Rippling

Streamline your business with an all-in-one platform for HR, IT, payroll, and spend management.

Effortlessly manage the entire employee lifecycle, from hiring to benefits administration. Automate HR tasks, ensure compliance, and streamline approvals. Simplify IT with device management, software access, and compliance monitoring, all from one dashboard. Enjoy timely payroll, real-time financial visibility, and dynamic spend policies. Rippling empowers your business to save time, reduce costs, and enhance efficiency, allowing you to focus on growth. Experience the power of unified management with Rippling today.
Learn More
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of SentencePiece!

Additional Project Details

Operating Systems

Mac

Programming Language

C++

Related Categories

C++ Machine Learning Software

Registered

2021-10-06