Kitoken. Tokenize Everything!
7
GPT-5
Kit15041 15041oken2488 2488.13 13 Token17951 17951ize750 750 Everything28997 28997!0 0
Fast tokenizer for language models compatible with SentencePiece, Tokenizers, Tiktoken and more.
What is a Tokenizer?
A tokenizer turns language model inputs from text into a series of numbers, called tokens, which a language model is
trained to understand.
There are many algorithms for this process -
BPE, Unigram and WordPiece
are the most popular and widespread.
All language models use a tokenizer for their text inputs, each with a different set of available tokens.
Kitoken is a tokenizer for any model.