Transformers
tokenizer

Byte Fallback BPE Tokenizer

  • Trained using huggingface/tokenizers
  • Vocab Size : 72000

Training Args



### gpt-4-turbo regex (you can use your own but this works fine)
pat_str = "|".join(
        [
            r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
            r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
            r"""\p{N}{1,3}""",
            r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""",
            r"""\s*[\r\n]+""",
            r"""\s+(?!\S)""",
            r"""\s+""",
        ]
    )


# Initialize tokenizer with script-aware settings
tokenizer = Tokenizer(models.BPE(
    byte_fallback=True,
    unk_token=None,
    fuse_unk=False
))

#  pre-tokenizer for multilingual support
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.Split(
        pattern=Regex(pat_str),
        behavior="isolated",
        invert=False
    ),
    pre_tokenizers.ByteLevel(
        add_prefix_space=False,
        trim_offsets=True,
        use_regex=False
    )
])

# (modified for Indic languages)
tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFC(),  # Safer than NFKC for Indic scripts
])

from tokenizers import decoders
tokenizer.decoder = decoders.ByteLevel(
    add_prefix_space=False  # Must match pre-tokenizer settings
)



# Optimized trainer configuration
trainer = trainers.BpeTrainer(
    vocab_size=VOCAB_SIZE,
    special_tokens=SPECIAL_TOKENS,
    min_frequency=1,  # Lower frequency for low-resource languages
    show_progress=True,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    max_token_length=24,
    continuing_subword_prefix=""
)

def get_corpus():
    # Load and process full dataset
    dataset = load_dataset(DATASET_NAME, split="train")
    shuffled = dataset.shuffle(seed=42)
    return [text for text in shuffled[TEXT_COLUMN] if text.strip()] * 3

Special Tokens

{'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>', 'pad_token': '<|pad|>'}

Training Composition:

  • Maths: 550 M * 3 "aluncstokes/mathpile_arxiv_subset"

  • Code: 800 M * 3 codeparrot/github-code

  • Hinglish : 250 M * 3
    Abhishekcr448/Hinglish-Everyday-Conversations-1M Maihar/hinglish-80k

  • English : 2 000 M * 3 "allenai/c4", "en"

  • Hindi : 2 200 M * 3 aloobun/dhpileIN , data_dir='hi'

Evals

Tokenization Efficency (Less is Better)

Tokenizer English Hindi Tamil Bengali Malayalam Telugu Gujarati Punjabi Code_Python Code_Java c++ Math
0 deepseek-ai/DeepSeek-R1 (128k) 338874 22855 48957 39617 73928 40345 101020 79172 5231 2224 7055 5376
1 unsloth/phi-4 (100k) 308645 40456 59750 116122 149889 48689 118335 87413 4809 2110 6529 5573
2 deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k) 308512 21110 59625 115138 149883 48661 118061 86765 4809 2111 6530 5574
3 unsloth/gemma-2-9b-it(256k) 323335 15916 53913 53402 57219 47610 107925 87222 5948 2569 8639 5871
4 Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old) 366710 11447 61408 94191 97207 50229 117874 90045 8201 4000 13706 5585
5 Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k) 330830 10318 59089 93740 92655 44975 109411 87922 7819 3743 12953 5253
> Ornaments/72k-TK-BBPE-HF (72k) 321274 10813 67585 159985 193813 55654 134397 97063 5225 2263 7090 5150
7 nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k) 332271 14327 55473 36615 45783 48270 160115 117174 6186 2732 8861 6136
8 sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k) 370133 15633 67845 120340 105953 68315 159122 113817 6595 2792 9233 6223
9 sarvamai/sarvam-1 (68k) 385386 11257 61396 27348 31822 51463 119666 103344 7331 3068 9724 6864

Encode-Decode

  • Hindi
Input  : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['à¤ĭ', 'त', 'à¥ģà¤°à¤¾à¤ľ', 'Ġà¤Ĺायà¤ķ', 'वाड़', 'Ġ(', 'à¤ķपà¥įतान', '),', 'Ġडà¥ĩ', 'व', 'à¥ĭन', 'Ġà¤ķà¥īन', 'वà¥ĩ', ',', 'Ġरà¤ļ', 'िन', 'Ġरविà¤Ĥदà¥įर', ',', 'Ġराहà¥ģल', 'Ġतà¥įरिप', 'à¤¾à¤łà¥Ģ', ',', 'Ġशिवम', 'Ġदà¥ģबà¥ĩ', ',', 'Ġरविà¤Ĥदà¥įर', 'Ġà¤ľà¤¡à¥ĩà¤ľà¤¾', ',', 'Ġà¤ıमà¤ıस', 'Ġधà¥ĭनà¥Ģ', 'Ġ(', 'व', 'िà¤ķà¥ĩà¤Ł', 'à¤ķà¥Ģपर', '),', 'Ġà¤Ĩर', 'Ġà¤ħशà¥įविन', ',', 'Ġमà¥Ģ', 'थ', 'ाशा', 'Ġपथ', 'िर', 'ाना', ',', 'Ġà¤ĸलà¥Ģल', 'Ġà¤ħहमद', ',', 'Ġनà¥Ĥर', 'Ġà¤ħहमद', '।']
Encoded: [38659, 299, 21358, 15506, 7249, 509, 28249, 1222, 2308, 357, 1731, 8940, 2506, 14, 17890, 504, 19058, 14, 4384, 9183, 7568, 14, 18827, 13293, 14, 19058, 13516, 14, 17978, 12756, 509, 357, 3072, 14080, 1222, 2215, 17009, 14, 7584, 942, 22395, 11558, 647, 901, 14, 39383, 6593, 14, 25750, 6593, 337]
Len Tokens 51
Decoded: ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
  • English
Input  : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['B', 'ang', 'alore', 'Ġand', 'ĠChennai', 'Ġhave', 'Ġfaced', 'Ġeach', 'Ġother', 'Ġin', 'Ġ', '33', 'Ġmatches', 'Ġin', 'ĠIPL', '.', 'ĠOut', 'Ġof', 'Ġthese', 'Ġ', '33', 'Ġgames', ',', 'ĠBangalore', 'Ġhave', 'Ġwon', 'Ġ', '11', 'Ġwhereas', 'ĠChennai', 'Ġhave', 'Ġcome', 'Ġout', 'Ġvict', 'orious', 'Ġon', 'Ġ', '21', 'Ġoccasion', '.', 'Ġ', '1', 'Ġmatch', 'Ġended', 'Ġwithout', 'Ġa', 'Ġresult', '.']
Encoded: [36, 951, 30658, 364, 45274, 688, 20861, 1993, 1101, 360, 223, 3276, 15006, 360, 11519, 16, 7921, 368, 1576, 223, 3276, 5013, 14, 45076, 688, 4896, 223, 1281, 21170, 45274, 688, 3051, 892, 9592, 29166, 462, 223, 2428, 13344, 16, 223, 19, 5359, 12784, 2752, 284, 1899, 16]
Len Tokens 48
Decoded: Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
  • Math
Input  : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi

Tokens: ['%', 'ĠChange', 'Ġthe', 'Ġfont', 'Ġif', 'Ġyou', 'Ġwant', 'Ġto', ',', 'Ġdepending', 'Ġon', 'Ġwhether', 'Ċ', '%', "Ġyou're", 'Ġusing', 'Ġpd', 'fl', 'ate', 'x', 'Ġor', 'Ġx', 'el', 'ate', 'x', '/l', 'ual', 'ate', 'x', 'Ċ', '%', 'ĠWH', 'EN', 'ĠCOMP', 'IL', 'ING', 'ĠWITH', 'ĠX', 'EL', 'ATE', 'X', 'ĠPLEASE', 'ĠUSE', 'Ċ', '%', 'Ġx', 'el', 'ate', 'x', 'Ġ-', 'shell', '-', 'escape', 'Ġ-', 'output', '-d', 'river', '="', 'xd', 'v', 'ip', 'df', 'mx', 'Ġ-', 'z', 'Ġ', '0', '"', 'Ġsample', '.', 'tex', 'Ċ', '\\', 'ift', 'utex', 'Ċ', 'Ġ', 'Ġ%', 'ĠIf', 'Ġusing', 'Ġx', 'el', 'ate', 'x', 'Ġor', 'Ġl', 'ual', 'ate', 'x', ':Ċ', 'Ġ', 'Ġ\\', 'set', 'main', 'font', '{R', 'ob', 'oto', 'ĠSl', 'ab', '}Ċ', 'Ġ', 'Ġ\\', 'sets', 'ans', 'font', '{L', 'ato', '}Ċ', 'Ġ', 'Ġ\\', 'renewcommand', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}Ċ', '\\', 'else', 'Ċ', 'Ġ', 'Ġ%', 'ĠIf', 'Ġusing', 'Ġpd', 'fl', 'ate', 'x', ':Ċ', 'Ġ', 'Ġ\\', 'us', 'ep', 'ackage', '[', 'rm', ']{', 'rob', 'oto', '}Ċ', 'Ġ', 'Ġ\\', 'us', 'ep', 'ackage', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}Ċ', 'Ġ', 'Ġ%', 'Ġ\\', 'us', 'ep', 'ackage', '{s', 'ources', 'ans', 'pro', '}Ċ', 'Ġ', 'Ġ\\', 'renewcommand', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}Ċ', '\\fi', 'Ċ']
Encoded: [7, 20642, 307, 10013, 803, 449, 1654, 349, 14, 11248, 462, 4806, 201, 7, 7412, 2233, 34245, 6404, 520, 90, 578, 2163, 395, 520, 90, 21145, 1316, 520, 90, 201, 7, 20360, 2167, 49037, 5195, 4249, 25624, 2712, 7413, 7119, 58, 65107, 22822, 201, 7, 2163, 395, 520, 90, 904, 47931, 15, 38885, 904, 9854, 3209, 11707, 772, 27503, 88, 1056, 8772, 44531, 904, 92, 223, 18, 4, 10164, 16, 8774, 201, 62, 3113, 17783, 201, 223, 3259, 1783, 2233, 2163, 395, 520, 90, 578, 390, 1316, 520, 90, 1215, 223, 514, 1292, 7517, 5685, 4020, 1216, 6289, 11833, 483, 612, 223, 514, 8645, 820, 5685, 6459, 10542, 612, 223, 514, 67762, 676, 34277, 7107, 4403, 5765, 7107, 612, 62, 7583, 201, 223, 3259, 1783, 2233, 34245, 6404, 520, 90, 1215, 223, 514, 447, 1057, 14270, 61, 1876, 6592, 20636, 6289, 612, 223, 514, 447, 1057, 14270, 61, 71659, 820, 6592, 78, 10542, 612, 223, 3259, 514, 447, 1057, 14270, 6170, 4113, 820, 1387, 612, 223, 514, 67762, 676, 34277, 7107, 4403, 5765, 7107, 612, 68146, 201]
Len Tokens 177
Decoded: % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi
  • Code
Input  : class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
Tokens: ['class', 'ĠSentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(Base', 'Token', 'izer', '):Ċ', 'ĠĠĠ', 'Ġ"""', 'Sentence', 'P', 'iece', 'ĠUn', 'ig', 'ram', 'ĠToken', 'izer', 'ĊĊ', 'ĠĠĠ', 'ĠRep', 'resents', 'Ġthe', 'ĠUn', 'ig', 'ram', 'Ġalgorithm', ',', 'Ġwith', 'Ġthe', 'Ġpret', 'oken', 'ization', 'Ġused', 'Ġby', 'ĠSentence', 'P', 'iece', 'Ċ', 'ĠĠĠ', 'Ġ"""ĊĊ', 'ĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__', '(Ċ', 'ĠĠĠĠĠĠĠ', 'Ġself', ',Ċ', 'ĠĠĠĠĠĠĠ', 'Ġvoc', 'ab', ':', 'ĠOptional', '[', 'List', '[T', 'uple', '[str', ',', 'Ġfloat', ']]', ']', 'Ġ=', 'ĠNone', ',Ċ', 'ĠĠĠĠĠĠĠ', 'Ġreplacement', ':', 'Ġstr', 'Ġ=', 'Ġ"', 'âĸ', 'ģ', '",Ċ', 'ĠĠĠĠĠĠĠ', 'Ġadd', '_prefix', '_space', ':', 'Ġbool', 'Ġ=', 'ĠTrue', ',Ċ', 'ĠĠĠ', 'Ġ):Ċ', 'ĠĠĠĠĠĠĠ', 'Ġif', 'Ġvoc', 'ab', 'Ġis', 'Ġnot', 'ĠNone', ':Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġ#', 'ĠLet', 'ĠUn', 'ig', 'ram', '(', '..', ')', 'Ġfail', 'Ġif', 'Ġonly', 'Ġone', 'Ġof', 'Ġthem', 'Ġis', 'ĠNone', 'Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġtoken', 'izer', 'Ġ=', 'ĠToken', 'izer', '(', 'Un', 'ig', 'ram', '(v', 'oc', 'ab', '))Ċ', 'ĠĠĠĠĠĠĠ', 'Ġelse', ':Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġtoken', 'izer', 'Ġ=', 'ĠToken', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [2805, 45192, 50, 15717, 5091, 436, 1293, 13735, 7625, 62228, 13735, 7625, 2818, 413, 8533, 63588, 50, 15717, 2644, 436, 1293, 37433, 7625, 1025, 413, 4954, 13180, 307, 2644, 436, 1293, 11436, 14, 505, 307, 5992, 6907, 2920, 1909, 679, 45192, 50, 15717, 201, 413, 25641, 413, 1333, 4304, 3747, 1614, 3873, 545, 1572, 740, 545, 25497, 483, 28, 22800, 61, 3754, 42378, 15732, 27446, 14, 10809, 17233, 63, 532, 5200, 740, 545, 13804, 28, 2030, 532, 698, 27234, 226, 3288, 545, 1290, 31498, 34542, 28, 7817, 532, 9402, 740, 413, 42359, 545, 803, 25497, 483, 429, 696, 5200, 1215, 829, 1769, 4983, 2644, 436, 1293, 10, 879, 11, 6312, 803, 1407, 963, 368, 1212, 429, 5200, 201, 829, 15025, 7625, 532, 37433, 7625, 10, 5091, 436, 1293, 8425, 1287, 483, 4095, 545, 2589, 1215, 829, 15025, 7625, 532, 37433, 7625, 10, 5091, 436, 1293, 9066]
Len Tokens 146
Decoded: class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
  • Emoji
Input  : 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
Tokens: ['ðŁ', 'ĺ', 'ľ', 'ðŁ', '«', '¤', 'â', 'ĺ', '¹', 'ï¸ı', 'ðŁ', 'ĺ', 'ĸ', 'ðŁ', '¤¢', 'ðŁ', '¤®', 'ðŁ', 'ĺ', 'ĩ', 'ðŁ', 'IJ', '»', 'âĢ', 'į', 'â', 'Ŀ', 'Ħ', 'ï¸ı', 'ðŁ', '¦', 'Ħ', 'ðŁ', 'IJ', '¾', 'ðŁ', 'IJ', '½', 'ðŁ', 'IJ', 'į', 'ðŁ', '¦', 'ŀ', 'ðŁ', '¦', 'IJ', 'ðŁ', '¦', '¿', 'ðŁ', '¤', '´', 'ðŁ', '§', 'ij', 'âĢ', 'į', 'ðŁ', '¦', '²', 'ðŁ', 'ij', '¨', 'âĢ', 'į', 'ðŁ', 'ļ', 'Ĵ', 'ðŁ', 'ij', '¨', 'âĢ', 'į', 'ðŁ', 'ļ', 'Ģ']
Encoded: [17635, 249, 253, 17635, 107, 100, 161, 249, 120, 67378, 17635, 249, 247, 17635, 6387, 17635, 326, 17635, 249, 232, 17635, 241, 122, 461, 238, 161, 254, 229, 67378, 17635, 102, 229, 17635, 241, 125, 17635, 241, 124, 17635, 241, 238, 17635, 102, 255, 17635, 102, 241, 17635, 102, 126, 17635, 100, 115, 17635, 103, 242, 461, 238, 17635, 102, 113, 17635, 242, 104, 461, 238, 17635, 251, 243, 17635, 242, 104, 461, 238, 17635, 251, 225]
Len Tokens 77
Decoded: 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
  • Sanskrit
Input  : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ. 
Tokens: ['à¥', 'IJ', 'Ġतà¥įर', 'à¥įयम', 'à¥įब', 'à¤ķà¤Ĥ', 'Ġय', 'à¤ľà¤¾à¤®', 'हà¥ĩ', 'Ġसà¥ģ', 'à¤Ĺन', 'à¥įध', 'िà¤Ĥ', 'Ġपà¥ģषà¥įà¤Ł', 'िव', 'रà¥įध', 'नम', 'à¥į', 'Ġà¤īरà¥įव', 'ार', 'à¥ģà¤ķ', 'म', 'िव', 'Ġबन', 'à¥įध', 'नान', 'à¥įम', 'à¥ĥतà¥įय', 'à¥ĭ', 'रà¥įम', 'à¥ģ', 'à¤ķà¥įष', 'à¥Ģय', 'Ġमाम', 'à¥ĥत', 'ात', 'à¥į', 'Ġà¥IJ', '.', 'Ġ']
Encoded: [261, 241, 5148, 1385, 2474, 69046, 452, 13431, 24956, 1196, 9464, 1074, 571, 56898, 616, 3985, 12134, 270, 19111, 315, 704, 327, 616, 854, 1074, 54741, 632, 15421, 282, 760, 304, 625, 2095, 1061, 1583, 471, 270, 21199, 16, 223]
Len Tokens 40
Decoded: ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ. 
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support