Byte Fallback BPE Tokenizer
- Trained using huggingface/tokenizers
- Vocab Size :
72000
Training Args
### gpt-4-turbo regex (you can use your own but this works fine)
pat_str = "|".join(
[
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
r"""\p{N}{1,3}""",
r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""",
r"""\s*[\r\n]+""",
r"""\s+(?!\S)""",
r"""\s+""",
]
)
# Initialize tokenizer with script-aware settings
tokenizer = Tokenizer(models.BPE(
byte_fallback=True,
unk_token=None,
fuse_unk=False
))
# pre-tokenizer for multilingual support
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
pre_tokenizers.Split(
pattern=Regex(pat_str),
behavior="isolated",
invert=False
),
pre_tokenizers.ByteLevel(
add_prefix_space=False,
trim_offsets=True,
use_regex=False
)
])
# (modified for Indic languages)
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFC(), # Safer than NFKC for Indic scripts
])
from tokenizers import decoders
tokenizer.decoder = decoders.ByteLevel(
add_prefix_space=False # Must match pre-tokenizer settings
)
# Optimized trainer configuration
trainer = trainers.BpeTrainer(
vocab_size=VOCAB_SIZE,
special_tokens=SPECIAL_TOKENS,
min_frequency=1, # Lower frequency for low-resource languages
show_progress=True,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
max_token_length=24,
continuing_subword_prefix=""
)
def get_corpus():
# Load and process full dataset
dataset = load_dataset(DATASET_NAME, split="train")
shuffled = dataset.shuffle(seed=42)
return [text for text in shuffled[TEXT_COLUMN] if text.strip()] * 3
Special Tokens
{'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>', 'pad_token': '<|pad|>'}
Training Composition:
Maths: 550 M * 3
"aluncstokes/mathpile_arxiv_subset"
Code: 800 M * 3
codeparrot/github-code
Hinglish : 250 M * 3
Abhishekcr448/Hinglish-Everyday-Conversations-1M
Maihar/hinglish-80k
English : 2 000 M * 3
"allenai/c4", "en"
Hindi : 2 200 M * 3
aloobun/dhpileIN
,data_dir='hi'
Evals
Tokenization Efficency (Less is Better)
Tokenizer | English | Hindi | Tamil | Bengali | Malayalam | Telugu | Gujarati | Punjabi | Code_Python | Code_Java | c++ | Math | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | deepseek-ai/DeepSeek-R1 (128k) | 338874 | 22855 | 48957 | 39617 | 73928 | 40345 | 101020 | 79172 | 5231 | 2224 | 7055 | 5376 |
1 | unsloth/phi-4 (100k) | 308645 | 40456 | 59750 | 116122 | 149889 | 48689 | 118335 | 87413 | 4809 | 2110 | 6529 | 5573 |
2 | deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k) | 308512 | 21110 | 59625 | 115138 | 149883 | 48661 | 118061 | 86765 | 4809 | 2111 | 6530 | 5574 |
3 | unsloth/gemma-2-9b-it(256k) | 323335 | 15916 | 53913 | 53402 | 57219 | 47610 | 107925 | 87222 | 5948 | 2569 | 8639 | 5871 |
4 | Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old) | 366710 | 11447 | 61408 | 94191 | 97207 | 50229 | 117874 | 90045 | 8201 | 4000 | 13706 | 5585 |
5 | Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k) | 330830 | 10318 | 59089 | 93740 | 92655 | 44975 | 109411 | 87922 | 7819 | 3743 | 12953 | 5253 |
> | Ornaments/72k-TK-BBPE-HF (72k) | 321274 | 10813 | 67585 | 159985 | 193813 | 55654 | 134397 | 97063 | 5225 | 2263 | 7090 | 5150 |
7 | nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k) | 332271 | 14327 | 55473 | 36615 | 45783 | 48270 | 160115 | 117174 | 6186 | 2732 | 8861 | 6136 |
8 | sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k) | 370133 | 15633 | 67845 | 120340 | 105953 | 68315 | 159122 | 113817 | 6595 | 2792 | 9233 | 6223 |
9 | sarvamai/sarvam-1 (68k) | 385386 | 11257 | 61396 | 27348 | 31822 | 51463 | 119666 | 103344 | 7331 | 3068 | 9724 | 6864 |
Encode-Decode
- Hindi
Input : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['à¤ĭ', 'त', 'à¥ģà¤°à¤¾à¤ľ', 'Ġà¤Ĺायà¤ķ', 'वाड़', 'Ġ(', 'à¤ķपà¥įतान', '),', 'Ġडà¥ĩ', 'व', 'à¥ĭन', 'Ġà¤ķà¥īन', 'वà¥ĩ', ',', 'Ġरà¤ļ', 'िन', 'Ġरविà¤Ĥदà¥įर', ',', 'Ġराहà¥ģल', 'Ġतà¥įरिप', 'à¤¾à¤łà¥Ģ', ',', 'Ġशिवम', 'Ġदà¥ģबà¥ĩ', ',', 'Ġरविà¤Ĥदà¥įर', 'Ġà¤ľà¤¡à¥ĩà¤ľà¤¾', ',', 'Ġà¤ıमà¤ıस', 'Ġधà¥ĭनà¥Ģ', 'Ġ(', 'व', 'िà¤ķà¥ĩà¤Ł', 'à¤ķà¥Ģपर', '),', 'Ġà¤Ĩर', 'Ġà¤ħशà¥įविन', ',', 'Ġमà¥Ģ', 'थ', 'ाशा', 'Ġपथ', 'िर', 'ाना', ',', 'Ġà¤ĸलà¥Ģल', 'Ġà¤ħहमद', ',', 'Ġनà¥Ĥर', 'Ġà¤ħहमद', '।']
Encoded: [38659, 299, 21358, 15506, 7249, 509, 28249, 1222, 2308, 357, 1731, 8940, 2506, 14, 17890, 504, 19058, 14, 4384, 9183, 7568, 14, 18827, 13293, 14, 19058, 13516, 14, 17978, 12756, 509, 357, 3072, 14080, 1222, 2215, 17009, 14, 7584, 942, 22395, 11558, 647, 901, 14, 39383, 6593, 14, 25750, 6593, 337]
Len Tokens 51
Decoded: ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
- English
Input : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['B', 'ang', 'alore', 'Ġand', 'ĠChennai', 'Ġhave', 'Ġfaced', 'Ġeach', 'Ġother', 'Ġin', 'Ġ', '33', 'Ġmatches', 'Ġin', 'ĠIPL', '.', 'ĠOut', 'Ġof', 'Ġthese', 'Ġ', '33', 'Ġgames', ',', 'ĠBangalore', 'Ġhave', 'Ġwon', 'Ġ', '11', 'Ġwhereas', 'ĠChennai', 'Ġhave', 'Ġcome', 'Ġout', 'Ġvict', 'orious', 'Ġon', 'Ġ', '21', 'Ġoccasion', '.', 'Ġ', '1', 'Ġmatch', 'Ġended', 'Ġwithout', 'Ġa', 'Ġresult', '.']
Encoded: [36, 951, 30658, 364, 45274, 688, 20861, 1993, 1101, 360, 223, 3276, 15006, 360, 11519, 16, 7921, 368, 1576, 223, 3276, 5013, 14, 45076, 688, 4896, 223, 1281, 21170, 45274, 688, 3051, 892, 9592, 29166, 462, 223, 2428, 13344, 16, 223, 19, 5359, 12784, 2752, 284, 1899, 16]
Len Tokens 48
Decoded: Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
- Math
Input : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
% If using xelatex or lualatex:
\setmainfont{Roboto Slab}
\setsansfont{Lato}
\renewcommand{\familydefault}{\sfdefault}
\else
% If using pdflatex:
\usepackage[rm]{roboto}
\usepackage[defaultsans]{lato}
% \usepackage{sourcesanspro}
\renewcommand{\familydefault}{\sfdefault}
\fi
Tokens: ['%', 'ĠChange', 'Ġthe', 'Ġfont', 'Ġif', 'Ġyou', 'Ġwant', 'Ġto', ',', 'Ġdepending', 'Ġon', 'Ġwhether', 'Ċ', '%', "Ġyou're", 'Ġusing', 'Ġpd', 'fl', 'ate', 'x', 'Ġor', 'Ġx', 'el', 'ate', 'x', '/l', 'ual', 'ate', 'x', 'Ċ', '%', 'ĠWH', 'EN', 'ĠCOMP', 'IL', 'ING', 'ĠWITH', 'ĠX', 'EL', 'ATE', 'X', 'ĠPLEASE', 'ĠUSE', 'Ċ', '%', 'Ġx', 'el', 'ate', 'x', 'Ġ-', 'shell', '-', 'escape', 'Ġ-', 'output', '-d', 'river', '="', 'xd', 'v', 'ip', 'df', 'mx', 'Ġ-', 'z', 'Ġ', '0', '"', 'Ġsample', '.', 'tex', 'Ċ', '\\', 'ift', 'utex', 'Ċ', 'Ġ', 'Ġ%', 'ĠIf', 'Ġusing', 'Ġx', 'el', 'ate', 'x', 'Ġor', 'Ġl', 'ual', 'ate', 'x', ':Ċ', 'Ġ', 'Ġ\\', 'set', 'main', 'font', '{R', 'ob', 'oto', 'ĠSl', 'ab', '}Ċ', 'Ġ', 'Ġ\\', 'sets', 'ans', 'font', '{L', 'ato', '}Ċ', 'Ġ', 'Ġ\\', 'renewcommand', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}Ċ', '\\', 'else', 'Ċ', 'Ġ', 'Ġ%', 'ĠIf', 'Ġusing', 'Ġpd', 'fl', 'ate', 'x', ':Ċ', 'Ġ', 'Ġ\\', 'us', 'ep', 'ackage', '[', 'rm', ']{', 'rob', 'oto', '}Ċ', 'Ġ', 'Ġ\\', 'us', 'ep', 'ackage', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}Ċ', 'Ġ', 'Ġ%', 'Ġ\\', 'us', 'ep', 'ackage', '{s', 'ources', 'ans', 'pro', '}Ċ', 'Ġ', 'Ġ\\', 'renewcommand', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}Ċ', '\\fi', 'Ċ']
Encoded: [7, 20642, 307, 10013, 803, 449, 1654, 349, 14, 11248, 462, 4806, 201, 7, 7412, 2233, 34245, 6404, 520, 90, 578, 2163, 395, 520, 90, 21145, 1316, 520, 90, 201, 7, 20360, 2167, 49037, 5195, 4249, 25624, 2712, 7413, 7119, 58, 65107, 22822, 201, 7, 2163, 395, 520, 90, 904, 47931, 15, 38885, 904, 9854, 3209, 11707, 772, 27503, 88, 1056, 8772, 44531, 904, 92, 223, 18, 4, 10164, 16, 8774, 201, 62, 3113, 17783, 201, 223, 3259, 1783, 2233, 2163, 395, 520, 90, 578, 390, 1316, 520, 90, 1215, 223, 514, 1292, 7517, 5685, 4020, 1216, 6289, 11833, 483, 612, 223, 514, 8645, 820, 5685, 6459, 10542, 612, 223, 514, 67762, 676, 34277, 7107, 4403, 5765, 7107, 612, 62, 7583, 201, 223, 3259, 1783, 2233, 34245, 6404, 520, 90, 1215, 223, 514, 447, 1057, 14270, 61, 1876, 6592, 20636, 6289, 612, 223, 514, 447, 1057, 14270, 61, 71659, 820, 6592, 78, 10542, 612, 223, 3259, 514, 447, 1057, 14270, 6170, 4113, 820, 1387, 612, 223, 514, 67762, 676, 34277, 7107, 4403, 5765, 7107, 612, 68146, 201]
Len Tokens 177
Decoded: % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
% If using xelatex or lualatex:
\setmainfont{Roboto Slab}
\setsansfont{Lato}
\renewcommand{\familydefault}{\sfdefault}
\else
% If using pdflatex:
\usepackage[rm]{roboto}
\usepackage[defaultsans]{lato}
% \usepackage{sourcesanspro}
\renewcommand{\familydefault}{\sfdefault}
\fi
- Code
Input : class SentencePieceUnigramTokenizer(BaseTokenizer):
"""SentencePiece Unigram Tokenizer
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
vocab: Optional[List[Tuple[str, float]]] = None,
replacement: str = "▁",
add_prefix_space: bool = True,
):
if vocab is not None:
# Let Unigram(..) fail if only one of them is None
tokenizer = Tokenizer(Unigram(vocab))
else:
tokenizer = Tokenizer(Unigram())
Tokens: ['class', 'ĠSentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(Base', 'Token', 'izer', '):Ċ', 'ĠĠĠ', 'Ġ"""', 'Sentence', 'P', 'iece', 'ĠUn', 'ig', 'ram', 'ĠToken', 'izer', 'ĊĊ', 'ĠĠĠ', 'ĠRep', 'resents', 'Ġthe', 'ĠUn', 'ig', 'ram', 'Ġalgorithm', ',', 'Ġwith', 'Ġthe', 'Ġpret', 'oken', 'ization', 'Ġused', 'Ġby', 'ĠSentence', 'P', 'iece', 'Ċ', 'ĠĠĠ', 'Ġ"""ĊĊ', 'ĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__', '(Ċ', 'ĠĠĠĠĠĠĠ', 'Ġself', ',Ċ', 'ĠĠĠĠĠĠĠ', 'Ġvoc', 'ab', ':', 'ĠOptional', '[', 'List', '[T', 'uple', '[str', ',', 'Ġfloat', ']]', ']', 'Ġ=', 'ĠNone', ',Ċ', 'ĠĠĠĠĠĠĠ', 'Ġreplacement', ':', 'Ġstr', 'Ġ=', 'Ġ"', 'âĸ', 'ģ', '",Ċ', 'ĠĠĠĠĠĠĠ', 'Ġadd', '_prefix', '_space', ':', 'Ġbool', 'Ġ=', 'ĠTrue', ',Ċ', 'ĠĠĠ', 'Ġ):Ċ', 'ĠĠĠĠĠĠĠ', 'Ġif', 'Ġvoc', 'ab', 'Ġis', 'Ġnot', 'ĠNone', ':Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġ#', 'ĠLet', 'ĠUn', 'ig', 'ram', '(', '..', ')', 'Ġfail', 'Ġif', 'Ġonly', 'Ġone', 'Ġof', 'Ġthem', 'Ġis', 'ĠNone', 'Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġtoken', 'izer', 'Ġ=', 'ĠToken', 'izer', '(', 'Un', 'ig', 'ram', '(v', 'oc', 'ab', '))Ċ', 'ĠĠĠĠĠĠĠ', 'Ġelse', ':Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġtoken', 'izer', 'Ġ=', 'ĠToken', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [2805, 45192, 50, 15717, 5091, 436, 1293, 13735, 7625, 62228, 13735, 7625, 2818, 413, 8533, 63588, 50, 15717, 2644, 436, 1293, 37433, 7625, 1025, 413, 4954, 13180, 307, 2644, 436, 1293, 11436, 14, 505, 307, 5992, 6907, 2920, 1909, 679, 45192, 50, 15717, 201, 413, 25641, 413, 1333, 4304, 3747, 1614, 3873, 545, 1572, 740, 545, 25497, 483, 28, 22800, 61, 3754, 42378, 15732, 27446, 14, 10809, 17233, 63, 532, 5200, 740, 545, 13804, 28, 2030, 532, 698, 27234, 226, 3288, 545, 1290, 31498, 34542, 28, 7817, 532, 9402, 740, 413, 42359, 545, 803, 25497, 483, 429, 696, 5200, 1215, 829, 1769, 4983, 2644, 436, 1293, 10, 879, 11, 6312, 803, 1407, 963, 368, 1212, 429, 5200, 201, 829, 15025, 7625, 532, 37433, 7625, 10, 5091, 436, 1293, 8425, 1287, 483, 4095, 545, 2589, 1215, 829, 15025, 7625, 532, 37433, 7625, 10, 5091, 436, 1293, 9066]
Len Tokens 146
Decoded: class SentencePieceUnigramTokenizer(BaseTokenizer):
"""SentencePiece Unigram Tokenizer
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
vocab: Optional[List[Tuple[str, float]]] = None,
replacement: str = "▁",
add_prefix_space: bool = True,
):
if vocab is not None:
# Let Unigram(..) fail if only one of them is None
tokenizer = Tokenizer(Unigram(vocab))
else:
tokenizer = Tokenizer(Unigram())
- Emoji
Input : 😜🫤☹️😖🤢🤮😇🐻❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑🦲👨🚒👨🚀
Tokens: ['ðŁ', 'ĺ', 'ľ', 'ðŁ', '«', '¤', 'â', 'ĺ', '¹', 'ï¸ı', 'ðŁ', 'ĺ', 'ĸ', 'ðŁ', '¤¢', 'ðŁ', '¤®', 'ðŁ', 'ĺ', 'ĩ', 'ðŁ', 'IJ', '»', 'âĢ', 'į', 'â', 'Ŀ', 'Ħ', 'ï¸ı', 'ðŁ', '¦', 'Ħ', 'ðŁ', 'IJ', '¾', 'ðŁ', 'IJ', '½', 'ðŁ', 'IJ', 'į', 'ðŁ', '¦', 'ŀ', 'ðŁ', '¦', 'IJ', 'ðŁ', '¦', '¿', 'ðŁ', '¤', '´', 'ðŁ', '§', 'ij', 'âĢ', 'į', 'ðŁ', '¦', '²', 'ðŁ', 'ij', '¨', 'âĢ', 'į', 'ðŁ', 'ļ', 'Ĵ', 'ðŁ', 'ij', '¨', 'âĢ', 'į', 'ðŁ', 'ļ', 'Ģ']
Encoded: [17635, 249, 253, 17635, 107, 100, 161, 249, 120, 67378, 17635, 249, 247, 17635, 6387, 17635, 326, 17635, 249, 232, 17635, 241, 122, 461, 238, 161, 254, 229, 67378, 17635, 102, 229, 17635, 241, 125, 17635, 241, 124, 17635, 241, 238, 17635, 102, 255, 17635, 102, 241, 17635, 102, 126, 17635, 100, 115, 17635, 103, 242, 461, 238, 17635, 102, 113, 17635, 242, 104, 461, 238, 17635, 251, 243, 17635, 242, 104, 461, 238, 17635, 251, 225]
Len Tokens 77
Decoded: 😜🫤☹️😖🤢🤮😇🐻❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑🦲👨🚒👨🚀
- Sanskrit
Input : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.
Tokens: ['à¥', 'IJ', 'Ġतà¥įर', 'à¥įयम', 'à¥įब', 'à¤ķà¤Ĥ', 'Ġय', 'à¤ľà¤¾à¤®', 'हà¥ĩ', 'Ġसà¥ģ', 'à¤Ĺन', 'à¥įध', 'िà¤Ĥ', 'Ġपà¥ģषà¥įà¤Ł', 'िव', 'रà¥įध', 'नम', 'à¥į', 'Ġà¤īरà¥įव', 'ार', 'à¥ģà¤ķ', 'म', 'िव', 'Ġबन', 'à¥įध', 'नान', 'à¥įम', 'à¥ĥतà¥įय', 'à¥ĭ', 'रà¥įम', 'à¥ģ', 'à¤ķà¥įष', 'à¥Ģय', 'Ġमाम', 'à¥ĥत', 'ात', 'à¥į', 'Ġà¥IJ', '.', 'Ġ']
Encoded: [261, 241, 5148, 1385, 2474, 69046, 452, 13431, 24956, 1196, 9464, 1074, 571, 56898, 616, 3985, 12134, 270, 19111, 315, 704, 327, 616, 854, 1074, 54741, 632, 15421, 282, 760, 304, 625, 2095, 1061, 1583, 471, 270, 21199, 16, 223]
Len Tokens 40
Decoded: ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
HF Inference deployability: The model has no pipeline_tag.