Clarification on the regex of the tokenizer configuration
Your JSON tokenization config uses the following regex to split the input.
" ?[^(\\s|[.,!?\u2026\u3002\uff0c\u3001\u0964\u06d4\u060c])]+"
or \s?[^(\s|[.,!?…。,、।۔،])]+
I don't understand why you have nested brackets for [.,!?…。,、।۔،]
Why do you separate \s from the rest of the characters.
Also also why do you try to capture (with parenthesis)?
Wouldn't it be the same as ?[^\s.,!?…。,、।۔،]+ ?
Maybe there is a fancy regex pattern I don't know.
For the context, I try to load this configuration with .NET C# and the standard regex engine doesn't not understand this regex.
I'm sorry, but I don't know why these decisions were made. We inherited the splitting in this tokenizer, maybe from BLOOM? I forget exactly.
I think you might be right, though, that the nested brackets are not required. It honestly looks like the person who wrote this was a little bit confused, but I don't really know the larger context or what the intentions were.
It might be worth looking at Llama.cpp's support for this tokenizer, it might be more portable.