Clarification on the regex of the tokenizer configuration

#9
by antoine-agthe-unity - opened

Your JSON tokenization config uses the following regex to split the input.

" ?[^(\\s|[.,!?\u2026\u3002\uff0c\u3001\u0964\u06d4\u060c])]+"
or \s?[^(\s|[.,!?…。,、।۔،])]+

I don't understand why you have nested brackets for [.,!?…。,、।۔،]
Why do you separate \s from the rest of the characters.
Also also why do you try to capture (with parenthesis)?

Wouldn't it be the same as ?[^\s.,!?…。,、।۔،]+ ?
Maybe there is a fancy regex pattern I don't know.

For the context, I try to load this configuration with .NET C# and the standard regex engine doesn't not understand this regex.

LumiOpen org

I'm sorry, but I don't know why these decisions were made. We inherited the splitting in this tokenizer, maybe from BLOOM? I forget exactly.

I think you might be right, though, that the nested brackets are not required. It honestly looks like the person who wrote this was a little bit confused, but I don't really know the larger context or what the intentions were.

It might be worth looking at Llama.cpp's support for this tokenizer, it might be more portable.

jonabur changed discussion status to closed

Sign up or log in to comment