Clarification on the regex of the tokenizer configuration

by antoine-agthe-unity - opened 17 days ago

17 days ago

Your JSON tokenization config uses the following regex to split the input.

" ?[^(\\s|[.,!?\u2026\u3002\uff0c\u3001\u0964\u06d4\u060c])]+"
or \s?[^(\s|[.,!?…。，、।۔،])]+

I don't understand why you have nested brackets for [.,!?…。，、।۔،]
Why do you separate \s from the rest of the characters.
Also also why do you try to capture (with parenthesis)?

Wouldn't it be the same as ?[^\s.,!?…。，、।۔،]+ ?
Maybe there is a fancy regex pattern I don't know.

For the context, I try to load this configuration with .NET C# and the standard regex engine doesn't not understand this regex.

jonabur

LumiOpen org 16 days ago

I'm sorry, but I don't know why these decisions were made. We inherited the splitting in this tokenizer, maybe from BLOOM? I forget exactly.

I think you might be right, though, that the nested brackets are not required. It honestly looks like the person who wrote this was a little bit confused, but I don't really know the larger context or what the intentions were.

It might be worth looking at Llama.cpp's support for this tokenizer, it might be more portable.

jonabur changed discussion status to closed 16 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment