Token issue? <start_of_turn> split into multiple tokens

#3
by mw55a - opened

As the title says. start_of_turn seems to be split into multiple tokens when running under llama.cpp. Where as every other Gemma 3 gguf has it as the single token.

It only seems to be start_of_turn that has this problem. end_of_turn is still a single token.

Tokens that start_of_turn is split into are:

'<':655, 'start':3041, '_':236779, 'of':1340, '_':236779, 'turn':887, '>':236813

Other Gemma 3 ggufs have this as single token: 105.

Edit: Had to edit this post because it seems you can't write start_of_turn with the < .> around it on here. It just disappears from the text.

Can confirm seeing the same issue as well, with latest llama.cpp (b5050)

So I used HF's GGUF javascript package to inspect the GGUF:

import { gguf } from '@huggingface/gguf';

const { metadata, tensorInfos } = await gguf("https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/gemma-3-12b-it-q4_0.gguf", {
  additionalFetchHeaders: {
    "Authorization": "Bearer " + process.env.HF_TOKEN,
  }
});

console.log(metadata['tokenizer.ggml.tokens']);

const startId = metadata['tokenizer.ggml.tokens'].indexOf("<start_of_turn>");
console.log(startId);
console.log(metadata['tokenizer.ggml.token_type'][startId]);

This prints out correctly the token ID 105 for . However, the token_type is set to 1 (normal token) while it should be 3 (control token)

Check again with non-QAT GGUF from unsloth, metadata['tokenizer.ggml.token_type'][startId] prints 3

const { metadata, tensorInfos } = await gguf("https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-BF16.gguf", {

(the rest of code is the same)

@google Can you regenerate the GGUF while correcting the token type? I think it's better to copy-paste this metadata key from an existing GGUF

Apparently this is not the only token issue these qat versions have. From here: https://www.reddit.com/r/LocalLLaMA/comments/1jvi860/comment/mmdgpim/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

"I just checked, there is indeed a whole lot of tokens (6411 to be precise) that are configured differently between the qat models and the models quantized with llama.cpp "

@ngxson
Here are the tokens that differ between a gguf generated from llama.cpp and the QAT. (ignoring all the <unused[N]> tokens wich are NORMAL instead of CONTROL)

NORMAL->CONTROL

['<mask>', '<start_of_turn>', '<end_of_turn>', '<start_of_image>', '<end_of_image>']

--------------------------
NORMAL->USER_DEFINED

['[multimodal]', '\n', '\n\n', '\n\n\n', '\n\n\n\n', '\n\n\n\n\n', '\n\n\n\n\n\n', '\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '▁▁', '▁▁▁', '▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '<table>', '<caption>', '<thead>', '<tbody>', '<tfoot>', '<tr>', '<th>', '<td>', '</table>', '</caption>', '</thead>', '</tbody>', '</tfoot>', '</tr>', '</th>', '</td>', '<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>', '<blockquote>', '</h1>', '</h2>', '</h3>', '</h4>', '</h5>', '</h6>', '</blockquote>', '<strong>', '<em>', '<b>', '<i>', '<u>', '<s>', '<sub>', '<sup>', '<code>', '</strong>', '</em>', '</b>', '</i>', '</u>', '</s>', '</sub>', '</sup>', '</code>', '<a>', '<html>', '<body>', '<img>', '<span>', '<bbox>', '<ul>', '<li>', '<div>', 
'<iframe>', '<footer>', '</a>', '</html>', '</body>', '</img>', '</span>', '</bbox>', '</ul>', '</li>', '</div>', '</iframe>', '</footer>', '\t', '\t\t', '\t\t\t', '\t\t\t\t', '\t\t\t\t\t', '\t\t\t\t\t\t', '\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t']

--------------------------
UNKNOWN->CONTROL

['<unk>']

Is it expected for the <unk> token to be CONTROL? UNKNOWN seems to make more sense in this case...

Google org

Hi , Sorry for late response,

You're absolutely right: the <start_of_turn> token (ID 105) being split into multiple tokens is caused by an incorrect token_type in the GGUF metadata. In the QAT GGUF files, it’s currently marked as a normal token (token_type = 1) instead of a control token (token_type = 3), which causes llama.cpp to treat it incorrectly during tokenization.

Kindly try and let us know.. if you have any concerns will assist you. Thank you.

Sign up or log in to comment