Add extra tokens to the tokenizer

#33

by debasisdwivedy - opened 5 days ago

Discussion

debasisdwivedy

5 days ago

•

edited 5 days ago

Hi,

I want to add some extra special tokens to my vocabulary. I am trying to do that as below:

model = AutoModelForCausalLM.from_pretrained(model_name,
                                                attn_implementation='eager',
                                                token=os.getenv("HUGGINGFACE_TOKEN"),
                                                torch_dtype=torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        pad_token=ChatmlSpecialTokens.pad_token.value,
        eos_token=ChatmlSpecialTokens.eos_token.value,
        additional_special_tokens=ChatmlSpecialTokens.list()
        )

I have also tried processor = AutoProcessor.from_pretrained and processor. tokenizer

I am trying to resize my model as model.resize_token_embeddings(len(tokenizer))

I have also tried to change model.language_model.embed size manually since it's a multimodal LLM as below:

def extend_gemma_embeddings(model, tokenizer):
    import torch
    from torch.nn import Embedding,Linear
    new_vocab_size = len(tokenizer)

    # --- Resize embed_tokens ---
    old_embed = model.language_model.embed_tokens
    old_vocab_size, embed_dim = old_embed.weight.shape

    if new_vocab_size > old_vocab_size:
        print(f"Resizing embed_tokens from {old_vocab_size} → {new_vocab_size}")
        new_embed = Embedding(new_vocab_size, embed_dim)
        new_embed.weight.data[:old_vocab_size] = old_embed.weight.data
        model.language_model.embed_tokens = new_embed
    else:
        print("No need to resize embed_tokens")

    # --- Resize lm_head ---
    old_lm_head = model.lm_head
    out_dim, _ = old_lm_head.weight.shape  # shape: (vocab_size, embed_dim)

    if new_vocab_size > out_dim:
        print(f"Resizing lm_head from {out_dim} → {new_vocab_size}")
        new_lm_head = Linear(embed_dim, new_vocab_size, bias=False)
        new_lm_head.weight.data[:out_dim] = old_lm_head.weight.data
        model.lm_head = new_lm_head
    else:
        print("No need to resize lm_head")

    # --- Update config ---
    model.vocab_size = new_vocab_size
    model.config.vocab_size = new_vocab_size
    model.config.pad_token_id = tokenizer.pad_token_id

    return model

My LORA config is as below:

peft_config = LoraConfig(r=rank_dimension,
                            lora_alpha=lora_alpha,
                            lora_dropout=lora_dropout,
                            target_modules=["gate_proj","q_proj","lm_head","o_proj","k_proj","embed_tokens","down_proj","up_proj","v_proj"],
                            task_type=TaskType.CAUSAL_LM)

trainer = SFTTrainer(
        model=model,
        args=training_arguments,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        processing_class=tokenizer,
        peft_config=peft_config,
    )

trainer.train()

But i get the error as follows:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [54,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Could someone from the team please let us know what is a way to add special tokens to the vocabulary/tokenizer.

Regards

debasisdwivedy changed discussion status to closed 5 days ago

debasisdwivedy changed discussion status to open 5 days ago

debasisdwivedy changed discussion title from Add extra tokens to the tokenizer to Add video modality 5 days ago

debasisdwivedy changed discussion title from Add video modality to Add extra token to the tokenizer 5 days ago

debasisdwivedy changed discussion title from Add extra token to the tokenizer to Add extra tokens to the tokenizer 5 days ago

BalakrishnaCh

Google org 4 days ago

Hi @debasisdwivedy ,

Welcome to Gemma family of open source models, adding new special tokens and resizing the model's embeddings seems to be on the right track, especially the manual resizing function extend_gemma_embeddings. I have examined your above code and made few modifications, after that it's working fine without any issues. Please find the attached gist file for your reference, the configuration for adding the new tokens are done in this gist file.

Thanks.

debasisdwivedy

3 days ago

•

edited 3 days ago

Hi @BalakrishnaCh ,

Thanks for responding.

One quick note : Your gist has HF token in it. Wanted to bring that to your attention, in case it's still valid.

The approach did not work. I tried it on AWS and GCP machines and my packages are as below:

+ accelerate==1.9.0
 + aiohappyeyeballs==2.6.1
 + aiohttp==3.12.15
 + aiosignal==1.4.0
 + attrs==25.3.0
 + certifi==2025.7.14
 + charset-normalizer==3.4.2
 + datasets==4.0.0
 + dill==0.3.8
 + filelock==3.18.0
 + frozenlist==1.7.0
 + fsspec==2025.3.0
 + hf-xet==1.1.5
 + huggingface-hub==0.34.3
 + idna==3.10
 + jinja2==3.1.6
 + markupsafe==3.0.2
 + mpmath==1.3.0
 + multidict==6.6.3
 + multiprocess==0.70.16
 + networkx==3.5
 + numpy==2.3.2
 + nvidia-cublas-cu12==12.6.4.1
 + nvidia-cuda-cupti-cu12==12.6.80
 + nvidia-cuda-nvrtc-cu12==12.6.77
 + nvidia-cuda-runtime-cu12==12.6.77
 + nvidia-cudnn-cu12==9.5.1.17
 + nvidia-cufft-cu12==11.3.0.4
 + nvidia-cufile-cu12==1.11.1.6
 + nvidia-curand-cu12==10.3.7.77
 + nvidia-cusolver-cu12==11.7.1.2
 + nvidia-cusparse-cu12==12.5.4.2
 + nvidia-cusparselt-cu12==0.6.3
 + nvidia-nccl-cu12==2.26.2
 + nvidia-nvjitlink-cu12==12.6.85
 + nvidia-nvtx-cu12==12.6.77
 + packaging==25.0
 + pandas==2.3.1
 + peft==0.17.0
 + propcache==0.3.2
 + psutil==7.0.0
 + pyarrow==21.0.0
 + python-dateutil==2.9.0.post0
 + pytz==2025.2
 + pyyaml==6.0.2
 + regex==2025.7.34
 + requests==2.32.4
 + safetensors==0.5.3
 + setuptools==80.9.0
 + six==1.17.0
 + sympy==1.14.0
 + tokenizers==0.21.4
 + torch==2.7.1
 + tqdm==4.67.1
 + transformers==4.54.1
 + triton==3.3.1
 + trl==0.20.0
 + typing-extensions==4.14.1
 + tzdata==2025.2
 + urllib3==2.5.0
 + xxhash==3.5.0
 + yarl==1.20.1
 + fsspec==2025.7.0
 + pillow==11.3.0
 + timm==1.0.19
 + torchvision==0.22.1

Would you mind to check, if we are running the same packages and versions. Mine are updated to the latest version.

Going with you code example if i comment out the below part of the code , it works:

num_added_tokens = tokenizer.add_special_tokens(
 {'additional_special_tokens': ChatmlSpecialTokens.additional_special_tokens}
)
 print(f"Added {num_added_tokens} new tokens to the tokenizer.")

But again it's the same issue, I am not able to add new tokens, so the special tokens would get split . :(

Regards,

mohamed-stifi

3 days ago

@debasisdwivedy @BalakrishnaCh ,

Hello,

I'm also trying to add new tokens to the tokenizer and encountering the same error. I've tried both of your approaches, but neither worked. Additionally, I attempted to train only the new tokens by registering a hook wrapper for the embeddings, but that didn't resolve the issue either.

As a last resort, I tried using only LoRA adapters, but the same error persisted.

Now, when I attempt to run basic inference on Gemma 3n e2b, it fails when loading the model with BitsAndBytesConfig. I suspect the issue might originate from there (becouse it run normal when i load the model without BitsAndBytesConfig).

If anyone finds a solution, I’d greatly appreciate it if you could share it with me. I’m happy to discuss my attempts in more detail.

Also, if there are any errors in this reply, please let me know so I can correct them.

Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment