Add extra tokens to the tokenizer
Hi,
I want to add some extra special tokens to my vocabulary. I am trying to do that as below:
model = AutoModelForCausalLM.from_pretrained(model_name,
attn_implementation='eager',
token=os.getenv("HUGGINGFACE_TOKEN"),
torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
pad_token=ChatmlSpecialTokens.pad_token.value,
eos_token=ChatmlSpecialTokens.eos_token.value,
additional_special_tokens=ChatmlSpecialTokens.list()
)
I have also tried processor = AutoProcessor.from_pretrained
and processor. tokenizer
I am trying to resize my model as model.resize_token_embeddings(len(tokenizer))
I have also tried to change model.language_model.embed
size manually since it's a multimodal LLM as below:
def extend_gemma_embeddings(model, tokenizer):
import torch
from torch.nn import Embedding,Linear
new_vocab_size = len(tokenizer)
# --- Resize embed_tokens ---
old_embed = model.language_model.embed_tokens
old_vocab_size, embed_dim = old_embed.weight.shape
if new_vocab_size > old_vocab_size:
print(f"Resizing embed_tokens from {old_vocab_size} → {new_vocab_size}")
new_embed = Embedding(new_vocab_size, embed_dim)
new_embed.weight.data[:old_vocab_size] = old_embed.weight.data
model.language_model.embed_tokens = new_embed
else:
print("No need to resize embed_tokens")
# --- Resize lm_head ---
old_lm_head = model.lm_head
out_dim, _ = old_lm_head.weight.shape # shape: (vocab_size, embed_dim)
if new_vocab_size > out_dim:
print(f"Resizing lm_head from {out_dim} → {new_vocab_size}")
new_lm_head = Linear(embed_dim, new_vocab_size, bias=False)
new_lm_head.weight.data[:out_dim] = old_lm_head.weight.data
model.lm_head = new_lm_head
else:
print("No need to resize lm_head")
# --- Update config ---
model.vocab_size = new_vocab_size
model.config.vocab_size = new_vocab_size
model.config.pad_token_id = tokenizer.pad_token_id
return model
My LORA config is as below:
peft_config = LoraConfig(r=rank_dimension,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
target_modules=["gate_proj","q_proj","lm_head","o_proj","k_proj","embed_tokens","down_proj","up_proj","v_proj"],
task_type=TaskType.CAUSAL_LM)
trainer = SFTTrainer(
model=model,
args=training_arguments,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
processing_class=tokenizer,
peft_config=peft_config,
)
trainer.train()
But i get the error as follows:
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [54,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Could someone from the team please let us know what is a way to add special tokens to the vocabulary/tokenizer.
Regards
Hi @debasisdwivedy ,
Welcome to Gemma family of open source models, adding new special tokens and resizing the model's embeddings seems to be on the right track, especially the manual resizing function extend_gemma_embeddings
. I have examined your above code and made few modifications, after that it's working fine without any issues. Please find the attached gist file for your reference, the configuration for adding the new tokens are done in this gist file.
Thanks.
Hi @BalakrishnaCh ,
Thanks for responding.
One quick note : Your gist has HF token in it. Wanted to bring that to your attention, in case it's still valid.
The approach did not work. I tried it on AWS and GCP machines and my packages are as below:
+ accelerate==1.9.0
+ aiohappyeyeballs==2.6.1
+ aiohttp==3.12.15
+ aiosignal==1.4.0
+ attrs==25.3.0
+ certifi==2025.7.14
+ charset-normalizer==3.4.2
+ datasets==4.0.0
+ dill==0.3.8
+ filelock==3.18.0
+ frozenlist==1.7.0
+ fsspec==2025.3.0
+ hf-xet==1.1.5
+ huggingface-hub==0.34.3
+ idna==3.10
+ jinja2==3.1.6
+ markupsafe==3.0.2
+ mpmath==1.3.0
+ multidict==6.6.3
+ multiprocess==0.70.16
+ networkx==3.5
+ numpy==2.3.2
+ nvidia-cublas-cu12==12.6.4.1
+ nvidia-cuda-cupti-cu12==12.6.80
+ nvidia-cuda-nvrtc-cu12==12.6.77
+ nvidia-cuda-runtime-cu12==12.6.77
+ nvidia-cudnn-cu12==9.5.1.17
+ nvidia-cufft-cu12==11.3.0.4
+ nvidia-cufile-cu12==1.11.1.6
+ nvidia-curand-cu12==10.3.7.77
+ nvidia-cusolver-cu12==11.7.1.2
+ nvidia-cusparse-cu12==12.5.4.2
+ nvidia-cusparselt-cu12==0.6.3
+ nvidia-nccl-cu12==2.26.2
+ nvidia-nvjitlink-cu12==12.6.85
+ nvidia-nvtx-cu12==12.6.77
+ packaging==25.0
+ pandas==2.3.1
+ peft==0.17.0
+ propcache==0.3.2
+ psutil==7.0.0
+ pyarrow==21.0.0
+ python-dateutil==2.9.0.post0
+ pytz==2025.2
+ pyyaml==6.0.2
+ regex==2025.7.34
+ requests==2.32.4
+ safetensors==0.5.3
+ setuptools==80.9.0
+ six==1.17.0
+ sympy==1.14.0
+ tokenizers==0.21.4
+ torch==2.7.1
+ tqdm==4.67.1
+ transformers==4.54.1
+ triton==3.3.1
+ trl==0.20.0
+ typing-extensions==4.14.1
+ tzdata==2025.2
+ urllib3==2.5.0
+ xxhash==3.5.0
+ yarl==1.20.1
+ fsspec==2025.7.0
+ pillow==11.3.0
+ timm==1.0.19
+ torchvision==0.22.1
Would you mind to check, if we are running the same packages and versions. Mine are updated to the latest version.
Going with you code example if i comment out the below part of the code , it works:
num_added_tokens = tokenizer.add_special_tokens(
{'additional_special_tokens': ChatmlSpecialTokens.additional_special_tokens}
)
print(f"Added {num_added_tokens} new tokens to the tokenizer.")
But again it's the same issue, I am not able to add new tokens, so the special tokens would get split . :(
Regards,
@debasisdwivedy @BalakrishnaCh ,
Hello,
I'm also trying to add new tokens to the tokenizer and encountering the same error. I've tried both of your approaches, but neither worked. Additionally, I attempted to train only the new tokens by registering a hook wrapper for the embeddings, but that didn't resolve the issue either.
As a last resort, I tried using only LoRA adapters, but the same error persisted.
Now, when I attempt to run basic inference on Gemma 3n e2b, it fails when loading the model with BitsAndBytesConfig. I suspect the issue might originate from there (becouse it run normal when i load the model without BitsAndBytesConfig).
If anyone finds a solution, I’d greatly appreciate it if you could share it with me. I’m happy to discuss my attempts in more detail.
Also, if there are any errors in this reply, please let me know so I can correct them.
Thanks