Repeating tokens - using Assamese language

#13

by ColdMeat2003 - opened Jul 15

ColdMeat2003

Jul 15

So I am facing an issue. It seems that the model is repeating its tokens while translating. I am translating from English->Assamese, and I am facing this issue for certain texts. I know that heavy context lengths can cause this issue, hence I made sure each of my English text sequences is within 300 tokens. I have tried (and wasted compute :)) multiple model param settings, but to no avail. Below is code implementation. Rest assured that the text input is below 300 tokens (acc. to spaCy).

def translate_text_sarvam(text, target_language="Assamese"):

"""Translate text using Sarvam-Translate model on GPU"""
global SARVAM_MODEL, SARVAM_TOKENIZER

# Models should already be loaded by the calling function
if SARVAM_MODEL is None or SARVAM_TOKENIZER is None:
    SARVAM_MODEL, SARVAM_TOKENIZER = load_models()

try:
    messages = [
        {
            "role": "system",
            "content": (
                f"You are a professional translator. Translate the following text to {target_language}. "
                "Please do not repeat the same word or phrase multiple times."
            )
        },
        {"role": "user", "content": text}
    ]

    text = SARVAM_TOKENIZER.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = SARVAM_TOKENIZER([text], return_tensors="pt").to(SARVAM_MODEL.device)
    input_tokens_size = model_inputs.input_ids.shape[1]
    print(f"Input tokens size: {input_tokens_size}")
    MAX_CONTEXT_SIZE = 8000
    max_new_tokens = MAX_CONTEXT_SIZE - input_tokens_size

    if max_new_tokens < 0:
        raise ValueError("Inputs are too long for the model. Please shorten the input text.")
    
    max_factor = 3

    with torch.no_grad():
        generated_ids = SARVAM_MODEL.generate(
            **model_inputs,
            do_sample=True,
            temperature=0.01,
            num_return_sequences=1,
            max_new_tokens=input_tokens_size*max_factor
        )
    
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
    output_tokens_size = len(output_ids)
    print(f"Output tokens size: {output_tokens_size}")
    translated_text = SARVAM_TOKENIZER.decode(output_ids, skip_special_tokens=True)

    if output_tokens_size == input_tokens_size*max_factor:
       #Repeating token case
        with open("translated.txt", "w", encoding="utf-8") as f:
            f.write(translated_text)

    return translated_text
    
except Exception as e:
    print(f"Error in Sarvam translation: {e}")
    return ""

iamgrootns

Jul 15

does this model translate from regional to other langauges i was trying malyalam to hindi but it gives me english in output through the code provided on the page itself, any idea how to do regional languages to hindi

ColdMeat2003

Jul 15

I think it does coz there is a "source_lang_code" parameter in the Sarvam API template. check the model card once.
Or if not possible through one prompt, then maybe convert the translated english lang to the other regional language as the next step. Will double the compute usage tho :(

iamgrootns

Jul 16

i have a 16gb 5060ti not sure if its enough i have 3 other models already running on that GPU cant Double the compute usage.

GokulNC

Sarvam AI org Jul 16

Hi @ColdMeat2003 . Your code looks good. Can you please share the input text for which the repetition issue is happening?
We can debug and get back.

Also BTW, having just this instruction in the system prompt should be sufficient: Translate the following text to {target_language}.
These additional instructions may not add much value: You are a professional translator. Please do not repeat the same word or phrase multiple times. (because the model was not explicitly trained on such instructions)

GokulNC

Sarvam AI org Jul 16

Hi @iamgrootns , the model is currently not trained for Indic to Indic translation.
It is trained only English->Indic and Indic->English.

Until we release a new version supporting any language to any language translation, please do Malayalam->English and then English->Hindi as suggested by @ColdMeat2003 .
This will not double the memory (since you'll just be using the same model), but yes, it will increase the compute time for 2 model calls.

iamgrootns

Jul 16

This comment has been hidden (marked as Off-Topic)

ryg81

Jul 16

same happens on LMStudio.

hey @GokulNC What about the IndicTrans3Beta , is it indic to indic? There was not much to go on the page but i tried and tested it from malyalam to hindi using gradio it gives me this as output

<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>
<|assistant|>

import torch
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer

Hugging Face token

HF_TOKEN = ""

Load model and tokenizer

model_id = "ai4bharat/IndicTrans3-beta"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float32, # avoid float16 to disable Triton
device_map="cpu", # force CPU if CUDA/Triton is unstable
token=HF_TOKEN
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

Translation helper

def translate_to_hindi(text: str) -> str:
# Format the prompt
prompt = f"Translate the following text to Hindi: {text}"
conversation = [{"role": "user", "content": prompt}]
# Tokenize using chat template
input_ids = tokenizer.apply_chat_template(
    conversation, return_tensors="pt", add_generation_prompt=True
).to(model.device)

# Trim if too long
if input_ids.shape[1] > 4096:
    input_ids = input_ids[:, -4096:]

# Generate translation
output_ids = model.generate(
    input_ids=input_ids,
    max_new_tokens=512,
    do_sample=False,
    num_beams=1,
    repetition_penalty=1.1,
)

# Decode output
translated = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
return translated.strip()
Gradio interface

demo = gr.Interface(
fn=translate_to_hindi,
inputs=gr.Textbox(label="Enter Regional Language Text"),
outputs=gr.Textbox(label="Translation in Hindi"),
title="Regional to Hindi Translator (IndicTrans3-beta)"
)

if name == "main":
demo.launch(debug=True)

A script like this

ColdMeat2003 changed discussion status to closed 12 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment