Repeating Tokens

#10
by Koosh0610 - opened

I am trying to translate a large document (above 6000 tokens). Sending in the payload just keeps repeating tokens after a certain point. The same conditions, do not affect generation for shorter inputs. What could be the reason? Also, I am using vllm to serve the model. I've tried changing the sampling params, but they cut the output short and even shorter for small documents.

system_prompt = f"Translate the text below to {target_language}."

messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": text}
]

# Prepare the request payload for vLLM server (chat/completions format)
payload = {
    "model":"sarvamai/sarvam-translate",
    "messages": messages,  
    "temperature": 0.01,
    "max_completion_tokens": 65536,
    "stream": False,
}`
Sarvam AI org

Hi @Koosh0610 . Thanks for reporting the issue.

It is recommended to not use the model for a context length of more than 8k tokens.
Recommended range is 4k-8k total tokens per sequence (combining prompt+generation tokens).

We have updated the README to put this limit on the vLLM server command:

vllm serve sarvamai/sarvam-translate --port 8000 --dtype bfloat16 --max-model-len 8192

This practically means that, if you are translating a text from English to Indic, please ensure that the English text is no longer than 2k-3k tokens.
Because depending on language, the corresponding number of output tokens will vary.

For example, for 2k English tokens, the Hindi translation may be around 2.5k output tokens or so.
But say Malayalam translation could be around ~4k output tokens, or perhaps Odia translation could be around ~5k tokens.

So whenever you have very large English texts to translate, please consider chunking it such that the number of tokens to translate is not more than 2k-3k input tokens.

Thank you Gokul, for the quick response. This helps a lot!

Sign up or log in to comment