6 days ago

I am trying to use the model ai4bharat/indictrans2-en-indic-dist-200M BUT IT IS NOT RECOGNISING THE LANGUAGE.
Can someone help me to use it as simple english text to text translation.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

MODEL_NAME = "ai4bharat/indictrans2-en-indic-dist-200M"
CACHE_DIR = "./model_cache"

--- The Test Case ---

source_text = "This is a test from the command line"
source_lang = "2en"
target_lang = "2hi"

--- THE CORRECT AND FINAL INPUT FORMAT ---

The model requires both source and target tags prefixed to the text.

input_text = f"<{source_lang}> <{target_lang}> {source_text}"

inputs = tokenizer(input_text, return_tensors="pt")
print("--- Tokenization successful. ---")

# NOTE: No forced_bos_token_id is needed because the target is now specified in the input.
print("\n--- Step 4: Generating translation from model ---")
generated_tokens = model.generate(
    **inputs,
    num_return_sequences=1,
    num_beams=5,
    max_length=256
)

It is failing and not giving that it invalid language. I am missing small part since it may be updated recently.

Rajdgt

6 days ago

I used NLLB model and its working. But not sure why ai4bharat/indictrans2-en-indic-dist-200M is not able to use it. Please check.

pranjalchitale

AI4Bharat org 6 days ago

The issue is occurring because the correct language tags are not being passed, and the necessary pre-processing steps expected by the IndicTrans2 models are missing.

Please refer to the HF readme, for a complete example script that demonstrates how to properly set language tags and apply the required pre-processing steps for the inference.

ai4bharat
/

indictrans2-en-indic-dist-200M

Not able to use the model

--- The Test Case ---

--- THE CORRECT AND FINAL INPUT FORMAT ---

The model requires both source and target tags prefixed to the text.