Model Card for HindiNER-4B-v0.1
HindiNER-4B-v0.1 - a general and non-constraint Hindi NER model
Model Details
Model Description
HindiNER-4B-v0.1 is a 4B general Hindi NER model, built on top of Nemotron-4-Mini-Hindi-4B-Instruct, utilizing 2-phase lora training strategy.
- Developed by: nis12ram
- Model type: Autoregressive model
- Language(s) (NLP): [Hindi, English]
- License: Apache License 2.0
- Finetuned from model: Nemotron-4-Mini-Hindi-4B-Instruct
Source Text Language Support:
- Hindi(written in Devanagari script) [primary support]
- English
- Hinglish or Romanized Hindi
- Mix of all
NOTE
The model was not explicitly trained for NER on Hinglish data. However, since Nemotron-4-Mini-Hindi-4B-Instruct was also pretrained on Romanized Hindi, the fine-tuned model generalizes well to Hinglish.
From Nemotron-4-Mini-Hindi-4B-Instruct paper
The translated Hindi data comprises approximately 60 billion tokens. We then combine this synthetic data with around 40 billion real tokens (web-scraped data) to create a dataset totaling 100 billion Hindi tokens. Additionally, this entire Hindi text is transliterated into Roman script, expanding the total dataset to 220 billion tokens. The transliterated tokens are included to enable the model to support Hinglish queries.
Model's Prompt & Desired Output
- prompt:
prompt = '''<extra_id_0>System
<extra_id_1>User
You are a Hindi language expert who specializes in extracting entities from text. Given a piece of text, extract all crucial entities along with their respective context-aware entity types. Ensure that entity type is in Hindi. The output should be in JSON format.
## Output format:
```json
{{
"entities": [
{{
"type": "_",
"value": ["_", "_"]
}},
{{
"type": "_",
"value": ["_"]
}}
]
}}
```
## Text:
""" {text} """
<extra_id_1>Assistant
'''
- desired output structure:
{
"entities": [
{
"type": "_",
"value": ["_", "_"]
},
{
"type": "_",
"value": ["_"]
}
]
}
NOTE The model outputs Entity Types in Hindi only, regardless of the language of the source text.
Uses
- For general and non-constraint based Hindi NER inferencing.
- For NER predictions on source text in Hindi(written in Devanagari script), English and Mix of Hindi(written in Devanagari script) & English.
- For collecting large scale synthetic NER data.
- For NER predictions on long text(supports ctx of 4096 tokens).
- For NER predictions on text covering diverse domains.
- For domain specific NER fine tuning.
Bias, Risks, and Limitations
- Model can output biased and unfaithful Entity Key & Entity Value pairs.
- Model can miss and ignore important Entity Key & Entity Value pairs due to its non-constraint based Hindi NER inferencing.
How to Get Started with the Model
Check out the Colab Notebook
Training Details
Training Data
Training Procedure
Model is trained using 2-phase lora training strategy:
- phase1: joint lora fine-tuning on both languages
Nemotron-4-Mini-Hindi-4B-Instruct is fine-tuned on the combined dataset(entity_type_hi_pilener, HindiNER-golden-dataset). 3 times duplication of HindiNER-golden-dataset is done to address the under-representation of NER on Hindi(written in Devanagari script). Such duplication did not lead to overfitting due to high diversity of HindiNER-golden-dataset and training hyperparameters.
Datapoint proportions:
- 37855 from entity_type_hi_pilener
- 952 × 3 from HindiNER-golden-dataset
Training Hyperparameters:
- Lora rank = 512
- Lora alpha = 512
- Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]
- Batch size = 16
- Gradient accumulation = 1
- Warmup ratio = 0.03
- Epochs = 1
- Learning rate = 5e-5
- Optimizer = adamw_8bit
- Learning rate scheduler = linear
- Weight decay = 0.01
Desired outcome: A model that understands diverse text and produces high-quality NER predictions for English source text, while generating reasonable predictions for Hindi source text, following the specified JSON output structure.
Check out the Colab Notebook for the phase1 training code.
Check out the phase1 model
- phase2: Hindi polishing
Model from phase1 is fine tuned on HindiNER-golden-dataset.
Datapoint proportions:
- 952 from HindiNER-golden-dataset
Training Hyperparameters:
- Lora rank = 128
- Lora alpha = 128
- Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]
- Batch size = 4
- Gradient accumulation = 2
- Warmup ratio = 0.00
- Epochs = 1
- Learning rate = 2e-4
- Optimizer = adamw_8bit
- Learning rate scheduler = linear
- Weight decay = 0.01
Desired outcome: A model that produces high-quality NER predictions for Hindi source text while retaining most of the learnings from phase1,
Check out the Colab Notebook for the phase2 training code.
Evaluation
general and non-constraint based Hindi NER model is highly non-deterministic and complex to evaluate:
- Due to the model's general behavior, it can generate any entity type for a given entity value. This makes it difficult to accurately predict the exact entity type for less common entity values.
- Non-constrained behavior implies that there is no mechanism to regulate or limit the entity type and entity value pairs that may be produced.
As a result of such behavior, human evaluation on a set of pre-selected, diverse data points proves to be the most effective approach.
NOTE
LLM-as-a-judge based approach was also designed, but it was abandoned due to budget constraints and other challenges.
Github Repo
- Repository: HindiNER-v0
Model Card Authors
- Downloads last month
- 7
Model tree for nis12ram/HindiNER-4B-v0.1
Base model
nvidia/Minitron-4B-Base