Model Card for HindiNER-4B-v1.0
HindiNER-4B-v1.0 - a general and constraint Hindi NER model
Model Details
Model Description
HindiNER-4B-v1.0 is a 4B general Hindi NER model, built on top of Nemotron-4-Mini-Hindi-4B-Instruct, utilizing data-efficient lora training strategy, and supports a context window of 4096 tokens.
- Developed by: nis12ram
- Model type: Autoregressive model
- Language(s) (NLP): [Hindi, English]
- License: Apache License 2.0
- Finetuned from model: Nemotron-4-Mini-Hindi-4B-Instruct
Source Text Language Support:
- Hindi(written in Devanagari script) [primary support]
- English
- Hinglish or Romanized Hindi
- Mix of all
NOTE
The model was not explicitly trained for NER on Hinglish data. However, since Nemotron-4-Mini-Hindi-4B-Instruct was also pretrained on Romanized Hindi, the fine-tuned model generalizes well to Hinglish.
From Nemotron-4-Mini-Hindi-4B-Instruct paper
The translated Hindi data comprises approximately 60 billion tokens. We then combine this synthetic data with around 40 billion real tokens (web-scraped data) to create a dataset totaling 100 billion Hindi tokens. Additionally, this entire Hindi text is transliterated into Roman script, expanding the total dataset to 220 billion tokens. The transliterated tokens are included to enable the model to support Hinglish queries.
Entity Type Language Support:
- Only Hindi(written in Devanagari script)
Model's Prompt & Desired Output
- prompt:
prompt = '''<extra_id_0>System
You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text.
<extra_id_1>User
{input}
<extra_id_1>Assistant
I have read the text.
<extra_id_1>User
{entity_type}
<extra_id_1>Assistant
'''
- desired output structure:
The model's output is a string representation of a JSON array (list), which can be directly parsed using tools such as json.loads() in Python or equivalent functions in other languages.
How to Get Started with the Model
Check out the Colab Notebook
Training Details
Training Data
To build a high-quality training dataset, a two-step approach is implemented:
Data Curation
Data Augmentation
Data Curation
1st Dataset
There are a few good traditional Hindi NER datasets available, but as far as I know, there’s no publicly available general-purpose Hindi NER dataset like Pile-NER-type for English.
So I decide to manually collect and annotate a general Hindi NER dataset — small but rich, diverse, and aligned with the indian context.
To know more, please refer to HindiNER-golden-dataset(952 datapoints)
2nd Dataset
To teach the model, how to properly do NER HindiNER-golden-dataset(952 datapoints) won't be enough.
Taking inspiration from these well known fact that
Intermediate tuning on similar tasks can enhance performance on low-resource downstream task.
After some experimentation, the most similar dataset comes out to be Pile-NER-type.
To further align the Pile-NER-type with the final model's objective of using only Hindi entity types, a translation phase is conducted in which all English entity types are translated into Hindi.
To know more, please refer to entity_type_hi_pilener(37,859 datapoints)
Data Augmentation
pre-defined augmentation variables
max_entity_type_value_paris = 32, negative_entities_percent = 50
Augmenting HindiNER-golden-dataset
Due to manual collection and annotation, the size of HindiNER-golden-dataset is just limited to 952 datapoints.
To facilitate effective learning from these limited size dataset, an oversampling strategy is used.
Oversampling strategy
- Step 1: Created 5 copies of HindiNER-golden-dataset.
- Step 2: Randomly shuffling each copy.
- Step 3: Randomly dropping entity type-value pairs from each copy, if total entity type-value pairs in a datapoint exceeds max_entity_type_value_paris/2.
- Step 4: Created a list of all rare enity types rare_entity_type_lst.
- Step 5: Adding negative_entities_percent% negative entity type-value pairs to each datapoint in all copies by randomly selecting from rare_entity_type_lst.
- Step 6: Randomly shuffling each copy.
To know more, please refer to HindiNER-golden-dataset-constraint1, HindiNER-golden-dataset-constraint2, HindiNER-golden-dataset-constraint3, HindiNER-golden-dataset-constraint4, HindiNER-golden-dataset-constraint5 HindiNER-golden-dataset-constraint-neg-corr (Not Documented)
Augmenting entity_type_hi_pilener
Here also a similar augmentation strategy is used.
Augmentation strategy
- Step 1: Randomly shuffling entity_type_hi_pilener.
- Step 2: Randomly dropping entity type-value pairs, if total entity type-value pairs in a datapoint exceeds max_entity_type_value_paris/2.
- Step 3: Created a list of all rare enity types rare_entity_type_lst.
- Step 4: Adding negative_entities_percent% negative entity type-value pairs to each datapoint by randomly selecting from rare_entity_type_lst.
- Step 5: Randomly shuffling entity_type_hi_pilener.
To know more, please refer to entity_type_hi_pilener_constraint, entity_type_hi_pilener_constraint-neg-corr (Not Documented)
Datset Size
- 4760(952*5) -> HindiNER-golden-dataset-constraint1, ..., HindiNER-golden-dataset-constraint5
- 37855 -> entity_type_hi_pilener_constraint
- 42615 -> total
Training Procedure
To build a high-quality model, a two-step approach is implemented:
Core
Polish
Core
Model is trained in a multi-turn conversation fashion, with a single entity type per turn.
- Example:
sample_conversation = '''<extra_id_0>System
You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text.
<extra_id_1>User
2024 में, प्रधानमंत्री नरेंद्र मोदी ने वाराणसी में एक नई AI रिसर्च लैब का उद्घाटन किया। इस कार्यक्रम में गूगल इंडिया, IIT दिल्ली और नीति आयोग के प्रतिनिधि मौजूद थे। उद्घाटन समारोह 15 अगस्त 2024 को हुआ, जिसमें रतन टाटा और सचिन तेंदुलकर भी विशेष अतिथि के रूप में शामिल हुए।
<extra_id_1>Assistant
I have read the text.
<extra_id_1>User
प्रधानमंत्री
<extra_id_1>Assistant
["नरेंद्र मोदी"]
<extra_id_1>User
वर्ष
<extra_id_1>Assistant
["2024"]
<extra_id_1>User
आपातकालीन घटना
<extra_id_1>Assistant
[]
<extra_id_1>User
संगठन
<extra_id_1>Assistant
["गूगल इंडिया", "IIT दिल्ली", "नीति आयोग"]
<extra_id_1>
'''
Training Details:
- Training technique = Lora
- Dataset = [
nis12ram/HindiNER-golden-dataset-constraint1,
nis12ram/HindiNER-golden-dataset-constraint2,
nis12ram/HindiNER-golden-dataset-constraint3,
nis12ram/HindiNER-golden-dataset-constraint4,
nis12ram/HindiNER-golden-dataset-constraint5,
nis12ram/entity_type_hi_pilener_constraint
]
- Dataset Format = Mulit-Turn conversation
- Lora rank = 512
- Lora alpha = 512
- Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]
- Batch size = 16
- Gradient accumulation = 1
- Warmup ratio = 0.03
- Epochs = 1
- Learning rate = 5e-5
- Optimizer = adamw_8bit
- Learning rate scheduler = linear
- Weight decay = 0.01
- max seq length = 4000
Check out the Colab Notebook for the training code.
Check out the Model obtained from Core training.
Q. What did Core training achieve?
Ans. The Core training produces a model highly effective in Named Entity Recognition (NER) for Hindi, English, and Hinglish, while preserving the desired output structure.
Q. What did Core training lack?
Ans. The Core training produces a model that works superbly in extracting entity values for positive entity types, but lacks the capability to understand negative entity types and often hallucinates when handling them.
Polish
The Polish training was not predetermined in terms of dataset, dataset format, and training hyperparameters. All these training components were decided based on the limitations of the model produced by the Core training.
Model is trained in a single-turn conversation fashion.
- Example:
sample_conversation = '''<extra_id_0>System
You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text.
<extra_id_1>User
2024 में, प्रधानमंत्री नरेंद्र मोदी ने वाराणसी में एक नई AI रिसर्च लैब का उद्घाटन किया। इस कार्यक्रम में गूगल इंडिया, IIT दिल्ली और नीति आयोग के प्रतिनिधि मौजूद थे। उद्घाटन समारोह 15 अगस्त 2024 को हुआ, जिसमें रतन टाटा और सचिन तेंदुलकर भी विशेष अतिथि के रूप में शामिल हुए।
<extra_id_1>Assistant
I have read the text.
<extra_id_1>User
प्रधानमंत्री
<extra_id_1>Assistant
["नरेंद्र मोदी"]
<extra_id_1>
'''
A manual dataset is collected with the objective of mitigating the limitations of Core training. To know more about dataset, please check out HindiNER-golden-dataset2.
To avoid catastophic forgetting, 2 datasets were arranged.
Training Details:
- Training technique = Lora
- Dataset = [
nis12ram/HindiNER-golden-dataset2,
nis12ram/entity_type_hi_pilener_constraint-neg-corr,
nis12ram/HindiNER-golden-dataset-constraint-neg-corr
]
- Dataset Format = Single-Turn conversation
- Lora rank = 8
- Lora alpha = 8
- Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]
- Batch size = 4
- Gradient accumulation = 2
- Warmup ratio = 0.00
- Epochs = 1
- Learning rate = 2e-4
- Optimizer = adamw_8bit
- Learning rate scheduler = linear
- Weight decay = 0.01
- max seq length = 2048
Check out the Colab Notebook for the training code.
Q. What did Polish training achieve?
Ans. The Polish training produces a model highly effective in handling positive and negative entity types.
Q. What did Polish training lack?
Ans. The final model is capable in handling source text of different format, style & language, but still their can be cases when model hallucinates.
Experiments Tried but Didn't Work
- Training Nemotron-4-Mini-Hindi-4B-Instruct by passing all or multiple entity type in a single turn, result in a model that has lower accuracy and poor structure-following capabilities then HindiNER-4B-v0.0 which is trained by passing a single entity type per turn. A model trained by passing multiple entity types in a single turn does not produce good results, even when only a single entity type is passed in a turn.
Possible Reason
From UniversalNER: TARGETED DISTILLATION FROM LARGE LANGUAGE MODELS FOR OPEN NAMED ENTITY RECOGNITION paper
When the model is required to handle multiple entity types within a single query, it might disperse its attention across these varied types, possibly resulting in less accurate identification for each individual type. Conversely, by decomposing the task into several simpler ones, each focusing on one entity type at a time, the model might be better equipped to handle the complexity, thus yielding more accurate results
Simple duplication of the HindiNER-golden-dataset as an oversampling strategy imporves model performance but is inferior compared to the oversampling strategy mentioned above.
Multi-Turn conversation based dataset format in Polish training, produces a model that is not robust enough to handle most negative entity type cases.
Possible Reason
Polish training phase is mainly about very fine-grained updates in the model behaviour, And learning such fine-grained behaviour would be better when only single entity type is passed per source text.
- Polish training, using only the HindiNER-golden-dataset2, learns to handle negative entity types but forgets many of the learnings from Core training.
Evaluation
Evaluation is done using a twofold process:
Auotmatic Evaluation
Human Evaluation
Automatic Evaluation
Please refer to these Linkedin article to know how automated evaluation is performed.
Result
🌐 Language: hi
Category | F1 Score |
---|---|
📰 News | 0.9295 |
💻 Coding | 0.8712 |
📄 Long Article | 0.9176 |
🏥 Medical | 0.9167 |
➕ Math | 0.9143 |
📦 Other | 0.9613 |
💬 Conversation | 0.9497 |
⚗️ Chemistry | 0.7500 |
🔹 Language-level F1: 0.9013
🌐 Language: en
Category | F1 Score |
---|---|
💻 Coding | 0.8589 |
🏥 Medical | 0.7667 |
➕ Math | 0.0000 |
📦 Other | 0.9658 |
💬 Conversation | 0.9490 |
🔹 Language-level F1: 0.7081
🌐 Language: hing
Category | F1 Score |
---|---|
💬 Chat | 0.7201 |
📦 Other | 0.9903 |
🔹 Language-level F1: 0.8552
🏁 Final Overall F1 Score: 0.8215
Check out the Colab Notebook for evaluation code.
Check out the Evaluation dataset
Human Evaluation
Human evaluation is basically done using diverse source texts to check the model's predictions in different scenarios.
Usage Details
- Stop token should be set to
<extra_id_1>
. - Greedy sampling should be prefered.
- Downloads last month
- 127
Model tree for nis12ram/HindiNER-4B-v1.0
Base model
nvidia/Minitron-4B-Base