Model Card for HindiNER-4B-v1.0

HindiNER-4B-v1.0 - a general and constraint Hindi NER model

Model Details

Model Description

HindiNER-4B-v1.0 is a 4B general Hindi NER model, built on top of Nemotron-4-Mini-Hindi-4B-Instruct, utilizing data-efficient lora training strategy, and supports a context window of 4096 tokens.

Developed by: nis12ram
Model type: Autoregressive model
Language(s) (NLP): [Hindi, English]
License: Apache License 2.0
Finetuned from model: Nemotron-4-Mini-Hindi-4B-Instruct

Source Text Language Support:

Hindi(written in Devanagari script) [primary support]
English
Hinglish or Romanized Hindi
Mix of all

NOTE

The model was not explicitly trained for NER on Hinglish data. However, since Nemotron-4-Mini-Hindi-4B-Instruct was also pretrained on Romanized Hindi, the fine-tuned model generalizes well to Hinglish.

From Nemotron-4-Mini-Hindi-4B-Instruct paper

The translated Hindi data comprises approximately 60 billion tokens. We then combine this synthetic data with around 40 billion real tokens (web-scraped data) to create a dataset totaling 100 billion Hindi tokens. Additionally, this entire Hindi text is transliterated into Roman script, expanding the total dataset to 220 billion tokens. The transliterated tokens are included to enable the model to support Hinglish queries.

Entity Type Language Support:

Only Hindi(written in Devanagari script)

Model's Prompt & Desired Output

prompt:

prompt = '''<extra_id_0>System
You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text.
<extra_id_1>User
{input}
<extra_id_1>Assistant
I have read the text.
<extra_id_1>User
{entity_type}
<extra_id_1>Assistant
'''

desired output structure:

The model's output is a string representation of a JSON array (list), which can be directly parsed using tools such as json.loads() in Python or equivalent functions in other languages.

How to Get Started with the Model

Check out the Colab Notebook

Training Details

Training Data

To build a high-quality training dataset, a two-step approach is implemented:

Data Curation
Data Augmentation

Data Curation

1st Dataset

There are a few good traditional Hindi NER datasets available, but as far as I know, there’s no publicly available general-purpose Hindi NER dataset like Pile-NER-type for English.

So I decide to manually collect and annotate a general Hindi NER dataset — small but rich, diverse, and aligned with the indian context.

To know more, please refer to HindiNER-golden-dataset(952 datapoints)

2nd Dataset

To teach the model, how to properly do NER HindiNER-golden-dataset(952 datapoints) won't be enough.

Taking inspiration from these well known fact that

Intermediate tuning on similar tasks can enhance performance on low-resource downstream task.

After some experimentation, the most similar dataset comes out to be Pile-NER-type.

To further align the Pile-NER-type with the final model's objective of using only Hindi entity types, a translation phase is conducted in which all English entity types are translated into Hindi.

To know more, please refer to entity_type_hi_pilener(37,859 datapoints)

Data Augmentation

pre-defined augmentation variables

max_entity_type_value_paris = 32, negative_entities_percent = 50

Augmenting HindiNER-golden-dataset

Due to manual collection and annotation, the size of HindiNER-golden-dataset is just limited to 952 datapoints.

To facilitate effective learning from these limited size dataset, an oversampling strategy is used.

Oversampling strategy

Step 1: Created 5 copies of HindiNER-golden-dataset.
Step 2: Randomly shuffling each copy.
Step 3: Randomly dropping entity type-value pairs from each copy, if total entity type-value pairs in a datapoint exceeds max_entity_type_value_paris/2.
Step 4: Created a list of all rare enity types rare_entity_type_lst.
Step 5: Adding negative_entities_percent% negative entity type-value pairs to each datapoint in all copies by randomly selecting from rare_entity_type_lst.
Step 6: Randomly shuffling each copy.

To know more, please refer to HindiNER-golden-dataset-constraint1, HindiNER-golden-dataset-constraint2, HindiNER-golden-dataset-constraint3, HindiNER-golden-dataset-constraint4, HindiNER-golden-dataset-constraint5 HindiNER-golden-dataset-constraint-neg-corr (Not Documented)

Augmenting entity_type_hi_pilener

Here also a similar augmentation strategy is used.

Augmentation strategy

Step 1: Randomly shuffling entity_type_hi_pilener.
Step 2: Randomly dropping entity type-value pairs, if total entity type-value pairs in a datapoint exceeds max_entity_type_value_paris/2.
Step 3: Created a list of all rare enity types rare_entity_type_lst.
Step 4: Adding negative_entities_percent% negative entity type-value pairs to each datapoint by randomly selecting from rare_entity_type_lst.
Step 5: Randomly shuffling entity_type_hi_pilener.

To know more, please refer to entity_type_hi_pilener_constraint, entity_type_hi_pilener_constraint-neg-corr (Not Documented)

Datset Size

  - 4760(952*5) -> HindiNER-golden-dataset-constraint1, ..., HindiNER-golden-dataset-constraint5
  
  - 37855 -> entity_type_hi_pilener_constraint

  - 42615 -> total

Training Procedure

To build a high-quality model, a two-step approach is implemented:

Core
Polish

Core

Model is trained in a multi-turn conversation fashion, with a single entity type per turn.

Example:

sample_conversation = '''<extra_id_0>System
You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text.
<extra_id_1>User
2024 में, प्रधानमंत्री नरेंद्र मोदी ने वाराणसी में एक नई AI रिसर्च लैब का उद्घाटन किया। इस कार्यक्रम में गूगल इंडिया, IIT दिल्ली और नीति आयोग के प्रतिनिधि मौजूद थे। उद्घाटन समारोह 15 अगस्त 2024 को हुआ, जिसमें रतन टाटा और सचिन तेंदुलकर भी विशेष अतिथि के रूप में शामिल हुए।
<extra_id_1>Assistant
I have read the text.
<extra_id_1>User
प्रधानमंत्री
<extra_id_1>Assistant
["नरेंद्र मोदी"]
<extra_id_1>User
वर्ष
<extra_id_1>Assistant
["2024"]
<extra_id_1>User
आपातकालीन घटना
<extra_id_1>Assistant
[]
<extra_id_1>User
संगठन
<extra_id_1>Assistant
["गूगल इंडिया", "IIT दिल्ली", "नीति आयोग"]
<extra_id_1>
'''

Training Details:

  - Training technique = Lora

  - Dataset = [
    nis12ram/HindiNER-golden-dataset-constraint1,
    nis12ram/HindiNER-golden-dataset-constraint2,
    nis12ram/HindiNER-golden-dataset-constraint3,
    nis12ram/HindiNER-golden-dataset-constraint4,
    nis12ram/HindiNER-golden-dataset-constraint5,
    nis12ram/entity_type_hi_pilener_constraint
  ]

  - Dataset Format = Mulit-Turn conversation

  - Lora rank = 512
  
  - Lora alpha = 512
  
  - Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]
  
  - Batch size = 16
  
  - Gradient accumulation = 1
  
  - Warmup ratio = 0.03
  
  - Epochs = 1
  
  - Learning rate = 5e-5
  
  - Optimizer = adamw_8bit
  
  - Learning rate scheduler = linear
  
  - Weight decay = 0.01

  - max seq length = 4000

Check out the Colab Notebook for the training code.

Check out the Model obtained from Core training.

Q. What did Core training achieve?

Ans. The Core training produces a model highly effective in Named Entity Recognition (NER) for Hindi, English, and Hinglish, while preserving the desired output structure.

Q. What did Core training lack?

Ans. The Core training produces a model that works superbly in extracting entity values for positive entity types, but lacks the capability to understand negative entity types and often hallucinates when handling them.

Polish

The Polish training was not predetermined in terms of dataset, dataset format, and training hyperparameters. All these training components were decided based on the limitations of the model produced by the Core training.

Model is trained in a single-turn conversation fashion.

Example:

sample_conversation = '''<extra_id_0>System
You are a text‐reader and entity extractor. When given a text, read it and reply “I have read the text.” Then, when the user provides an entity type in Hindi, extract and return a list of all matching entities from the text.
<extra_id_1>User
2024 में, प्रधानमंत्री नरेंद्र मोदी ने वाराणसी में एक नई AI रिसर्च लैब का उद्घाटन किया। इस कार्यक्रम में गूगल इंडिया, IIT दिल्ली और नीति आयोग के प्रतिनिधि मौजूद थे। उद्घाटन समारोह 15 अगस्त 2024 को हुआ, जिसमें रतन टाटा और सचिन तेंदुलकर भी विशेष अतिथि के रूप में शामिल हुए।
<extra_id_1>Assistant
I have read the text.
<extra_id_1>User
प्रधानमंत्री
<extra_id_1>Assistant
["नरेंद्र मोदी"]
<extra_id_1>
'''

A manual dataset is collected with the objective of mitigating the limitations of Core training. To know more about dataset, please check out HindiNER-golden-dataset2.

To avoid catastophic forgetting, 2 datasets were arranged.

Training Details:

  - Training technique = Lora

  - Dataset = [
    nis12ram/HindiNER-golden-dataset2,
    nis12ram/entity_type_hi_pilener_constraint-neg-corr,
    nis12ram/HindiNER-golden-dataset-constraint-neg-corr
  ]

  - Dataset Format = Single-Turn conversation

  - Lora rank = 8
  
  - Lora alpha = 8
  
  - Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]
  
  - Batch size = 4
  
  - Gradient accumulation = 2
  
  - Warmup ratio = 0.00
  
  - Epochs = 1
  
  - Learning rate = 2e-4
  
  - Optimizer = adamw_8bit
  
  - Learning rate scheduler = linear
  
  - Weight decay = 0.01

  - max seq length = 2048

Check out the Colab Notebook for the training code.

Q. What did Polish training achieve?

Ans. The Polish training produces a model highly effective in handling positive and negative entity types.

Q. What did Polish training lack?

Ans. The final model is capable in handling source text of different format, style & language, but still their can be cases when model hallucinates.

Experiments Tried but Didn't Work

Training Nemotron-4-Mini-Hindi-4B-Instruct by passing all or multiple entity type in a single turn, result in a model that has lower accuracy and poor structure-following capabilities then HindiNER-4B-v0.0 which is trained by passing a single entity type per turn. A model trained by passing multiple entity types in a single turn does not produce good results, even when only a single entity type is passed in a turn.

Possible Reason

From UniversalNER: TARGETED DISTILLATION FROM LARGE LANGUAGE MODELS FOR OPEN NAMED ENTITY RECOGNITION paper

When the model is required to handle multiple entity types within a single query, it might disperse its attention across these varied types, possibly resulting in less accurate identification for each individual type. Conversely, by decomposing the task into several simpler ones, each focusing on one entity type at a time, the model might be better equipped to handle the complexity, thus yielding more accurate results

Simple duplication of the HindiNER-golden-dataset as an oversampling strategy imporves model performance but is inferior compared to the oversampling strategy mentioned above.
Multi-Turn conversation based dataset format in Polish training, produces a model that is not robust enough to handle most negative entity type cases.

Possible Reason

Polish training phase is mainly about very fine-grained updates in the model behaviour, And learning such fine-grained behaviour would be better when only single entity type is passed per source text.

Polish training, using only the HindiNER-golden-dataset2, learns to handle negative entity types but forgets many of the learnings from Core training.

Evaluation

Evaluation is done using a twofold process:

Auotmatic Evaluation
Human Evaluation

Automatic Evaluation

Please refer to these Linkedin article to know how automated evaluation is performed.

Result

🌐 Language: `hi`

Category	F1 Score
📰 News	0.9295
💻 Coding	0.8712
📄 Long Article	0.9176
🏥 Medical	0.9167
➕ Math	0.9143
📦 Other	0.9613
💬 Conversation	0.9497
⚗️ Chemistry	0.7500

🔹 Language-level F1: 0.9013

🌐 Language: `en`

Category	F1 Score
💻 Coding	0.8589
🏥 Medical	0.7667
➕ Math	0.0000
📦 Other	0.9658
💬 Conversation	0.9490

🔹 Language-level F1: 0.7081

🌐 Language: `hing`

Category	F1 Score
💬 Chat	0.7201
📦 Other	0.9903

🔹 Language-level F1: 0.8552

🏁 Final Overall F1 Score: 0.8215

Check out the Colab Notebook for evaluation code.

Check out the Evaluation dataset

Human Evaluation

Human evaluation is basically done using diverse source texts to check the model's predictions in different scenarios.

Usage Details

Stop token should be set to <extra_id_1>.
Greedy sampling should be prefered.

nis12ram
/

HindiNER-4B-v1.0

Model Card for HindiNER-4B-v1.0

Model Details

Model Description

Source Text Language Support:

Entity Type Language Support:

Model's Prompt & Desired Output

How to Get Started with the Model

Training Details

Training Data

Data Curation

1st Dataset

2nd Dataset

Data Augmentation

Augmenting HindiNER-golden-dataset

Augmenting entity_type_hi_pilener

Datset Size

Training Procedure

Core

Polish

Experiments Tried but Didn't Work

Evaluation

Automatic Evaluation

Result

🌐 Language: `hi`

🌐 Language: `en`

🌐 Language: `hing`

Human Evaluation

Usage Details

Model tree for nis12ram/HindiNER-4B-v1.0

Datasets used to train nis12ram/HindiNER-4B-v1.0

Model Card for HindiNER-4B-v1.0

Model Details

Model Description

Source Text Language Support:

Entity Type Language Support:

Model's Prompt & Desired Output

How to Get Started with the Model

Training Details

Training Data

Data Curation

1st Dataset

2nd Dataset

Data Augmentation

Augmenting HindiNER-golden-dataset

Augmenting entity_type_hi_pilener

Datset Size

Training Procedure

Core

Polish

Experiments Tried but Didn't Work

Evaluation

Automatic Evaluation

Result

🌐 Language: hi

🌐 Language: en

🌐 Language: hing

Human Evaluation

Usage Details

Model tree for nis12ram/HindiNER-4B-v1.0

Datasets used to train nis12ram/HindiNER-4B-v1.0

🌐 Language: `hi`

🌐 Language: `en`

🌐 Language: `hing`