Model Card for HindiNER-4B-v0.1

HindiNER-4B-v0.1 - a general and non-constraint Hindi NER model

Model Details

Model Description

HindiNER-4B-v0.1 is a 4B general Hindi NER model, built on top of Nemotron-4-Mini-Hindi-4B-Instruct, utilizing 2-phase lora training strategy.

Developed by: nis12ram
Model type: Autoregressive model
Language(s) (NLP): [Hindi, English]
License: Apache License 2.0
Finetuned from model: Nemotron-4-Mini-Hindi-4B-Instruct

Source Text Language Support:

Hindi(written in Devanagari script) [primary support]
English
Hinglish or Romanized Hindi
Mix of all

NOTE

The model was not explicitly trained for NER on Hinglish data. However, since Nemotron-4-Mini-Hindi-4B-Instruct was also pretrained on Romanized Hindi, the fine-tuned model generalizes well to Hinglish.

From Nemotron-4-Mini-Hindi-4B-Instruct paper

The translated Hindi data comprises approximately 60 billion tokens. We then combine this synthetic data with around 40 billion real tokens (web-scraped data) to create a dataset totaling 100 billion Hindi tokens. Additionally, this entire Hindi text is transliterated into Roman script, expanding the total dataset to 220 billion tokens. The transliterated tokens are included to enable the model to support Hinglish queries.

Model's Prompt & Desired Output

prompt:

prompt = '''<extra_id_0>System

<extra_id_1>User
You are a Hindi language expert who specializes in extracting entities from text. Given a piece of text, extract all crucial entities along with their respective context-aware entity types. Ensure that entity type is in Hindi. The output should be in JSON format.

## Output format:
```json
{{
  "entities": [
    {{
      "type": "_",
      "value": ["_", "_"]
    }},
    {{
      "type": "_",
      "value": ["_"]
    }}
  ]
}}
```

## Text:
""" {text} """
<extra_id_1>Assistant
'''

desired output structure:

{
  "entities": [
    {
      "type": "_", 
      "value": ["_", "_"]
    },
    {
      "type": "_",
      "value": ["_"]
    }
  ]
}

NOTE The model outputs Entity Types in Hindi only, regardless of the language of the source text.

Uses

For general and non-constraint based Hindi NER inferencing.
For NER predictions on source text in Hindi(written in Devanagari script), English and Mix of Hindi(written in Devanagari script) & English.
For collecting large scale synthetic NER data.
For NER predictions on long text(supports ctx of 4096 tokens).
For NER predictions on text covering diverse domains.
For domain specific NER fine tuning.

Bias, Risks, and Limitations

Model can output biased and unfaithful Entity Key & Entity Value pairs.
Model can miss and ignore important Entity Key & Entity Value pairs due to its non-constraint based Hindi NER inferencing.

How to Get Started with the Model

Check out the Colab Notebook

Training Details

Training Data

Training Procedure

Model is trained using 2-phase lora training strategy:

phase1: joint lora fine-tuning on both languages

Nemotron-4-Mini-Hindi-4B-Instruct is fine-tuned on the combined dataset(entity_type_hi_pilener, HindiNER-golden-dataset). 3 times duplication of HindiNER-golden-dataset is done to address the under-representation of NER on Hindi(written in Devanagari script). Such duplication did not lead to overfitting due to high diversity of HindiNER-golden-dataset and training hyperparameters.

Datapoint proportions:

  - 37855 from entity_type_hi_pilener
  
  - 952 × 3 from HindiNER-golden-dataset

Training Hyperparameters:

  - Lora rank = 512
  
  - Lora alpha = 512
  
  - Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]
  
  - Batch size = 16
  
  - Gradient accumulation = 1
  
  - Warmup ratio = 0.03
  
  - Epochs = 1
  
  - Learning rate = 5e-5
  
  - Optimizer = adamw_8bit
  
  - Learning rate scheduler = linear
  
  - Weight decay = 0.01

Desired outcome: A model that understands diverse text and produces high-quality NER predictions for English source text, while generating reasonable predictions for Hindi source text, following the specified JSON output structure.

Check out the Colab Notebook for the phase1 training code.

Check out the phase1 model

phase2: Hindi polishing

Model from phase1 is fine tuned on HindiNER-golden-dataset.

Datapoint proportions:

  - 952 from HindiNER-golden-dataset

Training Hyperparameters:

  - Lora rank = 128
  
  - Lora alpha = 128
  
  - Lora target modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]
  
  - Batch size = 4
  
  - Gradient accumulation = 2
  
  - Warmup ratio = 0.00
  
  - Epochs = 1
  
  - Learning rate = 2e-4
  
  - Optimizer = adamw_8bit
  
  - Learning rate scheduler = linear
  
  - Weight decay = 0.01

Desired outcome: A model that produces high-quality NER predictions for Hindi source text while retaining most of the learnings from phase1,

Check out the Colab Notebook for the phase2 training code.

Evaluation

general and non-constraint based Hindi NER model is highly non-deterministic and complex to evaluate:

Due to the model's general behavior, it can generate any entity type for a given entity value. This makes it difficult to accurately predict the exact entity type for less common entity values.
Non-constrained behavior implies that there is no mechanism to regulate or limit the entity type and entity value pairs that may be produced.

As a result of such behavior, human evaluation on a set of pre-selected, diverse data points proves to be the most effective approach.

NOTE

LLM-as-a-judge based approach was also designed, but it was abandoned due to budget constraints and other challenges.

Github Repo

Repository: HindiNER-v0

Model Card Authors

nis12ram

nis12ram
/

HindiNER-4B-v0.1