File size: 17,151 Bytes

---
license: mit # Or choose another appropriate license: https://huggingface.co/docs/hub/repositories-licenses
language: hi
tags:
- hindi
- text-generation
- causal-lm
- custom-model
pipeline_tag: text-generation
---

# Hindi Causal Language Model (convaiinnovations/hindi-foundational-model-base)

This repository contains a custom-trained Hindi Causal Language Model designed for Hindi text generation.

## Model Description
- **Model Size:** 113M (YAH !!! Its very small)

- **Architecture:** Custom Transformer (12 layers, hidden=768, 16 heads, ffn=3072, act=swiglu, norm=rmsnorm) based on the `HindiCausalLM` class with Hindi-specific optimizations:
  - Multi-resolution attention to capture both character-level and word-level patterns
  - Morphology-aware feed-forward layers
  - Script-mix processing for Hindi-English code-mixing
- **Language:** Hindi (hi)
- **Training Data:** 2.7 million high-quality Hindi text samples from:
  - IITB Parallel Corpus (1.2M sentences)
  - Samanantar (750K samples)
  - Oscar Hindi (450K sentences)
  - CC-100 Hindi (300K sentences)
  - Hindi Wikipedia (150K articles)
  - Hindi news articles (100K pieces)
  - XNLI Hindi (50K premise-hypothesis pairs)
  - IndicGLUE (30K samples)
  - Hindi literature (5K passages)
- **Tokenizer:** SentencePiece trained on Hindi text with vocab size of 16,000
- **Training Details:** Trained on 4xL4 24GB VRAM GPUs for 8 hours. 2 epochs, hidden size=768, num_layers=12, block_size=512, batch_size=64, learning_rate=5e-5, swiglu activation, rope positional encoding, and rms normalization

## How to Use

**⚠️ Important:** This model uses custom Python classes (`HindiCausalLM`, `HindiCausalLMConfig`, `SentencePieceTokenizerWrapper`) which are **not** part of the standard Hugging Face `transformers` library. The custom Python files are included in this repository.

### Download Required Files

```python
import os
from huggingface_hub import hf_hub_download

# Configuration
repo_id = "convaiinnovations/hindi-foundational-model-base"
model_dir = "."  # Use current directory for downloaded files

# Download model files
print(f"Downloading files for {repo_id}...")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json", local_dir=model_dir)
tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.model", local_dir=model_dir)

# Download custom module files (these are crucial!)
hindi_model_path = hf_hub_download(repo_id=repo_id, filename="hindi_language_model.py", local_dir=model_dir)
hindi_embeddings_path = hf_hub_download(repo_id=repo_id, filename="hindi_embeddings.py", local_dir=model_dir)

# Try safetensors first, then bin
try:
    weights_path = hf_hub_download(repo_id=repo_id, filename="model.safetensors", local_dir=model_dir)
    using_safetensors = True
except:
    weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin", local_dir=model_dir)
    using_safetensors = False

print("All necessary files downloaded.")
```

### Debug and Inference Script

```python
import os
import json
import torch
import argparse # Keep argparse for potential future use
import numpy as np
import time
import traceback  # For detailed exception info

# Try importing safetensors
try:
    import safetensors.torch
    SAFE_TENSORS_AVAILABLE = True
except ImportError:
    SAFE_TENSORS_AVAILABLE = False

print("[INFO] --- Debug Inference Script Started ---")
if SAFE_TENSORS_AVAILABLE: print("[INFO] safetensors library found.")
else: print("[WARNING] safetensors library not found.")

# --- Attempt to import custom modules ---
print("[DEBUG] Attempting to import custom modules...")
try:
    from hindi_language_model import HindiCausalLM, HindiCausalLMConfig
    from hindi_embeddings import SentencePieceTokenizerWrapper
    print("[INFO] Successfully imported custom modules.")
except ImportError as e:
    print(f"[ERROR] Failed to import custom modules: {e}"); traceback.print_exc()

# --- End Custom Module Import ---


# --- Main Generation Function Definition ---
def run_generation(
    model_path: str,
    prompt: str,
    max_len: int,
    temp: float,
    top_k: int,
    seed: int,
    device_str: str
):
    """Loads model and generates text, printing debug info."""
    print(f"\nINFO: --- Starting Generation ---")
    print(f"[DEBUG] Args: path='{model_path}', max_len={max_len}, temp={temp}, top_k={top_k}, seed={seed}, device='{device_str}'")

    # --- Setup ---
    t_start_setup = time.time()
    try:
        torch.manual_seed(seed); np.random.seed(seed); device = torch.device(device_str)
        if device.type == 'cuda': torch.cuda.manual_seed_all(seed)
        print(f"[INFO] Using device: {device}")
        print(f"[DEBUG] Setup took {time.time()-t_start_setup:.4f}s")
    except Exception as e: print(f"[ERROR] Device/Seed setup failed: {e}"); traceback.print_exc(); return None

    # --- Load Tokenizer ---
    print("\n[INFO] --- Loading Tokenizer ---")
    t_start_load = time.time(); tokenizer = None
    try:
        tokenizer_model_file = os.path.join(model_path, "tokenizer.model")
        print(f"[DEBUG] Looking for tokenizer at: {tokenizer_model_file}")
        assert os.path.exists(tokenizer_model_file), "tokenizer.model not found!"
        tokenizer = SentencePieceTokenizerWrapper(tokenizer_model_file) # Use imported class
        print(f"[INFO] Tokenizer loaded. Vocab: {getattr(tokenizer, 'vocab_size', 'N/A')}")
        # Get BOS/EOS (handle if missing)
        bos_id = getattr(tokenizer, 'bos_token_id', 1) # Default 1
        eos_id = getattr(tokenizer, 'eos_token_id', 2) # Default 2
        print(f"[INFO] BOS ID: {bos_id}, EOS ID: {eos_id}")
    except Exception as e: print(f"[ERROR] Tokenizer loading failed: {e}"); traceback.print_exc(); return None

    # --- Load Config ---
    print("\n[INFO] --- Loading Config ---")
    lm_config = None
    try:
        config_file = os.path.join(model_path, "config.json")
        print(f"[DEBUG] Looking for config at: {config_file}")
        assert os.path.exists(config_file), "config.json not found!"
        with open(config_file, 'r', encoding='utf-8') as f: config_dict = json.load(f)
        print(f"[DEBUG] Config JSON loaded.")
        # Check/fix vocab size
        tok_vocab = getattr(tokenizer, 'vocab_size', None)
        if tok_vocab and 'vocab_size' in config_dict and config_dict['vocab_size'] != tok_vocab: print(f"[WARN] Config/Tokenizer vocab mismatch. Using tokenizer size: {tok_vocab}"); config_dict['vocab_size'] = tok_vocab
        # Instantiate config
        if hasattr(HindiCausalLMConfig, 'from_dict'): lm_config = HindiCausalLMConfig.from_dict(config_dict)
        else: lm_config = HindiCausalLMConfig(**config_dict)
        print("[INFO] Model config loaded.")
    except Exception as e: print(f"[ERROR] Config loading failed: {e}"); traceback.print_exc(); return None

    # --- Load Model ---
    print("\n[INFO] --- Loading Model ---")
    model = None
    try:
        print(f"[DEBUG] Instantiating {HindiCausalLM.__name__}...")
        model = HindiCausalLM(lm_config); print(f"[INFO] Model structure created.")
        weights_file = None; s_path = os.path.join(model_path, "model.safetensors"); b_path = os.path.join(model_path, "pytorch_model.bin")
        print(f"[DEBUG] Checking weights: {s_path} (exists: {os.path.exists(s_path)}), {b_path} (exists: {os.path.exists(b_path)})")
        if SAFE_TENSORS_AVAILABLE and os.path.exists(s_path): weights_file = s_path
        elif os.path.exists(b_path): weights_file = b_path
        else: raise FileNotFoundError("Model weights (.safetensors or .bin) not found!")
        print(f"[INFO] Loading weights from: {weights_file}")
        if weights_file.endswith(".safetensors"): state_dict = safetensors.torch.load_file(weights_file, device="cpu")
        else: state_dict = torch.load(weights_file, map_location="cpu")
        print(f"[DEBUG] State dict loaded to CPU. Keys: {len(state_dict)}")
        try: load_res = model.load_state_dict(state_dict, strict=True)
        except RuntimeError as e_load: print(f"[WARN] Strict load failed: {e_load}. Trying non-strict."); load_res = model.load_state_dict(state_dict, strict=False)
        missing = getattr(load_res, "missing_keys", []); unexpected = getattr(load_res, "unexpected_keys", [])
        print(f"[INFO] State dict loaded. Missing: {len(missing)}. Unexpected: {len(unexpected)}")
        if missing: print(f"[WARN] Missing keys: {missing[:5]}...")
        if unexpected: print(f"[WARN] Unexpected keys: {unexpected[:5]}...")
        del state_dict; model.to(device); model.eval()
        print("[INFO] Model loaded to device and set to eval mode.")
        print(f"[DEBUG] Tokenizer+Config+Model loading took {time.time()-t_start_load:.2f}s")
    except Exception as e: print(f"[ERROR] Model loading failed: {e}"); traceback.print_exc(); return None

    # --- Generation ---
    print("\n[INFO] --- Starting Text Generation ---")
    t_start_gen = time.time()
    print(f"[INFO] Prompt: \"{prompt}\"")
    try:
        print("[DEBUG] Encoding prompt...")
        # Use __call__ or sp_model.EncodeAsIds
        if hasattr(tokenizer, '__call__'):
             print("DEBUG: Trying tokenizer(prompt)...")
             encoded_result = tokenizer(prompt, return_tensors=None)
             if isinstance(encoded_result, dict) and 'input_ids' in encoded_result: input_ids = encoded_result['input_ids']
             else: print(f"DEBUG: __call__ result type {type(encoded_result)} unexpected. Trying sp_model.EncodeAsIds...");
             if hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'EncodeAsIds'): input_ids = tokenizer.sp_model.EncodeAsIds(prompt)
             else: raise AttributeError("Cannot find suitable encoding method (__call__ or sp_model.EncodeAsIds)")
        elif hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'EncodeAsIds'):
             print("DEBUG: Trying tokenizer.sp_model.EncodeAsIds...")
             input_ids = tokenizer.sp_model.EncodeAsIds(prompt)
        else: raise AttributeError("Cannot find suitable encoding method")
        print(f"[DEBUG] Prompt token IDs: {input_ids}")

        if bos_id is not None: print(f"[DEBUG] Prepending BOS {bos_id}"); input_ids = [bos_id] + input_ids
        input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device); print(f"[DEBUG] Initial input tensor shape: {input_tensor.shape}")
        generated_ids = input_tensor

        print("[DEBUG] Starting generation loop...")
        with torch.no_grad():
            for i in range(max_len - len(input_ids)):
                step = i + 1; print(f"\nDEBUG: --- Step {step}/{max_len - len(input_ids)} | Current len: {generated_ids.shape[1]} ---")
                t_fwd = time.time();

                # --- FORWARD CALL AND LOGIT EXTRACTION ---
                outputs = model(input_ids=generated_ids) # model call

                # *** CORRECTED LOGIT ACCESS ***
                if isinstance(outputs, dict) and 'logits' in outputs:
                    logits = outputs['logits'] # Access via key if output is dict
                    print(f"DEBUG: Fwd pass {time.time()-t_fwd:.4f}s. Accessed dict['logits'].")
                elif hasattr(outputs, 'logits'):
                    logits = outputs.logits # Access via attribute if output is object
                    print(f"DEBUG: Fwd pass {time.time()-t_fwd:.4f}s. Accessed outputs.logits.")
                else:
                    print(f"[ERROR] Model output type is {type(outputs)}, and does not contain 'logits'.")
                    raise TypeError("Model output format error.")
                # *** END CORRECTION ***

                next_token_logits = logits[:, -1, :]; print(f"DEBUG: Next logits shape: {next_token_logits.shape}")

                # --- Sampling ---
                if temp > 0: scaled_logits = next_token_logits / temp
                else: scaled_logits = next_token_logits # Greedy
                if top_k > 0: kth_vals, _ = torch.topk(scaled_logits, k=top_k, dim=-1); scaled_logits[scaled_logits < kth_vals[:, -1].unsqueeze(-1)] = -float("Inf")
                probs = torch.softmax(scaled_logits, dim=-1); next_token_id = torch.multinomial(probs, num_samples=1); print(f"DEBUG: Sampled ID: {next_token_id.item()}")
                generated_ids = torch.cat([generated_ids, next_token_id], dim=1)
                if next_token_id.item() == eos_id: print(f"INFO: EOS token {eos_id} generated."); break
            else: print(f"INFO: Reached max length {max_len}.")

        # --- Decode ---
        print("\nDEBUG: --- Post-processing ---")
        output_ids = generated_ids[0].cpu().tolist(); print(f"[DEBUG] Raw output IDs: {output_ids}")
        processed_ids = output_ids
        if bos_id and processed_ids and processed_ids[0] == bos_id: print("[DEBUG] Removing BOS"); processed_ids = processed_ids[1:]
        if eos_id and processed_ids and processed_ids[-1] == eos_id: print("[DEBUG] Removing EOS"); processed_ids = processed_ids[:-1]
        print(f"[DEBUG] Processed IDs: {processed_ids}")
        print("[INFO] Decoding...")
        # Use sp_model.DecodeIds or decode
        if hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'DecodeIds'): print("DEBUG: Decoding using tokenizer.sp_model.DecodeIds..."); generated_text = tokenizer.sp_model.DecodeIds(processed_ids)
        elif hasattr(tokenizer, 'decode'): print("DEBUG: Decoding using tokenizer.decode..."); generated_text = tokenizer.decode(processed_ids)
        else: raise AttributeError("Cannot find suitable decoding method")
        print(f"[DEBUG] Decoded text: '{generated_text}'")
        print(f"[INFO] Generation successful ({time.time() - t_start_gen:.2f}s).")
        return generated_text

    except Exception as e: print(f"ERROR: Generation loop error: {e}"); traceback.print_exc(); return None
# --- End Generation Function Definition ---


# --- Main Execution Block ---
if __name__ == "__main__":
    # --- Parameters ---
    model_dir = "."  # Use current directory if files are downloaded here
    prompt = "गंगा नदी"
    max_len = 80
    temp = 2
    top_k = 45
    seed = 42
    device = "cuda" if torch.cuda.is_available() else "cpu"

    print("\n[INFO] --- Simple Hindi Text Generation Script ---")
    print(f"[INFO] Model Dir: {model_dir}")
    print(f"[INFO] Prompt: \"{prompt}\"")
    print(f"[INFO] Max Length: {max_len}")
    print(f"[INFO] Temperature: {temp}")
    print(f"[INFO] Top-K: {top_k}")
    print(f"[INFO] Seed: {seed}")
    print(f"[INFO] Device: {device}")
    print("-" * 30)

    # --- Validate Path ---
    if not os.path.isdir(model_dir): print(f"[ERROR] Model directory not found: {model_dir}"); exit(1)

    # --- Run Generation ---
    if 'run_generation' in locals():
        generated_output = run_generation(
            model_path=model_dir, prompt=prompt, max_len=max_len,
            temp=temp, top_k=top_k, seed=seed, device_str=device
        )
    else: print("[ERROR] run_generation function is not defined!"); generated_output = None

    # --- Print Result ---
    print("\n" + "="*20 + " Final Generation Result " + "="*20)
    if generated_output is not None:
        print(f"Prompt: {prompt}")
        print("-" * (40 + len(" Final Generation Result ")))
        print("Generated Text:")
        print(generated_output)
    else:
        print("\n[FAILURE] Text generation failed. Check print statements above.")
    print("=" * (40 + len(" Final Generation Result ")))
```

## Example Outputs

### Basic Example

```python
prompt = "हिंदी भाषा"
# Output: "हिंदी भाषा भारत की सबसे महत्वपूर्ण भाषाओं में से एक है। यह भारत के उत्तर भारत के राज्यों में मुख्य भाषा के रूप में बोली जाती है..."
```

### Creative Writing Example

```python
prompt = "एक बार की बात है"
# Output: "एक बार की बात है, जब मैं छोटा था, तब मेरे दादाजी मुझे एक कहानी सुनाया करते थे। वह कहानी एक ऐसे राजा की थी जो अपने राज्य में..."
```

## Limitations and Biases

- The model may reflect biases present in its training data, including potential cultural, gender, or regional biases found in source materials.
- Performance is limited by its architecture size (12 layers, hidden=768) and training dataset size.
- May generate repetitive, nonsensical, or factually incorrect text.
- Uses weighted pooling with sensitivity to Hindi's SOV structure, but may struggle with complex semantic relationships in longer texts.
- May have particular difficulties with:
  - Cultural concepts lacking direct English translations
  - Idiomatic expressions specific to Hindi
  - Formal/informal speech distinctions
  - Handling Hindi-specific morphological complexities

## License

This model is licensed under the MIT License.

Please use this model responsibly.