File size: 17,151 Bytes
b253808 bf74119 b253808 de95a78 b253808 70a712a bf74119 70a712a b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 bf74119 b253808 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 |
---
license: mit # Or choose another appropriate license: https://huggingface.co/docs/hub/repositories-licenses
language: hi
tags:
- hindi
- text-generation
- causal-lm
- custom-model
pipeline_tag: text-generation
---
# Hindi Causal Language Model (convaiinnovations/hindi-foundational-model-base)
This repository contains a custom-trained Hindi Causal Language Model designed for Hindi text generation.
## Model Description
- **Model Size:** 113M (YAH !!! Its very small)
- **Architecture:** Custom Transformer (12 layers, hidden=768, 16 heads, ffn=3072, act=swiglu, norm=rmsnorm) based on the `HindiCausalLM` class with Hindi-specific optimizations:
- Multi-resolution attention to capture both character-level and word-level patterns
- Morphology-aware feed-forward layers
- Script-mix processing for Hindi-English code-mixing
- **Language:** Hindi (hi)
- **Training Data:** 2.7 million high-quality Hindi text samples from:
- IITB Parallel Corpus (1.2M sentences)
- Samanantar (750K samples)
- Oscar Hindi (450K sentences)
- CC-100 Hindi (300K sentences)
- Hindi Wikipedia (150K articles)
- Hindi news articles (100K pieces)
- XNLI Hindi (50K premise-hypothesis pairs)
- IndicGLUE (30K samples)
- Hindi literature (5K passages)
- **Tokenizer:** SentencePiece trained on Hindi text with vocab size of 16,000
- **Training Details:** Trained on 4xL4 24GB VRAM GPUs for 8 hours. 2 epochs, hidden size=768, num_layers=12, block_size=512, batch_size=64, learning_rate=5e-5, swiglu activation, rope positional encoding, and rms normalization
## How to Use
**⚠️ Important:** This model uses custom Python classes (`HindiCausalLM`, `HindiCausalLMConfig`, `SentencePieceTokenizerWrapper`) which are **not** part of the standard Hugging Face `transformers` library. The custom Python files are included in this repository.
### Download Required Files
```python
import os
from huggingface_hub import hf_hub_download
# Configuration
repo_id = "convaiinnovations/hindi-foundational-model-base"
model_dir = "." # Use current directory for downloaded files
# Download model files
print(f"Downloading files for {repo_id}...")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json", local_dir=model_dir)
tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.model", local_dir=model_dir)
# Download custom module files (these are crucial!)
hindi_model_path = hf_hub_download(repo_id=repo_id, filename="hindi_language_model.py", local_dir=model_dir)
hindi_embeddings_path = hf_hub_download(repo_id=repo_id, filename="hindi_embeddings.py", local_dir=model_dir)
# Try safetensors first, then bin
try:
weights_path = hf_hub_download(repo_id=repo_id, filename="model.safetensors", local_dir=model_dir)
using_safetensors = True
except:
weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin", local_dir=model_dir)
using_safetensors = False
print("All necessary files downloaded.")
```
### Debug and Inference Script
```python
import os
import json
import torch
import argparse # Keep argparse for potential future use
import numpy as np
import time
import traceback # For detailed exception info
# Try importing safetensors
try:
import safetensors.torch
SAFE_TENSORS_AVAILABLE = True
except ImportError:
SAFE_TENSORS_AVAILABLE = False
print("[INFO] --- Debug Inference Script Started ---")
if SAFE_TENSORS_AVAILABLE: print("[INFO] safetensors library found.")
else: print("[WARNING] safetensors library not found.")
# --- Attempt to import custom modules ---
print("[DEBUG] Attempting to import custom modules...")
try:
from hindi_language_model import HindiCausalLM, HindiCausalLMConfig
from hindi_embeddings import SentencePieceTokenizerWrapper
print("[INFO] Successfully imported custom modules.")
except ImportError as e:
print(f"[ERROR] Failed to import custom modules: {e}"); traceback.print_exc()
# --- End Custom Module Import ---
# --- Main Generation Function Definition ---
def run_generation(
model_path: str,
prompt: str,
max_len: int,
temp: float,
top_k: int,
seed: int,
device_str: str
):
"""Loads model and generates text, printing debug info."""
print(f"\nINFO: --- Starting Generation ---")
print(f"[DEBUG] Args: path='{model_path}', max_len={max_len}, temp={temp}, top_k={top_k}, seed={seed}, device='{device_str}'")
# --- Setup ---
t_start_setup = time.time()
try:
torch.manual_seed(seed); np.random.seed(seed); device = torch.device(device_str)
if device.type == 'cuda': torch.cuda.manual_seed_all(seed)
print(f"[INFO] Using device: {device}")
print(f"[DEBUG] Setup took {time.time()-t_start_setup:.4f}s")
except Exception as e: print(f"[ERROR] Device/Seed setup failed: {e}"); traceback.print_exc(); return None
# --- Load Tokenizer ---
print("\n[INFO] --- Loading Tokenizer ---")
t_start_load = time.time(); tokenizer = None
try:
tokenizer_model_file = os.path.join(model_path, "tokenizer.model")
print(f"[DEBUG] Looking for tokenizer at: {tokenizer_model_file}")
assert os.path.exists(tokenizer_model_file), "tokenizer.model not found!"
tokenizer = SentencePieceTokenizerWrapper(tokenizer_model_file) # Use imported class
print(f"[INFO] Tokenizer loaded. Vocab: {getattr(tokenizer, 'vocab_size', 'N/A')}")
# Get BOS/EOS (handle if missing)
bos_id = getattr(tokenizer, 'bos_token_id', 1) # Default 1
eos_id = getattr(tokenizer, 'eos_token_id', 2) # Default 2
print(f"[INFO] BOS ID: {bos_id}, EOS ID: {eos_id}")
except Exception as e: print(f"[ERROR] Tokenizer loading failed: {e}"); traceback.print_exc(); return None
# --- Load Config ---
print("\n[INFO] --- Loading Config ---")
lm_config = None
try:
config_file = os.path.join(model_path, "config.json")
print(f"[DEBUG] Looking for config at: {config_file}")
assert os.path.exists(config_file), "config.json not found!"
with open(config_file, 'r', encoding='utf-8') as f: config_dict = json.load(f)
print(f"[DEBUG] Config JSON loaded.")
# Check/fix vocab size
tok_vocab = getattr(tokenizer, 'vocab_size', None)
if tok_vocab and 'vocab_size' in config_dict and config_dict['vocab_size'] != tok_vocab: print(f"[WARN] Config/Tokenizer vocab mismatch. Using tokenizer size: {tok_vocab}"); config_dict['vocab_size'] = tok_vocab
# Instantiate config
if hasattr(HindiCausalLMConfig, 'from_dict'): lm_config = HindiCausalLMConfig.from_dict(config_dict)
else: lm_config = HindiCausalLMConfig(**config_dict)
print("[INFO] Model config loaded.")
except Exception as e: print(f"[ERROR] Config loading failed: {e}"); traceback.print_exc(); return None
# --- Load Model ---
print("\n[INFO] --- Loading Model ---")
model = None
try:
print(f"[DEBUG] Instantiating {HindiCausalLM.__name__}...")
model = HindiCausalLM(lm_config); print(f"[INFO] Model structure created.")
weights_file = None; s_path = os.path.join(model_path, "model.safetensors"); b_path = os.path.join(model_path, "pytorch_model.bin")
print(f"[DEBUG] Checking weights: {s_path} (exists: {os.path.exists(s_path)}), {b_path} (exists: {os.path.exists(b_path)})")
if SAFE_TENSORS_AVAILABLE and os.path.exists(s_path): weights_file = s_path
elif os.path.exists(b_path): weights_file = b_path
else: raise FileNotFoundError("Model weights (.safetensors or .bin) not found!")
print(f"[INFO] Loading weights from: {weights_file}")
if weights_file.endswith(".safetensors"): state_dict = safetensors.torch.load_file(weights_file, device="cpu")
else: state_dict = torch.load(weights_file, map_location="cpu")
print(f"[DEBUG] State dict loaded to CPU. Keys: {len(state_dict)}")
try: load_res = model.load_state_dict(state_dict, strict=True)
except RuntimeError as e_load: print(f"[WARN] Strict load failed: {e_load}. Trying non-strict."); load_res = model.load_state_dict(state_dict, strict=False)
missing = getattr(load_res, "missing_keys", []); unexpected = getattr(load_res, "unexpected_keys", [])
print(f"[INFO] State dict loaded. Missing: {len(missing)}. Unexpected: {len(unexpected)}")
if missing: print(f"[WARN] Missing keys: {missing[:5]}...")
if unexpected: print(f"[WARN] Unexpected keys: {unexpected[:5]}...")
del state_dict; model.to(device); model.eval()
print("[INFO] Model loaded to device and set to eval mode.")
print(f"[DEBUG] Tokenizer+Config+Model loading took {time.time()-t_start_load:.2f}s")
except Exception as e: print(f"[ERROR] Model loading failed: {e}"); traceback.print_exc(); return None
# --- Generation ---
print("\n[INFO] --- Starting Text Generation ---")
t_start_gen = time.time()
print(f"[INFO] Prompt: \"{prompt}\"")
try:
print("[DEBUG] Encoding prompt...")
# Use __call__ or sp_model.EncodeAsIds
if hasattr(tokenizer, '__call__'):
print("DEBUG: Trying tokenizer(prompt)...")
encoded_result = tokenizer(prompt, return_tensors=None)
if isinstance(encoded_result, dict) and 'input_ids' in encoded_result: input_ids = encoded_result['input_ids']
else: print(f"DEBUG: __call__ result type {type(encoded_result)} unexpected. Trying sp_model.EncodeAsIds...");
if hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'EncodeAsIds'): input_ids = tokenizer.sp_model.EncodeAsIds(prompt)
else: raise AttributeError("Cannot find suitable encoding method (__call__ or sp_model.EncodeAsIds)")
elif hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'EncodeAsIds'):
print("DEBUG: Trying tokenizer.sp_model.EncodeAsIds...")
input_ids = tokenizer.sp_model.EncodeAsIds(prompt)
else: raise AttributeError("Cannot find suitable encoding method")
print(f"[DEBUG] Prompt token IDs: {input_ids}")
if bos_id is not None: print(f"[DEBUG] Prepending BOS {bos_id}"); input_ids = [bos_id] + input_ids
input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device); print(f"[DEBUG] Initial input tensor shape: {input_tensor.shape}")
generated_ids = input_tensor
print("[DEBUG] Starting generation loop...")
with torch.no_grad():
for i in range(max_len - len(input_ids)):
step = i + 1; print(f"\nDEBUG: --- Step {step}/{max_len - len(input_ids)} | Current len: {generated_ids.shape[1]} ---")
t_fwd = time.time();
# --- FORWARD CALL AND LOGIT EXTRACTION ---
outputs = model(input_ids=generated_ids) # model call
# *** CORRECTED LOGIT ACCESS ***
if isinstance(outputs, dict) and 'logits' in outputs:
logits = outputs['logits'] # Access via key if output is dict
print(f"DEBUG: Fwd pass {time.time()-t_fwd:.4f}s. Accessed dict['logits'].")
elif hasattr(outputs, 'logits'):
logits = outputs.logits # Access via attribute if output is object
print(f"DEBUG: Fwd pass {time.time()-t_fwd:.4f}s. Accessed outputs.logits.")
else:
print(f"[ERROR] Model output type is {type(outputs)}, and does not contain 'logits'.")
raise TypeError("Model output format error.")
# *** END CORRECTION ***
next_token_logits = logits[:, -1, :]; print(f"DEBUG: Next logits shape: {next_token_logits.shape}")
# --- Sampling ---
if temp > 0: scaled_logits = next_token_logits / temp
else: scaled_logits = next_token_logits # Greedy
if top_k > 0: kth_vals, _ = torch.topk(scaled_logits, k=top_k, dim=-1); scaled_logits[scaled_logits < kth_vals[:, -1].unsqueeze(-1)] = -float("Inf")
probs = torch.softmax(scaled_logits, dim=-1); next_token_id = torch.multinomial(probs, num_samples=1); print(f"DEBUG: Sampled ID: {next_token_id.item()}")
generated_ids = torch.cat([generated_ids, next_token_id], dim=1)
if next_token_id.item() == eos_id: print(f"INFO: EOS token {eos_id} generated."); break
else: print(f"INFO: Reached max length {max_len}.")
# --- Decode ---
print("\nDEBUG: --- Post-processing ---")
output_ids = generated_ids[0].cpu().tolist(); print(f"[DEBUG] Raw output IDs: {output_ids}")
processed_ids = output_ids
if bos_id and processed_ids and processed_ids[0] == bos_id: print("[DEBUG] Removing BOS"); processed_ids = processed_ids[1:]
if eos_id and processed_ids and processed_ids[-1] == eos_id: print("[DEBUG] Removing EOS"); processed_ids = processed_ids[:-1]
print(f"[DEBUG] Processed IDs: {processed_ids}")
print("[INFO] Decoding...")
# Use sp_model.DecodeIds or decode
if hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'DecodeIds'): print("DEBUG: Decoding using tokenizer.sp_model.DecodeIds..."); generated_text = tokenizer.sp_model.DecodeIds(processed_ids)
elif hasattr(tokenizer, 'decode'): print("DEBUG: Decoding using tokenizer.decode..."); generated_text = tokenizer.decode(processed_ids)
else: raise AttributeError("Cannot find suitable decoding method")
print(f"[DEBUG] Decoded text: '{generated_text}'")
print(f"[INFO] Generation successful ({time.time() - t_start_gen:.2f}s).")
return generated_text
except Exception as e: print(f"ERROR: Generation loop error: {e}"); traceback.print_exc(); return None
# --- End Generation Function Definition ---
# --- Main Execution Block ---
if __name__ == "__main__":
# --- Parameters ---
model_dir = "." # Use current directory if files are downloaded here
prompt = "गंगा नदी"
max_len = 80
temp = 2
top_k = 45
seed = 42
device = "cuda" if torch.cuda.is_available() else "cpu"
print("\n[INFO] --- Simple Hindi Text Generation Script ---")
print(f"[INFO] Model Dir: {model_dir}")
print(f"[INFO] Prompt: \"{prompt}\"")
print(f"[INFO] Max Length: {max_len}")
print(f"[INFO] Temperature: {temp}")
print(f"[INFO] Top-K: {top_k}")
print(f"[INFO] Seed: {seed}")
print(f"[INFO] Device: {device}")
print("-" * 30)
# --- Validate Path ---
if not os.path.isdir(model_dir): print(f"[ERROR] Model directory not found: {model_dir}"); exit(1)
# --- Run Generation ---
if 'run_generation' in locals():
generated_output = run_generation(
model_path=model_dir, prompt=prompt, max_len=max_len,
temp=temp, top_k=top_k, seed=seed, device_str=device
)
else: print("[ERROR] run_generation function is not defined!"); generated_output = None
# --- Print Result ---
print("\n" + "="*20 + " Final Generation Result " + "="*20)
if generated_output is not None:
print(f"Prompt: {prompt}")
print("-" * (40 + len(" Final Generation Result ")))
print("Generated Text:")
print(generated_output)
else:
print("\n[FAILURE] Text generation failed. Check print statements above.")
print("=" * (40 + len(" Final Generation Result ")))
```
## Example Outputs
### Basic Example
```python
prompt = "हिंदी भाषा"
# Output: "हिंदी भाषा भारत की सबसे महत्वपूर्ण भाषाओं में से एक है। यह भारत के उत्तर भारत के राज्यों में मुख्य भाषा के रूप में बोली जाती है..."
```
### Creative Writing Example
```python
prompt = "एक बार की बात है"
# Output: "एक बार की बात है, जब मैं छोटा था, तब मेरे दादाजी मुझे एक कहानी सुनाया करते थे। वह कहानी एक ऐसे राजा की थी जो अपने राज्य में..."
```
## Limitations and Biases
- The model may reflect biases present in its training data, including potential cultural, gender, or regional biases found in source materials.
- Performance is limited by its architecture size (12 layers, hidden=768) and training dataset size.
- May generate repetitive, nonsensical, or factually incorrect text.
- Uses weighted pooling with sensitivity to Hindi's SOV structure, but may struggle with complex semantic relationships in longer texts.
- May have particular difficulties with:
- Cultural concepts lacking direct English translations
- Idiomatic expressions specific to Hindi
- Formal/informal speech distinctions
- Handling Hindi-specific morphological complexities
## License
This model is licensed under the MIT License.
Please use this model responsibly.
|