--- license: mit # Or choose another appropriate license: https://huggingface.co/docs/hub/repositories-licenses language: hi tags: - hindi - text-generation - causal-lm - custom-model pipeline_tag: text-generation --- # Hindi Causal Language Model (convaiinnovations/hindi-foundational-model-base) This repository contains a custom-trained Hindi Causal Language Model designed for Hindi text generation. ## Model Description - **Model Size:** 113M (YAH !!! Its very small) - **Architecture:** Custom Transformer (12 layers, hidden=768, 16 heads, ffn=3072, act=swiglu, norm=rmsnorm) based on the `HindiCausalLM` class with Hindi-specific optimizations: - Multi-resolution attention to capture both character-level and word-level patterns - Morphology-aware feed-forward layers - Script-mix processing for Hindi-English code-mixing - **Language:** Hindi (hi) - **Training Data:** 2.7 million high-quality Hindi text samples from: - IITB Parallel Corpus (1.2M sentences) - Samanantar (750K samples) - Oscar Hindi (450K sentences) - CC-100 Hindi (300K sentences) - Hindi Wikipedia (150K articles) - Hindi news articles (100K pieces) - XNLI Hindi (50K premise-hypothesis pairs) - IndicGLUE (30K samples) - Hindi literature (5K passages) - **Tokenizer:** SentencePiece trained on Hindi text with vocab size of 16,000 - **Training Details:** Trained on 4xL4 24GB VRAM GPUs for 8 hours. 2 epochs, hidden size=768, num_layers=12, block_size=512, batch_size=64, learning_rate=5e-5, swiglu activation, rope positional encoding, and rms normalization ## How to Use **⚠️ Important:** This model uses custom Python classes (`HindiCausalLM`, `HindiCausalLMConfig`, `SentencePieceTokenizerWrapper`) which are **not** part of the standard Hugging Face `transformers` library. The custom Python files are included in this repository. ### Download Required Files ```python import os from huggingface_hub import hf_hub_download # Configuration repo_id = "convaiinnovations/hindi-foundational-model-base" model_dir = "." # Use current directory for downloaded files # Download model files print(f"Downloading files for {repo_id}...") config_path = hf_hub_download(repo_id=repo_id, filename="config.json", local_dir=model_dir) tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.model", local_dir=model_dir) # Download custom module files (these are crucial!) hindi_model_path = hf_hub_download(repo_id=repo_id, filename="hindi_language_model.py", local_dir=model_dir) hindi_embeddings_path = hf_hub_download(repo_id=repo_id, filename="hindi_embeddings.py", local_dir=model_dir) # Try safetensors first, then bin try: weights_path = hf_hub_download(repo_id=repo_id, filename="model.safetensors", local_dir=model_dir) using_safetensors = True except: weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin", local_dir=model_dir) using_safetensors = False print("All necessary files downloaded.") ``` ### Debug and Inference Script ```python import os import json import torch import argparse # Keep argparse for potential future use import numpy as np import time import traceback # For detailed exception info # Try importing safetensors try: import safetensors.torch SAFE_TENSORS_AVAILABLE = True except ImportError: SAFE_TENSORS_AVAILABLE = False print("[INFO] --- Debug Inference Script Started ---") if SAFE_TENSORS_AVAILABLE: print("[INFO] safetensors library found.") else: print("[WARNING] safetensors library not found.") # --- Attempt to import custom modules --- print("[DEBUG] Attempting to import custom modules...") try: from hindi_language_model import HindiCausalLM, HindiCausalLMConfig from hindi_embeddings import SentencePieceTokenizerWrapper print("[INFO] Successfully imported custom modules.") except ImportError as e: print(f"[ERROR] Failed to import custom modules: {e}"); traceback.print_exc() # --- End Custom Module Import --- # --- Main Generation Function Definition --- def run_generation( model_path: str, prompt: str, max_len: int, temp: float, top_k: int, seed: int, device_str: str ): """Loads model and generates text, printing debug info.""" print(f"\nINFO: --- Starting Generation ---") print(f"[DEBUG] Args: path='{model_path}', max_len={max_len}, temp={temp}, top_k={top_k}, seed={seed}, device='{device_str}'") # --- Setup --- t_start_setup = time.time() try: torch.manual_seed(seed); np.random.seed(seed); device = torch.device(device_str) if device.type == 'cuda': torch.cuda.manual_seed_all(seed) print(f"[INFO] Using device: {device}") print(f"[DEBUG] Setup took {time.time()-t_start_setup:.4f}s") except Exception as e: print(f"[ERROR] Device/Seed setup failed: {e}"); traceback.print_exc(); return None # --- Load Tokenizer --- print("\n[INFO] --- Loading Tokenizer ---") t_start_load = time.time(); tokenizer = None try: tokenizer_model_file = os.path.join(model_path, "tokenizer.model") print(f"[DEBUG] Looking for tokenizer at: {tokenizer_model_file}") assert os.path.exists(tokenizer_model_file), "tokenizer.model not found!" tokenizer = SentencePieceTokenizerWrapper(tokenizer_model_file) # Use imported class print(f"[INFO] Tokenizer loaded. Vocab: {getattr(tokenizer, 'vocab_size', 'N/A')}") # Get BOS/EOS (handle if missing) bos_id = getattr(tokenizer, 'bos_token_id', 1) # Default 1 eos_id = getattr(tokenizer, 'eos_token_id', 2) # Default 2 print(f"[INFO] BOS ID: {bos_id}, EOS ID: {eos_id}") except Exception as e: print(f"[ERROR] Tokenizer loading failed: {e}"); traceback.print_exc(); return None # --- Load Config --- print("\n[INFO] --- Loading Config ---") lm_config = None try: config_file = os.path.join(model_path, "config.json") print(f"[DEBUG] Looking for config at: {config_file}") assert os.path.exists(config_file), "config.json not found!" with open(config_file, 'r', encoding='utf-8') as f: config_dict = json.load(f) print(f"[DEBUG] Config JSON loaded.") # Check/fix vocab size tok_vocab = getattr(tokenizer, 'vocab_size', None) if tok_vocab and 'vocab_size' in config_dict and config_dict['vocab_size'] != tok_vocab: print(f"[WARN] Config/Tokenizer vocab mismatch. Using tokenizer size: {tok_vocab}"); config_dict['vocab_size'] = tok_vocab # Instantiate config if hasattr(HindiCausalLMConfig, 'from_dict'): lm_config = HindiCausalLMConfig.from_dict(config_dict) else: lm_config = HindiCausalLMConfig(**config_dict) print("[INFO] Model config loaded.") except Exception as e: print(f"[ERROR] Config loading failed: {e}"); traceback.print_exc(); return None # --- Load Model --- print("\n[INFO] --- Loading Model ---") model = None try: print(f"[DEBUG] Instantiating {HindiCausalLM.__name__}...") model = HindiCausalLM(lm_config); print(f"[INFO] Model structure created.") weights_file = None; s_path = os.path.join(model_path, "model.safetensors"); b_path = os.path.join(model_path, "pytorch_model.bin") print(f"[DEBUG] Checking weights: {s_path} (exists: {os.path.exists(s_path)}), {b_path} (exists: {os.path.exists(b_path)})") if SAFE_TENSORS_AVAILABLE and os.path.exists(s_path): weights_file = s_path elif os.path.exists(b_path): weights_file = b_path else: raise FileNotFoundError("Model weights (.safetensors or .bin) not found!") print(f"[INFO] Loading weights from: {weights_file}") if weights_file.endswith(".safetensors"): state_dict = safetensors.torch.load_file(weights_file, device="cpu") else: state_dict = torch.load(weights_file, map_location="cpu") print(f"[DEBUG] State dict loaded to CPU. Keys: {len(state_dict)}") try: load_res = model.load_state_dict(state_dict, strict=True) except RuntimeError as e_load: print(f"[WARN] Strict load failed: {e_load}. Trying non-strict."); load_res = model.load_state_dict(state_dict, strict=False) missing = getattr(load_res, "missing_keys", []); unexpected = getattr(load_res, "unexpected_keys", []) print(f"[INFO] State dict loaded. Missing: {len(missing)}. Unexpected: {len(unexpected)}") if missing: print(f"[WARN] Missing keys: {missing[:5]}...") if unexpected: print(f"[WARN] Unexpected keys: {unexpected[:5]}...") del state_dict; model.to(device); model.eval() print("[INFO] Model loaded to device and set to eval mode.") print(f"[DEBUG] Tokenizer+Config+Model loading took {time.time()-t_start_load:.2f}s") except Exception as e: print(f"[ERROR] Model loading failed: {e}"); traceback.print_exc(); return None # --- Generation --- print("\n[INFO] --- Starting Text Generation ---") t_start_gen = time.time() print(f"[INFO] Prompt: \"{prompt}\"") try: print("[DEBUG] Encoding prompt...") # Use __call__ or sp_model.EncodeAsIds if hasattr(tokenizer, '__call__'): print("DEBUG: Trying tokenizer(prompt)...") encoded_result = tokenizer(prompt, return_tensors=None) if isinstance(encoded_result, dict) and 'input_ids' in encoded_result: input_ids = encoded_result['input_ids'] else: print(f"DEBUG: __call__ result type {type(encoded_result)} unexpected. Trying sp_model.EncodeAsIds..."); if hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'EncodeAsIds'): input_ids = tokenizer.sp_model.EncodeAsIds(prompt) else: raise AttributeError("Cannot find suitable encoding method (__call__ or sp_model.EncodeAsIds)") elif hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'EncodeAsIds'): print("DEBUG: Trying tokenizer.sp_model.EncodeAsIds...") input_ids = tokenizer.sp_model.EncodeAsIds(prompt) else: raise AttributeError("Cannot find suitable encoding method") print(f"[DEBUG] Prompt token IDs: {input_ids}") if bos_id is not None: print(f"[DEBUG] Prepending BOS {bos_id}"); input_ids = [bos_id] + input_ids input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device); print(f"[DEBUG] Initial input tensor shape: {input_tensor.shape}") generated_ids = input_tensor print("[DEBUG] Starting generation loop...") with torch.no_grad(): for i in range(max_len - len(input_ids)): step = i + 1; print(f"\nDEBUG: --- Step {step}/{max_len - len(input_ids)} | Current len: {generated_ids.shape[1]} ---") t_fwd = time.time(); # --- FORWARD CALL AND LOGIT EXTRACTION --- outputs = model(input_ids=generated_ids) # model call # *** CORRECTED LOGIT ACCESS *** if isinstance(outputs, dict) and 'logits' in outputs: logits = outputs['logits'] # Access via key if output is dict print(f"DEBUG: Fwd pass {time.time()-t_fwd:.4f}s. Accessed dict['logits'].") elif hasattr(outputs, 'logits'): logits = outputs.logits # Access via attribute if output is object print(f"DEBUG: Fwd pass {time.time()-t_fwd:.4f}s. Accessed outputs.logits.") else: print(f"[ERROR] Model output type is {type(outputs)}, and does not contain 'logits'.") raise TypeError("Model output format error.") # *** END CORRECTION *** next_token_logits = logits[:, -1, :]; print(f"DEBUG: Next logits shape: {next_token_logits.shape}") # --- Sampling --- if temp > 0: scaled_logits = next_token_logits / temp else: scaled_logits = next_token_logits # Greedy if top_k > 0: kth_vals, _ = torch.topk(scaled_logits, k=top_k, dim=-1); scaled_logits[scaled_logits < kth_vals[:, -1].unsqueeze(-1)] = -float("Inf") probs = torch.softmax(scaled_logits, dim=-1); next_token_id = torch.multinomial(probs, num_samples=1); print(f"DEBUG: Sampled ID: {next_token_id.item()}") generated_ids = torch.cat([generated_ids, next_token_id], dim=1) if next_token_id.item() == eos_id: print(f"INFO: EOS token {eos_id} generated."); break else: print(f"INFO: Reached max length {max_len}.") # --- Decode --- print("\nDEBUG: --- Post-processing ---") output_ids = generated_ids[0].cpu().tolist(); print(f"[DEBUG] Raw output IDs: {output_ids}") processed_ids = output_ids if bos_id and processed_ids and processed_ids[0] == bos_id: print("[DEBUG] Removing BOS"); processed_ids = processed_ids[1:] if eos_id and processed_ids and processed_ids[-1] == eos_id: print("[DEBUG] Removing EOS"); processed_ids = processed_ids[:-1] print(f"[DEBUG] Processed IDs: {processed_ids}") print("[INFO] Decoding...") # Use sp_model.DecodeIds or decode if hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'DecodeIds'): print("DEBUG: Decoding using tokenizer.sp_model.DecodeIds..."); generated_text = tokenizer.sp_model.DecodeIds(processed_ids) elif hasattr(tokenizer, 'decode'): print("DEBUG: Decoding using tokenizer.decode..."); generated_text = tokenizer.decode(processed_ids) else: raise AttributeError("Cannot find suitable decoding method") print(f"[DEBUG] Decoded text: '{generated_text}'") print(f"[INFO] Generation successful ({time.time() - t_start_gen:.2f}s).") return generated_text except Exception as e: print(f"ERROR: Generation loop error: {e}"); traceback.print_exc(); return None # --- End Generation Function Definition --- # --- Main Execution Block --- if __name__ == "__main__": # --- Parameters --- model_dir = "." # Use current directory if files are downloaded here prompt = "गंगा नदी" max_len = 80 temp = 2 top_k = 45 seed = 42 device = "cuda" if torch.cuda.is_available() else "cpu" print("\n[INFO] --- Simple Hindi Text Generation Script ---") print(f"[INFO] Model Dir: {model_dir}") print(f"[INFO] Prompt: \"{prompt}\"") print(f"[INFO] Max Length: {max_len}") print(f"[INFO] Temperature: {temp}") print(f"[INFO] Top-K: {top_k}") print(f"[INFO] Seed: {seed}") print(f"[INFO] Device: {device}") print("-" * 30) # --- Validate Path --- if not os.path.isdir(model_dir): print(f"[ERROR] Model directory not found: {model_dir}"); exit(1) # --- Run Generation --- if 'run_generation' in locals(): generated_output = run_generation( model_path=model_dir, prompt=prompt, max_len=max_len, temp=temp, top_k=top_k, seed=seed, device_str=device ) else: print("[ERROR] run_generation function is not defined!"); generated_output = None # --- Print Result --- print("\n" + "="*20 + " Final Generation Result " + "="*20) if generated_output is not None: print(f"Prompt: {prompt}") print("-" * (40 + len(" Final Generation Result "))) print("Generated Text:") print(generated_output) else: print("\n[FAILURE] Text generation failed. Check print statements above.") print("=" * (40 + len(" Final Generation Result "))) ``` ## Example Outputs ### Basic Example ```python prompt = "हिंदी भाषा" # Output: "हिंदी भाषा भारत की सबसे महत्वपूर्ण भाषाओं में से एक है। यह भारत के उत्तर भारत के राज्यों में मुख्य भाषा के रूप में बोली जाती है..." ``` ### Creative Writing Example ```python prompt = "एक बार की बात है" # Output: "एक बार की बात है, जब मैं छोटा था, तब मेरे दादाजी मुझे एक कहानी सुनाया करते थे। वह कहानी एक ऐसे राजा की थी जो अपने राज्य में..." ``` ## Limitations and Biases - The model may reflect biases present in its training data, including potential cultural, gender, or regional biases found in source materials. - Performance is limited by its architecture size (12 layers, hidden=768) and training dataset size. - May generate repetitive, nonsensical, or factually incorrect text. - Uses weighted pooling with sensitivity to Hindi's SOV structure, but may struggle with complex semantic relationships in longer texts. - May have particular difficulties with: - Cultural concepts lacking direct English translations - Idiomatic expressions specific to Hindi - Formal/informal speech distinctions - Handling Hindi-specific morphological complexities ## License This model is licensed under the MIT License. Please use this model responsibly.