hindi-causal-lm / README.md

Update README.md

9ec04da verified 2 months ago

17.2 kB

	---
	language:
	- hi
	tags:
	- hindi
	- text-generation
	- causal-lm
	- lm
	- rope
	license: mit
	datasets:
	- custom_hindi_corpus
	---

	# Hindi-CausalLM

	A Hindi language generation model with the following specifications:

	## Model Architecture
	- Type: Causal Language Model with Transformer architecture
	- Hidden size: 768
	- Layers: 12
	- Attention heads: 16
	- Key-value heads: 4 (using grouped-query attention)
	- Position encoding: Rotary Position Embeddings (RoPE)
	- Vocabulary size: 16000
	- Parameters: ~100M
	- Context window: 512 tokens
	- Trained on: Large corpus of Hindi text

	## Training

	The model was trained on a large corpus of Hindi text using a cosine learning rate schedule with warmup. Training utilized mixed-precision and distributed data parallel across multiple GPUs.
	## Usage

	You can use this model with the following code:

	```python
	import torch
	import math
	import os
	from hindi_embeddings import SentencePieceTokenizerWrapper
	from safetensors.torch import load_file
	from torch import nn
	from transformers import PreTrainedModel, PretrainedConfig


	class ConvaiCausalLMConfig(PretrainedConfig):
	model_type = "convaicausallm"

	def __init__(
	self,
	vocab_size=16000,
	hidden_size=768,
	num_hidden_layers=12,
	num_attention_heads=16,
	num_key_value_heads=4,
	intermediate_size=3072,
	hidden_act="silu",
	max_position_embeddings=512,
	rope_theta=10000.0, # Base parameter for RoPE
	**kwargs
	):
	super().__init__(**kwargs)
	self.vocab_size = vocab_size
	self.hidden_size = hidden_size
	self.num_hidden_layers = num_hidden_layers
	self.num_attention_heads = num_attention_heads
	self.num_key_value_heads = num_key_value_heads
	self.intermediate_size = intermediate_size
	self.hidden_act = hidden_act
	self.max_position_embeddings = max_position_embeddings
	self.rope_theta = rope_theta


	def precompute_freqs_cis(dim, end, theta=10000.0):
	"""Precompute the frequency tensor for complex exponentials (cos, sin)"""
	# Ensure dim is even for complex numbers
	assert dim % 2 == 0, "Dimension must be even"

	# Create position indices for caching
	freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
	t = torch.arange(end).float()
	freqs = torch.outer(t, freqs) # [end, dim/2]

	# Create complex exponentials (cos, sin pairs)
	cos, sin = torch.cos(freqs), torch.sin(freqs)
	return cos, sin


	def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None):
	"""Apply rotary position embeddings to q and k tensors"""
	# Extract shapes
	batch, seq_len, n_heads, head_dim = q.shape
	_, kv_seq_len, n_kv_heads, _ = k.shape

	# Handle position IDs or use sequential positions
	if position_ids is None:
	# Default: Just use sequential positions
	position_ids = torch.arange(seq_len, device=q.device)
	position_ids = position_ids.unsqueeze(0).expand(batch, -1)

	# Get the cosine and sine for the positions we're using
	cos = cos[position_ids].unsqueeze(-2) # [batch, seq, 1, dim/2]
	sin = sin[position_ids].unsqueeze(-2) # [batch, seq, 1, dim/2]

	# q and k must be arranged in pairs for rotation
	q_embed_dim = q.shape[-1]
	q_half_dim = q_embed_dim // 2

	# Split the embedding dimensions into pairs
	q_half1, q_half2 = q[..., :q_half_dim], q[..., q_half_dim:]
	k_half1, k_half2 = k[..., :q_half_dim], k[..., q_half_dim:]

	# Apply rotary embeddings to each pair of dimensions
	# For each pair (a, b), we compute (acos - bsin, asin + bcos)
	q_out_half1 = q_half1 * cos - q_half2 * sin
	q_out_half2 = q_half1 * sin + q_half2 * cos
	k_out_half1 = k_half1 * cos - k_half2 * sin
	k_out_half2 = k_half1 * sin + k_half2 * cos

	# Concatenate back to original shape
	q_out = torch.cat([q_out_half1, q_out_half2], dim=-1)
	k_out = torch.cat([k_out_half1, k_out_half2], dim=-1)

	return q_out, k_out


	class GroupedQueryAttention(nn.Module):
	def __init__(self, config):
	super().__init__()
	self.hidden_size = config.hidden_size
	self.num_heads = config.num_attention_heads
	self.num_kv_heads = config.num_key_value_heads
	self.head_dim = config.hidden_size // config.num_attention_heads

	# For MQA/GQA support
	self.num_key_value_groups = self.num_heads // self.num_kv_heads

	self.q_proj = nn.Linear(config.hidden_size, self.num_heads * self.head_dim)
	self.k_proj = nn.Linear(config.hidden_size, self.num_kv_heads * self.head_dim)
	self.v_proj = nn.Linear(config.hidden_size, self.num_kv_heads * self.head_dim)
	self.o_proj = nn.Linear(config.hidden_size, config.hidden_size)

	# Precompute rotary position encoding frequencies
	max_seq_len = config.max_position_embeddings
	self.max_seq_len = max_seq_len

	# Register frequencies as buffers
	cos, sin = precompute_freqs_cis(self.head_dim, max_seq_len, config.rope_theta)
	self.register_buffer("cos", cos) # [max_seq_len, dim/2]
	self.register_buffer("sin", sin) # [max_seq_len, dim/2]

	# Create causal mask for attention
	self.register_buffer(
	"causal_mask",
	torch.triu(torch.ones(max_seq_len, max_seq_len) * -1e9, diagonal=1)
	)

	def forward(self, hidden_states, attention_mask=None):
	batch_size, seq_len, _ = hidden_states.size()

	# Project queries, keys, values
	q = self.q_proj(hidden_states)
	k = self.k_proj(hidden_states)
	v = self.v_proj(hidden_states)

	# Reshape for attention computation
	q = q.view(batch_size, seq_len, self.num_heads, self.head_dim)
	k = k.view(batch_size, seq_len, self.num_kv_heads, self.head_dim)
	v = v.view(batch_size, seq_len, self.num_kv_heads, self.head_dim)

	# Apply rotary position embeddings
	q_rotary, k_rotary = apply_rotary_pos_emb(q, k, self.cos, self.sin)

	# Reshape for attention computation
	q_rotary = q_rotary.transpose(1, 2) # [batch, heads, seq, dim]
	k_rotary = k_rotary.transpose(1, 2) # [batch, kv_heads, seq, dim]
	v = v.transpose(1, 2) # [batch, kv_heads, seq, dim]

	# Handle Multi-Query Attention / Grouped-Query Attention
	if self.num_key_value_groups > 1:
	# Repeat k, v for each query in the group
	k_rotary = k_rotary.repeat_interleave(self.num_key_value_groups, dim=1)
	v = v.repeat_interleave(self.num_key_value_groups, dim=1)

	# Compute attention scores
	attn_scores = torch.matmul(q_rotary, k_rotary.transpose(-1, -2)) / (self.head_dim ** 0.5)

	# Apply causal mask - only attend to previous tokens
	causal_mask = self.causal_mask[:seq_len, :seq_len]
	attn_scores = attn_scores + causal_mask

	# Apply attention mask if provided
	if attention_mask is not None:
	attn_scores = attn_scores + attention_mask

	# Normalize the attention scores to probabilities
	attn_probs = torch.softmax(attn_scores, dim=-1)

	# Apply attention to values
	context = torch.matmul(attn_probs, v) # [b, n_heads, seq, head_dim]

	# Reshape back to [batch_size, seq_length, hidden_size]
	context = context.transpose(1, 2).contiguous()
	context = context.view(batch_size, seq_len, -1)

	# Final projection
	output = self.o_proj(context)

	return output


	class ConvaiCausalLM(PreTrainedModel):
	config_class = ConvaiCausalLMConfig

	def __init__(self, config):
	super().__init__(config)
	self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
	self.layers = nn.ModuleList([
	nn.ModuleDict({
	"self_attn": GroupedQueryAttention(config),
	"mlp": nn.Sequential(
	nn.Linear(config.hidden_size, config.intermediate_size),
	nn.SiLU(),
	nn.Linear(config.intermediate_size, config.hidden_size)
	),
	"input_layernorm": nn.LayerNorm(config.hidden_size),
	"post_attention_layernorm": nn.LayerNorm(config.hidden_size)
	}) for _ in range(config.num_hidden_layers)
	])
	self.norm = nn.LayerNorm(config.hidden_size)
	self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

	# Initialize weights
	self.apply(self._init_weights)

	def _init_weights(self, module):
	if isinstance(module, nn.Linear):
	torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
	if module.bias is not None:
	torch.nn.init.zeros_(module.bias)
	elif isinstance(module, nn.Embedding):
	torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

	def _prepare_attention_mask(self, attention_mask, input_shape, device):
	# Prepare masks for attention
	if attention_mask is None:
	attention_mask = torch.ones(input_shape, device=device)

	# Make broadcastable shape: [batch, 1, 1, seq_len]
	extended_mask = attention_mask.unsqueeze(1).unsqueeze(2)

	# Convert to additive mask (0 for valid, -10000 for masked)
	extended_mask = (1.0 - extended_mask) * -10000.0

	return extended_mask

	def forward(self, input_ids, attention_mask=None):
	batch_size, seq_len = input_ids.shape
	device = input_ids.device

	# Prepare attention mask
	if attention_mask is not None:
	attention_mask = self._prepare_attention_mask(
	attention_mask, (batch_size, seq_len), device
	)

	# Get embeddings
	hidden_states = self.embed_tokens(input_ids)

	# Apply each layer
	for layer in self.layers:
	residual = hidden_states

	# First norm and attention
	hidden_states = layer["input_layernorm"](hidden_states)
	hidden_states = layer["self_attn"](hidden_states, attention_mask)
	hidden_states = residual + hidden_states

	# Second norm and MLP
	residual = hidden_states
	hidden_states = layer["post_attention_layernorm"](hidden_states)
	hidden_states = layer["mlp"](hidden_states)
	hidden_states = residual + hidden_states

	# Final norm
	hidden_states = self.norm(hidden_states)

	# Compute logits
	logits = self.lm_head(hidden_states)

	return logits


	class HindiLLMGenerator:
	def __init__(self, model_path, device=None):
	# Set device
	if device is None:
	self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	else:
	self.device = torch.device(device)

	print(f"Using device: {self.device}")

	# Load tokenizer
	tokenizer_path = os.path.join(model_path, "tokenizer.model")
	self.tokenizer = SentencePieceTokenizerWrapper(tokenizer_path)

	# Load model config
	config_path = os.path.join(model_path, "config.json")
	import json
	with open(config_path, 'r') as f:
	config_dict = json.load(f)

	self.config = ConvaiCausalLMConfig(**config_dict)

	# Load model - try safetensors first, fall back to PyTorch bin if needed
	safetensors_path = os.path.join(model_path, "model.safetensors")
	pytorch_path = os.path.join(model_path, "pytorch_model.bin")

	self.model = ConvaiCausalLM(self.config)

	# Check which format is available and load accordingly
	if os.path.exists(safetensors_path):
	print(f"Loading model from SafeTensors")
	state_dict = load_file(safetensors_path, device="cpu")
	self.model.load_state_dict(state_dict)
	elif os.path.exists(pytorch_path):
	print(f"Loading model from PyTorch bin")
	self.model.load_state_dict(torch.load(pytorch_path, map_location="cpu"))

	# Move model to device and set to evaluation mode
	self.model.to(self.device)
	self.model.eval()

	def generate(self, prompt, max_length=100, temperature=0.8, top_k=50, top_p=0.9,
	repetition_penalty=1.1, do_sample=True):
	# Tokenize the prompt
	input_ids = self.tokenizer.sp_model.EncodeAsIds(prompt)
	input_tensor = torch.tensor([input_ids], dtype=torch.long).to(self.device)

	# Start with the input tensor
	output_sequence = input_tensor.clone()

	# Generate tokens one by one
	for _ in range(max_length - len(input_ids)):
	with torch.no_grad():
	# Get the model's output for the current sequence
	outputs = self.model(output_sequence)
	next_token_logits = outputs[0, -1, :]

	# Apply temperature
	if temperature > 0:
	next_token_logits = next_token_logits / temperature

	# Apply repetition penalty
	if repetition_penalty > 1.0:
	for token_id in output_sequence[0].tolist():
	next_token_logits[token_id] /= repetition_penalty

	# Filter with top-k sampling
	if top_k > 0:
	top_k_values, top_k_indices = torch.topk(next_token_logits, top_k)
	next_token_logits = torch.full_like(next_token_logits, float('-inf'))
	next_token_logits.scatter_(0, top_k_indices, top_k_values)

	# Filter with top-p/nucleus sampling
	if top_p < 1.0 and do_sample:
	sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
	cumulative_probs = torch.cumsum(torch.softmax(sorted_logits, dim=-1), dim=-1)

	# Remove tokens with cumulative probability above the threshold
	sorted_indices_to_remove = cumulative_probs > top_p
	# Shift the indices to the right to keep the first token above the threshold
	sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
	sorted_indices_to_remove[..., 0] = 0

	indices_to_remove = sorted_indices[sorted_indices_to_remove]
	next_token_logits[indices_to_remove] = float('-inf')

	# Sample or choose the next token
	if do_sample:
	probs = torch.softmax(next_token_logits, dim=-1)
	next_token = torch.multinomial(probs, num_samples=1)
	else:
	next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(0)

	# Add the next token to the sequence
	output_sequence = torch.cat([output_sequence, next_token.unsqueeze(0)], dim=1)

	# Check if we've generated an end token
	if next_token.item() == self.tokenizer.eos_token_id:
	break

	# Decode the generated sequence
	generated_ids = output_sequence[0].tolist()
	generated_text = self.tokenizer.sp_model.DecodeIds(generated_ids)

	return generated_text

	# Example usage
	if __name__ == "__main__":
	generator = HindiLLMGenerator("path/to/model")
	result = generator.generate("भारत एक विशाल देश है")
	print(result)
	```

	## Example Prompts

	Try the model with these example prompts:

	```
	भारत एक विशाल देश है
	मुझे हिंदी में एक कहानी सुनाओ
	आज का मौसम बहुत अच्छा है
	हिंदी साहित्य की प्रमुख विशेषताएं
	```

	## Capabilities

	This model can:
	- Generate coherent Hindi text
	- Continue text from a given prompt
	- Create stories, explanations, and other content in Hindi

	## Limitations

	- Performance varies based on the similarity of the input to the training data
	- May occasionally generate repetitive content for longer texts
	- May produce grammatically incorrect Hindi in some contexts
	- Has no knowledge of events beyond its training corpus

	## Intended Use

	This model is intended for Hindi language generation tasks, creative writing assistance, and as a foundation for fine-tuning on specific tasks.

	## Ethical Considerations

	Users should be aware that like all language models, this model may reproduce biases or generate problematic content in certain contexts.