Text Generation
PyTorch
Telugu
sakhi
abhi11nav's picture
readme update and tokenizer update
682168c
metadata
license: mit
datasets:
  - Telugu-LLM-Labs/telugu_alpaca_yahma_cleaned_filtered_romanized
  - Telugu-LLM-Labs/telugu_teknium_GPTeacher_general_instruct_filtered_romanized
  - maya-research/IndicVault
language:
  - te
pipeline_tag: text-generation

Sakhi – Instruction-Tuned Telugu Language Model

An instruction-tuned transformer model fine-tuned on abhi11nav/sakhi-telugu-681M-pretrained-0625. It was trained on natural Telugu instructions and responses curated from three sources.

License

MIT

Language

  • Telugu (te)

Pipeline Tag

  • text-generation

Datasets Used


Dataset

The instruction-tuning corpus was constructed by merging three datasets, followed by the removal of duplicate entries. The final dataset consists of 130,479 unique prompt–response pairs. This training corpus was prepared to ensure high data quality, linguistic relevance, and uniqueness.

Model Parameters

The sakhi-telugu-681M-instruct-0625 model was trained from scratch with the following configuration:

model_parameters:
  embed_dim: 2048
  num_heads: 8
  ff_dim: 4096
  chunk_length: 1024
  num_layers: 10
  vocab_size: 64002
  • Embedding Dimension: 2048
  • Attention Heads: 8
  • Feedforward Layer Dimension: 4096 (with SwiGLU activation)
  • Context Length: 1024 tokens
  • Layers: 10 transformer decoder blocks
  • Vocabulary Size: 64,000 (custom Byte-Level BPE)

Two additional special tokens — "<|instruction|>" and "<|response|>" — were added to the tokenizer vocabulary, increasing the vocab_size by 2 (from 64,000 to 64,002). Accordingly, the embedding matrix and output projection layer were expanded to accommodate these new tokens.

The model was initialized by transferring weights from the pretrained checkpoint wherever possible. For the two new tokens, both their embeddings and output projections were initialized using standard method to ensure stable training.

Training Details

  • The model was finetuned for ~3 hours on 4× H100 GPUs provided by on http://runpod.io
  • Mixed precision was not used.
  • DistributedDataParallel (DDP) used for mutli-GPU training.
train_parameters:
  batch_size: 48
  num_epochs: 1
  init_learning_rate: 1e-4
  min_learning_rate: 1e-6
  seed: 42
  master_addr: "localhost"
  master_port: "12355"
  num_gpus: -1
  log_every_n_steps: 100
  gradient_clipping_max_norm: 3.0
  call_torch_compile_on_model: False
  gradient_accumulation_steps: 2
  • Effective Batch Size: 48 × 2 (with gradient accumulation)
  • Epochs: 3
  • Learning Rate Schedule: 1e-4, cosine decay to 1e-6
  • Gradient Clipping: 3.0
  • Logging: Every 100 steps using Weights & Biases
  • Checkpointing: Every epoch

💡 Full Weights & Biases logs will be attached (step x 100) > Weights & Biases

Hardware Setup

  • GPUs: 4 × H100
  • Runtime: ~3 hours
  • Precision: FP32 (no mixed precision)

Paths in configuration

paths:
  tokenizer_path: "/"
  dataset_path: "/"
  save_dir: "/"

⚠️ Paths are placeholders — these should be replaced with actual paths