metadata

license: mit
datasets:
  - Telugu-LLM-Labs/telugu_alpaca_yahma_cleaned_filtered_romanized
  - Telugu-LLM-Labs/telugu_teknium_GPTeacher_general_instruct_filtered_romanized
  - maya-research/IndicVault
language:
  - te
pipeline_tag: text-generation

Sakhi – Instruction-Tuned Telugu Language Model

An instruction-tuned transformer model fine-tuned on abhi11nav/sakhi-telugu-681M-pretrained-0625. It was trained on natural Telugu instructions and responses curated from three sources.

License

MIT

Language

Telugu (te)

Pipeline Tag

text-generation

Datasets Used

Dataset

The instruction-tuning corpus was constructed by merging three datasets, followed by the removal of duplicate entries. The final dataset consists of 130,479 unique prompt–response pairs. This training corpus was prepared to ensure high data quality, linguistic relevance, and uniqueness.

Model Parameters

The sakhi-telugu-681M-instruct-0625 model was trained from scratch with the following configuration:

model_parameters:
  embed_dim: 2048
  num_heads: 8
  ff_dim: 4096
  chunk_length: 1024
  num_layers: 10
  vocab_size: 64002

Embedding Dimension: 2048
Attention Heads: 8
Feedforward Layer Dimension: 4096 (with SwiGLU activation)
Context Length: 1024 tokens
Layers: 10 transformer decoder blocks
Vocabulary Size: 64,000 (custom Byte-Level BPE)

Two additional special tokens — "<|instruction|>" and "<|response|>" — were added to the tokenizer vocabulary, increasing the vocab_size by 2 (from 64,000 to 64,002). Accordingly, the embedding matrix and output projection layer were expanded to accommodate these new tokens.

The model was initialized by transferring weights from the pretrained checkpoint wherever possible. For the two new tokens, both their embeddings and output projections were initialized using standard method to ensure stable training.

Training Details

The model was finetuned for ~3 hours on 4× H100 GPUs provided by on http://runpod.io
Mixed precision was not used.
DistributedDataParallel (DDP) used for mutli-GPU training.

train_parameters:
  batch_size: 48
  num_epochs: 1
  init_learning_rate: 1e-4
  min_learning_rate: 1e-6
  seed: 42
  master_addr: "localhost"
  master_port: "12355"
  num_gpus: -1
  log_every_n_steps: 100
  gradient_clipping_max_norm: 3.0
  call_torch_compile_on_model: False
  gradient_accumulation_steps: 2

Effective Batch Size: 48 × 2 (with gradient accumulation)
Epochs: 3
Learning Rate Schedule: 1e-4, cosine decay to 1e-6
Gradient Clipping: 3.0
Logging: Every 100 steps using Weights & Biases
Checkpointing: Every epoch

💡 Full Weights & Biases logs will be attached (step x 100) >

Hardware Setup

GPUs: 4 × H100
Runtime: ~3 hours
Precision: FP32 (no mixed precision)

Paths in configuration

paths:
  tokenizer_path: "/"
  dataset_path: "/"
  save_dir: "/"

⚠️ Paths are placeholders — these should be replaced with actual paths