Text Generation
PyTorch
Telugu
sakhi
abhi11nav commited on
Commit
682168c
·
1 Parent(s): e633707

readme update and tokenizer update

Browse files
README.md CHANGED
@@ -1,3 +1,113 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - Telugu-LLM-Labs/telugu_alpaca_yahma_cleaned_filtered_romanized
5
+ - Telugu-LLM-Labs/telugu_teknium_GPTeacher_general_instruct_filtered_romanized
6
+ - maya-research/IndicVault
7
+
8
+ language:
9
+ - te
10
+
11
+ pipeline_tag: text-generation
12
  ---
13
+
14
+ # Sakhi – Instruction-Tuned Telugu Language Model
15
+
16
+ An instruction-tuned transformer model fine-tuned on `abhi11nav/sakhi-telugu-681M-pretrained-0625`. It was trained on natural Telugu instructions and responses curated from three sources.
17
+
18
+ ## License
19
+
20
+ MIT
21
+
22
+ ## Language
23
+
24
+ - Telugu (`te`)
25
+
26
+ ## Pipeline Tag
27
+
28
+ - `text-generation`
29
+
30
+ ## Datasets Used
31
+
32
+ - [`Telugu-LLM-Labs/telugu_alpaca_yahma_cleaned_filtered_romanized`](https://huggingface.co/datasets/Telugu-LLM-Labs/telugu_alpaca_yahma_cleaned_filtered_romanized)
33
+ - [`Telugu-LLM-Labs/telugu_teknium_GPTeacher_general_instruct_filtered_romanized`](https://huggingface.co/datasets/Telugu-LLM-Labs/telugu_teknium_GPTeacher_general_instruct_filtered_romanized)
34
+ - [`maya-research/IndicVault`](https://huggingface.co/datasets/maya-research/IndicVault)
35
+
36
+ ---
37
+
38
+ ## Dataset
39
+
40
+ The instruction-tuning corpus was constructed by merging three datasets, followed by the removal of duplicate entries. The final dataset consists of 130,479 unique prompt–response pairs. This training corpus was prepared to ensure high data quality, linguistic relevance, and uniqueness.
41
+
42
+ ## Model Parameters
43
+
44
+ The `sakhi-telugu-681M-instruct-0625` model was trained from scratch with the following configuration:
45
+
46
+ ```yaml
47
+ model_parameters:
48
+ embed_dim: 2048
49
+ num_heads: 8
50
+ ff_dim: 4096
51
+ chunk_length: 1024
52
+ num_layers: 10
53
+ vocab_size: 64002
54
+ ```
55
+
56
+ - **Embedding Dimension**: 2048
57
+ - **Attention Heads**: 8
58
+ - **Feedforward Layer Dimension**: 4096 (with SwiGLU activation)
59
+ - **Context Length**: 1024 tokens
60
+ - **Layers**: 10 transformer decoder blocks
61
+ - **Vocabulary Size**: 64,000 (custom Byte-Level BPE)
62
+
63
+ Two additional special tokens — `"<|instruction|>"` and `"<|response|>"` — were added to the tokenizer vocabulary, increasing the vocab_size by 2 (from 64,000 to 64,002). Accordingly, the embedding matrix and output projection layer were expanded to accommodate these new tokens.
64
+
65
+ The model was initialized by transferring weights from the pretrained checkpoint wherever possible. For the two new tokens, both their embeddings and output projections were initialized using standard method to ensure stable training.
66
+
67
+ ## Training Details
68
+
69
+ - The model was finetuned for **~3 hours** on **4× H100 GPUs** provided by on **http://runpod.io**
70
+ - Mixed precision was not used.
71
+ - DistributedDataParallel (DDP) used for mutli-GPU training.
72
+
73
+ ```yaml
74
+ train_parameters:
75
+ batch_size: 48
76
+ num_epochs: 1
77
+ init_learning_rate: 1e-4
78
+ min_learning_rate: 1e-6
79
+ seed: 42
80
+ master_addr: "localhost"
81
+ master_port: "12355"
82
+ num_gpus: -1
83
+ log_every_n_steps: 100
84
+ gradient_clipping_max_norm: 3.0
85
+ call_torch_compile_on_model: False
86
+ gradient_accumulation_steps: 2
87
+ ```
88
+
89
+ - **Effective Batch Size**: 48 × 2 (with gradient accumulation)
90
+ - **Epochs**: 3
91
+ - **Learning Rate Schedule**: 1e-4, cosine decay to 1e-6
92
+ - **Gradient Clipping**: 3.0
93
+ - **Logging**: Every 100 steps using [Weights & Biases](https://wandb.ai/)
94
+ - **Checkpointing**: Every epoch
95
+
96
+ > 💡 Full Weights & Biases logs will be attached **(step x 100)** > [![Weights & Biases](https://img.shields.io/badge/Weights_%26_Biases-Project-blue)](https://api.wandb.ai/links/abhi11nav/g9oatq0u)
97
+
98
+ ### Hardware Setup
99
+
100
+ - **GPUs**: 4 × H100
101
+ - **Runtime**: ~3 hours
102
+ - **Precision**: FP32 (no mixed precision)
103
+
104
+ ## Paths in configuration
105
+
106
+ ```yaml
107
+ paths:
108
+ tokenizer_path: "/"
109
+ dataset_path: "/"
110
+ save_dir: "/"
111
+ ```
112
+
113
+ > ⚠️ Paths are placeholders — these should be replaced with actual paths
tokenizer/special_tokens_map.json CHANGED
@@ -1,4 +1,20 @@
1
  {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  "eos_token": {
3
  "content": "<|eos|>",
4
  "lstrip": false,
 
1
  {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<|instruction|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<|response|>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ }
17
+ ],
18
  "eos_token": {
19
  "content": "<|eos|>",
20
  "lstrip": false,
tokenizer/tokenizer.json CHANGED
@@ -29,6 +29,24 @@
29
  "rstrip": false,
30
  "normalized": false,
31
  "special": true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  }
33
  ],
34
  "normalizer": null,
 
29
  "rstrip": false,
30
  "normalized": false,
31
  "special": true
32
+ },
33
+ {
34
+ "id": 64000,
35
+ "content": "<|instruction|>",
36
+ "single_word": false,
37
+ "lstrip": false,
38
+ "rstrip": false,
39
+ "normalized": false,
40
+ "special": true
41
+ },
42
+ {
43
+ "id": 64001,
44
+ "content": "<|response|>",
45
+ "single_word": false,
46
+ "lstrip": false,
47
+ "rstrip": false,
48
+ "normalized": false,
49
+ "special": true
50
  }
51
  ],
52
  "normalizer": null,
tokenizer/tokenizer_config.json CHANGED
@@ -23,8 +23,28 @@
23
  "rstrip": false,
24
  "single_word": false,
25
  "special": true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  }
27
  },
 
 
 
 
28
  "clean_up_tokenization_spaces": false,
29
  "eos_token": "<|eos|>",
30
  "extra_special_tokens": {},
 
23
  "rstrip": false,
24
  "single_word": false,
25
  "special": true
26
+ },
27
+ "64000": {
28
+ "content": "<|instruction|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "64001": {
36
+ "content": "<|response|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
  }
43
  },
44
+ "additional_special_tokens": [
45
+ "<|instruction|>",
46
+ "<|response|>"
47
+ ],
48
  "clean_up_tokenization_spaces": false,
49
  "eos_token": "<|eos|>",
50
  "extra_special_tokens": {},