readme update and tokenizer update
Browse files- README.md +110 -0
- tokenizer/special_tokens_map.json +16 -0
- tokenizer/tokenizer.json +18 -0
- tokenizer/tokenizer_config.json +20 -0
README.md
CHANGED
@@ -1,3 +1,113 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
datasets:
|
4 |
+
- Telugu-LLM-Labs/telugu_alpaca_yahma_cleaned_filtered_romanized
|
5 |
+
- Telugu-LLM-Labs/telugu_teknium_GPTeacher_general_instruct_filtered_romanized
|
6 |
+
- maya-research/IndicVault
|
7 |
+
|
8 |
+
language:
|
9 |
+
- te
|
10 |
+
|
11 |
+
pipeline_tag: text-generation
|
12 |
---
|
13 |
+
|
14 |
+
# Sakhi – Instruction-Tuned Telugu Language Model
|
15 |
+
|
16 |
+
An instruction-tuned transformer model fine-tuned on `abhi11nav/sakhi-telugu-681M-pretrained-0625`. It was trained on natural Telugu instructions and responses curated from three sources.
|
17 |
+
|
18 |
+
## License
|
19 |
+
|
20 |
+
MIT
|
21 |
+
|
22 |
+
## Language
|
23 |
+
|
24 |
+
- Telugu (`te`)
|
25 |
+
|
26 |
+
## Pipeline Tag
|
27 |
+
|
28 |
+
- `text-generation`
|
29 |
+
|
30 |
+
## Datasets Used
|
31 |
+
|
32 |
+
- [`Telugu-LLM-Labs/telugu_alpaca_yahma_cleaned_filtered_romanized`](https://huggingface.co/datasets/Telugu-LLM-Labs/telugu_alpaca_yahma_cleaned_filtered_romanized)
|
33 |
+
- [`Telugu-LLM-Labs/telugu_teknium_GPTeacher_general_instruct_filtered_romanized`](https://huggingface.co/datasets/Telugu-LLM-Labs/telugu_teknium_GPTeacher_general_instruct_filtered_romanized)
|
34 |
+
- [`maya-research/IndicVault`](https://huggingface.co/datasets/maya-research/IndicVault)
|
35 |
+
|
36 |
+
---
|
37 |
+
|
38 |
+
## Dataset
|
39 |
+
|
40 |
+
The instruction-tuning corpus was constructed by merging three datasets, followed by the removal of duplicate entries. The final dataset consists of 130,479 unique prompt–response pairs. This training corpus was prepared to ensure high data quality, linguistic relevance, and uniqueness.
|
41 |
+
|
42 |
+
## Model Parameters
|
43 |
+
|
44 |
+
The `sakhi-telugu-681M-instruct-0625` model was trained from scratch with the following configuration:
|
45 |
+
|
46 |
+
```yaml
|
47 |
+
model_parameters:
|
48 |
+
embed_dim: 2048
|
49 |
+
num_heads: 8
|
50 |
+
ff_dim: 4096
|
51 |
+
chunk_length: 1024
|
52 |
+
num_layers: 10
|
53 |
+
vocab_size: 64002
|
54 |
+
```
|
55 |
+
|
56 |
+
- **Embedding Dimension**: 2048
|
57 |
+
- **Attention Heads**: 8
|
58 |
+
- **Feedforward Layer Dimension**: 4096 (with SwiGLU activation)
|
59 |
+
- **Context Length**: 1024 tokens
|
60 |
+
- **Layers**: 10 transformer decoder blocks
|
61 |
+
- **Vocabulary Size**: 64,000 (custom Byte-Level BPE)
|
62 |
+
|
63 |
+
Two additional special tokens — `"<|instruction|>"` and `"<|response|>"` — were added to the tokenizer vocabulary, increasing the vocab_size by 2 (from 64,000 to 64,002). Accordingly, the embedding matrix and output projection layer were expanded to accommodate these new tokens.
|
64 |
+
|
65 |
+
The model was initialized by transferring weights from the pretrained checkpoint wherever possible. For the two new tokens, both their embeddings and output projections were initialized using standard method to ensure stable training.
|
66 |
+
|
67 |
+
## Training Details
|
68 |
+
|
69 |
+
- The model was finetuned for **~3 hours** on **4× H100 GPUs** provided by on **http://runpod.io**
|
70 |
+
- Mixed precision was not used.
|
71 |
+
- DistributedDataParallel (DDP) used for mutli-GPU training.
|
72 |
+
|
73 |
+
```yaml
|
74 |
+
train_parameters:
|
75 |
+
batch_size: 48
|
76 |
+
num_epochs: 1
|
77 |
+
init_learning_rate: 1e-4
|
78 |
+
min_learning_rate: 1e-6
|
79 |
+
seed: 42
|
80 |
+
master_addr: "localhost"
|
81 |
+
master_port: "12355"
|
82 |
+
num_gpus: -1
|
83 |
+
log_every_n_steps: 100
|
84 |
+
gradient_clipping_max_norm: 3.0
|
85 |
+
call_torch_compile_on_model: False
|
86 |
+
gradient_accumulation_steps: 2
|
87 |
+
```
|
88 |
+
|
89 |
+
- **Effective Batch Size**: 48 × 2 (with gradient accumulation)
|
90 |
+
- **Epochs**: 3
|
91 |
+
- **Learning Rate Schedule**: 1e-4, cosine decay to 1e-6
|
92 |
+
- **Gradient Clipping**: 3.0
|
93 |
+
- **Logging**: Every 100 steps using [Weights & Biases](https://wandb.ai/)
|
94 |
+
- **Checkpointing**: Every epoch
|
95 |
+
|
96 |
+
> 💡 Full Weights & Biases logs will be attached **(step x 100)** > [](https://api.wandb.ai/links/abhi11nav/g9oatq0u)
|
97 |
+
|
98 |
+
### Hardware Setup
|
99 |
+
|
100 |
+
- **GPUs**: 4 × H100
|
101 |
+
- **Runtime**: ~3 hours
|
102 |
+
- **Precision**: FP32 (no mixed precision)
|
103 |
+
|
104 |
+
## Paths in configuration
|
105 |
+
|
106 |
+
```yaml
|
107 |
+
paths:
|
108 |
+
tokenizer_path: "/"
|
109 |
+
dataset_path: "/"
|
110 |
+
save_dir: "/"
|
111 |
+
```
|
112 |
+
|
113 |
+
> ⚠️ Paths are placeholders — these should be replaced with actual paths
|
tokenizer/special_tokens_map.json
CHANGED
@@ -1,4 +1,20 @@
|
|
1 |
{
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
"eos_token": {
|
3 |
"content": "<|eos|>",
|
4 |
"lstrip": false,
|
|
|
1 |
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
{
|
4 |
+
"content": "<|instruction|>",
|
5 |
+
"lstrip": false,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"single_word": false
|
9 |
+
},
|
10 |
+
{
|
11 |
+
"content": "<|response|>",
|
12 |
+
"lstrip": false,
|
13 |
+
"normalized": false,
|
14 |
+
"rstrip": false,
|
15 |
+
"single_word": false
|
16 |
+
}
|
17 |
+
],
|
18 |
"eos_token": {
|
19 |
"content": "<|eos|>",
|
20 |
"lstrip": false,
|
tokenizer/tokenizer.json
CHANGED
@@ -29,6 +29,24 @@
|
|
29 |
"rstrip": false,
|
30 |
"normalized": false,
|
31 |
"special": true
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
}
|
33 |
],
|
34 |
"normalizer": null,
|
|
|
29 |
"rstrip": false,
|
30 |
"normalized": false,
|
31 |
"special": true
|
32 |
+
},
|
33 |
+
{
|
34 |
+
"id": 64000,
|
35 |
+
"content": "<|instruction|>",
|
36 |
+
"single_word": false,
|
37 |
+
"lstrip": false,
|
38 |
+
"rstrip": false,
|
39 |
+
"normalized": false,
|
40 |
+
"special": true
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"id": 64001,
|
44 |
+
"content": "<|response|>",
|
45 |
+
"single_word": false,
|
46 |
+
"lstrip": false,
|
47 |
+
"rstrip": false,
|
48 |
+
"normalized": false,
|
49 |
+
"special": true
|
50 |
}
|
51 |
],
|
52 |
"normalizer": null,
|
tokenizer/tokenizer_config.json
CHANGED
@@ -23,8 +23,28 @@
|
|
23 |
"rstrip": false,
|
24 |
"single_word": false,
|
25 |
"special": true
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
}
|
27 |
},
|
|
|
|
|
|
|
|
|
28 |
"clean_up_tokenization_spaces": false,
|
29 |
"eos_token": "<|eos|>",
|
30 |
"extra_special_tokens": {},
|
|
|
23 |
"rstrip": false,
|
24 |
"single_word": false,
|
25 |
"special": true
|
26 |
+
},
|
27 |
+
"64000": {
|
28 |
+
"content": "<|instruction|>",
|
29 |
+
"lstrip": false,
|
30 |
+
"normalized": false,
|
31 |
+
"rstrip": false,
|
32 |
+
"single_word": false,
|
33 |
+
"special": true
|
34 |
+
},
|
35 |
+
"64001": {
|
36 |
+
"content": "<|response|>",
|
37 |
+
"lstrip": false,
|
38 |
+
"normalized": false,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false,
|
41 |
+
"special": true
|
42 |
}
|
43 |
},
|
44 |
+
"additional_special_tokens": [
|
45 |
+
"<|instruction|>",
|
46 |
+
"<|response|>"
|
47 |
+
],
|
48 |
"clean_up_tokenization_spaces": false,
|
49 |
"eos_token": "<|eos|>",
|
50 |
"extra_special_tokens": {},
|