File size: 4,092 Bytes
2088b3f
 
 
47d2d41
5bf57cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19fa3b4
5bf57cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42609a3
5bf57cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42609a3
 
5bf57cc
42609a3
 
5bf57cc
 
 
 
42609a3
 
5bf57cc
42609a3
 
 
 
5bf57cc
42609a3
 
5bf57cc
 
42609a3
da66f66
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
{}
---

# LLAMA3.2 Nepali 318M Model

## Overview
This is a 318M parameter LLAMA3.2 model fine-tuned on a Nepali text dataset. The model is designed for generating coherent and contextually relevant Nepali text.

## Resources
- **Training Code:** [GitHub Repository](https://github.com/Aananda-giri/LLAMA3-Nepali)
- **Chat Interface:** [Hugging Face Space](https://huggingface.co/spaces/Aananda-giri/LLAMA3_Nepali_318M)
- **Dataset:** [IRIISNEPAL/Nepali-Text-Corpus](https://huggingface.co/datasets/IRIISNEPAL/Nepali-Text-Corpus) and [nepberta](https://nepberta.github.io/)
- **Reference Book:** *[Build a Large Language Model (From Scratch)](https://www.manning.com/books/build-a-large-language-model-from-scratch)* by Sebastian Raschka, PhD

## Installation
To install the required dependencies, run:
```sh
pip install datasets huggingface_hub matplotlib transformers torch --quiet
```

## Usage
### 1. Download Model Weights
```python
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="Aananda-giri/LLAMA3-Nepali", filename="parameters_300m/model_pg_398000_steps.pth", local_dir="./")
```

### 2. Load the Tokenizer
```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/LLAMA3-Nepali")
tokenizer.save_pretrained("NepaliBPE")
```

### 3. Download Additional Scripts
```python
import requests
res=requests.get(r"https://raw.githubusercontent.com/Aananda-giri/LLAMA3-Nepali/main/4.%20inference/2_inference/previous_chapters.py")
with open('previous_chapters.py', 'w') as f:
    f.write(res.text)
```

### 4. Load the Model
```python
import torch
from previous_chapters import Llama3Model, ChatFormat, Tokenizer, generate_and_print_sample

# Initialize tokenizer
_tokenizer = Tokenizer("NepaliBPE/tokenizer.json")
chat_tokenizer = ChatFormat(_tokenizer)

# Define model configuration
LLAMA32_CONFIG = {
    "vocab_size": 50006,
    "context_length": 512,
    "emb_dim": 1320,
    "n_heads": 20,
    "n_layers": 10,
    "hidden_dim": 5280,
    "n_kv_groups": 5,
    "rope_base": 500_000.0,
    "dtype": torch.bfloat16,
    "rope_freq": {
        "factor": 32.0,
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_context_length": 8192,
    }
}

# Adjust RoPE Scaling
old_context_length = 131_072
new_context_length = LLAMA32_CONFIG["context_length"]
LLAMA32_CONFIG["rope_base"] *= new_context_length / old_context_length

# Load Model
model = Llama3Model(LLAMA32_CONFIG)
model.eval()

# Optimize model if PyTorch 2.0 is available
if torch.__version__ >= "2.0":
    model = torch.compile(model)
```

### 5. Load Model Weights
```python
# Move model to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f'device: {device}')

# Load checkpoint
latest_model_checkpoint = "parameters_300m/model_pg_398000_steps.pth"
checkpoint = torch.load(latest_model_checkpoint, map_location=device, weights_only=False)
model.load_state_dict(checkpoint["model_state_dict"])
```

### 6. Generate Text
```python
# Generate text sample
generate_and_print_sample(
    PROMPT="रामले भात",
    tokenizer=_tokenizer,
    chat_tokenizer=chat_tokenizer,
    model=model,
    device=device,
    context_length=LLAMA32_CONFIG["context_length"]
)
```

#### Advanced Text Generation
```python
from previous_chapters import generate_chat_optimized
import time

start_time = time.time()
output_text = generate_chat_optimized(
    prompt="रामले भात",
    tokenizer=tokenizer,
    chat_tokenizer=chat_tokenizer,
    model=model,
    max_new_tokens=20,
    context_size=512,
    device=device,
    temperature=0.3,
    top_k=5,
    top_p=None,
    eos_id=None,
    repetition_penalty=1.2,
    penalize_len_below=10,
    batch_size=1  # Added parameter
)

print(f"time:{time.time() - start_time}\n output_text: {output_text}")
```


# Model Checkpoints
The best-performing checkpoint is **parameters_300m/model_pg_398000_steps.pth**. Additionally, other folders contain experimental checkpoints from various training runs.