---
{}
---

# LLAMA3.2 Nepali 318M Model

## Overview
This is a 318M parameter LLAMA3.2 model fine-tuned on a Nepali text dataset. The model is designed for generating coherent and contextually relevant Nepali text.

## Resources
- **Training Code:** [GitHub Repository](https://github.com/Aananda-giri/LLAMA3-Nepali)
- **Chat Interface:** [Hugging Face Space](https://huggingface.co/spaces/Aananda-giri/LLAMA3_Nepali_318M)
- **Dataset:** [IRIISNEPAL/Nepali-Text-Corpus](https://huggingface.co/datasets/IRIISNEPAL/Nepali-Text-Corpus) and [nepberta](https://nepberta.github.io/)
- **Reference Book:** *[Build a Large Language Model (From Scratch)](https://www.manning.com/books/build-a-large-language-model-from-scratch)* by Sebastian Raschka, PhD

## Installation
To install the required dependencies, run:
```sh
pip install datasets huggingface_hub matplotlib transformers torch --quiet
```

## Usage
### 1. Download Model Weights
```python
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="Aananda-giri/LLAMA3-Nepali", filename="parameters_300m/model_pg_398000_steps.pth", local_dir="./")
```

### 2. Load the Tokenizer
```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/LLAMA3-Nepali")
tokenizer.save_pretrained("NepaliBPE")
```

### 3. Download Additional Scripts
```python
import requests
res=requests.get(r"https://raw.githubusercontent.com/Aananda-giri/LLAMA3-Nepali/main/4.%20inference/2_inference/previous_chapters.py")
with open('previous_chapters.py', 'w') as f:
    f.write(res.text)
```

### 4. Load the Model
```python
import torch
from previous_chapters import Llama3Model, ChatFormat, Tokenizer, generate_and_print_sample

# Initialize tokenizer
_tokenizer = Tokenizer("NepaliBPE/tokenizer.json")
chat_tokenizer = ChatFormat(_tokenizer)

# Define model configuration
LLAMA32_CONFIG = {
    "vocab_size": 50006,
    "context_length": 512,
    "emb_dim": 1320,
    "n_heads": 20,
    "n_layers": 10,
    "hidden_dim": 5280,
    "n_kv_groups": 5,
    "rope_base": 500_000.0,
    "dtype": torch.bfloat16,
    "rope_freq": {
        "factor": 32.0,
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_context_length": 8192,
    }
}

# Adjust RoPE Scaling
old_context_length = 131_072
new_context_length = LLAMA32_CONFIG["context_length"]
LLAMA32_CONFIG["rope_base"] *= new_context_length / old_context_length

# Load Model
model = Llama3Model(LLAMA32_CONFIG)
model.eval()

# Optimize model if PyTorch 2.0 is available
if torch.__version__ >= "2.0":
    model = torch.compile(model)
```

### 5. Load Model Weights
```python
# Move model to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f'device: {device}')

# Load checkpoint
latest_model_checkpoint = "parameters_300m/model_pg_398000_steps.pth"
checkpoint = torch.load(latest_model_checkpoint, map_location=device, weights_only=False)
model.load_state_dict(checkpoint["model_state_dict"])
```

### 6. Generate Text
```python
# Generate text sample
generate_and_print_sample(
    PROMPT="रामले भात",
    tokenizer=_tokenizer,
    chat_tokenizer=chat_tokenizer,
    model=model,
    device=device,
    context_length=LLAMA32_CONFIG["context_length"]
)
```

#### Advanced Text Generation
```python
from previous_chapters import generate_chat_optimized
import time

start_time = time.time()
output_text = generate_chat_optimized(
    prompt="रामले भात",
    tokenizer=tokenizer,
    chat_tokenizer=chat_tokenizer,
    model=model,
    max_new_tokens=20,
    context_size=512,
    device=device,
    temperature=0.3,
    top_k=5,
    top_p=None,
    eos_id=None,
    repetition_penalty=1.2,
    penalize_len_below=10,
    batch_size=1  # Added parameter
)

print(f"time:{time.time() - start_time}\n output_text: {output_text}")
```


# Model Checkpoints
The best-performing checkpoint is **parameters_300m/model_pg_398000_steps.pth**. Additionally, other folders contain experimental checkpoints from various training runs.