# Introduction
## Gemma Fine-tuning for Telugu News article's Interesting heading genration

In this notebook, we'll finetune the GEMMA-2 model on telugu news articles for interacting, interesting and factual headline genration.

### Table of Contents:
 
1. Dataset Info<br>
2. Package Installation and Importing<br>
3. Data Loading <br>
4. Data Preprocessing for Training<br>
5. Loading the Gemma Model<br>
7. Q & A Results Before Finetuning<br>
7. Applying Gemma LoRA<br>
8. Training Gemma<br>
9. Q & A Results After Finetuning<br>
10. Conclusion<br>

### Dataset Used
- [Telugu News articles](https://www.kaggle.com/datasets/chinmayadatt/dataset-python-question-answer) : This dataset is about Python programming. Question and answers are generated using Gemma. There are more than four hundred questions and their corresponding answers about Python programming.

---

# 1.Telugu News aricles dataset

**To be added**

### Inputs and Outputs

- **Input**: Gemma models take in text strings, which can range from questions and prompts to longer documents that require summarization.
- **Output**: In response, they generate text in English, offering answers, summaries, or other forms of text-based output, tailored to the input provided.


# 2. Package Installation and Importing

Before we start, it's essential to install all necessary packages, including Gemma itself. This part will cover the installation process step by step.

In [4]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '1,2,3,4,5'
# Install specific versions of PEFT, evaluate, transformers, accelerate, and bitsandbytes packages quietly without showing output.
# %pip install -q -U peft evaluate transformers accelerate bitsandbytes evaluate
#!pip3 install torch==2.0.1
# Upgrade and quietly install the latest versions of the trl and datasets packages.
#%pip install -U -q trl datasets

### Package Description

#### python basic module
- `os`: Provides ways to interact with the operating system and its environment variables.
- `torch`: PyTorch library for deep learning applications.
- `numpy`: Essential library for linear algebra and mathematical operations.
- `pandas`: Powerful data processing tool, ideal for handling CSV files and other forms of structured data.

#### transformers module
- `AutoTokenizer`: Used to automatically load a pre-trained tokenizer.
- `AutoModelForCausalLM`: Used to automatically load pre-trained models for causal language modeling.
- `BitsAndBytesConfig`: Configuration class for setting up the Bits and Bytes tokenizer.
- `AutoConfig`: Used to automatically load the model's configuration.
- `TrainingArguments`: Defines arguments for training setup.

#### datasets module
- `Dataset`: A class for handling datasets.

#### peft module
- `LoraConfig`: A configuration class for configuring the Lora model.
- `PeftModel`: A class that defines the PEFT model.
- `prepare_model_for_kbit_training`: A function that prepares a model for k-bit training.
- `get_peft_model`: Function to get the PEFT model.

#### trl module
- `SFTTrainer`: Trainer class for SFT (Supervised Fine-Tuning) training.

#### IPython.display module
- `Markdown`: Used to output text in Markdown format.
- `display`: Used to display objects in Jupyter notebooks.

In [2]:
# !python -m pip uninstall torch torchvision torchaudio
# !python -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

In [3]:
import os

import torch

import numpy as np
import pandas as pd

from transformers import (AutoTokenizer, 
                          AutoModelForCausalLM, 
                          BitsAndBytesConfig, 
                          AutoConfig,
                          TrainingArguments)

from datasets import Dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from IPython.display import Markdown, display

2025-01-06 17:39:47.044925: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-06 17:39:47.056750: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-06 17:39:47.068831: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-06 17:39:47.072356: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-06 17:39:47.083817: I tensorflow/core/platform/cpu_feature_guar

In [4]:
# Disable CA bundle check. Useful in certain environments where you may encounter SSL errors.
os.environ['CURL_CA_BUNDLE'] = ''

# Set the order of devices as seen by CUDA to PCI bus ID order. This is to ensure consistency in device selection.
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# Check if CUDA is available, and if so, specify which GPU(s) to be made visible to the process.
if torch.cuda.is_available():
    print("CUDA is available")
else:
    print("CUDA is not available")

CUDA is available


A tool for tracking and visualizing Machine Learning experiments. Wandb helps you easily manage metrics, hyperparameters, experiment code, and model artifacts during model training.<br>
<a href="https://github.com/wandb/wandb">wandb github</a>

In [5]:
# Wandb for experiment tracking
import wandb

# Initialize Weights & Biases (wandb) for experiment tracking.
# If a wandb account exists, it can typically be used by specifying project and entity.
# However, for this example, we're disabling wandb to ignore it by setting mode to "disabled".
wandb.init(mode="disabled")

# 3. Data Loading

Loading your data is the first step in the machine learning pipeline. This section will guide you through loading your dataset into the Jupyter notebook environment.

## Why this Datset?
We chose one part of the Telugu language for this challenge because, despite being one of the oldest languages, very little of it has been digitalized.Telugu newspapers, which frequently have a small political bent in favor of local politicians, offer a lot of information in a rich language that is connected to the present situation. 

Telugu newspapers stand out for their catchy headlines that make readers rush to read the full story. While English headlines are direct, Telugu news headlines (Mostly Eenadu paper) add drama through clever wordplay. These headlines aren't just clickbait - they reflect deep cultural understanding and creative expression unique to Telugu media.  Local reporters developed this art form over decades, turning daily news into memorable stories.



Some of these example include this type headline where the users are most excited to read the story because of the titles cleverness.


#### "గచ్చిబౌలి జంక్షన్ లో లారీ డ్రైవర్ గారి ఐటం షో"
#### "ట్రాఫిక్ ఎస్.ఐ గారు బిపి తో హాస్పిటల్ లో"
#### "బిర్యాని ప్యాకెట్ లో దొంగల గ్యాంగ్! బావార్చి గారి ఇన్వెస్టిగేషన్"
#### "కస్టమర్ కి పరిగెట్టు పరుగే"
#### "సిసిటివిలో కనపడ్డ మిస్టరీ గాంగ్ లీడర్"
#### "నిద్రపోయే ఎమ్.ఎల్.ఏ గారికి షాక్ ఇచిన కొత్త స్కీమ్"
#### "అసెంబ్లీ లో కుర్చి మీద స్నోరింగ్ సౌండ్స్"
#### "ఆపోజిషన్ లీడర్ గారి ఫోన్ తో వీడియో వైరల్"
#### "క్రికెట్ మ్యాచ్ లో జరిగిన రొమాన్స్"

Now, Lets Load the curated Telugu news articles dataset from hugginface 

In [6]:
# load dataset from huggingface called saidines12/telugu_news_dataset
from datasets import load_dataset

dataset = load_dataset('saidines12/telugu_news_dataset',
                        trust_remote_code=True
                    )
dataset['validation'][10]

{'story_id': 212635,
 'headline': 'ఆంజనేయస్వామి ఆలయాన్ని ఢీకొట్టిన లారీ',
 'article': 'ప్రకాశం జిల్లాలో ఆంజనేయస్వామి ఆలయాన్ని ఓ లారీ ఢీకొట్టింది . ఈ ఘటనలో ఇద్దరు దుర్మరణం చెందారు . ప్రకాశం : జిల్లాలో ఘోర రోడ్డు ప్రమాదం జరిగింది . ఆంజనేయస్వామి ఆలయాన్ని ఓ లారీ ఢీకొట్టింది . ఈ ఘటనలో ఇద్దరు దుర్మరణం చెందారు . పోలీసుల కథనం ప్రకారం… విజయవాడ నుంచి ఒంగోలుకు వెళ్తున్న లారీ మార్గంమధ్యలో మార్చి 9 శనివారం తెల్లవారుజామున అద్దంకి మండలం వెంకటాపురం గ్రామం వద్ద ఒంగోలు - విజయవాడ నేషనల్ హైవే పక్కన గల ఆంజనేయస్వామి ఆలయాన్ని ఢీకొట్టింది . దీంతో లారీ డ్రైవర్ , క్లీనర్ కు\xa0తీవ్ర గాయాలు కావడంతో అక్కడికక్కడే మృతి చెందారు . మృతదేహాలు లారీ క్యాబిన్లో ఇరుక్కుపోవడంతో స్థానికులు పోలీసుల సాయంతో బయటకు తీశారు . నిద్ర మత్తు కారణంగా ప్రమాదం జరిగి ఉండవచ్చని పోలీసులు భావిస్తున్నారు . లారీ బిహార్కు చెందినదిగా గుర్తించారు . పోస్టుమార్టం కోసం మృతదేహాలను అద్దంకి ప్రభుత్వ ఆస్పత్రికి తరలించారు . పోలీసులు కేసు నమోదు చేసుకుని దర్యాప్తు చేస్తున్నారు .'}

# 4. Data Preprocessing for Training

Before initiating the training process with Google's Gemma, a pivotal step involves the preparation of our dataset. The core of this stage is to align our dataset with the specifications required by Gemma, ensuring optimal compatibility and efficiency in training. The process commences with the strategic manipulation of our dataset, specifically focusing on the 'Question' and 'Answer' columns. These columns are instrumental as we meticulously combine them to form comprehensive training examples, thereby facilitating a seamless training experience.

A critical aspect to acknowledge during data preprocessing is the management of data length. Given that the Gemma model operates as a Large Language Model (LLM), it's imperative to assess the length of our training data. Training with excessively lengthy data could impose substantial demands on GPU resources, potentially hindering the efficiency of the process. To circumvent this challenge and optimize resource utilization, we advocate for the exclusion of unduly long data from the training set. This strategic decision not only preserves GPU resources but also ensures a more streamlined and effective training workflow.

In [7]:
question_column = "article"
answer_column = "headline"
data_column = "text"


original_data = dataset['train']

# get average, SHORTEST, longest length of 'Question and Answer' in original dataset
lengths = [len(x[question_column] + x[answer_column]) for x in original_data]
average_length = np.mean(lengths)
shortest_length = np.min(lengths)
longest_length = np.max(lengths)

# Print the statistics
print("Average length of 'Question and Answer' in original dataset:", average_length)
print("Shortest length of 'Question and Answer' in original dataset:", shortest_length)
print("Longest length of 'Question and Answer' in original dataset:", longest_length)

Average length of 'Question and Answer' in original dataset: 1253.3300860897145
Shortest length of 'Question and Answer' in original dataset: 46
Longest length of 'Question and Answer' in original dataset: 14611


Check The news headline with article after processing

In [8]:
original_data[10]

{'story_id': 2480992,
 'headline': 'పంత్.. ఓ అద్భుతం: అక్రమ్',
 'article': 'కరాచీ: ఘోర ప్రమాదం నుంచి కోలుకుని తిరిగి అంతర్జాతీయ క్రికెట్ ఆడుతున్న భారత వికెట్ కీపర్ రిషభ్ పంత్ ఓ అద్భుతమని పాకిస్థాన్ మాజీ కెప్టెన్ వసీమ్ అక్రమ్ కొనియాడాడు. ‘రోడ్డు ప్రమాదం తర్వాత ఎవరికైనా కోలుకునేందుకు చాలా సమయం పడుతుంది. ఇక ఆటగాడికైతే మరింత కష్టంగా ఉంటుంది. కానీ పంత్ అలా కాదు. నిజంగా తను మిరాకిల్ కిడ్. అతడిని యువతరం ఆదర్శంగా తీసుకోవాల్సిందే. ఐపీఎల్, టీ20 ప్రపంచకప్లోనూ ప్రభావం చూపి ఇప్పుడు టెస్టుల్లోనూ ఆకట్టుకుంటున్నాడు. ఆసీస్తో టెస్టు సిరీస్లోనూ తను కీలకం కానున్నాడు’ అని అక్రమ్ ప్రశంసించాడు.'}

# 5. Loading the Gemma Model

Here, we'll cover how to load the Gemma model so it's ready for finetuning. This includes where to download the model from and how to load it into your notebook.

### Adding the Gemma Model
1. Still in the "Input" section of the right-side menu in your Kaggle notebook, click on the "+ Add Input" button again.
2. Below the search bar that appears, click on the "Models" option.
3. In the search bar, type "Gemma" to find the model.
4. From the filtered results, select the Gemma model by clicking on the "+" button next to it. Make sure to choose the correct version by noting the framework as "Transformers", the variation as "2b-it", and the version as "v3".
5. After selecting the correct Gemma model, click on "Add Model" at the bottom.
6. The Gemma model, specifically "Gemma.v3", should now be listed under the "Models" subsection of the "Input" section in the right-side menu of your notebook, indicating successful addition.

**Note** we are using full version as our finetuning cluster supports it.

### BitsAndBytesConfig Overview

`BitsAndBytesConfig` is a configuration class provided by the `transformers` library, which is designed for controlling the behavior of model quantization and optimization during both the training and inference phases of model deployment. Quantization is a technique used to reduce the memory footprint and computational requirements of deep learning models by representing model weights and activations in lower-precision data types, such as 8-bit integers (`int8`) or even 4-bit representations.

#### Benefits of Quantization

The primary benefits of quantization include:

- **Reduced Memory Usage**: Lower-precision representations require less memory, enabling the deployment of larger models on devices with limited memory capacity.
- **Increased Inference Speed**: Operations with lower-precision data types can be executed faster, thus speeding up the inference time.
- **Energy Efficiency**: Reduced computational requirements translate to lower energy consumption, which is crucial for mobile and embedded devices.

#### `BitsAndBytesConfig` Parameters

In the context of the `transformers` library, `BitsAndBytesConfig` allows users to configure the quantization behavior specifically for using the `bitsandbytes` backend. Below is an example configuration along with comments explaining each parameter:


In [9]:
# Checking for the available device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Available devices print
print("device:",device)

# Defining the path to the pre-trained model
model_path = "google/gemma-2b"

# Loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Defining BitsAndBytesConfig
bnbConfig = BitsAndBytesConfig(
    load_in_4bit=True, # Enable loading of the model in 4-bit quantized format.
    bnb_4bit_quant_type="nf4", # Specify the quantization type. "nf4" refers to a specific 4-bit quantization scheme.
    bnb_4bit_compute_dtype=torch.bfloat16, # Define the data type for computations. bfloat16 offers a good balance between precision and speed.
)

# USE THIS FOR QUANTIZATION if you want to quantize the model
disable_quantization = True

if disable_quantization:
    model = AutoModelForCausalLM.from_pretrained(model_path,  device_map="auto",)
else:
    # Loading the model for causal language modeling
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                device_map="auto",

                                                quantization_config=bnbConfig # DISABLED FOR CLUSTER
                                                )

# Move the model to the specified computing device (CPU or GPU).
# model = model.to(device)

device: cuda


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
# Print a summary of the model to understand its architecture and the number of parameters.
model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((2048,), eps=1e-06)


### Setting generating Text with the Gemma Model

This code provides a simple function to generate text using the Gemma model. The Gemma model, a variant of large language models, excels in generating human-like text based on a given prompt. This function utilizes both a model and tokenizer from the Gemma architecture, formatting the output in a specific template for clarity and consistency.

In [11]:
# Define a template for formatting instructions and responses.
# This template will be used to format the text data in a LLM structure.
# give template for generating headline from news article in telugu language
template2 = "వార్తాంశం: {question}\nశీర్షిక: {answer}"
template = "Generate relative, interesting, factual short headline from this news article in telugu language\n{article}\n\nResponse:\n{response}"

#template = "Generate :\n{instruction}\n\nResponse:\n{response}"

In [12]:
def generate_response(model, tokenizer, prompt, device, max_new_tokens=128):
    """
    This function generates a response to a given prompt using a specified model and tokenizer.

    Parameters:
    - model (PreTrainedModel): The machine learning model pre-trained for text generation.
    - tokenizer (PreTrainedTokenizer): A tokenizer for converting text into a format the model understands.
    - prompt (str): The initial text prompt to generate a response for.
    - device (torch.device): The computing device (CPU or GPU) the model should use for calculations.
    - max_new_tokens (int, optional): The maximum number of new tokens to generate. Defaults to 128.

    Returns:
    - str: The text generated in response to the prompt.
    """
    # Convert the prompt into a format the model can understand using the tokenizer.
    # The result is also moved to the specified computing device.
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to(device)

    # Generate a response based on the tokenized prompt.
    outputs = model.generate(**inputs, num_return_sequences=1, max_new_tokens=max_new_tokens)

    # Convert the generated tokens back into readable text.
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract and return the response text. Here, it assumes the response is formatted as "Response: [generated text]".
    response_text = text.split("Response:")[1]
    
    return response_text

# 6. Article & Headline Results Before Finetuning

Before we start the finetuning process, let's see how the Gemma model performs out of the box on our dataset. This section will show you how to run a simple question-answering test.

In [13]:
instruction = "ఘోర ప్రమాదం నుంచి కోలుకుని తిరిగి అంతర్జాతీయ క్రికెట్ ఆడుతున్న భారత వికెట్ కీపర్ రిషభ్ పంత్ ఓ అద్భుతమని పాకిస్థాన్ మాజీ కెప్టెన్ వసీమ్ అక్రమ్ కొనియాడాడు. ‘రోడ్డు ప్రమాదం తర్వాత ఎవరికైనా కోలుకునేందుకు చాలా సమయం పడుతుంది. ఇక ఆటగాడికైతే మరింత కష్టంగా ఉంటుంది. కానీ పంత్ అలా కాదు. నిజంగా తను మిరాకిల్ కిడ్. అతడిని యువతరం ఆదర్శంగా తీసుకోవాల్సిందే. ఐపీఎల్, టీ20 ప్రపంచకప్లోనూ ప్రభావం చూపి ఇప్పుడు టెస్టుల్లోనూ ఆకట్టుకుంటున్నాడు. ఆసీస్తో టెస్టు సిరీస్లోనూ తను కీలకం కానున్నాడు’ అని అక్రమ్ ప్రశంసించాడు. "


prompt = template.format(
    article=instruction,
    response="",
)

# RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
response_text = generate_response(model, tokenizer, prompt, device, 256)

Markdown(response_text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



రోడ్డు ప్రమాదం నుంచి కోలుకుని తిరిగి అంతర్జాతీయ క్రికెట్ ఆడుతున్న భారత వికెట్ కీపర్ రిషభ్ పంత్ ఓ అద్భుతమని పాకిస్థాన్ మాజీ కెప్టెన్ వసీమ్ అక్రమ్ కొనియాడాడు. ‘రోడ్డు ప్రమాదం తర్వాత ఎవరికైనా కోలుకునేందుకు చాలా సమయం పడుతుంది. ఇక ఆటగాడికైతే మరింత కష్టంగా ఉంటుంది. కానీ పంత్ అలా కాదు. నిజంగా తను మిరాకిల్ కిడ్. అతడిని యువతరం ఆదర్శంగా తీసుకోవాల్సిందే. ఐపీఎల్, టీ20 ప్రపంచకప్లోనూ ప్రభావం చూపి ఇప్పుడు టెస్టుల్లోనూ ప్రభావం చూ

The 2B model doesn't seem to understand the instruction and keep repeating the context in telugu. The 9B gemma2 is better at understanding telugu but 2B is not. Lets experiment smaller model on the fintuning with telugu news dataset and compare the results


# 7. Applying Gemma LoRA

In this Session, we'll be applying the LoRA (**Low-Rank Adaptation**) technique to the **Gemma model**, a method designed to make fine-tuning large models like Gemma both **fast and efficient**. LoRA, a part of **PEFT** (**Parameter Efficient Fine-Tuning**), focuses on updating specific parts of a pre-trained model by only training a select few dense layers. This drastically cuts down on the computational demands and GPU memory needs, all without adding any extra time to the inference process. Here's what makes LoRA so powerful for our purposes:

<center><img src="https://cdn-lfs.huggingface.co/datasets/huggingface/documentation-images/4313422c5f2755897fb8ddfc5b99251358f679647ec0f2d120a3f1ff060defe7?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27lora_diagram.png%3B+filename%3D%22lora_diagram.png%22%3B&response-content-type=image%2Fpng&Expires=1713275384&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzI3NTM4NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9odWdnaW5nZmFjZS9kb2N1bWVudGF0aW9uLWltYWdlcy80MzEzNDIyYzVmMjc1NTg5N2ZiOGRkZmM1Yjk5MjUxMzU4ZjY3OTY0N2VjMGYyZDEyMGEzZjFmZjA2MGRlZmU3P3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiZyZXNwb25zZS1jb250ZW50LXR5cGU9KiJ9XX0_&Signature=NAlgCQRn6ktvkOq8WpJkP7DyBvC3ta3Z5gGREWKvLDGQLYpypCszzucGL7nFdzirC4Py9CkgAgkAwbtGAkBU0JvbDVqxIAK9SzpX34xyFmoERdHqH2sQUh17cZ42f60MU9E%7E209I%7Ec6HgUNponN8lhoQzn0jEKYvkzsVsVUPu4OuYONDx4C1tywJIDovcKZCqEQY7f9-OjEKjLPr-CkNymcE%7Eprd83SMPThprA3HVl4gmMbCslQgUM8mM5imHcFxozdbzgD1Mb0U%7El7THXSeBWXdpGdZIBjbJSwJBEEMBtlVbbKtncPTrZWUjrrq03EJJSB7Cc8IA%7EgtJ3cbUerDGw__&Key-Pair-Id=KVTP0A1DKRTAX" width="500"><br/>
Paper: <a href="https://arxiv.org/abs/2106.09685">LoRA: Low-Rank Adaptation of Large Language Models</a></center>

- **Dramatically reduces the number of parameters** needed, by up to **10,000 times**.
- **Cuts down GPU memory usage** by **three times**.
- **Maintains quick inference times** with **no additional latency**.

The essence of PEFT, and by extension LoRA, is to enhance a model's performance using minimal resources, focusing on fine-tuning a handful of parameters for specific tasks. This technique is particularly advantageous as it:
  
- Optimizes rank decomposition matrices, maintaining the original model weights while adding optimized low-rank weights **A** and **B**.
- Allows for up to **threefold reductions** in both time and computational costs.
- Enables easy swapping of the LoRA module (weights **A** and **B**) according to the task at hand, lowering storage requirements and avoiding any increase in inference time.

When applied specifically to **Transformer architectures**, targeting **attention weights** and keeping MLP modules static, LoRA significantly enhances the model's efficiency. For instance, in GPT-3 175B models, it:
  
- **Reduces VRAM usage** from **1.2TB to 350GB**.
- **Lowers checkpoint size** from **350GB to 35MB**.
- **Boosts training speed** by approximately **25%**.

By integrating LoRA into Gemma, we aim to streamline the model's fine-tuning process in this Session, making it quicker and more resource-efficient, without compromising on performance.

In [14]:
# LoRA configuration: Sets up the parameters for Low-Rank Adaptation, which is a method for efficient fine-tuning of transformers.
# USE LORA for saving memory and computation
lora_config = LoraConfig(
    r = 8,  # Rank of the adaptation matrices. A lower rank means fewer parameters to train.
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],  # Transformer modules to apply LoRA.
    task_type = "CAUSAL_LM",  # The type of task, here it is causal language modeling.
)

# 8.Evaluation Metrics

In [15]:
# create evaluation metric ROUGE score for telugu language
import evaluate

metric = evaluate.load("rouge")

# rouge metric formula
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return metric.compute(predictions=predictions, references=labels)

# 8. Training Gemma

Now that everything is set up, it's time to finetune the Gemma model on your data. This section will guide you through the training process, including setting up your training loop and selecting the right hyperparameters.

In [16]:
def formatting_func(examples):
    """
    Formats a given example (a dictionary containing question and answer list) using the predefined template.
    
    Parameters:
    - example (dict): A dictionary with keys corresponding to the columns of the dataset, such as 'article' and 'response'.
    
    Returns:
    - list: A list containing a single formatted string that combines the instruction and the response.
    """
    # Add the phrase to verify training success and format the text using the template and the specific example's instruction and response.
    # we have to return list of strings example[question_column] and example[answer_column] are list of strings
    articles = examples[question_column]
    responses = examples[answer_column]
    inputs = []
    for i in range(len(articles)):
        inputs.append(template.format(article=articles[i], response=responses[i]))

    #line = template.format(instruction=example[question_column], response=example[answer_column])
    return inputs


In [17]:
!rm -rf outputs
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'
# Setup for the trainer object that will handle fine-tuning of the model.
trainer = SFTTrainer(
    model=model,  # The pre-trained model to fine-tune.
    train_dataset=dataset['train'],  # The dataset used for training(83k)
    eval_dataset=dataset['validation'],  # The dataset used for validation(10k)
    max_seq_length=512,  # The maximum sequence length for the model inputs.
    compute_metrics=compute_metrics,
    args=TrainingArguments(  # Arguments for training setup.
        per_device_train_batch_size= 4 ,  # Batch size per device (e.g., GPU).
        #gradient_accumulation_steps=4,  # Number of steps to accumulate gradients before updating model weights.
        warmup_steps=10,  # Number of steps to gradually increase the learning rate at the beginning of training.
        max_steps=10000,  # Total number of training steps to perform.
        learning_rate=2e-4,  # Learning rate for the optimizer.
        fp16=True,  # Whether to use 16-bit floating point precision for training. False means 32-bit is used.
        logging_steps=1,  # How often to log training information.
        output_dir="outputs",  # Directory where training outputs will be saved.
        eval_strategy="steps",
        per_device_eval_batch_size=4,
        gradient_checkpointing=True,  # Enable gradient checkpointing to save memory.
        #optim="paged_adamw_8bit",  # The optimizer to use, with 8-bit precision for efficiency.
        eval_accumulation_steps = 4, # FIX for evaluation https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941/3
      eval_steps=2000
      
    ),
    # peft_config=lora_config,  # For The LoRA configuration for efficient fine-tuning.
    formatting_func=formatting_func,  # The function to format the dataset examples.
    
)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [18]:
# train the model to the processed data.
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss


In [1]:
# Push the model to huggingface under my user name saidies12 and model name telugu-news-headline-generation
trainer.push_to_hub(
    repository_name="saidines12/telugu-news-headline-generation",
)

NameError: name 'trainer' is not defined

# 9. Q&A Results After Finetuning

After training, let's see how much our Gemma model has improved. We'll rerun the question-answering test and compare the results to the pre-finetuning performance.

In [None]:
instruction = "అక్రమ్ కరాచీ ఘోర ప్రమాదం నుంచి కోలుకుని తిరిగి అంతర్జాతీయ క్రికెట్ ఆడుతున్న భారత వికెట్ కీపర్ రిషభ్ పంత్ ఓ అద్భుతమని పాకిస్థాన్ మాజీ కెప్టెన్ వసీమ్ అక్రమ్ కొనియాడాడు. ‘రోడ్డు ప్రమాదం తర్వాత ఎవరికైనా కోలుకునేందుకు చాలా సమయం పడుతుంది. ఇక ఆటగాడికైతే మరింత కష్టంగా ఉంటుంది. కానీ పంత్ అలా కాదు. నిజంగా తను మిరాకిల్ కిడ్. అతడిని యువతరం ఆదర్శంగా తీసుకోవాల్సిందే. ఐపీఎల్, టీ20 ప్రపంచకప్లోనూ ప్రభావం చూపి ఇప్పుడు టెస్టుల్లోనూ ఆకట్టుకుంటున్నాడు. ఆసీస్తో టెస్టు సిరీస్లోనూ తను కీలకం కానున్నాడు’ అని అక్రమ్ ప్రశంసించాడు. "


prompt = template.format(
    article=instruction,
    response="",
)

response_text = generate_response(trainer.model, tokenizer, prompt, device,32)
# TODO: Fix repitition of response

Markdown(response_text)


రోడ్డు ప్రమాదం తర్వాత కోలుకున్నాడు రిషభ్ పంత్
రోడ్డు ప్ర

**Although** the performance of the Gemma2B model okay, it is still better headline than reapeating the article from the last result. There is big room for improvement as we are using LORA with quantization. first try without LORA, and the performance doesn't match expected then Ramp up to bigger gemma 9B model which is really good at understanding telugu and instruction following. 

# 10. Conclusion

In this beginner-friendly notebook, we've outlined the process of fine-tuning the Gemma model, a Large Language Model (LLM), specifically for Python Q&A generation. Starting from data loading and preprocessing, we've demonstrated how to train the Gemma model effectively, even for those new to working with LLMs.

We leveraged the Dataset_Python_Question_Answer, featuring hundreds of Python programming questions and answers, to train and refine the Gemma model's capabilities in generating accurate Q&As. This journey, while introductory, underscores the potential and straightforward path to engaging with LLMs through the Gemma model.

Achieving the best performance with the Gemma model (or any LLM) generally requires training with more extensive datasets and over more epochs. Future enhancements could include integrating Retrieval-Augmented Generation (RAG) and Direct Preference Optimization (DPO) training techniques, offering a way to further improve the model by incorporating external knowledge bases for more precise and relevant responses.

Ultimately, this notebook is designed to make the Gemma model approachable for beginners, illustrating that straightforward steps can unlock the potential of LLMs for diverse domain-specific tasks. It encourages users to experiment with the Gemma model across various fields, broadening the scope of its application and enhancing its utility.

Reference:



In [9]:
from huggingface_hub import HfApi
api = HfApi()

api.upload_file(
    path_or_fileobj="/data1/max/telugu_corpus/andhrajyothy_data/gemma-fine-tuning-on-telugu-news-dataset.ipynb",
    repo_id="saidines12/telugu-news-headline-generation",
    repo_type="model",
    path_in_repo="notebooks/gemma-fine-tuning-on-telugu-news-dataset.ipynb",
)

HfHubHTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/api/models/saidines12/telugu-news-headline-generation/commit/main

<!DOCTYPE html>
<html class="" lang="en">
<head>
    <meta charset="utf-8" />
    <meta
            name="viewport"
            content="width=device-width, initial-scale=1.0, user-scalable=no"
    />
    <meta
            name="description"
            content="We're on a journey to advance and democratize artificial intelligence through open source and open science."
    />
    <meta property="fb:app_id" content="1321688464574422" />
    <meta name="twitter:card" content="summary_large_image" />
    <meta name="twitter:site" content="@huggingface" />
    <meta
            property="og:title"
            content="Hugging Face - The AI community building the future."
    />
    <meta property="og:type" content="website" />

    <title>Hugging Face - The AI community building the future.</title>
    <style>
        body {
            margin: 0;
        }

        main {
            background-color: white;
            min-height: 100vh;
            padding: 7rem 1rem 8rem 1rem;
            text-align: center;
            font-family: Source Sans Pro, ui-sans-serif, system-ui, -apple-system,
            BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Arial, Noto Sans,
            sans-serif, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol,
            Noto Color Emoji;
        }

        img {
            width: 6rem;
            height: 6rem;
            margin: 0 auto 1rem;
        }

        h1 {
            font-size: 3.75rem;
            line-height: 1;
            color: rgba(31, 41, 55, 1);
            font-weight: 700;
            box-sizing: border-box;
            margin: 0 auto;
        }

        p, a {
            color: rgba(107, 114, 128, 1);
            font-size: 1.125rem;
            line-height: 1.75rem;
            max-width: 28rem;
            box-sizing: border-box;
            margin: 0 auto;
        }

        .dark main {
            background-color: rgb(11, 15, 25);
        }
        .dark h1 {
            color: rgb(209, 213, 219);
        }
        .dark p, .dark a {
            color: rgb(156, 163, 175);
        }
    </style>
    <script>
        // On page load or when changing themes, best to add inline in `head` to avoid FOUC
        const key = "_tb_global_settings";
        let theme = window.matchMedia("(prefers-color-scheme: dark)").matches
            ? "dark"
            : "light";
        try {
            const storageTheme = JSON.parse(window.localStorage.getItem(key)).theme;
            if (storageTheme) {
                theme = storageTheme === "dark" ? "dark" : "light";
            }
        } catch (e) {}
        if (theme === "dark") {
            document.documentElement.classList.add("dark");
        } else {
            document.documentElement.classList.remove("dark");
        }
    </script>
</head>

<body>
<main>
    <img
            src="https://cdn-media.huggingface.co/assets/huggingface_logo.svg"
            alt=""
    />
    <div>
        <h1>502</h1>
        <p>Bad Gateway</p>
    </div>
</main>
</body>
</html>