---
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
library_name: transformers
license: llama3.2
tags:
- lora
- llama
- vision-language
- peft
- fine-tuned
inference: false
pipeline_tag: image-text-to-text
---

# **lavender-llama-3.2-11b-lora**
🚀 **LoRA fine-tuned model based on** `meta-llama/Llama-3.2-11B-Vision-Instruct`

## **Model Overview**
This is a **Lavender fine-tuned** version of **Llama-3.2-11B-Vision-Instruct**.
Lavender is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. 
This model retains the core capabilities of Llama-3.2 while incorporating Stable Diffusion's visual expertise.

- **Base Model**: [`meta-llama/Llama-3.2-11B-Vision-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
- **Fine-Tuned Model**: `lxasqjc/lavender-llama-3.2-11b-lora`
- **Lavender Paper**: [Diffusion Instruction Tuning (arXiv)](https://arxiv.org/abs/2502.06814)
- **Lavender Project Space**: [Diffusion Instruction Tuning](https://astrazeneca.github.io/vlm/)
- **Github repository**: [Diffusion Instruction Tuning](https://github.com/AstraZeneca/vlm)
- **Parameter Efficient Fine-Tuning (PEFT)**: Uses **LoRA** (Low-Rank Adaptation) to optimize model efficiency.
- **License**: Llama 3.2 Community License (See [`LICENSE.txt`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/resolve/main/LICENSE.txt))


---

## **🛠️ How to Use This Model**
This model contains **only LoRA weights**. To use it, you must first **load the base model** and then apply the LoRA adapter.

### **Install Required Packages**
```bash
pip install torch transformers accelerate peft
```

### **Load and Use the LoRA Model**
```python
from transformers import MllamaForConditionalGeneration, MllamaProcessor
from peft import PeftModel
import requests
import torch
from PIL import Image

# Define model paths
base_model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
lora_model_name = "lxasqjc/lavender-llama-3.2-11b-lora"

# Load base model
base_model = MllamaForConditionalGeneration.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = MllamaProcessor.from_pretrained(base_model_name)
# Load LoRA adapter
lora_model = PeftModel.from_pretrained(base_model, lora_model_name)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(lora_model.device)

output = lora_model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

```

---

## **💡 Applications**
- **Vision-Language Tasks**: Describing images, answering visual questions.
- **Instruction-Following**: Improved instruction compliance compared to base Llama-3.2.
- **Multimodal Reasoning**: Can process and reason about images with text.

---

## **📊 Training Details**
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)  
- **Base Model**: `meta-llama/Llama-3.2-11B-Vision-Instruct`  
- **PEFT Framework**: [Hugging Face PEFT](https://huggingface.co/docs/peft)  
- **Precision**: `bfloat16`  
- **Hyperparameters**: Coming soon  

---

## **📏 Limitations & Considerations**
- This model is **not a standalone** model. It requires the base Llama-3.2-11B-Vision-Instruct model.
- **Biases & Ethical Use**: Like all large models, it may exhibit biases present in the pretraining data.
- **Hardware Requirements**: Minimum **24GB VRAM (A100 40GB recommended)** for inference.

---

## **🔗 References**
- **Base Model**: [Meta Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
- **LoRA Paper**: [Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
- **PEFT Documentation**: [Hugging Face PEFT](https://huggingface.co/docs/peft)
- **Project Space**: [Diffusion Instruction Tuning](https://astrazeneca.github.io/vlm/)
- **Paper**: [Diffusion Instruction Tuning (arXiv)](https://arxiv.org/abs/2502.06814)

### **Citation**
If you use this model or work in your research, please cite:
```bibtex
@misc{jin2025diffusioninstructiontuning,
          title={Diffusion Instruction Tuning}, 
          author={Chen Jin and Ryutaro Tanno and Amrutha Saseendran and Tom Diethe and Philip Teare},
          year={2025},
          eprint={2502.06814},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2502.06814}, 
    }
```