llava-v1.6-mistral-7b-memes-chilenos-small
A LoRA‑fine‑tuned version of LLaVA‑Next for explaining Chilean memes in colloquial Spanish, built for the Somos NLP Hackathon 2025.

Project Overview
Introduction
Memes are more than jokes—they are cultural micro-units.
According to James Gleick (2011), memes are:
“nothing more than ideas, images, slogans, melodies, stories, recipes, skills or abilities, legends, and systems that populate our minds.”
Ryan Milner (2012) defines them as:
“amateur media artifacts, extensively remixed and recirculated by different participants on social media networks”
These definitions highlight the richness and complexity behind meme culture. Yet, vision-language models like LLaVA often struggle to interpret them—especially when memes rely on local context, irony, or informal language.
This project adapts LLaVA‑Next to the Chilean cultural landscape via targeted fine‑tuning, enabling the model to explain memes in Chilean Spanish, grounded in the humor, expressions, and imagery found in real social media content.
Motivation
While vision-language models like LLaVA‑Next have shown impressive results in traditional tasks (e.g., image captioning, VQA), they often struggle with highly contextual and culturally specific content such as memes—especially those rooted in local humor, slang, and social cues.
In the Chilean context, memes frequently rely on implicit cultural knowledge, regional dialects, and shared national experiences that general-purpose models are not trained to recognize. Moreover, no vision-language model had been fine-tuned specifically to interpret memes in colloquial Chilean Spanish.
Recent work such as MemeGuard (Liu et al., 2024) highlights this challenge, noting that memes are inherently ambiguous, multi-layered, and culturally grounded, making them difficult even for strong multimodal baselines.
This motivated us to fine-tune LLaVA‑Next using a custom dataset of Chilean Instagram memes, enriched with instruction-based annotations and cultural explanations.
Impact
This work contributing to research on:
- Cultural alignment of vision-language models.
- Educational, social media, and discourse analysis applications.
- Advanced fine-tuning strategies beyond standard captioning/VQA tasks.
It also offers a replicable methodology for developing culturally adapted models in other regions—especially across Latin America.
Model Details
Field | Value |
---|---|
Model ID | somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small |
Base model | llava-hf/llava-v1.6-mistral-7b-hf |
Architecture | Vision‑Language GPT‑style decoder with Mistral‑7B text backbone and CLIP ViT‑L/336 visual encoder |
Fine‑tuning method | LoRA (PEFT) on all linear layers except lm_head ; vision encoder included. |
Languages | es_CL (Chilean Spanish) |
Modalities | Input: image + text prompt |
License | Apache 2.0 (inherits base) |
Dataset | 1194 Instagram memes manually explained + 3 582 synthetic explanations via instruction templates (4 776 total) |
Training epochs | 2 |
Hardware | 1 × NVIDIA L40S (48 GB) |
Evaluation | BERTScore ≈ 0.73 on meme explanation test set |
Model Description
llava-v1.6-mistral-7b-memes-chilenos-small
adapts LLaVA‑Next to the highly contextual humour found in Chilean memes.
Because no public corpus of memes explained in Chilean Spanish existed, we scraped 1194 image‑text posts from popular Chilean Instagram meme pages, wrote human explanations, and expanded the corpus to 3 568 examples with GPT‑4‑o and command-r-plus (Cohere) using four instruction‑tuning templates:
- “Explica qué significa este meme en Chile, usando lenguaje coloquial…”
- “Explica cómo entendería este meme alguien que vive en Chile…”
- “Describe por qué este meme sería gracioso o significativo para un chileno…”
- “Imagina que le explicas este meme a alguien extranjero…”
The result is a compact model that can describe why a meme is funny for a Chilean audience, though it still struggles with extremely time‑sensitive or highly niche references.
Both during training and inference, we used the following system prompt to guide the model’s behavior and cultural framing: “Eres experto en memes chilenos. Observa la imagen y, si hay texto, interprétalo sin repetirlo. Analiza su sentido usando contexto cultural chileno. Responde según la instrucción.”
Bias, Risks & Limitations
- Temporal drift – many memes reference current events; explanations may become dated quickly.
- Subjectivity of humour – what is “funny” varies; the model reflects the curators’ viewpoint.
- Dataset biases – Instagram accounts skew toward urban, younger demographics; regional slang may be under‑represented.
- Vision safety – the model was not filtered for possibly offensive or unsafe imagery.
Recommendations
Always present model outputs with a disclaimer that humour is subjective and culturally bound. Human review is recommended before publishing explanations.
How to Get Started
from transformers import LlavaNextForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
from PIL import Image
BASE_MODEL = "llava-hf/llava-v1.6-mistral-7b-hf"
ADAPTER_ID = "AndresSebad/llava-v1.6-mistral-7b-memes-chilenos-small"
device = "cuda" # or "cpu" if no GPU is available
# Load the frozen base checkpoints in FP16 and attach the LoRA adapters
base_model = LlavaNextForConditionalGeneration.from_pretrained(
BASE_MODEL,
torch_dtype=torch.float16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, ADAPTER_ID)
model = model.eval()
model = model.to(device)
processor = AutoProcessor.from_pretrained(BASE_MODEL)
processor.tokenizer.padding_side = "right"
img = Image.open("meme.jpg")
conversation = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Eres experto en memes chilenos. Observa la imagen y, si hay texto, interprétalo sin repetirlo. Analiza su sentido usando contexto cultural chileno. Responde según la instrucción."},
{"type": "text", "text": "Explica qué significa este meme en Chile, usando lenguaje coloquial chileno."}
]
}]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, images=[img], return_tensors="pt").to(device)
out_ids = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Training Details
Training data
- 1 194 real Chilean memes (image + caption + human explanation).
- 3 582 synthetic examples generated with GPT‑4‑o using the templates above.
Pre‑processing
- Images resized to 336 × 336.
- Mixed‑precision (fp16) training.
Hyper‑parameters
Param | Value |
---|---|
LoRA rank r |
8 |
LoRA α | 8 |
LoRA dropout | 0.1 |
Epochs | 2 |
Batch size | 1 (gradient‑accumulation 8) |
LR | 1 × 10⁻⁴ → 2 × 10⁻⁵ (linear decay) |
Environmental Impact
Estimated total location-based greenhouse gas emissions were 190.96 grams CO₂eq for fine-tuning.
Training was performed on a single NVIDIA L40S GPU over approximately 1.45 hours, tracked using the CodeCarbon library.
No renewable energy matching or carbon offsets were applied. The training took place in a datacenter estimated to be located in Virginia, USA based on IP geolocation.
Note: Results are approximate and based on available metadata.
Example 1 – “18 de Septiembre”

Model | Output |
---|---|
Base LLaVA‑Next | La imagen muestra un perro con una sombrilla y una camisa, con un texto que dice "Me rehuso a ser controlado por un calendario. Feliz 18 a todos". Este meme parece estar utilizando la expresión "Me rehuso a ser controlado por un calendario" para expresar una actitud de resistencia a la rigidez o estructura de un calendario, lo que podría ser una crítica a la vida rutinaria o a la idea de que la vida debe ser organizada de manera estricta.\n\nEn Chile, la expresión "Me rehuso a ser controlado por un calendario" es una forma coloquial de expresar una actitud de no querer ser limitado o controlado por las normas o estructuras de una sociedad. Es una forma de expresar una actitud de libertad y no querer ser imponible o controlado por otros.\n\nEn el contexto cultural chileno, este meme podría estar criticando la idea de que la vida debe ser organizada de manera estricta... |
Fine‑tuned | Este meme muestra a un perrito vestido como huaso, con su chupalla y todo, y la frase "Me rehuso a ser controlado por un calendario. Feliz 18 a todos". La talla es que el perrito está jugando a ser un huaso, como si fuera parte de las celebraciones del 18 de septiembre, pero en realidad no está dispuesto a seguir las normas o rutinas, como si estuviera "controlado" por el calendario. Es una forma divertida de decir que no quiere estar "atado" a las cosas formales o predecibles, y prefiere seguir su propio ritmo. |
Example 2 – “No estoy entendiendo...”

Model | Output |
---|---|
Base LLaVA‑Next | La imagen muestra a un perro sentado en un escritorio con varios libros y un lápiz, y hay un texto que dice "no estoi entendiendo ni una wea". Este meme es una parodia de la expresión "no estoy entendiendo ni una wea", que en Chile se utiliza para indicar que alguien no está comprendiendo o no está interesado en una situación o conversación.\n\nEn este caso, el meme se utiliza para representar a un perro que no está comprendiendo o no está interesado en una situación o conversación, lo que es humorístico y se relaciona con la idea de que los perros no tienen la capacidad de comprender o de tener interés en las cosas que los humanos hacen.\n\nEn Chile, la expresión "no estoy entendiendo ni una wea" es muy común y se utiliza en muchas situaciones para indicar que alguien no está comprendiendo o no está interesado en una situación o conversación. La palabra "wea" es una exp... |
Fine‑tuned | La imagen muestra a un perro sentado en una escritura, como si fuera un estudiante en la U, con cara de estar chato. El texto dice "no estoi entendiendo ni una wea". Es como cuando estás en clases y no cachas ni una, pero igual te quedas ahí, como si estuvieras en modo "zombie". |
Technical Specifications
- Architecture – Mistral‑7B decoder with multimodal projector; unchanged aside from LoRA adapters.
- Compute – fine‑tuned on a single GPU; no distributed or multi‑node training.
Citation
BibTeX
@software{llava_memes_chile_2025,
author = {Somos NLP Hackathon 2025},
title = {llava-v1.6-mistral-7b-memes-chilenos-small},
year = 2025,
publisher = {Hugging Face},
url = {https://huggingface.co/somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small}
}
Glossary
- LoRA – Low‑Rank Adaptation; inserts lightweight trainable matrices.
- Instruction‑tuning – generating diverse prompts to improve alignment.
- Mistral‑7B – a 7‑billion‑parameter transformer decoder.
Hackathon: This model was developed for Somos NLP Hackathon 2025 – see the project page here.
Resources
- 🔗 GitHub repository – Source code, training scripts, and the Gradio demo.
Model tree for somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small
Base model
llava-hf/llava-v1.6-mistral-7b-hf