Image-to-Text
Transformers
Safetensors
Spanish
vision-language
llava
lora
memes
chile

llava-v1.6-mistral-7b-memes-chilenos-small

A LoRA‑fine‑tuned version of LLaVA‑Next for explaining Chilean memes in colloquial Spanish, built for the Somos NLP Hackathon 2025.

Banner

Project Overview

Introduction

Memes are more than jokes—they are cultural micro-units.

According to James Gleick (2011), memes are:

“nothing more than ideas, images, slogans, melodies, stories, recipes, skills or abilities, legends, and systems that populate our minds.”

Ryan Milner (2012) defines them as:

“amateur media artifacts, extensively remixed and recirculated by different participants on social media networks”

These definitions highlight the richness and complexity behind meme culture. Yet, vision-language models like LLaVA often struggle to interpret them—especially when memes rely on local context, irony, or informal language.

This project adapts LLaVA‑Next to the Chilean cultural landscape via targeted fine‑tuning, enabling the model to explain memes in Chilean Spanish, grounded in the humor, expressions, and imagery found in real social media content.


Motivation

While vision-language models like LLaVA‑Next have shown impressive results in traditional tasks (e.g., image captioning, VQA), they often struggle with highly contextual and culturally specific content such as memes—especially those rooted in local humor, slang, and social cues.

In the Chilean context, memes frequently rely on implicit cultural knowledge, regional dialects, and shared national experiences that general-purpose models are not trained to recognize. Moreover, no vision-language model had been fine-tuned specifically to interpret memes in colloquial Chilean Spanish.

Recent work such as MemeGuard (Liu et al., 2024) highlights this challenge, noting that memes are inherently ambiguous, multi-layered, and culturally grounded, making them difficult even for strong multimodal baselines.

This motivated us to fine-tune LLaVA‑Next using a custom dataset of Chilean Instagram memes, enriched with instruction-based annotations and cultural explanations.


Impact

This work contributing to research on:

  • Cultural alignment of vision-language models.
  • Educational, social media, and discourse analysis applications.
  • Advanced fine-tuning strategies beyond standard captioning/VQA tasks.

It also offers a replicable methodology for developing culturally adapted models in other regions—especially across Latin America.


Model Details

Field Value
Model ID somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small
Base model llava-hf/llava-v1.6-mistral-7b-hf
Architecture Vision‑Language GPT‑style decoder with Mistral‑7B text backbone and CLIP ViT‑L/336 visual encoder
Fine‑tuning method LoRA (PEFT) on all linear layers except lm_head; vision encoder included.
Languages es_CL (Chilean Spanish)
Modalities Input: image + text prompt
License Apache 2.0 (inherits base)
Dataset 1194 Instagram memes manually explained + 3 582 synthetic explanations via instruction templates (4 776 total)
Training epochs 2
Hardware 1 × NVIDIA L40S (48 GB)
Evaluation BERTScore ≈ 0.73 on meme explanation test set

Model Description

llava-v1.6-mistral-7b-memes-chilenos-small adapts LLaVA‑Next to the highly contextual humour found in Chilean memes. Because no public corpus of memes explained in Chilean Spanish existed, we scraped 1194 image‑text posts from popular Chilean Instagram meme pages, wrote human explanations, and expanded the corpus to 3 568 examples with GPT‑4‑o and command-r-plus (Cohere) using four instruction‑tuning templates:

  1. “Explica qué significa este meme en Chile, usando lenguaje coloquial…”
  2. “Explica cómo entendería este meme alguien que vive en Chile…”
  3. “Describe por qué este meme sería gracioso o significativo para un chileno…”
  4. “Imagina que le explicas este meme a alguien extranjero…”

The result is a compact model that can describe why a meme is funny for a Chilean audience, though it still struggles with extremely time‑sensitive or highly niche references.

Both during training and inference, we used the following system prompt to guide the model’s behavior and cultural framing: “Eres experto en memes chilenos. Observa la imagen y, si hay texto, interprétalo sin repetirlo. Analiza su sentido usando contexto cultural chileno. Responde según la instrucción.”


Bias, Risks & Limitations

  • Temporal drift – many memes reference current events; explanations may become dated quickly.
  • Subjectivity of humour – what is “funny” varies; the model reflects the curators’ viewpoint.
  • Dataset biases – Instagram accounts skew toward urban, younger demographics; regional slang may be under‑represented.
  • Vision safety – the model was not filtered for possibly offensive or unsafe imagery.

Recommendations

Always present model outputs with a disclaimer that humour is subjective and culturally bound. Human review is recommended before publishing explanations.


How to Get Started

from transformers import LlavaNextForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
from PIL import Image

BASE_MODEL = "llava-hf/llava-v1.6-mistral-7b-hf"
ADAPTER_ID = "AndresSebad/llava-v1.6-mistral-7b-memes-chilenos-small"

device = "cuda"  # or "cpu" if no GPU is available

# Load the frozen base checkpoints in FP16 and attach the LoRA adapters
base_model = LlavaNextForConditionalGeneration.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, ADAPTER_ID)

model = model.eval()
model = model.to(device)

processor = AutoProcessor.from_pretrained(BASE_MODEL)
processor.tokenizer.padding_side = "right"

img = Image.open("meme.jpg")
conversation = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Eres experto en memes chilenos. Observa la imagen y, si hay texto, interprétalo sin repetirlo. Analiza su sentido usando contexto cultural chileno. Responde según la instrucción."},
        {"type": "text", "text": "Explica qué significa este meme en Chile, usando lenguaje coloquial chileno."}
    ]
}]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, images=[img], return_tensors="pt").to(device)

out_ids = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Training Details

Training data

  • 1 194 real Chilean memes (image + caption + human explanation).
  • 3 582 synthetic examples generated with GPT‑4‑o using the templates above.

Pre‑processing

  • Images resized to 336 × 336.
  • Mixed‑precision (fp16) training.

Hyper‑parameters

Param Value
LoRA rank r 8
LoRA α 8
LoRA dropout 0.1
Epochs 2
Batch size 1 (gradient‑accumulation 8)
LR 1 × 10⁻⁴ → 2 × 10⁻⁵ (linear decay)

Environmental Impact

Estimated total location-based greenhouse gas emissions were 190.96 grams CO₂eq for fine-tuning.

Training was performed on a single NVIDIA L40S GPU over approximately 1.45 hours, tracked using the CodeCarbon library.

No renewable energy matching or carbon offsets were applied. The training took place in a datacenter estimated to be located in Virginia, USA based on IP geolocation.

Note: Results are approximate and based on available metadata.


Example 1 – “18 de Septiembre”

Example1
Model Output
Base LLaVA‑Next La imagen muestra un perro con una sombrilla y una camisa, con un texto que dice "Me rehuso a ser controlado por un calendario. Feliz 18 a todos". Este meme parece estar utilizando la expresión "Me rehuso a ser controlado por un calendario" para expresar una actitud de resistencia a la rigidez o estructura de un calendario, lo que podría ser una crítica a la vida rutinaria o a la idea de que la vida debe ser organizada de manera estricta.\n\nEn Chile, la expresión "Me rehuso a ser controlado por un calendario" es una forma coloquial de expresar una actitud de no querer ser limitado o controlado por las normas o estructuras de una sociedad. Es una forma de expresar una actitud de libertad y no querer ser imponible o controlado por otros.\n\nEn el contexto cultural chileno, este meme podría estar criticando la idea de que la vida debe ser organizada de manera estricta...
Fine‑tuned Este meme muestra a un perrito vestido como huaso, con su chupalla y todo, y la frase "Me rehuso a ser controlado por un calendario. Feliz 18 a todos". La talla es que el perrito está jugando a ser un huaso, como si fuera parte de las celebraciones del 18 de septiembre, pero en realidad no está dispuesto a seguir las normas o rutinas, como si estuviera "controlado" por el calendario. Es una forma divertida de decir que no quiere estar "atado" a las cosas formales o predecibles, y prefiere seguir su propio ritmo.

Example 2 – “No estoy entendiendo...”

Example2
Model Output
Base LLaVA‑Next La imagen muestra a un perro sentado en un escritorio con varios libros y un lápiz, y hay un texto que dice "no estoi entendiendo ni una wea". Este meme es una parodia de la expresión "no estoy entendiendo ni una wea", que en Chile se utiliza para indicar que alguien no está comprendiendo o no está interesado en una situación o conversación.\n\nEn este caso, el meme se utiliza para representar a un perro que no está comprendiendo o no está interesado en una situación o conversación, lo que es humorístico y se relaciona con la idea de que los perros no tienen la capacidad de comprender o de tener interés en las cosas que los humanos hacen.\n\nEn Chile, la expresión "no estoy entendiendo ni una wea" es muy común y se utiliza en muchas situaciones para indicar que alguien no está comprendiendo o no está interesado en una situación o conversación. La palabra "wea" es una exp...
Fine‑tuned La imagen muestra a un perro sentado en una escritura, como si fuera un estudiante en la U, con cara de estar chato. El texto dice "no estoi entendiendo ni una wea". Es como cuando estás en clases y no cachas ni una, pero igual te quedas ahí, como si estuvieras en modo "zombie".

Technical Specifications

  • Architecture – Mistral‑7B decoder with multimodal projector; unchanged aside from LoRA adapters.
  • Compute – fine‑tuned on a single GPU; no distributed or multi‑node training.

Citation

BibTeX

@software{llava_memes_chile_2025,
  author       = {Somos NLP Hackathon 2025},
  title        = {llava-v1.6-mistral-7b-memes-chilenos-small},
  year         = 2025,
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small}
}

Glossary

  • LoRA – Low‑Rank Adaptation; inserts lightweight trainable matrices.
  • Instruction‑tuning – generating diverse prompts to improve alignment.
  • Mistral‑7B – a 7‑billion‑parameter transformer decoder.

Hackathon: This model was developed for Somos NLP Hackathon 2025 – see the project page here.

Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small

Adapter
(15)
this model

Dataset used to train somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small

Space using somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small 1

Collection including somosnlp-hackathon-2025/llava-v1.6-mistral-7b-memes-chilenos-small