Llama2-TIFA-AWQ
Model Description
Llama2-TIFA-AWQ is a fine-tuned and AWQ-quantized version of tifa-benchmark/llama2_tifa_question_generation that addresses structural limitations in the original TIFA question generation model while providing significantly faster inference. The model combines structural refinement with AWQ quantization to achieve optimal performance-speed balance for TIFA question generation.
Key Innovation: Fixing Structural Issues
This model represents a refinement approach rather than training from scratch:
- Original model strength: Deep TIFA domain knowledge and question generation capabilities
- Original model issues:
- Generated multiple questions for the same attribute
- Lacked negative verification questions
- This solution: Structural fine-tuning to enforce 4-question format while preserving domain expertise
AWQ Quantization: Performance Optimization
This model includes AWQ (Activation-aware Weight Quantization) for optimal inference speed:
Performance Comparison
- SmolLM2 models: ~3 seconds (baseline small models)
- Original LLaMA 2: ~20 seconds (full precision)
- Llama2-TIFA-AWQ: ~8 seconds (2.5x faster than full precision)
Intended Use
This model generates exactly 4 structured visual verification questions for text-to-image evaluation:
- Mixed question types: Both yes/no and multiple choice questions
- Comprehensive coverage: Colors, shapes, objects, materials, spatial relationships
- Balanced verification: Both positive presence and negative absence testing
- Controlled structure: Exactly 4 questions without redundancy
Model Details
- Base Model: tifa-benchmark/llama2_tifa_question_generation (LLaMA 2 architecture)
- Model Size: ~7B parameters (AWQ quantized)
- Fine-tuning Method: LoRA (Low-Rank Adaptation) for structural refinement
- Quantization: AWQ (Activation-aware Weight Quantization)
- Training Framework: Transformers + TRL + PEFT
- License: apache-2.0
Training Details
Structural Refinement Configuration
Training Method: Supervised Fine-Tuning with LoRA on pre-trained TIFA model
LoRA Configuration:
- r: 32
- lora_alpha: 64
- lora_dropout: 0.05
- Target modules:
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Training Parameters:
- Epochs: 2
- Learning Rate: 3e-4
- Batch Size: 6
- Gradient Accumulation: 5 steps
- Max Sequence Length: 768
- LR Scheduler: Cosine with 5% warmup
- Precision: FP16
Enhanced Dataset
- Size: 18,000 examples with structured 4-question format
- Focus: Teaching proper question structure and negative verification
- Validation: Category-balanced split ensuring robust evaluation
- Format: LLaMA 2 chat template with preprocessed text format
Usage
Installation
pip install transformers torch autoawq
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_path = "kawchar85/Llama2-TIFA-AWQ"
# Load AWQ quantized model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
)
# Create pipeline (optimized for ~8 second inference)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
# System prompt for TIFA question generation
system_msg = """\
You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.
Guidelines:
- Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
- Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
- Each question should test a DIFFERENT aspect of the description
- Ensure questions can be answered by visual inspection of the image
- Use elements explicitly mentioned in the description
- Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
- Make distractors realistic and relevant to the domain
Format each question as:
Q[number]: [question text]
C: [comma-separated choices]
A: [correct answer]
Generate questions that test visual faithfulness between the description and image."""
# Generate evaluation questions
description = "a lighthouse overlooking the ocean"
prompt = (
"<s>[INST] <<SYS>>\n"
f"{system_msg}\n"
"<</SYS>>\n\n"
f'Create 4 visual verification questions for this description: "{description}" [/INST]'
)
output = pipe(prompt)[0]['generated_text']
response = output[len(prompt):]
print(response)
Example Output
For "a lighthouse overlooking the ocean":
Q1: What type of structure is prominently featured in the image?
C: windmill, lighthouse, castle, tower
A: lighthouse
Q2: What body of water is the lighthouse overlooking?
C: lake, river, ocean, pond
A: ocean
Q3: Are there any mountains visible in the scene?
C: no, yes
A: no
Q4: Is the lighthouse positioned to overlook a body of water?
C: no, yes
A: yes
Advantages Over Original Model
Structural Improvements
- Controlled output: Exactly 4 questions instead of variable numbers
- No redundancy: Eliminates multiple questions per attribute
- Negative verification: Includes proper negative questions (absent elements)
Preserved Strengths
- Domain expertise: Retains deep TIFA knowledge from original training
- Question quality: Maintains high-quality question formulation
- Visual focus: Strong emphasis on verifiable visual elements
- Natural language: Experienced question generation capabilities
Comparison with SmolLM2 Series
Aspect | Llama2-TIFA-AWQ | SmolLM2 Series |
---|---|---|
Starting point | TIFA-specialized model | General instruction models |
Domain knowledge | ✅ Pre-existing TIFA expertise | ⭐ Learned during fine-tuning |
Model size | ~7B parameters (AWQ quantized) | 135M - 1.7B parameters |
Inference speed | ~8 seconds (AWQ optimized) | ~3 seconds (small models) |
Training approach | Structural refinement + quantization | Full task learning |
Memory efficiency | ✅ AWQ quantized | ⭐ Naturally smaller |
Question quality | ✅ Deep domain knowledge | ⭐ Systematic structure |
Citation
@misc{llama2-tifa-refined-2025,
title={Llama2-TIFA: Structural Refinement of LLaMA 2 for Text-to-Image Faithfulness Assessment},
author={kawchar85},
year={2025},
url={https://huggingface.co/kawchar85/Llama2-TIFA-AWQ},
note={Fine-tuned from tifa-benchmark/llama2_tifa_question_generation}
}
Model Ecosystem
This model complements the broader TIFA question generation ecosystem:
Specialized TIFA models:
- Llama2-TIFA ← You are here (Domain expert, refined structure)
General→TIFA models:
- SmolLM2-135M-Instruct-TIFA: Compact version
- SmolLM2-360M-Instruct-TIFA: Balanced version
- SmolLM2-1.7B-Instruct-TIFA: Structured version
- SmolLM2-1.7B-Instruct-TIFA-Random: Flexible version
- Downloads last month
- 6