Llama2-TIFA-AWQ

Model Description

Llama2-TIFA-AWQ is a fine-tuned and AWQ-quantized version of tifa-benchmark/llama2_tifa_question_generation that addresses structural limitations in the original TIFA question generation model while providing significantly faster inference. The model combines structural refinement with AWQ quantization to achieve optimal performance-speed balance for TIFA question generation.

Key Innovation: Fixing Structural Issues

This model represents a refinement approach rather than training from scratch:

Original model strength: Deep TIFA domain knowledge and question generation capabilities
Original model issues:
- Generated multiple questions for the same attribute
- Lacked negative verification questions
This solution: Structural fine-tuning to enforce 4-question format while preserving domain expertise

AWQ Quantization: Performance Optimization

This model includes AWQ (Activation-aware Weight Quantization) for optimal inference speed:

Performance Comparison

SmolLM2 models: ~3 seconds (baseline small models)
Original LLaMA 2: ~20 seconds (full precision)
Llama2-TIFA-AWQ: ~8 seconds (2.5x faster than full precision)

Intended Use

This model generates exactly 4 structured visual verification questions for text-to-image evaluation:

Mixed question types: Both yes/no and multiple choice questions
Comprehensive coverage: Colors, shapes, objects, materials, spatial relationships
Balanced verification: Both positive presence and negative absence testing
Controlled structure: Exactly 4 questions without redundancy

Model Details

Base Model: tifa-benchmark/llama2_tifa_question_generation (LLaMA 2 architecture)
Model Size: ~7B parameters (AWQ quantized)
Fine-tuning Method: LoRA (Low-Rank Adaptation) for structural refinement
Quantization: AWQ (Activation-aware Weight Quantization)
Training Framework: Transformers + TRL + PEFT
License: apache-2.0

Training Details

Structural Refinement Configuration

Training Method: Supervised Fine-Tuning with LoRA on pre-trained TIFA model
LoRA Configuration:
- r: 32
- lora_alpha: 64
- lora_dropout: 0.05
- Target modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Training Parameters:
- Epochs: 2
- Learning Rate: 3e-4
- Batch Size: 6
- Gradient Accumulation: 5 steps
- Max Sequence Length: 768
- LR Scheduler: Cosine with 5% warmup
- Precision: FP16

Enhanced Dataset

Size: 18,000 examples with structured 4-question format
Focus: Teaching proper question structure and negative verification
Validation: Category-balanced split ensuring robust evaluation
Format: LLaMA 2 chat template with preprocessed text format

Usage

Installation

pip install transformers torch autoawq

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_path = "kawchar85/Llama2-TIFA-AWQ"

# Load AWQ quantized model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    torch_dtype=torch.float16,
    device_map="auto"
)

# Create pipeline (optimized for ~8 second inference)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

# System prompt for TIFA question generation
system_msg = """\
You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.

Guidelines:
- Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
- Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
- Each question should test a DIFFERENT aspect of the description
- Ensure questions can be answered by visual inspection of the image
- Use elements explicitly mentioned in the description
- Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
- Make distractors realistic and relevant to the domain

Format each question as:
Q[number]: [question text]
C: [comma-separated choices]
A: [correct answer]

Generate questions that test visual faithfulness between the description and image."""

# Generate evaluation questions
description = "a lighthouse overlooking the ocean"
prompt = (
    "<s>[INST] <<SYS>>\n"
    f"{system_msg}\n"
    "<</SYS>>\n\n"
    f'Create 4 visual verification questions for this description: "{description}" [/INST]'
)

output = pipe(prompt)[0]['generated_text']
response = output[len(prompt):]
print(response)

Example Output

For "a lighthouse overlooking the ocean":

Q1: What type of structure is prominently featured in the image?
C: windmill, lighthouse, castle, tower
A: lighthouse

Q2: What body of water is the lighthouse overlooking?
C: lake, river, ocean, pond
A: ocean

Q3: Are there any mountains visible in the scene?
C: no, yes
A: no

Q4: Is the lighthouse positioned to overlook a body of water?
C: no, yes
A: yes

Advantages Over Original Model

Structural Improvements

Controlled output: Exactly 4 questions instead of variable numbers
No redundancy: Eliminates multiple questions per attribute
Negative verification: Includes proper negative questions (absent elements)

Preserved Strengths

Domain expertise: Retains deep TIFA knowledge from original training
Question quality: Maintains high-quality question formulation
Visual focus: Strong emphasis on verifiable visual elements
Natural language: Experienced question generation capabilities

Comparison with SmolLM2 Series

Aspect	Llama2-TIFA-AWQ	SmolLM2 Series
Starting point	TIFA-specialized model	General instruction models
Domain knowledge	✅ Pre-existing TIFA expertise	⭐ Learned during fine-tuning
Model size	~7B parameters (AWQ quantized)	135M - 1.7B parameters
Inference speed	~8 seconds (AWQ optimized)	~3 seconds (small models)
Training approach	Structural refinement + quantization	Full task learning
Memory efficiency	✅ AWQ quantized	⭐ Naturally smaller
Question quality	✅ Deep domain knowledge	⭐ Systematic structure

Citation

@misc{llama2-tifa-refined-2025,
  title={Llama2-TIFA: Structural Refinement of LLaMA 2 for Text-to-Image Faithfulness Assessment},
  author={kawchar85},
  year={2025},
  url={https://huggingface.co/kawchar85/Llama2-TIFA-AWQ},
  note={Fine-tuned from tifa-benchmark/llama2_tifa_question_generation}
}

Model Ecosystem

This model complements the broader TIFA question generation ecosystem:

Specialized TIFA models:

Llama2-TIFA ← You are here (Domain expert, refined structure)

General→TIFA models:

SmolLM2-135M-Instruct-TIFA: Compact version
SmolLM2-360M-Instruct-TIFA: Balanced version
SmolLM2-1.7B-Instruct-TIFA: Structured version
SmolLM2-1.7B-Instruct-TIFA-Random: Flexible version

kawchar85
/

Llama2-TIFA-AWQ