Llama2-TIFA-AWQ

Model Description

Llama2-TIFA-AWQ is a fine-tuned and AWQ-quantized version of tifa-benchmark/llama2_tifa_question_generation that addresses structural limitations in the original TIFA question generation model while providing significantly faster inference. The model combines structural refinement with AWQ quantization to achieve optimal performance-speed balance for TIFA question generation.

Key Innovation: Fixing Structural Issues

This model represents a refinement approach rather than training from scratch:

  • Original model strength: Deep TIFA domain knowledge and question generation capabilities
  • Original model issues:
    • Generated multiple questions for the same attribute
    • Lacked negative verification questions
  • This solution: Structural fine-tuning to enforce 4-question format while preserving domain expertise

AWQ Quantization: Performance Optimization

This model includes AWQ (Activation-aware Weight Quantization) for optimal inference speed:

Performance Comparison

  • SmolLM2 models: ~3 seconds (baseline small models)
  • Original LLaMA 2: ~20 seconds (full precision)
  • Llama2-TIFA-AWQ: ~8 seconds (2.5x faster than full precision)

Intended Use

This model generates exactly 4 structured visual verification questions for text-to-image evaluation:

  • Mixed question types: Both yes/no and multiple choice questions
  • Comprehensive coverage: Colors, shapes, objects, materials, spatial relationships
  • Balanced verification: Both positive presence and negative absence testing
  • Controlled structure: Exactly 4 questions without redundancy

Model Details

  • Base Model: tifa-benchmark/llama2_tifa_question_generation (LLaMA 2 architecture)
  • Model Size: ~7B parameters (AWQ quantized)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation) for structural refinement
  • Quantization: AWQ (Activation-aware Weight Quantization)
  • Training Framework: Transformers + TRL + PEFT
  • License: apache-2.0

Training Details

Structural Refinement Configuration

  • Training Method: Supervised Fine-Tuning with LoRA on pre-trained TIFA model

  • LoRA Configuration:

    • r: 32
    • lora_alpha: 64
    • lora_dropout: 0.05
    • Target modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
  • Training Parameters:

    • Epochs: 2
    • Learning Rate: 3e-4
    • Batch Size: 6
    • Gradient Accumulation: 5 steps
    • Max Sequence Length: 768
    • LR Scheduler: Cosine with 5% warmup
    • Precision: FP16

Enhanced Dataset

  • Size: 18,000 examples with structured 4-question format
  • Focus: Teaching proper question structure and negative verification
  • Validation: Category-balanced split ensuring robust evaluation
  • Format: LLaMA 2 chat template with preprocessed text format

Usage

Installation

pip install transformers torch autoawq

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_path = "kawchar85/Llama2-TIFA-AWQ"

# Load AWQ quantized model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    torch_dtype=torch.float16,
    device_map="auto"
)

# Create pipeline (optimized for ~8 second inference)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

# System prompt for TIFA question generation
system_msg = """\
You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.

Guidelines:
- Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
- Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
- Each question should test a DIFFERENT aspect of the description
- Ensure questions can be answered by visual inspection of the image
- Use elements explicitly mentioned in the description
- Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
- Make distractors realistic and relevant to the domain

Format each question as:
Q[number]: [question text]
C: [comma-separated choices]
A: [correct answer]

Generate questions that test visual faithfulness between the description and image."""

# Generate evaluation questions
description = "a lighthouse overlooking the ocean"
prompt = (
    "<s>[INST] <<SYS>>\n"
    f"{system_msg}\n"
    "<</SYS>>\n\n"
    f'Create 4 visual verification questions for this description: "{description}" [/INST]'
)

output = pipe(prompt)[0]['generated_text']
response = output[len(prompt):]
print(response)

Example Output

For "a lighthouse overlooking the ocean":

Q1: What type of structure is prominently featured in the image?
C: windmill, lighthouse, castle, tower
A: lighthouse

Q2: What body of water is the lighthouse overlooking?
C: lake, river, ocean, pond
A: ocean

Q3: Are there any mountains visible in the scene?
C: no, yes
A: no

Q4: Is the lighthouse positioned to overlook a body of water?
C: no, yes
A: yes

Advantages Over Original Model

Structural Improvements

  • Controlled output: Exactly 4 questions instead of variable numbers
  • No redundancy: Eliminates multiple questions per attribute
  • Negative verification: Includes proper negative questions (absent elements)

Preserved Strengths

  • Domain expertise: Retains deep TIFA knowledge from original training
  • Question quality: Maintains high-quality question formulation
  • Visual focus: Strong emphasis on verifiable visual elements
  • Natural language: Experienced question generation capabilities

Comparison with SmolLM2 Series

Aspect Llama2-TIFA-AWQ SmolLM2 Series
Starting point TIFA-specialized model General instruction models
Domain knowledge ✅ Pre-existing TIFA expertise ⭐ Learned during fine-tuning
Model size ~7B parameters (AWQ quantized) 135M - 1.7B parameters
Inference speed ~8 seconds (AWQ optimized) ~3 seconds (small models)
Training approach Structural refinement + quantization Full task learning
Memory efficiency ✅ AWQ quantized ⭐ Naturally smaller
Question quality ✅ Deep domain knowledge ⭐ Systematic structure

Citation

@misc{llama2-tifa-refined-2025,
  title={Llama2-TIFA: Structural Refinement of LLaMA 2 for Text-to-Image Faithfulness Assessment},
  author={kawchar85},
  year={2025},
  url={https://huggingface.co/kawchar85/Llama2-TIFA-AWQ},
  note={Fine-tuned from tifa-benchmark/llama2_tifa_question_generation}
}

Model Ecosystem

This model complements the broader TIFA question generation ecosystem:

Specialized TIFA models:

  • Llama2-TIFAYou are here (Domain expert, refined structure)

General→TIFA models:

Downloads last month
6
Safetensors
Model size
1.13B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kawchar85/Llama2-TIFA-AWQ

Adapter
(1)
this model