QwenStoryteller2

QwenStoryteller2 is an improved version of QwenStoryteller, fine-tuned using contrastive reinforcement learning with Direct Preference Optimization (DPO) to achieve superior entity re-identification and visual grounding in cross-frame storytelling scenarios.

Model Description

Base Model: QwenStoryteller (Qwen2.5-VL 7B)
Training Method: Contrastive Reinforcement Learning with Direct Preference Optimization (LoRA rank 2048, alpha 4096)
Training Dataset: StoryReasoningAdversarialDPO

QwenStoryteller2 builds upon the original QwenStoryteller by addressing critical limitations in cross-frame entity consistency through:

  • Contrastive Learning: Training on both real and synthetic negative story examples
  • Enhanced Entity Re-identification: Improved tracking of characters and objects across frames
  • Better Grounding: Superior alignment between narrative elements and visual entities
  • Reduced Hallucinations: More reliable entity connections and fewer spurious references

The model employs a dual-component reward function that promotes appropriate entity connections in coherent sequences while discouraging incorrect connections in synthetic arrangements.

Key Improvements Over QwenStoryteller

  • Grounding Performance: mAP improved from 0.27 to 0.31 (+14.8%), F1 score from 0.35 to 0.41 (+17.1%)
  • Cross-frame Consistency: Character persistence on ≥5 frames increased from 37.7% to 49.3% (+30.8%)
  • Pronoun Grounding: Significant improvements across all pronoun types (he: 90.1%→99.1%, she: 91.1%→98.6%, they: 47.6%→68.8%)
  • Structural Quality: Well-structured stories increased from 79.1% to 97.5% (+23.3%)
  • Entity Tracking: Object persistence on ≥5 frames improved from 20.9% to 21.3%

System Prompt

The model was trained with the following system prompt, and we recommend using it for optimal performance:

You are an AI storyteller that can analyze sequences of images and create creative narratives. 
First think step-by-step to analyze characters, objects, settings, and narrative structure. 
Then create a grounded story that maintains consistent character identity and object references across frames. 
Use <think></think> tags to show your reasoning process before writing the final story.

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "daniel3303/QwenStoryteller2", torch_dtype="auto", device_map="auto"
)

# Load processor
processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller2")

# Load images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg"),
    Image.open("image4.jpg"),
    Image.open("image5.jpg")
]

# Create image content list
image_content = []
for img in images:
    image_content.append({
        "type": "image",
        "image": img,
    })

# Add text prompt at the end
image_content.append({"type": "text", "text": "Generate a story based on these images."})

# Create messages with system prompt
messages = [
    {
        "role": "system", 
        "content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story."
    },
    {
        "role": "user",
        "content": image_content,
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs, 
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
story = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(story)

Using vLLM for faster inference

For significantly faster inference, you can use vLLM to serve the model:

# Install vLLM
pip install vllm

# Serve the model with vLLM
vllm serve daniel3303/QwenStoryteller2

Training Methodology

Contrastive Learning Framework

QwenStoryteller2 was trained using a novel contrastive reinforcement learning approach:

  1. Synthetic Story Generation: Extended the StoryReasoning dataset with 4,178 synthetic stories created by sampling images from different movies to create incoherent sequences
  2. Dual-Component Reward Function: Combined entity re-identification (R_reid) and grounding (R_ground) rewards with structural validation
  3. Direct Preference Optimization: Used offline preference pairs generated from the reward function to train the model

Reward Function Components

  • Entity Re-identification Reward: Tracks character and object persistence across frames, promoting connections in real stories while penalizing them in synthetic ones
  • Grounding Reward: Evaluates pronoun and proper noun grounding to visual entities
  • Structure Validation: Ensures generated outputs maintain required format and consistency

Training Configuration

  • Method: Direct Preference Optimization (DPO) with LoRA fine-tuning
  • LoRA Parameters: Rank 2048, alpha 4096
  • Optimizer: AdamW with learning rate 5×10⁻⁶
  • Batch Size: 8
  • Epochs: 3
  • Temperature Parameter (β): 0.1

Performance Metrics

Metric QwenStoryteller QwenStoryteller2 Improvement
Character Precision 0.83 0.78 -6.0%
Object Precision 0.46 0.29 -37.0%
Total Precision 0.57 0.45 -21.1%
mAP 0.27 0.31 +14.8%
Character Recall 0.62 0.77 +24.2%
Object Recall 0.25 0.28 +12.0%
Total Recall 0.40 0.48 +20.0%
F1 Score 0.35 0.41 +17.1%
METEOR 0.14 0.17 +21.4%
ROUGE-L 0.16 0.18 +12.5%
BLEU-4 0.054 0.057 +5.6%

Output Format

QwenStoryteller2 produces enhanced outputs with improved consistency:

  1. Chain-of-Thought Analysis (<think></think>): More accurate structured analysis with:

    • Improved character tables with consistent identity references
    • Better object tracking with enhanced spatial coordination
    • More reliable setting categorization
    • Stronger narrative structure modeling
  2. Grounded Story: Enhanced narrative with specialized XML tags:

    • <gdi>: Image tags for specific frames
    • <gdo>: Entity reference tags with improved accuracy
    • <gda>: Action tags with better character-action alignment
    • <gdl>: Location/landmark tags with enhanced spatial grounding

Key Features

  • Enhanced Cross-Frame Consistency: Superior character and object identity maintenance through contrastive learning
  • Improved Pronoun Grounding: Better alignment of pronouns with visual entities (up to 99.1% for "he", 98.6% for "she")
  • Reduced Hallucinations: Fewer incorrect entity connections and spurious references
  • Robust Entity Discrimination: Learned ability to distinguish when cross-frame connections are appropriate
  • Better Structural Quality: Near-perfect adherence to expected output format (97.5%)

Limitations

  • Precision scores show some reduction compared to the original model due to increased recall
  • Training data derived from movies may introduce cinematic biases
  • Entity re-identification still relies primarily on visual similarity within bounding boxes
  • Performance validated only on 7B parameter scale
  • Optimal real-to-synthetic story ratio (2:1) may not generalize to all scenarios

Citation

TODO

@misc{oliveira2025storyreasoningdatasetusingchainofthought,
      title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, 
      author={Daniel A. P. Oliveira and David Martins de Matos},
      year={2025},
      eprint={2505.10292},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.10292}
}

Contact

For questions or feedback regarding this model, please contact:

Downloads last month
5
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for daniel3303/QwenStoryteller2

Finetuned
(1)
this model
Quantizations
2 models

Datasets used to train daniel3303/QwenStoryteller2

Evaluation results