QwenStoryteller2
QwenStoryteller2 is an improved version of QwenStoryteller, fine-tuned using contrastive reinforcement learning with Direct Preference Optimization (DPO) to achieve superior entity re-identification and visual grounding in cross-frame storytelling scenarios.
Model Description
Base Model: QwenStoryteller (Qwen2.5-VL 7B)
Training Method: Contrastive Reinforcement Learning with Direct Preference Optimization (LoRA rank 2048, alpha 4096)
Training Dataset: StoryReasoningAdversarialDPO
QwenStoryteller2 builds upon the original QwenStoryteller by addressing critical limitations in cross-frame entity consistency through:
- Contrastive Learning: Training on both real and synthetic negative story examples
- Enhanced Entity Re-identification: Improved tracking of characters and objects across frames
- Better Grounding: Superior alignment between narrative elements and visual entities
- Reduced Hallucinations: More reliable entity connections and fewer spurious references
The model employs a dual-component reward function that promotes appropriate entity connections in coherent sequences while discouraging incorrect connections in synthetic arrangements.
Key Improvements Over QwenStoryteller
- Grounding Performance: mAP improved from 0.27 to 0.31 (+14.8%), F1 score from 0.35 to 0.41 (+17.1%)
- Cross-frame Consistency: Character persistence on ≥5 frames increased from 37.7% to 49.3% (+30.8%)
- Pronoun Grounding: Significant improvements across all pronoun types (he: 90.1%→99.1%, she: 91.1%→98.6%, they: 47.6%→68.8%)
- Structural Quality: Well-structured stories increased from 79.1% to 97.5% (+23.3%)
- Entity Tracking: Object persistence on ≥5 frames improved from 20.9% to 21.3%
System Prompt
The model was trained with the following system prompt, and we recommend using it for optimal performance:
You are an AI storyteller that can analyze sequences of images and create creative narratives.
First think step-by-step to analyze characters, objects, settings, and narrative structure.
Then create a grounded story that maintains consistent character identity and object references across frames.
Use <think></think> tags to show your reasoning process before writing the final story.
Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image
# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"daniel3303/QwenStoryteller2", torch_dtype="auto", device_map="auto"
)
# Load processor
processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller2")
# Load images
images = [
Image.open("image1.jpg"),
Image.open("image2.jpg"),
Image.open("image3.jpg"),
Image.open("image4.jpg"),
Image.open("image5.jpg")
]
# Create image content list
image_content = []
for img in images:
image_content.append({
"type": "image",
"image": img,
})
# Add text prompt at the end
image_content.append({"type": "text", "text": "Generate a story based on these images."})
# Create messages with system prompt
messages = [
{
"role": "system",
"content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story."
},
{
"role": "user",
"content": image_content,
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(
**inputs,
max_new_tokens=4096,
do_sample=True,
temperature=0.7,
top_p=0.9
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
story = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(story)
Using vLLM for faster inference
For significantly faster inference, you can use vLLM to serve the model:
# Install vLLM
pip install vllm
# Serve the model with vLLM
vllm serve daniel3303/QwenStoryteller2
Training Methodology
Contrastive Learning Framework
QwenStoryteller2 was trained using a novel contrastive reinforcement learning approach:
- Synthetic Story Generation: Extended the StoryReasoning dataset with 4,178 synthetic stories created by sampling images from different movies to create incoherent sequences
- Dual-Component Reward Function: Combined entity re-identification (R_reid) and grounding (R_ground) rewards with structural validation
- Direct Preference Optimization: Used offline preference pairs generated from the reward function to train the model
Reward Function Components
- Entity Re-identification Reward: Tracks character and object persistence across frames, promoting connections in real stories while penalizing them in synthetic ones
- Grounding Reward: Evaluates pronoun and proper noun grounding to visual entities
- Structure Validation: Ensures generated outputs maintain required format and consistency
Training Configuration
- Method: Direct Preference Optimization (DPO) with LoRA fine-tuning
- LoRA Parameters: Rank 2048, alpha 4096
- Optimizer: AdamW with learning rate 5×10⁻⁶
- Batch Size: 8
- Epochs: 3
- Temperature Parameter (β): 0.1
Performance Metrics
Metric | QwenStoryteller | QwenStoryteller2 | Improvement |
---|---|---|---|
Character Precision | 0.83 | 0.78 | -6.0% |
Object Precision | 0.46 | 0.29 | -37.0% |
Total Precision | 0.57 | 0.45 | -21.1% |
mAP | 0.27 | 0.31 | +14.8% |
Character Recall | 0.62 | 0.77 | +24.2% |
Object Recall | 0.25 | 0.28 | +12.0% |
Total Recall | 0.40 | 0.48 | +20.0% |
F1 Score | 0.35 | 0.41 | +17.1% |
METEOR | 0.14 | 0.17 | +21.4% |
ROUGE-L | 0.16 | 0.18 | +12.5% |
BLEU-4 | 0.054 | 0.057 | +5.6% |
Output Format
QwenStoryteller2 produces enhanced outputs with improved consistency:
Chain-of-Thought Analysis (
<think></think>
): More accurate structured analysis with:- Improved character tables with consistent identity references
- Better object tracking with enhanced spatial coordination
- More reliable setting categorization
- Stronger narrative structure modeling
Grounded Story: Enhanced narrative with specialized XML tags:
<gdi>
: Image tags for specific frames<gdo>
: Entity reference tags with improved accuracy<gda>
: Action tags with better character-action alignment<gdl>
: Location/landmark tags with enhanced spatial grounding
Key Features
- Enhanced Cross-Frame Consistency: Superior character and object identity maintenance through contrastive learning
- Improved Pronoun Grounding: Better alignment of pronouns with visual entities (up to 99.1% for "he", 98.6% for "she")
- Reduced Hallucinations: Fewer incorrect entity connections and spurious references
- Robust Entity Discrimination: Learned ability to distinguish when cross-frame connections are appropriate
- Better Structural Quality: Near-perfect adherence to expected output format (97.5%)
Limitations
- Precision scores show some reduction compared to the original model due to increased recall
- Training data derived from movies may introduce cinematic biases
- Entity re-identification still relies primarily on visual similarity within bounding boxes
- Performance validated only on 7B parameter scale
- Optimal real-to-synthetic story ratio (2:1) may not generalize to all scenarios
Citation
TODO
@misc{oliveira2025storyreasoningdatasetusingchainofthought,
title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation},
author={Daniel A. P. Oliveira and David Martins de Matos},
year={2025},
eprint={2505.10292},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.10292}
}
Contact
For questions or feedback regarding this model, please contact:
- Daniel A. P. Oliveira ([email protected])
- Downloads last month
- 5
Model tree for daniel3303/QwenStoryteller2
Datasets used to train daniel3303/QwenStoryteller2
Evaluation results
- Character Precision on StoryReasoningAdversarialDPOtest set self-reported0.780
- Object Precision on StoryReasoningAdversarialDPOtest set self-reported0.290
- Total Precision on StoryReasoningAdversarialDPOtest set self-reported0.450
- mAP on StoryReasoningAdversarialDPOtest set self-reported0.310
- Character Recall on StoryReasoningAdversarialDPOtest set self-reported0.770
- Object Recall on StoryReasoningAdversarialDPOtest set self-reported0.280
- Total Recall on StoryReasoningAdversarialDPOtest set self-reported0.480
- F1 Score on StoryReasoningAdversarialDPOtest set self-reported0.410
- METEOR on StoryReasoningAdversarialDPOtest set self-reported0.170
- ROUGE-L on StoryReasoningAdversarialDPOtest set self-reported0.180