QwenStoryteller2

QwenStoryteller2 is an improved version of QwenStoryteller, fine-tuned using contrastive reinforcement learning with Direct Preference Optimization (DPO) to achieve superior entity re-identification and visual grounding in cross-frame storytelling scenarios.

Model Description

Base Model: QwenStoryteller (Qwen2.5-VL 7B)
Training Method: Contrastive Reinforcement Learning with Direct Preference Optimization (LoRA rank 2048, alpha 4096)
Training Dataset: StoryReasoningAdversarialDPO

QwenStoryteller2 builds upon the original QwenStoryteller by addressing critical limitations in cross-frame entity consistency through:

Contrastive Learning: Training on both real and synthetic negative story examples
Enhanced Entity Re-identification: Improved tracking of characters and objects across frames
Better Grounding: Superior alignment between narrative elements and visual entities
Reduced Hallucinations: More reliable entity connections and fewer spurious references

The model employs a dual-component reward function that promotes appropriate entity connections in coherent sequences while discouraging incorrect connections in synthetic arrangements.

Key Improvements Over QwenStoryteller

Grounding Performance: mAP improved from 0.27 to 0.31 (+14.8%), F1 score from 0.35 to 0.41 (+17.1%)
Cross-frame Consistency: Character persistence on ≥5 frames increased from 37.7% to 49.3% (+30.8%)
Pronoun Grounding: Significant improvements across all pronoun types (he: 90.1%→99.1%, she: 91.1%→98.6%, they: 47.6%→68.8%)
Structural Quality: Well-structured stories increased from 79.1% to 97.5% (+23.3%)
Entity Tracking: Object persistence on ≥5 frames improved from 20.9% to 21.3%

System Prompt

The model was trained with the following system prompt, and we recommend using it for optimal performance:

You are an AI storyteller that can analyze sequences of images and create creative narratives. 
First think step-by-step to analyze characters, objects, settings, and narrative structure. 
Then create a grounded story that maintains consistent character identity and object references across frames. 
Use <think></think> tags to show your reasoning process before writing the final story.

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "daniel3303/QwenStoryteller2", torch_dtype="auto", device_map="auto"
)

# Load processor
processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller2")

# Load images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg"),
    Image.open("image4.jpg"),
    Image.open("image5.jpg")
]

# Create image content list
image_content = []
for img in images:
    image_content.append({
        "type": "image",
        "image": img,
    })

# Add text prompt at the end
image_content.append({"type": "text", "text": "Generate a story based on these images."})

# Create messages with system prompt
messages = [
    {
        "role": "system", 
        "content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story."
    },
    {
        "role": "user",
        "content": image_content,
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs, 
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
story = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(story)

Using vLLM for faster inference

For significantly faster inference, you can use vLLM to serve the model:

# Install vLLM
pip install vllm

# Serve the model with vLLM
vllm serve daniel3303/QwenStoryteller2

Training Methodology

Contrastive Learning Framework

QwenStoryteller2 was trained using a novel contrastive reinforcement learning approach:

Synthetic Story Generation: Extended the StoryReasoning dataset with 4,178 synthetic stories created by sampling images from different movies to create incoherent sequences
Dual-Component Reward Function: Combined entity re-identification (R_reid) and grounding (R_ground) rewards with structural validation
Direct Preference Optimization: Used offline preference pairs generated from the reward function to train the model

Reward Function Components

Entity Re-identification Reward: Tracks character and object persistence across frames, promoting connections in real stories while penalizing them in synthetic ones
Grounding Reward: Evaluates pronoun and proper noun grounding to visual entities
Structure Validation: Ensures generated outputs maintain required format and consistency

Training Configuration

Method: Direct Preference Optimization (DPO) with LoRA fine-tuning
LoRA Parameters: Rank 2048, alpha 4096
Optimizer: AdamW with learning rate 5×10⁻⁶
Batch Size: 8
Epochs: 3
Temperature Parameter (β): 0.1

Performance Metrics

Metric	QwenStoryteller	QwenStoryteller2	Improvement
Character Precision	0.83	0.78	-6.0%
Object Precision	0.46	0.29	-37.0%
Total Precision	0.57	0.45	-21.1%
mAP	0.27	0.31	+14.8%
Character Recall	0.62	0.77	+24.2%
Object Recall	0.25	0.28	+12.0%
Total Recall	0.40	0.48	+20.0%
F1 Score	0.35	0.41	+17.1%
METEOR	0.14	0.17	+21.4%
ROUGE-L	0.16	0.18	+12.5%
BLEU-4	0.054	0.057	+5.6%

Output Format

QwenStoryteller2 produces enhanced outputs with improved consistency:

Chain-of-Thought Analysis (<think></think>): More accurate structured analysis with:
- Improved character tables with consistent identity references
- Better object tracking with enhanced spatial coordination
- More reliable setting categorization
- Stronger narrative structure modeling
Grounded Story: Enhanced narrative with specialized XML tags:
- <gdi>: Image tags for specific frames
- <gdo>: Entity reference tags with improved accuracy
- <gda>: Action tags with better character-action alignment
- <gdl>: Location/landmark tags with enhanced spatial grounding

Key Features

Enhanced Cross-Frame Consistency: Superior character and object identity maintenance through contrastive learning
Improved Pronoun Grounding: Better alignment of pronouns with visual entities (up to 99.1% for "he", 98.6% for "she")
Reduced Hallucinations: Fewer incorrect entity connections and spurious references
Robust Entity Discrimination: Learned ability to distinguish when cross-frame connections are appropriate
Better Structural Quality: Near-perfect adherence to expected output format (97.5%)

Limitations

Precision scores show some reduction compared to the original model due to increased recall
Training data derived from movies may introduce cinematic biases
Entity re-identification still relies primarily on visual similarity within bounding boxes
Performance validated only on 7B parameter scale
Optimal real-to-synthetic story ratio (2:1) may not generalize to all scenarios

Citation

TODO

@misc{oliveira2025storyreasoningdatasetusingchainofthought,
      title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, 
      author={Daniel A. P. Oliveira and David Martins de Matos},
      year={2025},
      eprint={2505.10292},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.10292}
}

Contact

For questions or feedback regarding this model, please contact:

Daniel A. P. Oliveira ([email protected])

Downloads last month: 15

Safetensors

Model size

8.29B params

Tensor type

BF16

Inference Providers NEW

Image-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for daniel3303/QwenStoryteller2

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

daniel3303/QwenStoryteller

Finetuned

(1)

this model

Quantizations

2 models

Datasets used to train daniel3303/QwenStoryteller2

Space using daniel3303/QwenStoryteller2 1

Evaluation results

Character Precision on StoryReasoningAdversarialDPO
test set self-reported

0.780
Object Precision on StoryReasoningAdversarialDPO
test set self-reported

0.290
Total Precision on StoryReasoningAdversarialDPO
test set self-reported

0.450
mAP on StoryReasoningAdversarialDPO
test set self-reported

0.310
Character Recall on StoryReasoningAdversarialDPO
test set self-reported

0.770
Object Recall on StoryReasoningAdversarialDPO
test set self-reported

0.280
Total Recall on StoryReasoningAdversarialDPO
test set self-reported

0.480
F1 Score on StoryReasoningAdversarialDPO
test set self-reported

0.410
METEOR on StoryReasoningAdversarialDPO
test set self-reported

0.170
ROUGE-L on StoryReasoningAdversarialDPO
test set self-reported

0.180

View on Papers With Code