nielsr HF Staff commited on
Commit
a2b0a44
Β·
verified Β·
1 Parent(s): 872291b

Add comprehensive model card for WebDancer

Browse files

This PR adds a comprehensive model card for the WebDancer model, described in the paper [WebDancer: Towards Autonomous Information Seeking Agency](https://huggingface.co/papers/2505.22648).

Key improvements include:
- Enriching the metadata with `pipeline_tag: image-text-to-text`, `library_name: transformers`, and relevant `tags` such as `web-agent`, `gui-agent`, and `information-seeking`.
- Including the paper abstract for better context.
- Highlighting key features and performance metrics.
- Providing a detailed `Quick Start` section with a Python code example for easy inference using the `transformers` library, including multimodal input handling.
- Showcasing the model's capabilities with embedded demo videos.
- Ensuring proper citation information is available.

This update aims to make the model more discoverable and user-friendly for the community.

Files changed (1) hide show
  1. README.md +172 -3
README.md CHANGED
@@ -1,3 +1,172 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ tags:
6
+ - web-agent
7
+ - gui-agent
8
+ - information-seeking
9
+ ---
10
+
11
+ # WebDancer: Towards Autonomous Information Seeking Agency
12
+
13
+ This repository contains the `WebDancer` model, a native agentic search reasoning model designed for autonomous information seeking, as introduced in the paper [WebDancer: Towards Autonomous Information Seeking Agency](https://huggingface.co/papers/2505.22648).
14
+
15
+ **WebDancer** aims to address intricate real-world problems necessitating in-depth information seeking and multi-step reasoning by understanding GUI screenshots and generating autonomous actions.
16
+
17
+ * 🏠 **Project Homepage:** [https://osatlas.github.io/](https://osatlas.github.io/)
18
+ * πŸ’» **Code:** [https://github.com/OS-Copilot/OS-Atlas](https://github.com/OS-Copilot/OS-Atlas)
19
+
20
+ <div align="center">
21
+ <p align="center">
22
+ <img src="https://github.com/OS-Copilot/OS-Atlas/assets/35510619/cf2ee020-5e15-4087-9a7e-75cc43662494" width="800%" height="400%" />
23
+ </p>
24
+ </div>
25
+
26
+ ## Abstract
27
+
28
+ Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in this https URL .
29
+
30
+ ## 🌐 Features for WebDancer
31
+
32
+ * Native agentic search reasoning model using ReAct framework towards autonomous information seeking agency and _Deep Research_-like model.
33
+ * We introduce a four-stage training paradigm comprising **browsing data construction, trajectory sampling, supervised fine-tuning for effective cold start, and reinforcement learning for improved generalization**, enabling the agent to autonomously acquire autonomous search and reasoning skills.
34
+ * Our data-centric approach integrates trajectory-level supervision fine-tuning and reinforcement learning (DAPO) to develop a scalable pipeline for **training agentic systems** via SFT or RL.
35
+ * WebDancer achieves a Pass@3 score of 64.1% on GAIA and 62.0% on WebWalkerQA.
36
+
37
+ ## πŸ’Ž Results Showcase
38
+
39
+ <div align="center">
40
+ <p align="center">
41
+ <img src="https://github.com/Alibaba-NLP/WebAgent/assets/35510619/webagent-gaia.png" width="800%" height="400%" />
42
+ </p>
43
+ </div>
44
+
45
+ <div align="center">
46
+ <p align="center">
47
+ <img src="https://github.com/Alibaba-NLP/WebAgent/assets/35510619/webagent-bc.png" width="800%" height="400%" />
48
+ </p>
49
+ </div>
50
+
51
+ ## πŸš€ Quick Start
52
+
53
+ This section provides instructions on how to inference the `WebDancer` model using the Hugging Face `transformers` library.
54
+
55
+ **Notes:** This model can accept images of various sizes as input. The model outputs are typically normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box). For visualization or interaction, remember to convert these relative coordinates back to the original image dimensions.
56
+
57
+ First, ensure that the necessary dependencies are installed:
58
+ ```bash
59
+ pip install transformers qwen-vl-utils
60
+ ```
61
+
62
+ Inference code example:
63
+ ```python
64
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
65
+ from qwen_vl_utils import process_vision_info
66
+ import torch
67
+ from PIL import Image
68
+
69
+ # Default: Load the model on the available device(s)
70
+ # Replace "Alibaba-NLP/WebDancer-32B" with the actual model ID if different
71
+ model_id = "Alibaba-NLP/WebDancer-32B"
72
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
73
+ model_id, torch_dtype="auto", device_map="auto"
74
+ )
75
+ processor = AutoProcessor.from_pretrained(model_id)
76
+
77
+ # Example: Prepare an image and a text query
78
+ # You would load your UI screenshot here. For demonstration, we use a placeholder.
79
+ # In a real scenario, './examples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png'
80
+ # would be replaced with your actual image file path.
81
+ # For simplicity, let's create a dummy image for this example to make it runnable without a file.
82
+ # In a real scenario, you would have an actual image file.
83
+ try:
84
+ image_path = "./examples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png"
85
+ # Attempt to open the image. If it doesn't exist, create a dummy.
86
+ _ = Image.open(image_path)
87
+ except FileNotFoundError:
88
+ print(f"Image not found at {image_path}. Creating a dummy image for demonstration.")
89
+ dummy_image = Image.new('RGB', (800, 600), color = 'red')
90
+ dummy_image.save("dummy_web_screenshot.png")
91
+ image_path = "dummy_web_screenshot.png"
92
+
93
+
94
+ messages = [
95
+ {
96
+ "role": "user",
97
+ "content": [
98
+ {
99
+ "type": "image",
100
+ "image": image_path, # Replace with your actual image path
101
+ },
102
+ {"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \"switch language of current page\" (with bbox)?"},
103
+ ],
104
+ }
105
+ ]
106
+
107
+ # Preparation for inference
108
+ text = processor.apply_chat_template(
109
+ messages, tokenize=False, add_generation_prompt=True
110
+ )
111
+ image_inputs, video_inputs = process_vision_info(messages) # Handles image loading
112
+ inputs = processor(
113
+ text=[text],
114
+ images=image_inputs,
115
+ videos=video_inputs,
116
+ padding=True,
117
+ return_tensors="pt",
118
+ )
119
+ inputs = inputs.to(model.device) # Move inputs to the same device as the model
120
+
121
+ # Inference: Generation of the output
122
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
123
+
124
+ generated_ids_trimmed = [
125
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
126
+ ]
127
+
128
+ output_text = processor.batch_decode(
129
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
130
+ )
131
+ print(output_text)
132
+ # Example output might look like: <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
133
+ ```
134
+
135
+ ## πŸŽ₯ WebDancer Demos
136
+
137
+ The model can execute long-horizon tasks with **multiple steps** and **complex reasoning**, such as web traversal, information seeking, and question answering.
138
+
139
+ <div align="center">
140
+ <h3>WebWalkerQA</h3>
141
+ <video src="https://github.com/user-attachments/assets/0bbaf55b-897e-4c57-967d-a6e8bbd2167e" />
142
+ </div>
143
+
144
+ <div align="center">
145
+ <h3>GAIA</h3>
146
+ <video src="https://github.com/user-attachments/assets/935c668e-6169-4712-9c04-ac80f0531872" />
147
+ </div>
148
+
149
+ <div align="center">
150
+ <h3>Daily Use</h3>
151
+ <video src="https://github.com/user-attachments/assets/d1d5b533-4009-478b-bd87-96b86389327d" />
152
+ </div>
153
+
154
+ ## πŸ“ƒ License
155
+
156
+ This project is licensed under the Apache-2.0 License.
157
+
158
+ ## 🚩 Citation
159
+
160
+ If this work is helpful, please kindly cite as:
161
+
162
+ ```bibtex
163
+ @misc{wu2025webdancer,
164
+ title={WebDancer: Towards Autonomous Information Seeking Agency},
165
+ author={Jialong Wu and Baixuan Li and Runnan Fang and Wenbiao Yin and Liwen Zhang and Zhengwei Tao and Dingchu Zhang and Zekun Xi and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou},
166
+ year={2025},
167
+ eprint={2505.22648},
168
+ archivePrefix={arXiv},
169
+ primaryClass={cs.CL},
170
+ url={https://arxiv.org/abs/2505.22648},
171
+ }
172
+ ```