File size: 4,086 Bytes
8bf9723
 
 
 
ac51f1e
8bf9723
 
 
 
 
 
 
 
 
 
 
 
bd04b08
8bf9723
 
 
 
 
 
 
 
 
09095b7
8bf9723
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac51f1e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
base_model:
- Qwen/Qwen3-32B
library_name: transformers
---

<style>
  .no-border-table table, .no-border-table th, .no-border-table td {
    border: none !important;
  }
</style>

<div class="no-border-table">

| | |
|-|-|
| [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/inclusionAI/AWorld/tree/main/train) | [![arXiv](http://img.shields.io/badge/cs.AI-arXiv%3A2508.20404-B31B1B.svg?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2508.20404) |

</div>

# Qwen3-32B-AWorld

## Model Description

**Qwen3-32B-AWorld** is a large language model fine-tuned from `Qwen3-32B`, specializing in agent capabilities and proficient tool usage. The model excels at complex agent-based tasks through precise integration with external tools, achieving a pass@1 score on the GAIA benchmark that surpasses GPT-4o and is comparable to DeepSeek-V3.

<img src="qwen3-32b-aworld-performance.jpg" style="width:100%;">

## Quick Start

This guide provides instructions for quickly deploying and running inference with `Qwen3-32B-AWorld` using vLLM.

### Deployment with vLLM

To deploy the model, use the following `vllm serve` command:

```bash
vllm serve inclusionAI/Qwen3-32B-AWorld \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--dtype bfloat16 \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser hermes
```

**Key Configuration:**

*   **Deployment Recommendation:** We recommend deploying the model on **8 GPUs** to enhance concurrency. The `tensor-parallel-size` argument should be set to the number of GPUs you are using (e.g., `8` in the command above).
*   **Tool Usage Flags:** To enable the model's tool-calling capabilities, it is crucial to include the `--enable-auto-tool-choice` and `--tool-call-parser hermes` flags. These ensure that the model can correctly process tool calls and parse the results.

### Making Inference Calls

When making an inference request, you must include the `tools` you want the model to use. The format should follow the official OpenAI API specification.

Here is a complete Python example for making an API call to the deployed model using the requests library. This example demonstrates how to query the model with a specific tool.

```python
import requests
import json

# Define the tools available for the model to use
tools = [
    {
      "type": "function",
      "function": {
        "name": "mcp__google-search__search",
        "description": "Perform a web search query",
        "parameters": {
          "type": "object",
          "properties": {
            "query": {
              "description": "Search query",
              "type": "string"
            },
            "num": {
              "description": "Number of results (1-10)",
              "type": "number"
            }
          },
          "required": [
            "query"
          ]
        }
      }
    }
]

# Define the user's prompt
messages = [
    {
        "role": "user", 
        "content": "Search for hangzhou's weather today."
    }
]

# Set generation parameters
temperature = 0.6
top_p = 0.95
top_k = 20
min_p = 0

# Prepare the request payload
data = {
    "messages": messages,
    "tools": tools,
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
    "min_p": min_p,
}

# The endpoint for the vLLM OpenAI-compatible server
# Replace {your_ip} and {your_port} with the actual IP address and port of your server.
url = "http://{your_ip}:{your_port}/v1/chat/completions"

# Send the POST request
response = requests.post(
    url,
    headers={"Content-Type": "application/json"},
    data=json.dumps(data)
)

# Print the response from the server
print("Status Code:", response.status_code)
print("Response Body:", response.text)

```

**Note:**

*   Remember to replace `{your_ip}` and `{your_port}` in the `url` variable with the actual IP address and port where your vLLM server is running. The default port is typically `8000`.