Update README.md
Browse files
README.md
CHANGED
|
@@ -29,7 +29,7 @@ tags:
|
|
| 29 |
line-height:1.25;
|
| 30 |
text-align:center;
|
| 31 |
margin:0 0 24px;">
|
| 32 |
-
OpenCUA
|
| 33 |
</h1>
|
| 34 |
|
| 35 |
<div style="
|
|
@@ -38,7 +38,7 @@ tags:
|
|
| 38 |
gap:12px;
|
| 39 |
flex-wrap:wrap;
|
| 40 |
margin-bottom:28px;">
|
| 41 |
-
|
| 42 |
<a href="https://opencua.xlang.ai/" style="
|
| 43 |
display:inline-block;
|
| 44 |
padding:8px 24px;
|
|
@@ -78,6 +78,22 @@ tags:
|
|
| 78 |
|
| 79 |
<div style="max-width:900px;margin:0 auto;">
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
# Introduction
|
| 82 |
<div style="
|
| 83 |
max-width: 880px; /* 可按需调节整体宽度 */
|
|
@@ -85,16 +101,18 @@ tags:
|
|
| 85 |
text-align: justify; /* 关键:两端对齐 */
|
| 86 |
text-justify: inter-word; /* 优化英文对齐效果 */
|
| 87 |
line-height: 1.6;">
|
| 88 |
-
|
| 89 |
-
OpenCUA models (OpenCUA-7B and OpenCUA-
|
| 90 |
-
|
| 91 |
-
establishing a new state-of-the-art (SOTA) among open-source models
|
| 92 |
</div>
|
| 93 |
|
| 94 |
## 📢 Updates
|
| 95 |
-
-
|
|
|
|
|
|
|
| 96 |
Thanks to [Sujit Vasanth](https://huggingface.co/sujitvasanth) for producing a quantized **exllamav2** version of OpenCUA-7B — enabling much faster inference with lower VRAM usage.
|
| 97 |
-
|
| 98 |
### Key Features
|
| 99 |
|
| 100 |
- **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
|
|
@@ -107,9 +125,8 @@ establishing a new state-of-the-art (SOTA) among open-source models and surpassi
|
|
| 107 |
# Performance
|
| 108 |
|
| 109 |
### Online Agent Evaluation
|
| 110 |
-
OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
|
| 111 |
-
|
| 112 |
-
It also closes the gap to proprietary Claude models.
|
| 113 |
<div align="center">
|
| 114 |
|
| 115 |
| **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
|
|
@@ -120,13 +137,14 @@ It also closes the gap to proprietary Claude models.
|
|
| 120 |
| Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
|
| 121 |
| Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
|
| 122 |
| **Open-Source** | | | |
|
| 123 |
-
| Qwen 2.5-VL-32B-Instruct
|
| 124 |
-
| Qwen 2.5-VL-72B-Instruct
|
| 125 |
| Kimi-VL-A3B | 9.7 | — | 10.3 |
|
| 126 |
| UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
|
| 127 |
| UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
|
| 128 |
| OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
|
| 129 |
-
|
|
|
|
|
| 130 |
</div>
|
| 131 |
|
| 132 |
*OpenCUA scores are the mean of 3 independent runs.*
|
|
@@ -134,15 +152,14 @@ It also closes the gap to proprietary Claude models.
|
|
| 134 |
### GUI Grounding Performance
|
| 135 |
<div align="center">
|
| 136 |
|
| 137 |
-
| **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** |
|
| 138 |
-
|
| 139 |
-
| Qwen2.5-VL-7B
|
| 140 |
-
| Qwen2.5-VL-32B
|
| 141 |
-
| UI-TARS-72B
|
| 142 |
-
| **OpenCUA-
|
| 143 |
-
| **OpenCUA-
|
| 144 |
-
| **OpenCUA-
|
| 145 |
-
| **OpenCUA-32B** | **59.6** | **93.4** | **55.3** |
|
| 146 |
</div>
|
| 147 |
|
| 148 |
|
|
@@ -161,164 +178,133 @@ It also closes the gap to proprietary Claude models.
|
|
| 161 |
|
| 162 |
# 🚀 Quick Start
|
| 163 |
<div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
|
| 164 |
-
<strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B):</strong>
|
| 165 |
-
|
| 166 |
To align with our training infrastructure, we have modified the model in two places:
|
| 167 |
<ul style="margin-top: 8px;">
|
| 168 |
<li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
|
| 169 |
<li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
|
| 170 |
-
<li>
|
| 171 |
</ul>
|
| 172 |
</div>
|
| 173 |
|
| 174 |
|
| 175 |
## Installation & Download
|
| 176 |
|
| 177 |
-
First, install the required
|
| 178 |
|
| 179 |
```bash
|
| 180 |
-
conda create -n opencua python=3.
|
| 181 |
conda activate opencua
|
| 182 |
-
pip install
|
| 183 |
```
|
| 184 |
|
| 185 |
-
Download the model weight from huggingface:
|
| 186 |
-
```
|
| 187 |
from huggingface_hub import snapshot_download
|
| 188 |
snapshot_download(
|
| 189 |
repo_id="xlangai/OpenCUA-7B",
|
| 190 |
-
local_dir="OpenCUA-7B",
|
| 191 |
-
local_dir_use_symlinks=False
|
| 192 |
)
|
| 193 |
```
|
| 194 |
|
| 195 |
-
## 🎯 GUI Grounding
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
|
| 197 |
-
|
| 198 |
|
| 199 |
```python
|
| 200 |
import base64
|
| 201 |
-
import
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
|
|
|
| 205 |
|
| 206 |
def encode_image(image_path: str) -> str:
|
| 207 |
-
"""Encode image to base64 string
|
| 208 |
with open(image_path, "rb") as f:
|
| 209 |
return base64.b64encode(f.read()).decode()
|
| 210 |
|
| 211 |
-
def
|
| 212 |
-
"""
|
| 213 |
-
|
| 214 |
-
model = AutoModel.from_pretrained(
|
| 215 |
-
model_path,
|
| 216 |
-
torch_dtype="auto",
|
| 217 |
-
device_map="auto",
|
| 218 |
-
trust_remote_code=True
|
| 219 |
-
)
|
| 220 |
-
image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
|
| 221 |
-
|
| 222 |
-
return model, tokenizer, image_processor
|
| 223 |
|
| 224 |
-
def create_grounding_messages(image_path: str, instruction: str):
|
| 225 |
-
"""Create chat messages for GUI grounding task."""
|
| 226 |
system_prompt = (
|
| 227 |
"You are a GUI agent. You are given a task and a screenshot of the screen. "
|
| 228 |
"You need to perform a series of pyautogui actions to complete the task."
|
| 229 |
)
|
| 230 |
-
|
| 231 |
messages = [
|
| 232 |
{"role": "system", "content": system_prompt},
|
| 233 |
{
|
| 234 |
"role": "user",
|
| 235 |
"content": [
|
| 236 |
-
{
|
|
|
|
|
|
|
|
|
|
| 237 |
{"type": "text", "text": instruction},
|
| 238 |
],
|
| 239 |
},
|
| 240 |
]
|
| 241 |
-
return messages
|
| 242 |
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
)
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
# Prepare image input
|
| 252 |
-
image = Image.open(image_path).convert('RGB')
|
| 253 |
-
image_info = image_processor.preprocess(images=[image])
|
| 254 |
-
pixel_values = torch.tensor(image_info['pixel_values']).to(
|
| 255 |
-
dtype=torch.bfloat16, device=model.device
|
| 256 |
-
)
|
| 257 |
-
grid_thws = torch.tensor(image_info['image_grid_thw'])
|
| 258 |
-
|
| 259 |
-
# Generate response
|
| 260 |
-
with torch.no_grad():
|
| 261 |
-
generated_ids = model.generate(
|
| 262 |
-
input_ids,
|
| 263 |
-
pixel_values=pixel_values,
|
| 264 |
-
grid_thws=grid_thws,
|
| 265 |
-
max_new_tokens=512,
|
| 266 |
-
temperature=0
|
| 267 |
-
)
|
| 268 |
-
|
| 269 |
-
# Decode output
|
| 270 |
-
prompt_len = input_ids.shape[1]
|
| 271 |
-
generated_ids = generated_ids[:, prompt_len:]
|
| 272 |
-
output_text = tokenizer.batch_decode(
|
| 273 |
-
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 274 |
-
)[0]
|
| 275 |
-
|
| 276 |
-
return output_text
|
| 277 |
|
| 278 |
# Example usage
|
| 279 |
-
model_path = "xlangai/OpenCUA-7B" # or other model variants
|
| 280 |
image_path = "screenshot.png"
|
| 281 |
instruction = "Click on the submit button"
|
| 282 |
|
| 283 |
-
|
| 284 |
-
model, tokenizer, image_processor = load_opencua_model(model_path)
|
| 285 |
-
|
| 286 |
-
# Create messages and run inference
|
| 287 |
-
messages = create_grounding_messages(image_path, instruction)
|
| 288 |
-
result = run_inference(model, tokenizer, image_processor, messages, image_path)
|
| 289 |
-
|
| 290 |
print("Model output:", result)
|
| 291 |
```
|
| 292 |
|
| 293 |
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 294 |
-
<em>Expected result:</em> ```python
|
| 295 |
-
pyautogui.click(x=1443, y=343)
|
| 296 |
-
```
|
| 297 |
</div>
|
| 298 |
|
| 299 |
-
You can also run the
|
| 300 |
-
```
|
| 301 |
cd ./model/inference/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 302 |
python huggingface_inference.py
|
| 303 |
```
|
| 304 |
|
| 305 |
## 🖥️ Computer Use Agent
|
| 306 |
**[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.
|
| 307 |
|
| 308 |
-
Command for running OpenCUA-7B
|
| 309 |
```
|
| 310 |
python run_multienv_opencua.py \
|
| 311 |
--headless \
|
| 312 |
--observation_type screenshot \
|
| 313 |
-
--model OpenCUA-
|
| 314 |
--result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
|
| 315 |
--max_steps 100 \
|
| 316 |
--num_envs 30 \
|
| 317 |
--coordinate_type qwen25
|
| 318 |
```
|
| 319 |
-
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 320 |
-
<em>Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned.</em>
|
| 321 |
-
</div>
|
| 322 |
|
| 323 |
---
|
| 324 |
|
|
@@ -328,7 +314,7 @@ Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
|
|
| 328 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/dw5k183ucDSB2SZuS5f2V.png" width="400" alt="AgentNet Dataset Domain Distribution">
|
| 329 |
</div>
|
| 330 |
|
| 331 |
-
AgentNet is the first large-scale desktop computer-use agent trajectory dataset, containing 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu systems.
|
| 332 |
|
| 333 |
👉 **[AgentNet Huggingface Dataset](https://huggingface.co/datasets/xlangai/AgentNet)**
|
| 334 |
|
|
@@ -350,17 +336,17 @@ Collecting computer-use agent training data requires 3 steps:
|
|
| 350 |
</div>
|
| 351 |
|
| 352 |
|
| 353 |
-
Our **AgentNetTool** is a cross-platform GUI recorder that runs unobtrusively on annotators
|
| 354 |
|
| 355 |
👉 **[AgentNetTool Document](https://agentnet-tool.xlang.ai/)**
|
| 356 |
|
| 357 |
|
| 358 |
|
| 359 |
## 2 DataProcessor – Action Reduction & State–Action Matching
|
| 360 |
-
Raw demonstrations can contain thousands of low-level events that are too dense for model training.
|
| 361 |
The **DataProcessor** module (`./data/data-process/`) performs two key steps:
|
| 362 |
|
| 363 |
-
1. **Action Reduction** — merges granular signals into concise, semantically meaningful PyAutoGUI actions (e.g., collapsing mouse moves → click, coalescing scrolls, grouping key-press sequences into text or hotkeys).
|
| 364 |
2. **State–Action Matching** — aligns every reduced action with the *last visually distinct frame* **before** the action begins, avoiding future-information leakage and yielding compact state–action pairs.
|
| 365 |
|
| 366 |
These processed trajectories underlie all downstream training and evaluation.
|
|
@@ -368,12 +354,12 @@ These processed trajectories underlie all downstream training and evaluation.
|
|
| 368 |
---
|
| 369 |
|
| 370 |
## 3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue
|
| 371 |
-
To boost robustness and interpretability, we augment each trajectory with **reflective long Chain-of-Thought (CoT) reasoning**.
|
| 372 |
The **CoTGenerator** pipeline (`./data/cot-generator/`) synthesizes step-level reflections that:
|
| 373 |
|
| 374 |
* reflect on the previous action,
|
| 375 |
-
* explain *why* an action is chosen given the current observation and history,
|
| 376 |
-
* note potential alternative actions, and
|
| 377 |
* forecast the expected next state.
|
| 378 |
|
| 379 |
Empirically, models trained with these rich CoTs scale better with data and generalize across unseen applications.
|
|
@@ -390,21 +376,13 @@ Empirically, models trained with these rich CoTs scale better with data and gene
|
|
| 390 |
|
| 391 |
👉 See **[AgentNetBench/README.md](./evaluation/agentnetbench/README.md)** for usage instructions.
|
| 392 |
|
| 393 |
-
# TODO
|
| 394 |
-
## vLLM Support
|
| 395 |
-
We are actively working with the vLLM team to add support for OpenCUA models.
|
| 396 |
-
|
| 397 |
-
**Workaround:** For now, please use the standard transformers library as shown in the examples above. We will update this section once vLLM support becomes available.
|
| 398 |
-
|
| 399 |
-
## Training Code
|
| 400 |
-
OpenCUA models are developed based on the training infrastructure of Kimi Team. We are developting the training pipeline based on the open-source infrastructure as well.
|
| 401 |
-
|
| 402 |
# Acknowledge
|
| 403 |
<p>
|
| 404 |
-
We thank Su
|
| 405 |
-
We are grateful to Moonshot AI for providing training infrastructure and annotated data.
|
| 406 |
-
We also sincerely appreciate
|
| 407 |
-
|
|
|
|
| 408 |
We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.
|
| 409 |
</p>
|
| 410 |
|
|
@@ -414,7 +392,7 @@ This project is licensed under the MIT License - see the LICENSE file in the roo
|
|
| 414 |
|
| 415 |
## Research Use and Disclaimer
|
| 416 |
|
| 417 |
-
OpenCUA models are intended for **research and educational purposes only**.
|
| 418 |
|
| 419 |
### Prohibited Uses
|
| 420 |
- The model may **not** be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction
|
|
@@ -428,52 +406,38 @@ OpenCUA models are intended for **research and educational purposes only**.
|
|
| 428 |
## Important Notes on Coordinate Systems
|
| 429 |
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 430 |
<ul style="margin: 0;">
|
| 431 |
-
<li><strong><code>OpenCUA/OpenCUA-A3B</code></strong> – Relative coordinates <em>(not supported in this code)</em></li>
|
| 432 |
-
<li><strong><code>OpenCUA/OpenCUA-Qwen2-7B</code></strong> – Relative coordinates</li>
|
| 433 |
<li><strong><code>OpenCUA/OpenCUA-7B</code></strong> – Absolute coordinates</li>
|
| 434 |
<li><strong><code>OpenCUA/OpenCUA-32B</code></strong> – Absolute coordinates</li>
|
|
|
|
| 435 |
</ul>
|
| 436 |
</div>
|
| 437 |
|
| 438 |
-
**OpenCUA models
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
|
| 456 |
-
|
| 457 |
-
|
| 458 |
-
|
| 459 |
-
def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
|
| 460 |
-
# First, calculate the smart-resized dimensions
|
| 461 |
-
resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
|
| 462 |
-
|
| 463 |
-
# Convert model output to relative coordinates on original image
|
| 464 |
-
rel_x = model_x / resized_width
|
| 465 |
-
rel_y = model_y / resized_height
|
| 466 |
-
|
| 467 |
-
# Then convert to absolute coordinates on original image
|
| 468 |
-
abs_x = int(rel_x * original_width)
|
| 469 |
-
abs_y = int(rel_y * original_height)
|
| 470 |
-
return abs_x, abs_y
|
| 471 |
-
```
|
| 472 |
|
| 473 |
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 474 |
<strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
|
| 475 |
<p style="margin: 8px 0 0;">
|
| 476 |
-
The Qwen2.5-VL models use a
|
| 477 |
For coordinate conversion, you need the smart resize function from the
|
| 478 |
<a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
|
| 479 |
official Qwen2.5-VL implementation</a>.
|
|
@@ -486,14 +450,14 @@ If you use OpenCUA models in your research, please cite our work:
|
|
| 486 |
|
| 487 |
```bibtex
|
| 488 |
@misc{wang2025opencuaopenfoundationscomputeruse,
|
| 489 |
-
title={OpenCUA: Open Foundations for Computer-Use Agents},
|
| 490 |
author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
|
| 491 |
year={2025},
|
| 492 |
eprint={2508.09123},
|
| 493 |
archivePrefix={arXiv},
|
| 494 |
primaryClass={cs.AI},
|
| 495 |
-
url={https://arxiv.org/abs/2508.09123},
|
| 496 |
}
|
| 497 |
```
|
| 498 |
|
| 499 |
-
</div>
|
|
|
|
| 29 |
line-height:1.25;
|
| 30 |
text-align:center;
|
| 31 |
margin:0 0 24px;">
|
| 32 |
+
OpenCUA-7B
|
| 33 |
</h1>
|
| 34 |
|
| 35 |
<div style="
|
|
|
|
| 38 |
gap:12px;
|
| 39 |
flex-wrap:wrap;
|
| 40 |
margin-bottom:28px;">
|
| 41 |
+
|
| 42 |
<a href="https://opencua.xlang.ai/" style="
|
| 43 |
display:inline-block;
|
| 44 |
padding:8px 24px;
|
|
|
|
| 78 |
|
| 79 |
<div style="max-width:900px;margin:0 auto;">
|
| 80 |
|
| 81 |
+
# 🚀 vLLM Serve (Recommended)
|
| 82 |
+
|
| 83 |
+
We recommend using vLLM for production deployment. Requires **vllm>=0.12.0** with `--trust-remote-code`.
|
| 84 |
+
|
| 85 |
+
```bash
|
| 86 |
+
vllm serve xlangai/OpenCUA-7B \
|
| 87 |
+
--trust-remote-code \
|
| 88 |
+
--served-model-name opencua-7b \
|
| 89 |
+
--host 0.0.0.0 \
|
| 90 |
+
--port 8000
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
Adjust `--gpu-memory-utilization` based on your hardware configuration.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
# Introduction
|
| 98 |
<div style="
|
| 99 |
max-width: 880px; /* 可按需调节整体宽度 */
|
|
|
|
| 101 |
text-align: justify; /* 关键:两端对齐 */
|
| 102 |
text-justify: inter-word; /* 优化英文对齐效果 */
|
| 103 |
line-height: 1.6;">
|
| 104 |
+
|
| 105 |
+
OpenCUA models (OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B) are end-to-end computer-use foundation models that can produce executable actions in the computer environments with great planning and grounding capabilities. They are based on the Qwen2.5-VL model family.
|
| 106 |
+
|
| 107 |
+
With the help of OpenCUA framework, our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, <b>OpenCUA-72B</b> achieves an average success rate of **45.0%** on [OSWorld-Verified](https://os-world.github.io/), establishing a new state-of-the-art (SOTA) among open-source models. OpenCUA-72B also has strong grounding ability, achieving 37.3% (SOTA) on [UI-Vision](https://arxiv.org/abs/2504.07981) and 60.8% on [ScreenSpot-Pro](https://arxiv.org/abs/2504.07981).
|
| 108 |
</div>
|
| 109 |
|
| 110 |
## 📢 Updates
|
| 111 |
+
- 2026-01-17: 🎉 **vLLM now fully supports OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B!** Thanks to the [Meituan EvoCUA Team](https://github.com/meituan) for their contributions to vLLM integration.
|
| 112 |
+
|
| 113 |
+
- 2025-10-12: <span style="font-weight:bold">[OpenCUA-7B-exl2](https://huggingface.co/sujitvasanth/OpenCUA-7B-exl2) is now live!</span> ⚡️
|
| 114 |
Thanks to [Sujit Vasanth](https://huggingface.co/sujitvasanth) for producing a quantized **exllamav2** version of OpenCUA-7B — enabling much faster inference with lower VRAM usage.
|
| 115 |
+
|
| 116 |
### Key Features
|
| 117 |
|
| 118 |
- **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
|
|
|
|
| 125 |
# Performance
|
| 126 |
|
| 127 |
### Online Agent Evaluation
|
| 128 |
+
OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
|
| 129 |
+
OpenCUA-72B achieves the best performance among all open-source models with an average success rate of 45.0%, establishing a new state-of-the-art (SOTA).
|
|
|
|
| 130 |
<div align="center">
|
| 131 |
|
| 132 |
| **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
|
|
|
|
| 137 |
| Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
|
| 138 |
| Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
|
| 139 |
| **Open-Source** | | | |
|
| 140 |
+
| Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
|
| 141 |
+
| Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
|
| 142 |
| Kimi-VL-A3B | 9.7 | — | 10.3 |
|
| 143 |
| UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
|
| 144 |
| UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
|
| 145 |
| OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
|
| 146 |
+
| OpenCUA-32B *(Ours)* | 29.7 | 34.1 | 34.8 |
|
| 147 |
+
| **OpenCUA-72B *(Ours)*** | **39.0** | **44.9** | **45.0** |
|
| 148 |
</div>
|
| 149 |
|
| 150 |
*OpenCUA scores are the mean of 3 independent runs.*
|
|
|
|
| 152 |
### GUI Grounding Performance
|
| 153 |
<div align="center">
|
| 154 |
|
| 155 |
+
| **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** | **UI-Vision** |
|
| 156 |
+
|-------|-----------|---------------|----------------|----------|
|
| 157 |
+
| Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 | 0.85 |
|
| 158 |
+
| Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 | - |
|
| 159 |
+
| UI-TARS-72B | 57.1 | 90.3 | 38.1 | 25.5 |
|
| 160 |
+
| **OpenCUA-7B** | 55.3 | 92.3 | 50.0 | 29.7 |
|
| 161 |
+
| **OpenCUA-32B** | 59.6 | 93.4 | 55.3 | 33.3 |
|
| 162 |
+
| **OpenCUA-72B** | **59.2** | **92.9** | **60.8** | **37.3** |
|
|
|
|
| 163 |
</div>
|
| 164 |
|
| 165 |
|
|
|
|
| 178 |
|
| 179 |
# 🚀 Quick Start
|
| 180 |
<div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
|
| 181 |
+
<strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B):</strong>
|
| 182 |
+
|
| 183 |
To align with our training infrastructure, we have modified the model in two places:
|
| 184 |
<ul style="margin-top: 8px;">
|
| 185 |
<li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
|
| 186 |
<li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
|
| 187 |
+
<li>vLLM supported via <code>--trust-remote-code</code> flag. Tokenizer and Chat Template should be aligned if training the models.</li>
|
| 188 |
</ul>
|
| 189 |
</div>
|
| 190 |
|
| 191 |
|
| 192 |
## Installation & Download
|
| 193 |
|
| 194 |
+
First, install the required dependencies:
|
| 195 |
|
| 196 |
```bash
|
| 197 |
+
conda create -n opencua python=3.12
|
| 198 |
conda activate opencua
|
| 199 |
+
pip install openai>=1.0.0
|
| 200 |
```
|
| 201 |
|
| 202 |
+
Download the model weight from huggingface (optional, vLLM can download automatically):
|
| 203 |
+
```python
|
| 204 |
from huggingface_hub import snapshot_download
|
| 205 |
snapshot_download(
|
| 206 |
repo_id="xlangai/OpenCUA-7B",
|
| 207 |
+
local_dir="OpenCUA-7B",
|
| 208 |
+
local_dir_use_symlinks=False
|
| 209 |
)
|
| 210 |
```
|
| 211 |
|
| 212 |
+
## 🎯 GUI Grounding
|
| 213 |
+
|
| 214 |
+
First, start the vLLM server:
|
| 215 |
+
|
| 216 |
+
```bash
|
| 217 |
+
vllm serve xlangai/OpenCUA-7B \
|
| 218 |
+
--trust-remote-code \
|
| 219 |
+
--served-model-name opencua-7b \
|
| 220 |
+
--host 0.0.0.0 \
|
| 221 |
+
--port 8000
|
| 222 |
+
```
|
| 223 |
|
| 224 |
+
Then run the following code to test GUI grounding:
|
| 225 |
|
| 226 |
```python
|
| 227 |
import base64
|
| 228 |
+
from openai import OpenAI
|
| 229 |
+
|
| 230 |
+
# vLLM server configuration
|
| 231 |
+
VLLM_BASE_URL = "http://localhost:8000/v1"
|
| 232 |
+
MODEL_NAME = "opencua-7b" # Should match --served-model-name in vllm serve
|
| 233 |
|
| 234 |
def encode_image(image_path: str) -> str:
|
| 235 |
+
"""Encode image to base64 string."""
|
| 236 |
with open(image_path, "rb") as f:
|
| 237 |
return base64.b64encode(f.read()).decode()
|
| 238 |
|
| 239 |
+
def run_grounding(image_path: str, instruction: str) -> str:
|
| 240 |
+
"""Run GUI grounding inference via vLLM."""
|
| 241 |
+
client = OpenAI(base_url=VLLM_BASE_URL, api_key="EMPTY")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 242 |
|
|
|
|
|
|
|
| 243 |
system_prompt = (
|
| 244 |
"You are a GUI agent. You are given a task and a screenshot of the screen. "
|
| 245 |
"You need to perform a series of pyautogui actions to complete the task."
|
| 246 |
)
|
| 247 |
+
|
| 248 |
messages = [
|
| 249 |
{"role": "system", "content": system_prompt},
|
| 250 |
{
|
| 251 |
"role": "user",
|
| 252 |
"content": [
|
| 253 |
+
{
|
| 254 |
+
"type": "image_url",
|
| 255 |
+
"image_url": {"url": f"data:image/png;base64,{encode_image(image_path)}"}
|
| 256 |
+
},
|
| 257 |
{"type": "text", "text": instruction},
|
| 258 |
],
|
| 259 |
},
|
| 260 |
]
|
|
|
|
| 261 |
|
| 262 |
+
response = client.chat.completions.create(
|
| 263 |
+
model=MODEL_NAME,
|
| 264 |
+
messages=messages,
|
| 265 |
+
max_tokens=512,
|
| 266 |
+
temperature=0,
|
| 267 |
)
|
| 268 |
+
|
| 269 |
+
return response.choices[0].message.content
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 270 |
|
| 271 |
# Example usage
|
|
|
|
| 272 |
image_path = "screenshot.png"
|
| 273 |
instruction = "Click on the submit button"
|
| 274 |
|
| 275 |
+
result = run_grounding(image_path, instruction)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 276 |
print("Model output:", result)
|
| 277 |
```
|
| 278 |
|
| 279 |
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 280 |
+
<em>Expected result:</em> ```python\npyautogui.click(x=1443, y=343)\n```
|
|
|
|
|
|
|
| 281 |
</div>
|
| 282 |
|
| 283 |
+
You can also run the grounding examples in [OpenCUA/model/inference/](https://github.com/xlang-ai/OpenCUA/blob/main/model/inference/):
|
| 284 |
+
```bash
|
| 285 |
cd ./model/inference/
|
| 286 |
+
|
| 287 |
+
# vLLM (requires running vLLM server first)
|
| 288 |
+
python vllm_inference.py
|
| 289 |
+
|
| 290 |
+
# HuggingFace Transformers
|
| 291 |
python huggingface_inference.py
|
| 292 |
```
|
| 293 |
|
| 294 |
## 🖥️ Computer Use Agent
|
| 295 |
**[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.
|
| 296 |
|
| 297 |
+
Command for running OpenCUA-7B in OSWorld:
|
| 298 |
```
|
| 299 |
python run_multienv_opencua.py \
|
| 300 |
--headless \
|
| 301 |
--observation_type screenshot \
|
| 302 |
+
--model OpenCUA-7B \
|
| 303 |
--result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
|
| 304 |
--max_steps 100 \
|
| 305 |
--num_envs 30 \
|
| 306 |
--coordinate_type qwen25
|
| 307 |
```
|
|
|
|
|
|
|
|
|
|
| 308 |
|
| 309 |
---
|
| 310 |
|
|
|
|
| 314 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/dw5k183ucDSB2SZuS5f2V.png" width="400" alt="AgentNet Dataset Domain Distribution">
|
| 315 |
</div>
|
| 316 |
|
| 317 |
+
AgentNet is the first large-scale desktop computer-use agent trajectory dataset, containing 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu systems.
|
| 318 |
|
| 319 |
👉 **[AgentNet Huggingface Dataset](https://huggingface.co/datasets/xlangai/AgentNet)**
|
| 320 |
|
|
|
|
| 336 |
</div>
|
| 337 |
|
| 338 |
|
| 339 |
+
Our **AgentNetTool** is a cross-platform GUI recorder that runs unobtrusively on annotators' machines. It captures synchronized **screen video**, **mouse/keyboard events**, and **accessibility trees**, then provides an in-browser UI for reviewing, trimming, and submitting demonstrations. AgentNet Tool is available on Windows, macOS and Ubuntu.
|
| 340 |
|
| 341 |
👉 **[AgentNetTool Document](https://agentnet-tool.xlang.ai/)**
|
| 342 |
|
| 343 |
|
| 344 |
|
| 345 |
## 2 DataProcessor – Action Reduction & State–Action Matching
|
| 346 |
+
Raw demonstrations can contain thousands of low-level events that are too dense for model training.
|
| 347 |
The **DataProcessor** module (`./data/data-process/`) performs two key steps:
|
| 348 |
|
| 349 |
+
1. **Action Reduction** — merges granular signals into concise, semantically meaningful PyAutoGUI actions (e.g., collapsing mouse moves → click, coalescing scrolls, grouping key-press sequences into text or hotkeys).
|
| 350 |
2. **State–Action Matching** — aligns every reduced action with the *last visually distinct frame* **before** the action begins, avoiding future-information leakage and yielding compact state–action pairs.
|
| 351 |
|
| 352 |
These processed trajectories underlie all downstream training and evaluation.
|
|
|
|
| 354 |
---
|
| 355 |
|
| 356 |
## 3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue
|
| 357 |
+
To boost robustness and interpretability, we augment each trajectory with **reflective long Chain-of-Thought (CoT) reasoning**.
|
| 358 |
The **CoTGenerator** pipeline (`./data/cot-generator/`) synthesizes step-level reflections that:
|
| 359 |
|
| 360 |
* reflect on the previous action,
|
| 361 |
+
* explain *why* an action is chosen given the current observation and history,
|
| 362 |
+
* note potential alternative actions, and
|
| 363 |
* forecast the expected next state.
|
| 364 |
|
| 365 |
Empirically, models trained with these rich CoTs scale better with data and generalize across unseen applications.
|
|
|
|
| 376 |
|
| 377 |
👉 See **[AgentNetBench/README.md](./evaluation/agentnetbench/README.md)** for usage instructions.
|
| 378 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 379 |
# Acknowledge
|
| 380 |
<p>
|
| 381 |
+
We thank Yu Su, Caiming Xiong, and the anonymous reviewers for their insightful discussions and valuable feedback.
|
| 382 |
+
We are grateful to Moonshot AI for providing training infrastructure and annotated data.
|
| 383 |
+
We also sincerely appreciate Hao Yang, Zhengtao Wang, and Yanxu Chen from the Kimi Team for their strong infrastructure support and helpful guidance.
|
| 384 |
+
We thank Chong Peng, Taofeng Xue, and Qiumian Huang from the <a href="https://github.com/meituan/EvoCUA" target="_blank">Meituan EvoCUA Team</a> for their contributions to vLLM integration.
|
| 385 |
+
The development of our tool is based on the open-source projects-<a href="https://github.com/TheDuckAI/DuckTrack" target="_blank">DuckTrack</a> and <a href="https://github.com/OpenAdaptAI/OpenAdapt" target="_blank">OpenAdapt</a>.
|
| 386 |
We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.
|
| 387 |
</p>
|
| 388 |
|
|
|
|
| 392 |
|
| 393 |
## Research Use and Disclaimer
|
| 394 |
|
| 395 |
+
OpenCUA models are intended for **research and educational purposes only**.
|
| 396 |
|
| 397 |
### Prohibited Uses
|
| 398 |
- The model may **not** be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction
|
|
|
|
| 406 |
## Important Notes on Coordinate Systems
|
| 407 |
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 408 |
<ul style="margin: 0;">
|
|
|
|
|
|
|
| 409 |
<li><strong><code>OpenCUA/OpenCUA-7B</code></strong> – Absolute coordinates</li>
|
| 410 |
<li><strong><code>OpenCUA/OpenCUA-32B</code></strong> – Absolute coordinates</li>
|
| 411 |
+
<li><strong><code>OpenCUA/OpenCUA-72B</code></strong> – Absolute coordinates</li>
|
| 412 |
</ul>
|
| 413 |
</div>
|
| 414 |
|
| 415 |
+
**OpenCUA models output absolute coordinates after smart resize:**
|
| 416 |
+
|
| 417 |
+
```python
|
| 418 |
+
# Example output: pyautogui.click(x=960, y=324)
|
| 419 |
+
# These are coordinates on the smart-resized image, not the original image
|
| 420 |
+
|
| 421 |
+
# Convert to original image coordinates:
|
| 422 |
+
# Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
|
| 423 |
+
def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
|
| 424 |
+
# First, calculate the smart-resized dimensions
|
| 425 |
+
resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
|
| 426 |
+
|
| 427 |
+
# Convert model output to relative coordinates on original image
|
| 428 |
+
rel_x = model_x / resized_width
|
| 429 |
+
rel_y = model_y / resized_height
|
| 430 |
+
|
| 431 |
+
# Then convert to absolute coordinates on original image
|
| 432 |
+
abs_x = int(rel_x * original_width)
|
| 433 |
+
abs_y = int(rel_y * original_height)
|
| 434 |
+
return abs_x, abs_y
|
| 435 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 436 |
|
| 437 |
<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
|
| 438 |
<strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
|
| 439 |
<p style="margin: 8px 0 0;">
|
| 440 |
+
The Qwen2.5-VL models use a "smart resize" preprocessing that maintains aspect ratio while fitting within pixel constraints.
|
| 441 |
For coordinate conversion, you need the smart resize function from the
|
| 442 |
<a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
|
| 443 |
official Qwen2.5-VL implementation</a>.
|
|
|
|
| 450 |
|
| 451 |
```bibtex
|
| 452 |
@misc{wang2025opencuaopenfoundationscomputeruse,
|
| 453 |
+
title={OpenCUA: Open Foundations for Computer-Use Agents},
|
| 454 |
author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
|
| 455 |
year={2025},
|
| 456 |
eprint={2508.09123},
|
| 457 |
archivePrefix={arXiv},
|
| 458 |
primaryClass={cs.AI},
|
| 459 |
+
url={https://arxiv.org/abs/2508.09123},
|
| 460 |
}
|
| 461 |
```
|
| 462 |
|
| 463 |
+
</div>
|