xywang626 commited on
Commit
6d6f65b
·
verified ·
1 Parent(s): 49c54f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -169
README.md CHANGED
@@ -29,7 +29,7 @@ tags:
29
  line-height:1.25;
30
  text-align:center;
31
  margin:0 0 24px;">
32
- OpenCUA: Open Foundations for Computer-Use Agents
33
  </h1>
34
 
35
  <div style="
@@ -38,7 +38,7 @@ tags:
38
  gap:12px;
39
  flex-wrap:wrap;
40
  margin-bottom:28px;">
41
-
42
  <a href="https://opencua.xlang.ai/" style="
43
  display:inline-block;
44
  padding:8px 24px;
@@ -78,6 +78,22 @@ tags:
78
 
79
  <div style="max-width:900px;margin:0 auto;">
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  # Introduction
82
  <div style="
83
  max-width: 880px; /* 可按需调节整体宽度 */
@@ -85,16 +101,18 @@ tags:
85
  text-align: justify; /* 关键:两端对齐 */
86
  text-justify: inter-word; /* 优化英文对齐效果 */
87
  line-height: 1.6;">
88
-
89
- OpenCUA models (OpenCUA-7B and OpenCUA-32B) are end-to-end computer-use foundation models than can produce executable actions in the computer environments. They are based on the weights of Qwen2.5-VL-7B-Instruction and Qwen2.5-VL-32B-Instruction.
90
- They demonstrate superior performance across CUA benchmarks. In particular, <b>OpenCUA-32B</b> achieves an average success rate of **34.8%** on [OSWorld-Verified](https://os-world.github.io/),
91
- establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Both models also have strong grounding performance, OpenCUA-32B achieves 59.6% on [OSWorld-G](https://osworld-grounding.github.io/) and 55.3% on [Screenspot-Pro](https://arxiv.org/abs/2504.07981).
92
  </div>
93
 
94
  ## 📢 Updates
95
- - 2025-10-12: <span style="font-weight:bold">[OpenCUA-7B-exl2](https://huggingface.co/sujitvasanth/OpenCUA-7B-exl2) is now live!</span> ⚡️
 
 
96
  Thanks to [Sujit Vasanth](https://huggingface.co/sujitvasanth) for producing a quantized **exllamav2** version of OpenCUA-7B — enabling much faster inference with lower VRAM usage.
97
-
98
  ### Key Features
99
 
100
  - **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
@@ -107,9 +125,8 @@ establishing a new state-of-the-art (SOTA) among open-source models and surpassi
107
  # Performance
108
 
109
  ### Online Agent Evaluation
110
- OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
111
- OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins.
112
- It also closes the gap to proprietary Claude models.
113
  <div align="center">
114
 
115
  | **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
@@ -120,13 +137,14 @@ It also closes the gap to proprietary Claude models.
120
  | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
121
  | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
122
  | **Open-Source** | | | |
123
- | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
124
- | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
125
  | Kimi-VL-A3B | 9.7 | — | 10.3 |
126
  | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
127
  | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
128
  | OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
129
- | **OpenCUA-32B *(Ours)*** | **29.7** | **34.1** | **34.8** |
 
130
  </div>
131
 
132
  *OpenCUA scores are the mean of 3 independent runs.*
@@ -134,15 +152,14 @@ It also closes the gap to proprietary Claude models.
134
  ### GUI Grounding Performance
135
  <div align="center">
136
 
137
- | **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** |
138
- |-------|-----------|---------------|----------------|
139
- | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 |
140
- | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 |
141
- | UI-TARS-72B | 57.1 | 90.3 | 38.1 |
142
- | **OpenCUA-A3B** | 48.6 | 91.4 | 28.5 |
143
- | **OpenCUA-Qwen2-7B** | 45.7 | 88.5 | 23.7 |
144
- | **OpenCUA-7B** | 55.3 | 92.3 | 50.0 |
145
- | **OpenCUA-32B** | **59.6** | **93.4** | **55.3** |
146
  </div>
147
 
148
 
@@ -161,164 +178,133 @@ It also closes the gap to proprietary Claude models.
161
 
162
  # 🚀 Quick Start
163
  <div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
164
- <strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B):</strong>
165
-
166
  To align with our training infrastructure, we have modified the model in two places:
167
  <ul style="margin-top: 8px;">
168
  <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
169
  <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
170
- <li>Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models.</li>
171
  </ul>
172
  </div>
173
 
174
 
175
  ## Installation & Download
176
 
177
- First, install the required transformers dependencies:
178
 
179
  ```bash
180
- conda create -n opencua python=3.10
181
  conda activate opencua
182
- pip install -r requirement.txt
183
  ```
184
 
185
- Download the model weight from huggingface:
186
- ```bash
187
  from huggingface_hub import snapshot_download
188
  snapshot_download(
189
  repo_id="xlangai/OpenCUA-7B",
190
- local_dir="OpenCUA-7B",
191
- local_dir_use_symlinks=False
192
  )
193
  ```
194
 
195
- ## 🎯 GUI Grounding
 
 
 
 
 
 
 
 
 
 
196
 
197
- The following code demonstrates how to use OpenCUA models for GUI grounding tasks:
198
 
199
  ```python
200
  import base64
201
- import torch
202
- from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
203
- from PIL import Image
204
- import json
 
205
 
206
  def encode_image(image_path: str) -> str:
207
- """Encode image to base64 string for model input."""
208
  with open(image_path, "rb") as f:
209
  return base64.b64encode(f.read()).decode()
210
 
211
- def load_opencua_model(model_path: str):
212
- """Load OpenCUA model, tokenizer, and image processor."""
213
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
214
- model = AutoModel.from_pretrained(
215
- model_path,
216
- torch_dtype="auto",
217
- device_map="auto",
218
- trust_remote_code=True
219
- )
220
- image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
221
-
222
- return model, tokenizer, image_processor
223
 
224
- def create_grounding_messages(image_path: str, instruction: str):
225
- """Create chat messages for GUI grounding task."""
226
  system_prompt = (
227
  "You are a GUI agent. You are given a task and a screenshot of the screen. "
228
  "You need to perform a series of pyautogui actions to complete the task."
229
  )
230
-
231
  messages = [
232
  {"role": "system", "content": system_prompt},
233
  {
234
  "role": "user",
235
  "content": [
236
- {"type": "image", "image": f"data:image/png;base64,{encode_image(image_path)}"},
 
 
 
237
  {"type": "text", "text": instruction},
238
  ],
239
  },
240
  ]
241
- return messages
242
 
243
- def run_inference(model, tokenizer, image_processor, messages, image_path):
244
- """Run inference on the model."""
245
- # Prepare text input
246
- input_ids = tokenizer.apply_chat_template(
247
- messages, tokenize=True, add_generation_prompt=True
248
  )
249
- input_ids = torch.tensor([input_ids]).to(model.device)
250
-
251
- # Prepare image input
252
- image = Image.open(image_path).convert('RGB')
253
- image_info = image_processor.preprocess(images=[image])
254
- pixel_values = torch.tensor(image_info['pixel_values']).to(
255
- dtype=torch.bfloat16, device=model.device
256
- )
257
- grid_thws = torch.tensor(image_info['image_grid_thw'])
258
-
259
- # Generate response
260
- with torch.no_grad():
261
- generated_ids = model.generate(
262
- input_ids,
263
- pixel_values=pixel_values,
264
- grid_thws=grid_thws,
265
- max_new_tokens=512,
266
- temperature=0
267
- )
268
-
269
- # Decode output
270
- prompt_len = input_ids.shape[1]
271
- generated_ids = generated_ids[:, prompt_len:]
272
- output_text = tokenizer.batch_decode(
273
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
274
- )[0]
275
-
276
- return output_text
277
 
278
  # Example usage
279
- model_path = "xlangai/OpenCUA-7B" # or other model variants
280
  image_path = "screenshot.png"
281
  instruction = "Click on the submit button"
282
 
283
- # Load model
284
- model, tokenizer, image_processor = load_opencua_model(model_path)
285
-
286
- # Create messages and run inference
287
- messages = create_grounding_messages(image_path, instruction)
288
- result = run_inference(model, tokenizer, image_processor, messages, image_path)
289
-
290
  print("Model output:", result)
291
  ```
292
 
293
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
294
- <em>Expected result:</em> ```python
295
- pyautogui.click(x=1443, y=343)
296
- ```
297
  </div>
298
 
299
- You can also run the five grounding examples in [OpenCUA/model/inference/huggingface_inference.py](https://github.com/xlang-ai/OpenCUA/blob/main/model/inference/huggingface_inference.py):
300
- ```
301
  cd ./model/inference/
 
 
 
 
 
302
  python huggingface_inference.py
303
  ```
304
 
305
  ## 🖥️ Computer Use Agent
306
  **[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.
307
 
308
- Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
309
  ```
310
  python run_multienv_opencua.py \
311
  --headless \
312
  --observation_type screenshot \
313
- --model OpenCUA-32B \
314
  --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
315
  --max_steps 100 \
316
  --num_envs 30 \
317
  --coordinate_type qwen25
318
  ```
319
- <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
320
- <em>Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned.</em>
321
- </div>
322
 
323
  ---
324
 
@@ -328,7 +314,7 @@ Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
328
  <img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/dw5k183ucDSB2SZuS5f2V.png" width="400" alt="AgentNet Dataset Domain Distribution">
329
  </div>
330
 
331
- AgentNet is the first large-scale desktop computer-use agent trajectory dataset, containing 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu systems.
332
 
333
  👉 **[AgentNet Huggingface Dataset](https://huggingface.co/datasets/xlangai/AgentNet)**
334
 
@@ -350,17 +336,17 @@ Collecting computer-use agent training data requires 3 steps:
350
  </div>
351
 
352
 
353
- Our **AgentNetTool** is a cross-platform GUI recorder that runs unobtrusively on annotators machines. It captures synchronized **screen video**, **mouse/keyboard events**, and **accessibility trees**, then provides an in-browser UI for reviewing, trimming, and submitting demonstrations. AgentNet Tool is available on Windows, macOS and Ubuntu.
354
 
355
  👉 **[AgentNetTool Document](https://agentnet-tool.xlang.ai/)**
356
 
357
 
358
 
359
  ## 2 DataProcessor – Action Reduction & State–Action Matching
360
- Raw demonstrations can contain thousands of low-level events that are too dense for model training.
361
  The **DataProcessor** module (`./data/data-process/`) performs two key steps:
362
 
363
- 1. **Action Reduction** — merges granular signals into concise, semantically meaningful PyAutoGUI actions (e.g., collapsing mouse moves → click, coalescing scrolls, grouping key-press sequences into text or hotkeys).
364
  2. **State–Action Matching** — aligns every reduced action with the *last visually distinct frame* **before** the action begins, avoiding future-information leakage and yielding compact state–action pairs.
365
 
366
  These processed trajectories underlie all downstream training and evaluation.
@@ -368,12 +354,12 @@ These processed trajectories underlie all downstream training and evaluation.
368
  ---
369
 
370
  ## 3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue
371
- To boost robustness and interpretability, we augment each trajectory with **reflective long Chain-of-Thought (CoT) reasoning**.
372
  The **CoTGenerator** pipeline (`./data/cot-generator/`) synthesizes step-level reflections that:
373
 
374
  * reflect on the previous action,
375
- * explain *why* an action is chosen given the current observation and history,
376
- * note potential alternative actions, and
377
  * forecast the expected next state.
378
 
379
  Empirically, models trained with these rich CoTs scale better with data and generalize across unseen applications.
@@ -390,21 +376,13 @@ Empirically, models trained with these rich CoTs scale better with data and gene
390
 
391
  👉 See **[AgentNetBench/README.md](./evaluation/agentnetbench/README.md)** for usage instructions.
392
 
393
- # TODO
394
- ## vLLM Support
395
- We are actively working with the vLLM team to add support for OpenCUA models.
396
-
397
- **Workaround:** For now, please use the standard transformers library as shown in the examples above. We will update this section once vLLM support becomes available.
398
-
399
- ## Training Code
400
- OpenCUA models are developed based on the training infrastructure of Kimi Team. We are developting the training pipeline based on the open-source infrastructure as well.
401
-
402
  # Acknowledge
403
  <p>
404
- We thank Su Yu, Caiming Xiong, Binyuan Hui, and the anonymous reviewers for their insightful discussions and valuable feedback.
405
- We are grateful to Moonshot AI for providing training infrastructure and annotated data.
406
- We also sincerely appreciate Calvin, Ziwei Chen, Jin Zhang, Ze Li, Zhengtao Wang, Yanxu Chen, and Qizheng Gu from the Kimi Team for their strong infrastructure support and helpful guidance.
407
- The development of our tool is based on the open-source projects-<a href="https://github.com/TheDuckAI/DuckTrack" target="_blank">DuckTrack</a> and <a href="https://github.com/OpenAdaptAI/OpenAdapt" target="_blank">OpenAdapt</a>.
 
408
  We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.
409
  </p>
410
 
@@ -414,7 +392,7 @@ This project is licensed under the MIT License - see the LICENSE file in the roo
414
 
415
  ## Research Use and Disclaimer
416
 
417
- OpenCUA models are intended for **research and educational purposes only**.
418
 
419
  ### Prohibited Uses
420
  - The model may **not** be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction
@@ -428,52 +406,38 @@ OpenCUA models are intended for **research and educational purposes only**.
428
  ## Important Notes on Coordinate Systems
429
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
430
  <ul style="margin: 0;">
431
- <li><strong><code>OpenCUA/OpenCUA-A3B</code></strong> – Relative coordinates <em>(not supported in this code)</em></li>
432
- <li><strong><code>OpenCUA/OpenCUA-Qwen2-7B</code></strong> – Relative coordinates</li>
433
  <li><strong><code>OpenCUA/OpenCUA-7B</code></strong> – Absolute coordinates</li>
434
  <li><strong><code>OpenCUA/OpenCUA-32B</code></strong> – Absolute coordinates</li>
 
435
  </ul>
436
  </div>
437
 
438
- **OpenCUA models use different coordinate systems depending on the base model:**
439
-
440
- - **OpenCUA-Qwen2-7B**: Outputs **relative coordinates** (0.0 to 1.0 range)
441
- ```python
442
- # Example output: pyautogui.click(x=0.5, y=0.3)
443
- # x=0.5 means 50% from left edge, y=0.3 means 30% from top edge
444
-
445
- # Convert to absolute coordinates:
446
- def qwen2_relative_to_absolute(rel_x, rel_y, original_width, original_height):
447
- abs_x = int(rel_x * original_width)
448
- abs_y = int(rel_y * original_height)
449
- return abs_x, abs_y
450
- ```
451
-
452
- - **OpenCUA-7B and OpenCUA-32B** (Qwen2.5-based): Output **absolute coordinates** after smart resize
453
- ```python
454
- # Example output: pyautogui.click(x=960, y=324)
455
- # These are coordinates on the smart-resized image, not the original image
456
-
457
- # Convert to original image coordinates:
458
- # Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
459
- def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
460
- # First, calculate the smart-resized dimensions
461
- resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
462
-
463
- # Convert model output to relative coordinates on original image
464
- rel_x = model_x / resized_width
465
- rel_y = model_y / resized_height
466
-
467
- # Then convert to absolute coordinates on original image
468
- abs_x = int(rel_x * original_width)
469
- abs_y = int(rel_y * original_height)
470
- return abs_x, abs_y
471
- ```
472
 
473
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
474
  <strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
475
  <p style="margin: 8px 0 0;">
476
- The Qwen2.5-VL models use a smart resize preprocessing that maintains aspect ratio while fitting within pixel constraints.
477
  For coordinate conversion, you need the smart resize function from the
478
  <a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
479
  official Qwen2.5-VL implementation</a>.
@@ -486,14 +450,14 @@ If you use OpenCUA models in your research, please cite our work:
486
 
487
  ```bibtex
488
  @misc{wang2025opencuaopenfoundationscomputeruse,
489
- title={OpenCUA: Open Foundations for Computer-Use Agents},
490
  author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
491
  year={2025},
492
  eprint={2508.09123},
493
  archivePrefix={arXiv},
494
  primaryClass={cs.AI},
495
- url={https://arxiv.org/abs/2508.09123},
496
  }
497
  ```
498
 
499
- </div>
 
29
  line-height:1.25;
30
  text-align:center;
31
  margin:0 0 24px;">
32
+ OpenCUA-7B
33
  </h1>
34
 
35
  <div style="
 
38
  gap:12px;
39
  flex-wrap:wrap;
40
  margin-bottom:28px;">
41
+
42
  <a href="https://opencua.xlang.ai/" style="
43
  display:inline-block;
44
  padding:8px 24px;
 
78
 
79
  <div style="max-width:900px;margin:0 auto;">
80
 
81
+ # 🚀 vLLM Serve (Recommended)
82
+
83
+ We recommend using vLLM for production deployment. Requires **vllm>=0.12.0** with `--trust-remote-code`.
84
+
85
+ ```bash
86
+ vllm serve xlangai/OpenCUA-7B \
87
+ --trust-remote-code \
88
+ --served-model-name opencua-7b \
89
+ --host 0.0.0.0 \
90
+ --port 8000
91
+ ```
92
+
93
+ Adjust `--gpu-memory-utilization` based on your hardware configuration.
94
+
95
+ ---
96
+
97
  # Introduction
98
  <div style="
99
  max-width: 880px; /* 可按需调节整体宽度 */
 
101
  text-align: justify; /* 关键:两端对齐 */
102
  text-justify: inter-word; /* 优化英文对齐效果 */
103
  line-height: 1.6;">
104
+
105
+ OpenCUA models (OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B) are end-to-end computer-use foundation models that can produce executable actions in the computer environments with great planning and grounding capabilities. They are based on the Qwen2.5-VL model family.
106
+
107
+ With the help of OpenCUA framework, our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, <b>OpenCUA-72B</b> achieves an average success rate of **45.0%** on [OSWorld-Verified](https://os-world.github.io/), establishing a new state-of-the-art (SOTA) among open-source models. OpenCUA-72B also has strong grounding ability, achieving 37.3% (SOTA) on [UI-Vision](https://arxiv.org/abs/2504.07981) and 60.8% on [ScreenSpot-Pro](https://arxiv.org/abs/2504.07981).
108
  </div>
109
 
110
  ## 📢 Updates
111
+ - 2026-01-17: 🎉 **vLLM now fully supports OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B!** Thanks to the [Meituan EvoCUA Team](https://github.com/meituan) for their contributions to vLLM integration.
112
+
113
+ - 2025-10-12: <span style="font-weight:bold">[OpenCUA-7B-exl2](https://huggingface.co/sujitvasanth/OpenCUA-7B-exl2) is now live!</span> ⚡️
114
  Thanks to [Sujit Vasanth](https://huggingface.co/sujitvasanth) for producing a quantized **exllamav2** version of OpenCUA-7B — enabling much faster inference with lower VRAM usage.
115
+
116
  ### Key Features
117
 
118
  - **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
 
125
  # Performance
126
 
127
  ### Online Agent Evaluation
128
+ OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
129
+ OpenCUA-72B achieves the best performance among all open-source models with an average success rate of 45.0%, establishing a new state-of-the-art (SOTA).
 
130
  <div align="center">
131
 
132
  | **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
 
137
  | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
138
  | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
139
  | **Open-Source** | | | |
140
+ | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
141
+ | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
142
  | Kimi-VL-A3B | 9.7 | — | 10.3 |
143
  | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
144
  | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
145
  | OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
146
+ | OpenCUA-32B *(Ours)* | 29.7 | 34.1 | 34.8 |
147
+ | **OpenCUA-72B *(Ours)*** | **39.0** | **44.9** | **45.0** |
148
  </div>
149
 
150
  *OpenCUA scores are the mean of 3 independent runs.*
 
152
  ### GUI Grounding Performance
153
  <div align="center">
154
 
155
+ | **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** | **UI-Vision** |
156
+ |-------|-----------|---------------|----------------|----------|
157
+ | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 | 0.85 |
158
+ | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 | - |
159
+ | UI-TARS-72B | 57.1 | 90.3 | 38.1 | 25.5 |
160
+ | **OpenCUA-7B** | 55.3 | 92.3 | 50.0 | 29.7 |
161
+ | **OpenCUA-32B** | 59.6 | 93.4 | 55.3 | 33.3 |
162
+ | **OpenCUA-72B** | **59.2** | **92.9** | **60.8** | **37.3** |
 
163
  </div>
164
 
165
 
 
178
 
179
  # 🚀 Quick Start
180
  <div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
181
+ <strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B):</strong>
182
+
183
  To align with our training infrastructure, we have modified the model in two places:
184
  <ul style="margin-top: 8px;">
185
  <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
186
  <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
187
+ <li>vLLM supported via <code>--trust-remote-code</code> flag. Tokenizer and Chat Template should be aligned if training the models.</li>
188
  </ul>
189
  </div>
190
 
191
 
192
  ## Installation & Download
193
 
194
+ First, install the required dependencies:
195
 
196
  ```bash
197
+ conda create -n opencua python=3.12
198
  conda activate opencua
199
+ pip install openai>=1.0.0
200
  ```
201
 
202
+ Download the model weight from huggingface (optional, vLLM can download automatically):
203
+ ```python
204
  from huggingface_hub import snapshot_download
205
  snapshot_download(
206
  repo_id="xlangai/OpenCUA-7B",
207
+ local_dir="OpenCUA-7B",
208
+ local_dir_use_symlinks=False
209
  )
210
  ```
211
 
212
+ ## 🎯 GUI Grounding
213
+
214
+ First, start the vLLM server:
215
+
216
+ ```bash
217
+ vllm serve xlangai/OpenCUA-7B \
218
+ --trust-remote-code \
219
+ --served-model-name opencua-7b \
220
+ --host 0.0.0.0 \
221
+ --port 8000
222
+ ```
223
 
224
+ Then run the following code to test GUI grounding:
225
 
226
  ```python
227
  import base64
228
+ from openai import OpenAI
229
+
230
+ # vLLM server configuration
231
+ VLLM_BASE_URL = "http://localhost:8000/v1"
232
+ MODEL_NAME = "opencua-7b" # Should match --served-model-name in vllm serve
233
 
234
  def encode_image(image_path: str) -> str:
235
+ """Encode image to base64 string."""
236
  with open(image_path, "rb") as f:
237
  return base64.b64encode(f.read()).decode()
238
 
239
+ def run_grounding(image_path: str, instruction: str) -> str:
240
+ """Run GUI grounding inference via vLLM."""
241
+ client = OpenAI(base_url=VLLM_BASE_URL, api_key="EMPTY")
 
 
 
 
 
 
 
 
 
242
 
 
 
243
  system_prompt = (
244
  "You are a GUI agent. You are given a task and a screenshot of the screen. "
245
  "You need to perform a series of pyautogui actions to complete the task."
246
  )
247
+
248
  messages = [
249
  {"role": "system", "content": system_prompt},
250
  {
251
  "role": "user",
252
  "content": [
253
+ {
254
+ "type": "image_url",
255
+ "image_url": {"url": f"data:image/png;base64,{encode_image(image_path)}"}
256
+ },
257
  {"type": "text", "text": instruction},
258
  ],
259
  },
260
  ]
 
261
 
262
+ response = client.chat.completions.create(
263
+ model=MODEL_NAME,
264
+ messages=messages,
265
+ max_tokens=512,
266
+ temperature=0,
267
  )
268
+
269
+ return response.choices[0].message.content
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
 
271
  # Example usage
 
272
  image_path = "screenshot.png"
273
  instruction = "Click on the submit button"
274
 
275
+ result = run_grounding(image_path, instruction)
 
 
 
 
 
 
276
  print("Model output:", result)
277
  ```
278
 
279
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
280
+ <em>Expected result:</em> ```python\npyautogui.click(x=1443, y=343)\n```
 
 
281
  </div>
282
 
283
+ You can also run the grounding examples in [OpenCUA/model/inference/](https://github.com/xlang-ai/OpenCUA/blob/main/model/inference/):
284
+ ```bash
285
  cd ./model/inference/
286
+
287
+ # vLLM (requires running vLLM server first)
288
+ python vllm_inference.py
289
+
290
+ # HuggingFace Transformers
291
  python huggingface_inference.py
292
  ```
293
 
294
  ## 🖥️ Computer Use Agent
295
  **[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.
296
 
297
+ Command for running OpenCUA-7B in OSWorld:
298
  ```
299
  python run_multienv_opencua.py \
300
  --headless \
301
  --observation_type screenshot \
302
+ --model OpenCUA-7B \
303
  --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
304
  --max_steps 100 \
305
  --num_envs 30 \
306
  --coordinate_type qwen25
307
  ```
 
 
 
308
 
309
  ---
310
 
 
314
  <img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/dw5k183ucDSB2SZuS5f2V.png" width="400" alt="AgentNet Dataset Domain Distribution">
315
  </div>
316
 
317
+ AgentNet is the first large-scale desktop computer-use agent trajectory dataset, containing 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu systems.
318
 
319
  👉 **[AgentNet Huggingface Dataset](https://huggingface.co/datasets/xlangai/AgentNet)**
320
 
 
336
  </div>
337
 
338
 
339
+ Our **AgentNetTool** is a cross-platform GUI recorder that runs unobtrusively on annotators' machines. It captures synchronized **screen video**, **mouse/keyboard events**, and **accessibility trees**, then provides an in-browser UI for reviewing, trimming, and submitting demonstrations. AgentNet Tool is available on Windows, macOS and Ubuntu.
340
 
341
  👉 **[AgentNetTool Document](https://agentnet-tool.xlang.ai/)**
342
 
343
 
344
 
345
  ## 2 DataProcessor – Action Reduction & State–Action Matching
346
+ Raw demonstrations can contain thousands of low-level events that are too dense for model training.
347
  The **DataProcessor** module (`./data/data-process/`) performs two key steps:
348
 
349
+ 1. **Action Reduction** — merges granular signals into concise, semantically meaningful PyAutoGUI actions (e.g., collapsing mouse moves → click, coalescing scrolls, grouping key-press sequences into text or hotkeys).
350
  2. **State–Action Matching** — aligns every reduced action with the *last visually distinct frame* **before** the action begins, avoiding future-information leakage and yielding compact state–action pairs.
351
 
352
  These processed trajectories underlie all downstream training and evaluation.
 
354
  ---
355
 
356
  ## 3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue
357
+ To boost robustness and interpretability, we augment each trajectory with **reflective long Chain-of-Thought (CoT) reasoning**.
358
  The **CoTGenerator** pipeline (`./data/cot-generator/`) synthesizes step-level reflections that:
359
 
360
  * reflect on the previous action,
361
+ * explain *why* an action is chosen given the current observation and history,
362
+ * note potential alternative actions, and
363
  * forecast the expected next state.
364
 
365
  Empirically, models trained with these rich CoTs scale better with data and generalize across unseen applications.
 
376
 
377
  👉 See **[AgentNetBench/README.md](./evaluation/agentnetbench/README.md)** for usage instructions.
378
 
 
 
 
 
 
 
 
 
 
379
  # Acknowledge
380
  <p>
381
+ We thank Yu Su, Caiming Xiong, and the anonymous reviewers for their insightful discussions and valuable feedback.
382
+ We are grateful to Moonshot AI for providing training infrastructure and annotated data.
383
+ We also sincerely appreciate Hao Yang, Zhengtao Wang, and Yanxu Chen from the Kimi Team for their strong infrastructure support and helpful guidance.
384
+ We thank Chong Peng, Taofeng Xue, and Qiumian Huang from the <a href="https://github.com/meituan/EvoCUA" target="_blank">Meituan EvoCUA Team</a> for their contributions to vLLM integration.
385
+ The development of our tool is based on the open-source projects-<a href="https://github.com/TheDuckAI/DuckTrack" target="_blank">DuckTrack</a> and <a href="https://github.com/OpenAdaptAI/OpenAdapt" target="_blank">OpenAdapt</a>.
386
  We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.
387
  </p>
388
 
 
392
 
393
  ## Research Use and Disclaimer
394
 
395
+ OpenCUA models are intended for **research and educational purposes only**.
396
 
397
  ### Prohibited Uses
398
  - The model may **not** be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction
 
406
  ## Important Notes on Coordinate Systems
407
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
408
  <ul style="margin: 0;">
 
 
409
  <li><strong><code>OpenCUA/OpenCUA-7B</code></strong> – Absolute coordinates</li>
410
  <li><strong><code>OpenCUA/OpenCUA-32B</code></strong> – Absolute coordinates</li>
411
+ <li><strong><code>OpenCUA/OpenCUA-72B</code></strong> – Absolute coordinates</li>
412
  </ul>
413
  </div>
414
 
415
+ **OpenCUA models output absolute coordinates after smart resize:**
416
+
417
+ ```python
418
+ # Example output: pyautogui.click(x=960, y=324)
419
+ # These are coordinates on the smart-resized image, not the original image
420
+
421
+ # Convert to original image coordinates:
422
+ # Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
423
+ def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
424
+ # First, calculate the smart-resized dimensions
425
+ resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
426
+
427
+ # Convert model output to relative coordinates on original image
428
+ rel_x = model_x / resized_width
429
+ rel_y = model_y / resized_height
430
+
431
+ # Then convert to absolute coordinates on original image
432
+ abs_x = int(rel_x * original_width)
433
+ abs_y = int(rel_y * original_height)
434
+ return abs_x, abs_y
435
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
436
 
437
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
438
  <strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
439
  <p style="margin: 8px 0 0;">
440
+ The Qwen2.5-VL models use a "smart resize" preprocessing that maintains aspect ratio while fitting within pixel constraints.
441
  For coordinate conversion, you need the smart resize function from the
442
  <a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
443
  official Qwen2.5-VL implementation</a>.
 
450
 
451
  ```bibtex
452
  @misc{wang2025opencuaopenfoundationscomputeruse,
453
+ title={OpenCUA: Open Foundations for Computer-Use Agents},
454
  author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
455
  year={2025},
456
  eprint={2508.09123},
457
  archivePrefix={arXiv},
458
  primaryClass={cs.AI},
459
+ url={https://arxiv.org/abs/2508.09123},
460
  }
461
  ```
462
 
463
+ </div>