Model Card for VGPO-RL-7B

📖 Overview of VGPO

Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to signal dilution — generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, temporal visual forgetting causes attention to visual inputs to progressively decay as reasoning chains extend.

VGPO addresses these issues through three key mechanisms:

  • Visual Attention Compensation (VAC): Uses the inherent hidden-state similarity between generated tokens and image tokens as a Visual Focus Score to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
  • Intra-Trajectory Re-weighting: At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
  • Inter-Trajectory Re-weighting: At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.

🔗 Model Sources

📕 Training Datasets

Split Dataset Link
Train ViRL39K PAPOGalaxy/PAPO_ViRL39K_train
Val MMK12 PAPOGalaxy/PAPO_MMK12_test

📊 Evaluation

We follow the evaluation script of Look-Back. All results are reported as average accuracy with inference temperature 0.0.

Supported Evaluation Benchmarks

Benchmark Focus Domain
MathVista General Mathematical & Geometric Reasoning
MathVerse General Mathematical & Geometric Reasoning
WeMath General Mathematical & Geometric Reasoning
MMK12 General Mathematical & Geometric Reasoning
GeoMath General Mathematical & Geometric Reasoning
Geometry3K General Mathematical & Geometric Reasoning
LogicVista Vision-dependent Multimodal Reasoning
SuperClevr Counting Vision-dependent Multimodal Reasoning
MMMU-Pro Vision-dependent Multimodal Reasoning
MathVerse-V Vision-dependent Multimodal Reasoning

✍️ Citation

If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work 📝:

@article{wang2026vgpo,
  title={Visually-Guided Policy Optimization for Multimodal Reasoning}, 
  author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
  journal={arXiv preprint arXiv:2604.09349},
  year={2026}
}

❤️ Acknowledgements

Our codebase is built upon EasyR1, VPPO-RL, PAPO, Look-Back. We thank the authors for their excellent work.

Downloads last month
24
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MuMing0102/VGPO-RL-7B

Finetuned
(1053)
this model
Quantizations
2 models

Dataset used to train MuMing0102/VGPO-RL-7B

Collection including MuMing0102/VGPO-RL-7B

Paper for MuMing0102/VGPO-RL-7B