Abstract
Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies.
Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.
Community
TRQAM internalizes the trust region as a scalar λ inside the flow-policy sampling SDE — an exact Girsanov path-space KL identity (Thm 1) makes the KL budget structurally enforceable via dual descent. 68% vs 46% on 50 OGBench tasks. 👇 blog & code
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning (2026)
- Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow (2026)
- Discrete Flow Matching for Offline-to-Online Reinforcement Learning (2026)
- Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities (2026)
- Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies (2026)
- Aligning Flow Map Policies with Optimal Q-Guidance (2026)
- Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.27079 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper