RL4RLM-GRPO-v4: Corrected GRPO with Token-Level KL

LoRA adapter for Qwen3-1.7B trained as a Recursive Language Model (RLM) — a model that writes Python code to decompose and solve long-context tasks via a persistent REPL environment.

Paper

Training Native Recursive Language Models — CS234 Final Project, Stanford University (Winter 2026)

Training Details

  • Method: GRPO with conditioned log-probs and token-level KL via frozen reference model
  • Data: On-policy rollouts, K=8 trajectories per prompt
  • Training: 30 steps, lr 5e-6, beta_KL=0.05, from STaR checkpoint
  • Key result: 85.1% Multi-NIAH, stable training dynamics, no degenerate outputs

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
model = PeftModel.from_pretrained(base, "omar81939/rl4rlm-grpo-v4")
tokenizer = AutoTokenizer.from_pretrained("omar81939/rl4rlm-grpo-v4")

Results

Model NIAH (100) Multi-NIAH (24) DocClassify (20) Avg
Base 72.0 38.3 80.3 63.5
SFT 90.0 57.9 82.4 76.8
STaR 87.0 58.4 83.4 76.3
DPO 83.0 87.9 82.6 84.5
GRPO-v4 82.0 85.1 83.2 83.4

LoRA Config

  • Rank: 16, Alpha: 32, Dropout: 0.05
  • Target modules: all attention and MLP projections
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for omar81939/rl4rlm-grpo-v4

Finetuned
Qwen/Qwen3-1.7B
Adapter
(342)
this model

Collection including omar81939/rl4rlm-grpo-v4