RL4RLM-GRPO-v4: Corrected GRPO with Token-Level KL

LoRA adapter for Qwen3-1.7B trained as a Recursive Language Model (RLM) — a model that writes Python code to decompose and solve long-context tasks via a persistent REPL environment.

Paper

Training Native Recursive Language Models — CS234 Final Project, Stanford University (Winter 2026)

GitHub: pythonomar22/rl4rlm

Training Details

Method: GRPO with conditioned log-probs and token-level KL via frozen reference model
Data: On-policy rollouts, K=8 trajectories per prompt
Training: 30 steps, lr 5e-6, beta_KL=0.05, from STaR checkpoint
Key result: 85.1% Multi-NIAH, stable training dynamics, no degenerate outputs

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
model = PeftModel.from_pretrained(base, "omar81939/rl4rlm-grpo-v4")
tokenizer = AutoTokenizer.from_pretrained("omar81939/rl4rlm-grpo-v4")