Qwen3-4B-tau2-sft1

SFT checkpoint for tau2-bench tool-use tasks, trained from Qwen/Qwen3-4B-Instruct-2507 using the Slime tau2 training cookbook.

Training summary

Key hyperparameters (from examples/tau-bench/training_cookbook.md)

  • num_epoch=2
  • global_batch_size=16
  • rollout_batch_size=16
  • rollout_max_response_len=4096
  • max_tokens_per_gpu=12288
  • lr=1e-5 (cosine decay, warmup fraction 0.05)
  • weight_decay=0.01
  • loss_mask_type=qwen3
  • loss_type=sft_loss

Training command is in examples/tau-bench/tau2/run_sft.sh.

Evaluation (tau2-bench test split)

  • Metric: pass@1 (any-success over 1 attempt)
  • Domains: airline, retail, telecom (test split, 100 tasks)
  • User simulator: gpt-4.1-mini
  • Settings: TAU2_USE_COMPRESSED_PROMPTS=0, TAU2_MAX_STEPS=100
  • Sampling: num_samples=1, temperature=0.0, top_p=1.0, top_k=20

Results

  • Overall pass@1: 0.40 (100 tasks)
  • By domain:
    • airline: 0.20 (20 tasks)
    • retail: 0.60 (40 tasks)
    • telecom: 0.30 (40 tasks)

Reproduce pass@1

# Start policy server
CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server \
  --model-path Jarrodbarnes/Qwen3-4B-tau2-sft1 \
  --host 0.0.0.0 --port 30000 --tp 2 --mem-fraction-static 0.70

# Eval
python3 examples/tau-bench/tau2/eval.py \
  --hf-checkpoint Jarrodbarnes/Qwen3-4B-tau2-sft1 \
  --sglang-url http://127.0.0.1:30000/generate \
  --domains airline,retail,telecom --task-split test \
  --num-samples 1 --temperature 0.0 --top-p 1.0 --top-k 20 \
  --output "${TAU_BENCH_OUT_DIR}/tau2/eval/sft_pass1.json"

Notes

  • eval.py reports pass@k. The official tau2-bench leaderboard uses pass^k.
  • Results are stochastic due to user-sim; expect small variance.

Intended use

Research and reproduction of tau2-bench tool-use training. Not intended for deployment without additional safety evaluation.

Downloads last month
1
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jarrodbarnes/Qwen3-4B-tau2-sft1

Finetuned
(1211)
this model

Dataset used to train Jarrodbarnes/Qwen3-4B-tau2-sft1