Qwen3-4B-tau2-sft1

SFT checkpoint for tau2-bench tool-use tasks, trained from Qwen/Qwen3-4B-Instruct-2507 using the Slime tau2 training cookbook.

Training summary

Stage: SFT (supervised fine-tuning)
Base model: Qwen/Qwen3-4B-Instruct-2507
Data: Jarrodbarnes/tau2-sft-seed-v3 (filtered, rejection-sampled trajectories)
Code: https://github.com/THUDM/slime/tree/main/examples/tau-bench
tau2-bench commit: 337326e62d8e0ca74c353b004a9c5d748e0ba914

Key hyperparameters (from `examples/tau-bench/training_cookbook.md`)

num_epoch=2
global_batch_size=16
rollout_batch_size=16
rollout_max_response_len=4096
max_tokens_per_gpu=12288
lr=1e-5 (cosine decay, warmup fraction 0.05)
weight_decay=0.01
loss_mask_type=qwen3
loss_type=sft_loss

Training command is in examples/tau-bench/tau2/run_sft.sh.

Evaluation (tau2-bench test split)

Metric: pass@1 (any-success over 1 attempt)
Domains: airline, retail, telecom (test split, 100 tasks)
User simulator: gpt-4.1-mini
Settings: TAU2_USE_COMPRESSED_PROMPTS=0, TAU2_MAX_STEPS=100
Sampling: num_samples=1, temperature=0.0, top_p=1.0, top_k=20

Results

Overall pass@1: 0.40 (100 tasks)
By domain:
- airline: 0.20 (20 tasks)
- retail: 0.60 (40 tasks)
- telecom: 0.30 (40 tasks)

Reproduce pass@1

# Start policy server
CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server \
  --model-path Jarrodbarnes/Qwen3-4B-tau2-sft1 \
  --host 0.0.0.0 --port 30000 --tp 2 --mem-fraction-static 0.70

# Eval
python3 examples/tau-bench/tau2/eval.py \
  --hf-checkpoint Jarrodbarnes/Qwen3-4B-tau2-sft1 \
  --sglang-url http://127.0.0.1:30000/generate \
  --domains airline,retail,telecom --task-split test \
  --num-samples 1 --temperature 0.0 --top-p 1.0 --top-k 20 \
  --output "${TAU_BENCH_OUT_DIR}/tau2/eval/sft_pass1.json"

Notes

eval.py reports pass@k. The official tau2-bench leaderboard uses pass^k.
Results are stochastic due to user-sim; expect small variance.

Intended use

Research and reproduction of tau2-bench tool-use training. Not intended for deployment without additional safety evaluation.

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jarrodbarnes/Qwen3-4B-tau2-sft1

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1211)

this model

Jarrodbarnes
/

Qwen3-4B-tau2-sft1

Qwen3-4B-tau2-sft1

Training summary

Key hyperparameters (from `examples/tau-bench/training_cookbook.md`)

Evaluation (tau2-bench test split)

Results

Reproduce pass@1

Notes

Intended use

Model tree for Jarrodbarnes/Qwen3-4B-tau2-sft1

Dataset used to train Jarrodbarnes/Qwen3-4B-tau2-sft1

Qwen3-4B-tau2-sft1

Training summary

Key hyperparameters (from examples/tau-bench/training_cookbook.md)

Evaluation (tau2-bench test split)

Results

Reproduce pass@1

Notes

Intended use

Model tree for Jarrodbarnes/Qwen3-4B-tau2-sft1

Dataset used to train Jarrodbarnes/Qwen3-4B-tau2-sft1

Key hyperparameters (from `examples/tau-bench/training_cookbook.md`)