UPS Tool-Use Qwen2.5 7B

UPS Tool-Use Qwen2.5 7B is a QLoRA fine-tune of Qwen2.5-7B-Instruct for generating structured Hermes function-calling payloads for a UPS MCP server. It maps natural-language shipping and logistics requests to 18 UPS MCP tools covering tracking, rating, shipment creation, pickup workflows, address validation, paperless documents, landed cost, and location lookup.

This repository publishes the PEFT LoRA adapter at the repo root and a Q4_K_M GGUF export for local inference with llama.cpp-compatible runtimes such as Ollama.

This model is not affiliated with or endorsed by UPS.

Model Details

Developed by: Matthew Hans
Model type: Qwen2.5-7B-Instruct causal language model with LoRA adapter fine-tuning
Base model: Qwen/Qwen2.5-7B-Instruct
Training base: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
Fine-tuning method: QLoRA with rank-stabilized LoRA
Language: English
Output format: Hermes FC v1 tool call blocks
Local run: ups-tools-aug-2ep
Evaluation period: April 25 — April 28, 2026

Intended Use

The model is intended to sit behind a UPS MCP server and produce one structured tool call for each user request. A downstream orchestrator should parse the <tool_call_> block, validate arguments against the MCP tool schema, and then decide whether to call the UPS API.

<tool_call_>
{"name": "track_package", "arguments": {"inquiryNumber": "1Z999AA10123456784"}}
</tool_call_>

The system prompt should include the complete set of UPS MCP tool definitions in Hermes-compatible JSON, for example:

<tools>
[{"type": "function", "function": {"name": "track_package", "parameters": {...}}}]
</tools>

Supported UPS MCP Tools

Tool	Category
`track_package`	Tracking
`validate_address`	Address validation
`rate_shipment`	Rating
`create_shipment`	Shipping
`void_shipment`	Shipping
`recover_label`	Shipping
`get_time_in_transit`	Transit
`get_landed_cost_quote`	International landed cost
`upload_paperless_document`	Paperless documents
`push_document_to_shipment`	Paperless documents
`delete_paperless_document`	Paperless documents
`find_locations`	Locator
`rate_pickup`	Pickup
`schedule_pickup`	Pickup
`cancel_pickup`	Pickup
`get_pickup_status`	Pickup
`get_political_divisions`	Pickup metadata
`get_service_center_facilities`	Pickup metadata

Evaluation

The model was evaluated with a 9-axis stress-test suite (~2,150 inference calls) against four graders run in parallel on every case:

Shape grader — required keys present in the tool call.
Value-grounded grader — argument values match what the prompt actually said (tracking number, zip, address, country code, dates, file names, etc.).
Nested-structural grader — deep paths in request_body are populated for rate_shipment, create_shipment, recover_label, get_time_in_transit.
Schema-typed grader — types and enums match the MCP schema; unknown params flagged.

Cross-field constraints (UPS service ↔ packaging compatibility, international service ↔ different countries) are also checked. Confidence intervals are Wilson 95% — the conservative lo figure is what to use for go/no-go decisions.

Headline results

Axis	n	Shape	Value-grounded	Wilson 95% lo (value)	Silent value-hallucination
Strict re-replay (templated, original test)	900	97.78%	95.00%	93.38%	2.78%
OOD-lexicon (cities/streets/companies disjoint from training)	360	98.89%	96.11%	93.58%	2.78%
Paraphrase stress (voice/email/sms/typo)	360	91.67%	65.28%	60.22%	26.39%
Sibling-tool ambiguity (6 confusable pair-families)	120	90.83%	78.33%	70.15%	13.33%
Negative / refusal	80	—	—	—	— (53.75% pass overall)
Multi-turn workflows (5–8 steps)	30	—	—	—	— (66.7% flow / 93.8% step)
Real-API acceptance (UPS CIE)	48	—	—	—	— (~95% on tools where CIE accepts the data)
Sampling variance (100 prompts × 5 samples T=0.7)	500	—	—	—	— (mode agreement 0.96, pass 0.924)
Natural-language eval (real customer phrasing)	100	90.00%	79.00%	70.02%	13.00%

How to read these numbers

The original "97.78% accurate" headline was a shape-only check on templated prompts. After applying value, structural, and schema graders to the same 900 prompts, 2.78% of supposedly-passing calls had wrong values that the original grader missed. On out-of-distribution lexical content the model generalizes well (96.1% value-grounded), but on paraphrased prompts in styles it never saw during training, value accuracy drops to 65.3% with a 26 pp silent-hallucination rate. On adversarial sibling-tool prompts, tool selection drops to 90.8%. The model is brittle to phrasing distribution shift, not to vocabulary or sampling temperature.

Paraphrase stress by style

Style	n	Shape	Value-grounded
voice (filler words, "uh", "you know")	90	96.7%	67.8%
email (greeting + signoff)	90	95.6%	68.9%
typo (2-3 misspellings)	90	94.4%	67.8%
sms (terse, lowercase, abbreviations)	90	80.0%	56.7%

Sibling-tool ambiguity by pair-family

Family	Tools	n	Shape	Value-grounded
P	rate_pickup vs schedule_pickup	15	100.0%	80.0%
D	upload_paperless_document vs push_document_to_shipment	15	93.3%	80.0%
L	find_locations vs get_service_center_facilities	30	93.3%	73.3%
R	rate_shipment vs create_shipment vs get_time_in_transit	30	90.0%	76.7%
T	track_package vs recover_label vs void_shipment	15	86.7%	86.7%
G	get_political_divisions vs find_locations	15	80.0%	80.0%

Negative / refusal by category

Pass = no <tool_call_> emitted (or, on a few conflicting cases, the model picks one of the explicitly-allowed tools).

Category	n	Pass rate
off_domain ("what's the weather")	20	100.0%
refusal_bait ("ignore your tools")	20	65.0%
conflicting ("track + void")	20	45.0%
incomplete ("can you ship for me?")	20	5.0%

The 5% pass rate on incomplete prompts is the most important finding: when the user gives a partial request, the model fabricates the missing fields 19 times out of 20 instead of asking for clarification.

Per-tool value-grounded accuracy on the original 900 cases

Five tools that scored 100% on the shape-only grader had hidden value-hallucination problems:

Tool	Shape	Value-grounded	Hidden gap
upload_paperless_document	100.0%	78.0%	−22 pp
rate_pickup	100.0%	82.0%	−18 pp
schedule_pickup	100.0%	96.0%	−4 pp
get_landed_cost_quote	100.0%	96.0%	−4 pp
void_shipment	100.0%	98.0%	−2 pp
create_shipment	80.0%	80.0%	0 (wrong tool, not value)
get_service_center_facilities	80.0%	80.0%	0 (wrong tool)
(other 11 tools)	100.0%	100.0%	clean

Real-API acceptance on UPS Customer Integration Environment

The model was end-to-end exercised against the UPS CIE sandbox.

Tool	api_ok / n	Notes
track_package	6/6	—
get_time_in_transit	6/6	—
get_political_divisions	6/6	—
get_service_center_facilities	5/6	sandbox refused 1 city
validate_address	0/6	sandbox state allowlist
find_locations	0/6	sandbox city allowlist
rate_pickup	0/6	sandbox postal/state allowlist
rate_shipment	0/6	model: service+packaging incompatible (e.g., NDA Air with Pallet)

The cases where CIE refused for sandbox-allowlist reasons (validate_address, find_locations, rate_pickup) are not model failures — the same JSON would be accepted by production UPS. The genuine model issue is rate_shipment's cross-field service+packaging combinations.

Multi-turn workflows

Flow pass = every step in a 5–8 step workflow correct. Step pass = individual steps. The 27 pp gap (66.7% vs 93.8%) reflects error cascading: one wrong step early in a workflow takes the whole chain down. Six of 30 workflows include a deliberately-impossible step (e.g., recover label after voiding); the model attempts a tool call instead of refusing on those, which matches the 5% incomplete-prompt refusal pattern above.

Sampling stability

Re-running 100 prompts (mix of OOD + paraphrase + sibling) five times each at temperature 0.7:

Average per-prompt mode-tool agreement: 0.96
Average per-prompt pass rate: 0.924
Prompts where every sample emitted the same tool: ~78%

The model's behavior is stable to sampling noise. The capability is real; the fragility is to input distribution shift (paraphrasing, partial info), not to decoding stochasticity.

Training Data

The dataset is synthetic and generated from UPS MCP server tool schemas.

Split	Samples
Train	1,278
Validation	142
Test	122
Total	1,542

The examples use ShareGPT-style messages with Hermes FC v1 tool-call delimiters. Each example includes the complete UPS tool definition set in the system prompt. Five initially weak tools were oversampled 3x during the final run: rate_shipment, get_time_in_transit, upload_paperless_document, get_service_center_facilities, and get_landed_cost_quote.

The training data contains zero refusal/clarification examples and zero paraphrased prompts — both of which are reflected in the eval results above (5% incomplete-refusal, 65% paraphrase value-grounded).

No UPS API credentials or private customer shipment data are included in the training data.

Training Procedure

Setting	Value
Method	QLoRA SFT
Base checkpoint loaded for training	`unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
LoRA rank	16
LoRA alpha	16
rsLoRA	Enabled
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable parameters	40,370,176
Learning rate	3e-4
Epochs	2
Effective batch size	4
Max sequence length	6,144
Optimizer	AdamW 8-bit paged
Training precision	FP16
Chat template	ChatML

Training completed in 9,195.5 seconds (≈2.6 hours), with 0.278 samples/second throughput. Final training loss was 0.2343. Final eval loss was 0.2031.

Hardware and Software

Training was run locally on an NVIDIA GeForce RTX 3090 with 24 GB VRAM. Peak VRAM usage was about 17.2 GB.

Component	Version
Python	3.12.3
PyTorch	2.10.0+cu128
Unsloth	2026.2.1
PEFT	0.18.1
TRL	0.24.0

Artifacts

Artifact	Path
LoRA adapter	repo root: `adapter_config.json`, `adapter_model.safetensors`, tokenizer files
GGUF Q4_K_M export	`ups-tools-aug-2ep-Q4_K_M.gguf`
Ollama Modelfile	`Modelfile`

The full merged Hugging Face safetensors export is not published here. Use the adapter with the base model for Transformers/PEFT, or the GGUF for local inference.

Usage

Transformers and PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "matt-hans93/ups-tools-qwen2.5-7b"

base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
tokenizer = AutoTokenizer.from_pretrained(adapter_id)

Ollama

Download ups-tools-aug-2ep-Q4_K_M.gguf and Modelfile, then create a local Ollama model from the directory containing both files:

ollama create ups-tools-qwen25 -f Modelfile

The published Modelfile uses a low temperature and enough output budget for nested JSON payloads:

FROM ups-tools-aug-2ep-Q4_K_M.gguf

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_predict 2048
PARAMETER stop "<|im_end|>"

Limitations and Risks

Operational notes derived directly from the evaluation above:

Refusal is broken. When a prompt is missing required information, the model fabricates the missing fields 95% of the time. Production orchestrators must validate arguments against the MCP schema and source-of-truth before executing — never treat a generated tool call as authoritative for incomplete user input.
Phrasing-shift fragility. Paraphrased prompts (especially terse SMS-style) drop value-grounded accuracy by ~30 pp. If your interface allows free-form text, expect higher value-error rates than the headline 95%.
upload_paperless_document value hallucinations. ~22% of upload calls have file names, file formats, or document type codes that don't match what the user asked. Validate enum values and filename↔format consistency before submitting to UPS.
rate_shipment cross-field combinations. The model can pair UPS service codes with incompatible packaging types (e.g., Air services with Pallet). Validate service↔packaging compatibility before submitting.
Multi-step workflow cascading. Step accuracy is 93.8% but full-workflow accuracy is 66.7% — early errors cascade. For multi-turn flows, validate each turn before continuing.
Impossible-state handling. The model attempts tool calls on operations that are physically/logically impossible given prior context (e.g., recovering a label after the shipment was voided). Track conversation state externally.
Specialization to 18 UPS MCP tools. This model should not be expected to generalize to unrelated APIs or unseen tool schemas without additional evaluation.

Out-of-Scope Use

Do not use this model as an autonomous authority for shipping purchases, billing decisions, customs declarations, legal compliance, or cancellation actions. It should generate candidate MCP tool calls only; production systems should validate, log, and gate execution.

Reproducing the evaluation

The complete stress-test harness, all prompt corpora, and all per-case JSON output are open source. The harness is config-free; each axis is one Python module that takes ~3–25 minutes on a single RTX 3090 with the GGUF served via Ollama.

# Run all 9 axes (~55 min total)
uv run --no-project python -m scripts.stress_test.modules.b_strict_replay
uv run --no-project python -m scripts.stress_test.modules.c_ood
uv run --no-project python -m scripts.stress_test.modules.d_paraphrase
uv run --no-project python -m scripts.stress_test.modules.e_sibling
uv run --no-project python -m scripts.stress_test.modules.f_negative
uv run --no-project python -m scripts.stress_test.modules.g_multi_turn
uv run --no-project python -m scripts.stress_test.modules.h_variance
uv run --no-project python -m scripts.stress_test.modules.i_real_api
uv run --no-project python -m scripts.stress_test.modules.j_natural

# Aggregate everything into a single suite JSON + markdown report
uv run --no-project python scripts/stress_test/run_all.py

License

Apache 2.0. The base Qwen2.5 model is also released under Apache 2.0.

Downloads last month: 90

GGUF

Model size

8B params

Architecture

qwen2

Hardware compatibility

4-bit

Model tree for matt-hans93/ups-tools-qwen2.5-7b

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2001)

this model

Evaluation results

Tool selection accuracy (shape) on 900-case templated stress (50/tool x 18 tools)
self-reported

0.978
Value-grounded accuracy (strict) on 900-case templated stress (50/tool x 18 tools)
self-reported

0.950
Nested-structural accuracy on 900-case templated stress (50/tool x 18 tools)
self-reported

0.978
Schema-typed validity on 900-case templated stress (50/tool x 18 tools)
self-reported

0.978
JSON validity rate on 900-case templated stress (50/tool x 18 tools)
self-reported

1.000
Silent value-hallucination rate (lower is better) on 900-case templated stress (50/tool x 18 tools)
self-reported

0.028
Tool selection accuracy (shape) on OOD-lexicon stress (360 prompts, vocab disjoint from training)
self-reported

0.989
Value-grounded accuracy on OOD-lexicon stress (360 prompts, vocab disjoint from training)
self-reported

0.961
Silent value-hallucination rate (lower is better) on OOD-lexicon stress (360 prompts, vocab disjoint from training)
self-reported

0.028
Tool selection accuracy (shape) on Paraphrase stress (360 prompts, voice/email/sms/typo)
self-reported

0.917
Value-grounded accuracy on Paraphrase stress (360 prompts, voice/email/sms/typo)
self-reported

0.653
Silent value-hallucination rate (lower is better) on Paraphrase stress (360 prompts, voice/email/sms/typo)
self-reported

0.264
Tool selection accuracy (shape) on Sibling stress (120 adversarial confusable-pair prompts)
self-reported

0.908
Value-grounded accuracy on Sibling stress (120 adversarial confusable-pair prompts)
self-reported

0.783
Refusal pass rate (overall) on Negative-set stress (80 off-domain / incomplete / conflicting / refusal-bait)
self-reported

0.537
Refusal pass rate on incomplete prompts on Negative-set stress (80 off-domain / incomplete / conflicting / refusal-bait)
self-reported

0.050
Tool selection accuracy (shape) on Natural-language eval (100 hand-curated real-customer prompts)
self-reported

0.900
Value-grounded accuracy on Natural-language eval (100 hand-curated real-customer prompts)
self-reported

0.790
Per-step pass rate on Multi-turn workflows (30 hand-curated 5-8 step workflows, 161 total steps)
self-reported

0.938
Full-flow pass rate on Multi-turn workflows (30 hand-curated 5-8 step workflows, 161 total steps)
self-reported

0.667