UPS Tool-Use Qwen2.5 7B
UPS Tool-Use Qwen2.5 7B is a QLoRA fine-tune of Qwen2.5-7B-Instruct for generating structured Hermes function-calling payloads for a UPS MCP server. It maps natural-language shipping and logistics requests to 18 UPS MCP tools covering tracking, rating, shipment creation, pickup workflows, address validation, paperless documents, landed cost, and location lookup.
This repository publishes the PEFT LoRA adapter at the repo root and a Q4_K_M GGUF export for local inference with llama.cpp-compatible runtimes such as Ollama.
This model is not affiliated with or endorsed by UPS.
Model Details
- Developed by: Matthew Hans
- Model type: Qwen2.5-7B-Instruct causal language model with LoRA adapter fine-tuning
- Base model:
Qwen/Qwen2.5-7B-Instruct - Training base:
unsloth/Qwen2.5-7B-Instruct-bnb-4bit - Fine-tuning method: QLoRA with rank-stabilized LoRA
- Language: English
- Output format: Hermes FC v1 tool call blocks
- Local run:
ups-tools-aug-2ep - Evaluation period: April 25 — April 28, 2026
Intended Use
The model is intended to sit behind a UPS MCP server and produce one structured tool call for each user request. A downstream orchestrator should parse the <tool_call_> block, validate arguments against the MCP tool schema, and then decide whether to call the UPS API.
<tool_call_>
{"name": "track_package", "arguments": {"inquiryNumber": "1Z999AA10123456784"}}
</tool_call_>
The system prompt should include the complete set of UPS MCP tool definitions in Hermes-compatible JSON, for example:
<tools>
[{"type": "function", "function": {"name": "track_package", "parameters": {...}}}]
</tools>
Supported UPS MCP Tools
| Tool | Category |
|---|---|
track_package |
Tracking |
validate_address |
Address validation |
rate_shipment |
Rating |
create_shipment |
Shipping |
void_shipment |
Shipping |
recover_label |
Shipping |
get_time_in_transit |
Transit |
get_landed_cost_quote |
International landed cost |
upload_paperless_document |
Paperless documents |
push_document_to_shipment |
Paperless documents |
delete_paperless_document |
Paperless documents |
find_locations |
Locator |
rate_pickup |
Pickup |
schedule_pickup |
Pickup |
cancel_pickup |
Pickup |
get_pickup_status |
Pickup |
get_political_divisions |
Pickup metadata |
get_service_center_facilities |
Pickup metadata |
Evaluation
The model was evaluated with a 9-axis stress-test suite (~2,150 inference calls) against four graders run in parallel on every case:
- Shape grader — required keys present in the tool call.
- Value-grounded grader — argument values match what the prompt actually said (tracking number, zip, address, country code, dates, file names, etc.).
- Nested-structural grader — deep paths in
request_bodyare populated forrate_shipment,create_shipment,recover_label,get_time_in_transit. - Schema-typed grader — types and enums match the MCP schema; unknown params flagged.
Cross-field constraints (UPS service ↔ packaging compatibility, international service ↔ different countries) are also checked. Confidence intervals are Wilson 95% — the conservative lo figure is what to use for go/no-go decisions.
Headline results
| Axis | n | Shape | Value-grounded | Wilson 95% lo (value) | Silent value-hallucination |
|---|---|---|---|---|---|
| Strict re-replay (templated, original test) | 900 | 97.78% | 95.00% | 93.38% | 2.78% |
| OOD-lexicon (cities/streets/companies disjoint from training) | 360 | 98.89% | 96.11% | 93.58% | 2.78% |
| Paraphrase stress (voice/email/sms/typo) | 360 | 91.67% | 65.28% | 60.22% | 26.39% |
| Sibling-tool ambiguity (6 confusable pair-families) | 120 | 90.83% | 78.33% | 70.15% | 13.33% |
| Negative / refusal | 80 | — | — | — | — (53.75% pass overall) |
| Multi-turn workflows (5–8 steps) | 30 | — | — | — | — (66.7% flow / 93.8% step) |
| Real-API acceptance (UPS CIE) | 48 | — | — | — | — (~95% on tools where CIE accepts the data) |
| Sampling variance (100 prompts × 5 samples T=0.7) | 500 | — | — | — | — (mode agreement 0.96, pass 0.924) |
| Natural-language eval (real customer phrasing) | 100 | 90.00% | 79.00% | 70.02% | 13.00% |
How to read these numbers
The original "97.78% accurate" headline was a shape-only check on templated prompts. After applying value, structural, and schema graders to the same 900 prompts, 2.78% of supposedly-passing calls had wrong values that the original grader missed. On out-of-distribution lexical content the model generalizes well (96.1% value-grounded), but on paraphrased prompts in styles it never saw during training, value accuracy drops to 65.3% with a 26 pp silent-hallucination rate. On adversarial sibling-tool prompts, tool selection drops to 90.8%. The model is brittle to phrasing distribution shift, not to vocabulary or sampling temperature.
Paraphrase stress by style
| Style | n | Shape | Value-grounded |
|---|---|---|---|
| voice (filler words, "uh", "you know") | 90 | 96.7% | 67.8% |
| email (greeting + signoff) | 90 | 95.6% | 68.9% |
| typo (2-3 misspellings) | 90 | 94.4% | 67.8% |
| sms (terse, lowercase, abbreviations) | 90 | 80.0% | 56.7% |
Sibling-tool ambiguity by pair-family
| Family | Tools | n | Shape | Value-grounded |
|---|---|---|---|---|
| P | rate_pickup vs schedule_pickup | 15 | 100.0% | 80.0% |
| D | upload_paperless_document vs push_document_to_shipment | 15 | 93.3% | 80.0% |
| L | find_locations vs get_service_center_facilities | 30 | 93.3% | 73.3% |
| R | rate_shipment vs create_shipment vs get_time_in_transit | 30 | 90.0% | 76.7% |
| T | track_package vs recover_label vs void_shipment | 15 | 86.7% | 86.7% |
| G | get_political_divisions vs find_locations | 15 | 80.0% | 80.0% |
Negative / refusal by category
Pass = no <tool_call_> emitted (or, on a few conflicting cases, the model picks one of the explicitly-allowed tools).
| Category | n | Pass rate |
|---|---|---|
| off_domain ("what's the weather") | 20 | 100.0% |
| refusal_bait ("ignore your tools") | 20 | 65.0% |
| conflicting ("track + void") | 20 | 45.0% |
| incomplete ("can you ship for me?") | 20 | 5.0% |
The 5% pass rate on incomplete prompts is the most important finding: when the user gives a partial request, the model fabricates the missing fields 19 times out of 20 instead of asking for clarification.
Per-tool value-grounded accuracy on the original 900 cases
Five tools that scored 100% on the shape-only grader had hidden value-hallucination problems:
| Tool | Shape | Value-grounded | Hidden gap |
|---|---|---|---|
| upload_paperless_document | 100.0% | 78.0% | −22 pp |
| rate_pickup | 100.0% | 82.0% | −18 pp |
| schedule_pickup | 100.0% | 96.0% | −4 pp |
| get_landed_cost_quote | 100.0% | 96.0% | −4 pp |
| void_shipment | 100.0% | 98.0% | −2 pp |
| create_shipment | 80.0% | 80.0% | 0 (wrong tool, not value) |
| get_service_center_facilities | 80.0% | 80.0% | 0 (wrong tool) |
| (other 11 tools) | 100.0% | 100.0% | clean |
Real-API acceptance on UPS Customer Integration Environment
The model was end-to-end exercised against the UPS CIE sandbox.
| Tool | api_ok / n | Notes |
|---|---|---|
| track_package | 6/6 | — |
| get_time_in_transit | 6/6 | — |
| get_political_divisions | 6/6 | — |
| get_service_center_facilities | 5/6 | sandbox refused 1 city |
| validate_address | 0/6 | sandbox state allowlist |
| find_locations | 0/6 | sandbox city allowlist |
| rate_pickup | 0/6 | sandbox postal/state allowlist |
| rate_shipment | 0/6 | model: service+packaging incompatible (e.g., NDA Air with Pallet) |
The cases where CIE refused for sandbox-allowlist reasons (validate_address, find_locations, rate_pickup) are not model failures — the same JSON would be accepted by production UPS. The genuine model issue is rate_shipment's cross-field service+packaging combinations.
Multi-turn workflows
Flow pass = every step in a 5–8 step workflow correct. Step pass = individual steps. The 27 pp gap (66.7% vs 93.8%) reflects error cascading: one wrong step early in a workflow takes the whole chain down. Six of 30 workflows include a deliberately-impossible step (e.g., recover label after voiding); the model attempts a tool call instead of refusing on those, which matches the 5% incomplete-prompt refusal pattern above.
Sampling stability
Re-running 100 prompts (mix of OOD + paraphrase + sibling) five times each at temperature 0.7:
- Average per-prompt mode-tool agreement: 0.96
- Average per-prompt pass rate: 0.924
- Prompts where every sample emitted the same tool: ~78%
The model's behavior is stable to sampling noise. The capability is real; the fragility is to input distribution shift (paraphrasing, partial info), not to decoding stochasticity.
Training Data
The dataset is synthetic and generated from UPS MCP server tool schemas.
| Split | Samples |
|---|---|
| Train | 1,278 |
| Validation | 142 |
| Test | 122 |
| Total | 1,542 |
The examples use ShareGPT-style messages with Hermes FC v1 tool-call delimiters. Each example includes the complete UPS tool definition set in the system prompt. Five initially weak tools were oversampled 3x during the final run: rate_shipment, get_time_in_transit, upload_paperless_document, get_service_center_facilities, and get_landed_cost_quote.
The training data contains zero refusal/clarification examples and zero paraphrased prompts — both of which are reflected in the eval results above (5% incomplete-refusal, 65% paraphrase value-grounded).
No UPS API credentials or private customer shipment data are included in the training data.
Training Procedure
| Setting | Value |
|---|---|
| Method | QLoRA SFT |
| Base checkpoint loaded for training | unsloth/Qwen2.5-7B-Instruct-bnb-4bit |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| rsLoRA | Enabled |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 40,370,176 |
| Learning rate | 3e-4 |
| Epochs | 2 |
| Effective batch size | 4 |
| Max sequence length | 6,144 |
| Optimizer | AdamW 8-bit paged |
| Training precision | FP16 |
| Chat template | ChatML |
Training completed in 9,195.5 seconds (≈2.6 hours), with 0.278 samples/second throughput. Final training loss was 0.2343. Final eval loss was 0.2031.
Hardware and Software
Training was run locally on an NVIDIA GeForce RTX 3090 with 24 GB VRAM. Peak VRAM usage was about 17.2 GB.
| Component | Version |
|---|---|
| Python | 3.12.3 |
| PyTorch | 2.10.0+cu128 |
| Unsloth | 2026.2.1 |
| PEFT | 0.18.1 |
| TRL | 0.24.0 |
Artifacts
| Artifact | Path |
|---|---|
| LoRA adapter | repo root: adapter_config.json, adapter_model.safetensors, tokenizer files |
| GGUF Q4_K_M export | ups-tools-aug-2ep-Q4_K_M.gguf |
| Ollama Modelfile | Modelfile |
The full merged Hugging Face safetensors export is not published here. Use the adapter with the base model for Transformers/PEFT, or the GGUF for local inference.
Usage
Transformers and PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "matt-hans93/ups-tools-qwen2.5-7b"
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
tokenizer = AutoTokenizer.from_pretrained(adapter_id)
Ollama
Download ups-tools-aug-2ep-Q4_K_M.gguf and Modelfile, then create a local Ollama model from the directory containing both files:
ollama create ups-tools-qwen25 -f Modelfile
The published Modelfile uses a low temperature and enough output budget for nested JSON payloads:
FROM ups-tools-aug-2ep-Q4_K_M.gguf
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_predict 2048
PARAMETER stop "<|im_end|>"
Limitations and Risks
Operational notes derived directly from the evaluation above:
- Refusal is broken. When a prompt is missing required information, the model fabricates the missing fields 95% of the time. Production orchestrators must validate arguments against the MCP schema and source-of-truth before executing — never treat a generated tool call as authoritative for incomplete user input.
- Phrasing-shift fragility. Paraphrased prompts (especially terse SMS-style) drop value-grounded accuracy by ~30 pp. If your interface allows free-form text, expect higher value-error rates than the headline 95%.
upload_paperless_documentvalue hallucinations. ~22% of upload calls have file names, file formats, or document type codes that don't match what the user asked. Validate enum values and filename↔format consistency before submitting to UPS.rate_shipmentcross-field combinations. The model can pair UPS service codes with incompatible packaging types (e.g., Air services with Pallet). Validate service↔packaging compatibility before submitting.- Multi-step workflow cascading. Step accuracy is 93.8% but full-workflow accuracy is 66.7% — early errors cascade. For multi-turn flows, validate each turn before continuing.
- Impossible-state handling. The model attempts tool calls on operations that are physically/logically impossible given prior context (e.g., recovering a label after the shipment was voided). Track conversation state externally.
- Specialization to 18 UPS MCP tools. This model should not be expected to generalize to unrelated APIs or unseen tool schemas without additional evaluation.
Out-of-Scope Use
Do not use this model as an autonomous authority for shipping purchases, billing decisions, customs declarations, legal compliance, or cancellation actions. It should generate candidate MCP tool calls only; production systems should validate, log, and gate execution.
Reproducing the evaluation
The complete stress-test harness, all prompt corpora, and all per-case JSON output are open source. The harness is config-free; each axis is one Python module that takes ~3–25 minutes on a single RTX 3090 with the GGUF served via Ollama.
# Run all 9 axes (~55 min total)
uv run --no-project python -m scripts.stress_test.modules.b_strict_replay
uv run --no-project python -m scripts.stress_test.modules.c_ood
uv run --no-project python -m scripts.stress_test.modules.d_paraphrase
uv run --no-project python -m scripts.stress_test.modules.e_sibling
uv run --no-project python -m scripts.stress_test.modules.f_negative
uv run --no-project python -m scripts.stress_test.modules.g_multi_turn
uv run --no-project python -m scripts.stress_test.modules.h_variance
uv run --no-project python -m scripts.stress_test.modules.i_real_api
uv run --no-project python -m scripts.stress_test.modules.j_natural
# Aggregate everything into a single suite JSON + markdown report
uv run --no-project python scripts/stress_test/run_all.py
License
Apache 2.0. The base Qwen2.5 model is also released under Apache 2.0.
- Downloads last month
- 90
4-bit
Model tree for matt-hans93/ups-tools-qwen2.5-7b
Evaluation results
- Tool selection accuracy (shape) on 900-case templated stress (50/tool x 18 tools)self-reported0.978
- Value-grounded accuracy (strict) on 900-case templated stress (50/tool x 18 tools)self-reported0.950
- Nested-structural accuracy on 900-case templated stress (50/tool x 18 tools)self-reported0.978
- Schema-typed validity on 900-case templated stress (50/tool x 18 tools)self-reported0.978
- JSON validity rate on 900-case templated stress (50/tool x 18 tools)self-reported1.000
- Silent value-hallucination rate (lower is better) on 900-case templated stress (50/tool x 18 tools)self-reported0.028
- Tool selection accuracy (shape) on OOD-lexicon stress (360 prompts, vocab disjoint from training)self-reported0.989
- Value-grounded accuracy on OOD-lexicon stress (360 prompts, vocab disjoint from training)self-reported0.961
- Silent value-hallucination rate (lower is better) on OOD-lexicon stress (360 prompts, vocab disjoint from training)self-reported0.028
- Tool selection accuracy (shape) on Paraphrase stress (360 prompts, voice/email/sms/typo)self-reported0.917
- Value-grounded accuracy on Paraphrase stress (360 prompts, voice/email/sms/typo)self-reported0.653
- Silent value-hallucination rate (lower is better) on Paraphrase stress (360 prompts, voice/email/sms/typo)self-reported0.264
- Tool selection accuracy (shape) on Sibling stress (120 adversarial confusable-pair prompts)self-reported0.908
- Value-grounded accuracy on Sibling stress (120 adversarial confusable-pair prompts)self-reported0.783
- Refusal pass rate (overall) on Negative-set stress (80 off-domain / incomplete / conflicting / refusal-bait)self-reported0.537
- Refusal pass rate on incomplete prompts on Negative-set stress (80 off-domain / incomplete / conflicting / refusal-bait)self-reported0.050
- Tool selection accuracy (shape) on Natural-language eval (100 hand-curated real-customer prompts)self-reported0.900
- Value-grounded accuracy on Natural-language eval (100 hand-curated real-customer prompts)self-reported0.790
- Per-step pass rate on Multi-turn workflows (30 hand-curated 5-8 step workflows, 161 total steps)self-reported0.938
- Full-flow pass rate on Multi-turn workflows (30 hand-curated 5-8 step workflows, 161 total steps)self-reported0.667