UPS Tool-Use Qwen2.5 7B

UPS Tool-Use Qwen2.5 7B is a QLoRA fine-tune of Qwen2.5-7B-Instruct for generating structured Hermes function-calling payloads for a UPS MCP server. It maps natural-language shipping and logistics requests to 18 UPS MCP tools covering tracking, rating, shipment creation, pickup workflows, address validation, paperless documents, landed cost, and location lookup.

This repository publishes the PEFT LoRA adapter at the repo root and a Q4_K_M GGUF export for local inference with llama.cpp-compatible runtimes such as Ollama.

This model is not affiliated with or endorsed by UPS.

Model Details

  • Developed by: Matthew Hans
  • Model type: Qwen2.5-7B-Instruct causal language model with LoRA adapter fine-tuning
  • Base model: Qwen/Qwen2.5-7B-Instruct
  • Training base: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
  • Fine-tuning method: QLoRA with rank-stabilized LoRA
  • Language: English
  • Output format: Hermes FC v1 tool call blocks
  • Local run: ups-tools-aug-2ep
  • Evaluation period: April 25 — April 28, 2026

Intended Use

The model is intended to sit behind a UPS MCP server and produce one structured tool call for each user request. A downstream orchestrator should parse the <tool_call_> block, validate arguments against the MCP tool schema, and then decide whether to call the UPS API.

<tool_call_>
{"name": "track_package", "arguments": {"inquiryNumber": "1Z999AA10123456784"}}
</tool_call_>

The system prompt should include the complete set of UPS MCP tool definitions in Hermes-compatible JSON, for example:

<tools>
[{"type": "function", "function": {"name": "track_package", "parameters": {...}}}]
</tools>

Supported UPS MCP Tools

Tool Category
track_package Tracking
validate_address Address validation
rate_shipment Rating
create_shipment Shipping
void_shipment Shipping
recover_label Shipping
get_time_in_transit Transit
get_landed_cost_quote International landed cost
upload_paperless_document Paperless documents
push_document_to_shipment Paperless documents
delete_paperless_document Paperless documents
find_locations Locator
rate_pickup Pickup
schedule_pickup Pickup
cancel_pickup Pickup
get_pickup_status Pickup
get_political_divisions Pickup metadata
get_service_center_facilities Pickup metadata

Evaluation

The model was evaluated with a 9-axis stress-test suite (~2,150 inference calls) against four graders run in parallel on every case:

  • Shape grader — required keys present in the tool call.
  • Value-grounded grader — argument values match what the prompt actually said (tracking number, zip, address, country code, dates, file names, etc.).
  • Nested-structural grader — deep paths in request_body are populated for rate_shipment, create_shipment, recover_label, get_time_in_transit.
  • Schema-typed grader — types and enums match the MCP schema; unknown params flagged.

Cross-field constraints (UPS service ↔ packaging compatibility, international service ↔ different countries) are also checked. Confidence intervals are Wilson 95% — the conservative lo figure is what to use for go/no-go decisions.

Headline results

Axis n Shape Value-grounded Wilson 95% lo (value) Silent value-hallucination
Strict re-replay (templated, original test) 900 97.78% 95.00% 93.38% 2.78%
OOD-lexicon (cities/streets/companies disjoint from training) 360 98.89% 96.11% 93.58% 2.78%
Paraphrase stress (voice/email/sms/typo) 360 91.67% 65.28% 60.22% 26.39%
Sibling-tool ambiguity (6 confusable pair-families) 120 90.83% 78.33% 70.15% 13.33%
Negative / refusal 80 — (53.75% pass overall)
Multi-turn workflows (5–8 steps) 30 — (66.7% flow / 93.8% step)
Real-API acceptance (UPS CIE) 48 — (~95% on tools where CIE accepts the data)
Sampling variance (100 prompts × 5 samples T=0.7) 500 — (mode agreement 0.96, pass 0.924)
Natural-language eval (real customer phrasing) 100 90.00% 79.00% 70.02% 13.00%

How to read these numbers

The original "97.78% accurate" headline was a shape-only check on templated prompts. After applying value, structural, and schema graders to the same 900 prompts, 2.78% of supposedly-passing calls had wrong values that the original grader missed. On out-of-distribution lexical content the model generalizes well (96.1% value-grounded), but on paraphrased prompts in styles it never saw during training, value accuracy drops to 65.3% with a 26 pp silent-hallucination rate. On adversarial sibling-tool prompts, tool selection drops to 90.8%. The model is brittle to phrasing distribution shift, not to vocabulary or sampling temperature.

Paraphrase stress by style

Style n Shape Value-grounded
voice (filler words, "uh", "you know") 90 96.7% 67.8%
email (greeting + signoff) 90 95.6% 68.9%
typo (2-3 misspellings) 90 94.4% 67.8%
sms (terse, lowercase, abbreviations) 90 80.0% 56.7%

Sibling-tool ambiguity by pair-family

Family Tools n Shape Value-grounded
P rate_pickup vs schedule_pickup 15 100.0% 80.0%
D upload_paperless_document vs push_document_to_shipment 15 93.3% 80.0%
L find_locations vs get_service_center_facilities 30 93.3% 73.3%
R rate_shipment vs create_shipment vs get_time_in_transit 30 90.0% 76.7%
T track_package vs recover_label vs void_shipment 15 86.7% 86.7%
G get_political_divisions vs find_locations 15 80.0% 80.0%

Negative / refusal by category

Pass = no <tool_call_> emitted (or, on a few conflicting cases, the model picks one of the explicitly-allowed tools).

Category n Pass rate
off_domain ("what's the weather") 20 100.0%
refusal_bait ("ignore your tools") 20 65.0%
conflicting ("track + void") 20 45.0%
incomplete ("can you ship for me?") 20 5.0%

The 5% pass rate on incomplete prompts is the most important finding: when the user gives a partial request, the model fabricates the missing fields 19 times out of 20 instead of asking for clarification.

Per-tool value-grounded accuracy on the original 900 cases

Five tools that scored 100% on the shape-only grader had hidden value-hallucination problems:

Tool Shape Value-grounded Hidden gap
upload_paperless_document 100.0% 78.0% −22 pp
rate_pickup 100.0% 82.0% −18 pp
schedule_pickup 100.0% 96.0% −4 pp
get_landed_cost_quote 100.0% 96.0% −4 pp
void_shipment 100.0% 98.0% −2 pp
create_shipment 80.0% 80.0% 0 (wrong tool, not value)
get_service_center_facilities 80.0% 80.0% 0 (wrong tool)
(other 11 tools) 100.0% 100.0% clean

Real-API acceptance on UPS Customer Integration Environment

The model was end-to-end exercised against the UPS CIE sandbox.

Tool api_ok / n Notes
track_package 6/6
get_time_in_transit 6/6
get_political_divisions 6/6
get_service_center_facilities 5/6 sandbox refused 1 city
validate_address 0/6 sandbox state allowlist
find_locations 0/6 sandbox city allowlist
rate_pickup 0/6 sandbox postal/state allowlist
rate_shipment 0/6 model: service+packaging incompatible (e.g., NDA Air with Pallet)

The cases where CIE refused for sandbox-allowlist reasons (validate_address, find_locations, rate_pickup) are not model failures — the same JSON would be accepted by production UPS. The genuine model issue is rate_shipment's cross-field service+packaging combinations.

Multi-turn workflows

Flow pass = every step in a 5–8 step workflow correct. Step pass = individual steps. The 27 pp gap (66.7% vs 93.8%) reflects error cascading: one wrong step early in a workflow takes the whole chain down. Six of 30 workflows include a deliberately-impossible step (e.g., recover label after voiding); the model attempts a tool call instead of refusing on those, which matches the 5% incomplete-prompt refusal pattern above.

Sampling stability

Re-running 100 prompts (mix of OOD + paraphrase + sibling) five times each at temperature 0.7:

  • Average per-prompt mode-tool agreement: 0.96
  • Average per-prompt pass rate: 0.924
  • Prompts where every sample emitted the same tool: ~78%

The model's behavior is stable to sampling noise. The capability is real; the fragility is to input distribution shift (paraphrasing, partial info), not to decoding stochasticity.

Training Data

The dataset is synthetic and generated from UPS MCP server tool schemas.

Split Samples
Train 1,278
Validation 142
Test 122
Total 1,542

The examples use ShareGPT-style messages with Hermes FC v1 tool-call delimiters. Each example includes the complete UPS tool definition set in the system prompt. Five initially weak tools were oversampled 3x during the final run: rate_shipment, get_time_in_transit, upload_paperless_document, get_service_center_facilities, and get_landed_cost_quote.

The training data contains zero refusal/clarification examples and zero paraphrased prompts — both of which are reflected in the eval results above (5% incomplete-refusal, 65% paraphrase value-grounded).

No UPS API credentials or private customer shipment data are included in the training data.

Training Procedure

Setting Value
Method QLoRA SFT
Base checkpoint loaded for training unsloth/Qwen2.5-7B-Instruct-bnb-4bit
LoRA rank 16
LoRA alpha 16
rsLoRA Enabled
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 40,370,176
Learning rate 3e-4
Epochs 2
Effective batch size 4
Max sequence length 6,144
Optimizer AdamW 8-bit paged
Training precision FP16
Chat template ChatML

Training completed in 9,195.5 seconds (≈2.6 hours), with 0.278 samples/second throughput. Final training loss was 0.2343. Final eval loss was 0.2031.

Hardware and Software

Training was run locally on an NVIDIA GeForce RTX 3090 with 24 GB VRAM. Peak VRAM usage was about 17.2 GB.

Component Version
Python 3.12.3
PyTorch 2.10.0+cu128
Unsloth 2026.2.1
PEFT 0.18.1
TRL 0.24.0

Artifacts

Artifact Path
LoRA adapter repo root: adapter_config.json, adapter_model.safetensors, tokenizer files
GGUF Q4_K_M export ups-tools-aug-2ep-Q4_K_M.gguf
Ollama Modelfile Modelfile

The full merged Hugging Face safetensors export is not published here. Use the adapter with the base model for Transformers/PEFT, or the GGUF for local inference.

Usage

Transformers and PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "matt-hans93/ups-tools-qwen2.5-7b"

base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
tokenizer = AutoTokenizer.from_pretrained(adapter_id)

Ollama

Download ups-tools-aug-2ep-Q4_K_M.gguf and Modelfile, then create a local Ollama model from the directory containing both files:

ollama create ups-tools-qwen25 -f Modelfile

The published Modelfile uses a low temperature and enough output budget for nested JSON payloads:

FROM ups-tools-aug-2ep-Q4_K_M.gguf

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_predict 2048
PARAMETER stop "<|im_end|>"

Limitations and Risks

Operational notes derived directly from the evaluation above:

  • Refusal is broken. When a prompt is missing required information, the model fabricates the missing fields 95% of the time. Production orchestrators must validate arguments against the MCP schema and source-of-truth before executing — never treat a generated tool call as authoritative for incomplete user input.
  • Phrasing-shift fragility. Paraphrased prompts (especially terse SMS-style) drop value-grounded accuracy by ~30 pp. If your interface allows free-form text, expect higher value-error rates than the headline 95%.
  • upload_paperless_document value hallucinations. ~22% of upload calls have file names, file formats, or document type codes that don't match what the user asked. Validate enum values and filename↔format consistency before submitting to UPS.
  • rate_shipment cross-field combinations. The model can pair UPS service codes with incompatible packaging types (e.g., Air services with Pallet). Validate service↔packaging compatibility before submitting.
  • Multi-step workflow cascading. Step accuracy is 93.8% but full-workflow accuracy is 66.7% — early errors cascade. For multi-turn flows, validate each turn before continuing.
  • Impossible-state handling. The model attempts tool calls on operations that are physically/logically impossible given prior context (e.g., recovering a label after the shipment was voided). Track conversation state externally.
  • Specialization to 18 UPS MCP tools. This model should not be expected to generalize to unrelated APIs or unseen tool schemas without additional evaluation.

Out-of-Scope Use

Do not use this model as an autonomous authority for shipping purchases, billing decisions, customs declarations, legal compliance, or cancellation actions. It should generate candidate MCP tool calls only; production systems should validate, log, and gate execution.

Reproducing the evaluation

The complete stress-test harness, all prompt corpora, and all per-case JSON output are open source. The harness is config-free; each axis is one Python module that takes ~3–25 minutes on a single RTX 3090 with the GGUF served via Ollama.

# Run all 9 axes (~55 min total)
uv run --no-project python -m scripts.stress_test.modules.b_strict_replay
uv run --no-project python -m scripts.stress_test.modules.c_ood
uv run --no-project python -m scripts.stress_test.modules.d_paraphrase
uv run --no-project python -m scripts.stress_test.modules.e_sibling
uv run --no-project python -m scripts.stress_test.modules.f_negative
uv run --no-project python -m scripts.stress_test.modules.g_multi_turn
uv run --no-project python -m scripts.stress_test.modules.h_variance
uv run --no-project python -m scripts.stress_test.modules.i_real_api
uv run --no-project python -m scripts.stress_test.modules.j_natural

# Aggregate everything into a single suite JSON + markdown report
uv run --no-project python scripts/stress_test/run_all.py

License

Apache 2.0. The base Qwen2.5 model is also released under Apache 2.0.

Downloads last month
90
GGUF
Model size
8B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for matt-hans93/ups-tools-qwen2.5-7b

Base model

Qwen/Qwen2.5-7B
Adapter
(2001)
this model

Evaluation results

  • Tool selection accuracy (shape) on 900-case templated stress (50/tool x 18 tools)
    self-reported
    0.978
  • Value-grounded accuracy (strict) on 900-case templated stress (50/tool x 18 tools)
    self-reported
    0.950
  • Nested-structural accuracy on 900-case templated stress (50/tool x 18 tools)
    self-reported
    0.978
  • Schema-typed validity on 900-case templated stress (50/tool x 18 tools)
    self-reported
    0.978
  • JSON validity rate on 900-case templated stress (50/tool x 18 tools)
    self-reported
    1.000
  • Silent value-hallucination rate (lower is better) on 900-case templated stress (50/tool x 18 tools)
    self-reported
    0.028
  • Tool selection accuracy (shape) on OOD-lexicon stress (360 prompts, vocab disjoint from training)
    self-reported
    0.989
  • Value-grounded accuracy on OOD-lexicon stress (360 prompts, vocab disjoint from training)
    self-reported
    0.961
  • Silent value-hallucination rate (lower is better) on OOD-lexicon stress (360 prompts, vocab disjoint from training)
    self-reported
    0.028
  • Tool selection accuracy (shape) on Paraphrase stress (360 prompts, voice/email/sms/typo)
    self-reported
    0.917
  • Value-grounded accuracy on Paraphrase stress (360 prompts, voice/email/sms/typo)
    self-reported
    0.653
  • Silent value-hallucination rate (lower is better) on Paraphrase stress (360 prompts, voice/email/sms/typo)
    self-reported
    0.264
  • Tool selection accuracy (shape) on Sibling stress (120 adversarial confusable-pair prompts)
    self-reported
    0.908
  • Value-grounded accuracy on Sibling stress (120 adversarial confusable-pair prompts)
    self-reported
    0.783
  • Refusal pass rate (overall) on Negative-set stress (80 off-domain / incomplete / conflicting / refusal-bait)
    self-reported
    0.537
  • Refusal pass rate on incomplete prompts on Negative-set stress (80 off-domain / incomplete / conflicting / refusal-bait)
    self-reported
    0.050
  • Tool selection accuracy (shape) on Natural-language eval (100 hand-curated real-customer prompts)
    self-reported
    0.900
  • Value-grounded accuracy on Natural-language eval (100 hand-curated real-customer prompts)
    self-reported
    0.790
  • Per-step pass rate on Multi-turn workflows (30 hand-curated 5-8 step workflows, 161 total steps)
    self-reported
    0.938
  • Full-flow pass rate on Multi-turn workflows (30 hand-curated 5-8 step workflows, 161 total steps)
    self-reported
    0.667