Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3-vl-2b-coreml")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }

With an image

import CoreGraphics

let cgImage: CGImage = ...   // your CGImage

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user,
                       content: "What's in this image?")],
    image: cgImage,
    maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }

Qwen3-VL 2B β€” Core ML (recurrent path, v1.4.0)

Core ML port of Qwen/Qwen3-VL-2B-Instruct β€” text + vision, INT8 chunked, runs on iPhone A18 ANE.

Heads up: for new work prefer the stateful variant at mlboydaisuke/qwen3-vl-2b-stateful-coreml β€” same model, but KV cache lives inside ANE via MLState + slice_update so memory is 6Γ— lower (264 MB vs 1.7 GB) and decode is 2Γ— faster (24 vs 10 tok/s) on iPhone 17 Pro. This recurrent repo is kept for backward compatibility with the v1.4.0 runtime.

Files

qwen3_vl_2b_decode_chunks/
β”œβ”€β”€ chunk_0.mlpackage              # 353 MB β€” text path: embed + L0-6
β”œβ”€β”€ chunk_1.mlpackage              # 353 MB β€” L7-13
β”œβ”€β”€ chunk_2.mlpackage              # 353 MB β€” L14-20
β”œβ”€β”€ chunk_3.mlpackage              # 353 MB β€” L21-27
β”œβ”€β”€ chunk_head.mlpackage           # 311 MB β€” final_norm + lm_head + argmax
β”œβ”€β”€ chunk_0_vision.mlpackage       # 353 MB β€” chunk_0 with DeepStack injection
β”œβ”€β”€ prefill_chunk_{0..3}.mlpackage # 353 MB each β€” T=32 batched prefill bodies
β”œβ”€β”€ prefill_chunk_0_vision.mlpackage  # vision-aware prefill chunk_0
└── embed_weight.bin               # 622 MB β€” raw fp16 embed (151936 Γ— 2048)

qwen3_vl_2b_vision/
└── vision.mlpackage               # 406 MB β€” 448Γ—448 β†’ 196 tokens + 3 DeepStack taps

The vision encoder is loaded only when an image is in the prompt. DeepStack taps from vision layers 5/11/17 are injected into text layers 0/1/2 via chunk_0_vision.

What this repo does NOT ship

  • No model_config.json β€” Core ML serializes shapes into each .mlpackage. coremltools opens them without external config.
  • No tokenizer / processor β€” fetch from the base model:
from transformers import AutoTokenizer, AutoProcessor
tok  = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
proc = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")

Vision preprocessing note: Qwen3-VL uses mean=std=0.5 (not the CLIP defaults), and the vision encoder expects pixel_values (3, 2, 448, 448) already pre-patchified β€” see conversion/build_qwen3_vl_2b_vision.py for the exact transform.

Standalone usage (Python / Mac)

import coremltools as ct, numpy as np
from huggingface_hub import snapshot_download

local = snapshot_download("mlboydaisuke/qwen3-vl-2b-coreml")
root  = f"{local}/qwen3_vl_2b_decode_chunks"

decode_chunks = [ct.models.MLModel(f"{root}/chunk_{i}.mlpackage") for i in range(4)]
head         = ct.models.MLModel(f"{root}/chunk_head.mlpackage")
embed        = np.memmap(f"{root}/embed_weight.bin",
                         dtype=np.float16, mode="r",
                         shape=(151936, 2048))
vision = ct.models.MLModel(
    f"{local}/qwen3_vl_2b_vision/vision.mlpackage")

For text-only prompts, skip the vision encoder and chain chunk_0..3 β†’ chunk_head per step. For image prompts, run vision.predict once, swap chunk_0 with chunk_0_vision and inject the 3 DeepStack tensors during the first token of each image span.

Reference loop: conversion/qwen3_vl_2b_parity.py.

iOS / Mac app

Swift runtime: Qwen3VL2BGenerator.swift. Pick Qwen3-VL 2B (recurrent, v1.4.0) in the model picker.

Architecture

28-layer GQA text backbone + ViT vision tower.

  • Text: hidden=2048, num_heads=16, num_kv=8, head_dim=128, vocab=151936, tie_embeddings=True, rope_theta=5e6, mRoPE section=[24,20,20] interleaved (collapses to standard 1D RoPE for text-only).
  • Vision: 448Γ—448 fixed, 196 tokens after spatial_merge=2, DeepStack taps at vision layers 5/11/17 β†’ text layers 0/1/2.
  • Chunks: 4 body chunks Γ— 7 layers each.

License

Apache 2.0 (inherits from the base model).

Downloads last month
63
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/qwen3-vl-2b-coreml

Quantized
(63)
this model