Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3-vl-2b-coreml")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }

With an image

import CoreGraphics

let cgImage: CGImage = ...   // your CGImage

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user,
                       content: "What's in this image?")],
    image: cgImage,
    maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }

Qwen3-VL 2B — Core ML (recurrent path, v1.4.0)

Core ML port of Qwen/Qwen3-VL-2B-Instruct — text + vision, INT8 chunked, runs on iPhone A18 ANE.

Heads up: for new work prefer the stateful variant at mlboydaisuke/qwen3-vl-2b-stateful-coreml — same model, but KV cache lives inside ANE via MLState + slice_update so memory is 6× lower (264 MB vs 1.7 GB) and decode is 2× faster (24 vs 10 tok/s) on iPhone 17 Pro. This recurrent repo is kept for backward compatibility with the v1.4.0 runtime.

Files

qwen3_vl_2b_decode_chunks/
├── chunk_0.mlpackage              # 353 MB — text path: embed + L0-6
├── chunk_1.mlpackage              # 353 MB — L7-13
├── chunk_2.mlpackage              # 353 MB — L14-20
├── chunk_3.mlpackage              # 353 MB — L21-27
├── chunk_head.mlpackage           # 311 MB — final_norm + lm_head + argmax
├── chunk_0_vision.mlpackage       # 353 MB — chunk_0 with DeepStack injection
├── prefill_chunk_{0..3}.mlpackage # 353 MB each — T=32 batched prefill bodies
├── prefill_chunk_0_vision.mlpackage  # vision-aware prefill chunk_0
└── embed_weight.bin               # 622 MB — raw fp16 embed (151936 × 2048)

qwen3_vl_2b_vision/
└── vision.mlpackage               # 406 MB — 448×448 → 196 tokens + 3 DeepStack taps

The vision encoder is loaded only when an image is in the prompt. DeepStack taps from vision layers 5/11/17 are injected into text layers 0/1/2 via chunk_0_vision.

What this repo does NOT ship

No model_config.json — Core ML serializes shapes into each .mlpackage. coremltools opens them without external config.
No tokenizer / processor — fetch from the base model:

from transformers import AutoTokenizer, AutoProcessor
tok  = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
proc = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")

Vision preprocessing note: Qwen3-VL uses mean=std=0.5 (not the CLIP defaults), and the vision encoder expects pixel_values (3, 2, 448, 448) already pre-patchified — see conversion/build_qwen3_vl_2b_vision.py for the exact transform.

Standalone usage (Python / Mac)

import coremltools as ct, numpy as np
from huggingface_hub import snapshot_download

local = snapshot_download("mlboydaisuke/qwen3-vl-2b-coreml")
root  = f"{local}/qwen3_vl_2b_decode_chunks"

decode_chunks = [ct.models.MLModel(f"{root}/chunk_{i}.mlpackage") for i in range(4)]
head         = ct.models.MLModel(f"{root}/chunk_head.mlpackage")
embed        = np.memmap(f"{root}/embed_weight.bin",
                         dtype=np.float16, mode="r",
                         shape=(151936, 2048))
vision = ct.models.MLModel(
    f"{local}/qwen3_vl_2b_vision/vision.mlpackage")

For text-only prompts, skip the vision encoder and chain chunk_0..3 → chunk_head per step. For image prompts, run vision.predict once, swap chunk_0 with chunk_0_vision and inject the 3 DeepStack tensors during the first token of each image span.

Reference loop: conversion/qwen3_vl_2b_parity.py.

iOS / Mac app

Swift runtime: Qwen3VL2BGenerator.swift. Pick Qwen3-VL 2B (recurrent, v1.4.0) in the model picker.

Architecture

28-layer GQA text backbone + ViT vision tower.

Text: hidden=2048, num_heads=16, num_kv=8, head_dim=128, vocab=151936, tie_embeddings=True, rope_theta=5e6, mRoPE section=[24,20,20] interleaved (collapses to standard 1D RoPE for text-only).
Vision: 448×448 fixed, 196 tokens after spatial_merge=2, DeepStack taps at vision layers 5/11/17 → text layers 0/1/2.
Chunks: 4 body chunks × 7 layers each.

License

Apache 2.0 (inherits from the base model).

Downloads last month: 63

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/qwen3-vl-2b-coreml

Base model

Qwen/Qwen3-VL-2B-Instruct

Quantized

(63)

this model