Unlimited-OCR → Core AI (on-device document OCR)

On-device document → structured-markdown OCR, end-to-end on Apple Core AI. A port of baidu/Unlimited-OCR (3B-A0.5B MoE, MIT): drop a document image, get back markdown — tables as HTML (<table><tr><td>…), formulas as LaTeX, reading order, and <|det|> layout boxes. Japanese + English + multilingual.

Runs on the stock coreai.runtime with no engine patch — the decoder is driven directly on inputs_embeds, so this is a pure-export port (not the static-input-buffer VLM path).

Use it

▶️ Run it (source) — the ReadDoc runner (GUI + CLI, one app for every document-OCR model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ReadDoc/ReadDoc.xcodeproj
# → Run, then pick "Unlimited-OCR" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/ReadDoc
swift run readdoc-cli --model unlimited-ocr --image sample.png

💻 Build with it — complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let reader = try await KitDocReader(catalog: "unlimited-ocr")
let markdown = try await reader.read(imageAt: imageURL)
// markdown: the document as structured text — tables as <table>/<tr>/<td>,
// <|det|> layout boxes, reading order — fully on-device

The take-home is Examples/ReadDoc/Sources/QuickStart.swift — this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same KitDocReader(catalog:) on the image you pick. One read(imageAt:) call per page; chunk a PDF into page images first. The output keeps the model's structural markup (tables as HTML, formulas as LaTeX, <|det|> boxes) — strip or render it as your app prefers.

Integration checklist

  • SPM: https://github.com/john-rocky/coreai-kit → product CoreAIKit
  • Info.plist: none needed
  • Entitlements: none needed
  • First run downloads the model — 4.5 GB (Mac) — then it loads from the local cache (Application Support; progress via the downloadProgress callback)
  • Measure in Release — Debug is ~3× slower on per-token host work

What's exciting (why you'd use it)

  • Private OCR: invoices, receipts, contracts, papers, forms never leave the device.
  • Structured, not just text: tables → HTML, equations → LaTeX, layout → boxes. RAG-ready ingestion.
  • Flat latency: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask) keeps every tensor shape constant, so the runtime compiles once and decode stays flat at 12.7 ms/token (79 tok/s on M4 Max) — no growing-cache recompilation stalls.
  • SOTA quality: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).

Bundles

path what dtype size
vision/unlimited_ocr_vision.aimodel DeepEncoder (SAM-ViT + CLIP-ViT cascade) → 100 visual tokens fp16 762 MB
decoder/unlimited_ocr_decoder.aimodel DeepseekV2 R-SWA MoE decoder, functions prefill + decode sharing one weight set + KV state sym8 3.2 GB
assets/embed_tokens.f16 token embedding table [129280,1280] (host row-gather) fp16 316 MB
assets/{image_newline,view_seperator}.f16, assets/prompt_input_ids.i32, assets/recipe.json arrangement constants + the assembly recipe tiny
tokenizer/ fast tokenizer (tokenizer.json + configs)

Pipeline (Base mode, 640px)

image → preprocess (pad to 640², normalize mean=std=0.5)
      → vision .aimodel                         → visual tokens [1,100,1280]
      → arrange (10×10 + image_newline per row + view_seperator) → [111,1280]
      → scatter into embed_tokens(prompt_ids)   → prefix [1,115,1280]
      → decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) → tokens
      → detokenize (keep special tokens)        → markdown

The exact, verified recipe is in assets/recipe.json. Reference implementations (Python end-to-end

  • a macOS app, CoreAIOCR, driving the stock runtime) are in the Core AI Model Zoo: conversion/unlimited_ocr/ and apps/CoreAIOCR/.

Notes

  • Appropriate input: clean single-page documents (invoice / paper / report / table / formula), roughly square or portrait, with text still legible when fit to 640². Very dense small-text scans (newspaper) want the tiled crop_mode vision export (not included here; Base mode only).
  • Prompt is fixed to document parsing (layout + structured extraction).
  • License: MIT (inherited from baidu/Unlimited-OCR).

Community port — not affiliated with Apple or baidu.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for mlboydaisuke/Unlimited-OCR-CoreAI

Finetuned
(7)
this model