Feature Request: TFLite Q4/Q6/Q8 Quantizations for Nanbeige4.1-3B

#42

by Narutoouz - opened 5 days ago

Motivation

Enable local inference on Android phones (Snapdragon 865+, 8GB RAM) via Google AI Edge Gallery / LiteRT runtime. This 3B SLM outperforms Qwen3-32B on benchmarks but lacks mobile-optimized formats.

Current Formats Available

✅ BF16 base model
✅ GGUF quants (Q4_K_M, Q6_K, Q8_0 via mradermacher repo) [web:33]
❌ No TFLite/LiteRT conversions

Requested Formats

TFLite INT4 (Q4 equivalent) - ~1.5GB, target 4-6 tok/s on Adreno GPU
TFLite INT6 (Q6 equivalent) - ~2.5GB, balanced quality/speed
TFLite INT8 (Q8 equivalent) - ~3GB, highest fidelity

Conversion Path (for maintainers)

python

Using LiteRT Torch Generative API (recommended)

import ai_edge_torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Nanbeige/Nanbeige4.1-3B")

Re-author + convert + quantize per Google docs

edge_model = ai_edge_torch.convert(model, backend="odml_torch")
edge_model.export("nanbeige4.1-3b-q4.tflite")

Target Deployments

Google AI Edge Gallery (Play Store/GitHub)
LiteRT-LM runtime w/ NNAPI/GPU delegates
MediaPipe LLM Inference API

Benefits

Democratizes access to top 3B SLM on midrange phones (Mi 10T Pro with 8gb ram, etc.)
Complements GGUF ecosystem with mobile-native format
Precedents: Gemma 2B, Phi-2 have official TFLite quants

Narutoouz

5 days ago

support for Nanbeige 3b in LiteRT (before TFlite) is made here https://github.com/google-ai-edge/LiteRT/issues/6419

please create TFlite Versions of this model
check this link : https://github.com/google-ai-edge/LiteRT/issues/6419
feel free to contribute - support this model in LiteRT inference engine

Thankyou

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment