Feature Request: TFLite Q4/Q6/Q8 Quantizations for Nanbeige4.1-3B

#42
by Narutoouz - opened

Motivation

Enable local inference on Android phones (Snapdragon 865+, 8GB RAM) via Google AI Edge Gallery / LiteRT runtime. This 3B SLM outperforms Qwen3-32B on benchmarks but lacks mobile-optimized formats.

Current Formats Available

  • ✅ BF16 base model
  • ✅ GGUF quants (Q4_K_M, Q6_K, Q8_0 via mradermacher repo) [web:33]
  • ❌ No TFLite/LiteRT conversions

Requested Formats

  1. TFLite INT4 (Q4 equivalent) - ~1.5GB, target 4-6 tok/s on Adreno GPU
  2. TFLite INT6 (Q6 equivalent) - ~2.5GB, balanced quality/speed
  3. TFLite INT8 (Q8 equivalent) - ~3GB, highest fidelity

Conversion Path (for maintainers)

python

Using LiteRT Torch Generative API (recommended)

import ai_edge_torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Nanbeige/Nanbeige4.1-3B")

Re-author + convert + quantize per Google docs

edge_model = ai_edge_torch.convert(model, backend="odml_torch")
edge_model.export("nanbeige4.1-3b-q4.tflite")

Target Deployments

  • Google AI Edge Gallery (Play Store/GitHub)
  • LiteRT-LM runtime w/ NNAPI/GPU delegates
  • MediaPipe LLM Inference API

Benefits

  • Democratizes access to top 3B SLM on midrange phones (Mi 10T Pro with 8gb ram, etc.)
  • Complements GGUF ecosystem with mobile-native format
  • Precedents: Gemma 2B, Phi-2 have official TFLite quants

support for Nanbeige 3b in LiteRT (before TFlite) is made here https://github.com/google-ai-edge/LiteRT/issues/6419

Thankyou

Sign up or log in to comment