CQ Systems: Google Gemma 4 E2B IT (GGUF)

This repository contains quantized GGUF formats of the Google Gemma 4 E2B Instruction-Tuned model, published and optimized by CQ Systems.

⚡ Mac Neo & Apple Silicon Compatibility

These files are specifically optimized for local inference on consumer and professional hardware, with a major focus on Apple Silicon.

Mac Neo Architecture Support:

The bF16 (bfloat16) derived quantizations in this repository are highly recommended for the latest Mac Neo hardware. By utilizing Mac Neo's advanced unified memory matrix and native bfloat16 processing capabilities, these models deliver blistering fast token generation for on-device reasoning and multimodal tasks, with significantly reduced thermal load.

📦 Available Files

We provide five different versions of the model, allowing you to scale based on your available unified memory (RAM/VRAM) and speed requirements.

Filename	Size	Quantization	Recommendation / Notes
CQ-Gemma-4-E2B-bF16-Q4_K.gguf	3.2 GB	Q4_K_M	🏆 Recommended for Mac Neo / Apple Silicon. Derived from a bfloat16 baseline. Best balance of speed, size, and reasoning quality. Easily fits in 8GB unified memory.
CQ-Gemma-4-E2B-Q8_0.gguf	4.6 GB	Q8_0	High quality 8-bit integer quantization. Nearly indistinguishable from the baseline. Requires ~8GB+ RAM/VRAM.
CQ-Gemma-4-E2B-bF16-Q2_K.gguf	2.8 GB	Q2_K	Ultra-compressed (bfloat16 base). Use only when memory is strictly constrained. Expect some perplexity degradation.
CQ-Gemma-4-E2B-f16-Q2_K.gguf	2.8 GB	Q2_K	Ultra-compressed (float16 base). Alternative for older hardware that prefers standard f16 math over bf16.
CQ-Gemma-4-E2B-bf16.gguf	8.7 GB	bF16	Baseline uncompressed 16-bit brain-float. Highest fidelity, highest memory footprint. Useful for research or custom requantization.

🚀 How to Use

Using llama.cpp

You can run this model directly in your terminal using the compiled llama.cpp CLI tool. Ensure you have built llama.cpp with Metal support for Macs.

# Basic chat launch with the recommended Mac Neo Q4_K model
./llama-cli -m CQ-Gemma-4-E2B-bF16-Q4_K.gguf \
-c 8192 \
-n 1024 \
-p "<|think|>\nAnalyze the system logs.\n<|channel>thought\n"

Using LM Studio / Ollama

Download the .gguf file of your choice.
LM Studio: Drag and drop the downloaded file into your LM Studio model folder, or use the local import feature.
Ollama: Create a Modelfile with the line FROM ./CQ-Gemma-4-E2B-bF16-Q4_K.gguf, then run ollama create cq-gemma-e2b -f Modelfile.

(Note: While the base Gemma 4 E2B supports audio and image inputs natively, standard GGUF text-generation primarily supports text out-of-the-box. Multimodal vision/audio capabilities in GGUF require an accompanying mmproj encoder file).

🧠 About the Base Model: Gemma 4 E2B

Gemma 4 is a family of open models built by Google DeepMind. The E2B variant is the smallest and fastest in the family, purpose-built for efficient local execution on laptops, mobile devices, and edge hardware.

Total Parameters: 2.3B effective (5.1B total with Per-Layer Embeddings). The use of PLE gives each decoder layer its own small embedding for every token, maximizing efficiency.
Context Length: 128K tokens.
Supported Modalities (Base Model): Text, Image, Audio.
System Prompts: Gemma 4 uses standard system, assistant, and user roles natively.

Thinking Mode Configuration

Gemma 4 models are designed as highly capable reasoners. To enable the built-in reasoning capabilities, include the <|think|> token at the start of your system prompt.

When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:

<|channel>thought\n[Internal reasoning]<channel|>[Final answer]

📜 License & Acknowledgements

Base Model: Google DeepMind
License: Gemma 4 License
Quantization & Optimization: CQ Systems Team

For full benchmark details, evaluations, ethical considerations, and safety metrics, please refer to the original Google Gemma 4 Documentation.

Downloads last month: 886

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

2-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CQSystems/Google-Gemma-4-E2B-It

Base model

google/gemma-4-E2B-it

Quantized

(137)

this model