CQ Systems: Google Gemma 4 E2B IT (GGUF)

This repository contains quantized GGUF formats of the Google Gemma 4 E2B Instruction-Tuned model, published and optimized by CQ Systems.

⚡ Mac Neo & Apple Silicon Compatibility

These files are specifically optimized for local inference on consumer and professional hardware, with a major focus on Apple Silicon.

Mac Neo Architecture Support:

The bF16 (bfloat16) derived quantizations in this repository are highly recommended for the latest Mac Neo hardware. By utilizing Mac Neo's advanced unified memory matrix and native bfloat16 processing capabilities, these models deliver blistering fast token generation for on-device reasoning and multimodal tasks, with significantly reduced thermal load.

📦 Available Files

We provide five different versions of the model, allowing you to scale based on your available unified memory (RAM/VRAM) and speed requirements.

Filename Size Quantization Recommendation / Notes
CQ-Gemma-4-E2B-bF16-Q4_K.gguf 3.2 GB Q4_K_M 🏆 Recommended for Mac Neo / Apple Silicon. Derived from a bfloat16 baseline. Best balance of speed, size, and reasoning quality. Easily fits in 8GB unified memory.
CQ-Gemma-4-E2B-Q8_0.gguf 4.6 GB Q8_0 High quality 8-bit integer quantization. Nearly indistinguishable from the baseline. Requires ~8GB+ RAM/VRAM.
CQ-Gemma-4-E2B-bF16-Q2_K.gguf 2.8 GB Q2_K Ultra-compressed (bfloat16 base). Use only when memory is strictly constrained. Expect some perplexity degradation.
CQ-Gemma-4-E2B-f16-Q2_K.gguf 2.8 GB Q2_K Ultra-compressed (float16 base). Alternative for older hardware that prefers standard f16 math over bf16.
CQ-Gemma-4-E2B-bf16.gguf 8.7 GB bF16 Baseline uncompressed 16-bit brain-float. Highest fidelity, highest memory footprint. Useful for research or custom requantization.

🚀 How to Use

Using llama.cpp

You can run this model directly in your terminal using the compiled llama.cpp CLI tool. Ensure you have built llama.cpp with Metal support for Macs.

# Basic chat launch with the recommended Mac Neo Q4_K model
./llama-cli -m CQ-Gemma-4-E2B-bF16-Q4_K.gguf \
-c 8192 \
-n 1024 \
-p "<|think|>\nAnalyze the system logs.\n<|channel>thought\n"

Using LM Studio / Ollama

  1. Download the .gguf file of your choice.
  2. LM Studio: Drag and drop the downloaded file into your LM Studio model folder, or use the local import feature.
  3. Ollama: Create a Modelfile with the line FROM ./CQ-Gemma-4-E2B-bF16-Q4_K.gguf, then run ollama create cq-gemma-e2b -f Modelfile.

(Note: While the base Gemma 4 E2B supports audio and image inputs natively, standard GGUF text-generation primarily supports text out-of-the-box. Multimodal vision/audio capabilities in GGUF require an accompanying mmproj encoder file).

🧠 About the Base Model: Gemma 4 E2B

Gemma 4 is a family of open models built by Google DeepMind. The E2B variant is the smallest and fastest in the family, purpose-built for efficient local execution on laptops, mobile devices, and edge hardware.

  • Total Parameters: 2.3B effective (5.1B total with Per-Layer Embeddings). The use of PLE gives each decoder layer its own small embedding for every token, maximizing efficiency.
  • Context Length: 128K tokens.
  • Supported Modalities (Base Model): Text, Image, Audio.
  • System Prompts: Gemma 4 uses standard system, assistant, and user roles natively.

Thinking Mode Configuration

Gemma 4 models are designed as highly capable reasoners. To enable the built-in reasoning capabilities, include the <|think|> token at the start of your system prompt.

When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:

<|channel>thought\n[Internal reasoning]<channel|>[Final answer]

📜 License & Acknowledgements

  • Base Model: Google DeepMind
  • License: Gemma 4 License
  • Quantization & Optimization: CQ Systems Team

For full benchmark details, evaluations, ethical considerations, and safety metrics, please refer to the original Google Gemma 4 Documentation.

Downloads last month
886
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CQSystems/Google-Gemma-4-E2B-It

Quantized
(137)
this model