Instructions to use ubergarm/GLM-4.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/GLM-4.7-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/GLM-4.7-GGUF",
	filename="IQ2_KL/GLM-4.7-IQ2_KL-00001-of-00004.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use ubergarm/GLM-4.7-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/GLM-4.7-GGUF:Q2_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/GLM-4.7-GGUF:Q2_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/GLM-4.7-GGUF:Q2_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/GLM-4.7-GGUF:Q2_K

Use Docker

docker model run hf.co/ubergarm/GLM-4.7-GGUF:Q2_K

LM Studio
Jan

vLLM

How to use ubergarm/GLM-4.7-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/GLM-4.7-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/GLM-4.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/GLM-4.7-GGUF:Q2_K

Ollama
How to use ubergarm/GLM-4.7-GGUF with Ollama:
```
ollama run hf.co/ubergarm/GLM-4.7-GGUF:Q2_K
```

Unsloth Studio new

How to use ubergarm/GLM-4.7-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/GLM-4.7-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/GLM-4.7-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/GLM-4.7-GGUF to start chatting

Pi new

How to use ubergarm/GLM-4.7-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/GLM-4.7-GGUF:Q2_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/GLM-4.7-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/GLM-4.7-GGUF:Q2_K

Run Hermes

hermes

Docker Model Runner
How to use ubergarm/GLM-4.7-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/GLM-4.7-GGUF:Q2_K
```

Lemonade

How to use ubergarm/GLM-4.7-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/GLM-4.7-GGUF:Q2_K

Run and chat with the model

lemonade run user.GLM-4.7-GGUF-Q2_K

List all available models

lemonade list

Why does this double my PP and improve TG?

by gtkunit - opened Jan 4

Discussion

gtkunit

Jan 4

When I use the default suggested offload tensor or --n-cpu-moe options I get about 90 PP and 5 TG (everything else identical).
However, when I set --tensor-split 1,0,0, making sure with --override-tensor I have enough space in CUDA0 I get: 218 PP and 6.2 TG.
This gives me better results than using the graph split features. I suppose it is because all regular (non-exp) layers are offloaded in a straight line? Wisdom dispersion would be appreciated!

ubergarm

Owner Jan 4

Heya happy new years!

Hrmm, glad you are experimenting and finding the best command for your specific hardware. It sounds like you have three GPUs then and are using hybrid CPU + 3x GPUs?

This gives me better results than using the graph split features.

Do you mean -sm graph feature here?

I suppose it is because all regular (non-exp) layers are offloaded in a straight line?

I'd have to know your rig specs better to try to speculate here. e.g. how much RAM, NUMA config, GPUs, PCIe speeds, are you compling with nccl available and P2P enabled correctly etc.

It is pretty complex, the best way is to try a lot of things and benchmark with llama-sweep-bench and see what works best for your specific workload.

In general though, yes you want all of the attn/shexp/first N dense layers fully offloaded into VRAM. Then only routed expert layers on RAM for CPU.

In my own testing, keeping them "in a straight line" or keeping them on the same GPU as the attn etc doesn't make a huge difference unless your PCIe speeds are slow perhaps.

Also in general, you can see how important it is to dial in your rig as bestas possible given it can make a big difference!

Cheers!

gtkunit

Jan 4

•

edited Jan 4

Happy new years 😀 and thanks for the reply!
I have a consumer board with 3x3090 and 128GB DDR4.

PCIe setup is horrid: 3@4x 3@1x and 4@16x
I tried REBAR a long time ago and IIRC all boards supported it and it worked, but it didn't improve performance for me. I haven't looked into flashing the BIOS and enabling P2P as I was planning on installing nvlink but I haven't gotten around to it. I usually stick to ExLlama and performance is sufficient there, but I do have a passion for trying out new models (managed to run Deepseek R1 (was it?) at 0.5 TG when it first came out haha).

Yes, I meant -sm graph. I just saw it in your README here 💝 .

I haven't tried nccl yet I think. I suppose my build options are a bit dated by now:
-DGGML_CUDA=ON DGGML_SCHED_MAX_COPIES=1 -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON -DGGML_BLAS=OFF -DCMAKE_CUDA_ARCHITECTURES="native" -DGGML_NATIVE=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_SERVER_SQLITE3=ON

Oh and yes I usually run many dozens of sweep benches. It's very meditative. 😅

gghfez

Jan 5

The 3x1 card would have been your bottleneck for prompt processing. Is your ts 1,0,0 putting everything on the 3x4 or the 4x16? If the former, try adjusting it to 0,0,1 and you might see improved pp

gtkunit

Jan 5

•

edited Jan 5

It's putting it on the 4x16. I've tried it on the 3x1 and PP went down to the 80~90 range.
I've also run some tests with nccl but haven't seen improvements.

I'm happy with the performance as it is; just always tweaking and benchmarking. It'd make me happy if I could help others with the testing here though, I can't be the only one with such a jank setup.

If I run -sm graph with custom -ts and -ot I get: /mnt/xyz/src/ik_llama.cpp/ggml/src/ggml.c:6084: GGML_ASSERT(nhave > 1) failed but it works fine with --n-cpu-moe. I doubt it's worth looking in to as performance gains with my hardware are likely small (even with ExLlama TP gives me only a ±10% increase in TG IIRC).

One thing that did help (with highly deterministic tasks) with GLM 4.6 was loading a/the GLM 4.5 draft finetune, but with the latest main of ik_llama it seems like the draft model isn't even loading anymore.

Here's how I run it now:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=2,0,1
./build_bf16/bin/llama-sweep-bench
--model /mnt/uvw/models/glm-4.7-ubergarm/GLM-4.7-IQ3_KS-00001-of-00005.gguf
-ctk q8_0
-ctv q5_1
-c 32768
--batch-size 4096
--ubatch-size 4096
-ngl 999
--tensor-split 1,0,0
-ot "blk.($(seq -s '|' 0 14)).ffn.=CUDA1"
-ot "blk.($(seq -s '|' 15 26)).ffn.=CUDA2"
--override-tensor exps=CPU
--threads 6
--threads-batch 12
--temp 1.0
--top-p 0.95
--top-k 40
--min-p 0
--warmup-batch
--no-mmap

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment