Instructions to use ubergarm/GLM-4.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/GLM-4.7-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/GLM-4.7-GGUF", filename="IQ2_KL/GLM-4.7-IQ2_KL-00001-of-00004.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ubergarm/GLM-4.7-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/GLM-4.7-GGUF:Q2_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/GLM-4.7-GGUF:Q2_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K # Run inference directly in the terminal: ./llama-cli -hf ubergarm/GLM-4.7-GGUF:Q2_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/GLM-4.7-GGUF:Q2_K
Use Docker
docker model run hf.co/ubergarm/GLM-4.7-GGUF:Q2_K
- LM Studio
- Jan
- vLLM
How to use ubergarm/GLM-4.7-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/GLM-4.7-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/GLM-4.7-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/GLM-4.7-GGUF:Q2_K
- Ollama
How to use ubergarm/GLM-4.7-GGUF with Ollama:
ollama run hf.co/ubergarm/GLM-4.7-GGUF:Q2_K
- Unsloth Studio new
How to use ubergarm/GLM-4.7-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/GLM-4.7-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/GLM-4.7-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/GLM-4.7-GGUF to start chatting
- Pi new
How to use ubergarm/GLM-4.7-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/GLM-4.7-GGUF:Q2_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/GLM-4.7-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/GLM-4.7-GGUF:Q2_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/GLM-4.7-GGUF:Q2_K
Run Hermes
hermes
- Docker Model Runner
How to use ubergarm/GLM-4.7-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/GLM-4.7-GGUF:Q2_K
- Lemonade
How to use ubergarm/GLM-4.7-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/GLM-4.7-GGUF:Q2_K
Run and chat with the model
lemonade run user.GLM-4.7-GGUF-Q2_K
List all available models
lemonade list
Why does this double my PP and improve TG?
When I use the default suggested offload tensor or --n-cpu-moe options I get about 90 PP and 5 TG (everything else identical).
However, when I set --tensor-split 1,0,0, making sure with --override-tensor I have enough space in CUDA0 I get: 218 PP and 6.2 TG.
This gives me better results than using the graph split features. I suppose it is because all regular (non-exp) layers are offloaded in a straight line? Wisdom dispersion would be appreciated!
Heya happy new years!
Hrmm, glad you are experimenting and finding the best command for your specific hardware. It sounds like you have three GPUs then and are using hybrid CPU + 3x GPUs?
This gives me better results than using the graph split features.
Do you mean -sm graph feature here?
I suppose it is because all regular (non-exp) layers are offloaded in a straight line?
I'd have to know your rig specs better to try to speculate here. e.g. how much RAM, NUMA config, GPUs, PCIe speeds, are you compling with nccl available and P2P enabled correctly etc.
It is pretty complex, the best way is to try a lot of things and benchmark with llama-sweep-bench and see what works best for your specific workload.
In general though, yes you want all of the attn/shexp/first N dense layers fully offloaded into VRAM. Then only routed expert layers on RAM for CPU.
In my own testing, keeping them "in a straight line" or keeping them on the same GPU as the attn etc doesn't make a huge difference unless your PCIe speeds are slow perhaps.
Also in general, you can see how important it is to dial in your rig as bestas possible given it can make a big difference!
Cheers!
Happy new years π and thanks for the reply!
I have a consumer board with 3x3090 and 128GB DDR4.
PCIe setup is horrid: 3@4x 3@1x and 4@16x
I tried REBAR a long time ago and IIRC all boards supported it and it worked, but it didn't improve performance for me. I haven't looked into flashing the BIOS and enabling P2P as I was planning on installing nvlink but I haven't gotten around to it. I usually stick to ExLlama and performance is sufficient there, but I do have a passion for trying out new models (managed to run Deepseek R1 (was it?) at 0.5 TG when it first came out haha).
Yes, I meant -sm graph. I just saw it in your README here π .
I haven't tried nccl yet I think. I suppose my build options are a bit dated by now:-DGGML_CUDA=ON DGGML_SCHED_MAX_COPIES=1 -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON -DGGML_BLAS=OFF -DCMAKE_CUDA_ARCHITECTURES="native" -DGGML_NATIVE=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_SERVER_SQLITE3=ON
Oh and yes I usually run many dozens of sweep benches. It's very meditative. π
The 3x1 card would have been your bottleneck for prompt processing. Is your ts 1,0,0 putting everything on the 3x4 or the 4x16? If the former, try adjusting it to 0,0,1 and you might see improved pp
It's putting it on the 4x16. I've tried it on the 3x1 and PP went down to the 80~90 range.
I've also run some tests with nccl but haven't seen improvements.
I'm happy with the performance as it is; just always tweaking and benchmarking. It'd make me happy if I could help others with the testing here though, I can't be the only one with such a jank setup.
If I run -sm graph with custom -ts and -ot I get: /mnt/xyz/src/ik_llama.cpp/ggml/src/ggml.c:6084: GGML_ASSERT(nhave > 1) failed but it works fine with --n-cpu-moe. I doubt it's worth looking in to as performance gains with my hardware are likely small (even with ExLlama TP gives me only a Β±10% increase in TG IIRC).
One thing that did help (with highly deterministic tasks) with GLM 4.6 was loading a/the GLM 4.5 draft finetune, but with the latest main of ik_llama it seems like the draft model isn't even loading anymore.
Here's how I run it now:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=2,0,1
./build_bf16/bin/llama-sweep-bench
--model /mnt/uvw/models/glm-4.7-ubergarm/GLM-4.7-IQ3_KS-00001-of-00005.gguf
-ctk q8_0
-ctv q5_1
-c 32768
--batch-size 4096
--ubatch-size 4096
-ngl 999
--tensor-split 1,0,0
-ot "blk.($(seq -s '|' 0 14)).ffn.=CUDA1"
-ot "blk.($(seq -s '|' 15 26)).ffn.=CUDA2"
--override-tensor exps=CPU
--threads 6
--threads-batch 12
--temp 1.0
--top-p 0.95
--top-k 40
--min-p 0
--warmup-batch
--no-mmap