Check in here for tok/s and benchmarks for local gguf models
#11
by
ykarout
- opened
π Performance Benchmark: Qwen3-Coder-Next (GGUF Q4_K_M)
Model: lmstudio-community/Qwen3-Coder-Next-GGUF (Q4_K_M)
Backend: LM Studio 0.4.1 (CUDA 12)
π» Hardware Specifications
| Component | Details |
|---|---|
| GPU | NVIDIA GeForce RTX 5080 (16GB GDDR7) |
| Driver | NVIDIA 590 Linux Driver (Latest Branch) |
| CPU | Intel Core Ultra 9 285K |
| RAM | 64GB DDR5 @ 6800 MT/s |
| OS | Fedora 43 Workstation Latest Kernel and Updates |
βοΈ Inference Settings
- Context Length: 60,000 Tokens
- Layer Offloading: 35 MoE Layers to CPU (Rest on GPU)
- KV Cache: Offloaded to GPU (Q8_0 Precision)
- CPU Threads: 8 Cores
- Features: Flash Attention ON
- Max Concurrency: 10
π Results
Testing performed with medium-sized coding prompts.
- Single Request:
40 - 45 tok/s - Concurrent (10 Requests):
- Per Request:
9 - 10 tok/s - Total Throughput:
~70 tok/s
ykarout
changed discussion title from
Check in here for tok/s and benchmarks on local gguf models
to Check in here for tok/s and benchmarks for local gguf models