Check in here for tok/s and benchmarks for local gguf models

#11
by ykarout - opened

πŸš€ Performance Benchmark: Qwen3-Coder-Next (GGUF Q4_K_M)

Model: lmstudio-community/Qwen3-Coder-Next-GGUF (Q4_K_M)

Backend: LM Studio 0.4.1 (CUDA 12)

πŸ’» Hardware Specifications

Component Details
GPU NVIDIA GeForce RTX 5080 (16GB GDDR7)
Driver NVIDIA 590 Linux Driver (Latest Branch)
CPU Intel Core Ultra 9 285K
RAM 64GB DDR5 @ 6800 MT/s
OS Fedora 43 Workstation Latest Kernel and Updates

βš™οΈ Inference Settings

  • Context Length: 60,000 Tokens
  • Layer Offloading: 35 MoE Layers to CPU (Rest on GPU)
  • KV Cache: Offloaded to GPU (Q8_0 Precision)
  • CPU Threads: 8 Cores
  • Features: Flash Attention ON
  • Max Concurrency: 10

πŸ“Š Results

Testing performed with medium-sized coding prompts.

  • Single Request: 40 - 45 tok/s
  • Concurrent (10 Requests):
  • Per Request: 9 - 10 tok/s
  • Total Throughput: ~70 tok/s
ykarout changed discussion title from Check in here for tok/s and benchmarks on local gguf models to Check in here for tok/s and benchmarks for local gguf models

Sign up or log in to comment