Check in here for tok/s and benchmarks for local gguf models

#11

by ykarout - opened 4 days ago

Discussion

ykarout

4 days ago

•

edited 4 days ago

🚀 Performance Benchmark: Qwen3-Coder-Next (GGUF Q4_K_M)

Model: lmstudio-community/Qwen3-Coder-Next-GGUF (Q4_K_M)

Backend: LM Studio 0.4.1 (CUDA 12)

💻 Hardware Specifications

Component	Details
GPU	NVIDIA GeForce RTX 5080 (16GB GDDR7)
Driver	NVIDIA 590 Linux Driver (Latest Branch)
CPU	Intel Core Ultra 9 285K
RAM	64GB DDR5 @ 6800 MT/s
OS	Fedora 43 Workstation Latest Kernel and Updates

⚙️ Inference Settings

Context Length: 60,000 Tokens
Layer Offloading: 35 MoE Layers to CPU (Rest on GPU)
KV Cache: Offloaded to GPU (Q8_0 Precision)
CPU Threads: 8 Cores
Features: Flash Attention ON
Max Concurrency: 10

📊 Results

Testing performed with medium-sized coding prompts.

Single Request: 40 - 45 tok/s
Concurrent (10 Requests):
Per Request: 9 - 10 tok/s
Total Throughput: ~70 tok/s

ykarout changed discussion title from Check in here for tok/s and benchmarks on local gguf models to Check in here for tok/s and benchmarks for local gguf models 4 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment