Qwen3-Next-80B on NVIDIA DGX Spark(GB10 Grace Blackwell GPU): μ™„μ „ν•œ κ°€μ΄λ“œ

πŸ“‹ κ°œμš”

이 λ¬Έμ„œλŠ” NVIDIA GB10 GPUμ—μ„œ Qwen3-Next-80B λͺ¨λΈμ„ μ‹€ν–‰ν•˜κΈ° μœ„ν•œ μ™„μ „ν•œ κ°€μ΄λ“œμž…λ‹ˆλ‹€. vLLM의 ν˜Έν™˜μ„± 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ Ollama둜 μ „ν™˜ν•˜κ³ , 60GB VRAM을 μ΅œλŒ€ν•œ ν™œμš©ν•˜λŠ” μ΅œμ ν™” 섀정을 ν¬ν•¨ν•©λ‹ˆλ‹€.

μž‘μ„±μΌ: 2025-12-20
ν…ŒμŠ€νŠΈ ν™˜κ²½: GB10 GPU (60GB VRAM), CUDA 12.1, PyTorch 2.9.x
λͺ¨λΈ: Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF (Q5_K_M μ–‘μžν™”)


🎯 μ£Όμš” μ„±κ³Ό

βœ… vLLM ν˜Έν™˜μ„± 문제 ν•΄κ²°: GB10μ—μ„œ vLLM의 EngineCore exiting 였λ₯˜λ₯Ό Ollama둜 우회
βœ… Qwen3-Next-80B Q5_K_M λͺ¨λΈ μ„±κ³΅μ μœΌλ‘œ μ‹€ν–‰: GGUF μ–‘μžν™” λͺ¨λΈ μ‚¬μš©
βœ… μ„±λŠ₯ μ΅œμ ν™”: λ™μ‹œ 처리 8개, Context Window 32K 토큰 지원
βœ… 톡합 μ‹œμž‘ 슀크립트: OllamaλΆ€ν„° NIKA μ„œλΉ„μŠ€κΉŒμ§€ 원클릭 μ‹œμž‘


⚠️ vLLMμ—μ„œ λ°œμƒν•œ 문제

문제 증상

  • EngineCore exiting 였λ₯˜λ‘œ λͺ¨λΈ λ‘œλ”© ν›„ μ¦‰μ‹œ μ’…λ£Œ
  • RuntimeError: Engine core initialization failed
  • λ‹€μ–‘ν•œ ν™˜κ²½λ³€μˆ˜μ™€ ν”Œλž˜κ·Έ μ‹œλ„μ—λ„ ν•΄κ²°λ˜μ§€ μ•ŠμŒ

μ‹œλ„ν•œ ν•΄κ²°μ±… (λͺ¨λ‘ μ‹€νŒ¨)

  1. ν™˜κ²½λ³€μˆ˜ μ‘°μ •

    • VLLM_DISABLE_CUSTOM_ALL_REDUCE=1
    • VLLM_WORKER_MULTIPROC_METHOD=spawn
    • VLLM_ENGINE_INIT_TIMEOUT=1200
    • VLLM_IPC_RETRY_COUNT=20
    • 기타 GB10 ν˜Έν™˜μ„± κ΄€λ ¨ ν™˜κ²½λ³€μˆ˜ λ‹€μˆ˜
  2. vLLM 버전 λ³€κ²½

    • vllm/vllm-openai:nightly (μ΅œμ‹  개발 버전)
    • vllm/vllm-openai:latest (μ•ˆμ • 버전)
    • μ†ŒμŠ€ λΉŒλ“œ μ‹œλ„
  3. κ²°λ‘ 

    • GB10 μ•„ν‚€ν…μ²˜μ™€ vLLM의 IPC 톡신 방식 κ°„ 근본적인 ν˜Έν™˜μ„± 문제둜 νŒλ‹¨
    • vLLM μ‹€ν–‰ μ‹œ μ΄ˆκΈ°ν™” λ‹¨κ³„μ—μ„œ 멈좀(Hang) ν˜„μƒ λ°œμƒ
    • λ‘œκ·Έμƒμ—μ„œ GPU κ°„ 톡신(NCCL)이 λ§Ίμ–΄μ§€μ§€ μ•Šκ³  νƒ€μž„μ•„μ›ƒ(Timeout) λ°œμƒ
    • Ray(λΆ„μ‚° 처리 ν”„λ ˆμž„μ›Œν¬)κ°€ μ›Œμ»€ ν”„λ‘œμ„ΈμŠ€μ™€ ν†΅μ‹ ν•˜μ§€ λͺ»ν•΄ μ£½λŠ” ν˜„μƒ
    • GB10의 특수 μ•„ν‚€ν…μ²˜μΈ **NVLink-C2C (Chip-to-Chip)**와 vLLM이 μ‚¬μš©ν•˜λŠ” 톡신 라이브러리(NCCL) κ°„μ˜ ν˜Έν™˜μ„± 문제
    • κΈ°λ³Έ P2P(Peer-to-Peer) 톡신이 ARM64 ν™˜κ²½ λ“œλΌμ΄λ²„ 좩돌둜 인해 μ œλŒ€λ‘œ μˆ˜ν–‰λ˜μ§€ μ•ŠμŒ
    • "κ°•μ œ TCP μ „ν™˜" (The TCP Workaround)
    • vLLM on GB10 (ARM64): 아직 μ†Œν”„νŠΈμ›¨μ–΄ μƒνƒœκ³„(PyTorch, Triton, NCCL)κ°€ μ™„λ²½ν•˜κ²Œ GB10의 ν•˜λ“œμ›¨μ–΄ νŠΉμ„±μ„ λ°›μ•„μ£Όμ§€ λͺ»ν•¨. (특히 직접 λΉŒλ“œν•΄μ•Ό ν•˜λŠ” κ²½μš°κ°€ λ§Žμ•„ μ˜μ‘΄μ„± μ§€μ˜₯ λ°œμƒ)
    • Ollama둜 μ „ν™˜ κ²°μ •

πŸš€ Ollama둜 μ „ν™˜

1. Ollama μ„€μΉ˜ 및 λͺ¨λΈ λ‹€μš΄λ‘œλ“œ

# Ollama μ„€μΉ˜ (이미 μ„€μΉ˜λ˜μ–΄ μžˆλ‹€λ©΄ μƒλž΅)
# Ubuntu/Debian
curl -fsSL https://ollama.com/install.sh | sh

# Qwen3-Next-80B Q5_K_M λͺ¨λΈ λ‹€μš΄λ‘œλ“œ (HuggingFace)
ollama pull hf.co/Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF:Q5_K_M

2. λͺ¨λΈ 이름 λ³€κ²½ (선택사항)

# λ‹€μš΄λ‘œλ“œν•œ λͺ¨λΈμ„ 더 짧은 μ΄λ¦„μœΌλ‘œ 등둝
ollama create qwen3-next-80b-q5km -f /path/to/Modelfile

3. Modelfile 생성 (ν…œν”Œλ¦Ώ 문제 ν•΄κ²°)

초기 λ‹€μš΄λ‘œλ“œ μ‹œ 빈 응닡이 λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” ν…œν”Œλ¦Ώ 문제둜, λ‹€μŒ Modelfile을 μ‚¬μš©ν•˜μ„Έμš”:

FROM qwen3-next-80b-q5km

PARAMETER num_thread 8
PARAMETER num_ctx 32768

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|eot_id|>"

TEMPLATE """{{- $lastUserIdx := -1 -}}
{{- range $idx, $msg := .Messages -}}
{{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
{{- end }}
{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}{{ .System }}

{{ end }}
{{- if .Tools }}# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end -}}
<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) -}}
<think>{{ .Thinking }}</think>
{{ end -}}
{{ if .Content }}{{ .Content }}{{ end }}
{{- if .ToolCalls }}
{{- range .ToolCalls }}
<tool_call>
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
</tool_call>
{{- end }}
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
<think>
{{ end }}
{{- end }}"""

SYSTEM """You are a helpful AI assistant."""

⚑ μ„±λŠ₯ μ΅œμ ν™” (60GB VRAM ν™œμš©)

핡심 μ΅œμ ν™” μ„€μ •

60GB VRAM이 μΆ©λΆ„νžˆ λ‚¨μ•„μžˆμ„ λ•Œ, λ‹€μŒ μ„€μ •μœΌλ‘œ μ„±λŠ₯을 κ·ΉλŒ€ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

1. λ™μ‹œ μ²˜λ¦¬λŸ‰ 증가 (OLLAMA_NUM_PARALLEL)

κ°€μž₯ μ€‘μš”: RAG μ‹œμŠ€ν…œμ€ 보톡 [검색 β†’ μš”μ•½ β†’ λ‹΅λ³€] 과정을 거치며 μ—¬λŸ¬ 번 λͺ¨λΈμ„ ν˜ΈμΆœν•©λ‹ˆλ‹€. κΈ°λ³Έκ°’(1)이라면 μš”μ²­μ΄ 2개만 λ™μ‹œμ— 듀어와도 ν•œ λͺ…은 쀄 μ„œμ„œ κΈ°λ‹€λ €μ•Ό ν•©λ‹ˆλ‹€.

export OLLAMA_NUM_PARALLEL=8  # 60GBλ©΄ λŒ€λž΅ 8~10λͺ… λ™μ‹œ 처리 κ°€λŠ₯

효과: 질문 4κ°œκ°€ λ™μ‹œμ— 듀어와도 4개 λ‹€ μ¦‰μ‹œ λ‹΅λ³€ μ‹œμž‘ (체감 속도 4λ°° ν–₯상)

2. Context Window ν™•μž₯ (OLLAMA_NUM_CTX)

κΈ°λ³Έκ°’(4096 토큰)은 A4μš©μ§€ 5~6μž₯ λΆ„λŸ‰μž…λ‹ˆλ‹€. RAG λ¬Έμ„œλ₯Ό 많이 λ„£μœΌλ©΄ μž˜λ¦½λ‹ˆλ‹€.

export OLLAMA_NUM_CTX=32768  # 32K 토큰 (κΈ°λ³Έκ°’μ˜ 8λ°°)

효과: PDF 맀뉴얼 μˆ˜μ‹­ νŽ˜μ΄μ§€λ₯Ό ν•œ λ²ˆμ— μž…λ ₯(Prompt)으둜 넣어도 OOM 없이 ν•œ 방에 λ‹΅λ³€ (λ‚˜λˆ μ„œ μ—¬λŸ¬ 번 λ¬Όμ–΄λ³Ό ν•„μš” μ—†μŒ β†’ 전체 μ‹œκ°„ 단좕)

3. KV Cache μ–‘μžν™” (선택사항)

60GBλ‚˜ 남기 λ•Œλ¬Έμ— ꡳ이 μ•ˆ 해도 λ˜μ§€λ§Œ, λ§Œμ•½ num_ctxλ₯Ό 128k(μ±… ν•œ ꢌ)κΉŒμ§€ 늘리고 μ‹Άλ‹€λ©΄:

export OLLAMA_KV_CACHE_TYPE=q8_0  # KV Cache μ–‘μžν™”

전체 μ΅œμ ν™” 슀크립트

#!/bin/bash
# Ollama μ„±λŠ₯ μ΅œμ ν™” ν™˜κ²½λ³€μˆ˜ μ„€μ •
# 60GB VRAM을 ν™œμš©ν•œ λ™μ‹œ μ²˜λ¦¬λŸ‰ 및 λ¬Έλ§₯ 크기 μ΅œμ ν™”

export OLLAMA_NUM_PARALLEL=8  # λ™μ‹œ 처리 μš”μ²­ 수 (60GB VRAM κΈ°μ€€ 8~10λͺ… κ°€λŠ₯)
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_NUM_CTX=32768  # Context Window ν™•μž₯ (κΈ°λ³Έ 4096 β†’ 32768)
export OLLAMA_KV_CACHE_TYPE=q8_0  # KV Cache μ–‘μžν™” (선택사항, 128kκΉŒμ§€ ν™•μž₯ μ‹œ 유용)

Ollama μ„œλ²„ μ‹œμž‘ (μ΅œμ ν™” μ„€μ • 적용)

# μ΅œμ ν™” ν™˜κ²½λ³€μˆ˜ λ‘œλ“œ
source setup_ollama_optimization.sh

# Ollama μ„œλ²„ μ‹œμž‘
nohup env OLLAMA_NUM_PARALLEL=8 \
         OLLAMA_NUM_CTX=32768 \
         OLLAMA_KV_CACHE_TYPE=q8_0 \
         ollama serve > /tmp/ollama_server.log 2>&1 &

πŸ“Š μ„±λŠ₯ 비ꡐ

μ΅œμ ν™” μ „ vs ν›„

ν•­λͺ© μ΅œμ ν™” μ „ μ΅œμ ν™” ν›„ κ°œμ„ μœ¨
λ™μ‹œ 처리 1개 8개 8λ°°
Context Window 4,096 토큰 32,768 토큰 8λ°°
κΈ΄ λ¬Έμ„œ 처리 μ—¬λŸ¬ 번 λΆ„ν•  ν•„μš” ν•œ λ²ˆμ— 처리 μ‹œκ°„ 단좕
λ™μ‹œ μš”μ²­ λŒ€κΈ° νμž‰ λ°œμƒ μ¦‰μ‹œ 처리 체감 속도 ν–₯상

πŸ› 문제 ν•΄κ²°

1. 빈 응닡 문제

증상: λͺ¨λΈμ΄ λ‘œλ”©λ˜μ§€λ§Œ 응닡이 λΉ„μ–΄μžˆμŒ

ν•΄κ²°μ±…: Modelfile의 TEMPLATE μ„Ήμ…˜μ„ Qwen3 ν‘œμ€€ ν˜•μ‹μœΌλ‘œ μˆ˜μ • (μœ„μ˜ Modelfile μ˜ˆμ‹œ μ°Έμ‘°)

2. OOM (Out of Memory) 였λ₯˜

증상: λͺ¨λΈ λ‘œλ”© μ‹œ λ©”λͺ¨λ¦¬ λΆ€μ‘±

ν•΄κ²°μ±…:

  • OLLAMA_NUM_PARALLEL 값을 쀄이기 (8 β†’ 4)
  • OLLAMA_NUM_CTX 값을 쀄이기 (32768 β†’ 16384)
  • λ‹€λ₯Έ ν”„λ‘œμ„ΈμŠ€μ˜ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰ 확인

3. API μ—°κ²° μ‹€νŒ¨

증상: curl: connection refused λ˜λŠ” νƒ€μž„μ•„μ›ƒ

ν•΄κ²°μ±…:

# Ollama ν”„λ‘œμ„ΈμŠ€ 확인
ps aux | grep "ollama serve"

# 포트 확인
netstat -tlnp | grep 11434

# Ollama μž¬μ‹œμž‘
pkill -f "ollama serve"
ollama serve

πŸ“ API μ‚¬μš© μ˜ˆμ‹œ

Python

import requests

url = "http://localhost:11434/api/chat"
payload = {
    "model": "qwen3-next-80b-q5km",
    "messages": [
        {"role": "user", "content": "μ•ˆλ…•ν•˜μ„Έμš”!"}
    ],
    "options": {
        "num_ctx": 32768,  # Context Window ν™•μž₯
        "num_predict": 512,
        "temperature": 0.7,
        "top_p": 0.9
    }
}

response = requests.post(url, json=payload)
print(response.json())

cURL

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-next-80b-q5km",
    "messages": [
      {"role": "user", "content": "μ•ˆλ…•ν•˜μ„Έμš”!"}
    ],
    "options": {
      "num_ctx": 32768,
      "num_predict": 512
    }
  }'

πŸ”— κ΄€λ ¨ λ¦¬μ†ŒμŠ€


πŸ™ κ°μ‚¬μ˜ 말

이 κ°€μ΄λ“œλŠ” GB10 GPUμ—μ„œ Qwen3-Next-80Bλ₯Ό μ‹€ν–‰ν•˜κΈ° μœ„ν•΄ μˆ˜λ§Žμ€ μ‹œν–‰μ°©μ˜€λ₯Ό 거쳐 μ™„μ„±λ˜μ—ˆμŠ΅λ‹ˆλ‹€. vLLM의 ν˜Έν™˜μ„± 문제λ₯Ό Ollama둜 μš°νšŒν•˜κ³ , 60GB VRAM을 μ΅œλŒ€ν•œ ν™œμš©ν•˜λŠ” μ΅œμ ν™” 방법을 μ°Ύμ•„λƒˆμŠ΅λ‹ˆλ‹€.

λ‹€λ₯Έ μ‚¬λžŒλ“€μ΄ 같은 고생을 ν•˜μ§€ μ•ŠκΈ°λ₯Ό 바라며 이 λ¬Έμ„œλ₯Ό κ³΅μœ ν•©λ‹ˆλ‹€.


πŸ“„ λΌμ΄μ„ μŠ€

이 λ¬Έμ„œλŠ” MIT λΌμ΄μ„ μŠ€ ν•˜μ— λ°°ν¬λ©λ‹ˆλ‹€. 자유둭게 μ‚¬μš©, μˆ˜μ •, λ°°ν¬ν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.


μž‘μ„±μž: NIKA κ°œλ°œνŒ€
μ΅œμ’… μ—…λ°μ΄νŠΈ: 2025-12-20

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support