Qwen3-VL-32B Distilled GPTQ

Qwen3-VL-32B์„ ๊ธฐ๋ฐ˜์œผ๋กœ Knowledge Distillation ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•œ ํ›„ GPTQ 4-bit ์–‘์žํ™”๋ฅผ ์ ์šฉํ•œ Vision-Language ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
DeepSeek-v3.2(Teacher)๋กœ ์ƒ์„ฑํ•œ ํ•œ๊ตญ์–ด CoT ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ, Prompt Pre-filling ๊ธฐ๋ฒ•์œผ๋กœ Tool Call ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค.


๋ชจ๋ธ ์ •๋ณด

ํ•ญ๋ชฉ ๋‚ด์šฉ
Base Model Qwen/Qwen3-VL-32B-Instruct
Teacher Model DeepSeek-v3.2 (OpenRouter API)
์–‘์žํ™” ๋ฐฉ์‹ GPTQ 4-bit
ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ 33.4B
ํ•™์Šต ์–ธ์–ด ํ•œ๊ตญ์–ด (์ฃผ), ์˜์–ด
ํ•™์Šต ๋ฐ์ดํ„ฐ 2,031๊ฑด ํ•œ๊ตญ์–ด ํ•™์ˆ  ์ง€๋ฌธ (CoT ํฌํ•จ)

ํ•™์Šต ๋ฐฉ๋ฒ•

1. CoT ๋ฐ์ดํ„ฐ ์ฆ๋ฅ˜

DeepSeek-v3.2(Teacher)๋ฅผ ์‚ฌ์šฉํ•ด 2,031๊ฑด์˜ ํ•œ๊ตญ์–ด ํ•™์ˆ  ์ง€๋ฌธ์— ๋Œ€ํ•œ Chain-of-Thought ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

์ฆ๋ฅ˜ ์ „๋žต (Zero-shot + Guided Fallback)

  • 1์ฐจ: Teacher ๋ชจ๋ธ์ด <think> ํƒœ๊ทธ ํฌํ•จ ์‚ฌ๊ณ  ๊ณผ์ • ์ƒ์„ฑ
  • 2์ฐจ: ์‹คํŒจ ์‹œ ์ •๋‹ต ํžŒํŠธ ์ œ๊ณต ํ›„ ์žฌ์ƒ์„ฑ
  • 3์ฐจ: ์™„์ „ ์‹คํŒจ ์‹œ Ground Truth ๊ธฐ๋ฐ˜ ๊ธฐ๋ณธ ๋‹ต๋ณ€ ์‚ฌ์šฉ

ํ’ˆ์งˆ ๊ด€๋ฆฌ ๊ฒฐ๊ณผ

  • CoT ํƒœ๊ทธ ์™„๊ฒฐ์„ฑ: 100% (</think> ํฌํ•จ ๋ณด์žฅ)
  • ์ •๋‹ต ํฌ๋งท ์ค€์ˆ˜์œจ: 100%
  • ์ปค๋ฒ„๋ฆฌ์ง€: 2,031๊ฑด / 2,031๊ฑด (100%)
  • ํ‰๊ท  CoT ๊ธธ์ด: 1,200์ž

2. Prompt Pre-filling (ํ•ต์‹ฌ ๊ธฐ๋ฒ•)

Qwen3-VL์€ ๋‚ด์žฅ๋œ Tool-Calling ๊ธฐ๋Šฅ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜ System Prompt๋งŒ์œผ๋กœ๋Š” ํ…์ŠคํŠธ ์ถ”๋ก ์„ ๊ฐ•์ œํ•  ์ˆ˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค.
Prompt Pre-filling ๊ธฐ๋ฒ•์œผ๋กœ ์ด ๋ฌธ์ œ๋ฅผ ์™„์ „ํžˆ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค.

# โŒ Before (์‹คํŒจ): tool_call ํƒœ๊ทธ ์ƒ์„ฑ
full_prompt = "<|im_start|>assistant\n"

# โœ… After (์„ฑ๊ณต): think ํƒœ๊ทธ๋กœ CoT ๊ฐ•์ œ
full_prompt = "<|im_start|>assistant\n<think>\n"
์ƒํƒœ ์ •๋‹ต ์ถ”์ถœ๋ฅ 
Pre-filling ์ ์šฉ ์ „ 0% (<tool_call> ์ƒ์„ฑ)
Pre-filling ์ ์šฉ ํ›„ 100% (<think> CoT ์ถ”๋ก )

3. GPTQ ์–‘์žํ™”

โš ๏ธ ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ์ฃผ์˜์‚ฌํ•ญ
32B ์ด์ƒ ๋Œ€ํ˜• ๋ชจ๋ธ์—์„œ 3 epoch ํ•™์Šต ํ›„ GPTQ ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜๋ฉด -3% ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์›์ธ: ๊ณ ๋ฐ€๋„ CoT ๋ฐ์ดํ„ฐ๋กœ ์ธํ•œ ๊ณผ์ ํ•ฉ ํŒจํ„ด์ด ์–‘์žํ™” ์‹œ ๋ถ•๊ดด๋ฉ๋‹ˆ๋‹ค.

๊ถŒ์žฅ ํ•™์Šต ์„ค์ • (32B ์ด์ƒ):

num_train_epochs = 1   # 3 epoch โ†’ 1 epoch
learning_rate = 1e-5   # ๊ธฐ๋ณธ๊ฐ’ ์ ˆ๋ฐ˜

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("xeker/qwen3-vl-32b-distilled-gptq")
model = AutoModelForCausalLM.from_pretrained(
    "xeker/qwen3-vl-32b-distilled-gptq",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "๋‹น์‹ ์€ ํ•œ๊ตญ์–ด ํ•™์ˆ  ์ง€๋ฌธ์„ ๋ถ„์„ํ•˜๊ณ  ์ถ”๋ก ํ•˜๋Š” ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค."},
    {"role": "user", "content": "๋‹ค์Œ ์ง€๋ฌธ์„ ์ฝ๊ณ  ์งˆ๋ฌธ์— ๋‹ตํ•˜์„ธ์š”.\n\n[์ง€๋ฌธ ๋‚ด์šฉ]"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
text += "<think>\n"  # โญ Prompt Pre-filling: Tool Call ๋ฐฉ์ง€, CoT ์ถ”๋ก  ๊ฐ•์ œ

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

์ฃผ์˜: <think>\n pre-filling ์—†์ด ์‚ฌ์šฉํ•˜๋ฉด <tool_call> ํƒœ๊ทธ๊ฐ€ ์ƒ์„ฑ๋˜์–ด ๋‹ต๋ณ€ ์ถ”์ถœ์— ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


๊ธฐ์ˆ  ์Šคํƒ

Python PyTorch Transformers vLLM GPTQ Knowledge Distillation OpenRouter API


ํ”„๋กœ์ ํŠธ ๋ฐฐ๊ฒฝ

๋„ค์ด๋ฒ„ ๋ถ€์ŠคํŠธ์บ ํ”„ AI Tech 8๊ธฐ ํŒ€ ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ์ง„ํ–‰๋œ ์ˆ˜๋Šฅ ํ’€์ด ํŠนํ™” LLM ๊ฐœ๋ฐœ ํ”„๋กœ์ ํŠธ์—์„œ ํŒŒ์ƒ๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

Downloads last month
17
Safetensors
Model size
33B params
Tensor type
BF16
ยท
I32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xeker/qwen3-vl-32b-distilled-gptq

Quantized
(35)
this model