vision?

by willfalco - opened Mar 15

Discussion

willfalco

Mar 15

errors out on pictures
on a same setup 27B one works with vision fine

turboderp

Owner Mar 15

What's the error?

willfalco

Mar 16

sorry for delay, couldn't find where I ran it

2026-03-16 19:22:50.379 INFO: Using backend exllamav3
2026-03-16 19:22:50.382 INFO: exllamav3 version: 0.0.25
.....
2026-03-16 19:24:08.498 INFO: Received chat completion streaming request
1c52fcb857174f4197add970dc162cb7
2026-03-16 19:24:08.520 ERROR: FATAL ERROR with generation. Attempting to
recreate the generator. If this fails, please restart the server.

2026-03-16 19:24:08.521 INFO: Generation options: {'request_id':
'1c52fcb857174f4197add970dc162cb7', 'bos_token_id': None, 'eos_token_id':
[248044, 248046], 'prompt': '<|im_start|>system\nYou are AI.<|im_end|>\n<|im_start|>user\n<$EMB_1000000000$><|im_end|>\n<|im_sta
rt|>assistant\n\n\n\n\n', 'max_tokens': None, 'min_tokens': 0,
'stop': [248044, 248046], 'banned_strings': [], 'banned_tokens': [],
'allowed_tokens': [], 'token_healing': False, 'temperature': 1.0,
'temperature_last': False, 'smoothing_factor': 0.0, 'top_k': 20, 'top_p': 0.95,
'top_a': 0.0, 'min_p': 0.02, 'tfs': 1.0, 'typical': 1.0, 'skew': 0.0,
'xtc_probability': 0.0, 'xtc_threshold': 0.1, 'frequency_penalty': 0.0,
'presence_penalty': 0.0, 'repetition_penalty': 1.05, 'penalty_range': -1,
'repetition_decay': 0, 'dry_multiplier': 0.8, 'dry_base': 1.8,
'dry_allowed_length': 3, 'dry_range': 0, 'dry_sequence_breakers': [],
'mirostat_mode': 0, 'mirostat_tau': 1.5, 'mirostat_eta': 0.3, 'add_bos_token':
None, 'ban_eos_token': False, 'logit_bias': None, 'negative_prompt': None,
'json_schema': None, 'regex_pattern': None, 'grammar_string': None,
'speculative_ngram': None, 'cfg_scale': 1.0, 'max_temp': 1.3, 'min_temp': 0.8,
'temp_exponent': 1.0, 'logprobs': 0, 'adaptive_target': 1.0, 'adaptive_decay':
0.9, 'model': 'q35', 'stream': True, 'stream_options': None, 'response_format':
{'type': 'text'}, 'n': 1, 'best_of': None, 'echo': False, 'suffix': None,
'user': None, 'messages': [{'role': 'user', 'content': '<$EMB_1000000000$>',
'tool_calls': None, 'tool_call_id': None}], 'prompt_template': None,
'add_generation_prompt': True, 'template_vars': {'add_generation_prompt': True,
'tools': None, 'functions': None, 'messages': [{'role': 'user', 'content':
'<$EMB_1000000000$>'}], 'bos_token': None, 'eos_token': '<|endoftext|>',
'pad_token': '', 'unk_token': None}, 'response_prefix': None, 'tools': None,
'functions': None}

2026-03-16 19:24:08.526 WARNING: Immediately terminating all jobs. Clients will
have their requests cancelled.

2026-03-16 19:24:08.532 ERROR: Traceback (most recent call last):
2026-03-16 19:24:08.532 ERROR: File
"/app/endpoints/OAI/utils/chat_completion.py", line 373, in
stream_generate_chat_completion
2026-03-16 19:24:08.532 ERROR: raise generation
2026-03-16 19:24:08.532 ERROR: File
"/app/endpoints/OAI/utils/completion.py", line 118, in _stream_collector
2026-03-16 19:24:08.532 ERROR: async for generation in new_generation:
2026-03-16 19:24:08.532 ERROR: File "/app/backends/exllamav3/model.py",
line 779, in stream_generate
2026-03-16 19:24:08.532 ERROR: async for generation_chunk in
self.generate_gen(
2026-03-16 19:24:08.532 ERROR: File "/app/backends/exllamav3/model.py",
line 1064, in generate_gen
2026-03-16 19:24:08.532 ERROR: raise ex
2026-03-16 19:24:08.532 ERROR: File "/app/backends/exllamav3/model.py",
line 1006, in generate_gen
2026-03-16 19:24:08.532 ERROR: async for result in job:
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/exllamav3/generator/async_generator.py",
line 87, in aiter
2026-03-16 19:24:08.532 ERROR: raise result
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/exllamav3/generator/async_generator.py",
line 23, in _run_iteration
2026-03-16 19:24:08.532 ERROR: results = self.generator.iterate()
2026-03-16 19:24:08.532 ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120,
in decorate_context
2026-03-16 19:24:08.532 ERROR: return func(*args, **kwargs)
2026-03-16 19:24:08.532 ERROR: ^^^^^^^^^^^^^^^^^^^^^
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/exllamav3/generator/generator.py", line
298, in iterate
2026-03-16 19:24:08.532 ERROR: job.prefill(results)
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/exllamav3/generator/job.py", line 1009,
in prefill
2026-03-16 19:24:08.532 ERROR: self.generator.model.prefill(
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120,
in decorate_context
2026-03-16 19:24:08.532 ERROR: return func(*args, **kwargs)
2026-03-16 19:24:08.532 ERROR: ^^^^^^^^^^^^^^^^^^^^^
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/exllamav3/model/model.py", line 103, in
prefill
2026-03-16 19:24:08.532 ERROR: return self.prefill_ls(x, params,
self.last_kv_module_idx, self.modules)
2026-03-16 19:24:08.532 ERROR:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/exllamav3/model/model_ls.py", line 191,
in prefill_ls
2026-03-16 19:24:08.532 ERROR: x = module.forward(x, params)
2026-03-16 19:24:08.532 ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/exllamav3/modules/transformer.py", line
78, in forward
2026-03-16 19:24:08.532 ERROR: y = self.mlp.forward(y, params)
2026-03-16 19:24:08.532 ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-16 19:24:08.532 ERROR: File
"/opt/venv/lib/python3.12/site-packages/exllamav3/modules/block_sparse_mlp.py",
line 627, in forward
2026-03-16 19:24:08.532 ERROR: expert_count =
torch.bincount(flat_expert_local, minlength = E + 1)
2026-03-16 19:24:08.532 ERROR:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-16 19:24:08.532 ERROR: RuntimeError: bincount only supports 1-d
non-negative integral inputs.
2026-03-16 19:24:08.539 ERROR: Sent to request: Chat completion aborted.
Please check the server console.
2026-03-16 19:24:08.550 INFO: Received chat completion request
3f92764dda1d410a816d7a799d853598

RTX 3080 10Gb+RTX 3070ti 8GB
Qwen3.5-35B-A3B-exl3-3.54bpw-config.yml
network:
host: 0.0.0.0
port: 5000
disable_auth: true
disable_fetch_requests: false
send_tracebacks: false
model:
use_dummy_models: true
dummy_model_names: ["q35"]
model_name: Qwen3.5-35B-A3B-exl3-3.54bpw
backend: exllamav3
tensor_parallel_backend: native
gpu_split_auto: false
gpu_split: [8, 7.5]
autosplit_reserve: [32, 32]
cache_mode: Q8
max_seq_len: 55040
chunk_size: 512
vision: true
sampling:
override_preset: Qwen3.5-27B-heretic-v2-exl3
logging:
log_prompt: false
log_generation_params: true
log_requests: false

willfalco

Mar 19

same on 0.0.26

willfalco

Mar 19

did anyone run this with vision?

turboderp

Owner Mar 20

I'm running it with vision with no issues. I've tried to replicate your model settings but still not seeing any errors.

Not really sure what could cause the error you're seeing. Sampling shouldn't matter since it's failing during prefill. It does look like this is a modified Jinja template, but again that shouldn't cause the error. Couple of questions:

Is the image being sent as a URL or base64?
You're sure the model downloaded correctly? (I see this a lot, mind you)
What's your Torch version?
Do you have flash-linear-attention installed?

willfalco

Mar 20

ran through TabbyAPI docker built from official repo
https://github.com/theroyallab/tabbyAPI/blob/main/docker/Dockerfile
https://github.com/theroyallab/tabbyAPI/blob/main/pyproject.toml
with 2.9.0 torch
image is sent through openwebui chat completions by pasting it in chat (there is no setting regarding URL or base64 anywhere in openwebui or tabbyapi)

confirmed working MetaphoricalCode/Qwen3.5-27B-heretic-v2-exl3-4bpw-hb6 on a same docker image and openwebui before trying 35B 3.54bpw

maybe there is better way to run it to get chat completions for openwebui

willfalco

Mar 21

•

edited Mar 21

hm, worked fine with same stack on a different box but turboderp/Qwen3.5-35B-A3B-exl3-4.09bpw
docker image was made and compiled for ARM

and just tested Qwen3.5-35B-A3B-exl3-3.54bpw works on fresh compiled docker on ARM

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment