I’ll post a draft of the issue for now (edited):
The good actual issues to raise are these, in this order.
Issue 1 — Primary: PEFT adapter loading fails only on an already CPU/GPU-dispatched bnb 4-bit Gemma 4 model
File first at: huggingface/transformers
Mention/cross-link: huggingface/peft, huggingface/accelerate, bitsandbytes-foundation/bitsandbytes
Suggested title
PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized
Why this is the strongest issue
This is the core, now updated with the latest Colab T4 evidence:
R05 all-GPU 4-bit:
PASS
R03 split CPU/GPU 4-bit:
FAIL_BNB_PARAM_CONSTRUCTOR
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
The important contrast is:
Works:
device_map = {"": 0}
Fails:
device_map = {
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
This means the current primary failure is not PEFT + bitsandbytes 4-bit in general. The all-GPU path works. The failure is specific to loading a PEFT adapter on top of an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model.
The previous Gemma4ClippableLinear blocker is still important, but it is no longer the primary issue because the latest repro bypasses it by targeting the inner .linear modules in Gemma 4 multimodal towers.
Current environment
Runtime: Colab Free / Tesla T4
Python: 3.12.13
torch: 2.10.0+cu128
transformers: 5.6.2
accelerate: 1.13.0
peft: 0.19.1
bitsandbytes: 0.49.2
huggingface_hub: 1.12.0
safetensors: 0.7.0
torchao_importable: false
Model and adapter
Base model:
unsloth/gemma-4-E2B-it
Adapter:
Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill
Adapter probe:
adapter_model.safetensors keys: 786
audio: 72
language_like: 490
vision_or_patch: 224
This adapter is genuinely multimodal, so simply excluding vision/audio LoRA would change the adapter semantics.
Why Transformers first
Transformers is still the best first repo because this issue crosses:
- Gemma 4 model integration;
- bitsandbytes quantization integration;
- PEFT adapter integration expectations;
- device-map loading behavior;
- Accelerate redispatch/hook behavior;
- current
_is_hf_initializedparameter reconstruction behavior.
Accelerate owns dispatch_model() and hook attachment. bitsandbytes owns Params4bit. PEFT triggers the adapter-loading path. But the user-facing break is in the Transformers + PEFT + bnb integration path.
Issue 2 — Supporting / separate if requested: PEFT target-module compatibility for Gemma4ClippableLinear
File at: huggingface/peft
Suggested title
Gemma4 multimodal LoRA adapters need inner .linear targeting or Gemma4ClippableLinear support
Why it is supporting, not primary
Before patching the local adapter config, PEFT fails earlier:
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
However, the latest repro bypasses that blocker by targeting the inner .linear modules for Gemma 4 vision/audio towers. With that patch:
R05 all-GPU 4-bit:
PASS
So this is real, but it is not the primary split/offload failure.
Issue 3 — Deferred / separate only if you have the exact trace: PEFT device_map breaks Gemma 4 shared-KV generation
File first at: huggingface/transformers
Mention/cross-link: huggingface/peft, huggingface/accelerate
Suggested title
Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate
Why this is deferred
The latest Colab E2B matrix did not reproduce this failure:
R05 all-GPU, no PEFT device_map:
PASS
R03 split CPU/GPU, no PEFT device_map:
FAIL_BNB_PARAM_CONSTRUCTOR before generation
Only file the shared-KV issue if you attach a separate trace where adapter loading succeeds and generation fails here:
Gemma4Attention.forward
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22
Optional Issue 4 — Docs / UX: PEFT offload-dir / offload-folder handling is confusing
File at: huggingface/peft
Suggested title
Clarify offload_dir/offload_folder behavior for PeftModel.from_pretrained on already-dispatched models
Why it is lower priority
This is useful, but it is not the core bug. The current core bug has a concrete split/offload Params4bit constructor trace.
What I would not file
Not this
CPU offloading is broken.
Too broad. The base model can load in split CPU/GPU form. The failure occurs during PEFT adapter loading on top of the already-dispatched quantized base model.
Better:
PeftModel.from_pretrained fails on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model during Accelerate hook setup.
Not this
PEFT expects vision/audio towers to be on GPU.
The all-GPU path works, and the split path fails in a bnb/Accelerate parameter reconstruction path. The evidence points to split/offload redispatch, not a generic PEFT requirement that vision/audio must be GPU-resident.
Not this as the main issue
Tensor.item() cannot be called on meta tensors
That may be a related double-quant / QuantState.as_dict() variant, but the latest controlled Colab matrix with double_quant=False lands on:
Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
Mention the nested_offset.item() issue only if you attach that exact separate trace.
Recommended filing plan
Best plan
Open one primary Transformers issue:
PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized
Include the Gemma4ClippableLinear patch as a “repro setup note,” not as the headline.
Then say:
I can split the Gemma4ClippableLinear target-module compatibility issue into a PEFT issue if maintainers prefer.
If you want the cleanest tracking
Open two separate issues:
- Transformers Issue A: split/offload bnb 4-bit + PEFT + Accelerate
Params4bit.__new__(_is_hf_initialized)failure. - PEFT Issue B: Gemma 4 multimodal LoRA adapters and
Gemma4ClippableLineartarget-module compatibility.
Do not open the shared-KV issue from the latest Colab matrix alone, because it was not reproduced there.
Key evidence to include
Include this exact contrast:
Works:
device_map = {"": 0}
Fails:
device_map = {
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
Include this exact result pair:
R05_ALL_GPU_4BIT:
PASS
generate output shape: (1, 8)
CUDA after generate: allocated 6.507 GiB, reserved 6.693 GiB, free 7.742 GiB
R03_SPLIT_4BIT:
FAIL_BNB_PARAM_CONSTRUCTOR
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
Include the adapter probe:
Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill
audio: 72
language_like: 490
vision_or_patch: 224
Bottom line
The actual issue to raise now is:
-
Primary bug:
PeftModel.from_pretrained()fails on a CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model during Accelerate hook setup, while the same patched multimodal adapter works all-GPU. -
Supporting bug: vanilla PEFT currently trips over Gemma 4 multimodal tower
Gemma4ClippableLinearwrappers unless the adapter targets inner.linearmodules or PEFT supports the wrapper. -
Deferred bug: passing
device_mapinto PEFT may break Gemma 4 shared-KV generation, but this needs its own exact trace and should not be merged into the latest Colab E2B primary issue.
Below are ready-to-paste GitHub sections. The ready-to-copy bodies use an outer 4-backtick fence so the inner 3-backtick code fences remain intact when copied.
Issue 1
Target repo
huggingface/transformers
Suggested title
PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized
Suggested labels
bug, Gemma4, PEFT, Accelerate, bitsandbytes, quantization, device_map, cpu-offload
Body
### System Info
- Runtime: Colab Free / Tesla T4
- Python: 3.12.13
- torch: 2.10.0+cu128
- transformers: 5.6.2
- accelerate: 1.13.0
- peft: 0.19.1
- bitsandbytes: 0.49.2
- huggingface_hub: 1.12.0
- safetensors: 0.7.0
- torchao: not importable / removed from environment
- model: `unsloth/gemma-4-E2B-it`
- adapter: `Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill`
- quantization: bitsandbytes 4-bit NF4
- attention implementation: `sdpa`
- trust_remote_code: `False`
### Summary
I can load `unsloth/gemma-4-E2B-it` in bitsandbytes 4-bit, load a patched multimodal LoRA adapter with PEFT, and run a tiny `generate()` smoke test when the whole model is placed on GPU:
```python
device_map = {"": 0}
```
However, the same model/adapter path fails when the base model is loaded with a split CPU/GPU `device_map`:
```python
device_map = {
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
```
The failure happens during `PeftModel.from_pretrained()`, after PEFT starts adapter loading and calls into Accelerate `dispatch_model()` / hook attachment. The failing path reconstructs a bitsandbytes `Params4bit` object and passes `_is_hf_initialized` into its constructor:
```text
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```
The contrast is important:
```text
R05 all-GPU 4-bit:
PASS
R03 split CPU/GPU 4-bit:
FAIL_BNB_PARAM_CONSTRUCTOR
```
This suggests the issue is not PEFT + bitsandbytes 4-bit in general. It appears specific to loading a PEFT adapter on top of an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model, where PEFT triggers an additional Accelerate dispatch/hook path.
### Why the adapter config is patched in this repro
The adapter is a real multimodal LoRA adapter. Its safetensors keys include:
```text
audio: 72
language_like: 490
vision_or_patch: 224
```
Without patching, PEFT first fails earlier with:
```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```
That is because Gemma 4 vision/audio towers expose wrapper modules such as:
```text
model.vision_tower.encoder.layers.0.self_attn.q_proj
→ Gemma4ClippableLinear
model.vision_tower.encoder.layers.0.self_attn.q_proj.linear
→ Linear / Linear4bit
```
The adapter weights use inner `.linear` paths for multimodal towers, for example:
```text
base_model.model.model.audio_tower.layers.0.self_attn.k_proj.linear.lora_A.weight
base_model.model.model.vision_tower.encoder.layers.0.self_attn.q_proj.linear.lora_A.weight
```
So the repro patches only the local copy of `adapter_config.json` to target inner `.linear` modules for Gemma 4 multimodal towers while leaving language tower targets at the usual projection modules.
Patched target expression:
```text
.*(?:model\.language_model\.layers\.\d+\.(?:self_attn|mlp)\.(?:q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)|model\.vision_tower\.encoder\.layers\.\d+\.(?:self_attn|mlp)\.(?:q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)\.linear|model\.audio_tower\.layers\.\d+\.self_attn\.(?:q_proj|k_proj|v_proj)\.linear)$
```
After this patch, the all-GPU case works, which confirms the Gemma4ClippableLinear target-module issue is bypassed for this repro.
### Quantization config
```python
from transformers import BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=False,
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_enable_fp32_cpu_offload=True,
)
```
### Case A — all-GPU path works
```python
device_map = {"": 0}
max_memory = {0: "14GiB"}
base_model = Gemma4ForConditionalGeneration.from_pretrained(
"unsloth/gemma-4-E2B-it",
quantization_config=quant_config,
device_map=device_map,
max_memory=max_memory,
dtype=torch.bfloat16,
attn_implementation="sdpa",
trust_remote_code=False,
low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(
base_model,
"/content/patched_gemma4_multimodal_inner_linear_adapter_v4",
is_trainable=False,
)
out = model.generate(**inputs, max_new_tokens=4, do_sample=False, use_cache=True)
```
Observed result:
```text
R05_ALL_GPU_4BIT: PASS
generate output shape: (1, 8)
CUDA after generate:
free_gib: 7.742
total_gib: 14.563
allocated_gib: 6.507
reserved_gib: 6.693
```
### Case B — split CPU/GPU path fails
```python
device_map = {
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
max_memory = {0: "13GiB", "cpu": "10GiB"}
base_model = Gemma4ForConditionalGeneration.from_pretrained(
"unsloth/gemma-4-E2B-it",
quantization_config=quant_config,
device_map=device_map,
max_memory=max_memory,
offload_folder="/content/gemma4_offload_v4",
dtype=torch.bfloat16,
attn_implementation="sdpa",
trust_remote_code=False,
low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(
base_model,
"/content/patched_gemma4_multimodal_inner_linear_adapter_v4",
is_trainable=False,
offload_dir="/content/gemma4_offload_v4",
offload_buffers=True,
ephemeral_gpu_offload=True,
torch_device="cuda:0",
)
```
Observed result:
```text
R03_SPLIT_4BIT: FAIL_BNB_PARAM_CONSTRUCTOR
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```
Trace tail:
```text
File ".../peft/peft_model.py", line 582, in from_pretrained
load_result = model.load_adapter(
File ".../peft/peft_model.py", line 1475, in load_adapter
dispatch_model(
File ".../accelerate/big_modeling.py", line 432, in dispatch_model
attach_align_device_hook_on_blocks(
File ".../accelerate/hooks.py", line 540, in attach_align_device_hook
add_hook_to_module(module, hook, append=True)
File ".../accelerate/hooks.py", line 183, in add_hook_to_module
module = hook.init_hook(module)
File ".../accelerate/hooks.py", line 330, in init_hook
set_module_tensor_to_device(module, name, "meta")
File ".../accelerate/utils/modeling.py", line 363, in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```
### Expected behavior
One of the following:
1. `PeftModel.from_pretrained()` should support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model without reconstructing `Params4bit` with unsupported kwargs.
2. PEFT should avoid redispatching an already-dispatched quantized base model, or should do so without passing `_is_hf_initialized` to bitsandbytes constructors.
3. Accelerate `set_module_tensor_to_device()` should avoid forwarding HF-internal parameter attributes into `bitsandbytes.Params4bit.__new__()` if that constructor does not accept them.
4. bitsandbytes `Params4bit.__new__()` should accept or ignore `_is_hf_initialized` if this attribute is expected in the HF integration path.
5. If this configuration is unsupported, the error should be raised early with a clear message.
### Actual behavior
- All-GPU 4-bit base + patched multimodal PEFT adapter: works.
- Split CPU/GPU 4-bit base + the same patched multimodal PEFT adapter: fails during PEFT adapter loading.
- The failure occurs before generation.
- The failure occurs in the Accelerate dispatch/hook path when reconstructing a bitsandbytes `Params4bit` parameter on the meta-device path.
### Why this seems split/offload-specific
The same environment, same model, same adapter, same adapter patch, same quantization config, and same PEFT call pattern work when the model is all-GPU:
```python
device_map = {"": 0}
```
The failure appears only with:
```python
device_map = {
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
```
Therefore, the issue seems tied to PEFT adapter loading on an already split/offloaded bitsandbytes 4-bit model, not to PEFT + bnb 4-bit generally.
### Related observations
#### 1. Gemma4ClippableLinear target-module blocker
Before applying the local adapter-config patch, both tested Gemma 4 E2B adapters failed earlier with:
```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```
That appears to be the same class of issue as the public PEFT `Gemma4ClippableLinear` support issue. The current repro works around that by targeting the inner `.linear` modules for vision/audio towers.
#### 2. Related `_is_hf_initialized` issue family
There is an existing issue for a similar error family with `Int8Params`:
```text
TypeError: Int8Params.__new__() got an unexpected keyword argument '_is_hf_initialized'
```
This repro appears to be the `Params4bit` variant of that family, reached through PEFT adapter loading and Accelerate hook setup on a split-device model.
### Questions
1. Is `PeftModel.from_pretrained()` expected to support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model?
2. Should PEFT skip redispatch when the base model already has an `hf_device_map`?
3. Should Accelerate filter `_is_hf_initialized` before reconstructing bitsandbytes parameter classes?
4. Should bitsandbytes `Params4bit` accept or ignore `_is_hf_initialized`?
5. Is the recommended workaround for Gemma 4 E2B on T4 to keep the whole 4-bit model on GPU rather than offloading vision/audio towers to CPU?
### Relevant links
```text
Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling
PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model
Transformers PEFT integration docs:
https://huggingface.co/docs/transformers/en/peft
Transformers bitsandbytes docs:
https://huggingface.co/docs/transformers/quantization/bitsandbytes
PEFT Gemma4ClippableLinear issue:
https://github.com/huggingface/peft/issues/3129
Related _is_hf_initialized issue:
https://github.com/huggingface/transformers/issues/43872
```
Optional Issue 2
Target repo
huggingface/peft
Suggested title
Gemma4 multimodal LoRA adapters need inner .linear targeting or Gemma4ClippableLinear support
Body
There is also a Gemma 4 target-module compatibility issue before this failure. Without the local `target_modules` patch, PEFT tries to inject LoRA into outer `Gemma4ClippableLinear` wrappers and fails with:
```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```
The adapter is genuinely multimodal, so simply excluding vision/audio targets would change the adapter semantics. The repro instead patches the local adapter config to target the inner `.linear` modules in Gemma 4 vision/audio towers.
With that patch, the all-GPU case passes, but the CPU/GPU-dispatched case still fails with:
```text
Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```
My recommendation
Open Issue 1 first. It is now much cleaner than the original draft because the latest evidence isolates the trigger:
all-GPU 4-bit path works
split/offloaded 4-bit path fails during PEFT adapter loading
Do not open the shared-KV issue from this latest Colab run alone, because this run did not reach that failure path.