CPU offloading error scenario

I’ll post a draft of the issue for now (edited):


The good actual issues to raise are these, in this order.

Issue 1 — Primary: PEFT adapter loading fails only on an already CPU/GPU-dispatched bnb 4-bit Gemma 4 model

File first at: huggingface/transformers
Mention/cross-link: huggingface/peft, huggingface/accelerate, bitsandbytes-foundation/bitsandbytes

Suggested title

PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized

Why this is the strongest issue

This is the core, now updated with the latest Colab T4 evidence:

R05 all-GPU 4-bit:
PASS

R03 split CPU/GPU 4-bit:
FAIL_BNB_PARAM_CONSTRUCTOR
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'

The important contrast is:

Works:
device_map = {"": 0}

Fails:
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}

This means the current primary failure is not PEFT + bitsandbytes 4-bit in general. The all-GPU path works. The failure is specific to loading a PEFT adapter on top of an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model.

The previous Gemma4ClippableLinear blocker is still important, but it is no longer the primary issue because the latest repro bypasses it by targeting the inner .linear modules in Gemma 4 multimodal towers.

Current environment

Runtime: Colab Free / Tesla T4
Python: 3.12.13
torch: 2.10.0+cu128
transformers: 5.6.2
accelerate: 1.13.0
peft: 0.19.1
bitsandbytes: 0.49.2
huggingface_hub: 1.12.0
safetensors: 0.7.0
torchao_importable: false

Model and adapter

Base model:
unsloth/gemma-4-E2B-it

Adapter:
Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill

Adapter probe:

adapter_model.safetensors keys: 786

audio: 72
language_like: 490
vision_or_patch: 224

This adapter is genuinely multimodal, so simply excluding vision/audio LoRA would change the adapter semantics.

Why Transformers first

Transformers is still the best first repo because this issue crosses:

  • Gemma 4 model integration;
  • bitsandbytes quantization integration;
  • PEFT adapter integration expectations;
  • device-map loading behavior;
  • Accelerate redispatch/hook behavior;
  • current _is_hf_initialized parameter reconstruction behavior.

Accelerate owns dispatch_model() and hook attachment. bitsandbytes owns Params4bit. PEFT triggers the adapter-loading path. But the user-facing break is in the Transformers + PEFT + bnb integration path.


Issue 2 — Supporting / separate if requested: PEFT target-module compatibility for Gemma4ClippableLinear

File at: huggingface/peft

Suggested title

Gemma4 multimodal LoRA adapters need inner .linear targeting or Gemma4ClippableLinear support

Why it is supporting, not primary

Before patching the local adapter config, PEFT fails earlier:

ValueError: Target module Gemma4ClippableLinear(...) is not supported.

However, the latest repro bypasses that blocker by targeting the inner .linear modules for Gemma 4 vision/audio towers. With that patch:

R05 all-GPU 4-bit:
PASS

So this is real, but it is not the primary split/offload failure.


Issue 3 — Deferred / separate only if you have the exact trace: PEFT device_map breaks Gemma 4 shared-KV generation

File first at: huggingface/transformers
Mention/cross-link: huggingface/peft, huggingface/accelerate

Suggested title

Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate

Why this is deferred

The latest Colab E2B matrix did not reproduce this failure:

R05 all-GPU, no PEFT device_map:
PASS

R03 split CPU/GPU, no PEFT device_map:
FAIL_BNB_PARAM_CONSTRUCTOR before generation

Only file the shared-KV issue if you attach a separate trace where adapter loading succeeds and generation fails here:

Gemma4Attention.forward
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22

Optional Issue 4 — Docs / UX: PEFT offload-dir / offload-folder handling is confusing

File at: huggingface/peft

Suggested title

Clarify offload_dir/offload_folder behavior for PeftModel.from_pretrained on already-dispatched models

Why it is lower priority

This is useful, but it is not the core bug. The current core bug has a concrete split/offload Params4bit constructor trace.


What I would not file

Not this

CPU offloading is broken.

Too broad. The base model can load in split CPU/GPU form. The failure occurs during PEFT adapter loading on top of the already-dispatched quantized base model.

Better:

PeftModel.from_pretrained fails on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model during Accelerate hook setup.

Not this

PEFT expects vision/audio towers to be on GPU.

The all-GPU path works, and the split path fails in a bnb/Accelerate parameter reconstruction path. The evidence points to split/offload redispatch, not a generic PEFT requirement that vision/audio must be GPU-resident.

Not this as the main issue

Tensor.item() cannot be called on meta tensors

That may be a related double-quant / QuantState.as_dict() variant, but the latest controlled Colab matrix with double_quant=False lands on:

Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'

Mention the nested_offset.item() issue only if you attach that exact separate trace.


Recommended filing plan

Best plan

Open one primary Transformers issue:

PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized

Include the Gemma4ClippableLinear patch as a “repro setup note,” not as the headline.

Then say:

I can split the Gemma4ClippableLinear target-module compatibility issue into a PEFT issue if maintainers prefer.

If you want the cleanest tracking

Open two separate issues:

  1. Transformers Issue A: split/offload bnb 4-bit + PEFT + Accelerate Params4bit.__new__(_is_hf_initialized) failure.
  2. PEFT Issue B: Gemma 4 multimodal LoRA adapters and Gemma4ClippableLinear target-module compatibility.

Do not open the shared-KV issue from the latest Colab matrix alone, because it was not reproduced there.


Key evidence to include

Include this exact contrast:

Works:
device_map = {"": 0}

Fails:
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}

Include this exact result pair:

R05_ALL_GPU_4BIT:
PASS
generate output shape: (1, 8)
CUDA after generate: allocated 6.507 GiB, reserved 6.693 GiB, free 7.742 GiB

R03_SPLIT_4BIT:
FAIL_BNB_PARAM_CONSTRUCTOR
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'

Include the adapter probe:

Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill

audio: 72
language_like: 490
vision_or_patch: 224

Bottom line

The actual issue to raise now is:

  1. Primary bug: PeftModel.from_pretrained() fails on a CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model during Accelerate hook setup, while the same patched multimodal adapter works all-GPU.

  2. Supporting bug: vanilla PEFT currently trips over Gemma 4 multimodal tower Gemma4ClippableLinear wrappers unless the adapter targets inner .linear modules or PEFT supports the wrapper.

  3. Deferred bug: passing device_map into PEFT may break Gemma 4 shared-KV generation, but this needs its own exact trace and should not be merged into the latest Colab E2B primary issue.


Below are ready-to-paste GitHub sections. The ready-to-copy bodies use an outer 4-backtick fence so the inner 3-backtick code fences remain intact when copied.


Issue 1

Target repo

huggingface/transformers

Suggested title

PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized

Suggested labels

bug, Gemma4, PEFT, Accelerate, bitsandbytes, quantization, device_map, cpu-offload

Body

### System Info

- Runtime: Colab Free / Tesla T4
- Python: 3.12.13
- torch: 2.10.0+cu128
- transformers: 5.6.2
- accelerate: 1.13.0
- peft: 0.19.1
- bitsandbytes: 0.49.2
- huggingface_hub: 1.12.0
- safetensors: 0.7.0
- torchao: not importable / removed from environment
- model: `unsloth/gemma-4-E2B-it`
- adapter: `Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill`
- quantization: bitsandbytes 4-bit NF4
- attention implementation: `sdpa`
- trust_remote_code: `False`

### Summary

I can load `unsloth/gemma-4-E2B-it` in bitsandbytes 4-bit, load a patched multimodal LoRA adapter with PEFT, and run a tiny `generate()` smoke test when the whole model is placed on GPU:

```python
device_map = {"": 0}
```

However, the same model/adapter path fails when the base model is loaded with a split CPU/GPU `device_map`:

```python
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}
```

The failure happens during `PeftModel.from_pretrained()`, after PEFT starts adapter loading and calls into Accelerate `dispatch_model()` / hook attachment. The failing path reconstructs a bitsandbytes `Params4bit` object and passes `_is_hf_initialized` into its constructor:

```text
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```

The contrast is important:

```text
R05 all-GPU 4-bit:
PASS

R03 split CPU/GPU 4-bit:
FAIL_BNB_PARAM_CONSTRUCTOR
```

This suggests the issue is not PEFT + bitsandbytes 4-bit in general. It appears specific to loading a PEFT adapter on top of an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model, where PEFT triggers an additional Accelerate dispatch/hook path.

### Why the adapter config is patched in this repro

The adapter is a real multimodal LoRA adapter. Its safetensors keys include:

```text
audio: 72
language_like: 490
vision_or_patch: 224
```

Without patching, PEFT first fails earlier with:

```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```

That is because Gemma 4 vision/audio towers expose wrapper modules such as:

```text
model.vision_tower.encoder.layers.0.self_attn.q_proj
→ Gemma4ClippableLinear

model.vision_tower.encoder.layers.0.self_attn.q_proj.linear
→ Linear / Linear4bit
```

The adapter weights use inner `.linear` paths for multimodal towers, for example:

```text
base_model.model.model.audio_tower.layers.0.self_attn.k_proj.linear.lora_A.weight
base_model.model.model.vision_tower.encoder.layers.0.self_attn.q_proj.linear.lora_A.weight
```

So the repro patches only the local copy of `adapter_config.json` to target inner `.linear` modules for Gemma 4 multimodal towers while leaving language tower targets at the usual projection modules.

Patched target expression:

```text
.*(?:model\.language_model\.layers\.\d+\.(?:self_attn|mlp)\.(?:q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)|model\.vision_tower\.encoder\.layers\.\d+\.(?:self_attn|mlp)\.(?:q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)\.linear|model\.audio_tower\.layers\.\d+\.self_attn\.(?:q_proj|k_proj|v_proj)\.linear)$
```

After this patch, the all-GPU case works, which confirms the Gemma4ClippableLinear target-module issue is bypassed for this repro.

### Quantization config

```python
from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,
)
```

### Case A — all-GPU path works

```python
device_map = {"": 0}
max_memory = {0: "14GiB"}

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    "unsloth/gemma-4-E2B-it",
    quantization_config=quant_config,
    device_map=device_map,
    max_memory=max_memory,
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    trust_remote_code=False,
    low_cpu_mem_usage=True,
)

model = PeftModel.from_pretrained(
    base_model,
    "/content/patched_gemma4_multimodal_inner_linear_adapter_v4",
    is_trainable=False,
)

out = model.generate(**inputs, max_new_tokens=4, do_sample=False, use_cache=True)
```

Observed result:

```text
R05_ALL_GPU_4BIT: PASS
generate output shape: (1, 8)

CUDA after generate:
free_gib: 7.742
total_gib: 14.563
allocated_gib: 6.507
reserved_gib: 6.693
```

### Case B — split CPU/GPU path fails

```python
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}
max_memory = {0: "13GiB", "cpu": "10GiB"}

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    "unsloth/gemma-4-E2B-it",
    quantization_config=quant_config,
    device_map=device_map,
    max_memory=max_memory,
    offload_folder="/content/gemma4_offload_v4",
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    trust_remote_code=False,
    low_cpu_mem_usage=True,
)

model = PeftModel.from_pretrained(
    base_model,
    "/content/patched_gemma4_multimodal_inner_linear_adapter_v4",
    is_trainable=False,
    offload_dir="/content/gemma4_offload_v4",
    offload_buffers=True,
    ephemeral_gpu_offload=True,
    torch_device="cuda:0",
)
```

Observed result:

```text
R03_SPLIT_4BIT: FAIL_BNB_PARAM_CONSTRUCTOR

TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```

Trace tail:

```text
File ".../peft/peft_model.py", line 582, in from_pretrained
    load_result = model.load_adapter(

File ".../peft/peft_model.py", line 1475, in load_adapter
    dispatch_model(

File ".../accelerate/big_modeling.py", line 432, in dispatch_model
    attach_align_device_hook_on_blocks(

File ".../accelerate/hooks.py", line 540, in attach_align_device_hook
    add_hook_to_module(module, hook, append=True)

File ".../accelerate/hooks.py", line 183, in add_hook_to_module
    module = hook.init_hook(module)

File ".../accelerate/hooks.py", line 330, in init_hook
    set_module_tensor_to_device(module, name, "meta")

File ".../accelerate/utils/modeling.py", line 363, in set_module_tensor_to_device
    new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(

TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```

### Expected behavior

One of the following:

1. `PeftModel.from_pretrained()` should support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model without reconstructing `Params4bit` with unsupported kwargs.
2. PEFT should avoid redispatching an already-dispatched quantized base model, or should do so without passing `_is_hf_initialized` to bitsandbytes constructors.
3. Accelerate `set_module_tensor_to_device()` should avoid forwarding HF-internal parameter attributes into `bitsandbytes.Params4bit.__new__()` if that constructor does not accept them.
4. bitsandbytes `Params4bit.__new__()` should accept or ignore `_is_hf_initialized` if this attribute is expected in the HF integration path.
5. If this configuration is unsupported, the error should be raised early with a clear message.

### Actual behavior

- All-GPU 4-bit base + patched multimodal PEFT adapter: works.
- Split CPU/GPU 4-bit base + the same patched multimodal PEFT adapter: fails during PEFT adapter loading.
- The failure occurs before generation.
- The failure occurs in the Accelerate dispatch/hook path when reconstructing a bitsandbytes `Params4bit` parameter on the meta-device path.

### Why this seems split/offload-specific

The same environment, same model, same adapter, same adapter patch, same quantization config, and same PEFT call pattern work when the model is all-GPU:

```python
device_map = {"": 0}
```

The failure appears only with:

```python
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}
```

Therefore, the issue seems tied to PEFT adapter loading on an already split/offloaded bitsandbytes 4-bit model, not to PEFT + bnb 4-bit generally.

### Related observations

#### 1. Gemma4ClippableLinear target-module blocker

Before applying the local adapter-config patch, both tested Gemma 4 E2B adapters failed earlier with:

```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```

That appears to be the same class of issue as the public PEFT `Gemma4ClippableLinear` support issue. The current repro works around that by targeting the inner `.linear` modules for vision/audio towers.

#### 2. Related `_is_hf_initialized` issue family

There is an existing issue for a similar error family with `Int8Params`:

```text
TypeError: Int8Params.__new__() got an unexpected keyword argument '_is_hf_initialized'
```

This repro appears to be the `Params4bit` variant of that family, reached through PEFT adapter loading and Accelerate hook setup on a split-device model.

### Questions

1. Is `PeftModel.from_pretrained()` expected to support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model?
2. Should PEFT skip redispatch when the base model already has an `hf_device_map`?
3. Should Accelerate filter `_is_hf_initialized` before reconstructing bitsandbytes parameter classes?
4. Should bitsandbytes `Params4bit` accept or ignore `_is_hf_initialized`?
5. Is the recommended workaround for Gemma 4 E2B on T4 to keep the whole 4-bit model on GPU rather than offloading vision/audio towers to CPU?

### Relevant links

```text
Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling

PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model

Transformers PEFT integration docs:
https://huggingface.co/docs/transformers/en/peft

Transformers bitsandbytes docs:
https://huggingface.co/docs/transformers/quantization/bitsandbytes

PEFT Gemma4ClippableLinear issue:
https://github.com/huggingface/peft/issues/3129

Related _is_hf_initialized issue:
https://github.com/huggingface/transformers/issues/43872
```


Optional Issue 2

Target repo

huggingface/peft

Suggested title

Gemma4 multimodal LoRA adapters need inner .linear targeting or Gemma4ClippableLinear support

Body

There is also a Gemma 4 target-module compatibility issue before this failure. Without the local `target_modules` patch, PEFT tries to inject LoRA into outer `Gemma4ClippableLinear` wrappers and fails with:

```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```

The adapter is genuinely multimodal, so simply excluding vision/audio targets would change the adapter semantics. The repro instead patches the local adapter config to target the inner `.linear` modules in Gemma 4 vision/audio towers.

With that patch, the all-GPU case passes, but the CPU/GPU-dispatched case still fails with:

```text
Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```


My recommendation

Open Issue 1 first. It is now much cleaner than the original draft because the latest evidence isolates the trigger:

all-GPU 4-bit path works
split/offloaded 4-bit path fails during PEFT adapter loading

Do not open the shared-KV issue from this latest Colab run alone, because this run did not reach that failure path.