LoRA Training 训练

中文
背景说明:
我计划为 wan2.2 视频模型 训练一个角色 LoRA(安妮 / Annie),用于动画制作,写实 3D 风格。
此前我从未训练过 LoRA,但今天我已经成功部署了 DiffSynth-Studio,并完成了一个官方示例的训练流程。
现在我希望正式开始我的角色 LoRA 训练工作,但在数据集构建与标注方面仍有许多疑问。
我期望这个 安妮 LoRA 的使用方式类似于:
“安妮在小溪边看书,另一个人物在旁边钓鱼。”
我希望生成结果中:
安妮的外观始终正确、稳定
其他角色不会继承安妮的外观特征
一、训练数据集 —— 图片相关问题
1.1 训练样本中是否需要脸部特写图片?
1.2 训练样本中是否需要上半身图片?
1.3 训练样本中是否需要全身图片?
1.4 是否需要包含多种表情(哭、笑、怒等)的图片?
1.5 是否需要背面视角图片?(我可以提供)
1.6 是否需要上下左右 360° 全角度图片?(我可以提供)
1.7 是否需要多种姿势(蹲、坐、站、跑、跳等)?(我可以提供)
1.8 是否需要不同服装(款式、颜色等)?(我可以提供)
1.9 是否需要不同发型(长、短、不同款式)?(我可以提供)
1.10 是否需要不同纯色背景(纯白、纯灰、纯黑等)?(我可以提供)
1.11 训练集中是否应尽量避免帽子?
1.12 是否还有其他重要的图片数据补充建议?
二、训练数据集 —— 描述 / 标注(Caption)相关问题
2.1 训练描述中是否应将角色名(Annie)放在第一位?
2.2 是否需要描述五官细节(眼睛、嘴巴、鼻子、脸型等)?
2.3 是否需要描述镜头/视角(远景、近景、左、右、俯视等)?
2.4 是否需要描述表情?
2.5 是否需要描述姿势/动作?
2.6 是否需要描述服装(款式、颜色等)?
2.7 如果发型是固定的,是否仍需要在描述中标注发型?
2.8 如果发型不固定,是否需要在描述中标注发型?
2.9 如果出现帽子,是否应在描述中明确标注?
2.10 是否需要描述背景(纯色或场景)?
2.11 是否需要明确标注人物风格(写实 3D)?
2.12 是否需要描述性别?
2.13 是否需要描述年龄?
2.14 是否需要描述体型?
2.15 是否还有其他重要的描述补充建议?
2.16 为了避免其他角色继承安妮的外观,是否需要在描述中强化安妮的独有特征?
(例如:只有安妮拥有红色头发)
三、训练数据集 —— 视频样本相关问题
3.1 是否需要视频样本(例如:围绕安妮 360° 旋转的视频)?
3.2 是否需要视频样本(例如:对安妮进行缓慢推近 / 拉远的镜头)?
非常感谢任何形式的帮助,哪怕只回答其中一项问题也非常感激 :folded_hands:

====================

English
Background:
I plan to train a character LoRA (Annie) for the wan2.2 video model, intended for animation production in a realistic 3D style.
I have never trained a LoRA before, but today I successfully deployed DiffSynth-Studio and completed training using one of the official example projects.
Now I would like to officially begin my character LoRA training workflow, and I still have many questions regarding dataset construction and captioning.
My intended usage of the Annie LoRA is something like:
“Annie is reading by a stream, while another character is fishing nearby.”
My goal is:
Annie’s appearance remains correct and consistent
Other characters do NOT inherit Annie’s appearance

  1. Training Dataset — Image-Related Questions
    1.1 Do training samples require close-up facial images?
    1.2 Do training samples require upper-body shots?
    1.3 Do training samples require full-body images?
    1.4 Should the dataset include various facial expressions (crying, smiling, angry, etc.)?
    1.5 Are back-view images required? (I can provide them)
    1.6 Are full 360-degree angle images (top, bottom, left, right) required? (I can provide them)
    1.7 Should the dataset include various poses (squatting, sitting, standing, running, jumping, etc.)? (I can provide them)
    1.8 Should the dataset include different outfits (styles, colors, etc.)? (I can provide them)
    1.9 Should the dataset include different hairstyles (long, short, various styles)? (I can provide them)
    1.10 Should the dataset include different solid-color backgrounds (pure white, gray, black, etc.)? (I can provide them)
    1.11 Should hats be avoided in the training dataset?
    1.12 Are there any other important image-related recommendations?
  2. Training Dataset — Caption / Description Questions
    2.1 Should the character name (Annie) be placed at the beginning of each caption?
    2.2 Should facial features (eyes, mouth, nose, face shape, etc.) be described?
    2.3 Should camera distance and angles (close-up, wide shot, left, right, top-down, etc.) be described?
    2.4 Should facial expressions be described?
    2.5 Should poses or actions be described?
    2.6 Should clothing details (style, color, etc.) be described?
    2.7 If the hairstyle is fixed, should it still be described?
    2.8 If the hairstyle is not fixed, should it be described?
    2.9 If hats appear, should their presence be explicitly described?
    2.10 Should the background (solid color or scene) be described?
    2.11 Should the character style (realistic 3D) be explicitly stated?
    2.12 Should gender be described?
    2.13 Should age be described?
    2.14 Should body type be described?
    2.15 Are there any additional important captioning recommendations?
    2.16 To prevent other characters from inheriting Annie’s appearance, should captions emphasize Annie’s unique features?
    (e.g., “Only Annie has red hair”)
  3. Training Dataset — Video Sample Questions
    3.1 Are video samples required (e.g., a 360-degree rotation video of Annie)?
    3.2 Are video samples required (e.g., a slow zoom-in / zoom-out shot of Annie)?
    I would greatly appreciate any clarification — even answering just one of these questions would be extremely helpful. :folded_hands:
1 Like

LoRA fundamentally affects things globally, so this is somewhat an unavoidable phenomenon.

However, for example, ComfyUI had a method to apply LoRA locally to some extent…? But I’m not a ComfyUI user, so I’m not familiar with the details…

The following is a general discussion.


You can reach your two goals, but only if you treat “Annie identity” as a separate concept from (a) style, (b) clothing, (c) background, and (d) other people.

Two facts set the constraints:

  • LoRA effects are global. In multi-person scenes, identity can “bleed” onto other faces. This is widely reported as “facial contamination” in group shots. (GitHub)
  • Wan2.2 is a two-expert MoE pipeline: a high-noise expert and a low-noise expert, split across denoising timesteps. You typically need the LoRA to be applied to the matching expert and used in the correct stage. (ComfyUI Wiki)

Below is a practical answer to every numbered question, plus a workflow that specifically targets “Annie stays consistent” and “others do not become Annie.”


0) The cleanest strategy for your case

Train “identity-first,” then add optional extras later

For a production character, the safest approach is:

  1. Identity LoRA: face, hair silhouette, body proportions.
  2. Optional later: Outfit LoRA, hairstyle variant LoRA, or style LoRA.

This reduces accidental entanglement like “Annie = red jacket + stream background + female face,” which is the exact kind of coupling that makes other characters inherit Annie.

Use a unique trigger token

Do not rely on the common name “Annie” as your only activation word.

Instead, use a rare token like ann13_chr and always prompt:

  • ann13_chr woman ...

DreamBooth-style personalization literature explicitly ties a subject to a unique identifier token and warns about “language drift” without countermeasures. (arXiv)


1) Training Dataset — Image-related questions (1.1–1.12)

1.1 Close-up facial images?

Not strictly required, but strongly recommended.
If you want identity consistency across shots, you need enough high-resolution face information. Many practitioner guides for character LoRAs recommend including a meaningful chunk of close-ups for this reason. (Reddit)

Suggested share: 20–35% close-ups.


1.2 Upper-body shots?

Recommended.
Upper-body shots connect face identity to hair framing, shoulders, and torso proportions. They also help when your real prompts are not always tight portraits.

Suggested share: 30–45% upper-body.


1.3 Full-body images?

Recommended if you will ever show full-body Annie in video.
Without full-body training, the model often “borrows” a generic body from the base model, which can mismatch Annie’s identity.

Suggested share: 20–35% full-body.


1.4 Various facial expressions?

Recommended, but keep it controlled.
Expressions are where identity often collapses. Include a small set that matches your production needs:

  • neutral, smiling, talking/open-mouth, angry/frown, surprised

You do not need “every emotion.” You need “identity survives deformation.”


1.5 Back-view images required?

Not required. Useful if your animation includes back shots or turns.
Back/over-shoulder shots teach hair mass and silhouette from behind.

Suggested share: 5–10%.


1.6 Full 360-degree angles required?

Not required.
What matters is coverage of likely camera angles, not completeness:

  • front, 3/4 left, 3/4 right, profiles, back 3/4
    Top-down and extreme low-angle only matter if you will use them.

1.7 Various poses required?

Include the poses you will prompt.
If you want “reading by a stream,” include:

  • sitting, kneeling, standing, walking
  • “holding book,” “looking down,” “turning head”

If you include running/jumping, do so only if you plan to use them. Otherwise you add variance that can weaken identity learning.


1.8 Different outfits required?

Depends on what you want the LoRA to represent.

  • If you want “Annie identity regardless of outfit,” include multiple outfits and caption them.
  • If you want “Annie in canonical outfit,” keep outfit mostly fixed.

A common production approach is: identity LoRA mostly consistent outfit, then separate outfit LoRAs later.


1.9 Different hairstyles required?

Same logic as outfits, but stricter.

Hair is one of the strongest identity anchors. If you vary hair too much in the identity LoRA, you can weaken the “core Annie” anchor.

  • Fixed hair in production: keep it fixed in training.
  • Variable hair needed: include variants and caption them clearly.

1.10 Different solid-color backgrounds required?

Not required. Use sparingly.
A few solid backgrounds help subject isolation, but too many can bias outputs into studio-like compositions.

Suggested: 0–20% simple backgrounds, 80–100% real scenes similar to your target.


1.11 Hats avoided?

Yes, unless hats are part of Annie’s identity.
Hats are high-salience and easily get fused into the concept. If you must include hats, keep them rare and caption “hat” explicitly.


1.12 Other key image recommendations

These matter more than “more angles.”

  1. Train Annie alone in most images
    Multi-person training images increase the risk of identity entanglement. The “group photo contamination” failure is real and common. (GitHub)

  2. Avoid near-duplicates
    Ten frames from the same camera move teach less than ten distinct viewpoints.

  3. Match your aspect ratio plan
    Wan2.2 video workflows are sensitive to resolution and stage matching. If your training crops are inconsistent, you can get warped results or poor generalization.

  4. Keep style consistent for the identity LoRA
    If your goal is “realistic 3D,” don’t mix wildly different rendering styles inside the same identity dataset unless you want “style drift.”


2) Training Dataset — Caption questions (2.1–2.16)

Background: what captions are for

Captions decide what the model treats as:

  • part of “Annie identity”
  • vs controllable attributes (pose, camera, outfit)
  • vs unrelated background noise

Over-captioning can make the LoRA “require” too many tokens. Under-captioning can entangle things unpredictably.

2.1 Put “Annie” at the beginning?

Use a unique trigger token. Put it early and keep it consistent.
DreamBooth-style practice is “identifier token + class word” to bind the subject while preserving the broader class concept. (arXiv)

Example:

  • ann13_chr woman, upper body, three-quarter view, smiling, reading, riverbank

2.2 Describe facial features?

Usually no.
Pixels teach geometry. If you spell out “small nose, big eyes,” you may:

  • accidentally make those tokens globally stronger
  • increase leakage onto other characters when those tokens appear

Describe only a small number of high-signal traits if you truly need them (often hair color and maybe eye color).


2.3 Describe camera distance and angles?

Yes. High value.
Use a small fixed vocabulary:

  • close-up, upper body, full body, wide shot
  • front view, three-quarter view, profile, back view

This helps control composition and reduces identity drift across distance.


2.4 Describe expressions?

Yes, if expressions vary in the dataset.
Keep vocabulary small: neutral, smiling, angry, surprised, crying.


2.5 Describe poses/actions?

Yes, if they vary and you want control.
Especially for your target prompt: include reading, sitting, holding a book, etc.


2.6 Describe clothing details?

  • If outfits vary: yes (short, consistent descriptors).
  • If outfit is fixed and you want it baked in: you can omit, but then outfit becomes part of “Annie.”

2.7 Hair fixed. Still describe it?

Optional. Two valid choices:

  • Omit hair: hair becomes part of Annie concept implicitly.
  • Always include hair tag: stronger anchor, but reduces ability to generate Annie without that hair.

2.8 Hair not fixed. Describe it?

Yes.
Otherwise the model “averages” hairstyles, creating instability.


2.9 Hats present. Describe them?

Yes.
Otherwise hats can become an unpromptable identity artifact.


2.10 Describe background?

Lightly, yes.
Use coarse scene tags: outdoors, forest, riverbank, indoors. Avoid long object lists.


2.11 State “realistic 3D” explicitly?

Usually no if constant in your dataset.
If every image is the same realistic 3D pipeline, the model learns it without needing a style token. If you later want style independence, keep style out of the identity LoRA and train style separately.


2.12 Gender described?

Use one class word consistently (for example woman).
This matches the “identifier + class” idea used to preserve class behavior. (arXiv)

Avoid multiple synonyms across captions (girl, female, lady) because inconsistent tokens broaden the concept.


2.13 Age described?

Only if it matters and you want it locked. Otherwise omit.


2.14 Body type described?

Only if it is essential to identity and you want it preserved. Otherwise omit.


2.15 Additional captioning recommendations

  1. Controlled vocabulary
    Pick one term per concept and reuse it.

  2. Caption what varies
    If outfits vary, caption outfits. If angles vary, caption angles.

  3. Avoid describing other characters
    Because the best dataset for your goal contains mostly Annie alone.


2.16 Should captions emphasize “Only Annie has red hair” to prevent inheritance?

No. Do not do that.
It does not create a reliable “exclusivity rule.” It often increases coupling between “red hair” and “Annie,” which can worsen leakage when “red hair” appears elsewhere.

Use real anti-leakage levers instead:

Lever A: Prior preservation / regularization

DreamBooth’s prior preservation loss is explicitly meant to reduce “language drift” where the model forgets how to generate other members of the class. (arXiv)
In many LoRA trainers, this shows up as “regularization images” and a “prior loss weight.” Kohya’s parameter docs explain that prior loss weight controls how much regularization images matter. (GitHub)

Lever B: Early stopping with a leakage test

Stop when “generic woman” prompts begin to look like Annie. The group-shot contamination issue reports that typical tweaks alone may not eliminate bleed once drift has occurred. (GitHub)

Lever C: Keep training images single-subject

This reduces entanglement pressure from the start.


3) Video sample questions (3.1–3.2)

3.1 Are video samples required?

Not required.
You can build a strong identity LoRA with images alone.

But video clips can help:

  • temporal identity stability
  • hair and face consistency under motion

Wan training toolchains commonly support image and video datasets in one config, with captions from text files or JSONL. (Hugging Face)


3.2 Turntable and zoom clips required?

Not required.
They are useful if your production includes these camera behaviors, but do not treat them as mandatory. If you add clips, keep them:

  • short
  • no cuts
  • slow, clean motion

Also pay attention to frame rate assumptions in dataset tooling. Some dataset config guidance recommends specific FPS for Wan-family training (for example 16fps guidance appears in some Wan dataset config docs). (Hugging Face)


4) Wan2.2-specific advice you should not ignore

4.1 High-noise vs low-noise and matching LoRAs

Wan2.2 uses high-noise and low-noise expert models split by timesteps. (ComfyUI Wiki)
ComfyUI documentation explicitly warns that high-noise and low-noise models and LoRAs must correspond. (ComfyUI Wiki)

Practical interpretation:

  • If you train only a low-noise LoRA, apply it only to the low-noise stage.
  • If you train both, keep them separate and load each into the correct stage.

4.2 Why I suggest “low-noise first” for character identity

Low-noise stage is where fine detail crystallizes. Identity (face specifics) tends to be most sensitive there. Many community Wan2.2 LoRA writeups explicitly train both high and low, but for a first character LoRA, “low-noise first” is the simplest way to debug identity without fighting composition effects. (GitHub)

4.3 Masking is not a perfect “no-bleed” tool

If you were considering masked training: OneTrainer explains why pixel masks are not exact after VAE latent transforms and why masked training does not work like a strict isolation boundary. (GitHub)
So treat masking as optional polish, not your main leak prevention method.


5) A concrete workflow that fits your prompt example

Step 1: Build a starter dataset (images)

  • 80–160 images of Annie alone
  • Balanced close-up / mid / full-body
  • Realistic 3D scenes similar to your target
  • A small amount of solid backgrounds if you want

Step 2: Caption template

Use one template and reuse the same vocabulary:

ann13_chr woman, [shot], [angle], [expression], [action/pose], [outfit if varies], [background coarse]

Step 3: Train and checkpoint frequently

You are looking for the earliest checkpoint that satisfies two tests.

Step 4: Leakage test suite (run every checkpoint)

  1. No-Annie prompt
    “A woman reading by a stream”
    Result must not resemble Annie.

  2. Annie prompt
    “ann13_chr woman reading by a stream”
    Result must resemble Annie.

  3. Two-person prompt
    “ann13_chr reading by a stream, another woman fishing nearby”
    Fisher must not become Annie.

This exact multi-face contamination failure is widely reported. (GitHub)

Step 5: If leakage appears

Apply the levers in this order:

  1. stop earlier (less training)
  2. strengthen dataset diversity in backgrounds/outfits/poses (reduce entanglement)
  3. add regularization images and tune prior loss weight if your trainer supports it (GitHub)

6) Curated reference set (good to keep open)


Summary

  • Use a unique trigger token and keep captions consistent. (arXiv)
  • Build an image dataset with balanced close-up, mid, and full-body coverage. (Reddit)
  • Train Annie mostly alone to reduce entanglement and bleed. (GitHub)
  • Consider prior preservation / regularization if generic prompts start drifting. (GitHub)
  • For Wan2.2, ensure high/low LoRAs match the correct expert stage. (ComfyUI Wiki)