LoRA fundamentally affects things globally, so this is somewhat an unavoidable phenomenon.
However, for example, ComfyUI had a method to apply LoRA locally to some extent…? But I’m not a ComfyUI user, so I’m not familiar with the details…
The following is a general discussion.
You can reach your two goals, but only if you treat “Annie identity” as a separate concept from (a) style, (b) clothing, (c) background, and (d) other people.
Two facts set the constraints:
- LoRA effects are global. In multi-person scenes, identity can “bleed” onto other faces. This is widely reported as “facial contamination” in group shots. (GitHub)
- Wan2.2 is a two-expert MoE pipeline: a high-noise expert and a low-noise expert, split across denoising timesteps. You typically need the LoRA to be applied to the matching expert and used in the correct stage. (ComfyUI Wiki)
Below is a practical answer to every numbered question, plus a workflow that specifically targets “Annie stays consistent” and “others do not become Annie.”
0) The cleanest strategy for your case
Train “identity-first,” then add optional extras later
For a production character, the safest approach is:
- Identity LoRA: face, hair silhouette, body proportions.
- Optional later: Outfit LoRA, hairstyle variant LoRA, or style LoRA.
This reduces accidental entanglement like “Annie = red jacket + stream background + female face,” which is the exact kind of coupling that makes other characters inherit Annie.
Use a unique trigger token
Do not rely on the common name “Annie” as your only activation word.
Instead, use a rare token like ann13_chr and always prompt:
DreamBooth-style personalization literature explicitly ties a subject to a unique identifier token and warns about “language drift” without countermeasures. (arXiv)
1) Training Dataset — Image-related questions (1.1–1.12)
1.1 Close-up facial images?
Not strictly required, but strongly recommended.
If you want identity consistency across shots, you need enough high-resolution face information. Many practitioner guides for character LoRAs recommend including a meaningful chunk of close-ups for this reason. (Reddit)
Suggested share: 20–35% close-ups.
1.2 Upper-body shots?
Recommended.
Upper-body shots connect face identity to hair framing, shoulders, and torso proportions. They also help when your real prompts are not always tight portraits.
Suggested share: 30–45% upper-body.
1.3 Full-body images?
Recommended if you will ever show full-body Annie in video.
Without full-body training, the model often “borrows” a generic body from the base model, which can mismatch Annie’s identity.
Suggested share: 20–35% full-body.
1.4 Various facial expressions?
Recommended, but keep it controlled.
Expressions are where identity often collapses. Include a small set that matches your production needs:
- neutral, smiling, talking/open-mouth, angry/frown, surprised
You do not need “every emotion.” You need “identity survives deformation.”
1.5 Back-view images required?
Not required. Useful if your animation includes back shots or turns.
Back/over-shoulder shots teach hair mass and silhouette from behind.
Suggested share: 5–10%.
1.6 Full 360-degree angles required?
Not required.
What matters is coverage of likely camera angles, not completeness:
- front, 3/4 left, 3/4 right, profiles, back 3/4
Top-down and extreme low-angle only matter if you will use them.
1.7 Various poses required?
Include the poses you will prompt.
If you want “reading by a stream,” include:
- sitting, kneeling, standing, walking
- “holding book,” “looking down,” “turning head”
If you include running/jumping, do so only if you plan to use them. Otherwise you add variance that can weaken identity learning.
1.8 Different outfits required?
Depends on what you want the LoRA to represent.
- If you want “Annie identity regardless of outfit,” include multiple outfits and caption them.
- If you want “Annie in canonical outfit,” keep outfit mostly fixed.
A common production approach is: identity LoRA mostly consistent outfit, then separate outfit LoRAs later.
1.9 Different hairstyles required?
Same logic as outfits, but stricter.
Hair is one of the strongest identity anchors. If you vary hair too much in the identity LoRA, you can weaken the “core Annie” anchor.
- Fixed hair in production: keep it fixed in training.
- Variable hair needed: include variants and caption them clearly.
1.10 Different solid-color backgrounds required?
Not required. Use sparingly.
A few solid backgrounds help subject isolation, but too many can bias outputs into studio-like compositions.
Suggested: 0–20% simple backgrounds, 80–100% real scenes similar to your target.
1.11 Hats avoided?
Yes, unless hats are part of Annie’s identity.
Hats are high-salience and easily get fused into the concept. If you must include hats, keep them rare and caption “hat” explicitly.
1.12 Other key image recommendations
These matter more than “more angles.”
-
Train Annie alone in most images
Multi-person training images increase the risk of identity entanglement. The “group photo contamination” failure is real and common. (GitHub)
-
Avoid near-duplicates
Ten frames from the same camera move teach less than ten distinct viewpoints.
-
Match your aspect ratio plan
Wan2.2 video workflows are sensitive to resolution and stage matching. If your training crops are inconsistent, you can get warped results or poor generalization.
-
Keep style consistent for the identity LoRA
If your goal is “realistic 3D,” don’t mix wildly different rendering styles inside the same identity dataset unless you want “style drift.”
2) Training Dataset — Caption questions (2.1–2.16)
Background: what captions are for
Captions decide what the model treats as:
- part of “Annie identity”
- vs controllable attributes (pose, camera, outfit)
- vs unrelated background noise
Over-captioning can make the LoRA “require” too many tokens. Under-captioning can entangle things unpredictably.
2.1 Put “Annie” at the beginning?
Use a unique trigger token. Put it early and keep it consistent.
DreamBooth-style practice is “identifier token + class word” to bind the subject while preserving the broader class concept. (arXiv)
Example:
ann13_chr woman, upper body, three-quarter view, smiling, reading, riverbank
2.2 Describe facial features?
Usually no.
Pixels teach geometry. If you spell out “small nose, big eyes,” you may:
- accidentally make those tokens globally stronger
- increase leakage onto other characters when those tokens appear
Describe only a small number of high-signal traits if you truly need them (often hair color and maybe eye color).
2.3 Describe camera distance and angles?
Yes. High value.
Use a small fixed vocabulary:
close-up, upper body, full body, wide shot
front view, three-quarter view, profile, back view
This helps control composition and reduces identity drift across distance.
2.4 Describe expressions?
Yes, if expressions vary in the dataset.
Keep vocabulary small: neutral, smiling, angry, surprised, crying.
2.5 Describe poses/actions?
Yes, if they vary and you want control.
Especially for your target prompt: include reading, sitting, holding a book, etc.
2.6 Describe clothing details?
- If outfits vary: yes (short, consistent descriptors).
- If outfit is fixed and you want it baked in: you can omit, but then outfit becomes part of “Annie.”
2.7 Hair fixed. Still describe it?
Optional. Two valid choices:
- Omit hair: hair becomes part of Annie concept implicitly.
- Always include hair tag: stronger anchor, but reduces ability to generate Annie without that hair.
2.8 Hair not fixed. Describe it?
Yes.
Otherwise the model “averages” hairstyles, creating instability.
2.9 Hats present. Describe them?
Yes.
Otherwise hats can become an unpromptable identity artifact.
2.10 Describe background?
Lightly, yes.
Use coarse scene tags: outdoors, forest, riverbank, indoors. Avoid long object lists.
2.11 State “realistic 3D” explicitly?
Usually no if constant in your dataset.
If every image is the same realistic 3D pipeline, the model learns it without needing a style token. If you later want style independence, keep style out of the identity LoRA and train style separately.
2.12 Gender described?
Use one class word consistently (for example woman).
This matches the “identifier + class” idea used to preserve class behavior. (arXiv)
Avoid multiple synonyms across captions (girl, female, lady) because inconsistent tokens broaden the concept.
2.13 Age described?
Only if it matters and you want it locked. Otherwise omit.
2.14 Body type described?
Only if it is essential to identity and you want it preserved. Otherwise omit.
2.15 Additional captioning recommendations
-
Controlled vocabulary
Pick one term per concept and reuse it.
-
Caption what varies
If outfits vary, caption outfits. If angles vary, caption angles.
-
Avoid describing other characters
Because the best dataset for your goal contains mostly Annie alone.
2.16 Should captions emphasize “Only Annie has red hair” to prevent inheritance?
No. Do not do that.
It does not create a reliable “exclusivity rule.” It often increases coupling between “red hair” and “Annie,” which can worsen leakage when “red hair” appears elsewhere.
Use real anti-leakage levers instead:
Lever A: Prior preservation / regularization
DreamBooth’s prior preservation loss is explicitly meant to reduce “language drift” where the model forgets how to generate other members of the class. (arXiv)
In many LoRA trainers, this shows up as “regularization images” and a “prior loss weight.” Kohya’s parameter docs explain that prior loss weight controls how much regularization images matter. (GitHub)
Lever B: Early stopping with a leakage test
Stop when “generic woman” prompts begin to look like Annie. The group-shot contamination issue reports that typical tweaks alone may not eliminate bleed once drift has occurred. (GitHub)
Lever C: Keep training images single-subject
This reduces entanglement pressure from the start.
3) Video sample questions (3.1–3.2)
3.1 Are video samples required?
Not required.
You can build a strong identity LoRA with images alone.
But video clips can help:
- temporal identity stability
- hair and face consistency under motion
Wan training toolchains commonly support image and video datasets in one config, with captions from text files or JSONL. (Hugging Face)
3.2 Turntable and zoom clips required?
Not required.
They are useful if your production includes these camera behaviors, but do not treat them as mandatory. If you add clips, keep them:
- short
- no cuts
- slow, clean motion
Also pay attention to frame rate assumptions in dataset tooling. Some dataset config guidance recommends specific FPS for Wan-family training (for example 16fps guidance appears in some Wan dataset config docs). (Hugging Face)
4) Wan2.2-specific advice you should not ignore
4.1 High-noise vs low-noise and matching LoRAs
Wan2.2 uses high-noise and low-noise expert models split by timesteps. (ComfyUI Wiki)
ComfyUI documentation explicitly warns that high-noise and low-noise models and LoRAs must correspond. (ComfyUI Wiki)
Practical interpretation:
- If you train only a low-noise LoRA, apply it only to the low-noise stage.
- If you train both, keep them separate and load each into the correct stage.
4.2 Why I suggest “low-noise first” for character identity
Low-noise stage is where fine detail crystallizes. Identity (face specifics) tends to be most sensitive there. Many community Wan2.2 LoRA writeups explicitly train both high and low, but for a first character LoRA, “low-noise first” is the simplest way to debug identity without fighting composition effects. (GitHub)
4.3 Masking is not a perfect “no-bleed” tool
If you were considering masked training: OneTrainer explains why pixel masks are not exact after VAE latent transforms and why masked training does not work like a strict isolation boundary. (GitHub)
So treat masking as optional polish, not your main leak prevention method.
5) A concrete workflow that fits your prompt example
Step 1: Build a starter dataset (images)
- 80–160 images of Annie alone
- Balanced close-up / mid / full-body
- Realistic 3D scenes similar to your target
- A small amount of solid backgrounds if you want
Step 2: Caption template
Use one template and reuse the same vocabulary:
ann13_chr woman, [shot], [angle], [expression], [action/pose], [outfit if varies], [background coarse]
Step 3: Train and checkpoint frequently
You are looking for the earliest checkpoint that satisfies two tests.
Step 4: Leakage test suite (run every checkpoint)
-
No-Annie prompt
“A woman reading by a stream”
Result must not resemble Annie.
-
Annie prompt
“ann13_chr woman reading by a stream”
Result must resemble Annie.
-
Two-person prompt
“ann13_chr reading by a stream, another woman fishing nearby”
Fisher must not become Annie.
This exact multi-face contamination failure is widely reported. (GitHub)
Step 5: If leakage appears
Apply the levers in this order:
- stop earlier (less training)
- strengthen dataset diversity in backgrounds/outfits/poses (reduce entanglement)
- add regularization images and tune prior loss weight if your trainer supports it (GitHub)
6) Curated reference set (good to keep open)
Summary
- Use a unique trigger token and keep captions consistent. (arXiv)
- Build an image dataset with balanced close-up, mid, and full-body coverage. (Reddit)
- Train Annie mostly alone to reduce entanglement and bleed. (GitHub)
- Consider prior preservation / regularization if generic prompts start drifting. (GitHub)
- For Wan2.2, ensure high/low LoRAs match the correct expert stage. (ComfyUI Wiki)