Spatially aware image generation (Dimensions)

I’m looking for some help and advice on a project I’m working on, it’s an AI image generator but I need it to understand dimensions / spaces. It should also use reference images. Can someone contact me / reply so we can discuss?

1 Like

There doesn’t seem to be a model well-suited to handle that task alone. It might be better to explore implementing it through a pipeline combining multiple models and programs…


My main suggestion is to design this as a scene system first and an image generator second.

What you want is not ordinary text-to-image. You want a model that can respect space, dimensions, object placement, camera/viewpoint, occlusion, and reference-image appearance at the same time. Recent spatial benchmarks still show that even strong image generators struggle with exact object placement, relational correctness, perspective, and metric-style constraints, and very recent work shows that even the order of entities in the prompt can spuriously affect where objects end up. (arXiv)

What your project really is

The underlying problem is closer to layout-constrained scene composition with learned rendering than to “make an image from text.” The strongest work in this area keeps separating the problem into stages: use language to plan a scene, convert that plan into explicit spatial structure, and only then let a diffusion model render the final image. That pattern shows up in LayoutGPT for language-to-layout planning, GLIGEN for box-grounded generation, Build-A-Scene for interactive 3D layout control, SceneCraft for layout-guided 3D scene generation, and Holodeck for LLM-generated spatial constraints plus layout optimization. (arXiv)

The core design choice

The biggest decision is this:

  • If your product mostly makes single-view images and users do not need to edit the scene much, a 2.5D pipeline is enough.
  • If users care about real dimensions, room-scale correctness, moving objects later, or multiple viewpoints, use a true 3D intermediate representation. (abdo-eldesokey.github.io)

My view is that if “dimensions / spaces” are central to the value of the product, you should lean toward 3D or quasi-3D from the beginning. Build-A-Scene explicitly argues that 2D layout control is too limited for 3D object-wise control and iterative refinements, while SceneCraft and SpatialGen both use explicit 3D layouts to produce more coherent spatial results. (abdo-eldesokey.github.io)

The architecture I would recommend

I would structure the system like this:

user intent → scene planner → canonical scene spec → geometry compiler → image renderer → verifier loop

That architecture is not just a personal preference; it closely matches the direction of current research and tooling. LayoutGPT turns language into layouts, ControlNet and T2I-Adapter turn spatial signals into conditioning inputs, IP-Adapter handles image-reference guidance, and newer 3D work uses explicit layouts before rendering. (arXiv)

1. Keep a canonical scene spec

Do not let the prompt be the only place where the scene exists.

Store the actual scene in a structured format with room size, object size, transforms, constraints, camera, and reference bindings. LayoutGPT is a strong reference for this planning step because it explicitly improves numerical and spatial correctness by converting language into layouts instead of relying on free-form prompting alone. (arXiv)

A practical internal schema would include:

  • room width, depth, height
  • camera position, orientation, focal length or FOV
  • object IDs and classes
  • object boxes or meshes
  • relations such as left_of, behind, aligned_to_wall, clearance
  • reference images for object identity, style, and background
  • protected regions that must not change (arXiv)

That one choice changes the project completely. Instead of asking the generator to “understand dimensions,” you give it an explicit spatial truth to follow. The benchmarks suggest that is much safer than relying on natural language alone. (arXiv)

2. Compile the scene into geometric controls

From the scene spec, generate control signals such as:

  • depth map
  • segmentation map
  • normals
  • floor/wall masks
  • per-object masks
  • 2D projected boxes
  • optional perspective guides or vanishing-line hints

This is exactly what ControlNet and T2I-Adapter are for: adding structural conditioning such as depth, edges, pose, or segmentation to a pretrained diffusion model. Diffusers also supports combining multiple ControlNets and advises masking conditionings so they do not overlap. (arXiv)

3. Separate structure from appearance

This is the most important practical rule for your case.

  • Structure = where things go
  • Appearance = what they look like

Use ControlNet / T2I-Adapter / boxes / masks / depth for structure. Use IP-Adapter for reference images and appearance cues. IP-Adapter is specifically designed as a lightweight image-prompt adapter with decoupled text/image attention, and Diffusers documents using binary masks so different reference images control different parts of the output. (arXiv)

This matters because reference images are extremely useful, but they can also destabilize layout if you let them act globally. The safest approach is to bind references regionally. The official docs explicitly support that with masking. (Hugging Face)

4. Treat reference images as different types, not one feature

For your case, I would split references into three categories:

  • object references: “this sofa,” “this lamp,” “this product,” “this character”
  • style/material references: palette, texture, finish, lighting mood
  • background/environment references: overall scene feel, not exact geometry

Then bind them differently. Object references should be tied to object masks or object regions. Style references should have weaker influence and cover broader regions. Background references should not be allowed to override object placement. This is an implementation recommendation, but it is strongly supported by how IP-Adapter masking is designed and by the general split between structural and image-based control in current tooling. (Hugging Face)

5. Let structure dominate, then let references refine

A useful trick here is to control when reference images matter during denoising. Diffusers has an IP-Adapter cutoff callback that can disable IP-Adapter after a chosen number of denoising steps. In practice, that gives you a way to let references influence the composition early or mid-way without letting them hijack the final layout. (Hugging Face)

My recommendation is:

  • early steps: strong geometry, moderate reference influence
  • middle steps: preserve geometry, refine appearance
  • later steps: weaker reference influence, protect structure

That timing strategy is an inference from the callback mechanism and from the common failure mode where references fight structure. (Hugging Face)

Which technical path fits your project

Path A: 2.5D MVP

Choose this if your product is closer to mockups, ad composition, storyboard frames, or one-shot interior renders from one camera. Build a planner that outputs a structured scene, render depth and segmentation from that scene, use ControlNet for layout, and use masked IP-Adapter for references. This is the fastest route to a product-quality prototype. (arXiv)

Path B: true 3D scene system

Choose this if you expect users to move objects, resize them, change viewpoints, or care about real-world measurements. Build-A-Scene, SceneCraft, SpatialGen, and Holodeck all point in this direction: store a 3D layout, generate proxy signals from it, and render images on top of that structure. (abdo-eldesokey.github.io)

This is the route I would choose if your product promise includes things like:

  • room planning
  • furniture placement
  • floor-plan to render
  • configurable product placement in space
  • editing one object without breaking the rest of the scene (abdo-eldesokey.github.io)

What I would actually use

For a production-oriented open-source stack, I would center the system on Diffusers, because it gives you the most practical programmable combination of ControlNet, T2I-Adapter, IP-Adapter, inpainting, callbacks, and custom training flows. The official docs cover structural control, multiple ControlNets, IP-Adapter masking, and T2I-Adapter as a smaller control module. (Hugging Face)

My default stack recommendation would be:

  • planner: LLM or rule-based scene parser
  • scene representation: JSON or graph with explicit geometry
  • geometry compiler: depth/segmentation/masks/boxes renderer
  • renderer: diffusion model with ControlNet
  • reference module: IP-Adapter with masks
  • optional custom control: T2I-Adapter for your own domain-specific signal
  • verification: detector/segmenter/depth checker against the target layout (Hugging Face)

When to train something custom

I would not start by training a whole new end-to-end generator.

ControlNet was designed so you can add structural control while freezing the backbone, and the paper reports robust training across both smaller and larger paired datasets. T2I-Adapter is even lighter, and Diffusers describes it as much smaller than ControlNet for SDXL training. That makes adapters a better first place to put your effort than retraining a full image model. (arXiv)

A good custom control signal for your product might be:

  • floor-contact map
  • wall-contact map
  • top-down occupancy map
  • clearance heatmap
  • perspective grid
  • projected 3D boxes
  • anchor-point map for constrained regions

That suggestion is an inference from the adapter frameworks: both ControlNet and T2I-Adapter are built around additional control images, so a domain-specific spatial control image is often the cleanest place to innovate. (arXiv)

The verifier loop is not optional

Your system should not trust the first generation.

GenSpace explicitly argues that standard evaluation often misses detailed spatial errors, and it introduces a more specialized pipeline because visually plausible images still fail on 3D placement, relationships, and measurements. That is directly relevant to your product. (arXiv)

So I would add automatic checks such as:

  • object present in the intended region
  • scale roughly matches the intended box
  • left/right/front/behind relations are correct
  • protected regions stayed unchanged
  • occlusion order is correct when required

If occlusion is especially important, VODiff is worth studying because it introduces explicit visibility-order control, which is a common pain point in spatial scene generation. (CVF Open Access)

The biggest pitfalls for your case

1. Prompt-only layout control

This is the most common trap. Recent benchmarks still show that modern image generators are weak at exact spatial reasoning and metric adherence, and recent work shows that prompt entity order itself can bias layout. Text is useful for intent, but it is a poor place to store geometric truth. (arXiv)

2. One global reference image

A single global reference often drags pose, composition, and framing away from the intended layout. The official IP-Adapter docs support masking for exactly this reason: different references should often control different regions. (Hugging Face)

3. Fragile multi-reference setups

Community issues show that multi-image, multi-mask IP-Adapter setups are powerful but easy to misconfigure, and some users have reported resolution-related quirks or confusion about how to wire multiple images and masks. That does not invalidate the approach; it means your product should be code-first and carefully tested, not just assembled in a UI. (GitHub)

4. Assuming inpainting will perfectly preserve everything else

In real pipelines, masked editing can still be fragile, especially once you combine ControlNet, IP-Adapter, multiple references, and inpainting. Public issues in Diffusers and common WebUI stacks show exactly these edge cases. (GitHub)

My direct recommendation for your case

If I had to advise you in one sentence:

Use language to author a structured scene, geometry to lock the space, references to paint appearance, and diffusion only to render the final image. (arXiv)

A concrete plan would be:

  1. Build a planner that converts user requests into a scene JSON. (arXiv)
  2. Render depth, segmentation, and per-object masks from that scene. (Hugging Face)
  3. Use ControlNet for those structural signals. (Hugging Face)
  4. Use masked IP-Adapter for object and style references. (Hugging Face)
  5. Add a verifier that checks spatial constraints after generation. (arXiv)
  6. Move to a true 3D representation as soon as users need editable scenes or real dimensions. (abdo-eldesokey.github.io)

The most honest summary

My honest view is that “spatially aware image generation” is still not solved by prompting harder. The best current systems keep adding explicit structure: boxes, masks, depth, scene graphs, solvers, or full 3D layouts. The good news is that this is now a very buildable direction with existing tools and papers. (arXiv)

1 Like