Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Slava Elizarov, Ciara Rowles, Simon Donné
Unity Technologies
Prompts: An avocado-shaped chair; A tropic plant; A fly agaric mushroom; A steampunk airplane; A leather jacket with spikes

Abstract

Generating high-quality 3D objects from textual descriptions remains a challenging problem due to computational cost, the scarcity of 3D data, and complex 3D representations. We introduce Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures.

By integrating a Collaborative Control mechanism, we exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion. This enables strong generalization even with limited 3D training data (allowing us to use only high-quality training data) as well as retaining compatibility with guidance techniques such as IPAdapter.

In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models. The generated objects consist of semantically meaningful, separate parts and include internal structures, enhancing both usability and versatility.

Method

architecture

Geometry images

In our method, we use Geometry Images (GIM) as the surface representation. GIM encodes 3D shapes using an atlas of charts arranged within a regular grid. This construction is analogous to texture atlases in UV space; however, instead of encoding RGB values, GIM encodes the 3D coordinates of the surface points.

Geometry image
Geometry image
Albedo texture
Albedo texture
Reconstructed mesh
Reconstructed mesh

Collaborative control

To leverage the prior knowledge encoded in existing 2D Text-to-Image models, we use the Collaborative Control. This approach comprises two parallel networks: a pre-trained RGB model and a new model for the geometry image. The former is responsible for generating UV-space albedo textures, while the latter generates the geometry images.

These two models are connected by a simple linear cross-network communication layer, which allows them to share information and collaborate in generating pixel-aligned outputs across these different modalities. Crucially, this also enables the geometry model to influence the frozen model, guiding it to generate UV-space textures that would otherwise lie at the fringes of its training distribution.

The frozen base model also drastically reduces the amount of data required to train the joint model while retaining generalizability, diversity, and quality.

Results

Strong generalization

Palm Texture
Palm
Palm surface

A palm in a pot

Burger texture
Burger
Burger surface

A delicious burger with lettuce, tomato, and cheese

Sneaker texture
Sneaker
Sneaker surface

A sneaker

Shield Texture
Shield Albedo Model
Shield surface

A fantasy style metal shield

Separable parts

A key advantage of our method is that it generates objects divided into distinct semantic (or nearly semantic) parts
Mushroom exploded view

BibTeX

@misc{elizarov2024geometryimagediffusionfast,
      title={Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation}, 
      author={Slava Elizarov and Ciara Rowles and Simon Donné},
      year={2024},
      eprint={2409.03718},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.03718}, 
}