Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Slava Elizarov, Ciara Rowles, Simon Donné

Unity Technologies

Abstract

Generating high-quality 3D objects from textual descriptions remains a challenging problem due to computational cost, the scarcity of 3D data, and complex 3D representations. We introduce Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures.

By integrating a Collaborative Control mechanism, we exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion. This enables strong generalization even with limited 3D training data (allowing us to use only high-quality training data) as well as retaining compatibility with guidance techniques such as IPAdapter.

In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models. The generated objects consist of semantically meaningful, separate parts and include internal structures, enhancing both usability and versatility.

Geometry images

In our method, we use Geometry Images (GIM) as the surface representation. GIM encodes 3D shapes using an atlas of charts arranged within a regular grid. This construction is analogous to texture atlases in UV space; however, instead of encoding RGB values, GIM encodes the 3D coordinates of the surface points.

Collaborative control

To leverage the prior knowledge encoded in existing 2D Text-to-Image models, we use the Collaborative Control. This approach comprises two parallel networks: a pre-trained RGB model and a new model for the geometry image. The former is responsible for generating UV-space albedo textures, while the latter generates the geometry images.

These two models are connected by a simple linear cross-network communication layer, which allows them to share information and collaborate in generating pixel-aligned outputs across these different modalities. Crucially, this also enables the geometry model to influence the frozen model, guiding it to generate UV-space textures that would otherwise lie at the fringes of its training distribution.

The frozen base model also drastically reduces the amount of data required to train the joint model while retaining generalizability, diversity, and quality.

Strong generalization

A palm in a pot

A delicious burger with lettuce, tomato, and cheese

A sneaker

A fantasy style metal shield

BibTeX

@misc{elizarov2024geometryimagediffusionfast, title={Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation}, author={Slava Elizarov and Ciara Rowles and Simon Donné}, year={2024}, eprint={2409.03718}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2409.03718}, }

Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Abstract

Method

Geometry images

Collaborative control

Results

Strong generalization

Separable parts

BibTeX