Magic3D is a new text-to-3D content creation tool that creates 3D
mesh models with unprecedented quality. Together with image
conditioning techniques as well as prompt-based editing approach,
we provide users with new ways to control 3D synthesis, opening up
new avenues to various creative applications.
(best viewed with Google Chrome on a desktop/laptop)
Abstract
DreamFusion has recently demonstrated the utility of a pre-trained
text-to-image diffusion model to optimize Neural Radiance Fields (NeRF),
achieving remarkable text-to-3D synthesis results. However, the method
has two inherent limitations: (a) extremely slow optimization of NeRF
and (b) low-resolution image space supervision on NeRF, leading to
low-quality 3D models with a long processing time. In this paper, we
address these limitations by utilizing a two-stage optimization
framework. First, we obtain a coarse model using a low-resolution
diffusion prior and accelerate with a sparse 3D hash grid structure.
Using the coarse representation as the initialization, we further
optimize a textured 3D mesh model with an efficient differentiable
renderer interacting with a high-resolution latent diffusion model. Our
method, dubbed Magic3D, can create high quality 3D mesh models in 40
minutes, which is 2× faster than DreamFusion (reportedly taking 1.5
hours on average), while also achieving higher resolution. User studies
show 61.7% raters to prefer our approach over DreamFusion. Together with
the image-conditioned generation capabilities, we provide users with new
ways to control 3D synthesis, opening up new avenues to various creative
applications.
Video
High-Resolution 3D Meshes
Magic3D can create high-quality 3D textured mesh models from input
text prompts. It utilizes a coarse-to-fine strategy leveraging both
low- and high-resolution diffusion priors for learning the 3D
representation of the target content. Magic3D synthesizes 3D content
with 8× higher-resolution supervision than
DreamFusion while also
being 2× faster.
[...] indicates helper captions added to improve quality, e.g.
"A DSLR photo of".
Given a coarse model generated with a base text prompt, we can modify
parts of the text in the prompt, and then fine-tune the NeRF and 3D
mesh models to obtain an edited high-resolution 3D mesh.
A
squirrel wearing a leather jacket
riding a motorcycle.
A bunny riding a
scooter.
A fairy riding a
bike.
A steampunk squirrel riding a
horse.
A baby bunny sitting on top of a stack
of pancakes.
A lego bunny sitting on top of a stack
of books.
A metal bunny sitting on top of a
stack of broccoli.
A metal bunny sitting on top of a
stack of chocolate cookies.
Other Editing Capabilities
Given input images for a subject instance, we can fine-tune the
diffusion models with
DreamBooth and optimize
the 3D models with the given prompts. The identity of the subject can
be well-preserved in the 3D models.
We can also condition the diffusion model (eDiff-I) on an input image
to transfer its style to the output 3D model.
Approach
We utilize a two-stage coarse-to-fine optimization framework for fast
and high-quality text-to-3D content creation. In the first stage, we
obtain a coarse model using a low-resolution diffusion prior and
accelerate this with a hash grid and sparse acceleration structure. In
the second stage, we use a textured mesh model initialized from the
coarse neural representation, allowing optimization with an efficient
differentiable renderer interacting with a high-resolution latent
diffusion model.
Presentation
Poster
(Click image to enlarge)
Citation
@inproceedings{lin2023magic3d,
title={Magic3D: High-Resolution Text-to-3D Content
Creation},
author={Lin, Chen-Hsuan and Gao, Jun and Tang, Luming and
Takikawa, Towaki and Zeng, Xiaohui and Huang, Xun and Kreis, Karsten
and Fidler, Sanja and Liu, Ming-Yu and Lin, Tsung-Yi},
booktitle={IEEE Conference on Computer Vision and Pattern
Recognition ({CVPR})},
year={2023}
}