We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models.
我们提出了一种在3D场景上的潜在扩散模型,该模型可以仅使用2D图像数据进行训练。为了实现这一点,我们首先设计了一个自动编码器,将多视角图像映射到3D高斯飞溅,并同时构建这些飞溅的压缩潜在表示。然后,我们在潜在空间上训练一个多视图扩散模型,以学习一个高效的生成模型。这一流程不需要物体掩模或深度,并且适用于具有任意相机位置的复杂场景。我们在两个大规模的复杂真实世界场景数据集——MVImgNet和RealEstate10K上进行了仔细的实验。我们展示了我们的方法能够在短至0.2秒内生成3D场景,无论是从零开始,还是从单个输入视图或稀疏输入视图出发。它在运行速度上比非潜在扩散模型和早期基于NeRF的生成模型快一个数量级,同时产生多样化和高质量的结果。