VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

Anonymous ICCV submission
Paper ID 7919

video description

image description

Abstract

We propose VideoRFSplat, a direct text-to-3D model lever- aging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D gener-ative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabi-lize training and inference. In this work, we propose an archi-tecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video genera-tion model. Our core idea is a dual-stream architecture that 014 attaches a dedicated pose generation model alongside a pre- 015 trained video generation model via communication blocks, 016 generating multi-view images and camera poses through 017 separate streams. This design reduces interference between 018 the pose and image modalities. Additionally, we propose an 019 asynchronous sampling strategy that denoises camera poses 020 faster than multi-view images, allowing rapidly denoised 021 poses to condition multi-view generation, reducing mutual 022 ambiguity and enhancing cross-modal consistency. Trained 023 on multiple large-scale real-world datasets (RealEstate10K, 024 MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms 025 existing text-to-3D direct generation methods that heavily de- 026 pend on post-hoc refinement via score distillation sampling, 027 achieving superior results without such refinement.

Introduction

stability in extending 2D generative models to joint modeling 082 due to the modality gap, hindering high-quality generation 083 and alignment between generated images and camera poses 084 To address this, several stabilization techniques leveraging 085 external models have been proposed For instance, NVCom- 086 poser [32] leverages Dust3R [70] distillation to improve 087 consistency, while SplatFlow [19] and director3D [33] rely 088 on other 2D models and refinements during sampling While 089 these help reduce instability, dependency on external models 090 hinders seamless integration into a single model 091 In this paper, to eliminate external dependency, we present 092 VideoRFSplat, a direct 3DGS generation model that intro- 093 duces an architecture and sampling strategy for jointly gener- 094 ating camera poses and multi-view images when leveraging 095 a video generation model Our core idea is a dual-stream 096 architecture, side-attaching a dedicated rectified flow-based 097 pose generation model alongside a pre-trained video gener- 098 ation model, jointly trained to generate multi-view image 099 latents and camera poses simultaneously This side-attached 100 pose diffusion model runs parallel to the video model’s for- 101 ward stream, interacting at specific layers while maintaining 102 separate forward paths This design minimizes interference 103 between the two modalities, allowing each to specialize inde- 104 pendently and ensuring consistency between poses and multi- 105 view images Similar to previous approaches [19, 33, 74], a 106 Gaussian splat decoder decodes 3DGS from the generated 107 poses and image latents in a feed-forward fashion 108 Then, we decouple the timesteps of the pose and multi- 109 view generation models, allowing each to operate at different 110 noise levels independently Unlike standard approaches that 111 synchronize timesteps and noise levels across modalities, 112 our design permits flexible asynchronous sampling This 113 approach is motivated by our observation that synchronized 114 denoising of multi-view images and camera poses, particu- 115 larly at early timesteps, leads to mutual ambiguity, increasing 116 uncertainty and causing unstable generation To mitigate this 117 issue, we design the pose modality—found to be more robust 118 to faster denoising—to undergo a more rapid denoising pro- 119 cess than the images By doing so, the clearer pose informa- 120 tion effectively reduces the ambiguity in the pose modality, 121 stabilizing the sampling Furthermore, we propose an asyn- 122 chronous adaptation of Classifier-Free Guidance (CFG) that 123 enables the clearer pose to better guide multi-view image 124 generation Moreover, the proposed asynchronous sampling 125 strategy with decoupled timesteps naturally extends to the 126 camera conditional generation task 127 We train VideoRFSplat on RealEstate10K [89], 128 MVImgNet [80], DL3DV-10K [39], and ACID [41] datasets 129 Notably, VideoRFSplat achieves superior performance 130 without relying on SDS++ refinement [33], surpassing 131 existing text-to-3D direct generation methods that depend 132 on SDS++ [19, 33], demonstrating effectiveness of our 133 approach and eliminating dependencies on external models 134

Related Works

stability in extending 2D generative models to joint modeling 082 due to the modality gap, hindering high-quality generation 083 and alignment between generated images and camera poses. 084 To address this, several stabilization techniques leveraging 085 external models have been proposed. For instance, NVCom- 086 poser [32] leverages Dust3R [70] distillation to improve 087 consistency, while SplatFlow [19] and director3D [33] rely 088 on other 2D models and refinements during sampling. While 089 these help reduce instability, dependency on external models 090 hinders seamless integration into a single model. 091 In this paper, to eliminate external dependency, we present 092 VideoRFSplat, a direct 3DGS generation model that intro- 093 duces an architecture and sampling strategy for jointly gener- 094 ating camera poses and multi-view images when leveraging 095 a video generation model. Our core idea is a dual-stream 096 architecture, side-attaching a dedicated rectified flow-based 097 pose generation model alongside a pre-trained video gener- 098 ation model, jointly trained to generate multi-view image 099 latents and camera poses simultaneously. This side-attached 100 pose diffusion model runs parallel to the video model’s for- 101 ward stream, interacting at specific layers while maintaining 102 separate forward paths. This design minimizes interference 103 between the two modalities, allowing each to specialize inde- 104 pendently and ensuring consistency between poses and multi- 105 view images. Similar to previous approaches [19, 33, 74], a 106 Gaussian splat decoder decodes 3DGS from the generated 107 poses and image latents in a feed-forward fashion. 108 Then, we decouple the timesteps of the pose and multi- 109 view generation models, allowing each to operate at different 110 noise levels independently. Unlike standard approaches that 111 synchronize timesteps and noise levels across modalities, 112 our design permits flexible asynchronous sampling. This 113 approach is motivated by our observation that synchronized 114 denoising of multi-view images and camera poses, particu- 115 larly at early timesteps, leads to mutual ambiguity, increasing 116 uncertainty and causing unstable generation. To mitigate this 117 issue, we design the pose modality—found to be more robust 118 to faster denoising—to undergo a more rapid denoising pro- 119 cess than the images. By doing so, the clearer pose informa- 120 tion effectively reduces the ambiguity in the pose modality, 121 stabilizing the sampling. Furthermore, we propose an asyn- 122 chronous adaptation of Classifier-Free Guidance (CFG) that 123 enables the clearer pose to better guide multi-view image 124 generation. Moreover, the proposed asynchronous sampling 125 strategy with decoupled timesteps naturally extends to the 126 camera conditional generation task. 127 We train VideoRFSplat on RealEstate10K [89], 128 MVImgNet [80], DL3DV-10K [39], and ACID [41] datasets. 129 Notably, VideoRFSplat achieves superior performance 130 without relying on SDS++ refinement [33], surpassing 131 existing text-to-3D direct generation methods that depend 132 on SDS++ [19, 33], demonstrating effectiveness of our 133 approach and eliminating dependencies on external models. 134

3D