Vd3d: Taming large video diffusion transformers for 3d camera control

S Bahmani, I Skorokhodov, A Siarohin… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of
complex videos from a text description. However, most existing models lack fine-grained …

Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs

B Smart, C Zheng, I Laina, VA Prisacariu - arXiv preprint arXiv:2408.13912, 2024 - arxiv.org
In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D
reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural …

Reconx: Reconstruct any scene from sparse views with video diffusion model

F Liu, W Sun, H Wang, Y Wang, H Sun, J Ye… - arXiv preprint arXiv …, 2024 - arxiv.org
Advancements in 3D scene reconstruction have transformed 2D images from the real world
into 3D models, producing realistic 3D results from hundreds of input photos. Despite great …

Mvsplat360: Feed-forward 360 scene synthesis from sparse views

Y Chen, C Zheng, H Xu, B Zhuang, A Vedaldi… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce MVSplat360, a feed-forward approach for 360 {\deg} novel view synthesis
(NVS) of diverse real-world scenes, using only sparse observations. This setting is …

MultiDepth: Multi-Sample Priors for Refining Monocular Metric Depth Estimations in Indoor Scenes

S Byun, J Song, WS Chung - arXiv preprint arXiv:2411.01048, 2024 - arxiv.org
Monocular metric depth estimation (MMDE) is a crucial task to solve for indoor scene
reconstruction on edge devices. Despite this importance, existing models are sensitive to …

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

H Jiang, L Liu, T Cheng, X Wang, T Lin, Z Su… - arXiv preprint arXiv …, 2024 - arxiv.org
3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a
comprehensive semantic cognition of surrounding environments. However, prevalent …

Novel View Synthesis with Pixel-Space Diffusion Models

N Elata, B Kawar, Y Ostrovsky-Berman… - arXiv preprint arXiv …, 2024 - arxiv.org
Synthesizing a novel view from a single input image is a challenging task. Traditionally, this
task was approached by estimating scene depth, warping, and inpainting, with machine …

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

S Bahmani, I Skorokhodov, G Qian, A Siarohin… - arXiv preprint arXiv …, 2024 - arxiv.org
Numerous works have recently integrated 3D camera control into foundational text-to-video
models, but the resulting camera control is often imprecise, and video generation quality …

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

S Nam, X Sun, G Kang, Y Lee, S Oh, E Park - arXiv preprint arXiv …, 2024 - arxiv.org
Generalized feed-forward Gaussian models have achieved significant progress in sparse-
view 3D reconstruction by leveraging prior knowledge from large multi-view datasets …

A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

C Peng, I Sobol, M Tomizuka, K Keutzer, C Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of
three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D …