Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

B Jia, Y Chen, H Yu, Y Wang, X Niu, T Liu, Q Li… - … on Computer Vision, 2025 - Springer
Abstract 3D vision-language (3D-VL) grounding, which aims to align language with 3D
physical environments, stands as a cornerstone in developing embodied agents. In …

Anyhome: Open-vocabulary generation of structured and textured 3d homes

R Fu, Z Wen, Z Liu, S Sridhar - European Conference on Computer Vision, 2025 - Springer
Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text
into well-structured and textured indoor scenes at a house-scale. By prompting Large …

Physcene: Physically interactable 3d scene synthesis for embodied ai

Y Yang, B Jia, P Zhi, S Huang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
With recent developments in Embodied Artificial Intelligence (EAI) research there has been
a growing demand for high-quality large-scale interactive scene generation. While prior …

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

KH Zeng, Z Zhang, K Ehsani, R Hendrix… - arXiv preprint arXiv …, 2024 - arxiv.org
We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained
end-to-end with reinforcement learning at scale that generalizes to the real-world without …

SceneMotifCoder: Example-driven visual program learning for generating 3D object arrangements

HII Tam, HID Pun, AT Wang, AX Chang… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite advances in text-to-3D generation methods, generation of multi-object
arrangements remains challenging. Current methods exhibit failures in generating …

Seeing the Unseen: Visual Common Sense for Semantic Placement

R Ramrakhya, A Kembhavi, D Batra… - Proceedings of the …, 2024 - openaccess.thecvf.com
Computer vision tasks typically involve describing what is visible in an image (eg
classification detection segmentation and captioning). We study a visual common sense task …

Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation

A Raistrick, L Mei, K Kayan, D Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We introduce Infinigen Indoors a Blender-based procedural generator of
photorealistic indoor scenes. It builds upon the existing Infinigen system which focuses on …

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

T Kim, C Min, B Kim, J Kim, W Jeung, J Choi - European Conference on …, 2025 - Springer
Simulated virtual environments have been widely used to learn robotic agents that perform
daily household tasks. These environments encourage research progress by far, but often …

Pre-trained text-to-image diffusion models are versatile representation learners for control

G Gupta, K Yadav, Y Gal, D Batra, Z Kira, C Lu… - arXiv preprint arXiv …, 2024 - arxiv.org
Embodied AI agents require a fine-grained understanding of the physical world mediated
through visual and language inputs. Such capabilities are difficult to learn solely from task …

S2O: Static to openable enhancement for articulated 3D objects

D Iliash, H Jiang, Y Zhang, M Savva… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite much progress in large 3D datasets there are currently few interactive 3D object
datasets, and their scale is limited due to the manual effort required in their construction. We …