Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs...

B Jia, Y Chen, H Yu, Y Wang, X Niu, T Liu, Q Li… - … on Computer Vision, 2025 - Springer

Abstract 3D vision-language (3D-VL) grounding, which aims to align language with 3D
physical environments, stands as a cornerstone in developing embodied agents. In …

被引用次数：23 相关文章所有 2 个版本

[PDF] arxiv.org

Anyhome: Open-vocabulary generation of structured and textured 3d homes

R Fu, Z Wen, Z Liu, S Sridhar - European Conference on Computer Vision, 2025 - Springer

Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text
into well-structured and textured indoor scenes at a house-scale. By prompting Large …

被引用次数：11 相关文章所有 3 个版本

[PDF] thecvf.com

Physcene: Physically interactable 3d scene synthesis for embodied ai

Y Yang, B Jia, P Zhi, S Huang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

With recent developments in Embodied Artificial Intelligence (EAI) research there has been
a growing demand for high-quality large-scale interactive scene generation. While prior …

被引用次数：17 相关文章所有 3 个版本

[PDF] arxiv.org

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

KH Zeng, Z Zhang, K Ehsani, R Hendrix… - arXiv preprint arXiv …, 2024 - arxiv.org

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained
end-to-end with reinforcement learning at scale that generalizes to the real-world without …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

SceneMotifCoder: Example-driven visual program learning for generating 3D object arrangements

HII Tam, HID Pun, AT Wang, AX Chang… - arXiv preprint arXiv …, 2024 - arxiv.org

Despite advances in text-to-3D generation methods, generation of multi-object
arrangements remains challenging. Current methods exhibit failures in generating …

被引用次数：3 相关文章所有 3 个版本

[PDF] thecvf.com

Seeing the Unseen: Visual Common Sense for Semantic Placement

R Ramrakhya, A Kembhavi, D Batra… - Proceedings of the …, 2024 - openaccess.thecvf.com

Computer vision tasks typically involve describing what is visible in an image (eg
classification detection segmentation and captioning). We study a visual common sense task …

被引用次数：3 相关文章所有 4 个版本

[PDF] thecvf.com

Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation

A Raistrick, L Mei, K Kayan, D Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We introduce Infinigen Indoors a Blender-based procedural generator of
photorealistic indoor scenes. It builds upon the existing Infinigen system which focuses on …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

T Kim, C Min, B Kim, J Kim, W Jeung, J Choi - European Conference on …, 2025 - Springer

Simulated virtual environments have been widely used to learn robotic agents that perform
daily household tasks. These environments encourage research progress by far, but often …

被引用次数：1 相关文章所有 6 个版本

[PDF] arxiv.org

Pre-trained text-to-image diffusion models are versatile representation learners for control

G Gupta, K Yadav, Y Gal, D Batra, Z Kira, C Lu… - arXiv preprint arXiv …, 2024 - arxiv.org

Embodied AI agents require a fine-grained understanding of the physical world mediated
through visual and language inputs. Such capabilities are difficult to learn solely from task …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

S2O: Static to openable enhancement for articulated 3D objects

D Iliash, H Jiang, Y Zhang, M Savva… - arXiv preprint arXiv …, 2024 - arxiv.org

Despite much progress in large 3D datasets there are currently few interactive 3D object
datasets, and their scale is limited due to the manual effort required in their construction. We …

被引用次数：1 相关文章所有 2 个版本