X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning

SS Sohail, Y Himeur, H Kheddar, A Amira, F Fadli… - Information …, 2024 - Elsevier

The 3D point cloud (3DPC) has significantly evolved and benefited from the advance of
deep learning (DL). However, the latter faces various issues, including the lack of data or …

被引用次数：11 相关文章所有 3 个版本

[PDF] springer.com

Video description: A comprehensive survey of deep learning approaches

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023 - Springer

Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …

被引用次数：26 相关文章所有 5 个版本

[PDF] arxiv.org

2dpass: 2d priors assisted semantic segmentation on lidar point clouds

X Yan, J Gao, C Zheng, C Zheng, R Zhang… - … on Computer Vision, 2022 - Springer

As camera and LiDAR sensors capture complementary information in autonomous driving,
great efforts have been made to conduct semantic segmentation through multi-modality data …

被引用次数：264 相关文章所有 7 个版本

[PDF] arxiv.org

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

B Jia, Y Chen, H Yu, Y Wang, X Niu, T Liu, Q Li… - … on Computer Vision, 2025 - Springer

Abstract 3D vision-language (3D-VL) grounding, which aims to align language with 3D
physical environments, stands as a cornerstone in developing embodied agents. In …

被引用次数：34 相关文章所有 2 个版本

[PDF] arxiv.org

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

被引用次数：96 相关文章所有 3 个版本

[PDF] thecvf.com

Eda: Explicit text-decoupling and dense alignment for 3d visual grounding

Y Wu, X Cheng, R Zhang, Z Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …

被引用次数：76 相关文章所有 5 个版本

[PDF] thecvf.com

Context-aware alignment and mutual masking for 3d-language pre-training

Z Jin, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

被引用次数：36 相关文章所有 3 个版本

[PDF] thecvf.com

Unit3d: A unified transformer for 3d dense captioning and visual grounding

Z Chen, R Hu, X Chen, M Nießner… - Proceedings of the …, 2023 - openaccess.thecvf.com

Performing 3D dense captioning and visual grounding requires a common and shared
understanding of the underlying multimodal relationships. However, despite some previous …

被引用次数：50 相关文章所有 5 个版本

[PDF] thecvf.com

Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud

Z Wang, B Cheng, L Zhao, D Xu… - Proceedings of the …, 2023 - openaccess.thecvf.com

The task of 3D semantic scene graph (3DSSG) prediction in the point cloud is challenging
since (1) the 3D point cloud only captures geometric structures with limited semantics …

被引用次数：26 相关文章所有 5 个版本

[PDF] thecvf.com

End-to-end 3d dense captioning with vote2cap-detr

S Chen, H Zhu, X Chen, Y Lei… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract 3D dense captioning aims to generate multiple captions localized with their
associated object regions. Existing methods follow a sophisticated" detect-then-describe" …

被引用次数：49 相关文章所有 7 个版本