Openeqa: Embodied question answering in the era of foundation models

Y Liu, W Chen, Y Bai, X Liang, G Li, W Gao… - arXiv preprint arXiv …, 2024 - arxiv.org

Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …

被引用次数：18 相关文章所有 3 个版本

[PDF] arxiv.org

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

被引用次数：140 相关文章所有 5 个版本

[PDF] arxiv.org

A survey on benchmarks of multimodal large language models

J Li, W Lu, H Fei, M Luo, M Dai, M Xia, Y Jin… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both
academia and industry due to their remarkable performance in various applications such as …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arXiv preprint arXiv:2408.15769, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Continual llava: Continual instruction tuning in large vision-language models

M Cao, Y Liu, Y Liu, T Wang, J Dong, H Ding… - arXiv preprint arXiv …, 2024 - arxiv.org

Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language
Models (LVLMs) to meet individual task requirements. To date, most of the existing …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Coarse correspondence elicit 3d spacetime understanding in multimodal language model

B Liu, Y Dong, Y Wang, Y Rao, Y Tang, WC Ma… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal language models (MLLMs) are increasingly being implemented in real-world
environments, necessitating their ability to interpret 3D spaces and comprehend temporal …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Hourvideo: 1-hour video-language understanding

K Chandrasegaran, A Gupta, LM Hadzic, T Kota… - arXiv preprint arXiv …, 2024 - arxiv.org

We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …

被引用次数：3 相关文章所有 4 个版本

[PDF] arxiv.org

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

J Chen, B Lin, X Liu, L Ma, X Liang… - arXiv preprint arXiv …, 2024 - arxiv.org

LLM-based agents have demonstrated impressive zero-shot performance in vision-
language navigation (VLN) task. However, existing LLM-based methods often focus only on …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

R Xu, Z Huang, T Wang, Y Chen, J Pang… - arXiv preprint arXiv …, 2024 - arxiv.org

3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J Xiao, L Chen - arXiv preprint arXiv:2409.18142, 2024 - arxiv.org

The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

被引用次数：3 相关文章所有 2 个版本