Aligning cyber space with physical world: A comprehensive survey on embodied ai

Y Liu, W Chen, Y Bai, X Liang, G Li, W Gao… - arXiv preprint arXiv …, 2024 - arxiv.org
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

A survey on benchmarks of multimodal large language models

J Li, W Lu, H Fei, M Luo, M Dai, M Xia, Y Jin… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both
academia and industry due to their remarkable performance in various applications such as …

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arXiv preprint arXiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Continual llava: Continual instruction tuning in large vision-language models

M Cao, Y Liu, Y Liu, T Wang, J Dong, H Ding… - arXiv preprint arXiv …, 2024 - arxiv.org
Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language
Models (LVLMs) to meet individual task requirements. To date, most of the existing …

Coarse correspondence elicit 3d spacetime understanding in multimodal language model

B Liu, Y Dong, Y Wang, Y Rao, Y Tang, WC Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal language models (MLLMs) are increasingly being implemented in real-world
environments, necessitating their ability to interpret 3D spaces and comprehend temporal …

Hourvideo: 1-hour video-language understanding

K Chandrasegaran, A Gupta, LM Hadzic, T Kota… - arXiv preprint arXiv …, 2024 - arxiv.org
We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

J Chen, B Lin, X Liu, L Ma, X Liang… - arXiv preprint arXiv …, 2024 - arxiv.org
LLM-based agents have demonstrated impressive zero-shot performance in vision-
language navigation (VLN) task. However, existing LLM-based methods often focus only on …

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

R Xu, Z Huang, T Wang, Y Chen, J Pang… - arXiv preprint arXiv …, 2024 - arxiv.org
3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J Xiao, L Chen - arXiv preprint arXiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …