Merlin: Empowering multimodal llms with foresight minds

H Wei, L Kong, J Chen, L Zhao, Z Ge, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary--CLIP,
which can cover most common vision tasks. However, for some special vision task that …

被引用次数：36 相关文章所有 2 个版本

[PDF] arxiv.org

Small language model meets with reinforced vision vocabulary

H Wei, L Kong, J Chen, L Zhao, Z Ge, E Yu… - arXiv preprint arXiv …, 2024 - arxiv.org

Playing Large Vision Language Models (LVLMs) in 2023 is trendy among the AI community.
However, the relatively large number of parameters (more than 7B) of popular LVLMs makes …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

J Chen, L Kong, H Wei, C Liu, Z Ge, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and
so forth. Even advanced large vision-language models (LVLMs) with billions of parameters …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

LH Chen, S Lu, A Zeng, H Zhang, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

This study delves into the realm of multi-modality (ie, video and motion modalities) human
behavior understanding by leveraging the powerful capabilities of Large Language Models …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Uncertainty-boosted Robust Video Activity Anticipation

Z Qi, S Wang, W Zhang, Q Huang - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Video activity anticipation aims to predict what will happen in the future, embracing a broad
application prospect ranging from robot vision and autonomous driving. Despite the recent …

被引用次数：1 相关文章所有 6 个版本

[PDF] arxiv.org

Self-Supervised Visual Preference Alignment

K Zhu, L Zhao, Z Ge, X Zhang - arXiv preprint arXiv:2404.10501, 2024 - arxiv.org

This paper makes the first attempt towards unsupervised preference alignment in Vision-
Language Models (VLMs). We generate chosen and rejected responses with regard to the …

Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Y Bai, D Wu, Y Liu, F Jia, W Mao, Z Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-
to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate …

被引用次数：3 相关文章所有 2 个版本

Emotion Recognition from Videos Using Multimodal Large Language Models

L Vaiani, L Cagliero, P Garza - Future Internet, 2024 - search.proquest.com

Abstract The diffusion of Multimodal Large Language Models (MLLMs) has opened new
research directions in the context of video content understanding and classification. Emotion …