Llama-adapter v2: Parameter-efficient visual instruction model

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

被引用次数：365 相关文章所有 3 个版本

[PDF] acm.org

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：25 相关文章

[PDF] neurips.cc

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr… - Advances in …, 2024 - proceedings.neurips.cc

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

被引用次数：175 相关文章所有 6 个版本

[PDF] neurips.cc

Chameleon: Plug-and-play compositional reasoning with large language models

P Lu, B Peng, H Cheng, M Galley… - Advances in …, 2024 - proceedings.neurips.cc

Large language models (LLMs) have achieved remarkable progress in solving various
natural language processing tasks due to emergent reasoning abilities. However, LLMs …

被引用次数：293 相关文章所有 10 个版本

[PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

被引用次数：228 相关文章所有 3 个版本

[PDF] arxiv.org

Qwen-vl: A frontier large vision-language model with versatile abilities

J Bai, S Bai, S Yang, S Wang, S Tan, P Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

被引用次数：585 相关文章所有 2 个版本

[PDF] arxiv.org

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …

被引用次数：461 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …

被引用次数：708 相关文章所有 6 个版本

[PDF] arxiv.org

Evaluating object hallucination in large vision-language models

Y Li, Y Du, K Zhou, J Wang, WX Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org

Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …

被引用次数：422 相关文章所有 6 个版本

[PDF] thecvf.com

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Q Ye, H Xu, J Ye, M Yan, A Hu, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …

被引用次数：175 相关文章所有 4 个版本