Cheap and quick: Efficient vision-language instruction tuning for large language models

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

被引用次数：265 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …

被引用次数：579 相关文章所有 6 个版本

[PDF] tandfonline.com

Generative AI and ChatGPT: Applications, challenges, and AI-human collaboration

F Fui-Hoon Nah, R Zheng, J Cai, K Siau… - Journal of Information …, 2023 - Taylor & Francis

Artificial intelligence (AI) has elicited much attention across disciplines and industries (Hyder
et al., 2019). AI has been defined as “a system's ability to correctly interpret external data, to …

被引用次数：390 相关文章所有 4 个版本

[PDF] thecvf.com

Vila: On pre-training for visual language models

J Lin, H Yin, W Ping, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com

Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

被引用次数：64 相关文章所有 4 个版本

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：117 相关文章所有 6 个版本

[PDF] arxiv.org

Sharegpt4v: Improving large multi-modal models with better captions

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet
often constrained by the scarcity of high-quality image-text data. To address this bottleneck …

被引用次数：178 相关文章所有 3 个版本

[PDF] thecvf.com

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

被引用次数：45 相关文章所有 4 个版本

[PDF] arxiv.org

Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

P Zhang, XDB Wang, Y Cao, C Xu, L Ouyang… - arXiv preprint arXiv …, 2023 - arxiv.org

We propose InternLM-XComposer, a vision-language large model that enables advanced
image-text comprehension and composition. The innovative nature of our model is …

被引用次数：100 相关文章所有 2 个版本

[PDF] arxiv.org

How Robust is Google's Bard to Adversarial Image Attacks?

Y Dong, H Chen, J Chen, Z Fang, X Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

Multimodal Large Language Models (MLLMs) that integrate text and other modalities
(especially vision) have achieved unprecedented performance in various multimodal tasks …

被引用次数：53 相关文章所有 3 个版本

[PDF] arxiv.org

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arXiv preprint arXiv …, 2024 - arxiv.org

Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

被引用次数：10 相关文章所有 4 个版本