- 学术资源搜索

Honeybee: Locality-enhanced projector for multimodal llm

J Cha, W Kang, J Mun, B Roh - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …

被引用次数：28 相关文章所有 4 个版本

[PDF] thecvf.com

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Listen, think, and understand

Y Gong, H Luo, AH Liu, L Karlinsky, J Glass - arXiv preprint arXiv …, 2023 - arxiv.org

The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …

被引用次数：69 相关文章所有 6 个版本

[PDF] arxiv.org

On decoder-only architecture for speech-to-text and large language model integration

J Wu, Y Gaur, Z Chen, L Zhou, Y Zhu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org

Large language models (LLMs) have achieved remarkable success in the field of natural
language processing, enabling better human-computer interaction using natural language …

被引用次数：44 相关文章所有 3 个版本

[PDF] arxiv.org

Sparks of large audio models: A survey and outlook

S Latif, M Shoukat, F Shamshad, M Usama… - arXiv preprint arXiv …, 2023 - arxiv.org

This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …

被引用次数：9 相关文章所有 4 个版本

[PDF] arxiv.org

Slm: Bridge the thin gap between speech and text foundation models

M Wang, W Han, I Shafran, Z Wu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org

We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-
modal model that takes advantage of pretrained foundational speech and language models …

被引用次数：20 相关文章所有 3 个版本

[PDF] arxiv.org

X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning

A Panagopoulou, L Xue, N Yu, J Li, D Li, S Joty… - arXiv preprint arXiv …, 2023 - arxiv.org

Vision-language pre-training and instruction tuning have demonstrated general-purpose
capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Natural language supervision for general-purpose audio representations

B Elizalde, S Deshmukh, H Wang - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org

Audio-Language models jointly learn multimodal text and audio representations that enable
Zero-Shot inference. Models rely on the encoders to create powerful representations of the …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model

MA Shah, R Sharma, H Dhamyal, R Olivier… - arXiv preprint arXiv …, 2023 - arxiv.org

It has been shown that Large Language Model (LLM) alignments can be circumvented by
appending specially crafted attack suffixes with harmful queries to elicit harmful responses …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

An image grid can be worth a video: Zero-shot video question answering using a vlm

W Kim, C Choi, W Lee, W Rhee - arXiv preprint arXiv:2403.18406, 2024 - arxiv.org

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models
(LLMs), a variety of strategies for bridging video modality have been devised. A prominent …

被引用次数：7 相关文章所有 2 个版本