Honeybee: Locality-enhanced projector for multimodal llm

J Cha, W Kang, J Mun, B Roh - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

Listen, think, and understand

Y Gong, H Luo, AH Liu, L Karlinsky, J Glass - arXiv preprint arXiv …, 2023 - arxiv.org
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …

On decoder-only architecture for speech-to-text and large language model integration

J Wu, Y Gaur, Z Chen, L Zhou, Y Zhu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Large language models (LLMs) have achieved remarkable success in the field of natural
language processing, enabling better human-computer interaction using natural language …

Sparks of large audio models: A survey and outlook

S Latif, M Shoukat, F Shamshad, M Usama… - arXiv preprint arXiv …, 2023 - arxiv.org
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …

Slm: Bridge the thin gap between speech and text foundation models

M Wang, W Han, I Shafran, Z Wu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-
modal model that takes advantage of pretrained foundational speech and language models …

X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning

A Panagopoulou, L Xue, N Yu, J Li, D Li, S Joty… - arXiv preprint arXiv …, 2023 - arxiv.org
Vision-language pre-training and instruction tuning have demonstrated general-purpose
capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art …

Natural language supervision for general-purpose audio representations

B Elizalde, S Deshmukh, H Wang - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Audio-Language models jointly learn multimodal text and audio representations that enable
Zero-Shot inference. Models rely on the encoders to create powerful representations of the …

Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model

MA Shah, R Sharma, H Dhamyal, R Olivier… - arXiv preprint arXiv …, 2023 - arxiv.org
It has been shown that Large Language Model (LLM) alignments can be circumvented by
appending specially crafted attack suffixes with harmful queries to elicit harmful responses …

An image grid can be worth a video: Zero-shot video question answering using a vlm

W Kim, C Choi, W Lee, W Rhee - arXiv preprint arXiv:2403.18406, 2024 - arxiv.org
Stimulated by the sophisticated reasoning capabilities of recent Large Language Models
(LLMs), a variety of strategies for bridging video modality have been devised. A prominent …