Honeybee: Locality-enhanced projector for multimodal llm
Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …
role in bridging pre-trained vision encoders with LLMs enabling profound visual …
Onellm: One framework to align all modalities with language
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …
strong multimodal understanding capability. However existing works rely heavily on modality …
Listen, think, and understand
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …
crucial for many applications. Although significant progress has been made in this area …
On decoder-only architecture for speech-to-text and large language model integration
Large language models (LLMs) have achieved remarkable success in the field of natural
language processing, enabling better human-computer interaction using natural language …
language processing, enabling better human-computer interaction using natural language …
Sparks of large audio models: A survey and outlook
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …
challenges in applying large language models to the field of audio signal processing. Audio …
Slm: Bridge the thin gap between speech and text foundation models
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-
modal model that takes advantage of pretrained foundational speech and language models …
modal model that takes advantage of pretrained foundational speech and language models …
X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning
Vision-language pre-training and instruction tuning have demonstrated general-purpose
capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art …
capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art …
Natural language supervision for general-purpose audio representations
Audio-Language models jointly learn multimodal text and audio representations that enable
Zero-Shot inference. Models rely on the encoders to create powerful representations of the …
Zero-Shot inference. Models rely on the encoders to create powerful representations of the …
Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model
It has been shown that Large Language Model (LLM) alignments can be circumvented by
appending specially crafted attack suffixes with harmful queries to elicit harmful responses …
appending specially crafted attack suffixes with harmful queries to elicit harmful responses …
An image grid can be worth a video: Zero-shot video question answering using a vlm
Stimulated by the sophisticated reasoning capabilities of recent Large Language Models
(LLMs), a variety of strategies for bridging video modality have been devised. A prominent …
(LLMs), a variety of strategies for bridging video modality have been devised. A prominent …