Lion: Empowering multimodal large language model with dual-level visual knowledge
Abstract Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However most of the existing MLLMs …
to perceive and understand multi-modal signals. However most of the existing MLLMs …
A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering
The emergence of multimodal large models (MLMs) has significantly advanced the field of
visual understanding, offering remarkable capabilities in the realm of visual question …
visual understanding, offering remarkable capabilities in the realm of visual question …
Vision-language instruction tuning: A review and analysis
Instruction tuning is an essential supervised training phase for Large Language Models
(LLMs), with the goal of enhancing LLMs' capacity to generalize instruction execution and …
(LLMs), with the goal of enhancing LLMs' capacity to generalize instruction execution and …
Training multimedia event extraction with generated images and captions
Contemporary news reporting increasingly features multimedia content, motivating research
on multimedia event extraction. However, the task lacks annotated multimodal training data …
on multimedia event extraction. However, the task lacks annotated multimodal training data …
Switchgpt: Adapting large language models for non-text outputs
Large Language Models (LLMs), primarily trained on text-based datasets, exhibit
exceptional proficiencies in understanding and executing complex linguistic instructions via …
exceptional proficiencies in understanding and executing complex linguistic instructions via …
Visual instruction tuning towards general-purpose multimodal model: A survey
Traditional computer vision generally solves each single task independently by a dedicated
model with the task instruction implicitly designed in the model architecture, arising two …
model with the task instruction implicitly designed in the model architecture, arising two …
Mitigating multilingual hallucination in large vision-language models
While Large Vision-Language Models (LVLMs) have exhibited remarkable capabilities
across a wide range of tasks, they suffer from hallucination problems, where models …
across a wide range of tasks, they suffer from hallucination problems, where models …
[PDF][PDF] LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs
Long video understanding is a significant and ongoing challenge in the intersection of
multimedia and artificial intelligence. Employing large language models (LLMs) for …
multimedia and artificial intelligence. Employing large language models (LLMs) for …
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
From content moderation to wildlife conservation the number of applications that require
models to recognize nuanced or subjective visual concepts is growing. Traditionally …
models to recognize nuanced or subjective visual concepts is growing. Traditionally …
Path to medical agi: Unify domain-specific medical llms with the lowest cost
Medical artificial general intelligence (AGI) is an emerging field that aims to develop systems
specifically designed for medical applications that possess the ability to understand, learn …
specifically designed for medical applications that possess the ability to understand, learn …