Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
The llama 3 herd of models
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …
presents a new set of foundation models, called Llama 3. It is a herd of language models …
Llama-adapter: Efficient fine-tuning of language models with zero-init attention
We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …
Egoschema: A diagnostic benchmark for very long-form video language understanding
K Mangalam, R Akshulakov… - Advances in Neural …, 2023 - proceedings.neurips.cc
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …
benchmark to evaluate long video understanding capabilities of modern vision and …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
Moviechat: From dense token to sparse memory for long video understanding
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
Mvbench: A comprehensive multi-modal video understanding benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
Zero-shot video question answering via frozen bidirectional language models
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …
data for training. Manual annotation of question and answers for videos, however, is tedious …
Self-chained image-language model for video localization and question answering
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …
models for video question answering. While these image-language models can efficiently …