Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Recent advances of continual learning in computer vision: An overview
In contrast to batch learning where all training data is available at once, continual learning
represents a family of methods that accumulate knowledge and learn continuously with data …
represents a family of methods that accumulate knowledge and learn continuously with data …
When and why vision-language models behave like bags-of-words, and what to do about it?
Despite the success of large vision and language models (VLMs) in many downstream
applications, it is unclear how well they encode compositional information. Here, we create …
applications, it is unclear how well they encode compositional information. Here, we create …
Winoground: Probing vision and language models for visio-linguistic compositionality
We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …
VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena
L Parcalabescu, M Cafagna, L Muradjan… - arXiv preprint arXiv …, 2021 - arxiv.org
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark
designed for testing general-purpose pretrained vision and language (V&L) models for their …
designed for testing general-purpose pretrained vision and language (V&L) models for their …
Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers
Pretrained vision-and-language BERTs aim to learn representations that combine
information from both modalities. We propose a diagnostic method based on cross-modal …
information from both modalities. We propose a diagnostic method based on cross-modal …
Finematch: Aspect-based fine-grained image and text mismatch detection and correction
Recent progress in large-scale pre-training has led to the development of advanced vision-
language models (VLMs) with remarkable proficiency in comprehending and generating …
language models (VLMs) with remarkable proficiency in comprehending and generating …
What's" up" with vision-language models? Investigating their struggle with spatial reasoning
Recent vision-language (VL) models are powerful, but can they reliably distinguish" right"
from" left"? We curate three new corpora to quantify model comprehension of such basic …
from" left"? We curate three new corpora to quantify model comprehension of such basic …
Multiviz: Towards visualizing and understanding multimodal models
The promise of multimodal models for real-world applications has inspired research in
visualizing and understanding their internal mechanics with the end goal of empowering …
visualizing and understanding their internal mechanics with the end goal of empowering …
On explaining multimodal hateful meme detection models
Hateful meme detection is a new multimodal task that has gained significant traction in
academic and industry research communities. Recently, researchers have applied pre …
academic and industry research communities. Recently, researchers have applied pre …