Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Recent advances of continual learning in computer vision: An overview

H Qu, H Rahmani, L Xu, B Williams, J Liu - arXiv preprint arXiv …, 2021 - arxiv.org
In contrast to batch learning where all training data is available at once, continual learning
represents a family of methods that accumulate knowledge and learn continuously with data …

When and why vision-language models behave like bags-of-words, and what to do about it?

M Yuksekgonul, F Bianchi, P Kalluri, D Jurafsky… - arXiv preprint arXiv …, 2022 - arxiv.org
Despite the success of large vision and language models (VLMs) in many downstream
applications, it is unclear how well they encode compositional information. Here, we create …

Winoground: Probing vision and language models for visio-linguistic compositionality

T Thrush, R Jiang, M Bartolo, A Singh… - Proceedings of the …, 2022 - openaccess.thecvf.com
We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …

VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena

L Parcalabescu, M Cafagna, L Muradjan… - arXiv preprint arXiv …, 2021 - arxiv.org
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark
designed for testing general-purpose pretrained vision and language (V&L) models for their …

Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers

S Frank, E Bugliarello, D Elliott - arXiv preprint arXiv:2109.04448, 2021 - arxiv.org
Pretrained vision-and-language BERTs aim to learn representations that combine
information from both modalities. We propose a diagnostic method based on cross-modal …

Finematch: Aspect-based fine-grained image and text mismatch detection and correction

H Hua, J Shi, K Kafle, S Jenni, D Zhang… - … on Computer Vision, 2025 - Springer
Recent progress in large-scale pre-training has led to the development of advanced vision-
language models (VLMs) with remarkable proficiency in comprehending and generating …

What's" up" with vision-language models? Investigating their struggle with spatial reasoning

A Kamath, J Hessel, KW Chang - arXiv preprint arXiv:2310.19785, 2023 - arxiv.org
Recent vision-language (VL) models are powerful, but can they reliably distinguish" right"
from" left"? We curate three new corpora to quantify model comprehension of such basic …

Multiviz: Towards visualizing and understanding multimodal models

PP Liang, Y Lyu, G Chhablani, N Jain, Z Deng… - arXiv preprint arXiv …, 2022 - arxiv.org
The promise of multimodal models for real-world applications has inspired research in
visualizing and understanding their internal mechanics with the end goal of empowering …

On explaining multimodal hateful meme detection models

MS Hee, RKW Lee, WH Chong - … of the ACM Web Conference 2022, 2022 - dl.acm.org
Hateful meme detection is a new multimodal task that has gained significant traction in
academic and industry research communities. Recently, researchers have applied pre …