[HTML][HTML] The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

Less is more: Clipbert for video-and-language learning via sparse sampling

J Lei, L Li, L Zhou, Z Gan, TL Berg… - Proceedings of the …, 2021 - openaccess.thecvf.com
The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …

A general survey on attention mechanisms in deep learning

G Brauwers, F Frasincar - IEEE Transactions on Knowledge …, 2021 - ieeexplore.ieee.org
Attention is an important mechanism that can be employed for a variety of deep learning
models across many different domains and tasks. This survey provides an overview of the …

Invariant grounding for video question answering

Y Li, X Wang, J Xiao, W Ji… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …

Next-qa: Next phase of question-answering to explaining temporal actions

J Xiao, X Shang, A Yao… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

Just ask: Learning to answer questions from millions of narrated videos

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering

D Gao, L Zhou, L Ji, L Zhu, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …