A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer
The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …

Next-qa: Next phase of question-answering to explaining temporal actions

J Xiao, X Shang, A Yao… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering

D Gao, L Zhou, L Ji, L Zhu, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …

Cross-modal causal relational reasoning for event-level visual question answering

Y Liu, G Li, L Lin - IEEE Transactions on Pattern Analysis and …, 2023 - ieeexplore.ieee.org
Existing visual question answering methods often suffer from cross-modal spurious
correlations and oversimplified event-level reasoning processes that fail to capture event …

Scanqa: 3d question answering for spatial scene understanding

D Azuma, T Miyanishi, S Kurita… - proceedings of the …, 2022 - openaccess.thecvf.com
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the
3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D …

Video as conditional graph hierarchy for multi-granular question answering

J Xiao, A Yao, Z Liu, Y Li, W Ji, TS Chua - Proceedings of the AAAI …, 2022 - ojs.aaai.org
Video question answering requires the models to understand and reason about both the
complex video and language data to correctly derive the answers. Existing efforts have been …

Large models for time series and spatio-temporal data: A survey and outlook

M Jin, Q Wen, Y Liang, C Zhang, S Xue, X Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …

Reasoning with heterogeneous graph alignment for video question answering

P Jiang, Y Han - Proceedings of the AAAI Conference on Artificial …, 2020 - aaai.org
The dominant video question answering methods are based on fine-grained representation
or model-specific attention mechanism. They usually process video and question separately …

Perception test: A diagnostic benchmark for multimodal video models

V Patraucean, L Smaira, A Gupta… - Advances in …, 2024 - proceedings.neurips.cc
We propose a novel multimodal video benchmark-the Perception Test-to evaluate the
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …

Video question answering: Datasets, algorithms and challenges

Y Zhong, J Xiao, W Ji, Y Li, W Deng… - arXiv preprint arXiv …, 2022 - arxiv.org
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …