Video question answering with spatio-temporal reasoning

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer

The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …

被引用次数：293 相关文章所有 7 个版本

[PDF] thecvf.com

Next-qa: Next phase of question-answering to explaining temporal actions

J Xiao, X Shang, A Yao… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …

被引用次数：246 相关文章所有 6 个版本

[PDF] thecvf.com

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering

D Gao, L Zhou, L Ji, L Zhu, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …

被引用次数：65 相关文章所有 8 个版本

[PDF] arxiv.org

Cross-modal causal relational reasoning for event-level visual question answering

Y Liu, G Li, L Lin - IEEE Transactions on Pattern Analysis and …, 2023 - ieeexplore.ieee.org

Existing visual question answering methods often suffer from cross-modal spurious
correlations and oversimplified event-level reasoning processes that fail to capture event …

被引用次数：92 相关文章所有 7 个版本

[PDF] thecvf.com

Scanqa: 3d question answering for spatial scene understanding

D Azuma, T Miyanishi, S Kurita… - proceedings of the …, 2022 - openaccess.thecvf.com

We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the
3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D …

被引用次数：115 相关文章所有 6 个版本

[PDF] aaai.org

Video as conditional graph hierarchy for multi-granular question answering

J Xiao, A Yao, Z Liu, Y Li, W Ji, TS Chua - Proceedings of the AAAI …, 2022 - ojs.aaai.org

Video question answering requires the models to understand and reason about both the
complex video and language data to correctly derive the answers. Existing efforts have been …

被引用次数：107 相关文章所有 6 个版本

[PDF] arxiv.org

Large models for time series and spatio-temporal data: A survey and outlook

M Jin, Q Wen, Y Liang, C Zhang, S Xue, X Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …

被引用次数：64 相关文章所有 3 个版本

[PDF] aaai.org

Reasoning with heterogeneous graph alignment for video question answering

P Jiang, Y Han - Proceedings of the AAAI Conference on Artificial …, 2020 - aaai.org

The dominant video question answering methods are based on fine-grained representation
or model-specific attention mechanism. They usually process video and question separately …

被引用次数：194 相关文章所有 5 个版本

[PDF] neurips.cc

Perception test: A diagnostic benchmark for multimodal video models

V Patraucean, L Smaira, A Gupta… - Advances in …, 2024 - proceedings.neurips.cc

We propose a novel multimodal video benchmark-the Perception Test-to evaluate the
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …

被引用次数：26 相关文章所有 4 个版本

[PDF] arxiv.org

Video question answering: Datasets, algorithms and challenges

Y Zhong, J Xiao, W Ji, Y Li, W Deng… - arXiv preprint arXiv …, 2022 - arxiv.org

Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …

被引用次数：71 相关文章所有 3 个版本