Movieqa: Understanding stories in movies through question-answering

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：129 相关文章所有 2 个版本

[PDF] arxiv.org

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org

Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

被引用次数：179 相关文章所有 6 个版本

[PDF] arxiv.org

A-okvqa: A benchmark for visual question answering using world knowledge

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022 - Springer

Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …

被引用次数：254 相关文章所有 5 个版本

[PDF] neurips.cc

Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2024 - proceedings.neurips.cc

We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …

被引用次数：55 相关文章所有 5 个版本

[PDF] thecvf.com

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

被引用次数：71 相关文章所有 3 个版本

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

被引用次数：172 相关文章所有 11 个版本

[PDF] thecvf.com

Next-qa: Next phase of question-answering to explaining temporal actions

J Xiao, X Shang, A Yao… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …

被引用次数：231 相关文章所有 6 个版本

[PDF] thecvf.com

Just ask: Learning to answer questions from millions of narrated videos

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

被引用次数：267 相关文章所有 14 个版本

[PDF] thecvf.com

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering

D Gao, L Zhou, L Ji, L Zhu, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …

被引用次数：63 相关文章所有 8 个版本

[PDF] arxiv.org

Hero: Hierarchical encoder for video+ language omni-representation pre-training

L Li, YC Chen, Y Cheng, Z Gan, L Yu, J Liu - arXiv preprint arXiv …, 2020 - arxiv.org

We present HERO, a novel framework for large-scale video+ language omni-representation
learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of …

被引用次数：509 相关文章所有 7 个版本