A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer
The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …
several areas, especially in computer vision. The growing potential of multimodal data …
Next-qa: Next phase of question-answering to explaining temporal actions
We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …
benchmark to advance video understanding from describing to explaining the temporal …
Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering
Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …
humans in daily activities, seeking answers from long-form videos with diverse and complex …
Cross-modal causal relational reasoning for event-level visual question answering
Existing visual question answering methods often suffer from cross-modal spurious
correlations and oversimplified event-level reasoning processes that fail to capture event …
correlations and oversimplified event-level reasoning processes that fail to capture event …
Scanqa: 3d question answering for spatial scene understanding
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the
3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D …
3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D …
Video as conditional graph hierarchy for multi-granular question answering
Video question answering requires the models to understand and reason about both the
complex video and language data to correctly derive the answers. Existing efforts have been …
complex video and language data to correctly derive the answers. Existing efforts have been …
Large models for time series and spatio-temporal data: A survey and outlook
Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …
applications. They capture dynamic system measurements and are produced in vast …
Reasoning with heterogeneous graph alignment for video question answering
The dominant video question answering methods are based on fine-grained representation
or model-specific attention mechanism. They usually process video and question separately …
or model-specific attention mechanism. They usually process video and question separately …
Perception test: A diagnostic benchmark for multimodal video models
We propose a novel multimodal video benchmark-the Perception Test-to evaluate the
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …
Video question answering: Datasets, algorithms and challenges
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …
to the given videos. It has earned increasing attention with recent research trends in joint …