Avqa: A dataset for audio-visual question answering on videos

P Yang, X Wang, X Duan, H Chen, R Hou… - Proceedings of the 30th …, 2022 - dl.acm.org
Audio-visual question answering aims to answer questions regarding both audio and visual
modalities in a given video, and has drawn increasing research interest in recent years …

Where did i leave my keys?-episodic-memory-based question answering on egocentric videos

L Bärmann, A Waibel - … of the IEEE/CVF Conference on …, 2022 - openaccess.thecvf.com
Humans have a remarkable ability to organize, compress and retrieve episodic memories
throughout their daily life. Current AI systems, however, lack comparable capabilities as they …

Encode-Store-Retrieve: Enhancing Memory Augmentation through Language-Encoded Egocentric Perception

J Shen, J Dudley, PO Kristensson - arXiv preprint arXiv:2308.05822, 2023 - arxiv.org
We depend on our own memory to encode, store, and retrieve our experiences. However,
memory lapses can occur. One promising avenue for achieving memory augmentation is …

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

J Shen, JJ Dudley… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
We depend on our own memory to encode, store, and retrieve our experiences. However,
memory lapses can occur. One promising avenue for achieving memory augmentation is …

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Y Jiang, J Yin - arXiv preprint arXiv:2305.12397, 2023 - arxiv.org
Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-
temporal reasoning over multimodal contexts. Recent works rely on elaborate target …

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Y Jiang, J Yin - arXiv preprint arXiv:2405.07451, 2024 - arxiv.org
While vision-language pretrained models (VLMs) excel in various multimodal understanding
tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual …

[PDF][PDF] Exploring deep learning for multimodal understanding

M Lao - 2023 - scholarlypublications …
[14] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. In …