Avqa: A dataset for audio-visual question answering on videos
Audio-visual question answering aims to answer questions regarding both audio and visual
modalities in a given video, and has drawn increasing research interest in recent years …
modalities in a given video, and has drawn increasing research interest in recent years …
Where did i leave my keys?-episodic-memory-based question answering on egocentric videos
Humans have a remarkable ability to organize, compress and retrieve episodic memories
throughout their daily life. Current AI systems, however, lack comparable capabilities as they …
throughout their daily life. Current AI systems, however, lack comparable capabilities as they …
Encode-Store-Retrieve: Enhancing Memory Augmentation through Language-Encoded Egocentric Perception
J Shen, J Dudley, PO Kristensson - arXiv preprint arXiv:2308.05822, 2023 - arxiv.org
We depend on our own memory to encode, store, and retrieve our experiences. However,
memory lapses can occur. One promising avenue for achieving memory augmentation is …
memory lapses can occur. One promising avenue for achieving memory augmentation is …
Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception
We depend on our own memory to encode, store, and retrieve our experiences. However,
memory lapses can occur. One promising avenue for achieving memory augmentation is …
memory lapses can occur. One promising avenue for achieving memory augmentation is …
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios
Y Jiang, J Yin - arXiv preprint arXiv:2305.12397, 2023 - arxiv.org
Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-
temporal reasoning over multimodal contexts. Recent works rely on elaborate target …
temporal reasoning over multimodal contexts. Recent works rely on elaborate target …
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Y Jiang, J Yin - arXiv preprint arXiv:2405.07451, 2024 - arxiv.org
While vision-language pretrained models (VLMs) excel in various multimodal understanding
tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual …
tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual …
[PDF][PDF] Exploring deep learning for multimodal understanding
M Lao - 2023 - scholarlypublications …
[14] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. In …
Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. In …