Watch, listen, and answer: Open-ended VideoQA with modulated multi-stream 3d ConvNets

P Yang, X Wang, X Duan, H Chen, R Hou… - Proceedings of the 30th …, 2022 - dl.acm.org

Audio-visual question answering aims to answer questions regarding both audio and visual
modalities in a given video, and has drawn increasing research interest in recent years …

被引用次数：65 相关文章

[PDF] thecvf.com

Where did i leave my keys?-episodic-memory-based question answering on egocentric videos

L Bärmann, A Waibel - … of the IEEE/CVF Conference on …, 2022 - openaccess.thecvf.com

Humans have a remarkable ability to organize, compress and retrieve episodic memories
throughout their daily life. Current AI systems, however, lack comparable capabilities as they …

被引用次数：22 相关文章所有 5 个版本

[PDF] arxiv.org

Encode-Store-Retrieve: Enhancing Memory Augmentation through Language-Encoded Egocentric Perception

J Shen, J Dudley, PO Kristensson - arXiv preprint arXiv:2308.05822, 2023 - arxiv.org

We depend on our own memory to encode, store, and retrieve our experiences. However,
memory lapses can occur. One promising avenue for achieving memory augmentation is …

被引用次数：4 相关文章所有 2 个版本

[PDF] pokristensson.com

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

J Shen, JJ Dudley… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org

We depend on our own memory to encode, store, and retrieve our experiences. However,
memory lapses can occur. One promising avenue for achieving memory augmentation is …

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Y Jiang, J Yin - arXiv preprint arXiv:2305.12397, 2023 - arxiv.org

Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-
temporal reasoning over multimodal contexts. Recent works rely on elaborate target …

被引用次数：5 相关文章所有 5 个版本

[PDF] arxiv.org

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Y Jiang, J Yin - arXiv preprint arXiv:2405.07451, 2024 - arxiv.org

While vision-language pretrained models (VLMs) excel in various multimodal understanding
tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual …

[PDF][PDF] Exploring deep learning for multimodal understanding

M Lao - 2023 - scholarlypublications …

[14] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. In …