Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

iquery: Instruments as queries for audio-visual sound separation

J Chen, R Zhang, D Lian, J Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Current audio-visual separation methods share a standard architecture design where an
audio encoder-decoder network is fused with visual encoding features at the encoder …

Progressive spatio-temporal perception for audio-visual question answering

G Li, W Hou, D Hu - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
Audio-Visual Question Answering (AVQA) task aims to answer questions about different
visual objects, sounds, and their associations in videos. Such naturally multi-modal videos …

Lavss: Location-guided audio-visual spatial audio separation

Y Ye, W Yang, Y Tian - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com
Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …

Subnetwork-To-Go: Elastic Neural Network with Dynamic Training and Customizable Inference

K Li, Y Luo - ICASSP 2024-2024 IEEE International Conference …, 2024 - ieeexplore.ieee.org
Deploying neural networks to different devices or platforms is in general challenging,
especially when the model size is large or model complexity is high. Although there exist …

DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification

X Liang, P Fu, Q Guo, K Zheng, Y Qian - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Neural architecture search-based multi-modal classification (NAS-MMC) methods can
individually obtain the optimal classifier for different multi-modal data sets in an automatic …

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Z Mu, X Yang - arXiv preprint arXiv:2404.12725, 2024 - arxiv.org
The integration of visual cues has revitalized the performance of the target speech extraction
task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm …

Independency Adversarial Learning for Cross-Modal Sound Separation

Z Lin, Y Ji, Y Yang - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
The sound mixture separation is still challenging due to heavy sound overlapping and
disturbance from noise. Unsupervised separation would significantly increase the difficulty …

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

G Li, H Du, D Hu - arXiv preprint arXiv:2407.20693, 2024 - arxiv.org
The Audio Visual Question Answering (AVQA) task aims to answer questions related to
various visual objects, sounds, and their interactions in videos. Such naturally multimodal …

Perceptual synchronization scoring of dubbed content using phoneme-viseme agreement

H Gupta - Proceedings of the IEEE/CVF Winter Conference …, 2024 - openaccess.thecvf.com
Recent works have shown great success in synchronizing lip-movements in a given video
with a dubbed audio stream. However, comparison and efficacy of the synchronization …