Learning to answer questions in dynamic audio-visual scenarios
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …
answer questions regarding different visual objects, sounds, and their associations in …
iquery: Instruments as queries for audio-visual sound separation
Current audio-visual separation methods share a standard architecture design where an
audio encoder-decoder network is fused with visual encoding features at the encoder …
audio encoder-decoder network is fused with visual encoding features at the encoder …
Progressive spatio-temporal perception for audio-visual question answering
Audio-Visual Question Answering (AVQA) task aims to answer questions about different
visual objects, sounds, and their associations in videos. Such naturally multi-modal videos …
visual objects, sounds, and their associations in videos. Such naturally multi-modal videos …
Lavss: Location-guided audio-visual spatial audio separation
Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …
visual separation (MAVS). However, most MAVS methods purely consider what the sound …
Subnetwork-To-Go: Elastic Neural Network with Dynamic Training and Customizable Inference
Deploying neural networks to different devices or platforms is in general challenging,
especially when the model size is large or model complexity is high. Although there exist …
especially when the model size is large or model complexity is high. Although there exist …
DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification
Neural architecture search-based multi-modal classification (NAS-MMC) methods can
individually obtain the optimal classifier for different multi-modal data sets in an automatic …
individually obtain the optimal classifier for different multi-modal data sets in an automatic …
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
Z Mu, X Yang - arXiv preprint arXiv:2404.12725, 2024 - arxiv.org
The integration of visual cues has revitalized the performance of the target speech extraction
task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm …
task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm …
Independency Adversarial Learning for Cross-Modal Sound Separation
Z Lin, Y Ji, Y Yang - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
The sound mixture separation is still challenging due to heavy sound overlapping and
disturbance from noise. Unsupervised separation would significantly increase the difficulty …
disturbance from noise. Unsupervised separation would significantly increase the difficulty …
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
The Audio Visual Question Answering (AVQA) task aims to answer questions related to
various visual objects, sounds, and their interactions in videos. Such naturally multimodal …
various visual objects, sounds, and their interactions in videos. Such naturally multimodal …
Perceptual synchronization scoring of dubbed content using phoneme-viseme agreement
H Gupta - Proceedings of the IEEE/CVF Winter Conference …, 2024 - openaccess.thecvf.com
Recent works have shown great success in synchronizing lip-movements in a given video
with a dubbed audio stream. However, comparison and efficacy of the synchronization …
with a dubbed audio stream. However, comparison and efficacy of the synchronization …