Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis

Y Leng, Z Chen, J Guo, H Liu, J Chen… - Advances in …, 2022 - proceedings.neurips.cc
Binaural audio plays a significant role in constructing immersive augmented and virtual
realities. As it is expensive to record binaural audio from the real world, synthesizing them …

[图书][B] Foundation models for natural language processing: Pre-trained language models integrating media

G Paaß, S Giesselbach - 2023 - library.oapen.org
This open access book provides a comprehensive overview of the state of the art in research
and applications of Foundation Models and is intended for readers familiar with basic …

Lavss: Location-guided audio-visual spatial audio separation

Y Ye, W Yang, Y Tian - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com
Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …

Cyclic Learning for Binaural Audio Generation and Localization

Z Li, B Zhao, Y Yuan - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Binaural audio is obtained by simulating the biological structure of human ears which plays
an important role in artificial immersive spaces. A promising approach is to utilize mono …

[HTML][HTML] Ticino: A multi-modal remote sensing dataset for semantic segmentation

MP Barbato, F Piccoli, P Napoletano - Expert Systems with Applications, 2024 - Elsevier
Multi-modal remote sensing (RS) involves the fusion of data from multiple sensors, such as
RGB, Multispectral, Hyperspectral, Light Detection and Ranging, Synthetic Aperture Radar …

Multi-space channel representation learning for mono-to-binaural conversion based audio deepfake detection

R Liu, J Zhang, G Gao - Information Fusion, 2024 - Elsevier
Audio deepfake detection (ADD) aims to detect the fake audio generated by text-to-speech
(TTS), and voice conversion (VC), etc., which is an emerging topic. Traditionally we read the …

Modality-independent teachers meet weakly-supervised audio-visual event parser

YH Lai, YC Chen, F Wang - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Audio-visual learning has been a major pillar of multi-modal machine learning, where the
community mostly focused on its $\textit {modality-aligned} $ setting, $\textit {ie} $, the audio …

Visual-guided scene-aware audio generation method based on hierarchical feature codec and rendering decision

R Wang, H Cheng, L Ye, Q Zhang - Displays, 2024 - Elsevier
Visually guided spatial sound generation (VGSSG) is a well-suited multimodal learning
method for dealing with recorded videos. However, existing methods are difficult to be …