[HTML][HTML] Deep audio-visual learning: A survey

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook

SS Baraheem, TN Le, TV Nguyen - Artificial Intelligence Review, 2023 - Springer
Image synthesis is a process of converting the input text, sketch, or other sources, ie, another
image or mask, into an image. It is an important problem in the computer vision field, where it …

Auto-regressive image synthesis with integrated quantization

F Zhan, Y Yu, R Wu, J Zhang, K Cui, C Zhang… - European Conference on …, 2022 - Springer
Deep generative models have achieved conspicuous progress in realistic image synthesis
with multifarious conditional inputs, while generating diverse yet high-fidelity images …

Music gesture for visual sound separation

C Gan, D Huang, H Zhao… - Proceedings of the …, 2020 - openaccess.thecvf.com
Recent deep learning approaches have achieved impressive performance on visual sound
separation tasks. However, these approaches are mostly built on appearance and optical …

The sound of motions

H Zhao, C Gan, WC Ma… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact
that humans is capable of interpreting sound sources from how objects move visually, we …

Foley music: Learning to generate music from videos

C Gan, D Huang, P Chen, JB Tenenbaum… - Computer Vision–ECCV …, 2020 - Springer
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a
silent video clip about people playing musical instruments. We first identify two key …

MT3: Multi-task multitrack music transcription

J Gardner, I Simon, E Manilow, C Hawthorne… - arXiv preprint arXiv …, 2021 - arxiv.org
Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a
challenging task at the core of music understanding. Unlike Automatic Speech Recognition …

Taming visually guided sound generation

V Iashin, E Rahtu - arXiv preprint arXiv:2110.08791, 2021 - arxiv.org
Recent advances in visually-induced audio generation are based on sampling short, low-
fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the …

Multi-instrument music synthesis with spectrogram diffusion

C Hawthorne, I Simon, A Roberts, N Zeghidour… - arXiv preprint arXiv …, 2022 - arxiv.org
An ideal music synthesizer should be both interactive and expressive, generating high-
fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural …

Giantmidi-piano: A large-scale midi dataset for classical piano music

Q Kong, B Li, J Chen, Y Wang - arXiv preprint arXiv:2010.07061, 2020 - arxiv.org
Symbolic music datasets are important for music information retrieval and musical analysis.
However, there is a lack of large-scale symbolic datasets for classical piano music. In this …