Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Safeear: Content privacy-preserving audio deepfake detection

X Li, K Li, Y Zheng, C Yan, X Ji, W Xu - Proceedings of the 2024 on ACM …, 2024 - dl.acm.org
Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable
performance in generating realistic and natural audio. However, their dark side, audio …

Singing voice data scaling-up: An introduction to ace-opencpop and kising-v2

J Shi, Y Lin, X Bai, K Zhang, Y Wu, Y Tang, Y Yu… - arXiv preprint arXiv …, 2024 - arxiv.org
In singing voice synthesis (SVS), generating singing voices from musical scores faces
challenges due to limited data availability, a constraint less common in text-to-speech (TTS) …

HiFi-WaveGAN: Generative adversarial network with auxiliary spectrogram-phase loss for high-fidelity singing voice generation

C Wang, C Zeng, J Chen, O Xue - International Symposium on Neural …, 2024 - Springer
Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-
fidelity (eg 48 kHz) audio. However, most text-to-speech (TTS) vocoders cannot reconstruct …

Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

Y Wu, J Shi, Y Yu, Y Tang, T Qian, Y Lin, J Han… - Proceedings of the …, 2024 - dl.acm.org
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to
Singing Voice Synthesis (SVS) through the application of pretrained audio models in both …

TokSing: Singing Voice Synthesis based on Discrete Tokens

Y Wu, J Shi, Y Tang, S Yang, Q Jin - arXiv preprint arXiv:2406.08416, 2024 - arxiv.org
Recent advancements in speech synthesis witness significant benefits by leveraging
discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer …

Crosssinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers

X Wang, C Zeng, J Chen… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-
lingual ability by only using monolingual singers in the training stage. In this paper, we …

Improving chinese pop song and hokkien gezi opera singing voice synthesis by enhancing local modeling

P Bai, Y Zhou, M Zheng, W Sun… - Proceedings of the 2023 …, 2023 - aclanthology.org
Abstract Singing Voice Synthesis (SVS) strives to synthesize pleasing vocals based on
music scores and lyrics. The current acoustic models based on Transformer usually process …

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

P Kakoulidis, N Ellinas, G Vamvoukakis… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained
only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource …

A High-Quality Melody-Aware Peking Opera Synthesizer Using Data Augmentation

X Zhou, W Sun, X Shi - 2023 IEEE International Conference on …, 2023 - ieeexplore.ieee.org
The performing art of Peking Opera places great demands on the singing skills of singers,
including pronunciation, melody, role, personal style and emotional expression, which …