World: a vocoder-based high-quality speech synthesis system for real-time applications

C Zhang, C Zhang, S Zheng, Y Qiao, C Li… - arXiv preprint arXiv …, 2023 - arxiv.org

As ChatGPT goes viral, generative AI (AIGC, aka AI-generated content) has made headlines
everywhere because of its ability to analyze and create text, images, and beyond. With such …

被引用次数：165 相关文章所有 4 个版本

[PDF] arxiv.org

An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org

Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

被引用次数：249 相关文章所有 6 个版本

[PDF] arxiv.org

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

被引用次数：379 相关文章所有 2 个版本

[PDF] aaai.org

Diffsinger: Singing voice synthesis via shallow diffusion mechanism

J Liu, C Li, Y Ren, F Chen, Z Zhao - … of the AAAI conference on artificial …, 2022 - ojs.aaai.org

Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive
singing voice, in which the acoustic model generates the acoustic features (eg, mel …

被引用次数：224 相关文章所有 7 个版本

[PDF] neurips.cc

Melgan: Generative adversarial networks for conditional waveform synthesis

K Kumar, R Kumar, T De Boissiere… - Advances in neural …, 2019 - proceedings.neurips.cc

Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating
coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is …

被引用次数：1031 相关文章所有 10 个版本

[PDF] neurips.cc

Fastspeech: Fast, robust and controllable text to speech

Y Ren, Y Ruan, X Tan, T Qin, S Zhao… - Advances in neural …, 2019 - proceedings.neurips.cc

Neural network based end-to-end text to speech (TTS) has significantly improved the quality
of synthesized speech. Prominent methods (eg, Tacotron 2) usually first generate mel …

被引用次数：1124 相关文章所有 10 个版本

[PDF] neurips.cc

Neural analysis and synthesis: Reconstructing speech from self-supervised representations

HS Choi, J Lee, W Kim, J Lee… - Advances in Neural …, 2021 - proceedings.neurips.cc

We present a neural analysis and synthesis (NANSY) framework that can manipulate the
voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have …

被引用次数：134 相关文章所有 6 个版本

[PDF] thecvf.com

Joint audio-visual deepfake detection

Y Zhou, SN Lim - Proceedings of the IEEE/CVF International …, 2021 - openaccess.thecvf.com

Abstract Deepfakes (" deep learning"+" fake") are synthetically-generated videos from AI
algorithms. While they could be entertaining, they could also be misused for falsifying …

被引用次数：137 相关文章所有 5 个版本

[PDF] arxiv.org

DDSP: Differentiable digital signal processing

J Engel, L Hantrakul, C Gu, A Roberts - arXiv preprint arXiv:2001.04643, 2020 - arxiv.org

Most generative models of audio directly generate samples in one of two domains: time or
frequency. While sufficient to express any signal, these representations are inefficient, as …

被引用次数：457 相关文章所有 6 个版本

[PDF] arxiv.org

ASVspoof 2019: Future horizons in spoofed and fake audio detection

M Todisco, X Wang, V Vestman, M Sahidullah… - arXiv preprint arXiv …, 2019 - arxiv.org

ASVspoof, now in its third edition, is a series of community-led challenges which promote
the development of countermeasures to protect automatic speaker verification (ASV) from …

被引用次数：643 相关文章所有 24 个版本