A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need?

C Zhang, C Zhang, S Zheng, Y Qiao, C Li… - arXiv preprint arXiv …, 2023 - arxiv.org
As ChatGPT goes viral, generative AI (AIGC, aka AI-generated content) has made headlines
everywhere because of its ability to analyze and create text, images, and beyond. With such …

An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Diffsinger: Singing voice synthesis via shallow diffusion mechanism

J Liu, C Li, Y Ren, F Chen, Z Zhao - … of the AAAI conference on artificial …, 2022 - ojs.aaai.org
Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive
singing voice, in which the acoustic model generates the acoustic features (eg, mel …

Melgan: Generative adversarial networks for conditional waveform synthesis

K Kumar, R Kumar, T De Boissiere… - Advances in neural …, 2019 - proceedings.neurips.cc
Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating
coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is …

Fastspeech: Fast, robust and controllable text to speech

Y Ren, Y Ruan, X Tan, T Qin, S Zhao… - Advances in neural …, 2019 - proceedings.neurips.cc
Neural network based end-to-end text to speech (TTS) has significantly improved the quality
of synthesized speech. Prominent methods (eg, Tacotron 2) usually first generate mel …

Neural analysis and synthesis: Reconstructing speech from self-supervised representations

HS Choi, J Lee, W Kim, J Lee… - Advances in Neural …, 2021 - proceedings.neurips.cc
We present a neural analysis and synthesis (NANSY) framework that can manipulate the
voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have …

Joint audio-visual deepfake detection

Y Zhou, SN Lim - Proceedings of the IEEE/CVF International …, 2021 - openaccess.thecvf.com
Abstract Deepfakes (" deep learning"+" fake") are synthetically-generated videos from AI
algorithms. While they could be entertaining, they could also be misused for falsifying …

DDSP: Differentiable digital signal processing

J Engel, L Hantrakul, C Gu, A Roberts - arXiv preprint arXiv:2001.04643, 2020 - arxiv.org
Most generative models of audio directly generate samples in one of two domains: time or
frequency. While sufficient to express any signal, these representations are inefficient, as …

ASVspoof 2019: Future horizons in spoofed and fake audio detection

M Todisco, X Wang, V Vestman, M Sahidullah… - arXiv preprint arXiv …, 2019 - arxiv.org
ASVspoof, now in its third edition, is a series of community-led challenges which promote
the development of countermeasures to protect automatic speaker verification (ASV) from …