A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Bigvgan: A universal neural vocoder with large-scale training

S Lee, W Ping, B Ginsburg, B Catanzaro… - arXiv preprint arXiv …, 2022 - arxiv.org
Despite recent progress in generative adversarial network (GAN)-based vocoders, where
the model generates raw waveform conditioned on acoustic features, it is challenging to …

Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation

W Jang, D Lim, J Yoon, B Kim, J Kim - arXiv preprint arXiv:2106.07889, 2021 - arxiv.org
Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-
band spectral features are used as the input, the vocoder can be provided with as much …

[HTML][HTML] Video and audio deepfake datasets and open issues in deepfake technology: being ahead of the curve

Z Akhtar, TL Pendyala, VS Athmakuri - Forensic Sciences, 2024 - mdpi.com
The revolutionary breakthroughs in Machine Learning (ML) and Artificial Intelligence (AI) are
extensively being harnessed across a diverse range of domains, eg, forensic science …

iSTFTNet: Fast and lightweight mel-spectrogram vocoder incorporating inverse short-time Fourier transform

T Kaneko, K Tanaka, H Kameoka… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is
commonly applied as an intermediate representation, and the necessity for a mel …

CFAD: A Chinese dataset for fake audio detection

H Ma, J Yi, C Wang, X Yan, J Tao, T Wang… - Speech …, 2024 - Elsevier
Fake audio detection is a growing concern and some relevant datasets have been designed
for research. However, there is no standard public Chinese dataset under complex …

Espnet2-tts: Extending the edge of tts research

T Hayashi, R Yamamoto, T Yoshimura, P Wu… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit.
ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features …

Avocodo: Generative adversarial network for artifact-free vocoder

T Bak, J Lee, H Bae, J Yang, JS Bae… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Neural vocoders based on the generative adversarial neural network (GAN) have been
widely used due to their fast inference speed and lightweight networks while generating …

Safeear: Content privacy-preserving audio deepfake detection

X Li, K Li, Y Zheng, C Yan, X Ji, W Xu - Proceedings of the 2024 on ACM …, 2024 - dl.acm.org
Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable
performance in generating realistic and natural audio. However, their dark side, audio …

[PDF][PDF] SVTS: scalable video-to-speech synthesis

R Mira, A Haliassos, S Petridis… - arXiv preprint …, 2022 - opus.bibliothek.uni-augsburg.de
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip
movements into the corresponding audio. This task has received an increasing amount of …