An overview of voice conversion and its challenges: From statistical modeling to deep learning

B Sisman, J Yamagishi, S King… - IEEE/ACM Transactions …, 2020 - ieeexplore.ieee.org
Speaker identity is one of the important characteristics of human speech. In voice
conversion, we change the speaker identity from one to another, while keeping the linguistic …

Mosnet: Deep learning based objective assessment for voice conversion

CC Lo, SW Fu, WC Huang, X Wang… - arXiv preprint arXiv …, 2019 - arxiv.org
Existing objective evaluation metrics for voice conversion (VC) are not always correlated
with human perception. Therefore, training VC models with such criteria may not effectively …

A review on subjective and objective evaluation of synthetic speech

E Cooper, WC Huang, Y Tsao, HM Wang… - Acoustical Science …, 2024 - jstage.jst.go.jp
Evaluating synthetic speech generated by machines is a complicated process, as it involves
judging along multiple dimensions including naturalness, intelligibility, and whether the …

The limits of the Mean Opinion Score for speech synthesis evaluation

S Le Maguer, S King, N Harte - Computer Speech & Language, 2024 - Elsevier
The release of WaveNet and Tacotron has forever transformed the speech synthesis
landscape. Thanks to these game-changing innovations, the quality of synthetic speech has …

On the use of WaveNet as a statistical vocoder

N Adiga, V Tsiaras, Y Stylianou - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
In this paper, we explore the possibility of using the WaveNet architecture as a statistical
vocoder. In that case, the generation of speech waveforms is locally conditioned only by …

Audio similarity is unreliable as a proxy for audio quality

P Manocha, Z Jin, A Finkelstein - arXiv preprint arXiv:2206.13411, 2022 - arxiv.org
Many audio processing tasks require perceptual assessment. However, the time and
expense of obtaining``gold standard''human judgments limit the availability of such data …

[PDF][PDF] Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN.

N Adiga, Y Pantazis, V Tsiaras, Y Stylianou - INTERSPEECH, 2019 - isca-archive.org
The quality of speech synthesis systems can be significantly deteriorated by the presence of
background noise in the recordings. Despite the existence of speech enhancement …

A hierarchical predictor of synthetic speech naturalness using neural networks

T Yoshimura, GE Henter, O Watts, M Wester… - Interspeech …, 2016 - research.ed.ac.uk
A problem when developing and tuning speech synthesis systems is that there is no well-
established method of automatically rating the quality of the synthetic speech. This research …

Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis

C Mayo, RAJ Clark, S King - Speech Communication, 2011 - Elsevier
The quality of current commercial speech synthesis systems is now so high that system
improvements are being made at subtle sub-and supra-segmental levels. Human perceptual …

[PDF][PDF] Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-Specific Scaling.

N Nessler, M Cernak, P Prandoni, P Mainar - Interspeech, 2021 - isca-archive.org
In communication systems, it is crucial to estimate the perceived quality of audio and
speech. The industrial standards for many years have been PESQ, 3QUEST, and POLQA …