World: a vocoder-based high-quality speech synthesis system for real-time applications

M Morise, F Yokomori, K Ozawa - IEICE TRANSACTIONS on …, 2016 - search.ieice.org
A vocoder-based speech synthesis system, named WORLD, was developed in an effort to
improve the sound quality of real-time applications using speech. Speech analysis …

[HTML][HTML] D4C, a band-aperiodicity estimator for high-quality speech synthesis

M Morise - Speech Communication, 2016 - Elsevier
An algorithm is proposed for estimating the band aperiodicity of speech signals, where
“aperiodicity” is defined as the power ratio between the speech signal and the aperiodic …

[PDF][PDF] Harvest: A High-Performance Fundamental Frequency Estimator from Speech Signals.

M Morise - INTERSPEECH, 2017 - isca-archive.org
A fundamental frequency (F0) estimator named Harvest is described. The unique points of
Harvest are that it can obtain a reliable F0 contour and reduce the error that the voiced …

A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis

S Takaki, J Yamagishi - 2016 IEEE International Conference on …, 2016 - ieeexplore.ieee.org
In the state-of-the-art statistical parametric speech synthesis system, a speech analysis
module, eg STRAIGHT spectral analysis, is generally used for obtaining accurate and stable …

Sound quality comparison among high-quality vocoders by using re-synthesized speech

M Morise, Y Watanabe - Acoustical Science and Technology, 2018 - jstage.jst.go.jp
Since we have released WORLD on GitHubà and have been continuously updating
WORLD to improve the sound quality of the synthesized speech, there is no information on …

[PDF][PDF] Low-Dimensional Representation of Spectral Envelope Without Deterioration for Full-Band Speech Analysis/Synthesis System.

M Morise, G Miyashita, K Ozawa - INTERSPEECH, 2017 - researchgate.net
A speech coding for a full-band speech analysis/synthesis system is described. In this work,
full-band speech is defined as speech with a sampling frequency above 40 kHz, whose …

Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data

N Makishima, S Suzuki, A Ando… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we investigate the semi-supervised joint training of text to speech (TTS) and
automatic speech recognition (ASR), where a small amount of paired data and a large …

Voice conversion with CycleRNN-based spectral mapping and finely tuned WaveNet vocoder

PL Tobing, YC Wu, T Hayashi, K Kobayashi… - IEEE Access, 2019 - ieeexplore.ieee.org
In this paper, we present a novel framework for a voice conversion (VC) system based on a
cyclic recurrent neural network (CycleRNN) and a finely tuned WaveNet vocoder. Even …

Efficient shallow wavenet vocoder using multiple samples output based on laplacian distribution and linear prediction

PL Tobing, YC Wu, T Hayashi… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
This paper presents a novel way for an efficient implementation scheme of shallow WaveNet
vocoder with multiple samples (segment) output based on the use of Laplacian distribution …

Human-in-the-loop speech-design system and its evaluation

D Kondo, M Morise - 2019 Asia-Pacific Signal and Information …, 2019 - ieeexplore.ieee.org
We propose human-in-the-loop (HITL) speech-design system with an interface. General text-
to-speech (TTS) systems generate the speech waveform from the input text without the need …