An overview of voice conversion systems

SH Mohammadi, A Kain - Speech Communication, 2017 - Elsevier
Voice transformation (VT) aims to change one or more aspects of a speech signal while
preserving linguistic information. A subset of VT, Voice conversion (VC) specifically aims to …

[PDF][PDF] A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy

AJE Kell, DLK Yamins, EN Shook… - Neuron, 2018 - cell.com
A core goal of auditory neuroscience is to build quantitative models that predict cortical
responses to natural sounds. Reasoning that a complete model of auditory cortex must solve …

Unsupervised speech decomposition via triple information bottleneck

K Qian, Y Zhang, S Chang… - International …, 2020 - proceedings.mlr.press
Speech information can be roughly decomposed into four components: language content,
timbre, pitch, and rhythm. Obtaining disentangled representations of these components is …

Speaker perception

SR Schweinberger, H Kawahara… - Wiley …, 2014 - Wiley Online Library
While humans use their voice mainly for communicating information about the world,
paralinguistic cues in the voice signal convey rich dynamic information about a speaker's …

WORLD: a vocoder-based high-quality speech synthesis system for real-time applications

M Morise, F Yokomori, K Ozawa - IEICE TRANSACTIONS on …, 2016 - search.ieice.org
A vocoder-based speech synthesis system, named WORLD, was developed in an effort to
improve the sound quality of real-time applications using speech. Speech analysis …

[PDF][PDF] Speaker-dependent wavenet vocoder.

A Tamamori, T Hayashi, K Kobayashi, K Takeda… - Interspeech, 2017 - isca-archive.org
In this study, we propose a speaker-dependent WaveNet vocoder, a method of synthesizing
speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as …

Indifference to dissonance in native Amazonians reveals cultural variation in music perception

JH McDermott, AF Schultz, EA Undurraga, RA Godoy - Nature, 2016 - nature.com
Music is present in every culture, but the degree to which it is shaped by biology remains
debated. One widely discussed phenomenon is that some combinations of notes are …

[HTML][HTML] D4C, a band-aperiodicity estimator for high-quality speech synthesis

M Morise - Speech Communication, 2016 - Elsevier
An algorithm is proposed for estimating the band aperiodicity of speech signals, where
“aperiodicity” is defined as the power ratio between the speech signal and the aperiodic …

[HTML][HTML] Perceptual fusion of musical notes by native Amazonians suggests universal representations of musical intervals

MJ McPherson, SE Dolan, A Durango… - Nature …, 2020 - nature.com
Music perception is plausibly constrained by universal perceptual mechanisms adapted to
natural sounds. Such constraints could arise from our dependence on harmonic frequency …

[PDF][PDF] Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis.

T Hayashi, S Watanabe, T Toda, K Takeda… - …, 2019 - isca-archive.org
We propose an end-to-end text-to-speech (TTS) synthesis model that explicitly uses
information from pre-trained embeddings of the text. Recent work in natural language …