Analysis of features and metrics for alignment in text-dependent voice conversion

NJ Shah, HA Patil - Pattern Recognition and Machine Intelligence: 7th …, 2017 - Springer
Pattern Recognition and Machine Intelligence: 7th International Conference …, 2017Springer
Voice Conversion (VC) is a technique that convert the perceived speaker identity from a
source speaker to a target speaker. Given a source and target speakers' parallel training
speech database in the text-dependent VC, first task is to align source and target speakers'
spectral features at frame-level before learning the mapping function. The accuracy of
alignment will affect the learning of mapping function and hence, the voice quality of
converted voice in VC. The impact of alignment is not much explored in the VC literature …
Abstract
Voice Conversion (VC) is a technique that convert the perceived speaker identity from a source speaker to a target speaker. Given a source and target speakers’ parallel training speech database in the text-dependent VC, first task is to align source and target speakers’ spectral features at frame-level before learning the mapping function. The accuracy of alignment will affect the learning of mapping function and hence, the voice quality of converted voice in VC. The impact of alignment is not much explored in the VC literature. Most of the alignment techniques try to align the acoustical features (namely, spectral features, such as Mel Cepstral Coefficients (MCC)). However, spectral features represents both speaker as well as speech-specific information. In this paper, we have done analysis on the use of different speaker-independent features (namely, unsupervised posterior features, such as, Gaussian Mixture Model (GMM)-based and Maximum A Posteriori (MAP) adapted from Universal Background Model (UBM), i.e., GMM-UBM-based posterior features) for the alignment task. In addition, we propose to use different metrics, such as, symmetric Kullback-Leibler (KL) and cosine distances instead of Euclidean distance for the alignment. Our analysis-based on % Phone Accuracy (PA) is correlating with subjective scores of the developed VC systems with 0.98 Pearson correlation coefficient.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果