Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search

C Ni, CC Leung, L Wang, NF Chen… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
2015 IEEE International Conference on Acoustics, Speech and Signal …, 2015ieeexplore.ieee.org
This paper considers an unsupervised data selection problem for the training data of an
acoustic model and the vocabulary coverage of a keyword search system in low-resource
settings. We propose to use Gaussian component index based n-grams as acoustic features
in a submodular function for unsupervised data selection. The submodular function provides
a near-optimal solution in terms of the objective being optimized. Moreover, to further
resolve the high out-of-vocabulary (OOV) rate for morphologically-rich languages like Tamil …
This paper considers an unsupervised data selection problem for the training data of an acoustic model and the vocabulary coverage of a keyword search system in low-resource settings. We propose to use Gaussian component index based n-grams as acoustic features in a submodular function for unsupervised data selection. The submodular function provides a near-optimal solution in terms of the objective being optimized. Moreover, to further resolve the high out-of-vocabulary (OOV) rate for morphologically-rich languages like Tamil, word-morph mixed language modeling is also considered. Our experiments are conducted on the Tamil speech provided by the IAPRA Babel program for the 2014 NIST Open Keyword Search Evaluation (OpenKWS14). We show that the selection of data plays an important role to the word error rate of the speech recognition system and the actual term weighted value (ATWV) of the keyword search system. The 10 hours of speech selected from the full language pack (FLP) using the proposed algorithm provides a relative 23.2% and 20.7% ATWV improvement over two other data subsets, the 10-hour data from the limited language pack (LLP) defined by IARPA and the 10 hours of speech randomly selected from the FLP, respectively. The proposed algorithm also increases the vocabulary coverage, implicitly alleviating the OOV problem: The number of OOV search terms drops from 1,686 and 1,171 in the two baseline conditions to 972.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果