Toward growing modular deep neural networks for continuous speech recognition

Z Ansari, SA Seyyedsalehi - Neural Computing and Applications, 2017 - Springer
Neural Computing and Applications, 2017Springer
The performance drop of typical automatic speech recognition systems in real applications is
related to their not properly designed structure and training procedure. In this article, a
growing modular deep neural network (MDNN) for speech recognition is introduced.
According to its structure, this network is pre-trained in a special manner. The ability of the
MDNN to grow enables it to implement spatiotemporal information of the frame sequences at
the input and their labels at the output layer at the same time. The trained network with such …
Abstract
The performance drop of typical automatic speech recognition systems in real applications is related to their not properly designed structure and training procedure. In this article, a growing modular deep neural network (MDNN) for speech recognition is introduced. According to its structure, this network is pre-trained in a special manner. The ability of the MDNN to grow enables it to implement spatiotemporal information of the frame sequences at the input and their labels at the output layer at the same time. The trained network with such a double spatiotemporal (DST) structure has learned valid phonetic sequences subspace. Therefore, it can filter out invalid output sequences in its own structure. In order to improve the proposed network performance in speaker variations, two speaker adaptation methods are also presented in this work. In these adaptation methods, the network trains how to move distorted input representations nonlinearly to their optimal positions or to adapt itself based on the input information. To evaluate the proposed MDNN structure and its modified versions, two Persian speech datasets are used: FARSDAT and Large FARSDAT. As there is no frame-level transcription for large vocabulary speech datasets, a semi-supervised learning algorithm is explored to train MDNN on Large FARSDAT. Experimental results on FARSDAT verify that implementing the DST structure besides speaker adaptation methods achieves up to 7.3 and 10.6 % absolute phone accuracy rate improvement over the MDNN and typical hidden Markov model, respectively. Likewise, semi-supervised training of the grown MDNN on Large FARSDAT improves its recognition performance up to 5 %.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果