A comparative study on transformer vs rnn in speech applications

S Karita, N Chen, T Hayashi, T Hori… - 2019 IEEE automatic …, 2019 - ieeexplore.ieee.org
Sequence-to-sequence models have been widely used in end-to-end speech processing,
for example, automatic speech recognition (ASR), speech translation (ST), and text-to …

Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline

H Bu, J Du, X Na, B Wu, H Zheng - … of the oriental chapter of the …, 2017 - ieeexplore.ieee.org
An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the
largest corpus which is suitable for conducting the speech recognition research and building …

[PDF][PDF] Audio augmentation for speech recognition.

T Ko, V Peddinti, D Povey, S Khudanpur - Interspeech, 2015 - isca-archive.org
Data augmentation is a common strategy adopted to increase the quantity of training data,
avoid overfitting and improve robustness of the models. In this paper, we investigate audio …

Pattern mining approaches used in sensor-based biometric recognition: a review

J Chaki, N Dey, F Shi, RS Sherratt - IEEE Sensors Journal, 2019 - ieeexplore.ieee.org
Sensing technologies place significant interest in the use of biometrics for the recognition
and assessment of individuals. Pattern mining techniques have established a critical step in …

Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM

T Hori, S Watanabe, Y Zhang, W Chan - arXiv preprint arXiv:1706.02737, 2017 - arxiv.org
We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We
learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) …

VQTTS: High-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature

C Du, Y Guo, X Chen, K Yu - arXiv preprint arXiv:2204.00768, 2022 - arxiv.org
The mainstream neural text-to-speech (TTS) pipeline is a cascade system, including an
acoustic model (AM) that predicts acoustic feature from the input transcript and a vocoder …

Automatic speech recognition method based on deep learning approaches for Uzbek language

A Mukhamadiyev, I Khujayarov, O Djuraev, J Cho - Sensors, 2022 - mdpi.com
Communication has been an important aspect of human life, civilization, and globalization
for thousands of years. Biometric analysis, education, security, healthcare, and smart cities …

Language independent end-to-end architecture for joint language identification and speech recognition

S Watanabe, T Hori, JR Hershey - 2017 IEEE Automatic Speech …, 2017 - ieeexplore.ieee.org
End-to-end automatic speech recognition (ASR) can significantly reduce the burden of
developing ASR systems for new languages, by eliminating the need for linguistic …

UniCATS: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding

C Du, Y Guo, F Shen, Z Liu, Z Liang, X Chen… - Proceedings of the …, 2024 - ojs.aaai.org
The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens,
has been proven superior to traditional acoustic feature mel-spectrograms in terms of …

Emotion recognition by fusing time synchronous and time asynchronous representations

W Wu, C Zhang, PC Woodland - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
In this paper, a novel two-branch neural network model structure is proposed for multimodal
emotion recognition, which consists of a time synchronous branch (TSB) and a time …