Very deep self-attention networks for end-to-end speech recognition

NQ Pham, TS Nguyen, J Niehues, M Müller… - arXiv preprint arXiv …, 2019 - arxiv.org
Recently, end-to-end sequence-to-sequence models for speech recognition have gained
significant interest in the research community. While previous architecture choices revolve …

Minimum latency training strategies for streaming sequence-to-sequence ASR

H Inaguma, Y Gaur, L Lu, J Li… - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org
Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have
been proposed to perform online speech recognition with linear-time decoding complexity …

An investigation of phone-based subword units for end-to-end speech recognition

W Wang, G Wang, A Bhatnagar, Y Zhou… - arXiv preprint arXiv …, 2020 - arxiv.org
Phones and their context-dependent variants have been the standard modeling units for
conventional speech recognition systems, while characters and subwords have …

Guiding CTC posterior spike timings for improved posterior fusion and knowledge distillation

G Kurata, K Audhkhasi - arXiv preprint arXiv:1904.08311, 2019 - arxiv.org
Conventional automatic speech recognition (ASR) systems trained from frame-level
alignments can easily leverage posterior fusion to improve ASR accuracy and build a better …

Minimum bayes risk training of rnn-transducer for end-to-end speech recognition

C Weng, C Yu, J Cui, C Zhang, D Yu - arXiv preprint arXiv:1911.12487, 2019 - arxiv.org
In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for
end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR …

Acoustically grounded word embeddings for improved acoustics-to-word speech recognition

S Settle, K Audhkhasi, K Livescu… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
Direct acoustics-to-word (A2W) systems for end-to-end automatic speech recognition are
simpler to train, and more efficient to decode with, than sub-word systems. However, A2W …

[PDF][PDF] Forget a Bit to Learn Better: Soft Forgetting for CTC-Based Automatic Speech Recognition.

K Audhkhasi, G Saon, Z Tüske, B Kingsbury… - Interspeech, 2019 - academia.edu
Prior work has shown that connectionist temporal classification (CTC)-based automatic
speech recognition systems perform well when using bidirectional long short-term memory …

Improved multi-stage training of online attention-based encoder-decoder models

A Garg, D Gowda, A Kumar, K Kim… - 2019 IEEE Automatic …, 2019 - ieeexplore.ieee.org
In this paper, we propose a refined multi-stage multi-task training strategy to improve the
performance of onli ne attention-based encoder-decoder (AED) models. A three-stage …

Advancing multi-accented lstm-ctc speech recognition using a domain specific student-teacher learning paradigm

S Ghorbani, AE Bulut… - 2018 IEEE Spoken …, 2018 - ieeexplore.ieee.org
Non-native speech causes automatic speech recognition systems to degrade in
performance. Past strategies to address this challenge have considered model adaptation …

Distilling attention weights for CTC-based ASR systems

T Moriya, H Sato, T Tanaka, T Ashihara… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
We present a novel training approach for connectionist temporal classification (CTC)-based
automatic speech recognition (ASR) systems. CTC models are promising for building both a …