Very deep self-attention networks for end-to-end speech recognition
Recently, end-to-end sequence-to-sequence models for speech recognition have gained
significant interest in the research community. While previous architecture choices revolve …
significant interest in the research community. While previous architecture choices revolve …
Minimum latency training strategies for streaming sequence-to-sequence ASR
Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have
been proposed to perform online speech recognition with linear-time decoding complexity …
been proposed to perform online speech recognition with linear-time decoding complexity …
An investigation of phone-based subword units for end-to-end speech recognition
Phones and their context-dependent variants have been the standard modeling units for
conventional speech recognition systems, while characters and subwords have …
conventional speech recognition systems, while characters and subwords have …
Guiding CTC posterior spike timings for improved posterior fusion and knowledge distillation
G Kurata, K Audhkhasi - arXiv preprint arXiv:1904.08311, 2019 - arxiv.org
Conventional automatic speech recognition (ASR) systems trained from frame-level
alignments can easily leverage posterior fusion to improve ASR accuracy and build a better …
alignments can easily leverage posterior fusion to improve ASR accuracy and build a better …
Minimum bayes risk training of rnn-transducer for end-to-end speech recognition
In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for
end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR …
end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR …
Acoustically grounded word embeddings for improved acoustics-to-word speech recognition
Direct acoustics-to-word (A2W) systems for end-to-end automatic speech recognition are
simpler to train, and more efficient to decode with, than sub-word systems. However, A2W …
simpler to train, and more efficient to decode with, than sub-word systems. However, A2W …
[PDF][PDF] Forget a Bit to Learn Better: Soft Forgetting for CTC-Based Automatic Speech Recognition.
Prior work has shown that connectionist temporal classification (CTC)-based automatic
speech recognition systems perform well when using bidirectional long short-term memory …
speech recognition systems perform well when using bidirectional long short-term memory …
Improved multi-stage training of online attention-based encoder-decoder models
In this paper, we propose a refined multi-stage multi-task training strategy to improve the
performance of onli ne attention-based encoder-decoder (AED) models. A three-stage …
performance of onli ne attention-based encoder-decoder (AED) models. A three-stage …
Advancing multi-accented lstm-ctc speech recognition using a domain specific student-teacher learning paradigm
S Ghorbani, AE Bulut… - 2018 IEEE Spoken …, 2018 - ieeexplore.ieee.org
Non-native speech causes automatic speech recognition systems to degrade in
performance. Past strategies to address this challenge have considered model adaptation …
performance. Past strategies to address this challenge have considered model adaptation …
Distilling attention weights for CTC-based ASR systems
We present a novel training approach for connectionist temporal classification (CTC)-based
automatic speech recognition (ASR) systems. CTC models are promising for building both a …
automatic speech recognition (ASR) systems. CTC models are promising for building both a …