[PDF][PDF] A Comparison of sequence-to-sequence models for speech recognition.

R Prabhavalkar, K Rao, TN Sainath, B Li, L Johnson… - Interspeech, 2017 - isca-archive.org
In this work, we conduct a detailed evaluation of various allneural, end-to-end trained,
sequence-to-sequence models applied to the task of speech recognition. Notably, each of …

[PDF][PDF] Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping.

H Sak, M Shannon, K Rao, F Beaufays - Interspeech, 2017 - isca-archive.org
We introduce an encoder-decoder recurrent neural network model called Recurrent Neural
Aligner (RNA) that can be used for sequence to sequence mapping tasks. Like connectionist …

Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer

K Rao, H Sak, R Prabhavalkar - 2017 IEEE automatic speech …, 2017 - ieeexplore.ieee.org
We investigate training end-to-end speech recognition models with the recurrent neural
network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture …

Multi-accent speech recognition with hierarchical grapheme based models

K Rao, H Sak - … conference on acoustics, speech and signal …, 2017 - ieeexplore.ieee.org
We train grapheme-based acoustic models for speech recognition using a hierarchical
recurrent neural network architecture with connectionist temporal classification (CTC) loss …

Direct acoustics-to-word models for english conversational speech recognition

K Audhkhasi, B Ramabhadran, G Saon… - arXiv preprint arXiv …, 2017 - arxiv.org
Recent work on end-to-end automatic speech recognition (ASR) has shown that the
connectionist temporal classification (CTC) loss can be used to convert acoustics to phone …

Towards end-to-end speech recognition with recurrent neural networks

A Graves, N Jaitly - International conference on machine …, 2014 - proceedings.mlr.press
This paper presents a speech recognition system that directly transcribes audio data with
text, without requiring an intermediate phonetic representation. The system is based on a …

A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese

S Zhou, L Dong, S Xu, B Xu - International Conference on Neural …, 2018 - Springer
The choice of modeling units is critical to automatic speech recognition (ASR) tasks.
Conventional ASR systems typically choose context-dependent states (CD-states) or context …

[PDF][PDF] Lower Frame Rate Neural Network Acoustic Models.

G Pundak, TN Sainath - Interspeech, 2016 - isca-archive.org
Recently neural network acoustic models trained with Connectionist Temporal Classification
(CTC) were proposed as an alternative approach to conventional cross-entropy trained …

An empirical exploration of CTC acoustic models

Y Miao, M Gowayyed, X Na, T Ko… - … on acoustics, speech …, 2016 - ieeexplore.ieee.org
The connectionist temporal classification (CTC) loss function has several interesting
properties relevant for automatic speech recognition (ASR): applied on top of deep recurrent …

[PDF][PDF] Recurrent neural network and LSTM models for lexical utterance classification.

SV Ravuri, A Stolcke - Interspeech, 2015 - isca-archive.org
Utterance classification is a critical pre-processing step for many speech understanding and
dialog systems. In multi-user settings, one needs to first identify if an utterance is even …