A streaming on-device end-to-end model surpassing server-side conventional model quality and latency
Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art
conventional models with respect to both quality, ie, word error rate (WER), and latency, ie …
conventional models with respect to both quality, ie, word error rate (WER), and latency, ie …
Speech processing for digital home assistants: Combining signal processing with deep-learning techniques
R Haeb-Umbach, S Watanabe… - IEEE Signal …, 2019 - ieeexplore.ieee.org
Once a popular theme of futuristic science fiction or far-fetched technology forecasts, digital
home assistants with a spoken language interface have become a ubiquitous commodity …
home assistants with a spoken language interface have become a ubiquitous commodity …
Towards fast and accurate streaming end-to-end ASR
End-to-end (E2E) models fold the acoustic, pronunciation and language models of a
conventional speech recognition model into one neural network with a much smaller …
conventional speech recognition model into one neural network with a much smaller …
[图书][B] Prosodic patterns in English conversation
NG Ward - 2019 - books.google.com
Language is more than words: it includes the prosodic features and patterns that we use,
subconsciously, to frame meanings and achieve our goals in our interaction with others …
subconsciously, to frame meanings and achieve our goals in our interaction with others …
Alignment restricted streaming recurrent neural network transducer
There is a growing interest in the speech community in developing Recurrent Neural
Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications …
Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications …
Personal VAD: Speaker-conditioned voice activity detection
In this paper, we propose" personal VAD", a system to detect the voice activity of a target
speaker at the frame level. This system is useful for gating the inputs to a streaming on …
speaker at the frame level. This system is useful for gating the inputs to a streaming on …
Keyword spotting for Google assistant using contextual speech recognition
AH Michaely, X Zhang, G Simko… - 2017 IEEE Automatic …, 2017 - ieeexplore.ieee.org
We present a novel keyword spotting (KWS) system that uses contextual automatic speech
recognition (ASR). For voice-activated devices, it is common that a KWS system is run on the …
recognition (ASR). For voice-activated devices, it is common that a KWS system is run on the …
Temporal modeling using dilated convolution and gating for voice-activity-detection
Voice activity detection (VAD) is the task of predicting which parts of an utterance contains
speech versus background noise. It is an important first step to determine which samples to …
speech versus background noise. It is an important first step to determine which samples to …
E2e segmenter: Joint segmenting and decoding for long-form asr
Improving the performance of end-to-end ASR models on long utterances ranging from
minutes to hours in length is an ongoing challenge in speech recognition. A common …
minutes to hours in length is an ongoing challenge in speech recognition. A common …
Dissecting user-perceived latency of on-device E2E speech recognition
As speech-enabled devices such as smartphones and smart speakers become increasingly
ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems …
ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems …