Improving the latency and quality of cascaded encoders

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

被引用次数：160 相关文章所有 4 个版本

[PDF] arxiv.org

Joist: A joint speech and text streaming model for asr

TN Sainath, R Prabhavalkar, A Bapna… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org

We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E)
model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous …

被引用次数：29 相关文章所有 3 个版本

[PDF] arxiv.org

Diagonal state space augmented transformers for speech recognition

G Saon, A Gupta, X Cui - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org

We improve on the popular conformer architecture by replacing the depthwise temporal
convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant …

被引用次数：22 相关文章所有 4 个版本

[PDF] arxiv.org

Modular hybrid autoregressive transducer

Z Meng, T Chen, R Prabhavalkar… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org

Text-only adaptation of a transducer model remains challenging for end-to-end speech
recognition since the transducer has no clearly separated acoustic model (AM), language …

被引用次数：20 相关文章所有 3 个版本

Nam+: Towards scalable end-to-end contextual biasing for adaptive asr

T Munkhdalai, Z Wu, G Pundak, KC Sim… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org

Attention-based biasing techniques for end-to-end ASR systems are able to achieve large
accuracy gains without requiring the inference algorithm adjustments and parameter tuning …

被引用次数：16 相关文章

[PDF] arxiv.org

Modular domain adaptation for conformer-based streaming asr

Q Li, B Li, D Hwang, TN Sainath… - arXiv preprint arXiv …, 2023 - arxiv.org

Speech data from different domains has distinct acoustic and linguistic characteristics. It is
common to train a single multidomain model such as a Conformer transducer for speech …

被引用次数：9 相关文章所有 7 个版本

[PDF] arxiv.org

A unified cascaded encoder asr model for dynamic model sizes

S Ding, W Wang, D Zhao, TN Sainath, Y He… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition
(ASR) model, which unifies models for different deployment scenarios. Moreover, the model …

被引用次数：14 相关文章所有 5 个版本

[PDF] arxiv.org

Learning a dual-mode speech recognition model via self-pruning

C Liu, Y Shangguan, H Yang, Y Shi… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org

There is growing interest in unifying the streaming and full-context automatic speech
recognition (ASR) networks into a single end-to-end ASR model to simplify the model …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Improving deliberation by text-only and semi-supervised training

K Hu, TN Sainath, Y He, R Prabhavalkar… - arXiv preprint arXiv …, 2022 - arxiv.org

Text-only and semi-supervised training based on audio-only data has gained popularity
recently due to the wide availability of unlabeled text and speech data. In this work, we …

被引用次数：11 相关文章所有 6 个版本

[PDF] arxiv.org

Sub-8-bit quantization for on-device speech recognition: A regularization-free approach

K Zhen, M Radfar, H Nguyen, GP Strimel… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is
ubiquitous to achieve the trade-off between model predictive performance and efficiency …

被引用次数：8 相关文章所有 4 个版本