Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

Joist: A joint speech and text streaming model for asr

TN Sainath, R Prabhavalkar, A Bapna… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E)
model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous …

Diagonal state space augmented transformers for speech recognition

G Saon, A Gupta, X Cui - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
We improve on the popular conformer architecture by replacing the depthwise temporal
convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant …

Modular hybrid autoregressive transducer

Z Meng, T Chen, R Prabhavalkar… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Text-only adaptation of a transducer model remains challenging for end-to-end speech
recognition since the transducer has no clearly separated acoustic model (AM), language …

Nam+: Towards scalable end-to-end contextual biasing for adaptive asr

T Munkhdalai, Z Wu, G Pundak, KC Sim… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Attention-based biasing techniques for end-to-end ASR systems are able to achieve large
accuracy gains without requiring the inference algorithm adjustments and parameter tuning …

Modular domain adaptation for conformer-based streaming asr

Q Li, B Li, D Hwang, TN Sainath… - arXiv preprint arXiv …, 2023 - arxiv.org
Speech data from different domains has distinct acoustic and linguistic characteristics. It is
common to train a single multidomain model such as a Conformer transducer for speech …

A unified cascaded encoder asr model for dynamic model sizes

S Ding, W Wang, D Zhao, TN Sainath, Y He… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition
(ASR) model, which unifies models for different deployment scenarios. Moreover, the model …

Learning a dual-mode speech recognition model via self-pruning

C Liu, Y Shangguan, H Yang, Y Shi… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
There is growing interest in unifying the streaming and full-context automatic speech
recognition (ASR) networks into a single end-to-end ASR model to simplify the model …

Improving deliberation by text-only and semi-supervised training

K Hu, TN Sainath, Y He, R Prabhavalkar… - arXiv preprint arXiv …, 2022 - arxiv.org
Text-only and semi-supervised training based on audio-only data has gained popularity
recently due to the wide availability of unlabeled text and speech data. In this work, we …

Sub-8-bit quantization for on-device speech recognition: A regularization-free approach

K Zhen, M Radfar, H Nguyen, GP Strimel… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
For on-device automatic speech recognition (ASR), quantization aware training (QAT) is
ubiquitous to achieve the trade-off between model predictive performance and efficiency …