Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Google usm: Scaling automatic speech recognition beyond 100 languages

Y Zhang, W Han, J Qin, Y Wang, A Bapna… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce the Universal Speech Model (USM), a single large model that performs
automatic speech recognition (ASR) across 100+ languages. This is achieved by pre …

W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

YA Chung, Y Zhang, W Han, CC Chiu… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
Motivated by the success of masked language modeling (MLM) in pre-training natural
language processing models, we propose w2v-BERT that explores MLM for self-supervised …

Rethinking pre-training and self-training

B Zoph, G Ghiasi, TY Lin, Y Cui, H Liu… - Advances in neural …, 2020 - proceedings.neurips.cc
Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet
pre-training is commonly used to initialize the backbones of object detection and …

Pushing the limits of semi-supervised learning for automatic speech recognition

Y Zhang, J Qin, DS Park, W Han, CC Chiu… - arXiv preprint arXiv …, 2020 - arxiv.org
We employ a combination of recent developments in semi-supervised learning for automatic
speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled …

Self-training with noisy student improves imagenet classification

Q Xie, MT Luong, E Hovy… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet,
which is 2.0% better than the state-of-the-art model that requires 3.5 B weakly labeled …

Webface260m: A benchmark unveiling the power of million-scale deep face recognition

Z Zhu, G Huang, J Deng, Y Ye… - Proceedings of the …, 2021 - openaccess.thecvf.com
In this paper, we contribute a new million-scale face benchmark containing noisy 4M
identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) …

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

Improved noisy student training for automatic speech recognition

DS Park, Y Zhang, Y Jia, W Han, CC Chiu, B Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Recently, a semi-supervised learning method known as" noisy student training" has been
shown to improve image classification performance of deep networks significantly. Noisy …

Self-training and pre-training are complementary for speech recognition

Q Xu, A Baevski, T Likhomanenko… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Self-training and unsupervised pre-training have emerged as effective approaches to
improve speech recognition systems using unlabeled data. However, it is not clear whether …