Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022 - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

Unsupervised speech recognition

A Baevski, WN Hsu, A Conneau… - Advances in Neural …, 2021 - proceedings.neurips.cc
Despite rapid progress in the recent past, current speech recognition systems still require
labeled training data which limits this technology to a small fraction of the languages spoken …

wav2vec: Unsupervised pre-training for speech recognition

S Schneider, A Baevski, R Collobert, M Auli - arXiv preprint arXiv …, 2019 - arxiv.org
We explore unsupervised pre-training for speech recognition by learning representations of
raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting …

Unsupervised speech representation learning using wavenet autoencoders

J Chorowski, RJ Weiss, S Bengio… - … /ACM transactions on …, 2019 - ieeexplore.ieee.org
We consider the task of unsupervised extraction of meaningful latent representations of
speech by applying autoencoding neural networks to speech waveforms. The goal is to …

Jointly discovering visual objects and spoken words from raw sensory input

D Harwath, A Recasens, D Surís… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …

Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech

YA Chung, J Glass - arXiv preprint arXiv:1803.08976, 2018 - arxiv.org
In this paper, we propose a novel deep neural network architecture, Speech2Vec, for
learning fixed-length vector representations of audio segments excised from a speech …

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

S Bansal, H Kamper, K Livescu, A Lopez… - arXiv preprint arXiv …, 2018 - arxiv.org
We present a simple approach to improve direct speech-to-text translation (ST) when the
source language is low-resource: we pre-train the model on a high-resource automatic …

Unsupervised pre-training of bidirectional speech encoders via masked reconstruction

W Wang, Q Tang, K Livescu - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org
We propose an approach for pre-training speech representations via a masked
reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …