Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
[HTML][HTML] Unsupervised automatic speech recognition: A review
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …
remarkable performance given large amounts of manually transcribed speech, but large …
Unsupervised speech recognition
Despite rapid progress in the recent past, current speech recognition systems still require
labeled training data which limits this technology to a small fraction of the languages spoken …
labeled training data which limits this technology to a small fraction of the languages spoken …
wav2vec: Unsupervised pre-training for speech recognition
We explore unsupervised pre-training for speech recognition by learning representations of
raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting …
raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting …
Unsupervised speech representation learning using wavenet autoencoders
We consider the task of unsupervised extraction of meaningful latent representations of
speech by applying autoencoding neural networks to speech waveforms. The goal is to …
speech by applying autoencoding neural networks to speech waveforms. The goal is to …
Jointly discovering visual objects and spoken words from raw sensory input
In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …
audio captions with the semantically relevant portions of natural images that they refer to …
Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech
In this paper, we propose a novel deep neural network architecture, Speech2Vec, for
learning fixed-length vector representations of audio segments excised from a speech …
learning fixed-length vector representations of audio segments excised from a speech …
Pre-training on high-resource speech recognition improves low-resource speech-to-text translation
We present a simple approach to improve direct speech-to-text translation (ST) when the
source language is low-resource: we pre-train the model on a high-resource automatic …
source language is low-resource: we pre-train the model on a high-resource automatic …
Unsupervised pre-training of bidirectional speech encoders via masked reconstruction
We propose an approach for pre-training speech representations via a masked
reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be …
reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be …
Learning hierarchical discrete linguistic units from visually-grounded speech
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …
vector quantization layers into neural models of visually grounded speech. We show that our …