Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022 - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

Unsupervised speech recognition

A Baevski, WN Hsu, A Conneau… - Advances in Neural …, 2021 - proceedings.neurips.cc
Despite rapid progress in the recent past, current speech recognition systems still require
labeled training data which limits this technology to a small fraction of the languages spoken …

On generative spoken language modeling from raw audio

K Lakhotia, E Kharitonov, WN Hsu, Y Adi… - Transactions of the …, 2021 - direct.mit.edu
Abstract We introduce Generative Spoken Language Modeling, the task of learning the
acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and …

Speech resynthesis from discrete disentangled self-supervised representations

A Polyak, Y Adi, J Copet, E Kharitonov… - arXiv preprint arXiv …, 2021 - arxiv.org
We propose using self-supervised discrete representations for the task of speech
resynthesis. To generate disentangled representation, we separately extract low-bitrate …

Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion

D Wang, L Deng, YT Yeung, X Chen, X Liu… - arXiv preprint arXiv …, 2021 - arxiv.org
One-shot voice conversion (VC), which performs conversion across arbitrary speakers with
only a single target-speaker utterance for reference, can be effectively achieved by speech …

A comparison of discrete and soft speech units for improved voice conversion

B Van Niekerk, MA Carbonneau, J Zaïdi… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
The goal of voice conversion is to transform source speech into a target voice, keeping the
content unchanged. In this paper, we focus on self-supervised representation learning for …

From discrete tokens to high-fidelity audio using multi-band diffusion

R San Roman, Y Adi, A Deleforge… - Advances in neural …, 2023 - proceedings.neurips.cc
Deep generative models can generate high-fidelity audio conditioned on varioustypes of
representations (eg, mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)) …

Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions

G Tuckute, J Feather, D Boebinger, JH McDermott - Plos Biology, 2023 - journals.plos.org
Models that predict brain responses to stimuli provide one measure of understanding of a
sensory system and have many potential applications in science and engineering. Deep …

Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization

Y Takida, T Shibuya, WH Liao, CH Lai… - arXiv preprint arXiv …, 2022 - arxiv.org
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned
discrete representation uses only a fraction of the full capacity of the codebook, also known …