Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
[HTML][HTML] Unsupervised automatic speech recognition: A review
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …
remarkable performance given large amounts of manually transcribed speech, but large …
Unsupervised speech recognition
Despite rapid progress in the recent past, current speech recognition systems still require
labeled training data which limits this technology to a small fraction of the languages spoken …
labeled training data which limits this technology to a small fraction of the languages spoken …
On generative spoken language modeling from raw audio
Abstract We introduce Generative Spoken Language Modeling, the task of learning the
acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and …
acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and …
Speech resynthesis from discrete disentangled self-supervised representations
We propose using self-supervised discrete representations for the task of speech
resynthesis. To generate disentangled representation, we separately extract low-bitrate …
resynthesis. To generate disentangled representation, we separately extract low-bitrate …
Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion
One-shot voice conversion (VC), which performs conversion across arbitrary speakers with
only a single target-speaker utterance for reference, can be effectively achieved by speech …
only a single target-speaker utterance for reference, can be effectively achieved by speech …
A comparison of discrete and soft speech units for improved voice conversion
The goal of voice conversion is to transform source speech into a target voice, keeping the
content unchanged. In this paper, we focus on self-supervised representation learning for …
content unchanged. In this paper, we focus on self-supervised representation learning for …
From discrete tokens to high-fidelity audio using multi-band diffusion
Deep generative models can generate high-fidelity audio conditioned on varioustypes of
representations (eg, mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)) …
representations (eg, mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)) …
Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions
Models that predict brain responses to stimuli provide one measure of understanding of a
sensory system and have many potential applications in science and engineering. Deep …
sensory system and have many potential applications in science and engineering. Deep …
Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned
discrete representation uses only a fraction of the full capacity of the codebook, also known …
discrete representation uses only a fraction of the full capacity of the codebook, also known …