Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
Self-supervised representation learning: Introduction, advances, and challenges
Self-supervised representation learning (SSRL) methods aim to provide powerful, deep
feature learning without the requirement of large annotated data sets, thus alleviating the …
feature learning without the requirement of large annotated data sets, thus alleviating the …
Neural codec language models are zero-shot text to speech synthesizers
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …
we train a neural codec language model (called Vall-E) using discrete codes derived from …
Audiolm: a language modeling approach to audio generation
Z Borsos, R Marinier, D Vincent… - … ACM transactions on …, 2023 - ieeexplore.ieee.org
We introduce AudioLM, a framework for high-quality audio generation with long-term
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …
Data2vec: A general framework for self-supervised learning in speech, vision and language
While the general idea of self-supervised learning is identical across modalities, the actual
algorithms and objectives differ widely because they were developed with a single modality …
algorithms and objectives differ widely because they were developed with a single modality …
Wavlm: Large-scale self-supervised pre-training for full stack speech processing
Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …
exploration has been attempted for other speech processing tasks. As speech signal …
Voicebox: Text-guided multilingual universal speech generation at scale
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …
community. These models not only generate high fidelity outputs, but are also generalists …
XLS-R: Self-supervised cross-lingual speech representation learning at scale
This paper presents XLS-R, a large-scale model for cross-lingual speech representation
learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a …
learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a …
Speak, read and prompt: High-fidelity text-to-speech with minimal supervision
E Kharitonov, D Vincent, Z Borsos… - Transactions of the …, 2023 - direct.mit.edu
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained
with minimal supervision. By combining two types of discrete speech representations, we …
with minimal supervision. By combining two types of discrete speech representations, we …
Hubert: Self-supervised speech representation learning by masked prediction of hidden units
Self-supervised approaches for speech representation learning are challenged by three
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …