Speak foreign languages with your own voice: Cross-lingual neural codec language modeling
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual
speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec …
speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec …
Viola: Unified codec language models for speech recognition, synthesis, and translation
Recent research shows a big convergence in model architecture, training objectives, and
inference methods across various tasks for different modalities. In this paper, we propose …
inference methods across various tasks for different modalities. In this paper, we propose …
Sparks of large audio models: A survey and outlook
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …
challenges in applying large language models to the field of audio signal processing. Audio …
Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …
Wenet 2.0: More productive end-to-end speech recognition toolkit
Recently, we made available WeNet, a production-oriented end-to-end speech recognition
toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address …
toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address …
Automatic speech recognition for Uyghur, Kazakh, and Kyrgyz: An overview
With the emergence of deep learning, the performance of automatic speech recognition
(ASR) systems has remarkably improved. Especially for resource-rich languages such as …
(ASR) systems has remarkably improved. Especially for resource-rich languages such as …
Reproducing whisper-style training using an open-source toolkit and publicly available data
Pre-training speech models on large volumes of data has achieved remarkable success.
OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised …
OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised …
Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts
Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous
large-scale multispeaker TTS models have successfully achieved this goal with an enrolled …
large-scale multispeaker TTS models have successfully achieved this goal with an enrolled …
Polyvoice: Language models for speech to speech translation
We propose PolyVoice, a language model-based framework for speech-to-speech
translation (S2ST) system. Our framework consists of two language models: a translation …
translation (S2ST) system. Our framework consists of two language models: a translation …
Lauragpt: Listen, attend, understand, and regenerate audio with gpt
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks. However, there has been limited research on …
on various natural language processing tasks. However, there has been limited research on …