Speak foreign languages with your own voice: Cross-lingual neural codec language modeling

Z Zhang, L Zhou, C Wang, S Chen, Y Wu, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual
speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec …

Viola: Unified codec language models for speech recognition, synthesis, and translation

T Wang, L Zhou, Z Zhang, Y Wu, S Liu, Y Gaur… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent research shows a big convergence in model architecture, training objectives, and
inference methods across various tasks for different modalities. In this paper, we propose …

Sparks of large audio models: A survey and outlook

S Latif, M Shoukat, F Shamshad, M Usama… - arXiv preprint arXiv …, 2023 - arxiv.org
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …

Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias

Z Jiang, Y Ren, Z Ye, J Liu, C Zhang, Q Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …

Wenet 2.0: More productive end-to-end speech recognition toolkit

B Zhang, D Wu, Z Peng, X Song, Z Yao, H Lv… - arXiv preprint arXiv …, 2022 - arxiv.org
Recently, we made available WeNet, a production-oriented end-to-end speech recognition
toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address …

Automatic speech recognition for Uyghur, Kazakh, and Kyrgyz: An overview

W Du, Y Maimaitiyiming, M Nijat, L Li, A Hamdulla… - Applied Sciences, 2022 - mdpi.com
With the emergence of deep learning, the performance of automatic speech recognition
(ASR) systems has remarkably improved. Especially for resource-rich languages such as …

Reproducing whisper-style training using an open-source toolkit and publicly available data

Y Peng, J Tian, B Yan, D Berrebbi… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Pre-training speech models on large volumes of data has achieved remarkable success.
OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised …

Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts

Z Jiang, J Liu, Y Ren, J He, C Zhang, Z Ye… - arXiv preprint arXiv …, 2023 - arxiv.org
Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous
large-scale multispeaker TTS models have successfully achieved this goal with an enrolled …

Polyvoice: Language models for speech to speech translation

Q Dong, Z Huang, Q Tian, C Xu, T Ko, Y Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose PolyVoice, a language model-based framework for speech-to-speech
translation (S2ST) system. Our framework consists of two language models: a translation …

Lauragpt: Listen, attend, understand, and regenerate audio with gpt

Q Chen, Y Chu, Z Gao, Z Li, K Hu, X Zhou, J Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks. However, there has been limited research on …