VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

P Peng, PY Huang, D Li, A Mohamed… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

S Ji, Z Jiang, H Wang, J Zuo, Z Zhao - arXiv preprint arXiv:2402.09378, 2024 - arxiv.org
Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice
cloning capabilities, requiring only a few seconds of unseen speaker voice prompts …

Language-codec: Reducing the gaps between discrete codec representation and speech language models

S Ji, M Fang, Z Jiang, R Huang, J Zuo, S Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, large language models have achieved significant success in generative
tasks (eg, speech cloning and audio generation) related to speech, audio, music, and other …

Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators

W Hutiri, O Papakyriakopoulos, A Xiang - The 2024 ACM Conference on …, 2024 - dl.acm.org
The rapid and wide-scale adoption of AI to generate human speech poses a range of
significant ethical and safety risks to society that need to be addressed. For example, a …

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

T SpeechTeam - arXiv preprint arXiv:2407.04051, 2024 - arxiv.org
This report introduces FunAudioLLM, a model family designed to enhance natural voice
interactions between humans and large language models (LLMs). At its core are two …

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

S Ji, J Zuo, M Fang, S Zheng, Q Chen, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

M Fang, S Ji, J Zuo, H Huang, Y Xia, J Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a
sequence-to-sequence model to directly generate candidate identifiers based on natural …

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

M Kawamura, R Yamamoto, Y Shirahata… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level
descriptions (ie, prompts) of speaking style and speaker-level prompts of speaker …

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

C Chen, Y Hu, W Wu, H Wang, ES Chng… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, text-to-speech (TTS) technology has witnessed impressive advancements,
particularly with large-scale training datasets, showcasing human-level speech quality and …

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

X Chen, D Yang, D Wang, X Wu, Z Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal
speech. It still suffers from low speaker similarity and poor prosody naturalness. In this …