Textrolspeech: A text style control speech corpus with codec language text-to-speech models

P Peng, PY Huang, D Li, A Mohamed… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

S Ji, Z Jiang, H Wang, J Zuo, Z Zhao - arXiv preprint arXiv:2402.09378, 2024 - arxiv.org

Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice
cloning capabilities, requiring only a few seconds of unseen speaker voice prompts …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Language-codec: Reducing the gaps between discrete codec representation and speech language models

S Ji, M Fang, Z Jiang, R Huang, J Zuo, S Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

In recent years, large language models have achieved significant success in generative
tasks (eg, speech cloning and audio generation) related to speech, audio, music, and other …

被引用次数：4 相关文章所有 2 个版本

[PDF] acm.org

Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators

W Hutiri, O Papakyriakopoulos, A Xiang - The 2024 ACM Conference on …, 2024 - dl.acm.org

The rapid and wide-scale adoption of AI to generate human speech poses a range of
significant ethical and safety risks to society that need to be addressed. For example, a …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

T SpeechTeam - arXiv preprint arXiv:2407.04051, 2024 - arxiv.org

This report introduces FunAudioLLM, a model family designed to enhance natural voice
interactions between humans and large language models (LLMs). At its core are two …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

S Ji, J Zuo, M Fang, S Zheng, Q Chen, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

X Chen, D Yang, D Wang, X Wu, Z Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal
speech. It still suffers from low speaker similarity and poor prosody naturalness. In this …