Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts

Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

被引用次数：52 相关文章所有 4 个版本

[PDF] arxiv.org

AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset

Z Cai, S Ghosh, AP Adatia, M Hayat, A Dhall… - arXiv preprint arXiv …, 2023 - arxiv.org

The detection and localization of highly realistic deepfake audio-visual content are
challenging even for the most advanced state-of-the-art methods. While most of the research …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis

SH Lee, HY Choi, SB Kim, SW Lee - arXiv preprint arXiv:2311.12454, 2023 - arxiv.org

Large language models (LLM)-based speech synthesis has been widely adopted in zero-
shot speech synthesis. However, they require a large-scale data and possess the same …

被引用次数：15 相关文章所有 2 个版本

[PDF] arxiv.org

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

S Ji, Z Jiang, H Wang, J Zuo, Z Zhao - arXiv preprint arXiv:2402.09378, 2024 - arxiv.org

Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice
cloning capabilities, requiring only a few seconds of unseen speaker voice prompts …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

1M-Deepfakes Detection Challenge

Z Cai, A Dhall, S Ghosh, M Hayat, D Kollias… - arXiv preprint arXiv …, 2024 - arxiv.org

The detection and localization of deepfake content, particularly when small fake segments
are seamlessly mixed with real videos, remains a significant challenge in the field of digital …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

N Kanda, X Wang, SE Eskimez, M Thakker… - arXiv preprint arXiv …, 2024 - arxiv.org

Laughter is one of the most expressive and natural aspects of human speech, conveying
emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis

H Guo, F Xie, K Xie, D Yang, D Guo, X Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

The long speech sequence has been troubling language models (LM) based TTS
approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

S Ji, Z Jiang, X Cheng, Y Chen, M Fang, J Zuo… - arXiv preprint arXiv …, 2024 - arxiv.org

Language models have been effectively applied to modeling natural signals, such as
images, video, speech, and audio. A crucial component of these models is the codec …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

S Chen, S Liu, L Zhou, Y Liu, X Tan, J Li, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …

被引用次数：13 相关文章所有 2 个版本