Diffvoice: Text-to-speech with latent diffusion

JP Cardenuto, J Yang, R Padilha… - … on Signal and …, 2023 - nowpublishers.com

Synthetic realities are digital creations or augmentations that are contextually generated
through the use of Artificial Intelligence (AI) methods, leveraging extensive amounts of data …

被引用次数：27 相关文章所有 7 个版本

[PDF] neurips.cc

P-flow: a fast and data-efficient zero-shot TTS through speech prompting

S Kim, K Shih, JF Santos… - Advances in …, 2024 - proceedings.neurips.cc

While recent large-scale neural codec language models have shown significant
improvement in zero-shot TTS by training on thousands of hours of data, they suffer from …

被引用次数：26 相关文章所有 3 个版本

[PDF] aaai.org

UniCATS: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding

C Du, Y Guo, F Shen, Z Liu, Z Liang, X Chen… - Proceedings of the …, 2024 - ojs.aaai.org

The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens,
has been proven superior to traditional acoustic feature mel-spectrograms in terms of …

被引用次数：44 相关文章所有 4 个版本

[PDF] nowpublishers.com

Navigating the Soundscape of Deception: A Comprehensive Survey on Audio Deepfake Generation, Detection, and Future Horizons

TM Wani, SAA Qadri, FA Wani… - Foundations and Trends …, 2024 - nowpublishers.com

The rise of audio deepfakes presents a significant security threat that undermines trust in
digital communications and media. These synthetic audio technologies can convincingly …

E3 tts: Easy end-to-end diffusion-based text to speech

Y Gao, N Morioka, Y Zhang… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org

We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-
end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Voiceflow: Efficient text-to-speech with rectified flow matching

Y Guo, C Du, Z Ma, X Chen, K Yu - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org

Although diffusion models in text-to-speech have become a popular choice due to their
strong generative ability, the intrinsic complexity of sampling from diffusion models harms …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Simplespeech 2: Towards simple and efficient text-to-speech with flow-based scalar latent transformer diffusion models

D Yang, R Huang, Y Wang, H Guo, D Chong… - arXiv preprint arXiv …, 2024 - arxiv.org

Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective
method for improving the diversity and naturalness of synthesized speech. At the high level …

被引用次数：3 相关文章

Diffusion-based diverse audio captioning with retrieval-guided Langevin dynamics

Y Zhu, A Men, L Xiao - Information Fusion, 2025 - Elsevier

Audio captioning, a comprehensive task of audio understanding, aims to provide a natural-
language description of an audio clip. Beyond accuracy, diversity is also a critical …

被引用次数：1 相关文章

[PDF] arxiv.org

Dualspeech: Enhancing speaker-fidelity and text-intelligibility through dual classifier-free guidance

J Yang, J Lee, HS Choi, S Ji, H Kim, J Lee - arXiv preprint arXiv …, 2024 - arxiv.org

Text-to-Speech (TTS) models have advanced significantly, aiming to accurately replicate
human speech's diversity, including unique speaker identities and linguistic nuances …

被引用次数：2 相关文章所有 3 个版本

[PDF] pkwyx.com

Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals

C Fosco, B Lahner, B Pan, A Andonian… - … on Computer Vision, 2025 - Springer

The field of brain-to-stimuli reconstruction has seen significant progress in the last few years,
but techniques continue to be subject-specific and are usually tested on a single dataset. In …