[PDF][PDF] The age of synthetic realities: Challenges and opportunities
Synthetic realities are digital creations or augmentations that are contextually generated
through the use of Artificial Intelligence (AI) methods, leveraging extensive amounts of data …
through the use of Artificial Intelligence (AI) methods, leveraging extensive amounts of data …
P-flow: a fast and data-efficient zero-shot TTS through speech prompting
While recent large-scale neural codec language models have shown significant
improvement in zero-shot TTS by training on thousands of hours of data, they suffer from …
improvement in zero-shot TTS by training on thousands of hours of data, they suffer from …
UniCATS: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding
The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens,
has been proven superior to traditional acoustic feature mel-spectrograms in terms of …
has been proven superior to traditional acoustic feature mel-spectrograms in terms of …
Navigating the Soundscape of Deception: A Comprehensive Survey on Audio Deepfake Generation, Detection, and Future Horizons
The rise of audio deepfakes presents a significant security threat that undermines trust in
digital communications and media. These synthetic audio technologies can convincingly …
digital communications and media. These synthetic audio technologies can convincingly …
E3 tts: Easy end-to-end diffusion-based text to speech
We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-
end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and …
end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and …
Voiceflow: Efficient text-to-speech with rectified flow matching
Although diffusion models in text-to-speech have become a popular choice due to their
strong generative ability, the intrinsic complexity of sampling from diffusion models harms …
strong generative ability, the intrinsic complexity of sampling from diffusion models harms …
Simplespeech 2: Towards simple and efficient text-to-speech with flow-based scalar latent transformer diffusion models
Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective
method for improving the diversity and naturalness of synthesized speech. At the high level …
method for improving the diversity and naturalness of synthesized speech. At the high level …
Diffusion-based diverse audio captioning with retrieval-guided Langevin dynamics
Y Zhu, A Men, L Xiao - Information Fusion, 2025 - Elsevier
Audio captioning, a comprehensive task of audio understanding, aims to provide a natural-
language description of an audio clip. Beyond accuracy, diversity is also a critical …
language description of an audio clip. Beyond accuracy, diversity is also a critical …
Dualspeech: Enhancing speaker-fidelity and text-intelligibility through dual classifier-free guidance
Text-to-Speech (TTS) models have advanced significantly, aiming to accurately replicate
human speech's diversity, including unique speaker identities and linguistic nuances …
human speech's diversity, including unique speaker identities and linguistic nuances …
Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals
The field of brain-to-stimuli reconstruction has seen significant progress in the last few years,
but techniques continue to be subject-specific and are usually tested on a single dataset. In …
but techniques continue to be subject-specific and are usually tested on a single dataset. In …