A review of deep learning techniques for speech processing
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …
learning. The use of multiple processing layers has enabled the creation of models capable …
[HTML][HTML] Transformers in medical image analysis
Transformers have dominated the field of natural language processing and have recently
made an impact in the area of computer vision. In the field of medical image analysis …
made an impact in the area of computer vision. In the field of medical image analysis …
Neural codec language models are zero-shot text to speech synthesizers
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …
we train a neural codec language model (called Vall-E) using discrete codes derived from …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
[HTML][HTML] A survey of transformers
Transformers have achieved great success in many artificial intelligence fields, such as
natural language processing, computer vision, and audio processing. Therefore, it is natural …
natural language processing, computer vision, and audio processing. Therefore, it is natural …
Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and
parallel sampling have been proposed, but their sample quality does not match that of two …
parallel sampling have been proposed, but their sample quality does not match that of two …
Motr: End-to-end multiple-object tracking with transformer
Temporal modeling of objects is a key challenge in multiple-object tracking (MOT). Existing
methods track by associating detections through motion-based and appearance-based …
methods track by associating detections through motion-based and appearance-based …
Grad-tts: A diffusion probabilistic model for text-to-speech
Recently, denoising diffusion probabilistic models and generative score matching have
shown high potential in modelling complex data distributions while stochastic calculus has …
shown high potential in modelling complex data distributions while stochastic calculus has …
Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
A survey on neural speech synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …
speech given text, is a hot research topic in speech, language, and machine learning …