Neural speech synthesis with transformer network

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：142 相关文章所有 6 个版本

[HTML] sciencedirect.com

[HTML][HTML] Transformers in medical image analysis

K He, C Gan, Z Li, I Rekik, Z Yin, W Ji, Y Gao, Q Wang… - Intelligent …, 2023 - Elsevier

Transformers have dominated the field of natural language processing and have recently
made an impact in the area of computer vision. In the field of medical image analysis …

被引用次数：273 相关文章所有 12 个版本

[PDF] arxiv.org

Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

被引用次数：474 相关文章所有 3 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：457 相关文章所有 9 个版本

[HTML] sciencedirect.com

[HTML][HTML] A survey of transformers

T Lin, Y Wang, X Liu, X Qiu - AI open, 2022 - Elsevier

Transformers have achieved great success in many artificial intelligence fields, such as
natural language processing, computer vision, and audio processing. Therefore, it is natural …

被引用次数：1168 相关文章所有 4 个版本

[PDF] mlr.press

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech

J Kim, J Kong, J Son - International Conference on Machine …, 2021 - proceedings.mlr.press

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and
parallel sampling have been proposed, but their sample quality does not match that of two …

被引用次数：775 相关文章所有 6 个版本

[PDF] arxiv.org

Motr: End-to-end multiple-object tracking with transformer

F Zeng, B Dong, Y Zhang, T Wang, X Zhang… - European Conference on …, 2022 - Springer

Temporal modeling of objects is a key challenge in multiple-object tracking (MOT). Existing
methods track by associating detections through motion-based and appearance-based …

被引用次数：507 相关文章所有 7 个版本

[PDF] mlr.press

Grad-tts: A diffusion probabilistic model for text-to-speech

V Popov, I Vovk, V Gogoryan… - International …, 2021 - proceedings.mlr.press

Recently, denoising diffusion probabilistic models and generative score matching have
shown high potential in modelling complex data distributions while stochastic calculus has …

被引用次数：459 相关文章所有 5 个版本

[PDF] arxiv.org

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

被引用次数：147 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

被引用次数：403 相关文章所有 2 个版本