Recent developments on espnet toolkit boosted by conformer

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com

Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

被引用次数：326 相关文章所有 7 个版本

[PDF] mlr.press

Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding

Y Peng, S Dalmia, I Lane… - … Conference on Machine …, 2022 - proceedings.mlr.press

Conformer has proven to be effective in many speech processing tasks. It combines the
benefits of extracting local dependencies using convolutions and global dependencies …

被引用次数：108 相关文章所有 8 个版本

[PDF] ieee.org

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org

In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

被引用次数：80 相关文章所有 6 个版本

[PDF] arxiv.org

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio

G Chen, S Chai, G Wang, J Du, WQ Zhang… - arXiv preprint arXiv …, 2021 - arxiv.org

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition
corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and …

被引用次数：186 相关文章所有 8 个版本

[PDF] arxiv.org

Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition

B Zhang, H Lv, P Guo, Q Shao, C Yang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of
10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about …

被引用次数：152 相关文章所有 4 个版本

[PDF] neurips.cc

Squeezeformer: An efficient transformer for automatic speech recognition

S Kim, A Gholami, A Shaw, N Lee… - Advances in …, 2022 - proceedings.neurips.cc

The recently proposed Conformer model has become the de facto backbone model for
various downstream speech tasks based on its hybrid attention-convolution architecture that …

被引用次数：75 相关文章所有 7 个版本

[PDF] arxiv.org

E-branchformer: Branchformer with enhanced merging for speech recognition

K Kim, F Wu, Y Peng, J Pan, P Sridhar… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org

Conformer, combining convolution and self-attention sequentially to capture both local and
global information, has shown remarkable performance and is currently regarded as the …

被引用次数：68 相关文章所有 5 个版本

[PDF] arxiv.org

Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis

Y Wang, X Wang, P Zhu, J Wu, H Li, H Xue… - arXiv preprint arXiv …, 2022 - arxiv.org

This paper introduces Opencpop, a publicly available high-quality Mandarin singing corpus
designed for singing voice synthesis (SVS). The corpus consists of 100 popular Mandarin …

被引用次数：76 相关文章所有 5 个版本

[PDF] arxiv.org

An exploration of self-supervised pretrained representations for end-to-end speech recognition

X Chang, T Maekaku, P Guo, J Shi… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org

Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity
representation of the speech signal is learned from a lot of untranscribed data and shows …

被引用次数：79 相关文章所有 9 个版本

[PDF] arxiv.org

The singing voice conversion challenge 2023

WC Huang, LP Violeta, S Liu, J Shi… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org

We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual
scientific event aiming to compare and understand different voice conversion (VC) systems …

被引用次数：31 相关文章所有 4 个版本