A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Machine learning for synthetic data generation: a review

Y Lu, M Shen, H Wang, X Wang, C van Rechem… - arXiv preprint arXiv …, 2023 - arxiv.org
Machine learning heavily relies on data, but real-world applications often encounter various
data-related issues. These include data of poor quality, insufficient data points leading to …

Parallel learning: Overview and perspective for computational learning across Syn2Real and Sim2Real

Q Miao, Y Lv, M Huang, X Wang… - IEEE/CAA Journal of …, 2023 - ieeexplore.ieee.org
The virtual-to-real paradigm, ie, training models on virtual data and then applying them to
solve real-world problems, has attracted more and more attention from various domains by …

On the importance and applicability of pre-training for federated learning

HY Chen, CH Tu, Z Li, HW Shen, WL Chao - arXiv preprint arXiv …, 2022 - arxiv.org
Pre-training is prevalent in nowadays deep learning to improve the learned model's
performance. However, in the literature on federated learning (FL), neural networks are …

Synthvsr: Scaling up visual speech recognition with synthetic supervision

X Liu, E Lakomkin, K Vougioukas… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on
increasingly large amounts of video data, while the publicly available transcribed video …

[HTML][HTML] Computer-assisted pronunciation training—Speech synthesis is almost all you need

D Korzekwa, J Lorenzo-Trueba, T Drugman… - Speech …, 2022 - Elsevier
The research community has long studied computer-assisted pronunciation training (CAPT)
methods in non-native speech. Researchers focused on studying various model …

Data augmentation for asr using tts via a discrete representation

S Ueno, M Mimura, S Sakai… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
While end-to-end automatic speech recognition (ASR) has achieved high performance, it
requires a huge amount of paired speech and transcription data for training. Recently, data …

A semi-supervised complementary joint training approach for low-resource speech recognition

YQ Du, J Zhang, X Fang, MH Wu… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
Both unpaired speech and text have shown to be beneficial for low-resource automatic
speech recognition (ASR), which, however were either separately used for pre-training, self …

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

R Zhao, J Xue, P Parthasarathy… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Neural transducer is now the most popular end-to-end model for speech recognition, due to
its naturally streaming ability. However, it is challenging to adapt it with text-only data …

Generating data with text-to-speech and large-language models for conversational speech recognition

S Cornell, J Darefsky, Z Duan, S Watanabe - arXiv preprint arXiv …, 2024 - arxiv.org
Currently, a common approach in many speech processing tasks is to leverage large scale
pre-trained models by fine-tuning them on in-domain data for a particular application. Yet …