Recent advances in direct speech-to-text translation

C Xu, R Ye, Q Dong, C Zhao, T Ko, M Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, speech-to-text translation has attracted more and more attention and many studies
have emerged rapidly. In this paper, we present a comprehensive survey on direct speech …

M3ST: Mix at Three Levels for Speech Translation

X Cheng, Q Dong, F Yue, T Ko… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's
well known that data augmentation is an efficient method to improve performance for many …

The SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion

K Gorman, LFE Ashby, A Goyzueta… - Proceedings of the …, 2020 - aclanthology.org
We describe the design and findings of the SIGMORPHON 2020 shared task on multilingual
grapheme-to-phoneme conversion. Participants were asked to submit systems which take in …

Sample, translate, recombine: Leveraging audio alignments for data augmentation in end-to-end speech translation

TK Lam, S Schamoni, S Riezler - arXiv preprint arXiv:2203.08757, 2022 - arxiv.org
End-to-end speech translation relies on data that pair source-language speech inputs with
corresponding translations into a target language. Such data are notoriously scarce, making …

Large-scale self-and semi-supervised learning for speech translation

C Wang, A Wu, J Pino, A Baevski, M Auli… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we improve speech translation (ST) through effectively leveraging large
quantities of unlabeled speech and text data in different and complementary ways. We …

Leveraging pseudo-labeled data to improve direct speech-to-speech translation

Q Dong, F Yue, T Ko, M Wang, Q Bai… - arXiv preprint arXiv …, 2022 - arxiv.org
Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently.
The task is very challenging due to data scarcity and complex speech-to-speech mapping. In …

Effectively pretraining a speech translation decoder with machine translation data

A Alinejad, A Sarkar - Proceedings of the 2020 Conference on …, 2020 - aclanthology.org
Directly translating from speech to text using an end-to-end approach is still challenging for
many language pairs due to insufficient data. Although pretraining the encoder parameters …

Translatotron 2: Robust direct speech-to-speech translation

Y Jia, MT Ramanovich, T Remez, R Pomerantz - 2021 - openreview.net
We present Translatotron 2, a neural direct speech-to-speech translation model that can be
trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a …

Learning when to translate for streaming speech

Q Dong, Y Zhu, M Wang, L Li - arXiv preprint arXiv:2109.07368, 2021 - arxiv.org
How to find proper moments to generate partial sentence translation given a streaming
speech input? Existing approaches waiting-and-translating for a fixed duration often break …

Self-supervised representations improve end-to-end speech translation

A Wu, C Wang, J Pino, J Gu - arXiv preprint arXiv:2006.12124, 2020 - arxiv.org
End-to-end speech-to-text translation can provide a simpler and smaller system but is facing
the challenge of data scarcity. Pre-training methods can leverage unlabeled data and have …