STEMM: Self-learning with speech-text manifold mixup for speech translation

Q Fang, R Ye, L Li, Y Feng, M Wang - arXiv preprint arXiv:2203.10426, 2022 - arxiv.org
How to learn a better speech representation for end-to-end speech-to-text translation (ST)
with limited labeled data? Existing techniques often attempt to transfer powerful machine …

The multilingual tedx corpus for speech recognition and translation

E Salesky, M Wiesner, J Bremerman, R Cattoni… - arXiv preprint arXiv …, 2021 - arxiv.org
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and
speech translation (ST) research across many non-English source languages. The corpus is …

Cascade versus direct speech translation: Do the differences still make a difference?

L Bentivogli, M Cettolo, M Gaido, A Karakanta… - arXiv preprint arXiv …, 2021 - arxiv.org
Five years after the first published proofs of concept, direct approaches to speech translation
(ST) are now competing with traditional cascade solutions. In light of this steady progress …

Learning shared semantic space for speech-to-text translation

C Han, M Wang, H Ji, L Li - arXiv preprint arXiv:2105.03095, 2021 - arxiv.org
Having numerous potential applications and great impact, end-to-end speech translation
(ST) has long been treated as an independent task, failing to fully draw strength from the …

Speech translation and the end-to-end promise: Taking stock of where we are

M Sperber, M Paulik - arXiv preprint arXiv:2004.06358, 2020 - arxiv.org
Over its three decade history, speech translation has experienced several shifts in its
primary research themes; moving from loosely coupled cascades of speech recognition and …

Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders

C Xu, B Hu, Y Li, Y Zhang, Q Ju, T Xiao, J Zhu - arXiv preprint arXiv …, 2021 - arxiv.org
Encoder pre-training is promising in end-to-end Speech Translation (ST), given the fact that
speech-to-translation data is scarce. But ST encoders are not simple instances of Automatic …

Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation

Q Dong, R Ye, M Wang, H Zhou, S Xu, B Xu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs
the text in a target language. Existing methods are limited by the amount of parallel corpus …

Multimodal machine translation through visuals and speech

U Sulubacak, O Caglayan, SA Grönroos, A Rouhe… - Machine …, 2020 - Springer
Multimodal machine translation involves drawing information from more than one modality,
based on the assumption that the additional modalities will contain useful alternative views …

Covost: A diverse multilingual speech-to-text translation corpus

C Wang, J Pino, A Wu, J Gu - arXiv preprint arXiv:2002.01320, 2020 - arxiv.org
Spoken language translation has recently witnessed a resurgence in popularity, thanks to
the development of end-to-end models and the creation of new corpora, such as Augmented …

Self-training for end-to-end speech translation

J Pino, Q Xu, X Ma, MJ Dousti, Y Tang - arXiv preprint arXiv:2006.02490, 2020 - arxiv.org
One of the main challenges for end-to-end speech translation is data scarcity. We leverage
pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech …