Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation

R Zheng, J Chen, M Ma… - … Conference on Machine …, 2021 - proceedings.mlr.press
Recently, representation learning for text and speech has successfully improved many
language related tasks. However, all existing methods suffer from two limitations:(a) they …

Unified segment-to-segment framework for simultaneous sequence generation

S Zhang, Y Feng - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Simultaneous sequence generation is a pivotal task for real-time scenarios, such as
streaming speech recognition, simultaneous machine translation and simultaneous speech …

End-to-End Speech-to-Text Translation: A Survey

N Sethiya, CK Maurya - arXiv preprint arXiv:2312.01053, 2023 - arxiv.org
Speech-to-text translation pertains to the task of converting speech signals in a language to
text in another language. It finds its application in various domains, such as hands-free …

ESPnet-ST-v2: Multipurpose spoken language translation toolkit

B Yan, J Shi, Y Tang, H Inaguma, Y Peng… - arXiv preprint arXiv …, 2023 - arxiv.org
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the
broadening interests of the spoken language translation community. ESPnet-ST-v2 supports …

Attention as a guide for simultaneous speech translation

S Papi, M Negri, M Turchi - arXiv preprint arXiv:2212.07850, 2022 - arxiv.org
The study of the attention mechanism has sparked interest in many fields, such as language
modeling and machine translation. Although its patterns have been exploited to perform …

Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation

S Papi, M Gaido, M Negri, M Turchi - arXiv preprint arXiv:2206.05807, 2022 - arxiv.org
Simultaneous speech translation (SimulST) systems aim at generating their output with the
lowest possible latency, which is normally computed in terms of Average Lagging (AL). In …

Learning when to translate for streaming speech

Q Dong, Y Zhu, M Wang, L Li - arXiv preprint arXiv:2109.07368, 2021 - arxiv.org
How to find proper moments to generate partial sentence translation given a streaming
speech input? Existing approaches waiting-and-translating for a fixed duration often break …

A roadmap for big model

S Yuan, H Zhao, S Zhao, J Leng, Y Liang… - arXiv preprint arXiv …, 2022 - arxiv.org
With the rapid development of deep learning, training Big Models (BMs) for multiple
downstream tasks becomes a popular paradigm. Researchers have achieved various …

Learning adaptive segmentation policy for end-to-end simultaneous translation

R Zhang, Z He, H Wu, H Wang - … of the 60th Annual Meeting of the …, 2022 - aclanthology.org
End-to-end simultaneous speech-to-text translation aims to directly perform translation from
streaming source speech to target text with high translation quality and low latency. A typical …

Token-level serialized output training for joint streaming asr and st leveraging textual alignments

S Papi, P Wang, J Chen, J Xue, J Li… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
In real-world applications, users often require both translations and transcriptions of speech
to enhance their comprehension, particularly in streaming scenarios where incremental …