Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation
Recently, representation learning for text and speech has successfully improved many
language related tasks. However, all existing methods suffer from two limitations:(a) they …
language related tasks. However, all existing methods suffer from two limitations:(a) they …
Unified segment-to-segment framework for simultaneous sequence generation
Simultaneous sequence generation is a pivotal task for real-time scenarios, such as
streaming speech recognition, simultaneous machine translation and simultaneous speech …
streaming speech recognition, simultaneous machine translation and simultaneous speech …
End-to-End Speech-to-Text Translation: A Survey
N Sethiya, CK Maurya - arXiv preprint arXiv:2312.01053, 2023 - arxiv.org
Speech-to-text translation pertains to the task of converting speech signals in a language to
text in another language. It finds its application in various domains, such as hands-free …
text in another language. It finds its application in various domains, such as hands-free …
ESPnet-ST-v2: Multipurpose spoken language translation toolkit
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the
broadening interests of the spoken language translation community. ESPnet-ST-v2 supports …
broadening interests of the spoken language translation community. ESPnet-ST-v2 supports …
Attention as a guide for simultaneous speech translation
The study of the attention mechanism has sparked interest in many fields, such as language
modeling and machine translation. Although its patterns have been exploited to perform …
modeling and machine translation. Although its patterns have been exploited to perform …
Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation
Simultaneous speech translation (SimulST) systems aim at generating their output with the
lowest possible latency, which is normally computed in terms of Average Lagging (AL). In …
lowest possible latency, which is normally computed in terms of Average Lagging (AL). In …
Learning when to translate for streaming speech
How to find proper moments to generate partial sentence translation given a streaming
speech input? Existing approaches waiting-and-translating for a fixed duration often break …
speech input? Existing approaches waiting-and-translating for a fixed duration often break …
Learning adaptive segmentation policy for end-to-end simultaneous translation
End-to-end simultaneous speech-to-text translation aims to directly perform translation from
streaming source speech to target text with high translation quality and low latency. A typical …
streaming source speech to target text with high translation quality and low latency. A typical …
Token-level serialized output training for joint streaming asr and st leveraging textual alignments
In real-world applications, users often require both translations and transcriptions of speech
to enhance their comprehension, particularly in streaming scenarios where incremental …
to enhance their comprehension, particularly in streaming scenarios where incremental …