Efficient sequence transduction by jointly predicting tokens and durations

H Xu, F Jia, S Majumdar, H Huang… - International …, 2023 - proceedings.mlr.press
This paper introduces a novel Token-and-Duration Transducer (TDT) architecture for
sequence-to-sequence tasks. TDT extends conventional RNN-Transducer architectures by …

End-to-end speech recognition contextualization with large language models

E Lakomkin, C Wu, Y Fathullah, O Kalinli… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
In recent years, Large Language Models (LLMs) have garnered significant attention from the
research community due to their exceptional performance and generalization capabilities. In …

Token-level serialized output training for joint streaming asr and st leveraging textual alignments

S Papi, P Wang, J Chen, J Xue, J Li… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
In real-world applications, users often require both translations and transcriptions of speech
to enhance their comprehension, particularly in streaming scenarios where incremental …

Contextual biasing of named-entities with large language models

C Sun, Z Ahmed, Y Ma, Z Liu, L Kabela… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
We explore contextual biasing with Large Language Models (LLMs) to enhance Automatic
Speech Recognition (ASR) in second-pass rescoring. Our approach introduces the …

Improving large-scale deep biasing with phoneme features and text-only data in streaming transducer

J Qiu, L Huang, B Li, J Zhang, L Lu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Deep biasing for the Transducer can improve the recognition performance of rare words or
contextual entities, which is essential in practical applications, especially for streaming …

Spike-Triggered Contextual Biasing for End-to-End Mandarin Speech Recognition

K Huang, A Zhang, B Zhang, T Xu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
The attention-based deep contextual biasing method has been demonstrated to effectively
improve the recognition performance of end-to-end automatic speech recognition (ASR) …

Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

D Galvez, V Bataev, H Xu, T Kaldewey - arXiv preprint arXiv:2406.03791, 2024 - arxiv.org
The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on
decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle …

Key Frame Mechanism For Efficient Conformer Based End-to-end Speech Recognition

P Fan, C Shan, S Sun, Q Yang… - IEEE Signal Processing …, 2023 - ieeexplore.ieee.org
Recently, Conformer as a backbone network for end-to-end automatic speech recognition
achieved state-of-the-art performance. The Conformer block leverages a self-attention …

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

M Shakeel, Y Sudo, Y Peng, S Watanabe - arXiv preprint arXiv …, 2024 - arxiv.org
Contextualized end-to-end automatic speech recognition has been an active research area,
with recent efforts focusing on the implicit learning of contextual phrases based on the final …

Improving ASR Contextual Biasing with Guided Attention

J Tang, K Kim, S Shon, F Wu… - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the
effectiveness and robustness of automatic speech recognition (ASR) contextual biasing …