[PDF][PDF] Time-synchronous one-pass beam search for parallel online and offline transducers with dynamic block training

Y Sudo, M Shakeel, Y Peng… - Proc. INTERSPEECH …, 2023 - researchgate.net
End-to-end automatic speech recognition (ASR) has become an increasingly popular area
of research, with two main models being online and offline ASR. Online models aim to …

Variable attention masking for configurable transformer transducer speech recognition

P Swietojanski, S Braun, D Can… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
This work studies the use of attention masking in transformer transducer based speech
recognition for building a single configurable model for different deployment scenarios. We …

A CIF-based speech segmentation method for streaming E2E ASR

Y Shu, H Luo, S Zhang, L Wang… - IEEE Signal Processing …, 2023 - ieeexplore.ieee.org
Long utterances segmentation is crucial in end-to-end (E2E) streaming automatic speech
recognition (ASR). However, commonly used voice activity detection (VAD)-based and fixed …

E2e segmentation in a two-pass cascaded encoder asr model

WR Huang, SY Chang, TN Sainath… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single
model. A key challenge is allowing the segmenter (which runs in real-time, synchronously …

Improving fast-slow encoder based transducer with streaming deliberation

K Li, J Mahadeokar, J Guo, Y Shi… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
This paper introduces a fast-slow encoder based transducer with streaming deliberation for
end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the …

Flickering reduction with partial hypothesis reranking for streaming asr

A Bruguier, D Qiu, T Strohman… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Incremental speech recognizers start displaying results while the users are still speaking.
These partial results are beneficial to users who like the responsiveness of the system …

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Y Sudo, M Shakeel, Y Fukumoto, B Yan, J Shi… - arXiv preprint arXiv …, 2024 - arxiv.org
End-to-end automatic speech recognition (E2E-ASR) can be classified into several network
architectures, such as connectionist temporal classification (CTC), recurrent neural network …

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

M Someki, N Eng, Y Higuchi… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Attention-based encoder-decoder models with autoregressive (AR) decoding have proven
to be the dominant approach for automatic speech recognition (ASR) due to their superior …

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

M Shakeel, Y Sudo, Y Peng, S Watanabe - arXiv preprint arXiv …, 2024 - arxiv.org
End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming
and non-streaming, each with its pros and cons. Streaming ASR processes the speech …

Conversation-oriented asr with multi-look-ahead cbs architecture

H Zhao, S Fujie, T Ogawa, J Sakuma… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
During conversations, humans are capable of inferring the intention of the speaker at any
point of the speech to prepare the following action promptly. Such ability is also the key for …