Streaming parallel transducer beam search with fast-slow cascaded encoders

[PDF][PDF] Time-synchronous one-pass beam search for parallel online and offline transducers with dynamic block training

Y Sudo, M Shakeel, Y Peng… - Proc. INTERSPEECH …, 2023 - researchgate.net

End-to-end automatic speech recognition (ASR) has become an increasingly popular area
of research, with two main models being online and offline ASR. Online models aim to …

被引用次数：6 相关文章所有 6 个版本

[PDF] arxiv.org

Variable attention masking for configurable transformer transducer speech recognition

P Swietojanski, S Braun, D Can… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

This work studies the use of attention masking in transformer transducer based speech
recognition for building a single configurable model for different deployment scenarios. We …

被引用次数：9 相关文章所有 3 个版本

A CIF-based speech segmentation method for streaming E2E ASR

Y Shu, H Luo, S Zhang, L Wang… - IEEE Signal Processing …, 2023 - ieeexplore.ieee.org

Long utterances segmentation is crucial in end-to-end (E2E) streaming automatic speech
recognition (ASR). However, commonly used voice activity detection (VAD)-based and fixed …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

E2e segmentation in a two-pass cascaded encoder asr model

WR Huang, SY Chang, TN Sainath… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single
model. A key challenge is allowing the segmenter (which runs in real-time, synchronously …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Improving fast-slow encoder based transducer with streaming deliberation

K Li, J Mahadeokar, J Guo, Y Shi… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

This paper introduces a fast-slow encoder based transducer with streaming deliberation for
end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the …

被引用次数：5 相关文章所有 4 个版本

[PDF] bruguier.com

Flickering reduction with partial hypothesis reranking for streaming asr

A Bruguier, D Qiu, T Strohman… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org

Incremental speech recognizers start displaying results while the users are still speaking.
These partial results are beneficial to users who like the responsiveness of the system …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Y Sudo, M Shakeel, Y Fukumoto, B Yan, J Shi… - arXiv preprint arXiv …, 2024 - arxiv.org

End-to-end automatic speech recognition (E2E-ASR) can be classified into several network
architectures, such as connectionist temporal classification (CTC), recurrent neural network …

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

M Someki, N Eng, Y Higuchi… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org

Attention-based encoder-decoder models with autoregressive (AR) decoding have proven
to be the dominant approach for automatic speech recognition (ASR) due to their superior …

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

M Shakeel, Y Sudo, Y Peng, S Watanabe - arXiv preprint arXiv …, 2024 - arxiv.org

End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming
and non-streaming, each with its pros and cons. Streaming ASR processes the speech …

Conversation-oriented asr with multi-look-ahead cbs architecture

H Zhao, S Fujie, T Ogawa, J Sakuma… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

During conversations, humans are capable of inferring the intention of the speaker at any
point of the speech to prepare the following action promptly. Such ability is also the key for …

被引用次数：2 相关文章所有 5 个版本