Squeezeformer: An efficient transformer for automatic speech recognition

S Kim, A Gholami, A Shaw, N Lee… - Advances in …, 2022 - proceedings.neurips.cc
The recently proposed Conformer model has become the de facto backbone model for
various downstream speech tasks based on its hybrid attention-convolution architecture that …

Understanding the role of self attention for efficient speech recognition

K Shim, J Choi, W Sung - International Conference on Learning …, 2022 - openreview.net
Self-attention (SA) is a critical component of Transformer neural networks that have
succeeded in automatic speech recognition (ASR). In this paper, we analyze the role of SA …

Attention as a guide for simultaneous speech translation

S Papi, M Negri, M Turchi - arXiv preprint arXiv:2212.07850, 2022 - arxiv.org
The study of the attention mechanism has sparked interest in many fields, such as language
modeling and machine translation. Although its patterns have been exploited to perform …

[HTML][HTML] Direct speech translation for automatic subtitling

S Papi, M Gaido, A Karakanta, M Cettolo… - Transactions of the …, 2023 - direct.mit.edu
Automatic subtitling is the task of automatically translating the speech of audiovisual content
into short pieces of timed text, ie, subtitles and their corresponding timestamps. The …

Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers

G Strimel, Y Xie, BJ King, M Radfar… - International …, 2023 - proceedings.mlr.press
Streaming speech recognition architectures are employed for low-latency, real-time
applications. Such architectures are often characterized by their causality. Causal …

Deep sparse conformer for speech recognition

X Wu - arXiv preprint arXiv:2209.00260, 2022 - arxiv.org
Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by
leveraging transformer's capturing of content-based global interactions and convolutional …

Attention Enhanced Citrinet for Speech Recognition

X Wu - arXiv preprint arXiv:2209.00261, 2022 - arxiv.org
Citrinet is an end-to-end convolutional Connectionist Temporal Classification (CTC) based
automatic speech recognition (ASR) model. To capture local and global contextual …

Conversation-oriented asr with multi-look-ahead cbs architecture

H Zhao, S Fujie, T Ogawa, J Sakuma… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
During conversations, humans are capable of inferring the intention of the speaker at any
point of the speech to prepare the following action promptly. Such ability is also the key for …

An investigation of enhancing CTC model for triggered attention-based streaming ASR

H Zhao, Y Higuchi, T Ogawa… - 2021 Asia-Pacific Signal …, 2021 - ieeexplore.ieee.org
In the present paper, an attempt is made to combine Mask-CTC and the triggered attention
mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system …

The evaluation of a code-switched Sepedi-English automatic speech recognition system

A Phaladi, T Modipa - arXiv preprint arXiv:2403.07947, 2024 - arxiv.org
Speech technology is a field that encompasses various techniques and tools used to enable
machines to interact with speech, such as automatic speech recognition (ASR), spoken …