Lessons learned in transcribing 5000 h of air traffic control communications for robust automatic speech understanding

J Zuluaga-Gomez, I Nigmatulina, A Prasad, P Motlicek… - Aerospace, 2023 - mdpi.com
Voice communication between air traffic controllers (ATCos) and pilots is critical for ensuring
safe and efficient air traffic control (ATC). The handling of these voice communications …

Train Long and Test Long: Leveraging Full Document Contexts in Speech Processing

W Chen, T Kano, A Ogawa, M Delcroix… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The quadratic memory complexity of self-attention has generally restricted Transformer-
based models to utterance-based speech processing, preventing models from leveraging …

Hypermixer: An mlp-based low cost alternative to transformers

F Mai, A Pannatier, F Fehr, H Chen, F Marelli… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformer-based architectures are the model of choice for natural language
understanding, but they come at a significant cost, as they have quadratic complexity in the …

Open-Source Conversational AI with SpeechBrain 1.0

M Ravanelli, T Parcollet, A Moumen… - arXiv preprint arXiv …, 2024 - arxiv.org
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused
particularly on speech processing tasks such as speech recognition, speech enhancement …

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

J Wang, Z Liang, X Zhang, N Cheng, J Xiao - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, Transformer networks have shown remarkable performance in speech
recognition tasks. However, their deployment poses challenges due to high computational …

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

S Kumar, S Madikeri, J Zuluaga-Gomez… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-supervised pretrained models exhibit competitive performance in automatic speech
recognition on finetuning, even with limited in-domain supervised data for training. However …

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations

S Yadav, ZH Tan - arXiv preprint arXiv:2406.02178, 2024 - arxiv.org
Despite its widespread adoption as the prominent neural architecture, the Transformer has
spurred several independent lines of work to address its limitations. One such approach is …

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

J Zuluaga-Gomez, Z Huang, X Niu, R Paturi… - arXiv preprint arXiv …, 2023 - arxiv.org
Conventional speech-to-text translation (ST) systems are trained on single-speaker
utterances, and they may not generalize to real-life scenarios where the audio contains …

[HTML][HTML] End-to-end single-channel speaker-turn aware conversational speech translation

JPZ Gomez, Z Huang, X Niu, R Paturi, S Srinivasan… - 2023 - amazon.science
Conventional speech-to-text translation (ST) systems are trained on single-speaker
utterances, and they may not generalize to real-life scenarios where the audio contains …