Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio

G Chen, S Chai, G Wang, J Du, WQ Zhang… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition
corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and …

Findings of the IWSLT 2022 Evaluation Campaign.

A Anastasopoulos, L Barrault, L Bentivogli… - Proceedings of the 19th …, 2022 - cris.fbk.eu
The evaluation campaign of the 19th International Conference on Spoken Language
Translation featured eight shared tasks:(i) Simultaneous speech translation,(ii) Offline …

Prompting large language models for zero-shot domain adaptation in speech recognition

Y Li, Y Wu, J Li, S Liu - 2023 IEEE Automatic Speech …, 2023 - ieeexplore.ieee.org
The integration of Language Models (LMs) has proven to be an effective way to address
domain shifts in speech recognition. However, these approaches usually require a …

Augmented datasheets for speech datasets and ethical decision-making

O Papakyriakopoulos, ASG Choi, W Thong… - Proceedings of the …, 2023 - dl.acm.org
Speech datasets are crucial for training Speech Language Technologies (SLT); however,
the lack of diversity of the underlying training data can lead to serious limitations in building …

Reproducing whisper-style training using an open-source toolkit and publicly available data

Y Peng, J Tian, B Yan, D Berrebbi… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Pre-training speech models on large volumes of data has achieved remarkable success.
OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised …

Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study

X Chang, B Yan, K Choi, JW Jung, Y Lu… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Speech signals, typically sampled at rates in the tens of thousands per second, contain
redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech …

OWSM v3. 1: Better and faster open whisper-style speech models based on e-branchformer

Y Peng, J Tian, W Chen, S Arora, B Yan, Y Sudo… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent studies have advocated for fully open foundation models to promote transparency
and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) …

How might we create better benchmarks for speech recognition?

A Aksënova, D van Esch, J Flynn… - Proceedings of the 1st …, 2021 - aclanthology.org
The applications of automatic speech recognition (ASR) systems are proliferating, in part
due to recent significant quality improvements. However, as recent work indicates, even …

A study on the integration of pre-trained ssl, asr, lm and slu models for spoken language understanding

Y Peng, S Arora, Y Higuchi, Y Ueda… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive
and time-consuming. Recent studies achieved promising results by using pre-trained …

Adapting large language model with speech for fully formatted end-to-end speech recognition

S Ling, Y Hu, S Qian, G Ye, Y Qian… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Most end-to-end (E2E) speech recognition models are composed of encoder and decoder
blocks that perform acoustic and language modeling functions. Pretrained large language …