Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

[HTML][HTML] Turn-taking in conversational systems and human-robot interaction: a review

G Skantze - Computer Speech & Language, 2021 - Elsevier
The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen
at the same time, the participants need to coordinate who is currently speaking and when …

[HTML][HTML] A high-performance speech neuroprosthesis

FR Willett, EM Kunz, C Fan, DT Avansino, GH Wilson… - Nature, 2023 - nature.com
Speech brain–computer interfaces (BCIs) have the potential to restore rapid communication
to people with paralysis by decoding neural activity evoked by attempted speech into text, or …

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024 - proceedings.neurips.cc
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding

Y Peng, S Dalmia, I Lane… - … Conference on Machine …, 2022 - proceedings.mlr.press
Conformer has proven to be effective in many speech processing tasks. It combines the
benefits of extracting local dependencies using convolutions and global dependencies …

Dataperf: Benchmarks for data-centric ai development

M Mazumder, C Banbury, X Yao… - Advances in …, 2024 - proceedings.neurips.cc
Abstract Machine learning research has long focused on models rather than datasets, and
prominent datasets are used for common ML tasks without regard to the breadth, difficulty …

A comparative study on transformer vs rnn in speech applications

S Karita, N Chen, T Hayashi, T Hori… - 2019 IEEE automatic …, 2019 - ieeexplore.ieee.org
Sequence-to-sequence models have been widely used in end-to-end speech processing,
for example, automatic speech recognition (ASR), speech translation (ST), and text-to …

[HTML][HTML] Voxceleb: Large-scale speaker verification in the wild

A Nagrani, JS Chung, W Xie, A Zisserman - Computer Speech & Language, 2020 - Elsevier
The objective of this work is speaker recognition under noisy and unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale audio-visual dataset …

Specaugment: A simple data augmentation method for automatic speech recognition

DS Park, W Chan, Y Zhang, CC Chiu, B Zoph… - arXiv preprint arXiv …, 2019 - arxiv.org
We present SpecAugment, a simple data augmentation method for speech recognition.
SpecAugment is applied directly to the feature inputs of a neural network (ie, filter bank …

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …