A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Google usm: Scaling automatic speech recognition beyond 100 languages

Y Zhang, W Han, J Qin, Y Wang, A Bapna… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce the Universal Speech Model (USM), a single large model that performs
automatic speech recognition (ASR) across 100+ languages. This is achieved by pre …

Beats: Audio pre-training with acoustic tokenizers

S Chen, Y Wu, C Wang, S Liu, D Tompkins… - arXiv preprint arXiv …, 2022 - arxiv.org
The massive growth of self-supervised learning (SSL) has been witnessed in language,
vision, speech, and audio domains over the past few years. While discrete label prediction is …

Audiopalm: A large language model that can speak and listen

PK Rubenstein, C Asawaroengchai… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce AudioPaLM, a large language model for speech understanding and
generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil …

Soundstorm: Efficient parallel audio generation

Z Borsos, M Sharifi, D Vincent, E Kharitonov… - arXiv preprint arXiv …, 2023 - arxiv.org
We present SoundStorm, a model for efficient, non-autoregressive audio generation.
SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional …

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arXiv preprint arXiv …, 2023 - arxiv.org
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …

Prompting large language models with speech recognition abilities

Y Fathullah, C Wu, E Lakomkin, J Jia… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Large language models (LLMs) have proven themselves highly flexible, able to solve a wide
range of generative tasks, such as abstractive summarization and open-ended question …

Seamless: Multilingual Expressive and Streaming Speech Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arXiv preprint arXiv …, 2023 - arxiv.org
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …

Conditional adapters: Parameter-efficient transfer learning with fast inference

T Lei, J Bai, S Brahma, J Ainslie… - Advances in …, 2023 - proceedings.neurips.cc
Abstract We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning
method that also improves inference efficiency. CoDA generalizes beyond standard adapter …