Self-supervised learning with random-projection quantizer for speech recognition

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：118 相关文章所有 6 个版本

[PDF] dtu.dk

Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

被引用次数：302 相关文章所有 10 个版本

[PDF] arxiv.org

Google usm: Scaling automatic speech recognition beyond 100 languages

Y Zhang, W Han, J Qin, Y Wang, A Bapna… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce the Universal Speech Model (USM), a single large model that performs
automatic speech recognition (ASR) across 100+ languages. This is achieved by pre …

被引用次数：190 相关文章所有 3 个版本

[PDF] arxiv.org

Beats: Audio pre-training with acoustic tokenizers

S Chen, Y Wu, C Wang, S Liu, D Tompkins… - arXiv preprint arXiv …, 2022 - arxiv.org

The massive growth of self-supervised learning (SSL) has been witnessed in language,
vision, speech, and audio domains over the past few years. While discrete label prediction is …

被引用次数：162 相关文章所有 8 个版本

[PDF] arxiv.org

Audiopalm: A large language model that can speak and listen

PK Rubenstein, C Asawaroengchai… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce AudioPaLM, a large language model for speech understanding and
generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil …

被引用次数：108 相关文章所有 2 个版本

[PDF] arxiv.org

Soundstorm: Efficient parallel audio generation

Z Borsos, M Sharifi, D Vincent, E Kharitonov… - arXiv preprint arXiv …, 2023 - arxiv.org

We present SoundStorm, a model for efficient, non-autoregressive audio generation.
SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional …

被引用次数：63 相关文章所有 4 个版本

[PDF] arxiv.org

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arXiv preprint arXiv …, 2023 - arxiv.org

What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …

被引用次数：56 相关文章

[PDF] arxiv.org

Prompting large language models with speech recognition abilities

Y Fathullah, C Wu, E Lakomkin, J Jia… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Large language models (LLMs) have proven themselves highly flexible, able to solve a wide
range of generative tasks, such as abstractive summarization and open-ended question …

被引用次数：59 相关文章所有 4 个版本

[PDF] arxiv.org

Seamless: Multilingual Expressive and Streaming Speech Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arXiv preprint arXiv …, 2023 - arxiv.org

Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …

被引用次数：53 相关文章

[PDF] neurips.cc

Conditional adapters: Parameter-efficient transfer learning with fast inference

T Lei, J Bai, S Brahma, J Ainslie… - Advances in …, 2023 - proceedings.neurips.cc

Abstract We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning
method that also improves inference efficiency. CoDA generalizes beyond standard adapter …

被引用次数：23 相关文章所有 6 个版本