A segmental framework for fully-unsupervised large-vocabulary speech recognition

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

被引用次数：385 相关文章所有 10 个版本

[HTML] sciencedirect.com

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022 - Elsevier

Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

被引用次数：72 相关文章所有 8 个版本

[PDF] neurips.cc

Unsupervised speech recognition

A Baevski, WN Hsu, A Conneau… - Advances in Neural …, 2021 - proceedings.neurips.cc

Despite rapid progress in the recent past, current speech recognition systems still require
labeled training data which limits this technology to a small fraction of the languages spoken …

被引用次数：321 相关文章所有 6 个版本

[PDF] arxiv.org

wav2vec: Unsupervised pre-training for speech recognition

S Schneider, A Baevski, R Collobert, M Auli - arXiv preprint arXiv …, 2019 - arxiv.org

We explore unsupervised pre-training for speech recognition by learning representations of
raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting …

被引用次数：1683 相关文章所有 12 个版本

[PDF] arxiv.org

Unsupervised speech representation learning using wavenet autoencoders

J Chorowski, RJ Weiss, S Bengio… - … /ACM transactions on …, 2019 - ieeexplore.ieee.org

We consider the task of unsupervised extraction of meaningful latent representations of
speech by applying autoencoding neural networks to speech waveforms. The goal is to …

被引用次数：406 相关文章所有 11 个版本

[PDF] thecvf.com

Jointly discovering visual objects and spoken words from raw sensory input

D Harwath, A Recasens, D Surís… - Proceedings of the …, 2018 - openaccess.thecvf.com

In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …

被引用次数：245 相关文章所有 17 个版本

[PDF] arxiv.org

Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech

YA Chung, J Glass - arXiv preprint arXiv:1803.08976, 2018 - arxiv.org

In this paper, we propose a novel deep neural network architecture, Speech2Vec, for
learning fixed-length vector representations of audio segments excised from a speech …

被引用次数：219 相关文章所有 11 个版本

[PDF] arxiv.org

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

S Bansal, H Kamper, K Livescu, A Lopez… - arXiv preprint arXiv …, 2018 - arxiv.org

We present a simple approach to improve direct speech-to-text translation (ST) when the
source language is low-resource: we pre-train the model on a high-resource automatic …

被引用次数：211 相关文章所有 8 个版本

[PDF] arxiv.org

Unsupervised pre-training of bidirectional speech encoders via masked reconstruction

W Wang, Q Tang, K Livescu - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org

We propose an approach for pre-training speech representations via a masked
reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be …

被引用次数：116 相关文章所有 7 个版本

[PDF] arxiv.org

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org

In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

被引用次数：102 相关文章所有 8 个版本