Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

[图书][B] Automatic speech recognition

D Yu, L Deng - 2016 - Springer
Automatic Speech Recognition (ASR), which is aimed to enable natural human–machine
interaction, has been an intensive research area for decades. Many core technologies, such …

Combining residual networks with LSTMs for lipreading

T Stafylakis, G Tzimiropoulos - arXiv preprint arXiv:1703.04105, 2017 - arxiv.org
We propose an end-to-end deep learning architecture for word-level visual speech
recognition. The system is a combination of spatiotemporal convolutional, residual and …

[图书][B] Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching

C Raffel - 2016 - search.proquest.com
Sequences of feature vectors are a natural way of representing temporal data. Given a
database of sequences, a fundamental task is to find the database entry which is the most …

Effectiveness of self-supervised pre-training for speech recognition

A Baevski, M Auli, A Mohamed - arXiv preprint arXiv:1911.03912, 2019 - arxiv.org
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …

Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech

YA Chung, J Glass - arXiv preprint arXiv:1803.08976, 2018 - arxiv.org
In this paper, we propose a novel deep neural network architecture, Speech2Vec, for
learning fixed-length vector representations of audio segments excised from a speech …

Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder

YA Chung, CC Wu, CH Shen, HY Lee… - arXiv preprint arXiv …, 2016 - arxiv.org
The vector representations of fixed dimensionality for words (in text) offered by Word2Vec
have been shown to be very useful in many application scenarios, in particular due to the …

Effectiveness of self-supervised pre-training for asr

A Baevski, A Mohamed - ICASSP 2020-2020 IEEE International …, 2020 - ieeexplore.ieee.org
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …

Measuring depression symptom severity from spoken language and 3D facial expressions

A Haque, M Guo, AS Miner, L Fei-Fei - arXiv preprint arXiv:1811.08592, 2018 - arxiv.org
With more than 300 million people depressed worldwide, depression is a global problem.
Due to access barriers such as social stigma, cost, and treatment availability, 60% of …

Deep multimodal semantic embeddings for speech and images

D Harwath, J Glass - 2015 IEEE Workshop on Automatic …, 2015 - ieeexplore.ieee.org
In this paper, we present a model which takes as input a corpus of images with relevant
spoken captions and finds a correspondence between the two modalities. We employ a pair …