Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
[图书][B] Automatic speech recognition
Automatic Speech Recognition (ASR), which is aimed to enable natural human–machine
interaction, has been an intensive research area for decades. Many core technologies, such …
interaction, has been an intensive research area for decades. Many core technologies, such …
Combining residual networks with LSTMs for lipreading
T Stafylakis, G Tzimiropoulos - arXiv preprint arXiv:1703.04105, 2017 - arxiv.org
We propose an end-to-end deep learning architecture for word-level visual speech
recognition. The system is a combination of spatiotemporal convolutional, residual and …
recognition. The system is a combination of spatiotemporal convolutional, residual and …
[图书][B] Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching
C Raffel - 2016 - search.proquest.com
Sequences of feature vectors are a natural way of representing temporal data. Given a
database of sequences, a fundamental task is to find the database entry which is the most …
database of sequences, a fundamental task is to find the database entry which is the most …
Effectiveness of self-supervised pre-training for speech recognition
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …
quantize the audio data or learn representations without quantization. We find the former to …
Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech
In this paper, we propose a novel deep neural network architecture, Speech2Vec, for
learning fixed-length vector representations of audio segments excised from a speech …
learning fixed-length vector representations of audio segments excised from a speech …
Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder
The vector representations of fixed dimensionality for words (in text) offered by Word2Vec
have been shown to be very useful in many application scenarios, in particular due to the …
have been shown to be very useful in many application scenarios, in particular due to the …
Effectiveness of self-supervised pre-training for asr
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …
quantize the audio data or learn representations without quantization. We find the former to …
Measuring depression symptom severity from spoken language and 3D facial expressions
With more than 300 million people depressed worldwide, depression is a global problem.
Due to access barriers such as social stigma, cost, and treatment availability, 60% of …
Due to access barriers such as social stigma, cost, and treatment availability, 60% of …
Deep multimodal semantic embeddings for speech and images
In this paper, we present a model which takes as input a corpus of images with relevant
spoken captions and finds a correspondence between the two modalities. We employ a pair …
spoken captions and finds a correspondence between the two modalities. We employ a pair …