Jointly discovering visual objects and spoken words from raw sensory input
In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …
audio captions with the semantically relevant portions of natural images that they refer to …
Advanced data exploitation in speech analysis: An overview
With recent advances in machine-learning techniques for automatic speech analysis (ASA)-
the computerized extraction of information from speech signals-there is a greater need for …
the computerized extraction of information from speech signals-there is a greater need for …
Learning hierarchical discrete linguistic units from visually-grounded speech
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …
vector quantization layers into neural models of visually grounded speech. We show that our …
Understanding automatic speech recognition
D O'Shaughnessy - Computer Speech & Language, 2023 - Elsevier
This paper discusses how automatic speech recognition systems are and could be
designed, in order to best exploit the discriminative information encoded in human speech …
designed, in order to best exploit the discriminative information encoded in human speech …
Unsupervised cross-modal alignment of speech and text embedding spaces
Recent research has shown that word embedding spaces learned from text corpora of
different languages can be aligned without any parallel data supervision. Inspired by the …
different languages can be aligned without any parallel data supervision. Inspired by the …
A segmental framework for fully-unsupervised large-vocabulary speech recognition
Zero-resource speech technology is a growing research area that aims to develop methods
for speech processing in the absence of transcriptions, lexicons, or language modelling text …
for speech processing in the absence of transcriptions, lexicons, or language modelling text …
Large-scale representation learning from visually grounded untranscribed speech
Systems that can associate images with their spoken audio captions are an important step
towards visually grounded language learning. We describe a scalable method to …
towards visually grounded language learning. We describe a scalable method to …
An embedded segmental k-means model for unsupervised segmentation and clustering of speech
Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-
resource speech processing. Most approaches lie at methodological extremes: some use …
resource speech processing. Most approaches lie at methodological extremes: some use …
Word segmentation on discovered phone units with dynamic programming and self-supervised scoring
H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …
phone and word segmentation modules that are trained jointly. This paper instead revisits …
Query-by-example search with discriminative neural acoustic word embeddings
Query-by-example search often uses dynamic time warping (DTW) for comparing queries
and proposed matching segments. Recent work has shown that comparing speech …
and proposed matching segments. Recent work has shown that comparing speech …