Jointly discovering visual objects and spoken words from raw sensory input

D Harwath, A Recasens, D Surís… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …

Advanced data exploitation in speech analysis: An overview

Z Zhang, N Cummins, B Schuller - IEEE Signal Processing …, 2017 - ieeexplore.ieee.org
With recent advances in machine-learning techniques for automatic speech analysis (ASA)-
the computerized extraction of information from speech signals-there is a greater need for …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

Understanding automatic speech recognition

D O'Shaughnessy - Computer Speech & Language, 2023 - Elsevier
This paper discusses how automatic speech recognition systems are and could be
designed, in order to best exploit the discriminative information encoded in human speech …

Unsupervised cross-modal alignment of speech and text embedding spaces

YA Chung, WH Weng, S Tong… - Advances in neural …, 2018 - proceedings.neurips.cc
Recent research has shown that word embedding spaces learned from text corpora of
different languages can be aligned without any parallel data supervision. Inspired by the …

A segmental framework for fully-unsupervised large-vocabulary speech recognition

H Kamper, A Jansen, S Goldwater - Computer Speech & Language, 2017 - Elsevier
Zero-resource speech technology is a growing research area that aims to develop methods
for speech processing in the absence of transcriptions, lexicons, or language modelling text …

Large-scale representation learning from visually grounded untranscribed speech

G Ilharco, Y Zhang, J Baldridge - arXiv preprint arXiv:1909.08782, 2019 - arxiv.org
Systems that can associate images with their spoken audio captions are an important step
towards visually grounded language learning. We describe a scalable method to …

An embedded segmental k-means model for unsupervised segmentation and clustering of speech

H Kamper, K Livescu… - 2017 IEEE automatic …, 2017 - ieeexplore.ieee.org
Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-
resource speech processing. Most approaches lie at methodological extremes: some use …

Word segmentation on discovered phone units with dynamic programming and self-supervised scoring

H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …

Query-by-example search with discriminative neural acoustic word embeddings

S Settle, K Levin, H Kamper, K Livescu - arXiv preprint arXiv:1706.03818, 2017 - arxiv.org
Query-by-example search often uses dynamic time warping (DTW) for comparing queries
and proposed matching segments. Recent work has shown that comparing speech …