Unsupervised word segmentation and lexicon discovery using acoustic word embeddings

D Harwath, A Recasens, D Surís… - Proceedings of the …, 2018 - openaccess.thecvf.com

In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …

被引用次数：245 相关文章所有 17 个版本

[PDF] uni-augsburg.de

Advanced data exploitation in speech analysis: An overview

Z Zhang, N Cummins, B Schuller - IEEE Signal Processing …, 2017 - ieeexplore.ieee.org

With recent advances in machine-learning techniques for automatic speech analysis (ASA)-
the computerized extraction of information from speech signals-there is a greater need for …

被引用次数：67 相关文章所有 3 个版本

[PDF] arxiv.org

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org

In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

被引用次数：102 相关文章所有 8 个版本

Understanding automatic speech recognition

D O'Shaughnessy - Computer Speech & Language, 2023 - Elsevier

This paper discusses how automatic speech recognition systems are and could be
designed, in order to best exploit the discriminative information encoded in human speech …

被引用次数：5 相关文章

[PDF] neurips.cc

Unsupervised cross-modal alignment of speech and text embedding spaces

YA Chung, WH Weng, S Tong… - Advances in neural …, 2018 - proceedings.neurips.cc

Recent research has shown that word embedding spaces learned from text corpora of
different languages can be aligned without any parallel data supervision. Inspired by the …

被引用次数：115 相关文章所有 15 个版本

[PDF] arxiv.org

A segmental framework for fully-unsupervised large-vocabulary speech recognition

H Kamper, A Jansen, S Goldwater - Computer Speech & Language, 2017 - Elsevier

Zero-resource speech technology is a growing research area that aims to develop methods
for speech processing in the absence of transcriptions, lexicons, or language modelling text …

被引用次数：133 相关文章所有 9 个版本

[PDF] arxiv.org

Large-scale representation learning from visually grounded untranscribed speech

G Ilharco, Y Zhang, J Baldridge - arXiv preprint arXiv:1909.08782, 2019 - arxiv.org

Systems that can associate images with their spoken audio captions are an important step
towards visually grounded language learning. We describe a scalable method to …

被引用次数：74 相关文章所有 3 个版本

[PDF] arxiv.org

An embedded segmental k-means model for unsupervised segmentation and clustering of speech

H Kamper, K Livescu… - 2017 IEEE automatic …, 2017 - ieeexplore.ieee.org

Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-
resource speech processing. Most approaches lie at methodological extremes: some use …

被引用次数：115 相关文章所有 8 个版本

[PDF] arxiv.org

Word segmentation on discovered phone units with dynamic programming and self-supervised scoring

H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org

Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …

被引用次数：31 相关文章所有 4 个版本

[PDF] arxiv.org

Query-by-example search with discriminative neural acoustic word embeddings

S Settle, K Levin, H Kamper, K Livescu - arXiv preprint arXiv:1706.03818, 2017 - arxiv.org

Query-by-example search often uses dynamic time warping (DTW) for comparing queries
and proposed matching segments. Recent work has shown that comparing speech …

被引用次数：99 相关文章所有 8 个版本