Phone-to-audio alignment without text: A semi-supervised approach

Y Zhang, S Yang, S Shan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

We propose a novel strategy ES3 for self-supervised learning of robust audio-visual speech
representations from unlabeled talking face videos. While many recent approaches for this …

[PDF] arxiv.org

PAAPLoss: A phonetic-aligned acoustic parameter loss for speech enhancement

M Yang, J Konan, D Bick, Y Zeng, S Han… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Despite rapid advancement in recent years, current speech enhancement models often
produce speech that differs in perceptual quality from real clean speech. We propose a …

被引用次数：11 相关文章所有 8 个版本

[PDF] academia.edu

Evaluating speech–phoneme alignment and its impact on neural text-to-speech synthesis

F Zalkow, P Govalkar, M Müller… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

In recent years, the quality of text-to-speech (TTS) synthesis vastly improved due to deep-
learning techniques, with parallel architectures, in particular, providing excellent synthesis …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

High-fidelity neural phonetic posteriorgrams

C Churchwell, M Morrison, B Pardo - arXiv preprint arXiv:2402.17735, 2024 - arxiv.org

A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units
of speech (eg, phonemes). PPGs are a popular representation in speech generation due to …

被引用次数：5 相关文章所有 3 个版本

[HTML] nih.gov

Wav2DDK: Analytical and Clinical Validation of an Automated Diadochokinetic Rate Estimation Algorithm on Remotely Collected Speech

P Kadambi, GM Stegmann, J Liss, V Berisha… - Journal of Speech …, 2023 - ASHA

Purpose: Oral diadochokinesis is a useful task in assessment of speech motor function in the
context of neurological disease. Remote collection of speech tasks provides a convenient …

被引用次数：5 相关文章所有 8 个版本

[PDF] acm.org

Towards Music-Aware Virtual Assistants

A Wang, D Lindlbauer, C Donahue - Proceedings of the 37th Annual …, 2024 - dl.acm.org

We propose a system for modifying spoken notifications in a manner that is sensitive to the
music a user is listening to. Spoken notifications provide convenient access to rich …

[PDF] acm.org

Watch Your Mouth: Silent Speech Recognition with Depth Sensing

X Wang, Z Su, J Rekimoto, Y Zhang - … of the CHI Conference on Human …, 2024 - dl.acm.org

Silent speech recognition is a promising technology that decodes human speech without
requiring audio signals, enabling private human-computer interactions. In this paper, we …

[PDF] eleanorchodroff.com

[PDF][PDF] Comparing language-specific and cross-language acoustic models for low-resource phonetic forced alignment

E Chodroff, E Ahn, H Dolatian - Language Documentation & …, 2024 - eleanorchodroff.com

Phonetic forced alignment can greatly expedite spoken language analysis by providing
automatictimealignmentsattheword-andphone-levels. Inthecaseoflow-resourcelanguages, it …

被引用次数：2 相关文章所有 3 个版本

[PDF] aclanthology.org

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

J Zhu, C Yang, F Samir, J Islam - … of the 2024 Conference of the …, 2024 - aclanthology.org

In this project, we demonstrate that phoneme-based models for speech processing can
achieve strong crosslinguistic generalizability to unseen languages. We curated the …

被引用次数：3 相关文章