ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations

Y Zhang, S Yang, S Shan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
We propose a novel strategy ES3 for self-supervised learning of robust audio-visual speech
representations from unlabeled talking face videos. While many recent approaches for this …

PAAPLoss: A phonetic-aligned acoustic parameter loss for speech enhancement

M Yang, J Konan, D Bick, Y Zeng, S Han… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Despite rapid advancement in recent years, current speech enhancement models often
produce speech that differs in perceptual quality from real clean speech. We propose a …

Evaluating speech–phoneme alignment and its impact on neural text-to-speech synthesis

F Zalkow, P Govalkar, M Müller… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
In recent years, the quality of text-to-speech (TTS) synthesis vastly improved due to deep-
learning techniques, with parallel architectures, in particular, providing excellent synthesis …

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

High-fidelity neural phonetic posteriorgrams

C Churchwell, M Morrison, B Pardo - arXiv preprint arXiv:2402.17735, 2024 - arxiv.org
A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units
of speech (eg, phonemes). PPGs are a popular representation in speech generation due to …

Wav2DDK: Analytical and Clinical Validation of an Automated Diadochokinetic Rate Estimation Algorithm on Remotely Collected Speech

P Kadambi, GM Stegmann, J Liss, V Berisha… - Journal of Speech …, 2023 - ASHA
Purpose: Oral diadochokinesis is a useful task in assessment of speech motor function in the
context of neurological disease. Remote collection of speech tasks provides a convenient …

Towards Music-Aware Virtual Assistants

A Wang, D Lindlbauer, C Donahue - Proceedings of the 37th Annual …, 2024 - dl.acm.org
We propose a system for modifying spoken notifications in a manner that is sensitive to the
music a user is listening to. Spoken notifications provide convenient access to rich …

Watch Your Mouth: Silent Speech Recognition with Depth Sensing

X Wang, Z Su, J Rekimoto, Y Zhang - … of the CHI Conference on Human …, 2024 - dl.acm.org
Silent speech recognition is a promising technology that decodes human speech without
requiring audio signals, enabling private human-computer interactions. In this paper, we …

[PDF][PDF] Comparing language-specific and cross-language acoustic models for low-resource phonetic forced alignment

E Chodroff, E Ahn, H Dolatian - Language Documentation & …, 2024 - eleanorchodroff.com
Phonetic forced alignment can greatly expedite spoken language analysis by providing
automatictimealignmentsattheword-andphone-levels. Inthecaseoflow-resourcelanguages, it …

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

J Zhu, C Yang, F Samir, J Islam - … of the 2024 Conference of the …, 2024 - aclanthology.org
In this project, we demonstrate that phoneme-based models for speech processing can
achieve strong crosslinguistic generalizability to unseen languages. We curated the …