ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
We propose a novel strategy ES3 for self-supervised learning of robust audio-visual speech
representations from unlabeled talking face videos. While many recent approaches for this …
representations from unlabeled talking face videos. While many recent approaches for this …
PAAPLoss: A phonetic-aligned acoustic parameter loss for speech enhancement
Despite rapid advancement in recent years, current speech enhancement models often
produce speech that differs in perceptual quality from real clean speech. We propose a …
produce speech that differs in perceptual quality from real clean speech. We propose a …
Evaluating speech–phoneme alignment and its impact on neural text-to-speech synthesis
In recent years, the quality of text-to-speech (TTS) synthesis vastly improved due to deep-
learning techniques, with parallel architectures, in particular, providing excellent synthesis …
learning techniques, with parallel architectures, in particular, providing excellent synthesis …
Emova: Empowering language models to see, hear and speak with vivid emotions
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …
tones, marks a milestone for omni-modal foundation models. However, empowering Large …
High-fidelity neural phonetic posteriorgrams
A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units
of speech (eg, phonemes). PPGs are a popular representation in speech generation due to …
of speech (eg, phonemes). PPGs are a popular representation in speech generation due to …
Wav2DDK: Analytical and Clinical Validation of an Automated Diadochokinetic Rate Estimation Algorithm on Remotely Collected Speech
P Kadambi, GM Stegmann, J Liss, V Berisha… - Journal of Speech …, 2023 - ASHA
Purpose: Oral diadochokinesis is a useful task in assessment of speech motor function in the
context of neurological disease. Remote collection of speech tasks provides a convenient …
context of neurological disease. Remote collection of speech tasks provides a convenient …
Towards Music-Aware Virtual Assistants
We propose a system for modifying spoken notifications in a manner that is sensitive to the
music a user is listening to. Spoken notifications provide convenient access to rich …
music a user is listening to. Spoken notifications provide convenient access to rich …
Watch Your Mouth: Silent Speech Recognition with Depth Sensing
Silent speech recognition is a promising technology that decodes human speech without
requiring audio signals, enabling private human-computer interactions. In this paper, we …
requiring audio signals, enabling private human-computer interactions. In this paper, we …
[PDF][PDF] Comparing language-specific and cross-language acoustic models for low-resource phonetic forced alignment
Phonetic forced alignment can greatly expedite spoken language analysis by providing
automatictimealignmentsattheword-andphone-levels. Inthecaseoflow-resourcelanguages, it …
automatictimealignmentsattheword-andphone-levels. Inthecaseoflow-resourcelanguages, it …
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
In this project, we demonstrate that phoneme-based models for speech processing can
achieve strong crosslinguistic generalizability to unseen languages. We curated the …
achieve strong crosslinguistic generalizability to unseen languages. We curated the …