[PDF][PDF] Odyssey 2024-Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results
Abstract The Odyssey 2024 Speech Emotion Recognition (SER) Challenge aims to enhance
innovation in recognizing emotions from spontaneous speech, moving beyond traditional …
innovation in recognizing emotions from spontaneous speech, moving beyond traditional …
MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition
Multimodal emotion recognition is an important research topic in artificial intelligence. Over
the past few decades, researchers have made remarkable progress by increasing dataset …
the past few decades, researchers have made remarkable progress by increasing dataset …
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
T SpeechTeam - arXiv preprint arXiv:2407.04051, 2024 - arxiv.org
This report introduces FunAudioLLM, a model family designed to enhance natural voice
interactions between humans and large language models (LLMs). At its core are two …
interactions between humans and large language models (LLMs). At its core are two …
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
People change their tones of voice, often accompanied by nonverbal vocalizations (NVs)
such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) …
such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) …
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation
J Kim, J Cho, J Park, S Hwang, DE Kim, G Kim… - arXiv preprint arXiv …, 2024 - arxiv.org
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of
applications. Despite recent advancements in achieving realistic lip motion, current methods …
applications. Despite recent advancements in achieving realistic lip motion, current methods …
Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation
In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented
speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to …
speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to …
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
Speech emotion recognition (SER) is an important part of human-computer interaction,
receiving extensive attention from both industry and academia. However, the current …
receiving extensive attention from both industry and academia. However, the current …
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights
Large Multimodal Models (LMMs) have achieved great success recently, demonstrating a
strong capability to understand multimodal information and to interact with human users …
strong capability to understand multimodal information and to interact with human users …
Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation
A general disentanglement-based speaker anonymization system typically separates
speech into content, speaker, and prosody features using individual encoders. This paper …
speech into content, speaker, and prosody features using individual encoders. This paper …