[PDF][PDF] Odyssey 2024-Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results

L Goncalves, AN Salman, AR Naini, LM Velazquez… - …, 2024 - ecs.utdallas.edu
Abstract The Odyssey 2024 Speech Emotion Recognition (SER) Challenge aims to enhance
innovation in recognizing emotions from spontaneous speech, moving beyond traditional …

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

Z Lian, H Sun, L Sun, Z Wen, S Zhang, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal emotion recognition is an important research topic in artificial intelligence. Over
the past few decades, researchers have made remarkable progress by increasing dataset …

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

T SpeechTeam - arXiv preprint arXiv:2407.04051, 2024 - arxiv.org
This report introduces FunAudioLLM, a model family designed to enhance natural voice
interactions between humans and large language models (LLMs). At its core are two …

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

S Ji, J Zuo, M Fang, S Zheng, Q Chen, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

H Wu, X Wang, SE Eskimez, M Thakker… - arXiv preprint arXiv …, 2024 - arxiv.org
People change their tones of voice, often accompanied by nonverbal vocalizations (NVs)
such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) …

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

J Kim, J Cho, J Park, S Hwang, DE Kim, G Kim… - arXiv preprint arXiv …, 2024 - arxiv.org
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of
applications. Despite recent advancements in achieving realistic lip motion, current methods …

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

CY Kuan, CK Yang, WP Huang, KH Lu… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented
speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to …

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

Z Ma, M Chen, H Zhang, Z Zheng, W Chen, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Speech emotion recognition (SER) is an important part of human-computer interaction,
receiving extensive attention from both industry and academia. However, the current …

Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights

H Yang, L Qu, E Shareghi, G Haffari - arXiv preprint arXiv:2406.17430, 2024 - arxiv.org
Large Multimodal Models (LMMs) have achieved great success recently, demonstrating a
strong capability to understand multimodal information and to interact with human users …

Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

X Miao, Y Zhang, X Wang, N Tomashenko… - arXiv preprint arXiv …, 2024 - arxiv.org
A general disentanglement-based speaker anonymization system typically separates
speech into content, speaker, and prosody features using individual encoders. This paper …