emotion2vec: Self-supervised pre-training for speech emotion representation

[PDF][PDF] Odyssey 2024-Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results

L Goncalves, AN Salman, AR Naini, LM Velazquez… - …, 2024 - ecs.utdallas.edu

Abstract The Odyssey 2024 Speech Emotion Recognition (SER) Challenge aims to enhance
innovation in recognizing emotions from spontaneous speech, moving beyond traditional …

被引用次数：8 相关文章

[PDF] arxiv.org

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

Z Lian, H Sun, L Sun, Z Wen, S Zhang, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal emotion recognition is an important research topic in artificial intelligence. Over
the past few decades, researchers have made remarkable progress by increasing dataset …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

T SpeechTeam - arXiv preprint arXiv:2407.04051, 2024 - arxiv.org

This report introduces FunAudioLLM, a model family designed to enhance natural voice
interactions between humans and large language models (LLMs). At its core are two …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

S Ji, J Zuo, M Fang, S Zheng, Q Chen, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

H Wu, X Wang, SE Eskimez, M Thakker… - arXiv preprint arXiv …, 2024 - arxiv.org

People change their tones of voice, often accompanied by nonverbal vocalizations (NVs)
such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) …

相关文章所有 2 个版本

[PDF] arxiv.org

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

J Kim, J Cho, J Park, S Hwang, DE Kim, G Kim… - arXiv preprint arXiv …, 2024 - arxiv.org

Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of
applications. Despite recent advancements in achieving realistic lip motion, current methods …

相关文章所有 2 个版本

[PDF] arxiv.org

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

CY Kuan, CK Yang, WP Huang, KH Lu… - arXiv preprint arXiv …, 2024 - arxiv.org

In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented
speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to …

相关文章所有 2 个版本

[PDF] arxiv.org

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

Z Ma, M Chen, H Zhang, Z Zheng, W Chen, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Speech emotion recognition (SER) is an important part of human-computer interaction,
receiving extensive attention from both industry and academia. However, the current …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights

H Yang, L Qu, E Shareghi, G Haffari - arXiv preprint arXiv:2406.17430, 2024 - arxiv.org

Large Multimodal Models (LMMs) have achieved great success recently, demonstrating a
strong capability to understand multimodal information and to interact with human users …

相关文章所有 2 个版本

[PDF] arxiv.org

Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

X Miao, Y Zhang, X Wang, N Tomashenko… - arXiv preprint arXiv …, 2024 - arxiv.org

A general disentanglement-based speaker anonymization system typically separates
speech into content, speaker, and prosody features using individual encoders. This paper …

相关文章所有 2 个版本