Crema-d: Crowd-sourced emotional multimodal actors dataset

BJ Abbaschian, D Sierra-Sosa, A Elmaghraby - Sensors, 2021 - mdpi.com

The advancements in neural networks and the on-demand need for accurate and near real-
time Speech Emotion Recognition (SER) in human–computer interactions make it …

被引用次数：285 相关文章所有 9 个版本

A systematic literature review of speech emotion recognition approaches

YB Singh, S Goel - Neurocomputing, 2022 - Elsevier

Nowadays emotion recognition from speech (SER) is a demanding research area for
researchers because of its wide real-life applications. There are many challenges for SER …

被引用次数：103 相关文章所有 2 个版本

[PDF] arxiv.org

EMO: Emote Portrait Alive Generating Expressive Portrait Videos with Audio2Video Diffusion Model Under Weak Conditions

L Tian, Q Wang, B Zhang, L Bo - European Conference on Computer …, 2025 - Springer

In this work, we tackle the challenge of enhancing the realism and expressiveness in talking
head video generation by focusing on the dynamic and nuanced relationship between audio …

被引用次数：87 相关文章所有 2 个版本

[PDF] thecvf.com

Emoca: Emotion driven monocular face capture and animation

R Daněček, MJ Black, T Bolkart - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

As 3D facial avatars become more widely used for communication, it is critical that they
faithfully convey emotion. Unfortunately, the best recent methods that regress parametric 3D …

被引用次数：184 相关文章所有 8 个版本

[PDF] thecvf.com

Balanced multimodal learning via on-the-fly gradient modulation

X Peng, Y Wei, A Deng, D Wang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Audio-visual learning helps to comprehensively understand the world, by integrating
different senses. Accordingly, multiple input modalities are expected to boost model …

被引用次数：195 相关文章所有 5 个版本

[PDF] thecvf.com

Diffused heads: Diffusion models beat gans on talking-face generation

M Stypułkowski, K Vougioukas, S He… - Proceedings of the …, 2024 - openaccess.thecvf.com

Talking face generation has historically struggled to produce head movements and natural
facial expressions without guidance from additional reference videos. Recent developments …

被引用次数：124 相关文章所有 6 个版本

[PDF] acm.org Full View

Eamm: One-shot emotional talking face via audio-based emotion-aware motion model

X Ji, H Zhou, K Wang, Q Wu, W Wu, F Xu… - ACM SIGGRAPH 2022 …, 2022 - dl.acm.org

Although significant progress has been made to audio-driven talking face generation,
existing methods either neglect facial emotion or cannot be applied to arbitrary subjects. In …

被引用次数：153 相关文章所有 4 个版本

[PDF] arxiv.org

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

被引用次数：197 相关文章所有 4 个版本

Mead: A large-scale audio-visual dataset for emotional talking-face generation

K Wang, Q Wu, L Song, Z Yang, W Wu, C Qian… - … on Computer Vision, 2020 - Springer

The synthesis of natural emotional reactions is an essential criterion in vivid talking-face
video generation. This criterion is nevertheless seldom taken into consideration in previous …

被引用次数：313 相关文章所有 2 个版本

[PDF] plos.org

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American …

SR Livingstone, FA Russo - PloS one, 2018 - journals.plos.org

The RAVDESS is a validated multimodal database of emotional speech and song. The
database is gender balanced consisting of 24 professional actors, vocalizing lexically …

被引用次数：2107 相关文章所有 20 个版本