Deep learning techniques for speech emotion recognition, from databases to models

BJ Abbaschian, D Sierra-Sosa, A Elmaghraby - Sensors, 2021 - mdpi.com
The advancements in neural networks and the on-demand need for accurate and near real-
time Speech Emotion Recognition (SER) in human–computer interactions make it …

A systematic literature review of speech emotion recognition approaches

YB Singh, S Goel - Neurocomputing, 2022 - Elsevier
Nowadays emotion recognition from speech (SER) is a demanding research area for
researchers because of its wide real-life applications. There are many challenges for SER …

EMO: Emote Portrait Alive Generating Expressive Portrait Videos with Audio2Video Diffusion Model Under Weak Conditions

L Tian, Q Wang, B Zhang, L Bo - European Conference on Computer …, 2025 - Springer
In this work, we tackle the challenge of enhancing the realism and expressiveness in talking
head video generation by focusing on the dynamic and nuanced relationship between audio …

Emoca: Emotion driven monocular face capture and animation

R Daněček, MJ Black, T Bolkart - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
As 3D facial avatars become more widely used for communication, it is critical that they
faithfully convey emotion. Unfortunately, the best recent methods that regress parametric 3D …

Balanced multimodal learning via on-the-fly gradient modulation

X Peng, Y Wei, A Deng, D Wang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Audio-visual learning helps to comprehensively understand the world, by integrating
different senses. Accordingly, multiple input modalities are expected to boost model …

Diffused heads: Diffusion models beat gans on talking-face generation

M Stypułkowski, K Vougioukas, S He… - Proceedings of the …, 2024 - openaccess.thecvf.com
Talking face generation has historically struggled to produce head movements and natural
facial expressions without guidance from additional reference videos. Recent developments …

Eamm: One-shot emotional talking face via audio-based emotion-aware motion model

X Ji, H Zhou, K Wang, Q Wu, W Wu, F Xu… - ACM SIGGRAPH 2022 …, 2022 - dl.acm.org
Although significant progress has been made to audio-driven talking face generation,
existing methods either neglect facial emotion or cannot be applied to arbitrary subjects. In …

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

Mead: A large-scale audio-visual dataset for emotional talking-face generation

K Wang, Q Wu, L Song, Z Yang, W Wu, C Qian… - … on Computer Vision, 2020 - Springer
The synthesis of natural emotional reactions is an essential criterion in vivid talking-face
video generation. This criterion is nevertheless seldom taken into consideration in previous …

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American …

SR Livingstone, FA Russo - PloS one, 2018 - journals.plos.org
The RAVDESS is a validated multimodal database of emotional speech and song. The
database is gender balanced consisting of 24 professional actors, vocalizing lexically …