HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-aware intelligent machines. Previous …
recent years for its critical role in creating emotion-aware intelligent machines. Previous …
[PDF][PDF] Versatile audio-visual learning for handling single and multi modalities in emotion regression and classification tasks
Most current audio-visual emotion recognition models lack the flexibility needed for
deployment in practical applications. We envision a multimodal system that works even …
deployment in practical applications. We envision a multimodal system that works even …
Selective acoustic feature enhancement for speech emotion recognition with noisy speech
A speech emotion recognition (SER) system deployed on a real-world application can
encounter speech contaminated with unconstrained background noise. To deal with this …
encounter speech contaminated with unconstrained background noise. To deal with this …
Versatile audio-visual learning for emotion recognition
Most current audio-visual emotion recognition models lack the flexibility needed for
deployment in practical applications. We envision a multimodal system that works even …
deployment in practical applications. We envision a multimodal system that works even …
[HTML][HTML] Deep temporal clustering features for speech emotion recognition
Deep clustering is a popular unsupervised technique for feature representation learning. We
recently proposed the chunk-based DeepEmoCluster framework for speech emotion …
recently proposed the chunk-based DeepEmoCluster framework for speech emotion …
Enhancing Resilience to Missing Data in Audio-Text Emotion Recognition with Multi-Scale Chunk Regularization
Most existing audio-text emotion recognition studies have focused on the computational
modeling aspects, including strategies for fusing the modalities. An area that has received …
modeling aspects, including strategies for fusing the modalities. An area that has received …
Detail-Enhanced Intra-and Inter-modal Interaction for Audio-Visual Emotion Recognition
Capturing complex temporal relationships between video and audio modalities is vital for
Audio-Visual Emotion Recognition (AVER). However, existing methods lack attention to …
Audio-Visual Emotion Recognition (AVER). However, existing methods lack attention to …