Spiking tucker fusion transformer for audio-visual zero-shot learning
The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown
great potential in extracting audio-visual joint feature representations. However, coupling …
great potential in extracting audio-visual joint feature representations. However, coupling …
Hyperbolic deep learning in computer vision: A survey
Deep representation learning is a ubiquitous part of modern computer vision. While
Euclidean space has been the de facto standard manifold for learning visual …
Euclidean space has been the de facto standard manifold for learning visual …
Audio-visual generalized zero-shot learning the easy way
S Mo, P Morgado - European Conference on Computer Vision, 2025 - Springer
Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to
understand the intricate relations between audio and visual cues within videos. The …
understand the intricate relations between audio and visual cues within videos. The …
Emergent visual-semantic hierarchies in image-text representations
M Alper, H Averbuch-Elor - European Conference on Computer Vision, 2025 - Springer
While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing
text and images in a shared semantic space, they do not explicitly model the hierarchical …
text and images in a shared semantic space, they do not explicitly model the hierarchical …
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
D Kurzendörfer, OB Mercea… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audio-visual zero-shot learning methods commonly build on features extracted from pre-
trained models eg video or audio classification models. However existing benchmarks …
trained models eg video or audio classification models. However existing benchmarks …
Boosting Audio-visual Zero-shot Learning with Large Language Models
Audio-visual zero-shot learning aims to recognize unseen categories based on paired audio-
visual sequences. Recent methods mainly focus on learning aligned and discriminative …
visual sequences. Recent methods mainly focus on learning aligned and discriminative …
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
Automatically understanding funny moments (ie, the moments that make people laugh)
when watching comedy is challenging, as they relate to various features, such as body …
when watching comedy is challenging, as they relate to various features, such as body …
Object-Aware Image Augmentation for Audio-Visual Zero-Shot Learning
Y Dong, S Chen, B Duan, W Ding… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Audio-visual zero-shot learning (ZSL) leverages both video and audio information for model
training, aiming to classify new video categories that were not seen during the training …
training, aiming to classify new video categories that were not seen during the training …
Hyperbolic-constraint Point Cloud Reconstruction from Single RGB-D Images
Reconstructing desired objects and scenes has long been a primary goal in 3D computer
vision. Single-view point cloud reconstruction has become a popular technique due to its …
vision. Single-view point cloud reconstruction has become a popular technique due to its …
Discrepancy-Aware Attention Network for Enhanced Audio-Visual Zero-Shot Learning
RL Yu, Y Gong, W Li, A Sun, M Zheng - arXiv preprint arXiv:2412.11715, 2024 - arxiv.org
Audio-visual Zero-Shot Learning (ZSL) has attracted significant attention for its ability to
identify unseen classes and perform well in video classification tasks. However, modal …
identify unseen classes and perform well in video classification tasks. However, modal …