Spiking tucker fusion transformer for audio-visual zero-shot learning

W Li, P Wang, R Xiong, X Fan - IEEE Transactions on Image …, 2024 - ieeexplore.ieee.org
The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown
great potential in extracting audio-visual joint feature representations. However, coupling …

Hyperbolic deep learning in computer vision: A survey

P Mettes, M Ghadimi Atigh, M Keller-Ressel… - International Journal of …, 2024 - Springer
Deep representation learning is a ubiquitous part of modern computer vision. While
Euclidean space has been the de facto standard manifold for learning visual …

Audio-visual generalized zero-shot learning the easy way

S Mo, P Morgado - European Conference on Computer Vision, 2025 - Springer
Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to
understand the intricate relations between audio and visual cues within videos. The …

Emergent visual-semantic hierarchies in image-text representations

M Alper, H Averbuch-Elor - European Conference on Computer Vision, 2025 - Springer
While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing
text and images in a shared semantic space, they do not explicitly model the hierarchical …

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

D Kurzendörfer, OB Mercea… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audio-visual zero-shot learning methods commonly build on features extracted from pre-
trained models eg video or audio classification models. However existing benchmarks …

Boosting Audio-visual Zero-shot Learning with Large Language Models

H Chen, Y Li, Y Hong, Z Huang, Z Xu, Z Gu… - arXiv preprint arXiv …, 2023 - arxiv.org
Audio-visual zero-shot learning aims to recognize unseen categories based on paired audio-
visual sequences. Recent methods mainly focus on learning aligned and discriminative …

FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

ZS Liu, R Courant, V Kalogeiton - International Journal of Computer Vision, 2024 - Springer
Automatically understanding funny moments (ie, the moments that make people laugh)
when watching comedy is challenging, as they relate to various features, such as body …

Object-Aware Image Augmentation for Audio-Visual Zero-Shot Learning

Y Dong, S Chen, B Duan, W Ding… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Audio-visual zero-shot learning (ZSL) leverages both video and audio information for model
training, aiming to classify new video categories that were not seen during the training …

Hyperbolic-constraint Point Cloud Reconstruction from Single RGB-D Images

W Li, Z Yang, W Han, H Man, X Wang, X Fan - arXiv preprint arXiv …, 2024 - arxiv.org
Reconstructing desired objects and scenes has long been a primary goal in 3D computer
vision. Single-view point cloud reconstruction has become a popular technique due to its …

Discrepancy-Aware Attention Network for Enhanced Audio-Visual Zero-Shot Learning

RL Yu, Y Gong, W Li, A Sun, M Zheng - arXiv preprint arXiv:2412.11715, 2024 - arxiv.org
Audio-visual Zero-Shot Learning (ZSL) has attracted significant attention for its ability to
identify unseen classes and perform well in video classification tasks. However, modal …