Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …
have achieved outstanding performance, which pursue semantic interaction upon pre …
Diffusionret: Generative text-video retrieval with diffusion model
Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
Time does tell: Self-supervised time-tuning of dense image representations
Spatially dense self-supervised learning is a rapidly growing problem domain with
promising applications for unsupervised segmentation and pretraining for dense …
promising applications for unsupervised segmentation and pretraining for dense …
Text-video retrieval with disentangled conceptualization and set-to-set alignment
Text-video retrieval is a challenging cross-modal task, which aims to align visual entities with
natural language descriptions. Current methods either fail to leverage the local details or are …
natural language descriptions. Current methods either fail to leverage the local details or are …
Patch-level contrastive learning via positional query for visual pre-training
Dense contrastive learning (DCL) has been recently explored for learning localized
information for dense prediction tasks (eg, detection and segmentation). It still suffers the …
information for dense prediction tasks (eg, detection and segmentation). It still suffers the …
[HTML][HTML] pnnclr: Stochastic pseudo neighborhoods for contrastive learning based unsupervised representation learning problems
Nearest neighbor (NN) sampling provides more semantic variations than predefined
transformations for self-supervised learning (SSL) based image recognition problems …
transformations for self-supervised learning (SSL) based image recognition problems …
SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training
Vision-language pre-training (VLP) aims to learn joint representations of vision and
language modalities. The contrastive paradigm is currently dominant in this field. However …
language modalities. The contrastive paradigm is currently dominant in this field. However …
Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability Composability and Decomposability from Anatomy via Self Supervision
MRH Taher, MB Gotway… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep
learning excels in learning multi-level feature spaces but they often lack explicit coding of …
learning excels in learning multi-level feature spaces but they often lack explicit coding of …
QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition
Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in
videos according to their associated acoustic cues. With multiple sound sources and …
videos according to their associated acoustic cues. With multiple sound sources and …
Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation
Vanilla text-to-image diffusion models struggle with generating accurate human images
commonly resulting in imperfect anatomies such as unnatural postures or disproportionate …
commonly resulting in imperfect anatomies such as unnatural postures or disproportionate …