Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Diffusionret: Generative text-video retrieval with diffusion model

P Jin, H Li, Z Cheng, K Li, X Ji, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …

Time does tell: Self-supervised time-tuning of dense image representations

M Salehi, E Gavves, CGM Snoek… - Proceedings of the …, 2023 - openaccess.thecvf.com
Spatially dense self-supervised learning is a rapidly growing problem domain with
promising applications for unsupervised segmentation and pretraining for dense …

Text-video retrieval with disentangled conceptualization and set-to-set alignment

P Jin, H Li, Z Cheng, J Huang, Z Wang, L Yuan… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-video retrieval is a challenging cross-modal task, which aims to align visual entities with
natural language descriptions. Current methods either fail to leverage the local details or are …

Patch-level contrastive learning via positional query for visual pre-training

S Zhang, Q Zhou, Z Wang, F Wang… - … on Machine Learning, 2023 - proceedings.mlr.press
Dense contrastive learning (DCL) has been recently explored for learning localized
information for dense prediction tasks (eg, detection and segmentation). It still suffers the …

[HTML][HTML] pnnclr: Stochastic pseudo neighborhoods for contrastive learning based unsupervised representation learning problems

M Biswas, H Buckchash, DK Prasad - Neurocomputing, 2024 - Elsevier
Nearest neighbor (NN) sampling provides more semantic variations than predefined
transformations for self-supervised learning (SSL) based image recognition problems …

SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training

S Wu, H Tan, Z Tian, Y Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language pre-training (VLP) aims to learn joint representations of vision and
language modalities. The contrastive paradigm is currently dominant in this field. However …

Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability Composability and Decomposability from Anatomy via Self Supervision

MRH Taher, MB Gotway… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep
learning excels in learning multi-level feature spaces but they often lack explicit coding of …

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

X Li, J Wang, X Xu, X Peng, R Singh… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in
videos according to their associated acoustic cues. With multiple sound sources and …

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

J Wang, Z Sun, Z Tan, X Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vanilla text-to-image diffusion models struggle with generating accurate human images
commonly resulting in imperfect anatomies such as unnatural postures or disproportionate …