Text-video retrieval with disentangled conceptualization and set-to-set alignment

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

被引用次数：147 相关文章所有 4 个版本

[PDF] neurips.cc

Momentdiff: Generative video moment retrieval from random to real

P Li, CW Xie, H Xie, L Zhao, L Zhang… - Advances in neural …, 2024 - proceedings.neurips.cc

Video moment retrieval pursues an efficient and generalized solution to identify the specific
temporal segments within an untrimmed video that correspond to a given language …

被引用次数：60 相关文章所有 6 个版本

[PDF] thecvf.com

Diffusionret: Generative text-video retrieval with diffusion model

P Jin, H Li, Z Cheng, K Li, X Ji, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …

被引用次数：53 相关文章所有 5 个版本

[PDF] neurips.cc

Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs

P Jin, Y Wu, Y Fan, Z Sun, W Yang… - Advances in Neural …, 2024 - proceedings.neurips.cc

Most text-driven human motion generation methods employ sequential modeling
approaches, eg, transformer, to extract sentence-level text representations automatically and …

被引用次数：23 相关文章所有 5 个版本

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

被引用次数：50 相关文章所有 2 个版本

[PDF] thecvf.com

Progressive spatio-temporal prototype matching for text-video retrieval

P Li, CW Xie, L Zhao, H Xie, J Ge… - Proceedings of the …, 2023 - openaccess.thecvf.com

The performance of text-video retrieval has been significantly improved by vision-language
cross-modal learning schemes. The typical solution is to directly align the global video-level …

被引用次数：32 相关文章所有 3 个版本

[PDF] thecvf.com

Uatvr: Uncertainty-adaptive text-video retrieval

B Fang, W Wu, C Liu, Y Zhou, Y Song… - Proceedings of the …, 2023 - openaccess.thecvf.com

With the explosive growth of web videos and emerging large-scale vision-language pre-
training models, eg, CLIP, retrieving videos of interest with text instructions has attracted …

被引用次数：46 相关文章所有 7 个版本

[PDF] neurips.cc

Discover and align taxonomic context priors for open-world semi-supervised learning

Y Wang, Z Zhong, P Qiao, X Cheng… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Open-world Semi-Supervised Learning (OSSL) is a realistic and challenging task,
aiming to classify unlabeled samples from both seen and novel classes using partially …

被引用次数：8 相关文章所有 6 个版本

[PDF] thecvf.com

Out-of-distributed semantic pruning for robust semi-supervised learning

Y Wang, P Qiao, C Liu, G Song… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recent advances in robust semi-supervised learning (SSL) typical filters out-of-distribution
(OOD) information at the sample level. We argue that an overlooked problem of robust SSL …

被引用次数：13 相关文章所有 6 个版本

[PDF] arxiv.org

Freestyleret: Retrieving images from style-diversified queries

H Li, Y Jia, P Jin, Z Cheng, K Li, J Sui, C Liu… - European Conference on …, 2025 - Springer

Image Retrieval aims to retrieve corresponding images based on a given query. In
application scenarios, users intend to express their retrieval intent through various query …

被引用次数：7 相关文章所有 2 个版本