Chat-univi: Unified visual representation empowers large language models with image and video understanding
P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …
range of open-ended tasks and have extended their utility to encompass multimodal …
Momentdiff: Generative video moment retrieval from random to real
Video moment retrieval pursues an efficient and generalized solution to identify the specific
temporal segments within an untrimmed video that correspond to a given language …
temporal segments within an untrimmed video that correspond to a given language …
Diffusionret: Generative text-video retrieval with diffusion model
Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs
Most text-driven human motion generation methods employ sequential modeling
approaches, eg, transformer, to extract sentence-level text representations automatically and …
approaches, eg, transformer, to extract sentence-level text representations automatically and …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Progressive spatio-temporal prototype matching for text-video retrieval
The performance of text-video retrieval has been significantly improved by vision-language
cross-modal learning schemes. The typical solution is to directly align the global video-level …
cross-modal learning schemes. The typical solution is to directly align the global video-level …
Uatvr: Uncertainty-adaptive text-video retrieval
With the explosive growth of web videos and emerging large-scale vision-language pre-
training models, eg, CLIP, retrieving videos of interest with text instructions has attracted …
training models, eg, CLIP, retrieving videos of interest with text instructions has attracted …
Discover and align taxonomic context priors for open-world semi-supervised learning
Abstract Open-world Semi-Supervised Learning (OSSL) is a realistic and challenging task,
aiming to classify unlabeled samples from both seen and novel classes using partially …
aiming to classify unlabeled samples from both seen and novel classes using partially …
Out-of-distributed semantic pruning for robust semi-supervised learning
Recent advances in robust semi-supervised learning (SSL) typical filters out-of-distribution
(OOD) information at the sample level. We argue that an overlooked problem of robust SSL …
(OOD) information at the sample level. We argue that an overlooked problem of robust SSL …
Freestyleret: Retrieving images from style-diversified queries
Image Retrieval aims to retrieve corresponding images based on a given query. In
application scenarios, users intend to express their retrieval intent through various query …
application scenarios, users intend to express their retrieval intent through various query …