A systematic survey of prompt engineering on vision-language foundation models
Prompt engineering is a technique that involves augmenting a large pre-trained model with
task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be …
task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be …
SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
Open-vocabulary object detection (OvOD) has transformed detection into a language-
guided task empowering users to freely define their class vocabularies of interest during …
guided task empowering users to freely define their class vocabularies of interest during …
Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models
W Wang, Y Yang - arXiv preprint arXiv:2403.06098, 2024 - arxiv.org
The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant
advancements in video generation and potential applications. However, Sora, as well as …
advancements in video generation and potential applications. However, Sora, as well as …
EA-VTR: Event-Aware Video-Text Retrieval
Understanding the content of events occurring in the video and their inherent temporal logic
is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack …
is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack …
Localizing Events in Videos with Multimodal Queries
Video understanding is a pivotal task in the digital era, yet the dynamic and multievent
nature of videos makes them labor-intensive and computationally demanding to process …
nature of videos makes them labor-intensive and computationally demanding to process …
Referring Atomic Video Action Recognition
We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed
at identifying atomic actions of a particular person based on a textual description and the …
at identifying atomic actions of a particular person based on a textual description and the …
Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval
T Nishimura, S Nakada, M Kondo - arXiv preprint arXiv:2312.00414, 2023 - arxiv.org
In this paper, we propose an efficient and high-performance method for partially relevant
video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least …
video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least …