A systematic survey of prompt engineering on vision-language foundation models

J Gu, Z Han, S Chen, A Beirami, B He, G Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Prompt engineering is a technique that involves augmenting a large pre-trained model with
task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be …

SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection

M Liu, TL Hayes, E Ricci, G Csurka… - Proceedings of the …, 2024 - openaccess.thecvf.com
Open-vocabulary object detection (OvOD) has transformed detection into a language-
guided task empowering users to freely define their class vocabularies of interest during …

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

W Wang, Y Yang - arXiv preprint arXiv:2403.06098, 2024 - arxiv.org
The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant
advancements in video generation and potential applications. However, Sora, as well as …

EA-VTR: Event-Aware Video-Text Retrieval

Z Ma, Z Zhang, Y Chen, Z Qi, C Yuan, B Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Understanding the content of events occurring in the video and their inherent temporal logic
is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack …

Localizing Events in Videos with Multimodal Queries

G Zhang, MLA Fok, Y Xia, Y Tang, D Cremers… - arXiv preprint arXiv …, 2024 - arxiv.org
Video understanding is a pivotal task in the digital era, yet the dynamic and multievent
nature of videos makes them labor-intensive and computationally demanding to process …

Referring Atomic Video Action Recognition

K Peng, J Fu, K Yang, D Wen, Y Chen, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed
at identifying atomic actions of a particular person based on a textual description and the …

Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

T Nishimura, S Nakada, M Kondo - arXiv preprint arXiv:2312.00414, 2023 - arxiv.org
In this paper, we propose an efficient and high-performance method for partially relevant
video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least …