A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision
Higher dimensional data such as video and 3D are the leading edge of multimedia retrieval
and computer vision research. In this survey, we give a comprehensive overview and key …
and computer vision research. In this survey, we give a comprehensive overview and key …
A survey on video moment localization
Video moment localization, also known as video moment retrieval, aims to search a target
segment within a video described by a given natural language query. Beyond the task of …
segment within a video described by a given natural language query. Beyond the task of …
Momentdiff: Generative video moment retrieval from random to real
Video moment retrieval pursues an efficient and generalized solution to identify the specific
temporal segments within an untrimmed video that correspond to a given language …
temporal segments within an untrimmed video that correspond to a given language …
Detecting moments and highlights in videos via natural language queries
Detecting customized moments and highlights from videos given natural language (NL) user
queries is an important but under-studied topic. One of the challenges in pursuing this …
queries is an important but under-studied topic. One of the challenges in pursuing this …
Learning 2d temporal adjacent networks for moment localization with natural language
We address the problem of retrieving a specific moment from an untrimmed video by a query
sentence. This is a challenging problem because a target moment may take place in …
sentence. This is a challenging problem because a target moment may take place in …
Image retrieval on real-life images with pre-trained vision-and-language models
Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com
We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …
and short textual description of how to modify the image. Existing methods have only been …
3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds
Observing that the 3D captioning task and the 3D grounding task contain both shared and
complementary information in nature, in this work, we propose a unified framework to jointly …
complementary information in nature, in this work, we propose a unified framework to jointly …
Interventional video grounding with dual contrastive learning
Video grounding aims to localize a moment from an untrimmed video for a given textual
query. Existing approaches focus more on the alignment of visual and language stimuli with …
query. Existing approaches focus more on the alignment of visual and language stimuli with …
Dense regression network for video grounding
We address the problem of video grounding from natural language queries. The key
challenge in this task is that one training video might only contain a few annotated …
challenge in this task is that one training video might only contain a few annotated …
Span-based localizing network for natural language video localization
Given an untrimmed video and a text query, natural language video localization (NLVL) is to
locate a matching span from the video that semantically corresponds to the query. Existing …
locate a matching span from the video that semantically corresponds to the query. Existing …