A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision
Higher dimensional data such as video and 3D are the leading edge of multimedia retrieval
and computer vision research. In this survey, we give a comprehensive overview and key …
and computer vision research. In this survey, we give a comprehensive overview and key …
A survey on video moment localization
Video moment localization, also known as video moment retrieval, aims to search a target
segment within a video described by a given natural language query. Beyond the task of …
segment within a video described by a given natural language query. Beyond the task of …
Mvbench: A comprehensive multi-modal video understanding benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
Egocentric video-language pretraining
Abstract Video-Language Pretraining (VLP), which aims to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received increasing …
to advance a wide range of video-text downstream tasks, has recently received increasing …
Less is more: Clipbert for video-and-language learning via sparse sampling
The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …
dictates a neural model to learn from offline-extracted dense video features from vision …
Univtg: Towards unified video-language temporal grounding
Abstract Video Temporal Grounding (VTG), which aims to ground target clips from videos
(such as consecutive intervals or disjoint shots) according to custom language queries (eg …
(such as consecutive intervals or disjoint shots) according to custom language queries (eg …
Violet: End-to-end video-language transformers with masked visual-token modeling
A great challenge in video-language (VidL) modeling lies in the disconnection between
fixed video representations extracted from image/video understanding models and …
fixed video representations extracted from image/video understanding models and …
Timechat: A time-sensitive multimodal large language model for long video understanding
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …
designed for long video understanding. Our model incorporates two key architectural …
Egovlpv2: Egocentric video-language pre-training with fusion in the backbone
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …
generalize to various vision and language tasks. However, existing egocentric VLP …
Hero: Hierarchical encoder for video+ language omni-representation pre-training
We present HERO, a novel framework for large-scale video+ language omni-representation
learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of …
learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of …