Timechat: A time-sensitive multimodal large language model for long video understanding
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …
designed for long video understanding. Our model incorporates two key architectural …
Long-form video-language pre-training with multimodal temporal contrastive learning
Large-scale video-language pre-training has shown significant improvement in video-
language understanding tasks. Previous studies of video-language pretraining mainly focus …
language understanding tasks. Previous studies of video-language pretraining mainly focus …
AutoAD: Movie description in context
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
Teachtext: Crossmodal generalized distillation for text-video retrieval
In recent years, considerable progress on the task of text-video retrieval has been achieved
by leveraging large-scale pretraining on visual and audio datasets to construct powerful …
by leveraging large-scale pretraining on visual and audio datasets to construct powerful …
Mad: A scalable dataset for language grounding in videos from movie audio descriptions
The recent and increasing interest in video-language research has driven the development
of large-scale datasets that enable data-intensive machine learning techniques. In …
of large-scale datasets that enable data-intensive machine learning techniques. In …
Cross modal retrieval with querybank normalisation
Profiting from large-scale training datasets, advances in neural architecture design and
efficient inference, joint embeddings have become the dominant approach for tackling cross …
efficient inference, joint embeddings have become the dominant approach for tackling cross …
Audio retrieval with natural language queries: A benchmark study
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the
goal is to retrieve the audio content from a pool of candidates that best matches a given …
goal is to retrieve the audio content from a pool of candidates that best matches a given …
Avlnet: Learning audio-visual language representations from instructional videos
Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …
annotation, such as human generated captions or machine generated automatic speech …
Audio retrieval with natural language queries
We consider the task of retrieving audio using free-form natural language queries. To study
this problem, which has received limited attention in the existing literature, we introduce …
this problem, which has received limited attention in the existing literature, we introduce …
On semantic similarity in video retrieval
Current video retrieval efforts all found their evaluation on an instance-based assumption,
that only a single caption is relevant to a query video and vice versa. We demonstrate that …
that only a single caption is relevant to a query video and vice versa. We demonstrate that …