Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Long-form video-language pre-training with multimodal temporal contrastive learning

Y Sun, H Xue, R Song, B Liu… - Advances in neural …, 2022 - proceedings.neurips.cc
Large-scale video-language pre-training has shown significant improvement in video-
language understanding tasks. Previous studies of video-language pretraining mainly focus …

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

Teachtext: Crossmodal generalized distillation for text-video retrieval

I Croitoru, SV Bogolin, M Leordeanu… - Proceedings of the …, 2021 - openaccess.thecvf.com
In recent years, considerable progress on the task of text-video retrieval has been achieved
by leveraging large-scale pretraining on visual and audio datasets to construct powerful …

Mad: A scalable dataset for language grounding in videos from movie audio descriptions

M Soldan, A Pardo, JL Alcázar… - Proceedings of the …, 2022 - openaccess.thecvf.com
The recent and increasing interest in video-language research has driven the development
of large-scale datasets that enable data-intensive machine learning techniques. In …

Cross modal retrieval with querybank normalisation

SV Bogolin, I Croitoru, H Jin, Y Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Profiting from large-scale training datasets, advances in neural architecture design and
efficient inference, joint embeddings have become the dominant approach for tackling cross …

Audio retrieval with natural language queries: A benchmark study

AS Koepke, AM Oncescu, JF Henriques… - IEEE Transactions …, 2022 - ieeexplore.ieee.org
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the
goal is to retrieve the audio content from a pool of candidates that best matches a given …

Avlnet: Learning audio-visual language representations from instructional videos

A Rouditchenko, A Boggust, D Harwath, B Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …

Audio retrieval with natural language queries

AM Oncescu, A Koepke, JF Henriques, Z Akata… - arXiv preprint arXiv …, 2021 - arxiv.org
We consider the task of retrieving audio using free-form natural language queries. To study
this problem, which has received limited attention in the existing literature, we introduce …

On semantic similarity in video retrieval

M Wray, H Doughty, D Damen - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Current video retrieval efforts all found their evaluation on an instance-based assumption,
that only a single caption is relevant to a query video and vice versa. We demonstrate that …