Queryd: A video dataset with high-quality text and audio narrations

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

被引用次数：99 相关文章所有 4 个版本

[PDF] neurips.cc

Long-form video-language pre-training with multimodal temporal contrastive learning

Y Sun, H Xue, R Song, B Liu… - Advances in neural …, 2022 - proceedings.neurips.cc

Large-scale video-language pre-training has shown significant improvement in video-
language understanding tasks. Previous studies of video-language pretraining mainly focus …

被引用次数：70 相关文章所有 6 个版本

[PDF] thecvf.com

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

被引用次数：56 相关文章所有 7 个版本

[PDF] thecvf.com

Teachtext: Crossmodal generalized distillation for text-video retrieval

I Croitoru, SV Bogolin, M Leordeanu… - Proceedings of the …, 2021 - openaccess.thecvf.com

In recent years, considerable progress on the task of text-video retrieval has been achieved
by leveraging large-scale pretraining on visual and audio datasets to construct powerful …

被引用次数：153 相关文章所有 11 个版本

[PDF] thecvf.com

Mad: A scalable dataset for language grounding in videos from movie audio descriptions

M Soldan, A Pardo, JL Alcázar… - Proceedings of the …, 2022 - openaccess.thecvf.com

The recent and increasing interest in video-language research has driven the development
of large-scale datasets that enable data-intensive machine learning techniques. In …

被引用次数：104 相关文章所有 8 个版本

[PDF] thecvf.com

Cross modal retrieval with querybank normalisation

SV Bogolin, I Croitoru, H Jin, Y Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Profiting from large-scale training datasets, advances in neural architecture design and
efficient inference, joint embeddings have become the dominant approach for tackling cross …

被引用次数：86 相关文章所有 5 个版本

[PDF] arxiv.org

Audio retrieval with natural language queries: A benchmark study

AS Koepke, AM Oncescu, JF Henriques… - IEEE Transactions …, 2022 - ieeexplore.ieee.org

The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the
goal is to retrieve the audio content from a pool of candidates that best matches a given …

被引用次数：112 相关文章所有 10 个版本

[PDF] arxiv.org

Avlnet: Learning audio-visual language representations from instructional videos

A Rouditchenko, A Boggust, D Harwath, B Chen… - arXiv preprint arXiv …, 2020 - arxiv.org

Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …

被引用次数：154 相关文章所有 9 个版本

[PDF] arxiv.org

Audio retrieval with natural language queries

AM Oncescu, A Koepke, JF Henriques, Z Akata… - arXiv preprint arXiv …, 2021 - arxiv.org

We consider the task of retrieving audio using free-form natural language queries. To study
this problem, which has received limited attention in the existing literature, we introduce …

被引用次数：90 相关文章所有 13 个版本

[PDF] thecvf.com

On semantic similarity in video retrieval

M Wray, H Doughty, D Damen - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

Current video retrieval efforts all found their evaluation on an instance-based assumption,
that only a single caption is relevant to a query video and vice versa. We demonstrate that …

被引用次数：76 相关文章所有 8 个版本