Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
End-to-end generative pretraining for multimodal video captioning
Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
End-to-end dense video captioning with parallel decoding
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
Autoad ii: The sequel-who, when, and what in movie audio description
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
Vtimellm: Empower llm to grasp video moments
Large language models (LLMs) have shown remarkable text understanding capabilities
which have been extended as Video LLMs to handle video data for comprehending visual …
which have been extended as Video LLMs to handle video data for comprehending visual …
Video transformers: A survey
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …
them a promising tool for modeling video. However, they lack inductive biases and scale …
Avoid-df: Audio-visual joint learning for detecting deepfake
Recently, deepfakes have raised severe concerns about the authenticity of online media.
Prior works for deepfake detection have made many efforts to capture the intra-modal …
Prior works for deepfake detection have made many efforts to capture the intra-modal …
Clip-it! language-guided video summarization
M Narasimhan, A Rohrbach… - Advances in neural …, 2021 - proceedings.neurips.cc
A generic video summary is an abridged version of a video that conveys the whole story and
features the most important scenes. Yet the importance of scenes in a video is often …
features the most important scenes. Yet the importance of scenes in a video is often …
Tsp: Temporally-sensitive pretraining of video encoders for localization tasks
H Alwassel, S Giancola… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Due to the large memory footprint of untrimmed videos, current state-of-the-art video
localization methods operate atop precomputed video clip features. These features are …
localization methods operate atop precomputed video clip features. These features are …