Svitt: Temporal learning of sparse video-text transformers

J Wang, G Sun, P Wang, D Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

The increasing prevalence of video clips has sparked growing interest in text-video retrieval.
Recent advances focus on establishing a joint embedding space for text and video relying …

被引用次数：3 相关文章所有 3 个版本

[PDF] thecvf.com

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

T Liang, C Tan, B Xia, WS Zheng… - Proceedings of the …, 2024 - openaccess.thecvf.com

This paper focuses on open-ended video question answering which aims to find the correct
answers from a large answer set in response to a video-related question. This is essentially …

被引用次数：1 相关文章所有 3 个版本

Text-Video Retrieval via Multi-Modal Hypergraph Networks

Q Li, L Su, J Zhao, L Xia, H Cai, S Cheng… - Proceedings of the 17th …, 2024 - dl.acm.org

Text-video retrieval is a challenging task that aims to identify relevant videos given textual
queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval …

被引用次数：1 相关文章

SIA-Net: Sparse Interactive Attention Network for Multimodal Emotion Recognition

S Li, T Zhang, CLP Chen - IEEE Transactions on Computational …, 2024 - ieeexplore.ieee.org

Multimodal emotion recognition (MER) integrates multiple modalities to identify the user's
emotional state, which is the core technology of natural and friendly human–computer …

[PDF] arxiv.org

EA-VTR: Event-Aware Video-Text Retrieval

Z Ma, Z Zhang, Y Chen, Z Qi, C Yuan, B Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Understanding the content of events occurring in the video and their inherent temporal logic
is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack …

A new multi-picture architecture for learned video deinterlacing and demosaicing with parallel deformable convolution and self-attention blocks

R Ji, AM Tekalp - Image and Vision Computing, 2024 - Elsevier

Despite the fact real-world video deinterlacing and demosaicing are well-suited to
supervised learning from synthetically degraded data because the degradation models are …

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

Q Li, L Su, J Zhao, L Xia, H Cai, S Cheng… - arXiv preprint arXiv …, 2024 - arxiv.org

Text-video retrieval is a challenging task that aims to identify relevant videos given textual
queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval …

Multi-Modal Inductive Framework for Text-Video Retrieval

Q Li, Y Zhou, C Ji, F Lu, J Gong, S Wang… - ACM Multimedia …, 2024 - openreview.net

Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing
methods are limited by their ability to understand and connect different modalities, resulting …

[PDF] arxiv.org

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

HA Valdez, K Min, S Tripathi - arXiv preprint arXiv:2406.09462, 2024 - arxiv.org

Pretraining egocentric vision-language models has become essential to improving
downstream egocentric video-text tasks. These egocentric foundation models commonly use …

[PDF] thecvf.com

[PDF][PDF] Supplementary Material of “Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval”

J Wang, G Sun, P Wang, D Liu, S Dianat, M Rabbani… - openaccess.thecvf.com

We provide more results, visualizations, and in-depth discussions of the proposed T-MASS
as follows• More quantitative performance of T-MASS (Section 1).• Discussions about …