Text with knowledge graph augmented transformer for video captioning

S Wu, H Fei, L Qu, W Ji, TS Chua - arXiv preprint arXiv:2309.05519, 2023 - arxiv.org

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

被引用次数：270 相关文章所有 4 个版本

[PDF] wiley.com Full View

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

MS Wajid, H Terashima‐Marin, P Najafirad… - Engineering …, 2024 - Wiley Online Library

Generating an image/video caption has always been a fundamental problem of Artificial
Intelligence, which is usually performed using the potential of Deep Learning Methods …

被引用次数：11 相关文章所有 2 个版本

[PDF] acm.org

Deep Multimodal Data Fusion

F Zhao, C Zhang, B Geng - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data
(eg, images, texts, or data collected from different sensors), feature engineering (eg …

被引用次数：11 相关文章

[PDF] thecvf.com

Accurate and fast compressed video captioning

Y Shen, X Gu, K Xu, H Fan, L Wen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Existing video captioning approaches typically require to first sample video frames from a
decoded video and then conduct a subsequent process (eg, feature extraction and/or …

被引用次数：11 相关文章所有 6 个版本

[PDF] thecvf.com

Alignment and generation adapter for efficient video-text understanding

H Fang, Z Yang, Y Wei, X Zang, C Ban… - Proceedings of the …, 2023 - openaccess.thecvf.com

Pre-trained models have demonstrated considerable performance, especially in enhancing
cross-modal understanding between videos and text. However, fine-tuning them at scale …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Native: Multi-modal knowledge graph completion in the wild

Y Zhang, Z Chen, L Guo, Y Xu, B Hu, Z Liu… - Proceedings of the 47th …, 2024 - dl.acm.org

Multi-modal knowledge graph completion (MMKGC) aims to automatically discover the
unobserved factual knowledge from a given multi-modal knowledge graph by collaboratively …

被引用次数：2 相关文章所有 6 个版本

EvCap: Element-Aware Video Captioning

S Liu, A Li, Y Zhao, J Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Video captioning is a multi-modal task across computer vision and natural language
processing. Previous methods generally follow two paradigms, ie template-based and …

被引用次数：1 相关文章

[PDF] thecvf.com

Context-Guided Spatio-Temporal Video Grounding

X Gu, H Fan, Y Huang, T Luo… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for
a specific instance given a text query. Despite advancements current methods easily suffer …

RTQ: Rethinking Video-language Understanding Based on Image-text Model

X Wang, Y Li, T Gan, Z Zhang, J Lv, L Nie - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Recent advancements in video-language understanding have been established on the
foundation of image-text models, resulting in promising outcomes due to the shared …

被引用次数：3 相关文章所有 3 个版本

[PDF] thecvf.com

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

J Choi, S Lee, J Chu, M Choi… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Video Transformers have become the prevalent solution for various video downstream tasks
with superior expressive power and flexibility. However these video transformers suffer from …

被引用次数：2 相关文章所有 3 个版本