Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arXiv preprint arXiv:2309.05519, 2023 - arxiv.org
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

MS Wajid, H Terashima‐Marin, P Najafirad… - Engineering …, 2024 - Wiley Online Library
Generating an image/video caption has always been a fundamental problem of Artificial
Intelligence, which is usually performed using the potential of Deep Learning Methods …

Deep Multimodal Data Fusion

F Zhao, C Zhang, B Geng - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data
(eg, images, texts, or data collected from different sensors), feature engineering (eg …

Accurate and fast compressed video captioning

Y Shen, X Gu, K Xu, H Fan, L Wen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Existing video captioning approaches typically require to first sample video frames from a
decoded video and then conduct a subsequent process (eg, feature extraction and/or …

Alignment and generation adapter for efficient video-text understanding

H Fang, Z Yang, Y Wei, X Zang, C Ban… - Proceedings of the …, 2023 - openaccess.thecvf.com
Pre-trained models have demonstrated considerable performance, especially in enhancing
cross-modal understanding between videos and text. However, fine-tuning them at scale …

Native: Multi-modal knowledge graph completion in the wild

Y Zhang, Z Chen, L Guo, Y Xu, B Hu, Z Liu… - Proceedings of the 47th …, 2024 - dl.acm.org
Multi-modal knowledge graph completion (MMKGC) aims to automatically discover the
unobserved factual knowledge from a given multi-modal knowledge graph by collaboratively …

EvCap: Element-Aware Video Captioning

S Liu, A Li, Y Zhao, J Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Video captioning is a multi-modal task across computer vision and natural language
processing. Previous methods generally follow two paradigms, ie template-based and …

Context-Guided Spatio-Temporal Video Grounding

X Gu, H Fan, Y Huang, T Luo… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for
a specific instance given a text query. Despite advancements current methods easily suffer …

RTQ: Rethinking Video-language Understanding Based on Image-text Model

X Wang, Y Li, T Gan, Z Zhang, J Lv, L Nie - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Recent advancements in video-language understanding have been established on the
foundation of image-text models, resulting in promising outcomes due to the shared …

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

J Choi, S Lee, J Chu, M Choi… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Video Transformers have become the prevalent solution for various video downstream tasks
with superior expressive power and flexibility. However these video transformers suffer from …