Visual commonsense-aware representation network for video captioning

S Jing, H Zhang, P Zeng, L Gao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Video captioning focuses on generating natural language descriptions according to the
video content. Existing works mainly explore this multimodal learning with the paired source …

被引用次数：16 相关文章所有 2 个版本

Rethink video retrieval representation for video captioning

M Tian, G Li, Y Qi, S Wang, QZ Sheng, Q Huang - Pattern Recognition, 2024 - Elsevier

Video captioning, a challenging task targeting the automatic generation of accurate and
comprehensive descriptions based on video content, has witnessed substantial success …

被引用次数：1 相关文章

UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval

H Zhang, P Zeng, L Gao, J Song… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Prompt tuning, an emerging parameter-efficient strategy, leverages the powerful knowledge
of large-scale pre-trained image-text models (eg, CLIP) to swiftly adapt to downstream tasks …

被引用次数：1 相关文章

Contrastive topic-enhanced network for video captioning

Y Zeng, Y Wang, D Liao, G Li, J Xu, H Man… - Expert Systems with …, 2024 - Elsevier

In the field of video captioning, recent works usually focus on multi-modal video content
understanding, in which transcripts are extracted from speech and are often adopted as an …

被引用次数：5 相关文章所有 4 个版本

Center-enhanced video captioning model with multimodal semantic alignment

B Zhang, J Gao, Y Yuan - Neural Networks, 2024 - Elsevier

Video captioning aims at automatically generating descriptive sentences based on the given
video, establishing an association between the visual contents and textual languages, has …

Learning Temporal Dynamics in Videos With Image Transformer

Y Shu, Z Qiu, F Long, T Yao, CW Ngo… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Temporal dynamics represent the evolving of video content over time, which are critical for
action recognition. In this paper, we ask the question: can the off-the-shelf image transformer …

EMS: A Large-Scale Eye Movement Dataset, Benchmark, and New Model for Schizophrenia Recognition

Y Song, Z Liu, G Li, J Xie, Q Wu, D Zeng… - … on Neural Networks …, 2024 - ieeexplore.ieee.org

Schizophrenia (SZ) is a common and disabling mental illness, and most patients encounter
cognitive deficits. The eye-tracking technology has been increasingly used to characterize …

CMGNet: Collaborative multi-modal graph network for video captioning

Q Rao, X Yu, G Li, L Zhu - Computer Vision and Image Understanding, 2024 - Elsevier

In video captioning, it is very challenging to comprehensively describe multi-modal content
information of a video, such as appearance, motion, and object. Prior arts often neglect …

被引用次数：1 相关文章所有 4 个版本

Video captioning based on dual learning via multiple reconstruction blocks

BHH Putra, C Jeong - Image and Vision Computing, 2024 - Elsevier

In the context of video captioning, a conventional dual learning scheme involves two tasks: a
primal task, which translates frame features into natural language captions, and a dual task …

Triple-stream commonsense circulation transformer network for image captioning

J Li, W Zhou, W Kai, H Hu - Computer Vision and Image Understanding, 2024 - Elsevier

Traditional image captioning methods only have a local perspective at the dataset level,
allowing them to explore dispersed information within individual images. However, the lack …