A review of deep learning for video captioning

M Abdar, M Kollati, S Kuraparthi, F Pourpanah… - arXiv preprint arXiv …, 2023 - arxiv.org
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work
in the fields of computer vision, natural language processing (NLP), linguistics, and human …

Video captioning: a comparative review of where we are and which could be the route

D Moctezuma, T Ramírez-delReal, G Ruiz… - Computer Vision and …, 2023 - Elsevier
Video captioning is the process of describing the content of a sequence of images capturing
its semantic relationships and meanings. Dealing with this task with a single image is …

Concept-aware video captioning: Describing videos with effective prior information

B Yang, M Cao, Y Zou - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org
Concepts, a collective term for meaningful words that correspond to objects, actions, and
attributes, can act as an intermediary for video captioning. While many efforts have been …

Explainability in graph neural networks: An experimental survey

P Li, Y Yang, M Pagnucco, Y Song - arXiv preprint arXiv:2203.09258, 2022 - arxiv.org
Graph neural networks (GNNs) have been extensively developed for graph representation
learning in various application domains. However, similar to all other neural networks …

Bridging video and text: A two-step polishing transformer for video captioning

W Xu, Z Miao, J Yu, Y Tian, L Wan… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Video captioning is a joint task of computer vision and natural language processing, which
aims to describe the video content using several natural language sentences. Nowadays …

Visual commonsense-aware representation network for video captioning

P Zeng, H Zhang, L Gao, X Li, J Qian… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Generating consecutive descriptions for videos, that is, video captioning, requires taking full
advantage of visual representation along with the generation process. Existing video …

Time–frequency recurrent transformer with diversity constraint for dense video captioning

P Li, P Zhang, T Wang, H Xiao - Information Processing & Management, 2023 - Elsevier
Describing a long video using multiple sentences, ie, dense video captioning, is a very
challenging task. Existing methods neglect the important fact that the actions of several …

Mir-gan: Refining frame-level modality-invariant representations with adversarial network for audio-visual speech recognition

Y Hu, C Chen, R Li, H Zou, ES Chng - arXiv preprint arXiv:2306.10567, 2023 - arxiv.org
Audio-visual speech recognition (AVSR) attracts a surge of research interest recently by
leveraging multimodal signals to understand human speech. Mainstream approaches …

Multi-sentence video captioning using spatial saliency of video frames and content-oriented beam search algorithm

M Nabati, A Behrad - Expert Systems with Applications, 2023 - Elsevier
Video captioning algorithms aim at expressing the information and activities contained in a
video clip in the form of lingual sentences. Most existing video captioning approaches have …

[HTML][HTML] Action knowledge for video captioning with graph neural networks

WF Hendria, V Velda, BHH Putra, F Adzaka… - Journal of King Saud …, 2023 - Elsevier
Many existing video captioning methods capture action information in the video by exploiting
features extracted from an action recognition model. However, directly using the action …