Deep reinforcement polishing network for video captioning

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org

Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

被引用次数：75 相关文章所有 7 个版本

[PDF] arxiv.org

Recent advances and trends in multimodal deep learning: A review

J Summaira, X Li, AM Shoib, S Li, J Abdul - arXiv preprint arXiv …, 2021 - arxiv.org

Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning is to create models that can …

被引用次数：79 相关文章所有 2 个版本

[PDF] springer.com

Video description: A comprehensive survey of deep learning approaches

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023 - Springer

Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …

被引用次数：20 相关文章所有 5 个版本

[PDF] port.ac.uk

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk

Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

被引用次数：39 相关文章所有 6 个版本

[PDF] arxiv.org

Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding

D Liu, X Fang, W Hu, P Zhou - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org

Temporal sentence grounding aims to localize a target segment in an untrimmed video
semantically according to a given sentence query. Most previous works focus on learning …

被引用次数：36 相关文章所有 4 个版本

[PDF] arxiv.org

Dual attention on pyramid feature maps for image captioning

L Yu, J Zhang, Q Wu - IEEE Transactions on Multimedia, 2021 - ieeexplore.ieee.org

Generating natural sentences from images is a fundamental learning task for visual-
semantic understanding in multimedia. In this paper, we propose to apply dual attention on …

被引用次数：43 相关文章所有 4 个版本

[PDF] arxiv.org

A review of deep learning for video captioning

M Abdar, M Kollati, S Kuraparthi, F Pourpanah… - arXiv preprint arXiv …, 2023 - arxiv.org

Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work
in the fields of computer vision, natural language processing (NLP), linguistics, and human …

被引用次数：18 相关文章所有 3 个版本

Temporal speciation network for few-shot object detection

X Zhao, X Liu, Y Ma, S Bai, Y Shen… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Recently, few-shot object detection (FSOD) has become an increasing research focus,
which can largely alleviate the heavy dependency on expensive annotations in the …

被引用次数：12 相关文章所有 2 个版本

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

Q Zheng, Z Chen, Z Wang, H Liu, M Lin - Expert Systems with Applications, 2024 - Elsevier

Transformer models have demonstrated superior performance across various domains,
including computer vision, natural language processing, and speech recognition. The …

被引用次数：8 相关文章

An efficient dimensionality reduction based on adaptive-GSM and transformer assisted classification for high dimensional data

N Rajender, MV Gopalachari - International Journal of Information …, 2024 - Springer

Over the last decade, a surge in multimedia data has significantly impacted research areas
like multimedia retrieval, database management, and medical imaging. Traditional machine …

被引用次数：9 相关文章