A review on methods and applications in multimodal deep learning

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

Recent advances and trends in multimodal deep learning: A review

J Summaira, X Li, AM Shoib, S Li, J Abdul - arXiv preprint arXiv …, 2021 - arxiv.org
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning is to create models that can …

Video description: A comprehensive survey of deep learning approaches

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023 - Springer
Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk
Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding

D Liu, X Fang, W Hu, P Zhou - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org
Temporal sentence grounding aims to localize a target segment in an untrimmed video
semantically according to a given sentence query. Most previous works focus on learning …

Dual attention on pyramid feature maps for image captioning

L Yu, J Zhang, Q Wu - IEEE Transactions on Multimedia, 2021 - ieeexplore.ieee.org
Generating natural sentences from images is a fundamental learning task for visual-
semantic understanding in multimedia. In this paper, we propose to apply dual attention on …

A review of deep learning for video captioning

M Abdar, M Kollati, S Kuraparthi, F Pourpanah… - arXiv preprint arXiv …, 2023 - arxiv.org
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work
in the fields of computer vision, natural language processing (NLP), linguistics, and human …

Temporal speciation network for few-shot object detection

X Zhao, X Liu, Y Ma, S Bai, Y Shen… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Recently, few-shot object detection (FSOD) has become an increasing research focus,
which can largely alleviate the heavy dependency on expensive annotations in the …

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

Q Zheng, Z Chen, Z Wang, H Liu, M Lin - Expert Systems with Applications, 2024 - Elsevier
Transformer models have demonstrated superior performance across various domains,
including computer vision, natural language processing, and speech recognition. The …

An efficient dimensionality reduction based on adaptive-GSM and transformer assisted classification for high dimensional data

N Rajender, MV Gopalachari - International Journal of Information …, 2024 - Springer
Over the last decade, a surge in multimedia data has significantly impacted research areas
like multimedia retrieval, database management, and medical imaging. Traditional machine …