A review on methods and applications in multimodal deep learning
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …
Recent advances and trends in multimodal deep learning: A review
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning is to create models that can …
popular in recent years. The goal of multimodal deep learning is to create models that can …
Video description: A comprehensive survey of deep learning approaches
Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …
understanding into automatic textual narration. It bridges the key AI fields of computer vision …
Visuals to text: A comprehensive review on automatic image captioning
Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk
Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …
content of images. It is a technique integrating multiple disciplines including the computer …
Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding
Temporal sentence grounding aims to localize a target segment in an untrimmed video
semantically according to a given sentence query. Most previous works focus on learning …
semantically according to a given sentence query. Most previous works focus on learning …
Dual attention on pyramid feature maps for image captioning
Generating natural sentences from images is a fundamental learning task for visual-
semantic understanding in multimedia. In this paper, we propose to apply dual attention on …
semantic understanding in multimedia. In this paper, we propose to apply dual attention on …
A review of deep learning for video captioning
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work
in the fields of computer vision, natural language processing (NLP), linguistics, and human …
in the fields of computer vision, natural language processing (NLP), linguistics, and human …
Temporal speciation network for few-shot object detection
Recently, few-shot object detection (FSOD) has become an increasing research focus,
which can largely alleviate the heavy dependency on expensive annotations in the …
which can largely alleviate the heavy dependency on expensive annotations in the …
MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder
Transformer models have demonstrated superior performance across various domains,
including computer vision, natural language processing, and speech recognition. The …
including computer vision, natural language processing, and speech recognition. The …
An efficient dimensionality reduction based on adaptive-GSM and transformer assisted classification for high dimensional data
N Rajender, MV Gopalachari - International Journal of Information …, 2024 - Springer
Over the last decade, a surge in multimedia data has significantly impacted research areas
like multimedia retrieval, database management, and medical imaging. Traditional machine …
like multimedia retrieval, database management, and medical imaging. Traditional machine …