Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning

A Oluwasammi, MU Aftab, Z Qin, ST Ngo, TV Doan… - …, 2021 - Wiley Online Library
With the emergence of deep learning, computer vision has witnessed extensive
advancement and has seen immense applications in multiple domains. Specifically, image …

Grounding'grounding'in NLP

KR Chandu, Y Bisk, AW Black - arXiv preprint arXiv:2106.02192, 2021 - arxiv.org
The NLP community has seen substantial recent interest in grounding to facilitate interaction
between language technologies and the world. However, as a community, we use the term …

Storytelling from an image stream using scene graphs

R Wang, Z Wei, P Li, Q Zhang, X Huang - Proceedings of the AAAI …, 2020 - aaai.org
Visual storytelling aims at generating a story from an image stream. Most existing methods
tend to represent images directly with the extracted high-level features, which is not intuitive …

Image captioning based on scene graphs: A survey

J Jia, X Ding, S Pang, X Gao, X Xin, R Hu… - Expert Systems with …, 2023 - Elsevier
Although recent developments in deep learning have brought several tasks closer to human
performance, there is still a significant gap between human and machine performance in …

Tcic: Theme concepts learning cross language and vision for image captioning

Z Fan, Z Wei, S Wang, R Wang, Z Li, H Shan… - arXiv preprint arXiv …, 2021 - arxiv.org
Existing research for image captioning usually represents an image using a scene graph
with low-level facts (objects and relations) and fails to capture the high-level semantics. In …

Semantic completion and filtration for image–text retrieval

S Yang, Q Li, W Li, XY Li, R Jin, B Lv, R Wang… - ACM Transactions on …, 2023 - dl.acm.org
Image–text retrieval is a vital task in computer vision and has received growing attention,
since it connects cross-modality data. It comes with the critical challenges of learning unified …

Object-centric diagnosis of visual reasoning

J Yang, J Mao, J Wu, D Parikh, DD Cox… - arXiv preprint arXiv …, 2020 - arxiv.org
When answering questions about an image, it not only needs knowing what--understanding
the fine-grained contents (eg, objects, relationships) in the image, but also telling why …

Structural semantic adversarial active learning for image captioning

B Zhang, L Li, L Su, S Wang, J Deng, ZJ Zha… - Proceedings of the 28th …, 2020 - dl.acm.org
Most image captioning models achieve superior performances with the help of large-scale
surprised training data, but it is prohibitively costly to label the image captions. To solve this …

Review of recent deep learning based methods for image-text retrieval

J Chen, L Zhang, C Bai… - 2020 IEEE Conference on …, 2020 - ieeexplore.ieee.org
Cross-modal retrieval has drawn much attention in recent years due to the diversity and the
quantity of information data that exploded with the popularity of mobile devices and social …