Prototype-based embedding network for scene graph generation

C Zheng, X Lyu, L Gao, B Dai… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Current Scene Graph Generation (SGG) methods explore contextual information to
predict relationships among entity pairs. However, due to the diverse visual appearance of …

From global to local: Multi-scale out-of-distribution detection

J Zhang, L Gao, B Hao, H Huang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Out-of-distribution (OOD) detection aims to detect “unknown” data whose labels have not
been seen during the in-distribution (ID) training process. Recent progress in representation …

Memory-based augmentation network for video captioning

S Jing, H Zhang, P Zeng, L Gao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Video captioning focuses on generating natural language descriptions according to the
video content. Existing works mainly explore this multimodal learning with the paired source …

Learning visual question answering on controlled semantic noisy labels

H Zhang, P Zeng, Y Hu, J Qian, J Song, L Gao - Pattern Recognition, 2023 - Elsevier
Abstract Visual Question Answering (VQA) has made great progress recently due to the
increasing ability to understand and encode multi-modal inputs based on deep learning …

Complementarity-aware space learning for video-text retrieval

J Zhu, P Zeng, L Gao, G Li, D Liao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In general, videos are powerful at recording physical patterns (eg, spatial layout) while texts
are great at describing abstract symbols (eg, emotion). When video and text are used in …

Spatial-temporal knowledge-embedded transformer for video scene graph generation

T Pu, T Chen, H Wu, Y Lu, L Lin - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer
their relationships for a given video. It requires not only a comprehensive understanding of …

End-to-end pre-training with hierarchical matching and momentum contrast for text-video retrieval

W Shen, J Song, X Zhu, G Li… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Lately, video-language pre-training and text-video retrieval have attracted significant
attention with the explosion of multimedia data on the Internet. However, existing …

A differentiable semantic metric approximation in probabilistic embedding for cross-modal retrieval

H Li, J Song, L Gao, P Zeng… - Advances in Neural …, 2022 - proceedings.neurips.cc
Cross-modal retrieval aims to build correspondence between multiple modalities by learning
a common representation space. Typically, an image can match multiple texts semantically …

Reducing vision-answer biases for multiple-choice VQA

X Zhang, F Zhang, C Xu - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org
Multiple-choice visual question answering (VQA) is a challenging task due to the
requirement of thorough multimodal understanding and complicated inter-modality …

Utilizing greedy nature for multimodal conditional image synthesis in transformers

S Su, J Zhu, L Gao, J Song - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org
Multimodal Conditional Image Synthesis (MCIS) aims to generate images according to
different modalities input and their combination, which allows users to describe their …