Video question answering with prior knowledge and object-sensitive learning

C Zheng, X Lyu, L Gao, B Dai… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Current Scene Graph Generation (SGG) methods explore contextual information to
predict relationships among entity pairs. However, due to the diverse visual appearance of …

被引用次数：41 相关文章所有 8 个版本

[PDF] arxiv.org

From global to local: Multi-scale out-of-distribution detection

J Zhang, L Gao, B Hao, H Huang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Out-of-distribution (OOD) detection aims to detect “unknown” data whose labels have not
been seen during the in-distribution (ID) training process. Recent progress in representation …

被引用次数：22 相关文章所有 7 个版本

Memory-based augmentation network for video captioning

S Jing, H Zhang, P Zeng, L Gao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Video captioning focuses on generating natural language descriptions according to the
video content. Existing works mainly explore this multimodal learning with the paired source …

被引用次数：16 相关文章所有 2 个版本

Learning visual question answering on controlled semantic noisy labels

H Zhang, P Zeng, Y Hu, J Qian, J Song, L Gao - Pattern Recognition, 2023 - Elsevier

Abstract Visual Question Answering (VQA) has made great progress recently due to the
increasing ability to understand and encode multi-modal inputs based on deep learning …

被引用次数：16 相关文章所有 3 个版本

Complementarity-aware space learning for video-text retrieval

J Zhu, P Zeng, L Gao, G Li, D Liao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

In general, videos are powerful at recording physical patterns (eg, spatial layout) while texts
are great at describing abstract symbols (eg, emotion). When video and text are used in …

被引用次数：21 相关文章所有 2 个版本

[PDF] arxiv.org

Spatial-temporal knowledge-embedded transformer for video scene graph generation

T Pu, T Chen, H Wu, Y Lu, L Lin - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org

Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer
their relationships for a given video. It requires not only a comprehensive understanding of …

被引用次数：8 相关文章所有 7 个版本

End-to-end pre-training with hierarchical matching and momentum contrast for text-video retrieval

W Shen, J Song, X Zhu, G Li… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Lately, video-language pre-training and text-video retrieval have attracted significant
attention with the explosion of multimedia data on the Internet. However, existing …

被引用次数：9 相关文章所有 5 个版本

[PDF] neurips.cc

A differentiable semantic metric approximation in probabilistic embedding for cross-modal retrieval

H Li, J Song, L Gao, P Zeng… - Advances in Neural …, 2022 - proceedings.neurips.cc

Cross-modal retrieval aims to build correspondence between multiple modalities by learning
a common representation space. Typically, an image can match multiple texts semantically …

被引用次数：12 相关文章所有 3 个版本

Reducing vision-answer biases for multiple-choice VQA

X Zhang, F Zhang, C Xu - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org

Multiple-choice visual question answering (VQA) is a challenging task due to the
requirement of thorough multimodal understanding and complicated inter-modality …

被引用次数：7 相关文章所有 5 个版本

Utilizing greedy nature for multimodal conditional image synthesis in transformers

S Su, J Zhu, L Gao, J Song - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org

Multimodal Conditional Image Synthesis (MCIS) aims to generate images according to
different modalities input and their combination, which allows users to describe their …

被引用次数：4 相关文章所有 2 个版本