Weakly-supervised video object grounding via learning uni-modal associations- 学术资源搜索

Weakly-supervised video object grounding via learning uni-modal associations

W Wang, J Gao, C Xu - IEEE Transactions on Multimedia, 2022 - ieeexplore.ieee.org

IEEE Transactions on Multimedia, 2022•ieeexplore.ieee.org

Grounding objects described in natural language to visual regions in the video is a crucial capability needed in vision-and-language fields. In this paper, we deal with the weakly-supervised video object grounding (WSVOG) task, where only video-sentence pairs are provided for learning. The essence of this task is to learn the cross-modal associations between words in textual modality and regions in visual modality. Despite the recent progress, we find that most existing methods focus on the association learning for cross-modal samples, while the rich and complementary information within uni-modal samples has not been fully exploited. To this end, we propose to explicitly learn uni-modal associations on both textual and visual sides, so as to fully exploit the useful uni-modal information for accurate video object grounding. Specifically, (1) we learn textual prototypes by considering rich contextual information of the same object in different sentences, and (2) we estimate visual prototypes in an adaptive manner so as to overcome the uncertainties in selecting object-relevant visual regions. Besides, a cross-modal correspondence is learned which not only bridges the visual and textual modalities for WSVOG task, but also tightly cooperates with the uni-modal association learning process. We conduct extensive experiments on three popular datasets, and the favorable results demonstrate the effectiveness of our method.

ieeexplore.ieee.org

展开收起

被引用次数：3 相关文章所有 2 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果