Multi-label image recognition with attentive transformer-localizer module
Recently, remarkable progress on multi-label image classification has been achieved by
locating semantic-agnostic image regions and extracting their features with deep
convolutional neural networks. However, existing pipelines depend on the hypothesis
region generation step, which typically brings about extra computational costs, eg,
generating hundreds of meaningless proposals and extracting their features. Moreover, the
contextual dependencies among these localized regions are usually ignored or …
locating semantic-agnostic image regions and extracting their features with deep
convolutional neural networks. However, existing pipelines depend on the hypothesis
region generation step, which typically brings about extra computational costs, eg,
generating hundreds of meaningless proposals and extracting their features. Moreover, the
contextual dependencies among these localized regions are usually ignored or …
Abstract
Recently, remarkable progress on multi-label image classification has been achieved by locating semantic-agnostic image regions and extracting their features with deep convolutional neural networks. However, existing pipelines depend on the hypothesis region generation step, which typically brings about extra computational costs, e.g., generating hundreds of meaningless proposals and extracting their features. Moreover, the contextual dependencies among these localized regions are usually ignored or oversimplified during the learning and inference stages. To resolve these issues, we develop a novel attentive transformer-localizer (ATL) module that contains differential transformations (e.g., translation, scale), which can automatically discover the discriminative semantic-aware regions from input images in terms of multi-label recognition. This module can be flexibly incorporated with recurrent neural networks such as the long short-term memory (LSTM) network for memorizing and updating the contextual dependencies of the localized regions. We thus build a unified multi-label image recognition framework. Specifically, the ATL module is applied to progressively localize the attentive regions from the convolutional feature maps in a proposal-free manner, and the LSTM network sequentially predicts label scores for the localized regions and updates the parameters of the ATL module while capturing the global dependencies among these regions. To associate the localized regions with semantic labels over diverse locations and scales, we further design three constraints together with the ATL module. Extensive experiments and evaluations on two large-scale benchmarks (i.e., PASCAL VOC and Microsoft COCO) show that the proposed approach achieves superior performance over existing state-of-the-art methods in terms of both performance and efficiency.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果