Skip to main content
Log in

Weakly Supervised Group Mask Network for Object Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Learning object detectors from weak image annotations is an important yet challenging problem. Many weakly supervised approaches formulate the task as a multiple instance learning problem, where each image is represented as a bag of instances. For predicting the score for each object that occurs in an image, existing MIL based approaches tend to select the instance that responds more strongly to a specific class, which, however, overlooks the contextual information. Besides, objects often exhibit dramatic variations such as scaling and transformations, which makes them hard to detect. In this paper, we propose the weakly supervised group mask network (WSGMN), which mainly has two distinctive properties: (i) it exploits the relations among regions to generate community instances, which contain context information and are robust to object variations. (ii) It generates a mask for each label group, and utilizes these masks to dynamically select the feature information of the most useful community instances for recognizing specific objects. Extensive experiments on several benchmark datasets demonstrate the effectiveness of WSGMN on the tasks of weakly supervised object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. In our experiment, we set \(Z = 10\) for PASCAL VOC datasets, \(Z = 24\) for MS-COCO, and \(Z = 26\) for ImageNet detection dataset.

References

  • Arun, A., Jawahar, C., & Kumar, M. P. (2019). Dissimilarity coefficient based weakly supervised object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9432–9441).

  • Bency, A. J., Kwon, H., Lee, H., Karthikeyan, S., & Manjunath, B. (2016). Weakly supervised localization using deep feature maps. In: European conference on computer vision (pp. 714–731).

  • Bilen, H., & Vedaldi, A. (2016). Weakly supervised deep detection networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2846–2854).

  • Bilen, H., Pedersoli, M., & Tuytelaars, T. (2014). Weakly supervised object detection with posterior regularization. In: Proceedings of the British machine vision conference (pp. 1–12).

  • Bosch, A., Munoz, X., Oliver, A., & Marti, R. (2006). Object and scene classification: What does a supervised approach provide us? International Conference on Pattern Recognition, 1, 773–777.

    Google Scholar 

  • Cao, J., Cholakkal, H., Anwer, R. M., Khan, F. S., Pang, Y., & Shao, L. (2020) D2det: Towards high quality object detection and instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11485–11494).

  • Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. ArXiv preprint, arXiv:1405.3531.

  • Cinbis, R. G., Verbeek, J., & Schmid, C. (2017a). Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 189–203.

    Article  Google Scholar 

  • Cinbis, R. G., Verbeek, J., & Schmid, C. (2017b). Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 189–203.

    Article  Google Scholar 

  • Deselaers, T., Alexe, B., & Ferrari, V. (2012). Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision, 100(3), 275–293.

    Article  MathSciNet  Google Scholar 

  • Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., & Van Gool, L. (2017). Weakly supervised cascaded convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 914–922).

  • Dietterich, T. G., Lathrop, R. H., & Lozano-Pérez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1), 31–71.

    Article  Google Scholar 

  • Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., & Hebert, M. (2009). An empirical study of context in object detection. In: IEEE conference on computer vision and pattern recognition (pp. 1271–1278).

  • Durand, T., Mordan, T., Thome, N., & Cord, M. (2017). Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Durand, T., Thome, N., & Cord, M. (2015). Mantra: Minimum maximum latent structural svm for image classification and ranking. In: Proceedings of the IEEE international conference on computer vision (pp. 2713–2721).

  • Durand, T., Thome, N., & Cord, M. (2016). Weldon: Weakly supervised learning of deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4743–4752).

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The pascal visual object classes challenge 2007 (voc 2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop

  • Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2012) The pascal visual object classes challenge 2012 results. In: See http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html (Vol. 5).

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Ge, W., Yang, S., & Yu, Y. (2018). Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1277–1286).

  • Girshick, R. (2015) Fast r-CNN. In: Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).

  • Hand, E. M., & Chellappa, R. (2017). Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In: AAAI (pp. 4068–4074).

  • He, S., Lau, R. W., Liu, W., Huang, Z., & Yang, Q. (2015). Supercnn: A superpixelwise convolutional neural network for salient object detection. International Journal of Computer Vision, 115(3), 330–344.

    Article  MathSciNet  Google Scholar 

  • Huang, J., Li, G., Huang, Q., & Wu, X. (2015). Learning label specific features for multi-label classification. In: IEEE international conference on data mining (pp. 181–190).

  • Jie, Z., Wei, Y., Jin, X., Feng, J., & Liu, W. (2017). Deep self-taught learning for weakly supervised object localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kantorov, V., Oquab, M., Cho, M., & Laptev, I. (2016). Contextlocnet: Context-aware deep network models for weakly supervised localization. In: European conference on computer vision (pp. 350–365).

  • Li, Y. F., Hu, J. H., Jiang, Y., & Zhou, Z. H. (2012). Towards discovering what patterns trigger what labels. In: Proceedings of the 26th AAAI conference on artificial intelligence.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: European conference on computer vision (pp. 740–755).

  • Lowe, D. G. (1999). Object recognition from local scale-invariant features. In: The proceedings of the 7th IEEE international conference on computer vision (Vol. 2, pp. 1150–1157).

  • Nikulin, M. S. (2001). Hellinger distance. Encyclopedia of Mathematics. http://encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453

  • Oliva, A., & Torralba, A. (2007). The role of context in object recognition. Trends in Cognitive Sciences, 11(12), 520–527.

    Article  Google Scholar 

  • Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 685–694).

  • Parizi, S. N., Vedaldi, A., Zisserman, A., & Felzenszwalb, P. (2014). Automatic discovery and optimization of parts for image classification. ArXiv preprint, arXiv:1412.6598.

  • Pourian, N., Karthikeyan, S., & Manjunath, B. (2015). Weakly supervised graph based semantic segmentation by learning communities of image-parts. In: Proceedings of the IEEE international conference on computer vision (pp. 1359–1367).

  • Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007) Objects in context. In: IEEE international conference on Computer vision (pp. 1–8).

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (pp. 91–99).

  • Ren, Z., Yu, Z., Yang, X., Liu, M. Y., Lee, Y. J., Schwing, A. G., & Kautz, J. (2020). Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10598–10607).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations.

  • Tan, M., Pang, R., & Le, Q. V. (2020). Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10781–10790).

  • Tang, P., Xinggang, W., Xiang, B., & Liu, W. (2017). Multiple instance detection network with online instance classifier refinement. In: Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Tang, P., Wang, X., Bai, S., Shen, W., Bai, X., Liu, W., et al. (2018a). PCL: Proposal cluster learning for weakly supervised object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1), 176–191.

    Article  Google Scholar 

  • Tang, P., Wang, X., Wang, A., Yan, Y., Liu, W., Huang, J., & Yuille, A. (2018b). Weakly supervised region proposal network and object detection. In: Proceedings of the European conference on computer vision (ECCV) (pp. 352–368).

  • Wu, Y., Chen, Y., Yuan, L., Liu, Z., Wang, L., Li, H., & Fu, Y. (2020) Rethinking classification and localization for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10186–10195).

  • Zhang, M. L., & Wu, L. (2015). Lift: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1), 107–120.

    Article  Google Scholar 

  • Zhang, X., Feng, J., Xiong, H., & Tian, Q. (2018) Zigzag learning for weakly supervised object detection. In: The IEEE conference on computer vision and pattern recognition.

  • Zhao, R., Ouyang, W., Li, H., & Wang, X. (2015). Saliency detection by multi-context deep learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1265–1274).

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2921–2929).

  • Zhu, C., Zheng, Y., Luu, K., & Savvides, M. (2017) CMS-rCNN: Contextual multi-scale region-based CNN for unconstrained face detection. In: Deep learning for biometrics (pp. 57–79).

  • Zitnick, C. L., & Dollár, P. (2014) Edge boxes: Locating object proposals from edges. In: European conference on computer vision (pp. 391–405).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingyun Song.

Additional information

Communicated by Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research was supported in part by National Key Research and Development Program of China under Grant No. 2018YFB1004500, National Nature Science Foundation of China under Grant Nos. 61772426, 61672419, 61672418, 61532004, 61502377, 61532015, 61721002, the Joint Funds of the National Natural Science Foundation of China under Grant No. U1811262, Innovation Research Team of Ministry of Education under Grant No. IRT_17R86, Fundamental Research Funds for the Central Universities under Grant No. D5000200146, China Postdoctoral Science Foundation under Grant No. 2020M673487.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, L., Liu, J., Sun, M. et al. Weakly Supervised Group Mask Network for Object Detection. Int J Comput Vis 129, 681–702 (2021). https://doi.org/10.1007/s11263-020-01397-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01397-w

Keywords

Navigation