Skip to main content
Log in

Weakly supervised detection with decoupled attention-based deep representation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Training object detectors with only image-level annotations is an important problem with a variety of applications. However, due to the deformable nature of objects, a target object delineated by a bounding box always includes irrelevant context and occlusions, which causes large intra-class object variations and ambiguity in object-background distinction. For this reason, identifying the object of interest from a substantial amount of cluttered backgrounds is very challenging. In this paper, we propose a decoupled attention-based deep model to optimize region-based object representation. Different from existing approaches posing object representation in a single-tower model, our proposed network decouples object representation into two separate modules, i.e., image representation and attention localization. The image representation module captures content-based semantic representation, while the attention localization module regresses an attention map which simultaneously highlights the locations of the discriminative object parts and down weights the irrelevant backgrounds presented in the image. The combined representation alleviates the impact from the noisy context and occlusions inside an object bounding box. As a result, object-background ambiguity can be largely reduced and background regions can be suppressed effectively. In addition, the proposed object representation model can be seamlessly integrated into a state-of-the-art weakly supervised detection framework, and the entire model can be trained end-to-end. We extensively evaluate the detection performance on the PASCAL VOC 2007, VOC 2010 and VOC2012 datasets. Experimental results demonstrate that our approach effectively improves weakly supervised object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Ba J, Mnih V, Kavukcuoglu K (2015) Multiple object recognition with visual attention. International Conference on Learning Representations, In, pp 1–10

    Google Scholar 

  2. Bency AJ, Kwon H, Lee H, Karthikeyan S, Manjunath BS (2016) Weakly supervised localization using deep feature maps. European Conference on Computer Vision

    Book  Google Scholar 

  3. Bilen H, Vedaldi A (2016) Weakly supervised deep detection networks. IEEE Conference on Computer Vision and Pattern Recognition

    Book  Google Scholar 

  4. Bilen H, Pedersoli M, Tuytelaars T (2015) Weakly supervised object detection with convex clustering. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 1081–1089

  5. Chang X, Yang Y (2016) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2582746

    Article  MathSciNet  Google Scholar 

  6. Chang X, Yu Y, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39:1617–1632. doi:10.1109/TPAMI.2016.2608901

    Article  Google Scholar 

  7. Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27:1502–1513

    Article  MathSciNet  Google Scholar 

  8. Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process 26:3911–3920

    Article  MathSciNet  Google Scholar 

  9. Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47:1180–1197

    Article  Google Scholar 

  10. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected CRFs. International Conference on Learning Representations, In, pp 1–14

    Google Scholar 

  11. Cinbis RG, Verbeek J, Schmid C (2017) Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans Pattern Anal Mach Intell 39:189–203. doi:10.1109/TPAMI.2016.2535231

    Article  Google Scholar 

  12. Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387

  13. Deselaers T, Alexe B, Ferrari V (2012) Weakly supervised localization and learning with generic knowledge. Int J Comput Vis 100:275–293. doi:10.1007/s11263-012-0538-3

    Article  MathSciNet  Google Scholar 

  14. Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2014) The Pascal visual object classes challenge: a retrospective. Int J Comput Vis 111:98–136. doi:10.1007/s11263-014-0733-5

    Article  Google Scholar 

  15. Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the KITTI dataset. Int J Robot Res 32:1231–1237. doi:10.1177/0278364913491297

    Article  Google Scholar 

  16. Gidaris S, Komodakis N (2015) Object detection via a multi-region & semantic segmentation-aware CNN model. IEEE International Conference on Computer Vision

    Book  Google Scholar 

  17. Girshick R (2015) Fast R-CNN. IEEE International Conference on Computer Vision

    Book  Google Scholar 

  18. Han J, Zhang D, Cheng G, Guo L, Ren J (2015) Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans Geosci Remote Sens 53:3325–3337

    Article  Google Scholar 

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 171–180

  20. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia. pp 675–678

  21. Jiang W, Zhao Z, Su F (2016) Bayes pooling of visual phrases for object retrieval. Multimed Tools Appl 75:9095–9119. doi:10.1007/s11042-015-2939-0

    Article  Google Scholar 

  22. Karthikeyan S, Ngo T, Eckstein M, Manjunath BS (2015) Eye tracking assisted extraction of attentionally important objects from videos. Proc IEEE Conf Comput Vis Pattern Recognit. doi:10.1109/CVPR.2015.7298944

  23. Krizhevsky A, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceeding NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems, Curran Associates Inc., Lake Tahoe, Nevada — December 03–06, 2012, pp. 1097–1105

  24. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S (2016) SSD : single shot MultiBox detector. European Conference on Computer Vision

    Google Scholar 

  25. Long J, Shelhamer E (2015) Fully convolutional networks for semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition

    Book  Google Scholar 

  26. Ma Z, Chang X, Yang Y, Sebe N, Hauptmann AG (2017) The many shades of negativity. IEEE Trans Multimedia 19:1558–1568

    Article  Google Scholar 

  27. Ma Z, Chang X, Xu Z, Sebe N, Hauptmann AG (2017) Joint attributes and event analysis for multimedia event detection. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2017.2709308

  28. Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. Advances in Neural Information Processing Systems, In, pp 2204–2212

    Google Scholar 

  29. Oquab M, Bottou L, Laptev I, Sivic J (1717–1724) (2014) learning and transferring mid-level image representations using convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition. pp, In

    Google Scholar 

  30. Oquab M, Bottou L, Laptev I, Sivic J (2015) Is object localization for free? - weakly-supervised learning with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition, In, pp 685–694

    Google Scholar 

  31. Papadopoulos DP, Clarke ADF, Keller F, Ferrari V (2014) Training object class detectors from eye tracking data. In: European Conference on Computer Vision. pp 1–16

    Chapter  Google Scholar 

  32. Redmon J, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition

    Google Scholar 

  33. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceeding NIPS'15 Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press Cambridge, Montreal, Canada — December 07–12, 2015, pp. 91–99

  34. Ren W, Member S, Huang K, Member S (2016) Weakly supervised large scale object localization with multiple instance learning and bag splitting. IEEE Trans Pattern Anal Mach Intell 38:405–416. doi:10.1109/TPAMI.2015.2456908

    Article  Google Scholar 

  35. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211–252. doi:10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  36. Sharma S, Kiros R, Salakhutdinov R (2016) Action recognition using visual attention. International Conference on Learning Representations, In, pp 1–11

    Google Scholar 

  37. Shi M, Ferrari V (2016) Weakly supervised object localization using size estimates. In: European Conference on Computer Vision

  38. Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. IEEE, Las Vegas

  39. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations. pp 1–14

  40. Song HO, Girshick R, Jegelka S, Mairal J, Harchaoui Z, Darrell T (2014) On learning to localize objects with minimal supervision. In: Proceeding ICML'14 Proceedings of the 31st International Conference on International Conference on Machine Learning vol. 32, Beijing, China, 21–26 June, 2014

  41. Song HO, Lee YJ, Jegelka S, Darrell T (2014) Weakly-supervised discovery of visual pattern configurations. In: Proceeding NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press Cambridge, Montreal, Canada, 8–13 December, 2014

  42. Treue S, Martinez Trujillo JC (1999) Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399:575–579. doi:10.1038/21176

    Article  Google Scholar 

  43. Uijlings JRR, Sande KE a., Gevers T, Smeulders a. WM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171

    Article  Google Scholar 

  44. Uijlings JRR, Keller F, Ferrari V (2016) We don’t need no bounding-boxes: training object class detectors using only human verification. IEEE Conference on Computer Vision and Pattern Recognition

    Google Scholar 

  45. Wang C, Huang K, Ren W, Zhang J, Maybank S (2015) Large-scale weakly supervised object localization via latent category learning. IEEE Trans Image Process 24:1371–1385. doi:10.1109/TIP.2015.2396361

    Article  MathSciNet  MATH  Google Scholar 

  46. Xu H, Saenko K (2016) Ask, attend and answer: exploring question-guided spatial attention for visual question answering. European Conference on Computer Vision, In, pp 451–466

    Google Scholar 

  47. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. International Conference on Machine learning

    Google Scholar 

  48. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In, IEEE Conference on Computer Vision and Pattern Recognition, p 10

    Google Scholar 

  49. Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120:215–232. doi:10.1007/s11263-016-0907-4

    Article  MathSciNet  Google Scholar 

  50. Zhang D, Han J, Han J, Shao L (2016) Cosaliency detection based on Intrasaliency prior transfer and deep Intersaliency mining. IEEE Trans Neural Netw Learn Syst 27:1163–1176. doi:10.1109/TNNLS.2015.2495161

    Article  MathSciNet  Google Scholar 

  51. Zhang D, Meng D, Zhao L, Han J (2016) Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. In: Proceeding IJCAI'16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, AAAI Press, New York, USA, 9–15 July, 2016, pp. 3538–3544

  52. Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39:865–878. doi:10.1109/TPAMI.2016.2567393

    Article  Google Scholar 

  53. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. IEEE Conference on Computer Vision and Pattern Recognition

    Book  Google Scholar 

  54. Zhu L, Shen J, Jin H, Xie L, Zheng R (2015) Landmark classification with hierarchical multi-modal exemplar feature. IEEE Trans Multimedia 17:981–993. doi:10.1109/TMM.2015.2431496

    Article  Google Scholar 

  55. Zhu L, Shen J, Jin H, Zheng R, Xie L (2015) Content-based visual landmark search via multimodal hypergraph learning. IEEE Trans Cybern 45:2756–2769. doi:10.1109/TCYB.2014.2383389

    Article  Google Scholar 

  56. Zhu Z, Liang D, Zhang S, Huang X, Baoli Li SH (2016) Traffic-sign detection and classification in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition. pp 2110–2118

  57. Zhu L, Shen J, Liu X, Xie L, Nie L (2016) Learning compact visual representation with canonical views for robust mobile landmark search. In: Proceeding IJCAI'16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, AAAI Press, New York, USA, 9–15 July 2016, pp. 3959–3965

  58. Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern. doi:10.1109/TCYB.2016.2591068

    Article  Google Scholar 

  59. Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29:472–486. doi:10.1109/TKDE.2016.2562624

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by Chinese National Natural Science Foundation under Grants 61471049, 61372169 and 61532018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenhui Jiang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, W., Zhao, Z. & Su, F. Weakly supervised detection with decoupled attention-based deep representation. Multimed Tools Appl 77, 3261–3277 (2018). https://doi.org/10.1007/s11042-017-5087-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5087-x

Keywords

Navigation