End-to-End Learning of Latent Deformable Part-Based Representations for Object Detection

Abstract

Object detection methods usually represent objects through rectangular bounding boxes from which they extract features, regardless of their actual shapes. In this paper, we apply deformations to regions in order to learn representations better fitted to objects. We introduce DP-FCN, a deep model implementing this idea by learning to align parts to discriminative elements of objects in a latent way, i.e. without part annotation. This approach has two main assets: it builds invariance to local transformations, thus improving recognition, and brings geometric information to describe objects more finely, leading to a more accurate localization. We further develop both features in a new model named DP-FCN2.0 by explicitly learning interactions between parts. Alignment is done with an in-network joint optimization of all parts based on a CRF with custom potentials, and deformations are influencing localization through a bilinear product. We validate our models on PASCAL VOC and MS COCO datasets and show significant gains. DP-FCN2.0 achieves state-of-the-art results of 83.3 and 81.2% on VOC 2007 and 2012 with VOC data only.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. Azizpour, H., & Laptev, I.(2012). Object detection using strongly-supervised deformable part models. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 836–849).

  2. Bell, S., Zitnick, L., Bala, K., & Girshick, R.(2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  3. Ben-Younes, H., Cadène, R., Thome, N., & Cord M. (2017). MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision (ICCV).

  4. Chandra, S., Usunier, N., Kokkinos, I. (2017). Dense and low-rank gaussian CRFs using deep embeddings. In Proceedings of the IEEE international conference on computer vision (ICCV).

  5. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2015). Semantic image segmentation with deep convolutional nets and fully connected CRFs. In Proceedings of the international conference on learning representations (ICLR).

  6. Dai, J., He, K., Li, Y., Ren, S., & Sun, J. (2016a). Instance-sensitive fully convolutional networks. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 534–549).

  7. Dai, J., Li, Y., He, K., & Sun, J. (2016b). R-FCN: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (NIPS).

  8. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV).

  9. Durand, T., Mordan, T., Thome, N., & Cord, M. (2017). WILDCAT: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  10. Everingham, M., Eslami, A., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.

    Article  Google Scholar 

  11. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(9), 1627–1645.

    Article  Google Scholar 

  12. Fidler, S., Mottaghi, R., Yuille, A., & Urtasun, R. (2013). Bottom-up segmentation for top-down detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3294–3301).

  13. Gidaris, S., & Komodakis, N.(2015). Object detection via a multi-region and semantic segmentation-aware CNN model. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1134–1142).

  14. Gidaris, S., & Komodakis, N. (2016a). Attend refine repeat: Active box proposal generation via in-out localization. In Proceedings of the British machine vision conference (BMVC).

  15. Gidaris, S., & Komodakis, N.(2016b). LocNet: Improving localization accuracy for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  16. Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1440–1448).

  17. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).

  18. Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable part models are convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 437–446).

  19. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9), 1904–1916.

    Article  Google Scholar 

  20. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  21. Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). HyperNet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  22. Krähenbühl, P., & Koltun, V. (2011) Efficient inference in fully connected CRFs with Gaussian ddge potentials. In Advances in neural information processing systems (NIPS) (pp. 109–117).

  23. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1097–1105).

  24. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (ICML).

  25. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.

    Article  Google Scholar 

  26. Li, Y., Qi, H., Dai, J., Ji, X., & Wei, Y. (2017). Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  27. Lin, D., Shen, X., Lu, C., & Jia, J. (2015). Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1666–1674).

  28. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, L. (2014). Microsoft COCO: Common objects in context. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 740–755).

  29. Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  30. Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (ICCV).

  31. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S.(2016). SSD: Single shot multibox detector. In Proceedings of the IEEE European conference on computer vision (ECCV).

  32. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3431–3440).

  33. Mordan, T., Thome, N., Cord, M., & Henaff, G. (2017). Deformable part-based fully convolutional network for object detection. In Proceedings of the British machine vision conference (BMVC).

  34. Ott, P., & Everingham, M. (2011). Shared parts for deformable part-based models. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1513–1520).

  35. Pinheiro, P., Lin, T. Y., Collobert, R., & Dollár, P. (2016) Learning to refine object segments. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 75–91).

  36. Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  37. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  38. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS) (pp. 91–99).

  39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.

    MathSciNet  Article  Google Scholar 

  40. Savalle, P. A., Tsogkas, S., Papandreou, G., & Kokkinos, I. (2014). Deformable part models with CNN features. In Proceedings of the IEEE European conference on computer vision (ECCV), parts and attributes workshop.

  41. Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  42. Sicre, R., Avrithis, Y., Kijak, E., & Jurie, F. (2017). Unsupervised part learning for visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  43. Simon, M., & Rodner, E. (2015). Neural activation constellations: Unsupervised part model discovery with convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1143–1151).

  44. Simonyan, K., & Zisserman, A. (2015) Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations (ICLR).

  45. Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3), 279–311.

    MathSciNet  Article  Google Scholar 

  46. Wan, L., Eigen, D., & Fergus, R. (2015). End-to-end integration of a convolution network, deformable parts model and non-maximum suppression. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 851–859).

  47. Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Joint object and part segmentation using deep learned potentials. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1573–1581).

  48. Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  49. Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In Proceedings of the international conference on learning representations (ICLR).

  50. Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. In Proceedings of the British machine vision conference (BMVC).

  51. Zagoruyko, S., Lerer, A., Lin, T. Y., Pinheiro, P., Gross, S., Chintala, S., & Dollar, P. (2016). A multipath network for object detection. In Proceedings of the British machine vision conference (BMVC).

  52. Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., & Metaxas, D. (2016). SPDA-CNN: Unifying semantic part detection and abstraction for fine-grained recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1143–1152).

  53. Zhang, N., Donahue, J., Girshick, R., & Darrell, T. (2014). Part-based R-CNNs for fine-grained category detection. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 834–849).

  54. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1529–1537).

  55. Zhu, L., Chen, Y., Yuille, A., & Freeman, W. (2010). Latent hierarchical structural learning for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1062–1069).

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Taylor Mordan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Tae-Kyun Kim, Stefanos Zafeiriou, Ben Glocker and Stefan Leutenegger.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mordan, T., Thome, N., Henaff, G. et al. End-to-End Learning of Latent Deformable Part-Based Representations for Object Detection. Int J Comput Vis 127, 1659–1679 (2019). https://doi.org/10.1007/s11263-018-1109-z

Download citation

Keywords

  • Object detection
  • Fully convolutional network
  • Deep learning
  • Part-based representation
  • End-to-end latent part learning