Advertisement

End-to-End Learning of Latent Deformable Part-Based Representations for Object Detection

  • Taylor Mordan
  • Nicolas Thome
  • Gilles Henaff
  • Matthieu Cord
Article
  • 401 Downloads

Abstract

Object detection methods usually represent objects through rectangular bounding boxes from which they extract features, regardless of their actual shapes. In this paper, we apply deformations to regions in order to learn representations better fitted to objects. We introduce DP-FCN, a deep model implementing this idea by learning to align parts to discriminative elements of objects in a latent way, i.e. without part annotation. This approach has two main assets: it builds invariance to local transformations, thus improving recognition, and brings geometric information to describe objects more finely, leading to a more accurate localization. We further develop both features in a new model named DP-FCN2.0 by explicitly learning interactions between parts. Alignment is done with an in-network joint optimization of all parts based on a CRF with custom potentials, and deformations are influencing localization through a bilinear product. We validate our models on PASCAL VOC and MS COCO datasets and show significant gains. DP-FCN2.0 achieves state-of-the-art results of 83.3 and 81.2% on VOC 2007 and 2012 with VOC data only.

Keywords

Object detection Fully convolutional network Deep learning Part-based representation End-to-end latent part learning 

Notes

References

  1. Azizpour, H., & Laptev, I.(2012). Object detection using strongly-supervised deformable part models. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 836–849).Google Scholar
  2. Bell, S., Zitnick, L., Bala, K., & Girshick, R.(2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  3. Ben-Younes, H., Cadène, R., Thome, N., & Cord M. (2017). MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision (ICCV).Google Scholar
  4. Chandra, S., Usunier, N., Kokkinos, I. (2017). Dense and low-rank gaussian CRFs using deep embeddings. In Proceedings of the IEEE international conference on computer vision (ICCV).Google Scholar
  5. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2015). Semantic image segmentation with deep convolutional nets and fully connected CRFs. In Proceedings of the international conference on learning representations (ICLR).Google Scholar
  6. Dai, J., He, K., Li, Y., Ren, S., & Sun, J. (2016a). Instance-sensitive fully convolutional networks. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 534–549).Google Scholar
  7. Dai, J., Li, Y., He, K., & Sun, J. (2016b). R-FCN: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (NIPS).Google Scholar
  8. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV).Google Scholar
  9. Durand, T., Mordan, T., Thome, N., & Cord, M. (2017). WILDCAT: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  10. Everingham, M., Eslami, A., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.CrossRefGoogle Scholar
  11. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(9), 1627–1645.CrossRefGoogle Scholar
  12. Fidler, S., Mottaghi, R., Yuille, A., & Urtasun, R. (2013). Bottom-up segmentation for top-down detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3294–3301).Google Scholar
  13. Gidaris, S., & Komodakis, N.(2015). Object detection via a multi-region and semantic segmentation-aware CNN model. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1134–1142).Google Scholar
  14. Gidaris, S., & Komodakis, N. (2016a). Attend refine repeat: Active box proposal generation via in-out localization. In Proceedings of the British machine vision conference (BMVC).Google Scholar
  15. Gidaris, S., & Komodakis, N.(2016b). LocNet: Improving localization accuracy for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  16. Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1440–1448).Google Scholar
  17. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).Google Scholar
  18. Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable part models are convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 437–446).Google Scholar
  19. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9), 1904–1916.CrossRefGoogle Scholar
  20. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  21. Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). HyperNet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  22. Krähenbühl, P., & Koltun, V. (2011) Efficient inference in fully connected CRFs with Gaussian ddge potentials. In Advances in neural information processing systems (NIPS) (pp. 109–117).Google Scholar
  23. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1097–1105).Google Scholar
  24. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (ICML).Google Scholar
  25. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.CrossRefGoogle Scholar
  26. Li, Y., Qi, H., Dai, J., Ji, X., & Wei, Y. (2017). Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  27. Lin, D., Shen, X., Lu, C., & Jia, J. (2015). Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1666–1674).Google Scholar
  28. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, L. (2014). Microsoft COCO: Common objects in context. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 740–755).Google Scholar
  29. Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  30. Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (ICCV).Google Scholar
  31. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S.(2016). SSD: Single shot multibox detector. In Proceedings of the IEEE European conference on computer vision (ECCV).Google Scholar
  32. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3431–3440).Google Scholar
  33. Mordan, T., Thome, N., Cord, M., & Henaff, G. (2017). Deformable part-based fully convolutional network for object detection. In Proceedings of the British machine vision conference (BMVC).Google Scholar
  34. Ott, P., & Everingham, M. (2011). Shared parts for deformable part-based models. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1513–1520).Google Scholar
  35. Pinheiro, P., Lin, T. Y., Collobert, R., & Dollár, P. (2016) Learning to refine object segments. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 75–91).Google Scholar
  36. Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  37. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  38. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS) (pp. 91–99).Google Scholar
  39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.MathSciNetCrossRefGoogle Scholar
  40. Savalle, P. A., Tsogkas, S., Papandreou, G., & Kokkinos, I. (2014). Deformable part models with CNN features. In Proceedings of the IEEE European conference on computer vision (ECCV), parts and attributes workshop.Google Scholar
  41. Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  42. Sicre, R., Avrithis, Y., Kijak, E., & Jurie, F. (2017). Unsupervised part learning for visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  43. Simon, M., & Rodner, E. (2015). Neural activation constellations: Unsupervised part model discovery with convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1143–1151).Google Scholar
  44. Simonyan, K., & Zisserman, A. (2015) Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations (ICLR).Google Scholar
  45. Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3), 279–311.MathSciNetCrossRefGoogle Scholar
  46. Wan, L., Eigen, D., & Fergus, R. (2015). End-to-end integration of a convolution network, deformable parts model and non-maximum suppression. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 851–859).Google Scholar
  47. Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Joint object and part segmentation using deep learned potentials. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1573–1581).Google Scholar
  48. Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  49. Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In Proceedings of the international conference on learning representations (ICLR).Google Scholar
  50. Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. In Proceedings of the British machine vision conference (BMVC).Google Scholar
  51. Zagoruyko, S., Lerer, A., Lin, T. Y., Pinheiro, P., Gross, S., Chintala, S., & Dollar, P. (2016). A multipath network for object detection. In Proceedings of the British machine vision conference (BMVC).Google Scholar
  52. Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., & Metaxas, D. (2016). SPDA-CNN: Unifying semantic part detection and abstraction for fine-grained recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1143–1152).Google Scholar
  53. Zhang, N., Donahue, J., Girshick, R., & Darrell, T. (2014). Part-based R-CNNs for fine-grained category detection. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 834–849).Google Scholar
  54. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1529–1537).Google Scholar
  55. Zhu, L., Chen, Y., Yuille, A., & Freeman, W. (2010). Latent hierarchical structural learning for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1062–1069).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.CNRS, Laboratoire d’Informatique de Paris 6, LIP6Sorbonne UniversitéParisFrance
  2. 2.Thales Land and Air SystemsÉlancourtFrance
  3. 3.CEDRICConservatoire National des Arts et MétiersParisFrance

Personalised recommendations