A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

  • Zhaowei CaiEmail author
  • Quanfu Fan
  • Rogerio S. Feris
  • Nuno Vasconcelos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9908)


A unified deep neural network, denoted the multi-scale CNN (MS-CNN), is proposed for fast multi-scale object detection. The MS-CNN consists of a proposal sub-network and a detection sub-network. In the proposal sub-network, detection is performed at multiple output layers, so that receptive fields match objects of different scales. These complementary scale-specific detectors are combined to produce a strong multi-scale object detector. The unified network is learned end-to-end, by optimizing a multi-task loss. Feature upsampling by deconvolution is also explored, as an alternative to input upsampling, to reduce the memory and computation costs. State-of-the-art object detection performance, at up to 15 fps, is reported on datasets, such as KITTI and Caltech, containing a substantial number of small objects.


Object detection Multi-scale Unified neural network 



This work was partially funded by NSF grant IIS1208522 and a gift from KETI. We also thank NVIDIA for GPU donations through their academic program.


  1. 1.
    Viola, P.A., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRefGoogle Scholar
  2. 2.
    Dollár, P., Appel, R., Belongie, S.J., Perona, P.: Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1532–1545 (2014)CrossRefGoogle Scholar
  3. 3.
    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587(2014)Google Scholar
  4. 4.
    Girshick, R.B.: Fast R-CNN. In: ICCV, pp. 1440–1448(2015)Google Scholar
  5. 5.
    Chen, X., Kundu, K., Zhu, Y., Berneshawi, A., Ma, H., Fidler, S., Urtasun, R.: 3D object proposals for accurate object class detection. In: NIPS (2015)Google Scholar
  6. 6.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10578-9_23 Google Scholar
  7. 7.
    Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware CNN model. In: ICCV, pp. 1134–1142(2015)Google Scholar
  8. 8.
    van de Sande, K.E.A., Uijlings, J.R.R., Gevers, T., Smeulders, A.W.M.: Segmentation as selective search for object recognition. In: ICCV, pp. 1879–1886(2011)Google Scholar
  9. 9.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  10. 10.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: CVPR, pp. 3354–3361(2012)Google Scholar
  11. 11.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)Google Scholar
  12. 12.
    Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV, pp. 1395–1403 (2015)Google Scholar
  13. 13.
    Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2012)CrossRefGoogle Scholar
  14. 14.
    Bourdev, L.D., Brandt, J.: Robust object detection via soft cascade. In: CVPR, pp. 236–243 (2005)Google Scholar
  15. 15.
    Cai, Z., Saberian, M.J., Vasconcelos, N.: Learning complexity-aware cascades for deep pedestrian detection. In: ICCV, pp. 3361–3369 (2015)Google Scholar
  16. 16.
    Ohn-Bar, E., Trivedi, M.M.: Learning to detect vehicles by clustering appearance patterns. IEEE Trans. Intell. Transp. Syst. 16(5), 2511–2521 (2015)CrossRefGoogle Scholar
  17. 17.
    Saberian, M.J., Vasconcelos, N.: Boosting algorithms for detector cascade learning. J. Mach. Learn. Res. 15(1), 2569–2605 (2014)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114(2012)Google Scholar
  19. 19.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  20. 20.
    Wang, X., Yang, M., Zhu, S., Lin, Y.: Regionlets for generic object detection. In: ICCV, pp. 17–24 (2013)Google Scholar
  21. 21.
    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR (2016)Google Scholar
  22. 22.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  23. 23.
    Lee, C., Xie, S., Gallagher, P.W., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: AISTATS (2015)Google Scholar
  24. 24.
    Benenson, R., Mathias, M., Timofte, R., Gool, L.J.V.: Pedestrian detection at 100 frames per second. In: CVPR, pp. 2903–2910 (2012)Google Scholar
  25. 25.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  26. 26.
    Bell, S., Zitnick, C.L., Bala, K., Girshick, R.B.: Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In: CVPR (2016)Google Scholar
  27. 27.
    Zhu, Y., Urtasun, R., Salakhutdinov, R., Fidler, S.: segDeepM: Exploiting segmentation and context in deep neural networks for object detection. In: CVPR, pp. 4703–4711 (2015)Google Scholar
  28. 28.
    Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)CrossRefGoogle Scholar
  29. 29.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: MM, pp. 675–678 (2014)Google Scholar
  31. 31.
    Hosang, J., Benenson, R., Dollár, P., Schiele, B.: What makes for effective detection proposals? PAMI 38(4), 814–830 (2015)CrossRefGoogle Scholar
  32. 32.
    Cheng, M., Zhang, Z., Lin, W., Torr, P.H.S.: BING: binarized normed gradients for objectness estimation at 300fps. In: CVPR, pp. 3286–3293 (2014)Google Scholar
  33. 33.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10602-1_26 Google Scholar
  34. 34.
    Arbeláez, P.A., Pont-Tuset, J., Barron, J.T., Marqués, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR, pp. 328–335 (2014)Google Scholar
  35. 35.
    Geiger, A., Wojek, C., Urtasun, R.: Joint 3D estimation of objects and scene layout. In: NIPS, pp. 1467–1475 (2011)Google Scholar
  36. 36.
    Pepik, B., Stark, M., Gehler, P.V., Schiele, B.: Multi-view and 3D deformable part models. IEEE Trans. Pattern Anal. Mach. Intell. 37(11), 2232–2245 (2015)CrossRefGoogle Scholar
  37. 37.
    Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3D voxel patterns for object category recognition. In: CVPR, pp. 1903–1911 (2015)Google Scholar
  38. 38.
    Li, B., Wu, T., Zhu, S.-C.: Integrating context and occlusion for car detection by hierarchical and-or model. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 652–667. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10599-4_42 Google Scholar
  39. 39.
    Tian, Y., Luo, P., Wang, X., Tang, X.: Deep learning strong parts for pedestrian detection. In: ICCV, pp. 1904–1912 (2015)Google Scholar
  40. 40.
    Zhang, S., Benenson, R., Schiele, B.: Filtered channel features for pedestrian detection. In: CVPR, pp. 1751–1760 (2015)Google Scholar
  41. 41.
    Paisitkriangkrai, S., Shen, C., van den Hengel, A.: Pedestrian detection with spatially pooled features and structured ensemble learning. CoRR abs/1409.5209 (2014)Google Scholar
  42. 42.
    Yang, F., Choi, W., Lin, Y.: Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In: CVPR. (2016)Google Scholar
  43. 43.
    Nam, W., Dollár, P., Han, J.H.: Local decorrelation for improved pedestrian detection. In: NIPS, pp. 424–432 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Zhaowei Cai
    • 1
    Email author
  • Quanfu Fan
    • 2
  • Rogerio S. Feris
    • 2
  • Nuno Vasconcelos
    • 1
  1. 1.SVCLUC San DiegoSan DiegoUSA
  2. 2.IBM T. J. Watson ResearchYorktown HeightsUSA

Personalised recommendations