Contextual Priming and Feedback for Faster R-CNN

  • Abhinav ShrivastavaEmail author
  • Abhinav Gupta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9905)


The field of object detection has seen dramatic performance improvements in the last few years. Most of these gains are attributed to bottom-up, feedforward ConvNet frameworks. However, in case of humans, top-down information, context and feedback play an important role in doing object detection. This paper investigates how we can incorporate top-down information and feedback in the state-of-the-art Faster R-CNN framework. Specifically, we propose to: (a) augment Faster R-CNN with a semantic segmentation network; (b) use segmentation for top-down contextual priming; (c) use segmentation to provide top-down iterative feedback using two stage training. Our results indicate that all three contributions improve the performance on object detection, semantic segmentation and region proposal generation.


Object Detection Joint Model Segmentation Signal Segmentation Module Contextual Priming 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We thank Ross Girshick, Ishan Misra and Sean Bell for helpful discussions. AS was supported by the Microsoft Research PhD Fellowship. This work was also partially supported by ONR MURI N000141612007. We thank NVIDIA for donating GPUs.


  1. 1.
    Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: CVPR (2010)Google Scholar
  2. 2.
    Alexe, B., Deselaers, T., Ferrari, V.: Measuring the objectness of image windows. TPAMI 34, 2189–2202 (2012)CrossRefGoogle Scholar
  3. 3.
    Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014)Google Scholar
  4. 4.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
  5. 5.
    Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. arXiv preprint arXiv:1512.04143 (2015)
  6. 6.
    Biederman, I.: On the semantics of a glance at a scene (1981)Google Scholar
  7. 7.
    Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. arXiv preprint arXiv:1507.06550 (2015)
  8. 8.
    Carreira, J., Sminchisescu, C.: Constrained parametric min-cuts for automatic object segmentation. In: CVPR (2010)Google Scholar
  9. 9.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015)Google Scholar
  10. 10.
    Chen, X., Shrivastava, A., Gupta, A.: Enriching visual knowledge bases via object discovery and segmentation. In: CVPR (2014)Google Scholar
  11. 11.
    Chun, M.M., Jiang, Y.: Top-down attentional guidance based on implicit learning of visual covariation. Psychol. Sci. 10, 360–365 (1999)CrossRefGoogle Scholar
  12. 12.
    Cinbis, R.G., Verbeek, J., Schmid, C.: Segmentation driven object detection with Fisher vectors. In: ICCV (2013)Google Scholar
  13. 13.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  14. 14.
    Davenport, J.L., Potter, M.C.: Scene consistency in object and background perception. Psychol. Sci. 15, 559–664 (2004)CrossRefGoogle Scholar
  15. 15.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei., L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  16. 16.
    Divvala, S.K., Hoiem, D., Hays, J.H., Efros, A.A., Hebert, M.: An empirical study of context in object detection. In: CVPR (2009)Google Scholar
  17. 17.
    Dong, J., Chen, Q., Yan, S., Yuille, A.: Towards unified object detection and semantic segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 299–314. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10602-1_20 Google Scholar
  18. 18.
    Endres, I., Hoiem, D.: Category independent object proposals. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 575–588. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15555-0_42 CrossRefGoogle Scholar
  19. 19.
    Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: CVPR (2014)Google Scholar
  20. 20.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV 88, 303–338 (2010)CrossRefGoogle Scholar
  21. 21.
    Felleman, D.J., Van Essen, D.C.: Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1, 1–47 (1991)CrossRefGoogle Scholar
  22. 22.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. PAMI 32, 1627–1645 (2010)CrossRefGoogle Scholar
  23. 23.
    Fidler, S., Mottaghi, R., Yuille, A., Urtasun, R.: Bottom-up segmentation for top-down detection. In: CVPR (2013)Google Scholar
  24. 24.
    Galleguillos, C., Belongie, S.: Context based object categorization: a critical survey. CVIU 114, 712–722 (2010)Google Scholar
  25. 25.
    Gatta, C., Romero, A., van de Veijer, J.: Unrolling loopy top-down semantic feedback in convolutional deep networks. In: CVPR Workshops (2014)Google Scholar
  26. 26.
    Gidaris, S., Komodakis, N.: Object detection via a multi-region & semantic segmentation-aware cnn model. arXiv preprint arXiv:1505.01749 (2015)
  27. 27.
    Gilbert, C.D., Sigman, M.: Brain states: top-down influences in sensory processing. Neuron 54, 677–696 (2007)CrossRefGoogle Scholar
  28. 28.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  29. 29.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  30. 30.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with RCNN. In: ICCV (2015)Google Scholar
  31. 31.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)Google Scholar
  32. 32.
    Gu, C., Lim, J.J., Arbeláez, P., Malik, J.: Recognition using regions. In: CVPR (2009)Google Scholar
  33. 33.
    Gupta, S., Hariharan, B., Malik, J.: Exploring person context and local scene context for object detection. arXiv preprint arXiv:1511.08177 (2015)
  34. 34.
    Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)Google Scholar
  35. 35.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 297–312. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_20 Google Scholar
  36. 36.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)Google Scholar
  37. 37.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  38. 38.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. PAMI (2015)Google Scholar
  39. 39.
    Hock, H.S., Gordon, G.P., Whitehurst, R.: Contextual relations: the influence of familiarity, physical plausibility, and belongingness. Percept. Psychophys. 16, 4–8 (1974)CrossRefGoogle Scholar
  40. 40.
    Hollingworth, A.: Does consistent scene context facilitate object perception? J. Exp. Psychol. Gen. 127, 398–415 (1998)CrossRefGoogle Scholar
  41. 41.
    Hupe, J., James, A., Payne, B., Lomber, S., Girard, P., Bullier, J.: Cortical feedback improves discrimination between figure and background by v1, v2 and v3 neurons. Nature 394, 784–787 (1998)CrossRefGoogle Scholar
  42. 42.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  43. 43.
    Kravitz, D.J., Saleem, K.S., Baker, C.I., Ungerleider, L.G., Mishkin, M.: The ventral visual pathway: an expanded neural framework for the processing of object quality. Trends Cogn. Sci. 17, 26–49 (2013)CrossRefGoogle Scholar
  44. 44.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  45. 45.
    Ladický, Ľ., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, Where and How many? Combining object detectors and CRFs. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 424–437. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15561-1_31 CrossRefGoogle Scholar
  46. 46.
    Lamme, V.A., Roelfsema, P.R.: The distinct modes of vision offered by feedforward and recurrent processing. Trends Neurosci. 23, 571–579 (2000)CrossRefGoogle Scholar
  47. 47.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)CrossRefGoogle Scholar
  48. 48.
    Li, K., Hariharan, B., Malik, J.: Iterative instance segmentation. arXiv preprint arXiv:1511.08498 (2015)
  49. 49.
    Lin, G., Shen, C., Reid, I., et al.: Efficient piecewise training of deep structured models for semantic segmentation. arXiv preprint arXiv:1504.01013 (2015)
  50. 50.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10602-1_48 Google Scholar
  51. 51.
    Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
  52. 52.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  53. 53.
    Meng, Y., Ye, X., Gonsalves, B.D.: Neural processing of recollection, familiarity and priming at encoding: evidence from a forced-choice recognition paradigm. Brain Res. 1585, 72–82 (2014)CrossRefGoogle Scholar
  54. 54.
    Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: CVPR (2016)Google Scholar
  55. 55.
    Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: NIPS (2014)Google Scholar
  56. 56.
    Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. In: CVPR (2015)Google Scholar
  57. 57.
    Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)Google Scholar
  58. 58.
    Murphy, K., Torralba, A., Freeman, W., et al.: Using the forest to see the trees: a graphical model relating features, objects and scenes. In: NIPS (2003)Google Scholar
  59. 59.
    Oliva, A., Torralba, A.: The role of context in object recognition. Trends Cogn. Sci. 11, 520–527 (2007)CrossRefGoogle Scholar
  60. 60.
    Palmer, T.E.: The effects of contextual scenes on the identification of objects. Memory Cogn. 3, 519–526 (1975)CrossRefGoogle Scholar
  61. 61.
    Pinheiro, P.O., Collobert, R., Dollar, P.: Learning to segment object candidates. In: NIPS (2015)Google Scholar
  62. 62.
    Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: ICCV (2007)Google Scholar
  63. 63.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)
  64. 64.
    Ross, S., Munoz, D., Hebert, M., Bagnell, J.A.: Learning message-passing inference machines for structured prediction. In: CVPR (2011)Google Scholar
  65. 65.
    Schwing, A.G., Urtasun, R.: Fully connected deep structured networks. arXiv preprint arXiv:1503.02351 (2015)
  66. 66.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. In: ICLR (2015)Google Scholar
  67. 67.
    Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR (2016)Google Scholar
  68. 68.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  69. 69.
    Stollenga, M.F., Masci, J., Gomez, F., Schmidhuber, J.: Deep networks with internal selective attention through feedback connections. In: NIPS (2014)Google Scholar
  70. 70.
    Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In: NIPS (2013)Google Scholar
  71. 71.
    Torralba, A.: Contextual priming for object detection. IJCV 53, 169–191 (2003)CrossRefGoogle Scholar
  72. 72.
    Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: ICCV (2003)Google Scholar
  73. 73.
    Torralba, A., Sinha, P.: Statistical context priming for object detection. In: ICCV (2001)Google Scholar
  74. 74.
    Tu, Z., Bai, X.: Auto-context and its application to high-level vision tasks and 3d brain image segmentation. PAMI 32, 1744–1757 (2010)CrossRefGoogle Scholar
  75. 75.
    Tulving, E., Schacter, D.L.: Priming and human memory systems. Science 247, 301–306 (1990)CrossRefGoogle Scholar
  76. 76.
    Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for object recognition. IJCV 104, 154–171 (2013)CrossRefGoogle Scholar
  77. 77.
    Viola, P., Jones, M.: Robust real-time object detection. IJCV 57, 137–154 (2001)CrossRefGoogle Scholar
  78. 78.
    Wang, X., Yang, M., Zhu, S., Lin, Y.: Regionlets for generic object detection. In: ICCV (2013)Google Scholar
  79. 79.
    Wig, G.S., Grafton, S.T., Demos, K.E., Kelley, W.M.: Reductions in neural activity underlie behavioral components of repetition priming. Nature Neurosci. 8, 1228–1233 (2005)CrossRefGoogle Scholar
  80. 80.
    Wyatte, D., Curran, T., O’Reilly, R.: The limits of feedforward vision: recurrent processing promotes robust object recognition when objects are degraded. J. Cogn. Neurosci. 24, 2248–2261 (2012)CrossRefGoogle Scholar
  81. 81.
    Yao, J., Fidler, S., Urtasun, R.: Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In: CVPR (2012)Google Scholar
  82. 82.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: ICCV (2015)Google Scholar
  83. 83.
    Zhu, Y., Urtasun, R., Salakhutdinov, R., Fidler, S.: segdeepm: Exploiting segmentation and context in deep neural networks for object detection. In: CVPR (2015)Google Scholar
  84. 84.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10602-1_26 Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations