Skip to main content

Recursive Context Routing for Object Detection


Recent studies have confirmed that modeling contexts is important for object detection. However, current context modeling approaches still have limited expressive capacity and dynamics to encode contextual relationships and model contexts, deteriorating their effectiveness. In this paper, we instead seek to recast the current context modeling framework and perform more dynamic context modeling for object detection. In particular, we devise a novel Recursive Context Routing (ReCoR) mechanism to encode contextual relationships and model contexts more effectively. The ReCoR progressively models more contexts through a recursive structure, providing a more feasible and more comprehensive method to utilize complicated contexts and contextual relationships. For each recursive stage, we further decompose the modeling of contexts and contextual relationships into a spatial modeling process and a channel-wise modeling process, avoiding the need for exhaustive modeling of all the potential pair-wise contextual relationships with more dynamics in a single pass. The spatial modeling process focuses on spatial contexts and gradually involves more spatial contexts according to the recursive architecture. In the channel-wise modeling process, we introduce a context routing algorithm to improve the efficacy of modeling channel-wise contextual relationships dynamically. We perform a comprehensive evaluation of the proposed ReCoR on the popular MS COCO dataset and PASCAL VOC dataset. The effectiveness of the ReCoR can be validated on both datasets according to the consistent performance gains of applying our method on different baseline object detectors. For example, on MS COCO dataset, our approach can respectively deliver around 10% relative improvements for a Mask RCNN detector on the bounding box task, and 7% relative improvements on the instance segmentation task, surpassing existing context modeling approaches with a great margin. State-of-the-art detection performance can also be accessed by applying the ReCoR on the Cascade Mask RCNN detector, illustrating the great benefits of our method for improving context modeling and object detection.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7




  3. FLOPs: floating point operations.

  4. GMAC:giga multiply-accumulate operations per second.


  • Auckland, M. E., Cave, K. R., & Donnelly, N. (2007). Nontarget objects can influence perceptual processes during object recognition. Psychonomic Bulletin Review, 14(2), 332–337.

    Article  Google Scholar 

  • Bell, S., Lawrence Zitnick, C., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR (pp. 2874–2883). IEEE.

  • Biederman, I., Rabinowitz, J. C., Glass, A. L., & Stacy, E. W. (1974). On the information extracted from a glance at a scene. Journal of Experimental Psychology, 103(3), 597.

    Article  Google Scholar 

  • Biederman, I., Mezzanotte, R. J., & Rabinowitz, J. C. (1982). Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology, 14(2), 143–177.

    Article  Google Scholar 

  • Boyce, S. J., Pollatsek, A., & Rayner, K. (1989). Effect of background information on object identification. Journal of Experimental Psychology: Human Perception and Performance, 15(3), 556.

    Google Scholar 

  • Brockmole, J. R., Castelhano, M. S., & Henderson, J. M. (2006). Contextual cueing in naturalistic scenes: Global and local contexts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32(4), 699.

    Google Scholar 

  • Brockmole, J. R., Hambrick, D. Z., Windisch, D. J., & Henderson, J. M. (2008). The role of meaning in contextual cueing: Evidence from chess expertise. The Quarterly Journal of Experimental Psychology, 61(12), 1886–1896.

    Article  Google Scholar 

  • Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In CVPR. IEEE.

  • Cao, Y., Xu, J., Lin, S., Wei, F., & Hu, H. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492.

  • Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., et al. (2019a). Hybrid task cascade for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4974–4983).

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017a). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017b). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.

  • Chen, X., & Gupta, A. (2017). Spatial memory for context reasoning in object detection. In ICCV (pp. 4106–4116). IEEE.

  • Chen, X., Li, L. J., Fei-Fei, L., & Gupta, A. (2018a). Iterative visual reasoning beyond convolutions. In CVPR. IEEE.

  • Chen, Y., Rohrbach, M., Yan, Z., Shuicheng, Y., Feng, J., & Kalantidis, Y. (2019b). Graph-based global reasoning networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 433–442).

  • Chen, Z., Huang, S., & Tao, D. (2018b). Context refinement for object detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 71–86). Springer, Berlin.

  • Choi, M. J., Lim, J. J., Torralba, A., & Willsky, A. S. (2010). Exploiting hierarchical context on a large database of object categories. In: CVPR (pp. 129–136). IEEE.

  • Chun, M. M., & Jiang, Y. (1998). Contextual cueing: Implicit learning and memory of visual context guides spatial attention. Cognitive Psychology, 36(1), 28–71.

    MathSciNet  Article  Google Scholar 

  • Chun, M. M., & Jiang, Y. (1999). Top-down attentional guidance based on implicit learning of visual covariation. Psychological Science, 10(4), 360–365.

    Article  Google Scholar 

  • Chun, M. M., & Jiang, Y. (2003). Implicit, long-term spatial contextual memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29(2), 224.

    Google Scholar 

  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In NIPS (pp. 379–387).

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., et al. (2017). Deformable convolutional networks. In ICCV. IEEE.

  • Davenport, J. L., & Potter, M. C. (2004). Scene consistency in object and background perception. Psychological Science, 15(8), 559–564.

    Article  Google Scholar 

  • De Graef, P., De Troy, A., & d’Ydewalle, G. (1992). Local and global contextual constraints on the identification of objects in scenes. Canadian Journal of Psychology/Revue canadienne de psychologie, 46(3), 489.

    Article  Google Scholar 

  • Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., & Hebert, M. (2009). An empirical study of context in object detection. In CVPR (pp. 1271–1278). IEEE.

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2007). The pascal visual object classes challenge 2007 (voc2007) results.

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge (Vol. 88, pp. 303–338). Berlin: Springer.

    Google Scholar 

  • Galleguillos, C., & Belongie, S. (2010). Context based object categorization: A critical survey. CVIU, 114(6), 712–722.

    Google Scholar 

  • Galleguillos, C., Rabinovich, A., & Belongie, S. (2008). Object categorization using co-occurrence, location and appearance. In CVPR (pp. 1–8). IEEE.

  • Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. IJRR, 32, 1231–1237.

    Google Scholar 

  • Ghiasi, G., Lin, T. Y., & Le, Q. V. (2018). Dropblock: A regularization method for convolutional networks. In Advances in neural information processing systems (pp. 10727–10737).

  • Gidaris, S., & Komodakis, N. (2015). Object detection via a multi-region and semantic segmentation-aware CNN model. In ICCV (pp. 1134–1142). IEEE.

  • Girshick, R. (2015). Fast R-CNN. In: ICCV (pp. 1440–1448). IEEE.

  • Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In ICCV (pp. 237–244). IEEE.

  • He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask R-CNN. ICCV.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778). IEEE.

  • Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In ECCV (pp. 30–43). Springer, Berlin.

  • Henderson, J. M., & Hollingworth, A. (1999). High-level scene perception. Annual Review of Psychology, 50(1), 243–271.

    Article  Google Scholar 

  • Hinton, G. E., Sabour, S., & Frosst, N. (2018). Matrix capsules with EM routing. In ICLR.

  • Hollingworth, A. (1998). Does consistent scene context facilitate object perception? Journal of Experimental Psychology: General, 127(4), 398.

    Article  Google Scholar 

  • Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018a). Relation networks for object detection. In CVPR (Vol. 2). IEEE.

  • Hu, J., Shen, L., & Sun, G. (2018b). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).

  • Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR (pp. 845–853). IEEE.

  • Li, H., Guo, X., Dai, B., Ouyang, W., & Wang, X. (2018). Neural network encapsulation. In Proceedings of the European conference on computer vision (ECCV) (pp. 252–267).

  • Li, H., Liu, Y., Ouyang, W., & Wang, X. (2019). Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision, 127(3), 225–238.

    Article  Google Scholar 

  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In CVPR. IEEE.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018). Focal loss for dense object detection. In TPAMI.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755). Springer, Berlin.

  • Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., et al. (2019). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128, 261–318.

    Article  Google Scholar 

  • Liu, S., Huang, D., & Wang, A. (2018a). Receptive field block net for accurate and fast object detection. In ECCV. Springer, Berlin.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). SSD: Single shot multibox detector. In ECCV (pp. 21–37). Springer, Berlin.

  • Liu, Y., Wang, R., Shan, S., & Chen, X. (2018b). Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR (pp. 6985–6994).

  • Modolo, D., Vezhnevets, A., & Ferrari, V. (2015). Context forest for object class detection (Vol. 1, p. 6). In BMVC.

  • Mordan, T., Thome, N., Henaff, G., & Cord, M. (2019). End-to-end learning of latent deformable part-based representations for object detection. International Journal of Computer Vision, 127(11–12), 1659–1679.

    Article  Google Scholar 

  • Mottaghi, R., Chen, X., Liu, X., Cho, N. G., Lee, S. W., Fidler, S., et al. (2014). The role of context for object detection and semantic segmentation in the wild. In CVPR (pp. 891–898). IEEE.

  • Ouyang, W., Wang, K., Zhu, X., & Wang, X. (2017). Learning chained deep features and classifiers for cascade in object detection. In ICCV.

  • Ouyang, W., Zeng, X., & Wang, X. (2016). Learning mutual visibility relationship for pedestrian detection with a deep model. International Journal of Computer Vision, 120(1), 14–27.

    MathSciNet  Article  Google Scholar 

  • Palmer, T. E. (1975). The effects of contextual scenes on the identification of objects. Memory and Cognition, 3, 519–526.

    Article  Google Scholar 

  • Qiao, S., Wang, H., Liu, C., Shen, W., & Yuille, A. (2019). Weight standardization. arXiv preprint arXiv:1903.10520.

  • Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In ICCV (pp. 1–8). IEEE.

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR (pp. 779–788). IEEE.

  • Ren, J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., et al. (2017). Accurate single stage detector using recurrent rolling convolution. In CVPR.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).

  • Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In NIPS (pp. 3856–3866).

  • Shen, Z., Liu, Z., Li, J., Jiang, Y. G., Chen, Y., & Xue, X. (2017). Dsod: Learning deeply supervised object detectors from scratch. In CVPR (pp. 1919–1927). IEEE.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  • Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A., et al. (2003). Context-based vision system for place and object recognition. In ICCV (Vol. 3, pp. 273–280). IEEE.

  • Tu, Z., & Bai, X. (2010). Auto-context and its application to high-level vision tasks and 3d brain image segmentation. TPAMI, 32(10), 1744–1757.

    Article  Google Scholar 

  • Vondrick, C., Khosla, A., Pirsiavash, H., Malisiewicz, T., & Torralba, A. (2016). Visualizing object detection features. International Journal of Computer Vision, 119(2), 145–158.

    MathSciNet  Article  Google Scholar 

  • Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., et al. (2018a). Understanding convolution for semantic segmentation. In: 2018 IEEE winter conference on applications of computer vision (WACV) (pp. 1451–1460). IEEE.

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018b). Non-local neural networks. In CVPR. IEEE.

  • Woo, S., Park, J., Lee, J. Y., & So Kweon, I. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).

  • Wu, Y., & He, K. (2018). Group normalization. In ECCV. Springer, Berlin.

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500).

  • Yu, R. R., Chen, X. S., Morariu, V. I., Davis, L. S., & Redmond, W. (2010). The role of context selection in object detection. T-PAMI, 32(9), 1627–1645.

    Article  Google Scholar 

  • Zagoruyko, S., Lerer, A., Lin, T. Y., Pinheiro, P. O., Gross, S., Chintala, S., et al. (2016). A multipath network for object detection. arXiv preprint arXiv:1604.02135.

  • Zeng, X., Ouyang, W., Yan, J., Li, H., Xiao, T., Wang, K., et al. (2017). Crafting gbd-net for object detection. T-PAMI, 40, 2109–2123.

    Article  Google Scholar 

  • Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., et al. (2018). Context encoding for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7151–7160).

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).

Download references

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Jing Zhang or Dacheng Tao.

Additional information

Communicated by Vittorio Ferrari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by Australian Research Council Projects FL-170100117, DP-180103424, IH-180100002, IC-190100031, LE-200100049.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Zhang, J. & Tao, D. Recursive Context Routing for Object Detection. Int J Comput Vis 129, 142–160 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Object detection
  • Context modeling
  • Computer vision
  • Deep learning