Recursive Context Routing for Object Detection

Chen, Zhe; Zhang, Jing; Tao, Dacheng

doi:10.1007/s11263-020-01370-7

Recursive Context Routing for Object Detection

Published: 19 August 2020

Volume 129, pages 142–160, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1500 Accesses
27 Citations
Explore all metrics

Abstract

Recent studies have confirmed that modeling contexts is important for object detection. However, current context modeling approaches still have limited expressive capacity and dynamics to encode contextual relationships and model contexts, deteriorating their effectiveness. In this paper, we instead seek to recast the current context modeling framework and perform more dynamic context modeling for object detection. In particular, we devise a novel Recursive Context Routing (ReCoR) mechanism to encode contextual relationships and model contexts more effectively. The ReCoR progressively models more contexts through a recursive structure, providing a more feasible and more comprehensive method to utilize complicated contexts and contextual relationships. For each recursive stage, we further decompose the modeling of contexts and contextual relationships into a spatial modeling process and a channel-wise modeling process, avoiding the need for exhaustive modeling of all the potential pair-wise contextual relationships with more dynamics in a single pass. The spatial modeling process focuses on spatial contexts and gradually involves more spatial contexts according to the recursive architecture. In the channel-wise modeling process, we introduce a context routing algorithm to improve the efficacy of modeling channel-wise contextual relationships dynamically. We perform a comprehensive evaluation of the proposed ReCoR on the popular MS COCO dataset and PASCAL VOC dataset. The effectiveness of the ReCoR can be validated on both datasets according to the consistent performance gains of applying our method on different baseline object detectors. For example, on MS COCO dataset, our approach can respectively deliver around 10% relative improvements for a Mask RCNN detector on the bounding box task, and 7% relative improvements on the instance segmentation task, surpassing existing context modeling approaches with a great margin. State-of-the-art detection performance can also be accessed by applying the ReCoR on the Cascade Mask RCNN detector, illustrating the great benefits of our method for improving context modeling and object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

http://cocodataset.org/.
https://github.com/open-mmlab/mmdetection.
FLOPs: floating point operations.
GMAC:giga multiply-accumulate operations per second.

References

Auckland, M. E., Cave, K. R., & Donnelly, N. (2007). Nontarget objects can influence perceptual processes during object recognition. Psychonomic Bulletin Review, 14(2), 332–337.
Article Google Scholar
Bell, S., Lawrence Zitnick, C., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR (pp. 2874–2883). IEEE.
Biederman, I., Rabinowitz, J. C., Glass, A. L., & Stacy, E. W. (1974). On the information extracted from a glance at a scene. Journal of Experimental Psychology, 103(3), 597.
Article Google Scholar
Biederman, I., Mezzanotte, R. J., & Rabinowitz, J. C. (1982). Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology, 14(2), 143–177.
Article Google Scholar
Boyce, S. J., Pollatsek, A., & Rayner, K. (1989). Effect of background information on object identification. Journal of Experimental Psychology: Human Perception and Performance, 15(3), 556.
Google Scholar
Brockmole, J. R., Castelhano, M. S., & Henderson, J. M. (2006). Contextual cueing in naturalistic scenes: Global and local contexts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32(4), 699.
Google Scholar
Brockmole, J. R., Hambrick, D. Z., Windisch, D. J., & Henderson, J. M. (2008). The role of meaning in contextual cueing: Evidence from chess expertise. The Quarterly Journal of Experimental Psychology, 61(12), 1886–1896.
Article Google Scholar
Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In CVPR. IEEE.
Cao, Y., Xu, J., Lin, S., Wei, F., & Hu, H. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492.
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., et al. (2019a). Hybrid task cascade for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4974–4983).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017a). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017b). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
Chen, X., & Gupta, A. (2017). Spatial memory for context reasoning in object detection. In ICCV (pp. 4106–4116). IEEE.
Chen, X., Li, L. J., Fei-Fei, L., & Gupta, A. (2018a). Iterative visual reasoning beyond convolutions. In CVPR. IEEE.
Chen, Y., Rohrbach, M., Yan, Z., Shuicheng, Y., Feng, J., & Kalantidis, Y. (2019b). Graph-based global reasoning networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 433–442).
Chen, Z., Huang, S., & Tao, D. (2018b). Context refinement for object detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 71–86). Springer, Berlin.
Choi, M. J., Lim, J. J., Torralba, A., & Willsky, A. S. (2010). Exploiting hierarchical context on a large database of object categories. In: CVPR (pp. 129–136). IEEE.
Chun, M. M., & Jiang, Y. (1998). Contextual cueing: Implicit learning and memory of visual context guides spatial attention. Cognitive Psychology, 36(1), 28–71.
Article MathSciNet Google Scholar
Chun, M. M., & Jiang, Y. (1999). Top-down attentional guidance based on implicit learning of visual covariation. Psychological Science, 10(4), 360–365.
Article Google Scholar
Chun, M. M., & Jiang, Y. (2003). Implicit, long-term spatial contextual memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29(2), 224.
Google Scholar
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In NIPS (pp. 379–387).
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., et al. (2017). Deformable convolutional networks. In ICCV. IEEE.
Davenport, J. L., & Potter, M. C. (2004). Scene consistency in object and background perception. Psychological Science, 15(8), 559–564.
Article Google Scholar
De Graef, P., De Troy, A., & d’Ydewalle, G. (1992). Local and global contextual constraints on the identification of objects in scenes. Canadian Journal of Psychology/Revue canadienne de psychologie, 46(3), 489.
Article Google Scholar
Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., & Hebert, M. (2009). An empirical study of context in object detection. In CVPR (pp. 1271–1278). IEEE.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2007). The pascal visual object classes challenge 2007 (voc2007) results.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge (Vol. 88, pp. 303–338). Berlin: Springer.
Google Scholar
Galleguillos, C., & Belongie, S. (2010). Context based object categorization: A critical survey. CVIU, 114(6), 712–722.
Google Scholar
Galleguillos, C., Rabinovich, A., & Belongie, S. (2008). Object categorization using co-occurrence, location and appearance. In CVPR (pp. 1–8). IEEE.
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. IJRR, 32, 1231–1237.
Google Scholar
Ghiasi, G., Lin, T. Y., & Le, Q. V. (2018). Dropblock: A regularization method for convolutional networks. In Advances in neural information processing systems (pp. 10727–10737).
Gidaris, S., & Komodakis, N. (2015). Object detection via a multi-region and semantic segmentation-aware CNN model. In ICCV (pp. 1134–1142). IEEE.
Girshick, R. (2015). Fast R-CNN. In: ICCV (pp. 1440–1448). IEEE.
Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In ICCV (pp. 237–244). IEEE.
He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask R-CNN. ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778). IEEE.
Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In ECCV (pp. 30–43). Springer, Berlin.
Henderson, J. M., & Hollingworth, A. (1999). High-level scene perception. Annual Review of Psychology, 50(1), 243–271.
Article Google Scholar
Hinton, G. E., Sabour, S., & Frosst, N. (2018). Matrix capsules with EM routing. In ICLR.
Hollingworth, A. (1998). Does consistent scene context facilitate object perception? Journal of Experimental Psychology: General, 127(4), 398.
Article Google Scholar
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018a). Relation networks for object detection. In CVPR (Vol. 2). IEEE.
Hu, J., Shen, L., & Sun, G. (2018b). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).
Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR (pp. 845–853). IEEE.
Li, H., Guo, X., Dai, B., Ouyang, W., & Wang, X. (2018). Neural network encapsulation. In Proceedings of the European conference on computer vision (ECCV) (pp. 252–267).
Li, H., Liu, Y., Ouyang, W., & Wang, X. (2019). Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision, 127(3), 225–238.
Article Google Scholar
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In CVPR. IEEE.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018). Focal loss for dense object detection. In TPAMI.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755). Springer, Berlin.
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., et al. (2019). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128, 261–318.
Article Google Scholar
Liu, S., Huang, D., & Wang, A. (2018a). Receptive field block net for accurate and fast object detection. In ECCV. Springer, Berlin.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). SSD: Single shot multibox detector. In ECCV (pp. 21–37). Springer, Berlin.
Liu, Y., Wang, R., Shan, S., & Chen, X. (2018b). Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR (pp. 6985–6994).
Modolo, D., Vezhnevets, A., & Ferrari, V. (2015). Context forest for object class detection (Vol. 1, p. 6). In BMVC.
Mordan, T., Thome, N., Henaff, G., & Cord, M. (2019). End-to-end learning of latent deformable part-based representations for object detection. International Journal of Computer Vision, 127(11–12), 1659–1679.
Article Google Scholar
Mottaghi, R., Chen, X., Liu, X., Cho, N. G., Lee, S. W., Fidler, S., et al. (2014). The role of context for object detection and semantic segmentation in the wild. In CVPR (pp. 891–898). IEEE.
Ouyang, W., Wang, K., Zhu, X., & Wang, X. (2017). Learning chained deep features and classifiers for cascade in object detection. In ICCV.
Ouyang, W., Zeng, X., & Wang, X. (2016). Learning mutual visibility relationship for pedestrian detection with a deep model. International Journal of Computer Vision, 120(1), 14–27.
Article MathSciNet Google Scholar
Palmer, T. E. (1975). The effects of contextual scenes on the identification of objects. Memory and Cognition, 3, 519–526.
Article Google Scholar
Qiao, S., Wang, H., Liu, C., Shen, W., & Yuille, A. (2019). Weight standardization. arXiv preprint arXiv:1903.10520.
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In ICCV (pp. 1–8). IEEE.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR (pp. 779–788). IEEE.
Ren, J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., et al. (2017). Accurate single stage detector using recurrent rolling convolution. In CVPR.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).
Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In NIPS (pp. 3856–3866).
Shen, Z., Liu, Z., Li, J., Jiang, Y. G., Chen, Y., & Xue, X. (2017). Dsod: Learning deeply supervised object detectors from scratch. In CVPR (pp. 1919–1927). IEEE.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A., et al. (2003). Context-based vision system for place and object recognition. In ICCV (Vol. 3, pp. 273–280). IEEE.
Tu, Z., & Bai, X. (2010). Auto-context and its application to high-level vision tasks and 3d brain image segmentation. TPAMI, 32(10), 1744–1757.
Article Google Scholar
Vondrick, C., Khosla, A., Pirsiavash, H., Malisiewicz, T., & Torralba, A. (2016). Visualizing object detection features. International Journal of Computer Vision, 119(2), 145–158.
Article MathSciNet Google Scholar
Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., et al. (2018a). Understanding convolution for semantic segmentation. In: 2018 IEEE winter conference on applications of computer vision (WACV) (pp. 1451–1460). IEEE.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018b). Non-local neural networks. In CVPR. IEEE.
Woo, S., Park, J., Lee, J. Y., & So Kweon, I. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).
Wu, Y., & He, K. (2018). Group normalization. In ECCV. Springer, Berlin.
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500).
Yu, R. R., Chen, X. S., Morariu, V. I., Davis, L. S., & Redmond, W. (2010). The role of context selection in object detection. T-PAMI, 32(9), 1627–1645.
Article Google Scholar
Zagoruyko, S., Lerer, A., Lin, T. Y., Pinheiro, P. O., Gross, S., Chintala, S., et al. (2016). A multipath network for object detection. arXiv preprint arXiv:1604.02135.
Zeng, X., Ouyang, W., Yan, J., Li, H., Xiao, T., Wang, K., et al. (2017). Crafting gbd-net for object detection. T-PAMI, 40, 2109–2123.
Article Google Scholar
Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., et al. (2018). Context encoding for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7151–7160).
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).

Download references

Author information

Authors and Affiliations

School of Computer Science, Faculty of Engineering, The University of Sydney, 6 Cleveland St, Darlington, NSW, 2008, Australia
Zhe Chen, Jing Zhang & Dacheng Tao

Authors

Zhe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dacheng Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jing Zhang or Dacheng Tao.

Additional information

Communicated by Vittorio Ferrari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by Australian Research Council Projects FL-170100117, DP-180103424, IH-180100002, IC-190100031, LE-200100049.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Zhang, J. & Tao, D. Recursive Context Routing for Object Detection. Int J Comput Vis 129, 142–160 (2021). https://doi.org/10.1007/s11263-020-01370-7

Download citation

Received: 12 January 2020
Accepted: 10 August 2020
Published: 19 August 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11263-020-01370-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recursive Context Routing for Object Detection

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recursive Context Routing for Object Detection

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation