Sketch-Guided Object Localization in Natural Images

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12351)


We introduce a novel problem of localizing all the instances of an object (seen or unseen during training) in a natural image via sketch query. We refer to this problem as sketch-guided object localization. This problem is distinctively different from the traditional sketch-based image retrieval task where the gallery set often contains images with only one object. The sketch-guided object localization proves to be more challenging when we consider the following: (i) the sketches used as queries are abstract representations with little information on the shape and salient attributes of the object, (ii) the sketches have significant variability as they are hand-drawn by a diverse set of untrained human subjects, and (iii) there exists a domain gap between sketch queries and target natural images as these are sampled from very different data distributions. To address the problem of sketch-guided object localization, we propose a novel cross-modal attention scheme that guides the region proposal network (RPN) to generate object proposals relevant to the sketch query. These object proposals are later scored against the query to obtain final localization. Our method is effective with as little as a single sketch query. Moreover, it also generalizes well to object categories not seen during training and is effective in localizing multiple object instances present in the image. Furthermore, we extend our framework to a multi-query setting using novel feature fusion and attention fusion strategies introduced in this paper. The localization performance is evaluated on publicly available object detection benchmarks, viz. MS-COCO and PASCAL-VOC, with sketch queries obtained from ‘Quick, Draw!’. The proposed method significantly outperforms related baselines on both single-query and multi-query localization tasks.


Sketch Cross-modal retrieval Object localization One-shot learning Attention 



The authors would like to thank the Advanced Data Management Research Group, Corporate Technologies, Siemens Technology and Services Pvt. Ltd., and Pratiksha Trust, Bengaluru, India for partly supporting this research.

Supplementary material

504443_1_En_32_MOESM1_ESM.pdf (752 kb)
Supplementary material 1 (pdf 751 KB)


  1. 1.
    Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018)Google Scholar
  2. 2.
    Caicedo, J.C., Lazebnik, S.: Active object localization with deep reinforcement learning. In: ICCV (2015)Google Scholar
  3. 3.
    Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: CVPR (2019)Google Scholar
  4. 4.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  5. 5.
    Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Trans. Vis. Comput. Graph. 17(11), 1624–1636 (2010)CrossRefGoogle Scholar
  6. 6.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal Visual Object Classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010). Scholar
  7. 7.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  8. 8.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  9. 9.
    Ha, D., Eck, D.: A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 (2017)
  10. 10.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)CrossRefGoogle Scholar
  12. 12.
    Hsieh, T.I., Lo, Y.C., Chen, H.T., Liu, T.L.: One-shot object detection with co-attention and co-excitation. In: NeurIPS (2019)Google Scholar
  13. 13.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)Google Scholar
  14. 14.
    Hu, R., Collomosse, J.: A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Comput. Vis. Image Underst. 117(7), 790–806 (2013)CrossRefGoogle Scholar
  15. 15.
    Jongejan, J., Rowley, H., Kawashima, T., Kim, J., Fox-Gieg, N.: The quick, draw!-ai experiment (2016).
  16. 16.
    Kong, T., Sun, F., Liu, H., Jiang, Y., Shi, J.: Foveabox: beyond anchor-based object detector (2019). arXiv preprint arXiv:1904.03797
  17. 17.
    Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: CVPR (2018)Google Scholar
  18. 18.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)Google Scholar
  19. 19.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)Google Scholar
  20. 20.
    Liu, L., Shen, F., Shen, Y., Liu, X., Shao, L.: Deep sketch hashing: fast free-hand sketch-based image retrieval. In: CVPR (2017)Google Scholar
  21. 21.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  22. 22.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
  23. 23.
    Redmon, J., Farhadi, A.: Yolov3: an incremental improvement (2018). arXiv preprint arXiv:1804.02767
  24. 24.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)Google Scholar
  25. 25.
    Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans. Graph. (TOG) 35(4), 1–12 (2016)CrossRefGoogle Scholar
  26. 26.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks (2013). arXiv preprint arXiv:1312.6229
  27. 27.
    Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: CVPRW (2014)Google Scholar
  28. 28.
    Shen, Y., Liu, L., Shen, F., Shao, L.: Zero-shot sketch-image hashing. In: CVPR (2018)Google Scholar
  29. 29.
    Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: ICCV (2003)Google Scholar
  30. 30.
    Song, J., Yu, Q., Song, Y.Z., Xiang, T., Hospedales, T.M.: Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: ICCV (2017)Google Scholar
  31. 31.
    Teh, E.W., Rochan, M., Wang, Y.: Attention networks for weakly supervised object localization. In: BMVC (2016)Google Scholar
  32. 32.
    Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013). Scholar
  33. 33.
    Wan, J., et al.: Deep learning for content-based image retrieval: a comprehensive study. In: ACM MM (2014)Google Scholar
  34. 34.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  35. 35.
    Xu, P., et al.: Sketchmate: deep hashing for million-scale human sketch retrieval. In: CVPR (2018)Google Scholar
  36. 36.
    Xu, P., Joshi, C.K., Bresson, X.: Multi-graph transformer for free-hand sketch recognition (2019). arXiv preprint arXiv:1912.11258
  37. 37.
    Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., Lin, L.: Meta R-CNN: towards general solver for instance-level low-shot learning. In: ICCV (2019)Google Scholar
  38. 38.
    Yu, Q., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M., Loy, C.C.: Sketch me that shoe. In: CVPR (2016)Google Scholar
  39. 39.
    Yu, Q., Yang, Y., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M.: Sketch-a-net: a deep neural network that beats humans. Int. J. Comput. Vis. 122(3), 411–425 (2017). Scholar
  40. 40.
    Zhang, J., et al.: Generative domain-migration hashing for sketch-to-image retrieval. In: ECCV (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Indian Institute of ScienceBengaluruIndia
  2. 2.Indian Institute of Technology, JodhpurJodhpurIndia

Personalised recommendations