We propose a deep learning approach to free-hand sketch recognition that achieves state-of-the-art performance, significantly surpassing that of humans. Our superior performance is a result of modelling and exploiting the unique characteristics of free-hand sketches, i.e., consisting of an ordered set of strokes but lacking visual cues such as colour and texture, being highly iconic and abstract, and exhibiting extremely large appearance variations due to different levels of abstraction and deformation. Specifically, our deep neural network, termed Sketch-a-Net has the following novel components: (i) we propose a network architecture designed for sketch rather than natural photo statistics. (ii) Two novel data augmentation strategies are developed which exploit the unique sketch-domain properties to modify and synthesise sketch training data at multiple abstraction levels. Based on this idea we are able to both significantly increase the volume and diversity of sketches for training, and address the challenge of varying levels of sketching detail commonplace in free-hand sketches. (iii) We explore different network ensemble fusion strategies, including a re-purposed joint Bayesian scheme, to further improve recognition performance. We show that state-of-the-art deep networks specifically engineered for photos of natural objects fail to perform well on sketch recognition, regardless whether they are trained using photos or sketches. Furthermore, through visualising the learned filters, we offer useful insights in to where the superior performance of our network comes from.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
We set \(k=30\) in this work and the regularisation parameter of JB is set to 1. For robustness at test time, we also take 10 crops and reflections of each train and test image (Krizhevsky et al. 2012). This inflates the KNN train and test pool by 10, and the crop-level matches are combined to image predictions by majority voting.
Chatfield, K., Simonyan , K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In BMVC.
Chen, D., Cao, X., Wang, L., Wen, F., & Sun, J. (2012). Bayesian face revisited: A joint formulation. In ECCV.
Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2015). Decaf: A deep convolutional activation feature for generic visual recognition. In ICML.
Eitz, M., Hays, J., & Alexa, M. (2012). How do humans sketch objects? In SIGGRAPH.
Eitz, M., Hildebrand, K., Boubekeur, T., & Alexa, M. (2011). Sketch-based image retrieval: Benchmark and bag-of-features descriptors. TVCG, 17(11), 1624–1636.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202.
Gabor, D. (1946). Theory of communication. Part 1: The analysis of information. Journal of the Institution of Electrical Engineers, Part III: Radio and Communication Engineering, 93, 429–441.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
Hu, R., & Collomosse, J. (2013). A performance evaluation of gradient field HOG descriptor for sketch based image retrieval. CVIU, 117(7), 790–806.
Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurons in the cat’s striate cortex. Journal of Physiology, 148, 574–591.
Jabal, M. F. A., Rahim, M. S. M., Othman, N. Z. S., & Jupri, Z. (2009). A comparative study on extraction and recognition method of CAD data from CAD drawings. In International conference on information management and engineering.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
Johnson, G., Gross, M. D., Hong, J., & Do, E. Y.-L. (2009). Computational support for sketching in design: A review. Foundations and Trends in Human–Computer Interaction, 2, 1–93.
Klare, B. F., Li, Z., & Jain, A. K. (2011). Matching forensic sketches to mug shot photos. TPAMI, 33(3), 639–646.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation network. In NIPS.
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. (1998). Efficient backprop. In G. Orr & K. Müller (Eds.), Neural networks: Tricks of the trade. Springer.
Li, Y., Hospedales, T. M., Song, Y., & Gong, S. (2015). Free-hand sketch recognition by multi-kernel feature learning. Springer. CVIU, 137, 1–11.
Li, Y., Song, Y., & Gong, S. (2013). Sketch recognition by ensemble matching of structured features. In BMVC.
Lu, T., Tai, C., Su, F., & Cai, S. (2005). A new recognition model for electronic architectural drawings. Computer-Aided Design, 37(10), 1053–1069.
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.
Ouyang, S., Hospedales ,T., Song, Y., & Li, X. (2014). Cross-modal face matching: Beyond viewed sketches. In ACCV.
Schaefer, S., McPhail, T., & Warren, J. (2006). Image deformation using moving least squares. TOG, 25(3), 533–540.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.
Schneider, R. G., & Tuytelaars, T. (2014). Sketch classification and classification-driven analysis using Fisher vectors. In SIGGRAPH Asia.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Sousa, P., & Fonseca, M. J. (2009). Geometric matching for clip-art drawing retrieval. Journal of Visual Communication and Image Representation, 20(12), 71–83.
Stollenga, M. F., Masci, J., Gomez, F., & Schmidhuber, J. (2014). Deep networks with internal selective attention through feedback connections. In NIPS.
Wang, F., Kang, L., & Li, Y. (2015). Sketch-based 3D shape retrieval using convolutional neural networks. In CVPR.
Yanık, E., & Sezgin, T. M. (2015). Active learning for sketch recognition. Computers and Graphics, 52, 93–105.
Yin, F., Wang, Q., Zhang, X., & Liu, C. (2013). ICDAR 2013 Chinese handwriting recognition competition. In International conference on document analysis and recognition.
Yu, Q., Yang, Y., Song, Y. Z., Xiang, T., & Hospedales, T. M. (2015). Sketch-a-net that beats humans. In BMVC.
Zeiler, M., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.
Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In ECCV.
Zitnick, C. L., & Parikh, D. (2013). Bringing semantics into focus using visual abstraction. In CVPR.
Zou, C., Huang, Z., Lau, R. W., Liu, J., & Fu, H. (2015). Sketch-based shape retrieval using pyramid-of-parts. arXiv preprint arXiv:1502.04232.
This Project received support from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement #640891, and the Royal Society and Natural Science Foundation of China (NSFC) Joint Grant #IE141387 and #61511130081. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the GPUs used for this research.
Communicated by Xianghua Xie, Mark Jones, Gary Tam.
About this article
Cite this article
Yu, Q., Yang, Y., Liu, F. et al. Sketch-a-Net: A Deep Neural Network that Beats Humans. Int J Comput Vis 122, 411–425 (2017). https://doi.org/10.1007/s11263-016-0932-3
- Sketch recognition
- Convolutional neural network
- Data augmentation
- Stroke ordering
- Sketch abstraction