DHFML: deep heterogeneous feature metric learning for matching photograph and cartoon pairs


We study the problem of retrieving cartoon faces of celebrities given their real face as a query. We refer to this problem as Photo2Cartoon. The Photo2Cartoon problem is challenging since (i) cartoons vary excessively in style and (ii) modality gap between real and cartoon faces is large. To address these challenges, we present a discriminative deep metric learning approach designed for matching cross-modal faces and showcase Photo2Cartoon. The proposed approach learns a nonlinear transformation to project real and cartoon face pairs into a common subspace where distance between positive pairs becomes smaller as compared to distance between negative pairs. We evaluate our method on two public benchmarks, namely IIIT-CFW and Viewed Sketch, and show superior retrieval results as compared to related methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    Cartoon is typically non-realistic or semi-realistic artistic style of drawing or painting, or an image or series of images intended for satire, caricature or humor [1].


  1. 1.

    Cartoon (from Wikipedia, the free encyclopedia). https://en.wikipedia.org/wiki/Cartoon. Accessed 2018-02-10

  2. 2.

    Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data. CoRR. arXiv:1306.6709

  3. 3.

    Cao Q, Shen L, Xie W, Parkhi O.M, Zisserman A (2018) VGGFace2: a dataset for recognising faces across pose and age. In: FG

  4. 4.

    Crowley EJ, Parkhi OM, Zisserman A (2015) Face painting: querying art with photos. In: BMVC

  5. 5.

    Fan H, Cao Z, Jiang Y, Yin Q, Doudou C (2014) Learning deep face representation. CoRR. arXiv:1403.2802

  6. 6.

    Goldberger J, Hinton GE, Roweis ST, Salakhutdinov RR (2005) Neighbourhood components analysis. In: NIPS

  7. 7.

    Härdle WK, Simar L (2015) Canonical correlation analysis. In: Applied multivariate statistical analysis. Springer, pp 443–454

  8. 8.

    Hu P, Ramanan D (2017) Finding tiny faces. In: CVPR

  9. 9.

    Huang D, Wang Y.F (2013) Coupled dictionary and feature space learning with applications to cross-domain image synthesis and recognition. In: ICCV

  10. 10.

    Huang GB, Ramesh M, Berg T, Learned-Miller E (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report 07-49, University of Massachusetts, Amherst

  11. 11.

    Huang X, Peng Y (2017) Cross-modal deep metric learning with multi-task regularization. arXiv preprint arXiv:1703.07026

  12. 12.

    Huo J, Gao Y, Shi Y, Yang W, Yin H (2016) Ensemble of sparse cross-modal metrics for heterogeneous face recognition. In: ACM-MM

  13. 13.

    Kang C, Liao S, He Y, Wang J, Niu W, Xiang S, Pan C (2015) Cross-modal similarity learning: a low rank bilinear formulation. In: CIKM

  14. 14.

    Kemelmacher-Shlizerman I, Seitz SM, Miller D, Brossard E (2016) The megaface benchmark: 1 million faces for recognition at scale. In: CVPR

  15. 15.

    Klare B, Jain AK (2013) Heterogeneous face recognition using kernel prototype similarities. IEEE Trans Pattern Anal Mach Intell 35(6):1410–1422

    Article  Google Scholar 

  16. 16.

    Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop, vol 2

  17. 17.

    Kumar N, Berg AC, Belhumeur PN, Nayar SK (2009) Attribute and simile classifiers for face verification. In: ICCV

  18. 18.

    Liong VE, Lu J, Tan YP, Zhou J (2017) Deep coupled metric learning for cross-modal matching. IEEE Trans Multimed 19(6):1234–1244

    Article  Google Scholar 

  19. 19.

    Martinez AM (1998) The AR face database. CVC technical report

  20. 20.

    Mauro R, Kubovy M (1992) Caricature and face recognition. Mem Cogn 20(4):433–440

    Article  Google Scholar 

  21. 21.

    Messer K, Matas J, Kittler J, Luettin J, Maitre G (1999) XM2VTSDB: the extended M2VTS database. In: Second international conference on audio and video-based biometric person authentication

  22. 22.

    Mignon A, Jurie F (2012) CMML: a new metric learning approach for cross modal matching. In: ACCV

  23. 23.

    Mishra A, Nandan Rai S, Mishra A, Jawahar C.V (2016) IIIT-CFW: a benchmark database of cartoon faces in the wild. In: VASE ECCVW

  24. 24.

    Ouyang S, Hospedales TM, Song Y, Li X (2014) Cross-modal face matching: beyond viewed sketches. In: ACCV

  25. 25.

    Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVC

  26. 26.

    Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: CVPR

  27. 27.

    Simonyan K, Parkhi OM, Vedaldi A, Zisserman A (2013) Fisher vector faces in the wild. In: BMVC

  28. 28.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR. arXiv:1409.1556

  29. 29.

    Song HO, Xiang Y, Jegelka S, Savarese S (2016) Deep metric learning via lifted structured feature embedding. In: CVPR

  30. 30.

    Sugiyama M (2006) Local fisher discriminant analysis for supervised dimensionality reduction. In: ICML

  31. 31.

    Wang X, Tang X (2009) Face photo-sketch synthesis and recognition. IEEE Trans Pattern Anal Mach Intell 31(11):1955–1967

    Article  Google Scholar 

  32. 32.

    Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  33. 33.

    Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Anand Mishra.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mishra, A. DHFML: deep heterogeneous feature metric learning for matching photograph and cartoon pairs. Int J Multimed Info Retr 8, 135–142 (2019). https://doi.org/10.1007/s13735-018-0160-4

Download citation


  • Deep metric learning
  • Cross-modality
  • Heterogeneous face verification
  • Cross-modal retrieval