Skip to main content

Cross-modal photo-caricature face recognition based on dynamic multi-task learning


Face recognition of realistic visual images (e.g., photos) has been well studied and made significant progress in the recent decade. However, face recognition between realistic visual images/photos and caricatures is still a challenging problem. Unlike the photos, the different artistic styles of caricatures introduce extreme non-rigid distortions of caricatures. The great representational gap between the different modalities of photos and caricatures is a big challenge for photo-caricature face recognition. In this paper, we propose to conduct cross-modal photo-caricature face recognition via multi-task learning, which can learn the features of different modalities with different tasks. Instead of manually setting the task weights as in conventional multi-task learning, this work proposes a dynamic weights learning module which can automatically generate/learn task weights according to the training importance of tasks. The learned task weights enable the network to focus on training the hard tasks instead of being stuck in the overtraining of easy tasks. The experimental results demonstrate the effectiveness of the proposed dynamic multi-task learning for cross-modal photo-caricature face recognition. The performance on the datasets CaVI and WebCaricature show the superiority over the state-of-art methods. The implementation code is provided here. (

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.

    Taigman, Yaniv, Yang, Ming, et al.: Deepface: Closing the gap to human-level performance in face verification. In: CVPR, pp. 1701–1708, (2014)

  2. 2.

    Parkhi, Omkar M., Vedaldi, Andrea, Zisserman, Andrew, et al.: Deep face recognition. In: BMVC, p. 6, (2015)

  3. 3.

    Schroff, Florian, Kalenichenko, Dmitry, Philbin, James: Facenet: A unified embedding for face recognition and clustering. In: CVPR, pp. 815–823, (2015)

  4. 4.

    Liu, Weiyang, Wen, Yandong, Yu, Zhiding, Li, Ming, Raj, Bhiksha, Song, Le.: Sphereface: Deep hypersphere embedding for face recognition. In: The CVPR, vol. 1, p. 1 (2017)

  5. 5.

    Huang, Gary B., Ramesh, Manu, Berg, Tamara, Learned-Miller, Erik: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst (2007)

  6. 6.

    Wolf, Lior, Hassner, Tal, Maoz, Itay: Face recognition in unconstrained videos with matched background similarity. In: CVPR, 2011 IEEE Conference on, pp. 529–534. IEEE (2011)

  7. 7.

    Ahonen, Timo: Hadid, Abdenour, Pietikainen, Matti: face description with local binary patterns: application to face recognition. IEEE Transact. Pattern Anal. Mach. Intell. 28(12), 2037–2041 (2006)

    Article  Google Scholar 

  8. 8.

    Tan, Xiaoyang, Triggs, Bill: Fusing gabor and lbp feature sets for kernel-based face recognition. In: International workshop on analysis and modeling of faces and gestures, pp. 235–249. Springer (2007)

  9. 9.

    Déniz, Oscar: Bueno, Gloria, Salido, Jesús, De la Torre, Fernando: Face recognition using histograms of oriented gradients. Pattern Recognit. Lett. 32(12), 1598–1603 (2011)

    Article  Google Scholar 

  10. 10.

    Bicego, Manuele, Lagorio, Andrea, Grosso, Enrico, Tistarelli, Massimo: On the use of sift features for face authentication. In: Computer Vision and Pattern Recognition Workshop, 2006. CVPRW’06. Conference on, pp. 35–35. IEEE (2006)

  11. 11.

    Huo, Jing, Li, Wenbin, Shi, Yinghuan, Gao, Yang, Yin, Hujun: Webcaricature: a benchmark for caricature recognition. In: British Machine Vision Conference (2018)

  12. 12.

    Mittal, Paritosh, Vatsa, Mayank, Singh, Richa: Composite sketch recognition via deep network-a transfer learning approach. In: 2015 International Conference on Biometrics (ICB), pp. 251–256. IEEE (2015)

  13. 13.

    Galea, Christian, Farrugia, Reuben A.: Forensic face photo-sketch recognition using a deep learning-based architecture. IEEE Signal Process. Lett. 24(11), 1586–1590 (2017)

    Article  Google Scholar 

  14. 14.

    Li, Shan, Deng, Weihong: Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing (2020)

  15. 15.

    He, Ran, Wu, Xiang, Sun, Zhenan, Tan, Tieniu: Learning invariant deep representation for nir-vis face recognition. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

  16. 16.

    He, Ran, Xiang, Wu, Sun, Zhenan, Tan, Tieniu: Wasserstein cnn: learning invariant features for nir-vis face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1761–1773 (2018)

    Article  Google Scholar 

  17. 17.

    Kim, Donghyun, Hernandez, Matthias, Choi, Jongmoo, Medioni, Gérard: Deep 3d face identification. In: 2017 IEEE international joint conference on biometrics (IJCB), pp. 133–142. IEEE (2017)

  18. 18.

    Zulqarnain Gilani, Syed, Mian, Ajmal: Learning from millions of 3d scans for large-scale 3d face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1896–1905 (2018)

  19. 19.

    Garg, Jatin, Peri, Skand Vishwanath, Tolani, Himanshu, Krishnan, Narayanan C.: Deep cross modal learning for caricature verification and identification (cavinet). arXiv preprint arXiv:1807.11688, (2018)

  20. 20.

    Cai, Deng, He, Xiaofei, Han, Jiawei: Speed up kernel discriminant analysis. VLDB J. 20(1), 21–33 (2011)

    Article  Google Scholar 

  21. 21.

    van der Maaten, Laurens, Hinton, Geoffrey: Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  22. 22.

    Ruder, Sebastian: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, (2017)

  23. 23.

    Girshick, Ross: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448 (2015)

  24. 24.

    Ranjan, Rajeev, Patel, Vishal M., Chellappa, Rama: Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41, 121 (2017)

    Article  Google Scholar 

  25. 25.

    Tian, Yonglong, Luo, Ping, Wang, Xiaogang, Tang, Xiaoou: Pedestrian detection aided by deep learning semantic tasks. In: Proceedings of the CVPR, pp. 5079–5087 (2015)

  26. 26.

    Chen, Zhao, Badrinarayanan, Vijay, Lee, Chen-Yu, Rabinovich, Andrew: Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257, (2017)

  27. 27.

    Kendall, Alex, Gal, Yarin, Cipolla, Roberto: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482–7491 (2018)

  28. 28.

    Yin, Xi, Liu, Xiaoming: Multi-task convolutional neural network for pose-invariant face recognition. IEEE Trans. Image Proces. 27(2), 964–975 (2008)

    MathSciNet  Article  Google Scholar 

  29. 29.

    Duong, Long, Cohn, Trevor, Bird, Steven, Cook, Paul: Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 845–850 (2015)

  30. 30.

    Misra, Ishan, Shrivastava, Abhinav, Gupta, Abhinav, Hebert, Martial: Cross-stitch networks for multi-task learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003 (2016)

  31. 31.

    Bragman, Felix J.S., Tanno, Ryutaro, Ourselin, Sebastien, Alexander, Daniel C., Cardoso, Jorge: Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1394 (2019)

  32. 32.

    Chen, Weihua, Chen, Xiaotang, Zhang, Jianguo, Huang, Kaiqi: (2017) A multi-task deep network for person re-identification. In: AAAI, pp. 3988–3994

  33. 33.

    Zhang, Zhanpeng: Luo, Ping, Loy, Chen Change, Tang, Xiaoou, : Learning deep representation for face alignment with auxiliary attributes. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 918–930 (2016)

  34. 34.

    Tran, Anh T., Nguyen, Cuong V., Hassner, Tal: Transferability and hardness of supervised classification tasks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1405 (2019)

  35. 35.

    Sun, Yi, Wang, Xiaogang, Tang, Xiaoou: Deeply learned face representations are sparse, selective, and robust. In: CVPR, pp. 2892–2900 (2015)

  36. 36.

    Simonyan, Karen, Zisserman, Andrew: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  37. 37.

    Wen, Yandong, Zhang, Kaipeng, Li, Zhifeng, Qiao, Yu: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515. Springer (2016)

  38. 38.

    Kemelmacher-Shlizerman, Ira, Seitz, Steven M., Miller, Daniel, Brossard, Evan: The megaface benchmark: 1 million faces for recognition at scale. In: Proceedings of the CVPR, pp. 4873–4882 (2016)

  39. 39.

    Zhang, Liliang, Lin, Liang, Wu, Xian, Ding, Shengyong, Zhang, Lei: End-to-end photo-sketch generation via fully convolutional representation learning. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 627–634 (2015)

  40. 40.

    Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, Efros, Alexei A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 2223–2232 (2017)

  41. 41.

    Wang, Lidan, Sindagi, Vishwanath, Patel, Vishal: High-quality facial photo-sketch synthesis using multi-adversarial networks. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pp. 83–90. IEEE (2018)

  42. 42.

    Saxena, Shreyas, Verbeek, Jakob: Heterogeneous face recognition with cnns. In: European conference on computer vision, pp. 483–491. Springer (2016)

  43. 43.

    Liu, Xiaoxiang, Song, Lingxiao, Wu, Xiang, Tan, Tieniu: Transferring deep representation for nir-vis heterogeneous face recognition. In: 2016 International Conference on Biometrics (ICB), pp. 1–8. IEEE (2016)

  44. 44.

    Lezama, José, Qiu, Qiang, Sapiro, Guillermo: Not afraid of the dark: Nir-vis face recognition via cross-spectral hallucination and low-rank embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6628–6637 (2017)

  45. 45.

    Collobert, Ronan, Weston, Jason: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on Machine learning, pp. 160–167. ACM, (2008)

  46. 46.

    Deng, Li, Hinton, Geoffrey, Kingsbury, Brian: New types of deep neural network learning for speech recognition and related applications: An overview. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8599–8603. IEEE (2013)

  47. 47.

    Szegedy, Christian, Ioffe, Sergey, Vanhoucke, Vincent, Alemi, Alexander A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

  48. 48.

    Simonyan, Karen, Omkar, M., et al. Parkhi. Fisher vector faces in the wild. In: BMVC, p. 4 (2013)

  49. 49.

    Amos, Brandon, Ludwiczuk, Bartosz, Satyanarayanan, Mahadev, et al. Openface: A general-purpose face recognition library with mobile applications. CMU School of Computer Science, 6, (2016)

  50. 50.

    MegviiInc. Face++ research toolkit.,. (December 2013)

  51. 51.

    Guo, Yandong, Zhang, Lei, Hu, Yuxiao, He, Xiaodong, Gao, Jianfeng: Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: European Conference on Computer Vision, pp. 87–102. Springer (2016)

  52. 52.

    Zhang, Kaipeng, Zhang, Zhanpeng, Li, Zhifeng, Qiao, Yu.: Joint face detection and alignment using multitask cascaded convolutional networks. Signal Proces. Lett. 23(10), 1499–1503 (2016)

    Article  Google Scholar 

  53. 53.

    Glorot, Xavier, Bengio, Yoshua: Understanding the difficulty of training deep feedforward neural networks. In: 13th International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

Download references

Author information



Corresponding author

Correspondence to Zuheng Ming.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ming, Z., Burie, JC. & Luqman, M.M. Cross-modal photo-caricature face recognition based on dynamic multi-task learning. IJDAR 24, 33–48 (2021).

Download citation


  • Photo-caricature face recognition
  • Dynamic multi-task learning
  • Deep CNNs