Advertisement

Face-Cap: Image Captioning Using Facial Expression Analysis

  • Omid Mohamad NezamiEmail author
  • Mark Dras
  • Peter Anderson
  • Len Hamey
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11051)

Abstract

Image captioning is the process of generating a natural language description of an image. Most current image captioning models, however, do not take into account the emotional aspect of an image, which is very relevant to activities and interpersonal relationships represented therein. Towards developing a model that can produce human-like captions incorporating these, we use facial expression features extracted from images including human faces, with the aim of improving the descriptive ability of the model. In this work, we present two variants of our Face-Cap model, which embed facial expression features in different ways, to generate image captions. Using all standard evaluation metrics, our Face-Cap models outperform a state-of-the-art baseline model for generating image captions when applied to an image caption dataset extracted from the standard Flickr 30 K dataset, consisting of around 11 K images containing faces. An analysis of the captions finds that, perhaps surprisingly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions. Code related to this paper is available at: https://github.com/omidmn/Face-Cap.

Keywords

Image captioning Facial expression recognition Sentiment analysis Deep learning 

References

  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  2. 2.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_24CrossRefGoogle Scholar
  3. 3.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998 (2017)
  4. 4.
    Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
  5. 5.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: WMT, pp. 376–380 (2014)Google Scholar
  6. 6.
    Ekman, P.: Basic emotions. In: Dalgleish, T., Power, T. (eds.) The Handbook of Cognition and Emotion, pp. 45–60. Wiley, Sussex (1999)Google Scholar
  7. 7.
    Ekman, P.: Darwin and facial expression: a century of research in review. In: ISHK (2006)Google Scholar
  8. 8.
    Fang, H., et al.: From captions to visual concepts and back. In: CVPR. IEEE (2015)Google Scholar
  9. 9.
    Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern Recogn. 36(1), 259–275 (2003)CrossRefGoogle Scholar
  10. 10.
    Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: StyleNet: generating attractive visual captions with styles. In: CVPR. IEEE (2017)Google Scholar
  11. 11.
    Goodfellow, I.J., et al.: Challenges in representation learning: a report on three machine learning contests. In: Lee, M., Hirose, A., Hou, Z.-G., Kil, R.M. (eds.) ICONIP 2013. LNCS, vol. 8228, pp. 117–124. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-42051-1_16CrossRefGoogle Scholar
  12. 12.
    Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574. IEEE (2016)Google Scholar
  13. 13.
    Kahou, S.E., et al.: EmoNets: multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interfaces 10(2), 99–111 (2016)CrossRefGoogle Scholar
  14. 14.
    Kahou, S.E., et al.: Combining modality specific deep neural networks for emotion recognition in video. In: ICMI, pp. 543–550. ACM (2013)Google Scholar
  15. 15.
    Karpathy, A.: Connecting Images and Natural Language. Ph.D. thesis, Stanford University (2016)Google Scholar
  16. 16.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137. IEEE (2015)Google Scholar
  17. 17.
    Kim, B.K., Dong, S.Y., Roh, J., Kim, G., Lee, S.Y.: Fusing aligned and non-aligned face information for automatic affect recognition in the wild: a deep learning approach. In: CVPR Workshops, pp. 48–57. IEEE (2016)Google Scholar
  18. 18.
    King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)Google Scholar
  19. 19.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  20. 20.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
  21. 21.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)Google Scholar
  22. 22.
    Lisetti, C.: Affective computing (1998)Google Scholar
  23. 23.
    Mathews, A.P., Xie, L., He, X.: SentiCap: generating image descriptions with sentiments. In: AAAI, pp. 3574–3580 (2016)Google Scholar
  24. 24.
    Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: ACL, Barcelona, Spain, pp. 271–278, July 2004.  https://doi.org/10.3115/1218955.1218990, http://www.aclweb.org/anthology/P04-1035
  25. 25.
    Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135 (2008).  https://doi.org/10.1561/1500000011CrossRefGoogle Scholar
  26. 26.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  27. 27.
    Pramerdorfer, C., Kampel, M.: Facial expression recognition using convolutional neural networks: state of the art. arXiv preprint arXiv:1612.02903 (2016)
  28. 28.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  29. 29.
    Sariyanidi, E., Gunes, H., Cavallaro, A.: Automatic analysis of facial affect: a survey of registration, representation, and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1113–1133 (2015)CrossRefGoogle Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  31. 31.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)Google Scholar
  32. 32.
    Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
  33. 33.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: NAACL HLT, pp. 173–180. Association for Computational Linguistics (2003)Google Scholar
  34. 34.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575. IEEE (2015)Google Scholar
  35. 35.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164. IEEE (2015)Google Scholar
  36. 36.
    Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: NIPS, pp. 4151–4161 (2017)Google Scholar
  37. 37.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  38. 38.
    You, Q., Jin, H., Luo, J.: Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121 (2018)
  39. 39.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)CrossRefGoogle Scholar
  40. 40.
    Yu, Z., Zhang, C.: Image based static facial expression recognition with multiple deep network learning. In: ICMI, pp. 435–442. ACM (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Omid Mohamad Nezami
    • 1
    Email author
  • Mark Dras
    • 1
  • Peter Anderson
    • 1
    • 2
  • Len Hamey
    • 1
  1. 1.Department of ComputingMacquarie UniversitySydneyAustralia
  2. 2.The Australian National UniversityCanberraAustralia

Personalised recommendations