Advertisement

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Conference paper
  • 422 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12358)

Abstract

Generating accurate descriptions for online fashion items is important not only for enhancing customers’ shopping experiences, but also for the increase of online sales. Besides the need of correctly presenting the attributes of items, the expressions in an enchanting style could better attract customer interests. The goal of this work is to develop a novel learning framework for accurate and expressive fashion captioning. Different from popular work on image captioning, it is hard to identify and describe the rich attributes of fashion items. We seed the description of an item by first identifying its attributes, and introduce attribute-level semantic (ALS) reward and sentence-level semantic (SLS) reward as metrics to improve the quality of text descriptions. We further integrate the training of our model with maximum likelihood estimation (MLE), attribute embedding, and Reinforcement Learning (RL). To facilitate the learning, we build a new FAshion CAptioning Dataset (FACAD), which contains 993K images and 130K corresponding enchanting and diverse descriptions. Experiments on FACAD demonstrate the effectiveness of our model (Code and data: https://github.com/xuewyang/Fashion_Captioning).

Keywords

Fashion Captioning Reinforcement Learning Semantics 

Notes

Acknowledgements

This work is supported in part by the National Science Foundation under Grants NSF ECCS 1731238 and NSF CCF 2007313.

Supplementary material

504454_1_En_1_MOESM1_ESM.pdf (4.9 mb)
Supplementary material 1 (pdf 4984 KB)

References

  1. 1.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_24CrossRefGoogle Scholar
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  3. 3.
    Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  4. 4.
    Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server (2015)Google Scholar
  5. 5.
    Cucurull, G., Taslakian, P., Vazquez, D.: Context-aware visual compatibility prediction. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)Google Scholar
  6. 6.
    Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation (2014)Google Scholar
  7. 7.
    Gabale, V., Prabhu Subramanian, A.: How to Extract Fashion Trends from Social Media? A Robust Object Detector With Support For Unsupervised Learning. arXiv e-prints (2018)Google Scholar
  8. 8.
    Gao, J., Wang, S., Wang, S., Ma, S., Gao, W.: Self-critical n-step training for image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  9. 9.
    Ge, Y., Zhang, R., Wu, L., Wang, X., Tang, X., Luo, P.: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: CVPR (2019)Google Scholar
  10. 10.
    Guo, X., Wu, H., Gao, Y., Rennie, S., Feris, R.: The fashion IQ dataset: Retrieving images by combining side information and relative natural language feedback. arXiv preprint arXiv:1905.12794 (2019)
  11. 11.
    Han, X., Wu, Z., Jiang, Y.G., Davis, L.S.: Learning fashion compatibility with bidirectional LSTMs. In: ACM Multimedia (2017)Google Scholar
  12. 12.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  14. 14.
    He, Y., Yang, L., Chen, L.: Real-time fashion-guided clothing semantic parsing: a lightweight multi-scale inception neural network and benchmark. In: AAAI Workshops (2017)Google Scholar
  15. 15.
    Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. In: Advances in Neural Information Processing Systems, vol. 32 (2019)Google Scholar
  16. 16.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  17. 17.
    Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  18. 18.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 664–676 (2017)CrossRefGoogle Scholar
  19. 19.
    Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)Google Scholar
  20. 20.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2015)Google Scholar
  21. 21.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2017).  https://doi.org/10.1007/s11263-016-0981-7MathSciNetCrossRefGoogle Scholar
  22. 22.
    Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)Google Scholar
  23. 23.
    Liu, S., et al.: Hi, magic closet, tell me what to wear! In: Proceedings of the 20th ACM International Conference on Multimedia (2012)Google Scholar
  24. 24.
    Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  25. 25.
    Lu, Z., Hu, Y., Jiang, Y., Chen, Y., Zeng, B.: Learning binary code for personalized fashion recommendation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)Google Scholar
  26. 26.
    Ma, C.Y., Kadav, A., Melvin, I., Kira, Z., Alregib, G., Graf, H.: Attend and interact: Higher-order object interactions for video understanding (2017)Google Scholar
  27. 27.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002)Google Scholar
  28. 28.
    Qin, Y., Du, J., Zhang, Y., Lu, H.: Look back and predict forward in image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  29. 29.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)Google Scholar
  30. 30.
    Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward (2017)Google Scholar
  31. 31.
    Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  32. 32.
    Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2013)Google Scholar
  33. 33.
    Vasileva, M.I., Plummer, B.A., Dusad, K., Rajpal, S., Kumar, R., Forsyth, D.: Learning type-aware embeddings for fashion compatibility. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 405–421. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01270-0_24CrossRefGoogle Scholar
  34. 34.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)Google Scholar
  35. 35.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)Google Scholar
  36. 36.
    Wang, W., Xu, Y., Shen, J., Zhu, S.C.: Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  37. 37.
    Wang, Z., Gu, Y., Zhang, Y., Zhou, J., Gu, X.: Clothing retrieval with visual attention model. In: 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4 (2017)Google Scholar
  38. 38.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992).  https://doi.org/10.1007/BF00992696CrossRefzbMATHGoogle Scholar
  39. 39.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning (2015)Google Scholar
  40. 40.
    Yu, W., Zhang, H., He, X., Chen, X., Xiong, L., Qin, Z.: Aesthetic-based clothing recommendation. In: Proceedings of the 2018 World Wide Web Conference (2018)Google Scholar
  41. 41.
    Zhang, L., et al.: Actor-critic sequence training for image captioning. In: NIPS workshop (2017)Google Scholar
  42. 42.
    Zheng, S., Yang, F., Kiapour, M.H., Piramuthu., R.: ModaNet: a large-scale street fashion dataset with polygon annotations. In: ACM Multimedia (2018)Google Scholar
  43. 43.
    Zou, X., Kong, X., Wong, W., Wang, C., Liu, Y., Cao, Y.: FashionAI: a hierarchical dataset for fashion understanding. In: CVPRW (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Stony Brook UniversityStony BrookUSA
  2. 2.USCLos AngelesUSA
  3. 3.MITCambridgeUSA
  4. 4.Kwai Inc.WashingtonUSA
  5. 5.BUPTBeijingChina
  6. 6.MegviiBeijingChina

Personalised recommendations