Skip to main content

iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11166))

Abstract

Generating natural language descriptions for short videos has made fast progress due to the advances in deep learning and various annotated datasets. However, automatic analysis, understanding and learning from long videos remain very challenging and request more exploration. To support investigation for this challenge, we introduce a large-scale makeup instructional video dataset named iMakeup. This dataset contains 2000 videos which covers 50 popular topics for makeup, amounting to 256 h, with 12,823 annotated clips in total. This dataset contains both visual and auditory modalities with a large coverage and diversity in the specific makeup domain. We further extend existing long video understanding techniques to present the feasibility of our dataset, showing the results of baseline video segmentation and caption models. We expect this dataset to support research works in various problems such as video segmentation, video dense captioning, object detection and tracking, action tracking, learning for fashion, etc.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Venugopalan, S., et al.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Computer Science (2014)

    Google Scholar 

  2. Yao, L., Torabi, A., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)

    Google Scholar 

  3. Krishna, R., et al.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision, p. 6 (2017)

    Google Scholar 

  4. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  5. Das, P., et al.: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)

    Google Scholar 

  6. Rohrbach, A., et al.: A dataset for movie description. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  7. http://www.wikihow.com

  8. Regneri, M., et al.: Transactions of the Association for Computational Linguistics (TACL), Grounding Action Descriptions in Videos, vol. 1, pp. 25–36 (2013)

    Google Scholar 

  9. Xu, J., et al.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  10. Zhou, L., et al.: End-to-End Dense Video Captioning with Masked Transformer. arXiv preprint arXiv:1804.00819 (2018)

  11. Shou, Z., et al.: Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs, pp. 1049–1058 (2016)

    Google Scholar 

  12. Hochreiter, S.: Longshort-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  13. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  14. Soomro, K., et al.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  15. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  16. Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)

  17. Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)

    Google Scholar 

  18. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  19. Monfort, M., et al.: Moments in time dataset: one million videos for event understanding (2018)

    Google Scholar 

  20. Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)

    Google Scholar 

  21. Regneri, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)

    Google Scholar 

  22. Zhang, H.J.: Automatic partitioning of full-motion video. Multimed. Syst. 1, 10–28 (1993)

    Article  Google Scholar 

  23. Lienhart, R., Pfeiffer, S., Effelsberg, W.: Video abstracting. Commun. ACM, 1–12 (1997)

    Google Scholar 

  24. Yuan, J.: A formal study of shot boundary detection. IEEE Trans. Circuits Syst. Video Tech. 17, 168–186 (2007)

    Article  Google Scholar 

  25. Zabih, R., Miller, J., Mai, K.: A feature-based algorithm for detecting and classifying scene breaks. ACM Multimed. 95, 189–200 (1995)

    Google Scholar 

  26. Porter, S.V., et al.: Video cut detection using frequency domain correlation. In: 15th International Conference on Pattern Recognition, pp. 413–416 (2000)

    Google Scholar 

  27. Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9

    Chapter  Google Scholar 

  28. Kuehne, H., et al.: Weakly supervised learning of actions from transcripts. CVIU 163, 78–89 (2010)

    Google Scholar 

  29. Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: ACM on Multimedia Conference, pp. 1087–1091. ACM (2016)

    Google Scholar 

  30. Tran, D., Bourdev, L., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  31. Karpathy, A., et al.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  32. Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41

    Chapter  Google Scholar 

  33. Szegedy, C., et al.: Inception-v4, inception-ResNet and the impact of residual connections on learning (2016)

    Google Scholar 

  34. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357366 (1980)

    Article  Google Scholar 

  35. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)

    Google Scholar 

Download references

Acknowledgment

This work is supported by National Natural Science Foundation of China under Grant No. 61772535 and National Key Research and Development Plan under Grant No. 2016YFB1001202. We also appreciate the support from the National Demonstration Center for Experimental Education of Information Technology and Management (Renmin University of China).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, X., Jin, Q., Chen, S., Song, Y., Zhao, Y. (2018). iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11166. Springer, Cham. https://doi.org/10.1007/978-3-030-00764-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00764-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00763-8

  • Online ISBN: 978-3-030-00764-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics