iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning

Lin, Xiaozhu; Jin, Qin; Chen, Shizhe; Song, Yuqing; Zhao, Yida

doi:10.1007/978-3-030-00764-5_8

iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning

Xiaozhu Lin¹⁸,
Qin Jin¹⁸,
Shizhe Chen¹⁸,
Yuqing Song¹⁸ &
…
Yida Zhao¹⁸

Conference paper
First Online: 18 September 2018

3242 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11166))

Abstract

Generating natural language descriptions for short videos has made fast progress due to the advances in deep learning and various annotated datasets. However, automatic analysis, understanding and learning from long videos remain very challenging and request more exploration. To support investigation for this challenge, we introduce a large-scale makeup instructional video dataset named iMakeup. This dataset contains 2000 videos which covers 50 popular topics for makeup, amounting to 256 h, with 12,823 annotated clips in total. This dataset contains both visual and auditory modalities with a large coverage and diversity in the specific makeup domain. We further extend existing long video understanding techniques to present the feasibility of our dataset, showing the results of baseline video segmentation and caption models. We expect this dataset to support research works in various problems such as video segmentation, video dense captioning, object detection and tracking, action tracking, learning for fashion, etc.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Venugopalan, S., et al.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Computer Science (2014)
Google Scholar
Yao, L., Torabi, A., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
Krishna, R., et al.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision, p. 6 (2017)
Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Das, P., et al.: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)
Google Scholar
Rohrbach, A., et al.: A dataset for movie description. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
http://www.wikihow.com
Regneri, M., et al.: Transactions of the Association for Computational Linguistics (TACL), Grounding Action Descriptions in Videos, vol. 1, pp. 25–36 (2013)
Google Scholar
Xu, J., et al.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Zhou, L., et al.: End-to-End Dense Video Captioning with Masked Transformer. arXiv preprint arXiv:1804.00819 (2018)
Shou, Z., et al.: Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs, pp. 1049–1058 (2016)
Google Scholar
Hochreiter, S.: Longshort-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Soomro, K., et al.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)
Google Scholar
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Monfort, M., et al.: Moments in time dataset: one million videos for event understanding (2018)
Google Scholar
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)
Google Scholar
Regneri, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Google Scholar
Zhang, H.J.: Automatic partitioning of full-motion video. Multimed. Syst. 1, 10–28 (1993)
Article Google Scholar
Lienhart, R., Pfeiffer, S., Effelsberg, W.: Video abstracting. Commun. ACM, 1–12 (1997)
Google Scholar
Yuan, J.: A formal study of shot boundary detection. IEEE Trans. Circuits Syst. Video Tech. 17, 168–186 (2007)
Article Google Scholar
Zabih, R., Miller, J., Mai, K.: A feature-based algorithm for detecting and classifying scene breaks. ACM Multimed. 95, 189–200 (1995)
Google Scholar
Porter, S.V., et al.: Video cut detection using frequency domain correlation. In: 15th International Conference on Pattern Recognition, pp. 413–416 (2000)
Google Scholar
Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9
Chapter Google Scholar
Kuehne, H., et al.: Weakly supervised learning of actions from transcripts. CVIU 163, 78–89 (2010)
Google Scholar
Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: ACM on Multimedia Conference, pp. 1087–1091. ACM (2016)
Google Scholar
Tran, D., Bourdev, L., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Karpathy, A., et al.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Google Scholar
Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41
Chapter Google Scholar
Szegedy, C., et al.: Inception-v4, inception-ResNet and the impact of residual connections on learning (2016)
Google Scholar
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357366 (1980)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Google Scholar

Download references

Acknowledgment

This work is supported by National Natural Science Foundation of China under Grant No. 61772535 and National Key Research and Development Plan under Grant No. 2016YFB1001202. We also appreciate the support from the National Demonstration Center for Experimental Education of Information Technology and Management (Renmin University of China).

Author information

Authors and Affiliations

Multimedia Computing Lab, School of Information, Renmin University of China, Beijing, China
Xiaozhu Lin, Qin Jin, Shizhe Chen, Yuqing Song & Yida Zhao

Authors

Xiaozhu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Qin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Shizhe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuqing Song
View author publications
You can also search for this author in PubMed Google Scholar
Yida Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, X., Jin, Q., Chen, S., Song, Y., Zhao, Y. (2018). iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11166. Springer, Cham. https://doi.org/10.1007/978-3-030-00764-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-00764-5_8
Published: 18 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00763-8
Online ISBN: 978-3-030-00764-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics