Abstract
Generating natural language descriptions for short videos has made fast progress due to the advances in deep learning and various annotated datasets. However, automatic analysis, understanding and learning from long videos remain very challenging and request more exploration. To support investigation for this challenge, we introduce a large-scale makeup instructional video dataset named iMakeup. This dataset contains 2000 videos which covers 50 popular topics for makeup, amounting to 256 h, with 12,823 annotated clips in total. This dataset contains both visual and auditory modalities with a large coverage and diversity in the specific makeup domain. We further extend existing long video understanding techniques to present the feasibility of our dataset, showing the results of baseline video segmentation and caption models. We expect this dataset to support research works in various problems such as video segmentation, video dense captioning, object detection and tracking, action tracking, learning for fashion, etc.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Venugopalan, S., et al.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Computer Science (2014)
Yao, L., Torabi, A., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Krishna, R., et al.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision, p. 6 (2017)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Das, P., et al.: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)
Rohrbach, A., et al.: A dataset for movie description. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Regneri, M., et al.: Transactions of the Association for Computational Linguistics (TACL), Grounding Action Descriptions in Videos, vol. 1, pp. 25–36 (2013)
Xu, J., et al.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Zhou, L., et al.: End-to-End Dense Video Captioning with Masked Transformer. arXiv preprint arXiv:1804.00819 (2018)
Shou, Z., et al.: Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs, pp. 1049–1058 (2016)
Hochreiter, S.: Longshort-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Soomro, K., et al.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Monfort, M., et al.: Moments in time dataset: one million videos for event understanding (2018)
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)
Regneri, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Zhang, H.J.: Automatic partitioning of full-motion video. Multimed. Syst. 1, 10–28 (1993)
Lienhart, R., Pfeiffer, S., Effelsberg, W.: Video abstracting. Commun. ACM, 1–12 (1997)
Yuan, J.: A formal study of shot boundary detection. IEEE Trans. Circuits Syst. Video Tech. 17, 168–186 (2007)
Zabih, R., Miller, J., Mai, K.: A feature-based algorithm for detecting and classifying scene breaks. ACM Multimed. 95, 189–200 (1995)
Porter, S.V., et al.: Video cut detection using frequency domain correlation. In: 15th International Conference on Pattern Recognition, pp. 413–416 (2000)
Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9
Kuehne, H., et al.: Weakly supervised learning of actions from transcripts. CVIU 163, 78–89 (2010)
Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: ACM on Multimedia Conference, pp. 1087–1091. ACM (2016)
Tran, D., Bourdev, L., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Karpathy, A., et al.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41
Szegedy, C., et al.: Inception-v4, inception-ResNet and the impact of residual connections on learning (2016)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357366 (1980)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Acknowledgment
This work is supported by National Natural Science Foundation of China under Grant No. 61772535 and National Key Research and Development Plan under Grant No. 2016YFB1001202. We also appreciate the support from the National Demonstration Center for Experimental Education of Information Technology and Management (Renmin University of China).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, X., Jin, Q., Chen, S., Song, Y., Zhao, Y. (2018). iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11166. Springer, Cham. https://doi.org/10.1007/978-3-030-00764-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-00764-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00763-8
Online ISBN: 978-3-030-00764-5
eBook Packages: Computer ScienceComputer Science (R0)