Advertisement

Scaling Egocentric Vision: The Open image in new window Dataset

  • Dima Damen
  • Hazel Doughty
  • Giovanni Maria Farinella
  • Sanja Fidler
  • Antonino Furnari
  • Evangelos Kazakos
  • Davide Moltisanti
  • Jonathan Munro
  • Toby Perrett
  • Will Price
  • Michael Wray
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11208)

Abstract

First-person vision is gaining interest as it offers a unique viewpoint on people’s interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce Open image in new window , a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55h of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.

Keywords

Egocentric vision Dataset Benchmarks First-person vision Egocentric object detection Action recognition and anticipation 

Notes

Acknowledgement

Annotations sponsored by a charitable donation from Nokia Technologies and UoB’s Jean Golding Institute. Research supported by EPSRC DTP, EPSRC GLANCE (EP/N013964/1), EPSRC LOCATE (EP/N033779/1) and Piano della Ricerca 2016–2018 linea di Intervento 2 of DMI. The object detection baseline helped by code from, and discussions with, Davide Acuña.

Supplementary material

Supplementary material 1 (mp4 88819 KB)

References

  1. 1.
    Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. In: CoRR (2016)Google Scholar
  2. 2.
    Alletto, S., Serra, G., Calderara, S., Cucchiara, R.: Understanding social relationships in egocentric vision. Pattern Recogn. (2015)Google Scholar
  3. 3.
    Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)Google Scholar
  4. 4.
    Banerjee, S., Pedersen, T.: An adapted lesk algorithm for word sense disambiguation using wordnet. In: CICLing (2002)Google Scholar
  5. 5.
    Carnegie Mellon University: CMU sphinx. https://cmusphinx.github.io/
  6. 6.
    Damen, D., Leelasawassuk, T., Haines, O., Calway, A., Mayol-Cuevas, W.: You-do, I-learn: discovering task relevant objects and their modes of interaction from multi-user egocentric video. In: BMVC (2014)Google Scholar
  7. 7.
    Das, A., et al.: Visual Dialog. In: CVPR (2017)Google Scholar
  8. 8.
    De La Torre, F., et al.: Guide to the carnegie mellon university multimodal activity (CMU-MMAC) database. In: Robotics Institute (2008)Google Scholar
  9. 9.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  10. 10.
    Doughty, H., Damen, D., Mayol-Cuevas, W.: Who’s better? who’s best? Pairwise deep ranking for skill determination. In: CVPR (2018)Google Scholar
  11. 11.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. IJCV (2010)Google Scholar
  12. 12.
    Fathi, A., Hodgins, J., Rehg, J.: Social interactions: a first-person perspective. In: CVPR (2012)Google Scholar
  13. 13.
    Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 314–327. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33718-5_23CrossRefGoogle Scholar
  14. 14.
    Fouhey, D.F., Kuo, W.c., Efros, A.A., Malik, J.: From lifestyle vlogs to everyday interactions. arXiv preprint arXiv:1712.02310 (2017)
  15. 15.
    Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. JVCIR (2017)Google Scholar
  16. 16.
  17. 17.
    Google: Google cloud speech api. https://cloud.google.com/speech
  18. 18.
    Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)Google Scholar
  19. 19.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  20. 20.
    Heidarivincheh, F., Mirmehdi, M., Damen, D.: Action completion: a temporal model for moment detection. In: BMVC (2018)Google Scholar
  21. 21.
    Huang, J., et al.: Tensorflow Object Detection API. https://github.com/tensorflow/models/tree/master/research/object_detection
  22. 22.
    Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: CVPR (2017)Google Scholar
  23. 23.
  24. 24.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  25. 25.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Joint learning of object and action detectors. In: ICCV (2017)Google Scholar
  26. 26.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  27. 27.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  28. 28.
    Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: CVPR (2014)Google Scholar
  29. 29.
    Lee, Y., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)Google Scholar
  30. 30.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  31. 31.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  32. 32.
    Miller, G.: WordNet: a lexical database for English. In: CACM (1995)Google Scholar
  33. 33.
    Moltisanti, D., Wray, M., Mayol-Cuevas, W., Damen, D.: Trespassing the boundaries: Labeling temporal bounds for object interactions in egocentric video. In: ICCV (2017)Google Scholar
  34. 34.
    Nair, A., et al.: Combining self-supervised learning and imitation for vision-based rope manipulation. In: ICRA (2017)Google Scholar
  35. 35.
    Park, H.S., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: CVPR (2016)Google Scholar
  36. 36.
    Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)Google Scholar
  37. 37.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  38. 38.
    Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR (2015)Google Scholar
  39. 39.
    Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: CVPR (2012)Google Scholar
  40. 40.
    Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In: CVPR (2013)Google Scholar
  41. 41.
    Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-ego: a large-scale dataset of paired third and first person videos. In: ArXiv (2018)Google Scholar
  42. 42.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_31CrossRefGoogle Scholar
  43. 43.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  44. 44.
    Stein, S., McKenna, S.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: UbiComp (2013)Google Scholar
  45. 45.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  46. 46.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)Google Scholar
  47. 47.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)Google Scholar
  48. 48.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  49. 49.
    Yamaguchi, K.: Bbox-annotator. https://github.com/kyamagu/bbox-annotator
  50. 50.
    Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: dense detailed labeling of actions in complex videos. IJCV (2018)Google Scholar
  51. 51.
    Yuanjun, X.: PyTorch Temporal Segment Network (2017). https://github.com/yjxiong/tsn-pytorch
  52. 52.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. Pattern Recogn. (2007)Google Scholar
  53. 53.
    Zhang, T., McCarthy, Z., Jow, O., Lee, D., Goldberg, K., Abbeel, P.: Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In: ICRA (2018)Google Scholar
  54. 54.
    Zhao, H., Yan, Z., Wang, H., Torresani, L., Torralba, A.: SLAC: a sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374 (2017)
  55. 55.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR (2017)Google Scholar
  56. 56.
    Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. arXiv preprint arXiv:1703.09788 (2017)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Dima Damen
    • 1
  • Hazel Doughty
    • 1
  • Giovanni Maria Farinella
    • 2
  • Sanja Fidler
    • 3
  • Antonino Furnari
    • 2
  • Evangelos Kazakos
    • 1
  • Davide Moltisanti
    • 1
  • Jonathan Munro
    • 1
  • Toby Perrett
    • 1
  • Will Price
    • 1
  • Michael Wray
    • 1
  1. 1.University of BristolBristolUK
  2. 2.University of CataniaCataniaItaly
  3. 3.University of TorontoTorontoCanada

Personalised recommendations