Cooking Action Recognition with iVAT: An Interactive Video Annotation Tool

  • Simone Bianco
  • Gianluigi Ciocca
  • Paolo Napoletano
  • Raimondo Schettini
  • Roberto Margherita
  • Gianluca Marini
  • Giuseppe Pantaleo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8157)


Within a video recipe, we are interested in locating and annotating the various ingredients, kitchenwares and relevant cooking actions. To this end we have developed the iVAT interactive video annotation tool to support manual, semi-automatic and automatic annotations obtained on the basis of the interaction of the user with various detection algorithms. The tool integrates versions of computer vision algorithms, specifically adapted to work in an interactive and incremental learning framework. iVAT has been developed to annotate video recipes, but it can be easily adapted and used to annotate videos from different domains as well. In this paper we present some results with respect to the task of cooking action recognition.


Interactive video annotation object recognition action recognition tracking incremental learning 


  1. 1.
    Ali, K., Hasler, D., Fleuret, F.: Flowboost: Appearance learning from sparsely annotated video. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1433–1440 (2011)Google Scholar
  2. 2.
    Bianco, S., Ciocca, G., Napoletano, P., Schettini, R., Margherita, R., Marini, G., Gianforme, G., Pantaleo, G.: A semi-automatic annotation tool for cooking video. In: Image Processing: Machine Vision Applications VI, vol. 8661, p. 866112. SPIE (2013)Google Scholar
  3. 3.
    Brooke, J.: SUS: A Quick and Dirty Usability Scale. In: Jordan, P.W., Thomas, B., Weerdmeester, B.A., McClelland, I.L. (eds.) Usability Evaluation in Industry. Taylor & Francis, London (1996)Google Scholar
  4. 4.
    Ciocca, G., Schettini, R.: An innovative algorithm for key frame extraction in video summarization. Journal of Real-Time Image Processing 1, 69–88 (2006)CrossRefGoogle Scholar
  5. 5.
    Dalal, N.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition (CVPR), pp. 886–893 (2005)Google Scholar
  6. 6.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 14th International Conference on Computer Communications and Networks, pp. 65–72 (2005)Google Scholar
  7. 7.
    Hashimoto, A., Mori, N., Funatomi, T., Mukunoki, M., Kakusho, K., Minoh, M.: Tracking food materials with changing their appearance in food preparing. In: 2010 IEEE International Symposium on Multimedia (ISM), pp. 248–253 (2010)Google Scholar
  8. 8.
    Ji, Y., Ko, Y., Shimada, A., Nagahara, H., Taniguchi, R.I.: Cooking gesture recognition using local feature and depth image. In: Proc. of the ACM Multimedia 2012 Workshop on Multimedia for Cooking and Eating Activities, pp. 37–42 (2012)Google Scholar
  9. 9.
    Kavasidis, I., Palazzo, S., Di Salvo, R., Giordano, D., Spampinato, C.: A semi-automatic tool for detection and tracking ground truth generation in videos. In: Proceedings of the 1st International Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications, pp. 6:1–6:5. ACM (2012)Google Scholar
  10. 10.
    Liao, S., Zhu, X., Lei, Z., Zhang, L., Li, S.Z.: Learning multi-scale block local binary patterns for face recognition. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol. 4642, pp. 828–837. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  11. 11.
    Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: Int. Conference on Image Processing (ICIP), pp. 900–903 (2002)Google Scholar
  12. 12.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar
  13. 13.
    Mihalcik, D., Doermann, D.: The design and implementation of viper (2003)Google Scholar
  14. 14.
    Schiele, B.: A database for fine grained activity detection of cooking activities. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR 2012, pp. 1194–1201 (2012)Google Scholar
  15. 15.
    Torralba, A., Russell, B., Yuen, J.: Labelme: Online image annotation and applications. Proceedings of the IEEE 98(8), 1467–1484 (2010)CrossRefGoogle Scholar
  16. 16.
    Viola, P., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 511–518 (2001)Google Scholar
  17. 17.
    Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vision 101(1), 184–204 (2013)CrossRefGoogle Scholar
  18. 18.
    Yao, A., Gall, J., Leistner, C., Van Gool, L.: Interactive object detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242–3249 (2012)Google Scholar
  19. 19.
    Yuen, J., Russell, B., Liu, C., Torralba, A.: Labelme video: Building a video database with human annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1451–1458 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Simone Bianco
    • 1
  • Gianluigi Ciocca
    • 1
  • Paolo Napoletano
    • 1
  • Raimondo Schettini
    • 1
  • Roberto Margherita
    • 2
  • Gianluca Marini
    • 3
  • Giuseppe Pantaleo
    • 3
  1. 1.DISCo (Dipartimento di Informatica, Sistemistica e Comunicazione)Università degli Studi di Milano-BicoccaMilanoItaly
  2. 2.Almaviva S.p.a.Italy
  3. 3.Almawave S.r.l.Italy

Personalised recommendations