International Journal of Computer Vision

, Volume 114, Issue 1, pp 38–55 | Cite as

Predicting Important Objects for Egocentric Video Summarization

  • Yong Jae Lee
  • Kristen Grauman


We present a video summarization approach for egocentric or “wearable” camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer’s day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video—such as the nearness to hands, gaze, and frequency of occurrence—and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. We adjust the compactness of the final summary given either an importance selection criterion or a length budget; for the latter, we design an efficient dynamic programming solution that accounts for importance, visual uniqueness, and temporal displacement. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results on two egocentric video datasets show the method’s promise relative to existing techniques for saliency and summarization.


Egocentric vision Video summarization Category discovery Saliency detection 

Supplementary material

11263_2014_794_MOESM1_ESM.docx (20 kb)
Supplementary material 1 (docx 20 KB)


  1. Aghazadeh, O., Sullivan, J., & Carlsson, S. (2011). Novelty detection from an egocentric perspective. In CVPR.Google Scholar
  2. Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In CVPR.Google Scholar
  3. Aner, A., & Kender, J. R. (2002). Video Summaries through mosaic-based shot and scene clustering. In ECCV.Google Scholar
  4. Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. TPAMI, 33(3), 500–513.CrossRefGoogle Scholar
  5. Carreira, J., & Sminchisescu, C. (2010). Constrained parametric min-cuts for automatic object segmentation. In CVPR.Google Scholar
  6. Caspi, Y., Axelrod, A., Matsushita, Y., & Gamliel, A. (2006). Dynamic stills and clip trailer. The Visual Computer, 22(9), 642–652Google Scholar
  7. Cheng, M.-M., Zhang, Z., Lin, W.-Y., & Torr, P. (2014). BING: Binarized normed gradients for objectness estimation at 300fpsn. In CVPR.Google Scholar
  8. Clarkson, B., & Pentland, A. (1999). Unsupervised clustering of ambulatory audio and video. In ICASSP.Google Scholar
  9. Doherty, A., & Smeaton, A. (2008). Combining face detection and novelty to identify important events in a visual lifelog. In International Conference on Computer and Information Technology Workshops.Google Scholar
  10. Doherty, A., Byrne, D., Smeaton, A., Jones, G., & Hughes, M. (2008). Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In CIVR.Google Scholar
  11. Endres, I., & Hoiem, D. (2010). Category independent object proposals. In ECCV.Google Scholar
  12. Fathi, A., Farhadi, A., & Rehg, J. (2011). Understanding egocentric activities. In ICCV.Google Scholar
  13. Fathi, A., Hodgins, J. K., & Rehg, J. M. (2012). Social interactions: A first-person perspective. In CVPR.Google Scholar
  14. Felzenszwalb, P., & Huttenlocher, D. (2004). Efficient graph-based image segmentation. IJCV, 59(2), 167–181.Google Scholar
  15. Gao, D., Mahadevan, V., & Vasconcelos, N. (2007). The discriminant center-surround hypothesis for bottom-up saliency. In NIPS.Google Scholar
  16. Goldman, D., Curless, B., Salesin, D., & Seitz, S. (2006). Schematic storyboarding for video visualization and editing. In SIGGRAPH.Google Scholar
  17. Healey, J., & Picard, R. (1998). Startlecam: A cybernetic wearable camera. In Wearable Computers.Google Scholar
  18. Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan, J., Butler, A., Smyth, G., Kapur, N., & Wood, K. (2006). SenseCam: A retrospective memory aid. In UBICOMP.Google Scholar
  19. Hodges, S., Berry, E., & Wood, K. (2011). Sensecam: A wearable camera which stimulates and rehabilitates autobiographical memory. Memory, 19(7), 685–696.Google Scholar
  20. Huynh, T., Fritz, M., & Schiele, B. (2008). Discovery of activity patterns using topic models. In UBICOMP.Google Scholar
  21. Hwang, S. J., & Grauman, K. (2010). Accounting for the relative importance of objects in image retrieval. In BMVC.Google Scholar
  22. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. TPAMI, 20(11), 1254–1259.Google Scholar
  23. Jojic, N., Perina, A., & Murino, V. (2010). Structural epitome: A way to summarize one’s visual experience. In NIPS.Google Scholar
  24. Jones, M., & Rehg, J. (2002). Statistical color models with application to skin detection. IJCV, 46(1), 81–96.Google Scholar
  25. Kitani, K., Okabe, T., Sato, Y., & Sugimoto, A. (2011). Fast Unsupervised Ego-Action Learning for First-Person Sports Video. In CVPR.Google Scholar
  26. Kolsch, M., & Turk, M. (2004). Robust hand detection. In FG.Google Scholar
  27. Lee, M., & Dey, A. (2007). Providing good memory cues for people with episodic memory impairment. In ACM SIGACCESS Conference on Computers and Accessibility.Google Scholar
  28. Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.Google Scholar
  29. Lee, Y. J., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In CVPR.Google Scholar
  30. Li, C., & Kitani, K. M. (2013). Pixel-level hand detection for ego-centric videos. In CVPR.Google Scholar
  31. Li, Y., Fathi, A. & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In ICCV.Google Scholar
  32. Lin, W., & Hauptmann, A. (2006). Structuring continuous video recordings of everyday life using time-constrained clustering. In IS&T/SPIE Symposium on Electronic Imaging.Google Scholar
  33. Liu, T., & Kender, J. R. (2002). Optimization algorithms for the selection of key frame sequences of variable length. In ECCV.Google Scholar
  34. Liu, T., Sun, J., Zheng, N., Tang, X., & Shum, H. (2007). Learning to detect a salient object. In CVPR.Google Scholar
  35. Liu, D., Hua, G., & Chen, T. (2009). A hierarchical visual model for video object summarization. In TPAMI.Google Scholar
  36. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.Google Scholar
  37. Lu, Z., & Grauman, K. (2013). Story-driven summarization for egocentric video. In CVPR.Google Scholar
  38. Mann, S. (1998). Wearcam (the wearable camera): Personal imaging systems for long term use in wearable tetherless computer mediated reality and personal photo/videographic memory prosthesis. In Wearable Computers.Google Scholar
  39. Money, A., & Agius, H. (2008). Video summarisation: A conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation, 19(2), 121–143.CrossRefGoogle Scholar
  40. Ng, H. W., Sawahata, Y., & Aizawa, K. (2002). Summarizing wearable videos using support vector machine. In ICME.Google Scholar
  41. Perona, P., & Freeman, W. (1998). A factorization approach to grouping. In ECCV.Google Scholar
  42. Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.Google Scholar
  43. Pritch, Y., Rav-Acha, A., Gutman, A., & Peleg, S. (2007). Webcam synopsis: Peeking around the world. In ICCV.Google Scholar
  44. Rav-Acha, A., Pritch, Y., & Peleg, S. (2006). Making a long video short. In CVPR.Google Scholar
  45. Ren, X., & Gu, C. (2010). Figure-ground segmentation improves handled object recognition in egocentric video. In CVPR.Google Scholar
  46. Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In CVPR.Google Scholar
  47. Simakov, D., Caspi, Y., Shechtman, E., & Irani, M. (2008). Summarizing visual data using bidirectional similarity. In CVPR.Google Scholar
  48. Spain, M., & Perona, P. (2008). Some objects are more equal than others: Measuring and predicting importance. In ECCV.Google Scholar
  49. Spriggs, E., la Torre, F. D., & Hebert, M. (2009). Temporal segmentation and activity classification from first-person sensing. In CVPR Workshop on Egocentric Vision.Google Scholar
  50. Starner, T., Schiele, B., & Pentland, A. (1998a). Visual contextual awareness in wearable computing. In ISWC.Google Scholar
  51. Starner, T., Weaver, J., & Pentland, A. (1998b). Real-time american sign language recognition using desk and wearable computer based video. PAMI, 20(12), 1371–1375.Google Scholar
  52. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In CVPR.Google Scholar
  53. Walther, D., & Koch, C. (2006). Modeling attention to salient proto-objects. Neural Networks, 19, 1395–1407.CrossRefzbMATHGoogle Scholar
  54. Weng, F., & Merialdo, B. (2009). Multi-document video summarization. In ICME.Google Scholar
  55. Wolf, W. (1996). Keyframe selection by motion analysis. In ICASSP.Google Scholar
  56. Zhang, H. J., Wu, J., Zhong, D., & Smoliar, S. (1997). An integrated system for content-based video retrieval and browsing. In Pattern Recognition.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of CaliforniaDavisUSA
  2. 2.Department of Computer ScienceUniversity of Texas at AustinAustinUSA

Personalised recommendations