Skip to main content
Log in

Predicting Important Objects for Egocentric Video Summarization

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present a video summarization approach for egocentric or “wearable” camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer’s day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video—such as the nearness to hands, gaze, and frequency of occurrence—and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. We adjust the compactness of the final summary given either an importance selection criterion or a length budget; for the latter, we design an efficient dynamic programming solution that accounts for importance, visual uniqueness, and temporal displacement. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results on two egocentric video datasets show the method’s promise relative to existing techniques for saliency and summarization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://vision.cs.utexas.edu/projects/egocentric/ Due to privacy issues, we are only able to share 4 of the 10 videos (one from each subject), for a total of 17 h of video. They correspond to the test videos that we evaluate on in Sect. 4.

  2. See Footnote 1

  3. This method summarizes a collection of videos, so we treat each event in our data as a different video.

References

  • Aghazadeh, O., Sullivan, J., & Carlsson, S. (2011). Novelty detection from an egocentric perspective. In CVPR.

  • Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In CVPR.

  • Aner, A., & Kender, J. R. (2002). Video Summaries through mosaic-based shot and scene clustering. In ECCV.

  • Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. TPAMI, 33(3), 500–513.

    Article  Google Scholar 

  • Carreira, J., & Sminchisescu, C. (2010). Constrained parametric min-cuts for automatic object segmentation. In CVPR.

  • Caspi, Y., Axelrod, A., Matsushita, Y., & Gamliel, A. (2006). Dynamic stills and clip trailer. The Visual Computer, 22(9), 642–652

  • Cheng, M.-M., Zhang, Z., Lin, W.-Y., & Torr, P. (2014). BING: Binarized normed gradients for objectness estimation at 300fpsn. In CVPR.

  • Clarkson, B., & Pentland, A. (1999). Unsupervised clustering of ambulatory audio and video. In ICASSP.

  • Doherty, A., & Smeaton, A. (2008). Combining face detection and novelty to identify important events in a visual lifelog. In International Conference on Computer and Information Technology Workshops.

  • Doherty, A., Byrne, D., Smeaton, A., Jones, G., & Hughes, M. (2008). Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In CIVR.

  • Endres, I., & Hoiem, D. (2010). Category independent object proposals. In ECCV.

  • Fathi, A., Farhadi, A., & Rehg, J. (2011). Understanding egocentric activities. In ICCV.

  • Fathi, A., Hodgins, J. K., & Rehg, J. M. (2012). Social interactions: A first-person perspective. In CVPR.

  • Felzenszwalb, P., & Huttenlocher, D. (2004). Efficient graph-based image segmentation. IJCV, 59(2), 167–181.

  • Gao, D., Mahadevan, V., & Vasconcelos, N. (2007). The discriminant center-surround hypothesis for bottom-up saliency. In NIPS.

  • Goldman, D., Curless, B., Salesin, D., & Seitz, S. (2006). Schematic storyboarding for video visualization and editing. In SIGGRAPH.

  • Healey, J., & Picard, R. (1998). Startlecam: A cybernetic wearable camera. In Wearable Computers.

  • Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan, J., Butler, A., Smyth, G., Kapur, N., & Wood, K. (2006). SenseCam: A retrospective memory aid. In UBICOMP.

  • Hodges, S., Berry, E., & Wood, K. (2011). Sensecam: A wearable camera which stimulates and rehabilitates autobiographical memory. Memory, 19(7), 685–696.

  • Huynh, T., Fritz, M., & Schiele, B. (2008). Discovery of activity patterns using topic models. In UBICOMP.

  • Hwang, S. J., & Grauman, K. (2010). Accounting for the relative importance of objects in image retrieval. In BMVC.

  • Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. TPAMI, 20(11), 1254–1259.

  • Jojic, N., Perina, A., & Murino, V. (2010). Structural epitome: A way to summarize one’s visual experience. In NIPS.

  • Jones, M., & Rehg, J. (2002). Statistical color models with application to skin detection. IJCV, 46(1), 81–96.

  • Kitani, K., Okabe, T., Sato, Y., & Sugimoto, A. (2011). Fast Unsupervised Ego-Action Learning for First-Person Sports Video. In CVPR.

  • Kolsch, M., & Turk, M. (2004). Robust hand detection. In FG.

  • Lee, M., & Dey, A. (2007). Providing good memory cues for people with episodic memory impairment. In ACM SIGACCESS Conference on Computers and Accessibility.

  • Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.

  • Lee, Y. J., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In CVPR.

  • Li, C., & Kitani, K. M. (2013). Pixel-level hand detection for ego-centric videos. In CVPR.

  • Li, Y., Fathi, A. & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In ICCV.

  • Lin, W., & Hauptmann, A. (2006). Structuring continuous video recordings of everyday life using time-constrained clustering. In IS&T/SPIE Symposium on Electronic Imaging.

  • Liu, T., & Kender, J. R. (2002). Optimization algorithms for the selection of key frame sequences of variable length. In ECCV.

  • Liu, T., Sun, J., Zheng, N., Tang, X., & Shum, H. (2007). Learning to detect a salient object. In CVPR.

  • Liu, D., Hua, G., & Chen, T. (2009). A hierarchical visual model for video object summarization. In TPAMI.

  • Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.

  • Lu, Z., & Grauman, K. (2013). Story-driven summarization for egocentric video. In CVPR.

  • Mann, S. (1998). Wearcam (the wearable camera): Personal imaging systems for long term use in wearable tetherless computer mediated reality and personal photo/videographic memory prosthesis. In Wearable Computers.

  • Money, A., & Agius, H. (2008). Video summarisation: A conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation, 19(2), 121–143.

    Article  Google Scholar 

  • Ng, H. W., Sawahata, Y., & Aizawa, K. (2002). Summarizing wearable videos using support vector machine. In ICME.

  • Perona, P., & Freeman, W. (1998). A factorization approach to grouping. In ECCV.

  • Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.

  • Pritch, Y., Rav-Acha, A., Gutman, A., & Peleg, S. (2007). Webcam synopsis: Peeking around the world. In ICCV.

  • Rav-Acha, A., Pritch, Y., & Peleg, S. (2006). Making a long video short. In CVPR.

  • Ren, X., & Gu, C. (2010). Figure-ground segmentation improves handled object recognition in egocentric video. In CVPR.

  • Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In CVPR.

  • Simakov, D., Caspi, Y., Shechtman, E., & Irani, M. (2008). Summarizing visual data using bidirectional similarity. In CVPR.

  • Spain, M., & Perona, P. (2008). Some objects are more equal than others: Measuring and predicting importance. In ECCV.

  • Spriggs, E., la Torre, F. D., & Hebert, M. (2009). Temporal segmentation and activity classification from first-person sensing. In CVPR Workshop on Egocentric Vision.

  • Starner, T., Schiele, B., & Pentland, A. (1998a). Visual contextual awareness in wearable computing. In ISWC.

  • Starner, T., Weaver, J., & Pentland, A. (1998b). Real-time american sign language recognition using desk and wearable computer based video. PAMI, 20(12), 1371–1375.

  • Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In CVPR.

  • Walther, D., & Koch, C. (2006). Modeling attention to salient proto-objects. Neural Networks, 19, 1395–1407.

    Article  MATH  Google Scholar 

  • Weng, F., & Merialdo, B. (2009). Multi-document video summarization. In ICME.

  • Wolf, W. (1996). Keyframe selection by motion analysis. In ICASSP.

  • Zhang, H. J., Wu, J., Zhong, D., & Smoliar, S. (1997). An integrated system for content-based video retrieval and browsing. In Pattern Recognition.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Jae Lee.

Additional information

Communicated by C. Schnörr.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (docx 20 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, Y.J., Grauman, K. Predicting Important Objects for Egocentric Video Summarization. Int J Comput Vis 114, 38–55 (2015). https://doi.org/10.1007/s11263-014-0794-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0794-5

Keywords

Navigation