Predicting Important Objects for Egocentric Video Summarization
- 1.9k Downloads
- 29 Citations
Abstract
We present a video summarization approach for egocentric or “wearable” camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer’s day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video—such as the nearness to hands, gaze, and frequency of occurrence—and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. We adjust the compactness of the final summary given either an importance selection criterion or a length budget; for the latter, we design an efficient dynamic programming solution that accounts for importance, visual uniqueness, and temporal displacement. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results on two egocentric video datasets show the method’s promise relative to existing techniques for saliency and summarization.
Keywords
Egocentric vision Video summarization Category discovery Saliency detectionSupplementary material
References
- Aghazadeh, O., Sullivan, J., & Carlsson, S. (2011). Novelty detection from an egocentric perspective. In CVPR.Google Scholar
- Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In CVPR.Google Scholar
- Aner, A., & Kender, J. R. (2002). Video Summaries through mosaic-based shot and scene clustering. In ECCV.Google Scholar
- Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. TPAMI, 33(3), 500–513.CrossRefGoogle Scholar
- Carreira, J., & Sminchisescu, C. (2010). Constrained parametric min-cuts for automatic object segmentation. In CVPR.Google Scholar
- Caspi, Y., Axelrod, A., Matsushita, Y., & Gamliel, A. (2006). Dynamic stills and clip trailer. The Visual Computer, 22(9), 642–652Google Scholar
- Cheng, M.-M., Zhang, Z., Lin, W.-Y., & Torr, P. (2014). BING: Binarized normed gradients for objectness estimation at 300fpsn. In CVPR.Google Scholar
- Clarkson, B., & Pentland, A. (1999). Unsupervised clustering of ambulatory audio and video. In ICASSP.Google Scholar
- Doherty, A., & Smeaton, A. (2008). Combining face detection and novelty to identify important events in a visual lifelog. In International Conference on Computer and Information Technology Workshops.Google Scholar
- Doherty, A., Byrne, D., Smeaton, A., Jones, G., & Hughes, M. (2008). Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In CIVR.Google Scholar
- Endres, I., & Hoiem, D. (2010). Category independent object proposals. In ECCV.Google Scholar
- Fathi, A., Farhadi, A., & Rehg, J. (2011). Understanding egocentric activities. In ICCV.Google Scholar
- Fathi, A., Hodgins, J. K., & Rehg, J. M. (2012). Social interactions: A first-person perspective. In CVPR.Google Scholar
- Felzenszwalb, P., & Huttenlocher, D. (2004). Efficient graph-based image segmentation. IJCV, 59(2), 167–181.Google Scholar
- Gao, D., Mahadevan, V., & Vasconcelos, N. (2007). The discriminant center-surround hypothesis for bottom-up saliency. In NIPS.Google Scholar
- Goldman, D., Curless, B., Salesin, D., & Seitz, S. (2006). Schematic storyboarding for video visualization and editing. In SIGGRAPH.Google Scholar
- Healey, J., & Picard, R. (1998). Startlecam: A cybernetic wearable camera. In Wearable Computers.Google Scholar
- Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan, J., Butler, A., Smyth, G., Kapur, N., & Wood, K. (2006). SenseCam: A retrospective memory aid. In UBICOMP.Google Scholar
- Hodges, S., Berry, E., & Wood, K. (2011). Sensecam: A wearable camera which stimulates and rehabilitates autobiographical memory. Memory, 19(7), 685–696.Google Scholar
- Huynh, T., Fritz, M., & Schiele, B. (2008). Discovery of activity patterns using topic models. In UBICOMP.Google Scholar
- Hwang, S. J., & Grauman, K. (2010). Accounting for the relative importance of objects in image retrieval. In BMVC.Google Scholar
- Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. TPAMI, 20(11), 1254–1259.Google Scholar
- Jojic, N., Perina, A., & Murino, V. (2010). Structural epitome: A way to summarize one’s visual experience. In NIPS.Google Scholar
- Jones, M., & Rehg, J. (2002). Statistical color models with application to skin detection. IJCV, 46(1), 81–96.Google Scholar
- Kitani, K., Okabe, T., Sato, Y., & Sugimoto, A. (2011). Fast Unsupervised Ego-Action Learning for First-Person Sports Video. In CVPR.Google Scholar
- Kolsch, M., & Turk, M. (2004). Robust hand detection. In FG.Google Scholar
- Lee, M., & Dey, A. (2007). Providing good memory cues for people with episodic memory impairment. In ACM SIGACCESS Conference on Computers and Accessibility.Google Scholar
- Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.Google Scholar
- Lee, Y. J., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In CVPR.Google Scholar
- Li, C., & Kitani, K. M. (2013). Pixel-level hand detection for ego-centric videos. In CVPR.Google Scholar
- Li, Y., Fathi, A. & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In ICCV.Google Scholar
- Lin, W., & Hauptmann, A. (2006). Structuring continuous video recordings of everyday life using time-constrained clustering. In IS&T/SPIE Symposium on Electronic Imaging.Google Scholar
- Liu, T., & Kender, J. R. (2002). Optimization algorithms for the selection of key frame sequences of variable length. In ECCV.Google Scholar
- Liu, T., Sun, J., Zheng, N., Tang, X., & Shum, H. (2007). Learning to detect a salient object. In CVPR.Google Scholar
- Liu, D., Hua, G., & Chen, T. (2009). A hierarchical visual model for video object summarization. In TPAMI.Google Scholar
- Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.Google Scholar
- Lu, Z., & Grauman, K. (2013). Story-driven summarization for egocentric video. In CVPR.Google Scholar
- Mann, S. (1998). Wearcam (the wearable camera): Personal imaging systems for long term use in wearable tetherless computer mediated reality and personal photo/videographic memory prosthesis. In Wearable Computers.Google Scholar
- Money, A., & Agius, H. (2008). Video summarisation: A conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation, 19(2), 121–143.CrossRefGoogle Scholar
- Ng, H. W., Sawahata, Y., & Aizawa, K. (2002). Summarizing wearable videos using support vector machine. In ICME.Google Scholar
- Perona, P., & Freeman, W. (1998). A factorization approach to grouping. In ECCV.Google Scholar
- Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.Google Scholar
- Pritch, Y., Rav-Acha, A., Gutman, A., & Peleg, S. (2007). Webcam synopsis: Peeking around the world. In ICCV.Google Scholar
- Rav-Acha, A., Pritch, Y., & Peleg, S. (2006). Making a long video short. In CVPR.Google Scholar
- Ren, X., & Gu, C. (2010). Figure-ground segmentation improves handled object recognition in egocentric video. In CVPR.Google Scholar
- Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In CVPR.Google Scholar
- Simakov, D., Caspi, Y., Shechtman, E., & Irani, M. (2008). Summarizing visual data using bidirectional similarity. In CVPR.Google Scholar
- Spain, M., & Perona, P. (2008). Some objects are more equal than others: Measuring and predicting importance. In ECCV.Google Scholar
- Spriggs, E., la Torre, F. D., & Hebert, M. (2009). Temporal segmentation and activity classification from first-person sensing. In CVPR Workshop on Egocentric Vision.Google Scholar
- Starner, T., Schiele, B., & Pentland, A. (1998a). Visual contextual awareness in wearable computing. In ISWC.Google Scholar
- Starner, T., Weaver, J., & Pentland, A. (1998b). Real-time american sign language recognition using desk and wearable computer based video. PAMI, 20(12), 1371–1375.Google Scholar
- Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In CVPR.Google Scholar
- Walther, D., & Koch, C. (2006). Modeling attention to salient proto-objects. Neural Networks, 19, 1395–1407.CrossRefMATHGoogle Scholar
- Weng, F., & Merialdo, B. (2009). Multi-document video summarization. In ICME.Google Scholar
- Wolf, W. (1996). Keyframe selection by motion analysis. In ICASSP.Google Scholar
- Zhang, H. J., Wu, J., Zhong, D., & Smoliar, S. (1997). An integrated system for content-based video retrieval and browsing. In Pattern Recognition.Google Scholar