DeepDiary: Automatically Captioning Lifelogging Image Streams

  • Chenyou FanEmail author
  • David J. Crandall
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9913)


Lifelogging cameras capture everyday life from a first-person perspective, but generate so much data that it is hard for users to browse and organize their image collections effectively. In this paper, we propose to use automatic image captioning algorithms to generate textual representations of these collections. We develop and explore novel techniques based on deep learning to generate captions for both individual images and image streams, using temporal consistency constraints to create summaries that are both more compact and less noisy. We evaluate our techniques with quantitative and qualitative results, and apply captioning to an image retrieval application for finding potentially private images. Our results suggest that our automatic captioning algorithms, while imperfect, may work well enough to help users manage lifelogging photo collections.


Lifelogging First-person Image captioning Computer vision 



This work was supported in part by the National Science Foundation (IIS-1253549 and CNS-1408730) and Google, and used compute facilities provided by NVidia, the Lilly Endowment through support of the IU PTI, and the Indiana METACyt Initiative. We thank Zhenhua Chen, Sally Crandall, and Xuan Dong for helping to label our lifelogging photos.


  1. 1.
    Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., MacIntyre, B.: Recent advances in augmented reality. IEEE Comput. Graph. Appl. 21(6), 34–47 (2001)CrossRefGoogle Scholar
  2. 2.
    Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)Google Scholar
  3. 3.
    Batra, D., Yadollahpour, P., Guzman-Rivera, A., Shakhnarovich, G.: Diverse M-best solutions in markov random fields. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 1–16. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33715-4_1 CrossRefGoogle Scholar
  4. 4.
    Castro, D., Hickson, S., Bettadapura, V., Thomaz, E., Abowd, G., Christensen, H., Essa, I.: Predicting daily activities from egocentric images using deep learning. In: International Symposium on Wearable Computers (2015)Google Scholar
  5. 5.
    Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)CrossRefGoogle Scholar
  6. 6.
    Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2155–2162 (2014)Google Scholar
  7. 7.
    Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 314–327. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33718-5_23 CrossRefGoogle Scholar
  8. 8.
    Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3281–3288 (2011)Google Scholar
  9. 9.
    Furnari, A., Farinella, G., Battiano, S.: Recognizing personal contexts from egocentric images. In: ICCV Workshops (2015)Google Scholar
  10. 10.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 580–587 (2014)Google Scholar
  11. 11.
    Graves, A.: Generating sequences with recurrent neural networks (2013). arXiv:1308.0850
  12. 12.
    Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan, J., Butler, A., Smyth, G., Kapur, N., Wood, K.: Sensecam: a retrospective memory aid. In: ACM Conference on Ubiquitous Computing, pp. 177–193 (2006)Google Scholar
  13. 13.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding (2014). arXiv:1408.5093
  14. 14.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions (2014). arXiv:1412.2306
  15. 15.
    Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent networks (2015). arXiv:1506.02078
  16. 16.
    Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp. 1889–1897 (2014)Google Scholar
  17. 17.
    Koller, D., Friedman, N.: Probabilistic Graphical Models Principles and Techniques. MIT Press, Cambridge (2009)zbMATHGoogle Scholar
  18. 18.
    Korayem, M., Templeman, R., Chen, D., Crandall, D., Kapadia, A.: Enhancing lifelogging privacy by detecting screens. In: ACM CHI Conference on Human Factors in Computing Systems (2016)Google Scholar
  19. 19.
    Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  20. 20.
    Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Workshop On Text Summarization Branches Out (2004)Google Scholar
  21. 21.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10602-1_48 Google Scholar
  22. 22.
    Mann, S., Nolan, J., Wellman, B.: Sousveillance: inventing and using wearable computing devices for data collection in surveillance environments. Surveill. Soc. 1(3), 331–355 (2002)Google Scholar
  23. 23.
    Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks (2014). arXiv:1410.1090
  24. 24.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)Google Scholar
  25. 25.
    Ryoo, M., Matthies, L.: First-person activity recognition: what are they doing to me? In: IEEE Conference on Computer Vision and Pattern Recognition pp. 2730–2737 (2013)Google Scholar
  26. 26.
    Ryoo, M., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: Robot-centric activity prediction from first-person videos: what will they do to me. In: ACM/IEEE International Conference on Human-Robot Interaction, pp. 295–302 (2015)Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556
  28. 28.
    Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In: Advances in Neural Information Processing Systems, pp. 2553–2561 (2013)Google Scholar
  29. 29.
    Templeman, R., Korayem, M., Crandall, D.J., Kapadia, A.: Placeavoider: steering first-person cameras away from sensitive spaces. In: Network and Distributed Systems Security Symposium (2014)Google Scholar
  30. 30.
    Vedantam, R., Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  31. 31.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text (2015). arXiv:1505.00487
  32. 32.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator (2014). arXiv:1411.4555
  33. 33.
    Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading book (2015). arXiv:1506.06724

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.School of Informatics and ComputingIndiana UniversityBloomingtonUSA

Personalised recommendations