Advertisement

In-Home Daily-Life Captioning Using Radio Signals

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12347)

Abstract

This paper aims to caption daily life – i.e., to create a textual description of people’s activities and interactions with objects in their homes. Addressing this problem requires novel methods beyond traditional video captioning, as most people would have privacy concerns about deploying cameras throughout their homes. We introduce RF-Diary, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home’s floormap. RF-Diary can further observe and caption people’s life through walls and occlusions and in dark settings. In designing RF-Diary, we exploit the ability of radio signals to capture people’s 3D dynamics, and use the floormap to help the model learn people’s interactions with objects. We also use a multi-modal feature alignment training scheme that leverages existing video-based captioning datasets to improve the performance of our radio-based captioning model. Extensive experimental results demonstrate that RF-Diary generates accurate captions under visible conditions. It also sustains its good performance in dark or occluded settings, where video-based captioning approaches fail to generate meaningful captions.(For more information, please visit our project webpage: http://rf-diary.csail.mit.edu)

Supplementary material

504434_1_En_7_MOESM1_ESM.pdf (5.7 mb)
Supplementary material 1 (pdf 5860 KB)

References

  1. 1.
    Adib, F., Kabelac, Z., Katabi, D., Miller, R.C.: 3D tracking via body radio reflections. In: 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2014, pp. 317–329 (2014)Google Scholar
  2. 2.
    Adib, F., Katabi, D.: See through walls with WiFi!, vol. 43. ACM (2013)Google Scholar
  3. 3.
    Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1657–1666 (2017)Google Scholar
  4. 4.
    Barbrow, L.: International lighting vocabulary. J. SMPTE 73(4), 331–332 (1964)CrossRefGoogle Scholar
  5. 5.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)Google Scholar
  6. 6.
    Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
  7. 7.
    Chetty, K., Chen, Q., Ritchie, M., Woodbridge, K.: A low-cost through-the-wall FMCW radar for stand-off operation and activity detection. In: Radar Sensor Technology XXI, vol. 10188, p. 1018808. International Society for Optics and Photonics (2017)Google Scholar
  8. 8.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, pp. 376–380 (2014)Google Scholar
  9. 9.
    Du, Y., Fu, Y., Wang, L.: Skeleton based action recognition with convolutional neural network. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583. IEEE (2015)Google Scholar
  10. 10.
    Fakoor, R., Mohamed, A., Mitchell, M., Kang, S.B., Kohli, P.: Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261 (2016)
  11. 11.
    Fan, L., Li, T., Fang, R., Hristov, R., Yuan, Y., Katabi, D.: Learning longterm representations for person re-identification using radio signals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10699–10709 (2020)Google Scholar
  12. 12.
    Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: StyleNet: generating attractive visual captions with styles. In: CVPR, pp. 3137–3146 (2017)Google Scholar
  13. 13.
    Gan, Z., et al.: Semantic compositional networks for visual captioning. In: CVPR, pp. 5630–5639 (2017)Google Scholar
  14. 14.
    Hsu, C.Y., Ahuja, A., Yue, S., Hristov, R., Kabelac, Z., Katabi, D.: Zero-effort in-home sleep and insomnia monitoring using radio signals. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, pp. 1–18 (2017)Google Scholar
  15. 15.
    Hsu, C.Y., Hristov, R., Lee, G.H., Zhao, M., Katabi, D.: Enabling identification and behavioral sensing in homes using radio reflections. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, p. 548. ACM (2019)Google Scholar
  16. 16.
    Hsu, C.Y., Liu, Y., Kabelac, Z., Hristov, R., Katabi, D., Liu, C.: Extracting gait velocity and stride length from surrounding radio signals. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2116–2126 (2017)Google Scholar
  17. 17.
    Hu, Y., Chen, Z., Zha, Z.J., Wu, F.: Hierarchical global-local temporal modeling for video captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 774–783 (2019)Google Scholar
  18. 18.
    Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 786–792 (2018)Google Scholar
  19. 19.
    Li, T., Fan, L., Zhao, M., Liu, Y., Katabi, D.: Making the invisible visible: action recognition through walls and occlusions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 872–881 (2019)Google Scholar
  20. 20.
    Lien, J.: Soli: ubiquitous gesture sensing with millimeter wave radar. ACM Trans. Graph. (TOG) 35(4), 142 (2016)CrossRefGoogle Scholar
  21. 21.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)Google Scholar
  22. 22.
    Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. Trans. Assoc. Comput. Linguist. 6, 173–184 (2018)CrossRefGoogle Scholar
  23. 23.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)Google Scholar
  24. 24.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  25. 25.
    Pasunuru, R., Bansal, M.: Reinforced video captioning with entailment rewards. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 979–985 (2017)Google Scholar
  26. 26.
    Peng, Z., Muñoz-Ferreras, J.M., Gómez-García, R., Li, C.: FMCW radar fall detection based on ISAR processing utilizing the properties of RCS, range, and Doppler. In: 2016 IEEE MTT-S International Microwave Symposium (IMS), pp. 1–3. IEEE (2016)Google Scholar
  27. 27.
    Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
  28. 28.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_31CrossRefGoogle Scholar
  29. 29.
    Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017)
  30. 30.
    Stove, A.G.: Linear FMCW radar techniques. In: IEE Proceedings F (Radar and Signal Processing), vol. 139, pp. 343–350. IET (1992)Google Scholar
  31. 31.
    Tian, Y., Lee, G.H., He, H., Hsu, C.Y., Katabi, D.: RF-based fall monitoring using convolutional neural networks. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 3, p. 137 (2018)Google Scholar
  32. 32.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  33. 33.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  34. 34.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1494–1504 (2015)Google Scholar
  35. 35.
    Wang, X., Chen, W., Wu, J., Wang, Y.F., Yang Wang, W.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4213–4222 (2018)Google Scholar
  36. 36.
    Wu, X., Li, G., Cao, Q., Ji, Q., Lin, L.: Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6829–6837 (2018)Google Scholar
  37. 37.
    Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)Google Scholar
  38. 38.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)Google Scholar
  39. 39.
    Zhang, Z., Tian, Z., Zhou, M.: Latern: dynamic continuous hand gesture recognition using FMCW radar sensor. IEEE Sens. J. 18(8), 3278–3289 (2018)CrossRefGoogle Scholar
  40. 40.
    Zhao, M., Adib, F., Katabi, D.: Emotion recognition using wireless signals. In: Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pp. 95–108. ACM (2016)Google Scholar
  41. 41.
    Zhao, M., et al.: Through-wall human pose estimation using radio signals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7356–7365 (2018)Google Scholar
  42. 42.
    Zhao, M., et al.: Through-wall human mesh recovery using radio signals. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10113–10122 (2019)Google Scholar
  43. 43.
    Zhao, M., et al.: RF-based 3D skeletons. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pp. 267–281. ACM (2018)Google Scholar
  44. 44.
    Zhao, M., Yue, S., Katabi, D., Jaakkola, T.S., Bianchi, M.T.: Learning sleep stages from radio signals: a conditional adversarial architecture. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4100–4109. JMLR. org (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.MIT CSAILCambridgeUSA

Personalised recommendations