Skip to main content
Log in

Indoor human activity recognition using high-dimensional sensors and deep neural networks

  • Engineering Applications of Neural Networks 2018
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Many smart home applications rely on indoor human activity recognition. This challenge is currently primarily tackled by employing video camera sensors. However, the use of such sensors is characterized by fundamental technical deficiencies in an indoor environment, often also resulting in a breach of privacy. In contrast, a radar sensor resolves most of these flaws and maintains privacy in particular. In this paper, we investigate a novel approach toward automatic indoor human activity recognition, feeding high-dimensional radar and video camera sensor data into several deep neural networks. Furthermore, we explore the efficacy of sensor fusion to provide a solution in less than ideal circumstances. We validate our approach on two newly constructed and published data sets that consist of 2347 and 1505 samples distributed over six different types of gestures and events, respectively. From our analysis, we can conclude that, when considering a radar sensor, it is optimal to make use of a three-dimensional convolutional neural network that takes as input sequential range-Doppler maps. This model achieves 12.22% and 2.97% error rate on the gestures and the events data set, respectively. A pretrained residual network is employed to deal with the video camera sensor data and obtains 1.67% and 3.00% error rate on the same data sets. We show that there exists a clear benefit in combining both sensors to enable activity recognition in the case of less than ideal circumstances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://crcv.ucf.edu/data/UCF101.php.

  2. https://20bn.com/datasets/jester.

  3. From a strict point-of-view, we are dealing with a cross-correlation, as the kernel is not flipped.

  4. The data sets are publicly available at: https://www.imec-int.com/en/harrad.

  5. https://pytorch.org.

References

  1. Bengio Y, Goodfellow IJ, Courville A (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  2. Brooker GM (2005) Understanding millimetre wave FMCW radars. In: 1st international conference on sensing technology. pp 152–157

  3. Chen Q, Tan B, Chetty K, Woodbridge K (2016) Activity recognition based on micro-doppler signature with in-home Wi-Fi. In: IEEE 18th international conference on e-health networking, applications and services (Healthcom). pp 1–6

  4. Chen VC, Li F, Ho SS, Wechsler H (2006) Micro-Doppler effect in radar: phenomenon, model, and simulation study. IEEE Trans Aerosp Electron Syst 42(1):2–21

    Article  Google Scholar 

  5. Cho H, Seo Y, Kumar BVKV, Rajkumar RR (2014) A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In: IEEE international conference on robotics and automation (ICRA). pp 1836–1843

  6. Djork-Arné C, Thomas U, Sepp H (2015) Fast and accurate deep network learning by exponential linear units (ELUs). CoRR arXiv:abs/1511.07289

  7. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Article  Google Scholar 

  8. Eshed OB, Ashish T, Sujitha M, Trivedi Mohan M (2015) On surveillance for safety critical events: in-vehicle video networks for predictive driver assistance systems. Comput Vis Image Underst 134:130–140

    Article  Google Scholar 

  9. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR)

  10. Fioranelli F, Ritchie M, Griffiths H (2015) Classification of unarmed/armed personnel using the NetRad multistatic radar for micro-doppler and singular value decomposition features. IEEE Geosci Remote Sens Lett 12(9):1933–1937

    Article  Google Scholar 

  11. Gurbuz SZ, Clemente C, Balleri A, Soraghan JJ (2017) Micro-Doppler-based in-home aided and unaided walking recognition with multiple radar and sonar systems. IET Radar Sonar Navig 11(1):107–115

    Article  Google Scholar 

  12. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: IEEE/CVF conference on computer vision and pattern recognition. pp 6546–6555

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). pp 770–778

  14. Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition. Image Vis Comput 60:4–21

    Article  Google Scholar 

  15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  16. INRAS GmbH (2017) http://www.inras.at. Accessed 20 Jun 2017

  17. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE Computer Society, Washington, pp 1725–1732

  18. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev A, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. CoRR arXiv:abs/1705.06950

  19. Kim Y, Toomajian B (2016) Hand gesture recognition using micro-doppler signatures with convolutional neural network. IEEE Access 4:7125–7130

    Article  Google Scholar 

  20. LeCun Y et al (1989) Generalization and network design strategies. In: Pfeifer R, Schreter Z, Fogelman F, Steels L (eds) Connectionism in perspective. Elsevier, Zurich, Switzerland, pp 143–155

    Google Scholar 

  21. Lee J, Li YA, Hung MH, Huang SJ (2010) A fully-integrated 77-GHz FMCW radar transceiver in 65-nm CMOS technology. IEEE J Solid-State Circuits 45(12):2746–2756

    Article  Google Scholar 

  22. Liu L, Popescu M, Skubic M, Rantz M, Yardibi T, Cuddihy P (2011) Automatic fall detection based on Doppler radar motion signature. In: 5th international conference on pervasive computing technologies for healthcare and workshops. pp 222–225

  23. Long N, Wang K, Cheng R, Yang K, Bai J (2018) Fusion of millimeter wave radar and RGB-depth sensors for assisted navigation of the visually impaired. In: Proc. SPIE 10800, Millimetre wave and terahertz sensors and technology XI, SPIE Security + Defence, Berlin, Germany, vol 10800, pp 1080006. https://doi.org/10.1117/12.2324626

  24. McLaughlin N, Martinez del Rincon J, Miller P (2016) Recurrent convolutional network for video-based person re-identification. In: IEEE conference on computer vision and pattern recognition (CVPR). pp 1325–1334

  25. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Fürnkranz J, Joachims T (eds) 27th international conference on machine learning (ICML). Omnipress, Madison, pp 807–814

    Google Scholar 

  26. Ng JYH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE conference on computer vision and pattern recognition (CVPR). pp 4694–4702

  27. Pigou L, van den Oord A, Dieleman S, Van Herreweghe M, Dambre J (2018) Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. Int J Comput Vis 126(2):430–439

    Article  MathSciNet  Google Scholar 

  28. Polfliet V, Knudde N, Vandersmissen B, Couckuyt I, Dhaene T (2018) Structured inference networks using high-dimensional sensors for surveillance purposes. In: Presented at the EANN2018, the 19th international conference on engineering applications of neural Networks, vol 893. Springer, Cham, pp 1–12

    Google Scholar 

  29. Ritchie M, Fioranelli F, Borrion H, Griffiths H (2017) Multistatic micro-Doppler radar feature extraction for classification of unloaded/loaded micro-drones. IET Radar Sonar Navig 11(1):116–124

    Article  Google Scholar 

  30. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer vision (ICCV)

  31. Vandersmissen B, Knudde N, Jalalvand A, Couckuyt I, Bourdoux A, De Neve W, Dhaene T (2018) Indoor person identification using a low-power FMCW radar. IEEE Trans Geosci Remote Sens 56(7):3941–3952

    Article  Google Scholar 

  32. Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517

    Article  Google Scholar 

  33. Wang S, Song J, Lien J, Poupyrev I, Hilliges O (2016) Interacting with soli: exploring fine-grained dynamic gesture recognition in the radio-frequency spectrum. In: 29th annual symposium on user interface software and technology (UIST). ACM, New York, pp 851–860

  34. Wu M, Dai X, Zhang YD, Davidson B, Amin MG, Zhang J (2013) Fall detection based on sequential modeling of radar signal time-frequency features. In: IEEE international conference on healthcare informatics (ICHI). IEEE Computer Society, Washington, pp 169–174

  35. Wu X, Ren J, Wu Y, Shao J (2018) Study on target tracking based on vision and radar sensor fusion. Tech. rep., SAE Technical Paper

  36. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp. 818–833

  37. Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS (2019) A comprehensive survey of vision-based human action recognition methods. IEEE Sens 19(5):1005

    Article  Google Scholar 

  38. Zhao M, Li T, Abu Alsheikh M, Tian Y, Zhao H, Torralba A, Katabi D (2018) Through-wall human pose estimation using radio signals. In: IEEE conference on computer vision and pattern recognition (CVPR). pp 7356–7365

Download references

Acknowledgements

The research activities described in this paper were funded by Ghent University-imec, the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baptist Vandersmissen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Indoor human activity recognition on combined data set

Appendix: Indoor human activity recognition on combined data set

In this study, we developed a deep learning approach toward automatic indoor human activity recognition. Moreover, this approach is validated on two separate data sets that are both applicable in a different domain. For the sake of completeness, we explore the efficacy of an integrated system that is capable of predicting the correct activity when dealing with a combined data set of gestures and events. To that end, both data sets are merged and the 3d-CNN and ResCNN networks are employed for the radar and camera sensors, respectively. The combined data set consists of 3852 samples distributed over 12 different activities. Table 4 lists the total number of samples per activity. Similar to the experiments performed in Sects. 7.2 and 7.3, the sample length is set to 2 s or 30 frames.

Table 8 shows the obtained results of both the radar- and video-based model. The results suggest that our developed approach is valid for the combined data set. The radar-based 3d-CNN achieves 14.40 % and 6.67 % error rate on the cross-validation and random split evaluation approach, respectively. These results are similar to those obtained on the gestures data set (c.f., Sect. 7.2). Similarly, the video-based ResCNN network obtains 3.52 % and 2.70 % error rates for \({\overline{S}}\) and RS, respectively.

Furthermore, an experiment is conducted that shows the benefit of fusing both sensors. More precisely, artificially darkened frames (denoted by the \(*\) operator) are used as input for the video-based model. This input has a clear negative effect on the error rate of the ResCNN network since it degrades by nearly 20 % and 13 % for \({\overline{S}}\) and RS, respectively. However, through the combined use of both sensor-specific networks this effect is not pronounced in the late fusion approach (Fused*). The performance of this approach only degrades by 2 % in comparison with the use of clean RGB data. Moreover, the fused approach that uses artificially darkened video data still outperforms the radar-only approach by a margin of 2 %.

Table 8 Results for leave-one-subject \(S_i\)-out cross-validation (\({\overline{S}}\)), with \(i \in \{1\dots 9\}\), and stratified random split (RS) for the combined data set

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vandersmissen, B., Knudde, N., Jalalvand, A. et al. Indoor human activity recognition using high-dimensional sensors and deep neural networks. Neural Comput & Applic 32, 12295–12309 (2020). https://doi.org/10.1007/s00521-019-04408-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04408-1

Keywords

Navigation