Video analytic system for activity profiling, fall detection, and unstable motion detection

Real time detection of falls and unstable movement by elderly people is vital to their quality of life and safety. We present an edge processing device integrated with a cloud computation framework that can be used for activity profiling as well as trigger alerts for falls and unstable motion by elderly people at home. The proposed system uses fixed cameras to track and analyze each visible person in the scene, classifying their actions into nine ordinary activities, a fall, or unstable movement. An alert notification is sent to caregivers whenever a fall or unstable movement is detected. The major components of the system include an embedded device (NVIDIA JETSON TX2) and cloud-based storage and analysis infrastructure. The system is composed of modules for detecting, tracking and recognizing humans, a cascaded hierarchical classifier for nine ordinary activities and falls, and a long short-term memory (LSTM) module to predict unstable movement in video. The system is designed for accuracy, usability, and cost. A prototype system has been subjected to individual module tests along with a field test within a volunteer’s household. It achieved an accuracy of 91.6% for ordinary actions and falls with a recall of 97.02% for unstable motion. Future phases will expand deployment to multiple homes.


Introduction
For elderly people, living at home without full-time assistance exposes them to health risks, especially falls.According to a study by Al-Aama [1], more than one-third of elderly persons fall each year, and in half of the cases, recurring falls are frequent.With elderly populations worldwide set to increase in the coming decades, there is a need for effective health monitoring systems.When an elderly person falls, the time to provide aid must be nearly instantaneous to prevent chronic injuries or even death in the worst case.Health care providers need a real time system to enable immediate response and care.Most of the current research on elderly care using video processing focuses on fall detection or activity classification over time.It would be more beneficial to alert caregivers with real-time unusual behavior alerts to help prevent rather than only increase reaction speed after the fact.To our knowledge, there are no existing video analytic products that can provide both real time monitoring and provide alerts to caretakers in case of a problem.In this paper, we describe the prototype of an Internet of Things device geared towards this goal, using video processing to detect humans, track and recognize them, detect falls and unstable motion, and profile activity.The prototype can monitor an elderly person's behavior in real time using fixed cameras, classify their activities, and generate alerts when unstable behavior is detected.We envision and prototype a system to be deployed in an elderly person's home or a nursing home that enables caretakers to provide proactive care to reduce serious complications from a sudden fall.The prototype can monitor an elderly person's behavior in real time using fixed cameras, classify their activities and generate alerts when unstable behavior is detected.The unique features of the system include an edge/cloud architecture with a capable embedded device (NVIDIA JETSON TX2) and a cloud-based storage and analysis infrastructure.The system can recognize nine ordinary human activities, a fall, or unstable movement in a scene.Everyday activities include sitting, standing, walking, bending, clapping, checking time on a watch, talking on the phone, pointing at something, and waving.These individual activities represent some of the possible common daily recreation and communication activities.The system additionally recognizes falls and unstable movement.The prototype has been subjected to individual module tests along with a field test within a volunteer's household.We target human action recognition accuracy and fall detection accuracy over 80%, along with a recall of over 95% for unstable movement detection.
Through our research, we aim to accomplish three goals: improve the existing video technology used for care of the elderly, expand the longevity of in-home elderly care, and spur activities of an entrepreneurial nature in innovative exploitation of video technology for elderly care.Preliminary field testing of the prototype increases confidence in the usefulness of the concept.The main outcome is that we have introduced and developed a system to enhance care for elderly people at home.Tests of the individual modules and an overall field integration test with volunteers indicate sufficient accuracy and usability to move forward toward a commercial exploitation phase.

Related work
When elders fall, prompt assistance is required since any delay can cause complications leading to severe mobility issues.Toward this end, wearable sensors have been widely adopted.Examples including wrist band sensors and fall prevention shoes that come with embedded cameras and obstacle detecting lasers [13].However, wearable devices come with the caveat that they are sensitive to noise in the environment, and the wearer may forget to use them or feel uncomfortable using them [14].Fixed optical sensors have also been deployed in many research studies [2], ranging from analysis of movement and silhouette information over time [29] to extraction of high-level human activity spanning over longer periods of time [5,28].A study by Zhou et al. [29] found it possible to extract accurate silhouettes of humans using a combination of techniques such as foreground segmentation, morphological filtering, and continuous background updating.These techniques help account for changes in lighting and other environmental conditions.In contrast to silhouette blobs, our approach makes use of 2-D skeleton joint key points that encode high level spatial information, simplifying the action recognition task.Skeleton-based features tend to be more discriminative than blobs, since they provide an accurate representation of the human body.Zhou et al. [30] propose a zero-shot video object segmentation method that learns spatiotemporal object representations.It uses a encoder-bridge-decoder structure with a two-stream encoder that takes an input image and optical flow field.In contrast, our system simply uses a single camera input stream and represents people for inference by their skeletons.Depth cameras have been used for precise measurement of gait-related parameters in both spatial and temporal domains [19].Unfortunately, depth cameras are relatively large, have a limited range for accurate depth measurements, and cost more than ordinary red-green-blue (RGB) cameras.
Wang et al. introduce a new feature, the local occupancy pattern, for representing depth appearance [21].Local occupancy can capture relationships between body parts and objects in the environment for a scene.Further, using data mining on a specific conjunction of features for a subset of joints called actionlets, they discover which actionlets are most discriminative for classifying human actions.The authors make use of depth cameras to obtain depth map sequences then extract 3D joint positions and local occupancy pattern feature sets.Our system, in comparison, is much simpler, since we use RGB cameras and multiple views during training to enable accurate monitoring with multi-view 2D skeleton joints.Li et al. use a skeleton-based convolutional neural network (CNN) fitted with a transformer module capable of rearranging joints [12].It selects important joints automatically, depending on the output of the skeleton transformer module.Adapting to multiple people requires a maxout scheme.Our system performs predictions for multiple people without such a scheme in real time.Zhang et al. make use of long short term memory (LSTM) modules trained on a combination of hand-crafted features and neural network parameters [26].They exploit geometric relational features obtained from distances between joints and selected lines of 3D skeletons.They concede that utilizing large numbers of such feature combinations leads to over-fitting.We process 2D poses directly after a normalization process without any further hand-crafted expansion of the feature space.Ibramhim et al. [10] propose an effective automated approach to activity recognition incorporating feature fusion and machine learning methods.They start by detecting moving objects in video then extracting HOG, Gabor, and color features from the moving object region.They fuse the best features serially then classify the object's activity.Their approach uses a pipeline consisting of various pre-processing steps including background subtraction, morphological operations, and a three-feature extraction process.Experiments demonstrate that the approach works well in videos with single individuals.However, it can be impractical in the presence of multiple objects or persons in the scene.In comparison, our system uses pose estimation to detect multiple persons in the video stream and provides separate classifications of the activities of each person in the scene.Yair et al. make use of a temporal convolutional neural network to learn spatio-temporal features for analysis and recognition of human activities using short videos as input [3].Their architecture is based on a 3D convolutional layer and a convolutional long short-term memory (LSTM) layer.Their system, like ours, only requires RGB images from videos for classification, but they do not demonstrate real time performance of the integrated system.Some of the results reported in this paper here have appeared in a research project report [25].
In the following section, we proceed with a description of the complete prototype system.The main functions are activity profiling and analysis of mobility in a scene.We integrate several existing modules with new modules, all of which run on commodity embedded hardware or server components.

Video analytic system
In the prototype elder care system, fixed cameras are used to capture video frames of each visible person.Each individual is tracked, and their actions over time are classified into nine ordinary activities, falls, and unstable movement.We consider unstable movement to include the moment before or after a fall that could be due to physical collapse, chest pain, an accidental slip, tripping over an obstacle, or a sudden fit of dizziness.When the system detects a fall or unstable movement, it sends an alert.The system is built using an embedded edge processing device, the NVIDIA Jetson TX2, connected to cloud based infrastructure for storage and analysis functionality.Video-based human detection, tracking, and recognition, along with fall detection and detection of painful, unstable, and confused motion likely to lead to falls, are the key modules of the system.
The system consists of hardware components such as IP cameras, a network switch, and an NVIDIA Jetson TX2 embedded system.Due to privacy concerns, bedrooms and bathrooms are off limits, but entrances and exits of such rooms are monitored.Video processing tasks are performed on premise on the embedded system with consent from the resident.The resulting skeleton configuration sequences are forwarded to the cloud service.
The system in Fig. 1 shows several modules that realize an edge-cloud computing system.People detection and tracking tasks are performed by edge modules.The cloud service performs face detection, face alignment, face recognition, activity classification, and analysis of unstable movement.A new cascade framewise classifier is used for recognizing actions, Fig. 1 System architecture and a long short-term memory (LSTM) module predicts unstable movement.The two classifiers run in parallel on the cloud service.When a fall or an unstable action sequence is detected, an alert is sent via a RESTful web service that then sends an instant message to a registered family member or doctor.To enhance safety further, the system also sends an alert when the person is absent for an extended period of time.Normal activities are used to generate a summarized action profile for each individual.The number of workers required depends on the workload created by the data streams from the edge nodes.In the deployment test, we learned that three workers were sufficient for two volunteer houses with relatively infrequent appearances of family members.Using three workers guaranteed real-time processing, which is beneficial for timely fall detection and notification.However, for a large-scale deployment, scaling of the cloud resources would obviously be required, but this would be easy to implement using auto scaling technology.
Several categories of relevant daily physical, recreational, and communication activities for elderly people detectable through hand gestures and body postures have been proposed in previous research [6,16,24].From this work, we decided on sit, stand, walk, and bend indicating daily mobility, along with clapping, checking time on a watch, talking on the phone, pointing at something, and waving to representing some of the possible daily recreation and communication activities.To model these tasks, we collected video recordings of individuals performing the aforementioned nine activities from different sources, along with examples of simulated falls and unstable movement.The complete data comprise four datasets: 1.The MoVi online dataset (90 people) [7].2. Artificial Intelligence Center Lab (AIC lab; 12 people).3. Volunteer house (AIC Volunteer Set 1; three people).4. Volunteer house (AIC Volunteer Set 2; two weeks later than Set 1; three people).Datasets 2, 3, and 4 contain nine activities plus falls and unstable motion.Dataset 1 contains nine activities (without falls or unstable motion).Dataset 1 has high resolution activities captured in a lab at a tripod level, whereas Dataset 2 has lower resolution activities from a more typical CCTV camera view point, also in a clean lab, and Datasets 3 and 4 utilize actual CCTV setups in real homes.Table 1 characterizes the datasets.
For the new datasets (datasets 2, 3, and 4), at each location, we recorded video using eight cameras that together give a 360 • view of the scene and cropped them into short clips, each expressing one activity.We processed each frame of each clip to detect people and estimate their pose, then stored each resulting sequence of skeleton points with a corresponding activity label in a CSV file for training and testing.Figures 2, 3, and 4 show samples from each dataset.Figure 5 shows examples of simulated fall/unstable movement in Dataset 2 (Artificial Intelligence Center Lab). Figure 6 shows examples of simulated fall/unstable movement in Datasets 3 and 4 (Volunteer house).We considered five different conditions  leading to a fall, including having heart/chest pain, slipping, dizziness, collapsing, and tripping on an obstacle.We manually labeled the motion leading up to each such fall event as unstable movement.

People detection, tracking, and recognition
Our human detection system makes use of OpenPose [4] or TRTPose [23].The Open-Pose network is an open-source real-time system comprising multiple stages performing a series of detection passes.Provided an input image of 368 × 368 pixels, OpenPose utilizes an ImageNet-pretrained VGG-19 backbone [18] to extract basic appearance features.This backbone runs efficiently on the Jetson TX2, according to benchmarks [15], and it outperforms ResNet [9] using fewer parameters, obtaining lower loss scores, and obtaining higher accuracy, precision, and recall [17].EfficientPose [8] is an alternative to OpenPose that uses EfficientNet [20] as the backbone and outperforms OpenPose in terms of accuracy but is a single-person pose estimation model that has not to our knowledge been benchmarked on the Jetson TX2.When using OpenPose, we create an expanded bounding box around the detected skeleton points to ensure the entire body is included, defining a Region of Interest (ROI) for each person in each frame.We use DeepSORT [22] to track the ROI for each subject from frame to frame.Recognition modules implement face detection and face alignment using MTCNN [27] and the Facenet [11] pretrained network for extracting face embeddings for gallery images and probe images of family members in order to identify each family member in the scene.These modules work to detect people from every camera viewpoint and keep track of any person in the video.Simultaneously, we accumulate a ROI for each detected and tracked person.A sequence of these image patches is sent to the activity classifier and also the unstable movement prediction modules.A cloud based storage and analysis infrastructure is integrated along with these modules.The cloud based service sends a notification to caregivers when any unstable motion event, fall event, or extended absence of the person of interest is observed.For notifying caregivers, we make use of the LINE Application Programming Interface (API), which is a freeware app for instant communications on electronic devices such as smartphones, tablets, and personal computers, since it is easy to use, in widespread use in Asia, and readily accessible by mobile phone or computer.A LINE token provided by the user is required for authentication and access to the web application.The notification message includes the type of incident that occurred along with a Uniform Resource Locator (URL) of an image of the concerned event.Image URLs contain hashes of random data to prevent guessing of valid URLs, and users also require a login session to access image files.

Activity classification
Figure 7 shows the actual workflow of the activity classification module, which contains a cascaded hierarchical classifier for nine ordinary activities and falls.Actions such as checking the time, talking on the phone, clapping, pointing, and waving may occur concurrently with sitting, standing, and walking, while they cannot (by assumption) co-occur with bending or falling.Hence, we can separate the major group of activities (sitting, standing and walking) from the "minor" activities (check time, clap, phone call, point, and wave).The motivation for designing two different models is to differentiate five activities, namely clap, checktime, phonecall, point, and wave, from the others.These five activities can be performed contemporaneously with sitting or standing.However, not all combinations of sitting and standing with those five activities have equal propensity in the training data.By training on only the upper half body for these five activities, can correctly classify the minor activity regardless of the major activity (sitting/standing posture).On the other hand, many activities other than the five activities have similar body posture, for example cooking, checking the time, clapping, washing hands, scratching one's head, or talking on the phone.To differentiate such minor activities that were not trained upon, we instead first classify the major activity, for example, sitting, standing, or walking, then we use a separate model for the minor activity.A single model would confuse activities such as cooking with activities such as checking time due to similar body postures and activity profiles, but the dual model is well situated to respond with the major category only in the event that the minor activity is The Activity-1 model is composed of five convolutional layers.Each convolutional layer is followed by batch normalization and ReLU.The output layer is a five-unit fully connected linear layer with softmax activation.Training minimizes cross entropy loss on 32-frame batches using the Adam optimizer with a learning rate of 0.001 for 480 iterations.Class sampling weights are used to balance training across classes.The three Activity-2 models have the same design as the Activity-1 model except that they have six units in the output layer rather than five.The detailed model parameters can be seen in Fig. 8.
We combined datasets 1-3 and split them randomly into 70% for training, 10% for validation, and 20% to test models (Dataset 4 was reserved for a final test).The validation data were used to fine tune hyperparameters, then the resulting models were tested on the test dataset.Table 2 shows a summary of the training, validation, and test set distributions for the four models.Test confusion matrices are shown in Tables 3, 4, 5 and 6.The overall accuracy on the test set (considering all four models) is 96%.

Sequence-based stable/unstable movement classification
Figure 9 presents the architecture of the recurrent neural network (RNN) model trained to predict unstable movement.We trained the LSTM cells to capture temporal information across a sequence of frames.The input to the model is a 40-frame sequence consisting of 36     7 with the confusion matrix as shown in Table 8.

Results
The final system was tested on the Dataset 4. We first implemented and tested the system in a controlled lab environment with minimal background noise.Furthermore, participants performed similar activities during training and testing, which resulted in good accuracy.
Then we transferred the models from the lab environment to a real environment, which required fine tuning of the models, because real world environments have a greater variety of human activities and background noise not encountered during training of our models.We therefore repeatedly fine tuned our models on data from the real environment until they achieved the system's accuracy targets on novel test data.The detailed results for the framewise activity classifier and the unstable motion detector are included.

Framewise activity classification
The final test data from Dataset 4 comprised 132 video files showcasing the 10 activities.Table 9 shows the dataset distribution for each activity, and Table 10 shows the final framewise test result of each of the models in the cascade.
Framewise classification tends to be noisy, with occasional misclassification within a sequence.Figure 10 shows an example for a walking sequence in which an intermediate pose appears similar to a standing still pose.To increase accuracy of the framewise model, we thus aggregate each sequence of 40 frames (1.6 s) and output the model's majority vote over the sequence.This improves accuracy substantially, as shown in Table 11.Out of 19 fall videos, 18 were correctly classified, and one was misclassified as bend, resulting in 95% accuracy for falls.Every activity was recognized with at least 80% accuracy, with a 91.6% overall accuracy.Figure 11 shows samples of the results.
We further evaluated the system for two weeks at the same volunteer house.Figure 12 shows predicted activity statistics over the 14 days.Some activities such as cooking, which were not trained upon, resulted in a clapping prediction, as the two postures are very similar.There was a scene in which a person was cleaning the floor, resulting in a predicted fall, because the posture was not familiar to the trained model.In some cases, the ROI was occluded by clothing hangers that occluded the person to be detected, so fewer activities than the actual number of activities were detected.

Sequence-based stable/unstable movement classification
On Dataset 4, the model is able to successfully identify unstable movements: collapse, simulated chest pain, dizziness, unstable movement after a fall, slipping, or hitting an obstacle, as well as stable motion.The stable/unstable motion classifier achieves a recall of 97.02% and a lower precision.Table 12 shows the confusion matrix of test results, and Fig. 13 shows samples of correct classification results.
As a performance baseline, we trained a feed forward neural network on the same data as the LSTM to obtain a framewise classifier.We considered each frame individually rather than the sequence of 40 frames used by the LSTM.We trained on a subset of the full dataset.We utilized cropped short clips from Dataset 2, each expressing one activity.We   frames obtained a test accuracy of 90.70%.This shows that the LSTM is able to utilize sequence information representing motion effectively.

Overall system test
The edge-cloud system includes the integration of modules for multi-person identification, cascading framewise activity recognition, unstable movement prediction, and activity profiling.The web service can display activity summaries for individuals as well as camera views and periods of absence.Figure 14 shows sample screenshots from the web service, and Fig. 15 shows sample notifications.

Usability from health care perspective
Besides the quantitative accuracy evaluations just described, we also include here a qualitative evaluation of the system's benefits to caregivers and families taking care of vulnerable elders.The system relies on a simple but effective arrangement that combines input from CCTV cameras and motion trackers with trained models capable of detecting various human activities.These activities range from normal actions such as walking, standing, and sitting to unstable movements and falls that ultimately lead to debilitating health problems.Further, the system sends notifications to family members instantly in case of a fall incident.Additionally, there is a feature to track activities over a desired duration of time.The system is designed for privacy with appropriate authentication and authorization.The system is designed with ease of use and security concerns as guiding principles.Events detected in each camera based on the camera setup, motion search, and analysis are accurately delivered by the system.However, there is a need for more investigation of the system's reliability and accuracy.In the future, since the system also provides a real time monitoring mechanism, it has strong potential to enhance patient mobility as a part of a remote physical therapy program.We conclude that artificial intelligence and information technology can improve health care for elderly people immensely.It is vital that the system be developed further to enhance health care for elderly people.Future health care professionals can build adaptations into the system to enhance mobility of elderly patients in conjunction with their own remote physical therapy programs.The system is viable as a secure monitoring system with an effective use case as a health care tool.

Conclusion
Experiments conducted thus far with the prototype system promote its potential efficacy in helping with in-home elderly care and ease of use for caretakers and health care professionals.Accuracy achieved in the limited range of conditions in the field test meets the targets set out for the system and is sufficient for the next step of commercialization.Some of the errors that do occur are no fault of the system.Occlusion of one individual by another or by furniture are unavoidable problems.The relatively high cost of high-quality IP cameras and the NVIDIA Jetson platform is an obstacle for a typical household budget in Asia.Further, the models were trained in certain conditions that may affect their prediction ability.Due to this, accuracy levels will fall when the same models are used in homes with different environmental conditions, especially the lighting used in rooms.In the same vein, the veracity of training data may affect outcomes, as there will always be a difference between simulated behavior used for training and the real-world behavior of the residents of each home the system is installed in.To overcome these issues, similar to most deep learning models, there has to be a period of fine tuning to adapt the model to every new site until exposure to a sufficiently wide range of conditions is achieved.In this paper, we have introduced and described the design, evaluation, and deployment of a system to enhance care for elderly people at home.Tests of the individual modules and an overall field integration test with volunteers indicate sufficient accuracy and usability to move forward toward a commercial exploitation phase.
The next phase of research and development will need to focus on striking the right balance between the costs and benefits of the system to elderly people and also their caregivers.Since the research began, NVIDIA released a new embedded system board, the XAVIER NX, that is cheaper and more powerful than the Jetson TX2, creating a new route to further reduce costs and increase the effectiveness of the elder care system.
One of the obstacles to deployment of the elder care system as described here at scale with low cost to the end user is the typically low bandwidth between households and cloud providers' data centers.We hope to see that bandwidth bottlenecks will be eliminated through the introduction of low-cost 5G mobile access and/or fiber to the home with high bandwidth to the data center.This would enable us to lower edge device costs and do more of the processing in the data center, where economy of scale can be exploited.

Data Availability
We have four datasets described in the manuscript.Dataset 1 (MoVi Online dataset) is human Motion and Video dataset which is publicly available and cited in the reference list.Datasets 2, 3 and 4 (Artificial Intelligence Center Lab and Volunteer house) generated during current study are not publicly available due to ethical viewpoint for privacy and confidentiality protection by Institutional Review Board (IRB) at Mahidol University, Thailand.

Declarations
The manuscript describes recent work and is not under consideration for publication elsewhere.Some of the results have been previously publicized in a research report for the funding donor's annual online compendium, "NBTC Journal."The new manuscript is substantially extended from the research report, with a more extensive review of the literature, more detailed descriptions of the methodologies, and additional experimental work.Experiments involving human participants were reviewed and approved by the Institutional Review Board at Mahidol University, Thailand.All authors have approved the manuscript and this submission.The authors certify that there is no conflict of interest with any financial/research/academic organization with regards to the content/research work discussed in the manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fig. 7
Fig. 7 Workflow of the cascade framewise activity classifier

Fig. 9
Fig. 9 Architecture of the unstable motion detection model

Fig. 10 Fig. 11 Fig. 12
Fig. 10 Misclassification by framewise classifierTable 11 Aggregated framewise activity classification final test results Actions Sit Stand Walk Bend Fall Clap Checktime Point Phonecall Wave Total Accuracy

Table 1
Dataset characterization

Table 2
Dataset distribution for cascade framewise activity classification

Table 4
Activity-2 test set confusion matrix (sitting activities)

Table 5
Activity-2 test set confusion matrix (standing activities)

Table 6
Activity-2 test set confusion matrix (walking activities) The training set contains consecutive framesets with 35 overlapping frames.The 40 × 36 input layer is followed by a fully connected layer, three LSTM layers, then a final fully connected layer with two outputs (stable or unstable motion).To reduce overfitting and improve generalization, 40% dropout is applied to the output of the 2nd LSTM layer, 50% dropout is applied to the output of the final LSTM layer, and L2 regularization is applied throughout the model.A batch size of 2000, learning rate of 0.0005, and cross entropy loss are used for training.We trained the model for 94 epochs with the Adam optimizer.The training, validation, and test data for the unstable motion classifier excluded Dataset 1 and the fall category examples in Datasets 2 and 3 but were otherwise the same as for the framewise activity classifier, except where there were less than 40 frames in a given sequence.Overall test set accuracy was 94.72%, shown in Table

Table 7
Stable-unstable motion classifier dataset distribution and accuracy

Table 8
Stable-unstable motion classifier confusion matrix

Table 10
Framewise activity classification results

Table 12
Stable/unstable motion classifier final test results