Virtual reality in training artificial intelligence-based systems: a case study of fall detection

Artificial Intelligent (AI) systems generally require training data of sufficient quantity and appropriate quality to perform efficiently. However, in many areas, such training data is simply not available or incredibly difficult to acquire. The recent developments in Virtual Reality (VR) have opened a new door for addressing this issue. This paper demonstrates the use of VR for generating training data for AI systems through a case study of human fall detection. Fall detection is a challenging problem in the public healthcare domain. Despite significant efforts devoted to introducing reliable and effective fall detection algorithms and enormous devices developed in the literature, minimal success has been achieved. The lack of recorded fall data and the data quality have been identified as major obstacles. To address this issue, this paper proposes an innovative approach to remove the afformentioned obstacle using VR technology. In this approach, a framework is, first, proposed to generate human fall data in virtual environments. The generated fall data is then tested with state-of-the-art visual-based fall detection algorithms to gauge its effectiveness. The results have indicated that the virtual human fall data generated using the proposed framework have sufficient quality to improve fall detection algorithms. Although the approach is proposed and verified in the context of human fall detection, it is applicable to other computer vision problems in different contexts, including human motion detection/recognition and self-driving vehicles.

trained model to evaluate and verify the proposed approach. Finally, Section 5 presents the concluding remarks and future work.

Related works
This section summarizes existing techniques used in the fall detection literature and their limitations. It also highlights the needs for computer generated (virtual) fall datasets through a critical review of the existing datasets used in fall detection research. Finally, related works in using synthesized data for training machine learning algorithms are reviewed to reinforce the research presented in this paper.

Fall detection approaches
Human falls, especially in elderly people, are a major cause of fatal injuries and they can create serious impediments to independent living [31]. According to the World Health Organization, about 28-42% of elderlies aged 65 and older fall every year. Falls are the primary cause of injury-related deaths for this age group [53]. The frequency of falls increases with the age upturn and frailty level [40]. Studies have shown that early fall detection is one of the key factors in reduccing the severity of fall consequences [31,40]. Without assistance, many victims of falls are unable to get up, and the longer periods of lying on the floor can lead to hypothermia, dehydration, bronchopneumonia and pressure sores [40]. This is particularly critical if the person lives alone or loses consciousness after falling.
As shown in recent studies [4,15,17,23,26,35,[37][38][39]48], it has become very important in the public healthcare domain to develop intelligent and reliable surveillance systems that can automatically track, detect, and timely notify falls. As a result, various techniques have been developed in the literature to automatically detect and prevent human falls [43,52,54]. In general, fall detection techniques in the literature can be divided into two main groups: (i) techniques that rely on wearable sensors, and (ii) techniques that are based on computer vision.
Techniques in the first group try to detect falls from abnormal changes in sensor readings. Different types of sensors have been used for this purpose including special health monitoring devices, such as blood pressure and heartbeat sensors, as well as multipurpose devices, including accelerometers and gyroscope in smartphones. Threshold-based and machine learning based classification algorithms have been used to detect and classify if an abnormal change in sensor readings is a fall. A drawback of using wearable sensors is the person under monitoring must wear the sensors at all time. In addition, multiple sensors should be worn to reduce the rate of false positives. As pointed out in [43,52], the performance of many existing wearable sensor based algorithms was much lower in real-world scenarios compared to what they achieved under a simulated environment.
Unlike the first group, techniques in the second group (computer vision based) detect falls using video footages collected from various types of cameras operating at different frequencies ranging from radio frequency to infrared and visible light. The generic model for fall detection used by techniques in the second group is illustrated in Fig. 1. In this type of model, distinctive features of a fall, such as the magnitude of the acceleration and the angular velocity, are extracted from the video data and then fitted to a classifier to distinguish a fall and a non-fall situation. Machine learning based classifiers are commonly used in this model for feature classification due to their performance in comparison to other classifiers, such as rule-based [43]. As camera footages contain more contextual information, techniques in the second group often perform better than techniques in the first group in term of false-positive rate. In contrast, as it takes more time and resource to process video data, the detection time is often longer in the second group.
Although machine learning algorithms have demonstrated superior performance in several fall detection studies [43], to obtain accurate fall detection results from machine learning based approaches, a dataset with a large number of fall data collected from different real scenarios is necessary [25]. However, there is a lack of real-world fall data [43] and collecting a quality real-world dataset is challenging due to cost and privacy reasons. Therefore, it is essential to provide a solution to overcome this data lacking issue. Before presenting a solution to this problem, a review of existing fall datasets is presented in the following subsection to discuss their features and limitations, and then further highlight the need for a VR based fall data generation approach.

Existing datasets for fall detection
Data play a vital role in training, verifying and testing machine learning, and computer vision algorithms [16,20]. The performance of these algorithms largely depends on the quality of datasets they were trained with. A comprehensive dataset should have sufficient data quantity and diversity to adequately represent the population of the problem under study. However, real-world datasets, especially in the case of fall datasets, usually are not complete and do not have the required quality.
The literature on fall detection indicates that only a few fall datasets, including the UR Fall dataset [26], the Multiple Cameras (MC) Fall dataset [5], and the Fall Detection (FD) dataset [12], are publicly available and have been used in different research works. An overview of the statistics and characteristics of these datasets is provided in Table 1. Although the datasets contain a substantial set of human fall and non-fall scenarios, they do have some limitations. As shown in Table 1, there is a lack of diversity in the environment, lighting and camera settings. In particular, most of the scenarios in these datasets were recorded from the same camera angles, in the same place, under the same furniture setting and lighting condition. It is well known that application to a new location is one of the critical factors impacting the performance of computer vision and machine learning algorithms, as this type of algorithm generally performs the best under conditions where the training and testing environments are comparably similar [20,46]. In addition to the diversity issue, the number of fallers is very small with only one faller in each video. The small number of fall events and the dimension of the fall data are other factors that heavily affect the generalization capability of any fall Fig. 1 The generic fall detection model (adapted from [43]) detection algorithms trained using these datasets. Another limitation of these datasets is that they contain only simulated fall data, i.e., data recorded while the falls were simulated by healthy and young individuals, which may be quite different from the elderly people. However, overcoming these limitations is challenging in practice. Real falls are often not predictable and attempts to record real falls may face privacy issues. Meanwhile, conducting simulated falls in a large number of scenarios is not practical due to time and resource reasons. Hence, generating and collecting data using VR technology, as proposed in this research work, can be a solution to some of the above-mentioned challenges.

Initial success of virtual datasets and research gaps
The virtual world has been recognized as an environment that facilitates creating online laboratories at a low cost [6]. In the computer vision domain; video games, synthetic images, and 3D modeling were used by researchers as data sources for training various models, including object detection algorithms [22,41,45,49,50]. For example, crowdsourced 3D CAD models were used to train a Convolutional Neural Network (CNN) for object detection [56]. Synthetic images were used to train a CNN for vehicle detection in [27,28]. Moreover, video games, such as Half-Life, were utilized to generate a virtual dataset to train an SVMbased algorithm for pedestrian detection in video streams [30]. More recently, the ParallelEye Vision framework, which relies on VR technology, was proposed to generate synthetic natural scenes and virtual images with precise annotations to successfully pre-train a detection model, which was fine-tuned later using real datasets [28]. VR-generated datasets were further used for the development of tree detection/recognition in a driver assistance system [22], and parts recognition in the automated assembly line and production [56]. As a follow-up to their previous work, Li et al. [28] used the ParallelEye-CS virtual dataset for the training and testing of their proposed system for intelligent vehicles.
In the field of fall detection research, there is only a group of researchers who have recently used motion capture technology to simulate human falls. The FUKinect-Fall dataset, which is publicly available [4], contains walking, bending, sitting, squatting, lying and falling actions performed by 21 actors aged between 19 and 72 years old. The FUKinect-Fall dataset is very useful for constructing fall scenarios and can be used to train/test fall detection algorithms.
The initial success of the previous methods in using synthesized data for object detection has motivated us to build virtual fall datasets for human fall detection. To the best of our

Methodology
This section details the methodology for generating different fall scenarios. The block diagram of the proposed VR based human fall data generation is depicted in Fig. 2. It starts with the generation of humanoid models, followed by applying different animation methods for simulating the fall models, and finally discussing the construction of the fall datasets. Details of each step are presented in the subsequent subsections.

Humanoid 3D Model Generation
Although 3D humanoid models are popular in the game and entertainment field, most of them do not have the required quality to simulate human activities realistically for purposes other than gaming and entertainment. Very few research works were conducted on creating 3D humanoid models for biomechanical purposes [1,8]. In an early work, Boulay et al. [8] have used SimHuman and Mesa library to create a humanoid model with 23 parameters for human posture recognition. Mesa is a 3D graphics library with an API (Application Programming Interface) which is very similar to that of OpenGL. Although this approach is highly flexible for modeling purposes, it requires a substantial coding effort to create a humanoid model. More recently, an adequate procedure has been created by the Make Human (MH) project (MH, 2020) to generate realistic humanoid models mainly used for speech therapy and human anatomy [11]. The MH tool has many functionalities, including built-in functions for creating, calibrating a humanoid model with gender, age, and other biological characteristics. As the models generated by this software are realistic and can be exported to different 3D modeling  Fig. 2 The proposed human fall data generation framework and VR engines, such as Blender and Unity3D with a choice of geometries, materials, and skeletons information, the MH version 1.1.1 is used to create humanoid models for fall simulation in this research work. The current version of the software allows choosing 4 skeleton models (or armature) with either 31, 53, 137, or 163 bones. The options with 53, 137 and 163 bones allow capturing finger movement and some facial expressions. As this research work focuses on fall detection and the 31-bone option is more appropriate and sufficient for generating falls from the external motion capture data; the 31-bone option is used in our experimentation. Other options can also be used depending on the motion generation algorithms. For example, the 163-bone option provides a much finer movement control when forward or inverse kinematics are used. Figure 3 shows four different skeleton models and their characteristics of which the 31-bone option is highlighted by the yellow rectangle.

Fall motion generation
In general, a fall motion in VR is generated by animating a humanoid model. The animation is controlled by the model skeleton or armature, which can further be rigged to different poses. Each bone in the skeleton drives a group of vertices of the mass model. The level of bone impact on vertices is set by a set of adjustable weights. Figure 4 demonstrates the impact of the shoulder bone on the mass model using the heatmap. The red area is the place where the highest impact is observed, and it gradually reduces from the yellow to the green areas. The blue area is the place where the bone has no impact on it. Although a fall simulation can be created by manually rigging each bone of the skeleton, this method is labor-expensive and not practical. In the following sub-sections, a few automatic and semi-automatic simulation methods used in our proposed framework for generating synthetic falls are discussed in detail.

Inverse kinematics
The skeleton of a humanoid model can be treated as a set of bones systematically interconnected by joints. The kinematic technique can be used to perform the simulation. Therefore, the kinematic technique, as the first simulation method, is used in our proposed VR based fall data generation. Forward kinematics refers to the process of obtaining the position and angle velocity (direction vector) of an end effector, given the joint variable, i.e., their angles and angular velocities [2]. Figure 5 demonstrates a humanoid arm model with two bones and two joints. The position of the end effector can be computed using the forward kinematic function fk(b 1 , a 1 , b 2 , a 2 ). In the simplest case when the arm is assumed to move in a flat surface, e.g., on a table, the position (x e , y e ) of the end effector can be computed using the forward kinematic Eqs. (1) and (2): where b 1 , and b 2 are the lengths of arm and forearm bones, and a 1 , and (a 1 + a 2 ) are their angles with respect to the horizontal axes, respectively. The velocity can further be obtained by taking the derivative of fk() considering a 1 , a 2 , b 1 , and b 2 . Inverse kinematic is the process of finding the corresponding position and velocity of each joint in the system given the position and velocity of the end effector. Inverse kinematic provides the foundation for the automatic creation of a 3D animation, including a fall simulation, as it allows interpolation of the skeleton movements between two poses. However, when the number of bones is greater than two, the inverse kinematic problem is ill-posed and finding a general analytical solution to the problem is difficult [3]. There are several techniques to address this problem, including the Jacobian inverse technique and Heuristic optimization methods [9]. In the context of fall simulation, the inverse kinematic problem can be reformulated into an optimization problem, i.e., finding the optimal position of each bone in the skeleton system to capture a fall motion. However, solving the inverse kinematic optimization problem for fall simulation is a challenging task. Motion capture and machine learning techniques can, however, facilitate solving this challenging problem.

Motion capture
Motion capture (mocap) is the process of recording the movements of a real person or object using mocap technologies. The data captured using the mocap technology is then used as constraints to reduce the ambiguity of the inverse kinematic process. As a result, the recorded movements can be reproduced by the humanoid model in a more realistic manner. So, the motion capture is also incorporated, as a simulation method, in our proposed VR based fall data generation framework.
Motion capture technologies can be categorized into two groups: online and offline [34]. The online technologies are often based on magnetic or infrared sensors and their output can directly be used to control a virtual human in real-time to mimic the human performer's movements. However, the current online mocap technologies have some limitations, including the small number of measurement points, noisy data and cumbersome sensors (although they tend to become smaller). Therefore, the quality of the captured motion largely depends on the data processing software, which handles the data cleaning and data interpolation processes. An inexpensive but rather effective online mocap system of this type is the HTC Vive, which is a full-body tracking system with the VR Mocap software package as shown in Fig. 6. The specification of the latest HTV Vive tracker is provided in Table 2 to shows weight, tracking, dimensions and other characteristics of the HTC Vive Motion Tracker 3.0. The HTC Vive system has successfully been used in the proposed framework and experimental analysis in this research work to capture fall movements.
The offline mocap technologies are mainly based on multiple cameras, which capture optical motions. The cameras track the markers attached to the body of the human performer being tracked. This class of technologies allows the acquisition of subtle gestures to produce high-quality, large and complex movements. However, offline technologies are considerably more expensive compared to their online counterparts, and they require a larger amount of time to process the captured motions. Despite the drawbacks, the offline mocap technologies are more preferable to capture motions in a clinical context, such as for the assessment of orthopaedical pathologies [36] and obviously for fall simulation. Figure 7 shows a conceptual camera-based mocap system. Data generated using mocap systems can be used to animate humanoid models. There are several mocap datasets [21,47], which are publicly available to use for research purposes. The datasets capture common human movements, including walking, running, and jumping. Unfortunately, there are no public mocap datasets for fall motion available at this stage. Therefore, in this research work, a mocap fall dataset was created for fall detection to be publicly available for research purposes.

Physics-based motion generation
To obtain a natural simulated motion, usually, motion-capture data is utilized. However, this approach is limited to those motions that can be replicated and captured. For example, it is difficult to request an elderly person to simulate a fall to capture his/her fall motion. Moreover, the mocap process is time-consuming and may not be quite suitable for the creation of large datasets.
The usage of physics-based simulation has been discussed in actively controlled virtual characters to automate the generation of natural and realistic human motions in an interactive setting without motion data [19]. This technique has been built upon past research [18] that has  To induce motion, a finite state machine-based control system is used to output muscle excitement signal to control the legs and produce muscle locomotion, based on the muscle dynamic previously modeled. The pose and overall movement of the model are then optimized and improved.

Video-based motion generation
Recently, the usage of uncalibrated videos for synthesizing character movements is rapidly increasing, due to their large scalability and their ability to produce novel motion data. Computer vision and machine learning techniques were used to detect and estimate human poses in video sequences, which were then converted to motion data using some model fitting algorithms [29,33,51]. The model is generally the human skeleton and the model variables are the joints. To improve the accuracy, some research works data from inertial measurement units (IMUs) was used. Although this motion generation approach is promising, it is still in an early stage and the quality of generated motions has not reached the required level that can provide a promising accuracy to be used in the healthcare field.

Fall scenario generation
A fall scenario is composed of at least one fall motion in a virtual environment. The virtual environment can be indoor or outdoor and its settings can be fully customized in terms of surrounding objects and lighting conditions. Virtual fall datasets were generally constructed by shooting and annotating a generated fall scenario from one or multiple virtual cameras with different environment settings and camera angles. Theoretically, any amount of visual data with high levels of diversity can be generated for training fall detection algorithms. Although other software tools, such as Unreal and Godot, are available in the literature, there is no significant difference between these tools in this fall generation task as they all offer high quality and realistic data. Therefore, in the proposed framework, the Unity3D software is used to generate fall scenarios for experiments and data generation.
It is worth mentioning that Unity3D can simulate real-world physics and as a result, it is possible to simulate other types of motion and directional sensors, such as accelerometers and gyroscope to provide additional data dimensions in the generated fall datasets. These functionalities have not been considered in our fall datasets and this possibility will be explored in our future work.

Case study
In this section, a simple case study is presented to verify the usability of the proposed framework for synthetic human fall generation. Our aim in this case study is to evaluate how an existing state-of-the-art fall detection algorithm performs on our synthetic falls. Would the algorithm detect a synthetic fall? What would be the difference in its performance compared to what the algorithm was trained for? To answer these questions, based on the proposed fall data generation framework, three sample fall datasets were first created. A stateof-the-art fall detection algorithm was then chosen to perform prediction using the generated synthetic falls. Finally, the fall detection results are discussed to answer the above-mentioned questions. The following subsections detail the process.

Virtual fall dataset construction
We used several fall motion generation methods discussed in the previous section to create different sets of data. In particular, we constructed three datasets of fall motions in this case study. The first virtual fall dataset (VFD-1) contains a fall motion created manually using the Blender 2.8 tool. The second (VFD-2) and the third (VFD-3) datasets contain the same fall motion captured using the HTC Vive full-body trackers and the OptiTrack motion capture system, respectively. The reseason for using different methods for fall motion construction is that the quality of synthetic falls constructed by each method is different. A manually constructed fall has the lowest quality, while an OptiTrack-captured fall has the highest quality. The synthesized fall motions were then used to drive a humanoid 3D model in a virtual environment to create a fall scenario using the Unity3D simulation engine. The fall scenarios were then captured by virtual cameras to create fall footages, which were used to evaluate fall detection algorithms. Figure 8 shows a photo of the OptiTrack motion capture system used in our experiments.
A summary of the key features of the virtual fall datasets (VFDs) is presented in Table 3. From Table 3 it can be noted that the fall simulation scenarios were conducted in an indoor environment setting. The lighting condition was the default ambient lighting and unchanged during the simulation. Five virtual cameras were used in the simulation to shoot the falls from different angles: two in the front, two from the back and one on the top at 30 fps. Figure 9 shows the camera angles used for the fall simulation. Each fall simulation was carried out for 5 s with two seconds of actual falls. In total, five fall motions were recorded in 300 frames. Nofall scenarios were also generated to test the fall detection algorithms. In the no-fall scenarios, the fall motion was replaced by a simple walk motion and the rest of the environment settings remained unchanged. As results, The VFDs are composed of 1,200 frames in which five nofall motions were also recorded.

Fall detection algorithm
The state-of-the-art fall detection algorithm proposed by Núñez-Marcos et al. [38] was used in this experimental study, as it has shown good performance for fall detection and proven to be one of the best methods in the literature. The fall detection algorithm has been designed based on the VGG16 Convolution Neural Network and the source code is available at https://github. com/AdrianNunez/Fall-Detection-with-CNNs-and-Optical-Flow. Since the algorithm operates on optical flow images, the same Dual TVL1 Optical Flow algorithm proposed by the authors was used to compute the sequence of optical flow images from our generated fall scenarios [55]. The algorithm was selected due to its performance and the availability of the source Fig. 8 The OptiTrack motion capture system

Experiment results and discussion
To evaluate the performance of the chosen fall detection algorithm on our datasets (VFDs), we relied on the same evaluation metrics frequently used in the literature [38]. In particular, the algorithm was evaluated using three performance metrics: Precision, Recall, and Accuracy computed using Eqs. (3), (4) and (5), respectively. Recall In Eqs. (3), (4) and (5), TP (True Positive) is an optical flow stack labelled as "fall" and was also predicted as fall, TN (True Negative) denotes an optical flow stack labelled as "no fall" and was predicted as no-fall, FP (False Positive) is an optical flow stack labelled as "no-fall" and was predicted as fall, and FN (False Negative) is an optical flow stack labelled as "fall" and was predicted as no-fall. An optical flow stack is a sequence of 10 consecutive frames labelled as "fall" or "no-fall". The first experiment was carried out to evaluate the overall detection performance against all synthesized fall footages. The detection algorithm was subsequently presented with  Table 4. The performances obtained by the algorithm [38] trained and test with three existing datasets (UR-Fall, MC-Fall, FD) are also presented in Table 4 for comparison.
From Table 4, it can be noted that the fall detection algorithm has performed poorly compared to the results and performance claimed by Núñez-Marcos et al. [38]. For example, considering the accuracy obtained using the VFD-3 dataset, the fall detection algorithm has provided an accuracy of 76.6%, which is quite lower than the expected results from the Núñez-Marcos et al. Fall detection algorithm [38]. The accuracy obtained using the VFD-1 is the worst compared to the performance of the fall detection algorithm on the UR-Fall, MC-Fall, and FD datasets. From Table 4, it can further be noted that the Recalls of the fall detection obtained using the VFDs are quite low compared to the fall detection results reported using the UR-Fall, MC-Fall, and FD datasets.
To understand reasons for the inferior performance of the algorithm against VFD datasets and the reason why the fall detection algorithm has performed poorly on VFD datasets, the algorithm was subsequently tested against footage from an individual virtual camera. The evaluation results obtained from the fall detection algorithm using footages from front cameras are shown in Table 5.
From the results shown in Table 5, it is clear that the fall detection algorithm performed significantly better with the image sequences obtained from the two front cameras. Particularly, the fall detection performance obtained using VFD-3 was comparable to the performance achieved with the three existing datasets in all three parameters, i.e., Recall, Precision and Accuracy. This is because the front camera angles were similar to the angles used in the existing fall datasets, with which the detection algorithm was trained. The results obtained from the detection algorithm indicate that the synthetic falls, when recorded from a similar camera angle, have the same quality (of being close to real falls) as the simulated real falls. The experiment also indicates that the detection algorithm failed to detect falls from footages taken from unfamiliar camera angles, e.g., the rear-view or the top-view, as it was not trained with those footages. In summary, the results obtained from the experiments in this case study allow us to make a few initial conclusions, including (i) the lack of diversity in the training data, particularly the fall motions taken from unfamiliar camera angles, could be the main reason for the poor performance of many fall detection algorithms in real-life scenarios, and (ii) the virtual fall footages have sufficient quality to trigger the fall detection algorithm. This also means that the synthetic data generated by the proposed framework can be used to train fall detection algorithms to improve their detection performance.

Conclusion and future work
Fall detection remains a challenging problem in the public health/aged care domain. In this paper, we presented an innovative application of Virtual Reality to address a major obstacle in a real-world fall detection problem-the lack of quality fall data. As a result, VR based fall datasets have been created for training/testing machine learning based fall detection algorithms. The virtual fall datasets will also be made publicly available to researchers for research purposes. The methodology for generating fall data in virtual environments was also discussed and a case study was conducted to verify the quality of the data generated by the proposed approach. The results indicated that the approach can synthesize high-quality fall data, which can potentially be used to improve machine learning based fall detection algorithms.
In future works, this initial research will be expanded in two directions. First, an extensive fall simulation will be conducted using different fall motions captured from both the HTC Vive body motion tracker and the OptiTrack motion capture system to create larger datasets. Second, the generated virtual fall dataset will be used to train/test more fall detection algorithms together with real simulation footages. Other types of sensors, such as accelerators and deep-camera, will further be simulated to provide an additional dimension of fall data for training and testing machine learning based fall detection algorithms. Although the proposed approach was intended to be used for fall detection, it can be applied to other domains, such as in training self-driving vehicles and robotics.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.
Data availability On request.
Code availability On request.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.