1 Introduction

The term “Metaverse” has gained widespread attention and is often centered on developing its architecture and infrastructure. This surge in interest has led to the introduction of new technologies and workflows. The concept of the Metaverse can be traced back to Neal Stephenson’s 1992 novel, “Snow Crash” [1]. The contemporary vision of the Metaverse integrates Web 2.0 principles with virtual reality, augmented reality, and virtual avatars [2]. Technologies such as Unreal Engine and Omniverse [3] offer the potential to create photorealistic virtual worlds, complemented by applications such as Character Creator and Unreal’s Metahumans for lifelike avatar generation. These advances provide a fertile ground for research in various fields, and are not limited to the Metaverse itself.

The convergence of the Metaverse with domains such as computer vision and virtual reality holds immense promise. However, in the realm of computer vision, obtaining comprehensive datasets accurately depicting human actions poses a significant challenge. Human activities are multifaceted and complex, demanding complex capture and annotation processes. This complexity increases when moving into the immersive realm of virtual reality (VR), where accurately simulating natural human motion requires meticulous data collection. The Metaverse’s emergence emphasizes the urgency of acquiring diverse, authentic human action data, bridging the gap between real-world and digital experiences. Addressing these challenges is pivotal, as the fusion of the Metaverse with computer vision and VR has the potential to revolutionize human-computer interactions and reshape our navigation of immersive digital spaces.

We decided to take a closer look at the field of activity detection and recognition in computer vision (CV), because it is an emerging field. Recently, the use of CV in several settings has triggered ethical concerns. These range from targeted advertising to law enforcement. Additionally, mature CV algorithms are available to identify human actions with promising results. These range from images and videos [4, 5]. However, there is evidence that some existing CV algorithms [6, 7] actively discriminate against specific groups of people [8]. This is attributed to a lack of comprehensive training data before deployment. From this, we conclude that the lack of real-world data for training activity detection and recognition (collectively referred to as ADR models) is a key problem [9]. To briefly summarize the difference, activity recognition is used to predict the actions occurring in the video, while activity detection focuses on the classification of actions in the video. It should be noted that the ADR methods rely heavily on data. While some datasets were created for either the problems of activity recognition or detection, we note that these datasets can be used for either problem. In turn, these results are only as good as the data used for training of the ADR models. Research has shown that generating new datasets is costly and time-consuming, and privacy issues need to be resolved before these datasets can be distributed [10]. To avoid these pitfalls, we postulate synthetic data as a solution.

Current work is centered on models, as opposed to data. In his discussion, Andrew Ng suggested that “one of the keys to extending the benefits of AI to companies beyond the online giants is to use techniques that enable AI systems to be trained effectively from much smaller datasets” [9]. This suggests the need for researchers to change course and direct focus towards the data-centric approach rather than the algorithm-driven approach. VR demands data-centric research to enhance its immersive potential. VR’s unique immersive nature requires realistic simulations of real-world interactions, environments, and behaviors. To achieve this realism, high-quality data are vital, as they enable the creation of lifelike avatars, accurate physics, dynamic lighting, and more. Data-centric research collects and analyzes relevant data to refine VR applications, resulting in more compelling and realistic virtual experiences. Most activity detection and recognition works are targeted at activities of daily living (ADL) [1114], and report high accuracy using benchmark datasets. However, these datasets are not diverse enough to address real life cases. These datasets often focus on chosen restricted components of ADL, which are dependent on the interests of different researchers. Furthermore, the preliminary literature review revealed that the collection and hand-labeling of real world data have adopted a more conventional route. Nevertheless, this approach is resource intensive, which limits the capacity to generate variability. Synthetic data provide a solution to these problems faced by real-world data.

1.1 Photorealistic vs. non-photorealistic synthetic data

When choosing synthetic data, a common forked path that researchers face is whether to use photorealistic synthetic data or non-photorealistic synthetic data. There are challenges faced in either option [15], such as the complexity of data generation and usability of data. In our extensive literature review, we found that in existing synthetic datasets [1619], the photorealistic option is the consensus option across researchers in the field of ADR. In some ADR methods, pose extraction is a common step that can be hindered if the synthetic data have low resolution or possess high levels of noise [15]. Thus, we propose that synthetic data should be used to support photorealism; and with the development of Metaverse technologies, achieving photorealism has become more accessible to researchers than ever before.

Current studies on synthetic datasets for ADL generate data without using existing real-world datasets. These include the SARA, SURREAL and SynADL [17, 18, 20]. This approach is often labor-intensive and requires additional post-processing work [16, 20] to convert real-world actions to rigged synthetic humans, as shown in Fig. 1. Furthermore, evaluations are conducted based on their own models [21] rather than existing models. This limits the comparison of trained model performance between real-world data and synthetic data. Additionally, the realism of previous ADL synthetic datasets is limited [20]. However, with the use of the RTX rendering technology [3], we seek to achieve a higher degree of photorealism when generating synthetic data.

Figure 1
figure 1

Example of one synthetic dataset SynADL [20] that uses real-world recordings to rig synthetic humans

The next steps in this process include, (a) thinking about how we can leverage Metaverse technologies to create synthetic data and (b) thinking about how we can use the synthetically generated data to train models to identify human actions performed by other avatars - regardless of location, appearance, duration or other factors. Users have always been able to create a personalized virtual world [22]. That also extends to avatars that may not always be human. Another challenge that researchers have been attempting to solve also lies in the question of how to generate large amounts of data [23].

Current models are trained on real footage, although little attention has been given to what happens when we start to test those models against humanoid or non-human avatar activities. There is a crucial need to train our existing model to detect actions of non-typical avatar designs and obtain high accuracy on the right type of data. This will be achieved by training models that will be present in the Metaverse. However, how do we create the right type of data efficiently?

This paper will evaluate both real and synthetic datasets relevant to ADR models, and analyze the data these methods collect and generate. We also discuss methods for improving synthetic datasets based on the review. The overview of this paper is illustrated in Fig. 2.

Figure 2
figure 2

Overview of the paper and what we discuss. VR: virtual reality ADL: activities of daily living and ADR: action detection and recognition. SDG: synthetic data generation

This paper makes three main contributions to current research. First, it presents an in-depth look at the data collection methods of existing real and synthetic datasets for ADR. This includes a discussion on how Metaverse technologies can be leveraged to ease existing problems in the field. Second, we present preliminary works for the development of a novel pipeline for synthetic data generation, titled “SynDa” [24]. SynDa enables partial automation for quality synthetic data generation. To the best of our knowledge, this is the first pipeline that leverages existing real-world data to generate synthetic data for activity recognition. Finally, our preliminary results have revealed an increase (+2%) in the prediction accuracy of the model when trained with 50% synthetic data as compared to 100% real data.

Several previous reviews [2527] have discussed the existing human activity recognition methods. Some papers examine specific areas such as the bias in datasets in AR, while others review datasets collected in a specific way, or simply examine the datasets that would be best suited for researchers. However, we note that existing systematic reviews do not examine the different factors that vary in the video datasets, or the possible links between these factors and the amount of data produced. Furthermore, this systematic review examines video datasets focusing on ADL, for both the fields of activity recognition and activity detection. Some researchers use these terms interchangeably; however, in this review, we identify these fields as closely related, but not the same, as discussed in the introduction.

2 Selection process for real and synthetic ADL datasets

The selection methods include a strict focus on video datasets, both synthetic and real. These datasets are used for evaluation/bench-marking by an existing ADR model.

2.1 Eligibility criteria

The authors have shortlisted the ADL datasets, which were well documented, annotated and evaluated by publicly available ADR models. If the papers did not meet these criteria, they were excluded. This is because we wanted to perform baseline comparisons against the original performances for early SynDa [24] (Sect. 4) experiments. Table 1 summarizes the data collection methods used for real life. Table 2 presents a summary of real-life datasets.

Table 1 Summary of the data collection methods used for real-life shortlisted ADR datasets for ADL. 1 Participants recorded themselves performing the actions. ADR: collectively referring to action detection and recognition; ADL: activities of daily living
Table 2 Summary of real-life datasets on ADL for ADR before eliminations (sorted by year) 1 Biased refers to datasets, which have included the environment/contextual information to infer the activity that occurred. Free is when the paper uses models that only focus on the human actions to determine the activity taking place in the clip. 2 For the datasets that did not meet the selection criteria, we explain them briefly in Sect. 3.1.1

2.2 Search strategy

To obtain datasets that can match the criteria, we browsed “papers with code”, “arXiv.org” and “Google Scholar”. The keywords for real videos encompassed activities of “daily living datasets”, “human activity recognition ADL datasets”, “indoor ADL datasets”, “activity recognition datasets”, and “daily activity video datasets”. For synthetic datasets, the search utilized the keywords and filters: “synthetic ADL”, “synthetic human activity”, “synthetic activity recognition”, “synthetic data generation”, and a filter in “papers with code” for videos.

2.3 Selection process

From this search standard we had 25 datasets, from which we selected 18 real and 5 synthetic datasets that met the eligibility criteria. We excluded datasets that

  1. 1)

    did not include human activities that were used for ADR;

  2. 2)

    did not detail the data collection/annotation methods;

  3. 3)

    did not show full body human motions in synthetic footage;

  4. 4)

    did not have benchmarking results or results with a publicly available ADR model.

2.4 Study risk of bias assessment

Datasets are always subject to bias which in turn affects the efficacy of model training [8, 41, 42]. A very notable example of this is the recent exposure of the popular CV dataset ImageNet. Scientists at Carnegie Mellon and George Washington University discovered significant bias due to ImageNet’s imbalance [42]. By training two models: OpenAI’s iGPT and Google’s SimCLRv2 on ImageNet, the researchers found shocking results when they compiled a representative set of “stimuli” images from Google and ran them against the two pretrained models.

AI researcher Kate Crawford in her book Atlas of AI [43], shared the following: “Many truly offensive and harmful categories hid in the depth of ImageNet’s Person categories. Some of the classifications were misogynist, racist, ageist, or enabled ableism. Insults, racist slurs, and oral judgments abound”. After further research, we selected some examples to illustrate the biased nature of the data in ImageNet. ImageNet does not have equal representation of race and gender [42]; for instance, the “groom” category mostly includes white people. There are also clear signs of stereotyping. White people with tools are labeled as such, while black people with tools are identified as “black people with weapons” [42]. This bias is similar to that shown by Google Cloud Vision, Google’s computer vision service, which was found to label images of dark-skinned people holding thermometers as “holding a gun” [41]. The researchers at Carnegie Mellon and George Washington University said in their paper that “although models like these may be useful for quantifying contemporary social biases as they are portrayed in vast quantities of images on the Internet, our results suggest the use of unsupervised pre-training on images at scale is likely to propagate harmful biases.”

As a result, the video datasets reviewed here are chosen objectively for a fair comparison between different datasets and their collection methods. As part of future work, we have plans to study how existing factors in the ADR datasets affect model performance, and how this effect can be mitigated with the use of synthetic data. Synthetic data allow researchers to create “ideal” datasets by adjusting the factors needed to create data that yield strong performance. With synthetic data, the objective is to create unbiased datasets with variable factors to account for the different combinations of visual inputs that are present in real-life situations. We also discuss how we can learn from the factors that affect both real and synthetic datasets to create data that are useful and unbiased in Sect. 4.2.

2.5 Dataset selection

In Table 2, we present the shortlisted ADL datasets that meet the selection criteria. We also show data that were excluded despite meeting preliminary search criteria. This table includes more datasets and is an improvement from the earlier table in our previous works [24].

3 Method

This work systemically reviews existing real and synthetic video datasets for ADR. This is because ADR is extremely essential for enabling seamless recognition of virtual human behaviors, which distinguishes between visitors and the Metaverse. This review seeks to (a) establish gaps in existing real-life datasets, (b) analyze existing synthetic datasets and (c) discuss what can be done about the gaps. Furthermore, it is crucial to understand the characteristics found in ADL datasets. These factors function as the defining factors for achieving the optimal results when synthesizing existing data.

In the current era of computer vision, there has been a shift in focus from improving algorithms to data-centric approaches [9]. We first start by describing our selection criteria in Sect. 2 and examine existing real-life datasets (Sect. 2) and synthetic datasets (Sect. 3.2) for ADR in ADL. The selection criteria are focused on ADL datasets for activity recognition.

3.1 Real-world ADL datasets

3.1.1 Existing real-world datasets

There are a variety of data collection methods used in the field of ADR. Early methods prior to 2010 used silhouettes and spatio-temporal interesting points (STIPs) [44]. The early concept of using silhouettes as an input to recognize human actions in general, follows the concept that human activity can be seen as a continuous evolution of the body pose over a period of time [44]. This evolved with the introduction of the bag-of-points concept with MSR-Action3D [28]. In this paper, a bag of three-dimensional (3D) points (BOPs) is efficiently sampled from the depth map and Gaussian mixture models are used to model the human postures. This method, introduced in 2010, was the first to use RGB and depth (RGB-D) sensors and daa and surpassed the two-dimensional (2D) silhouette-based method.

Following MSR Action [28] with the introduction of the RGB-D sensor method [28], researchers in the field subsequently discovered several weaknesses in the method used in MSR-Action3D. First, depth information is used for recognition while color information is completely ignored [13]. Only 2D poses are used, instead of 3D poses, which leads to a loss of information on human poses. In 2011, Sung et al. [45] introduced the idea of using Microsoft Kinect to capture human poses. Their contribution was an indoor (e.g., office, kitchen, bedroom, bathroom, and living room) activity dataset. However, this dataset lacked the ability to account for videos where skeleton data were not readily available.

MSRDailyActivity [29] is a daily activity dataset that contains depth sequences. There are 16 different activities in the dataset which contains 320 videos. The researchers used Kinect cameras to capture the depth sequences performed by 10 participants. Each person carries out each activity twice as directed in 2 modes: “sitting” and the “standing” position. In the paper, the researchers also work to solve the issue of occlusion and noise in depth maps—arguing that the previous methods of depth maps are susceptible to these factors, which could affect the 3D positions of tracked joints in activity detection.

HuDaAct [13] improved upon this method [45] and used Microsoft Kinect to create the RGBD-HuDaAct video database. In HuDaAct, there are both RGB and depth sequences. The dataset consists of 1189 videos with 12 human daily actions with a high variation in the duration of each video. They recruited 30 participants for it. This dataset’s unique feature lies in its synchronized and aligned RGB and depth channels, facilitating localized multi-modal analysis of RBG-D signals [13]. This dataset contains RGB and depth videos, which are relevant and usable in real-life application deployment. This is because the dataset trains models for 24-hour monitoring by accounting for low lighting and variable periods of time.

CAD-120 [31] is another sensor-based dataset that has RGB, depth and skeleton data for human ADL. The dataset is one of the few datasets that has multiple camera angles, which is not the usual front-view or side-view. Twelve actions were captured within 5 different environments, namely, the bathroom, bedroom, kitchen, living room and office, to diversify the background of the captured videos. The activities were performed by 4 participants. This dataset has only 120 videos, limiting its use despite having a variety of camera angles.

MPII-Cooking [30] is one of the first few datasets dedicated to capturing ADL within homes. It has 44 videos and focuses on fine-grained activities for daily living. These datasets involve a variety of similar activities, such as showing participants preparing 14 different dishes across the dataset, where each participant uses different ingredients and tools [30]. The duration of each video varies from 3 to 41 minutes.

DAHLIA [33] consists of 51 videos; (∼ 39 minutes long) untrimmed for high-level activities. The data were recorded with 3 Kinects surrounding the scene to address environment occlusions and human self-occlusion. The untrimmed actions were performed by 44 participants, which is one of the highest subject numbers, with 7 high-level activities. The original paper [33] describes the dataset as a dataset with the following modalities: video in color, depth maps, skeleton body joint locations and body indices, and a binary mask relative to the detected body [33].

Kinetics [37] is linked to AVA [35]; thus this review analyzes them together. Kinetics-700 proved that video datasets are difficult to annotate, and each video is afforded a single label. In AVA, we built on Kinetics-700 [37] and labeled every person in a subset of frames [35]. The AVA dataset focuses on annotating 80 atomic visual actions in 430 15-minute movie clips. The kinetics-700 dataset has 700 human action classes with at least 600 clips for each class, and a total of approximately 650k video clips [37]. For each class, a clip is from a different Internet video, lasts approximately 10 s and has a single label describing the dominant action occurring in the video [37]. It is noted that the AVA dataset is an extension of the Kinetics-700 dataset, where the annotations of data have been improved.

LIRIS [32] was published in 2014, making this dataset one of the older datasets that is evaluated here. LIRIS explored alternative methods of data collection to address “multi-modality, human-human interactions, human-object interactions and human-human object interactions” [32]. The clips were recorded with a camera, Sony DCR-HC51 and Kinect installed on a “mobile robot and full localization information with bounding boxes” [32]. This dataset extended beyond purely activity recognition but was included in this review. It is one of the earliest works undertaken to analyze the data collection method before researchers started finding alternative methods.

Ego4D [39] is unique. Unlike other datasets, Ego4D does not utilize a human photographer. Instead, this dataset combines the effort of 931 unique individuals to compile unscripted, first-person perspective footage of ADL. The dataset consists of 3670 hours,” and hundreds of diverse environments. Additionally, its videos are sourced from 74 different worldwide locations spanning across 9 countries” [39]. Each clip is approximately 8 minutes of unscripted footage from participants’ daily lives. Ego4D “strives to introduce diversity in the data collected and scales the collection efforts to 74 countries” [39]. This sets it apart from the other datasets reviewed here.

PKU-MMD [14] is a large-scale benchmark for ADR. The dataset contains 1076 long (untrimmed) video sequences in 51 action categories, performed by 66 subjects in three camera views. It contains almost 20,000 action instances and 5.4 million frames in total. Each three-to-four-minute long sequence is annotated with a set of action labels containing start and end frames as well as one of 51 class labels [14]. This untrimmed daily activity video dataset is available in four modalities of RGB, depth, infrared (IR), and skeleton for the research field of action detection. The recording was performed on 3 Kinect cameras, and included 66 participants. It was limited to vision modalities.

Home Action Genome (HOMAGE) [34] is a multimodal dataset that contains synchronized videos from multiple viewpoints along with hierarchical action and atomic-action labels. HOMAGE builds upon Charades [12], recording 27 participants using 12 different types of sensors. The videos were recorded in 2 different houses in kitchens, bathrooms, bedrooms, living rooms, and laundry rooms. The 12 types of sensors are “cameras (RGB), infrared (IR), microphone, RGB light, light, acceleration, gyro, human presence, magnet, air pressure, humidity and temperature” [34]. Each of the 5700 videos were annotated by hand, identifying 75 activities and 453 atomic actions. HOMAGE [34] targeted the problem of privacy-aware recognition, and rationalized the need for 12 sensors by explaining that various sensors like angular velocity, acceleration, and geomagnetic sensors capture motion details from an individual’s viewpoint (ego-view). Environmental sensors like temperature and humidity detect scene changes pre and post-activity. Thermal sensors identify heat sources, useful in settings like kitchens. Human presence and light sensors detect individuals without visual cues..

MMAct [36] is a large-scale multi-modal ADR benchmark, consisting of RGB videos, keypoints, acceleration, gyroscopy, and orientation. It provides an ego-view and 4 third-person views as well as temporally localized actions. However, MMAct does not provide bounding box annotations for spatial localization or relationships between objects [36]. The study involved a total of 20 performers with approximately 36,000 temporally identified action instances distributed across 1900 uninterrupted action sequences, each lasting 3-4 minutes for desk work scenes (containing 9 action instances) and 7-8 minutes for other scenes, for a total of approximately 26-28 action instances. [36]. Each subject was asked to perform each session for almost 5 times with random changes in motion, direction and position. The videos were recorded using four commercial surveillance cameras (Hitachi DI-CB520).

A well-known dataset, NTU-RGB+D 120 [21], is one of the largest datasets as shown in Table 2. The dataset consists of more than 100,000 video samples and over 4 million frames, collected from 40 distinct subjects with 120 action classes. This dataset was recorded with a limited variety of camera angles, and participants followed the researchers’ instructions to perform each specified movement. Similarly, Charades [12] does not capture natural behavior for ADL. The participants were requested to act out specific actions for a fixed duration of 30 seconds. Charades has approximately 9848 videos in the dataset and the videos were captured by participants in their own home environments.

In comparison, a recent high variability dataset, Toyota Smarthome Untrimmed (TSU) [5], provides 7 camera angles, recording ADLs performed naturally by the elderly people on a typical day. The factors discussed in this section created a naturalized context-independent dataset that is used to test SynDa’s pipeline (more details are given in Sect. 4). This dataset is synthesized and used to train the model to demonstrate the performance of SynDa generated data, against existing data, using the TSU [5]’s model. The results and observations are further discussed in Sect. 4.

More details about each dataset can be found in their respective papers.

This section briefly discusses why certain datasets were not included in the shortlisted list as presented in Table 1. UCF101 [38] is one of the older annotated video datasets available for ADR, from 2012. UCF-101 contains 13,000 videos (180 frames/video on average), annotated into 101 action classes. Several researchers contend that while these datasets have been highly valuable to the community, their relevance is now diminishing [46]. This is because the UCF-101 dataset has limited variation, and the number of activity classes, 101, is less than that of existing newer datasets such as NTU-RGBD or Kinetics-700 [21, 37]. For datasets like Ego4D [39], LIRIS [32], and D3D-HOI [40], the authors found that the datasets do not fit the selection criteria that are discussed in Sect. 2. The videos are either not relevant to the type of ADL that we were focusing on, or were too old.

3.1.2 Identifying themes across datasets

From the above review of the datasets, we devised a study of these papers using common terminologies and factors that were used in multiple instances across many papers. The study was guided by the question How does a dominant focus in one of the specified areas affect the other aspects of the dataset? The “areas” referenced were terminologies extracted from the papers, and are as follows:

  1. 1)

    Diversity in a dataset;

  2. 2)

    Scale of a dataset;

  3. 3)

    Classes and annotation detail in a dataset;

  4. 4)

    Camera and camera angles in a dataset;

These areas are identified as the most common factors that researchers tend to vary to ensure diversity in the datasets. In Table 3, we highlight some datasets that have undergone iterations to increase the size and diversity of the dataset.

Table 3 Summary of real ADL datasets that have been enhanced with classes/annotation improvements. Participants recorded themselves performing the actions

Diversity in participants. To gauge this, we list the number of participants that were recorded for the dataset. For those original papers that did not include an exact number, we pegged the value at 10, indicating that there was a greater variety of participants. As there was a dichotomy of the type of diversity, the authors used the 2 most frequent factors that appeared in the datasets and focused on diversity—appearance/number of humans (Table 4) and location of videos (Table 4). The appearance and number of humans gauge whether there is a good distribution of the appearance of the participants and whether there are a sizable number of participants, as that is directly related to the appearance.

Table 4 Overview of real ADL datasets and the variations in common factors found across all the datasets. This table is further discussed and the trends are analyzed in Sect. 3.1.3. The numbers marked with “*” are not limited to 10 during the experiments

Diversity in locations. To gauge this, we list the number of locations that were used during the recording of the dataset. For those original papers that did not include an exact number, we pegged the value at 10, indicating that there was a greater variety of locations. As there was a dichotomy of the type of diversity, we used the 2 most frequent factors that appeared in the datasets and focused on diversity—appearance/number of humans (Table 4) and location of videos (Table 4). The location of videos refers to whether there was an array of locations where the activities were filmed at, or at a few fixed locations (e.g lab, 1-2 houses).

Scale of dataset. For this section, we list out the number of videos in each dataset as reported in the original papers. These datasets are presented in Table 4.

Classes and annotation details in the dataset. There were different types of variation in the classes and annotation details across the dataset. However, to provide a succinct overview, we condensed this to reflect the number of action classes per dataset, which was the focus of many papers.

Camera and camera angles in the dataset. Due to the large number of options being utilized here for camera choice and camera angles, we decided to look at the number of camera angles that the paper reported for each dataset—as that seems to be a common trend across papers. Some papers did not explicitly state the number of camera angles due to the source of the videos used; thus we represented this in Table 4, indicating that the camera count is more than 10.

For a deeper look into each of these factors, we also introduce a complete overview via a taxonomy diagram later in this paper to cover the permutations of each of these 4 major categories that the reviewers have identified across the papers. This discussion can be found in Sect. 3.1.3.

These factors are frequently used across the literature, and we discuss the implications of these factors and how they affect the datasets in the next section.

3.1.3 Focuses of data collection

In Table 1, we summarize the data collection methods and focus of these datasets, which is a new development from our previous work [24]. In Sect. 3.1.2 we also discussed the common trends that are observed across the datasets, and we have codified them. The diagrams in the section summarize the trends across the codes, where we refer to as the “focus” of the dataset. In this section, we take a step further and delve into the correlation between each factor and the dataset size. We also break down the larger codes into sub-codes to account for more nuanced factors across datasets. These sub-codes in each focus are presented in the taxonomy diagram in Fig. 3 and are also used for further discussion.

Figure 3
figure 3

Taxonomy diagram of all real-world ADR datasets for activities of daily living (ADL)

1) diversity in dataset focus

(1) Different locations We developed Fig. 4 for reference. In general, we observe that as the number of locations increases, the number of videos decreases. In the case of Ego4D [39], researchers reached 931 unique individuals in 74 cities worldwide, and 9 other countries. However, they were only able to collect 3670 hours of footage.

Figure 4
figure 4

Setting the number of locations used for recording (diversity in location) versus the number of videos in a dataset. The general trend line here shows that as the number of locations increases, the number of videos also increases. Note: Each datapoint is represented as [{DatasetName}, Number of videos]

(2) Various participants: Kinetics-700 [47] is another prominent dataset that manages to capture 700 human action classes; however this dataset was constructed from Kinetics 400 [46], rather than a single project. Kinetics-700 [47] also diversified the geographic location of their footage, including multiple languages and, nationalities and balanced out the percentage distribution of video origin across continents. The summary chart of these observations is shown in Fig. 5.

Figure 5
figure 5

Setting the number of videos in a dataset versus the number of participants involved in the recording of actions (diversity in participants). The general trend line here shows that with an increase in the number of videos in a dataset, the number of participants that will be recorded decreases. Note: Each datapoint is represented as [{DatasetName}, Number of participants]

(3) Unscripted activities. MPII Cooking [30] was one of the first datasets that focused on ADL within homes. To capture the diverse sets of actions that may result from a single complex activity such as “cooking”, they captured the process of cooking 14 different dishes, each with their own set of steps and ingredients to introduce diversity in their dataset. Despite efforts to introduce diversity, the datasets are not able to capture all the variations. Furthermore, these datasets are not able to include a larger volume of footage, as researchers have to choose between diversity and volume.

2) Scale of dataset focus

Few datasets are given the “large-scale” stamp, due to the time and cost that are needed. In general, as illustrated in Fig. 7, we observed an upward trend in the number of annotated classes when the volume of the dataset increased.

(1) Large number of videos and participants: Notably NTU-RGB+D [21] is one of the largest datasets, with approximately 55,000 videos, and the extension NTU-RGBD-120 has approximately 114,000 videos. Forty participants were recruited to perform a variety of actions, all of which were scripted as directed by the researchers.

(2) Large variation for actions and camera: 3 main Kinect cameras were used, each with a specific angle of elevation (-45°, 0°, +45°), and each participant was asked to perform the specific action twice, once towards the left camera and once towards the right camera. In doing this, the NTU team captured two front views, one left side view, one right side view, one left side 45°view, and one right side 45°view. Similarly in HOMAGE [34], the researchers have used 12 different cameras to obtain a variety of data and different modalities. Toyota Smarthome [5] similarly employed 7 different camera angles to capture various angles of people performing a particular action. In Table 4, we can observe that there are few large-scale datasets, and in the other focus sections, we note that generally there is a trade-off between the volume of the dataset and the variations achieved. We can deduce here that large-scale data collection is manpower intensive and requires considerable hours and much effort.

3) Classes and annotation detail in dataset focus

There are publications such as NTU-RGB+D 120, Kinetics-700 and AVA [21, 35, 47], which are improvements to previous iterations of these previously published datasets. These have been in the form of further annotations, or the introduction of additional classes that are annotated well. In Fig. 7 the direct relationship between the number of activity classes in a dataset and the size of the dataset can be observed.

NTU-RGB+D 120 extends “NTU RGB+D” by adding another 60 classes and another 57,600 video samples. In the case of NTU-RGB+D, all the actions are scripted and performed by actors, so the team decided to increase the number of annotated action classes to account for more human actions. Each video is properly annotated due to the meticulous process of ensuring that each action has its own labels and video samples. In the case of Kinetics-700, the first iteration of this dataset was the Kinetics-400, which contained 400 annotated classes. The team subsequently released Kinetics-600 and Kinetics-700 subsequently to account for more annotated classes (600 and 700, respectively), in the dataset. We also noted that in Kinetics-700, more attention was paid towards incorporating rare situations into the Kinetics-700 dataset [47], as most datasets often perform poorly when given a rare situation.

(1) Improved annotations. Similarly, AVA [35] is an extension of Kinetics. The AVA team proposed an improvement in the annotations of the Kinetics-700 dataset, and this improvement has been widely accepted by the research community.

(2) Small-scale dataset: CAD-120 [31] is one of the pioneer examples of this, where the CAD-120 improved upon CAD-60, by helping participants perform each high-level activity three times, with different objects, so that they could improve annotations for sub-activities (e.g. reaching, moving, pouring, eating and drinking) and high-level actions (e.g making cereal, taking medicine, stacking objects, unstacking objects and microwaving food). We can conclude that it would take a lot of effort and coordination, to obtain more human-action classes, spanning multiple projects, and even then it may not be comprehensive.

4) Camera and camera angles in dataset focus

This review revealed that the number of cameras and the number of camera angles were common factors that researchers introduced in their experiments to increase the variety of data collected. In Fig. 6, we explore the correlation between the number of camera angles and the number of videos in a dataset.

Figure 6
figure 6

Comparison of the number of camera angles against the number of videos in each dataset. Note: Those that have “10” camera angles refer to the dataset having an unspecified number of camera angles, either due to the source of their videos or because they did not explicitly state it in the original paper. The general trend line here shows that as the number of videos in a dataset increases, the number of camera angles that are varied decreases. Each datapoint is represented as [{DatasetName}, Number of camera angles]

Figure 7
figure 7

Setting the number of action classes against the number of videos in a dataset. The general trend line here shows that with an increase in the number of videos in a dataset, by extension the number of action classes that will be annotated also increases. Note: Each datapoint is represented as [{DatasetName}, Number of action classes]

Two prominent examples are MMAct [36] and Toyota Smarthome [5]. Both datasets have more than 3 camera angles - 4 and 7.

1) Multiple cameras and ego-centric. In the case of MMAct, each participant was asked to perform actions 5 times with random changes in the motion, direction and position–all of which were captured by 4 cameras and the ego-centric camera. This dataset contains over 1900 videos, with 37 classes.

2) Multiple camera angles: In the Toyota Smarthome, the researchers recorded real-life people performing their activities of daily lives; however, the dataset contains 536 videos, with 51 classes.

Single camera angle. Similarly in MSRDailyActivity 3D [29], participants performed the actions twice, in two positions “sitting” and “standing”. The latest datasets also choose to use multiple Kinects because of their choice of camera.

3) External video source. Several datasets, such as Kinetics-700 [47], choose to have an almost unlimited number of camera angles, and take their videos from YouTube or existing movies. Once again, similar to the previous discussion 3.1.3, the researchers in this field generally have to choose (see Fig. 6) between the volume and variety of angles for data collection, unless external sources are used.

3.1.4 Discussion for real-life data collection and the Metaverse

Activity recognition and activity detection are fields that have been of interest to researchers for many years. Currently, however, people are beginning to realize that despite this field being recognized as an active and challenging research area in computer vision during the last decade, great progress cannot be made if limited attention is given to the data [9]. Datasets are behind each ADR solution that is proposed, and the current data pool is limited—especially in the real-life area of ADL. The focus on ADL is relatively new and is steadily gaining traction. With the emergence of the Metaverse, people are looking for ways to incorporate ADR technologies into the Metaverse and vice versa.

There is no doubt that a significant amount of effort is needed in constructing the real-world activity dataset that can contribute to the field of ADR. Resources such as money, manpower, physical location, coordination and time are some of the key components of creating a basic dataset. Basic datasets already exist, and the challenge is to create a dataset that a) contributes to the field by introducing challenges in real life, b) is unique and large enough for models to be adequately trained before being deployed in real-life scenarios [42] and c) holds sufficient resources at your disposal to coordinate a collection process of that magnitude. Furthermore, in this area of ADL, researchers must consider privacy concerns for the participants involved, which makes it harder for datasets to be made available to the research community. Given these concerns, we can see that creating real-world ADL datasets either in control laboratories or in people’s homes, is expensive and resource intensive. As a result, datasets become extremely difficult to create.

Moreover, the range and variety of real-world ADL datasets are often constrained by the environment. Thus, the type of diversity that is currently observed in the field is relatively limited as described in Sect. 3.1.3. Common factors that we see repeated often are camera angles and the number of participants. In most datasets discussed, the common method of collecting data involves recording real people performing their daily activities—both scripted and unscripted—using multiple cameras, multiple Kinects, multiple angles or a combination of both [14, 21, 29, 34]. Some take the extra step of having participants to perform the same action multiple times in front of each camera (from different angles) to add “diversity” to the dataset. However, current methods and approaches are insufficient. In addition to the efforts needed to properly annotate the datasets, massive manpower and coordination efforts are needed. In some cases, initial annotation efforts were not sufficient leading to subsequent contributions that improved upon previous works [21, 35, 46, 47]. In ADR, as discussed in Sect. 2.4, it is important to prevent bias and sufficient diversity is ensured with real-world examples in datasets before deploying the models. It is clear that when models are trained and tested with datasets that include a diverse and robust set of real-world challenges and activities, these models are extremely helpful in practical applications. From the literature review, we can conclude that a model, trained with such a dataset, would yield great results in comparison to models trained on less diverse datasets.

Data are the currency of the Metaverse, and researchers and everyone else involved in the Metaverse will need to have a constant influx of data to train new models, and train different models for the various fields such as CV. However, given that generating a single dataset with large varieties and on a large scale is extremely time-consuming and resource-intensive, it may not be possible to conclude that current real datasets cannot keep up with the Metaverse’s thirst for additional new data. With the evolution of the Metaverse, we can expect an influx of new Metaverse technologies that can help us in this endeavor to generate more data.

It is crucial for us to develop an efficient data collection method that is cost-effective, user-friendly, and efficient while allowing us to generate a large-scale diverse, non-biased ADL dataset for ADR. How do we obtain such an ideal dataset with all the parameters and factors included for us to train models on? Our proposed solution is: synthetic data.

3.2 Synthetic ADL datasets

3.2.1 Existing synthetic datasets

Synthetic data generation is usually viewed as a low-cost avenue for obtaining unlimited reliable data for studies and research. In reality, to ensure that synthetic data are useful, much pre-processing of the data is needed. According to our review of the synthetic datasets in this field [16, 1820, 48], the current methods for generating synthetic data are useful due to their complexity. With COVID-19, creating datasets by performing on-site recordings with real humans has become a challenge. For this review, we exclude synthetic data papers that do not provide a publicly available dataset for ADR. A summary of these chosen datasets and their data generation methods can be found in Table 5.

Table 5 Summary of synthetic data generation methods and tools for videos. 1 Numbers in criteria correspond to Sect. 4.22MoCap: Motion capture. 3 Maya: Autodesk Maya. 4 Pre-programmed motions within the SIMS game

The challenge, as discussed by Andrew Ng, is to create data that a) have data points that are indistinguishable from real-life data, b) have noticeable variations from each generated data point and c) has similar context points to the target environment, for example, camera angles in indoor environments for ADL [9].

A variety of methods that have been explored for generating synthetic data - ranging from generative adversarial networks (GANs) to composite methods, games and synthetic data platforms [19, 49, 50]. We approach the analysis in the field of synthetic ADL from the perspective of existing datasets and their data generation methods.

Eldersim is a synthetic data generator for SynADL, [20] and a large-scale synthetic dataset of elderly individuals’ activities that target real-life usage in elderly applications. SynADL provides 462,000 RGB videos. In addition, SynADL also provides 2D and 3D skeleton data that contain 55 action classes. The researchers incorporated 28 camera viewpoints, with 15 synthetic elderly humans, five variations in lighting conditions with four backgrounds. The SynADL team here used a real-time photorealistic rendering platform Unreal Engine 4 (UE4) and a 3D computer animation and modeling software called Autodesk Maya (Maya). Using the two software programs, the environment of the elderly living spaces are constructed. Next, the appearances and the movements of the synthetic humans are rigged as per the MoCap data that were obtained from live recordings of real people [20]. When the dataset produced is large, the process of generating these data requires multiple steps and combining the use of multiple software and hardware to create the dataset.

SURREAL [18] (Synthetic hUmans foR REAL tasks) is another large-scale dataset with images of humans that are synthetically generated but realistically rendered from 3D sequences of human motion capture data. This dataset consists approximately 6 million frames together with ground truth poses, depth maps, and segmentation masks. The 6 million frames are photo-realistic renderings of people under large variations in shape, texture, view-point and pose. To create the dataset, a few steps were involved, starting with 3D sequences of MoCap data. The images are rendered from these data. To ensure realism, the synthetic bodies are created using the skinned multi-person linear model (SMPL) [51], whose parameters are determined by the MoSh method given the raw 3D MoCap marker data [52]. In total, the dataset consists of 67,582 video sequences. To ensure that the dataset has diverse and variable data, a human body with a random 3D pose, shape and texture is rendered from a random viewpoint for some random lights and background settings. The SURREAL team determined these factors to be important for randomizing when generating synthetic data. Each of these factors is associated with its own set of pre and post processing steps, which are further described in Ref. [18]. Further analysis of the mechanism behind the randomization of a single factor such as “human texture” requires the researchers to extract the SMPL texture using CAESAR scans, and combine it with another set of extractions performed simultaneously. This second set of extractions involves the use of 4Cap and obtains 3D scans of normal clothes to account for the low quality of scans during the first extraction. To implement privacy-awareness measures, the faces of the real people from the MoCap are replaced with a generic CAESAR face. From these details, we can deduce that replicating this process of synthetic data generation might be challenging for researchers who do not have access to such extensive resources.

Synthetic actors and real actions (SARA Motion) [17] is a synthetic skeletal dataset focused on creating data for training a model to produce motion embedding suitable for reasoning about motion similarity. This motion is included here because the motions included here are a form of ADL and the method could be used in the field of ADR. The researchers of this dataset had to perform some pre-processing of the data, and in their paper, they explained that to account for a real-world environment, they adjusted the scale of the human character with respect to the distance from a camera. The video’s skeleton size undergoes modification through scaling, randomly sampled between 0.5 and 1.5. Following this scaling adjustment, a reference joint for each body part is chosen, and all joint coordinates shift from absolute to relative positions. The SARA dataset includes four action categories (Combat, Adventure, Sport, and Dance) comprising a variation in the motions captured. Each motion lasts approximately 32 frames, with 4428 basic motions (e.g., dancing, jumping) in the SARA dataset [17]. The tool that was used, Mixamo, “allows the users to control various characteristics of each motion (e.g., energy) that can be adjusted to create different motion characteristics” [17]. With the variations executed on each basic motion, the total number of videos in this dataset is 103,413.

HOISim [16] is a 3D activity simulator that was used to produce a small-scale procedurally generated synthetic dataset of two sample daily life activities namely “lunch” and “breakfast”. Similar to previous synthetic datasets, the HOISim team introduced methods to meaningfully randomize activity sequences and the environment. The team’s objective was to meaningfully generate additional synthetic data with the same scene. However, it should be noted that the process of this synthetic dataset was riddled with steps [16], starting from a) ensuring the synthetic human knows the environment layout and can navigate it using a path-finding algorithm, b) using NavMesh to generate the environment map, c) utilizing the software MakeHuman to model the synthetic human along with rigging the skeleton, d) using a built-in animator to manually animate basic actions such as reaching, picking and cutting, and this animation can be retargeted onto any other, e) using the inverse kinematics solver plugin in Unity Engine to make the movements of the joints and fingers reasonably realistic and, f) using the ROS plugin “ROSbag” to export sensor data, labels and annotations from each scene [16]. A further explanation of each of the steps can be found in the original paper [16].

Sims4Action [19] is a dataset created with the popular commercial game THE SIMS 4. The dataset consists of 10 hours of footage, capturing 10 different activity classes of ADL. Using the game platform is a unique approach to generating synthetic data for ADL, especially since THE SIMS 4 allows players to have control over what their character does in the game. As a result, researchers used 6 different locations, 4 camera angles, and various human appearances (4 elderly, 2 adults and 2 young adults). It is one of the more recent related studies that deep learning-based ADL recognition approaches are highly sensitive to changes in data distribution [19].

In the next section, we will look at the common characteristics and trends across these synthetic datasets for ADL. In Sect. 4, we introduce a new streamlined approach to synthetic data generation, SynDa, which is still in its early work phase.

A full summary of the above review can be found in Table 5.

3.2.2 Discussion of synthetic datasets and their place in the Metaverse

Synthetic data generation is the process of producing artificial data that mimics a real-world dataset. The ideal set of synthetic data is regarded as one that cannot be distinguished from real-world data. With the ability to generate such data, researchers can feed an infinite amount of data to models for training which in turn improves their performance. The team here sees synthetic data not as a full substitute for real-world data but rather as a complement to existing data to improve model performance. With Metaverse technologies, we can augment existing practices and accelerate the adoption of synthetic data in computer vision.

1) Privacy in synthetic data

As we see in the previous section, there were limited privacy concerns for most of the released datasets as the synthetic actors that were used were digitally generated, while the actions were captured from real anonymous participants. In the small-scale dataset released by HOISim [16], the synthetic humans were created using the MakeHuman software, which is an open source synthetic human generator. The generated humans bore no correlation to the participants who provided the actions that were animated using a built-in animator for Unity3D. In Sims4Action [19] the synthetic humans used are from in-game assets as part of the gameplay, so no personal data are needed or used to generate the dataset. In datasets that require MoCap such as SynADL [20] and SURREAL [18], there might be privacy concerns if the models look like the real-world participant, or if there are physical identifiers pointing to the identity or location of the participant. However, in this case the team only used the MoCap data from the participants, with no clear markers of the participants’ identities or locations, which alleviates privacy concerns.

To fully understand the issue, we also consider the possible risks that researchers face when using real-world data to generate synthetic data. If the real-world data contain unique data points or identifiers that are captured during the real-to-synthetic process, naturally these unique characteristics will spill over into the synthetic dataset. These unique data points could be easily identified because they exist within the real-world dataset, and thus participant information can be leaked.

In the context of the Metaverse, privacy will be important for users, similar to how privacy matters in existing social networks. Users might be averse to using technologies that expose intimate and personal details about themselves without their consent. By generating synthetic real-world human data, real-world users can be assured that their privacy is not compromised.

2) Research and Metaverse photorealism and quality of the synthetic data

Current approaches to synthetic data generation often prioritize quantity over the quality and photorealism of the generated data. Additionally, data produced by these methods may exhibit imperfections or inconsistencies, often referred to as “noise” or ‘jerkiness’, which are not present in real-world datasets. This raises concerns that if non-photorealistic or low-quality data significantly impact a model’s training, it could pose challenges to the model’s effectiveness in real-world applications. In essence, the focus on quantity in synthetic data generation may compromise the model’s ability to perform as expected in practical scenarios.

There is little point in generating excessive amounts of data if we do so aimlessly with no quality control. Even with a clear objective, the quality of the data that can be generated is limited by the existing tools and technology that are available at that point in time. Real-world datasets are challenging and never linear. Currently, there is no universal solution for generating quality synthetic data in an efficient manner. We introduce SynDa in Sect. 4 as few semi-automated pipelines that uses existing datasets to generate synthetic data.

The quality of the data generated should ideally possess similar characteristics and realistic challenges that real-world footage contains. However, most synthetic datasets are currently limited by the technology that is used at the time of publication. We note that photorealistic technologies are on the rise, and are enabling the replication of real-world actions, quirks and environments, such as NVIDIA’s Omniverse and Unreal Engine 5. However, photorealism alone is insufficient if there is no clear way to capture real human motion and port it into these photorealistic environments.

3) analysis of synthetic data generation methods for ADL

Currently in the field of computer vision, researchers are seeking additional data that will allow them to construct models that can detect a wider range of actions, both rare and common. As discussed in Sect. 1, real-world ADL datasets are tedious and costly to generate. Synthetic data provide researchers with a new avenue for generating meaningful and cost-efficient data. In the existing synthetic ADL datasets reviewed, we note that methods of generation are often inefficient—requiring many steps, which might limit the accessibility of synthetic data generation.

It is crucial that the movements of the synthetic humans mimic reality, and to achieve this, both the work behind SynADL and SURREAL [18, 20] used MoCap to capture real motions before using that data for rigging. In addition, the researchers need to ensure that the motion data are properly translated into synthetic human movements before bringing them into UE4 or Unity3D. When we look at the scale of this dataset - from its action classes to camera viewpoints, it can be deduced that sizable time, effort and money are required for the generation of these datasets. Unfortunately resources at that scale are scarce and difficult to obtain. This makes the synthetic data generation process limited by the rigor of the process itself. In cases where animation of the synthetic human is not reliant on real people [19], the data may be lacking in realism. In the Sims4Action dataset [19], the animations are taken from the SIMS 4 game, which was not created with realism or ADR algorithms in mind. As a result the data that are obtained may not be as realistic as expected, and this may affect model performance because the models would be trained on data that are not representative of real-world movements.

One of the main advantages of synthetic data is that it allows the volume of data produced with relative ease. In the case of ADL, most of the datasets explored have produced large volumes of data. For datasets such as the SARA [17], SynADL [20] and SURREAL [18], thousands of hours of footage are produced synthetically, while real-world ADL datasets [14, 21] need to manually record each hour of videos for their datasets. While synthetic dataset generation is not as simple as clicking a single button, it still requires relatively less time and money to produce large volumes of data. We also acknowledge that current synthetic data generation methods still have space to improve to the level of ease we would hope to see in programmatically generated video data.

3.3 Using Metaverse technologies to propel synthetic data

There is no argument that the Metaverse contains infinite possibilities and potential to change the way we work, interact with others, play and exist. Synthetic data will have a part to play in enabling these changes. Synthetic data can be used to train models for simultaneous localization and mapping (SLAM) and even create realistic automated interactions such as virtual helpdesks and virtual assistants.

One of the essential devices that will be used in the Metaverse would be VR and AR devices. As the Metaverse grows and develops, so do the technologies behind AR and VR. Recently, devices such as Microsoft Hololens and HTC Vive have started to use “inside-out” tracking in inside-out positional tracking [53], where the camera or sensors are located on the device being tracked (e.g. HMD). The older iteration of VR technology used the “outside-in” approach, where the sensors are placed in a stationary location. This change has advanced the field of computer vision, and has helped researchers make further headway into solving the SLAM issue. Previously, dealing with SLAM and VR headsets required special setups in the room before the VR/AR devices could be used. SLAM is a prime example of how synthetic data can be used. We can generate a realistic 3D scene, and generate large volumes of data with flawless annotations that are needed.

The area that also garners much interest is virtual avatars in the Metaverse. Much existing works deal with capturing human poses from footage, as discussed in Sect. 3.1.1. Nevertheless, we should also direct our attention towards how we can use Metaverse technologies to create realistic synthetic avatars, and then train the models on those data. This approach would be especially useful in creating synthetic data for computer vision model training. Our review is focused on computer vision, and researchers in this field have made forays into exploring virtual synthetic humans [5, 16, 17, 20]. The process and steps required for these works have been discussed in Sect. 3.2. However, the technology used in many of those works created synthetic humans that will not be ready for the Metaverse. There are issues to be solved, from a lack of photorealistic humans to easing the complexity of the data generation process. These are technologies that have emerged with the growing popularity of the Metaverse, from NVIDIA’s Omniverse [3] to Unreal the Engine’s Metahumans. Therefore, we propose a new system of steps created from the perspective of using Metaverse technologies to generate synthetic data for human activities.

Contributing to existing synthetic data generation methods or SDG, we introduce SynDa in an earlier publication [24], a semi-automatic synthetic data generation pipeline that leverages on existing real-world ADL datasets to create synthetic equivalents. This is further discussed in Sect. 4.

4 Using Metaverse technologies: SynDa pipeline

This paper highlights SynDa, our streamlined method for generating synthetic data using RTX rendering technology and the Maxine Pose Tracker (MPT) [54]. In our previous paper, we use these tools, which are found in NVIDIA’s Omniverse, to implement this pipeline.

SynDa [24] allows us to leverage existing footage to create synthetic datasets that reduce the mandatory human involvement used by previous dataset generation methods. Existing methods of synthetic data generation are often costly, require extensive pre- and post-processing of recorded data, and require researchers to often perform real recordings of human activities. At the time of this paper, according to the research survey, there are no works that leverage existing footage to create synthetic datasets.

Figure 8 illustrates the complete SynDa pipeline. More details can be found in the original paper [24]. The following list provides a summary of the pipeline:

  1. 1)

    The pipeline begins with an input of a real-world video.

  2. 2)

    2D camera capture - We convert the camera data into the vector representation.

  3. 3)

    Retargeting synthetic human - SynDa primarily utilizes MPT [54] as the choice of skeletal pose estimation. The MPT [54] is used as the pose estimator found in Omniverse. The MPT is used to retarget the synthetic humans and to export the skeletal animation. MPT is based on a convolutional neural network whose architecture consists of a backbone network, an initial estimation stage that performs a pixel-wise prediction of confidence maps, followed by multi-stage refinement of the initial predictions.

  4. 4)

    Export skeletal animation as a USD file. Upon export, users would obtain the skeleton animation data from the MPT before using the Sequencer tool to animate characters with the skeleton animation.

  5. 5)

    Attach SkelAnim to synthetic human. The user would attach the MPT generated animation to a synthetic human in the synthetic scene.

  6. 6)

    3D scene manual reconstruction - Creating the synthetic scene for the recording to happen. To leverage on existing datasets, the ground truth (GT) annotations can be obtained from existing GT data, and can be re-used for synthetic data because every action should remain the same.

  7. 7)

    2D synthetic scene camera capture - The 2D synthetic scene footage will replace the real-life footage in model training. By applying this pipeline to the existing datasets, the new synthetic dataset can then be used for model training. An example of the this pipeline can be found in Fig. 9.

Figure 8
figure 8

Converting real video data [5] into synthetic video data using SynDa [24]

Figure 9
figure 9

Results of leveraging on real world data [14] via pose estimation [54] to synthesize human actions to create synthetic data using SynDa (discussed further in Sect. 4)

4.1 Next iteration of SynDa

We have documented some of the early progress up to the time of this writing in our other publication [24, 55]. The latest iteration of Synda is referred to as SynthDa, which we will reference in future works such as this paper [56].

4.2 Enhancing ADR datasets with Metaverse technology: improving synthetic data generation

In the realm of data generation with Metaverse technology, researchers and creators have gained substantial autonomy in crafting scenes and generating data. This newfound flexibility has led us to outline a preliminary list of essential factors to consider when creating synthetic data with Metaverse technologies, exemplified by Omniverse [3]. These factors are listed in Table 5.

While there is a wealth of research on key characteristics in real-world video datasets for ADL detection [5, 1214, 18, 21], their counterparts in synthetic data have received comparatively less attention. This disparity may arise from the complexities involved in the data generation process or the relatively nascent state of this field.

The focal points in real-world ADL datasets are consolidated in Table 2, serving as a valuable reference for our synthetic data generation process. Drawing from this, we pinpoint the following characteristics as crucial for inclusion in synthetic datasets:

  1. C1.

    Camera Angles – Camera angles refer to multi views of the scenes and activities.

  2. C2.

    Spontaneous Acting – Spontaneous acting refers to whether the activity is scripted or occurs naturally.

    – To capture the challenges of real-world ADL, and to ensure that models have realistic training datasets to train on, the team noted that this aspect was important to consider.

  3. C3.

    Variable Duration – The variable duration is used to determine whether the clips were running for similar periods of time. Real life challenges exist when not all actions are performed for the same duration.

  4. C4.

    Composite Activity – Composite activity looks to see if the dataset includes larger composite activities and their fine-grained activities, for example, cooking as a composite, and stirring as the fine-grained activity.

  5. C5.

    Variation in Human Appearances – Variation in appearance notes if there is a variety of humans included in this dataset, to reduce the chances of dataset biased by appearance or gender.

  6. C6.

    Context – Context refers to the background information of the video, and if, rich enough, is sufficient information to recognize activities in the scene.

    – Context is a factor not currently supported by SynDa, but the current pipeline works best with context-free data. The general observation for the ADL datasets reviewed is that there might be less context available, due to similarities in ADL environments (indoor residences), as well as insufficient background information.

By identifying these characteristics, we aim to improve the quality and relevance of synthetic datasets in Metaverse technology, enhancing their utility for advancing research and applications.

To demonstrate SynDa’s capabilities, we utilized the TSU dataset [5] because it aligns with the criteria detailed in this section. The SynDa pipeline is designed to convert various video datasets with a single human focus into synthetic data through pose extraction.

In our preliminary testing, as shown in Table 6, we converted the TSU’s existing data into synthetic data using the SynDa pipeline, with promising results. Up to 50% of the real-world data were replaced with synthetic data during training, outperforming models trained solely on real data. This novel streamlined pipeline offers an innovative approach to generating synthetic data from existing datasets, with the potential to enhance model performance.

Table 6 Preliminary results from testing with the TSU model. The Real+Synthetic training set is a equal split of 50% real video data and 50% synthetic data, which was made using the SynDa pipeline. The best result is marked in italic

4.3 Crafting the right synthetic data

When we combine the factors discussed in Sect. 4.2, and the knowledge gained from the previous discussion in Sect. 3.1.2 and Sect. 3.1.3, we note that these factors, when adjusted, have a direct impact on the quantity and quality of the data produced. The factors that affect real ADL datasets will also spill-over to affect synthetic datasets, since synthetic datasets are created as a mimicry of real datasets.

With reference to the taxonomy diagram derived in Sect. 3.1.3 and current datasets, we have determined that the factors discussed are important to consider when generating our own synthetic datasets using Metaverse technologies. The next question to ask is “to what extent do these factors impact the performance accuracy of ADR models?” This is the guiding question that will lead us in our future works on this topic.

5 Application of synthetic data for VR interactions and displays

The application of synthetic data in the realm of VR interactions and displays represents a groundbreaking approach with transformative implications. As VR technologies continue to advance, the demand for high-quality, diverse, and extensive datasets to fuel immersive experiences grows exponentially. However, acquiring real-world data for every possible scenario can be a logistical and privacy challenge. Synthetic data bridges this gap.

We employ SynDa as one of the methods to produce synthetic data. Synthetic data involve generating human action datasets that emulate real-world scenarios while safeguarding sensitive information. In the context of VR interactions and displays, this approach becomes particularly powerful. By creating synthetic data that mimics various user behaviors and interactions, developers can design and test VR experiences more comprehensively and efficiently.

Moreover, with SynDa, synthetic data allow for rapid iteration and experimentation, enabling VR designers and developers to fine-tune their creations without the constraints of working solely with real-world data. This accelerates the innovation cycle, fostering the creation of more engaging and dynamic VR interactions and displays.

Furthermore, the application of synthetic data aligns seamlessly with the dynamic nature of the VR space. As VR hardware and software evolve, synthetic datasets can be adapted to reflect these changes swiftly. This flexibility enables the creation of VR experience that is cutting-edge and synchronized with the latest technological advances.

6 Conclusion

6.1 Highlights

In conclusion, our review has provided an in-depth exploration of the current landscape of synthetic data generation and real data collection methods. We have highlighted the strengths and weaknesses of existing approaches, shedding light on the evolving challenges and opportunities in the field.

The weaknesses of the existing methods, such as limited variations, excessive manual labor, and high costs, have underscored the need for innovative solutions in data generation. These challenges have driven the data community to focus on key factors when creating new datasets. Among these focal points, we have identified critical elements that are shaping the future of data generation.

Camera angles. The ability to replicate various camera perspectives and angles is essential for training robust computer vision models that can handle real-world scenarios. It reflects the community’s recognition of the importance of diverse visual perspectives in data generation.

Spontaneous acting. To accurately represent natural human behavior, datasets need to include spontaneous, unscripted actions. The community’s emphasis on spontaneity reflects a commitment to capturing the unpredictability of real-world events.

Variable duration. Real-world activities often have varying duration. Reflecting this in datasets ensures that models can handle activities that do not conform to fixed time frames, making them more adaptable to real-world applications.

Variation in human appearance. Human subjects in datasets should exhibit diverse appearances, including variations in clothing, age, gender, and ethnicity. This factor acknowledges the importance of addressing potential biases and ensuring the inclusiveness of AI models.

Context. Real-world scenarios often unfold within specific environmental contexts. Incorporating these contexts into datasets is vital for enhancing the contextual understanding of AI models.

In essence, the data community’s dedication to addressing these key factors in dataset creation demonstrates a forward-looking approach, aiming to bridge the gap between synthetic data and real-world applications. As these considerations become central to dataset development, we can expect more robust, adaptable, and inclusive AI models that are better equipped to tackle the complex challenges of the real world. The future of data generation holds great promise, with these focal points at its core, driving innovation and advancing the field of artificial intelligence. In VR, creating immersive and realistic experiences relies on accurate simulation of human actions and behaviors. SynDa’s synthetic data generation method, leveraging real-world datasets, provides a powerful solution to enhance VR experiences. An limitation would be the human effort required to carry out the AI pose estimation and photorealistic rendering steps. This is crucial for VR applications that aim to replicate real-life scenarios, as it enhances the authenticity and engagement of virtual experiences. With SynDa’s streamlined pipeline, VR developers can access a wealth of high-quality synthetic data, reducing the need for extensive real data collection. This not only accelerates VR content creation but also ensures a richer and more immersive user experience, ultimately pushing the boundaries of what is achievable in the realm of virtual reality.

6.2 Future works based on limitations

Future work on the SynDa [24] pipeline would address the limitations in the current SynDa pipeline. First, we will extend the testing of the pipeline from a sample size of Toyota Smarthome to the full Toyota Smarthome dataset, along with other datasets. Next, we will reduce the manual input needed from users to create synthetic video data using SynDa, as we want to make the pipeline as streamlined and automatic as possible. By doing this, we aim to support batch conversions of real world data to synthetic data in our current pipeline.

To quantify the outcomes of these improvements, we will provide full testing results from the experiments that will be conducted with the aforementioned achievements. Additionally, we would explore performing preliminary ablation studies to investigate the correlation between the amount of synthetic data used and the performance of models.

The preliminary conclusion from this paper is that when synthetic data are substituted for 50% of real world data, the trained model is able to produce results that are better than the results produced by both a 100% real-trained model and 100% synthetic-trained model. This also leads us to an exploratory hypothesis that synthetic data generalize human activity to the extent that it can augment real world data, even when the amount of real-world data is halved, thus proving that synthetic data can be substituted for real data. To achieve this, we propose that the way forward is by using Metaverse technologies.