Deep-learning for automated markerless tracking of infants general movements

The presence of abnormal infant General Movements (GMs) is a strong predictor of progressive neurodevelopmental disorders, including cerebral palsy (CP). Automation of the assessment will overcome scalability barriers that limit its delivery to at-risk individuals. Here, we report a robust markerless pose-estimation scheme, based on advanced deep-learning technology, to track infant movements in consumer mobile device video recordings. Two deep neural network models, namely Efficientnet-b6 and resnet-152, were trained on manually annotated data across twelve anatomical locations (3 per limb) in 12 videos from 6 full-term infants (mean age = 17.33 (SD 2.9) wks, 4 male, 2 female), using the DeepLabCut™ framework. K-fold cross-validation indicates the generalization capability of the deep-nets for GM tracking on out-of-domain data with an overall performance of 95.52% (SD 2.43) from the best performing model (Efficientnet-b6) across all infants (performance range: 84.32–99.24% across all anatomical locations). The paper further introduces an automatic, unsupervised strategy for performance evaluation on extensive out-of-domain recordings through a fusion of likelihoods from a Kalman filter and the deep-net. Findings indicate the possibility of establishing an automated GM tracking platform, as a suitable alternative to, or support for, the current observational protocols for early diagnosis of neurodevelopmental disorders in early infancy.


Introduction
Injury to the developing fetal or infant brain (for example, due to hypoxic-ischemic encephalopathy, perinatal stroke, or infection) can cause severe impairments that lead to physical disability and cerebral palsy [1][2][3].The quality of General Movements (GMs) in infants up to around 5 months of corrected age [4,5] reflects the health of their neuromotor system.Early prediction of neurological disorders facilitates early intervention during critical periods of heightened neuroplasticity, which a growing body of evidence confirms improves clinical outcomes [6].
The General Movements Assessment (GMA) is highly predictive [7][8][9] and sensitive (98%) for [10] the risk of developing CP.The GMA is performed by at least two highly-trained clinicians observing the infant awake, calm and alert, via video viewed at normal or increased speed, to evaluate the spatial and temporal complexity and variation of their movements [4,11,12].The need for trained assessors is one barrier to widespread adoption of the Abstract The presence of abnormal infant General Movements (GMs) is a strong predictor of progressive neurodevelopmental disorders, including cerebral palsy (CP).Automation of the assessment will overcome scalability barriers that limit its delivery to at-risk individuals.Here, we report a robust markerless pose-estimation scheme, based on advanced deep-learning technology, to track infant movements in consumer mobile device video recordings.Two deep neural network models, namely Efficientnet-b6 and resnet-152, were trained on manually annotated data across twelve anatomical locations (3 per limb) in 12 videos from 6 full-term infants (mean age = 17.33 (SD 2.9) wks, 4 male, 2 female), using the DeepLabCut™ framework.K-fold crossvalidation indicates the generalization capability of the deepnets for GM tracking on out-of-domain data with an overall performance of 95.52% (SD 2.43) from the best performing model (Efficientnet-b6) across all infants (performance range: 84.32-99.24%across all anatomical locations).The GMA in standard clinical care.Manual assessment is timeconsuming, requiring breaks from reviewing high numbers of videos to minimize fatigue-related error.Limitations to clinical resources typically mean that only infants meeting specific clinical criteria will undergo an assessment of their GMs, introducing the possibility of missed case identification of those not meeting the initial criteria.Hence, there is an urgent need to develop technological methods that can circumvent the bottleneck of the human assessor.Automated motion capture offers a low-cost, practical alternative to track and analyze anatomical movements effectively.
Automated neonatal GM tracking and analysis has been widely explored using various techniques [13], including utilization of wearable biotech [14,15], 3D motion capture [16,17], wearable accelerometers [18,19], 3D RGB-D camera recordings [20], and conventional video recordings [21][22][23][24][25][26][27], including with another deep-learning markerless pose estimation algorithm [28].Most of these technologies require specialist wearable equipment, and the attachment of physical devices to the infant may alter their behavior and physiological responses [21].2D recordings from singlecamera devices, such as mobile devices, web-cams, and baby-monitors, provide lower spatial resolution than goldstandard 3D motion capture systems.However, low resolution requirements for the capture of movement patterns, along with the broad availability of single-camera devices, reasonable price, and their flexibility of use (e.g., in home environments) make them an ideal, clinically relevant tool for capturing infant motor behavior.The utilization of technology readily available in the home creates the possibility of parents collecting data at the appropriate developmental stages, with little or no training, at convenient times for minimal disturbance of the infants' natural physical and mental states.Robust automated analysis, comprising movement tracking and GM classification, such as that proposed here, is required to process the extensive amount of produced data.
Deep neural network architectures are potent tools for reliably automating challenging signal processing tasks such as classification, identification, and segmentation [29][30][31][32][33][34].These structures require large datasets for authentic learning and robust performance.Dataset construction can be even more challenging when it relies on manual annotation of data [35].The application of deep-learning-based approaches is relatively new in the field of infant GMA, and only a few recent attempts have aimed to apply these novel techniques for movement classification [36][37][38].Additionally, limited work has been done to investigate the utility of deep learning and semi-supervised frameworks for motion detection in 2D videos for body parsing, pose estimation, and GMA prediction [39,40].To our knowledge, this is the first study to describe detailed analyses of automated marker placement accuracy.This work proposes a robust deep neural network, developed in the DeepLabCut environment [41,42] for automatic markerless motion tracking of infant movements, in standard iPad video recordings, for the purpose of detecting GMs in the fidgety period.The approach's generalization performance will be assessed using k-fold cross-validation across 12 video recordings from six infants.

Ethics
All procedures in this study were approved by the Auckland Health Research Ethics Committee (AHREC 000146).Parents/caregivers were fully informed of the experimental purpose, filming procedure and methods, and provided informed written consent for their child's participation.

Clinical procedures
A cohort of six infants (four males and two females, mean height 62.66 (SD 2.56) cm, weight 6.98 (SD 0.24) kg) were assessed within the fidgety period: mean age of 121.3 (SD 20.4) days post-term corrected age (equal to 17.33 (SD 2.92) weeks) (Table S1).Infants were placed in a supine position, wearing only a nappy, on a white cotton mat.Two standard iPad minis (iPad mini 5, OS version 14.4.2,model MUQX2X/A, camera resolution: 8 MP; and iPad mini 4, software version:13.4.1, model MNY32X/A, camera resolution: 8 MP 1080p HD video) were set up on tripods on both sides of the infant.Infants were filmed while awake, undistracted, and producing spontaneous movements, for 3-5 min.
Two videos (resolution: 1920 × 1080 pixels, frequency: 29.97 frames/s) were captured from each of the six infants.Videos were post-processed in Adobe Premiere Pro 2021 to situate the infant in the center of the frame and crop any unneeded regions of the frames.Videos were then saved in their original quality.Twelve anatomical locations (3 per limb: shoulders, elbows, hands, hips, knees, and feet) were manually labeled by an expert (HA) in 200 frames from each video.The 2400 frames were automatically extracted from each recording using k-means clustering implemented within DeepLabCut to maximize frame diversity.Examples of the manually labeled locations are shown with colored dots in Fig. 1.

Deep-learning approach
Several open-source deep-learning-based markerless motion-tracking toolboxes, such as DeepLabCut (DLC) [42] and OpenPose [43], now exist that can follow specified points across video frames.
Nath et al. have previously also raised the possibility of using such a platform for human infant motion tracking [41,44].Recent reports indicate the robust generalization capability of DLC in movement tracking of out-of-domain animal subjects (horses) with normalized errors close to the within-domain tested subjects [45].
Key features of DLC that help it achieve its high level of performance are (1) use of deep resnets and efficientnets that are pretrained on a benchmark ImageNet [46] used for object recognition, and (2) deconvolutional layers, instead of a classification layer, for semantic segmentation [42].Resnet-152 is a top-performing deep-net model created by Microsoft for the ImageNet challenge in 2016 [47].These deep artificial neural networks are designed to mimic the mechanism of certain biological neuronal cells, pyramidal neurons, to improve learning by optimally skipping intermediary connections in a deep structure [48].On the other hand, transfer learning in efficientnet architecture classes [45] uses a compound scaling method through uniform scaling of the network's depth, width, and resolution.This strategy has been shown to out-perform the resnet models on out-of-domain ImageNet data [45].A recent study suggests that the efficientnet-b6 can achieve better accuracies than other efficientnet classes (i.e., b0 to b5) [45]; hence we used this model in the current study.For simplicity, we will call the resnet-152 and efficientnet-b6 models 'resnet' and 'efficientnet' from here on.Deconvolutional layers are known to up-sample the visual details and generate probability densities in spatial space which can be later used as evidence to locate a specific body part in a particular location.
In DLC, the deep-nets are iteratively fine-tuned by updating/adjusting the weight parameters using the manually labeled data.DLC automatically assigns high probabilities to the labeled anatomical sites and allocates lowlikelihoods to the rest of the image [42].
In this work, we first utilize the DLC environment to develop robust deep-learning models for markerless motion tracking in human infants' video recordings

k-fold cross-validation
Subject-based k-fold leave-one-out cross-validation [49] can implicitly specify whether a model generalizes equally well across all subjects and helps to identify whether there is a significant variation in the dataset.The terms 'testing' and 'validation' are used somewhat interchangeably in the literature, which can cause confusion.In the following discussion we follow the conventions used by DLC, using the term 'testing' to refer to data withheld from the training to assess performance and update training hyperparameters during learning process, whereas the word 'validation' is used to refer to a dataset that is withheld to the very end of the process for performance evaluation.In each training fold, data from one particular baby were left out for validation, and the two recordings associated with that baby were excluded from the training set.This strategy allows comparison and validation of the learning schedule for, and between, the resnet-152 and efficientnet-b6 models.
A 95%-to-5% training/testing scheme was used to train the deep-nets using an imgaug image augmentation strategy [50] in each training fold (round).The learning rate was initially set at 0.005 and decreased to 0.001 through a recommended multi-step updating regime.The crossentropy loss function converged during training of all sixfolds, with a sharp decay during the first 30k iterations (for both models) and then gradually decayed further to lower values during 1030k and 700k iterations for the resnet and efficientnet, respectively.Efficientnet is a larger network, for which 700k training iterations were performed, as compared to 1030k iterations for the resnet.The root mean square errors (RMSE) between the identified labels by the deep-net classifier and the scorer's identifications, were evaluated at each snapshot throughout the training process across all sixfolds, resulting in an overall training error of 2.73 (SD 0.06) pixels at the 1030k iteration for the resnet and 7.77 (SD 0.11) pixels at the 700k iteration for the efficientnet.Training took 126 (SD 3.5) hours across the sixfolds on the high-performance machines detailed in the "Computing infrastructure" section.In contrast, test errors were found to be higher at 6.73 (SD 0.17) pixels and 9.59 (SD 0.32) pixels for the resnet and efficientnet, respectively.These findings are consistent with previously described capabilities of DLC [42]. Figure S1 demonstrates evaluated RMSEs (pixel) across all sixfolds for training over 1030k and 700k iterations for the resnet and efficientnet.This plot confirms the fast fine-tuning of the models across all folds.After each training fold, the performance of the model was evaluated (validated) on the two videos from the excluded infant (out-of-domain/ unseen participant).

Kalman filter for an aPCK-based performance measure
To quantify the performance of markerless pose-estimation, automatically determined marker positions are needed to be compared against labels determined in another way, usually from manual labelling.To compare between evaluation metrics, the literature has established a trustable criterion, called the average Percentage of Correct Keypoints (aPCK) [45,51,52].The aPCK approach typically needs a human assessor to annotate a validation set (on the top of training/test set).
The aPCK then labels a predicted marker as 'correctly detected' if its location falls within a certain circular distance to the ground-truth manually annotated location.The distance is usually chosen as a fraction of an image-specific scaling factor such as the torso length.Because manual labelling of frames requires time-consuming, high precision work, this approach is challenging when the validation set includes large datasets and numbers of frames.One mitigation of this problem is to validate against only a small number of additionally labelled frames.Here, we present an alternative novel automatic approach that allows for evaluation against an entire video.In our case, this would be approximately equivalent to manual labelling ~ 800k datapoints across ~ 60k frames.Conceptually, our proposed strategy follows a similar approach to aPCK, but instead of requiring manual measures at each frame it uses a Kalman filter (KF), building on the expectation of smooth and continuous motion, to estimate the position of a marker at time t based on the state (position, velocity, and acceleration) of the point in the previous frame t − 1.The KF uses automatically determined marker positions in previous frames as noisy measurements to update joint probability distributions for its position, velocity, and acceleration state variables, from which to predict a position in the current frame (Eq. 1) [53].
where the state vector x k = d x d y v x v y a x a y ⊤ comprising x and y displacement, velocity, and acceleration, F is the transition matrix (Eq.2), w k is a noise term, and dt is the time interval between frames.
The KF predicted marker positions can be used in place of the 'ground truth' manually annotated locations traditionally used in aPCK. (1) After each prediction step, the prediction is updated with a measurement of the current observation, the current marker position provided by the deep-net classifier, weighted by the calculated certainty of the current state and observation.The KF also returns the likelihood of each measurement, depending on its relationship to the probability distribution of the expected state.On the other hand, either deep-net returns a probability (p) that indicates the confidence level for each of the predicted anatomical locations (markers) at the identified location on the image (DLC-p).Here we propose to combine likelihoods from the KF with the predictions from deepnets to identify true positive (TP), false positive (FP), false negative (FN), and true negative (TN) detections and automatically evaluate performance metrics for the markerless approach.The details of this approach are explained below.Through manual visual inspection and testing, a negative loglikelihood (KF-LL) threshold of 20 was selected, above which observations were considered poor tracking.Similarly, we define good labelling when the deep-net classifier has identified a marker with a confidence level of ≥ 0.6.
Observations from the KF with likelihoods above a certain threshold [54] are not alone suitable to define groundtruth because the Kalman filter log-likelihood each data step is influenced by the state(s) from the previous data steps, derived from incorrect deep-net tracking.Additional information about the correctness of labelled locations can be derived by assuming that large contiguous blocks of low KF-LL values are correct, and that points that are neighbors to correct values, with low KF-LL values from running the Kalman filter either forwards or backwards through the data, are also correct.Figure S2 and Table S2 present a segment of data to illustrate how this approach handles deep-net errors in two scenarios.A state-machine approach to implementing the logic presented above is indicated in pseudocode in Table 1.In our experience, a common situation where a FN can arise is when a body-part, in particular hand and foot, is either entirely or partially hidden behind or under other body-parts.In this case, the deep-net has correctly identified the body-part location (the good tracking criterion is met), but with sufficient uncertainty to classify the body-part as "unidentified" (DLC-p < 0.6).
This automatic performance evaluation approach allows the validation of larger markerless datasets.Here, we report the overall performance (Eq.5) as the average of precision (positive predictive value (PPV) or selectivity -Eq. 3) and sensitivity (true positive rate (TPR) -Eq.4) in each fold or at each anatomical location.We also evaluate accuracy (ACC -Eq.6) to simultaneously consider FPs and FNs to identify which body-parts are associated with the poorest identification outcomes across all anatomical locations.

Computing infrastructure
The deep-net models, namely resnet-152 and efficientnet-b6, were trained using New Zealand eScience Infrastructure (NeSI) high-performance computing facilities' Cray CS400 cluster [55].The training process was executed using enhanced NVIDIA Tesla P100 GPUs with 12 GB CoWoS HBM2 stacked memory bandwidth at 549 GB/s (for resnet on DLC version 2.1.10.2) and NVIDIA Tesla A100 PCIe GPUs with 40 GB HBM2 stacked memory bandwidth at 1555 GB/s (for efficientnet on DLC 2.2.0.2).The larger size of the efficientnet model and the associated necessary memory size required us to utilize A100 GPUs instead of the P100s.Intel Xeon Broadwell CPUs (E5-2695v4, 2.1 GHz) were used on the cluster for handling the GPU jobs.The algorithms were run under Python environments (Python 3.7.9,pykalman 0.9.5).

Fold-based versus location-based performance metric
The sixfold cross-validation of the deep-net classifiers (trained on ten videos from 5 infants, tested on two videos from a novel infant) demonstrated a successful overall performance of 95.52% (SD 2.43) for marker tracking across all out-of-domain infants (subject-based metrics) using the efficientnet model, while the resnet resulted in an overall performance of 94.94% (SD 2.63).To help with a better understanding of the performance with unseen data, we also calculated performance for each anatomical location (Fig. 2).Panels A-B and C-D show the precision and sensitivity measures associated with each anatomical location for each fold from the efficientnet and resnet models, respectively.The reported overall performance metrics in this article are the average of the values (mean %, SD) calculated from twelve evaluations.Figure 3 demonstrates how the accuracy of resnet and efficientnet models vary across anatomical locations.As these measures include the simultaneous impact of FPs and FNs, they help to specify which anatomical locations were more or less challenging for each deep-net classifier to identify.Results in Fig. 3 highlight that performance was poorest for tracking of hands and hips, while both models achieved their best tracking results for the left shoulder, right knee, and right foot.Tables S3 and S4, and Figure S3, provide further analysis of the performance of the deep-net models across subjects and locations.Overall performance of the resnet and efficientnet for the subject-based and locationbased approaches are shown in Fig. 4.
The small performance variations both for the subjectbased and location-based schemes in this figure, in particular, indicate generalization capability of the schemes.However, performance across subjects was dominated by one poorly performing outlier, fold6_v2 (see also Fig. 2) which will be discussed in the Discussion.
To further illustrate the performance of the deep-net classifiers, we visualized the precision (selectivity) and sensitivity measures for each of the classifiers across all folds for each video in each validation set.The boxplots in Fig. S4 compare the precision and sensitivity measures separately for each individual video (v1 or v2) of the unseen infant using the resnet (S4A and S4C) and the efficientnet (S4B and S4D), respectively.A similar visualization approach was also carried out across all body-parts to illustrate what anatomical locations were associated with higher/lower precisions and sensitivities using each of the deep-net models, separately.The boxplots in Fig. S5 of the supplementary material section demonstrate these measures for each individual anatomical location across all folds using the resnet (S5A and S5C) and the efficientnet (S5B and S5D), respectively.Similarly, Fig. S4 demonstrates fold-based measurements across all anatomical locations.
Examples of the algorithm's predicted locations (in the test sets) versus manual annotations by the observer (HA) have been shown by crosses and dots in Fig. 5 (magnified representations from the left arms).
The scattered precision-recall (precision vs sensitivity) plots in Fig. S6 demonstrate how well the proposed deep-net models can track markers in each novel video of the unseen infant using data across 12 body-parts (subject-based scheme: resnet: A, efficientnet: C) and in each anatomical location across all tested videos (location-based scheme: resnet: B, efficientnet: D).An ideal precision-recall datapoint would be situated at the upper right corner which indicates higher precisions and sensitivities that are associated with fewer number of FPs and FNs respectively.
Comprehensive numerical details of the resnet and efficientnet models' performance across all anatomical locations at each fold are represented in Tables S5-10 and S11-16 in the supporting information section, respectively.Figure S7 demonstrates sample trajectories extracted from the validation sets (unseen tested videos) using 200 frames per infant.

Discussion
This work proposes a novel markerless pose-estimation scheme, based on deep neural networks, for accurate motion tracking of infants' movements using 2D video recordings from standard handheld devices (e.g., iPad).Subject-based performance assessment demonstrates generalization and performance consistency across out-of-domain data while identifying 12 different anatomical locations with overall cross-validated performances across 12 out-of-domain recordings from six infants of 95.52% (SD 2.43) and 94.94% (SD 2.63), respectively.The work further introduced a novel unsupervised performance evaluation strategy by combining Kalman filter likelihoods and deep-net probabilities to automatically measure performance metrics on larger outof-domain datasets where manual labeling is challenging.Despite dealing with a relatively large validation set that includes ~ 67k images, the performance range from this work resonates well with the performance range in recent markerless motion tracking studies performed on substantially smaller validations sets [45].
Types of errors Results indicate that despite postural variations in landmarks caused by rotational and/or fast movements in some body-parts, the proposed models have been able to extract and learn features related to each anatomical location in the 2D recordings of the training sets and later identify these locations correctly in the unseen data from novel infants (validation sets; see Tables S3 and S4).Further, validation of the models on data from novel infants was associated with consistently higher precision measures (corresponding to fewer FPs) compared to sensitivity measures (corresponding to more variable FN performance) across all folds.This result is important, mainly because it shows good generalization capability for the model obtained from a relatively small number of participants in the database, presumably aided by the relatively large number of frames used per participant [45].
Effect of location The location-based performance for the deep-net models ranged between 84.35-99.24%and 81.27-99.51%for the efficientnet and resnet, respectively across all anatomical locations (Tables S3 and S4).The lowest accuracy measures across both models were consistently associated with hands, which naturally exist in a much larger range of postures and speeds, and hips, which lack detailed, precisely located landmarks for the algorithm to identify (Fig. 3).Patterns that were present near the hips were prints on the nappies of the infants, which varied in location and style between individuals and so were not useful for learning.Conversely, the left shoulder, right knee, and right foot are consistently associated with best tracking results.
The labeled hand poses in the training set from the five infants in each fold may not have fully covered the range of hand poses in the videos of the sixth infant in the validation sets.In that case, the algorithms would fail to identify a true observation due to lack of confidence (as they may not have seen a similar case before) and therefore, will label the observation as a false negative.This scenario was seen in the results for fold6_v2, where the baby had his hands locked into each other almost throughout the entire recording, resulting in poorer TPR performance of 45.95% and 45.10% for the resnet and efficientnet, respectively (see also Table S3 and S4).This hand posture had not been observed in the training set from the other five infants associated with this fold, who mostly kept their hands separated.
Effect of velocity: We further investigated possible effects of velocity in hands and feet on the overall tracking performance.The tracking performance for feet markers was high despite the feet's often varying dynamics (excess of movement and range) and speed (Table S3 and S4).Normalized histograms of the velocity of hands and feet across all 12 videos from all participants are shown in Figure S8.The close similarity between hand and feet velocity distributions suggests that the lower tracking performance of the hands is rather associated with hands' natural morphological variations than the velocity itself: feet have a limited range of pose variations compared to hands, such that foot poses sampled in the training sets covered the range of potential poses, whereas this was not true for hands.
Fast movement exacerbates motion blur, where objects in the image move during the acquisition duration of individual frames.One mitigation of motion blur is to use video with a faster frame rate.Standard video frame rate is around 30 Hz, but newer consumer devices can record at 60 Hz or higher.The benefit for tracking accuracy of acquiring data at these higher rates is a useful avenue for future research.Effect of camera angle A novel camera angle critically affected accuracy in our sample.Video fold6_v1 was filmed from quite a tilted angle on the left side compared to the other babies in the training set.The accuracies for this video were 84.35% and 81.27% across all folds for the efficientnet and resnet, respectively.In contrast, the overall performances improve to 91.73% and 93.62%, respectively, for the other video of the same infant movements from an overhead position (fold6_v2).In this overhead video, the true positive rate for the right hand still remains low due to the hands being locked into each other.For other infants, the proposed scheme was able to perform almost equally on the two videos despite smaller differences in viewing angle (e.g., fold 1 video1: 98.04% vs video 2: 98.08% for the efficientnet, and fold 1 video1: 98.49% vs video 2: 98.22% for the resnet).
Other factors that might have impaired body-part labelling include high-contrast shadows, which occurred in our images, infant's skin color, and change of patterns on the nappies.Pre-processing of the videos to center the babies in their recordings, consistent use of a white background (mattress sheet), consistent lighting, and use of a fixed camera (as opposed to holding by hand) may have assisted with minimizing other sources of variation.These variables may be less well controlled when videos are recorded at home or in clinical settings.
Findings from this work indicate the feasibility of the proposed automated markerless movement tracking for reliable identification of infants' movements in their 2D video recordings, for the purpose of the GMA.Our results overall support the premise that a rich training dataset covering the range of features expected in the novel data will enable a high level of generalization to new individuals.These results indicate that our approach is potentially suitable for use in a clinical platform for automating the GMA, which could be used alongside the current observational protocols to diagnose at-risk infants.Future work involves validating the movement tracking model on a large dataset of clinically recorded videos, including high-risk identified infants, and extending it to developing infant GMA classifiers.

Fig. 1
Fig. 1 Examples of the twelve manually labelled anatomical locations (colored dots) in six different infants in DeepLabCut environment

Fig. 2
Fig. 2 PPV and TPR measures across anatomical locations for each fold (unseen tested video) using efficientnet (a, b) and resnet (c, d)

Fig. 5
Fig. 5 Examples of the automatic (+) vs manual (o) identifications in the left arms of 6 infants