Deep-learning for automated markerless tracking of infants general movements

Abbasi, H.; Mollet, S. R.; Williams, S. A.; Lim, L.; Battin, M. R.; Besier, T. F.; McMorland, A. J. C.

doi:10.1007/s41870-023-01497-z

Deep-learning for automated markerless tracking of infants general movements

Original Research
Open access
Published: 25 September 2023

Volume 15, pages 4073–4083, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Information Technology Aims and scope Submit manuscript

Deep-learning for automated markerless tracking of infants general movements

Download PDF

H. Abbasi ORCID: orcid.org/0000-0003-1136-3280¹,
S. R. Mollet¹,
S. A. Williams²,
L. Lim¹,
M. R. Battin³,
T. F. Besier¹ &
…
A. J. C. McMorland^1,4

1598 Accesses
6 Citations
Explore all metrics

Abstract

The presence of abnormal infant General Movements (GMs) is a strong predictor of progressive neurodevelopmental disorders, including cerebral palsy (CP). Automation of the assessment will overcome scalability barriers that limit its delivery to at-risk individuals. Here, we report a robust markerless pose-estimation scheme, based on advanced deep-learning technology, to track infant movements in consumer mobile device video recordings. Two deep neural network models, namely Efficientnet-b6 and resnet-152, were trained on manually annotated data across twelve anatomical locations (3 per limb) in 12 videos from 6 full-term infants (mean age = 17.33 (SD 2.9) wks, 4 male, 2 female), using the DeepLabCut™ framework. K-fold cross-validation indicates the generalization capability of the deep-nets for GM tracking on out-of-domain data with an overall performance of 95.52% (SD 2.43) from the best performing model (Efficientnet-b6) across all infants (performance range: 84.32–99.24% across all anatomical locations). The paper further introduces an automatic, unsupervised strategy for performance evaluation on extensive out-of-domain recordings through a fusion of likelihoods from a Kalman filter and the deep-net. Findings indicate the possibility of establishing an automated GM tracking platform, as a suitable alternative to, or support for, the current observational protocols for early diagnosis of neurodevelopmental disorders in early infancy.

TwinEDA: a sustainable deep-learning approach for limb-position estimation in preterm infants’ depth images

Article 28 November 2022

Automatic Posture and Movement Tracking of Infants with Wearable Movement Sensors

Article Open access 13 January 2020

Computer Vision for Medical Infant Motion Analysis: State of the Art and RGB-D Data Set

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Injury to the developing fetal or infant brain (for example, due to hypoxic-ischemic encephalopathy, perinatal stroke, or infection) can cause severe impairments that lead to physical disability and cerebral palsy [1,2,3]. The quality of General Movements (GMs) in infants up to around 5 months of corrected age [4, 5] reflects the health of their neuromotor system. Early prediction of neurological disorders facilitates early intervention during critical periods of heightened neuroplasticity, which a growing body of evidence confirms improves clinical outcomes [6].

The General Movements Assessment (GMA) is highly predictive [7,8,9] and sensitive (98%) for [10] the risk of developing CP. The GMA is performed by at least two highly-trained clinicians observing the infant awake, calm and alert, via video viewed at normal or increased speed, to evaluate the spatial and temporal complexity and variation of their movements [4, 11, 12]. The need for trained assessors is one barrier to widespread adoption of the GMA in standard clinical care. Manual assessment is time-consuming, requiring breaks from reviewing high numbers of videos to minimize fatigue-related error. Limitations to clinical resources typically mean that only infants meeting specific clinical criteria will undergo an assessment of their GMs, introducing the possibility of missed case identification of those not meeting the initial criteria. Hence, there is an urgent need to develop technological methods that can circumvent the bottleneck of the human assessor. Automated motion capture offers a low-cost, practical alternative to track and analyze anatomical movements effectively.

Automated neonatal GM tracking and analysis has been widely explored using various techniques [13], including utilization of wearable biotech [14, 15], 3D motion capture [16, 17], wearable accelerometers [18, 19], 3D RGB-D camera recordings [20], and conventional video recordings [21,22,23,24,25,26,27], including with another deep-learning markerless pose estimation algorithm [28]. Most of these technologies require specialist wearable equipment, and the attachment of physical devices to the infant may alter their behavior and physiological responses [21]. 2D recordings from single-camera devices, such as mobile devices, web-cams, and baby-monitors, provide lower spatial resolution than gold-standard 3D motion capture systems. However, low resolution requirements for the capture of movement patterns, along with the broad availability of single-camera devices, reasonable price, and their flexibility of use (e.g., in home environments) make them an ideal, clinically relevant tool for capturing infant motor behavior. The utilization of technology readily available in the home creates the possibility of parents collecting data at the appropriate developmental stages, with little or no training, at convenient times for minimal disturbance of the infants’ natural physical and mental states. Robust automated analysis, comprising movement tracking and GM classification, such as that proposed here, is required to process the extensive amount of produced data.

Deep neural network architectures are potent tools for reliably automating challenging signal processing tasks such as classification, identification, and segmentation [29,30,31,32,33,34]. These structures require large datasets for authentic learning and robust performance. Dataset construction can be even more challenging when it relies on manual annotation of data [35]. The application of deep-learning-based approaches is relatively new in the field of infant GMA, and only a few recent attempts have aimed to apply these novel techniques for movement classification [36,37,38]. Additionally, limited work has been done to investigate the utility of deep learning and semi-supervised frameworks for motion detection in 2D videos for body parsing, pose estimation, and GMA prediction [39, 40]. To our knowledge, this is the first study to describe detailed analyses of automated marker placement accuracy.

This work proposes a robust deep neural network, developed in the DeepLabCut environment [41, 42] for automatic markerless motion tracking of infant movements, in standard iPad video recordings, for the purpose of detecting GMs in the fidgety period. The approach’s generalization performance will be assessed using k-fold cross-validation across 12 video recordings from six infants.

2 Data acquisition

2.1 Ethics

All procedures in this study were approved by the Auckland Health Research Ethics Committee (AHREC 000146). Parents/caregivers were fully informed of the experimental purpose, filming procedure and methods, and provided informed written consent for their child's participation.

2.2 Clinical procedures

A cohort of six infants (four males and two females, mean height 62.66 (SD 2.56) cm, weight 6.98 (SD 0.24) kg) were assessed within the fidgety period: mean age of 121.3 (SD 20.4) days post-term corrected age (equal to 17.33 (SD 2.92) weeks) (Table S1). Infants were placed in a supine position, wearing only a nappy, on a white cotton mat. Two standard iPad minis (iPad mini 5, OS version 14.4.2, model MUQX2X/A, camera resolution: 8 MP; and iPad mini 4, software version:13.4.1, model MNY32X/A, camera resolution: 8 MP 1080p HD video) were set up on tripods on both sides of the infant. Infants were filmed while awake, undistracted, and producing spontaneous movements, for 3–5 min.

Two videos (resolution: 1920 × 1080 pixels, frequency: 29.97 frames/s) were captured from each of the six infants. Videos were post-processed in Adobe Premiere Pro 2021 to situate the infant in the center of the frame and crop any unneeded regions of the frames. Videos were then saved in their original quality. Twelve anatomical locations (3 per limb: shoulders, elbows, hands, hips, knees, and feet) were manually labeled by an expert (HA) in 200 frames from each video. The 2400 frames were automatically extracted from each recording using k-means clustering implemented within DeepLabCut to maximize frame diversity. Examples of the manually labeled locations are shown with colored dots in Fig. 1.

3 Methods and computational approach

3.1 Deep-learning approach

Several open-source deep-learning-based markerless motion-tracking toolboxes, such as DeepLabCut (DLC) [42] and OpenPose [43], now exist that can follow specified points across video frames.

Nath et al. have previously also raised the possibility of using such a platform for human infant motion tracking [41, 44]. Recent reports indicate the robust generalization capability of DLC in movement tracking of out-of-domain animal subjects (horses) with normalized errors close to the within-domain tested subjects [45].

Key features of DLC that help it achieve its high level of performance are (1) use of deep resnets and efficientnets that are pretrained on a benchmark ImageNet [46] used for object recognition, and (2) deconvolutional layers, instead of a classification layer, for semantic segmentation [42]. Resnet-152 is a top-performing deep-net model created by Microsoft for the ImageNet challenge in 2016 [47]. These deep artificial neural networks are designed to mimic the mechanism of certain biological neuronal cells, pyramidal neurons, to improve learning by optimally skipping intermediary connections in a deep structure [48]. On the other hand, transfer learning in efficientnet architecture classes [45] uses a compound scaling method through uniform scaling of the network’s depth, width, and resolution. This strategy has been shown to out-perform the resnet models on out-of-domain ImageNet data [45]. A recent study suggests that the efficientnet-b6 can achieve better accuracies than other efficientnet classes (i.e., b0 to b5) [45]; hence we used this model in the current study. For simplicity, we will call the resnet-152 and efficientnet-b6 models ‘resnet’ and ‘efficientnet’ from here on. Deconvolutional layers are known to up-sample the visual details and generate probability densities in spatial space which can be later used as evidence to locate a specific body part in a particular location.

In DLC, the deep-nets are iteratively fine-tuned by updating/adjusting the weight parameters using the manually labeled data. DLC automatically assigns high probabilities to the labeled anatomical sites and allocates low-likelihoods to the rest of the image [42].

In this work, we first utilize the DLC environment to develop robust deep-learning models for markerless motion tracking in human infants’ video recordings (resnet training: DLC 2.1.10.2, TensorFlow 1.14, CUDA 11.3, Python 3.7.9; efficientnet training: DLC 2.2.0.2, TensorFlow 2.5.0, CUDA 11.4, Python 3.8.6; efficientnets became available in the later DLC version).

3.2 k-fold cross-validation

Subject-based k-fold leave-one-out cross-validation [49] can implicitly specify whether a model generalizes equally well across all subjects and helps to identify whether there is a significant variation in the dataset. The terms ‘testing’ and ‘validation’ are used somewhat interchangeably in the literature, which can cause confusion. In the following discussion we follow the conventions used by DLC, using the term ‘testing’ to refer to data withheld from the training to assess performance and update training hyperparameters during learning process, whereas the word ‘validation’ is used to refer to a dataset that is withheld to the very end of the process for performance evaluation. In each training fold, data from one particular baby were left out for validation, and the two recordings associated with that baby were excluded from the training set. This strategy allows comparison and validation of the learning schedule for, and between, the resnet-152 and efficientnet-b6 models.

A 95%-to-5% training/testing scheme was used to train the deep-nets using an imgaug image augmentation strategy [50] in each training fold (round). The learning rate was initially set at 0.005 and decreased to 0.001 through a recommended multi-step updating regime. The cross-entropy loss function converged during training of all sixfolds, with a sharp decay during the first 30k iterations (for both models) and then gradually decayed further to lower values during 1030k and 700k iterations for the resnet and efficientnet, respectively. Efficientnet is a larger network, for which 700k training iterations were performed, as compared to 1030k iterations for the resnet. The root mean square errors (RMSE) between the identified labels by the deep-net classifier and the scorer’s identifications, were evaluated at each snapshot throughout the training process across all sixfolds, resulting in an overall training error of 2.73 (SD 0.06) pixels at the 1030k iteration for the resnet and 7.77 (SD 0.11) pixels at the 700k iteration for the efficientnet. Training took 126 (SD 3.5) hours across the sixfolds on the high-performance machines detailed in the “Computing infrastructure” section. In contrast, test errors were found to be higher at 6.73 (SD 0.17) pixels and 9.59 (SD 0.32) pixels for the resnet and efficientnet, respectively. These findings are consistent with previously described capabilities of DLC [42]. Figure S1 demonstrates evaluated RMSEs (pixel) across all sixfolds for training over 1030k and 700k iterations for the resnet and efficientnet. This plot confirms the fast fine-tuning of the models across all folds. After each training fold, the performance of the model was evaluated (validated) on the two videos from the excluded infant (out-of-domain/unseen participant).

3.3 Kalman filter for an aPCK-based performance measure

To quantify the performance of markerless pose-estimation, automatically determined marker positions are needed to be compared against labels determined in another way, usually from manual labelling. To compare between evaluation metrics, the literature has established a trustable criterion, called the average Percentage of Correct Keypoints (aPCK) [45, 51, 52]. The aPCK approach typically needs a human assessor to annotate a validation set (on the top of training/test set).

The aPCK then labels a predicted marker as ‘correctly detected’ if its location falls within a certain circular distance to the ground-truth manually annotated location. The distance is usually chosen as a fraction of an image-specific scaling factor such as the torso length. Because manual labelling of frames requires time-consuming, high precision work, this approach is challenging when the validation set includes large datasets and numbers of frames. One mitigation of this problem is to validate against only a small number of additionally labelled frames. Here, we present an alternative novel automatic approach that allows for evaluation against an entire video. In our case, this would be approximately equivalent to manual labelling ~ 800k datapoints across ~ 60k frames. Conceptually, our proposed strategy follows a similar approach to aPCK, but instead of requiring manual measures at each frame it uses a Kalman filter (KF), building on the expectation of smooth and continuous motion, to estimate the position of a marker at time t based on the state (position, velocity, and acceleration) of the point in the previous frame t − 1. The KF uses automatically determined marker positions in previous frames as noisy measurements to update joint probability distributions for its position, velocity, and acceleration state variables, from which to predict a position in the current frame (Eq. 1) [53].

$${\mathrm{x}}_{\mathrm{k}}={\mathrm{Fx}}_{\mathrm{k}-1}+{\mathrm{w}}_{\mathrm{k}}$$

(1)

where the state vector ${x}_{k}={\left[\begin{array}{cccccc}{d}_{x}& {d}_{y}& {v}_{x}& {v}_{y}& {a}_{x}& {a}_{y}\end{array}\right]}^{\mathrm{\top }}$ comprising x and y displacement, velocity, and acceleration, F is the transition matrix (Eq. 2), ${\mathrm{w}}_{\mathrm{k}}$ is a noise term, and dt is the time interval between frames.

$$\mathrm{F}=\left[\begin{array}{cccccc}1& 0& \mathrm{dt}& 0& \frac{{\mathrm{dt}}^{2}}{2}& 0\\ 0& 1& 0& \mathrm{dt}& 0& \frac{{\mathrm{dt}}^{2}}{2}\\ 0& 0& 1& 0& \mathrm{dt}& 0\\ 0& 0& 0& 1& 0& \mathrm{dt}\\ 0& 0& 0& 0& 1& 0\\ 0& 0& 0& 0& 0& 1\end{array}\right]$$

(2)

The KF predicted marker positions can be used in place of the ‘ground truth’ manually annotated locations traditionally used in aPCK.

After each prediction step, the prediction is updated with a measurement of the current observation, the current marker position provided by the deep-net classifier, weighted by the calculated certainty of the current state and observation. The KF also returns the likelihood of each measurement, depending on its relationship to the probability distribution of the expected state. On the other hand, either deep-net returns a probability (p) that indicates the confidence level for each of the predicted anatomical locations (markers) at the identified location on the image (DLC-p). Here we propose to combine likelihoods from the KF with the predictions from deep-nets to identify true positive (TP), false positive (FP), false negative (FN), and true negative (TN) detections and automatically evaluate performance metrics for the markerless approach. The details of this approach are explained below. Through manual visual inspection and testing, a negative loglikelihood (KF-LL) threshold of 20 was selected, above which observations were considered poor tracking. Similarly, we define good labelling when the deep-net classifier has identified a marker with a confidence level of ≥ 0.6.

Observations from the KF with likelihoods above a certain threshold [54] are not alone suitable to define ground-truth because the Kalman filter log-likelihood at each data step is influenced by the state(s) from the previous data steps, derived from incorrect deep-net tracking. Additional information about the correctness of labelled locations can be derived by assuming that large contiguous blocks of low KF-LL values are correct, and that points that are neighbors to correct values, with low KF-LL values from running the Kalman filter either forwards or backwards through the data, are also correct. Figure S2 and Table S2 present a segment of data to illustrate how this approach handles deep-net errors in two scenarios. A state-machine approach to implementing the logic presented above is indicated in pseudocode in Table 1. In our experience, a common situation where a FN can arise is when a body-part, in particular hand and foot, is either entirely or partially hidden behind or under other body-parts. In this case, the deep-net has correctly identified the body-part location (the good tracking criterion is met), but with sufficient uncertainty to classify the body-part as “unidentified” (DLC-p < 0.6).

Table 1 State-machine pseudocode for confusion matrix measures

Full size table

This automatic performance evaluation approach allows the validation of larger markerless datasets. Here, we report the overall performance (Eq. 5) as the average of precision (positive predictive value (PPV) or selectivity – Eq. 3) and sensitivity (true positive rate (TPR) – Eq. 4) in each fold or at each anatomical location. We also evaluate accuracy (ACC – Eq. 6) to simultaneously consider FPs and FNs to identify which body-parts are associated with the poorest identification outcomes across all anatomical locations.

$$\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$

(3)

$$\mathrm{Sensitivity}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$

(4)

$$\mathrm{Overall}\,\mathrm{performance}=\mathrm{Average}\left(\mathrm{Precision},\mathrm{Sensitivity}\right)$$

(5)

$$\mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}.$$

(6)

3.4 Computing infrastructure

The deep-net models, namely resnet-152 and efficientnet-b6, were trained using New Zealand eScience Infrastructure (NeSI) high-performance computing facilities’ Cray CS400 cluster [55]. The training process was executed using enhanced NVIDIA Tesla P100 GPUs with 12 GB CoWoS HBM2 stacked memory bandwidth at 549 GB/s (for resnet on DLC version 2.1.10.2) and NVIDIA Tesla A100 PCIe GPUs with 40 GB HBM2 stacked memory bandwidth at 1555 GB/s (for efficientnet on DLC 2.2.0.2). The larger size of the efficientnet model and the associated necessary memory size required us to utilize A100 GPUs instead of the P100s. Intel Xeon Broadwell CPUs (E5-2695v4, 2.1 GHz) were used on the cluster for handling the GPU jobs. The algorithms were run under Python environments (Python 3.7.9, pykalman 0.9.5).

4 Results

4.1 Fold-based versus location-based performance metric

The sixfold cross-validation of the deep-net classifiers (trained on ten videos from 5 infants, tested on two videos from a novel infant) demonstrated a successful overall performance of 95.52% (SD 2.43) for marker tracking across all out-of-domain infants (subject-based metrics) using the efficientnet model, while the resnet resulted in an overall performance of 94.94% (SD 2.63). To help with a better understanding of the performance with unseen data, we also calculated performance for each anatomical location (Fig. 2). Panels A-B and C-D show the precision and sensitivity measures associated with each anatomical location for each fold from the efficientnet and resnet models, respectively. The reported overall performance metrics in this article are the average of the values (mean %, SD) calculated from twelve evaluations.

Figure 3 demonstrates how the accuracy of resnet and efficientnet models vary across anatomical locations. As these measures include the simultaneous impact of FPs and FNs, they help to specify which anatomical locations were more or less challenging for each deep-net classifier to identify. Results in Fig. 3 highlight that performance was poorest for tracking of hands and hips, while both models achieved their best tracking results for the left shoulder, right knee, and right foot. Tables S3 and S4, and Figure S3, provide further analysis of the performance of the deep-net models across subjects and locations. Overall performance of the resnet and efficientnet for the subject-based and location-based approaches are shown in Fig. 4.

The small performance variations both for the subject-based and location-based schemes in this figure, in particular, indicate generalization capability of the schemes. However, performance across subjects was dominated by one poorly performing outlier, fold6_v2 (see also Fig. 2) which will be discussed in the Discussion.

To further illustrate the performance of the deep-net classifiers, we visualized the precision (selectivity) and sensitivity measures for each of the classifiers across all folds for each video in each validation set. The boxplots in Fig. S4 compare the precision and sensitivity measures separately for each individual video (v1 or v2) of the unseen infant using the resnet (S4A and S4C) and the efficientnet (S4B and S4D), respectively. A similar visualization approach was also carried out across all body-parts to illustrate what anatomical locations were associated with higher/lower precisions and sensitivities using each of the deep-net models, separately. The boxplots in Fig. S5 of the supplementary material section demonstrate these measures for each individual anatomical location across all folds using the resnet (S5A and S5C) and the efficientnet (S5B and S5D), respectively. Similarly, Fig. S4 demonstrates fold-based measurements across all anatomical locations.

Examples of the algorithm's predicted locations (in the test sets) versus manual annotations by the observer (HA) have been shown by crosses and dots in Fig. 5 (magnified representations from the left arms).

The scattered precision-recall (precision vs sensitivity) plots in Fig. S6 demonstrate how well the proposed deep-net models can track markers in each novel video of the unseen infant using data across 12 body-parts (subject-based scheme: resnet: A, efficientnet: C) and in each anatomical location across all tested videos (location-based scheme: resnet: B, efficientnet: D). An ideal precision-recall datapoint would be situated at the upper right corner which indicates higher precisions and sensitivities that are associated with fewer number of FPs and FNs respectively.

Comprehensive numerical details of the resnet and efficientnet models’ performance across all anatomical locations at each fold are represented in Tables S5–10 and S11–16 in the supporting information section, respectively. Figure S7 demonstrates sample trajectories extracted from the validation sets (unseen tested videos) using 200 frames per infant.

5 Discussion

This work proposes a novel markerless pose-estimation scheme, based on deep neural networks, for accurate motion tracking of infants’ movements using 2D video recordings from standard handheld devices (e.g., iPad). Subject-based performance assessment demonstrates generalization and performance consistency across out-of-domain data while identifying 12 different anatomical locations with overall cross-validated performances across 12 out-of-domain recordings from six infants of 95.52% (SD 2.43) and 94.94% (SD 2.63), respectively. The work further introduced a novel unsupervised performance evaluation strategy by combining Kalman filter likelihoods and deep-net probabilities to automatically measure performance metrics on larger out-of-domain datasets where manual labeling is challenging. Despite dealing with a relatively large validation set that includes ~ 67k images, the performance range from this work resonates well with the performance range in recent markerless motion tracking studies performed on substantially smaller validations sets [45].

Types of errors Results indicate that despite postural variations in landmarks caused by rotational and/or fast movements in some body-parts, the proposed models have been able to extract and learn features related to each anatomical location in the 2D recordings of the training sets and later identify these locations correctly in the unseen data from novel infants (validation sets; see Tables S3 and S4). Further, validation of the models on data from novel infants was associated with consistently higher precision measures (corresponding to fewer FPs) compared to sensitivity measures (corresponding to more variable FN performance) across all folds. This result is important, mainly because it shows good generalization capability for the model obtained from a relatively small number of participants in the database, presumably aided by the relatively large number of frames used per participant [45].

Effect of location The location-based performance for the deep-net models ranged between 84.35–99.24% and 81.27–99.51% for the efficientnet and resnet, respectively across all anatomical locations (Tables S3 and S4). The lowest accuracy measures across both models were consistently associated with hands, which naturally exist in a much larger range of postures and speeds, and hips, which lack detailed, precisely located landmarks for the algorithm to identify (Fig. 3). Patterns that were present near the hips were prints on the nappies of the infants, which varied in location and style between individuals and so were not useful for learning. Conversely, the left shoulder, right knee, and right foot are consistently associated with best tracking results.

The labeled hand poses in the training set from the five infants in each fold may not have fully covered the range of hand poses in the videos of the sixth infant in the validation sets. In that case, the algorithms would fail to identify a true observation due to lack of confidence (as they may not have seen a similar case before) and therefore, will label the observation as a false negative. This scenario was seen in the results for fold6_v2, where the baby had his hands locked into each other almost throughout the entire recording, resulting in poorer TPR performance of 45.95% and 45.10% for the resnet and efficientnet, respectively (see also Table S3 and S4). This hand posture had not been observed in the training set from the other five infants associated with this fold, who mostly kept their hands separated.

Effect of velocity: We further investigated possible effects of velocity in hands and feet on the overall tracking performance. The tracking performance for feet markers was high despite the feet’s often varying dynamics (excess of movement and range) and speed (Table S3 and S4). Normalized histograms of the velocity of hands and feet across all 12 videos from all participants are shown in Figure S8. The close similarity between hand and feet velocity distributions suggests that the lower tracking performance of the hands is rather associated with hands’ natural morphological variations than the velocity itself: feet have a limited range of pose variations compared to hands, such that foot poses sampled in the training sets covered the range of potential poses, whereas this was not true for hands.

Fast movement exacerbates motion blur, where objects in the image move during the acquisition duration of individual frames. One mitigation of motion blur is to use video with a faster frame rate. Standard video frame rate is around 30 Hz, but newer consumer devices can record at 60 Hz or higher. The benefit for tracking accuracy of acquiring data at these higher rates is a useful avenue for future research.

Effect of camera angle A novel camera angle critically affected accuracy in our sample. Video fold6_v1 was filmed from quite a tilted angle on the left side compared to the other babies in the training set. The accuracies for this video were 84.35% and 81.27% across all folds for the efficientnet and resnet, respectively. In contrast, the overall performances improve to 91.73% and 93.62%, respectively, for the other video of the same infant movements from an overhead position (fold6_v2). In this overhead video, the true positive rate for the right hand still remains low due to the hands being locked into each other. For other infants, the proposed scheme was able to perform almost equally on the two videos despite smaller differences in viewing angle (e.g., fold 1 video1: 98.04% vs video 2: 98.08% for the efficientnet, and fold 1 video1: 98.49% vs video 2: 98.22% for the resnet).

Other factors that might have impaired body-part labelling include high-contrast shadows, which occurred in our images, infant’s skin color, and change of patterns on the nappies. Pre-processing of the videos to center the babies in their recordings, consistent use of a white background (mattress sheet), consistent lighting, and use of a fixed camera (as opposed to holding by hand) may have assisted with minimizing other sources of variation. These variables may be less well controlled when videos are recorded at home or in clinical settings.

Findings from this work indicate the feasibility of the proposed automated markerless movement tracking for reliable identification of infants’ movements in their 2D video recordings, for the purpose of the GMA. Our results overall support the premise that a rich training dataset covering the range of features expected in the novel data will enable a high level of generalization to new individuals. These results indicate that our approach is potentially suitable for use in a clinical platform for automating the GMA, which could be used alongside the current observational protocols to diagnose at-risk infants. Future work involves validating the movement tracking model on a large dataset of clinically recorded videos, including high-risk identified infants, and extending it to developing infant GMA classifiers.

Availability of data and materials

The datasets used and/or analyzed during the current study are protected by ethics and cannot be shared.

References:

Fairhurst C (2012) Cerebral palsy: the whys and hows. Arch Dis Child-Educ Pract 97(4):122–131
Google Scholar
Gunn AJ, Thoresen M (2015) Animal studies of neonatal hypothermic neuroprotection have translated well in to practice. Resuscitation 97:88–90
Google Scholar
Ahearne CE, Boylan GB, Murray DM (2016) Short and long term prognosis in perinatal asphyxia: an update. World J Clin Pediatr 5(1):67
Google Scholar
Novak I, Morgan C, Adde L, Blackman J, Boyd RN, Brunstrom-Hernandez J, Cioni G, Damiano D, Darrah J, Eliasson A (2017) Early, accurate diagnosis and early intervention in cerebral palsy: advances in diagnosis and treatment. JAMA Pediatr 171(9):897–907
Google Scholar
Prechtl HF (1997) State of the art of a new functional assessment of the young nervous system. An early predictor of cerebral palsy. Early Hum Dev 50(1):1–11
Google Scholar
Hadders-Algra M (2014) Early diagnosis and early intervention in cerebral palsy. Front Neurol 5:185
Google Scholar
Ferrari F, Einspieler C, Prechtl H, Bos A, Cioni G (2004) Prechtl's method on the qualitative assessment of general movements in preterm, term and young infants. Mac Keith Press, London
Garcia JM, Gherpelli JLD, Leone CR (2004) The role of spontaneous general movement assessment in the neurological outcome of cerebral lesions in preterm infants. J Pediatr 80(4):296–304
Google Scholar
Spittle AJ, Boyd RN, Inder TE, Doyle LW (2009) Predicting motor development in very preterm infants at 12 months’ corrected age: the role of qualitative magnetic resonance imaging and general movements assessments. Pediatrics 123(2):512–517
Google Scholar
Bosanquet M, Copeland L, Ware R, Boyd R (2013) A systematic review of tests to predict cerebral palsy in young children. Dev Med Child Neurol 55(5):418–426
Google Scholar
Hadders-Algra M (2018) Neural substrate and clinical significance of general movements: an update. Dev Med Child Neurol 60(1):39–46
Google Scholar
Aizawa CYP, Einspieler C, Genovesi FF, Ibidi SM, Hasue RH (2020) The general movement checklist: a guide to the assessment of general movements during preterm and term age. J Pediatr 97:445–452
Google Scholar
Marcroft C, Khan A, Embleton ND, Trenell M, Plötz T (2015) Movement recognition technology as a method of assessing spontaneous general movements in high risk infants. Front Neurol 5:284
Google Scholar
Airaksinen M, Räsänen O, Ilén E, Häyrinen T, Kivi A, Marchi V, Gallen A, Blom S, Varhe A, Kaartinen N (2020) Automatic posture and movement tracking of infants with wearable movement sensors. Sci Rep 10(1):1–13
Google Scholar
Rahmati H, Aamo OM, Stavdahl Ø, Dragon R, Adde L (2014) Video-based early cerebral palsy prediction using motion segmentation. In: 2014 36th annual international conference of the IEEE engineering in medicine and biology society, pp 3779–3783
Schroeder AS, Hesse N, Weinberger R, Tacke U, Gerstl L, Hilgendorff A, Heinen F, Arens M, Dijkstra LJ, Rocamora SP (2020) General movement assessment from videos of computed 3D infant body models is equally effective compared to conventional RGB video rating. Early Hum Dev 144:104967
Google Scholar
Meinecke L, Breitbach-Faller N, Bartz C, Damen R, Rau G, Disselhorst-Klug C (2006) Movement analysis in the early detection of newborns at risk for developing spasticity due to infantile cerebral palsy. Hum Mov Sci 25(2):125–144
Google Scholar
Vicon. Vicon Motion Systems Ltd. http://www.vicon.com
Ohgi S, Morita S, Loo KK, Mizuike C (2008) Time series analysis of spontaneous upper-extremity movements of premature infants with brain injuries. Phys Ther 88(9):1022–1033
Google Scholar
Hesse N, Pujades S, Black MJ, Arens M, Hofmann UG, Schroeder AS (2019) Learning and tracking the 3D body shape of freely moving infants from RGB-D sequences. IEEE Trans Pattern Anal Mach Intell 42(10):2540–2551
Google Scholar
Zahr LK, Balian S (1995) Responses of premature infants to routine nursing interventions and noise in the NICU. Nurs Res 44(3):179–185
Google Scholar
Marschik PB, Pokorny FB, Peharz R, Zhang D, O’Muircheartaigh J, Roeyers H, Bölte S, Spittle AJ, Urlesberger B, Schuller B (2017) A novel way to measure and predict development: a heuristic approach to facilitate the early detection of neurodevelopmental disorders. Curr Neurol Neurosci Rep 17(5):43
Google Scholar
Groos D, Adde L, Støen R, Ramampiaro H, Ihlen EA (2020) Towards human performance on automatic motion tracking of infant spontaneous movements. arXiv preprint arXiv:2010.05949
Irshad MT, Nisar MA, Gouverneur P, Rapp M, Grzegorzek M (2020) AI approaches towards prechtl’s assessment of general movements: a systematic literature review. Sensors 20(18):5321
Google Scholar
Adde L, Brown A, Van Den Broeck C, DeCoen K, Eriksen BH, Fjørtoft T, Groos D, Ihlen EAF, Osland S, Pascal A (2021) In-motion-app for remote general movement assessment: a multi-site observational study. BMJ Open 11(3):e042147
Google Scholar
Baccinelli W, Bulgheroni M, Simonetti V, Fulceri F, Caruso A, Gila L, Scattoni ML (2020) Movidea: a software package for automatic video analysis of movements in infants at risk for neurodevelopmental disorders. Brain Sci 10(4):203
Google Scholar
Ihlen EA, Støen R, Boswell L, de Regnier R, Fjørtoft T, Gaebler-Spira D, Labori C, Loennecken MC, Msall ME, Möinichen UI (2020) Machine learning of infant spontaneous movements for the early prediction of cerebral palsy: a multi-site cohort study. J Clin Med 9(1):5
Google Scholar
Chambers C, Seethapathi N, Saluja R, Loeb H, Pierce SR, Bogen DK, Prosser L, Johnson MJ, Kording KP (2020) Computer vision to automatically assess infant neuromotor risk. IEEE Trans Neural Syst Rehabil Eng 28(11):2431–2442
Google Scholar
Kose MR, Ahirwal MK, Atulkar M (2023) Dynamic characterization of functional brain connectivity network for mental workload condition using an effective network identifier. Int J Inf Technol 15:1–10
Google Scholar
Kumar A, Jain S, Kumar M (2023) Face and gait biometrics authentication system based on simplified deep neural networks. Int J Inf Technol 15(2):1005–1014
Google Scholar
Rusia MK, Singh DK (2021) An efficient CNN approach for facial expression recognition with some measures of overfitting. Int J Inf Technol 13:2419–2430
Google Scholar
Thakur D, Biswas S (2021) Feature fusion using deep learning for smartphone based human activity recognition. Int J Inf Technol 13(4):1615–1624
Google Scholar
Saraswat A, Sharma N (2022) Salvaging tumor from T1-weighted CE-MR images using automatic segmentation techniques. Int J Inf Technol 14(4):1869–1874
Google Scholar
Abbasi H, Gunn AJ, Unsworth CP, Bennet L (2021) Advanced deep learning spectroscopy of scalogram infused CNN classifiers for robust identification of Post-Hypoxic epileptiform EEG spikes. Adv Intell Syst 3(2):2000198
Google Scholar
Cronin NJ (2021) Using deep neural networks for kinematic analysis: challenges and opportunities. J Biomech 123:110460
Google Scholar
Cunningham R, Sánchez MB, Butler PB, Southgate MJ, Loram ID (2019) Fully automated image-based estimation of postural point-features in children with cerebral palsy using deep learning. R Soc Open Sci 6(11):191011
Google Scholar
Lempereur M, Rousseau F, Rémy-Néris O, Pons C, Houx L, Quellec G, Brochard S (2020) A new deep learning-based method for the detection of gait events in children with gait disorders: proof-of-concept and concurrent validity. J Biomech 98:109490
Google Scholar
Shukla P, Gupta T, Saini A, Singh P, Balasubramanian R (2017) A deep learning frame-work for recognizing developmental disorders. In: 2017 IEEE winter conference on applications of computer vision (WACV), pp 705-714
Ni H, Xue Y, Ma L, Zhang Q, Li X, Huang SX (2023) Semi-supervised body parsing and pose estimation for enhancing infant general movement assessment. Med Image Anal 83:102654
Google Scholar
Passmore E, Kwong AL, Greenstein S, Olsen JE, Eeles AL, Cheong JL, Spittle AJ, Ball G (2023) Automated identification of abnormal infant movements from smart phone videos, medRxiv, pp 2023.04. 03.23288092
Nath T, Mathis A, Chen AC, Patel A, Bethge M, Mathis MW (2019) Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat Protoc 14(7):2152–2176
Google Scholar
Mathis A, Mamidanna P, Cury KM, Abe T, Murthy VN, Mathis MW, Bethge M (2018) DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat Neurosci 21(9):1281–1289
Google Scholar
Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2019) OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43(1):172–186
Google Scholar
Wei K, Kording KP (2018) Behavioral tracking gets real. Nat Neurosci 21(9):1146–1147
Google Scholar
Mathis A, Biasi T, Schneider S, Yuksekgonul M, Rogers B, Bethge M, Mathis MW (2021) Pretraining boosts out-of-domain robustness for pose estimation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1859–1868
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
Image-net challenege https://image-net.org/challenges/LSVRC/ [Online]
Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Google Scholar
Duda RO, Hart PE (2006) Pattern classification. Wiley, London
MATH Google Scholar
Jung A https://imgaug.readthedocs.io/en/latest/ [Online]
Yang Y, Ramanan D (2012) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal Mach Intell 35(12):2878–2890
Google Scholar
Liu X, Yu S, Flierman NA, Loyola S, Kamermans M, Hoogland TM, De Zeeuw CI (2021) OptiFlex: multi-frame animal pose estimation combining deep learning with optical flow. Front Cell Neurosci 15:621252
Welch G, Bishop G (1995) An introduction to the Kalman filter. https://perso.crans.org/club-krobot/doc/kalman.pdf.
Roopa T (2019) Machine learning to automatically assess infants at risk of developing cerebral palsy, pp 1–86
NeSI New Zealand eScience Infrastructure (NeSI). https://www.nesi.org.nz/

Download references

Acknowledgements

We are grateful for the support of the New Zealand eScience Infrastructure (NeSI) high performance computing facilities for hosting our computational processing. URL: https://www.nesi.org.nz.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions. The research was supported by Friedlander Foundation grant (Grant: 3720759). SW is funded by the Aotearoa Foundation.

Author information

Authors and Affiliations

Auckland Bioengineering Institute (ABI), University of Auckland, Auckland, 1010, New Zealand
H. Abbasi, S. R. Mollet, L. Lim, T. F. Besier & A. J. C. McMorland
Liggins Institute, University of Auckland, Auckland, 1142, New Zealand
S. A. Williams
Department of Newborn Services, Auckland City Hospital, Auckland, 1142, New Zealand
M. R. Battin
Department of Exercise Sciences, Faculty of Science, University of Auckland, Auckland, 1141, New Zealand
A. J. C. McMorland

Authors

H. Abbasi
View author publications
You can also search for this author in PubMed Google Scholar
S. R. Mollet
View author publications
You can also search for this author in PubMed Google Scholar
S. A. Williams
View author publications
You can also search for this author in PubMed Google Scholar
L. Lim
View author publications
You can also search for this author in PubMed Google Scholar
M. R. Battin
View author publications
You can also search for this author in PubMed Google Scholar
T. F. Besier
View author publications
You can also search for this author in PubMed Google Scholar
A. J. C. McMorland
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The algorithm development, data analysis and manuscript writing/preparation were undertaken by HA. Data was recorded by SM and LL. Manuscript was reviewed and revised by AM, TB and SW. Funding acquisition: TB and AM. The final submitted article has been revised and approved by all authors.

Corresponding authors

Correspondence to H. Abbasi or A. J. C. McMorland.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 8735 kb)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Abbasi, H., Mollet, S.R., Williams, S.A. et al. Deep-learning for automated markerless tracking of infants general movements. Int. j. inf. tecnol. 15, 4073–4083 (2023). https://doi.org/10.1007/s41870-023-01497-z

Download citation

Received: 04 April 2023
Accepted: 31 August 2023
Published: 25 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s41870-023-01497-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep-learning for automated markerless tracking of infants general movements

Abstract

Similar content being viewed by others

TwinEDA: a sustainable deep-learning approach for limb-position estimation in preterm infants’ depth images

Automatic Posture and Movement Tracking of Infants with Wearable Movement Sensors

Computer Vision for Medical Infant Motion Analysis: State of the Art and RGB-D Data Set

1 Introduction