1 Introduction

Preterm birth, occurring before 37 weeks of gestational age, is a risk factor for developmental disabilities, such as behavioral difficulties, cognitive impairments, or cerebral palsy (CP) [29]. In many cases, these developmental disabilities cannot be identified before two years of age [29]. However, early identification of preterm infants with abnormal developmental trajectories is critical to initiate early developmental intervention, to prevent the occurrence of such developmental disabilities [36].

The General Movement Assessment (GMA) analyses the complexity, variability, and fluidity of preterm spontaneous movements using video recordings [9]. The GMA is a reliable assessment of brain maturation and is able to identify preterm infants with abnormal developmental trajectories [26]. However, this method remains subjective and time-consuming. This motivates the need for automated GMA, especially with the continuous progress in computer vision and artificial intelligence tools. To achieve this goal, a primary task is to track preterm infant movements and estimate their poses in the three dimensions of space using a markerless tool adapted to this specific population.

Thanks to the advances in deep learning, many reliable neural models are available for automatic 2D and 3D human pose estimation [6, 8, 38]. However, as these models are trained on general-purpose datasets containing mainly adult persons [2, 20], they perform poorly for infants who are anatomically different. To solve this problem, researchers retrained 2D neural networks either on synthetic datasets [16] or on datasets collected from the internet [7]. In both cases, the training data was not representative enough of the target population of preterm infants hospitalized in the Neonatal Unit. Moreover, 2D movement analysis remains limited since it does not exploit the overall infants’ movement information in space, and information loss can go up to 53% due to dimensionality reduction [3]. On the other hand, 3D analysis can be advantageous since it provides a complete analysis of infant movements, which allows an objective classification of these movements as normal and abnormal according to their complexity, variability, and fluidity. The purpose of this work was first to introduce a real and representative dataset of preterm infants and second to propose a markerless, accurate, and safe 3D pose estimation framework to automatically assess preterm infants’ spontaneous movements.

The main contributions of this work are (1) to propose the first framework for 3D infant pose estimation from stereoscopic images dedicated to preterm infants hospitalized in the Neonatal Unit; (2) to create a fully annotated dataset of real preterm infants’ images with a clinical protocol that follows the guidelines for the GMA [9]; (3) to provide models for the main state of the art convolutional neural network (CNN) architectures specifically retrained with the new dataset;Footnote 1 (4) to perform an evaluation of these models for 2D and 3D infant pose estimation, providing a new baseline in this field.

2 Related work

2.1 2D pose estimation for infants

During the last two decades, a tremendous effort was made toward developing automatic tools to assess the general movements of preterm infants. In the beginning, many researchers tested motion sensors, including accelerometers [12, 13, 25] or electromagnetic tracking systems (EMTS) [17, 18] to track infants’ movements.

The advances in computer vision techniques encouraged other groups to use visual sensors. Some researchers opted for marker-based approaches [4, 23], which can be accurate but need a specific setup that may not be practical. On the other hand, markerless approaches became the state-of-the-art methods for infant pose estimation since they are cost-effective and due to the continuous improvement of computer vision techniques.

Adde et al [1] used motion images which were calculated as the difference between two consecutive frames where each pixel represents a point value of 0 and 1, 0 being black and representing no movement, and 1 being white and representing movement [1]. These motion images were used to calculate Motiongrams, Motion quantities, and motion centroids. A step further was made with the use of optical flow algorithms. Stahl et al [37] performed motion extraction using optical flow then feature extraction using wavelets and frequency analysis to classify infants’ fidgety movements. Many others opted for large displacement optical flow (LDOF) [27, 30, 31] to track the movements and obtain the movements’ velocities. These methods have many limitations, especially in the case of occlusions and illumination changes.

In the last decade, convolutional neural networks have revolutionized the task of pose estimation for adults and achieved very promising results on the different state-of-the-art datasets (e.g. COCO [20] or MPII [2]). This inspired many works on infants’ pose estimation [11, 22, 24, 32, 33], and all of them are based on the Openpose architecture [6] which was trained on datasets containing only images of adults who are structurally and anatomically different compared to infants [34].

To address this problem, Chambers et al. [7] have retrained the Openpose network with their own labeled dataset of infant pose consisting of 9039 infant images collected partially from videos on Youtube and from clinical data, which reduced the mean error by 60%. In addition to the fact that Openpose lacks sufficient scaling of network depth and its computational insufficiency [14], the ground truth body joints annotation process using Vatic[7] was not very accurate and was missing many occluded and not visible joints as can be seen from their publicly available dataset.

The absence of open-source infants pose images mainly for privacy concerns is a problem that many researchers have tried to address. Hesse et al. [15] introduced a synthetic dataset of infants, called MINI-RGBD for Moving INfants In RGB-D, created from only 12 RGB-D videos of moving infants using a textural mapping to a 3D model mesh called SMIL for Skinned Multi-Infant Linear model. Both its simplicity and exclusively synthetic nature cause the pose estimation models trained on MINI-RGBD not to generalize well on real-world infant images [16]. This is why Huang et al. [16] created a hybrid synthetic and real infant pose dataset (SyRIP) from real images collected from YouTube and Google images. They used a similar approach to SMIL, by fitting the 3D model mesh to a real infant image and getting synthetic images with changing backgrounds, texture maps, lightning, and camera position, resulting in 700 real and 1000 infant synthetic images. This small dataset was used to train an architecture (FIDIP) that contains a domain classifier that promotes a feature extractor to retain the ability to extract keypoints information but also to ignore the differences between the real and the synthetic input images [16].

The group tested the effect of retraining three existing models as backbone networks on their SyRIP dataset in different ways, but what we found questionable was mainly the small number of images used for testing (100 & 500) and the use of mean average precision (mAP) evaluation metric with OKS thresholds [20] which was previously implemented specifically for the COCO dataset (explained in Section 3.2). In this study, we provide a PCK-based evaluation of this model using our dataset containing real images captured in a clinical environment.

2.2 3D pose estimation for infants

For an accurate infant movement assessment and to achieve a complete analysis of infant movements in space, 3D infant pose estimation is an essential task.

Meinecke et al. [23] were the first to address this approach using an analysis system Vicon 370 motion, consisting of 7 infrared cameras on tripods and many reflective markers placed on different body joints. Even though a very good spatial precision of 2 mm was achieved, the system still has a difficult setup that is impractical for clinical use.

Other researchers tried to obtain a 3D keypoint representation using RGB images in addition to depth images obtained from depth cameras. Wu et al. [39] used Kinect to capture color and depth videos of infant movements. Then used OpenPose [6] to estimate infant 2D movement from RGB images. The 3D coordinates of the infant’s joints are achieved after combination with corresponding depth images. Li et al. [19] used exactly the same approach with the same pose estimation network and image acquisition protocol but with a correction of depth information to solve the problem of matching between RGB and depth images.

Even if these approaches are simpler and faster than multi-camera motion capturing systems, however when dealing with complex infant pose with a harder view or obstructive gesture, the method may be misidentified [40]; depth values captured by Kinect represent the body surface information [39] and the depth image will map a joint X and Y to the same depth value if one joint is occluded by another. Also, the use of infrared light-emitting cameras may have a health impact on infants [21].

The only work to our knowledge that presented a stereoscopic image acquisition protocol for infant pose estimation was presented by Shivakumar et al. [35]. They used two stereoscopic cameras, one placed right above the baby and the other positioned on the baby’s right side. To obtain depth images, the team used a series of calibration, rectification, and matching operations to get disparity maps and then depth maps.

Since the group used only a tracking algorithm, infants were provided with a blue onesie to facilitate the tracking of the torso center using the segmented blue color mask. The optical flow was used for tracking limbs, and a manual selection of regions was needed where the user clicked on a point within the limb region; then, a marker was set using their marker identification method.

3 Proposed 3D pose estimation framework

We propose a 3D pose estimation framework based on stereoscopic imaging (see Fig. 1). Stereoscopic video sequences of the infants were recorded through a stereoscopic camera located above the infant. Each pair of frames were subsequently rectified and processed by a 2D pose estimation model, giving 17 keypoints per image located at joint positions. A stereo triangulation was then performed to recover the 3D position of the joints. The output of the system is thus a temporal sequence of 3D joint positions.

Fig. 1
figure 1

3D infant pose estimation framework based on stereoscopic imaging, 2D pose estimation and 3D reconstruction using triangulation

The ZED 2 stereoscopic camera (Stereolabs, San Francisco, CA)system was chosen due to its simplicity, reliability, and the fact that it does not affect the infant’s health. The originality of our framework is to perform a joint-based 3D reconstruction. Instead of reconstructing the depth of all the points of the scene using a stereo association algorithm, we only reconstructed the depth of the body joints obtained separately on the two rectified views by direct triangulation [20]. This reconstruction is indeed much more accurate since it can infer the true depth of the joints even if the latter are not directly visible on images. Comparatively, classical association-based algorithms reconstructed the depth of the points located on the surface of the body [39] and were not able to recover the true joint positions. More precisely, if the stereo system has been calibrated, the depth is recovered by direct triangulation (Fig. 2):

$$\begin{aligned} Z = \frac{Bf}{D} \end{aligned}$$
(1)

where, B and f are the camera baseline and focal length respectively, and D is the point disparity.

This framework strongly relies on the precise estimation of the 2D location of the body joints, which depends on two key components: (1) the dataset used for training the model and (2) the deep learning model itself.

3.1 Dataset

The AGMA dataset was built by recording real images of 53 premature infants enrolled in the AGMAFootnote 2 study, born from October 2020 to June 2021 (see Fig. 3). All infants born before 33 weeks of gestational age (GA) and hospitalized in the Neonatology department of the Centre Hospitalier Universitaire de Saint-Étienne, France, were included. Exclusion criteria were the ongoing presence of ventilatory support, contraindication of a radiant heat warmer, and absence of written parental consent. The features of the included children are summarized in Table 1.

Fig. 2
figure 2

Stereo vision diagram using an aligned pair of cameras

Fig. 3
figure 3

A snippet of AGMA dataset images collected in a clinical environment

The video recording protocol was controlled according to Prechtl’s method of GMA [9]. Infants wearing a diaper were placed in a radiant heat warmer in a supine position. The room light illuminance was controlled between 60 and 120 lux. Videos were recorded for one hour using the ZED 2 (Stereolabs, San Francisco, CA) stereoscopic camera positioned perpendicularly to the warmer, with frames of 1280x720 resolution at 30 FPS.

Each included infant was recorded from one to three times, with a period \(\ge 7\) days between two consecutive recordings. The dataset contains 44,250 stereoscopic images (88,500 in total) from a selection of 175 videos from 53 infants acquired between 32 and 41 weeks of GA. All images were manually annotated and reviewed. The keypoints annotation process was done according to the COCO keypoint detection task, using 17 points forming the infant skeleton with a bounding box around them [20]. The annotation process was performed manually by a group of annotators specifically trained for this task. They were given a composite video containing the two rectified left and right views, and the task consisted of successively pointing each skeleton joint on both views for each video frame. The consistency of the annotation across time was subsequently checked for each video. Finally, the 3D annotation was obtained by triangulation using (1). AGMA dataset was divided into three independent sets for training, validation, and testing. Each set contains distinct subjects with various poses, skin colors, and ages. The training set is composed of 156 stereoscopic videos of 5 seconds duration (46800 frame pairs), the validation set is composed of 13 videos of 5 seconds duration (4500 frame pairs), and the test set comprises of 10 videos of one minute (36000 frame pairs).

It should be noted that the AGMA dataset cannot be directly compared to state-of-the-art publicly available datasets such as SyRIP and MINI-RGBD as our objective was to capture images following Prechtl’s method within a clinical setting. The MINI-RGBD dataset comprises images derived from 12 entirely synthetic videos featuring simplistic infant poses [16], whereas SyRIP is a partially synthetic dataset that does not accurately represent the context and poses encountered in the GMA assessment method (see Fig. 4). Thus, the AGMA dataset stands as the first dataset captured within a genuine clinical environment, adhering to a controlled protocol that aligns with Prechtl’s method of GMA [9]. It specifically concerns infants population aged between 32 and 41 weeks of gestational age which is very hard to record due to medical constraints and practical considerations which makes the AGMA unique and the first of its kind. Table 2 summarizes the characteristics of the three datasets.

Table 1 Dataset population features
Fig. 4
figure 4

(a): A snippet of SyRIP real images, (b): A snippet of MINI-RGBD images

3.1.1 Ethics

The AGMA study (IDRCB 2020-A03335-34) was approved by the Comité de Protection des Personnes - Sud-Est II Ethical Committee in February 2021. Written parental consent was obtained from each participant. The study was conducted in accordance with international ethical standards and the Declaration of Helsinki.

3.2 2D pose estimation

For the 2D pose estimation task, three different deep neural architectures were studied and evaluated, originally created for adult human pose estimation based on the High-Resolution Network HRNet [38]: HRNet itself, HigherHRNet, and DarkPose. These networks were chosen since they are still top ranked on different human pose estimation challenges, and they use parallel networks of different resolutions instead of traditional in-series high-to-low networks.

Table 2 Comparison of AGMA dataset with other publicly available datasets
Fig. 5
figure 5

2D pose estimation networks architectures: a) HRNet [38], b) HigherHRNet [8] (HRNet + deconvolution module) c) DarkPose [41] (HRNet + heatmap modulation and maximum re-localization)

3.2.1 HRNet

HRNet was introduced by Ke Sun et al. [38] for different tasks, including human pose estimation, segmentation, and object detection. It became a state-of-the-art top-down pose estimation network due to the strategy of using parallel networks of different resolutions instead of traditional in-series high-to-low networks (Fig. 5). The network calculates the high resolution sub-network in parallel with lower resolution sub-networks. Then, the sub-networks are fused through the fuse layers such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high resolution representations. Maintaining high-resolution representations through the entire network makes the architecture suitable for infant pose estimation.

For this architecture, Ke Sun et al. [38] have created four different networks, HRNet32 and HRNet48 with an input size of 256 × 192 and 384 × 288 for each. These networks were initialized by the weights of the models pre-trained on the ImageNet dataset and then trained on the COCO train2017 dataset, including 57K images and 150K person instances.

3.2.2 HigherHRNet

HigherHRNet [8] is built on HRNet as a backbone by adding a deconvolution module to predict heatmaps at multiple and higher resolutions (Fig. 5). It uses bilinear interpolation to upsample all the predicted heatmaps with different resolutions to the resolution of the input image and averages the heatmaps from all scales for the final prediction which can be beneficial for subjects of small scales in the image.

3.2.3 DarkPose

The added value in this architecture is the fact that it can be used as a plugin that improves the performance of SOTA Human pose estimation models. Taking into consideration the relevance of the output heatmap decoding process, it improves the AP of HRNet architecture by 3.8 % at least on the COCO validation set [41]. This is achieved by a series of heatmap modulations since often heatmaps predicted by a human pose estimation model do not have a good Gaussian structure compared to the training heatmap data. Then, instead of using the standard coordinate decoding method as in (2) to predict joints location, they perform a maximum re-localization with a Taylor expansion to find the maximal activation of the modulated heatmap as in (3) and finish with resolution recovery operations (Fig. 5).

$$\begin{aligned} p = m + 0.25\dfrac{s-m}{\left\| s-m_{} \right\| _{2}} \end{aligned}$$
(2)

where m and s are the coordinates of the maximal and second maximal predicted heatmap activations

$$\begin{aligned} \mu =m - ({D}''(m))^{-1}{D}'(m) \end{aligned}$$
(3)

where \(D''\) and \(D'\) are the derivatives of the modulated heatmap, and m is the coordinates of the maximal activation of the predicted heatmap

3.3 Evaluation metrics

Even if our dataset ground truth annotations were performed according to the COCO keypoint detection task [20], its respective evaluation metric (AP, AP50, AP75) was not used since COCO AP is calculated using the Object Keypoint Similarity metric. As shown in equation (4), OKS uses per-keypoint constants that are calculated only from images in COCO validation set [10], which cannot be representative of another kind of datasets.

$$\begin{aligned} OKS = \frac{\sum _{i} \exp \left( -\frac{d_{i}^{2}}{2s^{2}k_{i}^{2}}\right) \delta (\upsilon _{i}> 0)}{\sum _{i} \delta (\upsilon _{i}> 0) } \end{aligned}$$
(4)

where \(d_{i}\) is the Euclidean distance between the detected keypoint and the corresponding ground truth, \(\upsilon _{i}\) is the visibility flag of the ground truth, s is the object scale, and k is a per-keypoint constant that controls falloff.

The Percentage of Correct Key-points (PCK) is more suitable for infants’ 2D pose estimation evaluation. It defines a correctly detected joint if the distance between the predicted joint location and the ground truth is smaller than a certain threshold defined by a fraction of torso size. For this work, the torso height (the distance between the right hip and right shoulder) was used instead of torso diagonal size (the distance between the right hip and left shoulder) since torso diagonal size can vary depending on the infant’s pose as can be seen in Fig. 7. The notations PCK@0.2, PCK@0.1, and PCK@0.05 were used to refer to the fractions of torso height used as thresholds 0.2, 0.1, and 0.05 respectively (5). For 3D pose evaluation, 3DPCK was used. It defines the keypoint as correct if it falls within a given distance from the ground truth: 1cm, 2.5 cm, 5cm, and 10cm. These thresholds can show how accurate the keypoints estimations are at different levels. For clinical usage and considering that the average infant height is 50 centimeters, a 2.5 cm error will represent 5% of the body size which is acceptable. But errors that exceed this value (5cm, 10cm) are considered as too important. On the other hand, errors that do not exceed 1cm can be tolerated for this use case.

$$\begin{aligned} PCK@X = \frac{\sum _{i=1}^N \delta (d{_{i}^{2}\le X\times T })}{N} \end{aligned}$$
(5)

where \(d_{i}\) is the Euclidean distance between the detected keypoint and the corresponding ground truth, X is the threshold fraction, T is the torso height and N the total number of keypoints.

4 Experiments

The goal of this section is twofold: First, to compare the different selected deep learning architectures and their retrained versions on our dataset. Second, to quantify the precision and accuracy of the keypoint detection in the perspective of clinical use.

The OpenPose network which was originally trained on adult human pose estimation, is retrained by Chambers et al. [7] on an infants’ dataset of 9039 images collected mainly from Youtube. This retrained network is widely used to automatically assess infants’ movements. So we compare and test this model without further training on our images and check whether it can generalize well on data from a clinical environment only. The DarkPose + FiDIP network [16] was a suggested solution to solve the lack of clinical data. Based on a hybrid dataset of synthetic and real images collected from the internet (SyRip), a domain classifier was trained to promote a feature extractor to retain the ability to extract keypoints’ information but also to ignore the differences between the real and the synthetic input images. This network is also tested and compared without extra training as an existing solution in addition to OpenPose retrained version.

The experiments in [38] have shown that HRNet networks with higher input resolution (384 × 288) improve the AP compared to networks with lower input resolution (256 × 192). For that reason, two HRNet trained networks (W32 & W48) are chosen with an input resolution of 384 × 288, in addition to two HigherHRNet trained networks (W32 & W48) with an input resolution of 640 × 640, and finally two DarkPose on top of HRNet trained networks (W32 & W48) with input resolution of 384 × 288. These networks are tested with and without training on our infants’ clinical dataset. For training, we used only 15 epochs since we are only adjusting the network weights. An Adam optimizer was used, and the learning rate was set to \(10^{-4}\) throughout the whole process. For data augmentation, the random scale was set to ([0.9, 1.1]), random rotation to (\(\left[ -45^{\circ },45^{\circ } \right] \)), and without half-body data augmentation.

5 Results and discussion

5.1 2D pose estimation

Figure 6 shows the results obtained after training and testing the different architectures. The error is calculated in pixels as the Euclidean distance between ground truth and predicted keypoints and averaged for each image (36k images with 880x720 pixels resolution). The impact of retraining the networks on our infants’ dataset is obvious, as it led to a noticeable decrease in the mean error across all architectures, as well as a reduction in the standard deviation. Specifically, for the HigherHRNet32 and HigherHRNet48 networks, the mean error diminished from 30.7 to 3.7 pixels and from 28.5 to 5.3 pixels respectively.

The two networks OpenPose and FiDIP, have been evaluated without any additional training. With a confidence score threshold of 0.1, it was observed that OpenPose exhibited the lowest mean error of 6 pixels (see Fig. 6). This result is remarkably low; however, it should be noted that the mean error calculation exclusively considered successfully predicted joints across the entire dataset and the corresponding mean detection rate for this network was approximately 5 keypoints per image, in contrast to the other networks that achieved a detection rate of 17 keypoints per image (100% of joints detected). This discrepancy becomes evident when assessing the network’s performance using PCK evaluation at various thresholds, as illustrated in Table 3. In fact, OpenPose performed poorly in comparison to networks that were not trained on any infants’ images.

Fig. 6
figure 6

2D pose estimation error for different networks in pixels. The boxplot shows the median error and the whiskers mark 1.5 times the interquartile range. DR refers to detection rate

Table 3 2D pose estimation results comparison of original versus retrained networks on our test dataset

As referred in Table 3, the FiDIP network exhibited a comparable PCK (Percentage of Correct Keypoints) when compared to the non-retrained DarkPose48 network (95.35% versus 95.32% for PCK@0.2) . Despite being retrained using the SyRIP dataset, which comprises 700 real and 1000 synthetic images, the FiDIP network did not demonstrate any advantage over DarkPose48. This lack of improvement can be attributed to the characteristics of the images within this dataset (refer to Fig. 4 and Table 2), which were sourced from the internet and encompass infant poses that do not adequately represent the typical video acquisition protocol for assessing infants. Consequently, retraining on synthetic images does not inherently enhance the efficiency of networks when processing real images captured in a clinical setting. This underscores the need for a real and significant dataset in this particular research domain, which cannot be overlooked or substituted.

The HigherHRNet networks demonstrated a satisfactory level of performance in terms of PCK, but not as good as HRNet networks. This disparity can be attributed to the fact that the scale of our subjects within the image is already adequate. Consequently, upsampling the predicted heatmaps and averaging them across all scales may potentially impact the accuracy of the predictions. This observation aligns with the findings of Cheng et al. [8], who concluded that HigherHRNet performs exceptionally well when dealing with small scales. Furthermore, it is worth noting that the HigherHRNet32 network exhibited a notable standard deviation of error. Upon closer examination, it was identified that in certain test images, the network exhibited confusion between keypoints located on the left and right sides. However, this issue was successfully addressed through network retraining, as it was no longer observed thereafter.

Both architectures of DarkPose (W32 & W48) achieved the best PCK (91.37% & 90.97% at 0.1 threshold and 98.30% & 98.28% at 0.2 threshold respectively). This proves that instead of using the standard coordinate decoding method, the heatmap decoding process described in Section 3.2.3 can improve infants’ pose accuracy, which is in accordance with the results in [41]. Another observation is that after training the W32 and W48 versions, they have approximately the same performance, and this is because our subject’s bounding boxes are large enough, so a lighter version of these networks can be used to reduce time complexity and resources.

Table 4 3D pose estimation results comparison of the non-retrained networks on our test dataset
Table 5 Per joints 3D pose estimation results of the retrained DarkPose32 network

5.2 3D pose estimation

The 3D pose estimation results were in accordance with the results obtained in 2D analysis (see Table 4). Openpose had the least 3DPCK at all thresholds with the same detection rate of 30%, and FiDIP had comparable results with not retrained DarkPose48 (93.88% versus 93.86% at 3DPCK@5cm) despite being retrained using the SyRIP dataset, which validates that training on artificial images alone does not automatically improve the effectiveness of networks.

All the retrained networks gained 5% minimum of 3DPCK@2.5cm, and their mean error was reduced. The retrained version of DarkPose32 achieved the best 3DPCK at all thresholds, with a minimum mean error of 1.72 cm, which is a very promising result regarding the complex poses in our testing dataset. The results for DarkPose48 network after training were very similar to the 32 version with only a 3mm difference in mean error.

Fig. 7
figure 7

3D pose estimation results of different networks. (a) HRNet32, (b) retrained HRNet32, (c) HigherHRNet32, (d) retrained HigherHRNet32, (e) DarkPose32, (f) retrained DarkPose32

Moreover, when evaluating the results of this latter network for each group of joints separately, the mean error can be even lower, particularly for keypoints that are easy to define, such as the nose for which the retrained DarkPose32 network achieved 99.74% in 3DPCK@2.5cm and a mean error of 0.94 cm (see Table 5). The same table shows how higher the error is for keypoints that are difficult to localize precisely, such as the hips, for which the mean error is more than 2cm. We can also notice that better 2D pose detections lead to better 3D pose estimations, as is the case for DarkPose Networks, and the opposite is valid also since HigherHRNet had a considerable standard deviation and mean error.

Figure 7 shows the results of some networks on different complex infants’ poses. It can be observed that all three networks before training do not provide accurate 3D pose estimations, which become better after training even when joints are half or completely occluded, as is the case for the second image. A standard depth camera will not be able to estimate the depth of these joints since it represents visible body surface depth information only, which shows the benefits and advantages of using our framework to analyze infants’ movements, as can be seen in the video Online Resource 1.

In summary, the main conclusions that can be drawn from these results are: (1) it is particularly important to use a dedicated annotated dataset to train pose detection models for infants in a real medical environment; (2) compared to existing retrained models Openpose and FiDIP, models built on the HRNet network provide better precision, particularly DarkPose32 which is the best performing one in our case; (3) the latter network provides a 3D localization error of 1.7cm with a 3DPCK at 2.5cm of 83% and a 3DPCK at 5cm of 97%.

6 Conclusion

The 3D automatic GMA method is a challenging field of study. An accurate estimation of infant poses is needed to automatically analyze the complexity, variability, and fluidity of infants’ movements. This study is the first stereoscopic 3D infant’s pose estimation framework dedicated to preterm children hospitalized in the Neonatal Unit. Compared to other existing automatic infant movement analysis tools, AGMA stereoscopic framework has successfully shown the possibility of estimating accurate 3D infant poses without the use of any markers or infrared cameras. Three state-of-the-art 2D human pose estimation networks (HRNet, HigherHRNet, and DarkPose) were retrained on a dataset of 88,500 preterm infant images collected in a real medical environment and manually annotated. The networks were tested with 18000 stereoscopic images and compared to the latest works on 2D infant pose estimation retrained on real or synthetic images. The minimum mean 3D error on joint position achieved was 1.7 cm with a 3DPCK at 5cm of 97%. This study demonstrated that retraining on synthetic images did not consequently make networks efficient on real images captured in a clinical environment and that an adapted real dataset was more advantageous. In addition, it showed that an appropriate heatmap decoding process could improve infants’ pose accuracy instead of using the standard coordinate decoding method. Another advantage of the presented framework is that any other advanced deep neural network can be used in the future for 2D pose estimation and then 3D pose reconstruction. Beyond GMA, AGMA 3D framework could also pave the way for the development of new tools based on the analysis of preterm infant movements, such as the automation of clinical seizure detection [28] and sleep quantification [5].