1 Introduction

The neck movement is required and often in our daily life. However, the movement functionality might be decreasing due to factors such as aging, trauma, pain, muscle fatigue, abnormal posture, migraine, and cervical spine degenerative diseases such as cervical spondylotic myelopathy (CSM), cervical herniated disc, and cervical foraminal stenosis [1]. These lead to decreasing functional activity and quality of life. The most popular physical examination method for functional neck movement to assess severity and prognosis and to determine treatment progress and effectiveness is cervical range of motion (CROM) [2] measurement. CROM measurement is to evaluate the maximum angles of neck movement in six directions: flexion, extension, left lateral bending, right lateral bending, left rotation, and right rotation.

The neck direction can be described by Euler angles (i.e., yaw, pitch, and roll) referring to the Frankfort plane, which is determined by two centers of the external auditory meatus and the inferior margin of each orbit [3]. The yaw angle, which is related to the left and right rotation, measures the rotation around the axis-1 perpendicular to the Frankfort plane. The pitch angle, which is related to the flexion and extension, measures the rotation around the axis-2 parallel to a line passing through the two external auditory meatus. The roll angle, related to the left and right lateral bending, measures the rotation around the axis orthogonal to both axis-1 and axis-2.

The gold standard procedure of CROM measurement is in radiography. The measurement is applied by taking radiographs in two different head positions: neutral position and flexion and extension. This technique provides the most validity and accuracy in CROM measurement. However, it is not practical due to invasiveness, radiation exposure and high cost of operation [4].

Non-invasive CROM measurement methods consist of visual inspection, goniometer, cervical range of motion device (CROM device), ultrasonography, and electromagnetic devices. Some of them are not practical for clinical usage due to the requirement of finding anatomical landmarks, long time usage, high cost, and high technical expertise. Moreover, some of them have large measurement errors. Recently, there is a development of vision-based CROM measurement methods by using Microsoft Kinect or Cambridge Face Tracker [5]. They are more portable, affordable, and convenient because of marker-less requirement [4]. However, Kinect results in large CROM measurement errors and is difficult to be used in uncontrolled environments [4, 6]. Cambridge Face Tracker needs to be integrated with a facial landmark detection module (like Constrained Local Neural Fields (CLNF) [5]) and is only applicable to limited head pose ranges. The problematic issues for medical CROM measurement are summarized in Table 1.

Table 1 Summary of the problematic issues for medical CROM measurement

To overcome the above drawbacks, this study employs head pose estimation (HPE) technique for CROM measurements. HPE is a computer vision technique used to estimate head orientation with respect to the observing camera from a single digital imagery, where the orientation is described in terms of three Euler angles: yaw, pitch, and roll. HPE has several other applications: driver attention detection [7], human–computer interaction, best frame selection, and face frontalization [8]. The challenge in head pose estimation comes from the robustness and invariance to a variety of variations in illumination, geometry, facial appearance, facial expression, and facial accessories like glasses and hats. In recent years, HPE is progressed by the development of modern deep neural networks to compete with other kinds of methods and becomes a benchmark performance.

To the best of our knowledge, no research applies deep-learning-based HPE algorithms for CROM measurement. Herein, our work aims to propose an end-to-end HPE algorithm featured of multi-level feature extraction and cross-level attention and fusion to estimate the head pose (or, their associated Euler angles) accurately. The proposed method will be evaluated in two different scenarios. The first scenario is to train and test three public datasets (300W_LP [9], AFLW2000 [9], and BIWI [10]) and the second scenario is to use the pre-trained model for evaluation in our private dataset for medical CROM measurement, which collected RGB images from 15 healthy subjects with the ground truth angles measured with a goniometer [11]).

There are three main contributions in this paper:

  1. 1.

    This study proposed a novel landmark-free HPE algorithm, characterized of multi-level feature extraction and cross-level attention and fusion, to estimate the head pose from a single image. Our design aggregates multi-scale (i.e., pyramid structured) semantic information, focuses on spatial and channel attention to selectively emphasize informative features, thus improving the pose estimation accuracy.

  2. 2.

    The proposed technique was evaluated in two common protocols and compared with state-of-the-art (SOTA) methods. Our accuracy is comparable to the benchmark performance.

  3. 3.

    Our HPE design has a novel application to medical CROM measurement and shows its practical feasibility after comparison with goniometer measurement (as ground truth) conducted by rehabilitation doctors.

2 Related works

Head pose estimation (HPE) is the process of estimating head orientation (in terms of Euler angles) from digital imagery. HPE algorithms can be classified into two categories: landmark-based and landmark-free.

2.1 Landmark-based techniques

Landmark-based HPE algorithms commonly estimate head orientation in three steps. This kind of algorithms first detects 2D facial landmarks [12,13,14,15,16] (referring to the projections of a set of feature points on a 3D face model) from the input image, establishes the correspondences between the 2D landmarks and their 3D counterparts on the face model and finally utilizes the Perspective-n-Point (PnP) algorithm [17] to estimate pose parameters.

Werner et al. [18] proposed a simple method of learning the head pose estimator (HPEL) on top of a facial landmark detector via Support Vector Regression (SVR). This study can be easily combined with any facial landmark detector such as OpenFace [19] and IntraFace [20]. However, sometimes facial landmark algorithms suffer from failures to track faces in extreme poses. Bullat and Tzimiropoulos [21] proposed Face Alignment Network (FAN) by using four stacked hourglasses network to estimate heatmaps of 2D facial landmarks. In the sequel, they proposed a 2D-to-3D FAN to generate 3D facial landmarks. This FAN [21] technique however suffered from sensitivity to very low facial image resolution, and no optimal number of landmarks for accurate head poses estimation. Due to the above reasons, FAN is famous for its 2D/3D facial landmark detection, but not for HPE.

Zhu et al. [9] proposed a 3D Dense Face Alignment (3DDFA) algorithm. It is based on cascaded convolutional neural networks to convert a 2D facial image into dense vertices on a 3D facial model via 3D morphable model (3DMM) [22] fitting. The primary result of 3DDFA is to output 3D facial landmarks, with a by-product 3DMM, which can be transformed into Euler angles of the corresponding head pose. This technique can estimate in full head pose range to eliminate inaccurate facial landmarks for semi-profile and profile faces which present self-occlusions.

Wu et al. [22] proposed SynergyNet, a CNN which reversely regresses 3DMM parameters from 3D facial landmarks and establishes representations between 3DMM and 3D landmarks iteratively. As a result, 3DMM and 3D landmarks are used to learn 3D facial geometry better in a collaborative manner. This leads to better head orientation estimation and also finer facial landmarks, at an expense of more time in iterations. Xia et al. [23] proposed an HPE method to use detected facial landmarks for affine transformation of the facial image. The warped facial image is stacked with the landmark heatmap as an input to the CNN to estimate head angles. This technique can be considered as a hybrid method which is based on landmarks and RGB image features.

Lie et al. [24] considered the facial landmarks as a graph and proposed a two-stream architecture where Adaptive Graph Convolutional Network (AGCN) is adopted to process the 2D and 3D facial landmarks extracted from FAN [21]. The teacher-student training policy was also used to distillate knowledge from the teacher network (3D) to the student network (2D) so that the pose estimation accuracy after late fusion of the two-streams output can be enhanced. This technique strengthens some drawbacks of the landmark-based algorithms by knowledge distillation and multi-stream fusion, but at an expense of a larger model size.

2.2 Landmark-free methods

Landmark-free HPE techniques directly estimate features and convert them into head pose parameters from 2D digital images without using 2D/3D facial landmarks as the media.

Albiero et al. [25] proposed Img2pose network, a Faster R-CNN-based [26] technique with ResNet-18 as the backbone to detect faces and estimate head poses in 6 degrees of freedom (6DoF) such as three rotational and three translational parameters. However, this technique requires manual annotations of the bounding box labels for the training dataset. Ruiz et al. [17] proposed to use ResNet-50 as the backbone which is integrated with multiple fully connected layers to estimate the yaw, pitch, and roll angles (in a range between -99 and 99 degrees). This technique was further developed by Li et al. [6] to rectify the input image so as to align the subject’s head to a virtual optical center and tackle the perspective distortion. As a result, it achieved good accuracy at a very lightweight model size of 0.88 MB. However, camera intrinsic parameters were required and a limitation on the angular range of the head orientation exists.

To obtain Euler angles of the head pose, there are two possible ways. The first one uses direct regression [27] and the second one adopts bin or heatmap classification [6, 17]. FSA-Net [27] proposed by Yang et al. used a compact SSR-Net-based model with a soft-stage-wise regression scheme and a feature aggregation module to directly output the Euler angles. On the other hand, Hsu et al. [28] proposed QuatNet, a GoogLeNet-based architecture, to independently regress each of the 4 quaternion parameters. It can be better learned to disentangle the shared features and provide less computing complexity than predicting the 9 rotation matrix elements. Cao et al. [29] proposed TriNet, a ResNet-50 architecture with feature mapping and prediction modules based on FSA-Net Capsule network, to regress the 3 × 3 rotation matrix (or, 3 rotation vectors). Moreover, some studies introduced an orthogonality loss [29] or geodesic loss [12] to measure the orthogonal relationship between each pair of the 3 predicted vectors and guaranteed the correctness of the predicted 3-vectors. This achieved full-range head poses with higher accuracies than the priors.

Actually, the representation of 3 × 3 rotational matrix is capable of solving the so-called gimbal lock problem on Euler angles (in case the angle is near -90 or 90 degrees [28]) and the antipodal problem on quaternion (q and -q confusingly correspond to the same rotation [29]). Hampel’s work on RepVGG [12] was also thinking to reduce the rotation representation to 2-vectors (or, 6 elements) via Gram-Schmidt conversion to drop the prediction of the 3rd vector. Even so, Euler angles are the most common one in head pose estimation due to its more robust implementation via bin classification or heatmap estimation.

Landmark-based and landmark-free methods have their own advantages and downsides. Landmark-based methods output accurate head poses with resistance to occlusions, illumination, and extreme pose angles. However, they are susceptible to low image resolution, and require accurate landmark predictions as a premise. Landmark-free methods extract features from the raw images, aggregate them, and regress the head angles without facial landmark dependency. In general, this category of methods gets higher accuracy than the landmark-based techniques, except [23] which is a hybrid one (stacking the warped image and the landmark heatmap together as the input to the network for prediction).

3 Our proposed method

The design of our architecture, being in the landmark-free category, aims to build a pyramidal architecture to extract image features at multiple levels of details and then aggregate them to synergize advantages of multi-scale semantic information (e.g., edges and corners in the bottom layers and abstract features for classification in the top layers [30, 31]). Moreover, our architecture also aims to emphasize the attention in spatial and channel levels to focus on where and what to pay attention to each aggregated feature map [32]. Thus, our design is capable of improving the performance of head pose classification and regression.

3.1 Pyramid backbone for multi-level feature extraction

A pyramid structure is designed to extract multi-level image features (or, multi-scales of details) after multiple layers of convolution blocks, where the lower pyramidal levels produce larger feature maps to reveal detailed edges or corners information, while the higher levels output smaller feature maps to reflect abstract characteristics. All of the pyramid outputs are combined based on gradual aggregations from cross-level fusions (i.e., between two adjacent pyramidal levels) to synergize for HPE improvement.

To have a balanced architecture scaling on network, the scalable CNN backbone, EfficientNet-Lite4 [33], is used with the first six layers being the mobile inverted bottleneck convolutions (MBConv) [34], as shown in Fig. 1 to acquire the RGB image with the resolution of 224 × 224 pixels. MBConv [34] consists of a point-wise expansion layer and a depth-wise convolution layer to widen the number of channels, and a final 1 × 1 convolution layer to project the inner layers to the output. Batch normalization and ReLU6 follow the expansion and depth-wise convolution layers. A shortcut connects the input of the expansion layer and the output of the final 1 × 1 convolution layer to carry necessary information for assisting the backpropagation. This increases prediction performance while keeping low computational complexity [34].

Fig. 1
figure 1

The overall architecture of the proposed HPE technique

To comply with the pyramid structure, the output of each MBConv, except for the first one, is passed as the input to Multiple Attention Modules (MAMs). The outputs from the 2nd and 3rd pyramid levels are input to the 1st MAM module. Each MAM module will aggregate the output from the prior MAM module with the output of the topper pyramid level of the EfficientNet-Lite4. Notice that our feature aggregations by MAMs, in contrast to the popular UNet, ResUNet [35], or other pyramid network [36], are to collect and fuse attention information at the smallest resolution (i.e., 7 × 7 × 272) so as to reduce the overall model size and speedup subsequent processing.

3.2 Multiple attention modules (MAM) for feature fusion

In Fig. 1, MAMs are used to acquire and fuse attention features from different levels, where multiple cross-level attention modules are used to process adjacent hierarchical features which have different semantic meanings. As shown in Fig. 2 for each MAM module, it first builds attention masks by fusing cross-level features which are then used as the weight map for local feature enhancement by multiplication. Moreover, each MAM module is benefited from the pyramidal structure to better fuse semantic context information from the lower (2H × 2W resolution) to higher (H × W resolution) levels by two parts.

Fig. 2
figure 2

The structure of each cross-level attention module

The first part is an Attention Unit. Both inputs are first passed to conv blocks (shown as blue ones in Fig. 2). The processed output from the lower level (2H × 2W resolution) is passed through 2 × 2 Max pooling. Then, both outputs are combined together. Among the MAM #1-#4 in Fig. 1, the 3rd MAM does not have the max pooling due to the same feature size from two adjacent levels. Next, the input from a higher level (H × W resolution) is passed through another conv block to let the output been multiplied to the prior combined output. The second part is an Addition Unit. It is built as a shortcut connection to transfer features from the higher level (H × W resolution) to the output and prevent vanishing gradients in network backpropagation.

3.3 Modified atrous spatial pyramid pooling (Modified ASPP)

A modified ASPP module after the MAMs, as shown in Fig. 1, is used for channel attention purpose, which is detailed in Fig. 3. This part acquires the output of MAMs to process via the Atrous convolutions [37] and the ECA (Effective Channel Attention) [32] modules.

Fig. 3
figure 3

The two-paths structure of the modified ASPP module for channel attention

Atrous or dilated convolution is one kind of 2D convolution layer which calculates the convolutions via the down-sampled or dilated kernel elements to increase the receptive fields without increasing the real kernel size and without resizing the feature maps. However, the dilation rates need to fit the feature maps to prevent degenerating feature extraction from a 3 × 3 convolution to a simple 1 × 1 filter [37]. To prevent this problem, the image-level features are employed as a globally semantic information by using a 2D global average pooling (GAP) and a 1 × 1 convolution to average the features of the whole image. This makes the network more robust with a small amount of additional training and inference time. As a result, this study employs Atrous convolutions with the dilation rates of 1 and 2 to fit the output from MAMs (7 × 7 × 272).

After Atrous convolutions, precise and important pixel-level semantic information can be acquired via channel attention mechanisms. This will re-calibrate the channel weights to moderate the channel interdependencies, selectively emphasize informative features, and suppress the less useful ones. We modify ASPP by applying Effective Channel Attention (ECA) [32] module after Atrous convolutions to aggregate features by GAP without dimensionality reduction and perform 1 × 1 convolution with a sigmoid activation function to learn channel attention. This ECA module is capable of effectively enhancing the channel-wise features, and hence the prediction performance. After combining ECA outputs and the image-level features by concatenation, the result is passed to 1 × 1 and 3 × 3 convolution modules and get a final output of 7 × 7 × 256 for multi-bin classification and regression.

3.4 Multi-bin classification and regression heads

Multi-bin classification and regression heads are used to convert the aggregated image features (7 × 7 × 256) into Euler angles. To prevent a large model size in FC (Fully Connected) implementation, a 1 × 1 GAP layer is applied first (see Fig. 1) before entering into 3 sets of FC layers. The output of each FC block contains 66 bins, each of which represents a quantized angle value so that our regression problem becomes a classification problem into 66 categories with each bin representing a 3-degrees interval to keep the range of each Euler angle (yaw, pitch, or roll) between -99° ~ 99°. This multi-bin classification method, accompanying with the Arg_softmax function [45] that follows, has the advantages of easy convergence as most of the classification problems and easy measurement of errors for regression problems. The centroid of the probability distribution in multi-bins can be identified via Eq. (1) which is known as Arg_softmax function [38] to obtain a fine-grained Euler angle:

$${\theta }_{Pred}=3\sum\nolimits_{i=1}^{N}(\frac{{e}^{{x}_{i}}}{{\sum }_{c=1}^{N}{e}^{{x}_{c}}})*i-99,$$
(1)

The calculation of this function by Eq. (1) is done first by the softmax function for each i-th bin via normalization by the sum of exponential functions of all the predictions \({x}_{c}\) in bin c. Next, the results of the softmax in a total of N = 66 bins are calculated to find expectation (i.e., centroid). Then, the centroid position is multiplied by 3 and minus 99 to output the predicted angle (\({\theta }_{Pred}\)) in the range of -99° ⁓ 99°. The predicted angle \({\theta }_{Pred}\) can then be compared with its ground truth value to solve the regression problem (see loss functions below).

3.5 Loss function

Our loss function (\({L}_{total}\)) in yaw, pitch, and roll is composed of the classification sub-loss (\({L}_{cls}\)) and the regression sub-loss (\({L}_{reg}\)), which are expressed in Eq. (2) and (3), respectively. The cross-entropy loss is adopted for \({L}_{cls}\) and the wrapped loss [39] is for \({L}_{reg}\).

$${L}_{cls}=-\sum\nolimits_{i=1}^{N}{y}_{i}log({p}_{i})$$
(2)
$${L}_{reg}={\text{min}}\left[{\left|{\theta }_{pred}-{\theta }_{true}\right|}^{2},{(360-\left|{\theta }_{pred}-{\theta }_{true}\right|)}^{2}\right]$$
(3)
$${L}_{total}={L}_{cls}+{L}_{reg}$$
(4)

where \(\{{p}_{i}\}\) are the predicted bin outputs, \(\{{y}_{i}\}\) are the bin-label ground truths (converted from the Euler angle ground truth \({\theta }_{true}\)), N = 66, and \({\theta }_{pred}\) is calculated from Eq. (1). The wrapped loss will reduce large angular errors, especially the yaw angle. It will be converged smoothly during training, compared to the traditional mean square error loss, and thus provide lower head pose estimation error.

4 Experiments

4.1 Datasets and configurations for experiments

Three popular datasets, 300W_LP [9], AFLW2000 [9], and BIWI [10], are used for HPE evaluation. 300W_LP [9] is a synthetic dataset created by expanding several images via 3DMM fitting [22] to 2D images for accurate ground truth head poses. It was collected from four different sub-datasets, thus amounting to a total of 61,225 images. All images were flipped to increase the number of images to 122,450. AFLW2000 [9] is a challenging real-world dataset, composed of a variety of personal identities, illuminations, and occlusions. It contains 2,000 in-the-wild images which are then re-annotated with 68 3D facial landmarks. BIWI dataset [10] is gathered in a laboratory environment by recording 20 subjects (6 females and 14 males) across different head poses from a frontal position via the Microsoft Kinect V2 sensor. The depth annotation was applied for the creation of head pose labels represented as rotation matrices. It possesses 15,678 images. The head pose range of BIWI is -75 ~ 75 degrees for yaw, -60 ~ 60 degrees for pitch, and -50 ~ 50 degrees for roll.

For fair comparison, this study follows common protocols: (1) Protocol 1: use 300W_LP as the training set and AFLW2000 and BIWI as the test sets, (2) Protocol 2 [8]: use 70% of BIWI as the training set and 30% of BIWI as the test set, (3) keep only samples whose Euler angles are in the range of -99 ~ 99 degrees.

As is well-known, increasing the number of training samples is generally helpful to increase the deep neural network performance. This study hence proposes to augment the data via geometric and pixel-level transformations [40]. Geometric transformations include: (1) horizontal flipping, (2) shifting of the ROI (Region of Interest) within 10% of the image width, (3) image scaling between 1.0 ~ 1.25. On the other hand, pixel-level transformations include: (1) Gaussian noise, (2) brightness and contrast alteration, and (3) Gaussian blurring. This data augmentation was conducted under the guidance of 50% of samples for unaltered images and the other 50% for equally random selection out of the above 6 transformations.

This study trains the proposed network with 160 epochs using AdamW optimization with a learning rate of 0.00001, and with \({\beta }_{1}\) = 0.9, \({\beta }_{2}\) = 0.999, and \(\epsilon\) = \({10}^{-8}\) [41]. Image data are also z-score-normalized before training based on the mean and standard deviation. Moreover, the platform for this study was Core i9 CPU equipped with NVIDIA RTX 3080 GPU.

4.2 Ablation study

This study adds/removes some modules (e.g., Modified ASPP (Mod ASPP) and MAMs) to see their individual impacts on our HPE performances. Moreover, this study also tries to remove the addition (see Fig. 2) inside each attention module (denoted as MAM-w/o-Add), which is shown in Fig. 4a. And this study also tries to remove the attention unit (see Fig. 2) inside each attention module by replacing it with a 2 × 2 convolution layer (denoted as MAM-w/o-Att), which is shown in Fig. 4b. The evaluation metric is MAE (mean absolute error) [27, 42] shown in Eq. (5):

$$MAE=\frac{1}{N}\sum\nolimits_{n=1}^{N}{\Vert {\widetilde{y}}_{n}-{y}_{n}\Vert }_{1}$$
(5)

where \({\widetilde{y}}_{n}\) is the predicted angle, \({y}_{n}\) is the ground truth angle, and \(N\) represents the number of images in the dataset. Table 2 shows the results of the ablation analysis. It shows that the use of MAMs, Mod ASPP, and data augmentation is really capable of enhancing the overall performance in MAE metric.

Fig. 4
figure 4

a The MAM module without addition (MAM-w/o-Add) and b the MAM module without attention unit (replaced with a 2 × 2 convolution layer) (MAM-w/o-Att)

Table 2 The results (in terms of MAE) of ablation studies for Protocol 1 experiments

Removing each part (MAMs, Mod ASPP, and data augmentation) degrades HPE performance due to three reasons such as: no pyramidal structure compliance via multi-level semantic information acquisition from hierarchical features with cross-level attention, no receptive field enhancement and channel attention mechanism, and no increased number of training samples for enhancing the robustness of our HPE performance. Moreover, when removing the addition or attention unit inside each MAM, the performances will be degraded accordingly due to no cross-level spatial feature enhancement from adjacent-level outputs in the pyramid structure, and no shortcut connection for assisting the backpropagation.

4.3 Comparison with state-of-the-art (SOTA) methods

Here we make comparison in terms of MAE metric with other methods for Protocol 1 and 2 in Tables 3 and 4, respectively. Our proposed technique outperforms SOTA methods in both Protocols 1 & 2, except the lightweight CNN [6] for the Protocol 1 test on BIWI.

Table 3 Comparison of Euler angle estimation errors in MAE metric for Protocol 1 experiments
Table 4 Comparison of Euler angle estimation errors in MAE metric for Protocol 2 experiments

The results show the robustness of our proposed method in different datasets. Thanks to the pyramid structure in multi-level feature extraction, multiple spatial attention modules in fusing them, and modified ASPP module in channel attention. Our technique is free from facial landmarks, a 3D head model, and a 2D-3D correspondence process.

Among the prior works listed in Tables 3 and 4, FSA-Net [27] and Lightweight CNN [6], have much inferior performances than ours, but consuming smaller hardware resources (see Sect. 4.4). Though Lightweight CNN [6] is more comparable to ours, it needs to extra rectify the input image for pre-alignment purpose and the camera intrinsic parameters need to be provided. This will prevent its practical applications. TriNet [29] and 6DRepNet [12] are based on 3- or 2-vectors for rotation matrix, respectively, to represent the output poses. This shows that classification-based (e.g., on Euler angles) methods like ours are able to achieve higher accuracies than regression-based (e.g., on rotation matrix) methods [12, 29], but at a tradeoff of possible gimbal lock situation [28]. The Img2pose [25], based on a Faster R-CNN [26] and ResNet-18 as the backbone, achieves slightly inferior performance than ours, but requires manual annotations of the bounding box labels for training and much larger computations and time in inference (see Sect. 4.4). The performance of WHENet [39], not only inferior than ours in dataset test, but also shows less robustness in realistic medical CROM measurement (Table 5 in Sect. 4.4).

Table 5 Results (in MAE metric) of the head pose estimation on CROM measurements when compared to the goniometer measurements by two rehabilitation doctors

It is observed from Tables 3 and 4 that larger estimation errors on the pitch angle, compared to the yaw and roll angles, seem to be a common characteristic for most of the methods. The reason is the lack of extreme pitch ground truth samples in the datasets for training. A possible solution may be to augment the training dataset via synthesis by using 3D head model [17].

4.4 Applications of HPE to medical CROM measurement

To the best of our knowledge, this work may be the first study to apply deep-learning-based HPE algorithm to measure medical CROM. This application has the advantages of fast, non-invasive, free of radiation exposure, low cost of operation, and anatomical landmark free, compared to the use of traditional standard medical instrument like a goniometer [11]. Our dataset collected data from 15 subjects in the medical examination room (department of rehabilitation medicine, Ramathibodi Hospital, Thailand). The measurements were approved by the institutional review board of the same hospital with the certificate of approval (COA) number MURA2021/73. Consents were obtained for all subjects. The inclusion criteria include:

  1. 1.

    persons with the age between 18 and 80 years old,

  2. 2.

    no relationship with the research team,

  3. 3.

    no neck movement disorder, and

  4. 4.

    no history of cervical surgeries or trauma.

During video recording, the subject was asked to sit in front of a recording camera (Fig. 5). Calibration of the head pose was first conducted by asking the subject to move in six head directions such as flexion, extension, left and right lateral bending, and left and right rotation (e.g., Fig. 6b, c) for two sessions. In each session, the measurement for each movement was conducted two times by using a goniometer. The maximum movement in each direction was captured by the webcam and saved as an image file. The number of files thus amounts to 360 (15 persons × 6 directions × 2 sessions × 2 times). The ranges for pitch, yaw, and roll angles were [-70, 68], [-87, 88], and [-57, 52], respectively.

Fig. 5
figure 5

Arrangement of a webcam in front of the subject

Fig. 6
figure 6

Pose calibration: a front position, b left rotation, and c right rotation

The camera was configured with an image resolution of 800 × 600 pixels at 30 fps (frame per second) and a bit rate of 2.5Mbps. The rationale of setting the bitrate considers factors of bandwidth, hardware, and software in video network transmission for telemedicine [43]. Compressed videos without significant quality degradation are required for medical diagnosis.

The pre-processing of the collected video data consists of four steps and visualized in Fig. 7 accordingly: (1) face bounding box detected with RetinaFace [44], (2) facial landmarks detected with Face alignment network (FAN) [18], (3) face ROI adjusted based on the detected landmarks to expand the bounding box with an additional 20% to make a loose cropping of the face to cover the rest of the head (this follows a common pre-processing process in [17, 27]), and (4) cropped and resized face ROI based on the expanded face ROI to 224 × 224 pixels.

Fig. 7
figure 7

Visualization of the results of pre-processing

The pre-trained model based on 300W_LP dataset was used to measure the CROM vector in each direction of head movement and compared with the CROM vector in the frontal view position to get the relative CROM angle, which reveals the maximum angle by which the subject can rotate. The results will be compared to the ground truths measured by two rehabilitation doctors by using a goniometer. Table 4 compares the results by various HPE methods (codes executed by us).

It is observed that the proposed technique achieves the lowest MAE of 4.58 degrees, compared to other SOTA methods with a statistical significance. Our MAE, less than 5 degrees, is acceptable according to the error threshold defined on the research of Cambridge Face Tracker [5]. However, in Table 5, compared to the yaw and the roll angles, the pitch angles provide the larger CROM estimation errors. The reason is the same as in the previous section.

Figure 8 shows the statistics between the estimated angles and their corresponding ground truths (measured by the goniometer) for the all 360 images. A linear model is regressed and shown in Fig. 8, revealing a high goodness- of-fit of \({R}^{2}\) equal to 0.90. This means that our HPE model can estimate the angles faithfully.

Fig. 8
figure 8

Assessing goodness-of-fit for the linear regression model established between the estimated angles and their corresponding ground truths measured by the goniometer. The solid line shows a linear regression between two measures, which is via the formula \(y=0.9204x+4.305\) with \({R}^{2}\) value of 0.90

Figure 9 plots the MAE distribution of HPE for the yaw, pitch, and roll angles on our dataset for CROM measurements. It is observed that our proposed technique achieves a MAE lower than 5 degrees in the range of [-90°, 79°] for yaw angles, a smaller range of [-60°, 59°] for pitch angles and a least range of [-50°, 49°] for roll angles. The three ranges we achieved actually all cover the specified CROM ranges of [-70°, 70°], [-55°, 55°], and [-40°, 40°] on yaw, pitch, and roll angles defined in [45]. Capability of measuring CROM within specified ranges is mandatory in rehabilitation. This experiment shows the feasibility of our HPE technique in rehabilitation via CROM measurement.

Fig. 9
figure 9

The MAEs at different ground truth angles (-99° ~ 99°) for test on our dataset for CROM measurement

We also conducted a performance study of the SOTA methods on the CROM measurement. This study evaluated the performances in terms of number of parameters, computational complexity (in units of Giga floating-point operations (GFLOPS), and the inference speed per image (in millisecond). Table 6 provides the results, with all models being pre-trained based on the 300W_LP dataset.

Table 6 Results of the performance study in the number of parameters, the computational complexity (GFLOPS), and the inference speed (ms)

For medical CROM measurement application, accuracy and speed are two main concerns. In view of this, our technique outperforms 3DDFA [9], FSA [27], Img2Pose [25], TriNet [29], and 6DRepNet [12] by a faster speed and accuracy. Though HopeNet [17], WHENet [38], and Lightweight CNN [6] have a better speed than ours, HopeNet [17] has a larger model size and GFLOPS, while WHENet [38] and Lightweight CNN [6] have inferior accuracy performance than ours (see Tables 3, 4, and 5). This benefits from the use of EfficientNet-Lite [33] to have balanced scaling of neural network architecture in terms of channel size, network depth, and input image resolution to effectively optimize between FLOPS and accuracy.

In overall considerations, our proposed method is much more suitable for realistic CROM measurement application.

5 Conclusion

In this paper, a deep-learning-based HPE algorithm based on EfficientNet-Lite4, multiple attention modules and modified ASPP was proposed. Compared to the SOTA methods, our proposed technique has been verified to achieve the lowest MAE on two public datasets AFLW2000 and BIWI. Moreover, this study is the first to evaluate deep-learning-based HPE techniques in the medical aspect on CROM measurements. Via real tests on our own collected dataset, the results also show our better accuracy and robustness for medical CROM measurements than SOTA methods. Our error analysis also shows the conformity with medical standards and is thus acceptable in real applications.

This work still has some remaining challenges. Better CROM measurement instruments like CROM device [2] and 3D motion analysis technique [4], or even gold standard like radiography [1], will be preferred, instead of a goniometer, for measuring the ground truths. Moreover, this study evaluates the proposed technique for healthy subjects, preferably other subject groups like CROM-limited patients who have aging, trauma, pain, musculoskeletal problems, and cervical spine degenerative diseases.