Introduction

Adolescent idiopathic scoliosis (AIS) is a three-dimensional spinal disorder where the spine exhibits a lateral curvature and axial vertebral rotation. This condition affects 1–3% of adotscents and can lead to back pain and cardiopulmonary compromise if left untreated [1]. To properly diagnose and manage treatments of children with AIS, the spine is routinely imaged to evaluate the severity of the lateral curvature in the coronal plane. This is known as the Cobb angle and is measured by identifying the uppermost tilted and lowermost opposingly tilted vertebrae and calculating the difference between the two angles. Measuring the Cobb angle is typically performed on a posteroanterior (PA) radiograph. However, imaging with radiography exposes these children to ionizing radiation, which increases their risk of developing cancer later in life [2, 3]. Children with AIS receive an annual average cumulative dosage of 2.4–5.6 mSv [4].

Ultrasonography is an alternative imaging method that does not emit ionizing radiation and provides depth information, allowing for 3D spinal reconstructions [5]. Additionally, ultrasound systems have lower costs and are more portable than radiography systems. Measurement of various scoliotic parameters on 3D spinal ultrasonographs, including the Cobb angle, vertebral rotation, and sagittal curvature, have been proven to be comparably accurate and reliable to those measured on PA and lateral radiographs [6,7,8]. However, it is more difficult to interpret and analyze ultrasonographs. Accurately measuring scoliotic parameters on ultrasonographs is time-consuming and requires more training and leveraging the 3D views. Consequently, these obstacles create a barrier to entry for integrating ultrasonography in scoliosis clinics.

To mitigate the difficulties associated with measuring, research groups have developed algorithms to measure the coronal curvature severity on ultrasonographs automatically [9,10,11,12]. These groups measure the spinous process angle (SPA). This consists of identifying the curve in the middle of the spine that appears darker, corresponding to the spinous process column (SPC), on the coronal projection image and deriving the SPA from the points at which the curve’s slopes are steepest. However, the SPA cannot directly replace the radiographic Cobb angle, as it only correlates with it. Additionally, identifying only the SPA does not allow measuring the sagittal or axial plane parameters, such as the kyphotic angle and vertebral rotation.

Using the center of lamina method to measure a coronal curve angle on ultrasonographs has produced comparable accuracy and reliability to the radiographic Cobb angle [13]. The laminae are flat surfaces to the left and right of each spinous process on a vertebral body. Because the surfaces of the laminae are flat and perpendicular to the ultrasound transducer during scanning, the laminae typically appear as bright isolated regions on the left and right of the dark SPC. Because different vertebral features are used to measure the coronal curvature on ultrasonographs compared to radiographs, the coronal curvature will be referred to as “coronal curve angle” instead of “Cobb angle”. Figure 1a, b depict the Cobb angle measured on a PA radiograph and its coronal curve angle counterpart measured on the coronal projection of an ultrasonograph of the same patient. Figure 1c, d illustrate the centers of laminae labelled on the sagittal projection and axial views, respectively. To measure the coronal curve angle on ultrasonographs, the centers of laminae for each vertebra must first be identified and paired up. Then, the tilt angles of the steepest lamina pairs are determined, with the coronal curve angle being measured similarly to the radiographic Cobb angle, except the angles of the center of lamina pairs are used instead of the angles of the upper and lower endplates. Identifying the center of lamina pairs also allows for measurement of the axial vertebral rotation and proxy kyphosis angle [7, 8]. Therefore, this study aimed to apply machine learning to automate coronal curve angle measurement on spinal ultrasonographs and to investigate the validity of the automatic method by comparing it with manual measurements performed on the same ultrasonographs and matching PA radiographs.

Fig. 1
figure 1

a Cobb angle measured on a PA radiograph; b coronal curve angle measured on a coronal projection of an ultrasonograph with centers of laminae (green) and relevant pairs highlighted (red); c centers of laminae (green) on the sagittal projection of an ultrasonograph; d centers of laminae (green) labelled on two axial views of an ultrasonograph

Methods

Patient population

Ethics approval was received from the University of Alberta health research ethics board to extract 144 3D spinal ultrasonographs of children with AIS from the local scoliosis clinic. The inclusion criteria were: (1) diagnosed with AIS, (2) had a major radiographic Cobb angle between 10° and 55°, (3) had no prior surgery, and (4) had an out-of-brace PA radiograph taken on the same day. A 70-image training set was used to train a machine learning model, with data augmentation of random zooming, translation, rotation, contrast adjustment, and noise addition to increase the effective size of the data set to > 100,000. The remaining 74 ultrasonographs constituted a measurement validation set to evaluate the accuracy and reliability of the final algorithm. All ultrasonographs in the measurement validation set were randomly selected. This meant that the 74 ultrasonographs were selected without looking at a subject’s ultrasonograph, PA radiograph, and Cobb angle measurements.

Each ultrasonograph was obtained using a SonixTablet integrated with a SonixGPS system (BK medical, Massachusetts, USA). The subjects were instructed to stand in a standardized posture, similar to when PA radiographs were taken. The C5-2/60 GPS curvilinear convex transducer was then moved along the surface of the subject’s back, following the spinal lateral coronal curvature, from the cervical vertebra (C7) to the lowest lumbar vertebra (L5). The system tracked the transducer's position and orientation throughout the scan. The ultrasound scanning parameters were set to a 2.5 MHz frequency, 6 cm penetration imaging depth, and 10% gain with linear time gain compensation. Approximately 700–1000 axial B-mode images were acquired per spine, along with the position and orientation data of the transducer, at a resolution of 0.2 mm per pixel [14]. Using an in-house program called Medical Imaging Analysis System (version 10.3.41.6), the axial B-mode images were stitched together using the position and orientation data to generate 3D reconstructions. The matching PA radiographs were imaged on the same day using either a conventional digital X-ray system (Philips, Canada) or the EOS system (EOS Imaging Inc., France).

Automatic measurement method

A convolutional neural network (CNN) was trained to predict the locations of the centers of laminae on an input 3D spinal ultrasonograph. The CNN is a machine learning model that was designed to deal with input image data. To train the CNN to predict the centers of laminae, a rater with over 20 years of scoliosis research experience labelled the centers of laminae in 3D on the 70-image training set. Labelling was performed on a custom user graphical interface developed in Python while looking at the subject’s matching PA radiograph to guide the center of lamina placement.

Once training was complete, the CNN could predict the 3D coordinates of the centers of laminae given an initial 3D ultrasonograph. A post-processing algorithm was developed to derive the coronal curve angle measurements from the predicted centers of laminae. This consisted of pairing up centers of laminae on the same vertebra and calculating the tilt angle for each lamina pair. The coronal curve angle was then measured by calculating the difference between the angles of the steepest opposing tilted vertebrae. While the centers of laminae were predicted in 3D coordinates, only the coordinates in the coronal plane, as defined by the global positioning system in the SonixGPS system, were used in the coronal curve angle calculation.

After the measurement was complete, the machine learning algorithm also outputted how the coronal curve angles were measured. The image of the ultrasonograph’s coronal projection was output, along with the predicted lamina pairs overlaid on top of it. The relevant vertebral tilts used for measurement are highlighted.

Measurement validation

The performance of the developed measurement algorithm was assessed by performing all pairwise comparisons between automatic ultrasound (A-US) coronal curve angle measurements, manual ultrasound (M-US) coronal curve angle measurements, and manual radiographic (M-Xray) Cobb angle measurements on the 74-image measurement validation set. The A-US versus M-Xray comparison was investigated because measuring the Cobb angle on a PA radiograph is the gold standard, and so evaluating the accuracy and reliability between the two directly informs us of the feasibility of replacing radiography with ultrasonography. The M-US versus M-Xray comparison was added as a control group and informed us whether any errors in the A-US measurements when compared with the M-Xray measurements could possibly be explained by any inherent discrepancies in imaging modalities as opposed to any faults in the automatic measurement algorithm. All A-US measurements were run on a Windows computer with an NVIDIA GeForce RTX 3060 Ti GPU and i7-12,700 Intel CPU. All M-US measurements were performed by a researcher with over 20 years of scoliosis research experience and were conducted using the aid of the previous radiograph method, which involved overlaying a subject’s previous PA radiograph on top of the coronal projection of the ultrasonograph to improve measurement accuracy [6]. All M-Xray measurements were performed by clinicians with over 20 years of experience. Both the researcher and the clinicians were blinded to the A-US, M-US, and M-Xray measurements.

For each of these comparisons, the accuracy and reliability between the methods were determined using the mean absolute error (MAE), standard deviation of absolute errors (SD), inter-method intraclass correlation coefficient (ICC2,1), and standard error of measurement (SEM). Koo’s definitions of poor (< 0.5), moderate (0.5–0.75), good (0.75–0.90), and excellent (≥ 0.90) were used to evaluate the ICC2,1 qualitatively [15]. The percentage of errors within clinical acceptance was also calculated, where clinical acceptance was defined as when the absolute difference between a pair of measurements was ≤ 5°. The threshold for clinical acceptance is based on the intra-observer and inter-observer manual Cobb angle measurement variation in both radiographs [16] and ultrasonographs [6]. Bland–Altman analysis was also performed to assess the levels of agreement between the different measurement methods [17]. Finally, the error-index was calculated to evaluate the vertebral level agreement in each comparison [18].

To identify potential systematic biases in the measurement methods, results were analyzed by curve severity and curve region. Curve severity was separated into two groups: mild (< 25°) and moderate (≥ 25°). This threshold was chosen because it roughly coincides with clinicians considering bracing as a treatment option over observation [1]. ICC2,1 values were not reported for the groups in curve severity due to the attenuation that comes with restricting the population variance on the coronal curve angle. The curve region was separated into four groups: upper thoracic (UT), main thoracic (MT), thoracolumbar (TL), and lumbar (L). These regions were defined according to the location of the curve's apex, with UT being T2–T6, MT: T7–T11, TL: T12–L1, and L: L2–L4 [19]. One-way analysis of variance (ANOVA) was conducted to evaluate whether the differences in the MAEs between groups within the same category were significantly different. Statistical analysis was conducted using the pandas [20] and SciPy [21] Python libraries with an alpha level of 0.05.

Results

Accuracy and reliability

The average age of the subjects in the 74-ultrasonograph test set (64F, 10 M) was 14.4 ± 2.4 years. For the M-US measurements, 144 curves were measured with an average coronal curve angle of 21.5 ± 8.1° (range 9–45°). A total of 125 curves were automatically detected (A-US), and among those, 75% (94/125) were within clinical acceptance. The curves that were not automatically detected comprised primarily of mild curves (18/19). The A-US measurement method demonstrated good reliability with 0.81 ICC2,1 and 1.4° SEM when compared with M-US measurements (Table 1). A statistically significant difference was discovered between the mean errors from different curve severity categories (p = 0.003). Mild curves were measured significantly more accurately than moderate curves by 1.8° on average. Bland–Altman analysis revealed a tendency of the A-US measurements to overestimate the M-US measurements with a non-significant bias of 0.5° (95% confidence interval [– 0.4°, 1.4°]) (Fig. 2).

Table 1 A-US versus M-US coronal curve angle paired measurement comparison results
Fig. 2
figure 2

Bland–Altman plot of A-US versus M-US Cobb angle measurements, color-coded by upper thoracic (UT), main thoracic (MT), thoracolumbar (TL), lumbar (L) region, with bias (black line) and limits of agreement defined as the bias plus or minus 1.96 times the standard deviation (SD) (red lines)

A-US versus M-Xray comparison

For the A-US versus M-Xray comparison, 129 curves were manually measured on the radiographs with an average Cobb angle of 23.8 ± 8.8° (range: 9–50°). Out of the 129 M-Xray measurements, 113 were automatically measured with 59% within clinical acceptance and moderate reliability with 0.74 ICC2,1 and 1.9° SEM (Table 2). Mild curves were measured more accurately than moderate ones by 2.0° on average with statistical significance (p = 0.004). Bland–Altman analysis revealed a tendency of the A-US measurements to underestimate the M-Xray measurements with a non-significant bias of – 0.9°(95% confidence interval [− 2.1°, 0.3°]) Fig. 3.

Table 2 A-US versus M-Xray Cobb angle paired measurement comparison results
Fig. 3
figure 3

Bland–Altman plot of A-US versus M-Xray Cobb angle measurements, color-coded by upper thoracic (UT), main thoracic (MT), thoracolumbar (TL), lumbar (L) region, with bias (black line) and limits of agreement defined as the bias plus or minus 1.96 times the standard deviation (SD) (red lines)

M-US versus M-Xray comparison

For the M-US versus M-Xray comparison, 120 curves were manually measured on the ultrasonographs out of the 129 M-Xray measurements. The M-US measurements achieved 85% within clinical acceptance with good reliability of 0.87 ICC2,1 and 1.1° SEM (Table 3). Mild curves were measured more accurately by 2.1° on average with statistical significance (p < 0.001). Bland–Altman analysis revealed that the M-US measurements underestimated the M-Xray measurements with a significant bias of − 1.3° (95% confidence interval [− 2.1°, − 0.6°]) Fig. 4.

Table 3 M-US versus M-Xray Cobb angle paired measurement comparison results
Fig. 4
figure 4

Bland–Altman plot of M-US versus M-Xray Cobb angle measurements, color-coded by upper thoracic (UT), main thoracic (MT), thoracolumbar (TL), lumbar (L) region, with bias (black line) and limits of agreement defined as the bias plus or minus 1.96 times the standard deviation (SD) (red lines)

Measurement time and interpretability

The automated algorithm took 36 ± 7 s on average to measure the coronal curve angles per ultrasonograph. This improves upon the 3–5 min a rater takes to measure on an ultrasonograph [6] manually. Examples of measurement and segmentation outputs are illustrated in Fig. 5, demonstrating the algorithm's interpretability.

Fig. 5
figure 5

Measurement outputs from the coronal curve angle ultrasound automation algorithm visualized on the coronal projection images, with green dots indicating lamina pairs and magenta lines indicating lamina pairs used for measurements

Discussion

One identified trend is that moderate curves are measured less accurately than mild ones in comparisons involving A-US measurements and the M-US versus M-Xray comparison. The severity of the curvature often has an impact on the image quality of the ultrasonography. More severe curves typically exhibit higher vertebral rotation. This increases the frequency of only one lamina appearing on the scan for a given vertebra, interfering with the derivation of the vertebral tilt angle. The ultrasound operator must also move the transducer more laterally with more severe curves. Adding this extra axis of movement during the scan increases the likelihood of human errors, such as improper contact between the transducer and skin, that can further reduce image quality. From a manual measurement perspective, for these cases, the rater relies on the neighboring lamina pairs and the spinous process column to label the centers of lamina. These factors were encoded into the automatic measurement algorithm, but encoding the whole procedure with programmatic absolutes naturally lends itself to errors. Therefore, the further discrepancies between A-US versus M-Xray for moderate curves are likely attributed to this shortcoming.

The A-US measurements tended to perform better on curves in the main thoracic region when compared to both M-US and M-Xray measurements with SEM values of 1.2° and 1.8°, respectively. This result was due to the laminae in this region being typically both brighter and more clearly isolated when compared to the laminae in the upper thoracic or lumbar regions. The lumbar region is a more problematic area for both automatic and manual measurement. The thicker muscle and fat in this region result in more noise on the ultrasonograph. Consequently, the lumbar laminae are often not as isolated in terms of brightness. Additionally, because the lumbar vertebrae were larger than the thoracic vertebrae, each lamina typically produces two reflections. These two factors combined made accurate localization of the center of the lumbar laminae difficult, resulting in less accurate coronal curve angle measurement.

There were 109 curves present in the A-US, M-US, and M-Xray measurements. Out of the 44 Cobb angles that were measured inaccurately by the automatic algorithm within these 109 paired measurements, only 23% (10/44) was also measured outside of clinical acceptance in the M-US versus M-Xray comparison. This fact, coupled with the difference in accuracy performance between A-US and M-US measurements (59% vs. 85% clinical acceptance rate), suggested that the measurement errors were not just attributed to the difference in imaging modalities. A possible reason for the greater discrepancy was that the aid of the previous radiograph method was employed for the M-US measurements but not directly employed for the A-US measurements. Zheng et al. found that using the aid of previous radiographs significantly reduced the MAE from 4.6° to 2.7° when compared with radiographic measurements [6]. Similarly, the MAE for this study’s A-US versus M-Xray comparison was 5.2°, which is more representative of the blinded (without the aid of previous radiograph) ultrasound versus radiograph measurement comparison MAE of 4.6°. The ultrasonograph training set was labelled with the aid of the previous radiograph. However, this might not be enough by itself, and the previous radiograph may need to be considered directly to improve measurement accuracy.

Conclusion

An automatic measurement algorithm based on machine learning was developed for the coronal curve angle on 3D ultrasonographs. The developed method achieved good reliability when compared with manual ultrasound measurements, but only moderate reliability when compared with manual radiographic measurements. With measurements being made in 36 s on average and images being output displaying how the measurements are obtained, the algorithm can be integrated into clinics to at least automate coronal curve angle measurement steps. In turn, this could increase the accessibility of ultrasound AIS diagnosis. However, further improvements may be required to produce a fully automatic measurement method that is clinically feasible.