Introduction

Drawbacks of traditional COBB method

COBB angle measurement on coronal whole-spine standing X-rays for the quantification of scoliosis curve severity has been implemented as gold standard in 1966 by the Scoliosis Research Society (SRS). The method itself is flawed by high inter- and intraobserver errors, ranging from 3° to 10° (95% CI), which can mainly be attributed to the manual nature of end vertebra selection and delineation [1]. In addition to this inaccuracy, the COBB method neglects important parts of the curve characteristics, e.g., apical vertebra translation.

Whole-spine X-ray images are currently produced at a varying expense in radiation exposure (ranging from conventional X-ray to modern low-dose imaging systems) with variable results. Thus, treatment decisions are being based on a high image acquisition variability and only moderate measurement reliability. However, it has been shown that the error originating from imaging modality ranges from none to below 2° [2, 3].

Approaches to mitigate COBB method-associated inaccuracy

To level out observer-based errors, a certain amount of automation (e.g., for the selection of end vertebrae) was proposed. However, it has been realized that the original COBB method is not suitable for automation tasks. To overcome this problem, it would be necessary to focus on vertebra centroids that—following the original idea of Ferguson—offer the advantage of generating spinal curves that resemble global scoliotic spinal deformity more closely. In addition, they are less prone to measurement errors by smoothing curves through cubic interpolation [1]. Thus far, first semiautomated computerized measurement projects have reported mixed results in terms of reliability. However, they all could show a good correlation between COBB angle and the investigated method [4,5,6].

In the advent of modern neural network (NN)-driven image processing, automatic segmentation, curvature estimation and even 3D-reconstruction from 2D data have come at reach. Wu et al. [7] were able to reduce the standard deviation of COBB angle estimation to 1° (circular mean absolute error of 4°) by iterative training of a convolutional NN incorporating AP and lateral views from 154 patients. However, as their method requires a rather standardized approach with biplanar imaging, it might not be of use for the above-mentioned scenario with varying image quality and only AP imaging information. However, we used Wu’s results as reference as it resembles the most precise measurement modality known to us.

The neglectable quality aspect of X-ray images for a proper scoliosis curve estimation was discussed in depth by Bonanni. He proposed an angle function over a structural curve drawn on either side of the spine to quantify the scoliotic deformation from AP X-ray or MRI images. He thus dropped non-necessary radiographic details (like endplates) and rather focused on the extraction of global spinal information (e.g., curve shape) [8, 9]. The group could show that the standard deviation of measurements could be brought below 5°—even in very low-quality images (54 patients). However, they had to rely on a fixed scheme with preselected landmarks carried out by trained scoliosis surgeons to do the measurements [9].

Automatic vertebrae detection

In contrast to CT and MRI images, the automatic detection of vertebrae in X-ray images is a very challenging computer vision task, due to multiple overlapping shadows of the ribs and pelvis, relatively weak contrast, and the need to identify all thoracic and lumbar vertebrae individually [10, 11]. Lecron, Xu and coworkers were the first to show the possibility of vertebrae detection in X-ray images using classical machine learning techniques [12, 13]. With the advance of convolutional NN (CNN) architectures, the opportunities have improved dramatically. De Carvalho, Al Arif and coworkers showed groundbreaking results with extraordinary accuracy in the prediction of vertebrae in X-ray images [2, 14]. However, these methods were limited to a small part of the spine (lumbar or cervical) and were trained on high-quality images.

Aiming at higher precision at no additional cost

The aim of this project was to analyze, if there could be a solution for quantitative radiographic scoliosis assessment that:

  • offers high precision measurement,

  • is easy to use in an everyday clinical setting,

  • does not depend on special hard- or software,

  • even works in difficult situations (e.g., low-quality X-rays)—thus reducing the need of repeated radiation exposure,

  • incorporates modern machine learning techniques (e.g., deep learning),

  • had no additional cost.

We did not want to challenge alternative scoliosis imaging/screening modalities as ultrasound or photogrammetry [15,16,17].

Material and methods

Image material and labeling

This study has been approved by our institutional review board (#EK339092018). Standing coronal X-rays displaying the whole vertebral column or at least the most scoliotic part of the thoracic or lumbar column originated from different individuals with any form of idiopathic scoliosis subjected to bracing and without prior treatment. Image quality ranged from digitally acquired (most) to film or even paper scans. Images were saved with a random 8-digit identifier alias.

The images were split into 30-image-sized batches (551 images in total) that were successively labeled with a specially developed label tool by a group of 4 analyzers after appropriate training but with different educational levels, ranging from medical student (M), orthopedic residents in training (C and W) to experienced scoliosis surgeon (P).

Labeling included centroid identification of each vertebra, start and end vertebra description and angle line drawing following COBBs method (Fig. 1 and supplemental material for the workflow). Precision of the labeling task was supported by a magnifying glass in a second window. The selected centroids were used to generate a spline using cubic interpolation automatically. Perpendicular lines on this spline were generated at each vertebral level. The most tilted lines (with respect to the horizon) were used to calculate the corresponding COBB angle. Only values > 10° were processed further.

Fig. 1
figure 1

S: Manually labeled vertebrae (red dots) were used to algorithmically construct a spline and perpendiculars on each level. Maximum tilted perpendiculars (= turning points of the curve) were defined as end vertebrae if the tilt exceeded 10°. M: COBB lines were drawn on the respective vertebral structures using a magnifying glass (not shown here) to facilitate precision (hollow blue circles, connected by blue line). Note that this image was actually photographed from a film-based spine radiograph with already existing lines drawn on the image

Allocation of measurements to three distinct spinal regions was labeled as “high thoracic,” “thoracic” and “lumbar” if the curve was located above Th5, L2 or below, respectively [18].

Neural net training

Manually labeled vertebrae of all 4 analyzers were used to train a NN via deep learning. Splines were generated from vertebral centroids that were detected by the NN. The architecture of the NN is based on UNet, that is a standard approach in medical image segmentation [19]. It comprises an encoder/decoder structure, incorporating the special feature of skip connections between the encoder and decoder in order to retain important information over several layers [20]. To obtain better results as the original UNet, we implemented masked loss functions that learn to penalize areas outside the vertebrae. The whole dataset consisted of 571 X-ray images. In training process, we randomly split the data into chunks of 457 and 114 images for training and validation, respectively. The size of X-ray images was resized to 200 × 500 pixels in width x height. We also normalized the labeled vertebrae centroids as a percentage of the scaled image.

Statistics

For the statistical analysis, we used standard methods such as Pearson’s correlation coefficient and student’s t test. In case of missing values, the respective row was deleted from the observer’s dataset.

Results and discussion

End vertebra selection

Selection of end vertebrae revealed the typical double curve scoliotic pattern with peaks at Th5/6, Th11 and L4 for any investigator as well as the NN (Fig. 2).

Fig. 2
figure 2

End vertebra detection by manual COBB measurement and spline-based computation. All investigators (C–W) showed the same pattern, manual COBB and auto-spline were identical for the NN (N) as those values were retrieved algorithmically. Mean standard deviation between all observers was 0.45 vertebrae for spline-based and 0.56 vertebrae for manual COBB-based measurements

Interobserver differences were 0.56 vertebrae over all measurements for manual end vertebra selection. When the selection was left to the spline algorithm, it was only a little more uniform (0.45 vertebrae deviation). Interestingly, the deviation of the centroids was not zero, which shows that even the simple task of setting centroid points onto vertebrae yields observer-dependent results.

Another source of inconsistency is that observers were required to indicate the range of vertebrae visible on each X-ray (e.g., from T2 to L5). As vertebral recognition is not always easy, due to variable formation of the 12th rib, it might explain the observed variability. It seems to confirm Vrtovec’s view being end vertebra selection as one main source of error in COBB angle determination [1].

To this end, we investigated more closely if fuzzy end vertebra determination has an impact on measurement variance for spline and NN approaches that estimate the global curve characteristics.

Interobserver correlations

In our analysis, we have found out that spline-based COBB measurement showed higher correlations than manual COBB measurements (Fig. 3). This phenomenon was especially pronounced in the border regions (high thoracic and lumbar). Mean spline-based measurements intraclass interobserver correlation coefficients ranged from 0.92 (high thoracic) to 0.95 (main thoracic and lumbar), whereas traditional COBB-based measurements yielded coefficients ranging from 0.83 (high thoracic) to 0.89 (lumbar) and 0.92 (main thoracic). De Cavalho et al. achieved an interobserver correlation of 0.88–0.9, using high-quality images with predefined end vertebrae [2]. However, they did not convey if they observed the same region dependence nor did any other author. We hypothesize that a certain amount of variability, especially in the case of manual COBB measurement, might result from kyphosis-dependent obfuscation of vertebral structures. This uncertainty translates directly into bad correlation. Consequently, interclass (spline versus COBB) correlation was rather poor, ranging from 0.84 (high thoracic) to 0.92 (main thoracic).

Fig. 3
figure 3

Correlation heatmap showing Pearson correlation coefficients between different methods (A = automatic spline, M = Manual COBB) at different spinal regions (0 = high thoracic, 1 = thoracic, 2 = lumbar) for different observers (C, M, P, W = human investigators, N = NN). A higher correlation could be observed for interobserver automatic spline computations (upper left diagonal) compared to interobserver manual COBB measurements (lower right diagonal) or between automated spline and manual COBB measurements (upper right diagonal)

Interestingly, any observer (manual or spline)—NN correlations were always above 0.9, which is another view on the nn’s possibilities to mitigate interobserver variance, even in difficult anatomic regions. Further analysis will reveal how correlation can be translated into angle measurement error.

Distribution of differences between investigators

Our dataset was derived from X-ray images of patients subjected to brace treatment, typically in the range from 20° to 40°. Especially in this range, the detection of minor angle changes is most important to optimize conservative treatment results. Manual COBB angle measurement is still considered gold standard for scoliosis curve estimation, despite of a known inaccuracy of 3°–10° [1]. This high deviation can obviously lead to erroneous decisions, even surgically.

As expected, we had to confirm the inaccurate nature of manual COBB angle measurement. We saw region-dependent interobserver medians spanning 8° (high thoracic), 4° (thoracic) and 3° (lumbar) of deviation; see Fig. 4. Using standardized measurement setups, other authors were able to decrease the average angle difference to 2°–3° [2, 21].

Fig. 4
figure 4

Box plots depicting the median COBB angles of the complete dataset at three distinct anatomic regions. Blue boxes show spline-based COBB angle computations which have similar values for every investigator, whereas orange boxes depict manually measured COBB angles that display greater variability between investigators

To answer the question if there was a training effect in measurement precision, a rolling mean evaluation of the above-mentioned deviations with a window size of 16 datasets had been performed (Fig. 5). A trend toward deviation reduction over the dataset could be observed only for one observer (W). All investigators retained an error range, summing up to 4°–8° for all spinal regions. We therefore conclude that manual COBB measurement cannot be ameliorated by training. This disadvantage is caused by the method!

Fig. 5
figure 5

Rolling window calculation of mean absolute measurement difference between spline-based and manual COBB method, summed for the three spinal regions and plotted against the time course of dataset evaluations. Learning patterns (red) as well as unsteady precision in the beginning (green) or relative stability (blue and orange) can be observed. However, there remains a constant deviation of about 4°–8° that cannot be trained away

Spline-based curve size estimation showed an overall constant median angle, deviating not more than 1° between observers in any spinal region (Fig. 4).

Stability of measurements

In order to demonstrate the stability of measurements over the angle range of the dataset, Bland–Altman plots were generated (Fig. 6a and additional Table 1). We see that rather unexperienced analyzers significantly tended to manually underestimate (C, W, M, positive values), whereas the experienced analyzer (P) only overestimated the high thoracic curve (negative value) and stayed close to the spline values in the remaining spine regions, despite some considerable high outliers (blue dots). However, no systematic bias for lower or higher angles could be found, demonstrating the rigidity of the spline method.

Fig. 6
figure 6

Bland–Altman plot of spline-based measurements and manual COBB measurements (a) and NN derived curve estimations and all human observations (b). Means are plotted against their differences. Positive deviation = spline algorithm (a) or NN (b) overestimates the curve magnitude. Dashed lines = 1.96-fold standard deviation, dash-dotted line = mean deviation

Table 1  Mean deviations of spline-based curve computation and manual COBB measurement for each investigator and respective P value. Positive values = spline algorithm overestimates curve magnitude, negative values = manual COBB measurement overestimates curve magnitude

The above-mentioned data have demonstrated, that the NN achieves high correlations with all observers. The stability of measurements is further displayed in the Bland–Altman plot in Fig. 6b. There, the nn’s measurements have been plotted against all human obtained values (manual measurements and spline measurements). It can be seen, that the confidence interval for any region is narrower, comprising only one half to two-thirds of the confidence intervals of spline versus manual COBB measurements. The reasons for this stability are explained in the next section.

The strength of nn-based measurements

Our NN achieves a sensitivity of 0.87 and a relative pixel error of 0.011 for detecting the correct vertebrae in an AP X-ray of the spine (Fig. 7, Table 3). This is three times more precise than the NN of Wu et al. who reported an error of 0.0398 [7]. This translates to an angle deviation of < 1° for any region with high correlations in the main thoracic and lumbar spine (Table 2), which is also better than Wu’s error of 1°. Again, we need to point out that Wu used biplanar standardized imaging—we have used only AP images of heterogeneous quality (Table 3).

Fig. 7
figure 7

A: Sensitivity and positive predictive values (PPV) of the estimated vertebrae locations from the NN. B: relative pixel error of the estimated vertebrae locations from the NN

Table 2  Mean deviation of NN versus gathered human COBB angle measurements. Positive values = NN overestimates COBB angle. Correlation = Pearson correlation of NN measurements with all human manual and spline-based values
Table 3 Sensitivity and positive predictive values (PPV) of the estimated vertebrae locations from the NN

Conclusion

In conclusion, we were able to show, that even in the case of poor image quality, the reliability of COBB angle measurement can be substantially raised by

  • investigating the spinal curve in a more global fashion by using a vertebral centroid generated spline

  • automating vertebra detection by neural network which was trained by several observers

Measurement error can be brought down below 0.5°, which is less than one-tenth of traditional COBB measurement error.

Limitations of our current approach include the uncertainty in the higher thoracic spine which will be subject to optimization in our further investigations. We will also need to include various other types of spinal X-ray imaging to enhance vertebra recognition even further. Also, the angle range as well as the scoliosis spectrum (e.g., malformations) has to be exploited further, as severe rotated curves might exhibit different geometric properties.

From the data of this work and by taking into account all the criticism already mentioned over the years, manual COBB measurement cannot be recommended for scoliosis curve estimation any more. It is flawed by considerable interobserver deviations that can be eliminated completely by machine-based diagnosis, even if the image material is of poor quality.

The implementation of this knowledge into clinical practice is now mandatory. However, there are still further steps to accomplish. The current version of NN-based spline generation worked well with a limited set of images. Future model training has to include a broader image database, e.g., including operated spines, early-onset and adult scoliosis—to mention just a few. We plan to provide open access of the model to the scientific and medical community.