Abstract
Objectives
To develop a deep learning methodology that distinguishes early from late stages of avascular necrosis of the hip (AVN) to determine treatment decisions.
Methods
Three convolutional neural networks (CNNs) VGG-16, Inception ResnetV2, InceptionV3 were trained with transfer learning (ImageNet) and finetuned with a retrospectively collected cohort of (n = 104) MRI examinations of AVN patients, to differentiate between early (ARCO 1–2) and late (ARCO 3–4) stages. A consensus CNN ensemble decision was recorded as the agreement of at least two CNNs. CNN and ensemble performance was benchmarked on an independent cohort of 49 patients from another country and was compared to the performance of two MSK radiologists. CNN performance was expressed with areas under the curve (AUC), the respective 95% confidence intervals (CIs) and precision, and recall and f1-scores. AUCs were compared with DeLong’s test.
Results
On internal testing, Inception-ResnetV2 achieved the highest individual performance with an AUC of 99.7% (95%CI 99–100%), followed by InceptionV3 and VGG-16 with AUCs of 99.3% (95%CI 98.4–100%) and 97.3% (95%CI 95.5–99.2%) respectively. The CNN ensemble the same AUCs Inception ResnetV2. On external validation, model performance dropped with VGG-16 achieving the highest individual AUC of 78.9% (95%CI 51.6–79.6%) The best external performance was achieved by the model ensemble with an AUC of 85.5% (95%CI 72.2–93.9%). No significant difference was found between the CNN ensemble and expert MSK radiologists (p = 0.22 and 0.092 respectively).
Conclusion
An externally validated CNN ensemble accurately distinguishes between the early and late stages of AVN and has comparable performance to expert MSK radiologists.
Clinical relevance statement
This paper introduces the use of deep learning for the differentiation between early and late avascular necrosis of the hip, assisting in a complex clinical decision that can determine the choice between conservative and surgical treatment.
Key Points
• A convolutional neural network ensemble achieved excellent performance in distinguishing between early and late avascular necrosis.
• The performance of the deep learning method was similar to the performance of expert readers.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Avascular necrosis (AVN) of the hip affects approximately 20,000 patients every year in the USA and is the most common cause of total hip arthroplasty (THA) at young ages, demonstrated with bilateral involvement in > 70% of patients [1]. In cases where AVN is left untreated, it progresses to joint collapse and secondary osteoarthritis with THA being the only treatment option. However, in the early stages of the disease prior to articular collapse, joint preservation techniques (core decompression, vascularized grafting, etc) are available with the potential to avoid THA [2]. Therefore, differentiation between early and late AVN is of utmost importance for appropriate treatment selection.
Several AVN staging systems exist with the system of the Association Research Circulation Osseous (ARCO) [3] being the most commonly used. ARCO staging is recommended in the latest (2019) international guidelines on the management of AVN [4]. The latest version of ARCO defines four main stages, with joint preservation techniques being available for the two first stages (ARCO < 3) whereas hip replacement being the recommended treatment for terminal disease (ARCO 3–4). Nonetheless, distinguishing between ARCO 2 (early) and ARCO 3A (late) is an extremely challenging task, requiring significant expertise in musculoskeletal radiology and a combination of imaging findings including indications of loss of femoral head sphericity and the presence of a subchondral fracture [4,5,6].
Artificial intelligence has been previously used for the diagnosis of AVN on plain radiographs [7] and MRI [8], to identify factors increasing the risk for collapse [9], and to differentiate late AVN from other causes of proximal femoral bone marrow edema such as transient osteoporosis [10, 11]. Attempts have been also recently made to quantify the necrotic volume and surface area in an attempt to associate this with the stage of AVN [12]. However, quantification of the necrotic part volume is not part of any clinically relevant classification system and volume cut-offs have not been set to levels that will define optimal treatment.
The aim of this study was to develop a deep learning methodology to differentiate between the early (ARCO 1 and 2) and late (ARCO 3 and 4) stages of AVN. For this purpose, convolutional neural networks (CNNs) have been trained with a transfer learning methodology and finetuned with the use of a cohort of patients with AVN. The algorithm was internally tested and then subjected to external validation on an international cohort of AVN patients and its performance was benchmarked against the performance of musculoskeletal radiologists. The development of such a model would be invaluable in assisting clinical decisions between joint preservation surgery and total hip arthroplasty.
Materials and methods
Patients
A multi-institutional dataset (UHH cohort) of 104 consecutive hips (67 patients) with AVN was retrospectively compiled. The cohort contained 36 cases of early and 68 late cases of AVN. A combination of transfer learning and data augmentation was used to address the small size of the dataset as proposed by Candemir et al [13] (described below). All patients were evaluated in our specialized musculoskeletal imaging clinic and the second opinion bone marrow imaging clinic receiving domestic and international referrals in complicated hip cases. Cases were prospectively collected in an AVN registry and then imaging data were retrospectively retrieved based on this registry of our MSK clinics. This cohort has been previously used to develop deep learning and radiomics methodology for the differentiation between avascular necrosis and transient osteoporosis of the hip [10, 11]. Exclusion criteria included hips with tumours, prior trauma, infection, inflammatory arthropathies, or surgery at the hip of interest as well as insufficient image quality. Hips with extensive red marrow infiltration have been also excluded to avoid confounding effects imitating bone marrow edema of the proximal hip. This cohort was used for the training and internal testing of CNNs.
External validation of the developed deep learning methodology was performed using an independent anonymized cohort from a center located in another country (TUM cohort, n = 49 hips) which was retrospectively selected based on the same criteria (Fig. 1). The study has been performed according to the Helsinki Declaration and has been approved by the ethical committee of the University Hospital of Heraklion (Ref. No. 360/08/29-04-2020). Informed consent has been waived for the retrospective use of anonymized data. This manuscript was prepared according to the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [14] and the Standards for Reporting of Diagnostic accuracy studies (STARD) checklist [15].
MR imaging and ground truth ARCO staging
Ground truth staging of AVN was established based on imaging, according to the ARCO classification [3]. This is currently the gold standard practice for clinical diagnosis and staging since the diagnosis of AVN does not warrant biopsy. Ground truth staging was performed independently by two MSK radiologists (40 and 10 years of experience, respectively) and in cases of disagreement, final stage was defined by consensus. To ensure accurate ground truth grading the experts had access to the whole MRI protocol including T1-w, STIR or PD/T2 fs, and high-resolution 3D gradient echo images. AVN was diagnosed by the presence of the “band-like” sign on T1-w images [16]. Subsequently, two groups of hips were defined based on T1-w, STIR, and high-resolution 3D gradient echo images: (i) cases with a subchondral fracture, cases with loss of head sphericity and/or associated bone marrow edema, and cases with signs of secondary osteoarthritis were classified as “late AVN” (ARCO 3–4) and (ii) cases without the aforementioned findings were classified as “early AVN” (ARCO 1-2) (Fig. 2) [1, 17, 18].
Data pre-processing and augmentation
Mid-coronal STIR images through each femoral head were used for model training, testing, and validation. Images were resized to 150 × 150 pixels and then images were randomly split 70:30 in training: testing sets. Data harmonization and bias correction were performed by matching image histograms to account for intra-scanner variability and achieve gray-level normalization. In order to eliminate group imbalance bias and to expose the model to additional training/testing data, images were augmented using rotation of 10° (clockwise and anti-clockwise) as well as horizontal image flipping. The final training and testing datasets consisted of a total of 350 training and 150 testing images for each of the two groups (early vs late AVN) (Fig. 1).
Convolutional neural network development and external validation
A CNN ensemble was used as previously described [19]. Briefly, transfer learning was applied by obtaining the initial weights of three individual CNN architectures, VGG-16, InceptionV3, and Inception-ResNetV2, training first with the ImageNet dataset followed by weight freezing and final trainable layer finetuning with the use of our training dataset [20]. Network performance was subsequently evaluated with the use of the UHH testing dataset. A consensus ensemble decision of the three CNNs was recorded as the agreement of at least two out of three CNNs. To further benchmark the performance of the CNN ensemble, the resulting model was externally validated on a set of 49 hips from a radiology department of another country (TUM dataset). Images were resized and used without any further pre-processing. Ground truth for the TUM dataset was established with the same method as for the UHH dataset External validation images were also assessed by two experienced MSK radiologists (10 and 7 years of MSK experience) blinded to the results of the ensemble and their performance was compared to the performance of the ensemble. CNNs were trained for 100 epochs with early stopping at 10 rounds to avoid overfitting. Deep learning was performed with Python v.3.8, the Keras framework, and the TensorFlow backend on a Windows 10 Pro workstation with 32 GB RAM, Intel i7-10700F @2,9 GHz CPU, and NVIDIA GeForce RTX 2060 Super 8GB GPU.
Statistical analysis
CNN performance was evaluated using precision, recall, and f1-scores for each individual CNN and the ensemble. CNN and MSK expert performance was also assessed with receiver operating characteristics (ROC) curves and the respective area under the curve (AUC) with 95% confidence intervals for the AUC calculated with bootstrapping with the use of the pROC package [21] as implemented in R (v.4.03, https://www.R-project.org/). A single threshold value at 0.5 was used for the ROC curves given the fact that upon augmentation groups were balanced. Comparisons between the AUCs of the models and experts were performed with DeLong’s test [22].
Results
Individual and ensemble CNN performance
The average age of patients in the training/testing (UHH) cohort was 43.7 ± 14.7 years including 48 female and 56 male patients with 56 right and 48 left hips. Each CNN architecture was initially subjected to internal testing with the UHH cohort where Inception-ResnetV2 achieved the highest individual performance with an AUC of 99.7% (95%CI 99–100%), followed by InceptionV3 and VGG-16 with AUCs of 99.3% (95%CI 98.4–100%) and 97.3% (95%CI 95.5–99.2%) respectively. VGG-16 had the highest number of misclassified cases with three early cases misclassified as late and five late cases misclassified as early. The model ensemble achieved an AUC similar to Inception ResnetV2 with only one early case misclassified as late (Fig. 3 and Table 1). Training and validation accuracy/loss plots were observed to ensure that overfitting was avoided (Supplementary Fig. 1).
The performance of CNNs dropped when benchmarked with the TUM external validation cohort. VGG-16 achieved the highest individual AUC of 78.9% (95%CI 51.6–79.6%) followed by InceptionV3 and Inception ResnetV2 with AUCs of 74.8% (95%CI 58.1–84.7%) and 76.59% (95%CI 58.1–84.7%) respectively. Despite the performance drop, VGG-16 exhibited excellent precision for the diagnosis of late AVN and recall for the diagnosis of early AVN without any early cases misclassified as late. The best performance was achieved by the model ensemble which achieved an excellent AUC of 85.5% (95%CI 72.2–93.9%) with only 3 late cases misclassified as early and 4 early cases misclassified as late. The performance of the CNN ensemble was significantly higher than all individual CNNs (p value 0.014, 0.01, and 0.028 for the comparison of the ensemble to VGG-16, Inception ResnetV2, and InceptionV3 respectively) (Fig. 4 and Table 1).
Comparison between the CNN ensemble and human readers
The performance of the CNN ensemble was compared to the performance of expert readers on the TUM external validation cohort. The first MSK radiologist achieved an AUC of 75.7% (95%CI 62.7–87.9%), whereas the second achieved an AUC of 73.08% (95%CI 60.4–86.4%). No significant difference was found between the performance of each MSK radiologist and the CNN ensemble (p value 0.22 and 0.092 for the comparison of the CNN ensemble to the first and second MSK radiologist respectively). Both MSK radiologists achieved excellent recall values for the detection of late AVN (96.1% and 92.3% respectively) with only 1 and 2 late cases misclassified as early for the first and second MSK radiologists respectively (Fig. 5).
Discussion
Herein, a CNN ensemble was presented that achieved excellent performance in differentiating between early (ARCO 1–2) and late (ARCO 3–4) AVN of the hip. Three individual CNN architectures were trained and a consensus ensemble decision was derived. The excellent performance of the CNN ensemble was confirmed by external validation and was found equal to the performance of experienced MSK radiologists.
The development of a deep learning methodology to differentiate early from late AVN can be of great value in everyday clinical practice. The difficulty of distinguishing between ARCO 2 and 3A has been highlighted in several publications [5, 6, 17] and several indirect findings have been proposed as indicators of late AVN including the presence of bone marrow edema [17], joint effusion, cystic changes, bone resorption [5], and a combination of T2 signal heterogeneity, articular surface irregularity and a necrotic-viable interface with a width >3 mm [6]. Nonetheless, this remains a challenging task, especially for non-experienced radiologists or in cases where high-resolution sequences are not available that have been shown to be suitable for the accurate evaluation of all associated findings of AVN [23]. Given the fact that AVN can be asymptomatic in the early stages [24], it can be randomly identified in pelvic MRI examinations that are not tailored to the evaluation of the proximal femur. Our CNN ensemble achieved excellent performance in distinguishing between early and late AVN only with the use of coronal STIR images. This presents a great advantage in the hands of inexperienced readers and in cases where high-resolution images through the femoral head are not available to allow comprehensive ARCO staging.
Interestingly enough, all individual CNNs presented an important drop in performance when validated on the external TUM cohort. Such a performance drop has been found in the majority of externally validated deep learning studies [25, 26]. External validation is of utmost importance in establishing the “real-world” performance of deep learning algorithms but, alas, it can be found in only 6% of AI manuscripts [27]. Despite the fact that the exact reasons responsible for performance drop during external validation are still largely unknown [25], the size of the training dataset or the number of participating institutions in the training dataset has been shown to have no effect on external performance [25]. Nonetheless, being able to achieve an ensemble AUC > 85% in a dataset acquired in another country provides strong evidence for the generalizability of our method.
MSK radiologists achieved AUCs in the range of 70–75% which reflects the difficulty in staging the disease, especially in the absence of high-resolution images focused on the femoral head which would allow visualization of subchondral fractures equally or better than CT [18]. MSK radiologists were presented with the same coronal STIR images as the ones used for the external validation of our deep learning method. Both MSK experts achieved a high recall (sensitivity) in detecting late AVN whereas the CNN ensemble achieved high precision and recall for both late and early disease. Achieving a similar performance to MSK experts highlights the clinical value of the proposed algorithm especially in the setting of general radiology practices where highly experienced MSK radiologists are not available and protocols are not focused on the evaluation of the hip.
Our work has certain strengths and limitations. The use of a multi-institutional training dataset, the validation on an external dataset, and the comparable performance to expert readers are important advantages of the proposed deep learning methodology. Limitations of the proposed work include the retrospective nature of the study and the limited training dataset. However, we have used transfer learning and data augmentation, which represent strategies suitable for deep learning with small datasets [13], alleviating this limitation as shown by the excellent performance in the internal and external cohorts. Training of the algorithms solely on coronal STIR images could be also considered as a limitation of our study. However, coronal fluid-sensitive sequences are part of most pelvic MRI protocols even when they are not focused on the hips. Such sequences can depict all the features required for ARCO staging including subchondral fractures, bone marrow edema, joint effusion, synovitis, and loss of head sphericity [1]. Therefore, being able to stage the disease based on a sequence present in most settings (even when AVN is an incidental finding), increases the clinical value of our method.
In conclusion, a CNN ensemble has been trained and validated that accurately distinguishes between the early and late stages of AVN. The ensemble performs well in external data from another country and has comparable performance to expert MSK radiologists. This deep learning methodology has the potential to assist the accurate staging of AVN without the need for expertise in MSK radiology ultimately leading to the correct treatment strategy.
Abbreviations
- ARCO:
-
Association Research Circulation Osseous
- AVN:
-
Avascular necrosis
- CI:
-
Confidence interval
- CLAIM:
-
Checklist for Artificial Intelligence in Medical Imaging
- CNN:
-
Convolutional neural network
- MSK:
-
Musculoskeletal
- STARD:
-
Standards for Reporting of Diagnostic accuracy studies
References
Karantanas AH, Drakonaki EE (2011) The role of MR imaging in avascular necrosis of the femoral head. Semin Musculoskelet Radiol 15:281–300
Petek D, Hannouche D, Suva D (2019) Osteonecrosis of the femoral head: pathophysiology and current concepts of treatment. EFORT Open Rev 4:85–97
Yoon B, Mont MA, Koo K et al (2020) The 2019 revised version of Association Research Circulation Osseous staging system of osteonecrosis of the femoral head. J Arthroplasty 35:933–940
Zhao D, Zhang F, Wang B et al (2020) Guidelines for clinical diagnosis and treatment of osteonecrosis of the femoral head in adults (2019 version). J Orthop Transl 21:100–110
Kim J, Lee SK, Kim J-Y, Kim J-H (2023) CT and MRI findings beyond the subchondral bone in osteonecrosis of the femoral head to distinguish between ARCO stages 2 and 3A. Eur Radiol. https://doi.org/10.1007/s00330-023-09403-8
Shi S, Luo P, Sun L et al (2022) Analysis of MR signs to distinguish between ARCO stages 2 and 3A in osteonecrosis of the femoral head. J Magn Reson Imaging 55:610–617
Li Y, Li Y, Tian H (2021) Deep learning-based end-to-end diagnosis system for avascular necrosis of femoral head. IEEE J Biomed Health Inform 25:2093–2102
Shen X, Luo J, Tang X et al (2022) Deep learning approach for diagnosing early osteonecrosis of the femoral head based on magnetic resonance imaging. J Arthroplasty. https://doi.org/10.1016/j.arth.2022.10.003
Hernigou P (2023) Revisiting prediction of collapse in hip osteonecrosis with artificial intelligence and machine learning: a new approach for quantifying and ranking the contribution and association of factors for collapse. Int Orthop 47:677–689
Klontzas ME, Manikis GC, Nikiforaki K et al (2021) Radiomics and machine learning can differentiate transient osteoporosis from avascular necrosis of the hip. Diagnostics 11:1686
Klontzas ME, Stathis I, Spanakis K et al (2022) Deep learning for the differential diagnosis between transient osteoporosis and avascular necrosis of the hip. Diagnostics 12(8):1870
Ruckli AC, Nanavati AK, Meier MK et al (2023) A deep learning method for quantification of femoral head necrosis based on routine hip MRI for improved surgical decision making. J Person Med 13(1):153
Candemir S, Nguyen XV, Folio LR, Prevedello LM (2021) Training strategies for radiology deep learning models in data-limited scenarios. Radiology: Artif Intell 3(6):e210014
Mongan J, Moy L, Kahn CE (2020) Checklist for Artificial Intelligence and Medical Imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell 2:e200029
Bossuyt PM, Reitsma JB, Bruns DE et al (2015) STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology 277:826–832
Malizos KN, Karantanas AH, Varitimidis SE et al (2007) Osteonecrosis of the femoral head: etiology, imaging and treatment. Eur J Radiol 63:16–28
Meier R, Kraus TM, Schaeffeler C et al (2014) Bone marrow oedema on MR imaging indicates ARCO stage 3 disease in patients with AVN of the femoral head. Eur Radiol 24:2271–2278
Karantanas AH (2013) Accuracy and limitations of diagnostic methods for avascular necrosis of the hip. Expert Opin Med Diagn 7:179–187
Klontzas ME, Vassalou EE, Kakkos GA et al (2022) Differentiation between subchondral insufficiency fractures and advanced osteoarthritis of the knee using transfer learning and an ensemble of convolutional neural networks. Injury 53:2035–2040
Kim HE, Cosa-Linan A, Santhanam N et al (2022) Transfer learning for medical image classification: a literature review. BMC Med Imaging 22:1–13
Turck N, Vutskits L, Sanchez-Pena P et al (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform 8:12–77
DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44:837–845
Shuman WP, Castagno AA, Baron RL, Richardson ML (1988) MR imaging of avascular necrosis of the femoral head: value of small-field-of-view sagittal surface-coil images. AJR Am J Roentgenol 150:1073–8
Huang G-S, Chan WP, Chang Y-C et al (2003) MR imaging of bone marrow edema and joint effusion in patients with osteonecrosis of the femoral head: relationship to pain. AJR Am J Roentgenol 181:545–9
Yu AC, Mohajer B, Eng J (2022) External validation of deep learning algorithms for radiologic diagnosis: a systematic review. Radiol Artif Intell 4:e210064
Hsu W, Hippe DS, Nakhaei N et al (2022) External validation of an ensemble model for automated mammography interpretation by artificial intelligence. JAMA Netw Open 5:e2242343
Kim DW, Jang HY, Kim KW et al (2019) Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: results from recently published papers. Korean J Radiol 20:405–410
Acknowledgements
This research was partially funded by the Young Researchers Grant awarded to Michail Klontzas by the European Society of Musculoskeletal Radiology (ESSR).
Funding
Open access funding provided by HEAL-Link Greece. Young Researcher’s Grant awarded to Michail Klontzas by the European Society of Musculoskeletal Radiology (ESSR).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Guarantor
The scientific guarantor of this publication is Prof. Apostolos Karantanas.
Conflict of interest
The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article.
Statistics and biometry
One of the authors has significant statistical expertise.
Informed consent
Written informed consent was obtained from all subjects participating to undergo the examination. Informed consent was waived for the retrospective anonymized analysis of patient images.
Ethical approval
Institutional Review Board approval was obtained.
Study subjects or cohorts overlap
One of the two cohorts (UHH) used for the training of our deep learning models has been previously used to train another deep learning ensemble that distinguishes between avascular necrosis and transient osteoporosis. This work has no overlap with the present paper and has been published at https://doi.org/10.3390/diagnostics12081870. No staging of AVN had been attempted in the previous work and our current work does not attempt any diagnosis but only staging of the patients of the cohort. Therefore the work presented here is completely different without any overlap in aims or results.
Methodology
-
retrospective
-
cross-sectional study
-
multicenter study
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Klontzas, M.E., Vassalou, E.E., Spanakis, K. et al. Deep learning enables the differentiation between early and late stages of hip avascular necrosis. Eur Radiol 34, 1179–1186 (2024). https://doi.org/10.1007/s00330-023-10104-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00330-023-10104-5