Introduction

Chronic kidney disease (CKD) describes a state of progressive structural and functional deterioration of the kidney, presenting as a reduced estimated glomerular filtration rate (eGFR). CKD can lead to end-stage kidney disease (ESKD) and is responsible for 9.1% and 4.6% of noncommunicable disease-related morbidity and mortality, respectively [1]. It is projected that CKD will become the fifth leading global cause of death by 2040 [2]. Consequently, timely diagnosis followed by early treatment initiation for those with CKD is crucial for optimizing their outcomes.

Interstitial fibrosis and tubular atrophy (IFTA) are tightly correlated with CKD severity and impact patients’ long-term prognosis. Moderate and severe IFTA, compared to mild IFTA, and global glomerulosclerosis are associated with more than a two- and three-fold increased risk of kidney function loss, respectively [3]. However, current methods for monitoring kidney fibrosis remain unsatisfactory. In clinical practice, eGFR is not always consistent with the degree of renal fibrosis. eGFR can be quite insensitive to subclinical kidney function impairment. Kidney biopsy is considered the gold standard for confirming CKD diagnosis and fibrosis grading [4,5,6]. However, kidney biopsy carries the risk of complications, and spatial sampling bias reduces the accuracy of pathological diagnosis; therefore, kidney biopsy has not been considered the preferred follow-up approach for patients with CKD [7, 8].

Morphological changes in the kidney cortex and volume mostly occur during ESKD. Under pathological examination, CKD is characterized by kidney fibrosis, or the pathological deposition of massive extracellular matrices related to an increasing number of fibroblasts [9, 10]. These changes are associated with subsequent scarring and sclerosis of kidney tissues, leading to kidney morphological alterations [11]. Ultrasound examinations can assess changes in speckling pattern and signal scattering, both of which variably correlate with changes in kidney morphology and rising parenchymal stiffness. However, distinguishing diseased kidneys from healthy ones using two-dimensional (2D) ultrasound can be difficult for sonographers. These limitations lead to the increased utility of radiomics. Radiomics are quantifications of medical images using statistical algorithms. The machine learning part is used for outcome prediction in subsequent steps. Radiomics aims to support diagnostic decisions through differentiating between different tissue types [12, 13]. Among radiomics, texture analysis is an emerging tool for quantitating the severity of kidney diseases. Radiomics has been applied to different imaging modalities for the identification and differentiation between kidney diseases, including kidney tumors, carcinomas [14,15,16,17], the discrimination of malignant and benign clinical T1 renal masses [18] and renal tumor histological subtypes [19], early kidney damage in patients with diabetes mellitus [20], the detection of kidney stones [21], and the differentiation between normal and diseased kidneys in those with CKD [22].

Based on the reasons outlined above, we combined radiomics data from 2D ultrasound and Sound Touch Elastography (STE) images, as well as clinical factors to construct models for application, followed by model verification. We tried to use a nomogram to predict the degree of IFTA among CKD patients without histopathological data. We aimed to provide a non-invasive diagnosis approach for CKD and used this approach to monitor the treatment responses and disease course of these patients.

Materials and methods

Ethics statement

The current study complied with the Declaration of Helsinki and was approved by the local ethics review board (KY2021146). We obtained written informed consent from each participant.

Selection of study participants

The definition of CKD was made based on an eGFR < 60 mL/min/1.73 m2 for at least 3 months [23]. The inclusion criteria were CKD patients who had a clinical indication of kidney biopsy. The exclusion criteria were as follows: patients with any contraindications for kidney biopsy, asymmetric bilateral kidney atrophy, abnormal kidney structure, or poor resolution of kidney cortex and medulla on 2D ultrasound. Clinical and laboratory tests were collected from each patient within 2 days before they underwent kidney biopsy.

Ultrasonography procedures

We used the Mindary Resona 7 Ultrasound System and SC5-1U convex array probe (bandwidth frequency of 1–5 MHz) (Mindray Bio-Medical Electronics Co., Ltd.) to perform 2D ultrasound and STE software. STE measurements were performed 5 times with uniform color fill, and the final standard deviation (SD) of the STE values was set at less than 2.0 as quality control. All examinations were performed by a sonographer with 8 years of experience, who was blinded to serological, imaging, and kidney biopsy pathological results.

Kidney biopsy and pathological examination

Renal biopsy specimens within 3 days of renal ultrasound were obtained from patients with CKD. A renal needle biopsy was done to sample the lower pole parenchyma of the target kidney under ultrasound (US) guidance. To ensure that the selected US images matched the US biopsy location, the kidney puncture operation, 2D ultrasound, and STE examination were performed by the same sonographer. Two experienced pathologists scored the severity of glomerular sclerosis, tubulointerstitial injury, and vascular sclerosis based on the Banff scoring system and experiences from Farris et al [24, 25]. Any disagreement between pathologists was resolved by consensus. We used the Image-Pro Plus 6.0 software to evaluate the proportion of tubulointerstitial fibrotic areas. Patients with CKD were classified according to the Banff scoring system for kidney cortical fibrosis [26]. In this scoring system, the severity of cortical tubulointerstitial fibrosis was divided into three grades: mild IFTA, fibrotic area < 25%; moderate IFTA, fibrotic area 26–50%; and severe IFTA, fibrotic area > 50%.

Processing flow of radiomics

Image segmentation

Images of Digital Imaging and Communications in Medicine (DICOM) format acquired during B-mode and STE examination were imported into ITK-snap software for manual image segmentation. We evaluated the region of interest (ROI) containing the kidney cortex but removed the kidney medulla and perirenal fat tissues during image curation. Any difference between the two interpreters was resolved by group discussions.

Feature extraction and establishment of radiomics label

The DICOM images and ROIs obtained from ITK-SNAP software were imported into the AK software (Artificial Intelligence Kit, GE Healthcare) for extracting radiomics. The extracted features included first-order (histogram and morphologic features) alongside second-order parameters. The second-order parameters mainly involved Gray Level Co-Occurrence Matrix (GLCM), Gray Level Run Length Matrix (GLRLM), Gray Level Size Zone Matrix (GLZSM), Neighboring Gray Tone Difference Matrix (NGTDM), and Gray Level Dependence Matrix (GLDM). The ROI of all images was delineated by two sonographers. The inter-observer agreement was evaluated using interclass correlation coefficient (ICC) analysis, which was defined as good consistency for values between 0.75 and 1, fair consistency for values between 0.4 and 0.75, and poor for values under 0.4. ICC values higher than 0.75 were selected for further analysis. Patients were randomly divided into training and validation cohorts at a ratio of 7:3. We planned for two types of comparisons: mild versus moderate-to-severe IFTA and mild-to-moderate versus severe IFTA.

Feature selection

Minimum redundancy maximum relevance (mRMR) was used to eliminate redundant and irrelevant features, retain the optimal ones, filter out the optimal feature subset through the least absolute shrinkage and selection operator (LASSO) algorithm, and build a final model. After determining the number of optimal features, we selected the most predictive feature subset and calculated the corresponding coefficients [27].

Model construction and result validation

Feature extraction based on B-mode and STE images yielded a radiomics quality score (Rad-Score), which was the radiomics label calculated by the weighted summation of selected features by their coefficients. We used receiver operating characteristic curve (ROC) analysis to evaluate the performance of each constructed model. The Akaike information criterion of the clinical model was applied to determine the most appropriate clinical model. Multivariate logistic regression combining clinical features with the Rad-Score was conducted to establish a predictive model and generate a clinical nomogram. The usefulness of a nomogram lies in its ability to map prediction probability to points on a picture with a scale between 0 and 100. The total points accrued based on different types of features corresponded with the predicted probabilities of the index patient [28, 29]

The predictive accuracy of each model was assessed by the area under the ROC curve (AUC) value for the training and validation cohorts. We tested the performance of the Knott diagram in the validation cohort. Using the logistic regression model established in the training cohort, we calculated the total score for each patient in the validation cohort and obtained the AUC and calibration curve. To estimate the prediction error of each model, we further tested the proposed model using the 1000-iteration bootstrap analysis for both the training and validation cohorts. We randomly selected 70% of patients from the training or validation cohort and calculated the corresponding AUC values.

Comparisons between AUCs were made with the DeLong test. The calibration curves and Hosmer–Lemeshow test were used to investigate the performance of the nomogram. Finally, to evaluate the clinical practicability by quantifying the net benefits of the nomogram model in both the training and validation cohorts, the decision curve analysis (DCA) was performed based on clinical features and radiomics labels from B-model, STE-model, B plus STE model, and the combined models. DCA determines the clinical practicability of radiomics nomograms by quantifying the net benefits under different threshold probabilities in the validation set.

Statistical analyses

SPSS (version 26.0; IBM), GraphPad Prism 8.0 (GraphPad Software), and R statistical software (version 4.0.2) were used for statistical tests. p < 0.05 was considered statistically significant.

Results

Basic clinical information from participants

The flowchart of patient selection is provided in Fig. 1. A total of 150 patients with CKD were identified in Table 1, along with their pathological diagnoses (Supplementary Table 1). The course of processing radiomics is shown in Fig. 2. We also illustrated how the ITK-SNAP software delineated the ROI of the punctured kidney cortex (Fig. 3). Table 2 shows the clinical characteristics of training and validation cohorts.

Fig. 1
figure 1

The flowchart of patient selection process

Table 1 Patient’s characteristics at baseline
Fig 2
figure 2

Radiomics flow chart of this study. We exported the collected images in DICOM format, used the ITK software to delineate ROIs, and performed image segmentation. We used the AK software to extract ultrasound radiomics, and built models based on the clinical characteristics of patients with CKD. Later, we performed model calibration and validation.

Fig. 3
figure 3

Pathological examinations using Masson staining from kidney biopsy contrasting images from B-mode, and ITK-SNAP ROI in patients with different IFTA severities. AC A 38-year-old female patient with systemic lupus erythematosus. Kidney biopsy showing focal proliferative lupus nephritis, III-(A), tubulo-interstitial atrophy was 5% assessed by Masson staining, mild IFTA. DF A 36-year-old female patient with chronic kidney disease, 40% tubulo-interstitial atrophy assessed by Masson staining, moderate IFTA; GI A 28-year-old male patient with chronic kidney disease. Tubulo-interstitial atrophy assessed by Masson staining was 70%, with severe IFTA

Table 2 Clinical characteristics of the training and validation cohorts

Feature selection, model construction, and results validation

We extracted 1156 radiomics from the B-mode and STE images for each participant, based on the result of reproducibility analysis by two sonographers, 739 radiomics had good consistency (ICC > 0.75), and retained 120 features after being filtered by the mRMR method. We also did texture feature selection based on the LASSO logistic regression (Supplementary Figure 1) and selected 36 radiomics after the procedure. These features were used to construct the radiomics signature (Fig. 4). The final formula for calculating Rad-Scores is shown in the Supplementary Materials. We compared the Rad-Scores between the training and the testing groups, as shown in Supplementary Figure 2.

Fig. 4
figure 4

Radiomics signatures for B-mode and STE images. A Four features from B-mode images of mild versus moderate-to-severe IFTA; B Eleven features from STE images of mild versus moderate-to-severe IFTA; C Nine features from B-mode images of mild-to-moderate versus severe IFTA; D Twelve features from STE images of mild-to-moderate versus severe IFTA

We further compared the results of B-model, STE model, clinical model and combined model between groups of different IFTA grades, accompanied by model validation, and prediction parameters were calculated using the Youden index (Table 3). We also showed the results using decision curve evaluation models in Supplementary Figure 3. The predicting models built on clinical features for identifying mild vs. moderate-to-severe IFTA and for mild-to-moderate vs. severe IFTA are shown in Supplementary Figure 4.

Table 3 Diagnostic performance of different model prediction for the assessment of IFTA in two training and validation groups

Clinical features combined with ultrasound radiomics model performance and nomogram validation in analyses involving different IFTA group comparisons

During the validation of models comparing mild IFTA to moderate-to-severe IFTA, the clinical model established using serum albumin and eGFR achieved moderate prediction ability. Moderate prediction ability was also achieved using the STE radiomics model (Table 3). After adding the results of the STE radiomics model to the clinical model, the predictive performance of the combined model was significantly improved, with the nomogram shown in Fig. 5A, with AUCs of 0.91 (95% CI: 0.85–0.97) and 0.85 (95% CI: 0.77–0.98) for the training cohort and testing cohorts, respectively (DeLong test, p < 0.05) (Fig. 5B, C). The nomogram calibration curve showed good agreement between the predictions and observations in the two groups (Fig. 5D, E). The DCA of the nomogram is shown in Fig. 5F. The DCA based on the combined models (clinical and STE) showed greater benefits in the prediction of IFTA severity in the 20–80% threshold probabilities compared to the clinical and STE models.

Fig. 5
figure 5

A Nomogram for clinical features (albumin and eGFR) of mild vs. moderate-to-severe IFTA combined with STE Rad-Scores. B, C Clinical characteristics of mild vs. moderate-to-severe IFTA combined with the ROC curves of STE model in the training and validation sets. D, E Calibration curves of the nomogram for clinical model of mild vs. moderate-to-severe IFTA combined with STE model in the training and validation cohorts. F Analysis of the cut curve of the histogram for comparison of mild with moderate-to-severe IFTA in the clinical model alone, STE model alone, and combined model. The Y-axis is net income. The blue line represents the decision curve of the STE model. The green line represents the clinical model curve, whereas the red line represents the decision curve of the STE model combined with clinical model of patients with CKD

During the validation of models comparing mild-to-moderate IFTA to severe IFTA, the clinical model established based on age and eGFR achieved moderate prediction ability, with AUCs of 0.67 (95% CI: 0.55–0.79) and 0.55 (95% CI: 0.34–0.76) for the training and testing cohorts, respectively. Moderate prediction ability was also achieved using the B-mode radiomics model, with AUCs of 0.80 (95% CI: 0.71–0.90) and 0.78 (95% CI: 0.65–0.92) for the training and testing cohorts, respectively. Moderate prediction ability was similarly achieved using the STE radiomics model, with AUCs of 0.81 (95% CI: 0.73–0.89) and 0.73 (95% CI: 0.58–0.88) for the training and testing cohorts, respectively. Higher prediction ability was achieved using the B-mode plus STE radiomics model, with AUCs of 0.93 (95% CI: 0.88–0.98) and 0.86 (95% CI: 0.75–0.97) for the training and testing cohorts, respectively. Finally, models established using age and eGFR, B-mode, and STE radiomics data showed that the prediction ability of the combined model was high, with the nomogram shown in Fig. 6A. The AUCs of the training and the testing cohorts were 0.93 (95% CI: 0.89–0.98) and 0.83 (95% CI: 0.70–0.95), respectively (Fig. 6B, C). The AUCs of the combined model significantly differed from those of the clinical model, B model, or STE model (DeLong test, p < 0.005 for the training and validation cohorts). The nomogram calibration curves showed good agreement between predictions and observations in the two groups (Fig. 6D, E). The DCA of the nomogram is shown in Fig. 6F. Compared to other models, the combined nomogram model, showing the highest area under the curve, is the optimal decision making for maximal net benefit in classifying IFTA severity.

Fig. 6
figure 6

A Nomogram for clinical features (age and eGFR) of mild-to-moderate IFTA vs. severe IFTA combined with B-mode Rad-Scores and STE Rad-Scores. B, C Clinical characteristics model of mild-to-moderate vs. severe IFTA combined with ROC curves of B-model and STE modelin the training set and validation set. D, E Calibration curves of this nomogram for clinical features model of mild-to-moderate vs. severe IFTA in combination with B-model and STE-model in the training and validation cohorts. F Analysis of the cutting curve of the nomogram for the clinical model alone, B model alone, STE model alone, and B+STE combined model and the clinical+B+STE combined model comparing mild-to-moderate with severe IFTA. The Y-axis is net income. The gray, green, blue, purple, and red lines represent the clinical model curve, B model curve, decision curve of the STE model, B+STE model curve, and decision curve of the clinical +B+STE model, respectively

Discussion

The accurate and non-invasive classification of kidney fibrosis severities is crucial for clinical practice. Recently, researchers used machine learning based on elastography ultrasound images to gauge the severity of kidney fibrosis, with promising results [30]. The construction of a binary classification model is mostly used for comparing liver and kidney fibrosis severities [30,31,32,33]. In this study, we performed a binary classification by comparing one IFTA grade with the other grades as an approach. A combined model incorporating 2D ultrasound radiomics, STE radiomics, and clinical features for predicting IFTA severities was constructed and validated.

Among clinical features analyzed in this study, eGFR was an independent parameter as shown in different IFTA prediction models (Figs. 5 and 6), consistent with results from Zhu et al [30]. eGFR is an important indicator for estimating kidney function and assessing IFTA severity in patients with CKD [34]. eGFR is calculated based on a standardized formula using Scr, a laboratory index that is widely used for the clinical follow-up of these patients [35]. In the comparative model of mild and moderate-to-severe IFTA, the combined model based on clinical features (serum albumin and eGFR) and STE radiomics further improved the diagnostic performance. eGFR has limitations as an indicator for kidney fibrosis, since the levels of eGFR are frequently inconsistent with the degree of kidney fibrosis. Furthermore, eGFR is not sensitive to subclinical kidney damage [36]. In our training and testing cohorts, comparing mild-to-moderate IFTA to severe IFTA, the AUC of the clinical features (eGFR and age) model for discrimination was 0.67 (95% CI: 0.55–0.79) and 0.55 (95% CI: 0.34–0.76) in the training and testing cohorts, respectively, suggesting that clinical features model only performed worse than B-model or STE-model only or the combined model (Fig. 6B, C). Judging from the above arguments, we selected IFTA severity as the grouping variable and prediction model construction instead of eGFR. Therefore, a combined model consisting of clinical factors of eGFR and ultrasonography radiomics features can be helpful for achieving non-invasive monitoring of kidney fibrosis.

The main factors affecting STE elasticity measurements are anisotropy and the heterogeneity of kidney fibrosis. Other confounding factors for STE measurement include age and BMI [30]. In this study, age was used to construct a nomogram for predicting the comparison between mild and moderate-to-severe IFTA. Clinical model established by age and eGFR, and the combined model all achieved a fair predictive performance. In reality, kidneys become stiffened due to collagen deposition during ESKD, and STE measurement results will increase. However, with renal function further declining, kidneys may become softer due to poor blood perfusion, and the STE measurement results may decrease, whereas the kidney length becomes smaller on 2D ultrasound examination [37,38,39]. These factors likely lead to the emergence of a complex nonlinear relationship between 2D ultrasound measurements, STE measurements, and IFTA severity. In our study, a combined model built based on B-mode and STE results significantly improved the diagnostic performance of traditional ultrasound alone. Possible explanations for this finding include the ability of STE to capture the stiffness feature of patients’ kidneys, which is suitable for application during machine learning whose strength includes combining variables with nonlinear relationships and interactions [40]. Therefore, we used all variables including 2D ultrasound, STE radiomics, and clinical factors from these patients with CKD to model IFTA.

In the nomogram differentiating mild and moderate-to-severe IFTA models, STE radiomics and clinical factors were included, whereas 2D ultrasound radiomics were not. The reason is that the 2D ultrasound radiomics consist of data including the diameter of the kidney’s long axis and its cortical thickness. However, in patients with mild and moderate IFTA, changes in their kidney morphology remain minimal due to their early CKD stages [41]. In this study, there were no differences in kidney lengths and cortical thickness between different IFTA groups (p = 0.487 and p = 0.927 for the mild and moderate IFTA groups, respectively). During our construction of a comparison model between the mild and moderate-to-severe IFTA groups, we extracted 2D ultrasound image features from those with moderate-to-severe IFTA. Since the radiomics of moderate IFTA were included, the kidney morphological features that did not significantly differ between those with mild and moderate IFTA were extracted. The presence of redundant information might increase the probability of model overfitting, reducing model performance after constructing a joint model. However, when we compared between those with mild-to-moderate and severe IFTA, patients with ESKD and severe IFTA were more likely to have morphological kidney atrophy and cortical thinning. When we compared renal long-axis diameter and cortical thickness between severe and mild-to-moderate IFTA groups, there were differences between groups (p < 0.01). Therefore, the addition of radiomics including morphological differences of the kidneys in 2D ultrasound greatly increased the diagnostic performance of the combined model.

The combined model incorporating B-mode, STE, and clinical features can be applicable for IFTA detection for patients outside our training cohort, particularly during the follow-up of patients unable to receive a renal biopsy. The establishment of ultrasound radiomics model can be a great support for clinical ultrasound practice, and radiomics findings may assist in IFTA prediction in the future.

This study has some limitations. Patients selected were those with CKD and renal biopsy indications. The renal cortical tissues of patients with ESKD could be thin, precluding the derivation of histopathological results based on renal biopsy. The sample size of patients with severe IFTA was small, necessitating further expansion to reduce data redundancy during model construction, in order to facilitate the establishment of multi-classification models. In addition, this study was done based on data from one center, using a single-mode ultrasound diagnostic apparatus to collect ultrasound radiomics. Multi-center and different ultrasound modes may be needed to extract more 2D and ultrasound elasticity radiomics to construct a combined model and to test the generalizability of our established combined model. Finally, changes in 2D ultrasound and STE features and the course of CKD among these patients need to be further monitored and validated in the future.

Conclusion

STE combined with 2D ultrasound examinations can improve the diagnostic performance of traditional ultrasound for tubulointerstitial fibrosis in patients with CKD. The radiomics nomograms constructed based on 2D ultrasound and STE imaging features in combination with clinical features are non-invasive tools with high accuracy in detecting renal fibrosis with different IFTA severities. This approach can be helpful for non-invasive monitoring of kidney fibrosis.