Development and validation of a prediction model for failed shockwave lithotripsy of upper urinary tract calculi using computed tomography information: the S3HoCKwave score

Purpose To develop and validate a new clinical prediction model that accurately predicts the failure of shockwave lithotripsy (SWL) using information obtained from non-contrast-enhanced computed tomography (NCCT). Methods This multicentre retrospective cohort study consecutively enrolled patients diagnosed with upper urinary tract calculi by NCCT at five hospitals in Japan from January 1, 2006 to December 31, 2016. Among the candidate predictors, we selected the six most significant predictors a priori. The main outcome was SWL failure after three sessions. Model calibration was evaluated by the calibration slope and the Hosmer–Lemeshow test. Discrimination was evaluated by the receiver-operating characteristic curves and the area under the curve (AUC). A multivariable logistic regression analysis was performed; based on the estimated β coefficients, predictive scores were generated. Results Of 2695 patients, 2271 were included. Patients were divided into the development cohort (1666 patients) and validation cohort (605 patients) according to geographical factors. We developed a clinical prediction model with scores ranging from 0 to 49 points. We named the prediction model the S3HoCKwave score based on the initials of the predictors (sex, skin-to-stone distance, size, Hounsfield units, colic, and kidney or ureter). As a result of internal validation, the optimism-corrected AUC was 0.72. In the validation cohort, the Hosmer–Lemeshow test did not show statistical significance (P = 0.33), and the AUC was 0.71 (95% confidence interval 0.65–0.76). Conclusions The S3HoCKwave score is easy to understand, has a relatively high predictive value, and allows clinicians to make appropriate treatment selections. Electronic supplementary material The online version of this article (10.1007/s00345-020-03125-y) contains supplementary material, which is available to authorized users.


Introduction
Upper urinary tract calculi (i.e., kidney stones and ureteral stones) are common, with a prevalence of 5.2% during 1988-1994 [1], and increasing trends in the United States and Japan [2]. Symptoms of upper urinary tract calculi vary and are sometimes serious [3]. Upper urinary tract calculi can be complicated by sepsis, which can be fatal [4]. Therefore, early diagnosis and early treatment interventions for upper urinary tract calculi are important in clinical practice.
In Japan, 56.6% of patients diagnosed with upper urinary calculi in 2015 received some type of surgical treatment, such as shockwave lithotripsy (SWL), ureteroscopic lithotripsy (URSL), or percutaneous nephrolithotomy (PCNL) [2]. The European Association of Urology (EAU) and Japanese Urological Association (JUA) recommend the use of either SWL or URSL when a patient has a single calculi with a diameter of 20 mm or smaller [5,6]. A recent systematic review indicated that the stone-free rate (SFR) of URSL is superior to that of SWL at 4 weeks after treatment, whereas Members of the Okayama-Ehime SWL Study Group are listed in the Acknowledgements section.

Electronic supplementary material
The online version of this article (https ://doi.org/10.1007/s0034 5-020-03125 -y) contains supplementary material, which is available to authorized users.
* Yosuke Yamamoto yamamoto.yosuke.5n@kyoto-u.ac.jp Extended author information available on the last page of the article 1 3 the SFR of URSL at 3 months after treatment is equivalent to that of SWL [7]. Additionally, more complications and longer hospitalisation periods are associated with URSL than with SWL [7]. Therefore, SWL is a viable alternative treatment that may have clinical advantages over URSL for solitary calculi smaller than 20 mm. Recently, shared decision-making (SDM) has become an important practice in urology [8], and it can be used with clinical prediction models [9]. Although there are some prediction models that predict the SWL success rate [10][11][12][13][14], they may be difficult to use because of their complexity. In addition, their performance has not yet been sufficiently evaluated. Therefore, our study aimed to develop and validate a novel and epidemiologically robust clinical prediction model that can provide clinically useful information regarding treatment selection to determine if SWL is appropriate for the treatment of upper urinary tract calculi.

Research design and setting
We conducted a multicentre retrospective cohort study at five Japanese community hospitals.

Inclusion and exclusion criteria
We consecutively included patients who were 20 years or older and were diagnosed with solitary upper urinary tract calculi by non-contrast-enhanced computed tomography (NCCT) and underwent SWL as an initial treatment from January 1, 2006 to December 31, 2016. We followed-up patients to determine outcomes for 3 months based on the recommendations of the JUA guidelines [5] and on actual practice patterns during January 1, 2006 to March 31, 2017. We excluded patients who had urinary calculi of 20 mm or larger, had lower renal calyx calculi or multiple calculi, had indwelling ureteral stents, and did not have data regarding outcomes.

How to perform SWL
Patients were administered transrectal diclofenac and placed in the supine position to undergo treatment for upper ureter calculi or renal calculi. Treatment of mid-to-lower ureter calculi was performed with patients in the prone position. The shockwave rates were 60-90 per min. Shockwave energy was gradually increased to a tolerable level for patients, and involuntary patient movement and increased respiratory motion due to pain were prevented. The maximum number of shocks was 4000. Most SWL procedures were performed by well-trained Japanese board-certified urologists with 7 or more years of experience at each hospital.

Main outcome measures
The primary outcome was SWL failure after three sessions, which reflected the overall SWL outcome [5,6]. We defined treatment success as the resolution of calculus as determined by abdominal X-ray examination during the subsequent clinical outpatient visit. Clinically insignificant residual fragments, which were observed with residual stones smaller than 4 mm in diameter [15,16], were considered to indicate successful treatment. Cases converted to URSL or PCNL were defined as SWL failure. SWL failure after one session and SWL failure after two sessions were also evaluated as a secondary outcome.

Sample size calculation
To develop the multivariable logistic regression model, at least ten events per variable were needed [17]. At least 100 events were needed for model validation [18]. Based on the JUA and EAU guidelines, SWL failure was expected to occur in 10-30% of patients [5,6]. We calculated that we needed 500-1500 patients for the development cohort and 330-1000 patients for the validation cohort.

Development of a prediction model
First, we divided our cohort into two cohorts, the development cohort and the validation cohort, according to geographical factors. Second, based on previous studies, guidelines [5,6,[11][12][13][14], and opinions from an expert panel comprising 13 urologists of our study team, the 6 most clinically significant predictors (sex, presence of colic, maximum length of calculi, localisation of calculi, skin-to-stone distance [SSD], and mean stone density) were selected a priori. Third, we converted continuous outcomes to dichotomous outcomes according to previous studies. Fourth, we used a multivariable logistic regression analysis for the development cohort and calculated the β coefficients of each predictor. Fifth, we rounded those β coefficients and multiplied by 10 to create the scores. As a result, we were able to develop an integer score-based prediction model [19].

Internal and external validation of the model
We performed bootstrap validation 100 times as an internal validation [20]. We evaluated the internally validated model performance based on the calculated performance optimism. For external validation, we used a developed prediction model for our validation cohort (geographic validation) and evaluated the model performance of both calibration and discrimination [20]. Calibration showed the accuracy of absolute risk estimates, whereas discrimination showed how well the developed model differentiated those at higher risk from those at lower risk [9]. Model performance was calculated using the Hosmer-Lemeshow test, the description of the calibration slope for evaluating the calibration ability, and the descriptions of the receiver-operating characteristic (ROC) curve and the area under the curve (AUC) for evaluating discriminative ability [20]. For the secondary outcomes, we developed and validated prediction models and calculated the model performance in the same way.

Comparison with the triple D score and assessment of test performance
We applied a triple D score [12] to our validation cohort. We calculated the sensitivity, specificity, and likelihood ratios of the developed model for all possible cut-off scores.

Statistical analysis
Data were analysed using STATA version 15.0 (Stata Corp., College Station, TX, USA). The statistical significance of the Hosmer-Lemeshow test was set at P > 0.05.

Missing values
To compensate for missing values, we applied multiple imputation by a chained equation, which created 20 multiple imputed datasets, and the estimates were created by combining results from multiple imputed datasets using Rubin's rule [21]. Figure S1 shows the flow diagram of this study. Table 1 shows patient characteristics. There were 1666 eligible patients in the model development cohort and 605 eligible patients in the model validation cohort. The average age was 55.1 years for the development cohort; it was 53.0 years for the validation cohort. Males comprised 75.0% of the development cohort and 81.8% of the validation cohort. The most common diagnosis was upper ureter calculi (61.8% in the development cohort and 59.0% in the validation cohort).

Differences between the development cohort and validation cohort
The development cohort included patients from hospitals in western Japan (Okayama Prefecture, Hiroshima Prefecture, Ehime Prefecture). In contrast, our validation cohort included patients from a hospital in eastern Japan (Chiba Prefecture).

Observed SWL failure
The development cohort included 182 patients with SWL failure after 3 sessions. The validation cohort included 111 patients with SWL failure after 3 sessions (Table S1).

Development of the clinical prediction model
Results of the multivariable logistic regression analysis after multiple imputation are shown in Table 2. Table S2 shows the actual score calculation table. The developed

3
prediction model was called the S 3 HoCKwave score; its name was based on the initials of the selected predictors (sex, SSD, size, Hounsfield units, colic, and kidney or ureter). The lowest score was 0 points, and the highest score was 49 points. Higher scores predicted a higher risk of SWL failure.

Model performance
The performance of the S 3 HoCKwave score after internal validation and external validation is shown in Figure S2. The apparent statistical significance of the S 3 HoCKwave score was P = 0.49 according to the Hosmer-Lemeshow test, and the AUC was 0.75 (95% confidence interval [CI] 0.71-0.78). As a result of internal validation, the optimism-corrected AUC was 0.72. External validation according to the Hosmer-Lemeshow test showed statistical significance of P = 0.33, and the AUC was 0.71 (95% CI 0.65-0.76). The relationship between the score and the predicted probability is shown in Figure S3, and the test performance of the S 3 HoCKwave score is shown in Table S3.

Performance of the triple D score used in our validation cohort
When the Triple D score was applied in our validation cohort, the statistical significance was P < 0.0001 according to the Hosmer-Lemeshow test, and the AUC was 0.68 (95% CI 0.60-0.77).

S 3 HoCKwave scores for one-session SWL and two-session SWL
S 3 HoCKwave scores for one-session SWL and for two-session SWL are shown in Table S3. The calibration slope and ROC curve after external validation are shown in Figure S4.

Overview
In this study, we developed and validated a new clinical prediction model called the S 3 HoCKwave score. This prediction model has two important characteristics. First, the S 3 HoCKwave score is based on the sum score and consists of only six predictors; therefore, it is very easy for clinicians to use and understand compared to the clinical nomogram [22]. In addition, the S 3 HoCKwave score preserves the AUC at more than 0.70, which is classified as moderate accuracy [23]. These characteristics indicated that the S 3 HoCKwave score is also a good tool for SDM between clinicians and patients. Second, because of the sufficient sample size obtained from various types of hospitals in various areas of Japan, the developed model had better calibration and discrimination than the Triple D score after external validation. Due to these characteristics, we believe that the S 3 HoCKwave score is a more useful clinical prediction model than others, and that it can be a better tool for SDM when determining whether SWL is appropriate. The S 3 HoCKwave score is perhaps the first prediction model that has been strongly externally validated regarding the epidemiological status to predict SWL outcomes.

Performance and clinical application of the prediction model
When the calculated score was 35, the predicted probability of SWL failure was almost 30%. The specificity values of the model of that cut-off value were 0.95 for the development cohort and 0.91 for the validation cohort. The positive predictive values were 0.87 for the development cohort and 0.80 for the validation cohort. This means that the S 3 HoCKwave score provides information for patients at high risk for SWL failure and has good predictive ability. Therefore, if the score is 35 points or more, then we may not recommend SWL.

Strength of our study compared with previous studies
Many prediction models have been developed [10][11][12][13][14]; however, most of them were nomograms and may be difficult to use in daily clinical practice because of their complexity [22]. In contrast, our prediction model using the sum score can be easily interpreted. Furthermore, the performance of previously reported prediction models has not been validated; therefore, they have not provided reliable information. In fact, the reported AUC of the triple D score was 0.78 [12], but the AUC for our cohort was 0.68. In general, the apparent performance was often overestimated; therefore, internal validation to correct model optimism and external validation are recommended [20]. Our study performed both internal and external validation and presented better performance compared with the previously reported triple D score. Most previous studies have not performed both internal and external validation. To our knowledge, the S 3 HoCKwave score is the first prediction model that has been externally validated in an epidemiologically robust way to predict SWL outcomes accurately.

Study limitations
This study had some limitations. First, it was a Japanese multicentre study, and only Asian patients participated.
Although the most common composition of calculi is of calcium oxalate and calcium phosphate (82.8% in 2015) in Japan [2], which is comparable to those in the United States [24], the performance of our model for patients of other ethnicities is unknown. Further validation outside of Japan is necessary. Second, because the clinical prediction model used NCCT, the results are not applicable to patients who did not undergo NCCT. Although NCCT is recommended as a first-line diagnostic imaging tool because of its high sensitivity and specificity [25], a recent study recommended low-dose CT as a better diagnostic tool for urolithiasis because of its preserved diagnostic sensitivity, specificity, and reduced radiation dosage [26]. The predictors measured by NCCT (localisation of calculi, stone length, mean stone density, SSD) can be equivalently measured using low-dose CT [27]. Therefore, the S 3 HoCKwave score can be used even for patients diagnosed with upper urinary tract calculi by low-dose CT. Furthermore, we evaluated SWL outcomes using radiography, which has a lower sensitivity for small calculi than NCCT. Therefore, under diagnosis of SWL failure might exist. However, the low diagnostic sensitivity of X-ray examinations was especially observed for stone size smaller than 3 mm [28], which was smaller than our definition of clinically insignificant residual fragments. Additionally, radiographic examinations are superior to NCCT when it is necessary to limit radiation exposure [29]; therefore, we believe that outcomes measured using radiographic examinations support the usefulness of our prediction model in daily practice.

Conclusion
We used NCCT information for SWL to develop and validate a new clinical prediction model called the S 3 HoCKwave score. This model had a higher predictive value than previous models. Furthermore, it was useful for selecting the appropriate treatment strategies and for SDM. Additional external validation and studies are needed to enable healthcare providers to use this scale in clinical settings worldwide.

3
before performing the statistical analysis. Our study complies with the statement for transparent reporting of multivariate predictive models on individual prognosis and diagnosis (TRIPOD statement) [20].
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.