A prognostic model for failure and worsening after lumbar microdiscectomy: a multicenter study from the Norwegian Registry for Spine Surgery

Objective To develop a prognostic model for failure and worsening 1 year after surgery for lumbar disc herniation. Methods This multicenter cohort study included 11,081 patients operated with lumbar microdiscectomy, registered at the Norwegian Registry for Spine Surgery. Follow-up was 1 year. Uni- and multivariate logistic regression analyses were used to assess potential prognostic factors for previously defined cut-offs for failure and worsening on the Oswestry Disability Index scores 12 months after surgery. Since the cut-offs for failure and worsening are different for patients with low, moderate, and high baseline ODI scores, the multivariate analyses were run separately for these subgroups. Data were split into a training (70%) and a validation set (30%). The model was developed in the training set and tested in the validation set. A prediction (%) of an outcome was calculated for each patient in a risk matrix. Results The prognostic model produced six risk matrices based on three baseline ODI ranges (low, medium, and high) and two outcomes (failure and worsening), each containing 7 to 11 prognostic factors. Model discrimination and calibration were acceptable. The estimated preoperative probabilities ranged from 3 to 94% for failure and from 1 to 72% for worsening in our validation cohort. Conclusion We developed a prognostic model for failure and worsening 12 months after surgery for lumbar disc herniation. The model showed acceptable calibration and discrimination, and could be useful in assisting physicians and patients in clinical decision-making process prior to surgery.

equina syndrome) [3,28]. The majority of the operations are performed electively on relative indications.
Most clinical studies tend to focus on favorable outcomes after surgery based on mean improvements or success rates according to patient-reported outcome measures (PROMs) [2,3,20,28,37], and predictive models for such outcomes have been developed [22,24,25]. An efficient strategy for improving the quality and safety of the health service is to increase the focus on unfavorable outcomes [8,35]. Although the majority of patients experience substantial improvements, up to 30-40% report non-successful outcomes [2,12,23,38], and a large proportion of these cases cannot be classified as "failure" [6], indicating that non-success and failure are not interchangeable concepts.
The risk of a poor outcome is a frequent concern among patients being operated, especially the risk of getting worse, which indicates a harmful (adverse) treatment effect [32]. To enhance individualized risk prediction and prevention of unfavorable outcomes, we have previously defined benchmark criteria for both failure and worsening, based on frequently used PROMs [38]. A prediction model for unfavorable outcomes can be further developed into a risk calculator, which could enhance shared clinical decision-making and improve selection of patients prior to lumbar disc surgery.
The aim of this study was to develop a prognostic model calculating individual risk (%) for failure and worsening after surgery for lumbar disc herniation, based on a large cohort from the Norwegian registry for spine surgery (NORspine). Data from this large registry cohort, collected in daily surgical practice, would ensure high external validity, and thus clinical relevance.

Design
Multicentre observational study following the recommendations for reporting in observational studies, STROBE criteria [36], and the methodological framework proposed by the PROGRESS group [34].

Study population and data collection
A total of 26,427 patients operated for degenerative disorders of the lumbar spine reported to the NORspine registry between January 1, 2007 and August 2, 2015 were screened for eligibility and followed for 12 months. The NORspine includes patients operated for degenerative disorders of the spinal column. It does not include patients with fractures, primary infections of the spine, or with spinal malignancies. Furthermore, it does not include children <16 years of age, as well as patients with known serious drug abuse or severe psychiatric disorders. For the purpose of this study, we included all patients who had a microscope or loupe assisted lumbar disc microdiscectomy for a magnetic resonance imaging (MRI) confirmed lumbar disc herniation. Both emergency and elective cases were registered. Patients diagnosed with lumbar spinal stenosis or spondylolisthesis, and those operated with more comprehensive decompression techniques including laminectomy, disc prosthesis or fusion procedures, were excluded.
The NORspine is a comprehensive clinical registry for quality control and research, covering 95% of public and private operating centers in Norway, with a completeness (proportion of operated patients reported to the registry) of 65% over the study period. It comprises a range of baseline data on known and potential predictors for different outcomes [27]. Participation in NORspine is not required for a patient to gain access to the health care, or to receive payment/reimbursement for a provider.
At admission for surgery (baseline), the patients completed a questionnaire on demographics, lifestyle issues, and the PROMs. During the hospital stay, the surgeon recorded data concerning diagnosis, treatment, and comorbidity on a standard registration form. Twelve months after surgery, a questionnaire identical to that used at baseline was distributed by regular mail. It was completed at home by the patients and returned to the central registry unit without involvement of the treating hospitals. One reminder with a new copy of the questionnaire was sent to those who did not respond.
Informed consent was obtained from all patients. The NORspine registry protocol has been approved by the Data Protection Authority of Norway. This study was submitted to the regional ethical committee for medical research which categorized it as a clinical audit study (2015/1829/ REK South-East Regional Health Authority).

Outcomes
Failure and worsening were defined according to validated cut-offs on the Oswestry Disability Index (ODI) version 2.1a, which showed the highest accuracy identifying these outcomes when evaluated against the numeric rating scale for back pain, leg pain, and the EuroQol 5D (EQ-5D) [38]. The ODI contains ten questions about limitations of activities of daily living. Each item is rated from 0 to 5 and then transformed into a score ranging from 0 (none) to 100 (maximum pain-related disability) [4]. The ODI cut-offs have been determined according to an external anchor, the global perceived effect scale (GPE, 1-7): 1 "fully recovered," 2 "much better," 3 "somewhat better," 4 "unchanged," 5 "somewhat worse," 6 "much worse," 7 "worse than ever." Failure corresponds to GPE range 4-7, and worsening to GPE range 6-7 [38,39]. We have also shown that that both the ODI change score, as well as the final ODI score after 12 months are highly dependent on the preoperative ODI score [38,39]. Therefore, we stratified our model according to the preoperative ODI score (percentiles). Failure was defined as an ODI raw score 12 months after lumbar microdiscectomy ≥18 (low baseline ODI group, < 25 percentile), ≥ 29 (medium baseline ODI group, 25 to 75 percentile), and ≥ 34 (high baseline ODI group, > 75 th percentile). Worsening was defined accordingly as an ODI raw score 12 months after lumbar discectomy ≥33 (low baseline ODI group), ≥ 47 (medium baseline ODI group), and ≥ 58 (high baseline ODI group) [38].

Possible prognostic factors
We included prognostic factors, previously reported in the literature [10,12,15,17,18,29]. Sociodemographic and anthropometric factors included were; gender, age > 60, obesity (body mass index, BMI ≥ 30), marital status (living alone yes/no), employment status (employed/unemployed), and low educational level (yes/no), i.e., less than 4 years of college/university education. Anxiety or depression was assessed by the item on the EuroQol-5D-3L questionnaire, (yes = "moderate" to "severe" problems, no = "no problems"). In Norway, public health insurance is compulsory; thus, no distinction was made between public or private insurance, or between public and private hospitals. A recent study has shown equivalent effectiveness of lumbar disc surgery between the public and private sector [21]. Patients were also asked if they had a pending or unresolved claim or litigation issue (yes/no) against (1) the Norwegian public welfare agency fund concerning permanent disability pension or (2) a compensation claim against private insurance companies or the public Norwegian System of Compensation to Patients. As shown in the tables, we also assessed other clinical parameters, including the baseline PROM scores, smoking, duration of symptoms, previous lumbar spine surgery, and use of analgesics [12,15,17,18,29].

Statistical analyses
All statistical analyses were performed with the Statistical Package for the Social Sciences (SPSS, IBM Version 23.0) and R (Version 2.13.1.) To assess potential sources of selection bias among patients, baseline differences between respondents and non-respondents at 12 months of followup were evaluated using the Students t-test for continuous variables or chi-square test for pairs of categorical variables. The proportions of missing data were small, <10% for all the analyzed variables. No imputation of missing values was performed.
Cases were selected for the training set (70%, n = 5741) and validation set (30%, n = 2218,) by the random sample function in SPSS ( Fig. 1) [7]. The models were built using the training set, and then the final models were assessed in the validation set. Since the ODI threshold values for failure and worsening after 12 months depend on the preoperative ODI baseline score, we stratified the prediction model into the three ODI percentiles of "low" ODI baseline scores (<33), "medium" (33-58), and "high" (>58) for each outcome [38,39].

Training set
The outcomes failure versus no failure and worsening versus no worsening were modeled separately (Fig. 2). Crude associations between each selected covariate and the outcome were assessed using univariate logistic regression. Variables that reached p < 0.1 in these analyses were entered into the multivariate analyses (binary logistic regression model). In a next step, variables that were no longer statistically significant (p < 0.05) were removed from the model using backward selection. We chose to include gender and age in all models, irrespectively of their statistical significance [31]. Continuous variables were dichotomized in order to be adapted into a risk matrix. Collinearity between possible predictors was assessed with Spearmans rho, with correlation coefficients (CC) >0.3 considered as weak, >0.5 as moderate, and > 0.7 as strong. Associations between outcomes and prognostic factors were expressed as odds ratios (OR) with a 95% confidence interval (CI). Regression coefficients from the final models were converted into probabilities for the risk matrix. Depending on the presence or absence of the risk factors, the matrix then calculated a probability for both failure and worsening for each patient.

Validation set
For each model, calibration was assessed by dividing the sample into four prediction groups (quartiles) with increasing probabilities for failure and worsening. We then plotted the observed proportion for these outcomes against the average predicted probability, using a logistic regression model with the observed binary outcome as dependent and the log odds of the validated regression model as independent. Chi square test was used to assess difference between predicted coordinates and the optimal prediction line. Significant deviation, indicating over-or underestimation, was defined as p-values <0.1. Discrimination was assessed by the c-criterion (C), calculated as the area under the curve (AUC) in a receiver operating analysis (ROC), plotting predicted probability against failure and worsening. C values >0.6 were considered acceptable [31].

Study population and data collection
We included 11,081 patients in the analyses. Of these, 3621 (32.7%) were lost to follow-up 12 months after surgery (Fig. 1). Baseline characteristics for the entire study population are shown in Table 1.
Mean age was 47.8 years (SD 13.61), and 42% of patients were females. Non-respondents at 12 months were younger, more likely to be men, had less severe comorbidity, and less severe limb paresis, but were more likely to be smokers, obese, anxious or depressed, and previously operated. There were no clinically relevant differences in baseline pain and disability (PROMS) between respondents and non-respondents. The amount (n, %) of missing data for the prognostic factors was low for age (6, 0.01), gender (none), non-native Norwegian speaker (19,

Prognostic factors and outcomes
Tables 6 and 7 in the supplementary appendix show the results from the univariate analyses for all potential prognostic factors for failure and worsening, in both the training and validation sets. The results from the multivariate regression analyses for all three ODI baseline groups are shown in Table 2 (failure) and 3 (worsening). Duration of preoperative back pain was highly correlated (CC >0.7) with duration of preoperative leg pain. Duration of preoperative leg pain was consequently excluded from the model because Table 1 Baseline characteristics including patient-reported outcome measures of respondents vs. non-respondents (lost to follow-up) 1 Less than 4 years of college/university education. 2 Rheumatoid arthritis, ankylosing spondylitis, other rheumatic disorder, hip arthrosis, knee arthrosis, chronic generalized musculoskeletal pain, chronic neurologic disorder, cerebrovascular disorder, heart disease, vascular disease, chronic lung disease, cancer, osteoporosis, hypertension, diabetes mellitus, other endocrine disorder. 3 American Society of Anesthesiologists grade. 4 Body mass index ≥ 30. 5 EQ-5D 3L questionnaire; 5 th item, moderate to severe problems. 6 Pending medical claim/litigation against the Norwegian public welfare agency fund concerning disability pension. 7 Pending medical compensation claim/litigation against private insurance companies or the public Norwegian System of Compensation to Patients. 8 Oswestry Disability Index, 0-100 (no-maximal disability). 9 Numeric rating scale (0-10)  Table 2 Results from the multiple regression model showing associations (odds ratio (OR) and 95% confidence intervals (CI)) between predictors and patient-reported "failure" (unchanged or worse, yes/ no) of lumbar disc surgery, as defined by validated cut offs on the Oswestry Disability Index (ODI), split on subgroups with low, medium and high baseline ODI scores (percentiles). For all predictors, except age and gender, NS indicates statistical insignificance, p value > 0.05 1 Range: 0-100 (no-maximal disability). The ODI score was < 33, 33-58, and > 58 in the subgroups with low, medium high baseline disability, respectively. 2 Less than 4 years of college/university education. 3 American Society of Anesthesiologists grade. 4 Body mass index ≥ 30. 5 EQ-5D 3L questionnaire; 5 th item, moderate to severe problems. 6 Numeric rating scale (0-10). 7 Pending medical claim/litigation the Norwegian public welfare agency fund concerning disability pension. 8   The combination of the presence (yes) or absence (no) of each prognostic factor, as well as their respective odds ratios (Tables 2 and 3), yield an overall probability for failure or worsening in each of the three ODI baseline groups. The matrices are shown as a flow chart (Fig. 3). Table 4 illustrates three example cases from the risk matrices applied on the validation set. Each patient was allocated into 1 out of 6 matrices, based the baseline ODI (3 subgroups) and outcomes (2 subgroups). In the validation cohort, the individual predicted risk ranged from 3 to 94% for failure, and from 1 to 72% for worsening.
The calibration plots showing agreement between the average predicted and observed proportion of failure and worsening (Fig. 2) illustrate that the predicted and observed probabilities coincided well. There was no statistically significant deviation of the coordinates from the optimal Fig. 3 Model algorithm for the three ODI baseline groups. Based on the preoperative ODI the patient will be classified via one of the three pathways, calculating an overall risk for either failure or worsening. Risk is calculated from the odds of each risk factor. The risk factors are listed in random order, and their place in the sequence does not reflect their odds

Discussion
We have developed a prognostic model for unfavorable outcomes 12 months after surgery for lumbar disc herniation, based on validated and recommended PROMs [5]. The model can identify patients with a high and low baseline probability for those outcomes. Patients with low, medium, and high baseline ODI scores were associated with different sets of prognostic factors. Each factor has a different impact on the probability, shown as odds ratios in Tables 2 and 3. Higher odds ratios indicate higher probability for the outcome. The estimated preoperative probabilities in our study population ranged from 3% to 94% for failure and from 1% to 72% for worsening, exemplified by three cases. The model can be presented to surgeons and patients as a risk calculator, to facilitate individualized treatment recommendations.
It is important to acknowledge the conceptual differences between prognostic modeling and prognostic factors research. The prognostic model, developed in our study, aims at calculating the overall probability (individual absolute risk) for an outcome. Our study was not designed for prognostic factor research, which focuses on identifying independent prognostic (risk) factors [30,34]. Still, our results can lend support to previously studies identifying a long duration of low back pain and leg pain, anxiety and/or depression, previous back surgery, smoking, lower education, BMI, and unresolved disability pension or insurance issues as predictors for inferior outcomes [12,15,17,18,29].
Prediction models have to balance the need for accurate predictions against the risk of overfitting. Model overfitting implies lack of generalizability, i.e., it might work well for the population it was developed on, but not for others [26]. For instance, it is important not to include too many and/ or too specific covariates. Our model appeared to be well balanced between an acceptable accuracy and a limited number of predictors, which are available in most clinical trials and regular clinical practice at the hospitals. We stratified our model by different levels of baseline disability (low, medium, and high ODI score), since the outcome score is highly dependent on the baseline score, and the actual cut offs for failure and worsening are different in these subgroups [16,18,38].
The discriminative ability of risk the matrices was acceptable. Calibration assessment showed that for patients with high baseline disability (>75 th percentile of ODI) the model tended to underestimate the proportion of worsening, and the prediction of worsening among those cases was too inaccurate. A reason could be the small sample size (type II error) of this subgroup, or confounding due to unmeasured factors, such as widespread body pain and pain interference [1]. Confounding is the most likely source of bias in our study. We assessed anxiety and depression using one item of the EQ-5D 3L questionnaire, instead of a condition specific questionnaire which could be more sensitive. This may represent an information bias [12].
All cases of lumbar disc herniation were verified on MRI scans, evaluated by radiologists and surgeons. However, we did not have data on more specific morphological changes, e.g., contained versus uncontained herniation or additional Modic changes, which could influence the surgeon's recommendation about surgery. This illustrates that statistical probabilities cannot be used as surrogate for clinical Table 4 Example cases from the validation set (patients 1-3) with different predicted probability (6 risk matrices) for failure and worsening based on baseline ODI score and presence (yes) or absence (no) of predictors. An open cell indicates that predictor was not relevant for the risk matrix the patient was assigned to 1 Range: 0-100 (no-maximal disability). 2 Less than 4 years of college/ university education. 3 American Society of Anesthesiologists grade. 4 Body mass index ≥ 30. 5 EQ-5D 3 L questionnaire; 5 th item, moderate to severe problems. 6 Numeric rating scale (0-10). 7 Pending medical claim/ litigation the Norwegian public welfare agency fund concerning disability pension. 8  judgement, but rather as a supplementary decision support.
We suggest that our model could be used in cases where the indication for surgery is uncertain. The model could be also helpful in calibrating surgeons' and patients' expectations about surgical outcomes.
To the best of our knowledge, this is the first registry study modeling unfavorable patient-reported outcomes after lumbar disc surgery. Three American studies have assessed patient populations operated for different degenerative spine disorders, including disc replacement and arthrodesis surgery [16,24,25]. The models were developed for predicting improvements, such as minimal clinically important change (MCIC), rather than unfavorable outcomes. Interestingly, 12 months of follow-up data from the latter paper by Khor et al. on a subgroup of 528 surgical patients showed that 222 of them reported an unsuccessful outcome (not reaching MCIC on the ODI scale) [16]. Of these, 86 (39%) reported to be unchanged or worse. The remaining 136 (61%) did not, hence representing a "grey zone" of patients with minor improvements. This supports our strategy of distinguishing failed from non-successful outcomes [38,39].
Registry-based studies collecting "real-life" data from daily clinical practice have advantages such as large sample sizes and high external validity, but also limitations such as lower follow-up rates [11]. Loss to follow-up at 12 months was 32.7%. Baseline characteristics-linked inferior outcomes seemed to be equally distributed between responders and non-responders. Still, loss to follow-up could represent a selection bias, especially when estimating exact failure and worsening rates. However, two Scandinavian registry studies on similar patient populations found that loss to follow-up did not bias conclusions about treatment effects [13,33]. Moreover, the objective of our study was not effectiveness evaluations, but rather to develop a prediction model over a wide range of outcomes.
The model should be externally validated in other cohorts, and its feasibility should be confirmed by patients and clinicians before being implemented in regular clinical practice. Importantly, we have not assessed outcomes after non-operative treatment. Therefore, it is highly uncertain if the model could be useful in other settings, e.g., among patients seen in general practice.

Conclusion
We have developed a prognostic model to identify patients at risk of unfavorable outcomes after lumbar microdiscectomy, which could assist physicians and patients in clinical decision-making prior to surgery in cases where the indication for surgery is not clear cut. The model accounts for patients with different levels of preoperative disability and corresponding prognostic factors, facilitating individual based treatment recommendations. Table 5 Failure and worsening 12 months after surgery for subgroups of different baseline disability (low, medium and high percentiles of the ODI score) in the training (n = 5741, 70%) and validation (n = 2218, 30%) set 1 Baseline ODI group based on the baseline percentile of the ODI score -low (<25 th percentile, <33 points), medium (25 th -75 th percentile, 33-58 points), high (>75 th percentile, >58 points). ODI range: 0-100 (no-maximal disability)

Table 6
Results from the univariate binary logistic regression analyses of failure in both the training (n = 5741) and validation (n = 2218) cohort, showing associations (Odds Ratio (OR) and 95% confidence intervals (CI)) between predictors and patient reported "failure" (unchanged or worse, yes/no) of lumbar disc surgery, as defined by validated cut offs on the Oswestry Disability Index (ODI), split by subgroups with low, medium and high baseline ODI scores (percentiles). For all predictors, except age and gender, NS indicates statistical insignificance, p value > 0.1 1 Range: 0-100 (no-maximal disability). The ODI score was <33, 33-58, and >58 in the subgroups with low, medium high baseline disability. 2 Less than four years of college/university education. 3 Body Mass Index ≥30. 4 American Society of Anesthesiologists grade. 5 EQ-5D 3L questionnaire; 5 th item, moderate to severe problems. 6 Pending medical claim/litigation with the Norwegian public welfare agency fund concerning disability pension. 7 Pending medical compensation claim/litigation against private insurance companies or the public Norwegian System of Compensation to Patients. 8 Numeric Rating Scale (0-10) Training Cohort   Table 7 Results from the univariate binary logistic regression analyses of worsening in both the training (n =5741) and validation (n =2218) cohort, showing associations (Odds Ratio (OR) and 95% confidence intervals (CI)) between predictors and patient reported worsening (yes/no) of lumbar disc surgery, as defined by validated cut offs on the Oswestry Disability Index (ODI), split on subgroups with low, medium and high baseline ODI scores (percentiles). For all predictors, except age and gender, NS indicates statistical insignificance, p value > 0.1 1 Range: 0-100 (no-maximal disability). The ODI score was <33, 33-58, and >58 in the subgroups with low, medium high baseline disability. 2 Less than four years of college/university education. 3 Body Mass Index ≥30. 4 American Society of Anesthesiologists grade. 5 EQ-5D 3L questionnaire; 5 th item, moderate to severe problems. 6 Pending medical claim/litigation with the Norwegian public welfare agency fund concerning disability pension. 7 Pending medical compensation claim/litigation against private insurance companies or the public Norwegian System of Compensation to Patients. 8       Funding The main author, David Werner, has received grants from the Regional Health Authority of Northen Norway, and the Norwegian Medical Association -Foundation for quality improvement and patient safety, for the purpose of this project.

Declarations
Ethics approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee (name of institute/committee) and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This project was drafted with the regional ethical committee, which categorized it as a clinical audit study, not in need of their formal approval (REK 22,845,518.06.2015).

Informed consent Informed consent was obtained from all individual participants of this study
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.