Background

Worldwide, low back pain is the leading cause for years lived with disability [14]. The most common indication for low back surgery is sciatica caused by lumbar disc herniation (LDH) [9]. The lifetime prevalence of sciatica in the general population has been reported between 12 and 27% [19]. If left untreated, most patients with LDH will have a favorable outcome. Surgery is typically offered to patients with persisting and/or intolerable leg pain with or without low back pain, or with severe limb or bowel/bladder paresis (cauda equina syndrome) [3, 28]. The majority of the operations are performed electively on relative indications.

Most clinical studies tend to focus on favorable outcomes after surgery based on mean improvements or success rates according to patient-reported outcome measures (PROMs) [2, 3, 20, 28, 37], and predictive models for such outcomes have been developed [22, 24, 25]. An efficient strategy for improving the quality and safety of the health service is to increase the focus on unfavorable outcomes [8, 35]. Although the majority of patients experience substantial improvements, up to 30–40% report non-successful outcomes [2, 12, 23, 38], and a large proportion of these cases cannot be classified as “failure” [6], indicating that non-success and failure are not interchangeable concepts.

The risk of a poor outcome is a frequent concern among patients being operated, especially the risk of getting worse, which indicates a harmful (adverse) treatment effect [32]. To enhance individualized risk prediction and prevention of unfavorable outcomes, we have previously defined benchmark criteria for both failure and worsening, based on frequently used PROMs [38]. A prediction model for unfavorable outcomes can be further developed into a risk calculator, which could enhance shared clinical decision-making and improve selection of patients prior to lumbar disc surgery.

The aim of this study was to develop a prognostic model calculating individual risk (%) for failure and worsening after surgery for lumbar disc herniation, based on a large cohort from the Norwegian registry for spine surgery (NORspine). Data from this large registry cohort, collected in daily surgical practice, would ensure high external validity, and thus clinical relevance.

Material and methods

Design

Multicentre observational study following the recommendations for reporting in observational studies, STROBE criteria [36], and the methodological framework proposed by the PROGRESS group [34].

Study population and data collection

A total of 26,427 patients operated for degenerative disorders of the lumbar spine reported to the NORspine registry between January 1, 2007 and August 2, 2015 were screened for eligibility and followed for 12 months. The NORspine includes patients operated for degenerative disorders of the spinal column. It does not include patients with fractures, primary infections of the spine, or with spinal malignancies. Furthermore, it does not include children <16 years of age, as well as patients with known serious drug abuse or severe psychiatric disorders. For the purpose of this study, we included all patients who had a microscope or loupe assisted lumbar disc microdiscectomy for a magnetic resonance imaging (MRI) confirmed lumbar disc herniation. Both emergency and elective cases were registered. Patients diagnosed with lumbar spinal stenosis or spondylolisthesis, and those operated with more comprehensive decompression techniques including laminectomy, disc prosthesis or fusion procedures, were excluded.

The NORspine is a comprehensive clinical registry for quality control and research, covering 95% of public and private operating centers in Norway, with a completeness (proportion of operated patients reported to the registry) of 65% over the study period. It comprises a range of baseline data on known and potential predictors for different outcomes [27]. Participation in NORspine is not required for a patient to gain access to the health care, or to receive payment/reimbursement for a provider.

At admission for surgery (baseline), the patients completed a questionnaire on demographics, lifestyle issues, and the PROMs. During the hospital stay, the surgeon recorded data concerning diagnosis, treatment, and comorbidity on a standard registration form. Twelve months after surgery, a questionnaire identical to that used at baseline was distributed by regular mail. It was completed at home by the patients and returned to the central registry unit without involvement of the treating hospitals. One reminder with a new copy of the questionnaire was sent to those who did not respond.

Informed consent was obtained from all patients.

The NORspine registry protocol has been approved by the Data Protection Authority of Norway. This study was submitted to the regional ethical committee for medical research which categorized it as a clinical audit study (2015/1829/REK South-East Regional Health Authority).

Outcomes

Failure and worsening were defined according to validated cut-offs on the Oswestry Disability Index (ODI) version 2.1a, which showed the highest accuracy identifying these outcomes when evaluated against the numeric rating scale for back pain, leg pain, and the EuroQol 5D (EQ-5D) [38]. The ODI contains ten questions about limitations of activities of daily living. Each item is rated from 0 to 5 and then transformed into a score ranging from 0 (none) to 100 (maximum pain-related disability) [4]. The ODI cut-offs have been determined according to an external anchor, the global perceived effect scale (GPE, 1–7): 1 “fully recovered,” 2 “much better,” 3 “somewhat better,” 4 “unchanged,” 5 “somewhat worse,” 6 “much worse,” 7 “worse than ever.” Failure corresponds to GPE range 4–7, and worsening to GPE range 6–7 [38, 39]. We have also shown that that both the ODI change score, as well as the final ODI score after 12 months are highly dependent on the preoperative ODI score [38, 39]. Therefore, we stratified our model according to the preoperative ODI score (percentiles). Failure was defined as an ODI raw score 12 months after lumbar microdiscectomy ≥18 (low baseline ODI group, < 25 percentile), ≥ 29 (medium baseline ODI group, 25 to 75 percentile), and ≥ 34 (high baseline ODI group, > 75th percentile). Worsening was defined accordingly as an ODI raw score 12 months after lumbar discectomy ≥33 (low baseline ODI group), ≥ 47 (medium baseline ODI group), and ≥ 58 (high baseline ODI group) [38].

Possible prognostic factors

We included prognostic factors, previously reported in the literature [10, 12, 15, 17, 18, 29]. Sociodemographic and anthropometric factors included were; gender, age > 60, obesity (body mass index, BMI ≥ 30), marital status (living alone yes/no), employment status (employed/unemployed), and low educational level (yes/no), i.e., less than 4 years of college/university education. Anxiety or depression was assessed by the item on the EuroQol-5D-3L questionnaire, (yes = “moderate” to “severe” problems, no = “no problems”). In Norway, public health insurance is compulsory; thus, no distinction was made between public or private insurance, or between public and private hospitals. A recent study has shown equivalent effectiveness of lumbar disc surgery between the public and private sector [21]. Patients were also asked if they had a pending or unresolved claim or litigation issue (yes/no) against (1) the Norwegian public welfare agency fund concerning permanent disability pension or (2) a compensation claim against private insurance companies or the public Norwegian System of Compensation to Patients. As shown in the tables, we also assessed other clinical parameters, including the baseline PROM scores, smoking, duration of symptoms, previous lumbar spine surgery, and use of analgesics [12, 15, 17, 18, 29].

Statistical analyses

All statistical analyses were performed with the Statistical Package for the Social Sciences (SPSS, IBM Version 23.0) and R (Version 2.13.1.) To assess potential sources of selection bias among patients, baseline differences between respondents and non-respondents at 12 months of follow-up were evaluated using the Students t-test for continuous variables or chi-square test for pairs of categorical variables. The proportions of missing data were small, <10% for all the analyzed variables. No imputation of missing values was performed.

Cases were selected for the training set (70%, n = 5741) and validation set (30%, n = 2218,) by the random sample function in SPSS (Fig. 1) [7]. The models were built using the training set, and then the final models were assessed in the validation set. Since the ODI threshold values for failure and worsening after 12 months depend on the preoperative ODI baseline score, we stratified the prediction model into the three ODI percentiles of “low” ODI baseline scores (<33), “medium” (33–58), and “high” (>58) for each outcome[38, 39].

Fig. 1
figure 1

Flow diagram of patient enrollment, exclusion and allocation

Training set

The outcomes failure versus no failure and worsening versus no worsening were modeled separately (Fig. 2).

Fig. 2
figure 2

Model validation. Observed proportion of the outcome (with confidence interval) on the vertical axis against average predicted probability of the outcome on the horizontal axis. Each coordinate with whiskers represents one quartile of estimated probability and its 95% confidence interval, compared to the observed proportion of the predicted outcome. The p value from the chi square test for the coordinates vs the optimal prediction line is indicated in the lower right corner. A p value < 0.1 indicates significant deviation from the average predicted probability. a, b, c show prediction of failure for the three baseline invalidity groups (A: baseline ODI < 25th percentile, B 25–75th percentile, C > 75th percentile). d, e, f show prediction of worsening for the three baseline invalidity groups (D: baseline ODI < 25th percentile, E: 25th–75th percentile, F: > 75th percentile)

Crude associations between each selected covariate and the outcome were assessed using univariate logistic regression. Variables that reached p < 0.1 in these analyses were entered into the multivariate analyses (binary logistic regression model). In a next step, variables that were no longer statistically significant (p < 0.05) were removed from the model using backward selection. We chose to include gender and age in all models, irrespectively of their statistical significance [31]. Continuous variables were dichotomized in order to be adapted into a risk matrix. Collinearity between possible predictors was assessed with Spearmans rho, with correlation coefficients (CC) >0.3 considered as weak, >0.5 as moderate, and > 0.7 as strong. Associations between outcomes and prognostic factors were expressed as odds ratios (OR) with a 95% confidence interval (CI). Regression coefficients from the final models were converted into probabilities for the risk matrix. Depending on the presence or absence of the risk factors, the matrix then calculated a probability for both failure and worsening for each patient.

Validation set

For each model, calibration was assessed by dividing the sample into four prediction groups (quartiles) with increasing probabilities for failure and worsening. We then plotted the observed proportion for these outcomes against the average predicted probability, using a logistic regression model with the observed binary outcome as dependent and the log odds of the validated regression model as independent. Chi square test was used to assess difference between predicted coordinates and the optimal prediction line. Significant deviation, indicating over- or underestimation, was defined as p-values <0.1. Discrimination was assessed by the c-criterion (C), calculated as the area under the curve (AUC) in a receiver operating analysis (ROC), plotting predicted probability against failure and worsening. C values >0.6 were considered acceptable [31].

Results

Study population and data collection

We included 11,081 patients in the analyses. Of these, 3621 (32.7%) were lost to follow-up 12 months after surgery (Fig. 1). Baseline characteristics for the entire study population are shown in Table 1.

Table 1 Baseline characteristics including patient-reported outcome measures of respondents vs. non-respondents (lost to follow-up)

Mean age was 47.8 years (SD 13.61), and 42% of patients were females. Non-respondents at 12 months were younger, more likely to be men, had less severe comorbidity, and less severe limb paresis, but were more likely to be smokers, obese, anxious or depressed, and previously operated. There were no clinically relevant differences in baseline pain and disability (PROMS) between respondents and non-respondents. The amount (n, %) of missing data for the prognostic factors was low for age (6, 0.01), gender (none), non-native Norwegian speaker (19, 0) living alone (43, 0.01), smoking (76, 0.01), having low education (52, 0.01), BMI (522, 11.2), American Association of Anesthesiologists (ASA) grade > 2 (128, 1.8), unresolved disability pension issue (182, 3.4), unresolved insurance claim (171, 3.4), anxiety/depression (117, 1.6), duration of back pain >12 months (391, 5.6), back pain intensity (176, 2.4), and leg pain intensity (157, 2.2). Patient-reported outcomes by baseline ODI (percentiles) subgroups in the training and the validation sets are shown in table 5 (supplementary appendix). For the entire study population, a total of 1779 cases (24.1%) were classified as failed and 469 (6.3%) as worsened.

Prognostic factors and outcomes

Tables 6 and 7 in the supplementary appendix show the results from the univariate analyses for all potential prognostic factors for failure and worsening, in both the training and validation sets. The results from the multivariate regression analyses for all three ODI baseline groups are shown in Table 2 (failure) and 3 (worsening). Duration of preoperative back pain was highly correlated (CC >0.7) with duration of preoperative leg pain. Duration of preoperative leg pain was consequently excluded from the model because of suspected multi-collinearity. Otherwise, all correlations between potential prognostic factors were low (CC ≤0.3).

Table 2 Results from the multiple regression model showing associations (odds ratio (OR) and 95% confidence intervals (CI)) between predictors and patient-reported “failure” (unchanged or worse, yes/no) of lumbar disc surgery, as defined by validated cut offs on the Oswestry Disability Index (ODI), split on subgroups with low, medium and high baseline ODI scores (percentiles). For all predictors, except age and gender, NS indicates statistical insignificance, p value > 0.05
Table 3 Results from the multiple regression model showing associations (odds ratio (OR) and 95% confidence intervals (CI)) between predictors and patient-reported worsening (yes/no) after lumbar disc surgery, as defined by validated cut offs on the Oswestry Disability Index (ODI), split on subgroups with low, medium, and high baseline ODI scores (percentiles). For all predictors, except age and gender, NS indicates statistical insignificance, p value > 0.05

The combination of the presence (yes) or absence (no) of each prognostic factor, as well as their respective odds ratios (Tables 2 and 3), yield an overall probability for failure or worsening in each of the three ODI baseline groups. The matrices are shown as a flow chart (Fig. 3). Table 4 illustrates three example cases from the risk matrices applied on the validation set. Each patient was allocated into 1 out of 6 matrices, based the baseline ODI (3 subgroups) and outcomes (2 subgroups). In the validation cohort, the individual predicted risk ranged from 3 to 94% for failure, and from 1 to 72% for worsening.

Fig. 3
figure 3

Model algorithm for the three ODI baseline groups. Based on the preoperative ODI the patient will be classified via one of the three pathways, calculating an overall risk for either failure or worsening. Risk is calculated from the odds of each risk factor. The risk factors are listed in random order, and their place in the sequence does not reflect their odds

Table 4 Example cases from the validation set (patients 13) with different predicted probability (6 risk matrices) for failure and worsening based on baseline ODI score and presence (yes) or absence (no) of predictors. An open cell indicates that predictor was not relevant for the risk matrix the patient was assigned to

The calibration plots showing agreement between the average predicted and observed proportion of failure and worsening (Fig. 2) illustrate that the predicted and observed probabilities coincided well. There was no statistically significant deviation of the coordinates from the optimal prediction line, except for the model predicting worsening in the >75th percentile ODI baseline group.

C-criterion values (95% CI) were 0.68 (0.63–0.73), 0.74 (0.70–0.78), and 0.71 (0.66–0.76) for prediction of failure in the low, medium, and high baseline ODI groups, respectively, indicating acceptable discrimination. The corresponding c-criterion values for predicting worsening were similar: 0.68 (0.60–0.76), 0.74 (0.68–0.79), and 0.71 (0.61–0.81). All ROC curves for C calculations are shown in the supplementary appendix (Figs. 4, 5, 6, 7, 8 and 9).

Discussion

We have developed a prognostic model for unfavorable outcomes 12 months after surgery for lumbar disc herniation, based on validated and recommended PROMs [5]. The model can identify patients with a high and low baseline probability for those outcomes. Patients with low, medium, and high baseline ODI scores were associated with different sets of prognostic factors. Each factor has a different impact on the probability, shown as odds ratios in Tables 2 and 3. Higher odds ratios indicate higher probability for the outcome. The estimated preoperative probabilities in our study population ranged from 3% to 94% for failure and from 1% to 72% for worsening, exemplified by three cases. The model can be presented to surgeons and patients as a risk calculator, to facilitate individualized treatment recommendations.

It is important to acknowledge the conceptual differences between prognostic modeling and prognostic factors research. The prognostic model, developed in our study, aims at calculating the overall probability (individual absolute risk) for an outcome. Our study was not designed for prognostic factor research, which focuses on identifying independent prognostic (risk) factors [30, 34]. Still, our results can lend support to previously studies identifying a long duration of low back pain and leg pain, anxiety and/or depression, previous back surgery, smoking, lower education, BMI, and unresolved disability pension or insurance issues as predictors for inferior outcomes[12, 15, 17, 18, 29].

Prediction models have to balance the need for accurate predictions against the risk of overfitting. Model overfitting implies lack of generalizability, i.e., it might work well for the population it was developed on, but not for others [26]. For instance, it is important not to include too many and/or too specific covariates. Our model appeared to be well balanced between an acceptable accuracy and a limited number of predictors, which are available in most clinical trials and regular clinical practice at the hospitals. We stratified our model by different levels of baseline disability (low, medium, and high ODI score), since the outcome score is highly dependent on the baseline score, and the actual cut offs for failure and worsening are different in these subgroups [16, 18, 38].

The discriminative ability of risk the matrices was acceptable. Calibration assessment showed that for patients with high baseline disability (>75th percentile of ODI) the model tended to underestimate the proportion of worsening, and the prediction of worsening among those cases was too inaccurate. A reason could be the small sample size (type II error) of this subgroup, or confounding due to unmeasured factors, such as widespread body pain and pain interference [1]. Confounding is the most likely source of bias in our study. We assessed anxiety and depression using one item of the EQ-5D 3L questionnaire, instead of a condition specific questionnaire which could be more sensitive. This may represent an information bias [12].

All cases of lumbar disc herniation were verified on MRI scans, evaluated by radiologists and surgeons. However, we did not have data on more specific morphological changes, e.g., contained versus uncontained herniation or additional Modic changes, which could influence the surgeon’s recommendation about surgery. This illustrates that statistical probabilities cannot be used as surrogate for clinical judgement, but rather as a supplementary decision support. We suggest that our model could be used in cases where the indication for surgery is uncertain. The model could be also helpful in calibrating surgeons’ and patients’ expectations about surgical outcomes.

To the best of our knowledge, this is the first registry study modeling unfavorable patient-reported outcomes after lumbar disc surgery. Three American studies have assessed patient populations operated for different degenerative spine disorders, including disc replacement and arthrodesis surgery [16, 24, 25]. The models were developed for predicting improvements, such as minimal clinically important change (MCIC), rather than unfavorable outcomes. Interestingly, 12 months of follow-up data from the latter paper by Khor et al. on a subgroup of 528 surgical patients showed that 222 of them reported an unsuccessful outcome (not reaching MCIC on the ODI scale) [16]. Of these, 86 (39%) reported to be unchanged or worse. The remaining 136 (61%) did not, hence representing a “grey zone” of patients with minor improvements. This supports our strategy of distinguishing failed from non-successful outcomes [38, 39].

Registry-based studies collecting “real-life” data from daily clinical practice have advantages such as large sample sizes and high external validity, but also limitations such as lower follow-up rates [11]. Loss to follow-up at 12 months was 32.7%. Baseline characteristics-linked inferior outcomes seemed to be equally distributed between responders and non-responders. Still, loss to follow-up could represent a selection bias, especially when estimating exact failure and worsening rates. However, two Scandinavian registry studies on similar patient populations found that loss to follow-up did not bias conclusions about treatment effects [13, 33]. Moreover, the objective of our study was not effectiveness evaluations, but rather to develop a prediction model over a wide range of outcomes.

The model should be externally validated in other cohorts, and its feasibility should be confirmed by patients and clinicians before being implemented in regular clinical practice. Importantly, we have not assessed outcomes after non-operative treatment. Therefore, it is highly uncertain if the model could be useful in other settings, e.g., among patients seen in general practice.

Conclusion

We have developed a prognostic model to identify patients at risk of unfavorable outcomes after lumbar microdiscectomy, which could assist physicians and patients in clinical decision-making prior to surgery in cases where the indication for surgery is not clear cut. The model accounts for patients with different levels of preoperative disability and corresponding prognostic factors, facilitating individual based treatment recommendations.