Introduction

Low back pain (LBP) is very common and poses a great health risk for society. Worldwide, it is the number one cause of years lived with disability [1]. Up to 84% of the population will experience LBP at least once during their lifetime [2]. In roughly 90% of cases, a specific source for the LBP cannot be identified [3]. LBP is strongly associated with disability [1, 4], work absence [5, 6], and reduced quality of life [6, 7]. As a result, medical and particularly non-medical costs related to LBP are very high [8, 9].

Most patients improve substantially in the first six weeks after the onset of LBP [10]. However, one year after onset, approximately two thirds of patients still experience pain and disability [10,11,12]. Currently, LBP is looked at more and more as a long-lasting or recurrent condition rather than a series of unrelated episodes [9, 13]. A review on the long-term course (follow-up ranged from one to 28 years) of LBP in the general population found that most patients experienced a somewhat stable or fluctuating occurrence of LBP over time [14]. Becoming pain free was never reported as a common finding.

Despite the effects of LBP on physical, psychological, and social well-being, there are few longitudinal studies reporting multiple patient-centered outcomes. Cohort studies with long-term follow-up (> 2 years) often confine to investigating the presence of pain (yes/no) or the number of days with pain over the past month(s) or year [13, 14]. Several consensus statements have been published on outcome measures in chronic (back) pain research [15,16,17]. Most reports specifically provide recommendations for the evaluation of clinical trials, but there is an overall understanding that reporting on pain alone in LBP research is insufficient. Other important outcome domains include measures of physical function, generic measures of health and well-being, quality of life, and work (dis)ability.

At present, it is unclear what evidence is available from long-term studies on chronic non-specific LBP. More specifically, from studies examining patient-centered outcomes other than pain. We conducted a scoping review with the objective to identify and map the available evidence from studies on chronic LBP with long-term follow-up, to examine how these studies are conducted, and to address potential knowledge gaps. Where systematic reviews typically focus on more narrow and well-defined questions with appropriate study designs chosen in advance, a scoping review tends to address broader topics where many different study designs might be applicable [18]. For the present study, we included experimental and observational studies reporting at least two-year follow-up on disability, quality of life, work participation or health care utilization in patients with chronic non-specific LBP. The results are not intended to provide evidence to inform clinical practice, but rather to gain insight into the scientific literature that is currently available. For studying the feasibility, appropriateness or effectiveness of a certain treatment or practice, a systematic review is a more valid approach [19].

Methods

The PRISMA Extension for scoping reviews (PRISMA-ScR) was used as a reporting guideline for this review [20]. Although critical appraisal is optional, for the present study we evaluated methodological quality of the included studies with a quality assessment tool in order to be able to address any potential gaps in the literature related to low quality of research [21].

Eligibility criteria

Types of studies

Both experimental and observational studies investigating non-specific LBP with baseline measures and a minimum (mean) follow-up of > 2 years were included. Case reports and review studies were excluded.

Participants

Study participants were adults with sub-acute (6–12 weeks) or chronic (> 12 weeks) non-specific low back pain at study baseline, with or without leg pain. The average age of the study population had to be between 18 and 65 years. Studies that reported on LBP due to a specified physical cause (e.g., infection, tumor, osteoporosis, fracture, structural deformity, inflammatory disorder, radicular syndrome or cauda equina syndrome) were excluded. Studies on patients with LBP due to failed back surgery syndrome (FBSS) and LBP due to degenerative changes such as disk degeneration, osteoarthritis of facet joints, and a grade 1 degenerative spondylolisthesis were included, provided that there were no neurological symptoms. Little to no association has been found between imaging findings of these types of spine degeneration and the presence of LBP [22,23,24,25,26]. We therefore classified these (radiological) diagnoses as non-specific. Studies with mixed LBP groups (specific and non-specific cause for LBP) or mixed pain populations (e.g., neck pain and LBP) were excluded unless subgroup data for baseline and follow-up were presented.

Outcome measures

To be included, studies had to report on at least one of the following outcome measures: disability, quality of life, work participation, or health care utilization. Pain was also an outcome measure, but studies that only reported on pain were not included.

Search methods for identification of studies

Electronic searches in MEDLINE and EMBASE were conducted using indexed terms and free text words. The searches were not restricted by date, language, or place of publication. The search strategy included terms related to LBP, long-term follow-up and outcome measures (Supplementary Digital Content [SDC] 1). The search results for both databases were downloaded into RefWorks and duplicates were removed. An initial literature search was performed, followed by several updates, of which the last took place on March 5 2021. Initially, search terms for spondylolysis and spondylolisthesis were included. However, these were removed in the updated searches. We found that studies that were retrieved with these search terms (and that would not have been retrieved by searching for terms related to low back pain) targeted patients with spondylolisthesis with a higher than grade 1 degree of severity.

Data collection and analysis

Study selection

Three review authors independently screened titles, abstracts, and full text of the studies retrieved from the databases. One author (AD) screened all studies and two authors (RS, RSP) each screened half. The inclusion criteria included type of participants, length of follow up, and outcome measures. To determine interrater agreement, a sample of 200 studies was selected for the three reviewers to screen on title, abstract and full text. Agreement ranged from 98 to 99% between reviewers with kappa scores ranging 0.56–0.98 (moderate or substantial agreement). However, kappa scores are deemed not very reliable for ‘rare findings’ [27] and in this sample of 200 studies ultimately only 3 studies were included after consensus was reached. Any disagreement in the selection of studies was discussed until consensus was reached. If the three reviewers could not reach consensus, the fourth reviewer (MR) was consulted.

Quality assessment

The Effective Public Health Practice Project Quality Assessment Tool (EPHPP) was used to evaluate methodological quality of the studies [28]. The tool can be used to evaluate a variety of study designs such as RCTs, observational, cross sectional, and before-and-after studies. The EPHPP assesses six domains: (1) selection bias, (2) study design, (3) confounders, (4) blinding, (5) data collection method, and (6) withdrawals and dropouts. Each domain can be rated strong, moderate, or weak resulting in a global rating of strong (no weak ratings), moderate (one weak rating), or weak (two or more weak ratings) for each study. The confounders domain was scored ‘not applicable’ when there was no comparison or control group, since the corresponding question was phrased “Were there important differences between groups prior to the intervention?”. Content and construct validity of the EPHPP have been established and inter-rater reliability is fair for the individual domains (ICC = 0.60) and excellent for the global rating (ICC = 0.77) [28, 29]. Four reviewers assessed methodological quality of the studies. One author (AD) assessed all studies and three authors (RS, RSP, MR) each assessed one third of the studies. Disagreements were resolved between the authors assessing the study or when in doubt were discussed with all four assessing authors to reach consensus.

Data extraction and synthesis

The following data were extracted by one author (AD) from each paper and presented in supplementary tables (SDC 2): first author, study setting and country, study design, intervention(s), patient characteristics (diagnoses, age, % female), outcome domain(s), instrument(s), duration of follow-up, and results of measurements taken at baseline and > 2 year follow-up. This includes the results of any responder analyses (i.e., the proportion of patients achieving a pre-defined level of improvement) [30]. For randomized controlled trials (RCT), results from the intention-to-treat analyses were reported. Studies were organized thematically according to intervention type. Study characteristics were also summarized in a narrative format and the overall findings were presented in a summary table. Per outcome, the number of treatment arms that showed a significant (p < 0.05) improvement, decline, or no change compared to baseline was reported. The number of studies that did not report p-values for the change in outcome at follow-up was also reported.

Results

Study selection

Together, the initial and updated searches returned 10,312 articles, of which 90 ultimately met the inclusion criteria (Fig. 1). Follow-up results of one study were presented in two different articles [31, 32]. An overview of study characteristics can be found in SDC 2. Studies (n = 89) were classified according to the type of treatment(s) that was investigated: invasive (72%, n = 64; Table 1, SDC 2), conservative (21%, n = 19; Table 2, SDC 2), or a comparison of invasive and conservative treatments (7%, n = 6; Table 3, SDC 2). By definition, (minimal) invasive procedures require (1) a method of access to the body (incision, natural orifice, or percutaneous access), (2) instrumentation (e.g., endoscopes, catheters, scalpels), and (3) requirement for operator skill [33]. All non-invasive treatments were classified under conservative treatments.

Fig. 1
figure 1

Flow chart of the literature search

Quality assessment

Global quality rating was weak for 14 (16%), moderate for 56 (63%), and strong for 19 (21%) studies (Table 1). A global weak rating was more common with studies published before 2010, while most studies that rated strong were published in the last decade (Fig. 2). Most common design was either a prospective (44%, n = 39) or retrospective cohort study (31%, n = 28) (both rated ‘moderate’). Twenty-one studies (24%) conducted an RCT and one study was classified as a controlled clinical trial [34] (both rated ‘strong’). Weak ratings were prevalent with the domain ‘selection bias’, while strong ratings were prevalent for ‘data collection method’. Studies rated predominantly moderate (42%, n = 37) or strong (44%, n = 39) on ‘withdrawals and dropouts’. Sixty studies (67%) did not receive a rating on ‘confounders’ due to the absence of a comparison or control group. Twenty-six (29%) retrospective studies received a ‘moderate’ rating for scoring ‘not applicable’ on the item ‘percentage of patients completing the study’.

Table 1 EPHPP quality assessment scores of the included studies
Fig. 2
figure 2

Number of studies with methodological quality per time period of publication

Study Characteristics

Year published

Studies were published between 1985 and 2021, with 52 out of 89 studies (58%) published in the last decade (Fig. 2).

Study Setting

The majority of selected studies (83%, n = 74) were from Western countries (SDC 2). More specifically, from European countries (54%, n = 48), such as Germany (10%, n = 8), Sweden, the UK (both 9%, n = 7), Norway, the Netherlands (both 8%, n = 7), and from the USA (27%, n = 24). Thirteen studies (15%) were from Asian countries of which seven (8%) from China. Two studies were from Brazil (2%). There were no studies from African countries, Central America, or Eastern Europe.

Less than half of the selected studies (44%, n = 39) specified the setting in which they took place. Forty-six out of 64 studies on invasive treatments did not report or were unclear in their report on where a specific intervention took place. The 18 remaining studies (20%) specified they took place in (university) hospitals or (out-patient) medical practices. Studies on conservative treatments mostly took place in (university) hospitals, physiotherapy clinics, and chiropractic and general practices. Five out of six studies that compared invasive with conservative treatments took place in university hospitals.

Interventions

Most common types of invasive treatment were lumbar fusion (38% of studies, n = 34) and disc arthroplasty (25%, n = 22), followed by intradiscal therapies (e.g., intradiscal electrothermal therapy or intradiscal bone marrow injection; 11%, n = 10), and implantable therapies (e.g., spinal cord stimulation) [35, 55, 86] (SDC 2). Less common were interspinous process devices [39, 63], dynamic spine stabilization systems [57, 85], and basivertebral nerve ablation [48]. Two studies used sham infiltration as a control for intradiscal bone marrow injection [36, 70].

Most common conservative treatments were multidisciplinary treatment (10% of studies, n = 9), physiotherapy or exercise training (7%, n = 6), cognitive therapies (4%, n = 4), advice and/or education (4%, n = 4). Other treatments consisted of (non-operative) care as usual [108, 112, 121] chiropractic care or primary care by a medical doctor [102], anthroposophic medicine [103], rehabilitation treatment [109], or open label placebo pills [100].

With the exception of two control groups that were assessed in studies on conservative treatments [98, 110], there were no studies examining long-term outcomes of LBP in people receiving no treatment. Two studies reported examining the natural history of LBP; however, their patient samples completed Swedish Back School [106] or received two months of conservative treatment [108] and were therefore categorized under ‘conservative treatments’ in this review.

Patient characteristics

Selection criteria of this review were set to include only studies on adults with sub-acute or chronic non-specific LBP. This also included patients with LBP due to FBSS, or degenerative changes such as disk degeneration and grade 1 spondylolisthesis, provided that there were no neurological symptoms. One study exclusively included patients with sub-acute LBP [34] and five studies included both patients with sub-acute and CLBP [102,103,104, 110, 111].

The majority of studies (91%, n = 64) on invasive treatments (with or without conservative treatment as a control) included patients that fit their criteria for either degenerative disc disease (DDD), discogenic pain, internal disc disruption or a combination thereof. Other studies selected patients with Modic type 1 or 2 changes [48], patients with CLBP and radiating pain to the lower limb(s) [52], FBSS [55], either FBSS or mechanical LBP [86], or LBP originating from the endplate [77].

Only two studies investigating conservative treatment options sought to include patients with discogenic pain [108, 112]. One study specifically excluded patients with disk degeneration [100]. Commonly, patients with CLBP (58%, n = 11), sub-acute LBP [34], or both sub-acute and CLBP (29%, n = 5) were eligible for inclusion. Added criteria were: still working [111], permanent employment [110], or sick-leave due to LBP [34, 107]. One study reported results separately for patients with CLBP with or without modic changes [113].

Outcomes measurements

For the selected studies, disability (92%, n = 82) and pain (86%, n = 77) were the most commonly measured outcome domains, followed by work (25%, n = 22), and quality of life (15%, n = 13) (SDC 2). Only four studies (4%) measured health care use [85, 99, 101, 114]. Five out of seven most frequently used outcome measures were patient reported outcome measures (PROMs) of pain and disability (Fig. 3). The Oswestry Disability Index (ODI) and Visual Analogue Scale (VAS) back pain were used in the majority of studies. Less frequently used outcome measures were the SF-36 subscale ‘Bodily Pain’ (6%, n = 5) for measuring pain, the SF-36 subscales ‘Physical Functioning’ (4%, n = 4) and ‘Role Physical’ (3%, n = 3), the General Functioning Score (3%, n = 3) for disability, and ‘work status’ (3%, n = 3) for measuring work participation. A remaining 40 outcome measures, most for measuring pain, were each used by less than three studies.

Fig. 3
figure 3

Most commonly used (> 5 studies) patient-centered outcome measures with the corresponding outcome domain in brackets. VAS and NRS without specification of a certain body area (e.g., legs) were classified as VAS back pain and NRS back pain. The Physical Component Scale, General Health Scale and total score of the SF-36 were all classified as SF-36 quality of life measures. ODI, Oswestry Disability Index; VAS, Visual Analogue Scale; SF-36, Short Form (36) Health Survey; NRS, Numerical Rating Scale; RMDQ, Roland Morris Disability Questionnaire

Follow-up

Follow-up ranged between 26 months and 18 years with a median of 51 months (SDC 2). Forty-three studies (48%) reported an (average) duration of follow-up between 24 and 48 months, 22 (25%) studies between 49 months and six years, and 24 (27%) studies over six years. Only ten studies (11%) took more than one measurement at > 2 year after baseline. Follow-up was available for > 80% of patients in 39 studies, between 60 and 80% in 12 studies, and < 60% in six studies. The percentage was unclear in six studies. The remaining 26 studies were retrospective studies that included patients based on complete availability of follow-up. Furthermore, a total of 36 studies (all 28 retrospective studies, seven prospective studies, and one RCT [98]) reported only baseline results of those patients that completed a minimum length of follow-up.

Responder analyses

Twenty-six out of 89 studies (29%) reported the results of a responder analysis; 23 studies on invasive treatments, two studies on conservative treatments and one study that compared invasive with conservative treatments (SDC 2). An improvement in disability, measured with the ODI, was most commonly used to determine clinical success (85%, n = 22), followed by an improvement in back pain or leg pain (38%, n = 10) measured with VAS or NRS. The cut-off for clinical success varied greatly per instrument; 10 different cut-offs were used for the ODI and seven for the VAS or NRS. One study reported clinical success on pain and disability using an improvement in subscales for pain and functioning of the SF-36 [89]. Other studies analyzed improvement in quality of life (SF-36 Physical Component Scale) [67] or improvement in both pain and disability [44, 48, 60].

Summary of findings at long-term follow-up

Table 2 summarizes the overall findings of the selected studies per treatment type and duration of follow-up. Reported results were not specified for diagnoses or disease characteristics. Per outcome, the number of treatment arms that showed a significant improvement (‘+’), no significant change (‘0’), or a significant decline compared to baseline (‘−’) was reported. Several studies did not report p-values for the change in outcome at follow-up (‘?’). Results on work related outcomes were very rarely reported with a statistical level of significance. However, almost all results without a reported p-value showed some level of improvement between baseline and long-term follow-up. In general, pain, disability, and quality of life were significantly improved after an invasive intervention. Results after conservative treatments varied between significantly improved or unchanged. One study reported that patients had significantly worsened compared to baseline six years after following a rehabilitation program [109]. Since most studies reported significant improvement at follow-up, there was little difference in outcome at the different durations of follow-up.

Table 2 Summary of reported results per treatment type and duration of follow-up (Non-specified for diagnosis or disease characteristics)

Responder analyses

Setting aside the variety in definitions to determine clinical success and irrespective of the type of treatment patients received, we found that response on pain measures at long-term follow-up varied between 20 and 90% (10 studies with 15 treatment arms) and response on disability measures varied between 15 and 91% (22 studies with 32 treatment arms) (SDC 2).

Looking at different treatment types and taking into account the number of patients per treatment arm, clinical success on disability was achieved in 73% of patients that underwent a disc arthroplasty (n = 14 treatment arms), 75% of patients that underwent lumbar fusion (n = 7 treatment arms), 61% of patients that received multidisciplinary treatment or physiotherapy/exercise training (n = 4 treatment arms), and 63% of patients that received intradiscal therapies (n = 3 treatment arms). The only treatment type with > 3 treatments arms reporting response rates on pain measures was intradiscal therapy (n = 5 treatment arms), with 57% of patients achieving clinical success.

Discussion

The general purpose of this study was to identify and map the available evidence from long-term studies on chronic non-specific LBP. Our findings confirm the notion that there is little to no information available from natural cohorts when it comes to reporting on patient-centered outcomes other than pain. The majority (> 75%) of papers that were included examined long-term outcomes after invasive treatments. Surgical interventions, specifically lumbar fusion and disc arthroplasty, were most commonly reported. Among studies examining conservative treatments, physical therapy and multidisciplinary programs were most common. Overall, included studies were predominantly of moderate quality and differed in design, patient samples, and methods of data collection. These differences were most profound between studies on invasive and conservative treatments. In general, most studies reported improvements in pain and disability and, when measured, quality of life at long-term follow-up.

This review identifies several knowledge gaps regarding research into long-term outcomes of non-specific chronic LBP. First, there is still little insight into the natural course of LBP regarding outcomes such as disability, quality of life, work, and health care utilization, because no natural cohorts met the inclusion criteria. In a natural cohort, subjects would be followed in real life in which numerous situations and interventions may appear. It is not limited to one or several specified interventions to study its effect. The studies included in this review examined clinical outcomes of non-specific LBP and concerned patients that were actively seeking health-care. Therefore, they might not be representative of people with sub-chronic or chronic LBP in the general population. Secondly, we noticed that repeated measurements during long-term follow-up were scarce. Only ten studies (11%) took more than one measurement after the two-year mark. These studies reported lasting improvements in symptoms after lumbar fusion [31, 32, 40, 41, 59, 72], disc arthroplasty [53, 58, 76, 92], and chiropractic care or primary care by an MD [102]. Nonetheless, recurrence of LBP is very common and studies with less than two years follow-up have also shown that post-treatment trajectories of pain and disability can vary a great deal between patients [122,123,124]. Third, the present review also affirms the notion that across LBP trials, the primary focus has been on pain and disability as outcome measures [125], even though other (generic) measures of health and well-being, such as quality of life and work (dis)ability have been recommended in core outcome sets to reflect the multidimensionality of LBP [15, 126,127,128]. Furthermore, few studies seem to monitor health care utilization during follow-up. These data can be challenging to collect; however, they are an important piece of the puzzle in determining whether outcomes at long-term follow-up might be the result of the original intervention (at baseline) or other interventions that were provide during follow-up. To conclude, in order to really understand both the (natural) course of LBP and results of LBP-related interventions over time, frequent measurements of relevant patient-centered outcomes are needed, as well as the use of complete core outcome sets including quality of life and work disability, and an overview of patients’ health care utilization during follow-up.

Even though the patient reported outcome measures in this review seem to reflect more positive long-term pain, disability and quality of life status compared to baseline measurements, this should not be misinterpreted as treatment effectiveness. This scoping review was not designed to study long-term effectiveness of interventions. A number of factors might have contributed to the appearance of consistent improvement years after experiencing persistent LBP. First, the reported improvements derive from statistical significance and do not necessarily imply clinical relevance. It is unclear whether patients perceived their improvement on different outcome measures as clinically relevant. Only a select number of studies performed a responder analysis. A previous review on outcome measures also reported that merely 8% of 401 included LBP trials reported a number or proportion of improved patients [125]. Although most of the studies in the present review that included a responder analysis reported high percentages of patients with clinically relevant improvement, cut-off scores for clinical success varied greatly. For instance, in some studies relative improvements of 25–30% on VAS or ODI were deemed successful, while others aimed for 50% [35,36,37, 95].

Other factors might also have influenced improvement in LBP symptoms. A previous review in patients with non-specific LBP found that response to primary care treatment followed a pattern of rapid early improvement followed by a plateau, regardless of whether active treatment, usual care, or placebo treatment was used [129]. Natural prognosis could be one explanation [10, 11, 130]. However, natural prognosis at long-term is mostly unknown. People are also more likely to seek health care at a time when their pain and symptoms are at their worst or most debilitating, which could further explain a positive overall course. Regression to the mean could also have played a role in the improvements in symptoms that were found after the start of treatment [131]. Overall, these factors likely influenced short-term improvements in LBP complaints, but if maintained, could also explain the reported long-term beneficial outcomes. Finally, publication and reporting bias cannot be ruled out. Only one study reported that patients had significantly worsened at long-term follow-up. Future (systematic) reviews on long-term studies on LBP should consider checking their findings against reported study protocols and/or unpublished trial data.

Surgical treatments are relatively over-represented in the present review. Safety issues and long-term adverse events are of more concern in surgical trials compared to conservative interventions, which may be why long-term data is collected and analyzed more often from invasive interventions. Also, surgical studies more often seem to utilize data that are retrospectively obtained from patient medical records [132, 133]. This makes it easier to collect and report long-term follow-up data. In spine surgery, complication incidence is potentially underestimated with retrospective assessments [134]; however, the present review includes results from PROMs and not occurrence of adverse events.

Studies on invasive and conservative treatments were notably different in their patient inclusion criteria. Invasive studies sought to include patients with disc-related diagnoses or symptoms, whereas conservative studies defined symptom-related criteria more generally (‘low back pain’). Although diagnoses based on lumbar structures (e.g., discogenic pain, facet joint pain) were very common in some settings, diagnostic tests do not reliably identify these structures as a source of LBP. The usefulness of these tests in clinical practice remains unclear [22, 26, 135] and current guidelines on LBP usually classify these diagnoses as non-specific [136]. Nevertheless, spine surgeons have claimed that these diagnoses should classify as specific LBP and that better and earlier identification combined with, if indicated, invasive treatment would improve prognosis in these patients [137]. A Dutch task force that was tasked to develop a guideline for invasive treatment of lumbosacral pain syndromes has proposed to classify diagnoses such as facet joint pain, disc pain and FBSS as ‘degenerative uncomplicated spinal LBP syndromes’ [138]. In short, LBP diagnoses, as well as the decision to operate or treat conservatively, vary between countries and between medical disciplines. At present, there is no consensus among health care professionals on the classification of specific versus non-specific LBP. Improved consensus on a classification system could lead to more targeted care, reduce the need for expensive diagnostic methods, and facilitate comparison among LBP studies [17, 139, 140]

In line with worldwide research in the field of back pain, we identified a significant increase in annual publications on long-term outcomes of non-specific LBP [141]. The majority of selected studies were from Western countries, with the USA being the most productive (26% of studies). Little to no research took place in low- or middle-income countries, while in the past few decades the largest increases in disability due to LBP have occurred there [9, 142]. The impact of LBP in low- to middle-income countries potentially comes with disadvantages dissimilar to those in high-income countries and might therefore not be represented in the present review [9].

Finally, methodological quality of studies seemed to also increase over the years. Only prospectively conducted studies (prospective cohorts and RCT/CCTs) received a global ‘strong’ rating with the quality assessment tool that was utilized. Selection bias was often present in retrospectively conducted studies. In these instances, patients were included based on complete availability of follow-up data. Two sensitivity analyses were performed on the scoring method of the quality assessment tool. First, the global quality rating of a study was determined by the amount of ‘weak’ ratings that was scored on all separate domains. This means that studies that scored ‘moderate’ on each separate domain would have received a ‘strong’ global rating. A separate analysis showed that changing the global rating from strong to moderate for these studies would have had no effect on the results, since there were no studies that rated moderate on each domain. Second, prospective cohort studies received a ‘moderate’ rating on the domain study design. It could be argued that prospective cohort studies are a strong design for studying long-term outcomes. However, changing these ratings from moderate to strong on this domain would have also had no effect on the global quality rating.

Limitations

As to be expected, a number of studies on long-term LBP outcomes had to be excluded from this review after not meeting our inclusion criteria. This occurred most often with studies on samples with non-specific LBP mixed with specific LBP, samples with acute mixed with sub-acute and chronic LBP, and studies that failed to report baseline results of the outcomes measured at long-term follow-up. The latter in particular was common for measures related to health care utilization, since information has to be available, or recalled, from before baseline. Ultimately, only four studies could be included that reported health care use in the period before baseline [85, 99, 101, 114]. Another limitation is that this review gives limited insight into when the improvements that we observed took place. We chose to only report results from long-term follow-up (> 2 years), since the focus of was on mapping evidence from long-term follow-up studies. The complete course or trajectory of LBP symptoms could be studied in future reviews with a more narrow scope. Finally, the heterogeneity in the assessment and reporting of outcomes rendered it difficult to provide a qualitative synthesis of the results. A wide variety of instruments was used to measure pain, disability, quality of life, and work participation, and a considerable amount of studies did not report whether changes in scores between baseline and follow-up were statistically significant.

Conclusion

Patients with persistent non-specific LBP report improvements in pain, disability and quality of life years after seeking treatment. However, it remains unclear what factors might have influenced these improvements, and whether they are treatment-related. In part, because there is very little long-term evidence available from natural cohorts. Finally, studies that examined long-term outcomes of LBP symptoms varied greatly in design, quality, patient samples, and methods of data collection, and only few performed a responder analysis or applied repeated measurements after two years of follow-up.