Background

Low back pain (LBP) is a growing health problem in the industrialized world. Despite the high medical expenses required for its management, the prevalence of LBP is increasing [1]. LBP is a heterogeneous condition, and the identification of different sub-groups could help the management decisions [2,3]. One of these sub-groups is lumbar segmental instability [4,5].

The radiologically determined instability is characterized by a loss of passive integrity, causing excessive vertebral translation or rotation. The maximum lumbar flexion-extension radiographs in standing position are considered to be a reference standard to detect the function of the passive stabilization system [6,7]. This imaging method is commonly used to evaluate lumbar segmental mobility in isthmic and degenerative spondylolisthesis and degenerative disc dysfunctions. The radiographic diagnosis of spondylolisthesis is considered to be one of the most efficient methods of identifying lumbar instability [8].

Some authors refer to the concept of instability also considering the so-called “clinical” or “functional” instability, in which no defect of the body architecture of the lumbar spine, and no excessive detectable translation or rotation are shown. However, a poor trunk muscle function and/or an insufficient motor control is believed to be a factor in abnormal inter-segmental movement and LBP [9-11]. Despite this type of instability has not been demonstrated enough as a clinical entity and is not really measureable by any gold standard, it is one of the most frequent fields of interest for chiropractors and manual therapists.

Clinicians have used several clinical tests to detect the spinal instability and/or the ability of the muscles to stabilize the lumbar spine [12]. Recently, some of these tests have been suggested in the “Clinical Practice Guidelines linked to the International Classification of Functioning, Disability and Health from the Orthopaedic Section of the American Physical Therapy Association”, to assess the impairments of body functions in LBP [5]. The most commonly used tests are the Prone Instability Test (PIT), the Passive Lumbar Extension (PLE) test, the Aberrant Movements Pattern (AMP), the Posterior Shear Test (PST), the Prone Bridge Test (PBT), the Supine Bridge Test (SBT), and the Active Straight Leg Raise Test.

Previous reviews separately investigated the diagnostic accuracy [13] or the reliability [14] of the instability tests, but a complete vision about their diagnostic validity to detect lumbar instability is lacking. A single literature review on both the diagnostic accuracy (sensitivity, specificity and likelihood ratios) and the inter-rater reliability of these clinical tests does not exist. More specifically, a researcher could be interested in investigating the reliability of the tests that previously demonstrated sufficient face validity.

The objective of this literature review was to assess the methods used for diagnosis (primarily the accuracy with additional reporting of reliability of these tests) of the clinical tests for lumbar instability in individuals with LBP and investigate their applicability in daily practice.

Methods

This is a literature review of all the studies presenting a diagnosis of the clinical tests for lumbar instability in individuals with LBP in literature. PRISMA Guidelines [15] were followed during the design, search and reporting stages of this review on diagnostic test studies.

Literature search

A literature search of relevant literature was performed from July 2012 to December 2013. A comprehensive search, limited to articles in English, Italian and Spanish, was conducted in the following databases: Medline, Embase, Cinahl, PubMed, Scopus. Diagnostic test studies regarding humans published between 1972 and December 2013 were included. Narrative or systematic reviews, guidelines and meta-analyses were excluded.

Two authors (SF and TM) independently performed two different and parallel searches to avoid leaving out relevant articles. The search strategies are shown in Figure 1.

Figure 1
figure 1

Flow chart.

The results of these seven searches were unified into a single item set. From the results of the initial search, double citations were removed and then the titles, abstracts and full texts of retrieved articles were independently evaluated for definitive inclusion. When the two reviewers were unable to reach a consensus, a third reviewer (CV) was consulted. In addition to the Internet-assisted search, references were pulled from a textbook on diagnostic accuracy of orthopedic clinical tests [16], and from reference lists of included studies. Finally, an independent hand search including scanning of reference lists from other systematic reviews [13,14] was performed.

Study selection

Several criteria were used to select eligible studies. Articles examining clinical tests for lumbar instability were included if they met the following criteria:

  1. 1)

    Diagnostic accuracy studies on adult population with sub-acute or chronic LBP were considered if clinical instability tests were employed as index tests. Dynamic radiographs were the reference test to diagnose lumbar instability. The subject articles had to report data which would allow computation of parametric statistical tests of diagnostic accuracy [sensitivity, specificity, or positive and negative likelihood ratios (+LR and -LR)].

  2. 2)

    Reliability studies on healthy or LBP adult population were considered if they concerned the use of clinical tests to diagnose lumbar instability by one or more clinicians. Articles had to report the parametric statistical tests of relationship or agreement.

  3. 3)

    Finally, only the studies in which each test was investigated by at least one study concerning both the accuracy and the reliability were considered eligible.

Data extraction and quality assessment

One author (TM) gathered data regarding clinical tests, with its description and score, study population (e.g. age, gender, setting, clinical characteristics), inclusion and exclusion criteria, diagnostic reference standard, differences in operationalizing the index tests, study raters. Study results about sensitivity, specificity, LR+, LR-, and reliability were collected (or calculated, if included articles did not provide these data). Other authors (SF and FB) verified data extraction once completed. The methodological quality of included articles was independently assessed by 2 reviewers (TM and FB), using different tools for the 2 types of studies: the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool for diagnostic accuracy articles [17] and the Quality Appraisal of Reliability Studies (QAREL) checklist for diagnostic reliability articles [18].

Data synthesis and analysis

Kappa statistics were used to assess agreement between the 2 raters on article selection and QUADAS and QAREL ratings [19]. The QUADAS and QAREL statement delineates essential items to be reported in diagnostic test studies (Table 1 and Table 2).

Table 1 QUADAS (Quality Assessment of Diagnostic Accuracy Study) tool results
Table 2 QAREL application results

Concerning sensitivity and specificity, the acceptable levels were set between 50% (unacceptable test) and 100% (perfect test) [20]. The diagnostic accuracy was considered satisfactory, thus affecting the probability of lumbar instability, with + LR ≥ 2.0 or - LR ≤0.50 [21].

Concerning reliability, the following criteria has been used to determine the strength of the coefficients: ≤ 0.25 = little or no relationship; 0.26 – 0.50 = fair degree of relationship; 0.51 – 0.75 = moderate to good relationship; 0.76 – 1.00 = good to excellent relationship [22].

Results

Figure 1 shows the process of study selection. Initial searching identified 773 citations. Following the first screening, 299 articles were excluded and 474 citations were retained for the second screening; after reviewing the titles, 446 were excluded and 28 considered of interest, looking at the abstracts 16 were maintained and 13 retrieved in full text. Using the inclusion and exclusion criteria a further 7 articles were excluded. This study finally included 6 papers, considering 333 LBP patients, for the review [12,23-27].

Quality scores

Two articles of the 6 studies (33%) were identified as having high methodological rigor according to the QUADAS tool (Table 1). Table 2 shows the distribution of studies according to the scores obtained from the assessment of their methodological quality, following the QAREL tool.

Diagnostic accuracy of the tests

The diagnostic accuracy was investigated by 2 authors only: Fritz et al. [24] and Kasai et al. [25] Four lumbar instability tests were considered: the PLE test, the PIT, the AMP, and the PST. The main characteristics of the studies on diagnostic accuracy are shown in Table 3, whereas Table 4 shows the results.

Table 3 Summary of the studies on diagnostic accuracy
Table 4 Results of diagnostic accuracy studies

Kasai et al. [25] found that the PLE test was the most accurate clinical test, with high sensitivity (0.84, 95% CI: 0.7 - 0.93) and specificity (0.90, 95% CI: 0.82 - 0.95), in a sample of subjects diagnosed with spinal stenosis or lumbar spondylolisthesis or lumbar degenerative scoliosis. The positive and negative LR’s were informative.

The diagnostic accuracy of AMP depends on each singular test. Low sensitivity (0.26, 95% CI: 0.15 - 0.42) and good specificity (0.86, 95% CI: 0.77 - 0.92) were found by Kasai et al. [25] for the Instability Catch Signs. The Painful Catch Sign and the Apprehension Sign showed the same trend, low sensitivity (0.37, 95% CI: 0.24 - 0.54 and 0.18, 95% CI: 0.22 - 0.64 respectively) and good specificity (0.73, 95% CI: 0.61 - 0.8 and 0.88, 95% CI: 0.61 - 0.78 respectively). These tests are included in the AMP, also studied by Fritz et al. [24], who reported low sensitivity (0.18, 95% CI: 0.08 - 0.36) and high specificity (0.95, 95% CI: 0.77 - 0.99) for the AMP test in a cohort of patients with chronic LBP.

The article by Fritz et al. [24] is the only one that studied the diagnostic accuracy of the PIT and the PST. Both tests demonstrated by fair to moderate diagnostic test accuracy. PIT sensitivity = 0.71 (95% CI: 0.53 - 0.85); specificity = 0.57 (95% CI: 0.37 - 0.76); PST sensitivity = 0.50 (95% CI: 0.34 - 0.66); specificity = 0.48 (95% CI: 0.28 - 0.68).

Reliability of the tests

The reliability of the four clinical tests was studied in 5 papers [12,23,24,26,27]. The main characteristics of the studies on reliability and their results are shown in Table 5, whereas Table 6 shows the results in terms of inter-rater reliability.

Table 5 Summary of the articles on reliability
Table 6 Summary of results on reliability

The PLE test showed a better reliability, but this result comes from a single study [12]. The inter-rater reliability of this test resulted good (k = 0.76).

Five studies investigated the inter-rater reliability of the PIT. This reliability was considered fair by Schneider et al. [27] (k = 0.46) and Ravenna et al. [26] (k = 0.10 and 0.04), moderate by Fritz et al. [24] and Rabin et al. [12] (k = 0.69 and k = 0.67, respectively), and good by Hicks et al. [23] (k = 0.87).

The inter-rater reliability of the AMP was studied by Hicks et al. [23] Fritz et al. [24] and Rabin et al. [12]. Whereas Fritz et al. [24] found poor reproducibility (k = −0.07), Hicks et al. [23] (k = 0.60) and Rabin et al. [12] (k = 0.64) calculated moderate reliability. The inter-rater reliability of the Posterior Shear Test was only studied by Fritz et al. [24] showing poor reliability (k = 0.27).

Implications for clinical practice

The data from the studies provided information on the tests and methods used, the error of measurement and also the validity of the tests. However, only 5 studies (83.3%) provided information concerning the setting and the years of raters clinical experience, whereas all studies identified the person performing the assessment and his/her professional competence.

Discussion

This literature review was aimed to identify the most reliable findings concerning the assessment of methods for diagnosis of the clinical tests for lumbar instability in LBP subjects.

The lumbar instability is traditionally a field of debate. Lumbar segmental instability in the absence of defects of the bony architecture of the lumbar spine has also been cited as a significant cause of chronic low back pain [5,28]. The differences between surgical instability criteria and “functional instability” criteria were defined by Panjabi [29] decades ago. Chiropractics and Manual Therapists are more interested in the lost of motor control than in hypermobility detectable with flexion/extension radiological imaging, which is more useful to spine surgeons. However, the difficulty to clinically detect abnormal or excessive inter-segmental motion makes these tests often insensitive and unreliable and it becomes a limit for the clinical diagnosis of lumbar segmental instability [30,31]. The lack of studies in this field emerges also by our research, which found many studies about reliability of tests used by clinicians but few about their accuracy. Being aware that this criterion is too rigorous for manual therapists we have chosen to be rigorous and we have been forced to do our research having as reference the best reference (gold standard) to instability, that is dynamic X-rays. The result is that many other tests used in the manual clinical practice to detect lumbar clinical instability (i.e. active hip abduction test or hip extension test) have not been considered because no study had investigated their accuracy. These tests are not present in this review, so that, in latest analysis, our study could be considered as a literature review of accuracy of lumbar clinical tests with additional reporting of reliability information.

Six high-quality studies were selected and four lumbar clinical instability tests (PLE test, PIT, AMP and PST) satisfied the inclusion criteria.

Accuracy

The characteristics of the samples of the 2 subject studies [24,25] cannot be considered accurate. Fritz et al. [24] studied a population whose majority had a prior history of LBP, and in which only 30.6% (n = 15) of people complained about distal knee symptoms. Kasai et al. [25], however, investigated a population with specific lumbar conditions (lumbar spinal canal stenosis, lumbar spondylolisthesis or lumbar scoliosis), most of whom had intermittent claudication, and 42.6% (n = 52) had neurological leg symptoms.

The PLE test was the most accurate and informative test, even though it was measured by only one study, in patients affected by lumbar degenerative diseases. Despite the PLE test appears to be a potentially effective clinical test to detect lumbar instability, the characteristics of the investigated sample and the presence of only one study on its diagnostic accuracy may suggest the necessity of studies on non-specific LBP patients.

The PIT demonstrated low to moderate sensitivity and specificity [24] indicating that this test has limited accuracy in diagnosing lumbar instability in patients with LBP.

The PST showed relatively poor sensitivity and specificity [24], indicating that this test is less accurate than the PLE test and the PIT to detect lumbar instability.

The Instability Catch Sign, the Painful Catch Sign and the Apprehension Sign are three of the five signs included in the AMP investigated by Fritz et al. [24]. The relatively low sensitivity and high specificity resulting from the study of Kasai et al. [25] suggest caution in the use of these tests to diagnose lumbar instability. According to Hicks et al. [23], these 5 tests should be used together, as a complete observation of the trunk movement and the 5 signs could be considered as only one comprehensive test. However, positive results on AMP and PIT, which demonstrated moderate sensitivity and specificity, were considered predictive for a favorable response to stabilization exercises [32].

Reliability

The characteristics of the samples were not always well explained or were not reliable. The PLE test [12] and the PIT [12,23,24] demonstrated good inter-rater reliability. The reliability of PLE test is evident in younger subjects referred to outpatient physical therapy [12]. Five studies on PIT demonstrated very different inter-rater reliability scores. Nevertheless, the 2 studies showing fair reliability [26,27] are affected by possible bias; in the first case [27] due to a very limited sample size and in the second case [26] due to procedures and methodological weaknesses as the involvement of novel raters and the use of a modified test. The main statistical problem was the presence of few samples that could invalidate the k score. Despite all the other 4 studies adopting the PIT closely followed its original description, some differences in the positivity criteria were found. Hicks et al. [23] and Schneider et al. [27] judged the test positive when the pain disappeared in the second part of the test; Fritz et al. [24] when the pain decreased, whilst for Rabin et al. [12] the pain had to be both relieved or abolished.

After having excluded the two studies with the main methodological weaknesses, the reliability of the PIT appeared from moderate to good.

The AMP reliability was investigated in three studies [12,23,24] but their results were not similar and ranged from insufficient reliability [24] to moderate reliability [12,23]. The PST was investigated by only one study and scored the lowest reliability [24], which is insufficient to recommend its use.

Implications for clinical practice

After an initial inspection of the articles it appears that the information derived from the studies could provide a useful picture of the items that contribute to the definition of “applicability in rehabilitation practice”. Sufficient information was provided on the execution of the tests, whereas little information regarded the duration, and the time needed to process data. Considering that in clinical practice a standard manual therapy session normally lasts 30 minutes, it may be the case that a series of tests proposed in the literature cannot be repeated by the clinicians due to lack of time. The attempt to identify methods for the evaluation of lumbar instability in patients with LBP allowed us to select some tests that are suitable for clinicians in everyday clinical practice. The time needed to test and process data are compatible with clinical practice and research purposes. Starting from the same key-words used for the search of the articles of the literature review, 4 clinical tests (PIT, PLE, AMP and PST) investigated by 2 studies [24,25] met the criteria of applicability in clinical practice.

Limits

The main limitation of this review is the small number of articles found on any single test. Only 2 studies concerned the diagnostic accuracy, while for the studies investigating the reliability, the results are limited by statistical or methodological weaknesses. For example, the Ravenna’s [26] conclusions should be cautiously interpreted also for some significant modifications made to standardize the PIT, such as the different hip and knee positions, the use of a stabilization scapular belt and a stool for foot placement.

The average age and the characteristics of the spinal dysfunctions of the samples were not homogeneous in the different studies, thus reducing the external validity of the results. Another limitation of this review concerns the insufficient homogeneity regarding the execution and interpretation of the tests. As already mentioned, a lack of standardization of a test affects comparative analyses among different studies and the implementation of that test in clinical practice.

Conclusions

The actual state of the art of clinical tests for lumbar instability include 6 studies of almost 333 patients and 4 clinical tests. Our data suggest that the PLE test is the most suitable test for detecting lumbar instability, thanks to its excellent diagnostic accuracy, and good reliability. Further studies on the diagnostic properties of the PLE test to detect lumbar instability among different populations with LBP are suggested.

After more than 20 years from the definition of the importance of diagnostic clinical tests for lumbar instability in individuals with LBP, clinicians can use some tests showing encouraging results in terms of accuracy and reliability. Nevertheless, their application in daily practice might be affected by insufficient research and evidence on their performances. Future research should be oriented to compare in the same study different assessment methods on the same sample size, in order to evaluate their reliability and validity.