Introduction

Degenerative cervical myelopathy (DCM) is a chronic progressive degenerative disease predominantly affecting elderly aged 50 and over [1,2,3]. DCM is usually diagnosed with a combination of radiological tests, clinical tests, and a series of functional scoring. Hoffmann’s sign, Finger Escape Sign, Scapulo-humeral Reflex, and Reverse Supinator Reflex are tested as the special signs of DCM [4,5,6,7]. The functional deficits are assessed by Nurick Scale and Japanese Orthopaedic Association Scoring System for Cervical Myelopathy (JOA) [8, 9]. Several self-reported questionnaires are adopted for quantifying disturbances on physical functions and quality of life in DCM [10,11,12], including the JOA Cervical Myelopathy Evaluation Questionnaire (JOACMEQ), Neck Disability Index (NDI), Health-related Quality of Life Short Form-36 (SF-36), and EuroQol questionnaire (EQ-5D-5L). Hand clumsiness and gait disturbance are the featured clinical manifestations [1, 13,14,15,16], and physical performance tests (PPT) become the key reflection of capability in DCM. However, 10-second Grip and Release Test (G&R) is the only-accepted PPT currently in the diagnosis of DCM [17,18,19], which is undoubtedly insufficient to examine the wide-ranging functional deficits in DCM.

Currently, the clinical monitoring of DCM relies on few neurological signs and self-evaluating questionnaires, which may be influenced by post-operative wound pain and associated physical limitations [20]. Outcome indicators based on function may reinforce the accuracy in clinical decision. Consequently, the importance of PPT is arising in evaluating outcomes following cervical spinal surgery [21]. A number of studies have investigated the effect of PPT on post-operative monitoring in DCM [8, 19, 22, 23], yet knowledge is still limited [24]. PPT as an indicator for functional deficits in DCM remains unclear in influencing clinical management pathways.

To augment the evidence in the clinical practice for DCM, effective assessment tools in indicating the physical performance are essential for DCM to unveil the outcome toward success. The present review aims (1) to investigate the effectiveness of PPT in differentiating between DCM and healthy controls; and (2) to identify the efficacy of PPT as outcome indicators during post-operative clinical monitoring of DCM.

Materials and methods

A systematic review was conducted in line with the Preferred Reporting Items of Systematic Reviews and Meta-analysis (PRISMA) guidelines. This review protocol was registered in PROSPERO database (CRD42021220905). The literature search in online databases, including AMED, Cochrane Library, CINAHL, EMBASE, MEDLINE, PUBMED and Web of Science was performed with language restriction to English; from the inception of the databases to April 13, 2022.

Search strategy

The literature search implemented with search terms [“CERVICAL” AND “DEGENERATI*” AND “MYELOPATH*”] AND [“CLINICAL” OR “PHYSICAL” OR “NEUROLOGICAL” OR “FUNCTIONAL”] AND [“ASSESSMENT” OR “TEST*” OR “EXAMINATION*” OR “EVALUATION*”] (Appendix 1). Retrieval of additional relevant studies was conducted through the forward citation search via Scopus and manual searching of reference lists was performed to avoid omitting of any relevant studies that may missed throughout the adopted searching strategy.

The inclusion criteria were strictly adhered along the study extraction, which included (1) study design: randomized controlled trials, controlled clinical trials, cohort or case-control studies; (2) study population: DCM; and (3) valid, reliable, non-instrumental and quick-administrated physical performance tests. Articles were excluded if (1) DCM patients had other neurological conditions; (2) PPT required a sophisticated experimental setup; (3) official or legitimate registration was required in the application of PPT; (4) no statistical comparison between PPT and Modified Japanese Orthopaedic Association Scoring System for Cervical Myelopathy (mJOA); and (5) non-human studies, case reports, and review articles.

Study selection and data acquisition

Two reviewers (KP & KL) implemented study selection independently and inter-reviewer discrepancies were compromised between reviewers. The study selection was started from eliminating duplicates, followed by title-abstract screening and full-text screening. The credentials of each study were extracted, including the author’s name, year of publication, country of origin, research design, total sample size, DCM confirmation method, confounding factors (i.e., age and sex), testing functions, and measurements of PPT. Statistical data were tabulated as the sample size, mean, and standard deviation from each study group for further analysis.

Quality assessment

The risk of bias in the studies were scrutinized by the Newcastle-Ottawa Scale (NOS); the case-control and cohort studies were scored separately. “NOS” is a 9-item criterion-specific evaluation on sample selection, analyses of bias and quality of exposure. One star scored, only when minimum standard was met; the maximum score was nine stars. More stars achieved indicated lower risk of bias and higher quality of the article. A high-quality study with the lowest risk of bias scored “7 or more stars,” while “4 to 6 stars” suggested moderate-quality and medium risk of bias. Low-quality paper with very high risk of bias scored 4 stars or less [25].

Meta-analysis

In the meta-analysis, mean scores of PPT in DCM and non-DCM controls were compared in case-control studies, while the effectiveness of PPT between pre-operative (Pre-op) and post-operative (Post-op) performance were weighed in the cohort studies. Differences in “DCM vs. Controls” and “Pre-op vs. Post-op” groups were assessed through pooled estimates obtained from the random effect analysis model and the statistical method of inverse variance. The corresponding mean difference (MD) and 95% confidence intervals (CI) were analyzed with the level of significance set at 0.05. The homogeneity among comparison was assessed by I2 statistics [9], with a value ≤ 25% indicating high homogeneity, 26-74% indicating moderate heterogeneity, and ≥ 75% indicating an extremely high heterogeneity.

Review Manager Version 5.4.1 (Cochrane Collaboration, UK) was employed for data synthesis. The effect of PPT was confirmed by computing the Cohen’s d effect size (ES) and CI with the effect size calculator, Campbell Collaboration [7]. The ES of 0.2 to 0.3 is considered as “small,” 0.5 as “medium” and > 0.8 as “large” effect [24].

Results

The initial literature search yielded 3111 articles and 1531 were remained after the removal of duplicates, and 1505 articles were excluded in the title-abstract screening with strictly adhering to the inclusion and exclusion criteria. An additional 8 citations were found in the forward citation search and finally 26 studies were remained for full-text screening. Amongst, 15 studies were eliminated owing to the absence of correlating with mJOA, or insufficient information for data synthesis and meta-analysis. After all, a total of 19 studies were included, 5 prospective cohort studies, 13 prospective and 1 retrospective case-control studies (Fig. 1). There were 6359 subjects altogether in this review (Tables 1, 2).

Fig. 1
figure 1

PRISMA 2009 flow diagram of the literature search

Table 1 Study characteristics of Degenerative Cervical Myelopathy and physical performance tests (Case-control Study)
Table 2 Study characteristics of Degenerative Cervical Myelopathy and physical performance tests (Cohort Study)

Risk of bias assessment

The mean scores in NOS case-control study and cohort study were 5.36 and 5.60, respectively, which suggested moderate quality with medium risk of bias (Table 3, 4). Four case-control studies were at high quality (28.6%), 10 were at moderate quality (71.4%), and only 1 low-quality studies (7.1%) with high risk of bias scoring 3 stars was included. All cohort studies had moderate quality and medium risk of bias; 2 studies (40%) scored 5 stars and 3 (60%) scored 6 stars.

Table 3 Quality assessment by Newcastle-Ottawa Scale (NOS) for case-control study
Table 4 Quality assessment by Newcastle-Ottawa Scale (NOS) for cohort study

The fulfilment of NOS was generally low in components of “selection” (20-50%), “comparability” (0-43%), and “exposure” (36%) as shown in tables 4 and 5. The absence of clear definition in control subjects and confounding factors (e.g., age, sex), and having no blinding of subject status, was identified as key limitations of this review.

Table 5 Statistic of clinical measures on physical performances in Degenerative Cervical Myelopathy

Physical performance tests

There were 6 PPT identified and grouped into 2 domains: upper limb (13 studies, 50%) and lower limb (12 studies, 46%). They were all time-speed tests assessing the maximum performance within a fixed time limit or the time requirement for completing a structured task. The upper limb domain was comprised of 10-second Grip and Release Test (G&R) (10 studies, 52.6%) and Nine Hole Peg Test (9HPT) (3 studies, 11.5%). G&R assessed the maximum repetitions of reciprocal full opening and fisting of a single hand within 10 seconds. 9HPT tested the fine finger dexterity by charting the time spent on placing and removing nine round-pegs on the pegboard. G&R and 9HPT assessed the dominant and non-dominant hands separately. Similarly, the 10-second Stepping Test (SST) (5 studies, 19.2%), 30-meter Walking Test (30MWT) (4 studies, 15.4%), Foot Tapping Test (FTT) (3 studies, 11.5%), and Triangle Stepping Test (TST) (1 study, 3.8%) formed the lower limb domain. SST and 30MWT evaluated reciprocal concurrent coordination between both lower limbs concurrently, while FTT and TST assessed both lower limbs separately [9, 26, 27] (Table 5).

Meta-analysis on detection of DCM

Although 6 PPT were summarized on the effect in detecting DCM, TST was described in a single article; thus, only the 5 remaining tests were pooled and clustered into the “upper limb” and “lower limb” groups for meta-analysis. The lower limb cluster consisted of SST, 30MWT and FTT. Studies on SST demonstrated a high degree of homogeneity with Tau2 of 0.00, I2 indices of 0%, 95%CI ranged from -4.91 to -3.49 and MD of -4.20, while ES was excellent at 11.53 with p < 0.00001. Likewise, 30MWT had a high degree of homogeneity (Tau2 = 0.00, I2 = 0%, MD = 0.86, 95%CI = -2.10-3.82); however, the effect size was small and not significant (ES = 0.57, p = 0.57). A satisfactory ES was found 7.75 (p < 0.00001) in FTT, though the Tau2 of 2.75, I2 indices of 91%, MD of -7.84 and 95%CI ranging from -9.82 to -5.89 indicated a high degree of heterogeneity. (Fig. 2).

Fig. 2
figure 2

Meta-analysis of physical performance tests for lower limb between DCM and controls

Analyses of the upper limb cluster, G&R and 9HPT showed a significant homogeneity with Tau2 of 0.28, I2 indices of 25%, MD of -5.58 and 95%CI ranged from -6.13 to -5.03, ES was great at 19.85 with p < 0.00001, whereas 9HPT had a relatively lower ES at 5.11 (p < 0.00001) and equally homogeneous with Tau2 of 0.00, I2 indices of 0% and MD of 9.89 (Fig. 3).

Fig. 3
figure 3

Meta-analysis of physical performance tests for upper limb between DCM and controls

Meta-analysis on clinical monitoring of DCM

There were 5 tests pooled for meta-analysis on clinical monitoring, G&R and 9HPT for upper limbs; 30MWT, SST and FTT for lower limbs. The pooled studies on 9HPT (Tau2 = 0.00, I2 = 0%, ES = 2.87, p = 0.004, MD = -7.63, 95%CI = -12.84 to -2.41), and G&R (Tau2 = 0.06, I2 = 15%, ES = 18.97, p < 0.00001, MD = 3.58, 95%CI = 3.21-3.95) demonstrated a high degree of homogeneity in the upper limb cluster as shown in Fig. 4.

Fig. 4
figure 4

Meta-analysis of physical performance tests as clinical outcome indicators for upper limb

The highest degree of homogeneity was shown in 30MWT with ES at 15.58 with p < 0.00001, Tau2 of 0.00, I2 indices of 0%, MD of -12.58 and 95%CI ranged from -13.90 to -11.25. SST was equally homogeneous as 30MWT with ES at 13.36, p < 0.00001, Tau2 of 0.00, I2 indices of 0%, MD of 3.19 and 95%CI ranging from 2.72 to 3.66. FTT demonstrated substantial heterogeneity with a high I2 indices of 84% and an insignificant ES at 1.85 with p = 0.06 (Tau2 = 7.80, MD = 3.97, 95%CI = -0.23 to 8.18) as shown in Fig. 5.

Fig. 5
figure 5

Meta-analysis of physical performance tests as clinical outcome indicators for lower limb

Discussion

Impaired functional performance is a crucial element in diagnosing DCM [28], yet few functional performance tests were available and accessible for clinical assessment. Functional performance testing is usually implemented in laboratories with sophisticated setup or requires expensive self-designed tools in assessing limb functions [29, 30], such as the VICON three-dimensional motion capture system for motion analysis [31,32,33,34]. The psychometric properties of the tests were rarely analyzed, especially the experimental trials. Validated clinical evaluation for DCM, for instance, the Graded Redefined Assessment of Strength, Sensation and Prehension for Myelopathy (GRASSP-M), required mandatory certification for practice [35] and was usually lengthy. In general, lengthy assessment tools were not desired by clinicians owing to their packed schedules. To enforce their practicability, PPT should preferably be non-instrumental and quick-administered. The use of PPT may enhance the clinical documentation and reduce certain operation gap among diagnosis, monitoring, and decision-making in DCM.

The present review summarized 6 PPT for the detection and clinical monitoring of DCM. G&R and 9HPT evaluated the upper limbs; while 30MWT, FTT, SST, and TST assessed the lower limbs. The performance in activities of daily living, specifically those involved fine hand manipulation and coordination in walking, are believed to be in line with the somatosensory and sensorimotor deficits resulting from cervical spinal cord compression in DCM [36,37,38,39,40]. Most commonly, the cord compression in DCM occurs in the sagittal plane [41]; the dorsal column and corticospinal tract are usually affected and they are responsible for proprioception and motor coordination, respectively [9, 42]. Thus, DCM is predominantly associated with hand clumsiness and gait disturbance [10, 11, 43,44,45]. As a consequence of incoordination of the upper or lower limbs, inadequate performance detected by PPT should indicate definite functional deficits in daily living. These PPT were all validated to DCM against normal performance of the healthy controls and have been developed for different cultural ethnicities [19, 43, 46]. In this review, the sensitivity and specificity of PPT were not addressed statistically. Clinically, sensitivity and specificity of each PPT are important in identifying the characteristic and treatment effect in DCM; therefore, further study is essential to strengthen the application of PPT in diagnosis and monitoring of DCM.

The most impacting functional deficit was labeled as the balance during standing and walking [47]. The prerequisite in body coordination for making steps during walking was proprioception sense over the ankle joints [1, 48]. In DCM, stiff and clumsy ankle movement caused by spasticity or incoordination is a key indication for seeking medical advice [49, 50]. The ankle motor deficiency was expected to be assessed effectively by FTT, an quick-administered, unilateral, and single joint time-speed test of ankle joint coordination [51,52,53]. Nevertheless, FTT was excluded from the meta-analysis on account of high I2 indices denoting its severe heterogeneity that may perhaps justify by the insufficient number of articles [54,55,56]. Despite the extremely high I2 value at 84% in the pooled analysis of FTT studies, effect size was high at 7.75 (p < 0.00001). Hence, FTT could still consider as an effective tool and further study on its application may create less heterogeneity on effective detection and clinical monitoring of DCM. The present findings suggested a certain degree of inconsistency and clinicians should be expected to use FTT with caution.

In the quality assessment, the overall mean score assessed by NOS was 5.36 and 5.60 among the case-control and cohort studies, respectively, which indicated fair quality with moderate risk of bias may possibly be occurred during the analysis. This constraint was attenuated by studies having an ideal homogeneity as almost all I2 indices were bounded below 25% in this meta-analysis. Furthermore, “comparability” was found to be the most critical element in aggravating the risk of bias in quality assessment, a consistent limitation among case-control studies. The confounding factors, “Sex” and “Age,” were not mentioned in most of the case-control studies or upon subgroup analysis; only 36% to 43% of articles had addressed the variance among the confounding factors. “Assessment Exposure” was missed in 9 studies, 47.4% of the total number of included studies. Without blinding to subject-control assignment, bias of evaluators may have been brought about during the tests. Thus, independent blinded assessment was preferred to avoid observer bias.

While Magnetic Resonance Imaging (MRI) has an extraordinary importance in diagnosing DCM [57,58,59,60], functional deficits induced by the spinal cord compression remain dependent upon clinical assessment rather than imaging [28]. Therefore mJOA was adopted as a clinical outcome measure for DCM since 1980, and later as an universal golden standard [24, 61,62,63,64]. Moreover, it was recently adopted as triage for surgical intervention in DCM according to AO Spine 2017 International Consensus Guidelines [28]. Although PPT has become more imperative during diagnosis, clinical monitoring and surgical decision for DCM [6, 65, 66], preference on G&R could be noted worldwide [13, 17, 67]. In addition, the general acceptance of other PPT as outcome measures in DCM was not high, and thus only a few studies on 9HPT, FTT, SST, and 30MWT were available for review, regardless of good psychometric properties in assessing DCM [68, 69]. This phenomenon became the most limiting constraint in this review; lacking available studies for review may produce a distinct impact on the effect size in the meta-analysis leading to a high degree of heterogeneity. Perhaps, underestimation on the effect of PPT in detecting DCM may possibly be arose from committing an error of concluding with “no effect” when it actually existed [70]. Furthermore, several non-English articles on PPT for DCM were excluded, and some significant information may possibly be missed owing to the language limitation in the initial screening stage.

Conclusion

In the diagnosis of DCM, incorporation of MRI, mJOA and PPT are well-accepted as golden standard worldwide and the preference on PPT is biased toward G&R owing to its popularity. The use of other PPT such as 9HPT, SST, 30MWT, and FTT was rare, even though they were proven as effective and specific in the detection and clinical monitoring of DCM. This review has given an insight to clinicians in adopting comprehensive assessments including G&R, 9HPT, SST, 30MWT, and FTT as alliance diagnostic and monitoring tools in the early detection and along the clinical management pathway for DCM.

In view of the fair quality and insufficient number of articles available, Foot Tapping Test (FTT) was found effective with heterogeneity, therefore further studies on various PPT with addressing the confounding factors, “Sex” and “Age,” and the “Assessment Exposure” are necessary to enhance its efficacy.