Introduction

Neck pain is the second most common musculoskeletal problem [1]. It is one of the leading causes of years lived with disability worldwide and represents an increasing burden on healthcare systems [2,3,4]. The economic burden of neck pain, in terms of treatment costs, lost productivity and work-related problems is high [1]. The point prevalence of neck pain in different countries ranges from 2443.9 to 6151.2 cases per 100,000 population, with the highest values in western Europe [1, 5]. The mean percentages for one-year prevalences and lifetime prevalences of adults worldwide are 37.2% and 48.5%, respectively [6]. Although acute neck pain usually resolves within two months, approximately 50% of patients are not completely pain free one year after an episode of neck pain [7,8,9]. This illustrates the often chronic-episodic course of the condition, with patients experiencing persistent or recurrent episodes of neck pain [10].

Management of patients with neck pain is a major challenge in physiotherapy, mainly because these patients form a very heterogeneous group in terms of the nature of symptoms, symptom distribution, and underlying pain mechanisms [11]. As neck pain is a multidimensional condition, management should consider multiple factors (e.g. pain mechanisms, and psychological, biological, movement and work-related factors). Among the work-related factors, workload, work or study time, sustained postures or body positions during work and computer work are considered as risk factors for the development of neck pain [1, 12]. The different factors can interact, and their expression may be more or less dominant in each patient, thus influencing the clinical approach [1, 12].

Deficits of sensorimotor capacities (SC) may be one of the factors contributing to neck pain, in particular the persistence or recurrence of neck pain [13]. The sensorimotor system is defined as an integrated whole, comprising afferent and efferent information, with central integration and processing components necessary to provide functional joint stability [14]. It is thought to influence, among others, joint position sense, activation of cervical flexor muscles and control of head-eye movement. The SC of the cervical spine are related to neck pain [15] and patients with neck pain often demonstrate reduced SC, e.g. reduced joint position sense [16,17,18], altered activation patterns of the cervical muscles [19,20,21], or disturbed head-eye movement control [22]. Furthermore, the persistence of deficits in the sensorimotor system can continue even after pain relief. It is hypothesized that persistence of these deficits may contribute to some patients experiencing recurrent episodes of neck pain [23,24,25] and the integration of sensorimotor training in the management of patients with neck pain has shown promising results [13]. Therefore, evaluation of SC in patients with neck pain is important [26]. Various tests to evaluate the sensorimotor system have been developed and are widely used in physiotherapy practice and research. However, the terminology used is often confusing, and there is no consensus on how SC of the neck should be assessed [14, 27]. Systematic reviews of tests for SC of the neck have investigated only a limited selection of tests assessing single aspects of SC, such as joint position sense [28] or muscle function [29,30,31]. A systematic review, providing a comprehensive overview of all available tests to assess all different aspects of SC of the neck, is lacking.

Given that many tests exist for assessment of SC of the neck, the challenge is to choose the most appropriate test for use in a specific situation. From a scientific perspective, knowledge about the quality of a test, i.e. measurement properties, is important when making this decision. The quality of a test depends on three criteria: reliability, validity and responsiveness [32].

This systematic review investigates the domain reliability. Reliability is the degree to which measurements are free from measurement error. The domain reliability includes three measurement properties: reliability, (expressing the proportion of the total variance in the measurements which is due to ‘true’ differences between patients), measurement error (which is the systematic and random error of a patient’s score that is not attributed to true changes in the construct to be measured), and internal consistency [32]. Internal consistency is usually investigated in self-reporting multi-item questionnaires and therefore is not relevant for the single-item tests used to assess SC.

The aim of this systematic review is to include all tests assessing any aspect of SC of the neck. Therefore, since many different tests are described in the literature, this review focusses only on reliability. Of course, when deciding which test to use, it would also be important to consider the different aspects of validity.

The concepts of reliability and measurement error are related, but focus on different purposes. Reliability focusses on the variability between patients or measurements and is influenced by the variation in the population where the test is used. On the other hand, measurement error is a relevant parameter for measurement of change over time, and it is not affected by population variability [33]. In clinical practice physiotherapists are interested in both concepts. The distinction between patients with and without deficits in the sensorimotor system (diagnostic purpose) is important. But measurement error is also an issue, as change over time, i.e. the evolution of the patient’s symptoms, is of interest.

The aims of this systematic review are: (a) to provide an overview of tests used in physiotherapy to assess SC in patients with neck pain; and (b) to provide information about the reliability and measurement error of these tests, to enable physiotherapists to select appropriate tests.

Methods

Design

A meta-analysis of studies investigating the reliability and measurement error of tests assessing SC of patients with neck pain in a physiotherapy setting.

Search strategy

The databases CINAHL, Embase and PsycINFO were searched up to July 2020 and for Medline up to May 2021. Blocks of search terms were developed for: (a) construct of interest (sensorimotor capacities), (b) population (patients with neck pain), (c) the sensitive PubMed filter developed by Terwee et al. [34] for the identification of studies about measurement properties of measurement instruments, and (d) the exclusion filter proposed by Terwee et al. [34] to exclude irrelevant studies. The two filters were adapted to the other databases, adopting the strategy used by Ammann-Reiffer et al. [35]. There was no language restriction. The reference lists of systematic reviews retrieved were hand searched for further eligible studies. The detailed search strategy is shown in Additional file 1.

Selection process

Two reviewers (SE and either RH or MT) screened the titles and abstracts independently, based on the predefined inclusion and exclusion criteria listed in Fig. 1. Disagreements were discussed and, if necessary, a third reviewer (CB) made a decision regarding inclusion. Reviewers in the team were able to read English, German, Dutch, French, Danish and Norwegian, and no exclusion of relevant papers based on language was noted.

Full-text screening was performed independently by two researchers (SE and RH) using the same predefined criteria (Fig. 1). After each screening step (title/abstract and full text), in the case of any disagreement about inclusion, consensus was reached through discussion with a third reviewer (CB). The screening was carried out using Covidence systematic review software [36].

Data extraction

Data extraction was conducted using REDCap electronic data capture tools hosted at the University of Applied Sciences and Arts Western Switzerland (HES-SO) Valais [37] by SE and RH. The first five studies were checked by a third researcher (CB) to ensure the correct procedure. Data were extracted on study characteristics, reliability, and measurement error of the different tests. Two researchers (SE and RH) assessed methodological quality, applying the COSMIN risk of bias tool in the adapted version for clinician-reported or performance-based outcome measures [38]. Each criterion was rated on a four-point rating system (i.e. very good, adequate, doubtful, or inadequate). The lowest rating determined the overall rating of the study (worst-score-counts method). The detailed tables for risk of bias assessment are shown in Additional file 2 (reliability) and Additional file 3 (measurement error). A third researcher (CB) performed a check of the first studies. Data extraction and synthesis was conducted with all included studies regardless of their methodological quality.

Data synthesis and analysis

The intraclass correlation coefficients of studies that used the same device and similar instructions for the corresponding test were quantitatively pooled. When pooling was not feasible, the study results were qualitatively summarized, by reporting the lowest and highest values. Because of the large number of tests applied for different directions of movement of the neck (left rotation, right rotation, etc.) test directions were summarized with reliability or measurement error values that led to the same conclusion regarding the criteria for good measurement properties, with the lowest and highest value. Tests directions with very different values (i.e. when the conclusion about the appropriateness of the reliability or the measurement error for this direction would be different from that for other directions) were reported separately.

The overall results for the reliability and/or measurement error of single studies or of summarized or pooled studies were compared against the criteria for good measurement properties. In a next step, the quality of evidence was graded according to the modified GRADE method proposed by the COSMIN group [38]. The quality of evidence was classified as high, moderate, low, or very low. The score was downgraded for risk of bias (minus one for serious, minus two for very serious and minus three for extremely serious risk of bias), inconsistency (minus one if more than one study per test available I2 > 0.5), and imprecision (minus one if total sample size n = 50–100, minus two if total simple size n < 50). The score was not downgraded for indirectness, due to the restrictive inclusion criteria used in the current study [38]. Detailed tables of the quality of evidence criteria are shown in Additional file 4 (reliability) and Additional file 5 (measurement error).

Fig. 1
figure 1

Criteria for inclusion or exclusion of studies

Results

In total 11,704 studies were found using the search strategy in four databases (Medline, CINAHL, Embase, PsycINFO). First, 3741 duplicates were removed. The remaining 7963 studies were screened for title and abstract, and 7803 were excluded based on the predefined criteria. Of the 160 full-text studies, 118 were excluded. The reasons for exclusion are listed in Fig. 2.

A final total of 42 studies, investigating a total of 206 tests for the assessment of SC in patients with neck pain, were included in the systematic review (Table 1).

Fig. 2
figure 2

Flow chart

Table 1 Characteristics of the included studies

Tests were categorized into 18 different groups (e.g. tests for active range of motion in the different movement directions of flexion, extension, lateral flexion, and rotation with the help of different devices were grouped together as active range of motion tests). Based on the classification of Riemann & Lephart [14], tests for the sensory and the motor components of the sensorimotor system were identified, but no tests for the central integration component were found. Within the sensory component, tests in the subcomponents “tactile” and “conscious proprioceptive senses” were found. As this study did not search for tests assessing pain, the subcomponent “pain” does not contain a test. A list of all groups of tests is shown in Fig. 3.

Fig. 3
figure 3

Sensorimotor system definition (according to Riemann & Lephart 2002 (14)) and the 18 groups of tests included in this systematic review (pink boxes)

According to the COSMIN criteria the following 12 tests were rated as good: craniocervical flexion test (test-retest reliability), neck flexor muscle endurance test (inter-rater and test-retest reliability), neck extensor muscle endurance test (inter-rater and test-retest reliability), sternocleidomastoid muscle strength (test-retest reliability), maximal voluntary isometric contraction (test-retest reliability), isometric strength with the help of different devices (test-retest reliability), flexion-relaxation ratio (test-retest reliability), active range of motion test with the help of different devices (inter-rater and test-retest reliability), figure of eight test (inter-rater and test-retest reliability), zigzag test (inter-rater and test-retest reliability), smooth pursuit neck torsion test (test-retest reliability), and rod and frame test (test-retest reliability). An overview of the ratings of all tests is shown Table 2. However, regarding reliability, the quality of evidence was rated as low or very low for all included studies. The reasons for downgrading are shown in Additional file 4.

Table 2 Summary of findings

Regarding measurement error, the criteria for good measurement error were rated as unknown for all included tests, because the minimal clinically important change was not reported. The quality of evidence was rated very low to high (Table 2). Reasons for downgrading are shown in Additional file 5.

Discussion

This systematic review included 42 studies evaluating 206 tests, with the aim of investigating the reliability and measurement error of tests for SC in patients with neck pain. The main findings are, firstly, that tests for the sensory and motor components of the sensorimotor system were found, but not for the central integration component. Furthermore, no data were found on reliability or measurement error in patients with neck pain for some tests that are used in practice, such as the movement control tests, which would belong to the motor component; secondly, approximately half of the tests, particularly tests that are easier to standardize with regard to test position or movement direction, showed good reliability; and, finally, tests evaluating more complex movements, which are more difficult to standardize, were less reliable.

In general, all included muscle endurance tests, had good (relative) reliability values according to the criteria for good measurement properties proposed by COSMIN, except for the scapula muscle endurance test in standing position. The execution of this test is much more complex and more difficult to standardize than other tests. Furthermore, scapula movements, compensatory movements, muscle recruitment etc. are more difficult to assess compared with neck movements where the movement directions follow the sagittal, frontal, or transversal plane in a more stable way. Similarly, regarding reliability of the isometric muscle strength tests, tests involving the judgement of movements or muscle recruitment around the scapula have lower values for reliability than tests for isometric activity of the head into flexion, extension, lateral flexion, or rotation. Again, this may be because scapula positions are more difficult to standardize, and isometric contractions of the scapula muscles are more difficult to assess regarding compensatory movements than isometric muscle activity of the muscles of the cervical spine.

The test of fast cervical rotations showed very low reliability, possibly due to the very complex characteristics of these movements, which make it difficult to standardize the test. The tests assessing active range of motion (AROM) of the cervical spine showed that assessment of rotation is more difficult compared with the other movement directions. This is particularly evident when the rotation is assessed as a single movement (combined right and left values) and when AROM is assessed with the help of a smartphone. In the current analysis, the values were less reliable for Android phones than for iPhones (see Table 2). This could be due to differences in the study protocols. In the study that used an Android phone, it was only held against the head, whereas in the study assessing AROM with an iPhone, the device was fastened securely to the forehead with a rigid strap, which might produce more reliable results. The assessment of AROM with the help of a dynamometer or goniometer showed good test-retest reliability results, but less good values for inter-rater reliability. It is evident that good values for inter-rater reliability are more difficult to achieve, because more sources of variation are included (e.g. different testers). Thus, the standardization of these types of tests is often a problem.

Using the example of the craniocervical flexion test (CCFT), this review shows that tests that require a substantial subjective rating (e.g. judgement of muscle recruitment or movement patterns) lead to lower reliability compared with more objective criteria (e.g. time). The current results are in line with a recent systematic review by Selistre and colleagues [30], investigating clinical tests for measuring strength or endurance of cervical muscles. They found moderate to good intra- and inter-rater reliability for the CCFT, cervical flexor endurance test, cervical extensor endurance test and cervical muscle strength assessed using a handheld dynamometer. The results of the current review are comparable for the CCFT, the cervical flexor endurance test and the cervical extensor endurance test. For the cervical muscle strength tests, the current review performed a more detailed analysis, e.g. Selistre et al. [30] described the cervical strength tests only with the handheld dynamometer and not with other devices. In the current study, the cervical strength tests with dynamometer showed good results for test-retest reliability, but poorer results for intra-rater reliability. The results of the current review are also comparable with those of a recent systematic review of the measurement properties of the CCFT [31]. The authors classified the inter-rater and the intra-rater reliability of the CCFT as positive and the level of evidence as moderate. The measurement error was classified as indeterminate and the level of evidence as unknown. The authors identified the same problems as found in the current review, such as low methodological quality of the included studies and missing data on minimal clinically important change.

The two recent systematic reviews on measurement properties of tests for the SC of the neck included studies with participants with and without neck pain [30, 31]. Both stated that studies on participants with neck pain were lacking, which is in line with the current results. The current review excluded several studies because the results for participants with neck pain were not reported separately but only together with those for people without neck pain. It was decided to include only studies with data for patients with neck pain, given our interest in the use of the tests in a clinical setting. Because the reliability of a test is influenced by the heterogeneity of the population in which the test is performed, it is important to know the reliability for a comparable population to that in which the test will be administered. It was also surprising that tests such as the CCFT, which is widely used in clinical practice, are so rarely investigated in patients with neck pain.

The major strength of this study is that it included all available tests for assessment of all aspects of sensorimotor control of the neck. However, the study also has a number of limitations. Many tests were performed only on healthy participants or in a mixed group of participants with and without neck pain. Several studies were excluded, including all studies assessing tests for movement control of the neck, as the authors did not report separate data for the patient group. Secondly, the quality of evidence was low to very low regarding reliability for all included studies. It was necessary to downgrade the level of evidence, mainly because of high risk of bias, inconsistency, and low precision. In the assessment of risk of bias, the item “patient stability” is one of the items that was particularly rated as doubtful in many cases. COSMIN recommends that patient stability should only be rated as very good if the study explicitly describes that the patients’ condition did not change between measurements. As this information was often missing, the current review had to rate the patient stability item as doubtful, even though the time interval between measurements was adequate. A further limitation of this review is that the included studies did not report data on interpretability and feasibility of the different tests, which would be important information for the recommendation of specific tests. Finally, this review did not assess aspects of validity, which would certainly also be important for the selection of appropriate tests.

Better studies are needed on reliability, measurement error and validity of tests in patients with neck pain, because the quality of evidence of the existing research is mainly low or very low, and the reliability of some tests (e.g. for movement control) was not evaluated in patients with neck pain at all.

Conclusion

Despite the large number of tests available, the quality of evidence is not yet high enough to conclusively inform clinicians which test to use to assess SC in patients with neck pain.

For clinical practice, this systematic review shows that tests with objective criteria and a thorough standardization should be chosen to ensure higher reliability.

Measurement error could not be evaluated because the minimal clinically important change was not available for all tests.