Chronic musculoskeletal pain is a major source of disability and morbidity in the USA,1 and affects approximately 60% of Veterans with chronic health conditions in Veterans Health Administration (VHA) primary care.2 Management remains challenging, and groups ranging from pain expert coalitions to the National Institutes of Health and the Institute of Medicine have called for more focused and strategic pain therapy research.3 As these groups note, successful development and testing of interventions to improve chronic musculoskeletal pain depends on the use of valid, reliable, and responsive measures of pain domains.

Existing pain outcome measures often span multiple physical, emotional, and social domains. To guide development and use of these measures, experts and stakeholders have formed such initiatives as Outcome Measures in Rheumatology (OMERACT), the Analgesic, Anesthetic, and Addiction Clinical Trial Translations, Innovations, Opportunities, and Networks (ACTTION) public-private partnership with the US Food and Drug Administration (FDA), the associated Initiative on Methods, Measurement and Pain Assessment in Clinical Trials (IMMPACT), and the NIH Task Force on Research Standards for Chronic Low Back Pain. These groups have published several reviews and compiled recommendations suggesting that pain outcome studies measure multiple domains via multiple modes of assessment.4,5,6,7,8,9 These groups have identified both pain intensity or severity (hereafter “severity”) and pain-related impairment of physical function (hereafter “functional impairment”) as key domains for study, as these reflect both pain symptoms and pain’s impact on people’s daily lives.4,8 Functional impairment has been identified as a priority concern for patients10 and is an increasingly common primary outcome domain alongside pain severity. Self-report measures remain the gold standard mode of assessing core pain outcomes, as they reflect subjective pain experience, and as existing observer- and laboratory-based pain measures do not consistently reflect clinically meaningful changes in key pain domains.4,6,11

The Department of Veterans Affairs 2016 State of the Art (SOTA) Conference on non-pharmacological approaches to chronic musculoskeletal pain management recognized the value of adopting a consistent core set of outcome measures for future chronic pain research. For example, such a core could facilitate cross-study comparisons of intervention effectiveness and other findings. To inform their choice of key measures, the VA Pain Measurement Outcomes Workgroup requested an evidence review focused on describing existing research on key psychometric properties of 17 commonly used self-report measures of pain severity and pain-related functional impairment. Research on such psychometric properties would not provide the only criterion for selecting core measures,12 but can be seen as a basic requirement of candidates for wide implementation. Our review addressed the following key question: Which of the 17 self-report pain measures nominated by the VA Pain Measurement Outcomes Workgroup had sufficient psychometric evidence to consider their adoption for use as core outcome measures in future clinical research? The findings in this manuscript are based on a VA Evidence-based Synthesis Program report available online.13


In conjunction with the topic nominators’ expert input, we developed a protocol for this review (registered in the PROSPERO database: CRD42017056610) and identified the populations of interest, study inclusion and exclusion criteria (Table 1), and our primary and secondary psychometric outcomes. The topic nominators requested a focus on chronic, non-traumatic musculoskeletal pain, which was defined as musculoskeletal pain of at least a 3-month duration. There was a particular interest in measures that had been used in Veteran populations and in multidimensional measures that assessed both pain severity and pain-related functional impairments, such as activity limitations and interference with physical function.

Table 1 Inclusion and Exclusion Criteria

Our primary outcome was whether a minimal important difference (MID) had been established for each measure, with a focus on minimal clinically important difference vs. statistically detectable difference. Secondary outcomes related to measures’ psychometric properties of responsiveness to change, validity, and retest reliability. The 17 pain measures assessed in this review were selected by pain experts in the SOTA workgroup and are outlined in more detail in Table 2.

Table 2 Overview of Pain Measures

Search Strategy

We followed a multi-pronged search strategy. First, we searched MEDLINE (Ovid) from January 2000 to January 2017 for English language publications. Our search strategy, developed with input from a medical librarian, included Medical Subject Heading (MeSH) terms for Pain Measurement and specific locations/types of pain (e.g., Low Back) along with title and abstract words. The search was designed to include all study designs, including systematic reviews. The full search strategy is presented in Supplemental Content Table 1. At the request of reviewers of the full evidence report, we repeated the search with MeSH and title/abstract terms for fibromyalgia. Second, we used Google Scholar, the National Center for Biotechnology Information (NCBI), and PubMed to identify articles not found through the MEDLINE search. Third, we searched for Web sites associated with each pain measure and hand-reviewed all Web references, including those that pre-dated 2000. We also searched for original development and validation papers associated with each measure, regardless of publication date. Fourth, we hand-reviewed the reference lists of all included studies and the reference lists of relevant systematic reviews identified through MEDLINE. Fifth, we invited the SOTA experts to identify additional key articles for review. Sixth, the draft evidence report underwent peer review (including SOTA experts), and peer reviewers were asked to identify any potentially eligible references. All identified references were assessed for eligibility. We set no date limitations on publications identified through hand reviews of reference lists, Web sites, or expert nomination.

Study Selection

Eligibility criteria are presented in Table 1. Abstracts of studies identified in our MEDLINE search were reviewed by trained staff. The full text of potentially eligible articles from abstract review, and of all articles identified from reference list searching or online sources, was reviewed independently by two researchers. Disagreements were resolved by consensus.

Data Abstraction and Quality Assessment

From each eligible study, trained staff abstracted (1) study/population characteristics: location of study, funding source, measurement scales evaluated, time period of assessment (e.g., reporting pain over past week, past month), mode of administration, setting, chronic pain condition, study inclusion/exclusion criteria, baseline pain characteristics, sample size, age, gender, and race/ethnicity, and (2) our psychometric outcomes of interest. For the primary outcome, we noted whether the minimal important difference was clinically anchored (e.g., based on the smallest difference at which participants felt better or worse) or based solely on statistical parameters (e.g., standard error of the measurement). Data were abstracted onto standardized forms piloted by research staff. All data abstraction was completed by one reviewer and verified by another. The psychometric properties represent quality measures; no further quality assessment was done.

Data Synthesis

We summarized included studies to provide an overview of the populations and pain conditions for which the psychometric properties of measures have been evaluated. We present frequency of estimation of each psychometric outcome for each measure in the form of a heat map and provide a tabular summary of primary outcome results.


Literature Flow

The literature flow diagram (Fig. 1) illustrates the process of study review and selection. Using our various search strategies, we identified 1635 abstracts, of which 331 proceeded to full-text review. Over half of the articles excluded after full-text review did not report the psychometric properties of interest; over one-third did not assess a pain measure of interest and/or did not study a population documented to have chronic musculoskeletal pain.

Figure 1
figure 1

Literature flow chart.

Overview of Study Characteristics

Table 3 summarizes the characteristics of the pain measurement studies included in the review. We included 43 studies: 23 from the USA,20,23,30,31,36,38,39,43,45,46,48,49,50,51,52,56,59,62,64,65,66,67,70 3 from Canada,32,57,60 one from South America,41 5 from Australia,34,35,47,54,63 and 11 from Europe.33,37,40,42,44,53,55,58,61,68,69 Of the US studies, four enrolled exclusively military Veterans20,48,52,65 and two enrolled both Veterans and non-Veterans.23,50 Study enrollments ranged from 3053 to 99864 with 29 enrolling more than 100 and 3 enrolling more than 500.36,46,64 The most common chronic musculoskeletal pain condition was low back pain (LBP), with 16 studies enrolling only LBP patients.31,33,34,36,37,40,44,45,46,49,52,54,55,59,66,68 Thirteen studies included patients with any chronic musculoskeletal pain.20,30,35,38,41,48,51,53,57,61,64,65,70 Mean age, reported in 40 studies, ranged from 32 years69 to 80 years45: less than 50 years in 18 studies, 50 to 59 years in 15 studies, and 60 years and older in 7 studies. The percentage of women ranged from 8 to 19% in the studies that enrolled exclusively US military Veterans. Five of the remaining studies enrolled fewer than 50% women,34,43,53,58,62 29 enrolled 50% or more, and 5 did not report the percentage of women enrolled. Race/ethnicity was reported in 18 of the studies, all but one from the USA. The percentage of white enrollees was 75% or higher for 11 of the 18 studies. Additional study characteristics are reported in Supplemental Table 2 (available online).

Table 3 Overview of Included Studies

Heat Map

Figure 2 presents a heat map summarizing findings for the 17 pain measures on four psychometric outcomes of interest: MID, responsiveness, validity (concurrent and/or discriminant), and test-retest reliability. As the heat map shows, 14 measures had data reported on both responsiveness and concurrent validity, 5 measures had data reported on discriminant validity, and 10 measures had data reported on test-retest reliability. Data on all four main psychometric outcomes of interest were reported for five measures: Numeric Rating Scale (NRS), Oswestry Disability Index (ODI), Roland-Morris Disability Questionnaire (RMDQ), SF-36 Bodily Pain Scale (SF-36 BPS), and Visual Analog Scale (VAS). The highest numbers of relevant studies were also found on these five measures. Data on MID, responsiveness, and validity were reported for Brief Pain Inventory (BPI), Graded Chronic Pain Scale (GCPS), and Pain intensity, Enjoyment of life, and General activity (PEG). Data on responsiveness, validity, and test-retest reliability were reported for Multidimensional Pain Inventory (MPI)/West Haven-Yale Multidimensional Pain Inventory (WHYMPI), McGill Pain Questionnaire (MPQ), Patient-Reported Outcomes Measurement Information System - Pain Interference (PROMIS-PI), and Western Ontario and McMaster Universities Arthritis Index (WOMAC). We found no studies meeting eligibility criteria for the DVPRS or the KOOS. Screened studies of the Defense and Veterans Pain Rating Scale (DVPRS) were not specific to chronic musculoskeletal pain, and studies of the KOOS did not administer the measure and/or report findings in English. Supplemental Table 3 (in electronic appendices) identifies specific reviewed studies within this evidence map configuration, and Supplemental Table 4 contains more details on reported quantitative indicators of psychometric properties and relevant study design features.

Figure 2
figure 2

Number of studies reporting psychometric properties.

Primary Psychometric Outcome

Table 4 reports findings on the primary psychometric outcome, minimal important difference (MID). The VAS is reported twice in this table because the scoring range differed 10-fold across the two studies. Table 4 demonstrates the variety of statistical approaches used to estimate MID. Four studies calculated measure-specific minimal clinically important differences (MCIDs) using a clinically anchored approach,37,55,59,68 and one study used two different populations to calculate statistically detectable differences that were then compared to global ratings of change via kappa statistics.50 Three studies used distribution-based statistical estimations only.34,45,69

Table 4 Summary of Results: Minimal Important Difference (MID)


This focused evidence review evaluated published research on psychometric properties of 17 key patient-reported pain outcome measures assessed in chronic musculoskeletal pain populations. Of the five scales with reported data on all four psychometric outcomes (ODI, RMDQ, SF-36 BPS, NRS, and VAS), three (the ODI, RMDQ, and SF-36 BPS) measure multiple pain domains. The NRS and VAS varied among studies with respect to key construct (pain severity or pain-related functional impairment), phrasing, recall periods, and score ranges, making this overview more a cataloging of different numeric rating scales and visual analog scales than a review of two clearly defined pain measures. Seven additional scales (BPI, GCPS, MPI/WHYMPI, MPQ, PEG, PROMIS-PI, and WOMAC) also had evidence for three key psychometric properties. Findings are consistent with pain outcome measurement reviews focused on specific pain-related diagnoses: a review focused on responsiveness of patient-reported health outcome measures for LBP found the ODI and RMDQ to be the most comprehensively validated,71 and a previous review of back-specific functional status questionnaires for LBP found the ODI and RMDQ to have been most frequently studied, with good measurement properties in their original form as retested in multiple settings.72

The range of MID assessment methods identified in this review reflects variation in current MID-related research. Assessments of minimal clinically important difference (MCID) for a patient-reported outcome measure involve anchoring the measure to an indicator of meaningful patient-reported change in a clinical outcome.73,74,75 While some MID estimates reported here constitute MCIDs anchored to patient-reported clinical improvement via adaptations of the Patient Global Impression of Change (PGIC),37,55,59,68 others are purely estimates of statistical minimum detectable change (MDC) based on study population distribution characteristics34,45,69 without reference to clinical import of that change. Comparing anchor-based MCID findings with distribution-based MDC findings can be useful in MID estimation, as this allows researchers to consider both an external benchmark of clinical change and a measure of change detectable despite variation.37,73,74 Reviewed studies, however, contained relatively few estimates via any method. Estimation methods also differed substantially, resulting in large discrepancies both within and across measures, and precluding comparison and generalization of measure-specific MIDs. The widespread application of interpreting a 30% change from baseline as an MID—originally assessed using an NRS for pain severity22 and ultimately recommended for a range of patient-reported pain outcome measures—78 may have discouraged measure-specific MID development. Further research should explore whether this approach is empirically generalizable. Consensus is needed on optimal approaches to developing and reporting MID for patient-reported measures in chronic musculoskeletal pain.

There is no gold standard comparator for assessment of pain measure validity in the domains assessed. Included studies’ methods of assessing concurrent/criterion validity involved finding correlations between a measure of interest and another measure or subscale of interest. Other assessments arguably relevant to construct validity, such as relationships of self-reported pain-related functioning measures to objective physical performance measures, were less commonly identified, consistent with the state of current physical function research in pain.8 Perhaps unsurprisingly, therefore, our review identified a self-referential network of patient-reported outcome measures validated against one another, making validity estimates difficult to compare within or across measures. Future research could further investigate the network of validity comparisons to clarify underlying assumptions and identify gaps requiring conceptual research. Responsiveness findings in reviewed studies were also challenging to compare both within and across measures. Some methods of comparing pain measure changes within clinical trials of pain interventions cannot separate an intervention’s estimated effectiveness (either true differences or chance findings of difference) from the responsiveness of the pain measure used to assess it. Few methods recognize the inherent challenge that short-term fluctuations in pain, which commonly occur in chronic musculoskeletal pain conditions, pose to the capacity of pre-post assessments to track pain trajectory over time. Interpreting test-retest reliability estimates has similar conceptual challenges: separating undesirable measurement variability from variability that reflects actual fluctuations in pain can be difficult. Thus, short-term fluctuations in a measure may not indicate a lack of test/retest reliability, and may instead be evidence of true responsiveness. Researchers interested in comparing measures’ responsiveness and test-retest reliability should consider available psychometric evidence in the context of their own work, including the recall period of interest, the expected amount and time frame of change in the pain domains they plan to assess, and their desired study design (e.g., pre-post assessment vs. longitudinal repeated-measures assessment).

Chronic musculoskeletal pain definition and reporting varied widely across reviewed studies. The required duration for pain to be considered “chronic” was inconsistent and was not always reported. Pain type (e.g., musculoskeletal), primary diagnostic cause (e.g., osteoarthritis), and primary bodily site(s) (e.g., low back) were inconsistently reported, as were relevant characteristics such as pain duration and levels at baseline, treatment use, and co-existing physical or mental health conditions. Such differences reflect active discussion in current pain research: when and how duration, causal diagnoses, and bodily site affect key pain qualities, and when and how intermittent pain differs meaningfully from chronic continuous pain.11,79 Research is needed to define target populations and reporting standards for pain-relevant characteristics in psychometric research on chronic musculoskeletal pain.

The majority of studies were conducted in populations with over 50% women and mean ages 40–59. Most studies did not report race or ethnicity; of those that did, all included more than 50% white participants, and most included more than 75% white participants. No studies reported outcomes stratified by sex or gender, age range, or race/ethnicity. Generalizability of psychometric findings is thus limited by both demographic underreporting and population homogeneity. Given substantial evidence of the influence of age and psychosocial factors on individuals’ experiences and reporting of both pain-related functional impairment and pain severity,76,77,80,81 there is a need for consensus on key study population demographic and clinical characteristics, more consistent reporting of these population characteristics within studies, and further research on how measures’ psychometric properties generalize or change across age ranges and psychosocial categories.

Our review was limited to studies that published results in English. We also excluded studies that evaluated non-English language versions of eligible scales. This decision was supported by evidence on the limited generalizability of self-report measures’ psychometric properties across languages and highlights the need for linguistic and cultural validation of pain measures.80,82 With respect to search strategy, our primary abstract search was limited to 2000 onward. We complemented this, however, by applying no date limits to hand-searches of included studies’ reference lists, other reviews, and expert/peer reviewer suggestions. Finally, our criteria may have excluded some studies of psychometric properties of measures developed and validated prior to the popularization of specifying chronicity and duration of pain. Researchers considering such pain measures will need to consider the relevance of past psychometric work in the context of current conceptual pain research, and of their planned studies’ objectives and target populations.

This focused evidence review had key elements of an evidence mapping approach: systematically surveying the psychometric literature on expert-identified pain measures, summarizing quantities of studies on key psychometric outcomes, and identifying research gaps and relevant challenges to data synthesis.83 We developed this approach to illuminate the research gaps and data synthesis challenges that became evident through systematic review. Ultimately, we found that primary psychometric research on these measures within chronic musculoskeletal pain populations was limited, with the most evidence on reviewed psychometric properties found for the ODI, RMDQ, SF-36 BPS, NRS, and VAS. Key challenges in current musculoskeletal pain measurement research include substantial variation in methods of estimating psychometric properties, defining chronic musculoskeletal pain, and reporting patient demographics. Findings indicate that further methods research is needed to validate patient-reported pain outcome measures in populations with chronic musculoskeletal pain.