Introduction

Parkinson’s disease (PD) is the second most prevalent neurodegenerative disease after Alzheimer’s disease. PD afflicts about one million Americans, or about 1% of the population over 60 years of age [1]. As a chronic and progressive disease, PD may impact a person’s physical, mental and social health. PD patients may experience impairments in mood (especially depression and anxiety), orthostatic hypotension and other autonomic symptoms, sleep disturbances, and impulse control disorders, indicating the likelihood of a broad impact of this disease on health [2, 3].

Health-related quality of life (HRQOL) conceptualizes how well an individual functions and feels about his/her life. It encompasses physical, mental, and social dimensions of health [4]. There are two main types of HRQOL measures: generic and disease-targeted instruments, which differ in their form, content, and intended purpose. Generic HRQOL measures enable comparisons across populations, regardless of whether they have a particular condition [5]. The 36-Item Short-Form Health Survey (SF-36) is the most widely used HRQOL survey instrument in the United States [5, 6]. Its reliability and construct validity have been supported in studies of a number of other patient populations. The SF-36 includes eight health concepts judged as the most affected by disease and treatment, selected from 40 concepts assessed in the Medical Outcome Study [6].

Disease-targeted measures for several neurological conditions, such as multiple sclerosis and epilepsy, may provide additional key content over generic measures, tapping domains of HRQOL important to persons with these conditions [7, 8]. The most widely used PD-targeted HRQOL measure is the Parkinson’s Disease Questionnaire (PDQ-39), developed first as a 65-item questionnaire piloted on 359 individuals with PD attending a neurology outpatient clinic [9]. After testing for basic acceptability and comprehension, the number of questionnaire items was reduced to 39 items by a factor analysis, distributed across eight scales. The PDQ-39 has proved to have satisfactory reliability and construct validity in relation to other measures but limited evidence of responsiveness [10].

Another PD-targeted HRQOL measure, the Parkinson’s Disease Quality of Life (PDQUALIF) scale, was initially developed and evaluated in a cross-sectional study of 233 outpatient clinic attendees with physician-confirmed idiopathic PD [11]. Movement disorder specialists ranked a list of 73 indicators relevant for quality of life (QOL) in PD, and the top 32 ranked indicators were included in the measure. More than any other PD-targeted measure, the PDQUALIF taps many non-motor symptoms of PD including fatigue, sleep, autonomic dysfunction, and sexual function.

Patient-reported outcome measures are increasingly recognized as important for longitudinal studies including clinical trials of new treatments (http://www.fda.gov/fdac/features/2006/606_patients.html), and a review of a range of disease-targeted measures found overall better ability to detect change in clinically relevant domains relative to generic measures [12]. Yet, disease-targeted measures require investment of resources to develop and evaluate relative to “off-the-shelf” existing generic measures in widespread use, such as the SF-36. Thus, it is critical to compare generic and disease-targeted measures on their responsiveness to change in HRQOL over time. To date, responsiveness indices (effect sizes) have been reported only for the eight scales of the PDQ-39 scales, and in that study, only a few scales detected any effect [10].

Our goals were to compare these two PD-targeted HRQOL measures with the widely used SF-36 on responsiveness, construct validity, internal consistency reliability, and scaling assumptions. Because the PDQ-39 is the most widely used PD-targeted HRQOL measure, and because the PDQUALIF was specifically intended to tap not only motor but also non-motor aspects of QOL in PD, we selected for inclusion these two PD-targeted measures out of the small group of existing PD-targeted measures at the time the study began [13]. We hypothesized that reliability would be comparable but that the PD-targeted measures would have better construct validity and responsiveness than the generic SF-36.

Methods

Sample

A convenience sample of patients who were 18 years old or older and English-speaking were recruited from the Greater Los Angeles VA Healthcare System Movement Disorders Clinic and from the University of California Los Angeles (UCLA) Movement Disorders Clinic. At UCLA, study flyers were handed to PD patients by their movement disorder physician at the time of the patient’s visit. At the time of check-out from a regular appointment in the VA Movement Disorders Clinic, patients with PD were informed of the study and offered the flyer. Recruiting clinicians and staff were asked to provide information about the study only to patients without diagnosed dementia. In both sites, the flyer contained information on how to contact the study team through a toll-free telephone number. If the patient expressed interest during the time of check-out, the clinic clerk requested approval from the patient for the research team to initiate contact with the patient. Ninety-six patients provided verbal informed consent and were enrolled and completed the baseline telephone interview. The study was approved by the Institutional Review Boards at the VA Greater Los Angeles Healthcare System (project number PD1-01-158-1), and at UCLA (approval number G050405204).

Study design

The baseline telephone interview took place between March 2005 and February 2006, and the follow up interview between December 2006 and March 2007. The interval from baseline-to-follow up telephone interview had a mean of 17.9 months (range equals 11.1–24 months), a median of 17.9 months, and a standard deviation of 4.2 months. Measures were administered in the same way at both baseline and follow-up to avoid differential effects due to mode of administration [14].

Measures

Generic-HRQOL measure

The SF-36 (version 1.0) has 36 items covering eight scales: Physical Functioning, Role Limitations due to Physical Health, Role Limitations due to Emotional Problems, Pain, Emotional Well-Being, Energy, General Health, and Social Functioning. A Physical Health Composite score (PCS) and a Mental Health Composite score (MCS) can be derived from the SF-36 scales. The SF-36 is most commonly self-administered by mail survey or administered by telephone interview [15, 16].

PD-targeted HRQOL measures

The PDQ-39 has 39 items covering eight scales: Mobility, Activities of Daily Life, Emotional Well-Being, Stigma, Social Support, Cognitions, Communication, and Bodily Discomfort [9]. An overall score is constructed as the average of the eight scale scores. The PDQ-39 has been administered by telephone, with comparable levels of missingness, reliability, and construct validity to self-administration [3]. The PDQUALIF has 33 items covering seven scales; Social and Role Function; Self Image and Sexuality; Sleep; Outlook; Physical Functioning, Independence; and Urinary Function [11]. An overall score is the average of the seven scale scores.

In this study, all the PD-targeted HRQOL scales were scored on a 0–100 possible range with 0 representing the worst possible score and 100 the best possible score.

Criterion variables for evaluating validity of HRQOL measures

We used four criterion variables to assess validity of the HRQOL measures.

Criterion variable #1: “How PD affects you on a day-to-day basis?” This single item global rating of difficulty with day-to-day activities was developed specifically for PD based on interviews with PD specialist clinicians, patients, caregivers, and on a literature review; it has support for construct validity in terms of anticipated associations with depression, cognition, and PD severity in a community-based PD sample [17]. Subjects are asked to indicate one choice that “best describes how your Parkinson’s disease has affected your day-to-day activities in the last month:” (1) no difficulties, (2) mild difficulties, (3) moderate difficulties, (4) high levels of difficulties, or (5) extreme difficulties. Each choice is followed by a detailed example.

Criterion variable #2: “Current rating of overall QOL on scale of 1 to 10.” Subjects chose an integer between 1 (worst possible QOL, as bad as or worse than being dead) and 10 (the best possible QOL). This variable was adapted from other measures [18].

Criterion variable #3: “Rating of PD symptoms in the past 6 months”. In order to assess symptom severity, subjects were asked to rate the severity of their symptoms as (1) no symptoms, (2) mild symptoms, (3) moderate symptoms, and (4) severe symptoms.

Criterion variable #4: “Patient Health Questionnaire (PHQ)-9 Scoring Categories”. This nine-item self-rated depression screener is directly mapped on the Diagnostic and Statistical Manual-IV (DSM-IV) criteria for major depression [19]. It has been evaluated in large studies of primary care patients and used in a recent large trial of depression care in the elderly [20]. Three categories can be derived based on responses to the nine items: (1) depression treatment may be not needed, (2) clinical judgments about treatment on duration of symptoms and functional impairments, and (3) warrants treatment for depression.

We hypothesized that the PD-targeted HRQOL measures would be more highly associated than the SF-36 with the two criterion variables that elicited ratings of day-to-day difficulties with PD (criterion variable #1) and PD symptoms (criterion variable #3). Because the two PD-targeted HRQOL measures each had one summary score and the SF-36 had separate physical and mental health composite scores, we hypothesized that the SF-36 mental health composite score would be more highly associated with the PHQ-9 (criterion variable #4) than all the three other summary scores. We had no a priori hypotheses with respect to the global QOL rating (criterion variable #2) and summary scores of PD-targeted versus generic measures, nor did we make any formal a priori hypotheses about individual scale scores on any measure and the four criterion variables.

Socio-demographic and clinical characteristics included gender, age, race/ethnicity, marital status, education, and employment. We also collected self-reported Activities of Daily Living (ADL) via the Unified Parkinson’s Disease Rating Scale (UPDRS) [21].

Data collection

Telephone interviews were administered by trained research assistants who followed protocols for quality of data collection by interview. Participants were paid $10 for each interview. The research assistant obtained verbal consent over the telephone. Data were directly entered into an electronic spreadsheet. Reasons for the 38 non-respondents (39.6% non-response) at the follow-up interview 1–2 years later include: unreachable despite multiple attempts by phone (n = 19), phone number disconnected (n = 6), asked to not be contacted again after the first survey (n = 4), declined (n = 4), unable to participate because of stroke/dementia (n = 3), deceased (n = 1), and other (n = 1).

Data analysis

Data were analyzed using SAS version 9.1 (SAS® software, Version 9.1, SAS Institute, Cary, NC).

Mean scores, standard deviations, ranges, and percentages of patients scoring the minimum = 0 (floor), and maximum = 100 (ceiling) possible scores were examined. Internal consistency reliability of each multi-item scale was assessed using Cronbach’s alpha [22]. Reliability of composite scores was estimated using Mosier’s formula [23]. We categorized scales as reliable if Cronbach’s alpha was greater than or equal to 0.70, a widely used threshold for adequate reliability in group comparisons [24, 25].

Relative validity is reported as the ratio of the F-statistic of each scale of the three HRQOL measures to the F-statistic of a designated reference scale, usually the smallest F-statistic among the scales of the three HRQOL measures [26]. For a given criterion variable, the scale with the highest F-ratio is thus most sensitive to differences across categories of that criterion variable; for a fixed level of power, relative validity (F-ratio) values “are equivalent to the ratio of sample sizes that would be required to detect the known group difference using one measure versus the other” [27]. For each of the four criterion variables (baseline distributions in Table 1 and Appendix Table 7), we used analysis of variance (ANOVA) based F-statistics to compare mean HRQOL scale scores across different patient groups, based on patient’s categorization across different levels within that criterion variable [27].

Table 1 Sample characteristics

Responsiveness was assessed using standard methods [28]. We examined the decline in responses between the baseline and the follow-up interviews for criterion variable #1 (“How PD affects you on a day-to-day basis”) and for criterion variable #2 (“Current rating of overall QOL”). We excluded responses from subjects who “improved” rather than combining them with “declined” because prior studies suggest that the magnitude of responsiveness is different for these two groups and higher among “declined” [29], and PD is a disease of progressive decline [29]. We examined the distribution of the change in responses and used the existing literature and clinical judgment to set the threshold for a change in each criterion variable. For criterion variable #1 of how PD affects you daily, each of the five response choices were developed as clinically distinct and meaningful [17]; thus, we set a threshold of true change to be a change of at least one level. For criterion variable #2 of overall QOL, there are 10 response choices with anchors only at each extreme; we set a threshold of true change to be at least two levels based on our judgment. Selection of these thresholds was made a priori. Unchanged was defined as responses on the second interview that did not meet threshold for a true change from the first interview. The three most widely used responsiveness indices were calculated: effect size (ES), standardized response mean (SRM), and the Guyatt responsiveness statistic (GRS) [30]. For these indices, the numerator is the mean change in scale score for the declined group. The denominators are the standard deviation of the baseline scale score of the declined group (ES), the standard deviation of change in scale score for the change group (SRM), and the standard deviation of change in scale score for the unchanged group (GRS) [27]. Because each of these indices look at change for the declined group, we supplemented them by computing the F-statistic for the difference in change scores between the declined and unchanged groups. We categorized ES as large (greater than or equal to 0.80), medium (between 0.50 and 0.79), small (between 0.20 and 0.49), and not detectable (less than 0.2) according to well-known published benchmarks [31] and focused on ES in our interpretation, because such established benchmarks exist. (There is one published report providing regression equations linking different responsiveness indices [32].)

We used multivariate models to determine whether PD-targeted measures captured important HRQOL content beyond the SF-36. Each of the four criterion variable served as the dependent variable in a multivariate model, and the eight SF-36 scales served as the independent variables (Model 1). We then forced in the SF-36 scales that were significant at P < 0.10 in Model 1 and allowed items from the two PD-targeted HRQOL measures to enter at P < 0.05 (Model 2), using stepwise regression. We compared the improvement in adjusted R 2 from Model 1 to Model 2 for each of the criterion variables.

In order to evaluate the original scoring of the two-PD targeted measures, we estimated using baseline data from all the 96 subjects, the item-scale correlations from multitrait scaling analyses [33]; computing product-moment correlations between items and scales, correcting for the overlap of the item with the scale where applicable. We inspected the correlation matrix for potential lack of item discrimination across scales by highlighting those correlations that were ½ standard error below or any amount above the correlation of the item with the scale in which it was placed.

Results

Sample characteristics

The mean age of the 96 enrolled PD patients was 72 years, 88% were white. More than three-quarters (84%) were male (see Table 1). Sixty-five percent were currently married; 63% held a Bachelor’s degree or higher; 66% were retired and not working, and 58% reported moderate or severe PD symptoms.

For criterion variable #1 (How PD affects you on a day-to-day basis), 54% reported moderate, high, or extreme difficulties. For criterion variable #2 (rating of overall QOL), 61% rated their QOL as 7 or higher on the 1–10 QOL scale.

We compared characteristics of the 58 participants who completed the follow up interview to the 38 who did not (all the 96 participants had baseline data). The only significant difference was ethnicity, with a higher proportion of white participants in the follow up.

Descriptive statistics and reliability

There were noteworthy floor effects for the SF-36 Role Limitations—Physical scale (51% of sample scored the possible minimum) and ceiling effects for the SF-36 Role Limitations—Emotional scale (75% of the sample scored the possible maximum, see Table 2). On the PDQ-39, there was a ceiling effect for the Social Support scale (54% of the sample scored the possible maximum). On the PDQUALIF, there was substantial ceiling effects for the Independence scale (60% of the sample scored the possible maximum).

Table 2 Descriptive statistics and reliability of HRQOL scales (N = 96)a

Internal consistency reliability was satisfactory for all the eight SF-36 scales (Cronbach’s alpha > 0.70). However, coefficient alpha for three of the seven PDQUALIF scales (Physical Function, Outlook, and Sleep scales) fell below the 0.70 threshold for adequate reliability to make group comparisons (ranging from 0.52 to 0.60). Likewise, alphas for two of eight PDQ-39 scales (Cognition and Bodily Discomfort scales) were 0.59 and 0.68.

Relative validity

Criterion variable #1: how PD affects you on a day-to-day basis

Ten patients had no difficulties, 34 reported mild difficulties, 38 patients had moderate difficulties and 14 patients had high levels or extreme difficulties (see Table 3). The PDQUALIF Social/Role Function scale and the PDQ-39 Mobility scale had the highest relative validity (13.3 and 11.7). The level of discrimination across the four categories of the criterion variable was higher for the overall score of the two PD-targeted measures (relative validity = 9.3 for the PDQUALIF, and 10.6 for the PDQ-39) compared to either composite score of the SF-36 (relative validity = 5.5 for SF-36 PCS, 2.9 for SF-36 MCS).

Table 3 Relative validity of HRQOL scales by how PD affects on day-to-day basis rating (N = 96)a

Criterion variable #2: rating of overall QOL

Twenty-eight patients who rated their overall QOL as 8, 9, or 10 were combined into one group, 42 patients whose ratings were 6 or 7 were combined into another group, and 26 patients whose ratings were 1–5 were combined into a third group (see Table 4). The highest relative validity was observed for the SF-36 Emotional Well Being scale (relative validity = 15.44). The SF-36 MCS had higher relative validity than the overall scores of the two PD-targeted HRQOL measures.

Table 4 Relative validity of HRQOL scales by quality of life rating (N = 96)a

Similarly, for the other two criterion variables, the overall scores of the two PD-targeted measures did not perform appreciably better than the SF-36 PCS and MCS. (See Appendix Tables 8 and 9 for details.)

Responsiveness of HRQOL measures

For criterion variable #1 (“How PD affects you on a day-to-day basis”), 20 patients reported at least one level of worsening on the second interview and were categorized as “declined” and 23 patients were categorized as “unchanged.” The highest ES for any overall or composite score was for the SF-36 PCS (ES = −0.86), corresponding to a large ES (see Table 5).

Table 5 Responsiveness indices: declined and unchanged groups based on “How Parkinson’s disease affects you on a day-to-day basis?”a

For criterion variable #2 (rating of overall QOL), 16 patients reported at least two levels of worsening on the second interview and were categorized as “declined” versus 35 patients who rated within one point of baseline and were categorized as “unchanged” (see Table 6). The SF-36 MCS (ES = −1.06) had the highest ES, again corresponding to a large ES.

Table 6 Responsiveness indices: declined and unchanged groups based on “On a scale of 1 to 10, where 10 is best possible quality of life and 1 is the worst possible quality of life (as bad or worse than being dead) overall, how would you rate your quality of life?”a

We found three of the eight SF-36 scales had a large ES for each criterion variable examined. In contrast, only the PDQUALIF Social/Role Function scale had a large ES for the criterion variable on PD’s day-to-day effects; the ES for the other six PDQUALIF and eight PDQ-39 scales were not as large for either criterion variable.

Contributions of PD-targeted and generic HRQOL content to explaining criterion variables

Using criterion variable #1 (PD’s day-to-day effects) as the dependent variable, the following three SF-36 scales entered the multivariate model at P < 0.10: Social Functioning, Physical Functioning, and Role Limitations—Physical (Model 1). In Model 2, the following three PDQUALIF items from the PD-targeted measures entered the model at P < 0.05 after forcing in the above three SF-36 scales: Financial strain (Self-Image/Sexuality scale), Adjust to change (Social Role Function scale), Sleep with partner (Sleep scale); the following two PDQ-39 items also entered: Getting around house (Mobility scale) and Memory (Cognition scale). The adjusted R 2 improved from 0.48 in Model 1 to 0.65 in Model 2.

Using criterion variable #2 (rating of overall QOL), the following four SF-36 scales entered the model at P < 0.10: Role Limitations—Physical, General Health, Role Limitations—Emotional, and Emotional Well-Being (Model 1). Then in Model 2, the following three PDQUALIF items from the PD-targeted measures entered at P < 0.05 after forcing in to the above four SF-36 scales: Sexual ability (Self-image/Sexuality scale), Future and Ask for help (both from the Outlook scale), and Independent hygiene (Independence scale); the following three PDQ-39 items also entered: Confined to the house (Mobility scale), Isolated and lonely (Emotional/Well-Being scale), and Concentration (Cognition scale). The adjusted R 2 improved from 0.45 in Model 1 to 0.65 in Model 2. (See Appendix Table 10 for stepwise results for these two criterion variables and also for criterion variables # 3 and # 4.)

Evaluation of scoring of PD-targeted measures

Item-scale correlations using baseline data from the sample of 96 enrollees revealed that 18 of 32 PDQUALIF items correlated within 0.05 (one-half the standard error in this dataset) below or correlated more highly with other scales than the scales they were supposed to represent. Particularly problematic were the Outlook and Physical Functioning scales: three out of four Outlook scale items correlated more highly on another scale than Outlook, and all the five Physical Functioning items correlated more highly with another scale than the Physical Functioning scale (see Appendix Table 11).

Item-scale correlations for five of the eight PDQ-39 scales in general provided support for the arrangement of items by scale using our criteria for item discrimination across scales. However, all three items from the Bodily Discomfort scale, two out of the four items from the Cognition scale, and one out of the three items from the Communication scale loaded similarly or more highly on other scales than on the scale in which they are placed (see Appendix Table 12).

Discussion

We analyzed and compared the psychometric properties of a widely used generic measure of HRQOL, the SF-36, and two PD-targeted measures, the PDQ-39 and the PDQUALIF measures. While relative validity was somewhat better for the PD-targeted measures than the SF-36 on criterion variables that asked specifically about activities limited by PD or about PD symptoms, we found greater support for the responsiveness of the SF-36 than for the PD-targeted measures on both external criterion variables, including the variable on difficulties in day-to-day activities due to PD. Despite better responsiveness of the generic measure, however, multivariate regression models showed that items from the PD-targeted measures tap into additional HRQOL content not covered by the SF-36 scales. An analysis of the PD-targeted measures revealed multiple problems with items correlating as highly or more highly with other scales than with the scale they were intended to represent, potentially accounting for the unanticipated finding of superior responsiveness of the SF-36 compared to the PD-targeted measures.

Few studies have compared the psychometric properties of the SF-36 with a PD-targeted measure. The responsiveness of the SF-36 and PDQ-39 was tested among 132 PD patients by administering it at baseline and at 4 months and asking a criterion question of whether there was change in the effect of PD on everyday life [10]. In that study, none of the PDQ-39 had a large ES, the PDQ-39 mobility scale showed a medium ES, the ADL and Social Support scales had a small ES, and the other five PDQ-39 scales did not detect any change. The three PDQUALIF scales that had less than adequate internal consistency reliability in our study also did not perform well in the original study introducing the PDQUALIF (in the original study, a fourth scale Urinary Function also had Cronbach’s alpha below 0.7) [11]. Using Hoehn and Yahr stage as a criterion variable for their relative validity analysis, they found that the F-statistic for the overall PDQUALIF (10.8) was only a bit higher than the SF-36 PCS (9.1) though considerably better than the SF-36 MCS (1.9). The poor performance of the SF-36 MCS is likely due to Hoehn and Yahr’s emphasis on balance and mobility. When we used the PHQ-9 as a criterion variable in our study, the SF-36 MCS outperformed overall scores of the two PD-targeted HRQOL measures, as would be anticipated given that depression is a stronger component of mental health than physical health. Emphasis of some criterion variables on certain aspects of HRQOL was also observed in our study with respect to the PD day-to-day activities criterion variable, for which relative validity and responsiveness were stronger for physical and social functioning scales of all the measures than with scales tapping mental health.

The following limitations to our study should be noted. We recruited a convenience sample of 96 PD patients, and the portion of our analyses involving longitudinal data (responsiveness) was based on a subset of the 60% of the enrolled sample for whom we were able to collect follow-up data. Some of the sample sizes for subgroups in the responsiveness analyses were relatively small, and we recommend that our findings with regard to responsiveness be confirmed in samples having larger subgroups who changed. Power to detect a difference would have been increased with a larger sample size; for example, we observed almost significant F-statistic of 3.52 for the responsiveness of the SF-36 physical health summary score in Table 6. With a larger sample size we may have found this test statistic to be statistically significant.

There was a higher proportion of men than in the general PD population because about half of this study’s sample was recruited from a VA. Another potential limitation is that criterion variables were all self-reported, and it would have been useful to also include a clinical measure such as the motor UPDRS, an examination recorded by a trained clinician, or the Hoehn and Yahr stage. While we administered all the measures using the same modality at different points in time, data regarding the adequacy of telephone administration of the PDQUALIF is unknown.

The results of this study suggest that both generic and disease-targeted measures contribute important information about HRQOL. In the future, both generic and disease-targeted items tapping the same domain could be included together in an item bank and administered using computer adaptive testing [34].

Conclusion

A comparison of the psychometric properties between a generic and two PD-targeted HRQOL measures provides evidence for superior or equivalent responsiveness of the generic HRQOL measure over the PD-targeted HRQOL measures. However, the PD-targeted measures account for additional content beyond the generic HRQOL measure alone. The empirical findings related to lack of superior responsiveness of the PD-targeted measures relative to the SF-36 may in part be explained by inadequate scaling of the original PD-targeted measures.

The findings of this study provide support for use of a combination of generic and disease-targeted HRQOL measures in future research.