Background

Owing to advances in therapy and increased use of preventative medicine, more than ninety-percent of individuals with sickle cell disease (SCD) now grow out of pediatric care. However their optimal functioning and well-being may be compromised by chronic comorbid conditions including multi-organ failure, pain and neurocognitive deficits [1]-[3]. Social and economic challenges are common, as well as barriers to accessing quality healthcare [1],[4]. There is thus a need for skilled adult-oriented health care providers to deliver sickle cell care and a corresponding need to improve understanding of the long-term needs of adults with SCD.

Beginning in 2002, the National Heart, Lung and Blood Institute (NHLBI) conducted a series of conferences and workshops to determine ways to improve treatment for adults living with SCD [5]. Stakeholders communicated the need for a systematic, reliable and valid method for documenting adult patient-reported outcomes (PRO) of care. We developed the A dult S ickle C ell Q uality of Life Me asurement Information System, or ASCQ-Me (pronounced “Ask me”), to meet this need. In this article we detail the statistical methods used to identify the most precise measures to include in ASCQ-Me item banks and continue this introduction by summarizing the formative research upon which the current work was based (detailed in Treadwell, Hassell, Levine and Keller, 2013[6]).

Formation of the conceptual framework for ASCQ-Me

Because the epidemiology and psychosocial functioning of adults with SCD is understudied and not well-documented, we conducted a comprehensive program of formative research including a systematic review of the literature. We also conducted detailed, structured interviews with 123 adults with SCD and 15 sickle cell experts who varied in geographic location, age and gender. On the basis of these data, we created an inclusive taxonomy of 140 life areas that were affected by SCD and summarized these in a model of relationships between the signs and symptoms of SCD and adult life experiences [6].

Grounded item technique (GIT)

We “grounded” ASCQ-Me questions in actual events in the lives of the interview participants through a rigorous analytic protocol applied to the audio-taped group or individual interviews [7],[8]. The content based on GIT was supplemented, when required, by questions based on legacy measures. For example, SCD providers and clinical investigators advocated for the importance of including questions on cognitive functioning as part of ASCQ-Me; yet no content related to attention, memory, language or problem solving emerged from the GIT interviews. So we based the draft cognitive functioning items for ASCQ-Me on the careful, thoroughly documented work of the Medical Outcomes Study [9].

ASCQ-Me questions were written in plain language comprehensible at the 6th grade level or below. We avoided the use of clauses, and restricted the item content to one topic. To promote standardization, we used item formats comparable to the cross-National Institutes of Health (NIH) generic PRO measure development effort, the Patient-Reported Outcomes Measurement Information System (PROMIS®) [10]. Experts in cognitive testing from the Cognitive Laboratory at the American Institutes for Research reviewed all of the items to identify those that should be rewritten. Those that could not be rewritten without destroying the intended meaning were submitted to cognitive testing with patients and, if they were correctly understood, they were retained (otherwise they were discarded). The field test version of ASCQ-Me was based on this set of culled items.

The purpose of the ASCQ-Me field test data collection was to identify the ASCQ-Me items that would be the most precise measures of their constructs for inclusion in item banks suitable for administration using computer adaptive assessment software, determine appropriate scoring algorithms for those items, and evaluate the reliability and validity of scores based on those algorithms.

Methods

Participants

ASCQ-Me field test data were collected at seven geographically diverse sites of care with the assistance of site coordinators trained in a standardized study protocol. Four methods were used to recruit participants: some were invited to participate by personnel at each clinic; others responded to flyers posted in or near the clinics; some responded to a posting on the website for the Sickle Cell Disease Association of America; and still others were recruited by participants who had already taken the assessments. Prior to enrollment, potential participants were administered a short screener. To be eligible for the data collection, participants were required to be 18 years of age or older and to be diagnosed with sickle cell disease. People were excluded if they were younger than 18, did not have a diagnosis of SCD, had a diagnosis of sickle cell trait, or could not read English. Based on previous research findings [11], the targeted enrollment across sites was set to obtain sufficient sample size for the psychometric analyses (500 patients) assuming a ten-percent rate of no-shows. Thus we targeted 550 patients with diversity in terms of age and gender.

Measures

Four item sets were developed: cognitive impact (28 items), emotional impact (28 items), social functioning impact (28 items) and physical impact (56 items). In addition, five items were developed for fixed format administration: pain episode severity (3 items) and frequency (2 items). Examples of the questions are presented in Table 1 below.

Table 1 ASCQ-Me item sets (Health topics) and example questions

We sought to include a measure of SCD severity to evaluate the ability of ASCQ-Me measures to reflect differences in groups of people who differed in the extent of their disease. The challenge for us was that there is no consensus method for assessing SCD severity. SCD is characterized by the type of mutations to the pair of beta-hemoglobin (Hb) genes. Variations include Hb-SS, Hb-SC and Hb-Sβ [11],[12] and individuals with Hb-SS usually, but not always, have more symptoms than those with other genotypes [13]-[15]. The variation of symptoms and sequelae within genotypes is so broad that genotype can not serve as a reliable indicator of disease severity [16]-[19]. Incidence and frequency of hospitalizations for vaso-occlusive incidents have been used as a marker of disease severity [20]-[23]; however, data indicate that a large percentage of patients who suffer from extreme pain never go to the hospital [24]-[26].

Nevertheless, adult sickle cell providers seeing a patient for the first time ask that patient a set of questions to gauge the severity of his or her disease. Blood transfusions and daily use of pain medicine are types of health care utilization associated with severity of SCD. Complications of SCD include asplenia, retinopathy, avascular necrosis, leg ulcers, kidney disease, stroke, and pulmonary hypertension. A medical history characterized by prescription pain medication, blood transfusions and a number of these diagnoses in a patient presenting with SCD could indicate severe disease [27]-[32]. In the absence of a consensus method for determining severity, we reasoned that a method which mimicked the clinical interview in content would identify patients who differed in the amount of SCD-related damage caused by their sickle cell and this could serve as a surrogate marker of disease severity.

Following the logic outlined above, we included a checklist of seven conditions usually secondary to SCD and two treatments indicative of severity as part of the data collection. For convenience, we refer to this indicator here as the SCD Medical History Checklist (SCD-MHC). Along with the checklist, we included 10 global health items from the PROMIS®, and 12 comorbidity checklist questions unrelated to SCD taken from the PROMIS® comorbidity questions (e.g. diabetes, pregnancy, hepatitis-C, HIV/Aids). We expected the SCD-MHC to be related to the PROMIS® globals as well as reported health care utilization and pain episodes; but, not to be related to the PROMIS® comorbidity questions which referred to conditions not associated with SCD.

Data collection procedure

Patients signed a consent form after they arrived at one of seven geographically-dispersed sites of care participating in the ASCQ-Me field test. They were then seated at a computer and a site coordinator helped them to log onto the ASCQ-Me website. The coordinator then entered the SCD type of the respondent and assisted the respondent in reviewing a tutorial that demonstrated how to operate the mouse, select responses to questions, monitor progress toward completion, and take a break. Respondents proceeded to complete the survey on their own following the tutorial. The survey took about 55 minutes to complete on average. Respondents received an honorarium for their participation.

Analytic methods

Prior to applying statistical models to the data, we evaluated data quality by calculating the percent of missing data for each question and flagging participants who responded in less than one second to any question (on the assumption that they were not reading the questions, see van der Linden & Krimpen-Stoop, 2003 [33]). We also evaluated the plausibility of the SCD-MHC as a measure of severity by examining its relationship to several variables which would be expected to be related to differences in health based on previous research including age, frequency and severity of vaso-occlusive episodes, frequency of emergency room visits, and the PROMIS® global ratings of health. The SCD-MHC was scored as the sum of the questions that were endorsed as has been the method employed in previous research with such checklists [33]-[36], and supported by research showing negligible differences between unit and alternative weighting methods for the scoring of checklists [37],[38].

Factor analysis

Within each of the four item sets (i.e., cognitive impact, emotional impact, physical impact, and social functioning impact), we conducted confirmatory factor analysis (CFA) using structural equation modeling (SEM) and followed common current practice with regard to indications of model fit [39],[40]. We conducted exploratory factor analysis (EFA), with oblique rotation and identified the number of factors on the basis of a convergence among criteria such as the results of parallel analysis (PA) [41], a scree plot of the eigen values [42], number of factors required to explain the majority of the variance, and the simplicity of the factor pattern [43]. All factor analyses were conducted on polychoric correlation matrices and used maximum likelihood estimation.

Multi-trait scaling analysis

The Multi-trait Analysis Program, [44],[45] was used to: (1) calculate the correlation of an item with its scale, correcting for overlap [46], (2) compare that correlation to its correlation with all other scales in the analysis; (3) estimate the internal consistency reliability coefficients for each hypothesized scale; and (4) calculate the percentage of respondents with the highest and lowest scores possible, respectively, on each scale.

Evaluation of unidimensionality

Most health question sets exhibit a certain amount of multidimensionality due to repeated use of common phrases within the question stem (e.g. “how many” or “how much”) or to content balancing within the health domain (e.g. including items targeting attention and memory within the cognitive functioning domain). Following Reise et al. (2007) [47], we considered an item set to be essentially unidimensional if any identified multidimensionality did not have consequences for the interpretation of the underlying concept (i.e. did not consequentially affect the relationship of items to theta) by conducting bi-factor analysis. We specified two models of the relationship of items to latent trait(s) but used Item Response Theory (IRT) Graded Response Models (GRMs) [48] instead of the SEM approach used by Reise and colleagues [47]. The unidimensional models specified a single latent trait and no correlated errors; whereas, each item was allowed to have a discrimination parameter on the general factor and one of the group factors (identified earlier by the parallel analysis and EFA) in the bi-factor model specification. Both the unidimensional and bi-factor IRT models were fitted using the IRT-PRO software [49]. As this was new software, we confirmed all analyses by conducting unidimensional and bi-factor CFA’s using the SAS CALIS procedure. Essential unidimensionality was supported if the Pearson correlation between the vectors of discrimination parameters under the unidimensional model and the bi-factor model was high (e.g. > 0.90) and the root mean squared difference of discrimination parameters between the two models was comparatively low [50].

Item-level measurement bias

In an IRT framework, an item is defined as displaying measurement bias (i.e., differential item functioning) if the item response curves (i.e., item parameters) are not the same for the reference and focal group [51]. We implemented a rank-based strategy as proposed by Woods [52] to select anchor items, followed by the IRT-based Wald test method [53]-[55] (denoted here as IRT-Wald), to detect differential item functioning (DIF) for each item. The sample was homogenous with regard to race and the education level of respondents exceeded the question vocabulary (6th grade level) so we did not evaluate DIF due to race or education; but we did evaluate DIF for age and gender.

IRT calibration

We fit the GRM to estimate discrimination and location parameters for each item (using the marginal maximum likelihood method [56]) which, respectively, describe the strength of the item’s relationship to the latent trait and the position of the item on the trait continuum; and, together, determine the information function of the item.

Validity

Evidence for the content validity of items and measures was provided by the GIT [7] and review of legacy measures [6]. The validity of each item as a measure of the underlying health domain was evaluated via CFA. In addition, we determined the ability of each measure to discriminate among levels of SCD severity indicated by the SCD-MHC.

Results

Respondent characteristics

At a total of 561 respondents, we exceeded our targeted number of 550 patients by 11. ASCQ-Me field test participants represented a range of ages and a mix of gender and genotype (see Table 2). During the past 12 months, one in five respondents reported two or more sickle cell related pain episodes and the majority indicated considerable suffering during these episodes. On a scale of 0 to 10 where 0 was no pain and 10 was the worst pain imaginable, the average rating was 8 and nearly 30% indicated that the pain was the “worst pain imaginable”. When asked to rate interference from pain during the last episode, 37% indicated needing help from family or friends or constant care from family, friends or health care providers. On the other hand, nearly one in three said that the pain caused minimal or no interference. The majority of respondents indicated that their most recent pain episode lasted more than a day, with nearly 50% saying that it lasted more than the better part of a week, and more than 20% suffered more than a week with their latest attack.

Table 2 Characteristics of survey respondents

Data quality

None of the participants was flagged for responding too quickly to the questions. Ten questions which addressed work functioning had high rates of missing data (>200 respondents), because a large portion of respondents were not employed. We were therefore not able to include these 10 questions in the analysis of the Social Functioning Impact measure.

To evaluate the suitability of the SCD-MHC as an index of SCD severity we examined its relationship to several variables which are related to differences in health including age [57], frequency and severity of vaso-occlusive incidents [58],[25], frequency of emergency room visits [59],[60] and the PROMIS® global ratings of health [61]. Table 3 shows the results of a series of general linear models in which the variables listed in the row headings were regressed onto the SCD-MHC or the checklist of 12 co-morbidities which are not secondary to SCD (pulled from the PROMIS® co-morbidity checklist). For convenience, we refer to the checklist of 12 co-morbidities that are not associated with SCD as the Non-SCD-MHC. The pattern of relationships described in Table 3 supports the validity of the SCD-MHC as an indicator of SCD severity for these patients because the relationships to other indicators of health are highly significant and consistent and cannot be explained on the basis of common method bias. As shown in Table 3, the SCD-MHC was related to four of five items measuring the reported frequency and severity of vaso-occlusive incidents, while the Non-SCD-MHC was significantly related to only one of five. The SCD-MHC was also significantly related to age and number of emergency room visits, while the Non-SCD-MHC was not. Finally, the relationship of the SCD-MHC to general health as measured by the 10 PROMIS® Globals was far stronger and more consistent than the relationship of the Non-SCD-MHC index to the 10 PROMIS® Globals.

Table 3 Evaluation of the SCD-MHC as an indicator of SCD severity

Item bank construction

Unidimensionality

Although respondents were allowed to skip any of the field test questions, only five respondents had to be eliminated from the psychometric analysis due to missing data. This represented just one percent of the total number of respondents. Thus the psychometric analysis was conducted using 556 respondents. Item-total correlations for the Physical Impact item set confirmed our suspicion that the content of that bank would more appropriately reside in three groups representing pain (17 items), stiffness (19 items) and sleep functioning (20 items). Going forward, we evaluated the pain, stiffness and sleep functioning questions as three distinct item sets. This resulted in six item sets in total. Items were eliminated from three of the six due to low item-total correlations (see Table 4, 1st column under "causes for elimination").

Table 4 Number of items eliminated and cause for elimination

EFA, restricted to the number of factors that emerged in the PA, were conducted on the remaining items in each question set to identify the simple structure pattern that would be modeled by the IRT bi-factor analysis. Items which did not conform to simple structure were eliminated from four of six question sets (see Table 4, 2nd column under "causes for elimination"). The subsequent IRT analysis identified local dependence among pairs of the remaining items in five of six question sets. One item from each pair was deleted from the sets (see Table 4, 3rd column under "causes for elimination").

DIF

A total of nine items from four of the question sets demonstrated measurement bias with regard to age or gender and so were eliminated (see Table 4, 4th column under "causes for elimination" and Table 5). Table 5 lists the items for which we found DIF due to gender or age following the methods detailed above.

Table 5 Items with biased measurement across genders or ages

Reliability

Reliability statistics for ASCQ-Me are presented in the first two columns after the row headings of Table 6 which show, respectively: (1) the range of scores (out of a possible range of 6) wherein measurement error is below the threshold that is associated with greater than 90% systematic or predictable variance based on the underlying construct (based on information curves produced by the IRT 2-parameter GRM); and (2) Cronbach’s alpha coefficient estimate of internal consistency reliability [62]. The first column shows highly reliable measurement in the range of 4–5 out of 6 of the possible score distribution. We do not expect highly reliable measurement in the tails of the score range, by definition, as restricted range decreases statistical power. The second column shows the average reliability of scores for the measures to be well above the recommended 0.90 for use at the individual-patient level.

Table 6 Reliability and validity of ASCQ-Me item banks

Validity

The third column of Table 6 displays statistics for bi-factor models of the relationship of items to a single underlying dimension with secondary dimensions modeling artifactual covariances for each ASCQ-Me item bank. All comparative fit indices (CFI’s) are above the liberal criterion for good model fit (0.90) and three of six are at or above the more conservative criterion of 0.95. While the root-mean-square error of approximation (RMSEA) rates are higher than the range of 0.06-0.10 that is typically recommended in psychometric texts, [39],[63] they are consistent with that reported in the literature for other PRO measures of this type [64] and with findings regarding the relationship of fit indices to degrees of freedom [65].

The next to the last column in Table 6 shows the correlations between the vectors of discrimination parameters for the bi-factor and uni-factor IRT models exceed 0.95 in every case. This suggests that the secondary factors which were modeled, dealt with sources of variability unrelated to the primary factor.

The last column in Table 6 displays additional evidence for the validity of the ASCQ-Me measures. We looked at the relationship of ASCQ-Me scores to the alternative marker of SCD condition severity based on the number of SCD sequelae and treatments endorsed from a list of nine, total (that is, the SCD-MHC). We divided respondents into tertiles based on the distribution of the SCD-MHC and tested the significance of the difference in ASCQ-Me scores across these tertiles (F-statistic from univariate analysis of variance models). The ASCQ-Me scores for all measures with the exception of the cognitive functioning measure, significantly differed according to SCD severity level (p < 0.0001), such that those with the least severe disease had the highest scores (demonstrating better patient-reported outcomes).

Discussion

Summary

We evaluated the reliability and validity of 140 questions designed to provide data that could be used to measure the patient-reported functioning and well-being of adults with SCD by conducting psychometric analysis of data from 556 patients. These patients, varying in age, gender, SCD Type, and SCD severity, provided us with high-quality responses to questions administered through the internet in a range of clinical settings. We eliminated 48 questions that either did not form clean factors or provided biased measurement across subgroups defined by age and gender. As a result, we derived six item banks that measure cognitive, emotional and social functioning, sleep quality, pain and stiffness. These six measures, collectively called ASCQ-Me, provide highly reliable measurement according to IRT total information curves as well as internal consistency reliability estimates. Analysis of construct validity supported the essential unidimensionality of the item banks and the measures discriminate among levels of disease severity defined on the basis of medical history. With one exception, across multiple criteria, we recommend the use of these item banks to contribute to studies of the health of patients with SCD. We do not recommend the use of the Cognitive Impact measure at this time because, compared to the other measures, it provided reliable measurement across a more restricted score range and was far weaker in discriminating among levels of severity in SCD.

The psychometric analyses detailed here were used to identify a final set of questions which produce reliable health outcome scores using far fewer than the 140 questions to which patients in the field test responded. As shown in Table 4, the five recommended item banks (absent the cognitive bank) include just 77 items, following elimination of questions based on the psychometric analysis. Moreover, the purpose of constructing the IRT-calibrated item banks was to identify still smaller subsets of items within each that can be used to provide precise measurement for particular applications. IRT calibrations indicate which items are most informative at a particular level of the trait being measured and thus enable the construction of short form measures. For example, we have constructed reliable short form measures totaling just 5 questions for each of the five recommended item banks. This enables users to measure all five ASCQ-Me concepts using just 25 questions. Moreover, IRT-calibrated item banks such as those included in ASCQ-Me can be administered using computer adaptive software the purpose of which is an alternative way to produce reliable measurement with as few questions as possible. Such software has been developed for ASCQ-Me.

Limitations

We were limited by the study design in the range of analyses we could do to address the validity of the item banks. Data were collected at one point in time so we could not address the relationship of ASCQ-Me scores to change in condition. We had one clinical indicator of disease and this is known to be a poor measure of severity. Previous research supports the validity of self-report methods of multi-morbidity assessment, [66] and so we put considerable thought and careful analysis into developing the self-reported medical history checklist (SCD-MHC). We described the relationship of checklist scores to other markers of health burden: age, utilization, pain episode recency and severity, and PROMIS® global ratings of health. Because they derive from the same source, the relationship between SCD-MHC scores and ASCQ-Me item bank scores might be artifacts of the data collection method. This hypothesis was not supported because the ASCQ-Me scores did not have a strong and consistent relationship to a comorbidity index that was comprised of self-reported conditions which are not characteristic of SCD (e.g. migraine, cancer, rheumatoid arthritis).

The respondents included a mix of ages, gender, and disease type however those older than 54 (7%) and with SCD Type other than SS were in the minority. This prevented us from conducting psychometric analyses specific to patients with SC (21%) or Beta types (10%) or who were older than middle aged.

We do not know how representative our field test sample is because a nationally-representative, descriptive study of the socio-demographic and health characteristics (e.g. severity of disease, incidence of comorbid conditions) of adults with SCD does not yet exist. Such a study is hampered by the difficulty in developing a comprehensive sampling frame. That frame cannot rely on medical records or registries alone because many adults with SCD are not included in those data bases. Moreover, the stigma associated with SCD is a barrier to accurately identifying the names and contact information of individuals. However, this issue is not unique to the current research and indeed applies to all research results involving adults with SCD which seek to generalize to the population. Current scoring for the ASCQ-Me item banks is relative to this field test sample so that a score of 50 represents the average score for the 556 respondents. Ideally, we would like to be able to say that a score of 50 represents the average response of all adults with SCD. While we do not have a descriptive study of adults with SCD, available data suggests that the characteristics of our sample are likely to mirror those of the other populations with regard to age of participating adults and hemoglobin type [67]-[69]. Adult males with SCD may have been under-represented in our sample. Although results have been mixed [70] female gender has previously been associated with diminished health-related quality of life in the physical domain [68] and reports of a higher prevalence of pain episodes [6]. Our results should therefore be viewed with caution, until such time as a population-based description of socio-demographic and health characteristics of the U.S. SCD population is available.

Future research

Further analyses of the field test data will be conducted to evaluate the potential of identifying cut-off scores for the ASCQ-Me item banks and short forms. Such scores could serve as interpretative aids which would enhance the usefulness of the ASCQ-Me measures for clinical practice.

While sample size prevented us from including the work-functioning questions in the IRT analysis conducted to develop ASCQ-Me, we intend to explore the development of a static work functioning scale for adults with SCD based on the field test items and data. Employment and work functioning were of great concern to participants in our focus groups and these concerns frequently surfaced as well in our individual interviews with adult patients.

Finally, should resources become available, we intend to collect longitudinal data with the ASCQ-Me measures which would permit us to evaluate the sensitivity of these measures to change over time. An ideal study would be a placebo-controlled study in which the size of the change in ASCQ-Me scores associated with the introduction of a treatment of known efficacy was evaluated. We are aware, also, of ongoing studies conducted by other investigators in which ASCQ-Me data are being collected longitudinally to evaluate the efficacy of drug therapy and to evaluate the impact of differences in health care delivery systems, and eagerly anticipate their reports.

Conclusions

A valid measure of health outcome is required to inform the design and delivery of health care for adults with SCD. Building on a comprehensive program of formative research and statistical analysis of field test data on more than 550 patients, we applied advanced psychometric methods including those currently used by the PROMIS® initiative [10],[71] to construct the ASCQ-Me measures of Cognitive, Emotional, Pain, Sleep, Social and Stiffness Impact. Our results support the reliability and validity of all measures, with the exception of Cognitive Impact, and we encourage the use of the remaining five measures in future studies conducted by the broader research community.

The contribution of research described herein was to develop a system called ASCQ-Me to provide a standard method for describing the life impact of SCD on adult functioning and wellbeing which would enable the comparison of health outcomes for these patients across medical, clinical and health services research on SCD. Strengths of this research include: 1) roots in a rigorous program of formative research with adults with SCD [6]; 2) the participation of a large number of adults with SCD in the field test (>550 patients), 3) the application of advanced psychometric methods consistent with standards put forth by the PROMIS® initiative [10],[71], 4) a careful focus on evaluating and eliminating sources of bias in measurement, 5) evaluation of the validity of the ASCQ-Me measures using a measure of SCD severity that does not suffer from the weaknesses of often-used measures such as number of hospitalizations or SCD type, and 6) the development of item banks which can support the construction of short sets of questions for each health concept suitable for administration via fixed forms or adaptively, to provide precise measurement with as few questions as possible. This system is currently in use in a number of studies which will provide further information on the validity of the scores and the usefulness of the system; including studies of the sensitivity of ASCQ-Me scores to change in health that might result from drug therapy or from changes in how health care is delivered.