Quality of Life Research

, Volume 12, Issue 8, pp 981–1002 | Cite as

Using item response theory to calibrate the Headache Impact Test (HIT™) to the metric of traditional headache scales

  • Jakob B. Bjorner
  • Mark Kosinski
  • John E. Ware Jr


Background: Item response theory (IRT) scoring of health status questionnaires offers many advantages. However, to ensure 'backwards comparability' and to facilitate interpretations of results, we need the ability to express the IRT score in the metrics of the traditional scales. Objectives: To develop procedures to calibrate IRT-based scores on the Headache Impact Test (HIT) into the metrics of the traditional headache scales. To assess the degree to which the calibrated HIT scores agree with the observed traditional scores and lead to the same conclusions in group comparisons. Methods: We used telephone interview data (n = 1016) and Internet data (n = 1103) from general population surveys of recent headache sufferers. Analyses were conducted in four steps: (1) develop IRT models for all items, (2) for each IRT score level, calculate the expected score on each of the traditional scales (calibration), (3) adjust this calibrated score for measurement error in the IRT score, (4) for each of the traditional scales, assess agreement between calibrated HIT scores and observed scores using intraclass correlation (ICC) and evaluate the agreement of mean scores and the relative validity (RV) in discriminating among groups differing in migraine diagnosis, headache severity, and change in impact over time. Results: For the traditional categorical questionnaire items (the Migraine Specific Questionnaire (MSQ) and the Headache Disability Inventory (HDI)) the calibrated HIT agreed with the observed traditional scores: ICC's were between 0.80 and 0.94. In RV analyses the maximum mean difference between the observed and expected scores was 1.7 points on a 0–100 scale for comparisons at one point in time. Analyses of change over time and analyses calibrating scores from the fixed-form HIT-6 to the metric of other questionnaires were also satisfactory although less precise. Analysis of non-standard questionnaire items (e.g. On how many days in the past 3 months did you have a headache, from the HIMQ and the MIDAS) required special IRT models. Agreement was less good: ICC's were between 0.56 and 0.61 and the maximum mean differences were 2.9 (on a 0–270 scale) and 3.8 (on a 0–450 scale) in RV analyses at one point in time. The ability of the calibrated scale scores to discriminate between groups was at least as good as the ability of the observed sum scales and often remarkably better. Conclusion: The theoretical advantage of IRT models in scale calibration is supported by our results. This approach to achieving comparability of new and widely-used scales and accelerating the accumulation of interpretation guidelines based on previous work warrant testing for measures of other generic and disease-specific concepts.

Computerized adaptive assessment Disability Headache Health status Impact Item response theory Migraine Quality of life Questionnaires Severity 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Martin BC, Pathak DS, Sharfman MI, et al. Validity and reliability of the migraine-specific quality of life questionnaire (MSQ Version 2.1). Headache 2000; 40: 204–215.Google Scholar
  2. 2.
    Ware JE Jr, Kosinski M, Bjorner JB, et al. Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Quality Life Res 2003; 12: 935–952.Google Scholar
  3. 3.
    Jacobson GP, Ramadan NM, Aggarwal SK, Newman CW. The Henry Ford Hospital Headache Disability Inventory (HDI). Neurology 1994; 44: 837–842.Google Scholar
  4. 4.
    Wagner TH, Patrick DL, Galer BS, Berzon RA. A new instrument to assess the long-term quality of life effects from migraine: Development and psychometric testing of the MSQOL. Headache 1996; 36: 484–492.Google Scholar
  5. 5.
    Jacobson GP, Ramadan NM, Norris L, Newman CW. Headache disability inventory (HDI): Short-term test-retest reliability and spouse perceptions. Headache 1995; 35: 534–539.Google Scholar
  6. 6.
    Stewart WF, Lipton RB, Simon D, Liberman J, Von Korff M. Validity of an illness severity measure for headache in a population sample of migraine sufferers. Pain 1999; 79: 291–301.Google Scholar
  7. 7.
    Stewart WF, Lipton RB, Dowson AJ, Sawyer J. Development and testing of the Migraine Disability Assessment (MIDAS) Questionnaire to assess headache-related disability. Neurology 2001; 56: S20–S28.Google Scholar
  8. 8.
    Stewart WF, Lipton RB, Kolodner K, Liberman J, Sawyer J. Reliability of the migraine disability assessment score in a population-based sample of headache sufferers. Cephalalgia 1999; 19: 107–114.Google Scholar
  9. 9.
    Stewart WF, Lipton RB, Kolodner KB, Sawyer J, Lee C, Liberman JN. Validity of the Migraine Disability Assessment (MIDAS) score in comparison to a diary-based measure in a population sample of migraine sufferers. Pain 2000; 88: 41–52.Google Scholar
  10. 10.
    Bjorner JB, Kosinski M, Ware JE Jr. Calibration of an item pool for assessing the burden of headaches: An application of item response theory to the Headache Impact Test (HITTM). Quality Life Res 2003; 12: 913–933.Google Scholar
  11. 11.
    Bjorner JB, Kosinski M, Ware JE Jr. The feasibility of applying item response theory to measures of migraine impact: A re-analysis of three clinical studies. Quality Life Res 2003; 12: 887–902.Google Scholar
  12. 12.
    Kosinski M, Bayliss MS, Bjorner JB, et al. A six-item short-form survey for measuring headache impact: The HIT-6TM. Quality Life Res 2003; 12: 963–974.Google Scholar
  13. 13.
    Ware JE Jr, Bjorner JB, Kosinski M. Practical implications of item response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Med Care 2000; 38: II73–II82.Google Scholar
  14. 14.
    Mislevy RJ. Linking Educational Assessments: Concepts, Issues, Methods, and Prospects. Princeton, NJ: Educational Testing Service, Policy Information Center, 1992.Google Scholar
  15. 15.
    Linn RL. Linking results of distrinct assessments. Appl Meas Educ 1993; 6: 83–102.Google Scholar
  16. 16.
    Dorans NJ. Scaling and equating. In: Wainer H, Dorans NJ, Eignor D, Flaugher R, Green BF, Mislevy RJ, Steinberg L, Thissen D (eds), Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates, 2000: 135–158.Google Scholar
  17. 17.
    Orlando M, Sherbourne CD, Thissen D. Summed-score linking using item response theory: Application to depression measurement. Psychol Assess 2000; 12: 354–359.Google Scholar
  18. 18.
    Bjorner JB, Petersen MA, Groenvold M, et al. Use of item response theory to develop a shortened version of the EORTC QLQ-C30 emotional function scale. Quality Life Res (Submitted).Google Scholar
  19. 19.
    Muraki E. A generalized partial credit model. In: van der Linden WJ, Hambleton RK (eds), Handbook of Modern Item Response Theory. Berlin: Springer, 1997: 153–164.Google Scholar
  20. 20.
    Muraki E, Bock RD. Parscale-IRT based Test Scoring and Item Analysis for Graded Open-ended Exercises and Performance Tasks. Chicago: Scientific Software Inc., 1996.Google Scholar
  21. 21.
    Thissen D. Multilog-Multiple, Categorical Item Analysis and Test Scoring Using Item Response Theory. Chicago: Scientific Software Inc., 1991.Google Scholar
  22. 22.
    Samejima F. Graded response model. In: van der Linden WJ, Hambleton RK (eds), Handbook of Modern Item Response Theory, Berlin: Springer, 1997: 85–100.Google Scholar
  23. 23.
    Spray JA. Multiple-Attempt, Single-Item Response Models. In: van-der-Linden WJ, Hambleton RK (eds), Handbook of Modern Item Response Theory, New York: Springer, 1996: 209–220.Google Scholar
  24. 24.
    Ruppert D, Stefanski LA, Carrol RJ. Measurement Error in Nonlinear Models. London: Chapman & Hall, 1995.Google Scholar
  25. 25.
    Tsutakawa RK, Soltys MJ. Approximation for Bayesian Ability Estimation. J Educ Stat 1988; 13: 117–130.Google Scholar
  26. 26.
    Tsutakawa RK, Johnson JC. The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika 1990; 55: 371–390.Google Scholar
  27. 27.
    McHorney CA, Cohen AS. Equating health status measures with item response theory: Illustrations with functional status items. Med Care 2000; 38: II43–II59.Google Scholar
  28. 28.
    Bradburn NM, Rips LJ, Shevell SK. Answering autobiographical questions: The impact of memory and inference on surveys. Science 1987; 236: 157–161.Google Scholar
  29. 29.
    Menon G, Yorkston EA, Stone AA, Turkkan JA, Bachrach CA, Jobe JB, Kurtzman HS, Cain VS (eds), The Science of Self Report: Implications for Research and Practice. Mahwah, NJ: Lawrence Erlbaum Ass., 2000: 63–79.Google Scholar
  30. 30.
    Andrich D. A binomial latent trait model for the study of Likert-style attitude questionnaires. Br J Math Stat Psychol 1978; 31: 84–98.Google Scholar
  31. 31.
    Wright B, Masters GN. Rating Scale Analysis-Rasch Measurement. Chicago: MESA Press, 1982.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Jakob B. Bjorner
    • 1
    • 2
  • Mark Kosinski
    • 1
  • John E. Ware Jr
    • 1
    • 3
  1. 1.QualityMetric IncorporatedLincolnUSA
  2. 2.National Institute of Occupational HealthCopenhagenDenmark
  3. 3.Health Assessment LabWalthamUSA

Personalised recommendations