Best Practices in Detecting Bias in Nonverbal Tests

  • Susan J. Maller

Abstract

Group comparisons of performance on intelligence tests have been advanced as evidence of real similarities or differences in intellectual ability by Jensen (1980) and, more recently, by Herrnstein and Murray (1994). This purported evidence includes the mean intelligence score dif-ferences that have been reported for various ethnic groups (e.g., Jensen, 1969,1980; Loehlin, Lindzey, & Spuhler, 1975; Lynn, 1977; Munford & Munoz, 1980) or gender (e.g., Feingold, 1993; Nelson, Arthur, Lautiger, & Smith, 1994; Smith, Edmonds, & Smith, 1989; Vance, Hankins, & Brown, 1988; Wessel & Potter, 1994; Wilkinson, 1993).

Keywords

Covariance Folk Kano Pase Ferron 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alwin, D. F., & Jackson, D. J. (1981). Applications of simultaneous factor analysis to issues of factorial invariance. In D. Jackson & E. Borgatta (Eds.), Factor analysis and measurement in sociological research: A multi-dimensional perspective (pp. 249–279). Beverly Hills, CA: Sage.Google Scholar
  2. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.Google Scholar
  3. Anderson, R. J., & Sisco, F. H. (1977). Standardization of the WISC-R performance scale for deaf children. (Office of Demographic Studies Publication Series T, No. 1). Washington, DC: Gallaudet College.Google Scholar
  4. Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3–24). Hillsdale, NJ: Erlbaum.Google Scholar
  5. Angoff, W. H., & Ford, S. F. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10, 95–105.CrossRefGoogle Scholar
  6. Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107,238–246.CrossRefPubMedGoogle Scholar
  7. Bentler, P. M. (1992). On the fit of models to covariances and methodology in the Bulletin. Psychological Bulletin, 112, 400–404.CrossRefPubMedGoogle Scholar
  8. Berk, R. A. (Ed). (1982). Handbook of methods for detecting test bias. Baltimore: Johns Hopkins University Press.Google Scholar
  9. Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.) Handbook of modern item response theory (pp. 33–49). New York: Springer.Google Scholar
  10. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parame-ters: Application of an EM algorithm. Psychometrika, 46, 443–459.CrossRefGoogle Scholar
  11. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.Google Scholar
  12. Bracken, B. A., & McCallum, R. S. (1998). Universal Nonverbal Intelligence Test: Examiner’s manual. Itasca, IL: Riverside Publishing.Google Scholar
  13. Braden, J. P. (1999). Straight talk about assessment and diversity: What do we know? School Psychology Quarterly, 14, 343–355.CrossRefGoogle Scholar
  14. Breslow, N. E., & Day, N. E. (1980). Statistical methods in cancer research Volume 1: The analysis of case-control studies. Lyon: International Agency for Research on Cancer.Google Scholar
  15. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equations models (pp. 136–162). Newbury Park, CA: Sage.Google Scholar
  16. Bryk, A. (1980). Review of Bias in mental testing. Journal of Educational Measurement, 17, 369–374.Google Scholar
  17. Camilli, G. (1979). A critique of the chi-square method for assessing item bias. Unpublished paper, Laboratory of Educational Research, University of Colorado.Google Scholar
  18. Camilli, G., & Shepard, L. A. (1987). The inadequacy of ANOVA for detecting test bias. Journal of Educational Statistics, 12, 87–89.CrossRefGoogle Scholar
  19. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.Google Scholar
  20. Cattell, R. B. (1978). The scientific use of factor analysis in behavioral and life sciences. New York: Plenum.CrossRefGoogle Scholar
  21. Clauser, B. E., Nungester, R. J., & Swaminathan, H. (1996). Improving the matching for DIF analysis by conditioning on both test score and an educational background variable. Journal of Educational Measurement, 33, 453–464.CrossRefGoogle Scholar
  22. Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and White students in integrated colleges. Journal of Educational Measurement, 5, 115–124.CrossRefGoogle Scholar
  23. Cohen, A. S., Kim, S., & Wollack, J. A. (1996). An investigation of the likelihood ratio test for detection of differential item functioning. Applied Psychological Measurement, 20,15–26.CrossRefGoogle Scholar
  24. Cotter, D. E., & Berk, R. A. (1981, April). Item bias in the WISC-R using black, white, and hispanic learning disabled children. Paper presented at the Annual Meeting of the American Educational Research Association, Los Angeles (ERIC Document Reproduction Service ED 206 631).Google Scholar
  25. Diana v. the California State Board of Education. Case No. C-70–37 RFP. (N.D. Cal. 1970).Google Scholar
  26. Dorans, N. J. (1989). Two new approaches to assessing differential item functioning:Standardization and the Mantel-Haenszel method. Applied Psychological Measurement,2, 217–233.Google Scholar
  27. Dorans, N. J., & Kulick, E. (1983). Assessing unexpected differential item performance of female candidates on SAT and TSWE forms administered in December 1977: An applica-tion of the standardization approach (Research Rep. No. 83–9). Princeton, NJ: Educational Testing Service.Google Scholar
  28. Engelhard, G., Hansche, L., & Rutledge, K. E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Psychological Measurement, 3, 347–360.Google Scholar
  29. Feingold, A. (1993). Cognitive gender differences: A developmental perspective. Sex Roles, 29, 91–112.CrossRefGoogle Scholar
  30. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: American Council on Education & Macmillan.Google Scholar
  31. Frisby, C. L. (1998). Poverty and socioeconomic status. In J. L. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Scheuneman, & J. R. Grenier (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 241–270). Washington, DC: American Psychological Association.CrossRefGoogle Scholar
  32. Green, B. F., Crone, C. R., & Folk, V. G. (1989). A method for studying differential distractor functioning. Journal of educational measurement,26, 147–160.CrossRefGoogle Scholar
  33. Hammill, D. D., Pearson, N. A., & Wiederholt, J. L. (1997). Comprehensive test of nonverbal intelligence. Austin, TX: PRO-ED.Google Scholar
  34. Herrnstein, R. J., & Murray, C. (1994). The bell curve. New York: Free Press.Google Scholar
  35. Hills, J. R. (1989). Screening for potentially biased items in testing programs. Educational Measurement: Issues and Practice, 8(4), 5–11.CrossRefGoogle Scholar
  36. Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.Google Scholar
  37. Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted? Psychological Bulletin, 112,351–362.CrossRefPubMedGoogle Scholar
  38. Ilai, D., & Willerman, L. (1989). Sex differences in WAIS-R item performance. Intelligence, 13, 225–234.CrossRefGoogle Scholar
  39. Jastak, J. E., & Jastak, S. R. (1964). Short forms of the WAIS and WISC vocabulary subtests. Journal of Clinical Psychology, 20, 167–199.CrossRefPubMedGoogle Scholar
  40. Jensen, A. R. (1969). How much can we boost IQ and scholastic achievement? Harvard Educational Review, 39,1–123.Google Scholar
  41. Jensen, A. R. (1974). How biased are culture-loaded tests? Genetic Psychology Monographs, 90, 185–224.Google Scholar
  42. Jensen, A. R. (1976). Test bias and construct validity. Phi Delta Kappan, 58, 340–346.Google Scholar
  43. Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.Google Scholar
  44. Joint Committee on Testing Practices. (1988). Code of fair testing practices in education. Washington, DC: National Council on Measurement in Education.Google Scholar
  45. Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 57, 409–426.CrossRefGoogle Scholar
  46. Jöreskog, K. G., & Sörbom, D. (1989). LISREL7: A guide to the program and applications (2nd ed.). Chicago: SPSS.Google Scholar
  47. Koh, T., Abbatiello, A., & McLoughlin, C. S. (1984). Cultural bias in WISC subtest items: A response to Judge Grady’s suggestions in relation to the PASE case. School Psychology Review, 13,89–94.Google Scholar
  48. Kromrey, J. D., & Parshall, C. G. (1991, November). Screening items for bias: An empirical comparison of the performance of three indices in small samples of examinees. Paper presented at the annual meeting of the Florida Educational Research Association. Clearwater, FL.Google Scholar
  49. Larry P. et al. v. Wilson Riles, Superintendent of Public Instruction for the State of California, et al., Case No. C-71–2270 (N.D. Cal., 1979).Google Scholar
  50. Lawrence, I. M., & Curley, W. E. (1989, March). Differential item functioning of SAT-Verbal reading subscore items for males and females: follow-up study. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.Google Scholar
  51. Lawrence, I. M., Curley, W. E., & McHale, F. J. (1988, April). Differential item functioning of SAT-Verbal reading subscore items for male and female examinees. Paper presented at the annual meeting of the American Educational Research Association, New Orleans.Google Scholar
  52. Linn, R. L., Levine, M. V., Hastings, C. N., & Waldrop, J. L. (1980). An investigation of item bias in a test of reading comprehension (Technical Rep. No. 163). Urbana: Center for the Study of Reading, University of Illinois at Urbana-Champaign.Google Scholar
  53. Loehlin, J. C., Lindzey, G., & Spuhler, J. N. (1975). Race differences in intelligence. San Francisco: W. H. Freeman.Google Scholar
  54. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  55. Lynn, R. (1977). The intelligence of the Japanese. Bulletin of the British Psychological Society. 30,69–72.Google Scholar
  56. Maller, S. J. (1996). WISC-III Verbal item invariance across samples of deaf and hearing children of similar measured ability. Journal of Psychoeducational Assessment, 14, 152–165.CrossRefGoogle Scholar
  57. Maller, S. J. (2000). Item invariance in four subtests of the Universal Nonverbal Intelligence Test across groups of deaf and hearing children. Journal of Psychoeducational Assessment, 18,240–254.CrossRefGoogle Scholar
  58. Maller, S. J. (2001). Differential item functioning in the WISC-III: Item parameters for boys and girls in the national standardization sample. Educational and Psychological Measurement, 61,793–817.CrossRefGoogle Scholar
  59. Maller, S. J., & Ferron, J. (1997). WISC-III factor invariance across deaf and standardization samples. Educational and Psychological Measurement, 7, 987–994.CrossRefGoogle Scholar
  60. Maller, S. J., Konold, T. R., & Glutting, J. J. (1998). WISC-III Factor invariance across samples of children displaying appropriate and inappropriate test-taking behavior. Educational and Psychological Measurement, 58, 467–475.CrossRefGoogle Scholar
  61. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from the retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.PubMedGoogle Scholar
  62. McGaw, B., & Jöreskog, K. G. (1971). Factorial invariance of ability measures in groups differing in intelligence and socio-economic status. British Journal of Mathematical and Statistical Psychology, 24, 154–168.CrossRefGoogle Scholar
  63. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education & Macmillan.Google Scholar
  64. Miele, F. (1979). Cultural bias in the WISC. Intelligence, 3, 149–164.CrossRefGoogle Scholar
  65. Munford, P. R., & Munoz, A. (1980). A comparison of the WISC and WISC-R on Hispanic children. Journal of Clinical Psychology, 36, 452–458.CrossRefGoogle Scholar
  66. National Council on Measurement in Education (1995). Code of professional responsibilities in educational measurement. Washington, DC: NCME.Google Scholar
  67. Nelson, K. M., Arthur, P., Lautiger, J., & Smith, D. K. (1994, March). Does the use of color on the WISC-III affect student performance? Paper presented at the annual meeting of the National Association of School Psychologists, Seattle, WA.Google Scholar
  68. O’Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In H. Wainer & H. I. Braun (Eds.). Test validity (pp. 255–276). Hillsdale, NJ: Erlbaum.Google Scholar
  69. Plake, B. S. (1980). A comparison of a statistical and subjective procedure to ascertain item validity: One step in the test validation process. Educational and Psychological Measurement, 30,397–404.CrossRefGoogle Scholar
  70. Reynolds, C. R. (1982). The problem of bias in psychological assessment. In C. R. Reynolds & T. B. Gutkin (Eds.), The handbook of school psychology (pp. 178–108). New York: John Wiley.Google Scholar
  71. Rigdon, E. E. (1996). CFI versus RMSEA: A comparison of two fit indexes for structural equation modeling. Structural Equation Modeling, 3, 369–379.CrossRefGoogle Scholar
  72. Roid, G. H., & Miller, L. J. (1997). Leiter International performance scale-revised: Examiner’s manual. In G. H. Roid and L. J. Miller, Leiter International performance scale-revised,Wood Dale, IL: Stoelting.Google Scholar
  73. Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and the Mantel-Haenszel procedures for detecting differential item functioning. Applied Measurement in Education, 17, 105–116.Google Scholar
  74. Ross-Reynolds, J., & Reschly, D. J. (1983). An investigation of item bias on the WISC-R with four sociocultural groups. Journal of Consulting and Clinical Psychology, 51, 144–146.CrossRefGoogle Scholar
  75. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 4(2),Whole No. 17.Google Scholar
  76. Sandoval, J. (1979). The WISC-R and internal evidence of test bias with minority groups. Journal of Consulting and Clinical Psychology, 47, 919–927.CrossRefGoogle Scholar
  77. Sandoval, J., & Millee, M. P. W. (1980). Accuracy of judgements of WISC-R item difficulty for minority groups. Journal of Consulting and Clinical Psychology, 48,249–253.CrossRefGoogle Scholar
  78. Sandoval, J., Zimmerman, I. L., & Woo-Sam, J. M. (1977). Cultural differences on the WISC-R verbal items. Journal of School Psychology, 21,49–55.CrossRefGoogle Scholar
  79. Satorra, A., & Bentler, P. M. (1988). Scaling corrections for chi-square statistics in covariance structure analysis. Proceedings of the Business and Economic Statistics Section of the American Statistical Association, 303–313.Google Scholar
  80. Scheuneman, J. (1975, April). A new method of assessing bias in test items. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC.Google Scholar
  81. Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27, 109–131.CrossRefGoogle Scholar
  82. Scheuneman, J. D., & Oakland, T. (1998). In J. Sandoval, C. L. Frisby, K. F. Geisinger, J. D. Scheuneman, & J. R. Grenirer (Eds.), Test interpretation and diversity: Achieving equity in assessment (pp. 77–103). Washington, DC: American Psychological Association.CrossRefGoogle Scholar
  83. Shelley-Sireci, & Sireci, S. G. (1998, August). Controlling for uncontrolled variables in cross-cultural research. Paper presented at the annual meeting of the American Psychological Association, San Francisco.Google Scholar
  84. Sireci, S. G., Bastad, B., & Allalouf, A. (1998, August). Evaluating construct equivalence across adapted tests. Paper presented at the annual meeting of the American Psychological Association, San Francisco.Google Scholar
  85. Simpson, E. H. (1951). Interpretation of interaction contingency tables. Journal of the Royal Statistical Society, (Series B),13, 238–241.Google Scholar
  86. Smith, T. C., Edmonds, J. E., & Smith, B. (1989). The role of sex differences in the referral process as measured by the Peabody Picture Vocabulary Test-Revised and the Wechsler Intelligence Scale for Children-Revised. Psychology in the Schools, 26,354–358.CrossRefGoogle Scholar
  87. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.CrossRefGoogle Scholar
  88. Thissen, D. (1991). MULTILOG (Version 6.30) [Computer software]. Chicago: Scientific Software.Google Scholar
  89. Thissen, D., & Steinberg, L. (1997). A response model for multiple-choice items. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 51–65). New York: Springer.Google Scholar
  90. Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group mean differences: The concept of item bias. Psychological Bulletin, 99,118–128.CrossRefGoogle Scholar
  91. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 149–169). Hillsdale, NJ: Erlbaum.Google Scholar
  92. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response theory. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–114). Hillsdale, NJ: Erlbaum.Google Scholar
  93. Turner, R. G., & Willerman, L. (1977). Sex differences in WAIS item performance. Journal of Clinical Psychology, 33, 795–798.CrossRefPubMedGoogle Scholar
  94. Vance, B., Hankins, N. & Brown, W. (1988). Ethnic and sex differences on the Test of Nonverbal Intelligence, Quick Test of Intelligence, and Wechsler Intelligence Scale for Children-Revised. Journal of Clinical Psychology, 44, 261–265.CrossRefPubMedGoogle Scholar
  95. Veale, J. R. (1977). A note on the use of chi-square with “correct/incorrect” data to detect culturally biased items (Statistical Research in the Behavioral Sciences, Technical Report No. 4). (Available from J. R. Veale, PO Box 4036, Berkeley, CA 94704.)Google Scholar
  96. Welch, C., & Hoover, H. D. (1993). Procedures for extending item bias techniques to polytomously scored items. Applied Psychological Measurement, 6, 1–19.Google Scholar
  97. Wessel, J., & Potter, A. (1994, March). Analysis of WISC-III data from an urban population of referred children. Paper presented at the annual meeting of the National Association of School Psychologists, Seattle, WA.Google Scholar
  98. Wild, C. L., & McPeek, W. M. (1986, August). Performance of the Mantel-Haenszel statistic in identifying differentially functioning items. Paper presented at the annual meeting of the American Psychological Association, Washington, DC.Google Scholar
  99. Wilkinson, S. C. (1993). WISC-R profiles of children with superior intellectual ability. Gifted Child Quarterly, 2, 84–92.CrossRefGoogle Scholar
  100. Zeiky, M. (1993). Practical questions in the use of DIF statistics in item development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–364). Hillsdale, NJ: Erlbaum.Google Scholar
  101. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIP): Logistic regression modeling as a unitary framework for binary and Likert-type (ordi-nal) item scores. Ottawa: Directorate of Human Resources Research and Evaluation, Department of National Defense.Google Scholar
  102. Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessing differential item functioning in performance tasks. Journal of Educational Measurement, 30,233–251.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2003

Authors and Affiliations

  • Susan J. Maller
    • 1
  1. 1.Department of Educational StudiesPurdue UniversityWest LafayetteUSA

Personalised recommendations