Advertisement

Disentangling Setting and Mode Effects for Online Competence Assessment

  • Ulf KroehneEmail author
  • Timo Gnambs
  • Frank Goldhammer
Chapter
Part of the Edition ZfE book series (EZFE, volume 3)

Abstract

Many large-scale competence assessments such as the National Educational Panel Study (NEPS) have introduced novel test designs to improve response rates and measurement precision. In particular, unstandardized online assessments (UOA) offer an economic approach to reach heterogeneous populations that otherwise would not participate in face-to-face assessments. Acknowledging the difference between delivery, mode, and test setting, this chapter extends the theoretical background for dealing with mode effects in NEPS competence assessments (Kroehne and Martens in Zeitschrift für Erziehungswissenschaft 14:169–186, 2011 2011) and discusses two specific facets of UOA: (a) the confounding of selection and setting effects and (b) the role of test-taking behavior as mediator variable. We present a strategy that allows the integration of results from UOA into the results from proctored computerized assessments and generalizes the idea of motivational filtering, known for the treatment of rapid guessing behavior in low-stakes assessment. We particularly emphasize the relationship between paradata and the investigation of test-taking behavior, and illustrate how a reference sample formed by competence assessments under standardized and supervised conditions can be used to increase the comparability of UOA in mixed-mode designs. The closing discussion reflects on the trade-off between data quality and the benefits of UOA.

Keywords

Education Panel study Online testing Computer-based competence test Mode effects Paradata Test-taking behavior 

Schlüsselwörter

Bildung Panelstudie Online testung Computerbasierte kompetenztests Mode effects Paradaten Testbearbeitungsverhalten 

Zusammenfassung

Viele großangelegte Assessmentprogramme wie das National Bildungspanel führen neue Testdesigns ein, um die Antwortraten und die Messgenauigkeit zu verbessern. Insbesondere bietet unstandardisiertes Online-Assessments (UOA) eine ökonomische Möglichkeit, um heterogene Bevölkerungsgruppen zu erreichen, die ansonsten nicht an direkten Testung teilnehmen würden. Unter Berücksichtigung des Unterschieds zwischen Testauslieferung, Testmodus und Testsetting erweitert dieses Kapitel den theoretischen Hintergrund für den Umgang mit Moduseffekten in der Kompetenztestung des Nationalen Bildungspanels (NEPS; Kroehne und Martens 2011) und diskutiert zwei spezifische Facetten von UOA: a) Die Konfundierung von Selektionseffekten und Effekten des Testsettings und b) die Rolle des Testbearbeitungsverhaltens als Mediatorvariable. Wir stellen eine Strategie vor, die die Integration von Ergebnissen aus UOA in Ergebnisse computerbasierter Kompetenztestung ermöglicht und welche die Idee des Motivationsfilterns verallgemeinert, das für die Behandlung von schnellem Rateverhalten in Low-Stakes-Assessments bekannt ist. Dabei wird insbesondere der Zusammenhang zwischen Paradaten und der Erforschung von Testbearbeitungsverhalten hervorgehoben. Es wird gezeigt, wie eine Referenzstichprobe mit Kompetenztestung unter standardisierten und überwachten Testbedingungen verwendet werden könnte, um die Vergleichbarkeit von UOA in Mixed-Mode-Designs zu verbessern. Die abschließende Diskussion reflektiert den aus dem Vorgehen resultierenden Kompromiss zwischen Datenqualität und den Vorteilen von UOA.

References

  1. Barry, C. L., & Finney, S. J. (2009). Does it matter how data are collected? A comparison of testing conditions and the implications for validity. Research & Practice in Assessment, 3, 1–15.Google Scholar
  2. Bartram, D. (2005). Testing on the internet: Issues, challenges and opportunities in the field of occupational assessment. In D. Bartram & R. K. Hambleton, (Eds.), Computer-based testing and the internet (pp. 13–37). Chichester, England: John Wiley & Sons.Google Scholar
  3. Bayazit, A., & As¸kar, P. (2012). Performance and duration differences between online and paper–pencil tests. Asia Pacific Education Review, 13, 219–226.Google Scholar
  4. Beckers, T., Siegers, P., & Kuntz, A. (2011, March). Speeders in online value research. Paper presented at the GOR 11, Düsseldorf, Germany.Google Scholar
  5. Bennett, R. E. (2003). Online assessment and the comparability of score meaning (ETS-RM-03-05). Princeton, NJ: Educational Testing Service.Google Scholar
  6. Bloemers, W., Oud, A., & van Dam, K. (2016). Cheating on unproctored internet intelligence tests: Strategies and effects. Personnel Assessment and Decisions, 2, 21–29.Google Scholar
  7. Bodmann, S. M., & Robinson, D. H. (2004). Speed and performance differences among computer-based and paper-pencil tests. Journal of Educational Computing Research, 31, 51–60.Google Scholar
  8. Bosnjak, M., & Tuten, T. L. (2001). Classifying response behaviors in web-based surveys. Journal of Computer-Mediated Communication, 6(3).Google Scholar
  9. Buerger, S., Kroehne, U., & Goldhammer, F. (2016). The transition to computer-based testing in large-scale assessments: Investigating (partial) measurement invariance between modes. Psychological Test and Assessment Modeling, 58, 597–616.Google Scholar
  10. Callegaro, M. (2010). Do you know which device your respondent has used to take your online survey? Survey Practice, 3, 1–12.Google Scholar
  11. Couper, M. P., & Peterson, G. J. (2017). Why do web surveys take longer on smartphones? Social Science Computer Review, 35, 357–377.Google Scholar
  12. Csapó, B., Molnár, G., & Nagy, J. (2014). Computer-based assessment of school readiness and early reasoning. Journal of Educational Psychology, 106, 639–650.Google Scholar
  13. de Leeuw, E., Hox, J., & Scherpenzeel, A. (2011). Mode effect or question wording? Measurement error in mixed mode surveys. Proceedings of the Survey Research Methods Section, American Statistical Association (pp. 5959–5967). Alexandria, VA: American Statistical Association.Google Scholar
  14. Diedenhofen, B., & Musch, J. (2017). PageFocus: Using paradata to detect and prevent cheating on online achievement tests. Behavior Research Methods, 49, 1444–1459.Google Scholar
  15. Dillman, D. A. (2000). Mail and internet surveys: The total design method. New York, NY: Wiley.Google Scholar
  16. Dirk, J., Kratzsch, G. K., Prindle, J. P., Kroehne, U., Goldhammer, F., & Schmiedek, F. (2017). Paper-based assessment of the effects of aging on response time: A diffusion model analysis. Journal of Intelligence, 5, 12.Google Scholar
  17. Finn, B. (2015). Measuring motivation in low-stakes assessments. Research Report No. RR-15-19. Princeton, NJ: Educational Testing Service.Google Scholar
  18. Frein, S. T. (2011). Comparing in-class and out-of-class computer-based tests to traditional paper-and-pencil tests in introductory psychology courses. Teaching of Psychology, 38, 282–287.Google Scholar
  19. Fricker, S. (2005). An experimental comparison of web and telephone surveys. Public Opinion Quarterly, 69, 370–392.Google Scholar
  20. Glas, C. A., & Meijer, R. R. (2003). A Bayesian approach to person fit analysis in item response theory models. Applied Psychological Measurement, 27, 217–233.Google Scholar
  21. Gnambs, T., & Kaspar, K. (2015). Disclosure of sensitive behaviors across self-administered survey modes: A meta-analysis. Behavior Research Methods, 47, 1237–1259.Google Scholar
  22. Goegebeur, Y., De Boeck, P., & Molenberghs, G. (2010). Person fit for test speededness: Normal curvatures, likelihood ratio tests and empirical Bayes estimates. Methodology, 6, 3–16.Google Scholar
  23. Goldhammer, F. (2015). Measuring ability, speed, or both? Challenges, psychometric solutions, and what can be gained from experimental control. Measurement: Interdisciplinary Research and Perspectives, 13, 133–164.Google Scholar
  24. Goldhammer, F., Lüdtke, O., Martens, T., & Christoph, G. (2016). Test-taking engagement in PIAAC. OECD Education Working Papers 133. Paris, France: OECD Publishing.Google Scholar
  25. Goldhammer, F., Naumann, J., Rölke, H., Stelter, A., & Tóth, K. (2017). Relating product data to process data from computer-based competency assessment. In D. Leutner, J. Fleischer, J. Grünkorn, & E. Klieme (Eds.), Competence assessment in education: Research, models and instruments (pp. 407–425). Cham, Switzerland: Springer.Google Scholar
  26. Guo, H., Rios, J. A., Haberman, S., Liu, O. L., Wang, J., & Paek, I. (2016). A new procedure for detection of students’ rapid guessing responses using response time. Applied Measurement in Education, 29, 173–183.Google Scholar
  27. Heine, J.-H., Mang, J., Borchert, L., Gomolka, J., Kroehne, U., Goldhammer, F., & Sälzer, C. (2016). Kompetenzmessung in PISA 2015. In K. Reiss, C. Sälzer, A. Schiepe-Tiska, E. Klieme, & O. Köller (Eds.), PISA 2015 Eine Studie zwischen Kontinuität und Innovation, (pp. 383–540). Münster, Germany: Waxmann.Google Scholar
  28. Hox, J. J., De Leeuw, E. D., & Zijlmans, E. A. O. (2015). Measurement equivalence in mixed mode surveys. Frontiers in Psychology, 6, 1–11.Google Scholar
  29. Huff, K. C. (2015). The comparison of mobile devices to computers for web-based assessments. Computers in Human Behavior, 49, 208–212.Google Scholar
  30. Illingworth, A. J., Morelli, N. A., Scott, J. C., & Boyd, S. L. (2015). Internet-based, unproctored assessments on mobile and non-mobile devices: Usage, measurement equivalence, and outcomes. Journal of Business and Psychology, 30, 325–343.Google Scholar
  31. International Test Commission (2006). International guidelines on computer-based and internet-delivered testing. International Journal of Testing, 6, 143–171.Google Scholar
  32. Jäckle, A., Roberts, C., & Lynn, P. (2010). Assessing the effect of data collection mode on measurement. International Statistical Review, 78, 3–20.Google Scholar
  33. Johnston, M. M. (2016). Applying solution behavior thresholds to a noncognitive measure to identify rapid responders: An empirical investigation. PhD Thesis, James Madison University, Harrisonburg, VA.Google Scholar
  34. Kim, Y., Dykema, J., Stevenson, J., Black, P., & Moberg, D. P. (2018). Straightlining: Overview of measurement, comparison of indicators, and effects in mail–web mixed-mode surveys. Social Science Computer Review, 29, 208–220.Google Scholar
  35. King, D. D., Ryan, A. M., Kantrowitz, T., Grelle, D., & Dainis, A. (2015). Mobile internet testing: An analysis of equivalence, individual differences, and reactions. International Journal of Selection and Assessment, 23, 382–394.Google Scholar
  36. Kitchin, R., & McArdle, G. (2016). What makes big data, big data? Exploring the ontological characteristics of 26 datasets. Big Data & Society, 3, 1–10.Google Scholar
  37. Klausch, T., Hox, J. J., & Schouten, B. (2013a). Measurement effects of survey mode on the equivalence of attitudinal rating scale questions. Sociological Methods & Research, 42, 227–263.Google Scholar
  38. Klausch, T., Hox, J. J., & Schouten, B. (2013b). Assessing the mode-dependency of sample selectivity across the survey response process. Discussion Paper 2013-03. The Hague, Netherlands: Statistics Netherlands (Available from https://www.cbs.nl/-/media/imported/documents/2013/12/2013-03-x10-pub.pdf).
  39. Köhler, C., Pohl, S., & Carstensen, C. H. (2014). Taking the missing propensity into account when estimating competence scores: Evaluation of item response theory models for nonignorable omissions. Educational and Psychological Measurement, 75, 1–25.Google Scholar
  40. Könen, T., Dirk, J., & Schmiedek, F. (2015). Cognitive benefits of last night’s sleep: Daily variations in children’s sleep behavior are related to working memory fluctuations. Journal of Child Psychology and Psychiatry, 56, 171–182.Google Scholar
  41. Kong, X. J., Wise, S. L., & Bhola, D. S. (2007). Setting the response time threshold parameter to differentiate solution behavior from rapid-guessing behavior. Educational and Psychological Measurement, 67, 606–619.Google Scholar
  42. Kraus, R., Stricker, G., & Speyer, C. (Eds., 2010). Online counseling: A handbook for mental health professionals. Practical resources for the mental health professional. Boston, MA: Academic Press.Google Scholar
  43. Kreuter, F. (Ed., 2013). Improving surveys with paradata: Analytic uses of process information. Hoboken, NJ: Wiley & Sons.Google Scholar
  44. Kroehne, U., Hahnel, C., & Goldhammer, F. (2018, April). Invariance of the response process between modes and gender in reading assessment. Paper presented at the annual meeting of the National Council on Measurement in Education, New York.Google Scholar
  45. Kroehne, U. & Martens, T. (2011). Computer-based competence tests in the national educational panel study: The challenge of mode effects. Zeitschrift für Erziehungswissenschaft, 14, 169–186.Google Scholar
  46. Kroehne, U., Roelke, H., Kuger, S., Goldhammer, F., & Klieme, E. (2016, April). Theoretical framework for log-data in technology-based assessments with empirical applications from PISA. Paper presented at the annual meeting of the National Council on Measurement in Education, Washington, DC.Google Scholar
  47. Lau, A. R., Swerdzewski, P. J., Jones, A. T., Anderson, R. D., & Markle, R. E. (2009). Proctors matter: Strategies for increasing examinee effort on general education program assessments. The Journal of General Education, 58, 196–217.Google Scholar
  48. Lee, Y.-H. & Jia, Y. (2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-scale Assessments in Education, 2, 8.Google Scholar
  49. Lievens, F., & Burke, E. (2011). Dealing with the threats inherent in unproctored Internet testing of cognitive ability: Results from a large-scale operational test program: Unproctored internet testing. Journal of Occupational and Organizational Psychology, 84, 817–824.Google Scholar
  50. Liu, O. L., Rios, J. A., & Borden, V. (2015). The effects of motivational instruction on college students’ performance on low-stakes assessment. Educational Assessment, 20, 79–94.Google Scholar
  51. Maddox, B. (2017). Talk and gesture as process data. Measurement: Interdisciplinary Research and Perspectives, 15, 113–127.Google Scholar
  52. McClain, C. A., Couper, M. P., Hupp, A. L., Keusch, F., Peterson, G., Piskorowski, A. D., & West, B. T. (2018). A typology of web survey paradata for assessing total survey error. Social Science Computer Review, Online First.Google Scholar
  53. Pajkossy, P., Simor, P., Szendi, I., & Racsmány, M. (2015). Hungarian validation of the Penn State Worry Questionnaire (PSWQ): Method effects and comparison of paper-pencil versus online administration. European Journal of Psychological Assessment, 31, 159–165.Google Scholar
  54. Preckel, F., & Thiemann, H. (2003). Online-versus paper-pencil version of a high potential intelligence test. Swiss Journal of Psychology, 62, 131–138.Google Scholar
  55. Reips, U.-D. (2000). The Web experiment method: Advantages, disadvantages, and solutions. In M. H. Birnbaum (Ed.), Psychological experiments on the Internet (pp. 89–118). San Diego, CA: Academic Press.Google Scholar
  56. Rios, J. A., Guo, H., Mao, L., & Liu, O. L. (2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17, 74–104.Google Scholar
  57. Rios, J. A., & Liu, O. L. (2017). Online proctored versus unproctored low-stakes internet test administration: Is there differential test-taking behavior and performance? American Journal of Distance Education, 31, 226–241.Google Scholar
  58. Rios, J. A., Liu, O. L., & Bridgeman, B. (2014). Identifying low-effort examinees on student learning outcomes assessment: A comparison of two approaches. New Directions for Institutional Research, 161, 69–82.Google Scholar
  59. Robling, M. R., Ingledew, D. K., Greene, G., Sayers, A., Shaw, C., Sander, L., Russell, I. T., Williams, J. G., & Hood, K. (2010). Applying an extended theoretical framework for data collection mode to health services research. BMC Health Services Research, 10, 180.Google Scholar
  60. Rölke, H. (2012). The ItemBuilder: A graphical authoring system for complex item development. In T. Bastiaens & G. Marks (Eds.), Proceedings of World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education (pp. 344–353). Chesapeake, VA: AACE. Retrieved from http://www.editlib.org/p/41614.
  61. Russell, L. B., & Hubley, A. M. (2017). Some thoughts on gathering response processes validity evidence in the context of online measurement and the digital revolution. In B. D. Zumbo & A. M. Hubley, (Eds.), Understanding and investigating response processes in validation research (pp. 229–249). Cham, Switzerland: Springer.Google Scholar
  62. Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34, 213–232.Google Scholar
  63. Schouten, B., Cobben, F., & Bethlehem, J. (2009). Indicators for the representativeness of survey response. Survey Methodology, 35, 101–113.Google Scholar
  64. Sendelbah, A., Vehovar, V., Slavec, A., & Petrovcˇicˇ, A. (2016). Investigating respondent multi-tasking in web surveys using paradata. Computers in Human Behavior, 55, 777–787.Google Scholar
  65. Shlomo, N., Skinner, C., & Schouten, B. (2012). Estimation of an indicator of the representativeness of survey response. Journal of Statistical Planning and Inference, 142, 201–211.Google Scholar
  66. Sinharay, S. (2015). Assessment of person fit for mixed-format tests. Journal of Educational and Behavioral Statistics, 40, 343–365.Google Scholar
  67. Sinharay, S., Wan, P., Choi, S. W., & Kim, D.-I. (2015). Assessing individual-level impact of interruptions during online testing. Journal of Educational Measurement, 52, 80–105.Google Scholar
  68. Sinharay, S., Wan, P., Whitaker, M., Kim, D.-I., Zhang, L., & Choi, S. W. (2014). Determining the overall impact of interruptions during online testing. Journal of Educational Measurement, 51, 419–440.Google Scholar
  69. Steger, D., Schroeders, U., & Gnambs, T. (in press). A meta-analysis of test scores in proctored and unproctored ability assessments. European Journal of Psychological Assessment. Manuscript accepted for publication.Google Scholar
  70. Stieger, S., & Reips, U.-D. (2010). What are participants doing while filling in an online questionnaire: A paradata collection tool and an empirical study. Computers in Human Behavior, 26, 1488–1495.Google Scholar
  71. Sun, N., Rau, P. P.-L., & Ma, L. (2014). Understanding lurkers in online communities: A literature review. Computers in Human Behavior, 38, 110–117.Google Scholar
  72. Vannieuwenhuyze, J., Loosveldt, G., & Molenberghs, G. (2011). A method for evaluating mode effects in mixed-mode surveys. Public Opinion Quarterly, 74, 1027–1045.Google Scholar
  73. Viswesvaran, C., & Ones, D. S. (1999). Meta-analyses of fakability estimates: Implications for personality measurement. Educational and Psychological Measurement, 59, 197–210.Google Scholar
  74. Wise, S. L. (2015). Effort analysis: Individual score validation of achievement test data. Applied Measurement in Education, 28, 237–252.Google Scholar
  75. Wise, S. L., & DeMars, C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10, 1–17.Google Scholar
  76. Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19–38.Google Scholar
  77. Wise, S. L., Kingsbury, G. G., Thomason, J., & Kong, X. (2004, April). An investigation of motivation filtering in a statewide achievement testing program. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, California.Google Scholar
  78. Wise, S. L. and Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183.Google Scholar
  79. Wise, S. L., & Ma, L. (2012, May). Setting response time thresholds for a CAT item pool: The normative threshold method. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver.Google Scholar
  80. Wise, V., Wise, S., & Bhola, D. (2006). The generalizability of motivation filtering in improving test score validity. Educational Assessment, 11, 65–83.Google Scholar

Copyright information

© Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature 2019

Authors and Affiliations

  1. 1.German Institute for International Educational Research (DIPF)Frankfurt am MainGermany
  2. 2.Leibniz Institute for Educational Trajectories; Johannes Kepler University LinzBambergGermany

Personalised recommendations