Critically Understanding Impact Evaluations: Technical, Methodological, Organizational, and Political Issues

  • D. Brent EdwardsJr.


It is crucial that scholars, but also policymakers and practitioners, have a critical understanding of the conceptual and technical limitations of impact evaluations, as well as the ways that they are necessarily affected by organizational and political dynamics. In order to achieve a critical understanding of impact evaluations, this chapter directs its attention toward both their most common form—regression analysis—as well as the form that is seen to be more robust, that is, randomized control trials (RCTs). The methodological assumptions of each of these are discussed, first, in conceptual terms before moving on to a review of some more technical issues. The final section of the chapter turns to a consideration of how the production of impact evaluations is affected by organizational and political dynamics. In all, this chapter advocates a critical understanding of impact evaluations in five senses: conceptually, technically, contextually, organizationally, and politically.


Knowledge production Impact evaluation Regression analysis Randomized control trial RCT Political economy Critical review 


  1. Anderson, D., Burnham, K., Gould, W., & Cherry, S. (2001). Concerns about finding effects that are actually spurious. Wildlife Society Bulletin, 29(1), 311–316.Google Scholar
  2. Baker, J. (2000). Evaluating the impact of development projects on poverty: A handbook for practitioners. Washington, DC: World Bank.CrossRefGoogle Scholar
  3. Banerjee, A., Banerji, R., Berry, J., Duflo, E., Kannan, H., Mukerji, S., Shotland, M., & Walton, M. (2017). From proof concept to scalable policies: Challenges and solutions, with an application. NBER Working Paper No. 22931. Retrieved from
  4. Banerjee, A., & He, R. (2008). Making aid work. In W. Easterly (Ed.), Reinventing foreign aid (pp. 47–92). Cambridge, MA: MIT.Google Scholar
  5. Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association, 33(203), 526–536.CrossRefGoogle Scholar
  6. Biglan, A., Ary, D., & Wagenaar, A. (2000). The value of interrupted time-series experiments for community intervention research. Prevention Science, 1(1), 31–49.CrossRefGoogle Scholar
  7. Boruch, R., Rindskopf, D., Anderson, P., Amidjaya, I., & Jansson, D. (1979). Randomized experiments for evaluating and planning local programs: A summary on appropriateness and feasibility. Public Administration Review, 39(1), 36–40.CrossRefGoogle Scholar
  8. Braun, A., Ball, S., Maguire, M., & Hoskins, K. (2011). Taking context seriously: Towards explaining policy enactments in the secondary school. Discourse: Studies in the Cultural Politics of Education, 32(4), 585–596.Google Scholar
  9. Burde, D. (2012). Assessing impact and bridging methodological divides: Randomized trials in countries affected by conflict. Comparative Education Review, 56(3), 448–473.CrossRefGoogle Scholar
  10. Cartwright, N. (2007). Are RCTs the gold standard? Centre for Philosophy of Natural and Social Science. Technical Report 01/07. London: London School of Economics. Retrieved from
  11. Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378–399.CrossRefGoogle Scholar
  12. Carver, R. (1993). The case against statistical significance testing, revisited. The Journal of Experimental Education, 61(4), 287–292.CrossRefGoogle Scholar
  13. Castillo, N., & Wagner, D. (2013). Gold standard? The use of randomized controlled trials for international educational policy. Comparative Education Review, 58(1), 166–173.CrossRefGoogle Scholar
  14. Clay, R. (2010). More than one way to measure: Randomized clinical trials have their place, but critics argue that researchers would get better results if they also embraced other methodologies. Monitor on Psychology, 41 (8), 52. Retrieved from
  15. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  16. Concato, J., Shah, N., & Horwitz, R. (2000). Randomized, controlled trials, observational studies, and the hierarchy of research designs. The New England Journal of Medicine, 342(25), 1887–1892.CrossRefGoogle Scholar
  17. Cook, T. (2001). Science phobia: Why education researchers reject randomized experiments. Education Next, (Fall), 63–68. Retrieved from
  18. Cook, T. (2004). Why have educational evaluators chosen not to do randomized experiments? Annals of the American Academy of Political and Social Science, 589(Sep.), 114–149.Google Scholar
  19. Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2), 424–455.CrossRefGoogle Scholar
  20. Deaton, A., & Cartwright, N. (2016a). The limitations of ramdomised controlled trials. Vox. Retrieved from
  21. Deaton, A., & Cartwright, N. (2016b). Understanding and misunderstanding randomized controlled trials. NBER Working Paper No. 22595. Retrieved from
  22. De Boer, M., Waterlander, W., Kujper, L., Steenhuis, I., & Twisk, J. (2015). Testing for baseline differences in randomized controlled trials: An unhealthy research behavior that is hard to eradicate. International Journal of Behavioral Nutrition and Physical Activity, 12(4). Retrieved from
  23. Duflo, E., & Kremer, M. (2003, July 15–16). Use of randomization in the evaluation of development effectiveness. Paper prepared for the World Bank Operations Evaluation Department Conference on Evaluation and Development Effectiveness, Washington, DC. Retrieved from
  24. Durlak, J. (2009). How to select, calculate, and interpret effect sizes. Journal of Pediatric Psychology, 34(9), 917–928.CrossRefGoogle Scholar
  25. Durlak, J., Weissberg, R., Dymnicki, A., Taylor, R., & Schellinger, K. (2011). The impact of enhancing students’ social and emotional learning: A meta-analysis of school-based universal interventions. Child Development, 82(1), 405–432.CrossRefGoogle Scholar
  26. Everett, B., Rehkopf, D., & Rogers, R. (2013). The nonlinear relationship between education and mortality: An examination of cohort, race/ethnic, and gender differences. Population Research Policy Review, 32 (6). Retrieved from
  27. Feldman, A. & Haskins, R. (2016). Low-cost randomized controlled trials. Evidence-Based Policymaking Collaborative. Retrieved from
  28. Fendler, L., & Muzzafar, I. (2008). The history of the bell curve: Sorting and the idea of normal. Educational Theory, 58(1), 63–82.CrossRefGoogle Scholar
  29. Flack, V., & Chang, P. (1987). Frequency of selecting noise variables in subset regression analysis: A simulation study. The American Statistician, 41(1), 84–86.Google Scholar
  30. Freedman, D. (1983). A note on screening regression equations. The American Statistician, 37(2), 152–155.Google Scholar
  31. Ganimian, A. (2017). Not drawn to scale? RCTs and education reform in developing countries. Research on improving systems of education. Retrieved from
  32. Garbarino, S., & Holland, J. (2009). Quantitative and qualitative methods in impact evaluation and measuring results. Governance and Social Development Resource Centre. UK Department for International Development. Retrieved from
  33. Gertler, P., Martinez, S., Premand, P., Rawlings, L., & Vermeersch, C. (2016). Impact evaluation in practice (2nd ed.). Washginton, DC: World Bank.CrossRefGoogle Scholar
  34. Gigerenzer, G., Swijtink, Z., Porter, T., Datson, L., Beatty, J., & Kruger, L. (1989). The empire of chance: How probability changed science and everyday life. New York: Cambridge University Press.CrossRefGoogle Scholar
  35. Ginsburg, A., & Smith, M. (2016). Do randomized controlled trials meet the “gold standard”? A study of the usefulness of RCTs in the What Works Clearinghouse. American Enterprise Institute. Retrieved from
  36. Glewwe, P. (Ed.). (2014). Education policy in developing countries. Chicago: University of Chicago.Google Scholar
  37. Goertzel, T. (n.d). The myth of the bell curve. Retrieved from
  38. Gorard, S., & Taylor, C. (2004). Combining methods in educational and social research. New York: Open University.Google Scholar
  39. Greenland, S., Senn, S., Rothman, K., Carlin, J., Poole, C., Goodman, S., & Altman, D. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.CrossRefGoogle Scholar
  40. Johnson, D. (1999). The insignificance of statistical significance testing. Journal of Wildlife Management, 63(3), 763–772.CrossRefGoogle Scholar
  41. Klees, S. (2016). Inferences from regression analysis: Are they valid? Real-World Economics Review, 74, 85–97. Retrieved from
  42. Klees, S., & Edwards, D. B., Jr. (2014). Knowledge production and technologies of governance. In T. Fenwick, E. Mangez, & J. Ozga (Eds.), World yearbook of education 2014: Governing knowledge: Comparison, knowledge-based technologies and expertise in the regulation of education (pp. 31–43). New York: Routledge.Google Scholar
  43. Komatsu, H., & Rappleye, J. (2017). A new global policy regime founded on invalid statistics? Hanushek, Woessmann, PISA, and economic growth. Comparative Education, 53(2), 166–191.CrossRefGoogle Scholar
  44. Kremer, M. (2003). Randomized evaluations of educational programs in developing countries: Some lessons. The American Economic Review, 93(2), 102–106.CrossRefGoogle Scholar
  45. Lareau, A. (2009). Narrow questions, narrow answers: The limited value of randomized controlled trials for education research. In P. Walters, A. Lareau, & S. Ranis (Eds.), Education research on trial: Policy reform and the call for scientific rigor (pp. 145–162). New York: Routledge.Google Scholar
  46. Leamer, E. (1983). Let’s take the con out of econometrics. The American Economic Review, 73(1), 31–43.Google Scholar
  47. Leamer, E. (2010). Tantalus on the road to asymptopia. The Journal of Economic Perspectives, 24(2), 31–46.CrossRefGoogle Scholar
  48. Levine, T., Weber, R., Hullett, C., Park, H., & Lindsey, L. (2008). A critical assessment of null hypothesis significance testing in quantitative communication research. Human Communication Research, 34(2), 171–187.CrossRefGoogle Scholar
  49. Levy, S. (2006). Progress against poverty: Sustaining Mexico’s Progresa-Oportunidades program. Washington, DC: Brookings Institution Press.Google Scholar
  50. Luecke, D., & McGinn, N. (1975). Regression analyses and education production functions: Can they be trusted? Harvard Educational Review, 45(3), 325–350.CrossRefGoogle Scholar
  51. Lykken, D. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3), 151–159.CrossRefGoogle Scholar
  52. McLaughlin, M. (1987). Learning from experience: Lessons from policy implementation. Educational Evaluation and Policy Analysis, 9(2), 171–178.CrossRefGoogle Scholar
  53. McLaughlin, M. (1990). The Rand change agent study revisited: Macro perspectives and micro realities. Educational Researcher, 19(9), 11–16.CrossRefGoogle Scholar
  54. Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834.CrossRefGoogle Scholar
  55. Meehl, P. (1986). What social scientists don’t understand. In D. Fiske & R. Shweder (Eds.), Metatheory in social science: Pluralisms and subjectivities (pp. 315–338). Chicago: University of Chicago.Google Scholar
  56. Meehl, P. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 393–425). Mahwah, NJ: Erlbaum.Google Scholar
  57. Mertens, D. (2005). Research and evaluation in education and psychology: Integrating diversity with quantitative, qualitative, and mixed methods (2nd ed.). London: Sage.Google Scholar
  58. Miguel, E., & Kremer, M. (2004). Worms: Identifying impacts on education and health in the presence of treatment externalities. Econometrica, 72(1), 159–217.CrossRefGoogle Scholar
  59. Novella, S. (2016). P value under fire. Science-Base Medicine. Retrieved from
  60. Nuzzo, R. (2015). Scientists perturbed by loss of stat tools to sift research fudge from fact. Scientific American. Retrieved from
  61. O’Boyle, E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65(1), 79–119.CrossRefGoogle Scholar
  62. Peck, J., & Theodore, N. (2015). Fast policy: Experimental statecraft at the thresholds of neoliberalism. Minneapolis: University of Minnesota.CrossRefGoogle Scholar
  63. Pogrow, S. (2017). The failure of the U.S. education research establishment to identify effective practices: Beware effective practices policies. Education Policy Analysis Archives, 25(5), 1–19. Retrieved from Scholar
  64. Pritchett, L. (n.d.). “The evidence” about “what works” in education: Graphs to illustrate external validity and construct validity. Research on Improving Systems of Education. Retrieved from
  65. Pritchett, L. (2002). It pays to be ignorant: A simple political economy of rigorous program evaluation. The Journal of Policy Reform, 5(4), 251–269.CrossRefGoogle Scholar
  66. Rodrik, D. (2008). The development economics: We shall experiment, but how shall we learn? Faculty Research Working Papers Series. RWP08-055. John F. Kennedy School of Government. Harvard University. Retrieved from
  67. Romero, M., Sandefur, J., & Sandholtz, W. 2017. Can outsourcing improve Liberia’s schools? Preliminary results from year one of a three-year randomized evaluation of partnership schools for Liberia. Washington, DC: Center for Global Development.
  68. Rust, V., Soumare, A., Pescador, O., & Shibuya, M. (1999). Research Strategies in Comparative Education. Comparative Education Review, 43(1), 86–109.CrossRefGoogle Scholar
  69. Sanson-Fisher, R., Bonevski, B., Green, L., & D’Este, C. (2007). Limitations of the randomized controlled trial in evaluating population-based health interventions. American Journal of Preventative Medicine, 33(2), 155–161.CrossRefGoogle Scholar
  70. Schlotter, M., Schwerdt, G., & Woessmann, L. (2011). Econometric methods for causal evaluation of education policies and practices: A non-technical guide. Education Economics, 19(2), 109–137.CrossRefGoogle Scholar
  71. Schanzenbach, D. (2012). Limitations of experiments in education research. Education Finance and Policy, 7(2), 219–232.CrossRefGoogle Scholar
  72. Steidl, R., Hayes, J., & Schauber, E. (1997). Statistical power analysis in wildlife research. The Journal of Wildlife Management, 61(2), 270–279.CrossRefGoogle Scholar
  73. Uriel, E. (2013). Hypothesis testing in the multiple regression model. Retrieved from
  74. Vivalt, E. (2015). How much can we generalize from impact evaluations? New York University. Retrieved from
  75. Wang, L., & Guo, K. (2018). Shadow education of mathematics in China. In Y. Cao & F. Leung (Eds.), The 21st century mathematics education in China (pp. 93–106). Berlin: Springer.CrossRefGoogle Scholar
  76. Weiss, R., & Rein, M. (1970). The evaluation of broad-aim programs: Experimental design, its difficulties, and an alternative. Administrative Science Quarterly, 15(1), 97–109.CrossRefGoogle Scholar
  77. Williams, W., & Evans, J. (1969). The politics of evaluation: The case of Head Start. Annals of the American Academy of Political and Social Sciences, 385, 118–132.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • D. Brent EdwardsJr.
    • 1
  1. 1.University of Hawaii at ManoaHonoluluUSA

Personalised recommendations