Advertisement

Fair Statistical Communication in HCI

  • Pierre Dragicevic
Chapter
Part of the Human–Computer Interaction Series book series (HCIS)

Abstract

Statistics are tools to help end users accomplish their task. In research, to be qualified as usable, statistical tools should help researchers advance scientific knowledge by supporting and promoting the effective communication of research findings. Yet areas such as human-computer interaction (HCI) have adopted tools — i.e., p-values and dichotomous testing procedures — that have proven to be poor at supporting these tasks. The abusive use of these procedures has been severely criticized in a range of disciplines for several decades, suggesting that tools should be blamed, not end users. This chapter explains in a non-technical manner why it would be beneficial for HCI to switch to an estimation approach, i.e., reporting informative charts with effect sizes and interval estimates, and offering nuanced interpretations of our results. Advice is offered on how to communicate our empirical results in a clear, accurate, and transparent way without using any tests or p-values.

Keywords

Confidence Interval Credible Interval Interval Estimate Sampling Variability Usability Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

Many thanks to Elie Cattan, Fanny Chevalier, Geoff Cumming, Steven Franconeri, Steve Haroz, Petra Isenberg, Yvonne Jansen, Maurits Kaptein, Heidi Lam, Judy Robertson, Michael Sedlmair, Dan Simons, Chat Wacharamanotham and Wesley Willett for their helpful feedback and comments.

References

  1. Abelson R (1995) Statistics as principled argument. Lawrence Erlbaum AssociatesGoogle Scholar
  2. Abelson RP (1997) A retrospective on the significance test ban of 1999. What if there were no significance tests. pp 117–141Google Scholar
  3. Anderson G (2012) No result is worthless: the value of negative results in science. http://tinyurl.com/anderson-negative
  4. APA (2010) The publication manual of the APA, 6th edn. Washington, DCGoogle Scholar
  5. Bååth R (2015) The non-parametric bootstrap as a Bayesian model. http://tinyurl.com/bayes-bootstrap
  6. Baguley T (2009) Standardized or simple effect size: what should be reported? Br J Psychol 100(3):603–617CrossRefGoogle Scholar
  7. Baguley T (2012) Calculating and graphing within-subject confidence intervals for ANOVA. Behav Res Meth 44(1):158–175CrossRefGoogle Scholar
  8. Bayarri MJ, Berger JO (2004) The interplay of Bayesian and frequentist analysis. Stat Sci 58–80Google Scholar
  9. Beaudouin-Lafon M (2008) Interaction is the future of computing. In: McDonald DW, Erickson T (eds) HCI remixed, reflections on works that have influenced the HCI community. The MIT Press, pp 263–266Google Scholar
  10. Bender R, Lange S (2001) Adjusting for multiple testing: when and how? J Clin Epidemiol 54(4):343–349CrossRefGoogle Scholar
  11. Beyth-Marom R, Fidler F, Cumming G (2008) Statistical cognition: towards evidence-based practice in statistics and statistics education. Stat Educ Res J 7(2):20–39Google Scholar
  12. Brewer MB (2000) Research design and issues of validity. Handbook of research methods in social and personality psychology. pp 3–16Google Scholar
  13. Brodeur A, Lé M, Sangnier M, Zylberberg Y (2012) Star wars: the empirics strike back. Paris school of economics working paper (2012–29)Google Scholar
  14. Carifio J, Perla RJ (2007) Ten common misunderstandings, misconceptions, persistent myths and urban legends about Likert scales and Likert response formats and their antidotes. J Soc Sci 3(3):106Google Scholar
  15. Chevalier F, Dragicevic P, Franconeri S (2014) The not-so-staggering effect of staggered animated transitions on visual tracking. IEEE Trans Visual Comput Graphics 20(12):2241–2250CrossRefGoogle Scholar
  16. Coe R (2002) It’s the effect size, stupid. In: Paper presented at the British Educational Research Association annual conference, vol 12. p 14Google Scholar
  17. Cohen J (1990) Things I have learned (so far). Am Psychol 45(12):1304CrossRefGoogle Scholar
  18. Cohen J (1994) The Earth is round (p < .05). Am psychol 49(12):997Google Scholar
  19. Colquhoun D (2014) An investigation of the false discovery rate and the misinterpretation of p-values. Roy Soc Open Sci 1(3):140, 216Google Scholar
  20. Correll M, Gleicher M (2014) Error bars considered harmful: exploring alternate encodings for mean and error. IEEE Trans Visual Comput Graphics 20(12):2142–2151CrossRefGoogle Scholar
  21. Cumming G (2008) Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspect Psychol Sci 3(4):286–300MathSciNetCrossRefGoogle Scholar
  22. Cumming G (2009a) Dance of the p values [video]. http://tinyurl.com/danceptrial2
  23. Cumming G (2009b) Inference by eye: reading the overlap of independent confidence intervals. Stat med 28(2):205–220MathSciNetCrossRefGoogle Scholar
  24. Cumming G (2012) Understanding the new statistics : effect sizes, confidence intervals, and meta-analysis. Multivariate applications series. Routledge Academic, LondonGoogle Scholar
  25. Cumming G (2013) The new statistics: why and how. Psychol SciGoogle Scholar
  26. Cumming G, Finch S (2005) Inference by eye: confidence intervals and how to read pictures of data. Am Psychol 60(2):170CrossRefGoogle Scholar
  27. Cumming G, Williams R (2011) Significant does not equal important: why we need the new statistics. Podcast. http://tinyurl.com/geoffstalk
  28. Cumming G, Fidler F, Vaux DL (2007) Error bars in experimental biology. J Cell Biol 177(1):7–11CrossRefGoogle Scholar
  29. Dawkins R (2011) The tyranny of the discontinuous mind. New Statesman 19:54–57Google Scholar
  30. Dienes Z (2014) Using Bayes to get the most out of non-significant results. Front Psychol 5Google Scholar
  31. Dragicevic P (2012) My technique is 20% faster: problems with reports of speed improvements in HCI. Research reportGoogle Scholar
  32. Dragicevic P (2015) The dance of plots. http://www.aviz.fr/danceplots
  33. Dragicevic P, Chevalier F, Huot S (2014) Running an HCI experiment in multiple parallel universes. CHI extended abstracts. ACM, New YorkGoogle Scholar
  34. Drummond GB, Vowler SL (2011) Show the data, don’t conceal them. Adv Physiol Educ 35(2):130–132CrossRefGoogle Scholar
  35. Duckworth WM, Stephenson WR (2003) Resampling methods: not just for statisticians anymore. In: 2003 joint statistical meetingsGoogle Scholar
  36. Ecklund A (2012) Beeswarm: the bee swarm plot, an alternative to stripchart. R package version 01Google Scholar
  37. Eich E (2014) Business not as usual (editorial). Psychol Sci 25(1):3–6. http://tinyurl.com/psedito
  38. Fekete JD, Van Wijk JJ, Stasko JT, North C (2008) The value of information visualization. In: Information visualization. Springer, pp 1–18Google Scholar
  39. Fidler F (2010) The american psychological association publication manual, 6th edn. Implications for statistics education. In: Data and context in statistics education: towards an evidence based societyGoogle Scholar
  40. Fidler F, Cumming G (2005) Teaching confidence intervals: problems and potential solutions. In: Proceedings of the 55th international statistics institute sessionGoogle Scholar
  41. Fidler F, Loftus GR (2009) Why figures with error bars should replace p values. Zeitschrift für Psychologie/J Psychol 217(1):27–37CrossRefGoogle Scholar
  42. Fisher R (1955) Statistical methods and scientific induction. J Roy Stat Soc Ser B (Methodol): 69–78Google Scholar
  43. Forum C (2015) Is there a minimum sample size required for the t-test to be valid? http://tinyurl.com/minsample
  44. Franz VH, Loftus GR (2012) Standard errors and confidence intervals in within-subjects designs: generalizing Loftus and Masson (1994) and avoiding the biases of alternative accounts. Psychon Bull Rev 19(3):395–404CrossRefGoogle Scholar
  45. Frick RW (1998) Interpreting statistical testing: process and propensity, not population and random sampling. Behav Res Meth Instrum Comput 30(3):527–535MathSciNetCrossRefGoogle Scholar
  46. Gardner MJ, Altman DG (1986) Confidence intervals rather than p values: estimation rather than hypothesis testing. BMJ 292(6522):746–750CrossRefGoogle Scholar
  47. Gelman A (2004) Type 1, type 2, type S, and type M errors. http://tinyurl.com/typesm
  48. Gelman A (2013a) Commentary: p-values and statistical practice. Epidemiology 24(1):69–72CrossRefGoogle Scholar
  49. Gelman A (2013b) Interrogating p-values. J Math Psychol 57(5):188–189MathSciNetCrossRefzbMATHGoogle Scholar
  50. Gelman A, Loken E (2013) The garden of forking paths. Online articleGoogle Scholar
  51. Gelman A, Stern H (2006) The difference between significant and not significant is not itself statistically significant. Am Stat 60(4):328–331MathSciNetCrossRefGoogle Scholar
  52. Gigerenzer G (2004) Mindless statistics. J Socio Econ 33(5):587–606CrossRefGoogle Scholar
  53. Gigerenzer G, Kruger L, Beatty J, Porter T, Daston L, Swijtink Z (1990) The empire of chance: how probability changed science and everyday life, vol 12. Cambridge University PressGoogle Scholar
  54. Giner-Sorolla R (2012) Science or art? how aesthetic standards grease the way through the publication bottleneck but undermine science. Perspect Psychol Sci 7(6):562–571CrossRefGoogle Scholar
  55. Gliner JA, Leech NL, Morgan GA (2002) Problems with null hypothesis significance testing (NHST): what do the textbooks say? J Exp Educ 71(1):83–92CrossRefGoogle Scholar
  56. Goldacre B (2012) What doctors don’t know about the drugs they prescribe [TED talk]. http://tinyurl.com/goldacre-ted
  57. Goodman SN (1999) Toward evidence-based medical statistics. 1: the p value fallacy. Ann Intern Med 130(12):995–1004CrossRefGoogle Scholar
  58. Greenland S, Poole C (2013) Living with p values: resurrecting a Bayesian perspective on frequentist statistics. Epidemiology 24(1):62–68CrossRefGoogle Scholar
  59. Hager W (2002) The examination of psychological hypotheses by planned contrasts referring to two-factor interactions in fixed-effects ANOVA. Method Psychol Res, Online 7:49–77MathSciNetGoogle Scholar
  60. Haller H, Krauss S (2002) Misinterpretations of significance: a problem students share with their teachers. Methods Psychol Res 7(1):1–20Google Scholar
  61. Hoekstra R, Finch S, Kiers HA, Johnson A (2006) Probability as certainty: dichotomous thinking and the misuse of p values. Psychon Bull Rev 13(6):1033–1037CrossRefGoogle Scholar
  62. Hofmann H, Follett L, Majumder M, Cook D (2012) Graphical tests for power comparison of competing designs. IEEE Trans Visual Comput Graphics 18(12):2441–2448CrossRefGoogle Scholar
  63. Hornbæk K, Sander SS, Bargas-Avila JA, Grue Simonsen J (2014) Is once enough?: on the extent and content of replications in human-computer interaction. In: Proceedings of ACM, ACM conference on human factors in computing systems, pp 3523–3532Google Scholar
  64. Ioannidis JP (2005) Why most published research findings are false. PLoS med 2(8):e124MathSciNetCrossRefGoogle Scholar
  65. Jansen Y (2014) Physical and tangible information visualization. PhD thesis, Université Paris Sud-Paris XIGoogle Scholar
  66. Kaptein M, Robertson J (2012) Rethinking statistical analysis methods for CHI. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, pp 1105–1114Google Scholar
  67. Keene ON (1995) The log transformation is special. Stat Med 14(8):811–819CrossRefGoogle Scholar
  68. Kerr NL (1998) HARKing: hypothesizing after the results are known. Pers Soc Psychol Rev 2(3):196–217CrossRefGoogle Scholar
  69. Kindlmann G, Scheidegger C (2014) An algebraic process for visualization design. IEEE Trans Visual Comput Graphics 20(12):2181–2190CrossRefGoogle Scholar
  70. Kirby KN, Gerlanc D (2013) BootES: an R package for bootstrap confidence intervals on effect sizes. Behav Res Methods 45(4):905–927CrossRefGoogle Scholar
  71. Kirk RE (2001) Promoting good statistical practices: some suggestions. Educ Psychol Meas 61(2):213–218MathSciNetCrossRefGoogle Scholar
  72. Kline RB (2004) What’s wrong with statistical tests–and where we go from here. Am Psychol AssocGoogle Scholar
  73. Lakens D, Pigliucci M, Galef J (2014) Daniel Lakens on p-hacking and other problems in psychology research. Podcast. http://tinyurl.com/lakens-podcast
  74. Lambdin C (2012) Significance tests as sorcery: science is empirical, significance tests are not. Theory Psychol 22(1):67–90CrossRefGoogle Scholar
  75. Lazic SE (2010) The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neurosci 11(1):5CrossRefGoogle Scholar
  76. Levine TR, Weber R, Hullett C, Park HS, Lindsey LLM (2008a) A critical assessment of null hypothesis significance testing in quantitative communication research. Hum Commun Res 34(2):171–187CrossRefGoogle Scholar
  77. Levine TR, Weber R, Park HS, Hullett CR (2008b) A communication researchers’ guide to null hypothesis significance testing and alternatives. Hum Commun Res 34(2):188–209CrossRefGoogle Scholar
  78. Loftus GR (1993) A picture is worth a thousand p values: on the irrelevance of hypothesis testing in the microcomputer age. Behav Res Meth Instrum Comput 25(2):250–256CrossRefGoogle Scholar
  79. MacCallum RC, Zhang S, Preacher KJ, Rucker DD (2002) On the practice of dichotomization of quantitative variables. Psychol Methods 7(1):19CrossRefGoogle Scholar
  80. Mazar N, Amir O, Ariely D (2008) The dishonesty of honest people: a theory of self-concept maintenance. J Mark Res 45(6):633–644CrossRefGoogle Scholar
  81. Meehl PE (1967) Theory-testing in psychology and physics: a methodological paradox. Philos Sci: 103–115Google Scholar
  82. Miller J (1991) Short report: reaction time analysis with outlier exclusion: bias varies with sample size. Q J Exp Psychol 43(4):907–912CrossRefGoogle Scholar
  83. Morey RD, Hoekstra R, Rouder JN, Lee MD, Wagenmakers EJ (2015) The fallacy of placing confidence in confidence intervals (version 2). http://tinyurl.com/cifallacy
  84. Nelson MJ (2011) You might want a tolerance interval. http://tinyurl.com/tol-interval
  85. Newcombe RG (1998a) Interval estimation for the difference between independent proportions: comparison of eleven methods. Stat Med 17(8):873–890CrossRefGoogle Scholar
  86. Newcombe RG (1998b) Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med 17(8):857–872CrossRefGoogle Scholar
  87. Newman GE, Scholl BJ (2012) Bar graphs depicting averages are perceptually misinterpreted: the within-the-bar bias. Psychon Bull Rev 19(4):601–607CrossRefGoogle Scholar
  88. Norman DA (2002) The Design of Everyday Things. Basic Books Inc, New YorkGoogle Scholar
  89. Norman G (2010) Likert scales, levels of measurement and the laws of statistics. Adv Health Sci Educ 15(5):625–632CrossRefGoogle Scholar
  90. Nuzzo R (2014) Scientific method: statistical errors. Nature 506(7487):150–152CrossRefGoogle Scholar
  91. Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716+Google Scholar
  92. Osborne JW, Overbay A (2004) The power of outliers (and why researchers should always check for them). Pract Asses Res Eval 9(6):1–12Google Scholar
  93. Perin C, Dragicevic P, Fekete JD (2014) Revisiting Bertin matrices: new interactions for crafting tabular visualizations. IEEE Trans Visual Comput Graphics 20(12):2082–2091CrossRefGoogle Scholar
  94. Pollard P, Richardson J (1987) On the probability of making Type I errors. Psychol Bull 102(1):159CrossRefGoogle Scholar
  95. Rawls RL (1998) Breaking up is hard to do. Chem Eng News 76(25):29–34CrossRefGoogle Scholar
  96. Reips UD, Funke F (2008) Interval-level measurement with visual analogue scales in internet-based research: VAS generator. Behav Res Methods 40(3):699–704CrossRefGoogle Scholar
  97. Rensink RA (2014) On the prospects for a science of visualization. In: Handbook of Human Centric Visualization. Springer, pp 147–175Google Scholar
  98. Ricketts C, Berry J (1994) Teaching statistics through resampling. Teach Stat 16(2):41–44CrossRefGoogle Scholar
  99. Rosenthal R (2009) Artifacts in behavioral research: Robert Rosenthal and Ralph L. Rosnow’s Classic Books. Oxford University Press, OxfordCrossRefGoogle Scholar
  100. Rosenthal R, Fode KL (1963) The effect of experimenter bias on the performance of the albino rat. Behav Sci 8(3):183–189CrossRefGoogle Scholar
  101. Rosnow RL, Rosenthal R (1989) Statistical procedures and the justification of knowledge in psychological science. Am Psychol 44(10):1276CrossRefGoogle Scholar
  102. Rossi JS (1990) Statistical power of psychological research: what have we gained in 20 years? J Consult Clin Psychol 58(5):646CrossRefGoogle Scholar
  103. Sauro J, Lewis JR (2010) Average task times in usability tests: what to report? In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, pp 2347–2350Google Scholar
  104. Schmidt FL, Hunter J (1997) Eight common but false objections to the discontinuation of significance testing in the analysis of research data. What if there were no significance tests. pp 37–64Google Scholar
  105. Simmons JP, Nelson LD, Simonsohn U (2011) False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci 22(11):1359–1366CrossRefGoogle Scholar
  106. Smith RA, Levine TR, Lachlan KA, Fediuk TA (2002) The high cost of complexity in experimental design and data analysis: type I and type II error rates in multiway ANOVA. Hum Commun Res 28(4):515–530CrossRefGoogle Scholar
  107. Stewart-Oaten A (1995) Rules and judgments in statistics: three examples. Ecology: 2001–2009Google Scholar
  108. The Economist (2013) Unreliable research: Trouble at the lab. http://tinyurl.com/trouble-lab
  109. Thompson B (1998) Statistical significance and effect size reporting: portrait of a possible future. Res Sch 5(2):33–38Google Scholar
  110. Thompson B (1999) Statistical significance tests, effect size reporting and the vain pursuit of pseudo-objectivity. Theory Psychol 9(2):191–196CrossRefGoogle Scholar
  111. Trafimow D, Marks M (eds) (2015) Basic Appl Social Psychol 37(1):1–2. http://tinyurl.com/trafimow
  112. Tryon WW (2001) Evaluating statistical difference, equivalence, and indeterminacy using inferential confidence intervals: an integrated alternative method of conducting null hypothesis statistical tests. Psychol Methods 6(4):371CrossRefGoogle Scholar
  113. Tukey JW (1980) We need both exploratory and confirmatory. Am Stat 34(1):23–25Google Scholar
  114. Ulrich R, Miller J (1994) Effects of truncation on reaction time analysis. J Exp Psychol: Gen 123(1):34CrossRefGoogle Scholar
  115. van Deemter K (2010) Not exactly: in praise of vagueness. Oxford University Press, OxfordGoogle Scholar
  116. Velleman PF, Wilkinson L (1993) Nominal, ordinal, interval, and ratio typologies are misleading. Am Stat 47(1):65–72Google Scholar
  117. Vicente KJ, Torenvliet GL (2000) The Earth is spherical (p < 0.05): alternative methods of statistical inference. Theor Issues Ergon Sci 1(3):248–271Google Scholar
  118. Victor B (2011) Explorable explanations. http://worrydream.com/ExplorableExplanations/
  119. Wainer H (1984) How to display data badly. Am Stat 38(2):137–147MathSciNetGoogle Scholar
  120. Wickham H, Stryjewski L (2011) 40 years of boxplots. Am StatGoogle Scholar
  121. Wierdsma A (2013) What is wrong with tests of normality? http://tinyurl.com/normality-wrong
  122. Wilcox RR (1998) How many discoveries have been lost by ignoring modern statistical methods? Am Psychol 53(3):300CrossRefGoogle Scholar
  123. Wilkinson L (1999) Statistical methods in psychology journals: guidelines and explanations. Am Psychol 54(8):594CrossRefGoogle Scholar
  124. Willett W, Jenny B, Isenberg T, Dragicevic P (2015) Lightweight relief shearing for enhanced terrain perception on interactive maps. In: Proceedings of ACM conference on human factors in computing systems. ACM, New York, NY, USA, CHI ’15, pp 3563–3572Google Scholar
  125. Wilson W (1962) A note on the inconsistency inherent in the necessity to perform multiple comparisons. Psychol Bull 59(4):296CrossRefGoogle Scholar
  126. Wood M (2004) Statistical inference using bootstrap confidence intervals. Significance 1(4):180–182MathSciNetCrossRefGoogle Scholar
  127. Wood M (2005) Bootstrapped confidence intervals as an approach to statistical inference. Organ Res Meth 8(4):454–470CrossRefGoogle Scholar
  128. Zacks J, Tversky B (1999) Bars and lines: a study of graphic communication. Mem Cogn 27(6):1073–1079CrossRefGoogle Scholar
  129. Ziliak ST, McCloskey DN (2008) The cult of statistical significance. University of Michigan Press, Ann ArborGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Orsay CedexFrance

Personalised recommendations