This paper examines commonly applied methods of data analysis. Predicated on these methods, the main issue pertains to the plausibility of the studies’ end products, that is, their conclusions. I argue that the methods chosen often lead to unwarranted conclusions: the data analyses chosen tend to produce looked-for null rejections even though the null may be much more plausible on prior grounds. Two aspects of data analyses applied cause obvious problems. First, researchers tend to dismiss “preliminary” findings when the findings contradict the expected outcome of the research question (the “screen-picking” issue). Second, researchers rarely acknowledge that small p-values should be expected when the number of observations runs into the tens of thousands (the “large N” issue). This obviously enhances the chance for a null rejection even if the null hypothesis holds for all practical purposes. The discussion elaborates on these two aspects to explain why researchers generally avoid trying to mitigate false positives via supplementary data analyses. In particular, for no apparent good reasons, most research studiously avoids the use of hold-out samples. An additional topic in this paper concerns the dysfunctional consequences of the standard (“A-journal”) publication process, which tends to buttress the use of research methods prone to false or unwarranted null-rejections.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
This paper extends my previous work titled “Accounting Research and Common Sense” (Ohlson, 2015). Both papers raise many similar topics regarding inherent empirical research problems. For other similar papers focusing on the accounting literature, see Kim et al. (2018), Dyckman (2016), and Dyckman and Zeff (2015).
The claim that the academic literature publishes papers with very high rates of FPs became prominent in 2005 when the Stanford statistician Ioannides published his well-known (even celebrated) paper “Why Most Published Research Findings Are False.” In the title, the word “most” means what it says: more than half. What makes the paper particularly interesting is that it highlights the importance of prior probabilities of a hypothesis being true and the role of bias in the research process. A best-selling book by the academic Silver (2012) discusses Ioannides and others’ work related to the suspected high incidence of false research findings. He further argues that “big data” will make things even worse, a matter clearly relevant in the case of accounting research.
Ioannides (2005) raises points that seem to bear directly on accounting research: “Several methodologists have pointed out the high rate of non-replication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded, strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically less than 0.05. Research is not most appropriately represented and summarized by p-values.”
Even more striking: “The greater the flexibility in designs, definitions, outcomes, and analytical models in a scientific field, the less likely the research findings are to be true.”
If one takes these statements seriously, combined with Ioannides’ claim that more than half of all studies in the sciences are false, then one can reasonably claim that most articles in accounting A-journals fall into the category of “best characterized as belonging to a branch of creative writing somewhat connected to real world data.” In other words, accounting research is better viewed as a rather frivolous activity as opposed to a serious academic discipline. That being said, there are arguments against this dystopian perspective. Social science methods can never formalize true versus false hypotheses in a rigorous sense; it is therefore not meaningful to compare a social science to the traditional sciences. Social science empirics can only provide evidence in the spirit of “it is not totally unreasonable to claim that the evidence supports that Y relates to X positively.” It is implicitly understood that “there may well be data analyses that run counter to this claim, but it is beyond the scope of the current research to consider such possibilities.”
It has also been suggested that the literature publishes an excess of false findings. See, for example, Powell et al. (2009), Lindley (2013), Harvey et al. (2016), Harvey (2017) (a presidential address), Moosa (2017), Chordia et al. (2017) and Hou et al. (2020), All these papers underscore the high likelihood that the finance literature comprises an uncomfortable (unacceptable) incidence of false positives (FPs). As for the field of economics, it too has a long history of addressing the validity (or lack thereof) of empirical research. A recent paper by Brodeur et al. (2016) shows how the empirical analysis of statistics produced in published articles supports that the literature’s null rejections depend on screen-picking; specifically, there is a material “shortage” of p-values in the 10–25% range. This paper also provides extensive references.
As Leone et al. (2019) document, most papers now try winsorization or trimming schemes as a matter of routine. Results will generally depend on the details, a not unhelpful feature when a researcher pursues a table with the “right results.”
From what I can judge, the profession at large became aware of screen-picking as an explicit “tool” during the 1980s. (I recall having extensively discussed a well-known JAE paper published in 1989: the results presented seemed to be too good to be true but were agreed upon as time passed).
Uninhibited screen-picking can drift into unambiguously unethical behavior using input screens. Specifically, one can screen the data such that observations that run counter to the hypothesis get deleted. These deletions can then be justified, especially if they are comingled with deletions of observations whose effects on the hypothesis are neutral. This outcome becomes quite natural once a researcher asks the questions “Why do I not get what I want?” and “What is going wrong?” A close look at the data will then inform the researcher that certain observations, perhaps just a few, run strongly against the desired outcome.
How serious are accounting academics about producing research of integrity? Not very, if one considers the following. Many colleagues have suggested that it is inappropriate to challenge a presenter by pointing out that certain simple data analyses would most likely negate the claimed positive answer to the RQ. Why worry about the validity of conclusions presented? Will life not go on no matter what?
To be sure, there are good reasons why the academic community is disinclined to reprove authors who have published articles with false findings. Most important among them is that research is an intrinsically difficult activity, and researchers should be encouraged to take risks.
The Jeffreys–Lindley paradox effectively states that, under certain conditions, as N becomes large, a rejection of the null becomes a sure thing even though a rational Bayesian would conclude otherwise (see Spanos (2013) and the references in that paper).
A Fama–MacBeth test does not have to be implemented with its details. On a basic level, if a regression is run for each of the 25 years and the sign of the estimated coefficient of interest is correct in 23 out of the 25 years, then it would seem reasonable that one rejects the null. However, judgments may be unavoidable. Compare this approach to the pooled over years fixed effects approach when N equals, say, 20,000. It is now difficult to tell what minimum magnitude of the t-statistic will convince the reader that the null can comfortably be rejected.
Ordinary least squares (OLS) regressions are central in accounting research conventions. The method originally achieved its central status due to the limits of computers. Nowadays, of course, there are no such reasons OLS should be applied. From a strict intellectual perspective, there are no merits associated with OLS. In fact, its application seems bizarre insofar as OLS assigns a material role for outliers and, at the same time, most research winsorizes the outliers; in other words, outliers are replaced by less abrasive outliers that are now real-world disconnected. The staying power of OLS, I would argue, stems from the overstated t-statistics, which helps when researchers try to get the “right result” via screen-picking. Injecting some irony (or humor, perhaps), the published tables by convention express the t-statistics (or standard errors) using no less than four digits. To add insult to injury, a critique of OLS t-statistics includes the observation that they become even more upward biased if one considers, realistically, that the independent variables are random rather than “predetermined.” (I am never quite sure if “preordained” would be a better word in this context.) There is also the issue that the t-statistics depend on N, whereas the real world does not. For a succinct exposition that critiques classical statistics, see Silver’s book, pp. 251–61. In the finance field, Harvey (2017) discusses the issues and provides an extensive discussion of the intrinsic problems related to classical statistics.
Many papers posit regressions where the dependent variable is a ratio, such as the market-to-book (M/B) ratio. (These M/B settings are often referred to as Tobin Q-ratios, and they purport to explain market values even though they do not.) Now, it is often the case that the variable of substantive interest on the left-hand side of the regression is M, while B serves the purpose of scaling M to avoid severe cross-sectional heteroscedasticity, and so forth. In such a case, one obtains the inferred variable by multiplying both sides of Eqs. (1) and (2) with B. Thus, the inferred value using Eq. (1) now equals B*[a*X + b*Z], where (a, b) are determined by their estimated values. Yet again, it becomes a trivial matter to ask whether, as an empirical matter, X helps beyond Z to explain the dependent variable of real interest, M. The procedure can also be applied in case of logit regressions. I do not recall ever having seen a paper implement this procedure or anything similar. However, in some private conversations, researchers have indicated that in most cases the procedure would most likely yield “undesirable results”; this may be the reason for its complete absence from the literature.
More generally, in settings where N is very large—for instance, more than 25,000—the power of the test is reduced by splitting the data into, say, 10 bins. Regressions can be estimated for the data in each bin, and one can check whether the coefficient of interest is of the correct sign in most bins. This procedure will inform the reader about the RQ at hand, and it is now much tougher to reject the null. Note that this procedure is never applied (from what I can tell). It illustrates the importance that researchers attach to the imperative of rejecting the null, regardless of whether it is true.
Most people would agree that researchers spend considerable time guessing what the judgments of the potential reviewers will be. An overwhelming majority of authors sidestep any action that might increase the chances of a negative report, no matter how small the effect is on the perceived change in the probability. In addition, again as far as I can tell, the whole idea of not being 100% compliant is not entertained in second rounds.
To reduce the incidence of FPs, one might expect that if the reviewers fall short, then members of the community at large will step in and attempt to rectify erroneous papers that passed the publication hurdles. This, however, does not seem to be the case. Editorial policies effectively rule out publishing a “commentary” that claims to show that a prior paper is in error or that its conclusions are dubious. (Interestingly enough, those policies were not in place four decades ago.) The incentives to write such a commentary, let alone an entire paper on the subject, appear to be miniscule, because doing so is likely to lead to serious professional hazards. Nonetheless, constant rumors float that this or that paper cannot be replicated. This is an unhealthy situation because the policy in place projects an impression that it protects the more important people in the system.
Most people would agree that researchers spend considerable time guessing what the judgments of potential reviewers will be. An overwhelming majority of authors tend to avoid any action that might increase the chances of a negative report—no matter how small the effect is on the perceived change in the probability. Again, as far as I can tell, the whole idea of not being 100% compliant is simply out of the question in second rounds.
Brodeur, A., M. Lé, M. Sangnier, and Y. Zylberberg. 2016. Star Wars: The empirics strike back. American Economic Journal: Applied Economics 8 (1): 1–32.
Chordia, T., Goyal, A., & Saretto, A. (2017). Anomalies and false rejections Swiss Finance Institute research paper.
Dyckman, T.R. 2016. Significance testing: We can do better. Abacus 52 (2): 319–342.
Dyckman, T.R., and S.A. Zeff. 2014. Some methodological deficiencies in empirical research articles in accounting. Accounting Horizons 28 (3): 695–712.
Dyckman, T.R., and S.A. Zeff. 2015. Accounting research: Past, present, and future. Abacus 51 (4): 511–524.
Harvey, C.R. 2017. Presidential address: The scientific outlook in financial economics. The Journal of Finance 72 (4): 1399–1440.
Harvey, C.R., Liu, Y., and Zhu, H. (2016). … and the cross-section of expected returns. The Review of Financial Studies 29 (1): 5–68.
Hou, K., Xue, C., and Zhang, L. (2020). Replicating anomalies. The Review of Financial Studies 33 (5): 2019–2133.
Ioannidis, J.P.A. 2005. Why most published research findings are false. PLoS Medicine 2 (8): 124.
Kim, J.H., K. Ahmed, and P.I. Ji. 2018. Significance testing in accounting research: A critical evaluation based on evidence. Abacus 54 (4): 524–546.
Leamer, E. (1978). Specification searches : Ad hoc inference with nonexperimental data (Wiley series in probability and mathematical statistics). New York: Wiley.
Leone, A.J., M. Minutti-Meza, and C.E. Wasley. 2019. Influential observations and inference in accounting research. The Accounting Review 94 (6): 337–364.
Lindley, Dennis V. (2013). Understanding Uncertainty (Wiley series in probability and statistics). New York: John Wiley & Sons, Incorporated.
Moosa, I. A. (2017). Econometrics as a con art: Exposing the limitations and abuses of econometrics. Cheltenham, Edward Elgar.
Ohlson, J.A. 2015. Accounting research and common sense. Abacus 51 (4): 525–535.
Powell, J., J. Shi, T. Smith, and R. Whaley. 2009. Common divisors, payout persistence, and return predictability. International Review of Finance 9 (4): 335–357.
Silver, N. (2012). The signal and the noise : Why so many predictions fail-- but some don't. New York, N.Y.: Penguin Press.
Spanos, A. 2013. Who should be afraid of the Jeffreys-Lindley paradox? Philosophy of Science 80 (1): 73–93.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Ohlson, J.A. Researchers’ data analysis choices: an excess of false positives?. Rev Account Stud (2021). https://doi.org/10.1007/s11142-021-09620-w
- Data analysis
- False positives
- Publication process