Abstract
Machine learning offers empirical methods to sift through accounting datasets with a large number of variables and limited a priori knowledge about functional forms. In this study, we show that these methods help detect and interpret patterns present in ongoing accounting misstatements. We use a wide set of variables from accounting, capital markets, governance, and auditing datasets to detect material misstatements. A primary insight of our analysis is that accounting variables, while they do not detect misstatements well on their own, become important with suitable interactions with audit and market variables. We also analyze differences between misstatements and irregularities, compare algorithms, examine one-year- and two-year-ahead predictions and interpret groups at greater risk of misstatements.
Similar content being viewed by others
Notes
Another possible solution is to fit the model using a rolling window or exclude observations from firms used to build the model. However, both of these choices severely restrict the sample effectively used for cross-validation.
In a previous version of the manuscript, we focused on all restatements, including restatements not reported in 8-K; however, many of these restatements need not include large events. We thank Andy Imdieke for this suggestion.
Audit Analytics provides the restated amount for each year only for the five most recent years impacted by the restatement. However, firms’ restatements can often impact more than five years of financial data. The impact on accounting numbers prior to the most recent five years is usually reported as a cumulative charge to retained earnings, and, in practice, firms need not retrospectively adjust all prior years. To account for this, we assume that the cumulative effect to retained earnings is distributed evenly across the misstatement span identified in the restatement filing. If the span is missing, we allocate the unexplained cumulative change to the year prior to the last year with an income effect.
RUSBoost refers to random undersampling, such that balanced samples are constructed by randomly drawing from the sample. With a heavily imbalanced dataset, however, nonrandom undersampling may perform better than random sampling. In untabulated results, we used the sampling method of Perols et al. (2016), but it did not perform better than RUSboost in our dataset. Under this alternate sampling method, the AUC is 69.4%, and the detection rate of restatements is 60.0% and of AAERs is 81.1%. This method ranks better than logistic models but slightly worse than GBRT and random under-sampling in Tables 10 and 15. An important difference is that there are more material misstatements than AAERs, so the benefits of nonrandom sampling to alleviate imbalance are more muted.
We report the summary statistics of the important predictors in Table 6.
As in any multivariate descriptive analysis with multiple correlated variables, interpretation requires some caution since the method may select one variable over another for reasons that relate primarily to the fitting procedure. Later on, we list the set of important variables in other methods and observe that, while many variables are common to multiple algorithms, there are also some differences.
For variables combining market and accounting information, such as book-to-market and earnings-to-price, we allocate half weight to each category.
The theoretical model of Bertomeu and Marinovic (2015) also predicts this relation, as firms that endogenously retain more soft assets tend to be more credible.
We only document the results with the backward logistic model because the forward logistic and simple logistic models exhibit the same results. Backward and forward logistic models are much more sparse; that is, they use fewer variables than GBRT and simple logistic models. However, they do not appear to perform better than a simple logistic model. This finding suggests complex interactions in the entire population of potential predictors capture misstatements.
We report in Table10 bootstrapped standard errors, retraining and testing the model 200 times on randomly drawn datasets. Differences between the performance of most models tend to be greater than two standard errors, indicating these differences are significant. In untabulated analyses, we bootstrapped differences between model performance and confirm that differences between models are significantly different from zero at conventional levels.
In untabulated results, we also estimate the model by separating the restatement sample into positive and negative period income effects, under the conjecture that positive effects may reflect reversals or incentives to influence the stock price downward. See Kasnik (1999) for an extensive continuing literature. We divide restatements into three categories: negative income effects (overstatement), zero income effects, and positive income effects (understatement). We then build three models and predict the probability of overstatement, understatement, and a zero income effect separately. We do not find any notable improvement to predictive power in the test sample, likely because these alternative methods reduce the size of the dataset used to estimate the model.
In Panel B of Table 12, we obtain similar results after excluding firms with restatements in the training sample. Machine learning algorithms continue to perform better, compared to the logistic model, but feature lower catch rates.
In untabulated analyses, we compute the number of caught misstatements at least a year before the AAERs. Out of 29 misstatements caught by GBRT in the test sample, they relate to 20 AAERs fillings, and all of them are detected at least a year (often more than a year) before the AAER is filed.
We still estimate the models as in panel A using the entire population of misstatements and AAERs. One alternative would have been to estimate a model using only AAER-misstatement pairs as irregularities. However, the number of observations here becomes too small to build a model with reasonable out-of-sample performance.
In untabulated results, we find very low predictive ability when we predict the first misstatement year.
InTrees can imply redundant conditions if an inequality is repeated twice or is a subset of another inequality. In these cases, we only report the stricter condition.
This result coincides with the Stata package Boost, with code boost Res EP Soft , distribution(logistic) train(1) bag(1) interaction(2) maxiter(1) shrink(1) predict(pred).
References
Abbasi, A., Albrecht, C., Vance, A., & Hansen, J. (2012). Metafraud: a meta-learning framework for detecting financial fraud. Mis Quarterly, 36(4), 1293–1327.
Avramov, D., Chordia, T., Jostova, G., & Philipov, A. (2009). Credit ratings and the cross-section of stock returns. Journal of Financial Markets, 12 (3), 469–499.
Bao, Y., Ke, B., Li, B., Julia Yu, Y., & Zhang, J. (2020). Detecting accounting fraud in publicly traded us firms using a machine learning approach. Journal of Accounting Research, 58(1), 199–235.
Barton, J., & Simko, P.J. (2002). The balance sheet as an earnings management constraint. The Accounting Review, 77(s-1), 1–27.
Beneish, M.D. (1999). The detection of earnings manipulation. Financial Analysts Journal, 55(5), 24–36.
Bertomeu, J., & Marinovic, I. (2015). A Theory of hard and soft information. The Accounting Review, 91(1), 1–20.
Blackburne, T., Kepler, J., Quinn, P., & Taylor, D. (2020). Undisclosed sec investigations. Forthcoming Management Science.
Cheffers, M., Whalen, D., & Usvyatsky, O. (2010). 2009 financial restatements: A nine year comparison. Audit Analytics Sales (February).
Cheynel, E., & Levine, C. (2020). Public disclosures and information asymmetry: A theory of the mosaic. The Accounting Review, 95(1), 79–99.
Dechow, P.M., & Dichev, I.D. (2002). The quality of accruals and earnings: The role of accrual estimation errors. The Accounting Review, 77(s-1), 35–59.
Dechow, P.M., Ge, W., Larson, C.R., & Sloan, R.G. (2011). Predicting material accounting misstatements. Contemporary Accounting Research, 28(1), 17–82.
DeFond, M.L., Raghunandan, K., & Subramanyam, K.R. (2002). Do non–audit service fees impair auditor independence? evidence from going concern audit opinions. Journal of Accounting Research, 40(4), 1247–1274.
Deng, H. (2018). Interpreting tree ensembles with inttrees. International Journal of Data Science and Analytics, pp 1–11.
Ding, K., Lev, B., Peng, X., Sun, T., & Vasarhelyi, M.A. (2020). Machine learning improves accounting estimates. Review of Accounting Studies, pp 1–37.
Dutta, I., Dutta, S., & Raahemi, B. (2017). Detecting financial restatements using data mining techniques. Expert Systems with Applications, 90, 374–393.
Ettredge, M.L., Sun, L., Lee, P., & Anandarajan, A.A. (2008). Is earnings fraud associated with high deferred tax and/or book minus tax levels?. Auditing: A Journal of Practice & Theory, 27(1), 1–33.
Fanning, K.M., & Cogger, K.O. (1998). Neural network detection of management fraud using published financial data. Intelligent Systems in Accounting, Finance & Management, 7(1), 21–41.
Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8), 861– 874.
Frankel, R.M., Johnson, M.F., & Nelson, K.K. (2002). The relation between auditors’ fees for nonaudit services and earnings management. The Accounting Review, 77(s-1), 71–105.
Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp 1189–1232.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning Vol. 1. New York: Springer series in statistics.
Garfinkel, J.A. (2009). Measuring investors’ opinion divergence. Journal of Accounting Research, 47(5), 1317–1348.
Glosten, L.R., & Milgrom, P.R. (1985). Bid, ask and transaction prices in a specialist market with heterogeneously informed traders. Journal of Financial Economics, 14(1), 71–100.
Green, B.P., & Choi, J.H. (1997). Assessing the risk of management fraud through neural network technology. Auditing, A Journal of Practice and Theory, 16, 14–28.
Guelman, L. (2012). Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Systems with Applications, 39(3), 3659–3667.
Gupta, R., & Gill, N.S. (2012). A solution for preventing fraudulent financial reporting using descriptive data mining techniques. International Journal of Computer Applications.
Hribar, P., Kravet, T., & Wilson, R. (2014). A New measure of accounting quality. Review of Accounting Studies, 19(1), 506–538.
Johnson, V.E., Khurana, I.K., & Kenneth Reynolds, J. (2002). Audit-firm tenure and the quality of financial reports. Contemporary Accounting Research, 19(4), 637–660.
Kasznik, R. (1999). On the association between voluntary disclosure and earnings management. Journal of Accounting Research, 37(1), 57–81.
Kim, Y.J., Baik, B., & Cho, S. (2016). Detecting financial misstatements with fraud intention using multi-class cost-sensitive learning. Expert Systems with Applications, 62, 32–43.
Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., & Mullainathan, S. (2017). Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1), 237–293.
Kornish, L.J., & Levine, C.B. (2004). Discipline with common agency: The case of audit and nonaudit services. The Accounting Review, 79(1), 173–200.
Larcker, D.F., Richardson, S.A., & Tuna, Irem. (2007). Corporate governance, accounting outcomes, and organizational performance. The Accounting Review, 82(4), 963–1008.
Laux, V., & Newman, P.D. (2010). Auditor liability and client acceptance decisions. The Accounting Review, 85(1), 261–285.
Lin, J.W., Hwang, M.I., & Becker, J.D. (2003). A Fuzzy neural network for assessing the risk of fraudulent financial reporting. Managerial Auditing Journal, 18(8), 657–665.
Lobo, G.J., & Zhao, Y. (2013). Relation between audit effort and financial report misstatements: Evidence from quarterly and annual restatements. The Accounting Review, 88(4), 1385–1412.
Perols, J. (2011). Financial statement fraud detection: An analysis of statistical and machine learning algorithms. Auditing: A Journal of Practice & Theory, 30(2), 19–50.
Perols, J.L., Bowen, R.M., Zimmermann, C., & Samba, B. (2016). Finding needles in a haystack: Using data analytics to improve fraud prediction. The Accounting Review, 92(2), 221–245.
Ragothaman, S., & Lavin, A. (2008). Restatements due to improper revenue recognition: a neural networks perspective. Journal of Emerging Technologies in Accounting, 5(1), 129–142.
Romanus, R.N., Maher, J.J., & Fleming, D.M. (2008). Auditor industry specialization, auditor changes, and accounting restatements. Accounting Horizons, 22(4), 389–413.
Samuels, D., Taylor, D.J., & Verrecchia, R.E. (2018). Financial misreporting: Hiding in the shadows or in plain sight?.
Rijsbergen, V., & Joost, C. (2004). The geometry of information retrieval. Cambridge University Press.
Whiting, D.G., Hansen, J.V., McDonald, J.B., Albrecht, C., & Steve Albrecht, W. (2012). Machine learning methods for detecting patterns of management fraud. Computational Intelligence, 28(4), 505–527.
Zhang, Y., & Haghani, A. (2015). A gradient boosting method to improve travel time prediction. Transportation Research Part C: Emerging Technologies, 58, 308–324.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We gratefully thank B. Cadman, P. Dechow, C. Lennox, S.X. Li, D. Macciocchi, M. Plumlee, X. Peng, and seminar participants at LSE, University of Utah, the USC-UCLA-UCSD-UCI conference, MIT, and the CMU Accounting Mini Conference for valuable feedback. We also thank J. Engelberg for the many suggestions that were central in seeding the project.
Appendices
Appendix A: Omitted Tables
Appendix B: An illustration of GBRT
We now offer a more formal presentation of the main steps of the gradient-boosting algorithm. Let us hold the depth of the tree as fixed and denote \(\phi (x;\mathcal {S})\) as a mapping that yields a vector of predicted values for some variable to be predicted z, given x, under the regression tree applied to sample \(\mathcal {S}=(z_i,x_i)_{i=1}^N\), where \((z_i)_{i=1}^N\) is a set of observations to be predicted. Note that we focus here on the first-order aspects as we use it in this study; the reader should refer to Friedman (2001) for a general description that would apply to a broader class of datasets.
Let us subsequently set a loss function \(L(y,\hat {y})\) for the boosting algorithm, where y is any variable that could be predicted and \(\hat {y}\) is a prediction. For a typical application with continuous variables, we could set a quadratic loss \(L^q(y,\hat {y})=(y-\hat {y})^2/2\); for our purpose, given that we will predict probabilities, we will use a logistic loss function \(L(y, \hat {y})=log(1+exp(-2y\hat {y}))\). Note that, in the quadratic case, the residual \(y-\hat {y}\) is also given by the derivative in the second component of the loss function \(L_2^q(y,\hat {y})=y-\hat {y}\); gradient boosting builds on this intuition to compute a local version of the residual for a nonquadratic loss function by computing \(L_2(y,\hat {y})=2y/(1+exp(2 y \hat {y}))\). Note here that a correct prediction \(y=\hat {y}\) reduces L2, as it would in the case of a residual in a linear regression. So, for later use, let us interpret \(L_2(y,\hat {y})\) as a measure of the component of y unexplained by \(\hat {y}\).
In pseudo-code, the algorithm will operate as follows.
-
1.
Initialize the analysis with a single prediction for the entire sample that minimizes the logistic loss function; that is,
$$ \hat{y}_{i}^{0}=\frac{1}{2}\log\frac{1+\overline{y}}{1-\overline{y}}, $$where \(\overline {y}\) is the sample mean.
-
2.
For each step m ∈ [1, M], where M is the number of trees:
-
2.1.
Compute the residuals of the previous step and redefine this residual as the variable to be explained:
$$ {z_{i}^{m}}=L_{2}(y_{i}^{},\hat{y}_{i}^{m-1}). $$ -
2.2.
Fit a regression tree to \(\mathcal {S}^m=(z_i^m,x_i)_{i=1}^N\), where \(\mathcal {S}\) is a subset drawn randomly according to parameter (d). The prediction is updated to
$$ \hat{y}_{i}^{m}=\hat{y}_{i}^{m-1}+\nu \phi(x_{i};\mathcal{S}), $$where ν is the speed of learning parameter (c).
-
2.1.
We illustrate next how gradient boosting trees work using a simple numerical example, where we set the number of trees to M = 2 and the tree depth to 2. In Table 21, we create 16 firm-year observations, coding a misstatement as 1 and nonmisstatements as − 1 (column 1). Of course, this is only meant to carry the analysis by paper and pencil and is in no way a reasonable sample size to conduct a machine learning analysis. The dependent variable will be whether there is a misstatement, and we use two explanatory variables: earnings-to-price ratio and % of soft assets (columns 2 and 3).
Note that, in this created data set, the pattern that ties the variables to misstatements is visually self-evident. Plotting this data in Fig. 7, it is clear that misstatements have a monotonic relationship with the two variables. But this relationship is not linear, and there is an interaction between the two variables such that the cutoff for soft assets doesn’t change at low levels of earnings to price. Our objective will be to show how machine learning can uncover this pattern that we have ourselves “learned” from visual inspection over the graph of observations.
Following the pseudo code, we first calculate the sample mean as \(\overline {y}=0.125\) and initialize our first predictor at a predicted probability for the whole sample given by
Then we calculate the residuals for each of the 16 observations given by
as calculated in column 4 of Table 21. The next step is to fit the regression tree to \(z_i^1\), which involves finding one variable and one cutoff, such that all observations where the variable is at one side of the cutoff are given the same prediction. Of note, the regression tree does not use the same loss function as the boosting because the fitted \(z_i^1\) are derivatives that need not be bounded. In particular, the loss function used for this step is simply the quadratic loss function Lq and the best cutoff minimizes this loss.
In this example, there are two variables, and each has four values, so we could choose to place the cutoff on one of two variables and at three locations per variable, for six possible choices of the cutoff. For each possible cutoff, we compute a prediction \(\overline {z}_i^1\) as the average for all observations at one side of the cutoff and sum over the loss function \(L^q(z_i^1,\overline {z}_i^1)\). Table 22 provides the loss function for every possible cutoff. Here, the best cutoff is to partition as a function of soft assets greater or equal than 0.4, with 4 observations below this cutoff and 12 above.
Given that the number of leaves is set to be equal to three, we need to apply a second (and final) cutoff to obtain three end leaves. We proceed similarly in Table 23, except that there are now two choices as to whether to apply the cutoff on the left or right side of Soft ≥ 0.4 and then five choices for the cutoff for the two variables if we place the cutoff on the right node versus three if we place the cutoff on the left node. Note that, for the left node, all outcome variables have the same value equal to zero so that the cutoff itself does not increase explanatory power; hence we know, for this case, that the best cutoff must be on the right node.
Note that a variable that has been used in a prior cut can be used again in later steps, thus creating potentially complex interactions or nonlinearities. For this example, the best cutoff point is located at EP ≥ 0.07, which leads to a partition of the sample into three regions, as illustrated in Fig. 8.
Having completed the first tree, we compute an adjustment to our initial prediction from step 2.2, that is,
where ϕ(.) is based on the logistic functional
where R1(xi) indicates the set of observations classified in the same region by the tree. Column 5 in Table 21 provides an updated estimate after the first tree. Note that these refer to the term inside of the logistic term. To make a prediction, we need to convert this term into a probability (column 6) using a standard logistic transformation.Footnote 17
Since we have set the number of trees to two, we need to repeat these steps a second time, using an updated set of residuals
Skipping the calculations (which are conceptually identical), the second tree illustrated in Fig. 9 is now given by a cutoff on both high E/P and soft assets, which unsurprisingly correspond to the area of predictors that were partly misclassified by the first tree.
After computing the corresponding \(\phi (x_i;\mathcal {S}_1)\), where \(\mathcal {S}_1=(z_i^2,x_i)\) is the sample for the second tree, and mapping back to an updated value for \(\hat {y}_i^2\) and the implied probability, we compute in Column 8 of Table 21 the resulting prediction. As a result of the two trees, the data is now classified into seven regions.
To further illustrate in Fig. 10 the tree component of the analysis, we estimate a single regression tree by partitioning the probability of misstatements into 10 nodes. The probability of misstatements across nodes varies from 2% to 50%. The first branch of the tree, the most important variable is usually whether the auditor has issued a qualified opinion. As expected, qualified opinions tend to be more frequent in the presence of an ongoing misstatement. Conditional on a qualified opinion, the firms with higher book-to-market and higher deferred tax expenses tend to be the most likely to misstate, thus suggesting disagreements about deferred taxes. Conditional on an unqualified opinion, by contrast, a very different set of variables predicts misstatement, with firms with high non-audit fees and high stock market volatility having a high probability to misstate.
Rights and permissions
About this article
Cite this article
Bertomeu, J., Cheynel, E., Floyd, E. et al. Using machine learning to detect misstatements. Rev Account Stud 26, 468–519 (2021). https://doi.org/10.1007/s11142-020-09563-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11142-020-09563-8
Keywords
- Restatement
- Manipulation
- Earnings management
- Machine learning
- Data analytics
- Regression tree
- Misstatement
- Irregularity
- Fraud
- Prediction
- SEC
- Enforcement
- Gradient boosted regression tree
- Data mining
- Accounting
- Detection
- AAERs