Skip to main content
Log in

Using machine learning to detect misstatements

  • Published:
Review of Accounting Studies Aims and scope Submit manuscript

Abstract

Machine learning offers empirical methods to sift through accounting datasets with a large number of variables and limited a priori knowledge about functional forms. In this study, we show that these methods help detect and interpret patterns present in ongoing accounting misstatements. We use a wide set of variables from accounting, capital markets, governance, and auditing datasets to detect material misstatements. A primary insight of our analysis is that accounting variables, while they do not detect misstatements well on their own, become important with suitable interactions with audit and market variables. We also analyze differences between misstatements and irregularities, compare algorithms, examine one-year- and two-year-ahead predictions and interpret groups at greater risk of misstatements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Another possible solution is to fit the model using a rolling window or exclude observations from firms used to build the model. However, both of these choices severely restrict the sample effectively used for cross-validation.

  2. In a previous version of the manuscript, we focused on all restatements, including restatements not reported in 8-K; however, many of these restatements need not include large events. We thank Andy Imdieke for this suggestion.

  3. Audit Analytics provides the restated amount for each year only for the five most recent years impacted by the restatement. However, firms’ restatements can often impact more than five years of financial data. The impact on accounting numbers prior to the most recent five years is usually reported as a cumulative charge to retained earnings, and, in practice, firms need not retrospectively adjust all prior years. To account for this, we assume that the cumulative effect to retained earnings is distributed evenly across the misstatement span identified in the restatement filing. If the span is missing, we allocate the unexplained cumulative change to the year prior to the last year with an income effect.

  4. RUSBoost refers to random undersampling, such that balanced samples are constructed by randomly drawing from the sample. With a heavily imbalanced dataset, however, nonrandom undersampling may perform better than random sampling. In untabulated results, we used the sampling method of Perols et al. (2016), but it did not perform better than RUSboost in our dataset. Under this alternate sampling method, the AUC is 69.4%, and the detection rate of restatements is 60.0% and of AAERs is 81.1%. This method ranks better than logistic models but slightly worse than GBRT and random under-sampling in Tables 10 and 15. An important difference is that there are more material misstatements than AAERs, so the benefits of nonrandom sampling to alleviate imbalance are more muted.

  5. We report the summary statistics of the important predictors in Table 6.

  6. As in any multivariate descriptive analysis with multiple correlated variables, interpretation requires some caution since the method may select one variable over another for reasons that relate primarily to the fitting procedure. Later on, we list the set of important variables in other methods and observe that, while many variables are common to multiple algorithms, there are also some differences.

  7. For variables combining market and accounting information, such as book-to-market and earnings-to-price, we allocate half weight to each category.

  8. The theoretical model of Bertomeu and Marinovic (2015) also predicts this relation, as firms that endogenously retain more soft assets tend to be more credible.

  9. We only document the results with the backward logistic model because the forward logistic and simple logistic models exhibit the same results. Backward and forward logistic models are much more sparse; that is, they use fewer variables than GBRT and simple logistic models. However, they do not appear to perform better than a simple logistic model. This finding suggests complex interactions in the entire population of potential predictors capture misstatements.

  10. We report in Table10 bootstrapped standard errors, retraining and testing the model 200 times on randomly drawn datasets. Differences between the performance of most models tend to be greater than two standard errors, indicating these differences are significant. In untabulated analyses, we bootstrapped differences between model performance and confirm that differences between models are significantly different from zero at conventional levels.

  11. In untabulated results, we also estimate the model by separating the restatement sample into positive and negative period income effects, under the conjecture that positive effects may reflect reversals or incentives to influence the stock price downward. See Kasnik (1999) for an extensive continuing literature. We divide restatements into three categories: negative income effects (overstatement), zero income effects, and positive income effects (understatement). We then build three models and predict the probability of overstatement, understatement, and a zero income effect separately. We do not find any notable improvement to predictive power in the test sample, likely because these alternative methods reduce the size of the dataset used to estimate the model.

  12. In Panel B of Table 12, we obtain similar results after excluding firms with restatements in the training sample. Machine learning algorithms continue to perform better, compared to the logistic model, but feature lower catch rates.

  13. In untabulated analyses, we compute the number of caught misstatements at least a year before the AAERs. Out of 29 misstatements caught by GBRT in the test sample, they relate to 20 AAERs fillings, and all of them are detected at least a year (often more than a year) before the AAER is filed.

  14. We still estimate the models as in panel A using the entire population of misstatements and AAERs. One alternative would have been to estimate a model using only AAER-misstatement pairs as irregularities. However, the number of observations here becomes too small to build a model with reasonable out-of-sample performance.

  15. In untabulated results, we find very low predictive ability when we predict the first misstatement year.

  16. InTrees can imply redundant conditions if an inequality is repeated twice or is a subset of another inequality. In these cases, we only report the stricter condition.

  17. This result coincides with the Stata package Boost, with code boost Res EP Soft , distribution(logistic) train(1) bag(1) interaction(2) maxiter(1) shrink(1) predict(pred).

References

  • Abbasi, A., Albrecht, C., Vance, A., & Hansen, J. (2012). Metafraud: a meta-learning framework for detecting financial fraud. Mis Quarterly, 36(4), 1293–1327.

    Article  Google Scholar 

  • Avramov, D., Chordia, T., Jostova, G., & Philipov, A. (2009). Credit ratings and the cross-section of stock returns. Journal of Financial Markets, 12 (3), 469–499.

    Article  Google Scholar 

  • Bao, Y., Ke, B., Li, B., Julia Yu, Y., & Zhang, J. (2020). Detecting accounting fraud in publicly traded us firms using a machine learning approach. Journal of Accounting Research, 58(1), 199–235.

    Article  Google Scholar 

  • Barton, J., & Simko, P.J. (2002). The balance sheet as an earnings management constraint. The Accounting Review, 77(s-1), 1–27.

    Article  Google Scholar 

  • Beneish, M.D. (1999). The detection of earnings manipulation. Financial Analysts Journal, 55(5), 24–36.

    Article  Google Scholar 

  • Bertomeu, J., & Marinovic, I. (2015). A Theory of hard and soft information. The Accounting Review, 91(1), 1–20.

    Article  Google Scholar 

  • Blackburne, T., Kepler, J., Quinn, P., & Taylor, D. (2020). Undisclosed sec investigations. Forthcoming Management Science.

  • Cheffers, M., Whalen, D., & Usvyatsky, O. (2010). 2009 financial restatements: A nine year comparison. Audit Analytics Sales (February).

  • Cheynel, E., & Levine, C. (2020). Public disclosures and information asymmetry: A theory of the mosaic. The Accounting Review, 95(1), 79–99.

    Article  Google Scholar 

  • Dechow, P.M., & Dichev, I.D. (2002). The quality of accruals and earnings: The role of accrual estimation errors. The Accounting Review, 77(s-1), 35–59.

    Article  Google Scholar 

  • Dechow, P.M., Ge, W., Larson, C.R., & Sloan, R.G. (2011). Predicting material accounting misstatements. Contemporary Accounting Research, 28(1), 17–82.

    Article  Google Scholar 

  • DeFond, M.L., Raghunandan, K., & Subramanyam, K.R. (2002). Do non–audit service fees impair auditor independence? evidence from going concern audit opinions. Journal of Accounting Research, 40(4), 1247–1274.

    Article  Google Scholar 

  • Deng, H. (2018). Interpreting tree ensembles with inttrees. International Journal of Data Science and Analytics, pp 1–11.

  • Ding, K., Lev, B., Peng, X., Sun, T., & Vasarhelyi, M.A. (2020). Machine learning improves accounting estimates. Review of Accounting Studies, pp 1–37.

  • Dutta, I., Dutta, S., & Raahemi, B. (2017). Detecting financial restatements using data mining techniques. Expert Systems with Applications, 90, 374–393.

    Article  Google Scholar 

  • Ettredge, M.L., Sun, L., Lee, P., & Anandarajan, A.A. (2008). Is earnings fraud associated with high deferred tax and/or book minus tax levels?. Auditing: A Journal of Practice & Theory, 27(1), 1–33.

    Article  Google Scholar 

  • Fanning, K.M., & Cogger, K.O. (1998). Neural network detection of management fraud using published financial data. Intelligent Systems in Accounting, Finance & Management, 7(1), 21–41.

    Article  Google Scholar 

  • Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8), 861– 874.

    Article  Google Scholar 

  • Frankel, R.M., Johnson, M.F., & Nelson, K.K. (2002). The relation between auditors’ fees for nonaudit services and earnings management. The Accounting Review, 77(s-1), 71–105.

    Article  Google Scholar 

  • Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp 1189–1232.

  • Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning Vol. 1. New York: Springer series in statistics.

    Google Scholar 

  • Garfinkel, J.A. (2009). Measuring investors’ opinion divergence. Journal of Accounting Research, 47(5), 1317–1348.

    Article  Google Scholar 

  • Glosten, L.R., & Milgrom, P.R. (1985). Bid, ask and transaction prices in a specialist market with heterogeneously informed traders. Journal of Financial Economics, 14(1), 71–100.

    Article  Google Scholar 

  • Green, B.P., & Choi, J.H. (1997). Assessing the risk of management fraud through neural network technology. Auditing, A Journal of Practice and Theory, 16, 14–28.

    Google Scholar 

  • Guelman, L. (2012). Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Systems with Applications, 39(3), 3659–3667.

    Article  Google Scholar 

  • Gupta, R., & Gill, N.S. (2012). A solution for preventing fraudulent financial reporting using descriptive data mining techniques. International Journal of Computer Applications.

  • Hribar, P., Kravet, T., & Wilson, R. (2014). A New measure of accounting quality. Review of Accounting Studies, 19(1), 506–538.

    Article  Google Scholar 

  • Johnson, V.E., Khurana, I.K., & Kenneth Reynolds, J. (2002). Audit-firm tenure and the quality of financial reports. Contemporary Accounting Research, 19(4), 637–660.

    Article  Google Scholar 

  • Kasznik, R. (1999). On the association between voluntary disclosure and earnings management. Journal of Accounting Research, 37(1), 57–81.

    Article  Google Scholar 

  • Kim, Y.J., Baik, B., & Cho, S. (2016). Detecting financial misstatements with fraud intention using multi-class cost-sensitive learning. Expert Systems with Applications, 62, 32–43.

    Article  Google Scholar 

  • Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., & Mullainathan, S. (2017). Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1), 237–293.

    Google Scholar 

  • Kornish, L.J., & Levine, C.B. (2004). Discipline with common agency: The case of audit and nonaudit services. The Accounting Review, 79(1), 173–200.

    Article  Google Scholar 

  • Larcker, D.F., Richardson, S.A., & Tuna, Irem. (2007). Corporate governance, accounting outcomes, and organizational performance. The Accounting Review, 82(4), 963–1008.

    Article  Google Scholar 

  • Laux, V., & Newman, P.D. (2010). Auditor liability and client acceptance decisions. The Accounting Review, 85(1), 261–285.

    Article  Google Scholar 

  • Lin, J.W., Hwang, M.I., & Becker, J.D. (2003). A Fuzzy neural network for assessing the risk of fraudulent financial reporting. Managerial Auditing Journal, 18(8), 657–665.

    Article  Google Scholar 

  • Lobo, G.J., & Zhao, Y. (2013). Relation between audit effort and financial report misstatements: Evidence from quarterly and annual restatements. The Accounting Review, 88(4), 1385–1412.

    Article  Google Scholar 

  • Perols, J. (2011). Financial statement fraud detection: An analysis of statistical and machine learning algorithms. Auditing: A Journal of Practice & Theory, 30(2), 19–50.

    Article  Google Scholar 

  • Perols, J.L., Bowen, R.M., Zimmermann, C., & Samba, B. (2016). Finding needles in a haystack: Using data analytics to improve fraud prediction. The Accounting Review, 92(2), 221–245.

    Article  Google Scholar 

  • Ragothaman, S., & Lavin, A. (2008). Restatements due to improper revenue recognition: a neural networks perspective. Journal of Emerging Technologies in Accounting, 5(1), 129–142.

    Article  Google Scholar 

  • Romanus, R.N., Maher, J.J., & Fleming, D.M. (2008). Auditor industry specialization, auditor changes, and accounting restatements. Accounting Horizons, 22(4), 389–413.

    Article  Google Scholar 

  • Samuels, D., Taylor, D.J., & Verrecchia, R.E. (2018). Financial misreporting: Hiding in the shadows or in plain sight?.

  • Rijsbergen, V., & Joost, C. (2004). The geometry of information retrieval. Cambridge University Press.

  • Whiting, D.G., Hansen, J.V., McDonald, J.B., Albrecht, C., & Steve Albrecht, W. (2012). Machine learning methods for detecting patterns of management fraud. Computational Intelligence, 28(4), 505–527.

    Article  Google Scholar 

  • Zhang, Y., & Haghani, A. (2015). A gradient boosting method to improve travel time prediction. Transportation Research Part C: Emerging Technologies, 58, 308–324.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeremy Bertomeu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We gratefully thank B. Cadman, P. Dechow, C. Lennox, S.X. Li, D. Macciocchi, M. Plumlee, X. Peng, and seminar participants at LSE, University of Utah, the USC-UCLA-UCSD-UCI conference, MIT, and the CMU Accounting Mini Conference for valuable feedback. We also thank J. Engelberg for the many suggestions that were central in seeding the project.

Appendices

Appendix A: Omitted Tables

Table 1 Sample selection
Table 2 Key summary statistics
Table 3 Distribution of restated firms by year
Table 4 Frequency of firm-year restatements
Table 5 Variable definitions
Table 6 Summary statistics of predictors in Table 7
Table 7 Importance of predictors (importance higher than 1.5%)
Table 8 Models using subgroups of variables
Table 9 Difference of means test between restatement and nonrestatement firm-years
Table 10 Fit and accuracy of the different models
Table 11 Catch rate when using small bandwidths
Table 12 Performance of the models for the top 1/3 predicted probabilities of firm-years
Table 13 Performance of alternative methods
Table 14 AAERs sample selection and description
Table 15 AAERs catch rates on test dataset (11,323 firm years)
Table 16 Top 10 variables (bold: same variable across panels; italics: same variable in at least two models)
Table 17 Performance on catching AAERs
Table 18 Results from predicting ahead models
Table 19 Importance of predictors from predicting ahead model (top 15)
Table 20 InTrees for top 10 and all variables

Appendix B: An illustration of GBRT

We now offer a more formal presentation of the main steps of the gradient-boosting algorithm. Let us hold the depth of the tree as fixed and denote \(\phi (x;\mathcal {S})\) as a mapping that yields a vector of predicted values for some variable to be predicted z, given x, under the regression tree applied to sample \(\mathcal {S}=(z_i,x_i)_{i=1}^N\), where \((z_i)_{i=1}^N\) is a set of observations to be predicted. Note that we focus here on the first-order aspects as we use it in this study; the reader should refer to Friedman (2001) for a general description that would apply to a broader class of datasets.

Let us subsequently set a loss function \(L(y,\hat {y})\) for the boosting algorithm, where y is any variable that could be predicted and \(\hat {y}\) is a prediction. For a typical application with continuous variables, we could set a quadratic loss \(L^q(y,\hat {y})=(y-\hat {y})^2/2\); for our purpose, given that we will predict probabilities, we will use a logistic loss function \(L(y, \hat {y})=log(1+exp(-2y\hat {y}))\). Note that, in the quadratic case, the residual \(y-\hat {y}\) is also given by the derivative in the second component of the loss function \(L_2^q(y,\hat {y})=y-\hat {y}\); gradient boosting builds on this intuition to compute a local version of the residual for a nonquadratic loss function by computing \(L_2(y,\hat {y})=2y/(1+exp(2 y \hat {y}))\). Note here that a correct prediction \(y=\hat {y}\) reduces L2, as it would in the case of a residual in a linear regression. So, for later use, let us interpret \(L_2(y,\hat {y})\) as a measure of the component of y unexplained by \(\hat {y}\).

In pseudo-code, the algorithm will operate as follows.

  1. 1.

    Initialize the analysis with a single prediction for the entire sample that minimizes the logistic loss function; that is,

    $$ \hat{y}_{i}^{0}=\frac{1}{2}\log\frac{1+\overline{y}}{1-\overline{y}}, $$

    where \(\overline {y}\) is the sample mean.

  2. 2.

    For each step m ∈ [1, M], where M is the number of trees:

    1. 2.1.

      Compute the residuals of the previous step and redefine this residual as the variable to be explained:

      $$ {z_{i}^{m}}=L_{2}(y_{i}^{},\hat{y}_{i}^{m-1}). $$
    2. 2.2.

      Fit a regression tree to \(\mathcal {S}^m=(z_i^m,x_i)_{i=1}^N\), where \(\mathcal {S}\) is a subset drawn randomly according to parameter (d). The prediction is updated to

      $$ \hat{y}_{i}^{m}=\hat{y}_{i}^{m-1}+\nu \phi(x_{i};\mathcal{S}), $$

      where ν is the speed of learning parameter (c).

We illustrate next how gradient boosting trees work using a simple numerical example, where we set the number of trees to M = 2 and the tree depth to 2. In Table 21, we create 16 firm-year observations, coding a misstatement as 1 and nonmisstatements as − 1 (column 1). Of course, this is only meant to carry the analysis by paper and pencil and is in no way a reasonable sample size to conduct a machine learning analysis. The dependent variable will be whether there is a misstatement, and we use two explanatory variables: earnings-to-price ratio and % of soft assets (columns 2 and 3).

Table 21 Example - sample 16 observations, nine of which have misstatements, with two observable explanatory variables E/P and Soft

Note that, in this created data set, the pattern that ties the variables to misstatements is visually self-evident. Plotting this data in Fig. 7, it is clear that misstatements have a monotonic relationship with the two variables. But this relationship is not linear, and there is an interaction between the two variables such that the cutoff for soft assets doesn’t change at low levels of earnings to price. Our objective will be to show how machine learning can uncover this pattern that we have ourselves “learned” from visual inspection over the graph of observations.

Fig. 7
figure 7

Graphical analysis: misstatements as a function of explanatory variables

Following the pseudo code, we first calculate the sample mean as \(\overline {y}=0.125\) and initialize our first predictor at a predicted probability for the whole sample given by

$$ \hat{y}_{i}^{0}=\frac{1}{2}\log\frac{1+0.125}{1-0.125}=0.125657. $$

Then we calculate the residuals for each of the 16 observations given by

$$ {z_{i}^{1}}=L_{2}(y_{i}^{},0.125657)=\frac{2 y_{i}}{1+exp(2 y_{i} \hat{y}_{i}^{0})}, $$

as calculated in column 4 of Table 21. The next step is to fit the regression tree to \(z_i^1\), which involves finding one variable and one cutoff, such that all observations where the variable is at one side of the cutoff are given the same prediction. Of note, the regression tree does not use the same loss function as the boosting because the fitted \(z_i^1\) are derivatives that need not be bounded. In particular, the loss function used for this step is simply the quadratic loss function Lq and the best cutoff minimizes this loss.

In this example, there are two variables, and each has four values, so we could choose to place the cutoff on one of two variables and at three locations per variable, for six possible choices of the cutoff. For each possible cutoff, we compute a prediction \(\overline {z}_i^1\) as the average for all observations at one side of the cutoff and sum over the loss function \(L^q(z_i^1,\overline {z}_i^1)\). Table 22 provides the loss function for every possible cutoff. Here, the best cutoff is to partition as a function of soft assets greater or equal than 0.4, with 4 observations below this cutoff and 12 above.

Table 22 Squared loss for various cutoffs (first branch)

Given that the number of leaves is set to be equal to three, we need to apply a second (and final) cutoff to obtain three end leaves. We proceed similarly in Table 23, except that there are now two choices as to whether to apply the cutoff on the left or right side of Soft ≥ 0.4 and then five choices for the cutoff for the two variables if we place the cutoff on the right node versus three if we place the cutoff on the left node. Note that, for the left node, all outcome variables have the same value equal to zero so that the cutoff itself does not increase explanatory power; hence we know, for this case, that the best cutoff must be on the right node.

Table 23 Squared loss for various cutoffs (second branch)

Note that a variable that has been used in a prior cut can be used again in later steps, thus creating potentially complex interactions or nonlinearities. For this example, the best cutoff point is located at EP ≥ 0.07, which leads to a partition of the sample into three regions, as illustrated in Fig. 8.

Fig. 8
figure 8

First regression tree - complete

Having completed the first tree, we compute an adjustment to our initial prediction from step 2.2, that is,

$$ \hat{y}_{i}^{m}=\hat{y}_{i}^{m-1}+ \phi(x_{i};\mathcal{S}), $$

where ϕ(.) is based on the logistic functional

$$ \phi(x_{i};\mathcal{S}_{0})=\frac{{\sum}_{(z_{i},x_{i})\in R^{1}(x_{i}) } {z_{i}^{0}}}{{\sum}_{(z_{i},x_{i})\in R^{1}(x_{i})} |{z_{i}^{0}}|(2-|{z_{i}^{0}}|)}, $$

where R1(xi) indicates the set of observations classified in the same region by the tree. Column 5 in Table 21 provides an updated estimate after the first tree. Note that these refer to the term inside of the logistic term. To make a prediction, we need to convert this term into a probability (column 6) using a standard logistic transformation.Footnote 17

Since we have set the number of trees to two, we need to repeat these steps a second time, using an updated set of residuals

$$ {z_{i}^{2}}=L_{2}(y_{i},\hat{y}_{i}^{1}). $$

Skipping the calculations (which are conceptually identical), the second tree illustrated in Fig. 9 is now given by a cutoff on both high E/P and soft assets, which unsurprisingly correspond to the area of predictors that were partly misclassified by the first tree.

Fig. 9
figure 9

Second tree

After computing the corresponding \(\phi (x_i;\mathcal {S}_1)\), where \(\mathcal {S}_1=(z_i^2,x_i)\) is the sample for the second tree, and mapping back to an updated value for \(\hat {y}_i^2\) and the implied probability, we compute in Column 8 of Table 21 the resulting prediction. As a result of the two trees, the data is now classified into seven regions.

To further illustrate in Fig. 10 the tree component of the analysis, we estimate a single regression tree by partitioning the probability of misstatements into 10 nodes. The probability of misstatements across nodes varies from 2% to 50%. The first branch of the tree, the most important variable is usually whether the auditor has issued a qualified opinion. As expected, qualified opinions tend to be more frequent in the presence of an ongoing misstatement. Conditional on a qualified opinion, the firms with higher book-to-market and higher deferred tax expenses tend to be the most likely to misstate, thus suggesting disagreements about deferred taxes. Conditional on an unqualified opinion, by contrast, a very different set of variables predicts misstatement, with firms with high non-audit fees and high stock market volatility having a high probability to misstate.

Fig. 10
figure 10

Regression tree

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bertomeu, J., Cheynel, E., Floyd, E. et al. Using machine learning to detect misstatements. Rev Account Stud 26, 468–519 (2021). https://doi.org/10.1007/s11142-020-09563-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11142-020-09563-8

Keywords

JEL Classification

Navigation