An out-of-sample evaluation framework for DEA with application in bankruptcy prediction

Nowadays, data envelopment analysis (DEA) is a well-established non-parametric methodology for performance evaluation and benchmarking. DEA has witnessed a widespread use in many application areas since the publication of the seminal paper by Charnes, Cooper and Rhodes in 1978. However, to the best of our knowledge, no published work formally addressed out-of-sample evaluation in DEA. In this paper, we fill this gap by proposing a framework for the out-of-sample evaluation of decision making units. We tested the performance of the proposed framework in risk assessment and bankruptcy prediction of companies listed on the London Stock Exchange. Numerical results demonstrate that the proposed out-of-sample evaluation framework for DEA is capable of delivering an outstanding performance and thus opens a new avenue for research and applications in risk modelling and analysis using DEA as a non-parametric frontier-based classifier and makes DEA a real contender in industry applications in banking and investment.


Introduction
Since the publication of the seminal paper by Charnes, Cooper and Rhodes in 1978, Data envelopment analysis (DEA) has become a well-established non-parametric methodology for performance evaluation and benchmarking. DEA has witnessed a widespread use in many application areas-see Liu et al. (2013) for a recent survey, and Mousavi et al. (2015) and Ouenniche (2011, 2012a, b) for a recent application area-along with many methodological contributions-see, for example, Banker et al. (1984), Andersen and Petersen (1993), Tone (2001Tone ( , 2002 and Seiford and Zhu (2003). Despite the growing use of DEA, to the best of our knowledge, no published work formally addressed out-of-sample evaluation in DEA. In this paper, we fill this gap by proposing a framework for the out-of-sample evaluation of decision making units.
We illustrate the use of the proposed framework in bankruptcy prediction of companies listed on the London Stock Exchange. Note that prediction of risk class or bankruptcy is one of the major activities in auditing firms' risks and uncertainties. The design of reliable models to predict bankruptcy is crucial for many decision making processes. Bankruptcy prediction models could be divided into two broad categories depending on whether they are static (see, for example, Altman 1968Altman , 1983Taffler 1984;Theodossiou 1991;Ohlson 1980;Zmijewski 1984) or dynamic (see, for example, Shumway 2001;Bharath and Shumway 2008;Hillegeist et al. 2004). In this paper we shall focus on the first category of models to illustrate how outof-sample evaluation of companies could be performed. The most popular static bankruptcy prediction models are based on statistical methodologies (e.g., Altman 1968Altman , 1983Taffler 1984), stochastic methodologies (e.g., Theodossiou 1991;Ohlson 1980;Zmijewski 1984), and artificial intelligence methodologies (e.g., Kim and Han 2003;Li and Sun 2011;Zhang et al. 1999;Shin et al. 2005). DEA methodologies are increasingly gaining popularity in bankruptcy prediction (e.g., Cielen et al. 2004;Paradi et al. 2004;Premachandra et al. 2011;Shetty et al. 2012). However, the issue of out-of-sample evaluation remains to be addressed when DEA is used as a classifier.
The remainder of this paper is organised as follows. In Sect. 2, we propose a formal framework for performing out-of-sample evaluation in DEA. In Sect. 3, we provide information on the bankruptcy data we used along with details on the design of our experiment, and present our empirical findings. Finally, Sect. 4 concludes the paper.

A framework for out-of-sample evaluation in DEA
Nowadays, out-of-sample evaluation of statistical, stochastic and artificial intelligence methodologies for prediction of both continuous and discrete variables is commonly used for validating prediction models and testing their performance before actual implementation. The rationale for using out-of-sample testing lies in the following well known facts. First, models or methods selected based on in-sample performance may not best predict postsample data. Second, in-sample errors are likely to understate prediction errors. Third, for continuous variables, prediction intervals built on in-sample standard errors are likely to be too narrow. The setup of the standard out-of-sample analysis framework requires one to split the historical data set into two subsets, where the first subset often referred to as a training set, an estimation set, or an initialization set is used to estimate the parameters of a model, whereas the second subset generally referred to as the test set or the handout set is used to test the prediction performance of the fitted model. The counterpart of this testing framework is lacking in DEA. In this paper, we propose an out-of-sample evaluation framework for static DEA models. The proposed framework in general in that it can be used for any classification problem or number of classes and any application. Note that, without loss of generality, the proposed framework is customized for a bankruptcy prediction application with two risk classes (e.g., bankrupt class and non-bankrupt class, or low risk of bankruptcy class and high risk of bankruptcy class), as customary in most research on bankruptcy prediction, for the sake of illustrating the empirical performance of our framework. Obviously this risk classification into two categories or classes could be refined, if the researcher/analyst wished to do so, into more than two classes when the presence of non-zero slacks is suspected or proven to be a driver of bankruptcy; for example, one might be interested in refining each of the above mentioned risk classes into two subclasses depending on whether the slacks of a bankrupt (respectively, non-bankrupt) DMU sum to zero or not. In other practical settings, the researcher/analyst might be interested in the level or degree of distress prior to bankruptcy in which case one might also consider more than two risk or distress classes. In the remaining of this paper, we denote the variable on risk class belonging as Y . Hereafter, we describe the main steps of the proposed out-of-sample evaluation framework for DEA: Input: data set of historical observations, say X , where each observation is a DMU (e.g., firm-year observations where firms are listed on the London Stock Exchange) along with the corresponding available information (e.g., financial ratios) and the observed risk or bankruptcy status Y ; Fig. 1 Flowchart of out-of-sample evaluation framework for static DEA models such application. For the bankruptcy application, two main categories of DEA models could be used; namely, best efficiency frontier-based models (e.g., Charnes et al. 1978;Banker et al. 1984;Tone 2001) and worst efficiency frontier-based models (e.g., Paradi et al. 2004). Within each of these categories one could choose from a variety of DEA models. Note that the main difference between the best efficiency frontier-based models and the worst efficiency frontier-based models lies in the choice of the definition of the efficiency frontier. To be more specific, best efficiency frontier-based DEA models assume that the efficiency frontier is made of the best performers, whereas the worst efficiency frontier-based DEA models assume that the efficiency frontier is made of the worst performers (i.e., riskiest DMUs). In risk modelling and analysis applications, such as bankruptcy prediction, both types of frontiers or DEA models are appropriate to use; however, the classification rules used in step 2 and step 3 of the detailed procedure would have to be chosen accordingly.
For illustration purposes, in our empirical investigation, we used both a BCC model (Banker et al. 1984) and an SBM model (Tone 2001) and implemented each of them within the best efficiency frontier framework. Notice that, since our data consists of financial ratios which could take negative values, the SBM model was implemented within a variable returnto-scale framework; that is, the convexity constraint was added to the model. These models are presented in Tables 1, 2, where the parameter x i, j denote the amount of input i used by DMU j , the parameter y r, j denote the amount of output r produced by DMU j , the decision variable λ j denote the weight assigned to DMU j 's inputs and outputs in constructing the ideal benchmark of a given DMU, say DMU k , the decision variable θ k denote the technical efficiency score of DMU k , and the decision variable ρ k denote the slacks-based measure (SBM) for DMU k .
For each input i (i = 1, . . . , m), the amount used by DMU k 's "ideal" benchmark; i.e., its projection on the efficient frontier ( n j=1 λ j x i, j ), should at most be equal to the amount used by DMU k whether revised (i.e., amount of input i adjusted for the degree of technical efficiency of DMU k ) or not depending on whether the model is input-oriented or not n j=1 λ j y r, j ≥ y r,k ; ∀r or n j=1 λ j y r, j ≥ θ k · y r,k ; ∀r For each output r (r = 1, . . . , s), the amount produced by DMU k 's "ideal" benchmark; i.e., its projection on the efficient frontier ( n j=1 λ j y r, j ), should be at least as large as the amount produced by DMU k whether revised (i.e., amount of output r adjusted for the degree of technical efficiency of DMU k ) or not depending on whether the model is output-oriented or not The technology is required to be convex Objective; that is, input-oriented SBM measure Objective; that is, output-oriented SBM measure , the amount used by DMU k 's "ideal" benchmark; i.e., its projection on the efficient frontier, should at most be equal to the amount used by DMU k ; that is: For each output r (r = 1, . . . , s), the amount produced by DMU k 's "ideal" benchmark; i.e., its projection on the efficient frontier, should be at least as large as the amount produced by DMU k ; that is: n j=1 λ j y r, j ≥ y r,k ; ∀r The technology is required to be convex λ j ≥ 0; ∀ j; s − i,k ; ∀i; s + r,k ; ∀r Non-negativity requirements Table 3 Generic procedure for computing an optimal DEA score-based cut-off point and the corresponding classification Input: choice of a performance measure π and a non-linear programming search algorithm according to the properties of π Step 1: compute ξ L B and ξ U B Step 2: find the optimal value of ξ with respect to π , say ξ * , within the interval ξ L B , ξ U B using the chosen non-linear programming search algorithm Step 3: classify DMUs in X I −O E into two classes; namely bankrupt and non-bankrupt firms or DMUs; that is, so that DMUs with DEA scores less (respectively, greater) than ξ * are assigned to a bankruptcy class and those with DEA scores greater (respectively, less) than or equal to ξ * are assigned to a non-bankruptcy class if a best practice (respectively, worse practice) efficiency frontier framework was adopted to compute DEA scores Output: optimal DEA score-based cut-off point ξ * along with the predicted risk classesŶ I −O E

Decision rule for classifying DMUs in the training sample
Several decision rules could be used to classify the DMUs in the training sample. Obviously the choice of a decision rule for classification depends on the nature of the classification problem. To be more specific, decision rules would vary depending on whether one is concerned with a two-class problem or a multi-class problem. In bankruptcy prediction we are concerned with a two-class problem; therefore, we shall provide a solution that is suitable for these problems. In fact, we propose a DEA score-based cut-off point procedure to classify DMUs in X I −O E . The proposed procedure involves solving an optimization problem whereby the DEA score-based cut-off point, say ξ , is determined so as to optimize a given performance measure, say π, over an interval with a lower bound, say ξ L B , equal to the smallest DEA score of DMUs in X I −O E and an upper bound, say ξ U B , equal to the largest DEA score of DMUs in X I −O E . In sum, the proposed procedure is based on a performance measure-dependent approach-see Table 3 for a generic procedure. Note that, in most applications, the performance measure π is a non-linear function. The choice of a specific optimization algorithm for the implementation of the generic procedure outlined in Table 3 depends on whether the performance measure π is differentiable or not and if it is non-differentiable, whether it is quasiconvex or not. To be more specific, if π is differentiable, then one could choose Bisection Search; if π is twice differentiable, then one could choose Newton's Method; if π is non-differentiable but quasiconvex, then one could choose Golden Section Search, Fibonacci Search, Dichotomous Search, or a brute force search such as Uniform Search. For details on these standard non-linear programming algorithms, the reader is referred to the excellent book on non-linear programming by Bazaraa et al. (2006). Notice that the last step of this generic procedure classifies DMUs in the training sample into two classes; namely bankrupt and non-bankrupt firms or DMUs, and thus the output is the optimal DEA score-based cut-off point ξ along with the predicted risk classesŶ I −O E .

Algorithm for classifying DMUs in the test sample
A variety of algorithms could be used for out-of-sample classification of DMUs in X I −O T ranging from standard statistical and stochastic methodologies to artificial intelligence methodologies. In this paper, we propose an instance of our generic out-of-sample evaluation procedure for DEA where the out-of-sample classification of DMUs in X I −O T is performed Fig. 2 Pseud-code of the k-NN algorithm using a k-Nearest Neighbor (k-NN) algorithm, which itself is an instance of case-based reasoning. The pseudo-code for k-NN is customized for our application and is summarized in Fig. 2. Note that the k-NN algorithm is also generic in that a number of implementation decisions have to be made; namely, the size of the neighborhood k, the similarity or distance metric, and the classification criterion. In our experiments, we tested several values of k as well as several distance metrics (i.e., Euclidean, Standardized Euclidean, Cityblock, Hamming, Jaccard, Cosine, Correlation, Mahalanobis). As to the classification criterion, we opted for the most commonly used one; that is, majority vote. Note that, when computing the distance between two DMUs, each DMU is represented by its vector of inputs and outputs.

Computing efficiency scores of DMUs in the test sample
In order to compute the DEA score of those DMUs in X I −O To conclude this section, we would like to provide some explanation as to why the proposed framework should produce good results. As the reader is aware of by now, the proposed outof-sample evaluation framework is based on an instance of the case-based reasoning (CBR) methodology; namely, k-NN algorithm. CBR is a generic problem solving methodology, which solves a specific problem by exploiting solutions to similar problems. In sum, CBR relies on past experience and comparison to the current experience and therefore uses analogy by similarity. To be more specific, the basic methodological process of this artificial intelligence methodology involves pattern matching and classification. In our bankruptcy application, pattern matching would serve to identify DMUs with similar risk profiles (e.g., liquidity profiles in our experiments) and therefore is well equipped to discriminate between bankrupt and non-bankrupt firms. The extent of its empirical performance however would depend on whether the data or case base is noisy or not, the choice of the similarity criteria and their measures, the relevance of the features selected (i.e., inputs and outputs in the DEA context) and their weights, if any, and the choice of the classification rule, also known as a target function, as well as the quality of approximation of the target function. In our case, k-NN serves as a local approximation. For more details on CBR, the reader is referred to, for example, Richter and Weber (2013).
In the next section, we shall test the performance of our out-of-sample evaluation framework for DEA and report our numerical results.

Empirical analysis
In this section, we first describe the process of data gathering and sample selection (see Sect. 3.1). Then, we present the design of our experiment (see Sect. 3.2). Finally, we present and discuss our numerical results (see Sect. 3.3).

Data and sample selection
In this paper, we first considered all UK firms listed on the London Stock Exchange (LSE) during a 5 years period from 2010 through 2014 and defined the bankrupt firms using the London Share Price Database (LSPD) codes 16 (i.e., firm has receiver appointed or is in liquidation), 20 (i.e., firm is in administration or administrative receivership), and 21 (i.e., firm is cancelled and assumed valueless); the remaining firms are classified as non-bankrupt. Then, we further reduced such dataset by excluding both financial and utilities firms, on one hand, and those firms with less than 5 months lag between the reporting date and the fiscal year, on the other hand. As a result of using these data reduction rules, the final dataset consists of 6605 firm-year observations including 407 (6.16%) observations related to bankrupt firms and 6198 (94.38%) observations related to non-bankrupt firms. Therefore, we have a total of 6605 decision making units (DMUs). As to the selection of the training sample and the test sample, we have chosen the size of the training sample to be twice the size of the test sample; that is, 2/3 of the total number of DMUs were used in the training sample and the remaining 1/3 were used in the test sample. The selection of observations was done with random sampling without replacement so as to ensure that both the training sample and the test sample have the same proportions of bankrupt and non-bankrupt firms. A total of thirty pairs of training sample-test sample were generated.

Design of experiment
In our experiment, we reworked a standard and well known parametric model in the DEA framework; namely, the multivariate discriminant analysis (MDA) model of Taffler (1984) to provide some empirical evidence on the merit of the proposed out-of-sample evaluation framework for DEA. Recall that Taffler's model makes use of four explanatory variables; namely, current liabilities to total assets, number of credit intervals, profit before tax to current liabilities, and current assets to total liabilities. In our DEA models, current liabilities to total assets and number of credit intervals were used as inputs, whereas profit before tax to current liabilities and current assets to total liabilities were used as outputs. We report on the performance of our out-of-sample evaluation framework for DEA using the commonly used metrics; namely, type I error (T1), type II error (T2), sensitivity (Sen) and specificity (Spe). Recall that T1 is the proportion of bankrupt firms predicted as non-bankrupt; T2 is the proportion of non-bankrupt firms predicted as bankrupt; Sen is the proportion of non-bankrupt firms predicted as non-bankrupt; and Spe is the proportion of bankrupt firms predicted as bankrupt.

Results
Hereafter, we shall provide a summary of our empirical results and findings. Table 4 provides a summary of statistics on the performance of the MDA model of Taffler (1984) reworked within the best efficiency frontier framework using BCC and SMB models. Note that both insample and out-of-sample statistics reported correspond to DEA score-based cut-off points optimized for each performance measure separately (i.e., T1, T2, Sen, Spe). Note also that we run tests for several values of the size of the neighborhood k (i.e., 3, 5, 7); however, the results reported are for k = 3 since higher values delivered very close performances but required more computations. With respect to in-sample performance, our results demonstrate that DEA provides an outstanding classifier regardless of the choices of classification measures and DEA models-  Next, we provide empirical evidence to demonstrate that the proposed out-of-sample evaluation framework achieved a very high performance in classifying DMUs into the right risk category-see Tables 6, 7, 8, 9 and 10. In fact, regardless of which DEA model is chosen to compute the scores, the out-of-sample performance of the proposed framework is idealwith T1 and T2 being 0% and sensitivity and specificity being 100%-when Hamming and Jaccard metrics are used to compute the distances between training sample and test sample observations or DMUs. As to the remaining metrics, they deliver average performances ranging from −0.05 to 18%. It is worthy to mention however that the choice of SBM-OO and SBM models combined with Euclidean and Cityblock metrics drive the performance of the proposed framework to an unexpected high level with an average performance of −0.05% suggesting that the proposed framework fed with the right decisions could even strengthen in-sample DEA analysis. Once again, the proposed out-of-sample evaluation framework for DEA proves to be superior to Discriminant Analysis out-of-sample (see Table 5) with differences, for example, in average performance of 79-98% on T1, 0.26% on T2 and Sen, and 63-82% on Spe in favor of DEA.

Conclusions
Out-of-sample evaluation is commonly used for validating prediction models of both continuous and discrete variables and testing their performance. The counterpart of this evaluation framework is lacking in DEA. This paper fills this gap. In fact, we proposed a generic outof-sample evaluation framework for DEA and tested the performance of an instance of it in bankruptcy prediction. The accuracy of our framework, as suggested by our numerical results, suggests that this tool could prove valuable in industry implementations of DEA models in bankruptcy prediction and credit scoring. We also provided empirical evidence that DEA as a classifier is a real contender to Discriminant Analysis, which is one of the most commonly used classifiers by practitioners.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.