Introduction

Since banking industry is highly competitive, the performance assessment has been receiving more attention recently. The banking sector is in a race to see which banks offer the better or the best services. This results in an intensified competition in the market place. Therefore, bank management involves identifying and eliminating the underlying causes of inefficiencies to help firms improve their efficiency. In the literature, data envelopment analysis (DEA) is a leading approach in terms of performance analysis and discovering newer benchmarks. Various models of DEA are widely used for evaluating bank efficiency, such as Sherman and Gold (1985), Soteriou and Zenios (1999), Golany and Storbeck (1999), Athanassopoulos and Giokas (2000), thick frontier approach (TFA) as in Berger and Humphrey (1991), Clark (1996) and Deyoung (1998), free disposal hull (FDH) as in Tulkens (1993) and Chang (1999), stochastic frontier approach (SFA), also called econometric frontier approach (EFA) as in Kaparakis et al. (1994), Berger and Humphrey (1997) and Hao et al.(2001), and distribution free approach (DFA) as in Berger et al. (1993), and Deyoung (1997).

As DEA can hardly predict the performance of other decision-making units, Wang (2003) used artificial neural network (ANN) to assist in estimating efficiency. Athanassopoulos and Curram (1996) firstly introduced the combination of neural networks and DEA for classification and/or prediction. They used DEA in bank with multi-output: four inputs, three outputs to monitor training cases in a study. The comparison between DEA and ANN demonstrates that DEA is superior to ANN for measurement purposes. Azadeh et al. (2006), (2007a, b) utilized a highly flexible ANN algorithm to measure and rank the performance of decision-making units (DMUs). They defined an application of an algorithm in efficiency calculation of Iran steam power plants in 2004. Results demonstrate that the proposed algorithm estimates the values of efficiency closer to the ideal efficiency. Finally they displayed that the results of the proposed algorithm are more robust than the conventional approach as better performance patterns were explored. Furthermore, they proposed a method to integrate their pervious algorithm (Azadeh et al. 2007a, b). Azadeh et al. (2011), also used the combination of DEA, ANN and rough set theory (RST) for determining the impact of critical personnel attributes on efficiency. Wu et al. (2006) combined DEA and ANN for measuring the performance of a large Canadian bank. They came to the conclusion that the DEA–ANN method produces a more robust frontier and helps to identify more efficient units. Furthermore for inefficient units, it provides the guidance on how to improve their performance to different efficiency ratings. Finally, they concluded there was no need to make assumptions according to the production function in the ANN approach (the major drawback of the parametric approach) and that it is highly flexible, and that the weakness of the DEA in forecasting is the reason to use ANN (Wu et al. 2006).

On the other hand, Rahimi and Behmanesh (2012) employed the combined method to predict the DMU’s evaluation performance.

Recently, Gutierrez and Lozano (2010) mixed DEA and ANN to enhance the traditional Taguchi method for estimating quality loss measures for unobserved factor combinations and the non-parametric character of the performance evaluation of all the factor combinations. Consequently, Bashiri et al. (2013) combined DEA and ANN to optimize a Taguchi-based multi-response optimization problem for the processes where controllable factors are the smaller-the-better (STB)-type variables and the analyzer desires to find an optimal solution with smaller amount of controllable factors.

The classic DEA methods did not have the ability to demonstrate benchmarks for the future. ANN has been viewed as a useful tool for managers in predicting system behaviors. This paper integrates DEA and neural networks to cover for the shortcomings we were faced with while using DEA. Therefore, benchmarks are based on the future data and inefficient MDUs have better performance patterns to improve their efficiencies.

The paper is organized as follows. “Problem definition” section briefly reviews neural networks and DEA. “ANN–DEA” section demonstrates the models and methodology utilized in this paper. The DEA results and further discussion is given in “Computational results” section. Finally, our conclusions and future work are offered in “Conclusions and future works” section.

Problem definition

Data envelopment analysis

DEA is a non-parametric method, which uses linear programming to calculate the efficiency in a given set of decision-making units (DMUs).

The DMUs that make up a frontier envelope are scored as 1. The less efficient firms and the relative efficiency of the firms are calculated in terms of scores on a scale of 0–1.

Envelopment surface that represents best practices can give an indication of how inefficient DMUs can improve to become efficient. DEA provides a comprehensive analysis of relative efficiencies for multiple input–multiple output situations by evaluating each DMU’s performance relative to an envelopment surface composed of efficient DMUs. Units that lie on the surface are known as efficient according to DEA, while those units that do not are named inefficient. The efficient reference set includes DMUs, which are the peer group for the inefficient units.

The projection of inefficient units on an envelopment surface is called a benchmark. Benchmarks are the indication of how the inefficient DMU can improve to be efficient. Benchmarks prove that once the evaluated DMU includes these inputs and outputs, it could become efficient.

Assume input and outputs for j = 1,…,n DMUs (X j ,Y j ) where X j  = (x 1j ,…,x ij ,…,x mj ) is a vector of observed inputs and Y j  = (y 1j ,…,y rj ,…,y sj ) is a vector of observed outputs for DMU j .

The production possibility set is as below:

$$T = \left\{ {\left( {X,Y} \right)|Y \ge 0\;{\text{can}}\;{\text{be}}\;{\text{produced}}\;{\text{from}}\;X \ge 0} \right\}$$

The input possibility L(Y), for each Y, and the output possibility P(X), for each X, are defined as below:

$$L\left( Y \right) = \left\{ {X|\left( {X,Y} \right) \in T} \right\}$$
$$P\left( X \right) = \left\{ {X|\left( {X,Y} \right) \in T} \right\}$$

For achieving production possibility set, T, the following proprieties were postulated:

  1. 1.

    Convexity:

    $$\begin{array}{*{20}l} {{\text{If}}\left( {X_{j} ,Y_{j} } \right) \in T,\;j = 1, \ldots ,n,\;{\text{and}}\;\lambda _{j} \ge 0\;{\text{are}}\;{\text{nonnegative}}\;{\text{scalars}}\;{\text{such}}\;{\text{that}}\;\sum\nolimits_{{j = 1}}^{n} {\lambda _{j} = 1} ,\;{\text{then}}} \hfill \\ {\left( {\sum\nolimits_{{j = 1}}^{n} {\lambda _{j} X_{j} } ,\sum\nolimits_{{j = 1}}^{n} {\lambda _{j} Y_{j} } } \right) \in T} \hfill \\ \end{array}$$

    where λ is a vector of coefficients.

  2. 2.

    Inefficiency postulate:

    $$\begin{array}{*{20}l} {\left( {\text{a}} \right)\;{\text{If}}\left( {X,Y} \right) \in T\;{\text{and}}\;\overline{X} \ge X,\;{\text{then}}\;\left( {\overline{X} ,Y} \right) \in T} \hfill \\ {\left( {\text{b}} \right)\;{\text{If}}\left( {X,Y} \right) \in T\;{\text{and}}\;\overline{Y} \le Y,\;{\text{then}}\;\left( {X,\overline{Y} } \right) \in T} \hfill \\ \end{array}$$
  3. 3.

    Ray unboundedness:

    $${\text{If}}\left( {X,Y} \right) \in T\;{\text{then}}\;\left( {KX,KY} \right) \in T\;{\text{for}}\;{\text{any}}\;k > 0$$
  4. 4.

    Minimum extrapolation: T is the intersection set of \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{T}\) satisfying postulates 1,2, and 3 and subject to condition that each of observed vectors \({\kern 1pt} \left( {X_{j} ,Y_{j} } \right) \in \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{T} ,\;j = 1, \ldots ,n\).

    With mentioned assumptions Tv as below:

    $$Tv = \left\{ {\left( {\begin{array}{*{20}c} X \\ Y \\ \end{array} } \right)\left| {X \ge \sum\limits_{j = 1}^{n} {X_{ij} \lambda_{j} \& \;Y \le \sum\limits_{j = 1}^{n} {Y_{ij} \lambda_{j} \& \;\sum\limits_{j = 1}^{n} {\lambda_{j} } \& \;\lambda \ge 0} } } \right.} \right\}$$

Different models for calculating efficiency were introduced, the oldest model is BCC (Banker et al. 1984) model:

Input-oriented BCC Model

$$\begin{array}{*{20}l} {{\text{Minimize}}\;\theta - \varepsilon \left( {1s^{ + } + 1s^{ - } } \right)} \hfill \\ {{\text{Subject}}\;{\text{to}},} \hfill \\ \end{array}$$
$$\sum\limits_{j = 1}^{n} {x_{ij} \lambda_{j} + s_{i}^{ - } = \theta x_{iq} \quad i = 1,2, \ldots ,m}$$
$$\sum\limits_{j = 1}^{n} {y_{ij} \lambda_{j} - s_{r}^{ + } = y_{rq} \quad r = 1,2, \ldots ,s}$$
$$\sum\limits_{j = 1}^{n} {\lambda_{j} = 1\quad j = 1, \ldots ,n}$$
$$\lambda_{j} ,s_{i}^{ - } ,s_{r}^{ + } \ge 0\quad i = 1,2, \ldots ,m\quad r = 1,2, \ldots s\quad j = 1, \ldots ,n$$

A DMU is called efficient, if it has \(\theta^{*} = 1,\,s_{i}^{ - *} = 0,\,s_{r}^{ + *} = 0\). Otherwise, it is called inefficient.

For inefficient DMUs (ex. DMU q ), the DEA model calculates the benchmark. The benchmark is as follows:

$$\left( {\begin{array}{*{20}c} {X_{q} \cdot \theta^{*} - s^{ - *} } \\ {Y_{q} + s^{ + *} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {\sum\limits_{j = 1}^{n} {\lambda_{j}^{*} X_{j} } } \\ {\sum\limits_{j = 1}^{n} {\lambda_{j}^{*} Y_{j} } } \\ \end{array} } \right)$$

Benchmarks are like alerts when it comes to designing new strategies or changing old strategies. For each DMU, two parts should be taken into account:

  1. 1.

    Eliminate the distance between each DMU and its peer group

  2. 2.

    Display the frontier in a specific time horizon

As the benchmarks were based on the past data, they could not help in showing the frontier in specific time horizon and they may be still less efficient compared to the future benchmarks. Therefore, ANN is used to mitigate this issue and to indicate the envelope surface.

Artificial neural networks

The original inspiration for the structure of the neural networks comes from the human brain functions. The key factor of this paradigm is the novel structure of the information processing systems. A system consists of a large number of highly interconnected processing neurons working together to solve specific problems. Similar to people, ANNs learn by example. Neural network is trained by adjusting weights between neurons, so that an input leads to a target output.

The fast growth of ANN over the last decade has introduced a new dimension into the field of performance measurement especially in business application. One of the major application areas of ANNs is forecasting (Sharda 1994). Many different ANN models have been proposed since 1980s. Multilayer perceptron (MLP), Hopfield networks, and Kohonen’s self-organizing networks are the most influential models.

The MLP networks are used in several problems especially in forecasting because of their inherent capability of arbitrary input–output mapping. Several layers of nodes are included in an MLP. The information receiver layer is an input layer, which is the lowest layer. The last or the highest layer is an output layer in which the problem solution is obtained. The hidden layers are the intermediate layers where the input and output layers are separated. Acyclic arcs from a lower layer to a higher layer connect the nodes in adjacent layers. Figure 1 shows an example of a fully connected MLP with one hidden layer.

Fig. 1
figure 1

The structure of three-layer MLP network

Most multilayer networks are trained using the back propagation (BP) algorithm for forecasting. BP neural networks consist of a collection of inputs and processing units known as neurons.

BP networks are a class of feed-forward neural networks, which refers to the direction of information flow from the input to the output layer, with supervised learning rules. In such learning, each network’s forecasts are compared with the known correct answer and the weights are adjusted based on the resulting forecast error to minimize the error function.

For example, for forecasting the value of x(t + 1) in x(1)…x(t) time series, x(t – k + 1)…x(t) is chosen as the inputs to multilayer network and the output will be the forecast. The network uses the data, which are extracted from the historical time series for the sake of training and testing on large training and testing sets.

Before an ANN can be used to perform any desired task, it must be trained to do so. Basically, training is the process of demonstrating the arc weights, which are the key factors of an ANN. Arcs and nodes are saving the learned knowledge in the form of arc weights and node biases. The MLP training is a method of training, in which the desired response of the network (target value) for each input pattern (example) is always available. The steps of the training process are usually as following. Firstly, examples of the training set are entered into the input nodes. Secondly, the activation values of the input nodes are weighted and accumulated at each node in the first hidden layer. Lastly, activation value is obtained by an activation function, which is transforming the total into an activation value. The value becomes an input into the nodes in the next layer. This process works until the output activation values are found. The training algorithm is tried to the weights that minimize the mean squared errors (MSE) or the sum of squared errors (SSE).

ANN–DEA

During this research, multilayer ANN has been applied to forecast the input and outputs of each DMU in 5 years. After the preliminary analyses and trial, the Levenberg–Marquardt algorithm (the fastest training algorithm) was chosen for the proposed MLP network. Levenberg–Marquardt algorithm can be considered as a trust-region modification of the Gauss–Newton algorithm. Two operations must be considered in MLP networks: training and prediction.

MLP uses two data sets, the training set for the training of the MLP and the test set for the prediction.

Arbitrary values of the weights, which might be random numbers, are the beginning of the training mode. In each epoch, the iteration of the complete training set in the network adjusts the weights. Adjusting the weights results in reducing errors. The prediction mode begins with information flowing from the inputs to the outputs. The network produces an estimation of the output according to the input values. The resulting error demonstrates the quality of prediction of the trained network. The parameters of the estimated artificial neural network can be seen in Table 1.

Table 1 Estimated neural network parameters

The estimated neural network includes a hidden layer and the Levenberg–Marquardt algorithm has been selected for the training. Figure 2 shows the two samples of the test and the train regression charts for the proposed ANN. Figure 2 displays the good quality of the trained network prediction.

Fig. 2
figure 2

Training and testing charts

After forecasting the inputs and outputs by the ANN, the DEA model must be selected for calculating the efficiency and benchmarks.

Since some inputs and outputs in this study could be negative, the selected DEA model for efficiency measurement and benchmarking should not be sensitive to negative data. One of the best models, which could be used to deal with negative data, is the SBM model.

The SBM model is as follows:

$${\text{Minimize}}\;\rho = \frac{{1 - \frac{1}{m}\sum\nolimits_{i = 1}^{m} {\left( {{{s_{i}^{ - } } \mathord{\left/ {\vphantom {{s_{i}^{ - } } {R_{i}^{ - } }}} \right. \kern-0pt} {R_{i}^{ - } }}} \right)} }}{{1 + \frac{1}{s}\sum\nolimits_{r = 1}^{s} {\left( {{{s_{r}^{ + } } \mathord{\left/ {\vphantom {{s_{r}^{ + } } {R_{r}^{ + } }}} \right. \kern-0pt} {R_{r}^{ + } }}} \right)} }}$$
$${\text{Subject}}\;{\text{to}},$$
$$\sum\limits_{j = 1}^{n} {x_{ij} \lambda_{j} + s_{i}^{ - } = x_{iq} \quad i = 1,2, \ldots ,m}$$
$$\sum\limits_{j = 1}^{n} {y_{ij} \lambda_{j} - s_{r}^{ + } = y_{rq} \quad r = 1,2, \ldots ,s}$$
$$\sum\limits_{j = 1}^{n} {\lambda_{j} = 1\quad j = 1, \ldots ,n}$$
$$\lambda_{j} ,s_{i}^{ - } ,s_{r}^{ + } \ge 0\quad i = 1,2, \ldots ,m\quad r = 1,2, \ldots ,s\quad j = 1, \ldots ,n$$

where, \(R_{j}^{ - } = \hbox{max} \left\{ {x_{ij} :j = 1, \ldots ,n} \right\} - \hbox{min} \left\{ {x_{ij} :j = 1, \ldots ,n} \right\}\). The variables s + and s measure the distance of inputs and outputs of a virtual unit from those of the unit evaluated (X q ). The numerator and the denominator of the objective function of model measure the average distance of inputs and outputs, respectively, from the efficiency threshold. For variable returns to scale, condition \(\sum\nolimits_{j = 1}^{n} {\lambda_{j} = 1}\) is added.

The stages involved in the proposed algorithm are illustrated in Fig. 3.

Fig. 3
figure 3

The steps of ANN–DEA

Computational results

100 branches of one of the Iranian commercial banks were selected and the related data were collected. The data cover the period of March to February during the years 2006–2011. Each branch demonstrates a decision-making unit (DMU) and uses two inputs to produce seven outputs as is shown in Table 2.

Table 2 Inputs and outputs of branches

After implementing the ANN and computing the efficiencies by the SBM model, the efficiency and the benchmark that are calculated for the 11th DMU are as follows:

$$\begin{array}{*{20}l} {\rho = 0.005} \hfill \\ {{\text{Benchmark}}:\left( {\begin{array}{*{20}l} {\sum\limits_{j = 1}^{100} {\lambda_{j} X_{j} } } \hfill \\ {\sum\limits_{j = 1}^{100} {\lambda_{j} Y_{j} } } \hfill \\ \end{array} } \right)} \hfill \\ \end{array}$$

As is shown in Table 3, the 11th DMU should decrease its inputs and increase its outputs within a 5-year time horizon. Hence, the 11th DMU will be efficient in 5 years.

Table 3 Distance between DMU11 and its benchmark

For DMU 38, the benchmark is displayed in Table 4. For being efficient in 5 years, the 38th DMU should increase the inputs and outputs (ρ = 0.737).

Table 4 Distance between DMU38 and its benchmark

The efficient DMUs included in the reference set are the peer group for the inefficient units. Therefore, the benchmark and DMU are the same. For super efficient DMUs, like DMU1, the benchmark is as Table 5 (ρ = 1).

Table 5 Distance between DMU1 and its benchmark

Annual prediction could help each bank branch to have a strategic improvement plan. Hence, the bank management can plan due to this guide, and reach the 5-year goal.

Conclusions and future works

This paper presents an ANN–DEA study to the branches in one of the Iranian commercial banks. The result helps DMUs to improve their efficiency and gives them a useful strategic plan for future developments. Unlike DEA, the ANN–DEA approach guides weaker performers on how to improve their performance to different efficiency ratings for the future. We can also list the following directions for future researches. First, ranking DMUs can be considered for future work. Second, Malmquist productivity index can be used for calculating the DMU’s progress or regress. Third, other prediction methods can be utilized for estimating. Forth, combinatorial method can be used to find the most productive scale size.