A bootstrapped Malmquist index applied to Swedish district courts

This study measures the total factor productivity (TFP) of the Swedish district courts by applying data envelopment analysis to calculate the Malmquist productivity index (MPI) of 48 Swedish district courts from 2012 to 2015. In contrast to the limited international literature on court productivity, this study uses a fully decomposed MPI. A bootstrapping approach is further applied to compute confidence intervals for each decomposed factor of TFP. The findings show a 1.7% average decline of TFP, annually. However, a substantial variation between years can be observed in the number of statistically significant courts below and above unity. The averages of the components show that the negative impact is mainly driven by negative technical change. Large variations are also observed over time where the small courts have the largest volatility. Two recommendations are: (1) that district courts with negative TFP growth could learn from those with positive TFP growth; and (2) that the back-up labour force could be developed to enhance flexibility.

(1994a). 1 Following Wheelock and Wilson (1999), the MPI is decomposed into four parts: changes in (1) pure technical efficiency; (2) scale efficiency; (3) pure technology; and (4) the scale of the technology. A commonly known issue with all types of DEA analysis is that no statistical inference is possible (Simar and Wilson 2000). In this study a bootstrap approach is used to determine the confidence intervals of the different components presented above (Efron 1979). Another issue with DEA is the influence of outliers (Kapelko and Oude Lansink 2015). The analysis of outliers is, to a large extent, omitted in previous studies on court performance. In this study, an outlier detection analysis is performed to investigate whether the results depend on a few extreme observations. 2 The findings indicate a decline in TFP of 1.7% on average. However, a substantial variation between courts and between years is present. Between 36 and 57% of the courts have a significantly negative change in TFP, while the share of courts with a significantly positive TFP change is 16-36%. The reason for this is because courts that improve 1 year may show a decline in TFP the following year. Looking at the components, the negative impact is driven by a decline in pure technical change (TC) of 4.7% in 2012-2013. Further, the TFP is significantly negative during 2014-2015. During this period, the number of courts observed to have a significant decline in TFP is larger, and the numbers of courts with a positive and significant TFP development are fewer, in comparison with the rest of the years. The correlation analysis concludes that the rate of change in the caseload has a significantly positive correlation with TFP, which indicates flexibility problems. The caseload variable is defined as the number of pending cases and matters by the start of the year, plus new cases and matters during the year. This paper is organised as follows. Section 2 provides a brief summary of the Swedish judicial system. Section 3 presents the previous TFP and efficiency literature regarding courts. Section 4 describes the methodology. Section 5 examines the data, including outlier detection. The results are reported in Sect. 6. Finally, Sect. 7 concludes and discusses the policy implications.

The Swedish judicial system: a short description
The Ministry of Justice is responsible for matters that are related to the judicial system, which include legislation on the fields of civil law and criminal law, for example. 3 However, it is not allowed to interfere in the day-to-day work, since the aim of the Swedish legal justice system is to provide fair trials. This requires independence and autonomy between courts, in relation to the Parliament, Government, and other authorities. The judicial process differs, depending on whether it is a criminal case, a civil case, or a matter. The different processes are described in Fig. 1. Each stage has the general purpose of dealing with cases and matters in an efficient manner and in compliance with the rule of law.
Criminal cases, to the left in Fig. 1, are first handled by the police. These cases start with a police report, followed by a preliminary investigation. In the next step, the case can either be closed or sent to the prosecutor, who will decide whether the case will be prosecuted. If the case continues to prosecution, it will end up in a district court. Initially, a dispute is handled by the municipalities; however, if it remains unsolved, it becomes a civil case in a district court. Civil cases are related to a dispute between individuals or business firms. Matters are regulated in the Court Matters Act (SFS 1996:242) and can be separated into four categories: (1) debt clearances; (2) debt enforcements; (3) bankruptcies or company reconstructions; and (4) other matters. Categories 1-3 relate to payment problems, as shown in Fig. 1. Debt clearances and debt enforcements are decided by the Swedish Enforcement Agency and adjudicated, as a matter, by the district courts if the decision of the Swedish Enforcement Agency is appealed (SFS 1981:774). A decision of bankruptcy must be decided by the district court, which is also the case if a business firm applies for bankruptcy. The fourth category, named 'other matters' in Fig. 1, includes a variety of matters, for example: estate administrators, parking remarks, heritages, and custodians. 4 There are three different types of courts that build up the Swedish court system, namely, the general courts, the administrative courts, and special tribunals. The general courts consist of the district courts, the Courts of Appeal, and the Supreme Court. Each of the instances is important, since different instances provide possibilities to appeal to achieve a fair trial, which is a fundamental right in any legal justice system. The Supreme Court, which is the last instance, has the main mission to provide the district court system with legal practice to enhance the uniformity of actions in legal decisions.
This study focuses on the district courts, which have the mission to serve as the first instance in the legal system. Each district court mainly handles cases related to their catchment area, which corresponds to the surrounding geographical area. However, there are five courts that specialise in land and environment cases. These courts deal, for example, with environmental and water issues, property registration, and building matters. Within each court, there are Chief Judges, Senior Judges, and Judges who are considered as permanent judges, the former being the head of the court. Each judge is appointed by the government. There are also law clerks who work as non-permanent judges, including both recent law graduates in the training programme to become a permanent judge and regularly employed law clerks that are not included in the judge training. The work tasks of the law clerks normally consist of preparing cases, but can also include deciding simple cases as a nonpermanent judge. Finally, Lay Judges have experience from other occupations and politics and are chosen by the Municipal Council, but they are not educated in law. They work as judges for a period of 4 years.

Literature review
There is existing literature that focuses on the labour productivity of courts (Blank et al. 2004;SNCA 2015), but there is only a limited amount of literature that considers court TFP. Kittelsen and Førsund (1992) are the first to investigate TFP change over time. The efficiency scores of Farrell (1957) are used to calculate the MPI, which is decomposed into change in efficiency and technology, with the first year as the base (Caves et al. 1982). In terms of the decomposed factors, the catching-up was 4% and the technology shifted 2% from 1983 to 1988. Kittelsen and Førsund (1992) perform an outlier detection analysis, in which the MPI and its components are shown in a histogram, with the labour share on the x-axis. Based on the diagrams, three courts are considered to be outliers, due to a large improvement or decline in TFP. 5 Fauvrelle and Almeida (2016) calculate the MPI and decompose it into TC and efficiency change (EC).
Following Färe et al. (1994b), EC is further decomposed into a pure EC and a scale component. 6 The results show, on average, a positive TFP change of 1.5%, which is decomposed into a decline of 1.7% in TC, a pure EC of 3.3%, and a scale EC of 0.7%. 7 Both Fauvrelle and Almeida (2016) and Kittelsen and Førsund (1992) use the averages of TFP change and decompose them into, at most, three components. However, neither of them investigates whether the changes are statistically significant. Finally, Falavigna et al. (2017) contribute to the literature by applying a bootstrapped MPI in a two-stage analysis, proposed by Simar and Wilson (2007), to investigate the impact of structural changes in Italian district courts during 2009-2011. The MPI is found to be 0.3%, EC -0.1%, and TC 0.4%. Further, they conclude that the role of judges is correlated with court productivity and efficiency.
While only a few studies exist on TFP change, efficiency is measured more extensively. Such research is important for this study, since it deals with the question of which inputs and outputs are best for measuring performance. 8 Lewin et al. (1982) are the first to investigate inefficiency in district courts, using DEA. 9 Lewin et al. (1982), as well as all the other studies, use the number of employees as an input. In some studies, employees are measured as the number of judges (Falavigna et al. 2017;Ferrandino 2014;Finocchiaro Castro and Guccio 2014). In other studies, the personnel are separated into judges and office staff (Major 2015;Santos and Amado 2014). The caseload of a court is another input included in some studies. The caseload consists of pending and new cases; that is, the demand of justice services (Kittelsen and Førsund 1992;Schneider 2005). For instance, Nissi and Rapposelli (2010) and Schneider (2005) argue the importance of including the caseload, since an underestimation of productivity will occur, because the employees cannot perform their job without incoming or pending cases. However, this is a slightly contradicting argument when analysing TFP, since courts should be able to adjust inputs when justice demands change. This will be discussed, in more detail, in Sect. 5.
Moreover, Beenstock and Haitovsky (2004) argue that individual productivity increases if the work pressure is high. However, the caseload can also, as Kim and Min (2016) argue, correlate negatively with quality. For example, if the caseload is low, more time can be spent on each case, which, on average, generates a more precise judgement. Outputs normally consist of the number of decided cases (Falavigna et al. 2017;Nissi and Rapposelli 2010). In some studies, cases are separated by type; for example, criminal cases and civil cases (Finocchiaro Castro and Guccio 2016). However, due to data limitations, the studies cannot separate outputs within each category based on the spent resources. This aggregation equalises, for example, a murder with a car crime. Different types of crimes require different amounts of resources, due to their dissimilarity in complication. A problem 7 Moreover, an investigation is performed of what is described as '… productivity or technical efficiency …' (Schneider 2005, p. 133). It is, in fact, TE that is calculated but referred to as court productivity each time, except the previously cited occasion. Furthermore, Schneider (2005) investigates the determinants of efficiency, such as the share of Ph.D. judges and the share of judges older than 60, among others. 8 For a general descriptive survey of judicial efficiency, see Voigt (2016). 9 Other previous input-output-oriented studies exist, concerning court performance (Nardulli 1978) but not concerning TE. There is also older research on courts that, for instance, focuses on organisational perspectives (Eisenstein and Jacob 1977;Feeley 1973). with this will occur, in court performance analysis, if there are differences in the mixture of crime types between courts.
Quality variables are argued to be important in some studies (Yeung and Azevedo 2011). Some attempts to investigate the impact of quality variables on performance can be found in the literature. Examples are judges' salaries and education, of which the former have a significantly positive effect on efficiency (Deyneli 2012). Furthermore, Schneider (2005) concludes that more PhD holders, as judges, increase the efficiency. Falavigna et al. (2015) use court delay, as an undesirable output, in a directional distance function. 10 Finally, Andersson et al. (2017b) include a quality measure that relates to the number of changed decisions by a superior court, but does not find any significant correlation with efficiency. All of these studies focus on TE, which is basically a measurement of similarity. For instance, Espasa and Esteller-Moré (2015) argue that the efficiency can be high, even if the courts perform poorly, as long as they are congested. Thus, this is not a good measurement of performance improvements over time; for example, lower inefficiency over time could occur due to a decline in performance of the best district courts.
To sum up, there is no research regarding TFP in Sweden and very little literature, internationally. Furthermore, the international studies do not, due to data limitations, investigate the potential heterogeneity in resource spending within the output categories. Moreover, statistical inference is left out, with the exception of Falavigna et al. (2017), and TFP is at most decomposed into three components.

Methodology
Different approaches can be applied in productivity and efficiency studies. Stochastic frontier analysis (SFA) is a widely used parametric methodology (Krüger 2012;Kumbhakar and Lovell 2003). SFA has the advantage of allowing for statistical noise, directly. However, the disadvantage is that it requires a specific functional form. Another option is the DEA approach, which has the advantage of relying on few assumptions and the capability of handling multiple outputs and inputs. Furthermore, DEA is relevant when analysing the public sector, in which the outputs are not sold on the market (Førsund 2016). However, DEA also has some disadvantages. Firstly, it does not give information about inference. To some extent this can be handled by using resampling methods, such as the bootstrap procedure proposed by Simar and Wilson (1998a). The second disadvantage of DEA is its sensitivity to outliers. This shortage is, to a large extent, neglected in previous literature on court performance.

Outlier detection
There is no optimal procedure to detect outliers, since no generally accepted definition of an outlier can be found (Davies and Gather 1993). However, plenty of methods are applied in different areas. For example, the outlier detection method, by Wilson (1993), is useful when the data checking is costly (i.e., when the data-set is large). Kapelko and Oude Lansink (2015) use a specific deviation from the median. In DEA, it is important to identify observations that substantially push the frontier, as proposed by Banker and Gifford (1988). This procedure, referred to as the method of super-efficiency, is further concluded to perform well in practical applications, using experiments by Banker and Chang (2006) and Banker et al. (2017), which concludes robustness using different scale assumptions. The focus in this paper is TFP change, and a super-efficient unit 1 year may change the results. This paper identifies an observation as a potential outlier if the output-based superefficiency score, assuming constant returns to scale (CRS), is below 0.75. This limit is used in, for example, the robustness investigation by Agrell and Niknazar (2014) and the empirical application by Edvardsen et al. (2017). 11 Finally, when a potential outlier is identified, a closer look at the specific observation should be taken to produce arguments for why it is an outlier (Simar 2003).

DEA and the Malmquist productivity index
The point of reference can be taken either from an input perspective (i.e., minimise the inputs to produce a given level of output) or from an output perspective (i.e., maximise the output given the level of inputs). As in most studies of district courts, an output-based perspective is assumed. There are, in the scope of courts, two reasons for choosing an output-based perspective. First, inputs are not easily changed in the short-run. Second, the individual court has no incentives to change its inputs, since the budget for employees is given for a specific year. Thus, the maximum output should be carried out using a given level of inputs. The production technology in time period t, for the 48 Swedish district courts, is defined as: where S t represents the technology. Each court, i, uses a vector of inputs, x t , to produce a vector of outputs, y t , in period t. Using the output distance function, 12 the technical efficiency (TE) can, in time period t, be written as: where h is a scalar and the distance is 13 If TE is equal to unity, the court is on the frontier, meaning that it is technically efficient. However, if TE is larger than unity, the court is inefficient; for instance, TE equal to 1.1 means that the output can be increased by 10%, given the amount of inputs. To calculate the standard MPI, introduced by Caves et al. (1982), the same calculation needs to be performed for the following period: t ? 1. This is shown in Eq. 3. 14 In Eq. 3, and hereafter, the C subscript represents CRS. Similarly, Eq. 3 can be written in the variable returns to scale (VRS) case, which is defined as If the technology is not CRS, the MPI does not accurately measure TFP, according to Griffel-Tatjé and Lovell (1995). However, Wheelock and Wilson (1999) state that using the CRS assumption, if the true technology is VRS, will generate inconsistent distances that give arguments for not restricting the calculation to one scale assumption. 15 Using Eqs. 2 and 3, assuming the technology of period t as the reference, Caves et al. (1982) define the MPI as: where the MPI is the ratio of the output distance functions in each period, respectively. This paper uses the most common version of the MPI, based on Caves et al. (1982). To avoid an assumption of the benchmark technology, Eq. 4 is often defined as the geometric mean of two indices. 16

Decomposition
Decomposition of the productivity index was first proposed by Nishimizu and Page (1982), who define TFP as the sum of the EC and TC. The geometric mean of the two indices is, following Caves et al. (1982) and Färe et al. (1992Färe et al. ( , 1994a, obtained by rewriting Eq. 4 as: 13 Hereafter, the 'O' subscript for the output distance function is omitted to avoid notational clutter. 14 The name of the index comes from the early work on price indexes by Malmquist (1953). In productivity analysis, it was Caves et al. (1982) who adopted the methodology developed by Malmquist. 15 See also Grosskopf (2003) for an overview of different arguments on technology assumptions and decompositions. 16 The geometric mean is commonly used. However, it is worth to point out the potential issue of noncompatibility with the circularity assumption, which was pointed out already by Gini (1931). One possibility is to choose a base technology, but then the index depends on the chosen base (Berg et al. 1992;Pastor and Lovell 2007). Another is to use the 'transitivized' MPI by Balk and Althin (1996). However, the view of Färe (2008), in the spirit of Fisher, is that the natural order of time can be followed (i.e., circularity should not be a problem). For an empirical application of an alternative version of the MPI that fulfills this in a fairly similar application, see e.g., Førsund et al. (2015).
where the distance functions are defined, assuming CRS. Based on the geometric mean defined in Eq. 5, the MPI can be decomposed into TC and EC. Following Wheelock and Wilson (1999), the decomposition is, while allowing for VRS, written as 17 : EC is interpreted as changes in the relative efficiency of a court (i.e., movements towards or away from the frontier), while TC measures the shift of the frontier itself. 18 EC or TC that is larger (or smaller) than unity, indicates an improvement (or decline) in EC or TC, between period t and period t ? 1. 19 Allowing both TC and EC to have either VRS or CRS, makes the decomposition shown in Eq. 7 possible.
DPureTech and DPureEff are both defined on the best-practice technologies, according to Ray and Desli (1997) and Färe et al. (1994b), respectively. The scale EC measures the movement towards or away from the technically optimal scale. 17 The inclusion of multiple inputs and outputs, when assuming VRS in TE computations, was first proposed by Banker (1984) and empirically applied by Banker et al. (1984). A graphical description of the DEA frontier, with CRS and VRS, is presented in Fig. 2. 18 Non-homogeneity occurs, since another scale assumption other than CRS is carried out. This is discussed in Griffel-Tatjé and Lovell (1995). 19 Färe et al. (1994b) relax the CRS assumption to allow for VRS and decompose EC into a pure effect and a scale effect, respectively. According to Ray and Desli (1997), the decomposition presented by Färe et al. (1994b) is wrong, because the EC is assumed to exhibit VRS while the technology has CRS. To clarify, if CRS holds, there will not be any scale effect, since scale optimality, pioneered by Frisch (1965), is assumed.
On the other hand, if VRS is assumed, TC, as defined by Färe et al. (1994b), does not measure the shift in the CRS frontier. However, the error in calculation that Ray and Desli (1997) mention is, according to Simar and Wilson (1998b), only an error in the definition of Eqs. 6 and 7 in Färe et al. (1994b). In other words, the definitions by Färe et al. (1994b) assume CRS, but their calculations allow for VRS. According to Simar and Wilson (1998b), Eq. 6 in Färe et al. (1994b) should be written as Eq. 6 in this paper.
Finally, the scale of the technology (i.e., DScaleTech), proposed by Wheelock and Wilson (1999), represents the scale bias of TC (i.e., the geometric mean of two scale efficiency ratios). This means that any change in DScaleTech occurs from a change in the shape of the technology. The first ratio consists of the change in the scale of the technology between t and t ? 1. The reasoning of the second ratio is similar, specifically a change in the scale of the technology between t and t ? 1, relative to the location of the production unit in period t. 20 Problems with this decomposition can occur when cross-period distance functions are calculated using the VRS assumption, since it can generate missing values for some components. 21 Finally, this decomposition is criticised slightly for its confusing interpretation. For example, Wheelock and Wilson (1999) interpret what we call DScaleTech as the shape of the technology, while Zofio and Lovell (1998) interpret it as the scale bias of the technology (Balk 2001;Ray 2001). 22 To examine TFP and its decomposed factors, the efficiency needs to be calculated. The reciprocal to the output-based Farrell (1957) measure of TE is formulated by Färe et al. (1994b) as: Subject to where z k is N Â 1 the vector of intensity variables (i.e., weights). The objective is to maximise h which corresponds to minimising the value of the distance function, 20 This decomposition is simultaneously proposed by Gilbert and Wilson (1998), Simar and Wilson (1998b), and Zofio and Lovell (1998). 21 The VRS assumption is more sensitive to infeasible values when the cross-period distance functions are computed (Briec and Kerstens 2009). This can be avoided by only using a CRS technology. However, CRS is only relevant when long-run equilibrium, in size, appears to be a reasonable assumption (Chambers and Pope 1996). This is not a valid assumption in our case. 22 An alternative decomposition can be found in Lovell (2003). In the Lovell (2003) case, both the output and the input mix are considered, respectively. This is, however, not relevant to district courts, since a single court does not have the power to decide the output mix.
D t x t i ; y t i À Á . 23 For example, if y 0 is an arbitrarily chosen level of output, the maximum output, given the level of inputs, is calculated as y 0 Ã TE t or similarly as y 0 =D t x t i ; y t i À Á . 24 To compute the MPI, four single-period problems are required, assuming CRS and VRS, as well as four mixed-period problems, under CRS and VRS. 25 These calculations will generate the average MPI. To improve the robustness of the calculated MPIs and draw conclusions based on statistical inference, a bootstrap approach is applied.

Bootstrapping the Malmquist productivity index
Statistical inference for DEA is most commonly based on bootstrapping (Efron 1979). Bootstrapping, and other resampling techniques, simulates the datagenerating process multiple times by resampling from the data and applying the original estimator to each simulated sample. This generates an approximation of the sample distribution that can be used to create inference that is meaningful in a statistical sense; for example, the confidence intervals of the DEA efficiency scores. These confidence intervals are based on a large number of bootstrap draws (Simar and Wilson 1998a). Further, the efficiency scores can be bias-corrected, as proposed by Simar and Wilson (1999). However, the rule of thumb is not to correct for this bias, unless s 2 \ 1 3 Bias Bĥ ðx t i ; y t i h i 2 , where s 2 is the variance of the bootstrapped values (Simar and Wilson 2000). The procedure can be summarised in four steps: (1) calculate the MPIs as previously described; (2) generate an i.i.d. bootstrap sample from the original sample; (3) calculate the MPIs based on the bootstrap sample; and (4) repeat steps 2 and 3 a sufficient number of times (e.g., 2000 repetitions in our study) to generate standard deviations to construct the confidence intervals of the MPI and its decomposed factors. 26 To summarise the methodology, the MPI, including all the decomposed factors and their confidence intervals, will be computed and bootstrapped. This provides the possibility to evaluate changes based on statistical significance due to the bootstrapping. The decomposition can serve as a good starting point for , is fulfilled if the technology is defined on the production set, as in Eq. 3. However, it can be above 1 in the mixed-linear problem, indicating technical progress. 24 The returns to scale subscript is omitted to avoid notational clutter. 25 The necessary linear problems to be solved are D t for VRS. These are calculated for each change in time, generating 8*3 problems for each court. 26 This calculation is performed using the FEAR software in R (Wilson 2008), and the bootstrap procedure used in this paper is proposed by Simar and Wilson (1999). A discussion about weaknesses with this bootstrap procedure can be found in Olesen and Petersen (2016). investigating the sources of the TFP change. This can be achieved without problems of sample noise, making the results statistically robust. 27

Data
The data used were obtained from the SNCA and cover the time period 2012-2015. However, data on hearing times are available for a longer time period (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) and will be used for the weighting of the outputs. The available data is more detailed than the data used in previous research. For example, different cases and matters are reported in 292 sub-categories, which will be taken into consideration. The choice of input and output variables is based on what representatives of the courts have stated in interviews as reasonable resource and performance measures, as well as economic theory and previous research. 28 This very detailed information implies that the complexity of cases and matters within a given sub-category will not vary much, since cases and matters within the same sub-category will be similar.
In the first step, the outputs are aggregated into decided civil cases, decided criminal cases, and decided matters, which are described in Fig. 1. Simply adding these three groups together, however, will introduce aggregation errors. There are differences between courts, regarding the type of cases and matters that they handle. Some courts handle more resource-intense cases than others. For example, a murder case would most likely require more resources than a traffic case. This heterogeneity within different categories of cases and matters is not taken into account in previous studies (Kittelsen and Førsund 1992;Santos and Amado 2014;Yeung and Azevedo 2011). To compensate for these facts, the outputs are weighted by the hearing time (i.e., the time in the courtroom). 29 This means that courts with a large share of cases, from a complicated category, are not negatively affected in terms of TFP. The weights are based on the average hearing time in each sub-category; for example, criminal cases alone consist of almost 40 sub-categories, and each sub-category receives its own weight. 30 On the input side, labour is the largest cost share for Swedish district courts, with 70%, and the rental cost is about 13% for the period 2012-2015 (SNCA 2012(SNCA , 2013(SNCA , 2014(SNCA , 2015(SNCA , 2016. Labour is divided into three categories, specifically the number of hours worked for: (1) judges; (2) law clerks; and (3) other personnel. 27 Førsund (2016) argues that separation of the pure and the scale effects are problematic in the decompositions. Furthermore, the interpretation of multiplicative decomposition is argued to be unclear. Furthermore, no clear-cut, regarding causality, can be identified. However, other authors such as Lovell (2003) and Simar and Wilson (1998b) argue that the adopted decomposition is meaningful. 28 Wagner and Shimshaks's (2007) strategy of model selection is also used to assess the plausibility of the model. 29 Andersson et al. (2017b) conducted a Confirmatory Factor Analysis (CFA) to investigate if hearing time could be used as an approximation of resources used. The result of the CFA was that hearing time got a loading of 0.96, indicating that the hearing time is a good approximation for resources used. 30 To illustrate how the weights are constructed: (1) Economic crimes had a mean hearing time of 215 min, during the time period 2007-2015; and (2) Drug offences, on the other hand, had a mean hearing time of 57 min during the same time period. This implies that 3.8 drug offences are assumed to use as much resources as one economic crime.
The reason for dividing labour into three categories is that the different types of staff conduct different tasks at the Swedish district courts, and the staff composition varies between courts, according to the SNCA. A measure of capital is omitted from previous research (Ferrandino 2014;Kittelsen and Førsund 1992), except for Elbialy and García-Rubio (2011), who incorporate computers. To incorporate capital, the office space of the court is used, following the assumption that the amount of capital (e.g., computers and office equipment) is proportional to the size of the premises. We argue that including a measure of capital is important, since it is, to some extent, possible to substitute labour with capital in court production. An example is the incorporation of video conferences, which, according to the SNCA, decreases the travelling time for judges.
Furthermore, the caseload, as described in the previous literature, is an important source of performance for several reasons; for example, if there is no caseload, there will not be any output. We argue that the caseload is an important variable to incorporate when calculating performance that focuses on improvements of technology, management, and so on. However, it is not recommendable to include the caseload in the main analysis when TFP is investigated, since an important factor of TFP changes is flexibility in inputs, i.e., adjustable inputs depending on changes in justice demand. Thus, the caseload is only included in a second-stage correlation analysis. The caseload in year t is defined as the stock of open cases and matters at the end of year t -1 plus the incoming cases and matters in the present year. A potential problem with this method is that the incoming cases, at the end of 2015, are not included. The correlation is invariant for addition and, therefore, only a problem when the difference is non-random between courts. In our case, however, it is more relevant to assume randomness between courts, meaning that it does not affect the correlations. 31

Outlier detection
The chosen limit of the super-efficiency scores is 0.75. This means that if an observation is identified with a super-efficiency score below the limit in any of the years it will be under consideration to become eliminated from the main analysis. Four district courts are below this limit during, at least, one of the years. These are Eksjö, Uddevalla, Gotland and Nacka. 32 Gotland is unlikely to be super-efficient, in general, based on interviews with representatives for the district courts. However, 2014 is an exception, according to the representatives at the SNCA. Furthermore, from 2013 to 2014, Gotland has a TFP growth of 12%. Therefore, Gotland is eliminated from the main analysis. Thus, four courts are eliminated, meaning that there are 44 courts left in the sample. The descriptive statistics of the outputs and inputs, after the elimination of outliers, are reported in Table 1. 31 A measure of quality can potentially be included in the analysis. Andersson et al. (2017b) incorporates the change frequency in higher instances in a second-stage analysis and concludes zero correlation with the efficiency scores.  Table 1 show that the differences in output over time are, on average, quite small. However, each of the inputs increased in size over time. For example, the number of full-time equivalent judges increased from 17.28 to 18.81 (9%). The caseload declined over time, but the dramatic drop in the last period is because all incoming cases were not included as previously described. Also note the large standard deviations, which are almost as large as the means. For instance, the Stockholm district court is, in terms of hours worked for judges, almost 31 times larger than Lycksele. Finally, the nonweighted caseload declined during the studied time period.

Results
The results, concerning the MPI and its decomposed factors, are reported first, namely, EC and TC. EC and TC are then further decomposed into a pure and scale effect. Then a correlation analysis is performed, based on the MPI and its components. Finally, observations are concluded as outliers are eliminated from the main results. 33

Malmquist index and its decomposed factors
The MPI and its components are reported in Table 2. 34 In Table 2 the TFP change, measured as the MPI, is negative for 2012-2013 and 2014-2015, respectively. Column 3 reports that TC is significantly negative during the first period and, on average, below zero in the following periods. EC contributes positively to the TFP growth during the period 2012-2014. For the last period, 2014-2015, both TC and EC affect TFP negatively, which generates a statistically significant decline in TFP. We do not aim to identify causes of the different components of TFP change; however, argumentation of potential sources of the results is provided. A negative TC, which is observed for each period, has its original interpretation from other sectors, where it can occur from an absence of reinvestment in capital so less outputs can be produced. However, this is not likely for district courts. Instead, an inward shift of the frontier will most likely occur, due to two reasons. First, if the turnaround time increases, due to more complicated cases, that will generate a lower output, which occurs as a negative TC in the model. This means that all courts are affected the same (i.e., efficiency remains constant, but all courts are closer to the origin). However, the turnaround time decreased during this time period, in terms of both criminal cases and civil cases, according to the SNCA (2014SNCA ( , 2016, which means that the source is something else. As a second attempt to interpret a decline in TC or EC, depending on if the affected courts operate on the frontier, it is worth studying Table 1 of the descriptive statistics. In Table 1, it can be observed that the average number of decided cases and matters fluctuate between 1 and 2% for 2012-2014; thus, the changes are stable. However, for 2014-2015, the decided criminal cases and matters are in the same range, as previously described, but the civil cases declined by 4.2%, on average. Thus, the produced output decreases in total, driven by a lower number of civil cases. This, however, only The bootstrapped confidence intervals at 5% level of significance are reported in parentheses, where ** symbolizes significance at the 5% level. *Below (above) unity means significantly below (above) the 5% level of significance 34 The bias is not corrected for, since s 2 ¼ 0:00089 and 1 3 Bias Bĥ x t i ; y t i À Á h i 2 ¼ 0:00036, calculated as an annual average (Simar and Wilson 2000).
concerns decided cases. However, the number of incoming civil cases is reported to decline by 5% during the period 2014-2015. 35 Thus, a potential explanation for the negative TFP change that is driven by both a decline in TC and EC, depending on if the courts operate on the frontier, is likely to be due to the decline in the caseload during this period. This, in itself, should not decrease TFP if the inputs are fully flexible. However, it does decrease if the inputs are not flexible enough to compensate for the lower workload level. The MPI and its confidence intervals are graphically reported for each court and each year, excluding the outliers (see Figs. 3, 4, and 5 in the ''Appendix''). 36 Furthermore, the geometric mean of the MPI and its components are provided for the individual courts in Table 6.
In columns 5 and 6 in Table 2, it can be observed that the number of courts, with a significantly negative TFP change, is fairly stable during the period 2012-2014; however, the number increases in 2014-2015. Furthermore, the number of courts with significantly positive TFP growth decreases in each year. Both the fact that the production frontier moves towards the origin and the fact that fewer courts have a significant and positive TFP change indicates that this result is not driven by a few observations. Additional to the caseload, other differences may generate differences in TFP change. For example, the organisation within the courts may be an issue. Thus, adjusting the organisation to the best performing court can generate a better development. To gain more information, TC is decomposed into pure TC and scale TC.

Decomposition of TC
The decomposition of TC into pure TC and scale TC is performed according to Eq. 7 in Sect. 4. The results are reported in Table 3. TC is defined as the product of pure TC and scale TC. 37 Pure TC means that the best firms, assuming CRS, have a significant decline in 2012-2013 and a smaller negative change in the following periods. The movement of the technology, from the optimal scale (i.e., scale TC), generates regress of 2.2% for the same time period. This indicates that the largest decline has its source in pure TC, meaning that the frontier moves inwards; however, the shape of scale TC also contribute negatively. In November 2011, there was a reform so that a type of matter, handled only by a few courts, was moved to the Swedish mapping, cadastral, and land registration authority. In particular, these courts have a large decline in pure TC; for example, Å ngermanland had a negative pure TC of 34%. The source of this decline is that cases were moved in the end of 2011, which generated a smaller stock and less incoming cases during 2012-2013. Therefore, less outputs are produced; meanwhile, the inputs are not changed accordingly, even though the mentioned change was known by the courts at 35 It can also be noted that the incoming cases in this category decrease by 8% from 2013 to 2015 (SNCA 2016). 36 Figures of TFP change of all courts, including the outliers, can be made available upon request. 37 As noted in the methodology, a VRS assumption, using cross-period distance functions, is sensitive to infeasible values. least 1 year in advance, indicating a flexibility problem. In contrast to the previous interpretation over time (i.e., that the result is not driven by a few courts), it can be concluded that the significant decline in pure TC during 2012-2013 is driven by district courts where different types of matters where moved to another authority. To investigate the components of EC, its decomposition is now reported.

Decomposition of EC
EC is decomposed into pure EC and scale EC, according to Färe et al. (1994b). The results are reported in Table 4. EC is positive for 2012-2013 and 2013-2014, respectively. The positive effect has its source in the positive pure EC and scale EC for both periods. This indicates that district courts, on average, become more homogeneous, since their efficiency measures the distance from the frontier. However, based on previous arguments regarding TC, it is not necessarily the case that a positive EC of 5.7% has its source in better performance of inefficient courts. Instead, using the decomposition of MPI, inefficiency is reduced when courts on the frontier move towards the origin, meaning that such courts are closer in distance to the previously inefficient courts. In other words, the positive and significant EC during 2012-2013 is, most likely, due to a movement inwards of the frontier.
During the period 2014-2015, EC is negative, indicating greater heterogeneity between courts; that is, the average court is further away from the production frontier. This effect comes almost equally from both components of EC. Thus, it can be concluded that most changes in TFP, during the last time-period, occur from the different components of EC. As previously described, a decline in EC can also occur from a lower justice demand for courts that do not operate on the frontier, ex-ante. However, it can also occur from organisational issues; for example, if high-skilled employees leave the court and there are difficulties finding replacement staff.

Correlation analysis
A few of the previous studies argue the importance of incorporating the justice demand to avoid underestimating a court's TFP. However, as stated, the justice demand should The bootstrapped confidence intervals at 5% level of significance are reported in parentheses, where ** symbolizes significance at the 5% level not affect TFP if the inputs are fully flexible; that is, there should be a zero correlation if this is fulfilled. The interpretation of the previously presented result indicates that the MPI and its components are not independent of the changes in workload. In Table 5, the MPI and decomposed factors are correlated with the rate of change in the caseload. From Table 5, it can be observed that the MPI and its components, in all cases, are positive. To a large extent, the MPI and its components also have statistically significant correlation with the rate of change in the caseload. A positive correlation can mean either the inputs do not decrease enough when the demand for justice services declines or the employees work harder when the demand increases, generating increased output for the given inputs. Each of these reasons indicate a slack in the courts; that is, more can be produced without increasing the inputs. Schneider (2005) argues that the exclusion of the caseload generates an underestimation of TFP.
However, despite the positive correlation concluded in this section, we argue that the correct measure of TFP is what we reported in the main analysis. Nevertheless, the caseload can, at least, partly explain the results indicating that the inputs are not flexible enough. The positive relationship between the MPI and the rate of change in the caseload is also in line with Beenstock and Haitovsky (2004), who argue that individual productivity increases when the work pressure is high. These results strengthen the previous argument of low flexibility in inputs. However, it should be interpreted carefully, since no causality can be concluded, and other factors are likely to affect the TFP change, which is not included here.

Conclusion and policy recommendations
This paper aimed to investigate the development of TFP from 2012 to 2015. The differences in comparison with previous research are: (1) more detailed data are used, which allow the outputs to be weighted based on the hearing time; and (2) TFP is decomposed into four components, in contrast to a maximum of three in the earlier literature.
The findings indicate a 1.7% decline in TFP, which is measured as an annual geometric mean. However, a substantial variation between courts is found; for example, 36-57% of the courts have a negative change in TFP, while 16-36% of the courts have a positive TFP change, depending on the year. The negative TFP change is mainly driven by a decline in TC during the first period. Looking at the components of TC, it can be observed that most of the decline has its source in pure TC that is argued to be assigned to a decline in the caseload. However, the period 2014-2015 has a negative TFP change, occurring from a decline in pure TC, pure EC, and scale EC. Likely, this decline is also due to a smaller demand of justice services. However, the different components are differently affected, depending on where the court operates in relation to the frontier. Furthermore, the correlation between TFP and the rate of change of caseload is concluded to be positive and significant, which strengthens the previous argument and, therefore, indicates a non-sufficient level of flexibility in inputs. The policy conclusion is that there is room for improvements. A recommendation is that district courts with negative TFP could learn from those with positive TFP in aspects of organisation and internal development of working methods. Furthermore, since the smallest courts have the largest volatility in TFP change, smoother changes can be achieved by merging courts, which would improve TFP. However, merging is, to some extent, constrained by the social and geographical issues that need to be taken into consideration. To avoid this issue, a less controversial policy implication that achieves more flexibility in the Swedish district courts is to develop the back-up labour force, introduced in 2012, to include other personnel than judges. This will allow the inputs to be adjusted when the demand fluctuates, which generates a higher degree of flexibility on the regional level.
In particular, this will enhance the flexibility of the small courts. The smallest courts have close to the minimum number of employees. Small courts are, by construction, more sensitive to changes in the workload, since a small change in the justice demand generates a large share of the percentage. Therefore, the issue of large volatility in the justice demand could, at least, partly be solved by an expansion of the back-up labour force to enhance flexibility. Furthermore, more flexible inputs across Sweden could potentially make it possible to eliminate the requirement of a minimum number of employees in each court. Instead, the volatility in the smallest courts can be served by flexible personnel (e.g., the back-up labour force).
Finally, peer comparisons of courts could be used in many potential aspects of the work for improving efficiency and productivity. For example, differences can be present that are not directly possible to determine in this study, such as organisational problems. This is, however, an aspect that can be taken into consideration in future research.

Appendix
Illustration of DEA Figure 2 below shows the production, under constant returns to scale (CRS) and variable returns to scale (VRS), when one input is used to produce one output. Figure 2 shows the production frontier for CRS and VRS during year t and year t ? 1, respectively. Year t ? 1 is on a higher level of output, given the level of input, meaning that technical change (TC) has occurred. Furthermore, increasing returns to scale (IRS) can be observed, assuming VRS, to the left of the tangency between the CRS and the VRS frontier, indicating that the firms using less than this level of input are too small. Similarly, to the righthand side of the tangency point, there are decreasing returns to scale, meaning that firms observed there would increase their productivity by becoming smaller.

Results including all courts
To incorporate information of the results including all courts, the geometric means of all years for the individual courts, excluding the outliers, are reported in Table 6.
Finally, the total factor productivity (TFP) change and its confidence intervals are reported for each court and year, excluding the outliers, in Figs. 3, 4, and 5.