Measuring technical efficiency for multi-input multi-output production processes through OneClass Support Vector Machines: a finite-sample study

We introduce a new method for the estimation of production technologies in a multi-input multi-output context, based on OneClass Support Vector Machines with piecewise linear transformation mapping. We compare via a finite-sample simulation study the new technique with Data Envelopment Analysis (DEA) to estimate technical efficiency. The criteria adopted for measuring the performance of the estimators are bias and mean squared error. The simulations reveal that the approach based on machine learning seems to provide better results than DEA in our finite-sample scenarios. We also show how to adapt several well-known technical efficiency measures to the introduced estimator. Finally, we compare the new technique with respect to DEA via its application to an empirical database of USA schools from the Programme for International Student Assessment, where we obtain statistically significant differences in the efficiency scores determined through the Slacks-Based Measure.


Introduction
Measuring the technical efficiency of firms and organizations in the context of production processes with multiple inputs and outputs is a problem that has received considerable attention in the specialized literature over the last decades [see, for example, the recent contributions by Aparicio et al. (2020), Ebrahimi et al. (2021), Liao et al. (2022)]. Since the seminal work by Farrell (1957), who dealt with singleoutput production frameworks, many methods have been proposed for this task, but only a few have received enough attention to develop into their own sub-areas of research. These methods include both non-parametric perspectives, such as Data Envelopment Analysis (DEA) (Charnes et al. 1978;Banker et al. 1984), as well as parametric approaches such as Stochastic Frontier Analysis (SFA) (Aigner et al. 1977;Meeusen and van Den Broeck 1977). Between these two approaches, the nonparametric perspective can be highlighted as being one of the most appealing techniques, due to its flexibility, the set of mild conditions required, and the natural way with which it deals with multi-input multi-output production contexts. This paper focuses on this type of sub-area of research.
In this context, the usual goal is to estimate the technical efficiency of a set of observed production units, usually called Decision Making Units (DMUs). This is often done by evaluating each unit against a set of feasible input-output bundles, also known as the production possibility set or technology, and considering how much each DMU can be modified before leaving this technology following certain improvement paths. In this framework, a certain part of the border of the technology, the so-called efficient frontier, plays a major role. In particular, a DMU is efficient when any permitted modification of the DMU would result in the DMU being projected outside the technology. In the case of inefficient DMUs, there are many possible directions in which a DMU can be projected towards the efficient frontier. Some of the approaches modify either only the inputs or only the outputs, leaving the rest unchanged. Amongst these, the first to be introduced were the Farrell radial input (output) measure, see Banker et al. (1984), where every input (output) is scaled by the same factor, thus keeping the mix of outputs (inputs) constant. Later, more general measures were introduced, such as the Russell measures (Färe and Lovell 1978), which allow the use of different scaling factors along each input (output). Furthermore, by considering the ratio between input and output Russell efficiencies, Pastor et al. (1999) and Tone (2001) introduced the Slacks-Based Measure (SBM), which satisfies additional desirable properties. Another approach considers directional distance functions (DDF), which chooses a directional vector and then projects each DMU in this direction (Luenberger 1992a, b;Chambers et al. 1998). Other alternative measures have been introduced over the last decades. For example, the Additive Model introduced in Charnes et al. (1985) is an alternative formulation of the radial and directional models which measures inefficiency via slacks in each input-output dimension. Through the weighting of each of the components, a Weighted Additive Model (WAM) was introduced in Lovell and Pastor (1995), which, through different choices of the weights, results in different technical inefficiency measures (see, e.g., Cooper et al. 1999).
Another attractive area when the focus is quantitatively analysing a data sample, and which had practically been growing separately from efficiency estimation, is that of Machine Learning (ML). In this area, certain algorithms are applied to determine a target function from available data. In our opinion, somehow, DEA can be seen as a ML technique where the target is the underlying technology that is responsible for generating the set of observations. Under the statistical approach in production theory postulated by (Daraio and Simar (2007), Chapter 3), the so-called technology coincides with the support of the joint probability distribution of inputs and outputs. In this way, determining this support leads to the identification of the technology and vice versa.
Amongst machine learning algorithms, a family which is widely used and has solid theoretical foundations is that of Support Vector Machines (SVM), introduced in Vapnik (1998), Vapnik (2013), which are based on the minimization of two types of errors, namely the empirical and the generalization errors, with a hyperparameter-led weighting between the two possibilities. Originally introduced for classification between two classes of labelled data, many adaptations and extensions to other contexts have been proposed. In particular, an adaptation of the classical SVM classifier to the context of unsupervised learning is the OneClass Support Vector Machine algorithm (OneClassSVM), introduced in Schölkopf et al. (2001), which takes the approach that an unlabelled dataset can be seen as a binary classification dataset where only examples from one of the two classes are available, and so translates the separation of the classes into a problem of estimating the support of the available data, and separating it from the rest of the coordinate space. Thus, this algorithm estimates the support of a joint probability distribution from which the available data are random samples. However, as we are aware, so far, this technique has not been adapted to identify technologies in production theory, i.e., input-output sets satisfying certain microeconomic axioms (such as convexity or free disposability), or to determine the usual technical efficiency measures (radial models, Russell measures, directional distance functions and so on). This will be the main objective of this paper, thus narrowing the existing gap between machine learning techniques and technical efficiency measurement from a non-parametric perspective.
In this paper, for the first time, we show how to determine a list of well-known technical efficiency measures as a result of tailoring the so-called OneClassSVM technique. As a way of gathering evidence towards the validity of this adaptation, we will compare our results with those obtained by the standard DEA methodology. We will do that through two different strategies. First of all, we will evaluate the goodness of our technology estimator by means of a simulation study based on finite samples, along the lines of papers such as Gong and Sickles (1992). Subsequently, we will illustrate how the new methodology performs through an empirical example and the Slacks-Based Measure. One of the tools which allows the standard SVM techniques to be powerful estimators is the introduction of a transformation function and associated kernel on the data, which embeds the data into a higher-dimensional vector space where the separation can be performed better. For this purpose, and taking the piecewise linear technology estimated by DEA as inspiration, we choose a piecewise linear (PWL) transformation function introduced in Huang et al. (2013), which we adapt to make the estimated technology satisfy the usual microeconomic 47 Page 4 of 33 axioms in Production Theory (such as convexity and free disposability in inputs and outputs).
Overall, in this paper, we establish a new link between ML techniques and the measurement of technical efficiency, in the same line previously followed by authors such as Tsionas (2022), Esteve et al. (2020Esteve et al. ( , 2022, Valero-Carreras et al. (2021, 2022, Ruggiero (2018, 2022), Daouia et al. (2016), Parmeter and Racine (2013), and Liao et al. (2022). In particular, Valero-Carreras et al. (2021) and Valero-Carreras et al. (2022) introduced an adaptation of SVM for regression, that is, they tailored a supervised machine learning technique to estimate production frontiers. In contrast, in this paper, we focus our attention on the adaptation of One-ClassSVM, which is an unsupervised methodology. Other recent and related papers on ML and efficiency are Tsolas et al. (2020) and Thaker et al. (2022).
This paper follows the distinction between absolute and relative technical efficiency recently introduced by Aparicio and Esteve (2022). DEA identifies relative technical efficiency, that is, the degree of efficiency measured in comparative terms with respect to the performance of exactly the N observed units in the data sample. In contrast, absolute technical efficiency corresponds to the efficiency measured regarding the unknown Data Generating Process from which the data were drawn. In particular, the new method that we introduce attempts to measure absolute technical efficiency.
The remainder of the paper is structured as follows. Section 2 introduces the usual DEA context, with the netput notation that we will use throughout the paper, the list of most well-known efficiency measures as well as the fundamentals of the standard OneClassSVM algorithm. Section 3 develops the model that we propose, as well as two strategies to obtain the hyperparameters involved. Then, in Sect. 4, we adapt multiple measures of efficiency to the context of our algorithm and introduce the linear problems that will be solved to calculate the efficiency scores. Then, Sect. 5 contains some computational experiments that we use to evaluate the performance of our approach and compare it with traditional DEA under a finite-sample analysis. In Sect. 6, we apply the new approach to an empirical database from the literature, consisting of schools from the USA involved in the 2015 Programme for International Student Assessment (PISA) report, to illustrate and compare the new approach and DEA using, in particular, the Slacks-Based Measure. Finally, Sect. 7 contains the conclusions of this research and outlines several possible future research lines.

Notation
In this section, we describe the notation that we use throughout this paper. We denote scalar variables in Roman letters, and in lowercase boldface letters when they are vectors. We denote the d-dimensional Euclidean space by ℝ d , its nonnegative (non-positive) orthant by ℝ d . A dataset Z of DMUs contains N DMUs, where I is the number of inputs and O is the number of outputs. Usually, given a vector , we denote its j'th component by a j . However, in the case of the DMUs, which we denote by 1 , 2 , ..., N ∈ ℝ I+O , we indicate the components of these vectors using brackets, that is = (z(1), ..., z(I + O)) and i = (z i (1), ..., z i (I + O)) . We similarly denote other vectors indexed in two different ways, such as the coefficients of the hyperplanes in our programs, which are denoted by I+O+1 , ..., I+O+H , as well as the slacks vectors. In particular, the j'th component of the k'th hyperplane is p k (j) . Bold numbers such as , denote constant vectors with this number in every component. The dimension of these vectors is clear from the context in which they appear.
For operations between vectors, given = (a 1 , ..., a d ), We denote the componentwise vector product, also called Hadamard product, by ⊙ = (a 1 b 1 , a 2 b 2 , ..., a d b d ) . Vector inequalities indicate that the specified inequality holds componentwise, e.g. ≥ indicates that a i ≥ b i for all i = 1, ..., d . We remark that > indicates that every component of is strictly positive, whereas ≥ and ≠ indicates that at least one component of is nonzero, with every component non-negative, but some components may be 0.

Data envelopment analysis
Given a set Z of N Decision Making Units (DMUs) to be assessed, we denote the DMU k in netput notation by k = (z k (1), ..., amounts of inputs and produces k = (y k (1), ..., y k (O)) > amounts of outputs. Given this dataset Z , we have an underlying production process which we want to understand. In this context, the first goal is to estimate the technology or production possibility set, denoted by T, which consists of those pairs of input and output vectors that can be produced by the production process, i.e. the combinations of inputs and outputs which are feasible. In the netput notation, In this context, DEA estimates a technology which coincides with the convex closure of the dataset, extended in the directions that are appropriate to satisfy the free disposability of inputs and outputs.
As such, the technology estimated by DEA is conservative, in the sense that it will fit perfectly to the data. From a machine learning point of view, this results in overfitting, and perhaps its associated relatively weak power of generalization to unseen data. This is one of the issues that we wish to address by considering DEA as a machine learning problem, that is, we want to estimate technologies that are closer to the theoretical, underlying technology, rather than to the particular dataset that we have available.
This way of thinking about the underlying Data Generating Process (DGP) was formalized by Daraio and Simar in (Daraio and Simar 2007, Chapter 3), and allows us to consider the estimation of the technology as a problem of estimating the support of the underlying DGP.

Measures of efficiency in data envelopment analysis
In the context of the measurement of technical efficiency, in particular when working with DEA and related techniques, technical efficiency measures how much a DMU can be modified while staying within the technology. This direction must be such that it reduces the inputs, increases the outputs, or a combination of both. In terms of netputs, it must either keep constant or increase every component of the DMU. The DMUs can be modified in many different ways while satisfying this condition and staying with the technology T, and we now introduce some of the methods described in the literature to project the DMUs in appropriate directions. We can obtain estimates of the technical efficiency by replacing the theoretical T by an appropriate estimator, such as T DEA .
Some of the first methods for measuring technical efficiency in the DEA context are the radial models, both input and output-oriented (Farrell 1957;Charnes et al. 1978;Banker et al. 1984). These are usually called the Farrell input (output) distances, and project along the inputs (outputs) by multiplying all of them by the same constant, that is, in a radial direction, while leaving the outputs (inputs) constant.
The output-oriented Farrell distance is: The input-oriented Farrell distance is: These measures are radial and scale every output (input) by the same constant. As such, they leave the relative proportions (mix) of outputs (inputs) constant. The range of values that they can obtain is, whenever ∈ T , F out ( ) ∈ [1, +∞) , whereas (2) F out ( ) = ( ) = max{ ∶ (− , ) ∈ T}.
(3) F in ( ) = ( ) = min{ ∶ ( (− ), ) ∈ T}. F in ( ) ∈ (0, 1] where, in both cases, a value of 1 indicates technical efficiency. However, with the Farrell measures, a DMU may still have some room for improvement (slack) along some component of , and not others, which led to the introduction of other measures.
The so-called Russell measures are a generalization of the Farrell measures which allow for the rescaling of each input (output) by a different constant, thus allowing for the presence of slacks along only some of the inputs (outputs). They are non-radial measures of efficiency, introduced in Färe and Lovell (1978), (Färe et al. 1985, p.149), and are also called the Färe-Lovell efficiency indices in the literature.
The Russell measure of output efficiency, in the netput context, is defined as: Similarly, the Russell measure of input efficiency is defined as: We observe that the difference between the two orientations of the Russell measure are which coordinates are allowed to be rescaled (outputs or inputs), while the others (inputs or outputs) are kept constant, and that, in order to have the appropriate orientation, the ranges for these scaling factors differ. As a result, as written, whenever ∈ T , we have R in ( ) ∈ (0, 1] , whereas R out ( ) ∈ [1, +∞) . In either case, a DMU is considered efficient when it attains a value of 1, which indicates that no improvement is possible along any input (output).
The Farrell measures are special cases of the corresponding Russell measures with the additional restriction that 1 = 2 = ... = I for the input orientation and I+1 = I+2 = ... = I+O for the output orientation.
Aditionally, another alternative is the additive-type measure. It, like the DDF, allows for changes in both inputs and outputs simultaneously. Furthermore, by assigning an independent slack to each variable via a slacks vector ∈ ℝ I+O + (here all slacks are non-negative due to the netput notation) and weighting the penalties through a vector ∈ ℝ I+O + with > , they allow for the detection of further inefficiencies along some of the directions. The basic formulation is the Weighted Additive Model (WAM, Lovell and Pastor 1995), and different choices of weights then result in different measures. This model also treats inputs and outputs in a homogeneous way due to the netput notation. The WAM formulation is: Amongst the choices for weights, we consider the following: 1. Measure of Inefficiency Proportions (MIP, Cooper et al. 1999): Cooper et al. 1999): With these additive measures, as with the DDF, a DMU is considered to be efficient whenever its efficiency value is 0, and their range is [0, +∞) . However, due to the variation in normalizing factors, their magnitudes are different and are not directly comparable.
We now introduce a Slacks-Based Measure, see Pastor et al. (1999), Tone (2001), also called Enhanced Russell Graph measure. This measure combines the ratio between the Russell input and Russell output measures. The SBM is based on an ordinary linear fractional programming model that can be linearized using a standard approach in the literature (see Charnes and Cooper 1962). The formulation which we take as a starting point for adaptation is the following additive model, that is Model (7) in Tone (2001): In this paper, we will tailor all these measures to the context of our ML model.

OneClass Support Vector Machines
We now introduce the unsupervised machine learning algorithm, called OneClass Support Vector Machine, sometimes abbreviated as OneClassSVM, 1CSVM or 1SVM, that we adapt for the estimation of production technologies. It was introduced in Schölkopf et al. (2001) as an adaptation of classification Support Vector Machines (SVM) to the setting of estimating the support of a high-dimensional distribution, bringing us to the context of estimating a technology as the support of the probability distribution of inputs and outputs discussed above. An observation at the core of this approach is that estimating the support of a sample can be seen as a particular case of binary classification where only examples of one of the classes are available. The OneClassSVM quadratic program is: This program will obtain a solution ( * , * , * ) , which defines the estimated support of the dataset, of the form: This program involves a hyperparameter determining how much weight is given to each component of the objective function and it also involves a transformation function ∶ ℝ I+O → ℝ I+O+H , that is, a transformation from the space of netputs into a higher dimensional real vector space that is key to determine the properties of the estimated set. We choose a transformation function based on the following piecewise linear transformation function (PWL) for our algorithm, which will result in a polyhedral set.
This piecewise linear transformation function was introduced and studied in [Huang et al. 2013, Expression (12)]. We choose it for its parallelism with DEA, which estimates technologries with a piecewise linear boundary, that is, a polyhedral set. This function involves a number H of hyperplanes, each defined by its slope vector k ∈ ℝ I+O , and its intercept q k ∈ ℝ . These can be either considered hyperparameters of the model, which involves large amounts of computation to calculate the H(I + O + 1) hyperparameters required, or they can be chosen by a reasoned heuristic which is the approach that we take in this paper.
The role of the hyperplanes in the transformation is that the boundary of the estimated set will consist of flat portions until it reaches each of the hyperplanes where ⟨ k ⋅ ⟩ + q k = 0 , at which point at least one of the hyperplane components will change the function chosen in the maximum function, thus allowing the boundary to change direction. Therefore, the hyperplanes will determine where the edges of the estimated polyhedral set are located.

The new approach
The model for the estimation of the technology that we use is the following adaptation of the OneClassSVM model (9) to the world of efficiency estimation.
The objective function of (12) and restrictions (12a) and (12b) are identical to the original OneClassSVM, while restriction (12c) guarantees convexity due to our choice of piecewise linear mapping (see Huang et al. 2013), which we use in the following adapted version: This transformation is changed from the formulation in (11) in two ways. The first adaptation is that we take the negatives of each component, so that the estimated region coincides with the area where the data lies, and the second modification replaces the 0 in the hyperplane components by a hyperparameter which we will tune during the training process. The effect of this change is that, instead of allowing the edges of the technology to be at the hyperplanes satisfying ⟨ k ⋅ ⟩ + q k = 0 , now these edges will be located at the regions where ⟨ k ⋅ ⟩ + q k = . By considering different values of , we allow these edges to be in slightly different regions, and the hyperparameter tuning will then compare them and select the value of which yields the best estimator. This will, therefore, enable the estimator to select amongst various candidate sets, and help reduce overfitting. We will show later in the paper how our estimator works through a computational experience.
Problem (12) is solved with given values for and . This way, ( * , * , * ) denotes an optimal solution of (12). From it, the technology estimated by program (12) is defined by: The estimated technology T ( , ) then satisfies the following microeconomic axioms: Proposition 3.1 The following hold: The number of outliers ( n OL ) is at most N and the number of support vectors ( n SV ) is at least N . In other words, Convexity of T ( , ) follows as in (Huang et al. (2013), Section 5), given that the defined is concave and ≥ . Free disposability of inputs and outputs is satisfied as in CNLS [see (Kuosmanen and Johnson 2010), Section 2.2] when imposing additionally that k ≥ , so we will determine these parameters in a manner consistent with this choice. Finally, the bound on the fraction of outliers holds as in (Schölkopf et al. (2001), Proposition 3). The following corollary is a consequence of the principle of minimal extrapolation in DEA and Proposition 3.1(3). It states that the DEA estimator of the technology is always a subset of the estimator built from the adaptation of OneClassSVM.
Problem (12) with transformation (13) involves hyperparameters and , which we tune via a train-test split in order to obtain the best ones for each situation. In particular, by Proposition 3.1(3), can be seen to be a lower bound for the fraction of Support Vectors allowed, and an upper bound for the fraction of outliers. We choose in the range [1∕(N + 1), 0.1] , except when N ≤ 10 , where we choose ∈ [0.1, 0.3] . This results in a minimum of 0 outliers, and a maximum of 10% of DMUs being outliers. The values of depend on the hyperplanes chosen for , and we describe its role in the next Section.

Hyperplane parameters
The hyperplanes involved in the piecewise linear feature mapping that we use have a large impact on the performance of the estimator. The parameters k , q k and define, as in Huang et al. (2013), the regions where the boundary of the technology has turning points. As such, we are interested in using hyperplanes which lay between the dataset and the edge of the theoretical technology. Some examples of such hyperplanes are given by the convex closure of the data, in other words, the hyperplanes obtained by the DEA estimator. In order to obtain them, we solve the directional distance function (DDF) DEA program in its multiplier form, with directional vector = , corresponding to the Chebyshev norm l ∞ , see Briec (1999). We solve the program for each i ∈ Z to obtain its corresponding parameters k , q k : This is a netput-adapted form of the DDF program introduced in Equation (6) using the DEA-estimated technology, as found in (Pastor et al. (2012), Program 3). By solving this program, we ensure that the sum of the coefficients of the slope of each hyperplane add up to one, thus obtaining components in the transformation which have comparable magnitudes. Furthermore, they are non-negative, thus ensuring free disposability of the obtained estimators [see Proposition 3.1(2)].
Solving this problem for each DMU, we obtain N hyperplanes in the desired region. Furthermore, we obtain a set of values i which determine the distance along directional vector = of each DMU with respect to the DEA-estimated frontier, in other words, the convex closure of the dataset (extended by free disposability). This yields a minimum reasonable value for the offset hyperparameter , given by When takes the value min , the hyperplanes get offset so that every DMU is above at least one hyperplane and, if ≤ min , the hyperplanes are located in the region between the dataset and the origin, not enabling the frontier to have edges in the appropriate regions. Therefore, we choose the interval min , 0 as a suitable range of values for .
We now discuss two possible ways to define larger sets of hyperplanes for the PWL mapping: (1) the duplication of the hyperplane slopes with slightly modified offsets, and (2) the calculation of linear combinations of the existing hyperplanes, in particular via the mean. 1

Duplicate hyperplanes strategy
Following the process above, we obtained N hyperplanes with slopes k . A way to obtain a larger number of hyperplanes is to duplicate the hyperplanes with the same slopes and with a small offset both upwards and downwards in order to obtain more flexibility in the estimated technologies. After some testing, we choose the offset for the duplicated hyperplanes to be 0.05R, where R is the range of values taken by the data. This yields a value of H = 3N for the number of hyperplane components of the PWL mapping.
Thus, in this case, we work with a set of slopes k obtained by solving Problem (14) once for each DMU and, for each slope k , we take the corresponding intercept 1 We also considered other methods of obtaining the hyperplane parameters. Treating them as hyperparameters results in a large number H(I + O + 1) of hyperparameters to tune, and so involves large computational expense without significant improvements. A grid of "flat" hyperplanes with k = (0, ..., 1, ..., 0) performed worse than the DEA hyperplanes alone, and took much smaller weights when considered together. Various other pre-set values for the slopes such as k = ∕(I + O) posed the same problem. term q k , and consider three hyperplanes defined by ( k , q k − 0.05R) , ( k , q k ) , and ( k , q k + 0.05R) . As such, the piecewise linear transformation has I + O + H components, where H = 3N , which keeps a reasonable size. We provide an illustration of the types of hyperplanes thus obtained in Fig. 1a.

Mean hyperplanes strategy
Another approach we consider in order to obtain a higher number of hyperplanes in the transformation consists of, after solving Problem (14) for every DMU, the definition of new hyperplanes by taking linear combinations of the existing hyperplanes. We take the hyperplanes defined by the mean of two existing hyperplanes. In other words, given two hyperplanes ( k , q k ) , and ( l , q l ) , we define the mean hyperplane by k,l = ( k + l )∕2 , and q k,l = (q k + q l )∕2 . This yields hyperplanes with varying slopes which still live in the appropriate region for the edges of the technology, and creates N(N − 1)∕2 hyperplanes in the transformation. We illustrate this approach in Fig. 1b.
From our computational experience (Sect. 5), we conclude that the duplicate hyperplanes strategy is superior to the mean hyperplanes strategy and is less computationally expensive.

Technical inefficiency: the output-oriented directional distance function program
In order to evaluate and select the most appropriate technology estimator from the different choices of the hyperparameters and , we need a method to evaluate the different candidates and choose amongst them. We do so by calculating the Directional Distance Function (DDF) measure of inefficiency which, when we estimate a technology T ( , ) , is defined by: This requires the specification of a directional vector ∈ ℝ I+O + , with ≠ . For this purpose, we choose the vector i associated to the Farrell output distance for each DMU i , that is: i = ( i ) = ( , i ) = (0, … , 0, z i (I + 1), … , z i (I + O)) . Furthermore, while Program (15) is not directly implementable by usual solvers, it can be rewritten as a standard linear program. We do this in Sect. 4.5, since the linearization process and associated proofs are analogous for the measures introduced in that Section. The linear program thus obtained is:

Description of the algorithm: tuning the hyperparameters
First, for each DMU i ∈ Z , we solve Program (14) to obtain N basic hyperplanes, with their appropriate slopes k and intercepts q k . We also define min ∶= min i { i } . We then choose a strategy and use it to obtain a larger number H of hyperplanes and their corresponding values k , q k which we use to define the transformation function . At this stage, we have Program (12) ready to be solved for each choice of , .
The hyperparameters that remain to tune in the algorithm are and . Unless otherwise specified, we choose 5 values equally spaced in the intervals min , 0 for , and we choose in the range [1/N, 0.1], except when N ≤ 10 , where we choose In order to choose amongst these candidate values, we randomly split the dataset Z into a training set Z train containing 70% of the DMUs and a test set Z test containing the remaining 30% of the data.
For each candidate pair of values of , , we train the model by solving (12) on the training set, obtaining a candidate estimator T ( , ).
We then evaluate the performance of each estimator T ( , ) on the test set Z test by comparing, on each DMU i ∈ Z test , the predicted projections i + i of each DMU according to Program (16). This yields estimated output levels for i , which we use for each i ∈ Z test in order to calculate the Mean Squared Error (MSE) associated with T ( , ) . We then choose as the best hyperparameters those ( * , * ) that minimize this MSE on Z test .
Finally, once the best hyperparameters ( * , * ) are determined, the model is retrained on the whole dataset Z , by solving Program (12) on Z with the chosen hyperparameters, yielding the final estimate of the technology T ( * , * ) . We refer to the method by 1SVM d when the duplicate hyperplanes strategy is used and by 1SVM m when the mean hyperplanes strategy is employed.

Measures of efficiency
In this section, we introduce the optimization problems that we will use to obtain the various efficiency scores with respect to the estimator introduced in this paper, we linearize them, and show equivalence of the solutions to both problems. We begin with the Russell measures, as the proofs used in this case can be adapted in a straightforward manner to the other measures of efficiency.
We remark that, depending on the value of * , this method may leave some DMUs as outliers, and some of the ways to measure efficiency that we present will have infeasible problems in this case. This can be avoided by setting the hyperparameter 0 < * < 1∕N , where no outliers will be permitted, and thus this issue will not arise.
Similarly, we recall that we assume that the DMUs do not have any 0 values in their inputs or outputs. Otherwise, minor adjustments to the problems must be made to avoid issues of unboundedness or null denominators.
For every measure, we first obtain the estimated technology T ( * , * ) , which we recall is defined by taking ( * , * , * ) to be an optimal solution of (12) after the tuning of the hyperparameters * , * , which is given by Therefore, these values are fixed and are not variables of the presented programs.

Russell output
The Russell input and output measures of efficiency are non-radial measures of efficiency, and we consider them first. They can be seen as generalizations of the corresponding Farrell input or output measures which, instead of scaling every variable by the same scalar, allow that there can be slacks along some of the inputs (outputs). Thus, the arguments for these measures will also apply to the Farrell measures. We begin by introducing the output-oriented Russell measure.
The Russell measure of output efficiency is defined as: Given ∈T( * , * ) , the output-oriented Russell measure takes values R out ( ) ≥ 1 , with being efficient whenever R out ( ) = 1 . Due to the definition of the transformation function , which has nonlinear components involving a maximum function, program (17) is not linear. However, we can linearize it by adding a new variable ∈ ℝ H which attains the value of the maximum at each component, j = max{ * , ⟨ j ⋅ ⟩ + q j } . In order to force to attain the maximum at each component, we penalize this value in the objective function, and we introduce a constant M large enough so that small changes in affect the objective function more than the corresponding changes in . The linearized model for the output-oriented Russell measure is: Program (18) is a linear program, where M is a large number. We now prove that a solution to the linearized program yields a solution to the original definition of the measure. We first prove the following auxiliary result that, in an optimal solution, the linearizing variable which we introduce as proxy for the terms involving the maximum of two numbers indeed attains this maximum value. In other words, at least one of the second and third restrictions becomes equality at an optimal solution.
Proof Suppose ( * , * ) is an optimal solution to (18). By the second and third restrictions of (18), for each j ∈ {I + O + 1, ..., I + O + H} , we have * This is still a feasible point of (18), as the LHS of the first restriction becomes greater, and the last two restrictions are still satisfied, so it is still a solution to the optimization problem with a larger objective, contradicting the assumption that ( * , * ) was optimal. ◻ We can now prove the following link between the solutions of both programs.
r and M is large enough (to offset the effect of the change in ′ j ), we have: hence ( * , * ) is not an optimal solution of (18), contradicting our assumption. Thus, whenever ( * , * ) is an optimal solution of (18), * is an optimal solution of (17). ◻ Thus, we can solve the linear program (18) to obtain the output-oriented Russell measure scores when the underlying technology is estimated by our proposal based on OneClasSVM. Next, we similarly linearize the other programs, and the same proof, adapted, holds true for the other measures.

Russell input
The Russell measure of input efficiency, analogously to the output one, is defined as: In this case, 0 ≤ R in ( ) ≤ 1 , and a DMU is efficient when R in ( ) = 1 , which happens when = ; in other words, when we cannot decrease any input without making the DMU infeasible. That is, any decrease in any input would result in an infeasible DMU.
As before with the output Russell distance, we can linearize this problem. The arguments above hold mutatis mutandis, as the only changes are the change of maximizing to minimizing in the objective function (hence the penalization term for has different sign), the that appears in the objective function, and the range of possible values for , whereas everything else stays the same: Thus, we obtain the following result:

Proposition 4.3 If ( * , * ) is an optimal solution of (20) then * is an optimal solution of Program (19).
Proof The proof in Proposition 4.2 holds with the appropriate changes. ◻

Farrell output
The output-oriented radial measure or output-oriented Farrell measure for our technology estimator is: Regarding the various equalities, the first formulation is the definition of the Farrell output distance, the second expression expands the values in the vector involved, implicitly showing that this is a particular case of the Russell output measure. Finally, the third and fourth formulations show how, via the transformation = − 1 , this can be seen as a particular case of the DDF with directional vector ( ) = ( , ). The Farrell output distance, given ∈T( * , * ) , satisfies ≥ 1 , with = 1 whenever is efficient. However, in case of outliers, this value = 1 does not (20) yield a feasible solution, and so, to prevent leaving the quadrant, we will add a restriction of the form ≥ 0 to the corresponding linearized program. As before, program (21) is not linear, due to the dependence on a maximum in the definition of ( ) , but we can linearize it by using the same big-M technique as before. The resulting linear program, with variables , , is: As before, we can obtain an optimal solution to (21) from one of (22): is an optimal solution of (22) then * is an optimal solution of Program (21).
Proof The Farrell output measure is a special case of the output-oriented Russell measure, so this statement is a particular case of Proposition 4.2. ◻

Farrell input
The input-oriented Farrell measure of efficiency is analogous to the Farrell output measure, but with the scaling factor on the inputs instead of the outputs. In our setting, it is defined by: In this case, F in ( ) ∈ (0, 1] whenever ∈T( * , * ) , with F in = 1 whenever is efficient. The last equality in (23) shows that the Farrell input distance can be seen as a special case of the Directional Distance Function with = ( , ) = (−z(1), ..., −z(I), 0, ..., 0) . Note that, in this latter case, the problem is to minimize, which is due to the relationship = 1 − , which inverts the goal of the problem.
In the case of this distance, we can linearize the nonlinear problem above as in the output case, to obtain the following linear program: As before, we have the following relationship between the solutions to these programs: Proposition 4.5 If ( * , * ) is an optimal solution of (24) then * is an optimal solution of Program (23).
Proof This is a special case of the input-oriented Russell function, so it is a particular case of Proposition 4.3. ◻

Directional distance function
We now consider the directional distance function (DDF), as described in Sects. 2.3 and 3.2. This is a measure of inefficiency, which measures how much can be moved along a direction ∈ ℝ I+O + before leaving the technology. In order to obtain the inefficiency of a DMU , we solve Problem (15) in its linearized form (16).
The DDF takes values where efficient DMUs have inefficiency score = 0 , and DMUs within the technology have ≥ 0 . Furthermore, the DDF also allows for those DMUs outside the technology which can be projected along into the technology to obtain inefficiency scores, but in these cases < 0 . Then, the following holds, with proof analogous to that of Proposition 4.2: Proposition 4.6 If ( * , * ) is an optimal solution of (16) then * is an optimal solution of Program (15).

Weighted additive model
Another well-known family of measures of efficiency consists of the measures based on the Weighted Additive Model (WAM), introduced in Lovell and Pastor (1995). Through various choices of weights, a variety of measures is defined such as the Measure of Inefficiency Proportions (MIP) and Range Adjusted Measure (RAM), see Cooper et al. (1999). Regarding notation, since the slacks vector is going to depend on the corresponding DMU, we denote it by = (s(1), ..., s(I + O)) , so that, when necessary, we can refer to the slacks of DMU i by i . This measure allows for slacks in both the inputs and the (24) outputs. The Weighted Additive Model in the netput notation and with respect to the estimator introduced in this article is: We remark that, as with the property of free disposability, the change of signs of the inputs in the netput notation also allows a homogeneous treatment of the slacks, without the need to split the sum into its input and output terms. In this model, we require that the weights > are strictly positive. We linearize Program (25) with the same technique as before. Hence, once the weights are chosen, the linearized model for WAM( ) becomes: We remark that the last constraint in Model (26) is redundant whenever ∈ ℝ I − × ℝ O + , since ≥ , but we include it for parallelism with the other programs. The same proofs as before also work here to prove the following relationship between solutions to both programs. Proposition 4.7 If ( * , * ) is an optimal solution of (26) then * is an optimal solution of Program (25).

Slacks-based measure
The last measure that we adapt to the context of our estimator is the Slacks-Based Measure (SBM), see Pastor et al. (1999), Tone (2001). Denominated Enhanced Russell Graph Measure in Pastor et al. (1999), its original formulation has fractional terms in the objective, and we adapt the original linearization procedure to our context. We take the formulation in terms of the additive model as a starting point, i.e.
∀j ∈ {1, ..., I} z(j) + s(j) ≥ 0, ∀j ∈ {I + 1, ..., I + O} . model (7) in Tone (2001), where the ratio terms appear in terms involving the input and output slacks. The fractional program to be solved with respect to the estimator, in the netput notation, can be expressed as: In order to linearize this program, we first use the following substitution of variables, following Charnes and Cooper (1962). This change of variables is also mentioned in Cooper et al. (2006) as an exercise: We remark that > 0 . With these variables in mind, we notice that 1∕ is the denominator of the objective function of (27), so that this objective function can be rewritten as , which is linear in and . This is the new, linear, objective function.
In order to ensure that takes the desired value, we add an extra restriction of the z(i) = 1. Furthermore, we expand the definition of the technology, taking into account that = ∕ . The effect of this change of variables on the first restriction is + = + ∕ = 1 ( + ) ∈T( * , * ) , yielding the following intermediate program: At this stage, we have linearized the objective function, and it remains to linearize the maximum function in the PWL mapping, which we do using the same big-M technique that we have used previously. We begin by expanding the definition of : (30) min >0, ∈ℝ I+O Then, we introduce the usual auxiliar variable ∈ ℝ H and, for each hyperplane component of , we add, for each j ∈ {I + O + 1, ..., I + O + H} , the restrictions: We also add a penalization term with big M in the objective function to ensure that at least one of these bounds is tight, that is, j takes the maximum value of the two compared expressions. Restriction (33) is not linear, since it contains terms involving both and nonlinearly, as ∕ . We linearize it by multiplying both sides by , to obtain: Thus, we also change variable from to = ∈ ℝ H . Notice that we also multiply restriction (31), the one describing , by , in order to get rid of the 1 term (which creates a nonlinear term of the form t(j)∕ ), as well as the j ≥ * term in order to express it with respect to the same variables. After all these changes, the final linear problem, in , , , is: After solving program (35), we obtain the efficiency measure, in terms of , , as the optimal value of (30), that is, (i) . Finally, we have the following result relating the solutions to both programs: (32) j ≥ * , Proposition 4.8 If ( * , * , * ) is an optimal solution of (35) then * = * ∕ is an optimal solution of (27)).
Proof This proof has two stages, one where the equivalence with the changes of variables are performed, and another one where the linearization is performed. ◻

Computational experiments: a finite-sample study
In order to evaluate the new method introduced in this paper, this section shows the results obtained from a computational experience, based on a finite-sample analysis, comparing the DEA method with the duplicate and mean hyperplanes strategies, denoted by 1SVM d and 1SVM m , respectively. We summarize the results of the simulated technologies in Table 1. We compare them with the 2 input, 2 output technology proposed by Perelman and Santín (2009) created in a way to ensure the satisfaction of microeconomic behavioral regularity conditions. In this simulation context, the input values are generated from a uniform distribution Uni [5,50], while the values for the output variables are generated according to the formula: Following Perelman and Santín (2009), we generate points on the frontier of the technology via Equation (36), and then we introduce an inefficiency term u ∼ �N(0, √ 0.3)� with a half-normal distribution. We also incorporate random noise. This is indicated in the "Noise" column of Table 1. Furthermore, we allow for a proportion of 0%, 10% and 25% of the simulated DMUs to be on the true frontier. Moreover, we ran 100 trials for each combination of sample size, presence or absence of noise, and percentage of units on the frontier, and we report in Table 1 the average values for the Mean Squared Error (MSE) and bias, as well as % of improvement over the DEA estimator. We tested with sample sizes of 30, 50, 70, 100 and 200 DMUs.
Regarding the results in Table 1, the improvements in MSE range from 20% to 72% in the duplicate hyperplanes strategy, while the improvements in MSE ranged from 15% to 53% in the mean hyperplanes strategy, depending on the sample size and the number of DMUs on the true frontier. On the other hand, the improvements in bias were up to 47% in the first strategy, while the largest improvement in bias was 41%, depending on the sample size and the number of DMUs on the true frontier. Furthermore, the absence of noise in the data   We conclude that both strategies seem to obtain better results than DEA in the finite-sample study carried out, with the duplicate hyperplanes strategy performing better than the mean hyperplanes strategy in MSE, bias and runtime, so we choose it for the empirical application and as the superior strategy.

Empirical illustration: USA schools from PISA report 2015
In this section, we present the results of the application of the estimator introduced in this paper in an empirical database from the literature, using in particular the Slacks-Based Measure (SBM) for illustration. This database consists of results from USA schools participating in the PISA (Programme for International Student Assessment) in 2015, used in Aparicio et al. (2019), and further details can be found in OECD (2017). The dataset that we use contains 162 DMUs (schools), and we report the descriptive statistics in Table 2. We used the following variables for inefficiency estimation: three inputs and two outputs. The three inputs were the Economic, Social and Cultural Status (ESCS), the school's educational resources (SCMATEDU) and the number of teachers per 100 students (TEACHERS). The outputs were their scores in math and reading (PVMATH and PVREAD). Regarding the hyperparameters used in 1SVM d , for this application we fix * = 0.001 . This is because the SBM, as well as some of the other measures introduced, are not defined for units outside the estimated technology, so by fixing a small value for * we weigh the estimator towards the exclusion of outliers. In fact, there are no outliers with this setting. We consider as a hyperparameter, and tune it via the usual train-test (70%-30%).
With this dataset, we estimate the technology by using the standard DEA methodology and our introduced estimator with the duplicate hyperplanes strategy ( 1SVM d ). We then calculate the efficiencies estimated by DEA and 1SVM d for each school with respect to the SBM, and we compare them by means of the Li test (Simar and Zelenyuk 2006). The train-test process selects the value * = −0.1813 as the best one. The average efficiency reported by DEA is 0.739, whereas 1SVM d attains 0.697, which indicates lower estimates of the efficiency. We report descriptive statistics of these efficiency scores in Table 3. The DEA estimate of the technology classifies 7 DMUs as efficient under the application of the SBM, whereas the proposed approach estimate indicates only 5 units as efficient, all of which were already deemed efficient by DEA. The remaining 2 are deemed not completely efficient, attaining efficiency scores of   Table 4. We also report on the 5 DMUs where the change in the efficiency score is the largest between the 1SVM d and the DEA scores with respect to the SBM, as well as the 5 DMUs deemed least efficient by DEA. Furthermore, Table 5 reports the SBM efficiency obtained by each DMU in the dataset with respect to both 1SVM d and DEA, where values in bold indicate that the DMU is efficient. In order to compare the vectors of efficiencies associated with DEA and the new approach, we use the Li test, following Simar and Zelenyuk (2006), which considers whether there is a statistically significant difference between two random samples Z A , Z B , with distribution densities f A , f B . It considers as the Null Hypothesis that H 0 ∶ f A (Z A ) = f B (Z B ) , and calculates the corresponding p-value for its possible rejection. We apply the Li test to compare the SBM efficiency scores obtained by DEA and 1SVM d in order to check whether the differences are statistically significant, and obtain a p-value of 0.0008, indicating that the estimated scores show statistically significant differences.
From the density distribution curves associated with the SBM efficiency scores (see Fig. 2), we observe that the efficiency scores obtained by 1SVM d on the dataset classify fewer DMUs as technically efficient, and have consistently slightly higher inefficiencies than the corresponding DEA scores. This indicates results consistent with our goal of estimating slightly larger technologies than those estimated by DEA, which suffers from overfitting (Esteve et al. 2020).

Conclusions and future work
In this paper, we have explored methods to estimate a production technology by adapting the 1SVM algorithm, with a piecewise linear feature mapping. We have thus built a bridge between the fields of unsupervised machine learning and efficiency estimation via DEA in the context of multi-input multi-output production processes. We have introduced the corresponding methodology and some variations on the hyperplane parameters involved in the feature mapping in order to estimate the technology. We have evaluated the performance of the proposed estimators by comparing the results obtained in a finite-sample simulated environment using multiple inputs and outputs. From our results, we conclude that the approach which duplicates the DEA hyperplanes ( 1SVM d ) is superior to that which calculates the mean of the existing hyperplanes ( 1SVM m ), in addition to being less computationally expensive. Also, it is worth mentioning that both proposed approaches seem to obtain better results than the standard DEA approach regarding MSE and bias under our finite-sample analysis. However, this superiority cannot be claimed in general. From a statistical point of view, the frontier estimators could also be compared regarding some properties such as consistency, which indicates whether the estimator converges to the true target value as the sample size increases. Additionally, when the objective is to report the average efficiency score, the satisfaction of the central limit theorem is also a relevant property. Regarding the DEA estimator, these properties have been studied in detail, even establishing the rate of convergence (see, for example, Kneip et al. (1998Kneip et al. ( , 2008Kneip et al. ( , 2011Kneip et al. ( , 2015. One of the advantages of this knowledge is that, in the case of DEA, it is relatively easy to correct the potential bias of the estimator to determine, for example, suitable confidence intervals through bootstrapping. Unfortunately, we cannot reach the same conclusion regarding the new approach. Consequently, a complete comparison of our technique and the standard DEA model is neither possible nor fair. This way, our approach could be seen as a complementary technique to DEA when the data sample is not large.
Moreover, we have introduced and adapted multiple measures of efficiency found in the non-parametric literature on performance measurement to our context. For illustrative purposes, we have also shown the results obtained by applying, in particular, the Slacks-Based Measure to an empirical database involving schools in the USA from the PISA study. We have compared the scores determined by our proposed approach to those obtained by classical DEA. In our empirical application, the Finally, we mention some possible avenues for further research. The choices we have made for the hyperplanes will allow for different approaches and exploration, as well as different transformation functions altogether. Different choices of hyperplanes, or even of other types of transformation functions may be worth pursuing to further improve the methodology. Furthermore, these parameters could be considered as hyperparameters to tune, although this would be very computationally expensive. This contribution takes part in the larger context of machine learning algorithms adapted to estimate production technologies, and another possible future line of research would be the comparison amongst them and their performance. Another possible area of interest is the enrichment of this approach with feature selection methods, since one of the ever-arising issues is the curse of dimensionality, and these methods may be extended to approach problems where direct computation may not be possible or may yield insufficient results. In addition, the possibility of using the new approach to measure productivity change over time and decompose this measure into its usual drivers, i.e., efficiency change, scale efficiency change and technical change, is a topic that deserves further explorations. And, as was pointed out above, the study of the asymptotic properties of the new frontier estimator can be understood as one of the most important possible extensions of the method.
Funding The authors thank the grant PID2019-105952GB-I00 funded by Ministerio de Ciencia e Innovación/ Agencia Estatal de Investigación, Spain /10.13039/501100011033. R. Moragues wishes to thank the Cátedra Santander en Eficiencia Productividad, Miguel Hernandez University (UMH), for the funding provided. Furthermore, M. Esteve was also supported by the Spanish Ministry of Science, Innovation and Universities under Grant FPU17/05365. Additionally, J. Aparicio thanks the grant PROME-TEO/2021/063 funded by the Valencian Community (Spain), which partially supported this work. Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.