1 Introduction

Micro, Small and Medium Enterprises (MSMEs) play a key role in the global economy, accounting for about 90% of firms and creating more than 50% of employment worldwide (Ayyagari et al., 2003; IFC, 2012). In developing countries such as Vietnam, most MSMEs operate in the manufacturing sector (CIEM, 2016; GSO, 2016; Rand & Tarp, 2020), contributing to about 36% of the national value-added (OECD, 2021). It is therefore important to understand how efficiently MSMEs are operating and, especially, how to improve their performance. In the manufacturing sector, MSMEs are at the crossroads of technological advancement and operational excellence, where optimisation, Industry 4.0, and big data analysis are the buzzwords making the rounds (Schoenherr & Speier-Pero, 2015). A key research question arising from this situation is how to apply big data analytical tools such as machine learning (ML) to examine the performance of MSMEs, not only in terms of providing quicker results (regarding big data) but also in terms of recommending better and reliable solutions for improving their performance. Despite the growing body of literature on the application of analytics to solving operational problems (Kamble et al., 2020; Manimuthu et al., 2021; Wamba et al., 2017), research on MSMEs, especially in the manufacturing sector in developing countries, is still limited.

Data envelopment analysis (DEA) is a popular non-parametric tool for measuring efficiency and performance in various fields such as banking, healthcare, and aviation (Adler et al., 2002; Boubaker et al., 2018; Vidal-García et al., 2018; Yang, 2006). Zhu (2020) proposed that DEA should be viewed as a data-oriented analytical method for performance evaluations and benchmarking. The basic idea of DEA is that the individual decision-making unit (DMU) being examined can maximise its operational efficiency by using its own optimal weights regarding its inputs and outputs. The use of these so-called “dynamic weights” (Hammami et al., 2020) allows DEA to be price-free, and thus neither price information nor the functional form is needed. Consequently, DEA is more flexible with small samples, especially when the DMUs involved operate in a complex environment where it is difficult to define a production function (Ngo & Tsui, 2021). This has resulted in a much smaller number of DEA applications in the manufacturing sector (Tran & Ngo, 2014; Yang, 2006) where the data are large, especially given in big data era, normally involving thousands of DMUs or observations. Such studies often use the parametric approach of stochastic frontier analysis (Bačić et al., 2018; Hailu & Tanaka, 2015; Ngo et al., 2019a; Verschelde et al., 2016). One weakness of DEA, compared with stochastic frontier analysis, is that the different optimal weights allow the DMUs to be evaluated from different aspects, thus making it difficult to rank these DMUs on the same basis. The remedy to this situation is to estimate a common set of weights (CSW) that can be applied to all DMUs to provide a common basis for comparisons involving ranking; however, this approach incurs a high computational burden and sometimes faces the problem of convexity with non-linear objectives (Davtalab-Olyaie, 2019; Hammami et al., 2020; Wang et al., 2017, 2021).

DEA studies do not stop at the first stage of measuring efficiency; they also explain the role of environmental factors, including the corporate- and country-level governance of such efficiency in the second stage (Boubaker et al., 2019, 2020; Le et al., 2021). In other words, one may use the explanatory variables to explain or predict the efficiency scores of the DMUs involved. However, this second stage often applies the conventional econometric analytic models of Tobit or (bootstrap) truncated regression. According to Daraio et al. (2010, p. 1), “papers that estimate technical efficiency in the first stage and then regress these estimates on some environmental variables in a second-stage Tobit model continue to appear”. In contrast, the more advanced estimators from ML such as Random Forest (RF) or neural network (NN) regressions have seldom been used. Since these ML estimators have better predictive power, they can overcome the problem of multicollinearity and are also tolerant to outliers and noise, and it is arguable that the application of such ML techniques can improve the explanatory or predictive results of two-stage DEA (Chen et al., 2021; Nandy & Singh, 2021; Thaker et al., 2021; Zhu et al., 2021).

Given the issues discussed above, the two specific research questions of this study were: “How can we efficiently measure the performance of Vietnamese manufacturing MSMEs using DEA but on the same basis?” and “How can we efficiently predict the performance of these MSMEs, given a set of corporate- and country-level variables?” For the former question, we need to have a novel CSW DEA model that can deal with big data, as this situation is problematic for both the dynamic weights and CSW DEA approaches. For the latter, we will need to compare several predictive methods, including both econometric and ML models. We expect to see the ML models perform better than the econometric ones.

This study, therefore, aimed to contribute to the literature in three aspects. First, we propose a novel method of estimating the CSW for measuring efficiency and comparing MSMEs by ranking via DEA. Since it is based on regression analysis (RA), this method helps overcome the time-consuming non-convexity issue of the previous CSW DEA methodologies. In this sense, it can be easily extended to other sectors where big data exist, thus widening the use of DEA in such studies. Second, by using data from more than 5400 Vietnamese manufacturing MSMEs operating during the 2010–2016 period, yielding a total of 37,557 observations, this study is among the first (DEA) studies focus on the performance of manufacturing MSMEs in developing countries to use big data. It is noted that the performance of manufacturing MSMEs has been examined in a few countries such as India (Kamble et al., 2020), Brazil (Borchardt et al., 2021), and Turkey (Sariyer et al., 2021), but a study combining big data and predictive analysis has not been conducted in the Vietnamese context. More importantly, the novel use of CSW means that it can provide widely acceptable recommendations for the MSMEs to help them improve their performance. Third, we used several econometric and ML techniques such as Tobit, the least absolute shrinkage and selection operator (LASSO), and RF regressions to compare their predictions regarding the performance of the examined MSMEs. Given the advantages of these ML techniques, our results are therefore more efficient and more accurate than the econometric ones. Consequently, our combined CSW DEA–ML approach can shed new light on the two-stage DEA literature, especially in terms of predicting performance and using big data and predictive analysis.

Empirically, our CSW–RA–DEA approach in the first stage showed that the Vietnamese MSMEs performed quite well during the 2010–2016 period, with the average efficiency scores consistently ranging from 0.803 to 0.824. Compared with the conventional DEA estimates, which ranged from 0.261 to 0.388, our results are more consistent with previous studies on manufacturing firms in Vietnam and other developing countries (Hailu & Tanaka, 2015; Le et al., 2018; Ngo et al., 2019a). Furthermore, the second-stage DEA on the determinants of such efficiency scores are also in line with the literature, in which the performance of Vietnamese MSMEs was negatively influenced by the firm’s age, the ratio of female employees, and industrial zone status, but it was positively influenced by the firm’s foreign ownership and participants, export activities, municipality status, the provincial business environment, and asset size. For big data and predictive analytical applications to predict the performance of these MSMEs, a hybrid approach of two popular econometric models from the DEA literature (namely Tobit and truncated regressions) and four ML algorithms (including LASSO, NN, support vector machine regression (SVR), and RF regression) were used in this study. Our findings suggest that the RF regression had the best in-sample predictive power (but this may have been caused by overfitting), the LASSO regression exhibited the best out-of-sample predictions, and the popular Tobit/truncated regressions were the worst performers for both in-sample and out-of-sample predictions. We argue that such econometric techniques are not suitable for predictive purposes, especially for big data.

We organised the rest of this article as follows. In the next section, we provide a brief discussion of DEA efficiency using the CSW, and the links between DEA and RA, as well as the increasing but limited uses of ML in DEA. Section 3 introduces the methodologies of conventional DEA and, more importantly, our novel CSW using RA in DEA (CSW-RA-DEA). Brief explanations of the ML techniques, including LASSO, SVR, and RF regressions, are also presented in this section. Section 4 then focuses on examining and predicting the performance of Vietnamese manufacturing MSMEs. Finally, Sect. 5 concludes the paper and suggests some directions for future research.

2 Literature review

2.1 DEA and the need for a CSW

It is acknowledged that DEA, which was developed by Charnes et al. (1978), is one of the most common methods used to evaluate efficiency in many fields (Contreras, 2020; Ngo & Tsui, 2021; Nguyen et al., 2019). Accordingly, the optimal weights can be used for the set of inputs and outputs, depending on the assumptions, which may be output-oriented, input-oriented, or even both (Hammami et al., 2020). This flexibility in the choice of weights may be both an advantage and a disadvantage of the method. When these weights are used, DEA becomes price-free, meaning that the relative efficiency of DMUs in the sample can be measured without the need for any functional form or price information (Contreras, 2020). However, different weights corresponding to different frontier surfaces could make it hard to compare and rank the DMUs, whether they are efficient or not (Jahanshahloo et al., 2008; Kao & Hung, 2005). Hence, variation in the optimal set of weights (the so-called “dynamic set of weights”) that is used to rank the DMUs may become inappropriate. This requires different ranking approaches.

The literature includes a number of ranking methods based on DEA, which can be divided into six groups (Adler et al., 2002) or 11 groups (Jahanshahloo et al., 2008). Most of them are based on the dynamic set of weights; therefore, comparing the DMUs among different frontier surfaces becomes an issue (Hammami et al., 2020). Kao and Hung (2005) emphasised that it is crucial to construct a CSW in DEA because a common frontier hyperplane will rank the DMUs according to the same aspect or criterion. In other words, the CSW will allow us to compare DMUs or the select the best DMU(s) in a fairer context (Contreras, 2020).

All CSW DEA involves two steps: (i) computing the DEA efficiency scores and the dynamic weights, then (ii) using optimisation, often as a programming problem with multiple objectives, to derive the CSW based on the dynamic weights (Davtalab-Olyaie, 2019; Wang & Chin, 2010; Wang et al., 2017, 2021), the efficiency distance (Kao & Hung, 2005; Wang et al., 2011), or the frontier distance (Hammami et al., 2020). This optimisation is time-consuming and sometimes faces the problem of convexity in the case of non-linear objectives. In line with the suggestion of Contreras (2020) that the CSW can be potentially determined by incorporating RA into DEA, this study proposed this use of RA to directly determine the CSW.

2.2 The integration of DEA, RA, and ML

The early work of Thanassoulis (1993) provided a comprehensive discussion comparing DEA and RA, and concluded that both methods can be used to complement each other where possible. Other authors also suggested that the corrected ordinary least squares frontier is analogous to DEA under the assumption of constant returns to scale (Greene, 2008). Ouenniche and Carrales (2018) further suggested that RA can provide DEA with feedback for variable selections in which the inputs (and outputs) are negatively (and positively) associated with the DEA efficiency scores.Footnote 1 In a similar vein, Tone and Tsutsui (2009) suggested that regression can be used to predict and adjust the data for multi-stage DEA. Furthermore, the CSW approaches of Kao and Hung (2005) and Wang et al. (2011), which aimed to minimise the efficiency distance (see Sect.  2.3)(Hammami et al., 2020), can be seen as a special case of RA (see Sect. 4)(Wang et al., 2011). Nonetheless, this reemphasises the importance of RA in DEA studies.

DEA studies therefore do not stop at the first stage of measuring efficiency. The role of environmental variables such as ownership, size, corporate governance, and other macro-economic factors can also be used in a second-stage regression to explain or predict the efficiency (Boubaker et al., 2019, 2020; Le et al., 2021). Since the DEA efficiency scores are bounded between 0 and 1, most of those studies used Tobit or truncated regressions (Daraio et al., 2010; Ho et al., 2021; Ngo et al., 2019b; Pilar et al., 2018). Given the big data era, there is an increasing but limited trend of using ML with DEA as a hybrid approach for analytical purposes (Khezrimotlagh et al., 2019). In particular, Zhu (2020) suggested that in big data situations, one needs to look at the possibility of combining DEA with other ML techniques, such as RF, support vector machines, and artificial neural networks. According to Tsai and Chen (2010) and Belhadi et al. (2021), among others, such hybrid combinations are superior to single models. For instance, Lee and Cai (2020) proposed using the least absolute shrinkage and selection operator (LASSO) for variable selection in DEA with small simulated datasets. Chen et al. (2021) extended this idea by using the elastic net (an extension of LASSO) in the more comprehensive setting of both small and big simulated data. Both studies showed that the hybrid approach performed better than the existing approaches. For second-stage regression, Wu et al. (2006) and Misiunas et al. (2016) demonstrated that an artificial NN can be trained by using data from the efficient DMUs; the results were used to adjust the dataset and selection of the variables to improve the predictive power of their DEA-NN model. Zhu et al. (2021) combined two ML techniques of NN and SVR into DEA to predict the efficiency scores even when new DMUs were added into the sample. Nandy and Singh (2021) and Thaker et al. (2021) relied on the use of RF regression to examine the impacts of the second-stage explanatory variables on the predicted efficiency scores, especially for an out-of-sample dataset. These studies also agreed on the superiority of the hybrid DEA-ML models. However, since all previous studies were based on the dynamic weights of DEA but not on the CSW, the impacts of these explanatory variables on the observed and predicted efficiency scores were not examined on the same basis. In this sense, this study filled this research gap by combining big data analytical tools (i.e., ML) and operational research approaches (i.e., DEA), and simultaneously accounted for the CSW when evaluating and predicting the performance of MSMEs.

3 Methodologies

3.1 The research framework

This paper combines DEA analytics, econometric analytics, and ML analytics into a hybrid CSW-RA-DEA-ML predictive analytical method (see Fig. 1). Most ML studies forecast future outcomes on the basis of time series data; however, since our data spanned only 7 years (2010–2016), we did not have enough time-series datapoints for forecasting purposes. Instead, we focused on prediction, i.e., the use of pooled cross-sectional data, to answer our second research question about predicting the performance of Vietnamese MSMEs, given a set of corporate- and country-level information. To do so, our data were randomly split into two sub-samples, in which the training (in-sample) data consisted of 30,000 observations (about 80% of the total sample) and the predicting (out-of-sample) data consisted of 7557 observations (approximately 20% of the total sample).

Fig. 1
figure 1

The research framework

Specifically, our study followed a three-stage analysis as described below.

First stage For the training data, we used CSW-RA-DEA (see Sect. 3.2) to estimate the efficiency of the 30,000 DMUs involved, using the firms’ input and output data.

Second stage For the training data, we used different econometric (see Sect. 3.3) and ML (see Sect. 3.4) techniques to estimate the relationship between the CSW-RA-DEA efficiency (derived from stage 1) and the corporate- and country-level explanatory variables for the 30,000 DMUs involved, resulting in different predictive equations.

Third stage For the total sample, we used the predictive equations (derived in stage 2) to predict the DEA efficiency scores of all 37,557 DMUs involved, given the corporate- and country-level explanatory variables. Our estimates were then compared with the efficiency scores derived by traditional DEA in terms of the root mean squared error (RMSE). The technique or equation with the lowest RMSE exhibited the best predictive power.

3.2 The DEA analytics

Consider a set of n DMUs, each using k inputs to produce m outputs. The goal of DEA is to estimate the optimal weights for the inputs or outputs for each DMU so that they can bring the DMU as close as possible to the frontier envelope of the DMUs (i.e., to maximise the DMU’s efficiency). The mathematical expression of (constant returns to scale) DEA,Footnote 2 as introduced by Charnes et al. (1978), is:

$$ EF_{{j_{0} }} = max_{u,v} \frac{{\mathop \sum \nolimits_{r = 1}^{m} u_{r} y_{{rj_{0} }} }}{{\mathop \sum \nolimits_{i = 1}^{k} v_{i} x_{{ij_{0} }} }} $$
(1)

Subject to

$$ \begin{aligned} & \frac{{\sum\nolimits_{r}^{m} {u_{r} } y_{{rj}} }}{{\sum\nolimits_{i}^{k} {v_{i} } x_{{ij}} }} \le 1,\forall j,\quad j = 1,2, \ldots ,n \\ & u_{r} ,v_{i} \ge \varepsilon ,\quad \forall i,r \\ \end{aligned} $$

where \({\theta }_{{j}_{0}}\) is the efficiency score of DMU j0 (j = 1,2,…,n) to be maximised, given the output weight ur of output yr (r = 1,2,…,m) and the input weight vi of input xi (i = 1,2,…,k); ε is a non-Archimedean value designed to ensure positive weights. It is noted that Eq. (1) needs to be run n times for each of the DMUs in the sample, in which the optimal weights (vi, ur) can vary among the DMUs; hence, the so-called argument of the “dynamic set of weights” in DEA.

The CSW DEA seeks a common set of weights that can be applied to all DMUs in the sample instead of using different weights for each DMU. This makes sense for managers and decision-makers because it helps in benchmarking and ranking the DMUs in the same terms so that any recommendations or policies can be widely accepted and feasibly applied by these DMUs. Unlike previous CSW DEA studies, however, this study proposed the novel approach of CSW–RA–DEA to determine the CSW. The algorithm of our CSW-RA-DEA approach is described below.

Step 1 Compute the DEA efficiency scores for all DMUs in the sample as normal via Eq. (1). Note that all CSW DEA studies have applied this step.

Step 2 Regress those efficiency scores on the inputs xi and outputs yr of the DMUs. According to Ouenniche and Carrales (2018), among others, all inputs need to be negatively associated with the efficiency scores, whereas the relationship between the efficiency scores and the outputs should be positive. This regression has the form:

$$ EF_{j} = \alpha_{0} + \beta_{i} x_{ij} + \gamma_{r} y_{rj} + \varepsilon $$
(2)

where \({\beta }_{i}\) is expected to be negatively significant and \({\gamma }_{r}\) is expected to be positively significant.

Step 3 The CSW will be (− \(\beta_{i}\), \(\gamma_{r}\)), with a negative sign on \({\beta }_{i}\) to convert it to a positive value; accordingly, the CSW-RA-DEA efficiency scores can be estimated asFootnote 3:

$$ {\text{ (CSW - RA - DEA) }}EF_{j} = \frac{{\mathop \sum \nolimits_{r = 1}^{m} \gamma_{r} y_{rj} }}{{\mathop \sum \nolimits_{i = 1}^{k} \beta_{i} x_{ij} }} $$
(3)

We also used two popular numerical examples in the CSW literature (Davtalab-Olyaie, 2019; Sexton et al., 1986; Wang & Chin, 2010; Wang et al., 2021) to compare the efficiency scores and the ranks derived by different CSW approaches, including our CSW–RA–DEA approach. Our results show that the CSW–RA–DEA approach provided consistent and even better results than the others (see the Appendix) and that, therefore, it was appropriate to use in our analysis.

3.3 The econometric analytics

Obviously, one can train a model to estimate the impacts of the explanatory variables on the dependent variable, such as the CSW–RA–DEA efficiency scores derived from the DEA approach, following traditional econometric approaches. Since the efficiency scores are bounded between 0 and 1, it can be argued that Tobit or truncated regression is more appropriate for this second-stage DEA (Boubaker et al., 2019; Ho et al., 2021; Ngo et al., 2019b). For example, a simple search on Google Scholar on 20 November 2021 with the keywords “DEA”, “efficiency”, “Tobit”, and “two stage” returned 6540 results; a similar search using the keywords “DEA”, “efficiency”, “truncated”, and “two stage” resulted in 4680 results. Both models have the form:

$$ EF = \alpha + \beta Z + \epsilon $$
(4)

in which

$$ EF = \left\{ {\begin{array}{*{20}l} 0 &\quad {{\text{if}}} & \quad {EF < 0} \\ {EF} & \quad {{\text{if}}} &\quad {0 \le EF \le 1} \\ 1 &\quad {{\text{if}}} &\quad {EF > 1} \\ \end{array} } \right\} $$
(5)

where \(EF\) is the CSW–RA–DEA efficiency scores, \(Z\) is the vector of the explanatory variables, \(\beta \) is the vector of the coefficients to be estimated, \(\alpha \) is the intercept, and \(\epsilon \) is the random error.

3.4 The ML analytics

The current big data era is witnessing a growing body of literature on the use of ML for prediction purposes (Manimuthu et al., 2021; Schoenherr & Speier-Pero, 2015; Wamba et al., 2017; Zhu et al., 2021;), particularly in DEA studies (Nandy & Singh, 2021; Thaker et al., 2021). We therefore follow the literature in using LASSO regression (Chen et al., 2021), NN regression (Wu et al., 2006; Zhu et al., 2021), SVR (Zhu et al., 2021), and RF regression (Nandy & Singh, 2021; Thaker et al., 2021) as the ML analytical techniques for training our prediction model. This section briefly introduces these ML algorithms; the readers are encouraged to find more technical information in the relevant literature and the references therein.

LASSO identifies the important explanatory variables of the dependent efficiency scores by minimizing the following L1 penalization on the total sum of coefficients (Lee & Cai, 2020):

$$ \mathop {\min }\limits_{\alpha ,\beta } \frac{1}{2}\mathop \sum \limits_{j = 1}^{n} \left( {EF - \alpha - \beta Z} \right)^{2} + \lambda \sum \left| \beta \right| $$
(6)

where \(\lambda \) is a penalty (or tuning parameter) chosen by the extended Bayesian information criterion. Note that when \(\lambda =0\), the LASSO model in (6) collapses into the traditional regression model in (4). It is noted that, as is the case in other ML algorithms, including the other ones presented in this section, the estimation of the vector \(\beta \) in LASSO does not focus on its significance; it focuses on the contributions of each explanatory variable to the construction or prediction of the efficiency scores instead.

The NN model provides a different method that uses hidden layers to extract the important features or inputs of a given output (Wu et al., 2006; Zhu et al., 2021). In our case, it was appropriate to extract the important explanatory variables (inputs) influencing the CSW–RA–DEA efficiency scores (output) using NN. Specifically, the NN algorithm started by estimating the weights (or importance) of the inputs, then the relevant output was mapped via an activation function \(f\left( \bullet \right)\). This output was compared with the desired output, and the error was calculated accordingly. The error was then back-propagated to the NN to help adjust the weights, with the aim of reducing the error in each iteration. In our study, the activation function \(f\left( \bullet \right)\) had the form:

$$ EF = f\left( {\sum wZ} \right) - \theta $$
(7)

where \(\sum wZ\) is the weighted sum of the explanatory variables (or inputs) Z and \(\theta \) is the intercept.

RF is another ML algorithm which is based on a decision tree (Breiman et al., 1984) and the bootstrapping (or “bagging”) technique (Breiman, 1996). It randomly bootstraps the training dataset many times; at each iteration, the data are recursively partitioned by one input at a time (also called a “node”) to create a decision tree. By combining all these random “trees”, RF generates a “forest” where the dependent output can be predicted as the average of the predictions of all trees (Nandy & Singh, 2021; Thaker et al., 2021). According to Thaker et al. (2021), the RF’s predictor is computed as:

$$EF=\frac{1}{B}\sum_{b=1}^{B}Q\left(Z,{\Theta }_{b}\right)$$
(8)

where \(B\) is the number of randomised trees in the forest (i.e., the number of bootstrap iterations) and \(Q\) represents the predicted output of each tree, given the input \(Z\) and the independent and identically distributed random vector \({\Theta }_{b}\) that represents the relationship between the inputs and the output in the tree of the \(b\)-th iteration.

The algorithm of SVR is slightly different from the ones described above. Instead of examining the relationship between the inputs and the output, SVR constructs a hyperplane to separate the data, given the multiple-dimensional space of the output and inputs. In other words, the aim of SVR is to find the optimal surface that minimises the error of all training datapoints on the hyperplane (Smola & Schölkopf, 2004; Zhu et al., 2021). The linear form of SVR is:

$$EF=wZ+b$$
(9)

where \(w\) represents the support vectors of the hyperplane and \(b\) is the intercept, which are the optimal solutions of:

$$ \mathop {\min }\limits_{{w,b,\xi_{j} ,\xi_{j}^{*} }} \frac{1}{2}w^{2} + C\mathop \sum \limits_{j = 1}^{n} \left( {\xi_{j} + \xi_{j}^{*} } \right) $$
(10)

Subject to

$$ EF_{j} - \left( {wZ + b} \right) \le \epsilon + \xi_{j}^{*} ,\;j = 1,2, \ldots n $$
$$ \left( {wZ + b} \right) - EF_{j} \le \epsilon + \xi_{j} ,\;j = 1,2, \ldots n $$
$$ \xi_{j} ,\xi_{j}^{*} \; \ge \, 0,\forall j $$

4 Analytics using CSW–RA–DEA: the performance of Vietnamese manufacturing MSMEs

This section provides an analytical application of CSW–RA–DEA to a rich dataset of more than 37,000 observations on MSMEs in the Vietnamese manufacturing industry during 2010–2016. Given the rising use of big data, as in our case, defining the CSW via previous approaches by linear or non-linear optimisation of the secondary goal (Davtalab-Olyaie, 2019; Wang & Chin, 2010; Wang et al., 2021) is time-consuming but this was justified for our proposed CSW-RA-DEA approach.

4.1 Data and variable selection

Vietnam is an emerging economy that has witnessed impressive economic development over the last few decades. The driving force behind its economic growth is household businesses or MSMEs (CIEM, 2016; OECD, 2021; Rand & Tarp, 2020). For instance, Rand and Tarp (2020) emphasised that in Vietnam, private SMEs accounted for about 95% of all enterprises, employed about half of the workforce, and produced approximately 40% of the national GDP. Given the key role that the MSMEs play nationally and globally (Ayyagari et al., 2003; IFC, 2012; OECD, 2021), it is therefore important to examine the performance and efficiency of Vietnamese MSMEs, especially for making important recommendations to managers and policymakers to help improve the performance of this sector. Importantly, with data on Vietnamese MSMEs comprising more than 37,000 firm-year observations, this sample was suitable for a hybrid study combining DEA and ML techniques.

In line with the literature on evaluating the efficiency of manufacturing firms, such as Verschelde et al. (2016), Ngo et al. (2019a), and Sahoo et al. (2021), we examined the Vietnamese MSMEs in terms of three important inputs, namely labour (proxied by the number of employees, x1), capital (proxied by the value of total assets, x2), and materials (proxied by the amount of materials, x3), to produce a single output (total revenue, y). This information was extracted from the annual surveys of Vietnamese enterprises conducted by the national General Statistics Office (GSO, 2016); such data are popular in many studies (Dao et al., 2021; Le et al., 2018; Rand & Tarp, 2020). Since we focused our study on MSMEs only, we followed the IFC (2012) and filtered out firms with more than 250 employees, resulting in 37,557 firm-year observations for the Vietnamese MSMEs operating during the 2010–2016 period. Accordingly, our data covered 2011 observations for micro (one to nine employees), 12,494 observations for small (10–49 employees), and 23,052 observations for medium (50–249 employees) enterprises. Note that most previous studies applied a two-stage analysis, where the (dynamic) DEA efficiency scores were estimated (in the first stage) then regressed on a set of explanatory variables (the second stage).Footnote 4 More importantly, if these explanatory variables were found to significantly influence the CSW–RA–DEA efficiency, we could use them to predict the performance of the Vietnamese MSMEs. We therefore used several prediction methods, including recent ML techniques such as LASSO and RF regressions, in our second-stage analysis. The basic information of our data and variables are presented in Table 1.

Table 1 Descriptions of the variables

4.2 First-stage analytics: the CSW–RA–DEA efficiency of Vietnamese MSMEs

We report our CSW-RA-DEA efficiency scores for our sample of 37,557 MSME observations in Table 2, in which the estimated CSW–RA–DEA scores have higher means than the (dynamic) DEA scores. On the one hand, we can see that the average CSW-RA-DEA efficiency scores, which consistently ranged from 0.803 to 0.824 during the 2010–2016 period, in agreement with previous studies on manufacturing firms in Vietnam and other developing countries (Hailu & Tanaka, 2015; Le et al., 2018; Ngo et al., 2019a), compared with traditional DEA scores. On the other hand, we argue that the use of RA (in step 2) allowed us to estimate the weighted inputs and outputs (in step 3 of the CSW-RA-DEA algorithm) as having greater variations; therefore, the CSW–RA–DEA efficiency scores can have a wider range. However, this is similar to the case of super-efficiency (see Sect. 3.3 above) or other econometric-based DEA results (Wu et al., 2006). More importantly, the results of both Spearman’s and Kendall’s ranking correlations in Table 3 confirmed that our CSW-RA-DEA estimations are consistent with the results of traditional DEA and thus are reliable. In this sense, it was justified to proceed with the second-stage regression.

Table 2 Average efficiency scores of DEA and CSW-RA-DEA for 37,557 MSME observations (2010–2016)
Table 3 Ranking correlations between DEA and CSW-RA-DEA scores

4.3 Second-stage analytics: predicting the Vietnamese MSMEs’ performance

The predictions of our econometric and ML analytics are presented in Table 4. Three important findings and their relevant managerial implications can be summarised as follows.

Table 4 Predictive performance of different methods

Firstly, from the managerial perspective, Table 4 suggests (and confirms) that the performance of Vietnamese MSMEs was (i) negatively influenced by the firm’s age, the ratio of female employees, and the industrial zone status; and (ii) positively influenced by the firm foreign ownership, export activities, the municipality status, the provincial business environment, and asset size. These findings are consistent with the literature. For instance, it can be argued that young firms are more likely to be involved with radical innovations (Acemoglu & Cao, 2015); unlike in other sectors, where the use of technology and innovations may be an obstacle (Pellegrino, 2018). For MSMEs, such innovations do not require many resources and are feasible. Because most Vietnamese MSMEs operate in the garment, textile, and footwear sector (Dao et al., 2021; Pham et al., 2010), where the productivity of female employees is still low, it is reasonable to see that firms with a higher female employee ratio tend to have lower efficiency. In contrast, the participation of foreign investors allows the firms to possess more advanced technologies and management knowledge and hence, improve their performance (Huang & Yang, 2016; Ngo et al., 2019a). Similarly, MSMEs involved in export activities can benefit from a learning-by-exporting effect because they are more exposed to foreign technology and competition (Amiti & Konings, 2007; Baldwin & Gu, 2004; Pilar et al., 2018). The MSMEs also benefit from operating in large municipalities because of the effects of firm selection and agglomeration economies (Combes et al., 2012; Le et al., 2018; Vu et al., 2016), having a good provincial business environment (Ngo et al., 2019a; Dao et al., 2021; VCCI & USAID, 2022), and economies of scale (Bačić et al., 2018; Ngo et al., 2019a), to further improve their efficiency. These findings were robust across different models, including Tobit, truncated, and LASSO regressions. The RF regression did not provide coefficient estimates but confirmed the contributions of these explanatory variables: for instance, it identified MICRO as the most important factor (variable importance = 1.000), as it had the greatest magnitude for its coefficients: – 0.1575, – 0.1545, and – 0.1374 in the Tobit, truncated, and LASSO regressions, respectively. The NN model supported most of the signs of the coefficients but not their significance; however, this model performed slightly worse than the other ML models (the in-sample and out-of-sample RMSEs were slightly high at 0.35410 and 0.35129, respectively).

Secondly, we observed the disadvantages in terms of the competitiveness and performance of micro and small enterprises, compared with medium ones (Kamble et al., 2020), with the coefficients of both MICRO and SMALL being negatively and statistically significant. We also found that high-tech MSMEs outperformed their counterparts, in line with the evidence provided by Anh and Gan (2020). We therefore suggest that Vietnamese manufacturing MSMEs should be encouraged to expand their scale (both in terms of assets and employment) and become more involved in export activities. Meanwhile, central and provincial Vietnamese governments should improve their (business) governance and to allow more activities and the involvement of foreign investors in the Vietnamese manufacturing sector. As discussed earlier, given that our estimates are based on the CSW–RA–DEA scores, we believe that these managerial suggestions and recommendations can be widely applied to all MSMEs in our sample.

Thirdly, from the methodological perspective, the last three rows of Table 4 suggest that the advanced ML analysis generally made better predictions for both in-sample and out-of-sample data (i.e., lower RMSEs) compared with the traditional econometric analytical methods (i.e., Tobit and truncated regressions). Among the ML techniques, LASSO regression was the best model for out-of-sample prediction. The RF model seemed to be overfitted for the in-sample data (with an exceptional low in-sample RMSE of only 0.1371), but its predictive ability for out-of-sample data was not remarkable (its out-of-sample RMSE was 0.34858). Although this is not reported in Table 4, a similar situation was found for SVR, for which the in-sample and out-of-sample RMSEs were 0.31632 and 0.36842, respectively. For the econometric models, truncated regression outperformed the Tobit model, supporting the argument of Daraio et al. (2010) that DEA efficiency scores are truncated rather than censored. Nevertheless, both models yielded high RMSE values. Therefore, although the two are popular for two-stage DEA studies, we suggest that they are not suitable for predictive purposes.Footnote 5 We argue that for big data samples such as in our case, censoring or truncating efficient observations (with DEA scores greater than or equal to 1) from the prediction model may result in missing information, and their predictive power is accordingly weaker. We therefore support the ML literature (e.g., Belhadi et al., 2021; Tsai & Chen, 2010; Zhu et al., 2021) in confirming that the ML approach is superior to the econometric approach, and that our hybrid DEA-ML model is more efficient and more accurate than the traditional ones. Consequently, we conclude that LASSO regression is the best model of the ML approaches for predicting the efficiency of Vietnamese MSMEs.

Nevertheless, we have shown that the CSW–RA–DEA yields consistent and better results than other CSW ranking methods such as cross-efficiency, super-efficiency, normalised common weights, and so on. Because our novel model is based on RA, it overcomes the time-consuming non-convexity issue of previous CSW DEA methodologies, especially for large samples. As such, the CSW–RA–DEA model could be extended to other fields where big data exist, thus widening the use of DEA in such fields.

5 Conclusions

This study proposed a novel method of estimating the common set of weights for evaluating performance and rankings via DEA based on regression analysis (CSW–RA–DEA). It then applied CSW–RA and several other prediction methods (two econometric models and four ML algorithms) to explain and predict the performance of more than 5400 Vietnamese manufacturing MSMEs operating during 2010–2016. In this sense, our study contributes to the literature in terms of methodological (the CSW–RA–DEA method itself as well as the hybrid DEA-ML approach), empirical (the use of CSW DEA for Vietnamese MSMEs), and managerial (recommendations for improving MSMEs’ performance) perspectives.

It should be noted that although we have examined four popular ML techniques in our hybrid DEA–ML model (i.e., LASSO, NN, SVR, and RF regressions), there are other ML algorithms and DEA models that could be investigated in future research. Regarding the DEA approach, one could apply the variable returns to scale assumption (Banker, 1984), the cost/profit measures (Ngo et al., 2019b; Pilar et al., 2018), the fuzzy approach (Boubaker et al., 2020; Nandy & Singh, 2021), or the Euclidean distance (Hammami et al., 2020) to measure the DEA efficiency in different settings. Regarding ML analytics, the ensemble approach (Belhadi et al., 2021) and other hybrid ML combinations (e.g., combining LASSO and NN) could also be used. Finally, the extension of this CSW–RA–DEA to other industries with big data, such as banking and finance (Tsai & Chen, 2010), healthcare (Misiunas et al., 2016), agriculture (Nandy & Singh, 2021), or energy (Khezrimotlagh et al., 2019), could help increase our understanding of the role of DEA in such fields.