1 Introduction

The world economy is characterized by the interdependence of all countries with respect to their industrial activities. It can be modeled as a network in which each “country–sector” pair exchanges goods and services with any other (and even the same) such pair, at a different extent and using different technologies (Cerina et al. 2015). The structure of this network may provide information about such interdependencies of national economies and their changes over time. The literature usually refers to the study of these interdependencies with the term “input–output (I/O) analysis.” I/O tables portray flows of what is produced by some economic agents and used by other economics agents (either as intermediate or final consumption). They are arranged into two types, either according to relationships between industries (industry-by-industry I/O tables) or according to relationships between products (product-by-product I/O tables). Many world I/O tables for selected countries are periodically produced by, e.g., the World Input–Output Database (WIOD, Timmer et al. 2015, 2016), the EORA Global Supply Chain initiative (Lenzen et al. 2012), EXIObase (Tukker et al. 2013), the Full International and Global Accounts for Research in input–output analysis (FIGARO, Rémond-Tiedrez and Rueda-Cantuche 2019) from EUROSTAT and (OECD 2018).Footnote 1 By analyzing I/O matrices, several authors (Cerina et al. 2015; McNerney et al. 2013; Zhu et al. 2015) studied the network topology of inter-industry flows and others (Liang et al. 2016) investigated its scaling patterns. Some studies highlighted the presence of significant asymmetry between inflow data (i.e., what a “country–sector” pair uses from other “country–sector” pairs) and outflow data (i.e., what a “country–sector” pair sells to other “country–sector” pairs), and the emergence of a clustering pattern among either countries or sectors. This can derive, e.g., from the presence of similar production technologies (i.e., similar dependence of two specific countries or sectors on other “country–sector” pairs (Carvalho 2009). This issue was taken into account, for instance, by Zhu et al. (2018), who used an ad hoc measure to detect patterns of similarity and to group together similar countries and similar sectors, and by Oliva et al. (2016), who applied a spectral clustering approach to group similar sectors of the Italian economy.

Our work has to do with the reconstruction of I/O matrices. The interest in this topic arises if one takes into account some aspects related to the methodology of data collection adopted by national statistical offices. Indeed, based on the work by Percoco et al. (2006) and Wen et al. (2014) provide the following motivations to apply a reconstruction method to I/O matrices:

  • Data collection for I/O matrices is typically based on direct methods, e.g., on surveys (made at different times and in different countries). This leads to classical sampling errors;

  • For large surveys, errors from the inference design can easily arise;

  • The elements of an I/O matrix change over time. However, collecting them in a timely fashion (for instance, once for each country every year) is an almost impossible task, due to limited resources. For this reason, historical data are typically used to approximate the current I/O matrix.

Direct compilation of I/O tables does not rely only on surveys but also on administrative registers (e.g., firms’ registers, accountancies of companies in official registers, public budget documents, information on trade customs, taxes authorities’ reports, households’ registers), censuses (population, firms’ directories) and, among others, sectorial reports providing global economic information about specific sectors in an industry or in a country. It follows that, in order to compile an I/O table, many different statistical sources are contrasted (sometimes even contradictory ones). This is a huge time-consuming and resources-consuming operation, and the time gap between the publication of an I/O table and its reference year is one of the main reasons why also indirect methods exist. This was recently reviewed by Valderas-Jaramillo et al. (2019).

In this work, taking advantage from the similarity patterns discussed above, we apply, as an indirect method, matrix completion (MC, Hastie et al. 2015; Mazumder et al. 2010; Negahban and Wainwright 2012) to I/O submatrices associated with suitable groups of countries, by judiciously clustering them together, via a proper clustering method. We recall here that MC refers to a set of advanced statistical methods that can be used to predict unobserved entries of a matrix in terms of the set of the remaining observed entries (more technical details about MC are reported later in Sect. 2.2). In this way, the partially observed matrix is “completed” by the predictions produced by MC. Our choice of MC as an imputation method is based on the growing interest in using such method in the literature on economics (see Athey and Imbens 2019), and on the excellent performance achieved by MC in several applications (see, e.g., Hastie et al. 2015). We show that using a selected group of similar countries permits to increase the effectiveness of MC. This is done by comparing the results obtained by MC when, on the contrary, the selected group is made by highly dissimilar countries. To the best of the authors’ knowledge, this is the first article in which MC is applied to I/O tables in connection with clustering.

The aforementioned similarity patterns have important consequences on the structure of the I/O matrices, justifying the application of MC. In fact, due to the presence of countries that share similar technologies for producing the same goods, an I/O matrix might be low-rank, or, at least, might be well-approximated by a low-rank matrix (in the sense that a few singular values would dominate all the other ones, i.e., the singular values’ distribution would decay quickly to 0). This low-rank (approximation) property suggests, among other possible statistical or machine learning techniques, the adoption of MC to reconstruct potentially missing entries in an I/O matrix. Moreover, satisfying such a property is a necessary condition for obtaining good MC results (see Appendix 1 for a discussion on this issue). Nevertheless, the application of MC to a full I/O matrix is not straightforward, since elements in different blocks of that matrix can have quite different orders of magnitude. Having analyzed different real-world I/O matrices, we found that: (1) within-country values are way larger than cross-country ones; (2) I/O matrices are sparse, because of many cross-country zero entries, and (3) there is a clear separation between large-to-large countries’ values and small-to-small countries’ ones. This suggests performing a pre-processing step, in which some blocks are removed from the full I/O matrices. In particular, we focus our analysis on bilateral trade blocks (Dietzenbacher et al. 2013; Arto et al. 2019). Their investigation constitutes a relevant problem in multi-country and multi-regional I/O tables.

Based on cross-country subsets of real-world I/O matrices, we performed a panel analysis using MC based on a LASSO-like nuclear norm penalty (Mazumder et al. 2010; Negahban and Wainwright 2012). This permits, through a suitable choice of the regularization parameter (based on a validation set), to select the number of nonzero singular values to be kept in the reconstructed matrix. The specific selection of countries was generated by the output of hierarchical clustering, whose application was based on a dissimilarity measure (the Average Absolute Correlation Distance or AACD, see later) highly related to the successive application of MC. Robustness of the results produced by hierarchical clustering was evaluated by considering synthetic counterparts of real-world I/O matrices, presenting a structure that is common to many I/O tables. This was done also in order to generalize the approach to all kinds of I/O matrices.

In summary, the main goal of our analysis is to compare the performance of MC when it is applied to I/O subtables made, respectively, by similar or dissimilar blocks (the first case corresponding to a subset of countries belonging to the same cluster, the second one to a subset of countries belonging to different clusters). It is worth remarking that our specific choice of the dissimilarity measure has been guided by its relationship with the regularization term in the objective function of the optimization problem associated with MC.

In more detail, this paper proposes a two-step methodological approach where, in the first step, judiciously selected groups of countries—in terms of either i) the distribution by country–industry from where they buy inputs, or ii) the distribution by country–industry to which they sell outputs—are retrieved using a hierarchical cluster analysis (Revelle 1979). This choice is preferable to other clustering techniques (e.g., k-means clustering MacQueen 1967) to group countries according to their I/O exchanges where the set of countries themselves is not a priori partitioned into a certain number of groups, but their complex structure suggests a hierarchy of clusters (see also Sect. 2.2 for other technical motivations behind this choice). In the second step, an I/O submatrix associated with this selection of countries is analyzed based on a LASSO-like formulation of the MC optimization problem. In particular, our approach is applied to a 5-year panel of (cross-country subsets of) I/O matrices for subsets of countries selected after performing hierarchical clustering, where a known part of the matrix associated with a specific year (i.e., the latest one) has been artificially obscured. The validation/testing phase is based on the Root Mean Square Error (RMSE) and on the Symmetric Mean Absolute Percentage Error (SMAPE) between actual and estimated values of the obscured part of the matrix, which is not provided as input to the MC algorithm. Ad hoc analyses have been also performed in order to i) evaluate the improvement that could be achieved by a suitable pre-processing of raw data (by eliminating domestic blocks); ii) select the proper number of clusters; iii) evaluate the MC performance if the clustering failed in the first step. According to the latter point, we apply MC to subsets of either similar (cluster does not fail) or dissimilar (cluster fails) countries.

Results show the effectiveness of the proposed method to predict missing values in the current I/O matrix from both previous years’ data and current data related to countries similar to the one for which current data are obscured. In contrast, the effectiveness reduces, as expected, if similar countries are replaced by ones belonging to quite different clusters. This conclusion holds both for the real-world and the synthetic data examined.

The rest of the manuscript begins with presenting the proposed methodological approach in Sect. 2 and reports the specific application to WIOD and simulated matrices and its results in Sect. 3. Section 4 is dedicated to future research directions and conclusions. Further technical details are reported in the appendix.

2 Methods

2.1 The input–output model

The traditional I/O matrix (Leontief 1986) depicts inter-industry relationships within an economy (or country), showing how the output from one sector becomes an input to another sector (or to itself), or it contributes to the final demand. Row indices represent inputs (in nominal monetary values) from an industrial sector, while column indices represent intermediate outputs to a given sector or needed to produce a final output. This table shows how dependent each sector is on every other sector, both as a customer of outputs from other sectors, and as a supplier of inputs.

Suppose an economy with n sectors and l final outputs is given, the assumption of constant returns to scale can be made, and sectors use inputs in fixed proportions. Fix also a specific year. In that year, each sector i produces a monetary value \(x_{i}\) of good i. Let \(z_{i,j}\) be the value that sector i sells to sector j in that year, and let \(f_{i,j}\) be the value that sector i sells to the final user in that year, to produce the final output j. In matrix notation, if one lets

$$\begin{aligned} \mathbf{x} = \begin{bmatrix} x_1 \\ \cdot \\ \cdot \\ \cdot \\ x_n \end{bmatrix}\,,\,\, \mathbf{Z} = \begin{bmatrix} z_{1,1} &{} \cdot &{} \cdot &{} \cdot &{} z_{1,n} \\ \cdot &{} &{} &{} &{} \cdot \\ \cdot &{} &{} &{} &{} \cdot \\ \cdot &{} &{} &{} &{} \cdot \\ z_{n,1} &{} \cdot &{} \cdot &{} \cdot &{} z_{n,n} \end{bmatrix}\,,\,\, \mathbf{F} = \begin{bmatrix} f_{1,1} &{} \cdot &{} \cdot &{} \cdot &{} f_{1,l} \\ \cdot &{} &{} &{} &{} \cdot \\ \cdot &{} &{} &{} &{} \cdot \\ \cdot &{} &{} &{} &{} \cdot \\ f_{n,1} &{} \cdot &{} \cdot &{} \cdot &{} f_{n,l} \end{bmatrix}\,, \end{aligned}$$
(1)

one can write \(\mathbf{x} = \mathbf{Z} \mathbf{i} _n +\mathbf{F} \mathbf{i} _l\), where \(\mathbf{i} _n \in {\mathbb {R}}^{n \times 1}\) and \(\mathbf{i} _l \in \mathbb {R}^{l \times 1}\) are column vectors made of all ones.

Available I/O tables can also report the multi-national structure of intra- and inter-industries (products) exchanges. In this case, let m be the number of countries considered. Then, analogously as in Eq. (1), one sets

$$\begin{aligned} \mathbf{x} = \begin{bmatrix} \mathbf{x} ^1 \\ \cdot \\ \cdot \\ \cdot \\ \mathbf{x} ^m \end{bmatrix}\,,\,\, \mathbf{Z} = \begin{bmatrix} \mathbf{Z} ^{1,1} &{} \cdot &{} \cdot &{} \cdot &{} \mathbf{Z} ^{1,m} \\ \cdot &{} \cdot &{} &{} &{} \cdot \\ \cdot &{} &{} \mathbf{Z} ^{h,k} &{} &{} \cdot \\ \cdot &{} &{} &{} \cdot &{} \cdot \\ \mathbf{Z} ^{m,1} &{} \cdot &{} \cdot &{} \cdot &{} \mathbf{Z} ^{m,m} \end{bmatrix}\,,\,\, \mathbf{F} = \begin{bmatrix} \mathbf{F} ^{1,1} &{} \cdot &{} \cdot &{} \cdot &{} \mathbf{F} ^{1,m} \\ \cdot &{} \cdot &{} &{} &{} \cdot \\ \cdot &{} &{} \mathbf{F} ^{h,k} &{} &{} \cdot \\ \cdot &{} &{} &{} \cdot &{} \cdot \\ \mathbf{F} ^{m,1} &{} \cdot &{} \cdot &{} \cdot &{} \mathbf{F} ^{m,m} \end{bmatrix}\,, \end{aligned}$$
(2)

where the generic (column) block \(\mathbf{x} ^{m}\) of the vector \(\mathbf{x}\) in Eq.  (2) can be expressed as

$$\begin{aligned} \mathbf{x} ^m = \begin{bmatrix} x^m_1 \\ \cdot \\ \cdot \\ \cdot \\ x^m_n \end{bmatrix}\,. \end{aligned}$$
(3)

In this extended framework, one can analogously write \(\mathbf{x} = \mathbf{Z} \mathbf{i} _{nm} +\mathbf{F} \mathbf{i} _{lm}\), where \(\mathbf{i} _{nm} \in \mathbb {R}^{nm \times 1}\) and \(\mathbf{i} _{lm} \in \mathbb {R}^{lm \times 1}\) are column vectors made of all ones and \(\mathbf{x}\), \(\mathbf{Z}\) and \(\mathbf{F}\) are those in Eq. (2).

In the above, the generic block \(\mathbf{Z} ^{h,k}\) of the matrix \(\mathbf{Z}\) in Eq.  (2) represents the I/O subtable where h is the country in input and k is the country in output. Such a block can be expressed as

$$\begin{aligned} \mathbf{Z} ^{h,k} = \begin{bmatrix} z^{h,k}_{1,1} &{} \cdot &{} \cdot &{} \cdot &{} z^{h,k}_{1,n} \\ \cdot &{} &{} &{} &{} \cdot \\ \cdot &{} &{} &{} &{} \cdot \\ \cdot &{} &{} &{} &{} \cdot \\ z^{h,k}_{n,1} &{} \cdot &{} \cdot &{} \cdot &{} z^{h,k}_{n,n} \end{bmatrix}\,, \end{aligned}$$
(4)

whereas the generic block \(\mathbf{F} ^{h,k}\) of the matrix \(\mathbf{F}\) in Eq.  (2) can be expressed as

$$\begin{aligned} \mathbf{F} ^{h,k} = \begin{bmatrix} f^{h,k}_{1,1} &{} \cdot &{} \cdot &{} \cdot &{} f^{h,k}_{1,l} \\ \cdot &{} &{} &{} &{} \cdot \\ \cdot &{} &{} &{} &{} \cdot \\ \cdot &{} &{} &{} &{} \cdot \\ f^{h,k}_{n,1} &{} \cdot &{} \cdot &{} \cdot &{} f^{h,k}_{n,l} \end{bmatrix}\,. \end{aligned}$$
(5)

The “transition” matrix \(\mathbf{T} =[\mathbf{Z} | \mathbf{F} ]\) is obtained from Eq. (2). Here, \(\mathbf{T} \in \mathbb {R}^{mn \times m(n+l)}\) is a matrix whose row and column indices refer to an ordered “country, intermediate/final output” pair.

Since hierarchical clustering will be applied—depending on either i input or ii output criteria—to specific I/O submatrices, it is worth specifying two types of submatrices of \(\mathbf{T}\) obtained by combining several blocks of the form \(\mathbf{T} ^{h,k}=[\mathbf{Z} ^{h,k} | \mathbf{F} ^{h,k}]\). As an example, for \(h \ne 1,m\), let

$$\begin{aligned} \mathbf{T} ^{h,.} = \begin{bmatrix} \mathbf{T} ^{h,1} | \cdots | \mathbf{T} ^{h,h-1} | \mathbf{T} ^{h,h+1} | \cdots | \mathbf{T} ^{h,m} \end{bmatrix} \end{aligned}$$
(6)

be the transition submatrix related to what all sectors of country h sell to all sectors/final users of all countries with the exception of country h; and similarly, for \(k \ne 1,m\), let

$$\begin{aligned} \mathbf{T} ^{.,k} = \begin{bmatrix} \mathbf{T} ^{1,k} \\ \cdots \\ \mathbf{T} ^{k-1,k} \\ \mathbf{T} ^{k+1,k} \\ \cdots \\ \mathbf{T} ^{m,k} \end{bmatrix} \end{aligned}$$
(7)

be the transition submatrix related to what all sectors/final users of country k buy from all sectors of all countries with the exception of country k. Similar definitions obviously hold for \(h,k \in \{1,m\}\). The submatrices \(\mathbf{T} ^{h,.}\) have n rows and \((m-1)(n+l)\) columns, whereas the submatrices \(\mathbf{T} ^{.,k}\) have \((m-1)n\) rows and \(n+l\) columns. About the first kind of submatrix (criterion i of clustering), for example, the first block \(\mathbf{T} ^{h,1}\) expresses what all sectors/final users of country 1 buy from all sectors of country h (say, Italy). About the second kind of submatrix (criterion ii of clustering), for example, the first block \(\mathbf{T} ^{1,k}\) expresses what all sectors of country 1 sell to all sectors/final users of country k (say, Italy).

2.2 Clustering and matrix completion

It is recalled here that clustering is an unsupervised learning technique whose goal consists in partitioning a data set into several subsets (called clusters), by aiming to make more similar data points belong to the same cluster, while trying to assign less similar data points to distinct clustersFootnote 2 Aggarwal and Reddy (2014). Among several clustering methods, we choose a hierarchical method because, in I/O tables, the set of countries is not a priori partitioned into a certain number of groups and because their complex structure suggests a hierarchy of clusters. By using a set of pair-wise dissimilarities for m objects (i.e., countries), hierarchical clustering first assigns each object to its own cluster, then it proceeds iteratively by joining at each stage the two most similar clusters, continuing until there is just a single cluster. At each stage, distances between clusters are recomputed by the Lance–Williams formula (Lance and Williams 1967), based either on the complete or on the Ward linkage criterion (Murtagh and Legendre 2014). According to hierarchical clustering, differently from (e.g.) the k-means clustering method, one can achieve different partitions of objects depending on the level of resolution one is looking at. Moreover, despite k-means clustering is less computationally expensive compared to hierarchical clustering, it requires strict assumptions regarding the homoscedasticity and the spherical variance of the variables, and that each cluster has—a priori—roughly an equal number of objects. Hierarchical clustering performs well even when those assumptions are not satisfied.

To measure how (dis)similar any two objects are, several (dis)similarity measures have been developed in the literature. Among them, the most commonly used are the \(l_1\) norm of the difference of data points (also called Manhattan distance), and the \(l_2\) norm of their difference (also called Euclidean distance). In this work, we use the Average Absolute Correlation Distance (AACD) as a dissimilarity measure for clustering. In other words, the absolute value of the Pearson’s correlation coefficient between the j-th corresponding non-constant columns \(\mathbf{b }^{c_1}_{j}\) and \(\mathbf{b }^{c_2}_{j}\) of blocksFootnote 3 associated with two different countries \(c_1\) and \(c_2\) is evaluated, then it is averaged with respect to the columns, and subtracted from 1. In formulas, one has

$$\begin{aligned} {\mathrm{AACD}}_{c_1,c_2} = 1-\frac{\sum _{j=1}^{n+l}|{\mathrm{corr}}(\mathbf{b }^{c_1}_{j},\mathbf{b }^{c_2}_{j})|}{n+l}\,. \end{aligned}$$
(8)

Our choice of the dissimilarity measure is motivated by the fact that AACD is highly related to the specific formulation of the regularization term in the optimization problem modeling MC, which is reported later in this subsection. Indeed, in a sense, it quantifies the average linear dependence of corresponding columns of blocks associated with different countries.Footnote 4 The adoption of this dissimilarity measure also (1) provides an additional motivation for the application of hierarchical clustering instead of a different clustering technique, since such a distance does not satisfy the triangle inequality, which is not required by hierarchical clustering but is required, e.g., by k-means clustering; (2) operates as a data pre-processing step because, differently from the \(l_1\) and \(l_2\) norms, it is not affected by each country’s average dimension. It is worth remarking that, even if countries in I/O tables present a spatial dimension, we are not going to use any kind of spatial clustering technique, due to the following reasons: (1) in this work, the clustering step aims at selecting a subset of similar countries, in terms of a suitable dissimilarity measure (AACD) derived from the comparison of different portions of I/O subtables, in order to make the successive application of MC to I/O subtables easier. In other words, we are not interested specifically in finding the presence of a spatial pattern; (2) volumes of I/O exchanges across countries are determined, possibly to different extents, both by network relationships and by spatial contiguity. Both of them are taken indirectly into account by using data coming from I/O subtables to express the dissimilarity measure.

In this work, we apply hierarchical clustering to all countries in the I/O matrix with the exception of a specific country h, in terms of what they use from country h. Similarly, we also apply hierarchical clustering to all countries in the I/O matrix with the exception of a specific country k, in terms of what they sell to country k. In other words, we consider, for various years, the submatrices \(\mathbf{T} ^{h,.}\) and \(\mathbf{T} ^{.,k}\) defined in Sect. 2.1 to compare any two countries \(c_1\) and \(c_2\) (different, respectively, from h and k) in terms of what they use from (or what the sell to) the specific country h or k. For illustrative purposes, in the simulations reported in the application, the choice \(h=k\) is made, and Italy is selected as such a specific country.

Clustering is often used as a preliminary data pre-processing step to a successive supervised learning task. In the present context, clustering is used as a pre-processing step for MC applied to a suitable submatrix of an I/O table.Footnote 5 The idea is that MC is expected to perform better if the submatrix refers to countries belonging to the same cluster. This expectation is based on one of the reasons provided in the literature as a motivation for the effectiveness of MC (Hastie et al. 2015), which is summarized as follows. Given a subset of observed entries of a matrix \(\mathbf{M} \in \mathbb {R}^{m \times n}\), MC works by finding a suitable low-rank approximation (say, with rank r) of \(\mathbf{M}\), by assuming the following model:

$$\begin{aligned} \mathbf{M}= \mathbf{C} \mathbf{G}^T + \mathbf{E}\,, \end{aligned}$$
(9)

where \(\mathbf{C} \in \mathbb {R}^{m \times r}\), \(\mathbf{G} \in \mathbb {R}^{n \times r}\), whereas \(\mathbf{E} \in \mathbb {R}^{m \times n}\) is a matrix of modeling errors. The rank-r approximating matrix \(\mathbf{C} \mathbf{G}^T\) is found by solving a suitable optimization problem (see, e.g., Eq. (11) reported later). Equation (9) can be written element-wise as

$$\begin{aligned} M_{i,j}=\sum _{l=1}^r C_{i,l} G_{j,l} + E_{i,j}\,. \end{aligned}$$
(10)

Often, \(C_{i,l}\) is interpreted as the degree of membership of row i of matrix \(\mathbf{M}\) to some “latent” cluster l (for a total of r such clusters), and \(G_{j,l}\) as the prediction of an element in column j of matrix \(\mathbf{M}\), conditioned on its row i belonging to the l-th cluster.Footnote 6 In our application, \(\mathbf{M}\) is composed of several cross-country blocks (coming from I/O tables in different years), whereas i and j refer, respectively, to an input sector of a country and an output sector/final user of another country. Moreover, l may be interpreted as a specific “latent” cluster, possibly discovered by the MC algorithm.

It is worth observing that, in order for MC to work properly in the case of an I/O table (possibly partially observed for a set of consecutive years), it can be useful to apply it to its suitable submatrix determined by a pre-preprocessing step of clustering. Intuitively, missing blocks of an I/O table that, thanks to historical data in the past years, are expected to be similar to the other observed blocks in the current year, could be reconstructed more effectively than missing blocks in the current year that, again based on historical data in the past years, are expected to be less similar to the other observed blocks in the same year. Another reason to focus the analysis on a submatrix of an I/O table is that solving the MC problem becomes computationally more expensive as the size of the matrix \(\mathbf{M}\) increases.

In the work, we consider the following formulation for the MC optimization problem, which was investigated theoretically by Mazumder et al. (2010):

$$\begin{aligned} \underset{{\hat{\mathbf{M}}} \in \mathbb {R}^{m \times n}}{\mathrm{minimize}} \left( \frac{1}{2} \sum _{(i,j) \in \Omega ^{\mathrm{tr}}} \left( M_{i,j}-{\hat{M}}_{i,j} \right) ^2 + \lambda \Vert {\hat{\mathbf{M}}}\Vert _*\right) \,, \end{aligned}$$
(11)

where \(\Omega ^{\mathrm{tr}}\) (which, using a machine-learning expression, may be called training set) is a subset of pairs of indices (ij) corresponding to positions of known entries of \(\mathbf{M}\), \({\hat{\mathbf{M}}}\) is the completed matrix (to be optimized), \(\lambda \ge 0\) is a regularization constant, and \(\Vert {\hat{\mathbf{M}}}\Vert _*\) is the nuclear norm of the matrix \({\hat{\mathbf{M}}}\), i.e., the sum of all its singular values. The regularization constant \(\lambda\) controls the trade-off between fitting the known entries of the matrix \(\mathbf{M}\) and achieving a small nuclear norm. The latter requirement is often related to getting a small rank of the obtained optimal solutionFootnote 7\({\hat{\mathbf{M}}}_\lambda ^\circ\) of the optimization problem (11), which follows by geometric arguments similar to the ones typically adopted to justify how the classical LASSO (Least Absolute Shrinkage and Selection Operator) penalty term achieves effective feature selection in linear regression (Tibshirani 1996).

The optimization problem (11) can be also written as

$$\begin{aligned} \underset{{\hat{\mathbf{M}}} \in \mathbb {R}^{m \times n}}{\mathrm{minimize}} \left( \frac{1}{2} \Vert \mathbf{P}_{\Omega ^{\mathrm{tr}}}(\mathbf{M})-\mathbf{P}_{\Omega ^{\mathrm{tr}}}({\hat{\mathbf{M}}})\Vert _F^2 + \lambda \Vert {\hat{\mathbf{M}}}\Vert _*\right) \,, \end{aligned}$$
(12)

where, for a matrix \(\mathbf{Y} \in \mathbb {R}^{m \times n}\),

$$\begin{aligned} (P_{\Omega ^{\mathrm{tr}}}(\mathbf{Y}))_{i,j}:= {\left\{ \begin{array}{ll} Y_{i,j} &{} \mathrm{if\,\,} (i,j) \in \Omega ^{\mathrm{tr}}\,, \\ 0 &{} \mathrm{if\,\,} (i,j) \notin \Omega ^{\mathrm{tr}} \end{array}\right. } \end{aligned}$$
(13)

represents the projection of \(\mathbf{Y}\) onto the set of positions of observed entries of the matrix \(\mathbf{M}\), and \(\Vert \mathbf{Y}\Vert _F\) denotes the Frobenius norm of \(\mathbf{Y}\) (i.e., the square root of the summation of squares of all its entries).

It was shown by Mazumder et al. (2010) that the optimization problem (12) can be solved by applying Algorithm 1, named Soft Impute therein.Footnote 8 This is a state-of-the-art algorithm in the MC field.

In Algorithm 1, for a matrix \(\mathbf{Y} \in \mathbb {R}^{m \times n}\), \(\mathbf{P}_{\Omega ^{\mathrm{tr}}}^{\perp }(\mathbf{Y})\) represents the projection of \(\mathbf{Y}\) onto the complement of \(\Omega ^{\mathrm{tr}}\), whereas

$$\begin{aligned} \mathbf{S}_\lambda (\mathbf{Y}):= \mathbf{U} {\varvec{\Sigma }}_\lambda \mathbf{V}^T\,, \end{aligned}$$
(14)

being

$$\begin{aligned} \mathbf{Y}=\mathbf{U} {\varvec{\Sigma }} \mathbf{V}^T \end{aligned}$$
(15)

(with \({\varvec{\Sigma }}=\mathrm{diag} [\sigma _1,\ldots ,\sigma _r]\)) the singular value decomposition of \(\mathbf{Y}\), and

$$\begin{aligned} {\varvec{\Sigma }}_\lambda :=\mathrm{diag} [(\sigma _1-\lambda )_+,\ldots ,(\sigma _r-\lambda )_+]\,, \end{aligned}$$
(16)

with \(t_+:=\max (t,0)\).

It is worth mentioning that (Li and Zhou 2017) proposed a particularly efficient implementation of the operator \(\mathbf{S}_{\lambda }(\cdot )\) defined in Eq. (14) (by means of the MATLAB function svt.m reported therein), which is based on the determination of only the singular values \(\sigma _i\) of \(\mathbf{Y}\) that are larger than \(\lambda\), and of their corresponding left-singular vectors \(\mathbf{u}_i\) and right-singular vectors \(\mathbf{v}_i\). Indeed, all the other singular values of \(\mathbf{Y}\) are annihilated in \({\varvec{\Sigma }}_\lambda\) (see Eq. 16).

figure a

For this work, we combine the original MATLAB implementation of Soft Impute provided by Mazumder et al. (2010) with the MATLAB function svt.m developed by Li and Zhou (2017). Moreover, to avoid overfitting, we select the regularization constant \(\lambda\) via the following hold-out validation method. First, the set of positions of unobserved entries of the matrix \(\mathbf{M}\) is divided randomly into a validation set \(\Omega ^{\mathrm{val}}\) (about \(25\%\) of the positions of the unobserved entries) and a test set \(\Omega ^{\mathrm{test}}\) (the positions of the remaining entries). In the present context of application of MC to I/O subtables, the union of the validation and test sets corresponds to a block which is artificially obscured (but which is still available as a ground truth), whereas the training set corresponds to the positions of all the remaining entries of the submatrix considered. It is worth observing that, by the construction above, there is no overlap among the training, validation and test sets. Then, the optimization problem (12) is solved for several choices \(\lambda _k\) for \(\lambda\), exponentially distributed as \(\lambda _k=2^{k/2-10}\), for \(k=1,\ldots ,40\). For each \(\lambda _k\), the Root Mean Square Error (RMSE) of matrix reconstruction on the validation set is computed as

$$\begin{aligned} {\mathrm{RMSE}}_{\lambda _k}^{\mathrm{val}}:=\sqrt{\frac{1}{|\Omega ^{\mathrm{val}}|}\sum _{(i,j) \in \Omega ^{\mathrm{val}}} \left( M_{i,j}-\hat{M}_{\lambda _k,i,j} \right) ^2}\,, \end{aligned}$$
(17)

then the choice \(\lambda _k^\circ\) that minimizes \(RMSE_{\lambda _k}^{\mathrm{val}}\) for \(k=1,\ldots ,40\) is found.Footnote 9 Finally, the RMSE of matrix reconstruction on the test set is computed in correspondence of the optimal value \(\lambda _k^\circ\) as

$$\begin{aligned} {\mathrm{RMSE}}_{\lambda _k^\circ }^{\mathrm{test}}:=\sqrt{\frac{1}{|\Omega ^{\mathrm{test}}|}\sum _{(i,j) \in \Omega ^{\mathrm{test}}} \left( M_{i,j}-\hat{M}_{\lambda _k^\circ ,i,j} \right) ^2}\,. \end{aligned}$$
(18)

A similar expression holds for the RMSE of matrix reconstruction on the training set, in correspondence of the optimal value \(\lambda _k^\circ\):

$$\begin{aligned} {\mathrm{RMSE}}_{\lambda _k^\circ }^{\mathrm{tr}}:=\sqrt{\frac{1}{|\Omega ^{\mathrm{tr}}|}\sum _{(i,j) \in \Omega ^{\mathrm{tr}}} \left( M_{i,j}-\hat{M}_{\lambda _k^\circ ,i,j} \right) ^2}\,. \end{aligned}$$
(19)

In our application of MC to submatrices of I/O tables, as a pre-processing step, the very few missing/negative entries of such submatrices (when present) are replaced by zeros before running Algorithm 1. The tolerance is chosen as \({ tol}=10^{-9}\). Moreover, when convergence is not achieved, in order to reduce the computational time, the algorithm is stopped after \(N^{\mathrm{it}}=500\) iterations. An additional post-processing step is included, thresholding to 0 any negative element (when present) of the completed submatrices.Footnote 10 In the following, in order to avoid introducing new notation, the expression \({\hat{\mathbf{M}}}_{\lambda _k}\) is actually used to denote each post-processed MC output.

Finally, as a measure of the MC performance, we also use a second metric, which is known in the literature as Symmetric Mean Absolute Percentage Error (SMAPE).Footnote 11 Differently from the RMSE, it takes into account the relative error of reconstruction. Its definition for the validation set is as follows:

$$\begin{aligned} {\mathrm{SMAPE}}_{\lambda _k}^{\mathrm{val}}:=\frac{100}{|\Omega ^{\mathrm{val}}|}\sum _{(i,j) \in \Omega ^{\mathrm{val}}} \frac{\left| M_{i,j}-\hat{M}_{\lambda _k,i,j} \right| }{\left| M_{i,j}\right| +\left| \hat{M}_{\lambda _k,i,j} \right| } \end{aligned}$$
(20)

(the constant 100 is usedFootnote 12 to make the metric range from 0 to 100; when both the numerator and the denominator are equal to 0, the ratio is assumed to be equal to 0, too). Similar definitions hold for the test set and the training set. Again, the metric is first evaluated on the validation set for different choices of \(\lambda _k\), then it is computed on both the training and test sets in correspondence of the value of \(\lambda _k\) that minimizes the SMAPE on the validation set. Differently from the RMSE, this metric is not directly related to the optimization problem (11) solved by MC, nor to the choice of AACD as the dissimilarity measure used by hierarchical clustering in the present article. Hence, for this metric, differently from the RMSE, one does not necessarily expect an improvement in MC performance when moving from “dissimilar” to “similar” countries.

3 Application of clustering and matrix completion

In this section, we present an application of the proposed methodological approach to WIOD data. Before presenting the application to real data, some aspects need to be considered. In order to generalize our method to any kind of I/O tables (whether they are either industry-by-industry or product-by-product ones), in Sect. 3.2 we show how the structure of WIOD tables—described in Sect. 3.1—is similar to those of alternative I/O tables. Then, in Sect. 3.3 some simulation results are reported, based on synthetic I/O matrices generated from the raw WIOD tables, in order to discuss the benefits resulting from applying MC to proper pre-processed data and to determine the optimal number of clusters for the choice of similar and dissimilar countries. Later, in Sects. 3.4 and 3.5 we provide full details on how we operatively apply the method to real data (specifically, Sect. 3.4 reports a simple example of MC application without its combination with the clustering step, whereas Sect. 3.5 includes the clustering step). Additional analyses are performed in the remaining subsections. In more details, Sect. 3.6 repeats the analysis of Sect. 3.5 for the case of a synthetic dataset, whereas Sect. 3.7 shows the dependence of the results obtained in Sect. 3.5 with respect to changes in the choices of the validation and test sets. Finally, Sect. 3.8 examines the MC performance when MC is applied to an I/O subtable containing both intra-country and inter-country blocks.

It is important to clarify why we use WIOD matrices in our application. According to Timmer et al. (2015), WIOD represents a real improvement over other existing databases (such as EORA, EXIObase and OECD tables) for several reasons: i) its data are extrapolated by certified national statistical institutions, ii) to determine data from the rest of the world, data from United Nations (UN), International Monetary Funds (IMF) and other international institutions are used, iii) all versions of WIOD are available for free from the website www.wiod.org. Compared with other I/O datasets, we choose WIOD tables for two additional reasons: first, because they are characterized by a quite large coverage period; second, because their size is quite representative of the ones of the other tables (i.e., both the number of countries considered and the sector disaggregation are neither too small, nor too large with respect to the other tables). The latter issue is also important for computational reasons, because the application of MC is typically slow for large matrices.

3.1 Data

The WIOD database was constructed and developed in the seventh framework program funded by the European Commission in 2009, and is licensed under a Creative Commons Attribution 4.0 International-license. From a technical point of view, WIOD tables are built up from public databases coming from different national and international statistics’ offices. Currently, there exist two releases of WIOD: the 2013 and the 2016 release. The latest release covers the period between 2000 and 2014 and 43 among the most relevant countries in the world: EU-28 (including the UK), Australia, Brazil, Canada, Switzerland, China, Indonesia, India, Japan, South Korea, Mexico, Norway, Russia, Turkey, Taiwan, USA.Footnote 13 The yearly tables are split into 56 different macro-industries, classified according to the International Standard Industrial Classification Revision 4 (ISIC Rev. 4), and their pair-wise combinations.Footnote 14 Moreover, 5 final aggregated outputs—still classified according to ISIC Rev. 4—are reported in the tables. Finally, an estimation for the remaining non-covered part of the world economy (called “Rest Of the World,” ROW) is reported ( details are provided by Timmer et al. 2015, 2016). Thereby, using WIOD can help to perform excellent and detailed input/output analyses (some very recent applications being provided, e.g., by Bhattacharya et al. 2020; Chen et al. 2019; Wang et al. 2020; Xu and Liang 2019).

In the following subsection, we put WIOD tables in comparison with OECD and FIGARO ones. The latest release of OECD inter-country input–output (ICIO) tables dates back to 2018. ICIO tables report yearly data from 2005 to 2015 among 64 countries (including ROW) and 36 industries (products). FIGARO tables, also known as EU inter-country Supply, Use and input–output tables (EU IC-SUIOTs) are available on a yearly basis from 2010 to 2019 and display exchanges among EU economies, the UK and the USA in 64 industries (products).

3.2 Characterization of the data

Here, we provide a descriptive analysis of WIOD tables in comparison with OECD and FIGARO ones (whose data are available in the same time span of WIOD tables) in terms of their within-country and cross-country value distributions, level of sparsity (i.e., percentage of zeros in each subtable) and separation between values of transactions between so-called large-to-large and small-to-small country pairs. These analyses are made on industry-by-industry tables, as the product-by-product ones depart from the former to just a little extent (Pearson’s correlation coefficient equal to 0.9958 for FIGARO, year: 2010), and are reported for the years from 2010 to 2014, which is the time span considered in our panel analysis. After having removed ROW rows and columns, and also rows related to taxes and value added, WIOD tables count for 2408 rows (which is the product of 56 intermediate industries and 43 countries) and 2623 columns (which is the result of 56 intermediate industries plus 5 final outputs, multiplied by 43 countries). The 56 macro-industries (which are produced by the aggregation of various micro-industries) are reported in the following order: primary industry appears in the first 4 positions, followed by secondary industry (18 positions), and finally by tertiary industry (34 positions). About the sparsity of WIOD I/O matrices, Table 1 shows that, consistently over the years, the percentage of zeros is between 17% and 18%. Moreover, results are consistent if compared to those of OECD tables while the number of zeros of FIGARO is slightly larger.

Table 1 Percentage of zero values in the I/O tables, by year

Figure 1 shows, as an example, a colored visualization of the elements of the 2013 I/O tables where each colored rectangle corresponds to the exchange between country–industry pairs. In its subfigures, final consumption is reported on the right extremes. The figure sheds lights on the fact that, consistently over the three compared I/O tables, the largest values (depicted in red) are concentrated in the domestic blocks (main diagonal blocks). Indeed, industries usually tend to consume (with respect to trade) products coming from their home country, for reasons such as higher proximity and safety (i.e., less uncertainty in terms of price, and more regularity in terms of supplies). Moreover, flows from a specific “country–industry” pair to the same “country–industry” pair (main diagonal of the tables) are generally much larger than the other flows (this holds especially for the case of secondary macro-sectors, as one can see from the right chart of Fig. 1a representing exchanges within and between Italy and Spain for the case of WIOD). This issue is partially motivated by outsourcing, but especially by the fact that in some industries (e.g., manufacturing industries) there are several concatenated products in the production line, which in the case of WIOD tables are aggregated in the same macro-sector. Moreover, it can be noticed from the figure that secondary industrial products (e.g., ore, iron, oil, metals, technical equipment, products from manufacturing industry in general) are more open to international trade (see the corresponding parts of the main diagonals of the off-diagonal blocks, also called international trade blocks), whereas services are less traded internationally. Finally, by looking indifferently to one of the I/O tables in Fig. 1 and by comparing either single country blocks by row (supplying countries) or single country blocks by column (receiving countries), it is possible to notice how some countries are similar to each other according to how they are dependent with respect to other specific countries.

Overall, real-world I/O tables provided by different institutions share very similar characteristics and motivate us to work just on one data source (WIOD tables, in our specific selection) and to generalize the results obtained using it over different I/O tables.

Fig. 1
figure 1

Colored visualization of the elements of the complete I/O tables (year: 2013). For a better visualization, a logarithmic scale is used in the subfigures

3.3 Clustering step and simulations

As discussed in Sect. 3.2, the main diagonal blocks in the I/O tables (also called domestic blocks, since each of them refers to trade inside the same country), and especially their entries which refer to exchanges within the same sector and the same country, are characterized by much larger values than the other blocks, which may lead to problems for an effective application of MC, as the quite different orders of magnitude could make it difficult for MC to have a good generalization capability on both kinds of blocks.Footnote 15 Computational reasons (i.e., the need of performing a singular value decomposition step at each iteration of the Soft Impute algorithm) suggest to apply MC to a submatrix associated with a small subset of countries, as this reduces the size of that submatrix. Moreover, as discussed in Sect. 2.2, we argue that MC is more effective when it is applied to an I/O subtable made of “similar” blocks (with respect to the case of “dissimilar” blocks). For this reason, we apply MC to submatrices of WIOD tables obtained by excluding systematically the main diagonal blocks,Footnote 16 where the selection of the countries associated with the submatrices is made by means of hierarchical cluster analysis. In this context, the choice of the number of clusters is crucial and so, in order to validate hierarchical clustering, we perform some simulation exercises, based on synthetic data.Footnote 17 We choose to generate our synthetic I/O matrices by adding a matrix of normally distributed random terms \(\epsilon\) to the subset of interest of the WIOD dataset, where each element of the matrix \(\epsilon _{i,j} \sim \mathcal {N}(0,{ \sigma }_{\epsilon }), \forall i, \forall j\), \({ \sigma }_{\epsilon }\) being a Gamma(\(\alpha\), \(\beta\)) such that different generated synthetic matrices display different levels of variability. Specifically, we choose \(\alpha =1\) and \(\beta = 1\).

To find the correct number of groups in terms of similarity either with respect to input from Italy or with respect to output to Italy, we simulate:Footnote 18 (1) N = 1000 synthetic I/O matrices of dimension \(56 \times 2562\), where 56 is the number of industries in Italy, and 2562 is the product of the 61 industries (final sectors included) and the number of countries with the exception of Italy, (2) N = 1000 synthetic I/O matrices of dimension \(2352 \times 61\), where 2352 is the product of the 56 industries and the number of countries with the exception of Italy and 61 is the number of industries in Italy (final sectors included). To select the number of clustersFootnote 19 we consider the ratio \(\frac{\mathrm{WSS}}{\mathrm{TSS}}\), where WSS is the “Within-cluster sum of squares” and TSS is the “Total sum of squares.” More in detail, the optimal number of clusters is the minimum such number for which \(\frac{\mathrm{WSS}}{\mathrm{TSS}} < K\), where K is a cutoff that we set to 0.5.

It is worth observing that the volume of inputs (outputs) taken from (given to) a specific country is not equally distributed all along other countries. Table 2 and Fig. 2 show, for the case of inputs from Italy in 2010, a strong dispersion both in terms of averages by countries and in terms of within-countries standard deviations. This evidence further motivated us to use AACD as a dissimilarity measure for hierarchical clustering. Table 3 reports the results of the simulation using (stacked) years from 2010 to 2013.

Table 2 Distribution of the average exchanges from Italy (in input), by country of output (first row) and of the within-country standard deviation (second row)
Fig. 2
figure 2

Kernel density curves of the distribution of the average exchanges from Italy (in input), by country of output (averages by country) and of the within-country standard deviation (within-country standard deviation). All countries excluding Italy. WIOD data, year: 2010

Table 3 Simulation results for the choice of the optimal number of clusters

According to the results in Table 3, we obtain a fairly good clustering with around 21/22 groups.Footnote 20 Based on these results, in Sect. 3.5 we consider as similar those countries belonging to the same cluster in a configuration with 21 groups (when Italy is in input) or 22 groups (when Italy is in output).

Finally, MC has been applied to I/O subtables associated with suitable groups of “fictitious” similar and dissimilar countries obtained from synthetic matrices, and the performance has turned out to be similar to the one obtained in the case of real data, which is discussed extensively in Sect. 3.5. To ease the reading of the work and give more focus on the combined application of hierarchical clustering and MC to real-world data, the results obtained for the case of synthetic matrices are reported in Sect. 3.6, after presenting in detail the case of real-world matrices in the next two subsections.

3.4 Application to prediction of a subset of entries of an I/O table, based on historical data

In this subsection, we consider the application of MC to the prediction of a subset of current entries of an I/O table, based on historical data: more precisely, both past entries and other available current entries coming from other portions of such an I/O table are used to make predictions. In more detail, the following situation is considered. Starting from WIOD tables relative to some consecutive years, the information associated with a subset of countries is reported in a matrix, keeping only the off-main diagonal blocks, as done in the previous subsection. Then, for one of these ordered pairs of countries, the information about the last year is obscured, and one tries to reconstruct it by MC. As an example, Table 4 refers to the case in which the countries considered are France and Italy, and the years analyzed are 2010, 2011, 2012, 2013 and 2014. All the entries related to France imports in 2014 coming from Italy (i.e., input sectors are from Italy and intermediate/final outputs are from France) are obscured,Footnote 21 then such entries are reconstructed by MC.Footnote 22 The rationale behind this application is that WIOD tables are obtained by combining information coming from different sources, and these are not necessarily synchronized. So, one could combine the complete information available in the past with the partial one currently available, to predict currently missing elements.

Table 4 Structure of the WIOD submatrix used for the example reported in Fig. 3a

In the following application of MC, our focus is not on the absolute RMSE, but on its percentage of reduction (in correspondence of the optimal value of the regularization parameter \(\lambda\)), with respect to a base caseFootnote 23 (represented by \(\lambda \simeq 0\)). Figure 3a illustrates the results of the application of the MC Algorithm 1 to the partially observed WIOD submatrix reported in Table 4 (details about the construction of the training, validation and test sets are provided at the end of Sect. 2.2). The figure shows that, in this case, MC is able to reduce significantly the RMSE of reconstruction on the missing elements in 2014, when moving from the case \(\lambda \simeq 0\) (for which the predictions of the missing elements are nearly equal to 0) to the optimal choice of \(\lambda\). More precisely, for \(\lambda \simeq 0\), one gets

$$\begin{aligned} {\mathrm{RMSE}}_{\lambda }^{\mathrm{val}}= & {} \sqrt{\frac{1}{|\Omega ^{\mathrm{val}}|}\sum _{(i,j) \in \Omega ^{\mathrm{val}}} \left( M_{i,j}-\hat{M}_{\lambda ,i,j} \right) ^2} \nonumber \\\simeq & {} \sqrt{\frac{1}{|\Omega ^{\mathrm{val}}|}\sum _{(i,j) \in \Omega ^{\mathrm{val}}} M_{i,j}^2} \simeq 86.5711 \end{aligned}$$
(21)

(since, in this case, one gets \(\mathbf{S}_\lambda (\mathbf{Y}) \simeq \mathbf{Y}\) from Eq. (14), hence every time Step 2.a of Algorithm 1 is performed, one gets a matrix \({\hat{\mathbf{M}}}^{\mathrm{new}}\) whose entries are nearly equal to 0 in the positions corresponding to unobserved entries of \(\mathbf{M}\)). Instead, for the optimal value \(\lambda ^\circ =2^6=64\) of \(\lambda\) (whose location is highlighted in the figure), one gets

$$\begin{aligned} {\mathrm{RMSE}}_{\lambda ^\circ }^{\mathrm{val}} \simeq 47.2791\,, \end{aligned}$$

obtaining a reduction of the RMSE of about \(45\%\). A similar behavior is observed for the RMSE of matrix reconstruction on the test set, the reduction of such RMSE in this case being from 116.0874 (for \(\lambda \simeq 0\)) to 51.7544 (for \(\lambda = \lambda ^\circ\)), which amounts at about \(55\%\). Hence, a good generalization capability is observed, showing no overfitting occurred in the application of MC.Footnote 24 Results in terms of SMAPE are reported in Fig. 3b. The irregular behavior of the curves associated with the SMAPE is due to the fact that MC does not address directly the SMAPE criterion, whereas the RMSE on the training set is part of the objective function of the MC optimization problem (11). Moreover, Fig. 3c compares the singular values’ distribution of the WIOD submatrix reported in Table 4, and the one of the completed submatrix produced as output by the algorithm, for both \(\lambda \simeq 0\) and the optimal value of \(\lambda\). It is evident from the figure that MC was able to reconstruct excellently the singular values’ distribution of the original WIOD submatrix (part of which was not observed), due to the large overlap of the curves reported in the figure. Moreover, such distribution decays rapidly to 0, which, as already reported in the Introduction, is a necessary (but not sufficient) condition for a good performance of MC. Indeed, Eckart–Young theorem (see Appendix 1) provides an upper bound on the performance of MC, for a given number of singular values kept. Finally, Fig. 3d shows a colored visualization of the elements of the original WIOD submatrix, the positions of the missing entries (highlighted in red), the reconstructed submatrix obtained for the optimal value of the regularization constant \(\lambda\), and the element-wise absolute value of the reconstruction error. It is worth recalling that the positions of the missing entries form the union of the validation and test sets, whereas the positions of the observed entries form the training set.Footnote 25 In this case, although the third column in Fig. 3d shows that MC looks able to reconstruct some pattern in the missing block of the matrix (with respect to the case \(\lambda \simeq 0\), for which the missing block is predicted as a block of all negligible elements), the reconstruction error looks to be still large (fourth column), having a similar pattern as the corresponding original non-obscured block (first column). This is partly due to the fact that a \(50\%\) reconstruction error corresponds to a reduction by 1 in logarithmic scale with base 2. Improved results are reported in the next subsection (see Figs. 6a and 8a), where the choice of the WIOD subtable to which MC is applied is guided by hierarchical clustering.

Fig. 3
figure 3

a Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 4. b Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 4. c Singular values’ distribution of the WIOD submatrix reported in Table 4, and the one of the completed submatrix produced by Algorithm 1 for the optimal regularization constant (RMSE criterion). d Colored visualization of the elements of the WIOD submatrix reported in Table 4, positions of the missing entries, reconstructed submatrix obtained for the optimal regularization constant (RMSE criterion), and element-wise absolute value of the reconstruction error (color figure online)

3.5 Matrix completion applied to historical data for groups of similar/dissimilar countries determined by hierarchical clustering

In this subsection, using data from the WIOD latest release, we compare the application of MC to WIOD submatrices obtained using a pre-processing step based on hierarchical clustering.Footnote 26 The dissimilarity between any two countries \(c_1\) and \(c_2\) is computed as the AACD between the corresponding blocks of \(\mathbf{T}\) in the WIOD table (stacked by considering several consecutive years), obtained by either choosing Italian sectors in input and intermediate/final outputs from the two countries \(c_1\) and \(c_2\) (\(\mathbf{T} ^{Italy,c_1}\) and \(\mathbf{T} ^{Italy,c_2}\), recalling the notation introduced in Sect. 2.1), or choosing Italian intermediate/final outputs and sectors from the two countries \(c_1\) and \(c_2\) in input (\(\mathbf{T} ^{c_1,Italy}\) and \(\mathbf{T} ^{c_2,Italy}\)). In other words, the dissimilarity of the two countries \(c_1\) and \(c_2\) in their Italian export patterns is evaluated in the first case, whereas their dissimilarity in the respective Italian import patterns is evaluated in the second case. Both the hierarchical clustering analyses are repeated taking as inputs stacked I/O blocks associated with several years (2010, 2011, 2012 and 2013), and using complete linkage to perform clustering. Figures 4 and 5 report the dendrograms obtained, where \(c_1\) and \(c_2\) are, respectively, both output countries (Fig. 4), and both input countries (Fig. 5).

In this way, it is possible to extract from Fig. 4 two groups of 4 output countries (see Tables 5 and 6) that are, respectively, in the same cluster, and in 4 different clusters.

Fig. 4
figure 4

Dendrograms of output countries with Italian sectors in input, based on WIOD tables (stacked over the years 2010–2013). Hierarchical clustering performed with the AACD dissimilarity measure (y-axis) and complete linkage. 21 desired groups (compare with Table 3). Countries in the same cluster are depicted with the same color. Countries in singleton clusters are highlighted in black

Fig. 5
figure 5

Dendrograms of input countries with Italian sectors in output, based on WIOD tables (stacked over the years 2010–2013). Hierarchical clustering performed with the AACD dissimilarity measure (y-axis) and complete linkage. 22 desired groups (compare with Table 3). Countries in the same cluster are depicted with the same color. Countries in singleton clusters are highlighted in black (color figure online)

Table 5 refers to a WIOD submatrix whose blocks have Italian sectors in input and intermediate/final outputs associated with the first group of extracted countries (specifically, Austria, Belgium, Germany and the Netherlands). In contrast, in Table 6, the intermediate/final outputs refer to the second group of extracted countries (specifically, Australia, Belgium, Japan and Malta). For predictive/MC purposes, the tables contain also data related to the year 2014.Footnote 27 Then, the MC Algorithm 1 is applied to both submatrices, after obscuring all the elements of their last block (highlighted in bold in Tables 5 and 6), which refers to a specific output country in 2014.

Table 5 Structure of the WIOD submatrix used for the example reported in Fig. 6a
Table 6 Structure of the WIOD submatrix used for the example reported in Fig. 7a

Figures 6a and 7a report the results of the application of the MC Algorithm 1 to the two WIOD submatrices whose structures are described in Tables 5 and 6, respectively. As expected, the results show a better performance of the MC algorithm, measured in terms of the percentage of reduction of the RMSE on the validation set from \(\lambda \simeq 0\) to the optimal choice of \(\lambda\), in the case of the first submatrix, whose intermediate/final outputs are associated with more similar countries.

It is worth observing that quite similar results have been obtained if a different 2014 block corresponding to another country in the group of 4 countries is obscured in each of the two WIOD submatrices, or when the analysis has been repeated by considering Italy in output and 4 similar/dissimilar countries in input (see the dendrogram reported in Fig. 5). In this second analysis, the selected subset of 4 similar countries in input is made by Belgium, Germany, Spain and France (see Table 7), whereas the selected subset of 4 dissimilar countries in input is made by Germany, India, Malta and Slovenia (see Table 8). Corresponding results of the MC analysis are reported in Figs. 8a and 9a. Again, similar comments as before apply: when more similar input countries are considered and the RMSE criterion is considered, the performance of MC improves.

Moreover, a comparison of Figs. 6b, 7b, 8b and 9b show that, also when the SMAPE performance measure is used, MC applied to similar countries produces better results (in terms of relative improvement with the respect to the baseline case) than MC applied to dissimilar countries.

As shown later in Sect. 3.7, qualitatively similar results as in this subsection have been obtained by varying the random choices of the validation and test sets.

Fig. 6
figure 6

a Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 5. b Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 5. c Singular values’ distribution of the WIOD submatrix reported in Table 5, and the one of the completed submatrix produced by Algorithm 1 for the optimal regularization constant (RMSE criterion). d Colored visualization of the elements of the WIOD submatrix reported in Table 5, positions of the missing entries, reconstructed submatrix obtained for the optimal regularization constant (RMSE criterion), and element-wise absolute value of the reconstruction error (color figure online)

Fig. 7
figure 7

a Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 6. b Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 6. c Singular values’ distribution of the WIOD submatrix reported in Table 6, and the one of the completed submatrix produced by Algorithm 1 for the optimal regularization constant (RMSE criterion). d Colored visualization of the elements of the WIOD submatrix reported in Table 6, positions of the missing entries, reconstructed submatrix obtained for the optimal regularization constant (RMSE criterion), and element-wise absolute value of the reconstruction error (color figure online)

Table 7 Structure of the WIOD submatrix used for the example reported in Fig. 8a
Table 8 Structure of the WIOD submatrix used for the example reported in Fig. 9a
Fig. 8
figure 8

a Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 7. b Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 7. c Singular values’ distribution of the WIOD submatrix reported in Table 7, and the one of the completed submatrix produced by Algorithm 1 for the optimal regularization constant (RMSE criterion). d Colored visualization of the elements of the WIOD submatrix reported in Table 7, positions of the missing entries, reconstructed submatrix obtained for the optimal regularization constant (RMSE criterion), and element-wise absolute value of the reconstruction error (color figure online)

Fig. 9
figure 9

a Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 8. b Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 8. c Singular values’ distribution of the WIOD submatrix reported in Table 8, and the one of the completed submatrix produced by Algorithm 1 for the optimal regularization constant (RMSE criterion). d Colored visualization of the elements of the WIOD submatrix reported in Table 8, positions of the missing entries, reconstructed submatrix obtained for the optimal regularization constant (RMSE criterion), and element-wise absolute value of the reconstruction error (color figure online)

3.6 Performance of matrix completion on simulated matrices

In this subsection, we show that the application of MC on the simulated data of Sect. 3.3 produces similar results as its application to the original data (see Sect. 3.5). In the following, for illustrative purposes, we focus just on one of the simulated matrices considered in Sect. 3.3 (in the next figures, the synthetic countries are still named as the original countries, since their respective data are obtained by perturbations of the ones of the associated original countries).

Figures 10 and 11 show the results of the hierarchical clustering, obtained, respectively, with Italy in input and in output.

Fig. 10
figure 10

Dendrograms of synthetic output countries with Italian sectors in input, based on synthetic WIOD tables (stacked over the years 2010–2013). Hierarchical clustering performed with the AACD dissimilarity measure (y-axis) and complete linkage. 21 desired groups (compare with Table 3). Countries in the same cluster are depicted with the same color. Countries in singleton clusters are highlighted in black (color figure online)

Fig. 11
figure 11

Dendrograms of synthetic input countries with Italian sectors in output, based on synthetic WIOD tables (stacked over the years 2010–2013). Hierarchical clustering performed with the AACD dissimilarity measure (y-axis) and complete linkage. 22 desired groups (compare with Table 3). Countries in the same cluster are depicted with the same color. Countries in singleton clusters are highlighted in black (color figure online)

Then, based on the dendrogram shown in Figs. 10 and 12 compares the MC performance (for short, limiting to the RMSE criterion) for the cases—analogous to the ones considered in Sect. 3.5—in which the 4 selected synthetic countries belong, respectively, to the same cluster (similar synthetic countries: ESP, FRA, GBR, USA, obscured one in the last year: ESP) and to different clusters (dissimilar synthetic countries: CYP, ESP, IDN, MEX, obscured one in the last year: CYP). Analogously, based on the dendrogram shown in Figs. 11 and 13 compares the MC performance (again, limiting to the RMSE criterion) for the cases—analogous to the ones considered in Sect. 3.5—in which the 4 selected synthetic countries belong, respectively, to the same cluster (similar synthetic countries: BEL, DEU, ESP, FRA, obscured one in the last year: BEL) and to different clusters (dissimilar synthetic countries: CZE, DEU, EST, IND, obscured one in the last year: DEU). The results are qualitatively similar to the ones reported in for the original data, and demonstrate the robustness of the proposed approach of analysis, which combines hierarchical clustering and MC. Similar results, not reported here, are obtained when the SMAPE criterion is used to compare the performance of MC for similar and dissimilar countries.

Fig. 12
figure 12

a Results in terms of RMSE of the application of Algorithm 1 to one synthetic WIOD submatrix made of 4 similar synthetic countries, with Italy in input. b Results in terms of RMSE of the application of Algorithm 1 to one synthetic WIOD submatrix made of 4 dissimilar synthetic countries, with Italy in input

Fig. 13
figure 13

a Results in terms of RMSE of the application of Algorithm 1 to one synthetic WIOD submatrix made of 4 similar synthetic countries, with Italy in output. b Results in terms of RMSE of the application of Algorithm 1 to one synthetic WIOD submatrix made of 4 dissimilar synthetic countries, with Italy in output

3.7 Results for different choices of the validation and test sets

In order to investigate how the results obtained in Sect. 3.5 may depend on the random choices of the validation and test sets inside the obscured blocks, we report in Fig. 14a–d some variations of Figs. 6a, 7, 8 and 9a, achieved by considering, for illustrative purposes, 5 such random choices. Similarly, in Fig. 15a–d, we do the same to produce variations of Figs. 6b, 7, 8 and 9a. Figures 14a–d and 15a–d show that qualitatively similar results as in Sect. 3.5 are obtained in this way. Moreover, especially in the case of the RMSE criterion, they further justify focusing on the relative improvement achieved by MC, as there is some variability in the validation and test set RMSEs obtained for \(\lambda \simeq 0\). It is also worth noticing that the variability of the curves looks larger for the cases of Fig. 14b and d, which refer to situations in which MC is applied to I/O subtables made of dissimilar countries. This is in agreement with our intuition that MC performs better when it is applied to I/O subtables made of similar countries (see Fig. 14a and c).

Qualitatively similar results, not reported here, have been obtained by changing randomly the validation and test sets related to Figs. 12 and 13 in Sect. 3.6.

Fig. 14
figure 14

a Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 5. b Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 6. c Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 7. d Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 8

Fig. 15
figure 15

a Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 5. b Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 6. c Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 7. d Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 8

3.8 Application of matrix completion to a WIOD submatrix containing both intra-country and inter-country blocks

In Table 9, we consider the following variation of Table 4, in which we take into account also the domestic block associated with Italy, evaluated in different years.

Table 9 Structure of the WIOD submatrix used for the example reported in Fig. 16

For what concerns the application of MC, due to the different orders of magnitude of the elements contained in the domestic blocks compared to the ones belonging to the other blocks, the range of values for the regularization parameter has been increased for this specific example, by setting \(\lambda _k=2^{k/2-20}\), for \(k=1,\ldots ,80\). As highlighted by Fig. 16, in this case, the performance of MC is quite bad, likely due to the highly different orders of magnitude of the elements in the various blocks.

Fig. 16
figure 16

a Results in terms of RMSE of the application of Algorithm 1 to the WIOD submatrix reported in Table 9. b Results in terms of SMAPE of the application of Algorithm 1 to the WIOD submatrix reported in Table 9. c Singular values’ distribution of the WIOD submatrix reported in Table 9, and the one of the completed submatrix produced by Algorithm 1 for the optimal regularization constant (RMSE criterion). d Colored visualization of the elements of the WIOD submatrix reported in Table 9, positions of the missing entries, reconstructed submatrix obtained for the optimal regularization constant (RMSE criterion), and element-wise absolute value of the reconstruction error (color figure online)

4 Future research and concluding remarks

This work represents the first attempt to adopt a matrix completion (MC) algorithm, combined with a hierarchical clustering pre-preprocessing step, to predict missing entries in submatrices of I/O tables in the context of a panel data analysis.

The particular structure of I/O tables, reported in the article, makes the data reconstruction and prediction problems not trivial. Hence, in the pre-processing phase, we have employed the dissimilarity pattern of countries to define low-rank I/O subtables with few dominant singular values. A panel matrix completion with nuclear norm penalty has been tested on those low-rank subtables. The effectiveness of the proposed method according to historical data available from previous years has been demonstrated when the considered I/O subtables are obtained by selecting similar countries.

A first possible extension of the analysis concerns comparing matrix reconstruction of I/O tables (in one year, based on current and previous years) based on the repeated application of matrix completion to several subtablesFootnote 28 of the original I/O table, instead of a single more computationally expensive and (presumably) less effective application to the whole table (possibly after removing domestic blocks, likewise in this article). For what concerns the possible dependence of the results on the cluster size (in the case of I/O subtables associated with countries coming from the same cluster), it is worth noticing that the results reported in the present work refer to clusters having slightly different sizes. So, the proposed approach has the potential to work well (compared to the alternative selection of countries from distinct clusters) with different cluster sizes. A more extensive analysis (based either on artificial data or on real-world data, possibly with various selections of the given country in input or output) would be needed to further check this. This is left as a future development, since it would require a much larger number of (computationally intensive) repeated applications of MC.

The proposed methodology is expected to be applicable, with similar results, also to other I/O tables (either industry-by-industry and product-by-product ones), because their structure is often similar to the one of WIOD tables, as highlighted in this work.Footnote 29

As a second possible extension, algorithms for clustering and matrix completion different from those employed in the present article could be used. Moreover, matrix completion itself could be compared with other imputation methods for missing entries in panel data models. A comparison with alternative methods such as the one suggested by Rueda-Cantuche et al. (2018) is left for future research.

A third possible extension consists in applying the MC algorithm not to the original I/O subtable, but to its suitable pre-processed version, obtained by removing from that subtable its prediction provided by another method. This “ensemble learning” approach would combine the two methods, with the aim of possibly obtaining better predictions.

Finally, as another possible extension, the approach adopted in the paper could be applied to generate counterfactuals of I/O submatrices: e.g., by predicting how the entries of a suitably specified input–output submatrix related to Japan would have changed, in case the March 2011 earthquake and tsunami and the successive Fukushima Daiichi nuclear disaster (Yonemoto 2016) would have not occurred. To do this, one would preliminary need to identify sectors of the economy that were not affected by such events (i.e., untreated sectors), then obscure (and reconstruct) the entries of that submatrix related to other sectors that were affected (i.e., treated sectors).