A model-based ultrametric composite indicator for studying waste management in Italian municipalities

Cavicchia, Carlo; Sarnacchiaro, Pasquale; Vichi, Maurizio; Zaccaria, Giorgia

doi:10.1007/s00180-023-01333-9

A model-based ultrametric composite indicator for studying waste management in Italian municipalities

Original paper
Open access
Published: 16 March 2023

Volume 39, pages 21–50, (2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Statistics Aims and scope Submit manuscript

A model-based ultrametric composite indicator for studying waste management in Italian municipalities

Download PDF

1406 Accesses
1 Altmetric
Explore all metrics

Abstract

A Composite Indicator (CI) is a useful tool to synthesize information on a multidimensional phenomenon and make policy decisions. Multidimensional phenomena are often modeled by hierarchical latent structures that reconstruct relationships between variables. In this paper, we propose an exploratory, simultaneous model for building a hierarchical CI system to synthesize a multidimensional phenomenon and analyze its several facets. The proposal, called the Ultrametric Composite Indicator (UCI) model, reconstructs the hierarchical relationships among manifest variables detected by the correlation matrix via an extended ultrametric correlation matrix. The latter has the feature of being one-to-one associated with a hierarchy of latent concepts. Furthermore, the proposal introduces a test to unravel relevant dimensions in the hierarchy and retain statistically significant higher-level CIs. A simulation study is illustrated to compare the proposal with other existing methodologies. Finally, the UCI model is applied to study Italian municipalities’ behavior toward waste management and to provide a tool to guide their councils in policy decisions.

Developing a hierarchical framework for assessing the strategic effectiveness of sustainable waste management in the Somaliland construction industry

Article 27 April 2023

Modeling Effective Construction Waste Management Through Causal Loop Diagrams

A Novel approach to construct a composite indicator by maximizing its sum of squared correlations with Sub-indicators

Article 16 October 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Composite Indicators (CIs) have become increasingly relevant in the last twenty years as statistical tools for policy making. In fact, they are useful to convey and synthesize information on complex multidimensional phenomena that are not directly observable. As established by the Joint Research Centre and the Organization for Economic Co-operation and Development, a CI is an unobserved (latent) variable resulting from the aggregation of manifest variables into a single synthetic measure grounded in an underlying model of the multidimensional phenomenon under study (Nardo et al. 2005; OECD-JRC 2008). These complex phenomena are generally characterized by latent dimensions (concepts) ordered in a hierarchical structure that cannot be observable straightforwardly, and therefore, in turn, can be measured by CIs. Accordingly, two different types of CIs can be distinguished: the General Composite Indicator (GCI) to measure the multidimensional concept and the Specific Composite Indicators (SCIs) to represent its latent dimensions. This results in a CI system with an underlying hierarchical structure, where manifest variables are aggregated into first-level SCIs (specific dimensions), the latter into higher-level ones (broader dimensions) up to the GCI.

Some of the most severe criticisms of the use of CIs are related to the oversimplistic policy conclusions they can lead to and the normative approach to their construction, i.e., the fact that they are based on expert evaluation without any statistical assessment (Cavicchia and Vichi 2021; OECD-JRC 2008). However, both limitations can be overcome by building a model-based CI system. Indeed, even if a theoretical framework settled by a think tank can provide the interpretation of the phenomenon under study, the construction of a CI system via statistical models limits the researcher’s arbitrary choices and connects it to the data via a mathematical formalization.

In the specialized literature, several methodologies have been proposed with the aim of modeling the multivariate data matrix or the covariance/ correlation matrix, to inspect the hierarchical relationships among manifest variables and detect latent dimensions and their quantification (Anderson and Rubin 1956; Cattell 1978; Wherry 1959; Schmid and Leiman 1957; Cavicchia and Vichi 2022, among others). These models were developed via a sequential approach, i.e., without optimizing an overall objective function, which can lead to inaccurate detection of the hierarchical relationships among manifest variables, or via a simultaneous approach, yet restricting the resulting hierarchy to a reduced number of levels. However, none of the existing methodologies builds a hierarchical structure over the manifest variables via a simultaneous approach, and by testing which levels of the hierarchy are not statistically significant so that to reduce their number and obtain a CI system representative of the researched multidimensional phenomenon. Therefore, this article aims to fill this gap by proposing a novel hierarchical methodology to build a CI system that considers all the hierarchical levels of the concept under study using a simultaneous model-based approach.

The proposal, called the Ultrametric Composite Indicator (UCI) model, unravels the hierarchical relationships between manifest variables by reconstructing the observed correlation matrix through an extended ultrametric one. The latter is a peculiar block matrix that has the features of being related one-to-one with a hierarchy of latent concepts and is represented by a tree structure, where the leaves correspond to the manifest variables, the internal nodes represent the SCIs, and the root identifies the GCI. Thus, an extended ultrametric correlation matrix results well-suited to model hierarchical relations among manifest variables and/or groups of them.

Notwithstanding being interesting from an interpretation point of view, not all internal nodes obtained as aggregation of those of lower level have to be necessarily retained in the hierarchy. In fact, if they are not statistically significant, some internal nodes corresponding to higher-level SCIs can be removed. In the UCI model, we introduce a test to evaluate the difference between two levels of the hierarchy engendered by the adopted ultrametric structure. The quantification of the CI system is based on the resulting hierarchy, where only statistically significant levels are maintained. It is worth underlining that the proposal is characterized by an overall objective function to optimize in order to obtain the optimal hierarchy in a simultaneous approach instead of a sequential and greedy manner. Moreover, the UCI model performs an exploratory analysis where only the number of first-level latent concepts is required beforehand. Differently from a confirmatory analysis, the exploratory one does not impose any relationships between manifest variables and first-level SCIs (or first-level and higher-level SCIs) by letting the data determine them, which can be extremely useful if a theoretical conceptualization of the phenomenon under study is not available or this is not empirically confirmed (see Cavicchia and Vichi 2021, for further details on the difference between exploratory and confirmatory analyses).

The performance of the UCI model is first evaluated through a simulation study on synthetic data, where it is compared with other existing methodologies for detecting hierarchical structures of variables. The proposal is then applied to the study of waste management in the 40 largest Italian municipalities by identifying its relevant latent dimensions. Waste management represents a multidimensional phenomenon that policy makers have highly considered in the last few decades (Heads of State and Government and High Representatives 2015; European Parliament and Council of the European Union 1999, 2018). To monitor waste collection and recycling in Europe, Eurostat collects indicators and statistics under the Waste Statistics Regulation (European Commission 2010), which can be used to build a waste management CI in Europe (Cavicchia et al. 2021). Starting from variables such that Total costs of mixed waste management, Total costs of separated waste management and Percentage of separated waste over the total waste, the proposal aims to pinpoint SCIs related to the quantities, performances, and costs of waste management and allows assessing the importance of each SCI in the construction of the GCI. Furthermore, the resulting SCIs and GCI are used to unravel different behaviors of Italian municipalities in waste disposal and treatment, as well as to determine the dimensions of waste management on which governments must focus to improve the performance of municipalities (i.e., increasing recycling practices and investing in separated waste). When studying the performance of Italian municipalities, it should be considered that several aspects can affect waste management and its dimensions. For example, tourism can have an impact on waste generation and collection (e.g., Matai 2015; Mateu-Sbert et al. 2013; Diaz-Farina et al. 2020). For this reason, we implement a further analysis considering aspects that affect waste management as external information and applying the UCI model to the data net of these effects.

The paper is organized as follows. In Sect. 2, the notation used throughout the paper is introduced and the definitions necessary to follow the specification of the model proposed here are provided. Section 3 thoroughly discusses the proposal in all its aspects (model specification, estimation, CI system definition and treatment of the external variable effect). The performance of the proposed model is illustrated in Sect. 4 both on synthetic and real data. A final discussion completes the article in Sect. 5.

2 Notation and background

For the convenience of the reader, the notation used in this paper is listed here.

n, p	Number of units and manifest variables, respectively
Q	Number of variable groups corresponding to the first-level SCIs over which the hierarchy is built
$\textbf{X}= [x_{ij}]$	($n \times p$) data matrix
$\textbf{R} = [r_{jl}]$	Data correlation matrix of order p, where $r_{jl}$ is the correlation between the manifest variables j and l ($j,l = 1, \ldots , p$)
$\textbf{V}=[v_{jq}]$	($p \times Q$) membership matrix, where $v_{jq} = 1$ if the jth manifest variable belongs to the qth group; $v_{jq} = 0$ otherwise
$\textbf{R}_{\textrm{W}}= [_{W}r_{qq}]$	Diagonal matrix of order Q, whose diagonal entries represent the correlation within groups
$\textbf{R}_{\textrm{B}}= [_{B}r_{qh}]$	Matrix of order Q, whose off-diagonal entries represent the correlation between groups and diagonal ones are equal to zero
$\textbf{E}=[e_{jl}]$	Error square matrix of order p
$\textbf{Y}_{Q} = [y_{iq}^{(Q)}]$	($n \times Q$) score matrix of the first-level SCIs
$\textbf{A}_{Q} = [a_{jq}^{(Q)}]$	($p \times Q$) sparse loading matrix, with a nonnull value per row representing the unique loading of each manifest variable on the corresponding first-level SCI. The position of each nonnull value per row is determined according to $\textbf{V}$
$\textbf{1}_{p},\textbf{1}_{Q},\textbf{I}_{p}$	Unitary vector of order p and Q, identity matrix of order p, respectively

Before going into the details of the model proposed to build a CI system in Sect. 3, we need to recall and introduce some definitions.

Definition 1

A matrix $\textbf{U} = [u_{jl} \in {\mathbb {R}}_{\ge 0}]$ of order p is said to be ultrametric if

(i)
$u_{jl} = u_{lj}$ for all $j, l = 1, \ldots , p$ (symmetry);
(ii)
$u_{jj} \ge \max \{u_{lj}: l = 1, \ldots , p\}$ for all $j = 1, \ldots , p$ (column pointwise diagonal dominance);
(iii)
$u_{jl} \ge \min \{u_{ji}, u_{il}\}$, for all $i,j,l = 1, \ldots , p$ (ultrametric inequality).

An ultrametric matrix has two main characteristics that make it suitable for building a hierarchy of composite indicators, starting with studying the relationships among manifest variables. These characteristics can be summarized as follows.

Remark 1

Every ultrametric matrix turns out to be positive semidefinite (psd) (Dellacherie et al. 2014, pp. 58-59).

Remark 2

An ultrametric matrix is associated one-to-one with a hierarchy of latent concepts (Cavicchia et al. 2020, 2022).

Remark 1 is essential if we analyze the relationships among the manifest variables through their correlations. In fact, a nonnegative correlation matrix of order p is an ultrametric matrix if (iii) holds, since (i) and (ii) are satisfied by definition. As we will see later in the paper, Remark 2 relates an ultrametric correlation matrix to a hierarchical structure. However, Definition 1 is based on the nonnegativity assumption, which can be very restrictive in several real applications. To include negative values and thus make the notion of ultrametricity more applicable in practice, the extension of Definition 1 is provided as follows.

Definition 2

A matrix $\textbf{U} = [u_{jl} \in {\mathbb {R}}]$ of order p is said to be extended ultrametric if

(i)
$u_{jl} = u_{lj}$ for all $j, l = 1, \ldots , p$ (symmetry);
(ii.a)
$u_{jj} \ge 0$ for $j = 1, \ldots , p$ (nonnegativity of the diagonal);
(ii.b)
$u_{jj} \ge \max \{|u_{lj}|: l = 1, \ldots , p\}$ for $j = 1, \ldots , p$ (column pointwise diagonal dominance);
(iii)
$u_{jl} \ge \min \{u_{ji}, u_{il}\}$, for all $i,j,l = 1, \ldots , p$ (ultrametric inequality).

It is worth noting that, if the nonnegativity assumption does not hold for the entire matrix, condition (ii.b) is not sufficient to guarantee the positive semidefiniteness of an extended ultrametric matrix, and thus to apply Definition 2 to a correlation matrix. To overcome this drawback, we request that if $\textbf{U}$ is not psd, $\textbf{U} = \textbf{U} + a\textbf{I}_{p}$, where a is the absolute value of the smallest eigenvalue of $\textbf{U}$ (Cailliez 1983). This thus satisfies the positive semidefiniteness condition needed to apply the notion of ultrametricity to generic correlation matrices. In the next section, we introduce a new model-based approach for building a composite indicator system based on an extended ultrametric matrix.

3 The ultrametric composite indicator model

The Ultrametric Composite Indicator (UCI) model defines a hierarchy of composite indicators that starts with the study of the relationships among manifest variables and identifies broader dimensions associated with SCIs up to GCI. Therefore, we model the observed correlation matrix through an ultrametric structure to inspect the hierarchical relationships among manifest variables. This means that the UCI model reconstructs a correlation matrix $\textbf{R} = [r_{jl} \in {\mathbb {R}}]$ of order p through an extended ultrametric correlation matrix $\textbf{R}_{\textrm{EU}}$, which is therefore psd, and an error square matrix $\textbf{E}$ of the same order. Formally, the correlation matrix $\textbf{R}$ of an ($n \times p$) data matrix $\textbf{X}$ is modeled by

$$\begin{aligned} \textbf{R} = \textbf{R}_{\textrm{EU}}+ \textbf{E}, \end{aligned}$$

(1)

where $\textbf{R}_{\textrm{EU}}$ detects the hierarchical structure of the manifest variables. Specifically, $\textbf{R}_{\textrm{EU}}$ is parameterized as follows

$$\begin{aligned} \textbf{R}_{\textrm{EU}}= \textbf{V}\textbf{R}_{\textrm{W}}\textbf{V}^{\prime }-\text {diag}\big (\textbf{V}\textbf{R}_{\textrm{W}}\textbf{V}^{\prime }\big ) + \textbf{V}\textbf{R}_{\textrm{B}}\textbf{V}^{\prime } + \textbf{I}_{p}, \end{aligned}$$

(2)

subject to constraints

$$\begin{aligned}&\textbf{V}= [v_{jq} \in \{0,1\}: j = 1, \ldots , p, q = 1, \ldots , Q]; \end{aligned}$$

(3)

$$\begin{aligned}&\textbf{V}\textbf{1}_{Q} = \textbf{1}_{p} \quad \text {i.e.} \quad \sum _{q = 1}^{Q} v_{jq} = 1 \quad j = 1, \ldots , p; \end{aligned}$$

(4)

$$\begin{aligned}&\textbf{R}_{\textrm{W}}= \text {diag}([_{W}r_{11}, \ldots , _{W}r_{QQ}]); \end{aligned}$$

(5)

$$\begin{aligned}&\textbf{R}_{\textrm{B}}= \textbf{R}_{\textrm{B}}^{\prime }, \text {diag}(\textbf{R}_{\textrm{B}}) = \textbf{0}, {_{B}r_{qh}} \ge \min \{ {_{B}r_{qs}}, {_{B}r_{hs}} \} \; q, h, s = 1, \ldots , Q, \nonumber \\&s \ne h \ne q; \end{aligned}$$

(6)

$$\begin{aligned}&\min \{ {_{W}r_{qq}}: q = 1, \ldots , Q\} \ge \max \{ {_{B}r_{qh}}: q, h = 1, \ldots , Q, h \ne q\}; \end{aligned}$$

(7)

Remark that $\text {diag}(\cdot )$ denotes the diagonal matrix whose diagonal elements are those of the parenthesized matrix. It can be easily proved that $\textbf{R}_{\textrm{EU}}$ is in agreement with Definition 2. In fact, it is symmetric since (5) and (6) hold; it is nonnegative on the diagonal and is column pointwise diagonally dominant since its diagonal corresponds to a unitary vector, that is, the diagonal of $\textbf{I}_{p}$ in Eq. (2), whereas its off-diagonal elements vary between $-1$ and 1; lastly, it fits the ultrametric condition thanks to Eqs. (6)–(7). Moreover, if $\textbf{R}_{\textrm{EU}}$ is not psd, it must be rewritten as follows $\textbf{R}_{\textrm{EU}}= \text {diag}(\widetilde{\textbf{R}}_{\textrm{EU}})^{-\frac{1}{2}} \, \widetilde{\textbf{R}}_{\textrm{EU}}\; \text {diag}(\widetilde{\textbf{R}}_{\textrm{EU}})^{-\frac{1}{2}}$, where $\widetilde{\textbf{R}}_{\textrm{EU}}= \textbf{R}_{\textrm{EU}}+ a\textbf{I}_{p}$ and a is set to the absolute value of the smallest eigenvalue of $\textbf{R}_{\textrm{EU}}$.

The matrix $\textbf{R}_{\textrm{EU}}$ defined in Eq. (2) depends on three parameters: $\textbf{V}$, which represents the membership matrix that defines the partition of the variables into Q groups ($Q \le p$), each associated with a specific dimension, $\textbf{R}_{\textrm{W}}$ and $\textbf{R}_{\textrm{B}}$ that determine the characteristics of the groups. Specifically, $\textbf{R}_{\textrm{W}}$ is a diagonal matrix of order Q, whose diagonal entries represent the correlations within the variable groups, and $\textbf{R}_{\textrm{B}}$ is a matrix of order Q, whose off-diagonal elements represent the correlations between pairs of groups. Given the ultrametricity constraint (6), $\textbf{R}_{\textrm{B}}$ has a reduced number of different values that is at most $Q-1$. By construction, $\textbf{R}_{\textrm{EU}}$ is then a ($2Q-1$)-extended ultrametric correlation matrix since it has at most $2Q-1$ different values, i.e., Q in $\textbf{R}_{\textrm{W}}$ and ($Q-1$) in $\textbf{R}_{\textrm{B}}$. Moreover, recalling Remark 2, it should be noted that $\textbf{R}_{\textrm{EU}}$ is one-to-one associated with a hierarchy of latent concepts. In detail, since each variable belongs to only one latent dimension, any triplet (i, j, l) of variables will surely fall into one of the following possible scenarios: (a) all elements of the triplet belong to a single group; (b) the elements of the triplet belong to two distinct groups; (c) all elements of the triplets belong to different groups. These three scenarios correspond to the following correlation triplets: ($_{W}r_{qq}$, $_{W}r_{qq}$, $_{W}r_{qq}$), ($_{W}r_{qq}$, $_{B}r_{qh}$, $_{B}r_{qh}$) and ($_{B}r_{qh}$, $_{B}r_{qk}$, $_{B}r_{hk}$), respectively. Furthermore, all triplets verify the ultrametric inequality due to constraints (6) and (7). Thus, in $\textbf{R}_{\textrm{EU}}$ the Q values $_{W}r_{qq}$ ($q = 1,\dots ,Q$) correspond to the variable aggregations in groups defined by $\textbf{V}$, while the other $Q-1$ values $_{B}r_{qh}$ ($q,h = 1,\dots ,Q$, $h \ne q$) represent the aggregations in pairs of the Q variable groups. Therefore, $\textbf{R}_{\textrm{B}}$ defines the hierarchical structure of the Q variable groups considering its $Q-1$ values in decreasing order. This gives rise to broader groups and corresponding dimensions lumped together from the most concordant to the least concordant.

It has to be noted that constraint (7) allows us to guarantee that the variables belonging to the same group are more concordant among them than with the variables belonging to other groups, preserving the internal consistency of the Q variable groups. For this reason, a data preprocessing is recommendable. If a theory on the variable partition into Q groups exists, the UCI model can be applied in a semi-confirmatory approach, i.e., by constraining the membership of each variable to a specific group, where the polarity of the variables that are negatively related to the corresponding dimension is changed. $\textbf{R}_{\textrm{EU}}$ can also contain negative or zero values, other than positive ones. When this happens, the corresponding broader dimensions are defined by discordant or uncorrelated dimensions of lower levels, respectively.

An example of $\textbf{R}_{\textrm{EU}}$ and its parameters are provided in Fig. 1. Herein, four groups of variables can be detected: two variables are lumped together in the first group (first column of $\textbf{V}$), five in the second group (second column of $\textbf{V}$), three in the third group (third column of $\textbf{V}$), and the last two in the last group (fourth column of $\textbf{V}$). For simplicity reasons, the rows of the membership matrix $\textbf{V}$ have been rearranged so that the variables belonging to the same group are contiguous. This variable partition corresponds to a block structure of $\textbf{R}_{\textrm{EU}}$, where the off-diagonal elements are equal to $_{W}r_{qq}$ ($q = 1, \ldots , 4$) if the corresponding two variables belong to the same group among the Q ones, or to $_{B}r_{qh}$ ($q, h = 1, \ldots , 4, h \ne q$) if the corresponding variables belong to two different groups and are lumped together further in the hierarchy. An example of the hierarchy corresponding to $\textbf{R}_{\textrm{EU}}$ is provided in Fig. 2a. Evidently, the order of aggregation between groups depends on the actual values of $\textbf{R}_{\textrm{B}}$ and therefore can be different from that shown in Fig. 2a.

3.1 Estimation of the UCI model

Model (1) is estimated in a least-squares framework by fitting the closest extended ultrametric correlation matrix $\textbf{R}_{\textrm{EU}}$ to the correlation matrix $\textbf{R}$. Hence, the optimization problem corresponds to minimizing the following loss function

$$\begin{aligned} F(\textbf{R}_{\textrm{W}}, \textbf{R}_{\textrm{B}}, \textbf{V}) = \Vert {\textbf{R} - \textbf{R}_{\textrm{EU}}} \Vert ^{2} \end{aligned}$$

(8)

w.r.t. the parameters of $\textbf{R}_{\textrm{EU}}$ in Eq. (2) and subject to constraints (3)–(7). The details of the parameters’ estimation are provided in Appendix A.

To find the parameter estimates $\widehat{\textbf{R}}_{\textrm{W}}$, $\widehat{\textbf{R}}_{\textrm{B}}$ and $\widehat{\textbf{V}}$ that minimize Eq. (8), the least-squares estimation is performed via an algorithm that consists of the following steps: (0, initialization) a random partition $\widehat{\textbf{V}}$ is generated from a Multinomial distribution in Q nonempty categories, each with equal probability, and the matrices reporting within and between groups correlations are computed accordingly; (1) the update of $\widehat{\textbf{V}}$, subject to (3) and (4); (2) the update of $\widehat{\textbf{R}}_{\textrm{W}}$ and $\widehat{\textbf{R}}_{\textrm{B}}$ conditionally on the current configuration of $\widehat{\textbf{V}}$ and subject to constraints (5)-(7); (3) the check on the positive semidefiniteness of the resulting extended ultrametric correlation matrix $\widehat{\textbf{R}}_{\textrm{EU}}$, which is obtained by substituting the results of Steps (1) and (2) into Eq. (2). The Steps from (1) to (3) are iteratively alternated and afterwards the loss function is computed. The latter decreases, or at least does not increase, at each iteration. The algorithm stops when the difference between the loss function in two sequential iterations is negligible, i.e., lower than an arbitrary small positive constant which is equal to $0.1^{6}$ in our experiments. Because random initialization turns out to be prone to local optima, the algorithm is run several times (e.g., 100 in our experiments), starting from different random partitions of the variable space, to increase the chance to obtain a global minimum. However, the number of different solutions over the replications is limited, therefore, the algorithm results stable, and the presence of local optima does not result in an issue if the model runs 100 times.

A detailed and complete presentation of the algorithm for the estimation of the UCI model is provided in Appendix B, also including the test on the hierarchical levels produced by $\widehat{\textbf{R}}_{\textrm{EU}}$ and the computation of the SCIs and GCI on its significant levels, as discussed in the following two sections.

3.2 Test on the difference between two levels of the hierarchy

The hierarchy corresponding to $\textbf{R}_{\textrm{EU}}$ is composed of Q disjoint variable groups that identify the first hierarchical levels (the first four internal nodes that start at the top of Fig. 2a) and $Q-1$ higher hierarchical levels that pinpoint their aggregations in pairs in broader groups, from the most concordant to the least concordant. As we will discuss in Sect. 3.3, the first Q internal nodes are crucial to unravel specific dimensions that account for the correlation among the manifest variables. Nonetheless, their aggregations – denoted into $\textbf{R}_{\textrm{B}}$ – could be irrelevant and the corresponding broader dimensions might result not statistically significant in the population. For this reason, it is pivotal to test whether the existence of all $Q-1$ higher levels is statistically significant in order to retain the relevant dimensions in the hierarchy.

The test introduced herein is based upon that one proposed by Dunn and Clark (1969), and improved by Steiger (1980), for comparing correlations measured on the same individuals. We implement the test by analyzing the difference between the different values of $\textbf{R}_{\textrm{B}}$ that correspond to the aggregation between the variable groups. Starting from the last aggregation, which identifies the general concept (i.e., the root of the tree at the bottom of Fig. 2a), we test the difference between two subsequent values of $\textbf{R}_{\textrm{B}}$. Considering the example shown in Fig. 2a, the application of the aforementioned test is fulfilled by analyzing the difference between $_{B}r_{13}$ and $_{B}r_{12}$, and that one between $_{B}r_{13}$ and $_{B}r_{34}$.

In order to assess which out of the $Q-1$ higher levels are significant or can be discarded, the following hypothesis testing is performed

$$\begin{aligned} {\left\{ \begin{array}{ll} \text {H}_{0}: {_{B}r_{qh}} - {_{B}r_{ls}} = 0 \\ \text {H}_{1}:{_{B}r_{qh}} - {_{B}r_{ls}} \ne 0 \end{array}\right. } \end{aligned}$$

where ${_{B}r_{qh}}$ and ${_{B}r_{ls}}$ are two correlations of $\textbf{R}_{\textrm{B}}$ that correspond to two sequential levels of the hierarchy. The test is performed by computing the following test statistic

$$\begin{aligned} Z = (z_{_{B}{\hat{r}}_{qh}} - z_{{_{B}{\hat{r}}_{ls}}}) \sqrt{\dfrac{n-3}{2 (1 - {\bar{s}}_{qh,ls })}} \approx N(0,1), \end{aligned}$$

(9)

where n is the sample size, $z_{_{B}{\hat{r}}_{qh}}$ and $z_{{_{B}{\hat{r}}_{ls}}}$ are the Fisher’s z-transformations (Fisher 1921) of the sample estimators ${_{B}{\hat{r}}_{qh}}$ and ${_{B}{\hat{r}}_{ls}}$, respectively, and ${\bar{s}}_{qh, ls}$ is the sample estimator of the asymptotic covariance between $z_{_{B}{\hat{r}}_{qh}}$ and $z_{{_{B}{\hat{r}}_{ls}}}$ calculated using a pooled estimate of the correlation coefficients that are equal under the null hypothesis (see Steiger 1980, for further details). If the null hypothesis is rejected according to the test statistic in Eq. (9), then the hierarchical level (and the corresponding dimension) will be retained.^{Footnote 1}

The test is implemented from the last level of the hierarchy (that is, from the bottom to the top of Fig. 2a), since retention of the latter is fundamental for the construction of the GCI. Moreover, this choice is motivated by the goal of identifying latent dimensions, which are obtained by merging two dimensions of lower levels as much correlated as possible. Therefore, if the difference between two hierarchical subsequent levels is not statistically significant, no reason occurs to retain the lower level. Figure 2b displays an explanatory example of the effect of the test applied to the hierarchy obtained by the UCI model. The application of the test reveals only one statistically significant level in the hierarchy ($_{B}r_{12}$), in addition to the last level corresponding to the GCI ($_{B}r_{13}$); instead, the difference between $_{B}r_{13}$ and $_{B}r_{34}$ turns out to be not statistically significant and the corresponding hierarchical level is discarded. In this example, no other differences between hierarchical levels must be tested. The test stops when all the possible differences between two sequential hierarchical levels are tested, or equivalently when further tests on differences only include the first Q internal nodes.

3.3 Specific and General Composite Indicators scores

The test illustrated in Sect. 3.2 unravels which of the $Q-1$ higher levels resulting from $\widehat{\textbf{R}}_{\textrm{EU}}$ are statistically significant. According to its conclusion, the dimensions associated with the first Q internal nodes and the $H \leq Q-1$ statistically significant higher levels must be quantified. The quantification results into the definition of Q first-level^{Footnote 2} SCIs, $H-1$ SCIs of higher level associated with broader dimensions, and a GCI, that describes the multidimensional phenomenon of interest. The SCIs and GCI allow quantitatively evaluating the behaviors of units (e.g., countries) with respect to a dimension and/or a phenomenon and to make comparisons among them.

We can differentiate between the construction of first-level SCIs, higher-level SCIs and GCI as follows.

First-level SCIs: the first Q SCIs, say $\textbf{Y}_{Q}$, which correspond to the ones directly associated with manifest variables, are computed by selecting the principal component of maximum variance for each variable group. Therefore, for each $q =1, \ldots , Q$, the manifest variables belonging to the qth group are considered to compute the principal component of maximum variance for the group (i.e., the qth column of $\textbf{Y}_Q$). It should be noted that a reduced number of manifest variables is involved in the quantification of each first-level SCI since the Q variable groups are disjoint. For this reason, the loading matrix $\textbf{A}_{Q}$ that contains the weight of each manifest variable on the corresponding component is sparse. Due to condition (4), each row of $\textbf{A}_{Q}$ has only one nonnull element, which corresponds to the qth column of $\textbf{V}_{Q}$ s.t. $v_{jq} = 1$, $q \in \{1, \ldots , Q\}$.
Higher-level SCIs and GCI: for each higher hierarchical level, the corresponding SCI is computed by selecting the principal component of maximum variance for the SCIs of the lower level that compose it. The same holds for the GCI.

Looking at Fig. 2b, the first-level SCIs are those corresponding to the first four groups (from the top of the figure downward), each of which is calculated as the principal component of maximum variance for the manifest variables that define it (e.g., the second group is associated with variables 3, 4, 5, 6, 7); then the higher-level SCI, which is unique in this case, is obtained as the principal component of maximum variance resulting in a combination of the first-level SCIs of the groups 1 and 2; and finally, the GCI corresponding to the last aggregation is calculated as the principal component of maximum variance obtained considering the first-level SCIs associated with groups 3 and 4 and the higher-level SCI previously computed.

The choice of computing the principal components of maximum variance on the SCIs of lower levels is motivated by the idea to stress the importance to the hierarchy. Indeed, if each higher-level SCI were directly computed on the manifest variables, it would not take the levels of the hierarchy into account. Instead, the objective of the model is to obtain consistent and reliable first-level SCIs representing groups of highly positively correlated manifest variables and to build a hierarchy on them.

To define the variable groups and the corresponding first-level SCIs, Q must be determined. Indeed, the hierarchy obtained by $\widehat{\textbf{R}}_{\textrm{EU}}$ depends on the choice of Q, which identifies specific dimensions the multidimensional phenomenon is composed of. Q can be selected according to Kaiser’s method (Kaiser 1960) and/or the unidimensionality (Cavicchia and Vichi 2021) of the first-level SCIs, among others. The latter corresponds to the evaluation of the second largest eigenvalue of the correlation sub-matrix of each variable group associated with a first-level SCI: if this is less than 1, then the corresponding SCI is unidimensional. Therefore, the optimal Q is chosen from 1 up to the value that corresponds to the first Q unidimensional first-level SCIs. The two aforementioned methods are used to choose the optimal number of first-level SCIs in the application presented in Sect. 4.2.

3.4 Cleaning composite indicators for external information

The researcher could be interested in considering additional information to build the CI system. In fact, the ranking of units based on the GCI (and SCIs) can be affected by some unit features that have not been considered in the analysis. In order to include external information, Takane and Shibayama (1991) proposed a decomposition of the original data into several components (see also Hunter and Takane 2002, for various applications of the proposed method). Specifically, we focus on the inclusion of auxiliary information on units, collected in the matrix ${\textbf{G}}$ of dimension $(n \times r)$, where r is the number of external variables (i.e., external with respect to those of the original analysis). The model proposed by Takane and Shibayama (1991) is made up of two analyses: the external analysis and the internal analysis. In the first, the data matrix $\textbf{X}$ is decomposed into a term that refers to what can be explained by ${\textbf{G}}$, thus including the effect of external information, and another term that concerns what cannot be explained by ${\textbf{G}}$, thus it is net of the effect of ${\textbf{G}}$. In the latter, Principal Component Analysis (PCA, Pearson 1901; Hotelling 1933) is applied to some of the components or each component separately. In our case, the internal analysis is replaced by considering the UCI model.

We can summarize the procedure to include external information into the UCI model as follows.

External Analysis: The data matrix $\textbf{X}$ is decomposed into two parts using the multivariate regression model, that is,
$$\begin{aligned} \textbf{X}= \textbf{G}\textbf{C} + \textbf{E}, \end{aligned}$$
(10)
where $\widehat{\textbf{C}} = (\textbf{G}^{\prime }\textbf{G})^{-1} \textbf{G}^{\prime }\textbf{X}$. By substituting $\widehat{\textbf{C}}$ into Eq. (10), we obtain
$$\begin{aligned} \textbf{X}= \textbf{P}_{G} \textbf{X}+ \textbf{Q}_{G} \textbf{X}, \end{aligned}$$
where $\textbf{P}_{G} = \textbf{G}(\textbf{G}^{\prime }\textbf{G})^{-1} \textbf{G}^{\prime }$ and $\textbf{Q}_{G} = \textbf{I} - \textbf{P}_{G}$ that, multiplied by $\textbf{X}$, represent the original data with the inclusion of the effect of external information and net of this effect, respectively.
Internal Analysis: The correlation matrices of $\textbf{P}_{G} \textbf{X}$ and $\textbf{Q}_{G} \textbf{X}$ are computed, i.e., $\textbf{R}^{(\textbf{P}_{G})}$ and $\textbf{R}^{(\textbf{Q}_{G})}$, respectively. The UCI model could be applied on both separately.

In Sect. 4.2.3, we will focus on $\textbf{R}^{(\textbf{Q}_{G})}$ in order to compute a CI system and evaluate differences in the GCI and SCI rankings of units net of the effect of additional information, that can affect the unit behavior towards the phenomenon under study.

4 Applications

We carry out two analyses on synthetic and real data to assess the performance of the UCI model. In Sect. 4.1, we provide a simulation study where we compare our proposal with other existing methodologies. The UCI model is then applied to a real data set to study waste management in Italy in Sect. 4.2.

4.1 Synthetic data analysis

The performance of the UCI model in detecting hierarchical structures of variables is evaluated in comparison with the existing methodologies based upon sequential applications of PCA followed by oblique rotation methods, such that oblimin, quartimin, and geomin.

Two different scenarios are structured: one with a small scale correlation matrix and a small number of groups ($J = 30$ and $Q = 4$, respectively, Scenario 1), and another one with a large scale correlation matrix and a large number of groups ($J = 100$ and $Q = 10$, respectively, Scenario 2). The correlation matrices are generated according to Eq. (1). Specifically, the three parameters of $\textbf{R}_{\textrm{EU}}$ in Eq. (2) are obtained as follows: $\textbf{V}$ is randomly generated from a Multinomial distribution in Q categories each with equal probability, where categories are not empty; the diagonal values of $\textbf{R}_{\textrm{W}}$ are generated as ${}_{W}r_{qq} = 0.85 + 0.1a$, where $a \sim N(0, 1), q = 1, \ldots , Q$, and the off-diagonal values of $\textbf{R}_{\textrm{B}}$ are set as ${}_{B}r_{qh} \in [0.4, 0.8]$, $q, h = 1, \ldots , Q, h \ne q$, by keeping constant the difference between two sequential correlation coefficients and such that constraint (6) holds. In Scenario 1, the lower value of $\textbf{R}_{\textrm{B}}$ (the last aggregation) is set to negative. For each scenario, three levels of error are fixed: $\sigma _{\textrm{E}}^\text {L} = 0.1$ (low error), $\sigma _{\textrm{E}}^\text {M} = 0.5$ (medium error), and $\sigma _{\textrm{E}}^\text {H} = 0.9$ (high error). Error levels affect the generation of the error matrix $\textbf{E}$, which is obtained by a uniform distribution in the interval $[0, \sigma _{\textrm{E}}]$, symmetrized, and let it be positive semidefinite. The effect of the error level on the generation of the correlation matrix is shown in Fig. 3, where it can be seen that the variable groups and their hierarchical structure become less visible as the error level increases. The properties of the correlation matrix resulting from Eq. (1), i.e., the positive semidefiniteness and the appropriate range for its values are verified. For each scenario and error level, we generate 200 correlation matrices.

The comparison of the hierarchical structures pinpointed by our proposal and the competitors is carried out according to the Adjusted Rand Index (ARI, Hubert and Arabie 1985), that evaluates the similarity between the generated and the estimated partitions of variables. The ARI ranges between $-\infty$ and 1 (perfect agreement between the generated and the estimated membership matrix), and it is computed for each hierarchical level. For the UCI model the variable partitions in q, $q = Q-1, \ldots , 2$, groups are derived from the one in Q groups detected in $\textbf{V}$ and the aggregations defined into $\textbf{R}_{\textrm{B}}$, whereas for the competitors they are obtained by assigning each variable (component) to the component (higher-order component) it loads more on in absolute term. It should be noted that the last aggregation is not taken into account, since it corresponds to the group containing all the variables. Moreover, the Mean Squared Error (MSE) of the parameters $\textbf{R}_{\textrm{W}}$ and $\textbf{R}_{\textrm{B}}$ is computed for all scenarios.

The results of the simulation study in terms of the mean of the ARI across the samples for the proposal and the competitors are provided in Table 1, whereas Table 2 shows the results of the MSE for the parameters of the UCI model. The proposed model turns out to have good results in terms of the mean of the ARI in all scenarios and for each level of error by outperforming the competitors. As expected, the performance of the UCI model, as well as that of competitors, decreases as the error level increases, as the latter tends to mask the hierarchical structure generated over the variables (Fig. 3). It is worthy to pinpoint that, differently from the UCI model, the mean of the ARI for the competitors usually declines as q lowers by stressing the difficulties in correctly detecting hierarchical relationships of variables with sequential models, even if they perfectly recover the variable partition in Q groups – as in the low error case. The UCI model also shows good performance in terms of the MSE of $\textbf{R}_{\textrm{W}}$ and $\textbf{R}_{\textrm{B}}$, as shown in Table 2.

Table 1 Mean of the ARI for the UCI model, PCA + Oblimin, PCA + Quartimin, PCA + Geomin for each hierarchical level

Full size table

Table 2 MSE for the UCI model parameters

Full size table

4.2 Waste management in the largest Italian municipalities

In this section, the UCI model is applied to study waste management in the 40 largest Italian municipalities by identifying the latent dimensions and the corresponding SCIs that characterize it. The data set is presented in Sect. 4.2.1 and two analyses are performed. In the first one, the UCI model is implemented on the data set without considering any further information (Sect. 4.2.2); external variables are included in the second analysis to take into account characteristics of Italian municipalities that could influence their performance in waste management (Sect. 4.2.3).

4.2.1 Data

The data used for waste management analysis were collected from Eurostat, Joint Research Centre and Istituto Superiore per la Protezione e la Ricerca Ambientale for the 40 largest Italian municipalities (i.e., municipalities with more than 100.000 inhabitants) - 22 municipalities in the north, 8 in the center and 10 in the south and islands - at 2019 (Table 3). The data set consists of 13 manifest variables (Table 4) that are related to two main dimensions: costs (from 1 to 5) and quantities (from 6 to 13). For comparability reasons, the population size was used to normalized the manifest variables, when necessary. Few missing data occurred in the data set. They were Missing Completely At Random and were imputed via the K-nearest neighbors method by setting $K = 4$ and using the Euclidean distance. The manifest variables were standardized to z-score to eliminate the effect of different measurement units.

Table 3 List of the 40 largest Italian municipalities

Full size table

Table 4 List of the 13 manifest variables

Full size table

Other than the 13 manifest variables, 2 variables were included in the analysis as additional information for units: Density, which was computed as the ratio between the population size and the surface of the municipality (i.e., inhabitants per $\hbox {km}^2$), and Touristic rate, which was calculated as the total number of attendees in different accommodations over the population size of the municipality (i.e., total number of attendees per inhabitant). The municipalities with the highest density are Napoli, Milano, Torino, Palermo, Monza, Firenze, Pescara, Bergamo, Bologna, and Bari, while those with the highest touristic rate are Rimini, Venezia, Firenze, Ravenna, Roma, Verona, Trento, Milano, Bologna, and Padova (the Density and Touristic rate distributions are given in Fig. 1 of the Online Resource). The latter analysis allows us to take into account the influence of the density and touristic flows of a municipality on waste management, as we will see in Sect. 4.2.3.

4.2.2 The UCI of waste management

Before applying the UCI model to the data set described in the previous section, the optimal number of first-level SCIs was selected. We determined Q according to the two different methods presented in Sect. 3.3: Kaiser’s rule and unidimensionality. Both methods returned 4 as optimal Q.

Table 5 Results of the UCI model (loadings, unidimensionality, and Cronbach’s $\alpha$) in defining the dimensions of waste management

Full size table

The UCI model unravels one statistically significant higher level^{Footnote 3} in the hierarchy, in addition to those corresponding to the first-level SCIs and the GCI of Waste Management (WM), as shown in Fig. 4. As reported in Table 5, the first first-level SCI, that we called Mixed Waste Costs (MWC), is characterized by Costs of mixed waste collection and transport and Total costs of mixed waste management, which are both related to costs of mixed waste management. The second first-level SCI, named Separated Waste Costs (SWC), is defined by the three variables related to the costs of separated waste management, i.e., Costs of separated waste collection and transport, Total costs of separated waste management and Percentage of costs of separated waste management over the total costs. The third first-level SCI is characterized by Organic waste collection, Glass waste collection, Metal waste collection, Plastic waste collection, Percentage of separated waste over the total waste, and thus called Household Separated Waste (HSW); and the fourth first-level SCI is named Large Packaging as defined by Paper waste collection, Wood waste collection and Waste from electrical and electronic equipment. All first-level SCIs turn out to be unidimensional and reliable according to Cronbach’s $\alpha$ (Cronbach 1951), since all are greater than 0.7 (Table 5), which is considered as a threshold for acceptable value (Kline 2000). A higher-level SCI is obtained by merging SWC, HSW, and LP. This represents a latent dimension related to recycling (both costs and quantities), called Separated Waste (SW), which is mainly influenced by HSW and LP (see loadings in Table 5). Figure 5 detects positive relationships between SWC and HSW, and SWC and LP, that is, large amounts of separated waste progress at the same rate as the high costs of separated waste management, for example, for collection, transportation, etc.

The GCI of WM is then obtained by lumping together MWC (one of the first-level SCIs) and SW (the higher-level SCI), where the latter loads more on the GCI while the former has a negative relationship with it (Table 5). This means that the higher the quantities and costs of separated waste and the lower the costs of mixed waste, the better the waste management of a municipality. In fact, waste segregation is essential for proper recycling and avoids the use of landfills for waste disposal. Therefore, Italian municipalities that produce more separated waste and also invest more in it are those with the highest performance in waste management. It should be noted that the correlation between WM and SW (Fig. 5) is extremely high, and consequently we can evaluate the relationships between the GCI and the first-level SCIs of SWC, HSW and LP considering those between the latter and SW.

Table 6 Rankings based on normalized GCI and SCIs scores. Partition into groups according to thresholds: normalized score $\ge 0.60$; normalized score $\ge 0.30$ and $<0.60$; normalized score $< 0.30$

Full size table

In Table 6, the rankings based on the GCI and SCIs are provided. They were obtained after normalizing the composite indicators by the Min-Max transformation. The rankings are substantially different, meaning that the behavior of each municipality can differ in the dimensions of WM. Taking into account the group of the best municipalities (reported in bold italic in Table 6), no municipality is in that group for all the SCIs and GCI, except for Ferrara, Rimini and Reggio nell’Emilia. If we consider the group of the worst municipalities (reported in italic in Table 6) on the GCI, we can notice that Catania is also in that group for all the SCIs, Genova as well except for MWC, whereas Palermo, Foggia and Taranto are in this group for two out of the four SCIs (HSW and LP). Roma, Venezia, Milano, Firenze, Napoli, Torino, Bologna, Verona, Bari – the cities classified by ISTAT as “large” – are in the intermediate municipality group for the GCI. Other “large” cities such as Genova, Palermo, and Catania behave differently across the SCIs. For instance, Roma is in the group of the intermediate municipalities for MWC and HSW, in the group of the best municipalities for SWC and in the group of the worst municipalities for LP. Although, generally speaking, the smaller the quantity the better is in terms of waste, it has to be noted that Percentage of separated waste over the total waste has the highest loading on HSW. For this reason, we can state that the different position of Roma in the rankings of SWC and HSW could be due to an investment of this municipality on separated waste which does not still correspond to a high level of separate waste collection in terms of quantities.

The territorial distribution of the normalized scores of the GCI and the SCIs is represented in Fig. 6. For readability reasons, the map of Italy displays provinces instead of municipalities the data refer to; however, each municipality represents the main city of the corresponding province. The northern municipalities show to have a higher WM performance than the southern ones (Fig. 6a and Fig. 2 in Online Resource), which reflects the better behavior in separated waste, and, in particular, separated waste collection (Fig. 6e). It is noteworthy that the northern municipalities are also those with the lowest values of MWC (Fig. 6c), whereas LP in Fig. 6f shows values lower than those of the other SCIs in Italy. The latter may be due to the fact that the variable that loads more on LP is Wood waste collection (Table 5), whose collection also depends on specific characteristics of the municipalities, e.g., the presence of green areas.

However, several features of the municipalities can affect their waste management. In fact, if we consider the 10 municipalities with the highest density (see Sect. 4.2.1), 7 are in the group of the intermediate municipalities for WM and 1 into that of the worst municipalities for WM (i.e., Palermo), whereas 6 out of the 10 municipalities with the highest touristic rate (see Sect. 4.2.1) are in the intermediate group of the WM ranking (Table 6).

In the next section, we analyze the UCI model applied on the data set net of the effect of Density and Touristic rate, which can affect, and make more difficult, the municipalities’ waste management.

4.2.3 Influence of external variables

Table 7 Ranking based on the normalized scores of WM net of the effect of external information on municipalities, compared to the ranking based on WM. Partition into groups according to the thresholds: normalized score $\ge 0.60$; normalized score $\ge 0.30$ and $< 0.60$; normalized score $< 0.30$

Full size table

As introduced in Sect. 3.4, we considered the effect of external variables which can affect the behavior of the municipalities in waste management. In this case, the matrix ${\textbf{G}}$ consists of the variables Density and Touristic rate measured in the 40 largest Italian municipalities. The goal of this analysis is to evaluate WM net of the effect of the Density and Touristic rate and to pinpoint differences in its ranking. Therefore, we focus on ${\textbf{Q}}_{G}\textbf{X}$. To compare the results, we fixed the membership of the 13 variables with the corresponding first-level SCI, according to the partition obtained in Sect. 4.2.2, and we let the UCI model identify the hierarchy and its statistically significant levels. Indeed, an important aspect of the UCI model is that it provides the possibility to fix some (or all) relationships between manifest variables and first-level SCIs in a semi-confirmatory approach when a theoretical framework on the phenomenon under study is known a priori or a previous analysis has already been carried out. The comparison can provide interesting information on differences among municipalities generated by external effects to the mere analyzed phenomenon. We thus implemented a semi-confirmatory approach for the UCI model, where only the first-level SCIs are fixed, as well as their number ($Q =4$).

Table 8 Rankings based on the normalized scores of the four SCIs net of the effect of external information on municipalities. Partition into groups according to the thresholds: normalized score $\ge 0.60$; normalized score $\ge 0.30$ and $< 0.60$; normalized score $< 0.30$

Full size table

In this case, the UCI model does not pinpoint higher-level SCIs. Thus, only two levels exist in the hierarchy: one corresponding to the fixed first-level SCIs, and the other one to the GCI of WM. Looking at Fig. 7, it can be highlighted that the three first-level SCIs related to separated waste remain the most important in the definition of waste management, even if the loading of LP is reduced to 0.48, while that of SWC increases to 0.45, w.r.t. the same obtained without considering external information. The relationship between the GCI and the first-level SCI that is most affected by the removal of the effect of external information is with MWC. Indeed, its loading is reduced to $-0.01$ by omitting its impact in the definition of WM. It must be considered that both density and tourism have an impact on mixed waste. Specifically, density affects the production of mixed waste, as higher density limits the possibility of implementing door-to-door recycling collection due to smaller spaces. Furthermore, tourism waste is also mainly characterized by mixed waste and is therefore associated with higher costs. The tourist destinations often correspond to the cities’ historic centers which are usually pedestrianized or restricted traffic zones. In the latter, mixed waste costs significantly increase because of the need to use vehicles of reduced dimensions, whose operating cost is higher than that of standard vehicles, and the higher presence of mixed waste bins.

Rankings based on the normalized scores of WM and first-level SCIs net of the effect of external information are shown in Table 7 and 8, respectively. Large cities such as Milano, Torino, Napoli, Venezia, Firenze, and Bologna, having the highest values for one or both external variables and being in the group of intermediate municipalities for WM in the previous analysis, belong to the group of the best municipalities for WM after removing the effect of Density and Touristic rate. This result supports the hypothesis that the density of a municipality and the flows of tourists make waste management more difficult, as well as waste separation, regardless of the territorial distribution of the municipalities (see also Fig. 3 of the Online Resource). On the contrary, the bottom end of Table 7, that is, the group of the worst municipalities, remains substantially unchanged. Moreover, considering separated waste (costs and quantities), Napoli is in the group of the intermediate municipalities for SWC and HSW, and in the group of the worse municipalities for LP if no external information is considered, whereas if the latter is treated in the analysis Napoli belongs to the group of the best municipalities for SWC and HSW, and the group of the intermediate municipalities for LP.

5 Conclusions

In this paper, we propose the UCI model to reconstruct the main hierarchical relationships among the manifest variables, which are represented by the correlation matrix. Distinct to the existing hierarchical methods, the proposal is simultaneous and minimizes an overall objective function for obtaining the hierarchical solution. To minimize the least-squares loss function, we present a block-coordinate descent algorithm. Moreover, the UCI model is characterized by the introduction of a statistical test for the hierarchical levels to consider into the hierarchy. The test leads to a further reduction in the number of CIs to include in the model by building a parsimonious CI system for the phenomenon studied.

Notwithstanding the fact that the model selection problems are addressed in the paper by providing indications on the appropriate selected number of first-level SCIs, it remains for future studies to consider other information criteria useful for such model selection.

The proposal has several applications in different fields, for example, to study climate change and its dimensions, to build a model-based CI system to track the Sustainable Development Goals (Heads of State and Government and High Representatives 2015). In this paper, the UCI model is used to investigate waste management in the 40 largest Italian municipalities showing its main characteristics and its potential to represent multidimensional hierarchical phenomena. Therefore, the model provides a hierarchical system of CIs and corresponding rankings, which might be used for policy actions. An additional analysis that excludes the effect of two important external variables, namely Density and Touristic rate, shows another important feature of the model.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Notes

The rejection of $\hbox {H}_{{0}}$ occurs if $P(Z \ge |z_{obs}|) + P(Z \le -|z_{obs}|) \le \alpha$, where $z_{obs}$ is the realization of the test statistic Z and $\alpha$ is the level of significance of the test set a priori.
The internal nodes associated with the first Q SCIs have different levels of correlation, which correspond to the diagonal elements of $\textbf{R}_{\textrm{W}}$.
Significance level of the test: 0.05.

References

Anderson TW, Rubin H (1956) Statistical inferences in factor analysis. Proceedings of the Third Symposium on Mathematical Statistics and Probability 5:111–150
MathSciNet Google Scholar
Cailliez F (1983) The analytical solution of the additive constant problem. Psychometrika 48(2):305–308
Article MathSciNet Google Scholar
Cattell RB (1978) Higher-order factors: models andormulas. Springer, US, Boston, MA, pp 192–228
Google Scholar
Cavicchia C, Vichi M (2021) Statistical model-based composite indicators for tracking coherent policy conclusions. Soc Indic Res 156(2):449–479
Article Google Scholar
Cavicchia C, Vichi M (2022) Second-order disjoint factor analysis. Psychometrika 87(1):289–309
Article MathSciNet Google Scholar
Cavicchia C, Vichi M, Zaccaria G (2020) The ultrametric correlation matrix for modelling hierarchical latent concepts. Adv Data Anal Classif 14(4):837–853
Article MathSciNet Google Scholar
Cavicchia C, Sarnacchiaro P, Vichi M (2021) A composite indicator for the waste management in the eu via hierarchical disjoint non-negative factor analysis. Socio-Econ Plan Sci 73:100832
Article Google Scholar
Cavicchia C, Vichi M, Zaccaria G (2022) Gaussian mixture model with an extended ultrametric covariance structure. Adv Data Anal Classif 16(2):399–427
Article MathSciNet Google Scholar
Cronbach LJ (1951) Coefficient alpha and the internal structure of tests. Psychometrika 16(3):297–334
Article Google Scholar
Dellacherie C, Martinez S, San Martin J (2014) Inverse M-matrices and ultrametric matrices. Lecture Notes in Mathematics, Springer International Publishing
Diaz-Farina E, Díaz-Hernández JJ, Padrón-Fumero N (2020) The contribution of tourism to municipal solid waste generation: A mixed demand-supply approach on the island of Tenerife. Waste Manage 102:587–597
Article Google Scholar
Dunn J, Clark VA (1969) Correlation coefficients measured on the same individuals. J Am Stat Assoc 64(325):366–377
Article Google Scholar
European Commission (2010) Commission regulation (EU) No 849/2010 of 27 September 2010 amending Regulation (EC) No 2150/2002 of the European parliament and of the Council on waste statistics. Off J Eur Union 53(L 253):2–41, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32010R0849 &from=EN
European Parliament, Council of the European Union (1999) Council Directive 1999/31/EC of 26 April 1999 on the landfill of waste. Off J Eur Communities 42(L 182):1–39, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L:1999:182:FULL &from=EN
European Parliament, Council of the European Union (2018) Directive (EU) 2018/850 of the European Parliament and the Council of 30 May 2018 amending Directive 1999/31/EC on the landfill of waste. Off J Eur Union 61(L 150):100–108, https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=OJ:L:2018:150:TOC
Fisher RA (1921) On the probable error of a coefficient of correlation deduced from a small sample. Metron 1:3–32
MathSciNet Google Scholar
Heads of State and Government and High Representatives (2015) Transforming our world: the 2030 agenda for sustainable development, a/res/70/1. Tech. rep., United Nations, https://sustainabledevelopment.un.org/content/documents/21252030%20Agenda%20for%20Sustainable%20Development%20web.pdf
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417–441, 498–520
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article Google Scholar
Hunter MA, Takane Y (2002) Constrained principal component analysis: various application. J Educ Behav Stat 27(2):105–145
Article Google Scholar
Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20(1):141–151
Article Google Scholar
Kline P (2000) The handbook of psychological testing, 2nd edn. Routledge
Google Scholar
Matai K (2015) Sustainable tourism: waste management issues. J Bas Appl Eng 2(1):1445–1448
Google Scholar
Mateu-Sbert J, Ricci-Cabello I, Villalonga-Olives E, Cabeza-Irigoyen E (2013) The impact of tourism on municipal solid waste generation: The case of Menorca Island (Spain). Waste Manage 33(12):2589–2593
Article Google Scholar
Nardo M, Saisana M, Saltelli A, Tarantola S (2005) Tools for composite indicators building. Tech. Rep. EUR 21682, Join Research Centre, Ispra, Italy, https://knowledge4policy.ec.europa.eu/publication/tools-composite-indicators-building-0_en
OECD-JRC (2008) Handbook on constructing composite indicators: Methodology and user guide. Tech. rep., OECD Publishing, https://www.oecd.org/sdd/42495745.pdf
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philosophical Magazine and Journal of Science 2(11):559–572
Article Google Scholar
Schmid J, Leiman JM (1957) The development of hierarchical factorial solutions. Psychometrika 22(1):53–61
Article Google Scholar
Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38(2):1409–1438
Google Scholar
Steiger JH (1980) Tests for comparing elements of a correlation matrix. Psychol Bull 87(2):245–251
Article MathSciNet Google Scholar
Takane Y, Shibayama T (1991) Principal component analysis with external information on both subjects and variables. Psychometrika 56(1):97–120
Article MathSciNet Google Scholar
Wherry RJ (1959) Hierarchical factor solutions without rotation. Psychometrika 24(1):45–51
Article Google Scholar

Download references

Funding

Open access funding provided by Università degli Studi di Milano - Bicocca within the CRUI-CARE Agreement. The authors did not received support from any organization for the submitted work.

Author information

Authors and Affiliations

Erasmus University Rotterdam, Rotterdam, The Netherlands
Carlo Cavicchia
University of Naples Federico II, Naples, Italy
Pasquale Sarnacchiaro
University of Rome La Sapienza, Rome, Italy
Maurizio Vichi
University of Milano-Bicocca, Milan, Italy
Giorgia Zaccaria

Authors

Carlo Cavicchia
View author publications
You can also search for this author in PubMed Google Scholar
Pasquale Sarnacchiaro
View author publications
You can also search for this author in PubMed Google Scholar
Maurizio Vichi
View author publications
You can also search for this author in PubMed Google Scholar
Giorgia Zaccaria
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giorgia Zaccaria.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (PDF 981 kb)

Appendices

Appendix A: Estimation of the parameters of $\textbf{R}_{\textrm{EU}}$

The estimates of $\textbf{R}_{\textrm{W}}, \textbf{R}_{\textrm{B}}$ and $\textbf{V}$ provided in the following are obtained by minimizing Eq. (8) subject to constraints (3)−(7).

(a)
Estimation of $\textbf{R}_{\textrm{W}}$: for fixed $\widehat{\textbf{V}}$,
$$\begin{aligned} \widehat{\textbf{R}}_{\textrm{W}}= \text {diag}\big (\widehat{\textbf{V}}^{\prime } (\textbf{R}-\textbf{I}_{p}) \widehat{\textbf{V}}\big ) \big ((\widehat{\textbf{V}}^{\prime }\widehat{\textbf{V}})^{2} - \widehat{\textbf{V}}^{\prime }\widehat{\textbf{V}}\big )^{-1}. \end{aligned}$$
$\widehat{\textbf{R}}_{\textrm{W}}$ minimizes Eq. (8), given $\widehat{\textbf{R}}_{\textrm{B}}$ and $\widehat{\textbf{V}}$, and satisfies condition (5). It should be noted that since the diagonal of $\widehat{\textbf{R}}_{\textrm{B}}$ is set to zero by constraint (6), it does not affect the estimates of $\textbf{R}_{\textrm{W}}$. The inverse of $(\widehat{\textbf{V}}^{\prime }\widehat{\textbf{V}})^{2} - \widehat{\textbf{V}}^{\prime }\widehat{\textbf{V}}$ results from the fact that $\widehat{\textbf{V}}^{\prime }\widehat{\textbf{V}}$ is a diagonal matrix whose diagonal entries represent the group sizes, and the Moore-Penrose inverse of a matrix $\textbf{M}$, that is, $\textbf{M}^{+}$, is equal to the inverse of the same matrix, that is, $\textbf{M}^{-1}$, if $\textbf{M}$ is diagonal.
(b)
Estimation of $\textbf{R}_{\textrm{B}}$: for fixed $\widehat{\textbf{V}}$, $\widehat{\textbf{R}}_{\textrm{B}}$ is calculated as the closest matrix to
$$\begin{aligned} \tilde{\textbf{R}}_{\textrm{B}}= \widehat{\textbf{V}}^{+}\textbf{R}(\widehat{\textbf{V}}^{+})^{\prime } \end{aligned}$$
in the LS sense that satisfies condition (6). Indeed, the off-diagonal elements of $\tilde{\textbf{R}}_{\textrm{B}}$ simply denote the correlations between Q variable groups, but they do not necessarily satisfy the ultrametric condition. An average linkage (UPGMA, Sokal and Michener 1958) algorithm for correlations can be used to compute $\widehat{\textbf{R}}_{\textrm{B}}$.
(c)
Estimation of $\textbf{V}$: for fixed $\widehat{\textbf{R}}_{\textrm{W}}$ and $\widehat{\textbf{R}}_{\textrm{B}}$, each row of $\textbf{V}$, that is, $\textbf{v}_{j}, j = 1, \ldots , p$, is estimated by fixing the remaining rows and setting
$$\begin{aligned} {\left\{ \begin{array}{ll} {\hat{v}}_{jq} = 1 \quad \text {if} \quad q= \underset{{q^{\prime } = 1, \ldots , Q}}{\mathrm {arg\,min}}\, F(\widehat{\textbf{R}}_{\textrm{W}}, \widehat{\textbf{R}}_{\textrm{B}}, [\hat{\textbf{v}}_{1}, \ldots , \textbf{v}_{j} = \textbf{i}_{q^{\prime }}, \ldots , \hat{\textbf{v}}_{p}]^{\prime }) \\ {\hat{v}}_{jq} = 0 \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
where $\textbf{i}_{q}$ is the qth row of the identity matrix of order Q. Therefore, estimating the rows of $\textbf{V}$ corresponds to assigning each variable to only one of the Q disjoint groups (conditions 3 and 4) to minimize the loss function.

Appendix B: The UCI model algorithm

The algorithm for the estimation of the UCI model is provided in Algorithm 1. The code for Algorithm 1 is written in MATLAB and is available upon request to the authors.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Cavicchia, C., Sarnacchiaro, P., Vichi, M. et al. A model-based ultrametric composite indicator for studying waste management in Italian municipalities. Comput Stat 39, 21–50 (2024). https://doi.org/10.1007/s00180-023-01333-9

Download citation

Received: 02 May 2022
Accepted: 21 January 2023
Published: 16 March 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s00180-023-01333-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

n, p	Number of units and manifest variables, respectively
Q	Number of variable groups corresponding to the first-level SCIs over which the hierarchy is built
\(\textbf{X}= [x_{ij}]\)	(\(n \times p\)) data matrix
\(\textbf{R} = [r_{jl}]\)	Data correlation matrix of order p, where \(r_{jl}\) is the correlation between the manifest variables j and l (\(j,l = 1, \ldots , p\))
\(\textbf{V}=[v_{jq}]\)	(\(p \times Q\)) membership matrix, where \(v_{jq} = 1\) if the jth manifest variable belongs to the qth group; \(v_{jq} = 0\) otherwise
\(\textbf{R}_{\textrm{W}}= [_{W}r_{qq}]\)	Diagonal matrix of order Q, whose diagonal entries represent the correlation within groups
\(\textbf{R}_{\textrm{B}}= [_{B}r_{qh}]\)	Matrix of order Q, whose off-diagonal entries represent the correlation between groups and diagonal ones are equal to zero
\(\textbf{E}=[e_{jl}]\)	Error square matrix of order p
\(\textbf{Y}_{Q} = [y_{iq}^{(Q)}]\)	(\(n \times Q\)) score matrix of the first-level SCIs
\(\textbf{A}_{Q} = [a_{jq}^{(Q)}]\)	(\(p \times Q\)) sparse loading matrix, with a nonnull value per row representing the unique loading of each manifest variable on the corresponding first-level SCI. The position of each nonnull value per row is determined according to \(\textbf{V}\)
\(\textbf{1}_{p},\textbf{1}_{Q},\textbf{I}_{p}\)	Unitary vector of order p and Q, identity matrix of order p, respectively

A model-based ultrametric composite indicator for studying waste management in Italian municipalities

Abstract

Similar content being viewed by others

Developing a hierarchical framework for assessing the strategic effectiveness of sustainable waste management in the Somaliland construction industry

Modeling Effective Construction Waste Management Through Causal Loop Diagrams

A Novel approach to construct a composite indicator by maximizing its sum of squared correlations with Sub-indicators

1 Introduction

2 Notation and background

Definition 1

Remark 1

Remark 2

Definition 2

3 The ultrametric composite indicator model

3.1 Estimation of the UCI model

3.2 Test on the difference between two levels of the hierarchy

3.3 Specific and General Composite Indicators scores

3.4 Cleaning composite indicators for external information

4 Applications

4.1 Synthetic data analysis

4.2 Waste management in the largest Italian municipalities

4.2.1 Data

4.2.2 The UCI of waste management

4.2.3 Influence of external variables

5 Conclusions

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (PDF 981 kb)

Appendices

Appendix A: Estimation of the parameters of \(\textbf{R}_{\textrm{EU}}\)

Appendix B: The UCI model algorithm

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation