## Abstract

Clustering of variables relies on relationships among them. The strength of those relationships is generally measured by the correlation coefficients between pairs of variables. This paper proposes specified variable weighted correlation coefficients and takes the clustering around latent variables (CLV) approach as an example to transform the common clustering method into a “weighted” clustering method. The aim is to eliminate factors that are unrelated to the variable that was adopted for weighting to ensure that the cluster centers are sufficiently different and have good correlations with the adopted variable. A log-transformed dataset was used to evaluate the proposed method. Three clusters were obtained under the restriction of the As element, and they represented three ore-controlling factors related to the Goldenville Formation, namely geologic features such as formation, fault contacts, and granitoid intrusions. Not only did the new cluster centers account for most of the variability related to the weighted element (As) but they also showed significant differences in spatial distributions.

## Introduction

As an important data analysis method, cluster analysis has been used broadly in geochemical data interpretation (Howarth & Jones, 1972; Castillo-Muñoz & Howarth, 1976; Gustavsson & Bjorklund, 1976; Vriend et al., 1988; Kramar, 1995; Rantitsch, 2000; Hanesch et al., 2001; Xie et al., 2004; Ji et al., 2007; Templ et al., 2008). The principal aim of clustering is to split multivariate observations into meaningful, multivariate, and homogeneous groups based on the dissimilarity between variables. To preserve the scales of measurements of variables for clustering, the Euclidean distance and Procrustean distance were discussed by Qannari et al. (1997, 1998). A new multivariate association measure was proposed by Soffritti (1999) to overcome the drawbacks of the typically employed bivariate correlation coefficient. Moreover, specific clustering methods based on principal component analysis (PCA), such as clustering around latent variables (CLV) (Vigneau & Qannari, 2003; Vigneau et al. 2011) and diametrical clustering (Dhillon et al., 2003), have been proposed to identify groups of highly correlated quantitative variables. Additional classification methods similar to CLV have been discussed based on the different methods for estimation of the cluster center. For example, sparse PCA (Jolliffe et al., 2003; Zou et al., 2006) and sparse partial least squares regression (Lê Cao et al., 2008; Chun & Keleş, 2010) have been explored to eliminate the disturbance of variables in the clustering process, and mixture models using factor analysis (Subedi et al., 2013) have been proposed for clustering of high-dimensional variables. A CLV method under the constraint of a specified variable has also been discussed to highlight the group structure among variables and to identify the most relevant groups of variables for prediction (Chen & Vigneau, 2014).

CLV is able to determine simultaneously*K* clusters of variables and*K* latent components such that the variables in each cluster are strongly related to the corresponding latent component (Vigneau & Qannari, 2003). The latent variables that are extracted based on geochemical elements represent geological factors hidden behind the geochemical data; they help to establish the connection between the geochemical data and the geological process. Current researchers have proposed a “weighted” variable clustering method based on CLV and random sampling technologies. The main differences between the traditional CLV and the new method proposed here include the following two points:

(1) The centroid (*p*_{k}) of the*k*^{th} cluster in the new method is assigned as the prediction for a specified response variable*y* from among the variables in the*k*^{th} cluster (*G*_{k}) rather than the first principal component of the variables in the*k*^{th} cluster in the CLV method.

(2) The similarity between the centroid*p*_{k} and variable \(x_{j} \left( {x_{j} \in G_{k} } \right)\) the in new method is defined as the ratio between two correlations \(\left( {{\raise0.7ex\hbox{${\rho_{{x_{j} ,y}} }$} \!\mathord{\left/ {\vphantom {{\rho_{{x_{j} ,y}} } {\rho_{{p_{k} ,y}} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\rho_{{p_{k} ,y}} }$}}} \right)\) rather than the covariance between them in the CLV method. This similarity is a measure of the association between variable*x*_{j} and the corresponding centroid*p*_{k} in a regression with respect to the response variable*y*.

A case study is presented here to validate this new method. The study is based on a geochemical dataset that includes geochemical concentrations of 16 elements (Ag, As, Au, Cu, F, Li, Nb, Pb, Rb, Sb, Sn, Th, Ti, W, Zn, and Zr) from 671 lake sediment samples from southern Nova Scotia. Three clusters via the proposed partial clustering method were extracted from the 15 log-transformed geochemical elements (thus excluding the constraint element). The clustering result was applied to identify factors associated with gold mineralization in the study area.

## Methods

### Clustering of Variables Around Latent Variables

CLV is a type of K-means method that is used for variable clustering. It attempts to find cluster centers that represent specific regions of the data. When the value of*k* is determined, the algorithm alternates between the following two steps: (1) assigning each object to the nearest cluster center and (2) setting each cluster center to the average of all objects assigned, and ending if the cluster assignment no longer changes.

Consider a data matrix*X* of*n* observations (samples) evaluated using*p* variables, i.e., \(X = \left\{ {x_{1} , \ldots ,x_{p} } \right\} = \left( {x_{ij} } \right)_{n \times p}\).*.* Let \(P_{K} = { }\left( {G_{1} , \ldots ,G_{K} } \right)\). be a partitionnto*K* clusters associated with the*k* components:*c*_{1}*, c*_{2}*,…, c*_{k}. It is expected to result in a clustering on by maximizing:

where \(\delta_{kj} = 1\). if the*j*^{th} variable belongs to cluster*G*_{k} and \(\delta_{kj} = 0\); otherwise, \(\rho \left( {x_{j} ,c_{k} } \right)\). represents the correlation coefficient between \(x_{j}\) and \(c_{k}\); \(c_{k}\) is the centroid (latent component) of the*k*^{th} cluster, which is usually defined as the first standardized principal component of*X*_{k} (variables belonging to cluster*G*_{k}).

### Specified Variable Weighted CLV Algorithm

The association of two variables (*x*_{1},*x*_{2}) adopted in conventional CLV methods is the correlation coefficient \(\rho \left( {x_{1} ,x_{2} } \right)\)., thus:

where \(Cov\left( {x_{1} ,x_{2} } \right)\). is the covariance between*x*_{1} and*x*_{2}, and \(\sigma \left( {x_{1} } \right)\). and \(\sigma \left( {x_{2} } \right)\). are the standard deviations of*x*_{1} and*x*_{2}, respectively. A high value of \({\text{abs}}\left( {\rho \left( {x_{1} ,x_{2} } \right)} \right)\) a strong relationship between*x*_{1} and*x*_{2}. However, the association of two variables with a large correlation coefficient can be weak if it is measured in a regression with respect to a response variable (*y*). For example, if*x*_{1}is uncorrelated with*y*, for example,*ρ*_{x1,y} = 0, the conditional correlation coefficient of*x*_{1} and*x*_{2}in a regression with respect to*y* must be 0.

The response variable (*y*) weighted distance between two variables (*x*_{1} and*x*_{2}) is defined as an indirect effect from the response variable in the correlation coefficient among them. A structural equation model (SEM) can be imported to describe the association of*x*_{1} and*x*_{2} under the given response variable (y) and to calculate the conditional correlation coefficient (Fig. 1). In Figure 1 (see Table 1 for explanation of parameters), the latent variable (*x*_{latent}) represents such a factor in*x*_{1} and*x*_{2}, and its effect on*y* (\(\rho_{{y,x_{{{\text{latent}}}} }}\)) is equal to the combined effect of*x*_{1} on*y* and*x*_{latent} (\({\uprho }_{{{\text{x}}_{1} ,{\text{x}}_{{{\text{latent}}}} }} \times {\uprho }_{{{\text{x}}_{1} ,{\text{y}}}}\)). Similarly, the effect of*x*_{latent} on*y* should be equal to the combined effects of*x*_{2} on*y* and*x*_{latent} (\({ }\rho_{{{\text{x}}_{2} ,{\text{x}}_{{{\text{latent}}}} }} \times \rho_{{{\text{x}}_{2} ,{\text{y}}}}\)). The above relationships can be expressed by Eq. 3.

Let \(x_{{{\text{latent}}}} = \lambda_{1} x_{1} + \lambda_{2} x_{2} + \lambda_{3} y\) and all of the variables (\(x_{1} ,x_{2} ,y, x_{{{\text{latent}}}}\)) be*Z*-standardized. Thus, their variances and standard deviations become 1; in addition, their covariances are equal to their correlation coefficients. The relationship of*x*_{latent} with*x*_{1},*x*_{2} and*y* is defined as:

From Eq. 3, the correlation coefficient between \(x_{latent}\) and \(x_{i}\) (i = 1, 2) is expressed as:

where \(Cov\left( {x_{i} ,x_{{{\text{latent}}}} } \right)\) is the covariance between \(x_{i}\) and \(x_{{{\text{latent}}}}\), which can be expressed as:

and \(\rho_{{x_{i} ,x_{{{\text{latent}}}} }}\) is defined as:

where \(\rho_{1i}\), \(\rho_{2i}\), and \(\rho_{yi}\) are the correlation coefficients of \(x_{i}\) with \(x_{1}\), \(x_{2}\), and y, respectively. According to Eqs. (3) to (6), one can derive the following relations:

The standard deviation of \(x_{{{\text{latent}}}}\) (\(\sigma_{{x_{{{\text{latent}}}} }}\)) is:

where \(\lambda_{1} , \lambda_{2} , \lambda_{3}\) can be estimated through Eq. 3 and 10.

The \(\rho_{{y,x_{latent} }}\) can be used to quantify the association between*x*_{1} and*x*_{2} when they are applied in a regression with respect to*y*. The standardized \(\rho_{{y,x_{latent} }}\) is defined as:

where \(\rho_{{ y.{\text{x}}_{1} ,{\text{x}}_{2} }}\) is the multiple correlation coefficient of \(\left\{ {x_{1} ,x_{2} } \right\}\) with*y*.

The new index has the following characteristics:

*(1*) \(0 \le {\text{R}} \le 1\)*, a large R means a strong similarity between x*_{1}* and x*_{2}*; for example, let x*_{1} = *x*_{2}, then \(x_{{{\text{latent}}}} = { }x_{1} = { }x_{2}\), \({\text{R}} = 1\).

(2) If*x*_{1} = *y* or*x*_{2} = *y*, then \(x_{{{\text{latent}}}} = { }x_{2} { }or{ }x_{1}\), \(\rho_{{x_{{{\text{latent}}}} ,{ }y}} = { }\rho_{{x_{1} ,{ }x_{2} }}\),\({ }\rho_{{\left( {x_{1} ,x_{2} } \right),{ }y}} = 1\), \(R = { }\rho_{{x_{1} ,{ }x_{2} }}\).

(3) If \(\rho_{{x_{1} ,y}} = 0\) or \(\rho_{{x_{2} ,y}} = 0\), then \(x_{{{\text{latent}}}}\) cannot be estimated, and if \(\rho_{{x_{{{\text{latent}}}} ,y}} = 0\), then*R* = 0.

(4) \({\text{R}}_{{\text{y}}} \left( {{\text{x}}_{1} ,{\text{x}}_{2} } \right) = {\text{ R}}_{{\text{y}}} \left( {{\text{x}}_{2} ,{\text{x}}_{1} } \right)\)

(5) \(\rho_{{x_{1} ,y}} = \rho_{{y.x_{1} ,x_{{{\text{latent}}}} }} ,\) \(\rho_{{x_{2} ,y}} = \rho_{{y.x_{2} ,x_{{{\text{latent}}}} }}\),\(\rho_{{y.x_{1} ,x_{2} }} = \rho_{{y.x_{1} ,x_{2} ,x_{{{\text{latent}}}} }}\), where \(\rho_{{y.x_{1} ,x_{{{\text{latent}}}} }}\), \(\rho_{{y.x_{2} ,x_{{{\text{latent}}}} }}\) and \(\rho_{{y.x_{1} ,x_{2} ,x_{{{\text{latent}}}} }}\) are multiple correlation coefficients between {*x*_{1},*x*_{latent}}, {*x*_{2},*x*_{latent}} and {*x*_{1},*x*_{2},*x*_{latent}} with*y*, respectively.

These characteristics indicate that the new index*R* is a symmetrical statistic measure of the relative conditional correlation between*x*_{1} and*x*_{2} under the regression with respect to variable y.

In the weighted CLV, based on the abovementioned concepts, Eq. 1 can then be transformed to:

where \(p_{k}\) is a prediction of the response variable (*y*) by the variables in*G*_{k}, and \(\rho_{{x_{j} y}}\) and \(\rho_{{p_{k} y}}\)are the correlation coefficients of \(x_{j}\)and \(p_{k}\)with y, respectively. \(\frac{{\rho_{{x_{j} y}} }}{{\rho_{{p_{k} y}} }}\) is replaced by 1 in the calculation when \(\rho_{{p_{k} y}} = 0\).The ratio of the two values can be considered a “conditional” correlation given y,which represents the homogeneity of variables \(x_{j}\) and \(p_{k}\) in clustering.

### Clustering Stages

A random sampling technology (Monte Carlo simulation) was employed as the calculation algorithm to obtain the solutions of clustering. It can also be considered a type of “expectation maximization” method (Jain & Dubes, 1988). A target function was created, and all possible classifications generated by a random sampling method were tested. The classification was set as the final result when the target function (Eq. 12) reached the maximum value. The algorithm stages are as follows, and the flow chart is shown in Figure 2.

*Stage 1 Choose initial model parameters* Initial clusters are generated and the iteration count is set to 1.

*Stage 2 Calculate the target function value* The target function (T) is calculated.

*Stage 3 Evaluate the result* If the iteration equals 1, record the cluster and target functions in a dataset as the best clustering suggestion. Otherwise, compare the T value in stage 2 with the existing best suggestion. If the new T value is less than the old value (recorded in the dataset as the best suggestion), set the current clusters and T value as the best suggestion, and increase the iteration counter by 1.

*Stage 4 Output the results* Check for the termination of the calculation in the current stage. If yes, terminate the calculation and output the best suggestion; if not, return to stage 5.

*Stage 5 Generate new clusters* A new cluster is generated through a random sampling process.

*Stage 6 Check the duplicity* Check whether the new cluster already exists in the dataset, and if the newly generated clusters are not found in the dataset, the target function T is calculated and the clusters are recorded in the dataset. Otherwise, return to stage 2.

The calculation can terminate in two possible ways. The first way is to generate a sufficient number of different classifications, which may be expressed as a ratio of the total number of possible classifications. The second way is to stop sampling when the final classification is unchanged after a specified number of samplings. This number of samplings depends on the total number of possible classifications. The random sampling algorithm adopted in this research is taken from the R package*sampling* (Tillé and Matei, 2011). Details about this algorithm and package are available in Särndal et al. (2003) and Tillé and Matei (2011).

## Case Study

### Dataset and Geological Background

The study area, located in the western Meguma Terrain of Nova Scotia, Canada, covers roughly 25,000 km^{2} and consists mainly of Cambrian–Ordovician low-middle grade metamorphosed sedimentary rocks and a suite of aluminous Devonian granitoid intrusions (Sangster, 1990; Ryan & Ramsay, 1996). The metamorphosed sedimentary strata of the Meguma Group include two rock formations: the lower sand-dominated flysch Goldenville Formation and the upper shaly flysch Halifax Formation. Both of them were deformed during the Devonian granitoid intrusion emplacement, resulting in NE–SW-trending folds (Kontak et al., 1998).

The South Mountain Batholith (SMB), which is a complex of multiple intrusions, occupies nearly one-third of the whole study area. Abundant*Sn**,** W**,** U* and*Au* mineralization and mineral deposits have been found in this area. While*Sn**,** W* and*U* mineralization occurs mainly inside the SMB and in the contact zones between the complex intrusion and the metamorphic sedimentary rocks, Au deposits occur mainly in the Meguma Group, especially around the Goldenville and Halifax contact (GHC) zones (Chatterjee, 1983).

Studies of known Au deposits and their regional geological environment have shown that they are turbidite-hosted (Mawer, 1986; Ryan & Ramsay, 1996). The major mineralization-related geological features described by previous researchers include the GHC, northeast-southwest-trending anticline axes and northeast-southwest-trending shear zones (Kontak et al., 1990; Sangster, 1990; Ryan & Ramsay, 1996; Kontak & Kerrich, 1997). Lithogeochemical analyses have shown that As, as a main pathfinder element of Au, has strong but complex relationships with Au mineralization. For example,*Au* and*As* are highly correlated in alteration zones related to Au mineralization controlled by fracture zones or faults and within the GHC but not in all gold-bearing quartz veins (Zentilli et al., 1985; Crocket et al., 1986; Kerswill, 1991).

A method of identifying geochemical factors has been introduced based on this dataset by Liu et al. (2015), and additional information about the dataset and study area is available in the literature (Bonham Carter et al., 1988; Rogers et al., 1990; Dunn et al., 1991). The location of lake sediment samples, gold deposits and the geological background are shown in Figure 3.

The correlation coefficient matrix is an important base*for* variable clustering, and its use requires input data to form a normal/multivariate normal distribution (Reimann et al., 2002). Because geochemical data are compositional (Reimann & Filzmoser, 2000), data transformation is an issue of geochemical element clustering. Some studies on compositional data have recently been published (Buccianti et al., 2018; Thiombane et al., 2018; Dmitrijeva et al., 2019; Greenacre et al., 2021).

Although several transformation methods have been developed for opening compositional data (Aitchison, 1982, 1984, 1986; Egozcue et al., 2003), log transformation is still an option for normalizing geochemical data in element clustering due to its ability to maintain relationships among elements after transformation (Templ et al., 2008). Other problems (e.g., algorithm choice) and possibilities (e.g., exploratory analysis based on clustering results) in geochemical data cluster analysis have been reviewed by Templ et al. (2008).

To extract geological factors related to gold mineralization based on clusters,*As* was selected as the domain element for the classification instead of*Au* for two reasons: first, As is related with gold mineralization in the study area (Agterberg et al., 1990; Xu & Cheng, 2001); and second, concentration values of*As* exhibit more variability than Au. The densities of the samples with Au and As are shown in Figure 4. These indicate that the As data contain information and are better as a control variable than Au in the current research. The spatial distributions of both elements were mapped (Fig. 5) through the inverse distance weighting (IDW) method (power = 2, search radius = 12 points, cell size = 200 m; these parameters were used for subsequent IDW)*.* The matrix of correlation coefficients of the log-transformed data for the 16 elements is shown in Table 2.

### Partial Clustering Results Through CLV and Weighted CLV

The loadings of the first four principal components (Fig. 6) were calculated through PCA based on the matrix in Table 2. The lambda values of these components were > 1, and the cumulative covariance explained was 73%. Figure 7 shows the clustering result and corresponding loadings of each centroid. The elements contained in clusters 1, 2 and 3 were {*Ag**,** Au**,** Sn**,** W*}, {*Pb**,** Sb**,** Cu**,** Zn*} and {Rb,*Li**,** Ti**,** Zr**,** F**,** Th**,** Nb}*, respectively, which were the main elements of principal components 3, 2 and 1 (Fig. 6), respectively. The results indicate that the CLV method is controlled mainly by the principal components of the variables.

### Partial Clustering Result from Weighted CLV

Through the weighted CLV algorithm introduced in the previous section, the partial clustering results were {Ag, Au, Nb, W}, {Cu, F, Li, Pb, Sb, Th, Zn, Zr} and {Rb, Sn, Ti}, corresponding to clusters 1, 2 and 3 in the CLV method. It should be noted that the partial clustering output depends on the initial input; therefore, this result represents only a local optimum solution.

Because the centroid of the*k*^{th} cluster is the prediction for As through variables in the*k*^{th} group, the loadings of an element (\(x_{j}\)) to the corresponding centroid (\(p_{k}\)) is the absolute value of \({\raise0.7ex\hbox{${\rho_{{x_{j} y}} }$} \!\mathord{\left/ {\vphantom {{\rho_{{x_{j} y}} } {\rho_{{p_{k} y}} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\rho_{{p_{k} y}} }$}}\). Figure 7 shows the loadings of each group, which are the centroid loadings on elements.

Compared with the result in Figure 7, the new clustering result obtained through the new method is similar to that obtained through CLV based on the main elements in each cluster. For example, clusters 1, 2 and 3 in Figure 8 correspond to clusters 1, 2 and 3 in Figure 7, respectively. The main elements in the three clusters in Figures 7 and 8 are {Au, W, Ag}, {Zn, Cu, Pb, Sb} and {Rb, Ti}, but the importance of each element was changed, i.e., the roles of Au, W, and Ag in Figure 8 are strengthened compared with those in Figure 7.

### Spatial Distribution of Cluster Centroids

To compare the two methods, the centroids of the clusters in both methods were mapped (Fig. 9) through IDW interpolation (using the same parameters as those in Fig. 4). Because the centroid in the weighted CLV is the prediction for As, the error of prediction (Fig. 10) was also mapped through IDW interpolation.

For comparison, clusters with similar element combinations were combined. They contained the same combination of elements as {Rb, Ti}, {Zn, Cu, Pb, Sb} and {Au, W, Ag} in clusters 1, 2 and 3. The corresponding centroids of each cluster from the CLV and the weighted CLV were named*g*_{1},*g*_{2}, and*g*_{3} and*p*_{1},*p*_{2}, and*p*_{3}, respectively. The corresponding spatial distributions are mapped in Figure 9a and b – g1 and p1, Figure 9c and d – g2 and p2, and Figure 9e and f – g3 and p3. From this figures, it can be deduced that the factors depicted by the three clusters relate to the intersections of geological boundaries and anticlines (Fig. 9a, b), to Goldenville Formation (Fig. 9c, d) and to magmatic rocks (Fig. 9e, f). These factors are closely related to gold mineralization in the study area.

The predicted errors of each cluster from the weighted CLV (*e*_{1},*e*_{2} and*e*_{3}) and the predicted errors of all elements are mapped in Figure 10, which depicts the difference between the estimated value (*p*_{i}) and response element (As). As evident in Figure 9, the spatial distributions of the three clusters were different.*g*_{1} and*p*_{1} are related to the boundaries of geological features and to NE–SW-trending folds; these represent transportation channels for the gold mineralization. g_{2} and*p*_{2}include most of the information from As, with most of the high concentration values located within the Goldenville Formation; these values represent material source of gold mineralization.*g*_{3} and*p*_{3} relate to granite intrusions, which represent the kinetic factors of gold mineralization. There are also some differences in the spatial distributions of the cluster centers obtained by the two methods. Compared with the CLV, the clustering center obtained by the weighted CLV, under the constraint of element As, had a clearer spatial representation of the ore-controlling factors of gold mineralization.

In Figure 10, most of the error in*e*_{1} was located within the Goldenville Formation and in the southwest area, but the error around the lithological boundaries was small. Error in*e*_{2} shows that the prediction of*p*_{2} was good in most of the areas except at the center of the map. Error in*e*_{3} shows that most of the area of good prediction of*p*_{3} is located within the granite and granodiorite. The levels of the errors reflect the quality of the prediction. The combination of elements from the same source tended to have better predictability, and the elements from different sources were less predictable. Different error distributions indicate that the cluster center and As element relationships were different, indicating that As has played an significant restrictive role in the clustering process.

To further compare the difference between the centroids obtained by the weighted CLV and CLV methods, a section from northwest to southeast is shown in Figure 11a, and scores of the centroids of clusters 1, 2 and 3 along this section are shown in Figure 11b–d. The comparison revealed that the score profile obtained by weighted CLV has the same trend as CLV, but the former was smoother than the latter. For example, in Figure 11b, the two profiles had obvious high-value anomalies at the contacts between different geological units, but the weighted CLV centroid profile suppressed the high-value anomalies inside the geological units, which enhances the control of the contact boundaries. In addition, Figure 12 shows the mean scores of the cluster centroids in different geological units obtained by the two clustering methods. Figure 12b and c shows that the mean values of cluster centroids are significantly different among different geological units, and the clustering centroids had a good distinguishing effect on different geological units, which is consistent with the conclusions observed in Figure 9d and f.

## Conclusions

In this research, a weighted clustering method based on a new index \(\left( {\left| {{\raise0.7ex\hbox{${\rho_{{x_{j} y}} }$} \!\mathord{\left/ {\vphantom {{\rho_{{x_{j} y}} } {\rho_{{p_{k} y}} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\rho_{{p_{k} y}} }$}}} \right|} \right)\) of a two-variable relationship is proposed. Unlike the CLV, which is based on the correlations among variables, the new index is based not only the correlations among variables but also their relationships with a constraint variable. The index is a “weighted” correlation coefficient via a specific constraint variable. In the proposed weighted clustering method, the correlation coefficients used in CLV is replaced by the new index for each variable pair. In addition, the centroid of each cluster obtained by CLV is replaced by the prediction for the variable in the “weighted” correlation coefficient rather than the first component. In this way, the CLV method is transformed into a “weighted” variable clustering method. The clustering result can respond to different factors related to a response variable. Through different constraint variables, it provides flexibility to build a clustering model.

Clustering results for a classic geochemical dataset demonstrate that the weighted CLV maintains the basic characteristics of the traditional CLV clustering method, and the structure of the clustering results of the elements is similar. However, by introducing a constraint variable that is closely related to other elements and the mineralization in the study area, there is clearer correspondence between the newly obtained cluster centroids and the geological units. By suppressing the unrelated correlation with the constraint variables, the clustering results can be fine-tuned, and the weighted CLV method has more prominent clustering center characteristics, which is more valuable for subsequent applications.

## References

Agterberg, F. P., Bonham-Carter, G. F., & Wright, D. F. (1990). Statistical pattern integration for mineral exploration. In

*Computer applications in resource estimation*(pp. 1–21). Pergamon.Aitchison, J. (1982). The statistical analysis of compositional data.

*Journal of the Royal Statistical Society: Series B (Methodological),**44*(2), 139–160.Aitchison, J. (1984). The statistical analysis of geochemical compositions.

*Journal of the International Association for Mathematical Geology,**16*(6), 531–564.Aitchison, J. (1986).

*The statistical analysis of compositional data*. London: Chapman & Hall.Bonham-Carter, G. F., Agterberg, F. P., & Wright, D. F. (1988). Integration of geological datasets for gold exploration in Nova Scotia.

*Photogrammetric Engineering and Remote Sensing,**54*(11), 1585–1592.Buccianti, A., Lima, A., Albanese, S., & De Vivo, B. (2018). Measuring the change under compositional data analysis (CoDA): Insight on the dynamics of geochemical systems.

*Journal of Geochemical Exploration,**189*, 100–108.Castillo-Muñoz, R., & Howarth, R. J. (1976). Application of the empirical discriminant function to regional geochemical data from the United Kingdom.

*Geological Society of America Bulletin,**87*(11), 1567–1581.Chatterjee, A. K. (1983).

*Metallogenic map of the Province of Nova Scotia*. Department of Mines and Energy, Nova Scotia, Canada, ver. 1, scale 1: 500,000Chen, M., & Vigneau, E. (2014). Supervised clustering of variables.

*Advances in Data Analysis and Classification,**10*(1), 85–101.Chun, H., & Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology),**72*(1), 3–25.Crocket, J., Fueten, F., & Clifford, P. (1986). Distribution and localization of gold in Meguma Group rocks, Nova Scotia: Implications of metal distribution patterns in quartz veins and host rocks on mineralization processes at Harrigan Cove, Halifax County.

*Atlantic Geology,**22*(1), 15–33.Dhillon, I. S., Marcotte, E. M., & Roshan, U. (2003). Diametrical clustering for identifying anti-correlated gene clusters.

*Bioinformatics,**19*(13), 1612–1619.Dmitrijeva, M., Ehrig, K. J., Ciobanu, C. L., Cook, N. J., Verdugo-Ihl, M. R., & Metcalfe, A. V. (2019). Defining IOCG signatures through compositional data analysis: A case study of lithogeochemical zoning from the Olympic Dam deposit, South Australia.

*Ore Geology Reviews,**105*, 86–101.Dunn, C. E., Coker, W. B., & Rogers, P. J. (1991). Reconnaissance and detailed geochemical surveys for gold in eastern Nova Scotia using plants, lake sediment, soil and till.

*Journal of Geochemical Exploration,**40*(1–3), 143–163.Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G., & Barcelo-Vidal, C. (2003). Isometric logratio transformations for compositional data analysis.

*Mathematical Geology,**35*(3), 279–300.Greenacre, M., Grunsky, E., & Bacon-Shone, J. (2021). A comparison of isometric and amalgamation logratio balances in compositional data analysis.

*Computers and Geosciences,**148*, 104621.Gustavsson, N., & Bjorklund, A. (1976). Lithological classification of tills by discriminant analysis.

*Journal of Geochemical Exploration,**5*, 393–395.Hanesch, M., Scholger, R., & Dekkers, M. J. (2001). The application of fuzzy c-means cluster analysis and non-linear mapping to a soil data set for the detection of polluted sites.

*Physics and Chemistry of the Earth, Part A: Solid Earth and Geodesy,**26*(11–12), 885–891.Howarth, R. J., & Jones, M. J. (1972). The pattern recognition problem in applied geochemistry.

*Geochemical Exploration*, 259–273.Jain, A. K., & Dubes, R. C. (1988).

*Algorithms for clustering data*. Prentice-Hall.Ji, H., Zeng, D., Shi, Y., Wu, Y., & Wu, X. (2007). Semi-hierarchical correspondence cluster analysis and regional geochemical pattern recognition.

*Journal of Geochemical Exploration,**93*(2), 109–119.Jolliffe, I. T., Trendafilov, N. T., & Uddin, M. (2003). A modified principal component technique based on the LASSO.

*Journal of Computational and Graphical Statistics,**12*(3), 531–547.Kerswill, J. A. (1991). Lithogeochemical indicators of gold potential in the eastern Meguma Terrane of Nova Scotia.

*Papers-Geological Survey of Canada*, 19–19.Kontak, D. J., Horne, R. J., Sandeman, H., Archibald, D., & Lee, J. K. (1998). 40Ar/39Ar dating of ribbon-textured veins and wall-rock material from Meguma lode gold deposits, Nova Scotia: Implications for timing and duration of vein formation in slate-belt hosted vein gold deposits.

*Canadian Journal of Earth Sciences,**35*(7), 746–761.Kontak, D. J., & Kerrich, R. (1997). An isotopic (C, O, Sr) study of vein gold deposits in the Meguma Terrane, Nova Scotia; implication for source reservoirs.

*Economic Geology,**92*(2), 161–180.Kontak, D. J., Smith, P. K., Kerrich, R., & Williams, P. F. (1990). Integrated model for Meguma group lode gold deposits, Nova Scotia, Canada.

*Geology,**18*(3), 238–242.Kramar, U. (1995). Application of limited fuzzy clusters to anomaly recognition in complex geological environments.

*Journal of Geochemical Exploration,**55*(1–3), 81–92.Kriegel, H. P., Kröger, P., Schubert, E., & Zimek, A. (2008). A general framework for increasing the robustness of PCA-based correlation clustering algorithms. In

*International Conference on Scientific and Statistical Database Management*(pp. 418–435). Springer.Lê Cao, K. A., Rossouw, D., Robert-Granié, C., & Besse, P. (2008). A sparse PLS for variable selection when integrating omics data.

*Statistical Applications in Genetics and Molecular Biology*,*7*(1).Liu, J., Cheng, Q., & Wang, J. (2015). Identification of geochemical factors in regression to mineralization endogenous variables using structural equation modeling.

*Journal of Geochemical Exploration,**150*, 125–136.Mawer, C. K. (1986). The bedding-concordant gold-quartz veins of the Meguma Group, Nova Scotia.

*Turbidite-Hosted Gold Deposits,**32*, 135–148.Qannari, E. M., Vigneau, E., & Courcoux, P. (1998). Une nouvelle distance entre variables. Application en classification.

*Revue de Statistique Appliquée*,*46*(2), 21–32.Qannari, E. M., Vigneau, E., Luscan, P., Lefebvre, A. C., & Vey, F. (1997). Clustering of variables, application in consumer and sensory studies.

*Food Quality and Preference,**8*(5–6), 423–428.Rantitsch, G. (2000). Application of fuzzy clusters to quantify lithological background concentrations in stream-sediment geochemistry.

*Journal of Geochemical Exploration,**71*(1), 73–82.Reimann, C., & Filzmoser, P. (2000). Normal and lognormal data distribution in geochemistry: Death of a myth Consequences for the statistical treatment of geochemical and environmental data.

*Environmental Geology,**39*(9), 1001–1014.Reimann, C., Filzmoser, P., & Garrett, R. G. (2002). Factor analysis applied to regional geochemical data: Problems and possibilities.

*Applied Geochemistry,**17*(3), 185–206.Rogers, P. J., Chatterjee, A. K., & Aucott, J. W. (1990). Metallogenic domains and their reflection in regional lake sediment surveys from the Meguma Zone, southern Nova Scotia, Canada.

*Journal of Geochemical Exploration,**39*(1–2), 153–174.Ryan, R. J., & Ramsay, W. R. H. (1996). Preliminary comparison of gold field in the Meguma Terrain, Nova Scotia, and Victoria, Australia.

*MacDonald, DR, Mills, & KA (eds). Mines and Mineral Branch. Report of Activities*, 97–1.Sangster, A. L. (1990). Metallogeny of the Meguma Terrane, Nova Scotia.

*Mineral Deposit Studies in Nova Scotia,**1*, 90–98.Särndal, C. E., Swensson, B., & Wretman, J. (2003).

*Model assisted survey sampling*. Springer.Soffritti, G. (1999). Hierarchical clustering of variables: A comparison among strategies of analysis.

*Communications in Statistics-Simulation and Computation,**28*(4), 977–999.Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2013). Clustering and classification via cluster-weighted factor analyzers.

*Advances in Data Analysis and Classification,**7*(1), 5–40.Templ, M., Filzmoser, P., & Reimann, C. (2008). Cluster analysis applied to regional geochemical data: Problems and possibilities.

*Applied Geochemistry,**23*(8), 2198–2213.Thiombane, M., Martín-Fernández, J. A., Albanese, S., Lima, A., Doherty, A., & De Vivo, B. (2018). Exploratory analysis of multi-element geochemical patterns in soil from the Sarno River Basin (Campania region, southern Italy) through compositional data analysis (CODA).

*Journal of Geochemical Exploration,**195*, 110–120.Tillé, Y., & Matei, A. (2011). Sampling: Survey Sampling. R Package Version 2.4.

Vigneau, E., Endrizzi, I., & Qannari, E. M. (2011). Finding and explaining clusters of consumers using the CLV approach.

*Food Quality and Preference,**22*(8), 705–713.Vigneau, E., & Qannari, E. M. (2003). Clustering of variables around latent components.

*Communications in Statistics-Simulation and Computation,**32*(4), 1131–1150.Vriend, S. P., van Gaans, P. M., Middelburg, J., & De Nijs, A. (1988). The application of fuzzy c-means cluster analysis and non-linear mapping to geochemical datasets: Examples from Portugal.

*Applied Geochemistry,**3*(2), 213–224.Xie, X., Liu, D., Xiang, Y., Yan, G., & Lian, C. (2004). Geochemical blocks for predicting large ore deposits—concept and methodology.

*Journal of Geochemical Exploration,**84*(2), 77–91.Xu, Y., & Cheng, Q. (2001). A fractal filtering technique for processing regional geochemical maps for mineral exploration.

*Geochemistry: Exploration, Environment, Analysis,**1*(2), 147–156.Zentilli, M., Graves, M. C., Mulja, T., MacInnis, I., & Matheson, J. R. (1985). Geochemical characterization of the Goldenville-Halifax Transition of the Meguma Group of Nova Scotia; preliminary report.

*Information Series-Nova Scotia, Department of Mines and Energy*.Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis.

*Journal of Computational and Graphical Statistics,**15*(2), 265–286.

## Acknowledgments

This study was funded by the Foreign Aid Project of the Ministry of Commerce of the People’s Republic of China (2021-28) and the China National Major Water Conservancy Project Construction Fund (0001212012AC50001).

## Author information

### Authors and Affiliations

### Corresponding author

## Ethics declarations

### Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could affect the work reported in this article.

### Data Availability

The data that supported the findings of this study are openly available in Department of Natural Resources and Renewables, Nova Scotia, Canada at https://novascotia.ca/natr/meb/geoscience-online/geochemistry.asp.

## Rights and permissions

## About this article

### Cite this article

Liu, J., Cheng, Q., Wang, JG. *et al.* A “Weighted” Geochemical Variable Classification Method Based on Latent Variables.
*Nat Resour Res* **31, **1925–1941 (2022). https://doi.org/10.1007/s11053-022-10061-8

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11053-022-10061-8

### Keywords

- Variable clustering
- Clustering around latent variables (CLV)
- Weighted clustering
- Geochemical factor extraction