Skip to main content

A “Weighted” Geochemical Variable Classification Method Based on Latent Variables


Clustering of variables relies on relationships among them. The strength of those relationships is generally measured by the correlation coefficients between pairs of variables. This paper proposes specified variable weighted correlation coefficients and takes the clustering around latent variables (CLV) approach as an example to transform the common clustering method into a “weighted” clustering method. The aim is to eliminate factors that are unrelated to the variable that was adopted for weighting to ensure that the cluster centers are sufficiently different and have good correlations with the adopted variable. A log-transformed dataset was used to evaluate the proposed method. Three clusters were obtained under the restriction of the As element, and they represented three ore-controlling factors related to the Goldenville Formation, namely geologic features such as formation, fault contacts, and granitoid intrusions. Not only did the new cluster centers account for most of the variability related to the weighted element (As) but they also showed significant differences in spatial distributions.


As an important data analysis method, cluster analysis has been used broadly in geochemical data interpretation (Howarth & Jones, 1972; Castillo-Muñoz & Howarth, 1976; Gustavsson & Bjorklund, 1976; Vriend et al., 1988; Kramar, 1995; Rantitsch, 2000; Hanesch et al., 2001; Xie et al., 2004; Ji et al., 2007; Templ et al., 2008). The principal aim of clustering is to split multivariate observations into meaningful, multivariate, and homogeneous groups based on the dissimilarity between variables. To preserve the scales of measurements of variables for clustering, the Euclidean distance and Procrustean distance were discussed by Qannari et al. (1997, 1998). A new multivariate association measure was proposed by Soffritti (1999) to overcome the drawbacks of the typically employed bivariate correlation coefficient. Moreover, specific clustering methods based on principal component analysis (PCA), such as clustering around latent variables (CLV) (Vigneau & Qannari, 2003; Vigneau et al. 2011) and diametrical clustering (Dhillon et al., 2003), have been proposed to identify groups of highly correlated quantitative variables. Additional classification methods similar to CLV have been discussed based on the different methods for estimation of the cluster center. For example, sparse PCA (Jolliffe et al., 2003; Zou et al., 2006) and sparse partial least squares regression (Lê Cao et al., 2008; Chun & Keleş, 2010) have been explored to eliminate the disturbance of variables in the clustering process, and mixture models using factor analysis (Subedi et al., 2013) have been proposed for clustering of high-dimensional variables. A CLV method under the constraint of a specified variable has also been discussed to highlight the group structure among variables and to identify the most relevant groups of variables for prediction (Chen & Vigneau, 2014).

CLV is able to determine simultaneouslyK clusters of variables andK latent components such that the variables in each cluster are strongly related to the corresponding latent component (Vigneau & Qannari, 2003). The latent variables that are extracted based on geochemical elements represent geological factors hidden behind the geochemical data; they help to establish the connection between the geochemical data and the geological process. Current researchers have proposed a “weighted” variable clustering method based on CLV and random sampling technologies. The main differences between the traditional CLV and the new method proposed here include the following two points:

(1) The centroid (pk) of thekth cluster in the new method is assigned as the prediction for a specified response variabley from among the variables in thekth cluster (Gk) rather than the first principal component of the variables in thekth cluster in the CLV method.

(2) The similarity between the centroidpk and variable \(x_{j} \left( {x_{j} \in G_{k} } \right)\) the in new method is defined as the ratio between two correlations \(\left( {{\raise0.7ex\hbox{${\rho_{{x_{j} ,y}} }$} \!\mathord{\left/ {\vphantom {{\rho_{{x_{j} ,y}} } {\rho_{{p_{k} ,y}} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\rho_{{p_{k} ,y}} }$}}} \right)\) rather than the covariance between them in the CLV method. This similarity is a measure of the association between variablexj and the corresponding centroidpk in a regression with respect to the response variabley.

A case study is presented here to validate this new method. The study is based on a geochemical dataset that includes geochemical concentrations of 16 elements (Ag, As, Au, Cu, F, Li, Nb, Pb, Rb, Sb, Sn, Th, Ti, W, Zn, and Zr) from 671 lake sediment samples from southern Nova Scotia. Three clusters via the proposed partial clustering method were extracted from the 15 log-transformed geochemical elements (thus excluding the constraint element). The clustering result was applied to identify factors associated with gold mineralization in the study area.


Clustering of Variables Around Latent Variables

CLV is a type of K-means method that is used for variable clustering. It attempts to find cluster centers that represent specific regions of the data. When the value ofk is determined, the algorithm alternates between the following two steps: (1) assigning each object to the nearest cluster center and (2) setting each cluster center to the average of all objects assigned, and ending if the cluster assignment no longer changes.

Consider a data matrixX ofn observations (samples) evaluated usingp variables, i.e., \(X = \left\{ {x_{1} , \ldots ,x_{p} } \right\} = \left( {x_{ij} } \right)_{n \times p}\).. Let \(P_{K} = { }\left( {G_{1} , \ldots ,G_{K} } \right)\). be a partitionntoK clusters associated with thek components:c1, c2,…, ck. It is expected to result in a clustering on by maximizing:

$$T = \mathop \sum \limits_{k = 1}^{K} \mathop \sum \limits_{j = 1}^{p} \delta_{kj} \rho^{2} \left( {x_{j} ,c_{k} } \right)$$

where \(\delta_{kj} = 1\). if thejth variable belongs to clusterGk and \(\delta_{kj} = 0\); otherwise, \(\rho \left( {x_{j} ,c_{k} } \right)\). represents the correlation coefficient between \(x_{j}\) and \(c_{k}\); \(c_{k}\) is the centroid (latent component) of thekth cluster, which is usually defined as the first standardized principal component ofXk (variables belonging to clusterGk).

Specified Variable Weighted CLV Algorithm

The association of two variables (x1,x2) adopted in conventional CLV methods is the correlation coefficient \(\rho \left( {x_{1} ,x_{2} } \right)\)., thus:

$$\rho \left( {x_{1} ,x_{2} } \right) = \frac{{Cov\left( {x_{1} ,x_{2} } \right)}}{{\sigma \left( {x_{1} } \right) \times \sigma \left( {x_{2} } \right)}}$$

where \(Cov\left( {x_{1} ,x_{2} } \right)\). is the covariance betweenx1 andx2, and \(\sigma \left( {x_{1} } \right)\). and \(\sigma \left( {x_{2} } \right)\). are the standard deviations ofx1 andx2, respectively. A high value of \({\text{abs}}\left( {\rho \left( {x_{1} ,x_{2} } \right)} \right)\) a strong relationship betweenx1 andx2. However, the association of two variables with a large correlation coefficient can be weak if it is measured in a regression with respect to a response variable (y). For example, ifx1is uncorrelated withy, for example,ρx1,y = 0, the conditional correlation coefficient ofx1 andx2in a regression with respect toy must be 0.

The response variable (y) weighted distance between two variables (x1 andx2) is defined as an indirect effect from the response variable in the correlation coefficient among them. A structural equation model (SEM) can be imported to describe the association ofx1 andx2 under the given response variable (y) and to calculate the conditional correlation coefficient (Fig. 1). In Figure 1 (see Table 1 for explanation of parameters), the latent variable (xlatent) represents such a factor inx1 andx2, and its effect ony (\(\rho_{{y,x_{{{\text{latent}}}} }}\)) is equal to the combined effect ofx1 ony andxlatent (\({\uprho }_{{{\text{x}}_{1} ,{\text{x}}_{{{\text{latent}}}} }} \times {\uprho }_{{{\text{x}}_{1} ,{\text{y}}}}\)). Similarly, the effect ofxlatent ony should be equal to the combined effects ofx2 ony andxlatent (\({ }\rho_{{{\text{x}}_{2} ,{\text{x}}_{{{\text{latent}}}} }} \times \rho_{{{\text{x}}_{2} ,{\text{y}}}}\)). The above relationships can be expressed by Eq. 3.

Figure 1
figure 1

Structural equation model between observed variables \(x_{1} and x_{2}\) and a response variable y. The standardized association ofx1 andx2 in a regression with respect toy is the ratio of \(\rho_{{y,x_{latent} }}\) and \(\rho_{{y.x_{1} ,x_{2} }}\), where \(\rho_{{y.x_{1} ,x_{2} }}\) is the multiple correlation coefficient of \(\left( {x_{1} ,x_{2} } \right)\) withy

Table 1 Definition of parameters in Figure 1

Let \(x_{{{\text{latent}}}} = \lambda_{1} x_{1} + \lambda_{2} x_{2} + \lambda_{3} y\) and all of the variables (\(x_{1} ,x_{2} ,y, x_{{{\text{latent}}}}\)) beZ-standardized. Thus, their variances and standard deviations become 1; in addition, their covariances are equal to their correlation coefficients. The relationship ofxlatent withx1,x2 andy is defined as:

$$\rho_{{x_{1} ,x_{{{\text{latent}}}} }} \times \rho_{{x_{1} ,y}} = \rho_{{x_{2} ,x_{{{\text{latent}}}} }} \times \rho_{{x_{2} ,y}} = \rho_{{y,x_{{{\text{latent}}}} }}$$

From Eq. 3, the correlation coefficient between \(x_{latent}\) and \(x_{i}\) (i = 1, 2) is expressed as:

$$\rho_{{x_{i} ,x_{{{\text{latent}}}} }} = \frac{{Cov\left( {x_{i} ,x_{{{\text{latent}}}} } \right)}}{{\sigma \left( {x_{i} } \right) \times \sigma \left( {x_{{{\text{latent}}}} } \right)}} = Cov\left( {x_{i} ,x_{{{\text{latent}}}} } \right)$$

where \(Cov\left( {x_{i} ,x_{{{\text{latent}}}} } \right)\) is the covariance between \(x_{i}\) and \(x_{{{\text{latent}}}}\), which can be expressed as:

$$\begin{aligned} Cov\left( {x_{i} ,x_{{{\text{latent}}}} } \right) = & Cov\left( {x_{i} , \lambda_{1} x_{1} + \lambda_{2} x_{2} + \lambda_{3} y} \right) \\ = & \lambda_{1} Cov\left( {x_{i} ,x_{1} } \right) + \lambda_{2} Cov\left( {x_{i} ,x_{2} } \right) + \lambda_{3} Cov\left( {x_{i} ,y} \right) \\ = & \lambda_{1} \rho_{{x_{i} ,x_{1} }} + \lambda_{2} \rho_{{x_{i} ,x_{2} }} + \lambda_{3} \rho_{{x_{i} ,y}} \\ \end{aligned}$$

and \(\rho_{{x_{i} ,x_{{{\text{latent}}}} }}\) is defined as:

$$\rho_{{x_{i} ,x_{{{\text{latent}}}} }} = \left( {\lambda_{1} \rho_{1i} + \lambda_{2} \rho_{2i} + \lambda_{3} \rho_{yi} } \right)$$

where \(\rho_{1i}\), \(\rho_{2i}\), and \(\rho_{yi}\) are the correlation coefficients of \(x_{i}\) with \(x_{1}\), \(x_{2}\), and y, respectively. According to Eqs. (3) to (6), one can derive the following relations:

$$\rho_{{x_{1} ,x_{{{\text{latent}}}} }} = \left( {\lambda_{1} + \rho_{12} \lambda_{2} + \rho_{1y} \lambda_{3} } \right)$$
$$\rho_{{x_{2} ,x_{{{\text{latent}}}} }} = \left( {\rho_{12} \lambda_{1} + \lambda_{2} + \rho_{2y} \lambda_{3} } \right)$$
$$\rho_{{y,x_{{{\text{latent}}}} }} = \left( {\rho_{1y} \lambda_{1} + \rho_{2y} \lambda_{2} + \lambda_{3} } \right)$$

The standard deviation of \(x_{{{\text{latent}}}}\) (\(\sigma_{{x_{{{\text{latent}}}} }}\)) is:

$$\sigma \left( {x_{{{\text{latent}}}} } \right) = \sigma \left( { \lambda_{1} x_{1} + \lambda_{2} x_{2} + \lambda_{3} y} \right)$$
$$= \sqrt {\lambda_{1}^{2} + \lambda_{2}^{2} + \lambda_{3}^{2} + 2\rho_{12} \lambda_{1} \lambda_{2} + 2\rho_{1y} \lambda_{1} \lambda_{3} + 2\rho_{2y} \lambda_{2} \lambda_{3} } = 1$$

where \(\lambda_{1} , \lambda_{2} , \lambda_{3}\) can be estimated through Eq. 3 and 10.

The \(\rho_{{y,x_{latent} }}\) can be used to quantify the association betweenx1 andx2 when they are applied in a regression with respect toy. The standardized \(\rho_{{y,x_{latent} }}\) is defined as:

$${\text{R}}_{{\text{y}}} \left( {{\text{x}}_{1} ,{\text{x}}_{2} } \right) = \left| {\frac{{\rho_{{y,{\text{x}}_{{{\text{latent}}}} }} }}{{\rho_{{ y.{\text{x}}_{1} ,{\text{x}}_{2} }} }}} \right|$$

where \(\rho_{{ y.{\text{x}}_{1} ,{\text{x}}_{2} }}\) is the multiple correlation coefficient of \(\left\{ {x_{1} ,x_{2} } \right\}\) withy.

The new index has the following characteristics:

(1) \(0 \le {\text{R}} \le 1\), a large R means a strong similarity between x1 and x2; for example, let x1 = x2, then \(x_{{{\text{latent}}}} = { }x_{1} = { }x_{2}\), \({\text{R}} = 1\).

(2) Ifx1 = y orx2 = y, then \(x_{{{\text{latent}}}} = { }x_{2} { }or{ }x_{1}\), \(\rho_{{x_{{{\text{latent}}}} ,{ }y}} = { }\rho_{{x_{1} ,{ }x_{2} }}\),\({ }\rho_{{\left( {x_{1} ,x_{2} } \right),{ }y}} = 1\), \(R = { }\rho_{{x_{1} ,{ }x_{2} }}\).

(3) If \(\rho_{{x_{1} ,y}} = 0\) or \(\rho_{{x_{2} ,y}} = 0\), then \(x_{{{\text{latent}}}}\) cannot be estimated, and if \(\rho_{{x_{{{\text{latent}}}} ,y}} = 0\), thenR = 0.

(4) \({\text{R}}_{{\text{y}}} \left( {{\text{x}}_{1} ,{\text{x}}_{2} } \right) = {\text{ R}}_{{\text{y}}} \left( {{\text{x}}_{2} ,{\text{x}}_{1} } \right)\)

(5) \(\rho_{{x_{1} ,y}} = \rho_{{y.x_{1} ,x_{{{\text{latent}}}} }} ,\) \(\rho_{{x_{2} ,y}} = \rho_{{y.x_{2} ,x_{{{\text{latent}}}} }}\),\(\rho_{{y.x_{1} ,x_{2} }} = \rho_{{y.x_{1} ,x_{2} ,x_{{{\text{latent}}}} }}\), where \(\rho_{{y.x_{1} ,x_{{{\text{latent}}}} }}\), \(\rho_{{y.x_{2} ,x_{{{\text{latent}}}} }}\) and \(\rho_{{y.x_{1} ,x_{2} ,x_{{{\text{latent}}}} }}\) are multiple correlation coefficients between {x1,xlatent}, {x2,xlatent} and {x1,x2,xlatent} withy, respectively.

These characteristics indicate that the new indexR is a symmetrical statistic measure of the relative conditional correlation betweenx1 andx2 under the regression with respect to variable y.

In the weighted CLV, based on the abovementioned concepts, Eq. 1 can then be transformed to:

$$T = \mathop \sum \limits_{k = 1}^{K} \mathop \sum \limits_{j = 1}^{p} \delta_{kj} \left( {\frac{{\rho_{{x_{j} y}} }}{{\rho_{{p_{k} y}} }}} \right)^{2} , \rho_{{p_{k} y}} \ne 0$$

where \(p_{k}\) is a prediction of the response variable (y) by the variables inGk, and \(\rho_{{x_{j} y}}\) and \(\rho_{{p_{k} y}}\)are the correlation coefficients of \(x_{j}\)and \(p_{k}\)with y, respectively. \(\frac{{\rho_{{x_{j} y}} }}{{\rho_{{p_{k} y}} }}\) is replaced by 1 in the calculation when \(\rho_{{p_{k} y}} = 0\).The ratio of the two values can be considered a “conditional” correlation given y,which represents the homogeneity of variables \(x_{j}\) and \(p_{k}\) in clustering.

Clustering Stages

A random sampling technology (Monte Carlo simulation) was employed as the calculation algorithm to obtain the solutions of clustering. It can also be considered a type of “expectation maximization” method (Jain & Dubes, 1988). A target function was created, and all possible classifications generated by a random sampling method were tested. The classification was set as the final result when the target function (Eq. 12) reached the maximum value. The algorithm stages are as follows, and the flow chart is shown in Figure 2.

Figure 2
figure 2

Flow chart of the new clustering algorithm

Stage 1 Choose initial model parameters Initial clusters are generated and the iteration count is set to 1.

Stage 2 Calculate the target function value The target function (T) is calculated.

Stage 3 Evaluate the result If the iteration equals 1, record the cluster and target functions in a dataset as the best clustering suggestion. Otherwise, compare the T value in stage 2 with the existing best suggestion. If the new T value is less than the old value (recorded in the dataset as the best suggestion), set the current clusters and T value as the best suggestion, and increase the iteration counter by 1.

Stage 4 Output the results Check for the termination of the calculation in the current stage. If yes, terminate the calculation and output the best suggestion; if not, return to stage 5.

Stage 5 Generate new clusters A new cluster is generated through a random sampling process.

Stage 6 Check the duplicity Check whether the new cluster already exists in the dataset, and if the newly generated clusters are not found in the dataset, the target function T is calculated and the clusters are recorded in the dataset. Otherwise, return to stage 2.

The calculation can terminate in two possible ways. The first way is to generate a sufficient number of different classifications, which may be expressed as a ratio of the total number of possible classifications. The second way is to stop sampling when the final classification is unchanged after a specified number of samplings. This number of samplings depends on the total number of possible classifications. The random sampling algorithm adopted in this research is taken from the R packagesampling (Tillé and Matei, 2011). Details about this algorithm and package are available in Särndal et al. (2003) and Tillé and Matei (2011).

Case Study

Dataset and Geological Background

The study area, located in the western Meguma Terrain of Nova Scotia, Canada, covers roughly 25,000 km2 and consists mainly of Cambrian–Ordovician low-middle grade metamorphosed sedimentary rocks and a suite of aluminous Devonian granitoid intrusions (Sangster, 1990; Ryan & Ramsay, 1996). The metamorphosed sedimentary strata of the Meguma Group include two rock formations: the lower sand-dominated flysch Goldenville Formation and the upper shaly flysch Halifax Formation. Both of them were deformed during the Devonian granitoid intrusion emplacement, resulting in NE–SW-trending folds (Kontak et al., 1998).

The South Mountain Batholith (SMB), which is a complex of multiple intrusions, occupies nearly one-third of the whole study area. AbundantSn, W, U andAu mineralization and mineral deposits have been found in this area. WhileSn, W andU mineralization occurs mainly inside the SMB and in the contact zones between the complex intrusion and the metamorphic sedimentary rocks, Au deposits occur mainly in the Meguma Group, especially around the Goldenville and Halifax contact (GHC) zones (Chatterjee, 1983).

Studies of known Au deposits and their regional geological environment have shown that they are turbidite-hosted (Mawer, 1986; Ryan & Ramsay, 1996). The major mineralization-related geological features described by previous researchers include the GHC, northeast-southwest-trending anticline axes and northeast-southwest-trending shear zones (Kontak et al., 1990; Sangster, 1990; Ryan & Ramsay, 1996; Kontak & Kerrich, 1997). Lithogeochemical analyses have shown that As, as a main pathfinder element of Au, has strong but complex relationships with Au mineralization. For example,Au andAs are highly correlated in alteration zones related to Au mineralization controlled by fracture zones or faults and within the GHC but not in all gold-bearing quartz veins (Zentilli et al., 1985; Crocket et al., 1986; Kerswill, 1991).

A method of identifying geochemical factors has been introduced based on this dataset by Liu et al. (2015), and additional information about the dataset and study area is available in the literature (Bonham Carter et al., 1988; Rogers et al., 1990; Dunn et al., 1991). The location of lake sediment samples, gold deposits and the geological background are shown in Figure 3.

Figure 3
figure 3

Locations of lake sediment samples, gold deposits and the geological background in the study area

The correlation coefficient matrix is an important basefor variable clustering, and its use requires input data to form a normal/multivariate normal distribution (Reimann et al., 2002). Because geochemical data are compositional (Reimann & Filzmoser, 2000), data transformation is an issue of geochemical element clustering. Some studies on compositional data have recently been published (Buccianti et al., 2018; Thiombane et al., 2018; Dmitrijeva et al., 2019; Greenacre et al., 2021).

Although several transformation methods have been developed for opening compositional data (Aitchison, 1982, 1984, 1986; Egozcue et al., 2003), log transformation is still an option for normalizing geochemical data in element clustering due to its ability to maintain relationships among elements after transformation (Templ et al., 2008). Other problems (e.g., algorithm choice) and possibilities (e.g., exploratory analysis based on clustering results) in geochemical data cluster analysis have been reviewed by Templ et al. (2008).

To extract geological factors related to gold mineralization based on clusters,As was selected as the domain element for the classification instead ofAu for two reasons: first, As is related with gold mineralization in the study area (Agterberg et al., 1990; Xu & Cheng, 2001); and second, concentration values ofAs exhibit more variability than Au. The densities of the samples with Au and As are shown in Figure 4. These indicate that the As data contain information and are better as a control variable than Au in the current research. The spatial distributions of both elements were mapped (Fig. 5) through the inverse distance weighting (IDW) method (power = 2, search radius = 12 points, cell size = 200 m; these parameters were used for subsequent IDW). The matrix of correlation coefficients of the log-transformed data for the 16 elements is shown in Table 2.

Figure 4
figure 4

Densities of samples; black curve: density of Au; red curve: density of As. Horizontal axis: concentration values of samples. Vertical axis: density

Figure 5
figure 5

(a) and(b) shows the log distributions of Au and As, respectively. Both maps were derived by IDW interpolation and reclassified into eight classes using natural breaks (Jenks) method in ArcGIS 10.3

Table 2 Pearson's correlation coefficient matrix of the log-transformed dataset

Partial Clustering Results Through CLV and Weighted CLV

The loadings of the first four principal components (Fig. 6) were calculated through PCA based on the matrix in Table 2. The lambda values of these components were > 1, and the cumulative covariance explained was 73%. Figure 7 shows the clustering result and corresponding loadings of each centroid. The elements contained in clusters 1, 2 and 3 were {Ag, Au, Sn, W}, {Pb, Sb, Cu, Zn} and {Rb,Li, Ti, Zr, F, Th, Nb}, respectively, which were the main elements of principal components 3, 2 and 1 (Fig. 6), respectively. The results indicate that the CLV method is controlled mainly by the principal components of the variables.

Figure 6
figure 6

Loadings of the first four principal components calculated through PCA

Figure 7
figure 7

Centroid loadings on elements per cluster. Blue, orange and purple bars represent clusters 1, 2 and 3, respectively

Partial Clustering Result from Weighted CLV

Through the weighted CLV algorithm introduced in the previous section, the partial clustering results were {Ag, Au, Nb, W}, {Cu, F, Li, Pb, Sb, Th, Zn, Zr} and {Rb, Sn, Ti}, corresponding to clusters 1, 2 and 3 in the CLV method. It should be noted that the partial clustering output depends on the initial input; therefore, this result represents only a local optimum solution.

Because the centroid of thekth cluster is the prediction for As through variables in thekth group, the loadings of an element (\(x_{j}\)) to the corresponding centroid (\(p_{k}\)) is the absolute value of \({\raise0.7ex\hbox{${\rho_{{x_{j} y}} }$} \!\mathord{\left/ {\vphantom {{\rho_{{x_{j} y}} } {\rho_{{p_{k} y}} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\rho_{{p_{k} y}} }$}}\). Figure 7 shows the loadings of each group, which are the centroid loadings on elements.

Compared with the result in Figure 7, the new clustering result obtained through the new method is similar to that obtained through CLV based on the main elements in each cluster. For example, clusters 1, 2 and 3 in Figure 8 correspond to clusters 1, 2 and 3 in Figure 7, respectively. The main elements in the three clusters in Figures 7 and 8 are {Au, W, Ag}, {Zn, Cu, Pb, Sb} and {Rb, Ti}, but the importance of each element was changed, i.e., the roles of Au, W, and Ag in Figure 8 are strengthened compared with those in Figure 7.

Figure 8
figure 8

Centroid loadings on elements per cluster calculated through the new method. Blue, orange and purple bars represent clusters 1, 2 and 3, respectively. The heights of the bar represent the loadings of each centroid on the corresponding elements

Spatial Distribution of Cluster Centroids

To compare the two methods, the centroids of the clusters in both methods were mapped (Fig. 9) through IDW interpolation (using the same parameters as those in Fig. 4). Because the centroid in the weighted CLV is the prediction for As, the error of prediction (Fig. 10) was also mapped through IDW interpolation.

Figure 9
figure 9figure 9

(a),(c) and(e) Score maps of centroids of clusters 1, 2 and 3 obtained through CLV.(b),(d) and(f) Score maps of response centroids obtained through weighted CLV. All maps are interpolated by IDW and reclassified into eight classes using the natural breaks (Jenks) method in ArcGIS 10.3

Figure 10
figure 10

Prediction errors in clusters 1(a), 2(b) and 3(c).(d) Prediction error through all elements. All maps are interpolated through IDW method

For comparison, clusters with similar element combinations were combined. They contained the same combination of elements as {Rb, Ti}, {Zn, Cu, Pb, Sb} and {Au, W, Ag} in clusters 1, 2 and 3. The corresponding centroids of each cluster from the CLV and the weighted CLV were namedg1,g2, andg3 andp1,p2, andp3, respectively. The corresponding spatial distributions are mapped in Figure 9a and b – g1 and p1, Figure 9c and d – g2 and p2, and Figure 9e and f – g3 and p3. From this figures, it can be deduced that the factors depicted by the three clusters relate to the intersections of geological boundaries and anticlines (Fig. 9a, b), to Goldenville Formation (Fig. 9c, d) and to magmatic rocks (Fig. 9e, f). These factors are closely related to gold mineralization in the study area.

The predicted errors of each cluster from the weighted CLV (e1,e2 ande3) and the predicted errors of all elements are mapped in Figure 10, which depicts the difference between the estimated value (pi) and response element (As). As evident in Figure 9, the spatial distributions of the three clusters were different.g1 andp1 are related to the boundaries of geological features and to NE–SW-trending folds; these represent transportation channels for the gold mineralization. g2 andp2include most of the information from As, with most of the high concentration values located within the Goldenville Formation; these values represent material source of gold mineralization.g3 andp3 relate to granite intrusions, which represent the kinetic factors of gold mineralization. There are also some differences in the spatial distributions of the cluster centers obtained by the two methods. Compared with the CLV, the clustering center obtained by the weighted CLV, under the constraint of element As, had a clearer spatial representation of the ore-controlling factors of gold mineralization.

In Figure 10, most of the error ine1 was located within the Goldenville Formation and in the southwest area, but the error around the lithological boundaries was small. Error ine2 shows that the prediction ofp2 was good in most of the areas except at the center of the map. Error ine3 shows that most of the area of good prediction ofp3 is located within the granite and granodiorite. The levels of the errors reflect the quality of the prediction. The combination of elements from the same source tended to have better predictability, and the elements from different sources were less predictable. Different error distributions indicate that the cluster center and As element relationships were different, indicating that As has played an significant restrictive role in the clustering process.

To further compare the difference between the centroids obtained by the weighted CLV and CLV methods, a section from northwest to southeast is shown in Figure 11a, and scores of the centroids of clusters 1, 2 and 3 along this section are shown in Figure 11b–d. The comparison revealed that the score profile obtained by weighted CLV has the same trend as CLV, but the former was smoother than the latter. For example, in Figure 11b, the two profiles had obvious high-value anomalies at the contacts between different geological units, but the weighted CLV centroid profile suppressed the high-value anomalies inside the geological units, which enhances the control of the contact boundaries. In addition, Figure 12 shows the mean scores of the cluster centroids in different geological units obtained by the two clustering methods. Figure 12b and c shows that the mean values of cluster centroids are significantly different among different geological units, and the clustering centroids had a good distinguishing effect on different geological units, which is consistent with the conclusions observed in Figure 9d and f.

Figure 11
figure 11figure 11

(a) A section through northwest to southeast in study area, the parts of the section in geological units Granite and Granodiorite, Goldenville Formation, and Halifax Formation are marked as blue, green and yellow respectively.(b),(c) and(d) Scores of the centroids of clusters 1, 2 and 3 on this section obtained by two clustering methods. The yellow and blue curves represent the centroids of the weighted CLV and CLV methods, respectively

Figure 12
figure 12

Mean values of cluster centroid scores in different geological units obtained by the two clustering methods. Red = CLV. Green = weighted CLV. (a), (b) and (c) represent the centroids of clusters 1, 2 and 3, respectively


In this research, a weighted clustering method based on a new index \(\left( {\left| {{\raise0.7ex\hbox{${\rho_{{x_{j} y}} }$} \!\mathord{\left/ {\vphantom {{\rho_{{x_{j} y}} } {\rho_{{p_{k} y}} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\rho_{{p_{k} y}} }$}}} \right|} \right)\) of a two-variable relationship is proposed. Unlike the CLV, which is based on the correlations among variables, the new index is based not only the correlations among variables but also their relationships with a constraint variable. The index is a “weighted” correlation coefficient via a specific constraint variable. In the proposed weighted clustering method, the correlation coefficients used in CLV is replaced by the new index for each variable pair. In addition, the centroid of each cluster obtained by CLV is replaced by the prediction for the variable in the “weighted” correlation coefficient rather than the first component. In this way, the CLV method is transformed into a “weighted” variable clustering method. The clustering result can respond to different factors related to a response variable. Through different constraint variables, it provides flexibility to build a clustering model.

Clustering results for a classic geochemical dataset demonstrate that the weighted CLV maintains the basic characteristics of the traditional CLV clustering method, and the structure of the clustering results of the elements is similar. However, by introducing a constraint variable that is closely related to other elements and the mineralization in the study area, there is clearer correspondence between the newly obtained cluster centroids and the geological units. By suppressing the unrelated correlation with the constraint variables, the clustering results can be fine-tuned, and the weighted CLV method has more prominent clustering center characteristics, which is more valuable for subsequent applications.


  • Agterberg, F. P., Bonham-Carter, G. F., & Wright, D. F. (1990). Statistical pattern integration for mineral exploration. InComputer applications in resource estimation (pp. 1–21). Pergamon.

  • Aitchison, J. (1982). The statistical analysis of compositional data.Journal of the Royal Statistical Society: Series B (Methodological),44(2), 139–160.

    Google Scholar 

  • Aitchison, J. (1984). The statistical analysis of geochemical compositions.Journal of the International Association for Mathematical Geology,16(6), 531–564.

    Article  Google Scholar 

  • Aitchison, J. (1986).The statistical analysis of compositional data. London: Chapman & Hall.

    Book  Google Scholar 

  • Bonham-Carter, G. F., Agterberg, F. P., & Wright, D. F. (1988). Integration of geological datasets for gold exploration in Nova Scotia.Photogrammetric Engineering and Remote Sensing,54(11), 1585–1592.

    Google Scholar 

  • Buccianti, A., Lima, A., Albanese, S., & De Vivo, B. (2018). Measuring the change under compositional data analysis (CoDA): Insight on the dynamics of geochemical systems.Journal of Geochemical Exploration,189, 100–108.

    Article  Google Scholar 

  • Castillo-Muñoz, R., & Howarth, R. J. (1976). Application of the empirical discriminant function to regional geochemical data from the United Kingdom.Geological Society of America Bulletin,87(11), 1567–1581.

    Article  Google Scholar 

  • Chatterjee, A. K. (1983).Metallogenic map of the Province of Nova Scotia. Department of Mines and Energy, Nova Scotia, Canada, ver. 1, scale 1: 500,000

  • Chen, M., & Vigneau, E. (2014). Supervised clustering of variables.Advances in Data Analysis and Classification,10(1), 85–101.

    Article  Google Scholar 

  • Chun, H., & Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection.Journal of the Royal Statistical Society: Series B (Statistical Methodology),72(1), 3–25.

    Article  Google Scholar 

  • Crocket, J., Fueten, F., & Clifford, P. (1986). Distribution and localization of gold in Meguma Group rocks, Nova Scotia: Implications of metal distribution patterns in quartz veins and host rocks on mineralization processes at Harrigan Cove, Halifax County.Atlantic Geology,22(1), 15–33.

    Article  Google Scholar 

  • Dhillon, I. S., Marcotte, E. M., & Roshan, U. (2003). Diametrical clustering for identifying anti-correlated gene clusters.Bioinformatics,19(13), 1612–1619.

    Article  Google Scholar 

  • Dmitrijeva, M., Ehrig, K. J., Ciobanu, C. L., Cook, N. J., Verdugo-Ihl, M. R., & Metcalfe, A. V. (2019). Defining IOCG signatures through compositional data analysis: A case study of lithogeochemical zoning from the Olympic Dam deposit, South Australia.Ore Geology Reviews,105, 86–101.

    Article  Google Scholar 

  • Dunn, C. E., Coker, W. B., & Rogers, P. J. (1991). Reconnaissance and detailed geochemical surveys for gold in eastern Nova Scotia using plants, lake sediment, soil and till.Journal of Geochemical Exploration,40(1–3), 143–163.

    Article  Google Scholar 

  • Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G., & Barcelo-Vidal, C. (2003). Isometric logratio transformations for compositional data analysis.Mathematical Geology,35(3), 279–300.

    Article  Google Scholar 

  • Greenacre, M., Grunsky, E., & Bacon-Shone, J. (2021). A comparison of isometric and amalgamation logratio balances in compositional data analysis.Computers and Geosciences,148, 104621.

    Article  Google Scholar 

  • Gustavsson, N., & Bjorklund, A. (1976). Lithological classification of tills by discriminant analysis.Journal of Geochemical Exploration,5, 393–395.

    Google Scholar 

  • Hanesch, M., Scholger, R., & Dekkers, M. J. (2001). The application of fuzzy c-means cluster analysis and non-linear mapping to a soil data set for the detection of polluted sites.Physics and Chemistry of the Earth, Part A: Solid Earth and Geodesy,26(11–12), 885–891.

    Article  Google Scholar 

  • Howarth, R. J., & Jones, M. J. (1972). The pattern recognition problem in applied geochemistry. Geochemical Exploration, 259–273.

  • Jain, A. K., & Dubes, R. C. (1988).Algorithms for clustering data. Prentice-Hall.

    Google Scholar 

  • Ji, H., Zeng, D., Shi, Y., Wu, Y., & Wu, X. (2007). Semi-hierarchical correspondence cluster analysis and regional geochemical pattern recognition.Journal of Geochemical Exploration,93(2), 109–119.

    Article  Google Scholar 

  • Jolliffe, I. T., Trendafilov, N. T., & Uddin, M. (2003). A modified principal component technique based on the LASSO.Journal of Computational and Graphical Statistics,12(3), 531–547.

    Article  Google Scholar 

  • Kerswill, J. A. (1991). Lithogeochemical indicators of gold potential in the eastern Meguma Terrane of Nova Scotia.Papers-Geological Survey of Canada, 19–19.

  • Kontak, D. J., Horne, R. J., Sandeman, H., Archibald, D., & Lee, J. K. (1998). 40Ar/39Ar dating of ribbon-textured veins and wall-rock material from Meguma lode gold deposits, Nova Scotia: Implications for timing and duration of vein formation in slate-belt hosted vein gold deposits.Canadian Journal of Earth Sciences,35(7), 746–761.

    Article  Google Scholar 

  • Kontak, D. J., & Kerrich, R. (1997). An isotopic (C, O, Sr) study of vein gold deposits in the Meguma Terrane, Nova Scotia; implication for source reservoirs.Economic Geology,92(2), 161–180.

    Article  Google Scholar 

  • Kontak, D. J., Smith, P. K., Kerrich, R., & Williams, P. F. (1990). Integrated model for Meguma group lode gold deposits, Nova Scotia, Canada.Geology,18(3), 238–242.

    Article  Google Scholar 

  • Kramar, U. (1995). Application of limited fuzzy clusters to anomaly recognition in complex geological environments.Journal of Geochemical Exploration,55(1–3), 81–92.

    Article  Google Scholar 

  • Kriegel, H. P., Kröger, P., Schubert, E., & Zimek, A. (2008). A general framework for increasing the robustness of PCA-based correlation clustering algorithms. InInternational Conference on Scientific and Statistical Database Management (pp. 418–435). Springer.

  • Lê Cao, K. A., Rossouw, D., Robert-Granié, C., & Besse, P. (2008). A sparse PLS for variable selection when integrating omics data.Statistical Applications in Genetics and Molecular Biology7(1).

  • Liu, J., Cheng, Q., & Wang, J. (2015). Identification of geochemical factors in regression to mineralization endogenous variables using structural equation modeling.Journal of Geochemical Exploration,150, 125–136.

    Article  Google Scholar 

  • Mawer, C. K. (1986). The bedding-concordant gold-quartz veins of the Meguma Group, Nova Scotia.Turbidite-Hosted Gold Deposits,32, 135–148.

    Google Scholar 

  • Qannari, E. M., Vigneau, E., & Courcoux, P. (1998). Une nouvelle distance entre variables. Application en classification.Revue de Statistique Appliquée,46(2), 21–32.

  • Qannari, E. M., Vigneau, E., Luscan, P., Lefebvre, A. C., & Vey, F. (1997). Clustering of variables, application in consumer and sensory studies.Food Quality and Preference,8(5–6), 423–428.

    Article  Google Scholar 

  • Rantitsch, G. (2000). Application of fuzzy clusters to quantify lithological background concentrations in stream-sediment geochemistry.Journal of Geochemical Exploration,71(1), 73–82.

    Article  Google Scholar 

  • Reimann, C., & Filzmoser, P. (2000). Normal and lognormal data distribution in geochemistry: Death of a myth Consequences for the statistical treatment of geochemical and environmental data.Environmental Geology,39(9), 1001–1014.

    Article  Google Scholar 

  • Reimann, C., Filzmoser, P., & Garrett, R. G. (2002). Factor analysis applied to regional geochemical data: Problems and possibilities.Applied Geochemistry,17(3), 185–206.

    Article  Google Scholar 

  • Rogers, P. J., Chatterjee, A. K., & Aucott, J. W. (1990). Metallogenic domains and their reflection in regional lake sediment surveys from the Meguma Zone, southern Nova Scotia, Canada.Journal of Geochemical Exploration,39(1–2), 153–174.

    Article  Google Scholar 

  • Ryan, R. J., & Ramsay, W. R. H. (1996). Preliminary comparison of gold field in the Meguma Terrain, Nova Scotia, and Victoria, Australia.MacDonald, DR, Mills, & KA (eds). Mines and Mineral Branch. Report of Activities, 97–1.

  • Sangster, A. L. (1990). Metallogeny of the Meguma Terrane, Nova Scotia.Mineral Deposit Studies in Nova Scotia,1, 90–98.

    Google Scholar 

  • Särndal, C. E., Swensson, B., & Wretman, J. (2003).Model assisted survey sampling. Springer.

    Google Scholar 

  • Soffritti, G. (1999). Hierarchical clustering of variables: A comparison among strategies of analysis.Communications in Statistics-Simulation and Computation,28(4), 977–999.

    Article  Google Scholar 

  • Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P. D. (2013). Clustering and classification via cluster-weighted factor analyzers.Advances in Data Analysis and Classification,7(1), 5–40.

    Article  Google Scholar 

  • Templ, M., Filzmoser, P., & Reimann, C. (2008). Cluster analysis applied to regional geochemical data: Problems and possibilities.Applied Geochemistry,23(8), 2198–2213.

    Article  Google Scholar 

  • Thiombane, M., Martín-Fernández, J. A., Albanese, S., Lima, A., Doherty, A., & De Vivo, B. (2018). Exploratory analysis of multi-element geochemical patterns in soil from the Sarno River Basin (Campania region, southern Italy) through compositional data analysis (CODA).Journal of Geochemical Exploration,195, 110–120.

    Article  Google Scholar 

  • Tillé, Y., & Matei, A. (2011). Sampling: Survey Sampling. R Package Version 2.4.

  • Vigneau, E., Endrizzi, I., & Qannari, E. M. (2011). Finding and explaining clusters of consumers using the CLV approach.Food Quality and Preference,22(8), 705–713.

    Article  Google Scholar 

  • Vigneau, E., & Qannari, E. M. (2003). Clustering of variables around latent components.Communications in Statistics-Simulation and Computation,32(4), 1131–1150.

    Article  Google Scholar 

  • Vriend, S. P., van Gaans, P. M., Middelburg, J., & De Nijs, A. (1988). The application of fuzzy c-means cluster analysis and non-linear mapping to geochemical datasets: Examples from Portugal.Applied Geochemistry,3(2), 213–224.

    Article  Google Scholar 

  • Xie, X., Liu, D., Xiang, Y., Yan, G., & Lian, C. (2004). Geochemical blocks for predicting large ore deposits—concept and methodology.Journal of Geochemical Exploration,84(2), 77–91.

    Article  Google Scholar 

  • Xu, Y., & Cheng, Q. (2001). A fractal filtering technique for processing regional geochemical maps for mineral exploration.Geochemistry: Exploration, Environment, Analysis,1(2), 147–156.

    Google Scholar 

  • Zentilli, M., Graves, M. C., Mulja, T., MacInnis, I., & Matheson, J. R. (1985). Geochemical characterization of the Goldenville-Halifax Transition of the Meguma Group of Nova Scotia; preliminary report.Information Series-Nova Scotia, Department of Mines and Energy.

  • Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis.Journal of Computational and Graphical Statistics,15(2), 265–286.

    Article  Google Scholar 

Download references


This study was funded by the Foreign Aid Project of the Ministry of Commerce of the People’s Republic of China (2021-28) and the China National Major Water Conservancy Project Construction Fund (0001212012AC50001).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Yusen Dong.

Ethics declarations

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could affect the work reported in this article.

Data Availability

The data that supported the findings of this study are openly available in Department of Natural Resources and Renewables, Nova Scotia, Canada at

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Cheng, Q., Wang, JG. et al. A “Weighted” Geochemical Variable Classification Method Based on Latent Variables. Nat Resour Res 31, 1925–1941 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Variable clustering
  • Clustering around latent variables (CLV)
  • Weighted clustering
  • Geochemical factor extraction