1 Introduction

Given a dissimilarity matrix between N objects, multidimensional scaling (MDS) aims to estimate a configuration of points in a low-dimensional space p, such that the distances between them, usually Euclidean, are as close as possible to the observed dissimilarities between the objects. Since the seminal work of Sampson and Guttorp (1992), MDS plays an important role in spatial deformation modelling and estimation techniques, particularly in nonparametric approaches to the analysis of the covariance structure of the spatiotemporal processes underlying environmental studies. However, in applications involving the problem of estimating the covariance structure, the localisation of environmental phenomena on the surface of the sphere rather than the planar representation is of increasing interest, allowing us to capture both the spatiotemporal dependencies between observations, as well as the development of accurate predictions about planet Earth (see, for example, Alegría et al. 2018).

Although one of the main goals of MDS is to facilitate the visual interpretation of relationships between objects, when the number of objects is large, this representation of the results tends to be rather confusing, making it difficult to identify structures in the data. In addition, the approximation error between the dissimilarities and the estimated distances increases with the number of points, in particular in low dimensions. This problem is usually addressed by the joint application of MDS and cluster analysis. However, while the latter approach is optimal in the original space, this may not be the case in a reduced space, where performing cluster analysis and then representing the objects in a low-dimensional space can lead to interpretation errors (Heiser and Groenen 1997). To enhance the interpretation of the MDS solution and/or to obtain an adequate fit of the model when the number of objects is too large, cluster-MDS models have proven useful, both in the classical context and in the least squares framework (Bock 1986, 1987; Heiser 1993; Heiser and Groenen 1997; Vera et al. 2008). In a probabilistic framework, Vera et al. (2009a, 2009b) have proposed latent class multidimensional scaling models for dissimilarity data in Euclidean spaces.

In some experimental situations, it is desirable to impose constraints on the MDS configuration to best represent the singularities between the relations of objects, usually obtaining a representation in a specific parametric space. In particular, constraints related to quadratic surfaces such as the sphere arise, for example, in the analysis of data for dissimilarities between geographical regions, or distances between cities (or countries), or in the measurement of large-scale environmental phenomena and/or those affecting different locations on the planet, and in general when the representation of the MDS solution in three dimensions tends to have a spherical shape. Other applications related to quadratic surface embedding have been proposed as parabolas or ellipses, but in the present case we are interested in spherical surfaces. On the other hand, a different situation from the one we are dealing with here is when the true configuration of the objects is known in high dimensionality and the objective is to embed it in the surface of the sphere. In these cases, in addition to local MDS, there are other very efficient procedures that allow the representation of points while preserving local structures, such as SRCA (Luo et al. 2023), t-SNE (Van Der Maaten and Hinton 2008) or UMAP (McInnes et al. 2018). However, it is important to note that these methods do not perform clustering, and in general, are not designed to preserve global structures (Wang et al. 2021).

Let us consider as a running example some study data concerning the monthly mean near-surface air temperature dataset (Harris et al. 2020a, b) of the Climatic Research Unit of the University of East Anglia (CRU TS 4.02). A random sample of 500 locations was analysed from the total of 67,420 locations, during the \(T=12\) months of 2017. The complete database can be found at Harris et al. (2020a) Version 4 of the CRU TS monthly high-resolution gridded multivariate climate dataset. Here we focus on the study of the spherical spatial representation of locations based on their dissimilarities related to temperature information using MDS, in particular, on the combined use of clustering and spherical MDS to facilitate the interpretation of the representation and reduce the number of parameters to be estimated. In addition, the estimation accuracy of a spatial deformation procedure using spherical spline interpolation is analysed in relation to the representation error in MDS.

Our analysis considers the situation in which the MDS configuration must fall on a spherical surface. This case requires a more appropriate measure of the distance between points in this framework than the Euclidean one, in order to approximate the dissimilarities. One such measure is the geodesic distance, that is, the length of the shortest geodesic along the surface. There exists a monotonic relation between Euclidean and geodesic distances, and various approaches to this problem have been proposed (see, for example, De Leeuw and Mair 2009, for further details). One of the most used methods for this purpose consists of applying MDS with quadratic restrictions in the configuration, with both Euclidean (MDS-Q) and geodesic (geodesic MDS-Q) distances. This model can be viewed as a weakly constrained MDS (Borg and Groenen 2005).

Gnanadesikan (1977) proposed a two-step method to impose constraints in MDS as a form of nonlinear component analysis, and Bookstein (1979) and Fitzgibbon et al. (1999), later proposed improvements to this method. In general, in the two-step procedure, first an unrestricted MDS solution is found and then the best quadratic surface is fitted. In a direct approach for Euclidean distances with spherical constraints on the configuration, Bentler and Weeks (1978) applied Gauss–Newton methods with linear constraints, while De Leeuw and Heiser (1980) developed a general theory of MDS with restrictions on the configuration. For the geodesic distance, Cox and Cox (1991) proposed circular and spherical MDS non-metric models, and Elad et al. (2005) solved the metric spherical MDS problem for geodesic distances, minimising the stress by means of a gradient method with line search.

De Leeuw and Mair (2009) proposed a majorisation-based methodology for quadratic surfaces in a least squared framework, based on two approaches, the primal or quadratic multidimensional scaling (Q-MDS) method and the dual method. In the first, the quadratic constraints are incorporated in parametric form directly into the loss function based on Euclidean distances, while in the second, the constraints are imposed at convergence by means of penalty or Lagrangian terms. Various dual methods have been proposed by Borg and Lingoes (1980) in which the constraints are directly imposed on the distances.

However, regardless of the estimation procedure used, the problem of achieving visualisation with a large number of objects is even more acute in the case of a spherical representation. This problem was addressed by Dzwinel et al. (2005), who used multi-resolutional clustering and nonlinear MDS separately to cluster the time events in the feature space and inspected the resulting clustering structures using three-dimensional MDS. For example, in Lopes et al. (2014), dissimilarities were obtained between fifty subjectively chosen zones or clusters, with the intention of investigating the behaviour of more than three million seismic occurrences around the world, and the clusters were represented by MDS.

The usefulness of a model that enables clusters to be estimated while the centre of each cluster is represented on the sphere is thus evident in a wide variety of practical applications. This approach is interesting, for instance, for the problem of estimating non-stationary spatial covariance when a broad domain of the spherical surface is involved (see for example Vera et al. 2008, 2009a). In this type of applications, reducing the representation to clusters while preserving the structure between them also contributes to reducing the number of underlying parameters to be estimated, in addition to facilitating interpretation. This also helps to avoid known problems such as those of estimation in oversampled domains (Kovitz and Christakos 2004) or reduce the appearance of non-injective mappings as a result of a “nugget effect” or folding problems (Sampson and Guttorp 1992; Vera et al. 2008, 2009a).

Additionally, since the dimensionality in this particular MDS model is fixed a priori, the use of criteria to select the number of clusters directly from the dissimilarity matrix is also advisable (see Vera and Macías 2017).

In the present paper, we propose a model that, given a dissimilarity matrix between a set of objects, obtains its classification in homogeneous clusters while simultaneously, not the objects themselves but the centres of the clusters are represented on a spherical surface using the geodesic distance. The parameter estimation is performed using an alternating estimation procedure for which, given a classification, the cluster centres are represented using the well-known monotonic relation between the geodesic and the Euclidean distances. For the representation step, we propose a Q-MDS approach for geodesic distances that estimates the configuration in the sphere surface, in combination with a ratio transformation that allows us to estimate the optimal radius of the sphere assuming that it is centred at the origin.

In the next section, we formulate the combined clustering and spherical representation model. Section 3 then describes the alternating estimation procedure together with the criterion applied to determine the number of clusters. In Sect. 4, we analyse the behaviour of the model, based on a Monte Carlo experiment, and in Sect. 5 its performance for empirical data is illustrated and compared with that of a two-step estimation procedure based on K-means clustering and a quadratic constraints MDS algorithm. In addition, the adjustment of a spatial deformation between the geographic configuration and that of the MDS by means of spherical splines is illustrated, and the interpolation and location errors of the estimated points on the sphere are compared, with or without clustering. Finally, we discuss the results obtained and present the main conclusions drawn.

2 The Clustering Spherical Scaling (CSS) Model

Let us denote by \(\varvec{\Delta }\) a dissimilarity matrix between N objects \(o_{i}\), with entries \(\delta _{ij}\), \(i,j=1,\ldots ,N\). It is assumed that the objects are grouped into K disjoint clusters, and thus, we denote by \(\textbf{E}\) a matrix of \(N \times K\) binary \(e_{ik}\) entries, where \(e_{ik}=1\) if the ith object belongs to the kth cluster, and zero otherwise.

We denote by \(\textbf{X}\) a configuration of K points representing the cluster centres on the surface of a sphere of radius \(\mu \), and denote by \(d_{kl}\), the Euclidean distance between the points \(\textbf{x}_{k}\), \(\textbf{x}_{l} \in \mathcal {S}(\mu )\), and by \(\breve{d}_{kl}\), the related geodesic great-circle distance given by

$$\begin{aligned} \breve{d}_{kl}=\breve{d}(\textbf{x}_{k},\textbf{x}_{l})= \mu \arccos \left( \frac{ \textbf{x}_{k}'\textbf{x}_{l}}{\mu ^2} \right) =\mu \arccos \left( \frac{2\mu ^2-{d}_{kl}^{2}}{2\mu ^2}\right) . \end{aligned}$$
(1)

Like the Euclidean distance, the geodesic distance suffers from scale indeterminacy, since for \(\textbf{x}_{k},\textbf{x}_{l}\in \mathcal {S}(\mu )\), and \(b>0\), \(b \breve{d}(\textbf{x}_{k},\textbf{x}_{l})=\breve{d}(b \textbf{x}_{k},b \textbf{x}_{l})\), where \(b\textbf{x}_{k},b\textbf{x}_{l}\in \mathcal {S}(b\mu )\), both distances being monotonically related. Without loss of generality, we can consider weights \(w_{ij}\) for the dissimilarities, which also enables us to deal with missing data. The loss function (stress) in this model is given by

$$\begin{aligned} \sigma _\le \left( \textbf{X},\textbf{E}\right) = \sum _{k\le l}\sum _{i=1}^{N}\sum _{j=1}^{N}{e_{ik}e_{jl}\omega _{ij}\left( \delta _{ij}-\breve{d}_{kl}\left( \textbf{X}\right) \right) ^2}, \end{aligned}$$
(2)

and our aim, hence, is to minimise Eq. (2) in terms of a classification \(\textbf{E}\) and a configuration \(\textbf{X}\) on the surface of a sphere of radius \(\mu \).

According to least squares orthogonality (Heiser and Groenen 1997), the stress function Eq. (2) can be decomposed considering geodesics distances as follows:

$$\begin{aligned} \sigma _\le \left( \textbf{X},\textbf{E}\right) = \sum _{k\le l}\sum _{i=1}^{N}\sum _{j=1}^{N}{e_{ik}e_{jl}\omega _{ij}\left( \delta _{ij}-{\widetilde{\delta }}_{kl}\right) ^2} +\sum _{k\le l}{{\widetilde{\omega }}_{kl}\left( {\widetilde{\delta }}_{kl}-\breve{d}_{kl}\left( \textbf{X}\right) \right) ^2} \end{aligned}$$
(3)

where \({\widetilde{\delta }}_{kl}\) is the Sokal–Michener dissimilarity (Sokal and Michener 1958) given by

$$\begin{aligned} {\widetilde{\delta }}_{kl}=\sum _{i=1}^{N}\sum _{j=1}^{N}e_{ik}e_{jl}\frac{\omega _{ij}\delta _{ij}}{{\widetilde{\omega }}_{kl}} \ \ \ \text{, } \text{ whith } \ \ \ \ \ {\widetilde{\omega }}_{kl}=\sum _{i=1}^{N}\sum _{j=1}^{N}e_{ik}e_{jl}\omega _{ij}. \end{aligned}$$
(4)

The first term in Eq. (3) depends only on the classification \({\textbf {E}}\), while the last term depends on both the classification and the representation. Therefore, the parameter estimation in this model can be performed in an alternating least squares procedure.

3 Parameter Estimation

The algorithm consists of two stages that alternate iteratively between an allocation step and a representation step, until the convergence criterion holds. The iterative cycle continues until the difference between two consecutive values of the stress function Eq. (2) is below a small value previously set by the investigator. The algorithm starts with an initial classification (allocation step). This can be set by the investigator, or taken randomly. Then, the Sokal–Michener dissimilarities are calculated, and the related configuration in a sphere for the cluster centres is estimated (representation step). The alternating procedure continues iteratively until the loss function is minimised (see Fig. 1). The two estimation phases are described in detail below.

3.1 Allocation Step

Given an estimated value for \(\textbf{X}\in \mathcal {S}(\mu )\), we classify each object into the cluster to which the corresponding dissimilarities are nearest to the related geodesic distances between the corresponding cluster centres. To this end, the following loss function is minimised (Heiser and Groenen 1997),

$$\begin{aligned} \min _{E}\kappa ^2\left( E\vert \textbf{X},{\varvec{\Delta }}\right) =\sum _{i}\sum _{k} e_{ik} \Vert \textbf{a}_i-\textbf{b}_k^{\left( i\right) }\Vert ^2, \end{aligned}$$
(5)

where \(||a_{i}-b^{(i)}_{k}||^2\) denotes the squared Euclidean distance between the ith row of the matrix \(\textbf{A}=\left\{ a_{ij}\right\} \) of order \(N\times \left( N-1\right) \) and the kth row of the matrix \(\textbf{B}^{\left( i\right) }=\left\{ b_{kr}^{\left( i\right) }\right\} \) of order \(K\times \left( N-1\right) \), and with elements of \(\textbf{A}\) and \(\textbf{B}^{\left( i\right) }\), \(i=1,\ldots ,n\), which is specified as \(a_{ir}=\delta _{ir}^*\) and \(b_{kr}^{\left( i\right) }=\breve{d}_{kr}^*\), where \(\delta _{ir}^*=\delta _{is}\) and \(\breve{d}_{kr}^*=\sum _{l} e_{sl}\breve{d}_{kl}\left( \textbf{X}\right) \), for \(r=1,\ldots ,N-1\), with \(s=r\) if \(r<i\), and \(s=r+1\) if \(r\ge i\) and where \(e_{lk}\) is a binary variable which is equal to one if object i is an element of cluster k in the matrix \(\textbf{E}\).

The optimal classification stage consists of a nested iterative cycle of NK iterations, after which a new classification matrix \(\textbf{E}\) is obtained, from which the weights \({\widetilde{\omega }}_{kl}\), and the Sokal–Michener dissimilarities \({\widetilde{\delta }}_{kl}\) are updated. Heiser and Groenen (1997) have shown that the assignment step can be viewed as a K-means clustering procedure. A simple convergent algorithm is possible if we proceed row by row finding

$$\begin{aligned} \displaystyle \min _{\varvec{e}_{i}}\sum _{k}e_{ik}\Vert {\textbf {a}}_{i}-{\textbf {b}}^{(i)}_{k}\Vert ^{2}, \end{aligned}$$
(6)

being \(\varvec{e}_{i}=(e_{i1},\dots ,e_{iK})'\), the ith row of matrix E and maintaining the allocation of the other objects \(j\ne i\), fixed. The minimum value is achieved in some row \(\kappa \) of \(B^{(i)}\) such that if \(\kappa \) does not change the assignment, the algorithm moves to the next object, otherwise object i is reallocated and the ith row of E is adjusted first; then the next object is considered until the position of all objects have been examined, concluding the allocation phase.

3.2 Geodesic MDS-Q Step

For a given classification E, the corresponding Sokal–Michener dissimilarities \({\widetilde{\delta }}_{kl}\) are calculated, and the configuration for the cluster centres is estimated in a geodesic MDS-Q step by minimising,

$$\begin{aligned} \min _{\textbf{X},\mu } \sum _{k<l}{{\widetilde{\omega }}_{kl}\left( {\widetilde{\delta }}_{kl}-\mu \breve{d}_{kl}\left( \textbf{X}\right) \right) ^2 \ \ , \ \ \text{ with } \ \ \textbf{X} \in \mathcal {S}(1)}. \end{aligned}$$
(7)

The configuration \(\textbf{X}\) is estimated using Euclidean distances in the smacof framework using the Guttman transform, after which the points are projected onto the surface of a sphere for an optimal radius \(\mu \) (De Leeuw and Mair 2009). Here, we present a refinement of the MDS-Q algorithm, considering the estimation of the optimal radius of the sphere, not in terms of the projection from Euclidean space to the surface of the sphere, but to minimise the stress function Eq. (2) at each step. This task is performed by an alternating estimation procedure in the MDS-Q step that estimates the radius as the slope of a linear transformation of the MDS-Q geodesic distances, given its linear invariance property. The algorithm at this representation step can be summarised as follows (see also De Leeuw and Mair 2009, for further details).

  1. 1.

    Given \(\textbf{E}^{(s)}\) at the sth iteration in the geodesic step, take \(\textbf{Z}=\textbf{Y}^{(s)}\) the classical MDS solution in three dimensions for the Sokal–Michener dissimilarities.

  2. 2.

    Calculate the Guttman transform which improves the configuration in terms of Euclidean distances minimising the second terms in (2). \(\textbf{Y}^{(s)}=\textbf{V}^{+}\textbf{B}(\textbf{Z})\textbf{Z}\), where \(\textbf{V}^{+}\) is the Moore–Penrose inverse of \(\textbf{V}\), \(\textbf{V}^+=\left( \textbf{V}+N^{-1}\mathbf {11'}\right) ^{-1}+N^{-1}\mathbf {11'}\), where \(\textbf{V}\) is given by \(\textbf{V}=\sum _{k<l}{\widetilde{w}_{kl}\textbf{M}_{kl}}\), with \(\textbf{M}_{kl}= \left( \textbf{m}_k-\textbf{m}_l\right) \left( \textbf{m}_k-\textbf{m}_l\right) ^\prime \), with elements \(m_{kk} = m_{ll} = 1\), \(m_{kl} = m_{lk}=-1\), and 0 elsewhere, and \(\textbf{B}\) is given by \(\textbf{B}\left( \textbf{Z}\right) =\sum _{k<l}{\widetilde{w}_{kl}\ s_{kl}\left( \textbf{Z}\right) \textbf{M}_{kl}}\) with \(s_{kl}(\textbf{Z})=\widetilde{\delta }_{kl}/d_{kl}\left( \textbf{Z}\right) \) if \(d_{kl}\left( \textbf{Z}\right) >0\) and \(s_{kl}(\textbf{Z})=0\) if \(d_{kl}\left( \textbf{Z}\right) =0\).

  3. 3.

    Obtain the projection \(\textbf{X}^{(s)} \in \mathcal {S}(\lambda ^{(s)})\) of \(\textbf{Y}^{(s)}\),

    $$\begin{aligned} \min _{\lambda ^{(s)},\textbf{X}^{(s)}\in \mathcal {S}(\lambda ^{(s)})} tr(\textbf{X}^{(s)}-\textbf{Y}^{(s)})'\textbf{V}(\textbf{X}^{(s)}-\textbf{Y}^{(s)}). \end{aligned}$$

    Then consider \(\textbf{X}^{(s)}=\textbf{X}^{(s)}/\lambda ^{(s)}\in \mathcal {S}(1)\) (see De Leeuw and Mair 2009 for details on convergence).

  4. 4.

    Calculate \(\breve{d}(\textbf{X}^{(s)})\) the geodesic distances Eq. (1), and estimate \(\mu ^{(s)}\) (see Appendix) by minimising (7), at the fixed value of \(\textbf{X}^{(s)}\).

  5. 5.

    Set \(\textbf{X}^{(s)}=\mu \textbf{X}^{(s)}\), and evaluate the stress \(\sigma ^{(s)}(\textbf{X}^{(s)}\vert \textbf{E})\).

  6. 6.

    For \(s > 0\), and \(\varepsilon >0\) small enough, if \((\sigma ^{(s-1)}-\sigma ^{(s)})\le \varepsilon \), then \(\textbf{X}=\textbf{X}^{(s)}\in \mathcal {S}(\mu ^{(s)})\) is found and the algorithm stop. Otherwise, update the iteration index and return to 2.

Finally, the geodesic distances \(\breve{d}(\textbf{X})\) are calculated and the algorithm continues with the allocation step, minimising Eq. (2). The overall alternating estimation procedure converges to a local minimum. The allocation step is a nested iterative loop of N assignments (a total of NK iterations), while the MDS step is essentially a K-order ratio MDS procedure, for which the Guttman transformation can be seen as a steepest descent step with a fixed stepsize parameter (see de Leeuw 1988, for details on the convergence of smacof).

Fig. 1
figure 1

Pseudocode of the CSS Model algorithm

3.3 Selecting the Number of Clusters

Several criteria can be employed to select the number of clusters for a dissimilarity matrix (see Vera and Macías 2017, 2021). Here, we consider the adapted version of Hartigan’s criterion \(H^{*}\) proposed by Vera and Macías (2017), which has been experimentally shown to obtain good results even when some degree of overlap is present. Since the clustering is optimal in the entire space, the criterion is used directly on the original dissimilarity matrix without imposing geometric restrictions. This is defined by

$$\begin{aligned} H^{*}\left( K\right) =\left[ \frac{W^{*}(K)}{W^{*}(K+1)} -1\right] (([N(N-1)-K(K+1)]/2)-1), \end{aligned}$$
(8)

where,

$$\begin{aligned} W^{*} \left( K\right) =\sum _{k\le l}\sum _{i=1}^{N}\sum _{j=1}^{N}{e_{ik}e_{jl}w_{ij}\left( \delta _{kl}-{\widetilde{\delta }}_{kl}\right) ^2}, \end{aligned}$$
(9)

and where \({\widetilde{\delta }}_{kl}\) is given by Eq. (4). According to the selection rule proposed by Vera and Macías (2017), the values of \(H^{*}(t)\) are calculated for \(t=1,\dots ,T\), where T is usually predetermined by the investigator. Then, the estimated number of clusters is the smallest value \(K\le T\) such that \(H^{*}(K) \le 5N\).

4 Monte Carlo Experiment

To test the performance of the proposed model, artificially clustered data sets were generated in the unit sphere, with mixtures of the well-known von Mises–Fisher distribution (Banerjee et al. 2005). To this end, we used the rmixvfm function of the Directional R package (v.6.0; Tsagris et al. 2022). All statistical analyses were performed in R v.4.1.3 (R Development Core Team 2023), working on an Intel(R) Core(TM) i5-6200U CPU 2.40 GHz computer with 6 GB of RAM.

The data sets were generated considering a structure of \(K=4,6,8,10\) clusters with equal prior probabilities in the unit sphere. The distribution of points within each cluster was selected considering different concentration indices \(k_{c}=6, 3, 1.5, 0.75\) for the mixtures of von Mises–Fisher distributions, taking into account that the higher the concentration value, the more separated the clusters will be from each other, and the lower this value, the more the clusters will overlap (see, for example, Chevallier et al. 2022 for alternative distributions on the sphere surface. For each combination of the above factors, ten clustered sets of points were generated in the unit sphere, for each of the sizes \(N=50,100,250,500\). Hence, a total of 640 data sets were analysed with the proposed model for the true number of clusters, and the results obtained were compared to the original ones in terms of the classification using the Adjusted Rand Index (ARI) of Hubert and Arabie (1985). The Tucker congruence coefficient (Tucker 1951) was calculated between the simulated data sets and the outcome of the Procrustes transformation on the estimated configuration. For both indices, a value close to one indicates good performance.

Table 1 shows the averaged values for each set of ten simulated matrices, for the ARI, the congruence index and the CPU time in seconds, according to the sample size, number of clusters and the larger concentration index values of \(k_{c}=6, 3\) producing non-overlapping clusters. Almost all of these averaged values indicate excellent performance by the proposed procedure, both in terms of the quality of the classification obtained (ARI) and of recovering the cluster centres (CC). In general, as the size increased and the number of clusters decreased, the model performed better, as expected. In terms of CPU time, the procedure was very efficient for all data sets analysed, with the time increasing as the size of the data set and the number of clusters grew.

Table 1 Average values of ARI, congruence coefficient (CC) and CPU time (T) from simulation experiment with equal probability for each cluster, these being well separated (\(k_{c}=6\)) and moderately separated (\(k_{c}=3\))

Since the performance of K-means clustering is known to decrease for overlapping clusters, we investigated the performance of the proposed procedure when a degree of overlapping was present. Table 2 shows the corresponding averaged results for the lower concentration index values of \(k_{c}=1.5,0.75\), which are related to clusters that are somewhat overlapping. In general, the average ARI values decreased as the number of clusters K increased, and the concentration index \(k_{c}\) decreased, as expected. However, the congruence coefficient (CC index) values remained high, since the average values of dissimilarities between clusters did not seem to be greatly altered by the misclassified points.

Table 2 Average values of ARI, congruence coefficient (CC) and CPU time (T) from simulation experiment with equal probability for each cluster, these being somewhat overlapping (\(k_{c}=1.5\)) and very overlapping (\(k_{c}=0.75\))

We have also investigated the performance of the model when the groups are unbalanced, for which the two previous experiments were carried out but now considering unequal probabilities for the simulation of the data. For each value of k, the kth group (\(k=1,\dots , K\)) was weighted using the values \(p_{k}=k/\sum _{k} k\). Tables 3 and 4 show the given results. As can be appreciated, in general, the performance of the procedure was good when the clusters are well separated, without much influence of the different sizes of the clusters, becoming poorer for the ARI coefficient as the degree of overlap increases, as expected.

Table 3 Average values of ARI, congruence coefficient (CC) and CPU time (T) from simulation experiment with unequal probabilities for each cluster, these being well separated (\(k_{c}=6\)) and moderately separated (\(k_{c}=3\))
Table 4 Average values of ARI, congruence coefficient (CC) and CPU time (T) from simulation experiment with unequal probabilities for each cluster, these being somewhat overlapping (\(k_{c}=1.5\)) and very overlapping (\(k_{c}=0.75\))

Finally, the convergence rates of the model were tested for three large data sets to also analyse the scalability of the model, considering the values of \(K=15\), \(k_{c}=2\) and \(N=500,1000,2000,4000,6000\). Figure 2 shows the convergence plots in terms of the normalised stress value given for each main iteration, starting from the second iteration to better appreciate the differences. In terms of iterations, the stress value decreased rapidly for the overall iterative procedure, even for the largest data sets, indicating that most of the work is done in each of the stages of the alternating estimation process. The above experiment was repeated ten times, and the average execution times (rounded) were of 14, 42, 189, 1094 and 2932 s, respectively. In terms of CPU time, the cost increases as data size increases, as expected, although this is still competitive for large data sets. It seems evident that part of the efficiency of the procedure is due to the well-known speed of convergence towards a local minimum of the smacof procedure (see, for example, De Leeuw and Mair 2009), which also here is performed only for \(K<<N\) points.

Fig. 2
figure 2

Convergence plots for each iteration for the normalised stress of the simulated data sets, considering \(K=6\), \(k_{c}\)=2 and \(N=500,1000,2000,4000,6000\)

Fig. 3
figure 3

Cluster configuration on the sphere for the real data (top panel). The geographical coordinates of the cluster centres (blue) together with the CSS configuration after Procrustes (red) are shown in the bottom panel

5 Illustrative Example

We now analyse the time series data of temperature measured at 500 locations introduced in Sect. 1. We first show the results of our new approach, the CSS model. We then show the results obtained using a two-step procedure that first performs clustering and then plots the cluster centres on the sphere. Finally, we illustrate the fitting behaviour of the spatial deformation estimation procedure using smoothing spline regression between geographic locations and the MDS spherical representation.

To perform MDS, the dissimilarities between time series within the sample were obtained using the cosine correlation-based dissimilarity measure. Considering the Pearson correlation coefficient between two time series \(\textbf{x}_{i}=\{x_{i1},\dots ,x_{iT}\}\) and \(\textbf{x}_{j}=\{x_{j1},\dots ,x_{jT}\}\),

$$\begin{aligned} \rho \left( \textbf{x}_i,\textbf{x}_j\right) = \frac{\sum _{t=1}^{T}\left( \textbf{x}_{it}-{\bar{\textbf{x}}}_i\right) \left( \textbf{x}_{jt}-{\bar{\textbf{x}}}_j\right) }{\sqrt{\sum _{t=1}^{T}\left( \textbf{x}_{it}-{\bar{\textbf{x}}}_i\right) ^2}\sqrt{\sum _{t=1}^{T}\left( \textbf{x}_{jt}-{\bar{\textbf{x}}}_j\right) ^2}}, \end{aligned}$$
(10)

with \({\bar{\textbf{x}}}_i\) and \({\bar{\textbf{x}}}_j\) the averaged temperature values of the time series realisations respectively, the dissimilarities are calculated as \(\delta _{ij}=\sqrt{2\left( 1-\rho \left( \textbf{x}_i,\textbf{x}_j\right) \right) }\) (see, for example, Golay et al. 1998; Montero and Vilar 2014).

The number of clusters was selected from the dissimilarity matrix using the \(H^*\) index (Vera and Macías 2017), for values of \(K=1,2,3,4,...,20\). For each value of K, the partition was estimated without imposing geometric constraints, using the allocation step and the Sokal–Michener dissimilarities between the clusters instead of the Euclidean distances, and the value of Eq. (8) was calculated. The lowest value of the \(H^{*}(K)\) index was found for \(K=15\) clusters (see Vera and Macías 2017, for further details).

In view of these considerations, the CSS model was run for \(K=15\) clusters, obtaining a normalised stress value of \(\sigma ^{2}=0.02235079\). The top panel of Fig. 3 shows the configuration for the cluster centres given by the CSS model, except an invariant deformation for geodetic distances. Although related to the geographical location, the CSS configuration is not expected to exactly match the location of the geographic centres of the clusters. The bottom panel of Fig. 3 shows the geographic locations of the cluster centres together with the CSS configuration after applying the Procrustes transformation in three dimensions (\(CC=0.955686\)), and projecting the optimally transformed configuration on the surface of the sphere. Despite the differences in some locations, in general the spatial relationship of the clusters of temperature time series can be appreciated.

Fig. 4
figure 4

Time series for groups 1 and 11

The relationship is best appreciated for clusters in which the related stations are in the same hemisphere and therefore the centre is not deformed by curvature. For example, Fig. 4 shows the temperature time series for stations in groups 1 and 11. Although the average geographic location of the station locations in each cluster may be affected by the longitude effect, the averaged time series of both clusters are well related to their geographic centre.

Fig. 5
figure 5

MDS in the sphere with a geodesics distance solution for the 500 real data points, with colours assigned according to the classification provided by the CSS model

Figure 5 shows the MDS-Q solution on the unit sphere surface, obtained using geodesic distances for all the sampling series (without clustering). Different colours and numbers represent the classification provided by the CSS model for \(K=15\). In general, the different cluster structures can be appreciated, some of which are quite close together, while others are sparse, but in general a clear interpretation is difficult due to the large number of points represented. As an example, let us now illustrate how the representation error with or without clustering can influence the estimation of a spatial interpolator.

Mapping between geographic locations and the MDS representation is a widely used procedure, which, for example, allows interpolation in the context in which the covariance structure, expressed in terms of spatial dispersion, is stationary and isotropic, since the MDS representation only depends on distances. This is a very useful procedure in this framework, for example, when the aims is estimate the spatial covariance in a non-stationary process, although this aspect is beyond the scope of this paper (see Sampson and Guttorp 1992 for further details). Denote by \({\textbf {g}}\in \mathcal {S}(1)\), the geographic location of a station, and consider \(0 \le \theta \le 2\pi \), and \(- \pi /2 \le \phi \le \pi /2\), the longitude and the latitude of \({\textbf {g}}\), respectively. Let denote by \(W_{2}^{2}\) the model space for the spherical spline f of order 2 given by

$$\begin{aligned} W_{2}^{2}=\left\{ f : \left| \int _{\mathcal {S}(1)} f\ \textrm{d}{} {\textbf {g}} \right|< \infty , J(f)<\infty \right\} , \end{aligned}$$
(11)

where \(J(f)=\int _{0}^{2 \pi } \int _{- \pi /2}^{\pi /2} (\Delta f)^{2} \cos (\phi ) \textrm{d} \phi \textrm{d}\theta \), and \(\Delta f\) is the surface Laplacian on the unit sphere (see Wang 2011 for further details). Then, denoting by \(\vartheta _{i}\) and \(\psi _{i}\) the longitude and latitude of a point \({\textbf {x}}_{i}\), \(i=1\dots ,N\) in the MDS configuration on \(\mathcal {S}(1)\), two smoothing spherical spline regression functions are estimated \(f=(f_{\vartheta },f_{\psi })\), for the longitude and for the latitude values, respectively. For each component \(v=\vartheta ,\psi \), \(f_{v}\) is estimated by minimising

$$\begin{aligned} \frac{1}{N} \sum _{i=1}^{N} (v_{i}- f_{v}({\textbf {g}}_{i}))^2+ \lambda J(f), \end{aligned}$$
(12)

where \({\textbf {g}}_{i}=(\theta _{i},\phi _{i})\) are the geographical coordinates, and \(v_{i}=\vartheta _{i}\) or \(v_{i}=\psi _{i}\), \(i=1,\dots ,N\) for the MDS configuration. We analyse here the location error (LE) in terms of the average of the squared geodetic distances between the interpolated geographic locations and the MDS representation given by

$$\begin{aligned} LE=\frac{1}{N}\sum _{i=1}^{N}\breve{d}({\textbf {x}}_{i},\hat{{\textbf {x}}}_{i})^2, \end{aligned}$$
(13)

where \(\hat{{\textbf {x}}}_{i}=f({\textbf {x}}_{i})=(f_{\vartheta }(\theta _{i}),f_{\psi }(\phi _{i}))\), is the image of f for the ith geographical location. The analysis has been performed using the ssr function at the assist R package (v2.1.8; Wang et al. 2022). A value of LE of 2.482347 was obtained using the CSS model, while this was of 6.11973 when the interpolation process was performed using the MDS spherical representation for the \(N=500\) stations. In addition, we consider the mean squares error (MSE) of the fit for each component (longitude, latitude) after Procrustes, given by \(MSE_{v}=\sum _{i}(v_{i}-f_{v}({\textbf {g}}_{i}))^2/N\). For the longitude component, the MSE values of 0.4730863 and 0.8416269 where obtained for the CSS and spherical MDS (without clustering) respectively, while for the latitude, the values were of 0.8057067 and 1.262198, respectively. The results obtained show that the performance of the interpolation procedure is better with the CSS model, as expected.

Finally, we investigated whether a two-step procedure, consisting of an independent classification and representation stage, offered an adequate solution to the problem. Using the correlation-based dissimilarity matrix for the time series based on 500 locations, we first performed classical MDS in full dimension (dimension 499). Then, K-means clustering was performed for \(K=15\) using this configuration (see Vera and Macías 2021). The Sokal–Michener dissimilarity matrix was calculated, and the configuration of the cluster centres was estimated using the representation step described in Sect. 3.2. The normalised value of the total stress (2) given for the two-step procedure was 0.02449717, which is higher than that obtained with the proposed CSS model (0.02235079), as was hoped. In addition, the normalised value of the first term in (3) was of 0.009104791 for the two-step model and 0.006082755 for the CSS model, while the normalised stress values in (7) were 0.01077115, and 0.009900273, respectively, reflecting the good performance of the proposed CSS model.

6 Discussion

This paper proposes a spherical-constrained cluster-MDS model for two-way one-mode dissimilarity data using geodesic distances. One of the main advantages of this model is that it enables us to address clustering and representation problems presenting a large number of points, particularly on a spherical surface, all based on any dissimilarity measure. Instead of a two-step procedure, and using only the information given by a dissimilarity matrix between a set of objects, the proposed model obtains a classification into homogeneous clusters while simultaneously, not the objects themselves but the centres of the clusters are represented on a spherical surface of optimal radius, using geodesic distances. Hence, for any partition, the dissimilarities are assumed to vary randomly within a cluster, while the corresponding distance is constant within the same cluster, whereas between clusters, differences in distance will reflect the tendency of the corresponding dissimilarities to vary systematically. Furthermore, when the representation of the points in high dimensionality is also known, any measure of dissimilarity can be used, and the proposed procedure, in addition to grouping, facilitates interpretation by preserving the global structure between the groups, unlike other spherical embedding-only procedures.

The parameter estimation is performed in an alternating step procedure which consists of a dissimilarity-based assignment step using geodesic distances, and a double representation step in which a ratio transformation is considered in a MDS model for a configuration that is constrained to the sphere surface using geodesic rather than Euclidean distances, in a metric approach. The search for the optimal object classification is formulated using a minimum distance procedure (Heiser and Groenen 1997), together with geodesic distances, which can be seen as a generalised K-means clustering procedure on the dissimilarity matrix in terms of geodesic (instead of Euclidean) distances. For the representation step, the search for the overall optimal radius is performed introducing a ratio transformation in the primal quadratic constraints MDS algorithm proposed by De Leeuw and Mair (2009), also using an alternating estimation procedure.

The performance of the CSS model was analysed considering 640 artificial clustered data sets, in an extensive Monte Carlo experiment with different sizes, numbers of clusters, sizes of clusters and degrees of overlap. Among other aspects, we determined the quality of the classification obtained using the ARI (Hubert and Arabie 1985), and the degree to which the true configuration of the cluster centres was recovered using the congruence coefficient (Tucker 1951) after the Procrustes transformation. Regardless of whether the groups were balanced or not, the model showed good performance in all data sets considered with non-overlapping clusters. For somewhat overlapping groups, the ARI values worsened as the number of clusters increased, while for fairly overlapping groups, the ARI values were poor for both clusters of equal or different sizes. In all situations, the congruence coefficient values remained high, as the average values of the differences between the groups did not appear to be greatly altered by misclassified points.

To illustrate the performance of the model for real data sets, we analysed the monthly mean near-surface air temperature for 500 locations, worldwide. The results obtained were compared with those given by a two-step procedure in which first, K-means clustering was performed using the dissimilarity matrix (Vera and Macías 2021), after which the cluster centres were represented using the primal quadratic constraint MDS algorithm proposed by De Leeuw and Mair (2009), and implemented in the smacof package in R (Mair et al. 2021). As expected, our procedure outperformed the two-step algorithm. In addition to facilitating interpretation and reducing the number of parameters, the proposed procedure improves the fit when the interpolation procedure based on spherical spline is used on the centres of the clusters instead of on the complete data set.

The proposed model performs clustering based on any dissimilarity measure, but does not take into account the spatial proximity between objects, which may be a limitation for some practical applications. Clustering with constraints is necessary, for example, when we wish stations and clusters to retain their spatial relationships, and in this situation, additional spatial contiguity constraints are required (see Vera et al. 2008). Another limitation of the model is related to the assumption that the scale of the dissimilarities is metric. In many cases, dissimilarities are measured on an ordinal scale, that is, only the order between dissimilarities is preserved by distances, so in these cases a transformation based on monotonic regression must be considered when approximating them by geodesic distances. This results in a more flexible but imprecise clustering and representation model, more appropriate when it is not required to strictly preserve the global structure, but only the order of dissimilarities across the distances between group positions.

Spherical embedding is an important tool for data analysis in diverse areas of interest. Clustering and/or a planar representation using MDS have been considered for the problem of non-stationary spatial covariance structure estimation (Sampson and Guttorp 1992; Vera et al. 2008, 2009a; Vera et al. 2017). The model we present allows the extension of this procedure when the dissimilarities are determined by the location of the points on the surface of a sphere. This model, together with the spatiotemporal processes defined on the sphere surface is currently being investigated by the authors.