k-Means, Ward and Probabilistic Distance-Based Clustering Methods with Contiguity Constraint

We analyze some possibilities of using contiguity (neighbourhood) matrix as a constraint in the clustering made by the k-means and Ward methods as well as by an approach based on distances and probabilistic assignments aimed at obtaining a solution of the multi-facility location problem (MFLP). That is, some special two-stage algorithms being the kinds of clustering with relational constraint are proposed. They optimize division of set of objects into clusters respecting the requirement that neighbours have to belong to the same cluster. In the case of the probabilistic d-clustering, relevant modification of its target function is suggested and studied. Versatile simulation study and empirical analysis verify the practical efficiency of these methods. The quality of clustering is assessed on the basis of indices of homogeneity, heterogeneity and correctness of clusters as well as the silhouette index. Using these tools and similarity indices (Rand, Peirce and Sokal and Sneath), it was shown that the probabilistic d-clustering can produce better results than Ward’s algorithm. In comparison with the k-means approach, the probabilistic d-clustering—although gives rather similar results—is more robust to creation of trivial (of which empty) clusters and produces less diversified (in replications, in terms of correctness) results than k-means approach, i.e. is more predictable from the point of view of the clustering quality.


Introduction
The cluster analysis, aimed at efficient classifying given multidimensional objects (i.e.objects described by many statistical features reflecting a composite social or economic phenomenon, of statistical variables and proposed a three-step procedure to estimate parameters of such model.Wagner and Mantaj (2010) analyzed mathematical and practical properties of the neighbourhood matrix in the physical dimension and its applications in taxonomy.LeSage and Pace (2009) studied usefulness of the spatial econometric Bayesian model which takes the neighbourhood of areas into account.The same researchers (LeSage and Pace (2010)) verified the hypothesis about sensitiveness of the Spatial Regression Model (SAR) to individual specifications and investigated how to interpret its relevant parameters.Młodak (2013) turned his mind on some practical properties of both types of neighbourhood matrix, proposed an efficient method of their comparison and studied possibilities of estimation of variance of disturbances and spatial lag vectors and of effects and prediction correlation in CAS and SAR models.General theory of the physical neighbourhood was here also conceptualized.
Taking the above results into account, it should be not surprising that the importance of neighbourhood was appreciated also in the small area estimation.One can mention in this context a generalization of the Fay-Herriot model taking correlated random area effects between neighbouring areas into account modelled using the SAR process1 -cf.Singh et al. (2005), Petrucci and Salvati (2006) and Pratesi and Salvati (2008).Saefuddin et al. (2013) used the SAR and CAR (conditional autoregression) models to extend the hierarchical Bayes method towards improving quality of poverty estimation in East Java, Indonesia.And, last but not least, both small area estimation models with the sub-tools based on the neighbourhood of powiats (Polish NUTS 4 regions) were used by Kubacki and Jędrzejczak (2016) to estimate average per capita available income of households with the support of data from the Polish tax register (POLTAX) as auxiliary variables.It is worth recalling also that clustering of units is one of the factors deciding about efficiency of stratified sampling (cf.e.g.Zimmer et al. (2012)).
In this paper, we present and study efficiency of the k-means and Ward methods when contiguity constraint is imposed.That is, it is required that neighbours belong to the same cluster.To satisfy this condition, a type of clustering algorithm with relational constraint (cf.Ferligoj and Batagelj (1982), Basu et al. (2008)) is constructed.It consists of two steps.In the first of them, the pre-clusters consisting of objects having at least one neighbour are determined.Next, such pre-clusters and remaining objects (i.e.those which have no neighbour) are finally clustered.In the case of k-means method, the pre-clusters are represented by their centroids.
We make also an attempt to introduce the factor of neighbourhood (and therefore the additional information which it carries) to the probabilistic d-clustering.In the conventional approach formulated by Ben-Israel and Iyigun (2008), the target function of probabilistic assignments and centres of clusters is optimized by a specific iteration algorithm based on Weiszfeld's idea (cf.Weiszfeld (1937)).Therefore, this approach is a type of the multi-facility location problem (MFLP), which is aimed at simultaneous optimization of level of assignment and location of objects.Now, we propose a modified target function, where the matrix of distances of objects from the cluster centres is multiplied by the corrected neighbourhood matrix (i.e.taking also trivial neighbourhood-an object with itself-into account).We prove that many properties of the Ben-Israel and Iyigun model and its optimum solutions can be easily observed (in relevant adopted form) also in this case-especially in terms of asymptotics.Additionally, some original features concerning the relations of these modified methods with its conventional option and restrictions of some parameters are also revealed.In particular, the dual problem (which in this case becomes more complicated) is also shown.
On the basis of such modification, an algorithm of clustering with contiguity constraint is introduced.It consists also of two steps.In the first place, the objects having some neighbours are grouped using the modified optimization procedure.Each of such clusters is next represented by its centroid and these "new" objects and objects having no neighbours are clustered using the basic probabilistic d-clustering approach.One can emphasize that the proposed algorithms work over not the original objects but rather classes of contiguity equivalence relation.This method can be useful when predefinition of neighbourhood is necessary from some practical reasons.
The efficiency of introduced approaches and comparison of impact of the neighbourhood matrix on each of them was verified by a special simulation and empirical studies.In the simulation experiment, the number of clusters is arbitrarily established.It results from the pre-assumed groups of data drawn from relevant multivariate normal distribution.We analyze both balanced option of such sample (i.e. the sizes of subsamples generated from individual multivariate normal distributions are equal) and unbalanced realizations (the subsample sizes are drawn from the multinomial distribution).The impact of adding small and intensive noise to the data was also investigated.The neighbourhood is here established on the basis of the variables represented by some specific dimensions of such distributions (the others are regarded to be the model diagnostic variables, which are a basis of clustering).Special indices of homogeneity, heterogeneity and correctness of clusters and the silhouette index will be used to verify their quality.The conducted empirical study is based on current statistical data concerning the labour market in powiats (NUTS 4 regions) of the Wielkopolskie Voivodship (the Polish NUTS 2 region located in central-west part of the country).The criterion of neighbourhood of two powiats was here defined as the absolute difference between their population densities.Unlike the simulation study, the number of clusters is here established endogenously, by optimizing the index of correctness of clustering.The final set of clusters is that the value of such index is smallest.These two attempts show the utility of analyzed methods.Especially it was observed that the probabilistic dclustering produces clusterings of better quality than Ward's approach and is more robust to produce trivial and empty clusters and gives less diversified results in replications than the k-means approach (although in many cases the results of application of probabilistic d-clustering and k-means algorithms are similar).Moreover, the obtained results call into question the appropriateness of the silhouette index in this case.
The paper is organized as follows.Firstly (Section 2), we present the main assumptions of the concept of neighbourhood and algorithms for k-means and Ward methods which take the occurrence of contiguity constraint into account.Next (Section 3), the multi-facility location problem and probabilistic d-clustering are discussed.Section 4 contains the modification of the Ben-Israel and Iyigun model obtained by the introduction of a neighbourhood matrix is discussed, its properties and relevant dual problem are investigated.A relevant modified algorithm for this new approach is formulated.In Section 5, we describe the assumptions and results of conducted simulation study and in Section 6, the results of empirical analysis for the Wielkopolskie Voivodship in Poland.Finally, in Section 7, some major concluding remarks are formulated.Two appendices containing the proof of some mathematical results of the paper and results of additional computations are also included.

Preliminaries and Modified k-Means and Ward Algorithms
In this section, we recall the basic assumptions of cluster analysis and the concept of neighbourhood.We show also how one can modify the k-means and Ward algorithms when the contiguity constraint is imposed.

Basic Notions and Assumptions
Let n be a natural number and Γ = {Γ 1 , Γ 2 , …, Γ n }-the set of analyzed objects.We assume that these objects are characterized by m (where m is a natural number) features X 1 , X 2 , …, X m .Any object Γ i is then represented by a point being maximally internally homogeneous and heterogeneous between themselves.The clusters can be defined e.g. using their centres c 1 , c 2 , …, c k (being often their centroids).Let d(a, b) be the distance between vectors a and b in ℝ m .
In our study, we will use the model of neighbourhood matrix (called also the contiguity matrix).Such matrix of the form W = [w ij ] and size n × n is defined as & for every i, j = 1, 2, …, n ("⋈" denotes the relation of neighbourhood-called also adjacency-of objects: the symbol Γ i ⋈ Γ j means that Γ i and Γ j are neighbours).The matrix W is symmetric with zero diagonal entries.Moreover, the number of entries of W which are equal to 1 is exactly twice greater than the number of all pairs of neighbouring objects.In some options, instead of simple 1's in the neighbourhood matrix levels of intensity of neighbourhood (belonging to (0, 1]) determined e.g. on the basis of distances of vectors describing such objects can be placed.2An object which is not adjacent to another (i.e. an object Γ i for which w ij = 0 for every j ≠ i, j = 1, 2, …, n) is called isolated.If w ij > 0 for any j ≠ i, j = 1, 2, …, n then Γ i will be called familiar, because it is adjacent to each other.If the objects are spatial areas, the neighbourhood can be defined in geographical manner.Two areas are then neighbours if they have a common border.However, the neighbourhood can be also established on the basis of some auxiliary features connected with some information associated with the main subject of clustering.Example 1 illustrates this idea.
Example 1. Assume that n = 10 and the neighbourhood will be determined on the basis of three auxiliary features Q 1 , Q 2 and Q 3 .Their values for given objects are as follows: The features Q 1 , Q 2 and Q 3 are of sufficient variability (their coefficients of variation vary from 62.05 to 401.37%) and weakly correlated (the strongest correlation is observed between Q 2 and Q 3 -Pearson's correlation coefficient amounts to 0.0865 in this case, the remaining ones are 0.0323 (for Q 1 and Q 3 ) and − 0.0101 (Q 1 and Q 2 ).The neighbourhood is defined using the Euclidean distances between points γ i , i = 1, 2, …, n.Hence, the matrix of distances is of the form: We assume that objects Γ h and Γ i , h ≠ i are neighbours if the Euclidean distance between γ h and γ i is smaller than 3, h, i = 1, 2, …, n.Such distances were written italic in the aforementioned table.The neighbourhood matrix is then as follows: From the practical point of view, two manners of normalization of the neighbourhood matrix are promoted in the literature.First, Kelejian and Prucha (1998) use the maximum absolute eigenvalue of the neighbourhood matrix.Let such value be denoted by τ.Thus, all entries of the spatial weight matrix are divided by τ.Hence, the normalized matrix has entries equal to 0 or 1/τ.Another approach, suggested by LeSage and Pace (2009)-called row-standardization-is based on standardization of rows of the neighbourhood matrix.Concretely, they normalize the matrix such that entries in each row with at least one non-zero entry adds up to one, i.e. each entry is divided by the sum of elements of the row which it belongs to.Such normalization becomes necessary in the case of analysis of the CAS models.It has to be performed in order to provide a parameter space for the spatial autoregressive parameters (cf.Młodak (2013)).

k-Means Algorithm with Contiguity Constraint
In many practical situations, the neighbourhood is defined on the basis of some additional information connected with the main data used to clustering.Thus, a natural expectation is that neighbours have to belong to the same cluster.In this way, a type of clustering with relational constraint, considered by Ferligoj and Batagelj (1982) and Basu et al. (2008) was constructed.
Below we show how one can adopt k-means method to this situation.
The basis feature of this approach in these circumstances is the division of the set of objects into two disjoint subsets: Γ (NB) , containing objects having at least one neighbour, and Γ (NN) , consisting of objects which have no neighbour.Of course, The modified k-means algorithm consists then of two parts: clustering the objects from Γ (NB) and clustering clusters obtained in the first part (represented by their centroids) and the objects from Γ (NN) .This approach is therefore of the form indicated by Algorithm 1.In Part I, the objects having at least one neighbour are grouped according to the minimax criterion of distance of neighbouring objects from the centroids of clusters obtained in previous iteration (i.e.any two neighbours are included in such cluster which centroid is the closest to them, taking into account the greater of distances of these objects from it).In this way, any neighbours will belong to the same cluster.In Part II, the clusters obtained in Part I are represented by their centroids.This collection is extended by objects having no neighbour (and therefore being not considered in Part 1) and kmeans grouping is here conducted.For instance, if we have the objects A, B, C, D, E and F and A⋈B and C⋈D, then we group first the objects A, B, C and D (so the structure of their clusters are {{A,B},{C,D}} or {{A,B,C,D}}) and next, the k-means clustering is conducted on the collection of these clusters and objects E, F. Algorithm 1.

Part I
Step 1.Let 1 < s NB < n NB be the assumed number of clusters for Γ (NB) and ð Þs NBtheir initial centroids.Determine the preliminary clusters (called also the transitive closure) Θ , where c (NB)i is the centroid of i-th cluster Θ (NB)i obtained finally in Part I, i = 1, 2, …, s NB , and s in the following way: where ε > 0 is the maximum acceptable level of imprecisionthen stop else go to step r + 1.
We must prove that the relation expressed in the r-th step of Part I of Algorithm 1 is the equivalence relation.It is clear that it has the reflexive and symmetric properties.We will show now the transitivity.Let, for simplification, In both cases, it leads to the conclusion that min l max u l ; w l f g ð Þ¼max u l * ; w l * È É .Thus, in each step of any part of Algorithm 1, the given object or objects is/are assigned to the cluster defined by the nearest centroid.The nearest centroid is determined by minimization of distance between object(s) and all analyzed centroids.One can ask how to treat the situation when for the same object(s) two or more centroids are the nearest (i.e.there are more than one centroid which the distance from given object(s) is the same and the smallest).In such situation, such object(s) is/are assigned only to one of these clusters.In practice, if the algorithm is implemented to software programming language, it assigns such object(s) to such cluster which was firstly discovered as nearest.In consequence, the final clusters are disjoint.
The initial centroids can be defined randomly, taking the empirical distribution of diagnostic variables into account.The simulation and empirical studies (Sections 5 and 6) showed that it is a good solution, owing to applied maximum level of imprecision Algorithm 1 converges.Sometimes such convergence may denote, of course, that there will be no further changes after a finite number of steps.
The quality of such clustering can be assessed using the indices of homogeneity, heterogeneity and correctness of clusters (cf.e.g.Młodak (2006)).Especially useful in this context are the indices of this class based on the median of distances between objects-they are robust to incidental outliers inside clusters and between clusters.The index of homogeneity of lth cluster Θ l is then of the form: Thus, the final aggregated index of homogeneity is defined as: The smaller is the index hm then the more homogeneous is the internal structure of clusters.The heterogeneity of cluster in relation to the remaining ones is given as The aggregation of such indices leads to the final assessment of heterogeneity: The greater is the index ht, the stronger heterogeneous are obtained clusters.The index of correctness reflects the ratio of homo-and heterogeneity of clusters: The smaller values of cr inform about better correctness of clusters, i.e. that they are more homogeneous and more distinct from one another.Clustering can be regarded as efficient if cr < 1.
The construction of index (3) based on (1) and ( 2) is a proposal slightly than many indices used for evaluation of clustering algorithms.One can recall here the following known measures of quality of clustering.The Dunn index (cf. Dunn (1974)) is based on the dissimilarity between clusters and maximum diameter of them, i.e.

DU ¼ min
k¼1;2;…;s min l¼1;2;…;s;l≠k and ‖•‖ is the Euclidean norm.The main disadvantage of DU is that-as observed e.g. by Kamaruddin and Ravi (2020)-it is strongly affected by noisy data and outliers.The index cr overcomes these problems.The Davies-Bouldin measure (cf.Davies and Bouldin (1979)) was created using the distance between centroids of clusters and sum of their diameters, i.e.
r and n r is the number of objects in Θ r , r = 1, 2, …, s.
This measure assesses the differences between clusters only as the distance of their centroids, what sometimes can neglect the diversification of distances between individual objects belonging to various clusters.The index (3) to a large extent helps to avoid this inconvenience.
The silhouette index (cf.Kaufman and Rousseeuw (2005)) is a function of a i -average distance of object Γ i from all other objects in the same cluster-and b i -the smallest the average dissimilarity between Γ i and objects in any other clusters, which it does not belong to, i = 1, 2, …, n: This index can be used to estimate only the first choice or the best partition and therefore is not useful in the case when intermediate clusters contained in the target one occur.The index (3) seems to be better from this point of view.
There are also some other interesting attempts in this context, e.g. the Maulik-Bandyopadhyay index (Maulik and Bandyopadhyay (2002)) is based on relation of intracluster distance for trivial clustering (i.e. when the set of objects Γ is the only cluster) for the current clustering and on the inter-cluster distance defined as maximal distance between centroids of clusters.However, its construction depends on some subjectively chosen parameter.On the other hand, Saitta et al. (2007) proposed a new index where between-class distance (being a sum of weighted distances between centroid of individual cluster and centroid of all the clusters) and within-class distance (sum of average distances of objects from centroids of relevant clusters) are used.In both cases, the inter-cluster distance may not fully reflect the diversification between clusters.Moreover, the intra-cluster variance is sensitive to incidental outliers within clusters and hence could be overestimated.
As one can see, the index (3) enables to avoid-or at least to substantially reduce-many aforementioned drawbacks of other indices.That is, it is based on distances between pairs of objects what enables to fully take internal diversification of clusters and differences between them into account.On the other hand, the usage of median of distances enables to robustify the quality assessment on incidental outliers in this respect.For a comparison, in the simulation studies described in Section 5, both (3) and (4) indices are used.Moreover, the proposal allows for easy separation the inter-and intra-cluster components (i.e.homo-and heterogeneity, respectively) of correctness of clustering and for clear recognition of contribution of each of these aspects to the total value of final index.
The members of clusters in each part of Algorithm 1, i.e. s NB and s, can be established arbitrarily.If there is no other obstacle, it is comfortable to assume that s NB = s.Such number can be, however, alternatively optimized endogenously.That is, we put s = 2, 3, … and after finishing Algorithm 1, we compute the index of correctness (3).Finally, as the final number of clusters, we assume such s, for which the value (3) is the smallest.The drawback of such approach is that it could be time and hard-and software capability consuming when basic dataset is large.

Contiguity Constraint in the Ward Method
The Ward method (cf.Ward (1963)) is a type of hierarchical clustering algorithm.In its conventional form, it starts from the analyzed objects as singletons, i.e. from the collection of clusters {{Γ 1 }, {Γ 2 }, …, {Γ n }}.These n clusters are combined in successive steps to make one cluster containing all investigated objects.At each step, a pair of clusters whose merger minimizes the increase in the total sum of squares within-group error are merged.This increase is expressed as sum of squared differences between means of particular variables for these clusters divided by sum of reciprocal of their cardinalities.In practice, to obtain the optimal clustering, the procedure stops when the distance between merged clusters exceeds arbitrarily established threshold.When the Ward clustering is constrained by the restriction that neighbours should belong to the same cluster, it can be modified to the form presented in Algorithm 2.
Step 1. Create the transitive closure Θ ð Þs NB in such way, that any two neighbours are included in the same cluster.For instance, if we have five objects A, B, C, D and E and A⋈B and C⋈D and B⋈E then A, B and E will belong to one cluster and C and D to the second.In this way all objects between which there existsdirect or indirectcontiguity connection will belong to the same cluster.The transitive closure consists then of disjoint clusters.
Step 2. Set the starting collection objects having no neighbours treated as singleton clusters.
Step r-th i.e. which are closest according to the distance between their centres; if where d * > 0 is arbitrarily established level of precision then the algorithm stops else go to step r + 1.
The distance d is here defined as u and n_u^{(r)} denotes the cardinality (i.e. the number of objects belonging to Several issues connected with Algorithm 2 require more detailed explanation.One can observe the situation where all the objects belong to one cluster in transitive closure (step 1).However, this case occurs in practice very rarely.The assumptions of efficient cluster analysis (first of all, sufficiently high value of the coefficient of variation of diagnostic features) imposed clear diversification of analyzed objects from the point of view of the investigated composite phenomenons.So, the combination of all the objects into one cluster is practically impossible.On the other hand, due to specific pre-clustering based on neighbourhood of objects according to some associated phenomenon, conducted in step 1 and hierarchical clustering made in steps 3, 4,…, this method has a tendency to generate large final clusters.It can be observed e.g. for the results presented Section 6.So, Algorithm 2 does not "inherits" the inclination to generate clusters of similar size, which is one of immanent (and not always gainful) properties of the conventional Ward method (cf.e.g.Everitt et al. (2011)).Of course, it may happen that in step r the minimum distance will be achieved for two or more pairs of clusters from step r − 1.However, at each step only, one of these pairs is merged (the program usually does it with the first discovered optimal pair).
The threshold d * can be established in various manners, e.g.quite exogenously or based on descriptive statistics of distances between clusters merged at particular stages over the whole algorithm (i.e. if steps 2, 3, …, n * − 3 leading to obtain complete set Γ are conducted).For instance, Mojena's rule (Mojena (1977)) establishes such threshold as d * ¼ d þ cs d , where d is the mean of minimum of distances, s d -its standard deviation and c-some constant, usually 2.5 < c < 3. The quality of clustering can be assessed using the indices (1), ( 2) and (3).The index (3) can be also used as optimization criterion of the algorithm-i.e.we regard as final such clustering for which the index (3) is the smallest.

Probabilistic d-Clustering
The probabilistic d-clustering, 3 proposed by Ben-Israel and Iyigun (2008), is based on the optimization of probabilities of assignment of objects to clusters defined by their centroids, leading to obtain best clustering.
Formally, we will take into account d(γ i , c k ) the distance between the object Γ i and centre c k of cluster Θ k and p(Γ i , Θ k )-the probability of assignment of the object Γ i to the cluster Θ k .For simplification, we will use in this context also shorter symbols d ik and p ik , i = 1, 2, …, n, k = 1, 2, …, s, respectively.According to the conventional definition of probability, two basic conditions hold: 0≤ p ik ≤1; for every i ¼ 1; 2; …; n; k ¼ 1; 2; …; s ð5Þ and 3 The term "d-clustering" is an abbreviation of distance-clustering.
Let p k = (p 1k , p 2k , …, p nk ), k = 1, 2, …, s.Ben-Israel and Iyigun (2008) proposed clustering method aimed at minimization of the target function with conditions ( 5) and ( 6) and a requirement of the probabilistic assignment to be reversely proportional to the distance of an object from the relevant given cluster,4 i.e.
where α k is some constant depending only on k, k = 1, 2, …, s, i = 1, 2, …, n.It was proved that by these assumptions the optimal probability of assignment of Γ to cluster Θ k will be of the form: for every i = 1, 2, …, n and k = 1, 2, …, s.
In the target function ( 7), instead of the squares of probabilities (i.e.p 2 ik ), a generalized model based on their power of exponent u ≥ 1 can be used: On the basis of this result, Ben-Israel and Iyigun (2008)-inspired by Weiszfeld's approach-suggested an algorithm of iteration of centroids of clusters and optimal probabilistic assignments of objects to these clusters.Its scheme is given in Algorithm 3.

Algorithm 3.
Step 1.We assume arbitrarily the initial forms of centres c s and the exponent u 0 ≥ 1 as well as the increase of exponent, Δ.We put r ≔ 0.
Step 2. Assuming that c are given, we determine optimal probabilistic assignments according to the formula (10) and update the exponent: u + ≔ u + Δ.
Step 3. On the basis of results obtained in the Step 2, we determine optimal centres c , for which the value of righthand side of ( 7) is minimal.This minimization -according to the proposal expressed by Ben-Israel and Iyigun (2008) is conducted with respect to c k in d(γ i , c k ) = d ik when the values of p ik are calculated according to (9) using c s and are assumed to be fixed.That is the centres are approximated using the formula Step 4. We repeat steps 2-3 putting < ε, where ε is the arbitrarily established level of precision (e.g.ε = 0.0001).
Finally, each object is assigned to such cluster for which its probability of assignment is the greatest.In the case of ambiguity in assignment (i.e. if there are two or more clusters for which the probability of assignment are equally maximal), the object is assigned only to one of these clusters.The obtained clusters are the disjoint.
In fact, the subject of this study is so-called multi-facility location problem (MFLP), consisting of two following aspects: • The assignment problem: having s centres c 1 , c 2 , …, c s assign each of the n objects to the cluster with the nearest centre, • The location problem: having s clusters determined as a solution of the assignment problem, find centre-i.e.centroid-of each cluster, using relevant algorithm.
The probabilistic d-clustering resembles the fuzzy c-means method proposed by Dunn (1973) and Bezdek (1981).However, the target function is in this case given as f q 1 ; q 2 ; …; q s ð Þ¼∑ where q k = (q 1k , q 2k , …, q nk ) and q ik is the degree of membership of the object It is assumed that q ik satisfy probability conditions analogous as ( 5) and (6).Function (11) achieves its minimum when The target function is then based on the squared distances of objects from centroids of clusters (instead of simple distances, as in ( 7)).Moreover, the degree of membership, q ik , is here raised to some arbitrarily established power u.Algorithm 3 is applied here, respectively.It is worth noting that the optimization of q ik is impossible when u = 1.Thus, in practice, u ∈ [1, ∞).If u = 3, then q ik = p ik for every i = 1, 2, …, n and k = 1, 2, …, s.On the other hand, if u = 2, the Dunn-Bezdek target function (11) seems to be most similar to the Ben-Israel-Iyigun target function (7).The latter method seems to enable also to obtain better interpretable results as the former one.
However, the Ben-Israel and Iyigun approach has some practical drawbacks.Firstly, using squares of probabilities in ( 7) is in practical application hardly to interpret.Secondly, formula (8) assumes a dependency of p ik only on the distance of ith object from kth cluster.But in practice, explanatory variables describing a given social or economic composite phenomenon are mutually connected (independently of their formal connection such as e.g.correlation)for instance, logically or by a type of information.Thus, the assignment of a given object to a given cluster depends to some extent on the assignment of remaining objects to this and other clusters.The conditions ( 5) and ( 6) seem to confirm such statement.Moreover, note that for every i = 1, 2, …, n, k = 1, 2, …, s.From (6), we have ∑ s k¼1 p ik ¼ 1, i = 1, 2, …, n.Using the Lagrange multipliers method and taking (12) into account, we obtain where λ i are the Lagrange multipliers, i = 1, 2, …, n.Thus, from (13), we have p ik = λ i /(2d ik ) and from (6), 1 After determination of λ i and placing it to the first of these equalities, we obtain (9).Thus, the Condition ( 8) is redundant and, in consequence, it is sufficient to restrict only to the conditions ( 5) and ( 6).

The Neighbourhood Matrix in Clustering Using the Ben-Israel-Iyigun Method
Now we will show how one can apply the additional information carried by the contiguity matrix in the probabilistic d-clustering.To introduce the neighbourhood of objects to the cluster analysis, we should remember that any object can be perceived as its own neighbour (what is called the trivial neighbourhood).It is important when distances between objects and centres of clusters are multiplied by the entries of the contiguity matrix-the distance of a given object from centres of clusters cannot be ignored.Therefore, a necessity of taking in the further study a trivial neighbourhood into account occurs.To do it, we will apply here an extended neighbourhood matrix, i.e. the n × n matrix In this case, our problem is to minimize the target function by p 1 , p 2 , …, p s and relevant optimization of centroids of clusters.The optimal values of probabilities can be obtained analogously as for (6), i.e.
It is not difficult to see that if Γ i is an isolated object, then p ik given by ( 15) is not greater than the probabilistic assignment of this object to kth cluster in the conventional form (9). Additionally, p ik in ( 15) is equal to 1 if and only if γ i = c k and the object Γ i is isolated.
In the case of optimization of centres of clusters when probabilistic assignments are given, we use the target function with the centres as its arguments, i.e.: If d is the Euclidean distance and ‖•‖ denotes the Euclidean norm, then Assuming that c k ∉ {γ 1 , γ 2 , …, γ n } and taking the fact that the right-hand side of ( 18) should be zero, we obtain where h ¼ 1; 2; …; n; k ¼ 1; 2; …; s: If in functions ( 14) and ( 16) the squares of probabilities are replaced with its any normalized power of exponent u ≥ 1, i.e. with p u ð Þ ik ¼ p u ik =∑ s k¼1 p u ik , then formula (15) will be of the form , and formula (19) will be modified as Formula ( 19) can be a basis for iteration of centroids of clusters.The equalities ( 18) and ( 19) induce the mappings where e η ik γ i , and Gradient ( 17) is undefined if c k is one of the points γ 1 , γ 2 , …, γ n .To have a universal gradient, we will modify the definition given by Kuhn (1967Kuhn ( , 1973) ) in the following way: where Formula ( 21) is equivalent to standard form (e.g. to (17) when u = 2) if centroid does not coincide with any data point.Otherwise, the gradient is based on the prior expression where the sum is taken only by points different than point coinciding with the centroid.Such "trimmed" sum is next normalized by its length.The final value of gradient is a product of such value and difference between the aforementioned length and sum of probabilities of assignment of neighbours of the object represented by a point identical with centroid to a given cluster if such difference is positive or zero otherwise.
In the conventional case (cf.Iyigun and Ben-Israel ( 2013)), minimization of ( 7) is a probabilistic approximation of the problem of minimization of the function where ω i , i=1, 2, …, n are some weights of objects Γ 1 , Γ 2 , …, Γ n , respectively.Thus, the minimal value of ( 22), denoted here by μ, is obtained by assigning the object Γ i to a cluster with its nearest centre: In the analyzed situation when the neighbourhood matrix W * is used, such minimization is an extension of (23) when such additional information is introduced.That is, we minimize the function and obtain its minimal value, μ W * c 1 ; c 2 ; …; c s ð Þ , in the following form: Assume that the distance d(•, •) is the convex function with respect to every argument.Thus, min k¼1;2;…;s d •; c k ð Þ is also the convex function. 5Hence, from Jensen's inequality, we obtain: where ξ i denotes the number of neighbours of the object Γ i , a i -some constant depending on the way of normalization of the neighbourhood matrix.If W is not normalized, then a i = 1.If the normalization is conducted using maximal absolute eigenvalue τ of this matrix, then a i = 1/ τ.Finally, if W is row standardized, then a i ¼ 1=∑ n h¼1 w hi , i = 1, 2, …, n.So, the right-hand side of the inequality (24) is the value of the function μ, where ω i = (a i ξ i + 1)/n, and γ i was replaced with a centroid of the set of points consisting of Γ i and its neighbours, i = 1, 2, …, n.
On the other hand, using the Schwarz inequality, we have The last equality follows from the fact that summation performed by neighbours of Γ i (and including Γ i ) is equivalent to the summation made by these h for which w * ih ¼ 1 and-because w * ih ¼ 0 if i ≠ h and Γ i is not a neighbour of Γ h -to the summation by all w * ih .In this case, the right-hand side of this inequality is the value of the function μ with , increased by relevant value for neighbours of the object Γ i , i.e. μ(c 1 , The inequalities ( 24) and ( 25) present lower and upper bounds of minimum of the target function with respect to the number of neighbours of objects and minimal distances of relevant points (centroids of sets consisting of objects and its neighbours in the former and simple objects in the latter case) from the nearest centroid.
Assume that the distance d(•, •) between points is calculated using the standard Euclidean formula.Similarly, as in the conventional case, the following theorem holds.
Theorem 1 a) For every set of points c 1 , c 2 , …, c s ∈ ℝ m , the condition for every k = 1, 2, …, s is sufficient and necessary for the points c 1 , c 2 , …, c s to minimize the function f defined by ( 16).
b) The optimal centres of clusters c * 1 ; c * 2 ; …; c * s are in the convex hull of the points γ 1 , γ 2 , …, γ n .c) If c is the optimal centre of kth cluster, then it is the fixed point of the mapping T k , i.e.
This theorem shows that the version of probabilistic d-clustering with use of the neighbourhood matrix is in some sense a generalization of the conventional version of such approach described in Section 3. The proof of Theorem 1 is presented in Appendix 1.
The clustering is conducted according to the procedure of similar scheme as in the case of attempts described in subsections 2.2 and 2.3, but using Algorithm 3 and some properties of this proposed modification based on the neighbourhood of objects.This procedure is described in Algorithm 4. Algorithm 4.

Part I
Step 1.Let Θ i and Γ (NB)k are neighbours and the smaller probability of assignment of The probabilities of assignment are determined by (15).Next, the centres of preliminary clusters c for every i, k ∈ {1, 2, …, n NB } (where probabilities p …, s NB are also computed by ( 15)) and determine centres of such clusters where ε > 0 is arbitrarily established level of precision then stop else go to step r + 1.A defuzzyfication of clusters is similar as applied in Algorithm 3, i.e. each object (group of neighbouring objects) is assigned to such group for which the probability of assignment is the greatest.Of course, at each step of Algorithm 4, a situation when for the same object(s) two or more clusters are nearest according to the computed probability of assignment (i.e. for these clusters the probabilities are equal and maximal) is possible.Also in this case, the object(s) is (are) assigned only to one of these clusters.
The basic draft scheme of the idea of Algorithm 4 is presented in Fig. 1.The objects and the neighbourhood matrix are the same as in Example 1.
The clustering is conducted using some diagnostic features different than those used to define the pairs of neighbours.The dashed lines denote the scope of current clusters.

Simulation Study
We will perform now an investigation of practical differences and similarities between analyzed clustering algorithms with contiguity constraint.Our study is based on simulated data combined from random samples produced using some types of multivariate normal distribution.That is, we have created a data set in ℝ 7 with n = 100 objects.The data were simulated from four types of multivariate distribution where Thus, the analyzed data set was combined from four independent subsamples sampled from each of these distributions.This way, each object is described by seven (m = 7) artificial variables.The choice of parameters of such distributions was motivated by several premises.Firstly, we have assumed that some of such dimensions are diagnostic variables being the basis of clustering (i.e. they can be regarded as describing some composite phenomenon) and the rest of them illustrate some associated-but different-field which is used to define the neighbourhood of analyzed objects.In our case, the first four dimensions reflect four diagnostic variables and the remaining three supporting variables used to determine the contiguity matrix.Hence, the variables from the former group are assumed to be correlated with variables belonging to the latter.Secondly, to ensure the sufficient variety of sampled data, the covariance matrices have high diagonal entries and zeros in the blocks restricted to only variables used to clustering.The same premise was a reason for defining the blocks for the variables used to determination of neighbourhoods as identity matrices.Moreover, the entries connected with the relation between these two types of variables are of high absolute value what enables to reflect the correlation between them.Another important question is the number of objects sampled from each of the four aforementioned distributions.Of course, the easiest solution in this respect is to set the equal size of any subsample, i.e. 25.Such approach seems to be, however, slightly simplistic and hence, we have investigated also alternative choice of sizes of four subsamples.The new sizes were unbalanced because they were sampled from the multinomial distribution with n = 100 number of trials and equal event probabilities (p 1 = p 2 = p 3 = p 4 = 0.25).
Moreover, to assess how the analyzed algorithms are sensitive to disturbances, we have added to the simulated data two types of noise being generated from the uniform distribution on [−0.5, 0.5] (small noise) and [−2, 2] (intensive noise).These assumptions resulted in the production of six following data sets:  The contiguity (neighbourhood) matrix was determined on the basis of first (lower) quartile of distances between objects in terms of the auxiliary variables (i.e.dimensions from 5 to 7 of the input data).That is, two objects are regarded to be neighbours if the Euclidean distance between two vectors representing these objects is not greater than 0.5q 1 (d), where q 1 (d) is the first quartile of such distances across all pairs of objects (i.e.q 1 (d) is the lower quartile of {d(γ h , γ i ), h, i = 1, 2, …, n, h < i}).Therefore, the neighbourhood information is incorporated into the model.Moreover, this choice ensures a rational proportion between the number of pairs of objects being neighbours and the number of pairs of non-neighbours.The input data were normalized using the zero-unitarization formula, i.e. we have taken The clustering of analyzed objects was based on these normalized data and was constrained by the neighbourhood in the earlier described way.Three investigated methods of clustering with contiguity constraint were here used, i.e.: k-means algorithm (Algorithm 1), denoted by K -Ward method (Algorithm 2), denoted by W -Probabilistic d-clustering (Algorithm 4), denoted by S According to the assumptions of the experiment, the number of clusters was arbitrarily assumed as s = 4.This choice is motivated by the fact that the data were simulated from four different distributions what defines four types of information.Therefore, one would expect that these predefined types were reflected by the obtained clusters.The initial centres of clusters are chosen randomly from the four aforementioned distributions (i.e.taking only their dimensions 1-4 into account).The level of precision (ε) below which the iteration is terminated was assumed as ε = 0.001 in any case.
The experiment was repeated 1000 times.The fundamental measure of efficiency and quality of clustering for each replication was the index of correctness defined by formula (3) as the ratio of indices of homogeneity (1) and heterogeneity (2).Alternatively, the silhouette index (4) was also computed.Of course, for each replication, the quality assessment concerns the optimal clustering of objects based on data generated in such replication.Thus, we obtain 1000 values of each of these two indices.All computations were conducted using the SAS Enterprise Guide 4.3.software.
Figure 2 shows the distributions of the index of correctness and silhouette index optimal clustering in each of 1000 replications of the experiment for balanced subsample sizes without a noise.Basic descriptive statistics of these distributions are also presented.Moreover, the best fitted normal curves are overlaid.The relevant figures for the remaining cases (i.e.balanced sizes with the U(− 0.5,0.5)noise, balanced sizes with the U(− 2,2) noise, unbalanced sizes with no noise, unbalanced sizes with the U(− 0.5,0.5)noise and unbalanced sizes with the U(− 2,2) noise) are in fact practically the same.The normal curve is placed here to show how the whole shape of empirical distributions (of which kurtosis, variation, tails, etc.) of analyzed indices is close to (or far from) the normal one.We can observe that the values of correctness index (3) are distributed much more close to the normal curve (around 1 or a value slightly smaller   In general, one can observe that the probabilistic d-clustering provides, in general, better results than the Ward approach.The quartiles, mean, median and maximum of the cr index are smaller for S than for W. Also skewness and kurtosis for S is less intensive.Between S and the k-means clustering (K), there is, in this context, greater similarity.It is worth noting, however, that S seems to be slightly more efficient in the cases where the analyzed index is high (i.e. when some problems with receiving distinguishable clusters occur).is, the 95% and 99% percentiles and the maximum of both indices are usually greater for K than for S (e.g.1.0468, 1.1440 and 1.5381 for K and 1.0458, 1.1022 and 1.3498 in the case of the correctness index for unbalanced sizes without a noise).
These values, however, are related to these data for which efficient clustering was especially difficult.Such cases were few.On the other hand, one can observe that the values of the index (3) for the results of d-clustering are least diverse and their distribution was similar to the normal one in terms of kurtosis.It shows that this approach is more stable in terms of quality than other analyzed algorithms.One more property strictly connected with this stability can be here underlined: for several replications in some cases, the k-means method has generated a half or more empty clusters.Thus, such replications had to be omitted in the analysis of final results (the relevant correctness index was extremely large).It is worth noting also that the same problem has occurred in some trials of repetition of the whole experiment, not only for the aforementioned options but also for the remaining ones.Nevertheless, in any case, the dclustering had never produced so many empty clusters although the number of clusters is here also arbitrarily established.
In connection with these results, the following question arises: are the situations when the analyzed indices take extreme values (cr > 1 or SI < 0) occur at the same clusterings?Moreover, it would be interesting in how many clusterings the indices provide quite different results (e.g.cr > 1 and SI > 0).Table 1 contains percentages of clusterings for which such counterintuitive cases as well as quite consistent (i.e. when cr < 1 and SI > 0) are observed.One can see also that most clusterings obtained using the k-means and probabilistic dclustering algorithms are of proper quality.However, in the case of the Ward method, for over 50% of clusterings, the correctness index shows that they are bad, whereas the silhouette index suggests that one of these indices their quality is quite satisfactory.It means that one of these indices is inappropriate in this case.Because SI is based on a sum of normalized average dissimilarities of objects within and inter clusters, these differences can reduce one to another and the final result could be then distorted.The fact that the Ward method is hierarchical and therefore generates internal clusters contained in the target ones seems to contribute to such situation.In the correctness index, such situation practically cannot appear and hence it can be perceived as better.To investigate whether it really reflects the inconsistency of both clusterings, we will use three similarity indices belonging to the so-called α-family (cf.Albatineh (2010)).They are constructed as a function the following quantities:   The three chosen options use optimally the information potential of all these binary counts and simultaneously are concentrated on some different-but important-properties of both clusterings 6 : & Peirce index (Peirce (1884)): (Rand (1971)): & Sokal and Sneath index (Sokal and Sneath (1963)): The Rand and Sokal and Sneath indices take values from [0,1], whereas the Peirce index has the range of [−1,1].In all three cases, the minimum value denotes complete dissimilarity and the maximum-perfect similarity (i.e.identity) of both structures of clusters.These indices will be used to investigate how the clusters obtained by the three analyzed methods recover the original clear-cut 4-cluster partition.The detailed table containing basic descriptive statistics for distributions of these indices over 1000 replications of the experiment for all its variants is presented in Appendix 2. Of course, due to the contiguity constraint, the level of reflection of original clusters cannot be very close to ideal.However, the situation, when in all variants in the Rand and Sokal and Sneath indices take values greater than 0.5 and Peirce-greater than 0 can be regarded here as satisfactory.It occurs in many clusterings generated by the three algorithms, but only in the case of probabilistic d-clustering approach both for balanced and unbalanced sizes and without and with a noise at least 50% replications of the experiment have given such favourable result.Moreover, maximum vales for data with subsamples of balanced sizes are usually greatest in the case of probabilistic d-clustering.
Because the k-means and probabilistic d-clustering seem to be better solutions than the Ward approach, we will analyze now how close clusterings they produce.The basic descriptive statistics of distributions of these indices for the structures of clusters obtained in 1000 replications using k-means and d-clustering methods are collected in Table 2.
As we can see, the results show that in prevailing majority of simulated data, the objects are clustered similarly both if the k-means or d-clustering approach is used.The means and medians of Rand and Sokal and Sneath indices are greater than 0.59 and the Peirce index (which takes values from [−1,1])-than 0.20.The skewness, variation and kurtosis of the indices (especially Rand and Sokal and Sneath) are also not large.Thus, one can be afraid that the silhouette index does often not properly reflect the quality of clustering in the analyzed situation.
6 These three indices belong to the so-called L-family of similarity indices, i.e. each of them is a linear function of sum of squared numbers of objects placed in one cluster according to one clustering and in other cluster according to the second compared clustering.Albatineh et al. (2006), Albatineh (2010) and Albatineh and Niewiadomska -Bugaj M. (2011) showed that any index φ from the L-family can be corrected as e , where E(φ) is the expectation of φ conditional upon fixed sets of marginal counts of the matrix which entries are numbers of objects belonging to individual clusters according to compared clusterings.An example of such correction can be the Adjusted Rand Index (cf.Hubert and Arabie (1985)).However, in our studies, we use the conventional form of indices, because they are easier interpretable and comparable in this context.

Empirical Application
Now, we will present an example of practical application of the investigated clustering models and their properties in practical circumstances.We will analyze the condition of labour market in powiats (the Polish NUTS 4 regions) of the Wielkopolskie Voivodship in Poland in 2016.The Wielkopolskie Voivodship is the NUTS 2 unit of territorial division.It consists of 35 powiats.It is worth mentioning that four of them are the largest cities of the region (Poznań, Kalisz, Konin and Leszno) having the powiat status.
To realize this task, an initial set of variables describing the composite social and economic phenomenon-i.e. the labour market, was collected. 7The main criteria of their choice were data availability and logical and economical connections between them and with the situation on the labour market.All of them have the character of indices, i.e. they are ratios of some quantities expressed in absolute values.Such set was next verified.That is, variables having too small coefficient of variation-what implies negligible power of diversification-and variables too much correlated with others (using the method of reversed correlation matrix) were eliminated.More details about methods and concepts of these verifications can be found in the paper by Młodak (2014).Finally, the following set of diagnostic features has been obtained: X 1 -registered unemployment rate in % X 2 -coefficient of outflow from unemployment8 (in %) X 3 -coefficient of inflow to unemployment9 (in %) X 4 -number of unemployed persons per one job offer X 5 -average monthly wage and salary in the national economy in zlotys (PLN) The data concerning these variables are presented in Table 3.Note that the variables X 1 , X 3 and X 4 are destimulants (which means that the higher is the value of the feature, the better is the situation of an object is in this context) whereas the remaining diagnostic features are stimulants (higher values are "better").Therefore, to uniform performance of all of them, the destimulants were transformed into stimulants by taking their "bad" values with opposite signs.
Three analyzed algorithms of contiguity constrained clustering were used.Neighbourhood is here understood in traditional geographical sense.The main difference of the current exercise in comparison with the optimization methodology applied in the simulation study is that now the number of clusters was established endogenously.That is, from all divisions of the set of objects into clusters possible to obtain by Algorithms 1, 2, and 4, this one for which the correctness index (3) achieved its minimum was regarded to be final.All other computational proceedings were the same as in Section 5 (except for replications, noise addition and sampling, of course).
In Table 4, the summary statistics for optimal divisions into clusters obtained using the three analyzed methods are presented.This table contains the number of received clusters and indices of homogeneity, heterogeneity and correctness of such divisions.
We can observe then that the best results are obtained when the probabilistic dclustering method was used.The optimal index of correctness takes here the smallest value from the three analyzed options.Therefore, it is the best method in this situation.
The constrained Ward approach tends to generate relatively small number of cluster whereas the constrained k-means algorithm produces six clusters but less homogeneous.
As regards similarity of structures of clusters, the relevant indices Rand, Peirce and Sokal and Sneath (which amounted to 0.9950, 0.9848 and 0.9942, respectively) show that the results of the constrained k-means and probabilistic d-clustering methods are very close.The effects obtained by k-means and Ward approaches with contiguity constraint are slightly more distinct (the values of discussed indices are 0.7916, 0.7523 and 0.7739, respectively).The difference is, however, most considerable in the case of the constrained k-means and probabilistic d-clustering algorithms where the similarity indices taken the values 0.7933, 0.3451 and 0.7748, respectively.These conclusions coincide with regularities observed in the simulation study (Section 5).
On the other hand, if we compute the silhouette index, we can see that its value is very low (− 0.8591 for K, − 0.1267 for W and − 0.4059 for S).This discrepancy (especially in comparison with the correctness index which signs good quality of clustering) is a premise to be afraid that the silhouette index could be inappropriate I this case.
Figures 3, 4 and 5 present territorial arrangement of individual clusters obtained by three analyzed methods.The constrained k-means method produces 6 clusters.The most numerous group of them (i.e. 1) consists of 29 powiats.The cluster 6 is created by two powiats located in the north part of the region, where the inflow of unemployed persons and their number per one job offer is especially high.The rest of powiats are singletons.They are far from their neighbours in the analyzed context and their labour markets have some specific feature distinguishing them from the others.For instance, the ostrzeszowski powiat is characterized by relatively low unemployment rate but also by high inflow of unemployed persons and high average monthly wage and salary.
The results of the probabilistic d-clustering taking neighbourhood into account are similar to those obtained by relevantly modified k-means approach.The main difference between them is that now the złotowski powiat is not a singleton but belongs to the cluster 3 together with the międzychodzki powiat.There exists also other two-element cluster consisting of the wągrowiecki powiat which was included by the constrained kmeans algorithm in one cluster with the międzychodzki powiat with the średzki powiat (being a singleton in the constrained k-means clustering).The śremski powiat creates a single cluster both in k-means and probabilistic d-clustering with contiguity constraint.Also the greatest cluster of 29 remaining powiats is in both cases the same.

Concluding Remarks
The experiment and studies conducted in this paper showed that the contiguity constraint imposed on the k-means, Ward and probabilistic d-clustering methods offers the opportunity to construct efficient algorithms, where the neighbourhood plays an important role.
In the case when the final number of clusters is fixed (as in the simulation study), all these methods are, in general, robust both to unbalanced sizes of clusters and to noisy data.However, Ward's approach produces in any case worst results in comparison with the two remaining ones.On the other hand, the results of d-clustering are least diverse in replications and never contain a half or more of k empty cluster (whereas the k-means algorithm can sometimes produce such undesirable solutions).
Another question which sometimes can have an importance is taking the cardinalities of generated clusters in the algorithm into account.Although in used constraints no special conditions of that type are imposed (cf.Wagstaff et al. (2001)),10 sometimes a special situation may occur, when e.g. in the course of iteration an empty cluster occurs and in consequence the final number of non-trivial clusters will be smaller than k.If such inconvenience is-due to some practical reasons-harmful, it can be avoided e.g. by imposing a minimum size of cluster and using method proposed by Bradley et al. (2000).
Most of these conclusions were confirmed when the number of clusters was established endogenously, on the basis of the optimization of the correctness index.The results provided by the probabilistic d-clustering are of highest quality and also similar to the k-means procedure results.It is worth noting, however, that from these three attempts, the probabilistic d-clustering has the smallest propensity to generate too small number of clusters or too many singletons, which can be perceived as one of its main advantages, reducing some inconveniences being sometimes a side effect of taking the neighbourhood into account.
By the way, we have composed the efficiency of the proposed correctness index and the silhouette index it turned out that despite two clusterings obtained using various methods are similar the values of the silhouette index for them can be quite different.This observation calls into question the utility of this latter index in such situation.

Appendix 1. Proof of Theorem 1
In some cases the proof can be perceived as an adaptation of similar theorem for conventional situation (cf.Iyigun and Ben-Israel (2013)).
a 17) in c k and the thesis of such item is, of course, true.Assume, that c k = γ i for some i ∈ {1, 2, …, n}.Consider the change of argument from γ i to γ i + tz, where z ∈ ℝ m and ‖z‖ = 1.Thus,

Fig. 1
Fig. 1 Basic scheme of probabilistic d-clustering with contiguity constraint.Remark: C 1 and C 2 denote centroids of clusters.Source: Own elaboration based on Example 1.
Create new set of objects consisting of the clusters of neighbours determined in Part I and objects having no neighbours as singletons (cf.Step 1 of Part II of Algorithm Step r-th (r = 2, 3, …): Compute centroids of the clusters established in Step 1 and conduct Algorithm in relation to them.The relation expressed in r-th step of Part I of Algorithm 4 is the equivalence relation.One can prove it in a quite similar way as in the case of the relation used in the r-th step of Part I of Algorithm 1.

Fig. 2
Fig.2Distributions of values of correctness and silhouette indices obtained in particular replications and best adjusted normal curves-balanced subsample sizes, no noise added.Source: Own computations using special algorithm written in the IML environment of the SAS Enterprise Guide 4.3.software.

Fig. 3
Fig.3Clusters of powiats of the Wielkopolskie Voivodship according to the condition of labour market obtained using the k-means method with contiguity constraint.Remark: The miniature of map placed in the left bottom corner shows the location of the Wielkopolskie Voivodship in Poland.Source: Own computations using the SAS Enterprise Guide 4.3.software (of which its IML environment).

&Fig. 4
Fig.4Clusters of powiats of the Wielkopolskie Voivodship according to the condition of labour market obtained using the Ward method with contiguity constraint.Source: Own computations using the SAS Enterprise Guide 4.3.software (of which its IML environment).

&Fig. 5
Fig.5Clusters of powiats of the Wielkopolskie Voivodship according to the condition of labour market obtained using probabilistic d-clustering method with contiguity constraint.Source: Own computations using the SAS Enterprise Guide 4.3.software (of which its IML environment).
of value of the function f k is along R ik , i.e. if z¼R ik = R ik k k.Then, c k = γ iif and only if ∑ follows from (18) and from the part a), i.e. from that if c *k ¼ γ i for some i ∈ {1, 2, …, n}, then R k c * k À Á ¼ 0 (k = 1, 2, …, s).b) It follows from the part a).

Table 1
Share of clusterings for which correctness and silhouette indices assume favourable and unfavourable values (in %) Source: Own computations using the SAS Enterprise Guide 4.3.software (of which its IML environment) & Set 6: unbalanced subsample sizes, added uniform noise from [−2, 2].

Table 2
Basic descriptive statistics of distributions of similarity indices related to results of and d-

Table 3
Values of diagnostic features of labour market in Poland in 2016 Source: Own computations based on the resources of the Local Data Bank of the Statistics Poland (http://bdl.stat.gov.pl)

Table 4
Summary results for optimal clusters of powiats of the Wielkopolskie Voivodship