Proximity Curves for Potential-Based Clustering

The concept of proximity curve and a new algorithm are proposed for obtaining clusters in a finite set of data points in the finite dimensional Euclidean space. Each point is endowed with a potential constructed by means of a multi-dimensional Cauchy density, contributing to an overall anisotropic potential function. Guided by the steepest descent algorithm, the data points are successively visited and removed one by one, and at each stage the overall potential is updated and the magnitude of its local gradient is calculated. The result is a finite sequence of tuples, the proximity curve, whose pattern is analysed to give rise to a deterministic clustering. The finite set of all such proximity curves in conjunction with a simulation study of their distribution results in a probabilistic clustering represented by a distribution on the set of dendrograms. A two-dimensional synthetic data set is used to illustrate the proposed potential-based clustering idea. It is shown that the results achieved are plausible since both the ‘geographic distribution’ of data points as well as the ‘topographic features’ imposed by the potential function are well reflected in the suggested clustering. Experiments using the Iris data set are conducted for validation purposes on classification and clustering benchmark data. The results are consistent with the proposed theoretical framework and data properties, and open new approaches and applications to consider data processing from different perspectives and interpret data attributes contribution to patterns.


Introduction
The history of work on clustering algorithms based on, and motivated by the gravitational (and electrostatic) models stretches back to the 1970s. They are too numerous to be exhaustively reviewed here; we review below a few relevant highlights, thereby placing our work into context and pointing out its distinguishing features. For a general review of clustering techniques, see Berkhin (2006) and for a recent survey see Xu and Tian (2015).
One of the earliest papers of cognate interest is Wright (1977b), where the attractive force (between the particles) under Newtonian gravity moves the data points to form successively larger and fewer clumps, eventually resulting in all the mass being lumped together in one single cluster.
Starting with a set of data points where each point has a potential, in Shi et al. (2002), an agglomerative hierarchical clustering algorithm is presented based on a comparison of distances between already defined clusters. The distance used in Shi et al. (2002) is calculated as some weighted value of differences of potentials.
In Lu and Wan (2012), each of the data points is endowed with a potential (of a Newtonian type), and the points are sorted in a list in ascending order of their potential. Cluster centres are nominated as the list is being worked through by using a preselected bandwidth parameter; being able to measure the distance between points is an essential feature of the algorithm.
An interesting recent paper with a philosophical flavour is Henning (2015). The authors reason that there is no such thing as the 'best' clustering algorithm since the notion of a cluster will be dependent on the context and on the intention of the investigator. On the other end of the spectrum, there is the early paper Wright (1977a), seeking to make clustering into an objective endeavour by assuming a priori a 'clustering function' which then is used to assess the quality of the clustering achieved.
The view taken here is that 'reality' and 'truth' are elusive concepts and can be at best of partial concern to the modeller; important will be that, initially, a collection of (more or less plausible) models is available, and out of these the one is chosen which most successfully predicts future events. We want our model to be understood and received in this spirit: the model put forward is rooted in the physics of own data attributes, it is rich in structure and appears plausible in certain circumstances.
The algorithm suggested here is an hierarchical clustering algorithm for a finite set of data points in the Euclidean space, each of which is assumed to be generating a potential field constructed by means of the density of a certain family of probability distributions. The individual point potentials can be anisotropic, thus allowing the surface representing the total potential to be thought of (in two dimensions for the sake of simplicity) as an undulating landscape with hills and valleys extending in any direction. This construction will allow clusters of data points on a 'real' surface to be modelled, a feature which may turn out to be useful in future work in epidemic modelling using networks (see Barabási 2016, Chapter 10). The novelty of our model also lies in the fact that the result is not a single dendrogram but a randomized dendrogram, defined as a convex combination of deterministic dendrograms, thus allowing the results to be formulated in a probabilistic framework.
The important novel feature introduced here is that of the proximity curve whose pattern will be the key for determining the suggested clustering. A welcomed by-product of our algorithm is that the black hole problem (as a local minima challenge discussed for example in Li and Fu 2011) does not appear here: all massive data points will be 'collected' prior to visiting genuine clusters comprising smaller data points.
In the next section, we place the paper's topic into context and introduce some notation. Then, in the main Section 3, the new proximity curve concept and proposed algorithm are described in detail and illustrated by using a synthetic data set. In Section 4, experiments using the Iris data set (Fisher 1936(Fisher , 2011 are conducted for validation purposes on classification and clustering benchmark data. In Section 5, the results and various features of the suggested technique and its merit are highlighted. In Section 6, our findings are summarised and some future research ideas are indicated.
Appendices A -C contain auxiliary theoretical material, whereas all nontextual material namely contour maps, proximity curve, potential curve, dendrogram, pseudocodes, tables and large matrices are collected in Appendix E.

Context and Assumptions
We are given N data points x 1 , . . . , x N in the Euclidean space, assumed here for the sake of simplicity to be the 2-dimensional plane R 2 . This is not a technical limitation as such; it is intended merely as a vehicle for conveying the underlying principles. The ith point induces the potentials v i , generating the total potential as follows: Depending on the context and emphases, the data points may be referred to also as nodes, point masses, point charges or simply points. The collection (set) of all data points may be referred to as a data cloud.
The structure of the point potentials v i is as follows: where M i > 0 is the 'mass' of point i and f (. |θ ) stands for a Cauchy probability density functions (pdf) with parameter θ. The parameter θ has five components, reflecting the geometric transformation steps carried out to convey the seed distribution (a standard Cauchy) to the required Cauchy distribution; the construction involved is described in detail in Appendices A -C. Our running example is a synthetic data set comprising N = 19 points in the plane shown in Fig. 2 with individual and joint equipotentials. The corresponding parameter values are shown in Table 2.
We will be interested in an algorithm for finding data clusters based on the points' location and the overall potential field generated by them.
The physical models of Gravity and Electrostatics serve us as an intuitive background (e.g. Ramsey 1959;Kip 1962). The field's gradient vector ∇V (y) is the force exerted onto an infinitesimal unit 'test mass' or 'test charge' located at y. The gradient vector ∇V (y) will be used to reflect qualitatively on where the point y in relationship with the others lies: is ∇V (y) , the magnitude of the gradient, close to zero, so is y near the bottom of a valley (i.e. near a local minimum of V ), whereas 'noticeably positive' values of ∇V (y) are associated with the point y being on a slope on the surface describing V . 1 Respective examples in Fig. 2 are the points #10 and #6. (See Table 2 for the points' numbering.) Our aim is to discover data clusters by considering properties derived from the gradient field. Finite lists of tuples of real numbers will be constructed (we shall call them proximity curves) whose pattern will suggest ways of clustering the data.

Proximity Curves
Initially, all data points are said to be active. Starting from some initial position, steepest descent is applied to the total potential generated by all the active points, guiding us to one of the data points. The point thus found is then deactivated, the total potential is redefined (i.e. updated) to be the one generated by the now active points and the process is restarted at the position of the point just deactivated. This process is carried out until the last point has been found upon which all points become inactive and the total potential is zero.
The core of this algorithm is PROXIMITYCURVE; it is shown as Pseudocode E-2. As can be gleaned from Pseudocode E-2, in the course of the computations, a list of pairs is being built up, each entry comprising the label of the node being deactivated and the logarithm of the vector norm of the gradient of the updated potential at the node's location. (The algorithm CLOSESTACTIVE, shown as Pseudocode E-3, is an auxiliary.) Figure 3 illustrates the progress of the algorithm with a snapshot of the equipotentials just after having deactivated seven data points.
A proximity curve, illustrated by Fig. 4, is a plot of the logarithm of the vector norm of the gradient of the successively updated potential (as described above) versus the points' position in the list (the logarithm applied to the magnitude of the gradient is merely there to express and amplify an existing pattern). In total, there will be at the most N proximity curves as in practice some nodes cannot be reached to be the first node to be visited and since the first node visited completely determines the rest of the proximity curve. The full information about all possible proximity curves can be conveyed by two square matrices of size N : -The rows of matrix S = s ij i,j =1,...,N record all possible sequences of nodes, thus obtainable. -The rows of matrix = λ ij i,j =1,...,N are the logarithm of the norm of the gradient of the corresponding successively updated potential at the respective node locations. This is illustrated in Fig. 4 by the proximity curve whose first node is #2. ( 1 )

Pattern Extraction
The curve shown in Fig. 4 has roughly a self-similar structure (in a statistical sense) akin to the Elliott Wave Pattern described in Casti (2002) in the context of financial data. The principle formulated below will guide us when using proximity curves for identifying clusters of data points. The pattern of a proximity curve is explored by a recursive algorithm.
-The recursive step is in locating the smallest local minimum. In the example shown in Fig. 4, this occurs at node #13 as is seen from the appropriate section of the list representation of the proximity curve (1) as follows: each of which is a proximity curve in its own right. Obviously, if local minima determine cluster boundaries, then L left and L right give rise to the clustering as follows: defining level 2 in the hierarchical clustering scheme. The next level of clustering will be arrived at by subjecting each of the lists L left and L right to the same recursive step. (A list not containing a local minimum will not be expanded further but will be replaced by a list containing the list itself as the only entry.) -The base case is arrived at if none of the input lists can be written as the concatenation of two non-empty lists as described above.

Deterministic Dendrograms
The algorithm in Section 3.2 can be used for obtaining a nested list representation of the set of nodes indicating the clustering thus found; for the present example, this is as follows:

Probabilistic Dendrograms
Every proximity curve gives rise uniquely to a fully developed dendrogram as described in Sections 3.2 -3.3. Not all conceivable proximity curves can be obtained by the present algorithm, however, as is illustrated by the running example. We are going to describe here how those reachable can be combined to a probabilistic (or randomized) dendrogram. Such a dendrogram allows the user to reason probabilistically and motivate a choice of starting point. Questions like "What is the probability that two given points belong to the same cluster?" can then be meaningfully addressed. Choose a window of interest W ⊂ R 2 . To any given position u ∈ W, a possible dendrogram is assigned in the following manner. W is chosen so that it defines a window that the entire data set fits into, and that will provide a selection of possible points which are "sensible", i.e. could have been observed given the context of the initial data.
To arrive at z = STEEPESTDESCENT (u, V ), use Pseudocode E-1 from Appendix E.3. The point z will be the position of a local minimum of the potential V . Find the data point x ι(u) ∈ {x 1 , . . . , x N } which is closest to z. Circumstances will be in practice such that z is unique (as far as the numerical procedure allows), resulting in a unique choice of the closest point x ι(u) . The process thus described defines a mapping as follows: Running Pseudocode E-2 (see Appendix E.3) with the initial vector z (start) = u will result in the proximity curve whose first node is labelled ι(u). The proximity curve thus found is unique in that no two different proximity curves have the same first node. The dendrogram based on this proximity curve will be denoted by D ι(u) . Let u 1 , . . . , u be a random sample of size from the uniform distribution on W. We define the dendrogram probability vector d by the following: is the relative frequency of the node k having been chosen as the first data point for the corresponding proximity curve, where is the indicator function of a given set B.
The probability of two given nodes m 1 = m 2 being in the same cluster can now be evaluated by the following: P (nodes m 1 and m 2 are in the same cluster) = N k=1 d k P (nodes m 1 and m 2 are in the same cluster|D k applies) .
The conditional probabilities in Eq. 5 take values in {0, 1}; they are obtained by inspecting dendrogram D k . There are in total N(N − 1)/2 probabilities defined in Eq. 5. Probabilities for i pairwise different nodes m 1 , . . . , m i being in the same cluster can be evaluated analogously; there are N i such values. Based on a sample of size = 1, 000, the dendrogram probabilities were calculated by Eq. 3; they are shown in Table 3. Notice that the data points 6, 7, 9, 13, and 15 (and the corresponding dendrograms) are unreachable, which is a consequence of none of these points being located close enough to a valley (a local minimum) of V , or, more precisely, for all of these data points there is another data point closer to the candidate minimum location; see Fig. 2, the right-hand box.
Consider m 1 = 16, m 2 = 18 as an example. It can be seen by inspection that out of the 14 possible dendrograms (not shown here), each of the following 9 will assign these two nodes to the same cluster: It is therefore P (nodes #16 and #18 are in the same cluster) In the above calculations and example, fully developed dendrograms were considered only (requiring us to descend to depth 5). In general, the probability of any given set of nodes belonging to the same cluster can be calculated for any required level in a similar manner.

Implementation
A suite of programmes has been written to implement the algorithms. The functions in Appendix E.3 were implemented in R since this has excellent programming and plotting facilities for data analytics. The pictorial and computational output of the R programs are the contour plots and the proximity curve, the latter internally represented as a list-of-tuples which then serves as an input to a recursive Haskell programme returning the dendrogram as a nested list. Haskell was also used to produce L A T E X code for drawing the dendrogram (as a tree) as exemplified in Fig. 5.
The associated code and data are available from http://www.comp.brad.ac.uk/ ∼ dneagu/proximity

Experimental Work
In order to evaluate the accuracy achieved by the clustering method proposed, we use the Iris data set (Fisher 2011), which is commonly used in the literature. The Cauchy parameters are selected from the data set as detailed below. The Iris data set has the following fields: In order to test how the algorithm behaves under different assignments of those fields to Cauchy distribution parameters, we varied which Iris data set field gets assigned to which parameter (μ 1 , μ 2 , c 1 , c 2 ).
For each of the inputs, the mass is assumed to be 1, and the angle parameters are computed as follows: It should be noted that this only one pragmatic assignment of α i , but in principle other functions of the other parameters can be used.
In addition, since we know which points should be clustered together (i.e. those belonging to the same Iris species), we can evaluate the accuracy for a given clustering (a given depth of the dendrogram). The accuracy is measured by determining what proportion of the data points belong to the correct cluster. The class label assigned to the cluster is chosen based on what class the majority of the points in the cluster belong to. The full results, for all possible parameter assignments, are given in Appendix E.5.
We consider here the 4 parameter assignments shown in Table 1, along with the related accuracies, for the sake of simplicity (the experimental work has covered the entire range of possible permutations; please see Appendix E.5 for the entire list of results). Accuracy at depth 1 is not shown, as that represents the root of the tree, i.e. the point before any clustering has been attempted. This information is visualised in Fig. 7. As evidenced by Table 1, the accuracy achieved is highly dependent on the parameter assignments chosen.
For example, the best overall result, as well as the fastest, is obtained for θ 3 . That, after the first split, gives the accuracy of 0.593, which grows to 0.607 after one more iteration, and reaches 0.860 at dendogram depth 5. These results validate our theoretical assumption that data attributes translated into Cauchy parameters (which give rise to a particular proximity curve) influence and determine cluster definitions. The proposed algorithm allows further tuning in subsequent iterations, creating the opportunity to further improve classification results.

Discussion
The distinguishing feature of our algorithm is that the data points are kept stationary while the space is explored with a moving unit test charge (as in electrostatics) or test mass (as in gravity) to discover clusters. In doing so, the test charge will be attracted to groups of charges (or masses), depending on the force of attraction as measured by the gradient of the potential. As a data point is 'discovered', it is deactivated (i.e. removed) and the gradient of the updated potential field is used to guide us to discover the next data point. By successively removing data points, the current cluster gets depleted, resulting in a steady weakening of attraction to what remains of the cluster. Eventually, the current cluster gets exhausted, indicated by the small magnitude of the gradient of the potential. This then is the sign for starting a new cluster: the magnitude of the gradient of the total potential of the remaining active points begins to increase. The pattern of the proximity curve thus constructed holds the clue to the definition of the clustering.
The process described above suggests a nested structure of clusters inherent in hierarchical clustering.
Individual proximity curves are usually suitable for finding the clustering at the local level because removal of points may eventually noticeably interfere with the global interplay of individual point potentials. The effect of 'point removal' inherent in the technique is intended to be counteracted by taking into account all possible (reachable) proximity curves in a probabilistic sense. The dendrogram in Fig. 5 in conjunction with the set's joint equipotentials in Fig. 2 shows that the clustering suggested is intuitively plausible. The fact that point #19 is not grouped together with the points {16, 17, 18} is attributable to the fact that #19, even though it is physically close to the group, is separated from it by a ridge. Figure 3 shows a snapshot of the node deactivation process. It is seen that the algorithm duly observes both the mutual closeness of data points as well as the topography of the potential surface.
Finally, it is instructive to see that the north-eastern group of data points {14, 12, 15, 13} is separated by the dendrogram from the south-western group {6, 7, 9} even though the locations of all these points are roughly on the same potential level, as shown in Fig. 6. This illustrates once more the point that the algorithm takes into account mutual geographic proximity of points as well as surface topography.

Conclusion and Outlook
The new concept of the proximity curve has been introduced and used for finding clusterings in finite sets of data points in the Euclidean space endowed with individual potentials derived from a multivariate Cauchy distribution. The finite collection of all proximity curves gives rise to a probabilistic dendrogram. A synthetic example comprising 19 data points was taken to illustrate the technique. The evaluation of the algorithm on a combination of Iris data set case studies reported in Section 4 proves that this new approach to consider data features' contributions to the global field potential as a way to discover clusters is promising and valuable.
Further research is planned to explore the technique for data sets in higher dimension, for large data sets, for more real-life data sets, and the associated computational complexity. More comprehensive evaluations on other cases with real data is envisaged as future work allowing an insight into the applicability of the proposed methodology. Another research direction could address how the technique generalises to the continuous case where the data set is so large and densely packed that it is reasonable to model it as a continuum. The authors see interesting research avenues on parameter tuning and extension to multidimensional spaces that can be treated as multi-objective optimisation challenges.

A.1 Single Point Mass
A point mass of size M 0 placed at the co-ordinate origin creates a gravitational potential which at a distance r from the origin is as follows: where κ is the gravitational constant. The force of attraction on a unit point mass is the negative of the derivative of V (r) in Eq. 6 w.r.t. r, The negative sign in Eq. 7 indicates that the force points towards the origin (the location of the point mass M 0 ). If Cartesian co-ordinates are used then Eq. 6 now reads as follows: We may visualise the potential in Eq. 8 as a funnel with rotational symmetry around the origin which descends to minus infinity. The origin is a singularity of the potential. (It is not defined there.) If the point mass is at (x 0 , y 0 , z 0 ), then its potential at the location (x, y, z) is as follows: A particular component of the attractive force is the partial derivative of −V in Eq. 9 in the respective direction; for example, the x component of the attractive force at location (x, y, z) is as follows: Notice that if x 0 = y 0 = z 0 = 0, then Eq. 10 specialises to the Cartesian form of Eq. 7, Also notice that the unit vector on the right-hand side of Eq. 11, 1 x 2 + y 2 + z 2 x y z T points in the direction of the point (x, y, z), and the magnitude of the attractive force in Eq. 11 is as follows:

A.2 Several Point Masses
Let us assume that we are given N points at respective locations (x i , y i , z i ) and masses M i , i = 1, . . . , N. Denoting by v individual points' potential, the total potential is by superposition and the force acting upon a point of unit infinitesimal mass is the negative of the gradient of the potential, The operation 'gradient' applied to a real function of several variables is sometimes denoted also by the operator ∇ (the nabla operator). Using this notation, Eq. 12 becomes as follows:

B.1 General Considerations
A data cloud is viewed here as a collection of point masses. The same data point recorded n times will carry the mass n. The potential functions used here were inspired by the Newtonian potential. The Newtonian gravity model is, however, not suitable since it has singularities and does not allow anisotropic potential fields to be modelled. To avoid singularities, each point mass will be assumed to be generating a potential field which is the weighted negative density of a (sufficiently 'smooth') 2-D distribution. This approach will also allow directionality (anisotropy) of the potential field to be modelled.
It will be assumed that a class of probability densities on R 2 is given by the following: and that the ith of the N point masses generates the point potential where M i in Eq. 14 is the mass of the ith point. The class of densities in Eq. 13 arises from a seed distribution by subjecting it to an affine linear transformation as described in Appendix C. In addition to the mass, the ith point is associated with five more parameters, -Two location parameters μ i1 , μ i2 ∈ R -Two stretch parameters c i1 , c i2 > 0. It is 0 < c ≤ 1 for 'squeezing' and 1 ≤ c < ∞ for 'stretching' -An angle of rotation paremeter α i ∈ [0, 2π) They will be called the geometric parameters.
The total potential, the sum of the point potentials, is the negative of a linear combination of 2-D densities of the class (13). We mention in passing that the negative of the normalised total potential is therefore the probability density of a mixture distribution.
The point potential of the point mass i is then The total potential generated by the N point masses is by superposition as follows: The force acting upon an infinitesimal unit mass in the plane is the negative of the gradient of V in that point.

B.2 Cauchy Type Potentials
The context is as described above where now the underlying class of distributions is Cauchy. More precisely, the class of densities in Eq. 13 is the set of all 2-D Cauchy densities obtainable from the seed density in Eq. 29 by applying an affine linear transformation. The Cauchy density in Eqs. 31-32 has the form as follows: with constants K defined in terms of the three of the five natural parameters as follows: By partially differentiating (17), we get the following: Similarly, From Eqs. 22-23, the gradient of f Y is seen to be as follows: y 1 ,y 2 )) 5/3 2K 1 K 12 K 12 2K 2 y 1 y 2 − μ 1 μ 2 (24) From Eqs. 14 and Eq. 24, the gradient of v i , the potential of the ith single point mass is by Eq. 15 as follows: where f i,Y stands for the density f Y from Eq. 17 with the parameters being μ i1 , μ i2 , c i1 , c i2 and α i . Furthermore, the quantities K i in Eq. 25 are defined by Eqs. 18-21; they are to be evaluated at the same set of parameter values μ i1 , μ i2 , c i1 , c i2 and α i . The gradient of the total potential is the sum of the values in Eq. 25,

Appendix C: Cauchy Distributions in 2-D
The type of point potentiused in this paper for data points derives from densities from the well-known Cauchy class of probability distributions. Facts are collected here for reference in the main body of the paper.

C.1 Seed Distribution
The standard one dimensional Cauchy distribution is usually presented as the prime example of a distribution on R whose mean does not exist. Its probability density function is as follows: The appeal of a function like the one in Eq. 26 lies in the fact that it does not tend 'too fast' to zero as the argument is taken to infinity; such distributions are sometimes referred to as 'fat tailed'. 2 2-D versions of the Cauchy distribution are much less well-known. The 2-D analogue of the Cauchy density (26) is as follows: where ρ ∈ (−1, +1) is a parameter and The density in Eqs. 27-28 is referred to as a standard bivariate Cauchy density, Jamalizadeh and Balakrishnan (2008). It will be transformed by a sequence of geometric operations. Assume that the 2-D random vector X is standard bivariate Cauchy distributed with density (27)-(28) with ρ = 0, Cauchy distributions, univariate and bivariate, are discussed extensively in Feller (1971) where (29) is termed the standard bivariate Cauchy density. It will be taken to be the seed for the geometric transformations carried out next in Appendix C.2.

C.2 Transformed Distribution
Assume now that Y comes about by subjecting X to the composition of the following three geometric operations: 1. Stretch by c 1 , c 2 > 0 along the respective axes 2. Rotate by the angle α ∈ [0, 2π) 3. Shift by the vector (μ 1 , μ 2 ) By solving (30) for X, we get the following: which then component-wise is written as following: According to the transformation rule of probability densities, it is as following: where the denominator D in Eq. 31 is as follows: Figure 1 illustrates the succession of transformations for as follows: Corresponding formulae can be developed also for the 2-D Gaussian class which, however, in contrast to the Cauchy class, turns out not to be suitable for our present purposes as the effect of distribution dies off fairly quickly for arguments further away from the distribution's centre.

Appendix D: Optimisation: Steepest Descent
The steepest descent algorithm is an iterative algorithm for unconstrained minimisation of a differentiable real function of several real variables (Ecker and Kupferschmid 1988;Marlow 1978;Rao 1984). The equal steplength version will be used in the algorithm for determining the proximity curve. Removing a point reached changes the landscape, assuring a new minimum can be reached (since the current position is no longer a minimum after a point is deactivated). The algorithm is shown in Appendix E.3 as Pseudocode E-1.

Pseudocode E-2 Proximity Curve
Pseudocode E-3 Closest Active Node Table 2 Nineteen data points in the plane with parameters Point no.

E.5.2 Further Accuracy Plots
Below, the accuracy plots for other assignments of Cauchy parameters are presented.