1 Introduction

Uncovering structural communities and clusters within complex data can be of interest across disciplines. In [1], the authors harness the richness of a social perspective to derive community network structure in the presence of heterogeneity. Therein, a key concept of locality to a pair of data points is provided, leading to informative measures of (local) depth and cohesion. In this paper, we provide a generalization of this approach, by distilling two key probabilistic concepts: local relevance and support division. The approach sheds light on the foundations of partitioned local depth, and removes reliance on static distance comparisons, to enable probabilistic consideration of uncertain, variable and potentially conflicting information.

The notion of local (community) depth introduced in [1] builds on existing approaches to data depth (see for instance [2, 3]). Partitioning the probabilities defining local depth leads to a quantity referred to as cohesion, which can be understood as a measure of locally perceived closeness. The resulting framework also gives rise to a natural threshold for distinguishing strongly and weakly cohesive pairs and provides an alternative perspective for the concept of near neighbors. Topological features of the data can be considered via networks of pairwise cohesion, and meaningful structure can be identified without additional inputs (e.g., number of clusters or neighborhood size), optimization criteria, iterative procedures or distributional assumptions. For a review of the general method, referred to as partitioned local depth (PaLD), see Sect. 2; for further details see [1], and the references therein.

It is crucial to note the importance of accounting for varying local density, particularly in applications involving complex evolutionary processes (see, for instance, [4,5,6,7], and examples in [1]). In [1], relative positioning is considered through distance comparisons within triples of points, which may be of value in non-metric and high-dimensional settings.

Now, consider a given finite set of interest, S. If, for \(x,y,z\in S\), we have definitive answers to questions such as “Is z more similar to x than to y?”, then PaLD community analysis can proceed directly [1, 8]. Still, these may not be the most informative answers to such queries. For example, answers might instead have inherent variability, e.g., 80% of information available suggests that z is more similar to x than to y. It may, on the other hand, be the case that there is some true, definitive answer but this answer is subject to inherent uncertainty.

As an example application of PaLD, as introduced in [1], Fig. 1 displays community structure for cultural distance information obtained in [9] from two recent waves of the World Values Survey (2005 to 2009 and 2010 to 2014) [10]. Distances are computed using the cultural fixation index (CFST), which is a measure built on the framework of fixation indices from population biology [11, 12]. Note that PaLD employs within-triplet comparisons and allows for the employment of such application-dependent, non-Euclidean measures of dissimilarity. In Fig. 1A colored edges correspond to strong mutual cohesion as results from partitioning local depths. For a review of the derivation of such networks, see Sect. 2. Histograms for within-group cohesion and distance are provided in Fig. 1B; colored bars for (mutual) cohesion indicate values above the threshold of 0.0217 (see 14 below). Note that community structure can be identified without additional inputs (e.g., number of clusters or neighborhood size), optimization criteria, iterative procedures or distributional assumptions.

The data reflect that while, culturally speaking, regions within the USA are far more similar to each other than regions within India, the latter displays similar levels of strong internal cohesion.

Fig. 1
figure 1

Cultural communities from survey data; adapted from [1] with permission. In A, we display the community structure obtained from the cultural fixation index values from [9] for regions within the USA, China, India and the European Union. In B, we display the distribution of within-group cohesions and distances; colored bars for (mutual) cohesion indicate values above the threshold of 0.0217 (see 14). Note that distances are brought to comparable levels of cohesion

The remainder of the paper proceeds as follows. In Sect. 2, we provide some preliminaries and notation, including a review of the development of PaLD as introduced in [1] which highlights its formulation in terms of static dissimilarity comparisons. Section 3 provides an introduction to the abstracted concepts of local relevance and support division, and the given generalization of PaLD, to incorporate uncertainty, and Sect. 4 follows with theoretical results on properties of cohesion mirroring those in [1], for the new scenario. Section 5 includes mention of potential applications to multiple dissimilarity measures, event-based data and data uncertainty.

We now turn to some preliminaries and notation.

2 Preliminaries and Notation

Suppose \(S=\{a_1,a_2,\dots , a_n\}\) is a finite set with a corresponding notion of pairwise dissimilarity or distance \(d: S \times S \rightarrow \mathbb {R}\). For any pair \((x,y)\in S\times S\), the set of relevant local data (or local focus), \(U_{x,y}\), is defined to be the set of elements \(z\in S\) which are as close to x as y is to x, or as close to y as x is to y, i.e.,

$$\begin{aligned} U_{x,y} {\mathop {=}\limits ^{\textrm{def}}}\{z \in S \mid d(z,x) \le d(y,x) ~\textrm{or }~ d(z,y) \le d(x,y)\}. \end{aligned}$$
(1)

From a social perspective, the set, \(U_{x,y}\), local to the pair of individuals (xy), consists of individuals with alignment-based impetus for involvement in a “conflict" between x and y (see Social Framework in [1] for a discussion of the underlying social latent space and related references). In the case of a symmetric distance, \(U_{x,y}\) is comprised of those z as close to x or y as they are to each other. The sense of local could be altered depending on applications.

The local depth of x, \(\ell _S(x)\), is a measure of local support, which leverages the concept of local that is implicit in the definition of \(\{ U_{x,y} \}\):

$$\begin{aligned} \ell (x){\mathop {=}\limits ^{\textrm{def}}}\ell _{S,d}(x) = P(d(Z,x)<d(Z,Y))+ \frac{1}{2} P(d(Z,x)=d(Z,Y)), \end{aligned}$$
(2)

where Y is selected uniformly at random from the set \(S\setminus \{x\}\) and Z is selected uniformly at random from the local set \(U_{x,Y}\) (see Fig. 2). For convenience, the term resolving ties in distance (via coin flip), in (2), will be suppressed in what follows. The important concept of cohesion can then be obtained through partitioning of the probabilities defining \(\ell \). In particular, we have that \(C_{x,w}\), the cohesion of w to x, is given by

$$\begin{aligned} C_{x,w}{\mathop {=}\limits ^{\textrm{def}}}P\left( Z=w,d(Z,x)<d(Z,Y) \right) . \end{aligned}$$
(3)
Fig. 2
figure 2

The local focus for a fixed point x and a random point Y, in two-dimensional Euclidean space. The points in red are outside the focus. Those in green (and Z in blue) are in the focus and closer to x, while those in gray are closer to Y

The cohesion network is the weighted, directed graph with node set S and edge weights \(\{C_{x,w}\}\); typically, an undirected version is displayed by considering the minimum of the bi-directional cohesions for each edge pair, with thicker edges depicting larger weights. Unless stated otherwise, we will employ the Fruchterman–Reingold algorithm [13] to display cohesion networks. Through cohesion, the dissimilarity measure, \(d\), is locally adapted, to reflect relative locally-based support (see, for example, Fig. 1). For additional discussion and applications of PaLD, in the context of considerations of data depth, embedding, clustering and near-neighbors, see [1].

As mentioned, though PaLD is formulated in terms of \(d\), the above definitions in (1), (2) and (3) depend only on relative closeness comparisons—e.g., whether z is closer to x than it is to y. Thus, as observed in [1, 8], an oracle for triplet comparisons is sufficient to determine the directed cohesion network. Note that previous work has suggested that one can often more reliably provide distance comparisons than exact numerical evaluations [2, 14].

As we will see in the remainder of the paper, due to its probabilistic formulation, PaLD is quite readily adapted to allow for uncertainty in dissimilarities.

3 Generalized PaLD

Whereas membership of a given z in the local focus \(U_{x,y}\) is assumed to be captured by an indicator in \(\{0, 1\}\) in (1), we will generalize the notion of “locality” to a pair (xy), probabilistically. In a similar manner, support from Z can be formulated stochastically, to give generalized concepts of local depth as in (2) and cohesion as in (3).

Example 1

Before proceeding with formal definitions, for context, consider the simplistic generative process for triplet comparisons depicted in Fig. 3. Here, we assume that x, y and z are fixed, but distance comparisons are based on observed \(X^*\), \(Y^*\) and \(Z^*\) random in neighborhoods about x, y and z, respectively. We could be interested in uncertain events such as \(d(Z^*,X^*)<d(Y^*,X^*)\), say (see Fig. 3). Note that static comparisons such as \(d(z,x)<d(y,x)\) may not be fully informative, here.

Fig. 3
figure 3

Conceptual generative process for random triplet comparisons

We now introduce the abstracted concepts of local relevance and support division.

3.1 Local Relevance and Support Division

We are interested in generalized definitions of local focus, local depth and cohesion, which reflect uncertainty in dissimilarities.

For fixed \(x,y\in S\), membership in the local focus \(U_{x,y}\) can be generalized as follows. For each \(x,y,z\in S\), define the local relevance of z to the pair (x, y), \(R_{x,y,z}\), as the probability that z is local to the pair (xy), or more formally

$$\begin{aligned} R_{x,y,z}{\mathop {=}\limits ^{\textrm{def}}}P( z \in N (x,y) ), \end{aligned}$$
(4)

where \(\mathcal {N}{\mathop {=}\limits ^{\textrm{def}}}\{N(x,y):(x,y)\in S\times S\}\) is a random (pairwise) neighborhood structure on the set of pairs \(S\times S\). Note that we consider the elements in S here as fixed, with no required underlying sense of distance or position; stochasticity is provided through \(\mathcal {N}\) (akin to neighborhoods in random graphs). When convenient, we may also consider the full \(n \times n \times n\) array of probabilities, \(\varvec{R}{\mathop {=}\limits ^{\textrm{def}}}[R_{x,y,z}]\).

To obtain constructions for local depth and cohesion, here, we require a mechanism to sample an element, \(Z\in S\), local to (xy). For this, we consider the process of selecting uniformly at random an element \({\tilde{Z}}\in S\), and with acceptance probability \(R_{x,y,{\tilde{Z}}}\) taking this as the value of Z, repeating the process until a Z is accepted. It is not difficult to see that, for \(z\in S\),

$$\begin{aligned} P\left( Z=z \right) = P_{x,y}\left( Z=z \right) = \frac{R_{x,y,z}}{\sum _{w\in S} R_{x,y,w}}. \end{aligned}$$
(5)

For fixed \(x,y,z\in S\), we may also consider the probability

$$\begin{aligned} Q_{x,y,z}{\mathop {=}\limits ^{\textrm{def}}}P( \mathcal {C}_z(\{x,y\})=x), \end{aligned}$$
(6)

where \(\mathcal {C}_z\) is a random choice function, defined on the set of two-element subsets of S, i.e., \(\mathcal {C}_z(A)\in A\). Here, \(Q_{x,y,z}\) reflects the support division for z with respect to the pair (xy). Note that the choice mechanism defined through \(\{\mathcal {C}_z:z\in S\}\) need not be related, per se, to the neighborhood system \(\mathcal {N}\), emphasizing further that the points in S are not required to have fixed position in some underlying, say Euclidean, space. For convenience, we set \(\varvec{Q}{\mathop {=}\limits ^{\textrm{def}}}[Q_{x,y,z}]\). For general discussion of random choice, see, for instance, [15].

The local depth of x can then be given by

$$\begin{aligned} \ell (x){\mathop {=}\limits ^{\textrm{def}}}\ell _{S,\varvec{R},\varvec{Q}}(x) := P({\mathcal {C}_Z(\{x,Y\})=x}), \end{aligned}$$
(7)

where Y is selected uniformly from \(S\setminus \{x\}\), and Z is selected with relative weight as in (5). Likewise, the cohesion of w to x, \(C_{x,w}\), generalizes directly as in (3):

$$\begin{aligned} C_{x,w}{\mathop {=}\limits ^{\textrm{def}}}P( Z=w,\mathcal {C}_Z(\{x,Y\})=x). \end{aligned}$$
(8)

Note that the quantity \(C_{x,w}\) can be defined independently of \(\ell (x)\). We include (7), as the work here also generalizes the concept of local depth as defined in [1] (see 2).

For examples of computing the arrays \(\varvec{R}\) and \(\varvec{Q}\), see Sect. 5.

We will assume, throughout, the following basic structural properties on the arrays \(\varvec{R}\) and \(\varvec{Q}\). Suppose \(x,y,z\in S\),

  1. (a)

    \(0 \le R_{x,y,z},Q_{x,y,z} \le 1\),        (b) \(R_{x,y,z}=R_{y,x,z}\),

  2. (c)

    \(Q_{x,y,z}=1-Q_{y,x,z}\),           (d) \(R_{x,y,x}=R_{x,y,y}=1\).

In (a), we are expressing the fact that the entries in \(\varvec{R}\) and \(\varvec{Q}\) represent probabilities; in (b), we have that local relevance does not depend on the ordering of x and y, (c) reflects the fact that Z supports either x or y (and there is no loss in probability) and (d) states that any individual is locally relevant to any pair in which it is an entry.

An algorithmic formalization of PaLD, generalized for uncertainty, then follows as in Algorithm 1. The implementation takes the specification of local relevance and support division (through \(\varvec{R}\) and \(\varvec{Q}\), respectively) as input, to output cohesion. Local depths can be obtained from the row sums of the output matrix, C.

Algorithm 1
figure a

Generalized partitioned local depth

Note that for a given distance function \(d:S\times S\rightarrow \mathbb {R}\), and \(U_{x,y}\) as in (1), setting

$$\begin{aligned} R_{x,y,z}= & {} {\left\{ \begin{array}{ll} 1, &{} \text {if } z\in U_{x,y} \\ 0, &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$
(9)

and

$$\begin{aligned} Q_{x,y,z}= & {} {\left\{ \begin{array}{ll} 1, &{} \text {if } d(z, x) < d(z, y) \\ 1/2, &{} \text {if } d(z, x) = d(z, y)\\ 0, &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$
(10)

the computation of cohesion in [1] is recovered.

Before turning to some applications, we summarize some theoretical results, which generalize and shed light on those given in [1].

4 Results

In this section, we provide results regarding properties of cohesion, mirroring those in [1], including (a) dissipation of cohesion under separation, (b) irrelevance of density under separation and (c) dissipation of cohesion for concentrated sets of increasing size, in the context of uncertainty; proofs can be found in Appendix A. Throughout, unless stated otherwise, we will assume that the arrays \(\varvec{R}\) and \(\varvec{Q}\) are fixed, and satisfy the basic assumptions (a)–(d), listed in Sect. 3.1. In addition, \(x\in S\) is fixed, Y is selected uniformly at random from \(S\setminus \{x\}\) and Z is selected as in Eq. (5).

We begin with three definitions regarding structural properties of the set S with respect to the arrays \(\varvec{R}\) and \(\varvec{Q}\). The first provides conditions under which two disjoint subsets, A and B, of S are sufficiently separated. In essence, for \(c,c^*\in A\) and \(d \in B\), \(c^*\) is local to the pair (cd) and fully supports c in that context, while d is not local to the pair \((c,c^*)\).

Definition

(Sufficiently Separated) Suppose \(A,B\subseteq S\). The set A is said to be sufficiently separated from B (with respect to \(\varvec{R}\) and \(\varvec{Q}\)) if \(A\cap B =\emptyset \), and for all \(c,c^*\in A\) and \(d\in B\), the following hold:

     (a) \(R_{c,d,c^*}=1\),    (b) \(R_{c,c^*,d}=0\),     (c) \(Q_{c,d,c^*}=1\).

The sets A and B are said to be (mutually) sufficiently separated if A is sufficiently separated from B, and B is sufficiently separated A.

The second definition is crucial to stating Theorem 2, and addresses equivalence of ordinal structure for two subsets of S of equal cardinality.

Definition

(Equivalence of Ordinal Structure) Suppose two sets AB satisfy \(A=\{a_1,a_2,\dots ,a_m\}\), and \(B=\{b_1,b_2,\dots ,b_m\}\), then A and B are said to have equivalent ordinal structure, if they are \((\varvec{R},\varvec{Q})\)-equivalent, i.e., for \(i,j,k\in \{1,2,\dots ,m\}\),

$$\begin{aligned} R_{a_i,a_j,a_k}=R_{b_i,b_j,b_k} \text{ and } Q_{a_i,a_j,a_k}=Q_{b_i,b_j,b_k}. \end{aligned}$$
(11)

Finally, the following definition suggests a point-like property of one subset, \(B \subseteq S\), with respect to another, A. In particular if locality to any given pair of elements of A is constant over the set B, and all elements of B fully support other elements of B in comparisons with elements of A, then B is concentrated with respect to A.

Definition

(Concentrated) Suppose \(A,B\subseteq S\), then B is said to be concentrated with respect to A (for given \(\varvec{R}\) and \(\varvec{Q}\)), if there exists a function \(f: A \times A \rightarrow [0,1]\), such that

$$\begin{aligned} R_{a,a^*,b} = f(a,a^*)~~~~ \text{ and } ~~~~ Q_{a,b,b^*} = 0, \end{aligned}$$
(12)

for \(a,a^* \in A\) and \(b,b^* \in B\).

We have the following results regarding properties of cohesion. Proofs are provided in Appendix A.

Theorem 1

(Dissipation of cohesion under separation) Suppose \(\varvec{R}\) and \(\varvec{Q}\) are fixed, S is a disjoint union of A and B, and A and B are sufficiently separated with respect to \(\varvec{R}\) and \(\varvec{Q}\), then the between-set cohesion values are zero, i.e., \(C_{a,b}\) = \(C_{b,a}=0\) for \(a \in A\) and \(b \in B\).

Theorem 2

(Irrelevance of density under separation) Suppose \(A=\{a_1,a_2,\dots ,a_m\}\) and \(A'=\{a'_1,a'_2,\dots ,a'_m\}\) have equivalent ordinal structure and \(S=A\cup B\) (resp. \(S'=A'\cup B\)), for some set B, where A and B (resp \(A'\) and B) are sufficiently separated. Then for any \(1 \le i,j \le m\), \(C_{a_i,a_j} = C_{a'_i,a'_j}\), i.e., the corresponding (within-set) pairwise cohesion values are equal.

Theorem 3

(Dissipation of cohesion for concentrated sets of increasing size) Suppose S is a disjoint union of A and B, and B is sufficiently separated from, and concentrated with respect to A. Then, for \(a \in A\) and \(b \in B\), the cohesion of b to a is bounded above by \((|A|/ |B|) (1/n)\).

The next result follows from the probabilistic definition of local depth along with the assumptions (c) and (d), from Sect. 3.1, namely

$$\begin{aligned} Q_{x,y,z}=1-Q_{y,x,z} ~~~~ \text{ and } ~~~~ R_{x,y,x}=R_{x,y,y}=1. \end{aligned}$$

Here, the first assumption provides conservation of probability and the second guarantees proper selection of Z.

Theorem 4

(Conservation of Cohesion) We have

$$\begin{aligned} \frac{n}{2}=\sum _{x\in S} \ell _{S,\varvec{R},\varvec{Q}}(x) =\sum _{x,w\in S}C_{x,w}. \end{aligned}$$
(13)

Finally, in [1], a threshold distinguishing strong from weak cohesion is provided. In particular, define

$$\begin{aligned} T_{S,d}{\mathop {=}\limits ^{\textrm{def}}}P(Z=W,d(Z,X)< d(Z,Y))=\frac{1}{2n}\sum _{x\in S}C_{x,x}, \end{aligned}$$
(14)

where XYZ and W are selected uniformly at random from S, \(S\setminus \{x\}\), \(U_{X,Y}\) and \(U_{X,Y}\), respectively. For the generalization provided here, the analogue of the final equality in (14) no longer necessarily holds, but we do have the following.

Theorem 5

Set \(T{\mathop {=}\limits ^{\textrm{def}}}T_{S,\varvec{R},\varvec{Q}}=P(Z=W, \mathcal {C}_Z(\{X,Y\})=X)\). Then,

$$\begin{aligned} T\le \frac{1}{2n}\sum _{x\in S}C_{x,x}. \end{aligned}$$
(15)

Key to the proof of Theorem 5 is the fact that here, in place of \(P(Z=W)= P(Z=X)\) as available in [1], we only have \(P(Z=W)\le P(Z=X)\) due to selection being dependent on the local relevance array, \(\varvec{R}\). One would have equality in the case of a \((0,1)-\varvec{R}\) array, even in the presence of flexibility allowed for support division.

We now turn to discussion of some potential applications.

5 Applications

In this section, we consider applications of the concepts of local relevance and support division in revealing community structure in complex data. Results follow upon determination of the arrays \(\varvec{R}\) and \(\varvec{Q}\). Importantly, the foundational framework from [1] carries over (see the results in Sect. 4). Note at the outset that the perspective on community structure developed in [1] (and extended here) is quite distinct from clustering. See Discussion and Conclusions in [1] for further details on implications of the underlying PaLD perspective in this context.

5.1 Combining Multiple Dissimilarity Measures

Our ability to reason directly from local relevance and support division allows for flexibility to combine multiple, possibly conflicting, dissimilarity measures. Instead of linearly combining such measures to form one, say, we can proceed probabilistically.

Example 2

Recall the cultural values data considered earlier in Fig. 1. Distances for politically-related questions are provided at [16] for the dimensions of Politics, Democracy, Egalitarianism, Conservatism, Neoliberalism, Authoritarianism, Libertarianism, Change, and Social. We will focus, here, on the subset of 25 European countries (out of 27) for which there is complete data for these dimensions. Define the respective resulting distance matrices as \(\varvec{D}_1,\varvec{D}_2,\dots ,\varvec{D}_9\).

Consider two potential methods for combining the pairwise distance information for the countries, to obtain cohesion networks. In one, we could obtain a single distance matrix, via simple linear weighting, i.e., for a nonnegative weight vector, \(\varvec{w}=(w_1,w_2,\dots ,w_9)\), satisfying \(\sum _i w_i=1\)

$$\begin{aligned} \varvec{D}^*_{\varvec{w}}{\mathop {=}\limits ^{\textrm{def}}}w_1\varvec{D}_1+w_2\varvec{D}_2+\cdots +w_9\varvec{D}_9. \end{aligned}$$
(16)

and proceed with PaLD, as in [1]. Alternatively, we could obtain respective arrays \(\varvec{R}_1,\varvec{R}_2,\dots ,\varvec{R}_9\) and \(\varvec{Q}_1,\varvec{Q}_2,\dots ,\varvec{Q}_9\), (as in (9) and (10)), and weight these to give

$$\begin{aligned} \varvec{R}^*_{\varvec{w}}{\mathop {=}\limits ^{\textrm{def}}}w_1\varvec{R}_1+w_2\varvec{R}_2+\cdots +w_9\varvec{R}_9, \end{aligned}$$
(17)

and

$$\begin{aligned} \varvec{Q}^*_{\varvec{w}}{\mathop {=}\limits ^{\textrm{def}}}w_1\varvec{Q}_1+w_2\varvec{Q}_2+\cdots +w_9\varvec{Q}_9. \end{aligned}$$
(18)

Note that the (ijk)-entry in the array \(\varvec{R}^*_{\varvec{w}}\) can be viewed as

$$\begin{aligned} (\varvec{R}_{\varvec{w}}^*)_{i,j,k}=w_1(\varvec{R}_1)_{i,j,k}+w_2(\varvec{R}_2)_{i,j,k}+\cdots +w_9(\varvec{R}_9)_{i,j,k}, \end{aligned}$$
(19)

expressing the fact that \((\varvec{R}_{\varvec{w}}^*)_{i,j,k}\), the (ijk)-entry in the array \(\varvec{R}^*_{\varvec{w}}\), is the probability that when a dimension, \(\delta \), is selected according to the distribution \(\varvec{w}\), we have \(a_k \in N(a_i,a_j)\) under that respective distance. Since relations are considered solely under individual distance matrices, this allows dimensions of differing scales and data types to be readily considered.

For illustration, Fig. 4 contains a display of the cohesion networks resulting from weight vectors where the relative weight on \(\varvec{D}_9\) (for the Social dimension, with equal weights for others), increases through the values 0, 0.5, 1.0, 2.0, 10, 100.

Fig. 4
figure 4

Cohesion networks based on \(\varvec{D}^*_{\varvec{w}}\) as the relative weight on the Social dimension increases through the values 0, 0.5, 1.0, 2.0, 10, 100. Ties above the threshold in (14) are displayed. The layout for each plot is that based on the Social dimension in isolation

Figure 5 contains a display of the cohesion networks resulting from weight vectors where the relative weights on \(\varvec{R}_9\) and \(\varvec{Q}_9\) increase through the same values 0, 0.5, 1.0, 2.0, 10, 100. Note some potential added stability in the cohesion network, as the weight on the Social dimension increases.

Fig. 5
figure 5

Cohesion networks based on \(\varvec{R}^*_{\varvec{w}}\) and \(\varvec{Q}^*_{\varvec{w}}\) as the relative weight on the Social dimension increases through the values 0, 0.5, 1.0, 2.0, 10, 100. Ties above the threshold in (15) are displayed. The layout for each plot is that based on the Social dimension in isolation

Typically one may choose to use uninformative uniform weights \(\{w_i\}\), in obtaining \(\varvec{R}^*\) and \(\varvec{Q}^*\). Further considerations of combining measures, and weight selections is work in progress. For discussion of combining dissimilarity measures from mixed-type data in the context of clustering, see, for instance, [17], and the references therein. Note as mentioned the probabilistic framework here maintains the properties of PaLD, and avoids need to consider standardization choices within dimensions.

5.2 Event-Based Data

Another potential application of the concepts of local relevance and support division is to similarity determined by multiple events. For instance, consider a set S of individuals, where for each pair \((x,y)\in S\times S\), we have a set of dissimilarities \(A_{x,y}\), each with nonzero cardinality \(n_{x,y}{\mathop {=}\limits ^{\textrm{def}}}|A_{x,y}|\). Note that it is not necessary that \(n_{x,y}\) be constant over pairs (xy). There are several ways in which such similarities might arise. Consider the following example.

Example 3

Suppose we have competing entities for which multiple events determine pairwise distance, e.g. firms in different markets or competitors in an athletic context. For fixed \(x,y,z\in S\), values for \(Q_{x,y,z}\) and \(R_{x,y,z}\) can be determined as probabilities through random (potentially weighted) selections from \(A_{x,y}\), \(A_{y,z}\) and \(A_{x,z}\). For concreteness, in Fig. 6, we consider a cohesion network based on pairwise similarities determined by competitiveness in games played between teams during the 2021–2022 season of the National Basketball Association (NBA). Here, dissimilarity in a particular event (game) was determined as the proportion of (absolute) point differential to overall game point total. For instance, a score of 110-90 would result in a non-competitiveness score of \(|110-90|/(110+90)=0.10\). Note that, in this case, the values of \(\{n_{x,y}\}\) vary between 2 and 4. The edges corresponding to strong pairwise cohesions above the threshold bound in (15) are displayed, in Fig. 6. Note that the figure shows a general gradient from weaker teams at the top right to stronger teams at the bottom left. The largest cohesion is between the Dallas Mavericks and Brooklyn Nets, while the lowest is between the Phoenix Suns and the Charlotte Hornets. Some weaker teams display relative competitiveness with stronger teams head-to-head, such as the Detroit Pistons with the Denver Nuggets (two games with proportional point differentials of 6/(117 + 111) and 5/(105 + 110)).

Fig. 6
figure 6

The cohesion network for the 2021–2022 NBA basketball season based on proportional point differentials. Shading of nodes is according to mean proportional point differential; the highest is for the Phoenix Suns (0.034122; red) and lowest is for the Portland Trail Blazers (− 0.04019; yellow). Edge width is proportional to mutual cohesion. Note that team names have been abbreviated for display

Further applications could include any instances where event results determine distances. Similar ideas could also be used, when the events are drawn from sampling pairs of entities (and measuring dissimilarities) over time.

The final application included here is a line of potential further work, with applicability in the context of addressing discrete jumps in cohesion, adapting to cases where there is known levels of data precision and considerations of structural persistence.

5.3 Data Uncertainty

If we have information regarding data uncertainty, then, for fixed \(x,y,z\in S\), it is possible to adjust \(R_{x,y,z}\) and \(Q_{x,y,z}\) from indicators, as in (9) and (10), directly to probabilities. That is, \(R_{x,y,z}\) could reflect the probability of membership of z in the local focus of (xy) and \(Q_{x,y,z}\), the probability of z being closer to x than to y. More generally, adjustment for various sources of uncertainty becomes possible and has the potential advantage of making cohesion continuous in the data.

Example 4

If we assume a sufficiently simple model, exact calculations of \(\varvec{R}\) and \(\varvec{Q}\) are relatively straightforward. Suppose that \(\epsilon >0\) is fixed and each \(a\in S\subseteq \mathbb {R}\) has random associated value \(A^*\in S^*\subseteq \mathbb {R}\), uniformly distributed in an \(\epsilon \)-ball centered at a. Here, \(S^*\) is the set of associated values. We can then compute the arrays \(\varvec{R}\) and \(\varvec{Q}\), in terms of corresponding entries in the set \(S^*\). If \(\epsilon \) (under uniformity) accurately reflects measurement uncertainty, then cohesion can be more faithfully modeled.

In this scenario, cohesion can be seen to be stable with respect to small changes in the data. Rather than having discrete jumps, due to discontinuities in (9) and (10), a positive \(\epsilon \) (e.g., reflecting the precision used to store the data) makes cohesion a continuous function of S. For instance, for \(x,y,z\in S\), with \(x<y<z\), if z is sufficiently close to y (relative to \(\epsilon \)), then z is in the local focus with probability one, i.e., \(R_{x,y,z}=1\). However, if z is gradually increased (moving farther from y), \(R_{x,y,z}\) transitions to the value zero.

Finally, we can consider how cohesion varies as \(\epsilon \) increases, in a manner similar to persistent homology [18]. It is currently work in progress to consider higher-dimensional scenarios and more complex settings. For consideration of clustering in the context of uncertain data, see, for instance [19, 20].

6 Conclusion

The generalization of partitioned local depth, developed here, enhances PaLD’s theoretical underpinnings and broadens the potential application of cohesion to complex data for which there may be uncertain, variable or conflicting information.

Two key probabilistic concepts, local relevance and support division, are introduced leading to an extended probabilistic framework for revealing communities in data.

Base properties of the resulting cohesion values have been proven and initial potential applications in the contexts of multiple dissimilarity measures, event-based data and data uncertainty are discussed. Several questions remain, as suggested throughout the manuscript. We have provided examples of applications in Sect. 5, but general determination of arrays \(\varvec{R}\) and \(\varvec{Q}\) (and their impact for representative community structure) is important for future work. It is hoped that the present work may lead to further consideration of communities in data.