Abstract
We employ random geometric digraphs to construct semiparametric classifiers. These datarandom digraphs belong to parameterized random digraph families called proximity catch digraphs (PCDs). A related geometric digraph family, class cover catch digraph (CCCD), has been used to solve the class cover problem by using its approximate minimum dominating set and showed relatively good performance in the classification of imbalanced data sets. Although CCCDs have a convenient construction in \({\mathbb {R}}^d\), finding their minimum dominating sets is NPhard and their probabilistic behaviour is not mathematically tractable except for \(d=1\). On the other hand, a particular family of PCDs, called proportionaledge PCDs (PEPCDs), has mathematically tractable minimum dominating sets in \({\mathbb {R}}^d\); however their construction in higher dimensions may be computationally demanding. More specifically, we show that the classifiers based on PEPCDs are prototypebased classifiers such that the exact minimum number of prototypes (equivalent to minimum dominating sets) is found in polynomial time on the number of observations. We construct two types of classifiers based on PEPCDs. One is a family of hybrid classifiers that depends on the location of the points of the training data set, and another type is a family of classifiers solely based on class covers. We assess the classification performance of our PEPCD based classifiers by extensive Monte Carlo simulations, and compare them with that of other commonly used classifiers. We also show that, similar to CCCD classifiers, our classifiers tend to be robust to the class imbalance in classification as well.
Introduction
Classification methods based on set covering algorithms received considerable attention recently because of their use in prototype selection (Bien and Tibshirani 2011; Cannon and Cowen 2004; Angiulli 2012). Prototypes are selected members of a data set so as to attain various tasks including reducing, condensing or summarizing a data set. Many learning methods aim to carry out more than one of these tasks, thereby building efficient learning algorithms (Pȩkalska et al. 2006; Bien and Tibshirani 2011). A desirable prototype set reduces the data set in order to decrease running time, condenses the data set to preserve information, and summarizes the data set for better exploration and understanding. The methods we discuss in this work are considered as decision boundary generators where decisions are made based on class conditional regions, or class covers, that are composed of a collection of convex sets, each associated with a prototype (Toussaint 2002). The union of such convex sets constitutes a region for the class of interest, estimating the support of this class (Schölkopf et al. 2001). Support estimates have uses in both supervised and unsupervised learning schemes offering solutions to many problems in machine learning (Marchette 2004). We propose supervised learning methods, or classifiers, based on these estimates of the supports constructed with a random geometric digraph family called proximity catch digraphs.
Proximity Catch Digraphs (PCDs) are closely related to Class Cover Catch Digraphs (CCCDs) introduced by Priebe et al. (2001), and are vertexrandom digraphs defined by the relationship between classlabeled observations. Priebe et al. (2001) introduced CCCDs to find graph theoretic solutions to the Class Cover Problem (CCP), and provided some results on the minimum dominating sets and the distribution of the domination number of such digraphs for one dimensional data. The goal of CCP is to find a set of hyperspheres (usually Euclidean balls) such that their union encapsulates, or covers, (a subset of) the training data set associated with a particular class, called the target class (Cannon and Cowen 2004). In addition, Priebe et al. (2003a) showed that approximate dominating sets of CCCDs, which were obtained by a greedy algorithm, can be used to establish efficient semiparametric classifiers. Moreover, DeVinney et al. (2002) defined random walk CCCDs (RWCCCDs) where balls of class covers are defined in a relaxed manner compared to the previously introduced CCCDs so as to avoid the overfitting problem. These digraphs have been used, e.g. in face detection (Eveland et al. 2005) and in latent class discovery for gene expression data (Priebe et al. 2003b). CCCDs also show robustness to the class imbalance in data sets (Manukyan and Ceyhan 2016). Class imbalance often occurs in real data sets; that is, some classes of the data sets have a large number of members whereas the remaining classes only have fewer, resulting in a bias towards the majority class (the class with abundant number of members) which drastically decreases the classification performance.
Class covers with Euclidean balls have been extended to allow the use of different type of regions in order to cover a class of interest. Serafini (2014) uses sets of boxes to find a cover of classes, and also defines the maximum redundancy problem. This is an optimization problem of covering as many points as possible by each box where the total number of boxes is kept to a (approximately) minimum. Hammer et al. (2004) investigates CCP using boxes with applications to the logical data analysis. Moreover, Bereg et al. (2012) extend covering boxes to rectilinear polygons to cover classes, and they report on the complexity of the CCP algorithms using such polygonal covering regions. Takigawa et al. (2009) incorporate balls and establish classifiers similar to the ones based on CCCDs, and they also use sets of convex hulls. Ceyhan (2005) uses sets of triangles relative to the tessellation of the opposite class to analytically compute the minimum number of triangles required to establish a class cover. In this work, we study class covers with particular triangular regions (simplicial regions in higher dimensions).
CCCDs can be generalized using proximity maps (Jaromczyk and Toussaint 1992). Ceyhan (2005) defined PCDs and introduced three families of PCDs and investigated the distribution of the domination number of such digraphs in a two class setting. Domination number and, another graph invariant called the arc density (the ratio of number of arcs in a digraph to the total number of arcs possible) of these PCDs has been used for testing spatial patterns of segregation and association (Ceyhan and Priebe 2005; Ceyhan et al. 2006, 2007). In this article, we employ PCDs in statistical classification and investigate their performance. The PCDs of concern in this work are based on a particular family of proximity maps called proportionaledge (PE) proximity maps. The corresponding PCDs are called PEPCDs, and are defined for target class (i.e. the class of interest) points inside the convex hull of nontarget class points (Ceyhan 2005). However, this construction ignores the target class points outside the convex hull of the nontarget class. We mitigate this shortcoming by partitioning the region outside of the convex hull into unbounded regions, called outer simplices, which may be viewed as extensions of outer intervals in \({\mathbb {R}}\) (e.g. intervals with infinite endpoints) to higher dimensions. We attain proximity regions in these outer simplices by extending PE proximity maps to outer simplices. We establish two types of classifiers based on PEPCDs, namely, hybrid and cover classifiers. The first type incorporates the PEPCD covers of only points in the convex hull and use other classifiers for points outside the convex hull of the nontarget class, hence we have some kind of a hybrid classifier; the second type is further based on two class cover models where the first is a mixture of PEPCDs and CCCDs (composite covers) whereas the second is purely based on PEPCDs (standard covers).
One common property of most class covering (or set covering) methods is that none of the algorithms find the exact minimum number of covering sets in polynomial time, and solutions are mostly provided by approximation algorithms (Vazirani 2001). However, for PEPCDs, the exact minimum number of covering sets (equivalent to prototype sets) can be found much faster; that is, the exact minimum solution is found in a running time polynomial in size of the data set but exponential in dimensionality. PEPCDs have computationally tractable (exact) minimum dominating sets in \({\mathbb {R}}^d\) (Ceyhan 2010). Since the complexity of class covers based on this family of proximity maps exponentially increases with dimensionality, we apply dimension reduction methods (such as principal components analysis) to substantially reduce the number of features and thus reduce the dimensionality. Hence, based on the transformed data sets in the reduced dimensions, the PEPCD based hybrid classifiers and, in particular, cover classifiers become more appealing in terms of both prototype selection and classification performance (in the reduced dimension). We use simulated and real data sets to show that these two types of classifiers based on PEPCDs have either comparable or slightly better classification performance than other classifiers when the data sets exhibit the class imbalance problem.
The article is organized as follows: in Sect. 2, we introduce some auxiliary tools for defining PCDs, and in particular in Sect. 3, we describe PE proximity regions and PEPCDs. In Sect. 4, we introduce two types of class cover models that are called composite and standard covers. In Sect. 5, we establish two types of statistical classifiers based on PEPCDs which are called hybrid and cover PEPCD classifiers. The latter type is defined for both class cover models described in Sect. 4. In Sect. 6, we assess the performance of PEPCD classifiers and compare them with existing methods (such as \(k\hbox {NN}\) and support vector machine classifiers) on simulated data sets. Finally, in Sect. 7, we assess our classifiers on real data sets, and in Sect. 8, we present discussion and conclusions as well as some future research directions.
Tessellations in \({\mathbb {R}}^d\) and auxiliary tools
In this section, we introduce the tools required for constructing PEPCD classifiers. Let \((\varOmega ,{\mathcal {M}})\) be a measurable space, and let the training data set be composed of two nonempty sets, \({\mathcal {X}}_0\) and \({\mathcal {X}}_1\) with sample sizes \(n_0:={\mathcal {X}}_0\) and \(n_1:={\mathcal {X}}_1\) from classes 0 and 1, respectively. Also let \({\mathcal {X}}_0\) and \({\mathcal {X}}_1\) be sets of \(\varOmega\)valued random variables with class conditional distributions \(F_0\) and \(F_1\), with supports \(s(F_0)\) and \(s(F_1)\), respectively. We develop rules to define proximity maps and regions for the points from the class of interest, i.e. target class, which is class j with respect to the Delaunay tessellation of the points from the class of noninterest, i.e. nontarget class which is class \(1j\) for \(j=0,1\).
A tessellation in \({\mathbb {R}}^d\) is a collection of nonintersecting (or intersecting only on boundaries) convex dpolytopes such that their union covers a region. We partition \({\mathbb {R}}^d\) into nonintersecting dsimplices and dpolytopes to construct PEPCDs that tend to have multiple disconnected components. We show that such a partitioning of the domain provides digraphs with computationally tractable minimum dominating sets. In addition, we use the barycentric coordinate system to characterize the target class points with respect to the Delaunay tessellation of the nontarget class. Such a coordinate system simplifies the definitions of many tools associated with PEPCD classifiers in \({\mathbb {R}}^d\), including minimum dominating sets of PEPCDs and convex distance functions which are defined in Sect. 5.1.
Delaunay tessellation of \({\mathbb {R}}^d\)
The convex hull of the nontarget class points \(C_H({\mathcal {X}}_{1j})\) can be partitioned into Delaunay cells through the Delaunay tessellation of \({\mathcal {X}}_{1j} \subset {\mathbb {R}}^d\). For \(d=1\), the Delaunay tessellation is an intervalization (i.e., integer partition) of the the convex hull of \({\mathcal {X}}_{1j}\) which is \(C_H({\mathcal {X}}_{1j})=\left( \min (X_{1j}),\max (X_{1j})\right)\), where the middle intervals (i.e. intervals in the convex hull) are based on the order statistics of \({\mathcal {X}}_{1j}\), and end intervals are \(\left( \infty ,\min (X_{1j})\right)\) and \(\left( \max (X_{1j}),\infty \right)\). For \(d=2\), the Delaunay tessellation becomes a triangulation which partitions \(C_H({\mathcal {X}}_{1j})\) into non intersecting triangles. For the points in the general position, the triangles in the Delaunay triangulation satisfy the property that the circumcircle of a triangle contains no points from \({\mathcal {X}}_{1j}\) except for the vertices of the triangle. In higher dimensions, Delaunay cells are dsimplices (for example, a tetrahedron in \({\mathbb {R}}^3\)). Hence, the \(C_H({\mathcal {X}}_{1j})\) is the union of a set of disjoint dsimplices \(\{{\mathcal {S}}_k\}_{k=1}^K\) where K is the number of dsimplices, or Delaunay cells. Each dsimplex has \(d+1\) noncoplanar vertices where none of the remaining points of \({\mathcal {X}}_{1j}\) are in the interior of the circumsphere of the simplex (and the vertices of the simplex are points from \({\mathcal {X}}_{1j}\)). Hence, simplices of the Delaunay tessellations are more likely to be acute (simplices with no substantially small inner angles). Note that Delaunay tessellation is the dual of the Voronoi diagram of the set \({\mathcal {X}}_{1j}\). A Voronoi diagram is a partitioning of \({\mathbb {R}}^d\) into convex polytopes such that the points inside each polytope is closer to the point associated with the polytope than any other point in \({\mathcal {X}}_{1j}\). Hence, a polytope \(V({\mathsf {y}})\) associated with a point \({\mathsf {y}}\in {\mathcal {X}}_{1j}\) is defined as
Here, \(\Vert \cdot \Vert\) stands for the usual Euclidean norm. Observe that the Voronoi diagram is unique for a fixed set of points. A Delaunay graph is constructed by joining the pairs of points in \({\mathcal {X}}_{1j}\) whose boundaries of Voronoi polytopes intersect. The edges of the Delaunay graph constitute a partitioning of \(C_H({\mathcal {X}}_{1j})\), hence the Delaunay tessellation. By the uniqueness of the Voronoi diagram, the Delaunay tessellation is also unique (except for cases where \(d+1\) or more points lie on the same hypersphere).
A Delaunay tessellation partitions only \(C_H({\mathcal {X}}_{1j})\) and does not offer a partitioning of the complement \({\mathbb {R}}^d {\setminus } C_H({\mathcal {X}}_{1j})\) unlike the Voronoi diagrams. As we will see in the following sections, this drawback makes the definition of our semiparametric classifiers more difficult. Let facets of \(C_H({\mathcal {X}}_{1j})\) be the simplices on the boundary of \(C_H({\mathcal {X}}_{1j})\). To partition \({\mathbb {R}}^d {\setminus } C_H({\mathcal {X}}_{1j})\), we define unbounded regions associated with each facet of \(C_H({\mathcal {X}}_{1j})\), namely outer simplices in \({\mathbb {R}}^d\) or outer triangles in \({\mathbb {R}}^2\). Each outer simplex is constructed by a single facet of \(C_H({\mathcal {X}}_{1j})\), denoted by \({\mathcal {F}}_l\) for \(l=1,\ldots ,L\) where, L is the number of boundary facets and, note that, each facet is a \((d1)\)simplex. Let \(\{P_1,P_2,\ldots ,P_N\} \subseteq {\mathcal {X}}_{1j}\) be the set of points on the boundary of \(C_H({\mathcal {X}}_{1j})\), and let \(C_M:=\sum _{i=1}^N p_i/N\) be the center of mass of \(C_H({\mathcal {X}}_{1j})\). We use the bisector rays of Deng and Zhu (1999) as a framework for constructing outer simplices, however, such rays are not well defined for convex hulls in \({\mathbb {R}}^d\) for \(d>2\). Let the ray emanating from \(C_M\) through \(P_i\) be denoted as \(\overrightarrow{C_{M} P_i}\). Hence, we define the outer simplices by rays emanating from each boundary vertex \(P_i\) to outside of \(C_H({\mathcal {X}}_{1j})\) in the direction of \(\overrightarrow{C_{M} P_i}\). Each facet \({\mathcal {F}}_l\) has d boundary points adjacent to it, and the rays associated with these boundary points establish an unbounded region together with the facet \({\mathcal {F}}_l\). Such a region can be viewed as an infinite “drinking glass” with \({\mathcal {F}}_l\) being the bottom while top of the glass reaching infinity, similar to the end intervals in \({\mathbb {R}}\) with one end being infinity. Let \({\mathscr {F}}_l\) denote the outer simplex associated with the facet \({\mathcal {F}}_l\). An illustration of a Delaunay triangulation and the corresponding outer triangles in \({\mathbb {R}}^2\) is given in Fig. 1 where \(C_H({\mathcal {X}}_{1j})\) has six facets, hence \({\mathbb {R}}^2 {\setminus } C_H({\mathcal {X}}_{1j})\) is partitioned into six disjoint unbounded regions.
Barycentric coordinate system
The barycentric coordinate system was introduced by A.F. Möbius in his book “The Barycentric Calculus” in 1837. The idea is to assign weights \(w_1\), \(w_2\) and \(w_3\) to points \({\mathsf {y}}_1\), \({\mathsf {y}}_2\) and \({\mathsf {y}}_3\) which constitute the vertices of a triangle T in \({\mathbb {R}}^2\), respectively (Ungar 2010). Hence, the center of mass, or the barycenter of \({\mathsf {y}}_1\), \({\mathsf {y}}_2\) and \({\mathsf {y}}_3\) is , for \(w_1+w_2+w_3 \ne 0\), is given by
Similarly, let \({\mathcal {S}}= {\mathcal {S}}({\mathcal {Y}})\) be a dsimplex defined by the \(d+1\) noncoplanar points \({\mathcal {Y}}=\{{\mathsf {y}}_1,{\mathsf {y}}_2,\ldots ,{\mathsf {y}}_{d+1}\} \subset {\mathbb {R}}^d\) with weights \((w_1,w_2,\ldots ,w_{d+1})\). Thus, the barycenter \(W \in {\mathbb {R}}^d\) is given by
The \((d+1)\)tuple \({\mathbf {w}}=(w_1,w_2,\ldots ,w_{d+1})\) (also denoted as \((w_1:w_2:\ldots :w_{d+1})\)) can also be viewed as a set of coordinates of W with respect to the (vertex) set \({\mathcal {Y}}= \{{\mathsf {y}}_1,{\mathsf {y}}_2,\ldots ,{\mathsf {y}}_{d+1}\}\) for \(d>0\). Hence, the name barycentric coordinates. Observe that W in Eq. (2) is scale invariant (i.e. invariant under scaling of the weights of W). Therefore, the set of barycentric coordinates is homogeneous, i.e., for any \(\lambda \in {\mathbb {R}}_+\),
This gives rise to normalized barycentric coordinates\({\mathbf {w}}'=(w'_1,w'_2,\ldots ,w'_{d+1})\) of a point \(x \in {\mathbb {R}}^d\) with respect to \({\mathcal {Y}}\) as follows:
where \(w_{tot}:=\sum _{j=1}^{d+1} w_j\). For simplicity, we use the normalized barycentric coordinates as “barycentric coordinates” throughout this work, and use \({\mathbf {w}}\) to denote the vector of the coordinates of x. That is, x is \((w_1,w_2,\ldots ,w_{d+1})\) in barycentric coordinates so that \(\sum _{i=1}^{d+1}w_i=1\). The vector \({\mathbf {w}}\) has a unique solution given the linear systems of equations
where the vectors \({\mathsf {y}}_k  {\mathsf {y}}_1 \in {\mathbb {R}}^d\) for \(k=2,\ldots ,d+1\) are linearly independent (Lawson 1986). The vector \({\mathbf {w}}\) is unique but \(w_i\) are not necessarily in (0, 1). Barycentric coordinates define whether the point x is in \({\mathcal {S}}({\mathcal {Y}})\) or not, as follows:
\(x \in {\mathcal {S}}({\mathcal {Y}})^o\) if \(w_i \in (0,1)\) for all \(i=1,\ldots ,d+1\): the point x is inside of the dsimplex \({\mathcal {S}}({\mathcal {Y}})\) where \({\mathcal {S}}({\mathcal {Y}})^o\) denotes the interior of \({\mathcal {S}}({\mathcal {Y}})\),
\(x \in \mathfrak \partial ({\mathcal {S}}({\mathcal {Y}}))\), the point x is on the boundary of \({\mathcal {S}}({\mathcal {Y}})\), if \(w_i=0\) and \(w_j=(0,1]\) for some i in I such that \(I \subsetneq \{1,\ldots ,d+1\}\) and \(j \in \{1,\ldots ,d+1\} {\setminus } I\),
\(x={\mathsf {y}}_i\) if \(w_i=1\) and \(w_j=0\) for all \(i=1,\ldots ,d+1\) and \(j \ne i\): the point x is at a vertex of \({\mathcal {S}}({\mathcal {Y}})\),
\(x \not \in {\mathcal {S}}({\mathcal {Y}})\) if \(w_i \not \in [0,1]\) for some \(i \in \{1,\ldots ,d+1\}\): the point x is outside of \({\mathcal {S}}({\mathcal {Y}})\).
Barycentric coordinates of a point \(x \in {\mathcal {S}}({\mathcal {Y}})\) can also be viewed as the convex combination of the points of \({\mathcal {Y}}\), the vertices on the boundary of \({\mathcal {S}}({\mathcal {Y}})\).
Mvertex regions
A dsimplex is the smallest convex polytope in \({\mathbb {R}}^d\) constructed by a set of noncoplanar vertices \({\mathcal {Y}}=\{{\mathsf {y}}_1,{\mathsf {y}}_2,\ldots ,{\mathsf {y}}_{d+1}\}\). The boundary of a dsimplex consists of ksimplices called kfaces for \(0 \le k < d\). Each kface is a simplex defined by a subset of \({\mathcal {Y}}\) with k elements, hence there are \(\left( {\begin{array}{c}d+1\\ k+1\end{array}}\right)\)kfaces in a dsimplex. Let \({\mathcal {S}}({\mathcal {Y}})\) be the simplex defined by the set of points \({\mathcal {Y}}\). Given a simplex center\(M \in {\mathcal {S}}({\mathcal {Y}})^o\) (e.g. a triangle center in \({\mathbb {R}}^2\)), there are \(d+1\)Mvertex regions constructed by the set \({\mathcal {Y}}\). The Mvertex region of the vertex \({\mathsf {y}}_i\) is denoted by \(R_M({\mathsf {y}}_i)\) for \(i=1,2,\ldots ,d+1\).
For \(i=1,\ldots ,d+1\), let \(f_i\) denote the \((d1)\)face opposite to the vertex \({\mathsf {y}}_i\). Observe that the lines through the points \({\mathsf {y}}_i\) and M cross the face \(f_i\), a (\(d1\))face, at the points \(M_i\). Similarly, since the face \(f_i\) is a (\(d1\))simplex with a center \(M_i\) for any \(i=1,\ldots ,d+1\), we can find the centers of \((d2)\)faces of this (\(d1\))simplex. Note that both \(M_i\) and M are of the same type of centers of their respective simplices \(f_i\) and \({\mathcal {S}}({\mathcal {Y}})\). The vertex region \(R_M({\mathsf {y}}_i)\) is the convex hull of the points \({\mathsf {y}}_i\), \(\{M_j\}^{d+1}_{j=1;j \ne i}\), and centers of all kfaces (which are also ksimplices) adjacent to \({\mathsf {y}}_i\) for \(k=1,\ldots ,d2\). In Fig. 2, we illustrate the vertex regions of an acute triangle in \({\mathbb {R}}^2\) and the vertex regions \(R_M({\mathsf {y}}_1)\) and \(R_M({\mathsf {y}}_3)\) of a 3simplex (tetrahedron). In \({\mathbb {R}}^2\), the 2simplex is a triangle with vertices \({\mathcal {Y}}=\{{\mathsf {y}}_1,{\mathsf {y}}_2,{\mathsf {y}}_3\}\) denoted as \(T({\mathcal {Y}})=T({\mathsf {y}}_1,{\mathsf {y}}_2,{\mathsf {y}}_3)\) and the corresponding vertex regions are \(R_M({\mathsf {y}}_1)\), \(R_M({\mathsf {y}}_2)\), and \(R_M({\mathsf {y}}_3)\) (see Fig. 2a, b). Notice that \(M_i\) lies on edge \(e_i\) which is opposite to vertex \({\mathsf {y}}_i\) for \(i=1,2,3\). Observe that, in Fig. 2c and d, each 2face of this 3simplex is a 2simplex (a triangle). For example, in Fig. 2c, the points \(M_2\), \(M_3\) and \(M_4\) are centers of \(f_2\), \(f_3\) and \(f_4\), respectively. Moreover, these 2simplices also have faces (1faces or edges of the 3simplex), and the centers of these faces are \(\{M_{ij}\}^4_{i,j=1;i \ne j}\). Hence, the vertex region \(R_M({\mathsf {y}}_1)\) is a convex polytope of points \(\{{\mathsf {y}}_1,M,M_2,M_3,M_4,M_{32},M_{42},M_{43}\}\) and \(R_M({\mathsf {y}}_3)\) is a convex polytope of points \(\{{\mathsf {y}}_3,M,M_2,M_4,M_1,M_{42},M_{41},M_{21}\}\). Just as we can write the vertex region as the intersection of two triangles in \({\mathbb {R}}^2\), we can write the vertex region as intersections of three tetrahedrons in \({\mathbb {R}}^3\). For example, \(R_M({\mathsf {y}}_1)\) is the intersection of tetrahedrons \(T({\mathsf {y}}_1,M_{42},{\mathsf {y}}_2,{\mathsf {y}}_4)\), \(T({\mathsf {y}}_1,M_{43},{\mathsf {y}}_3,{\mathsf {y}}_4)\) and \(T({\mathsf {y}}_1,M_{32},{\mathsf {y}}_2,{\mathsf {y}}_3)\).
Ceyhan and Priebe (2005) introduced the vertex regions as auxiliary tools to define proximity regions. They also gave the explicit functional forms of these regions as a function of the coordinates of vertices \(\{{\mathsf {y}}_1,{\mathsf {y}}_2,{\mathsf {y}}_3\}\). However, we characterize these regions based on barycentric coordinates as given in Proposition 1 and Theorem 1, as this coordinate system is more convenient for computation in higher dimensions.
Proposition 1
Let\({\mathcal {Y}}=\{{\mathsf {y}}_1,{\mathsf {y}}_2,{\mathsf {y}}_3\} \subset {\mathbb {R}}^2\)be a set of three noncollinear points, and let\(\{R_M({\mathsf {y}}_i)\}_{i=1,2,3}\)be the vertex regions that partition\(T({\mathcal {Y}})\). Then for\(x \in T({\mathcal {Y}})\)and\(M\in T({\mathcal {Y}})^o\), we have\(x \in R_M({\mathsf {y}}_i)\)if and only if
for\(i=1,2,3\)where\({\mathbf {w}}_T(x)=\left( w_T^{(1)}(x),w_T^{(2)}(x),w_T^{(3)}(x)\right)\)and\({\mathbf {m}}=(m_1,m_2,m_3)\)are barycentric coordinates of pointsxandMwith respect to the triangle\(T({\mathcal {Y}})\), respectively.
Proof
It is sufficient to show the result for \(i=1\), as \(i=2,3\) cases would follow similarly. So we will show that, \(x \in R_M({\mathsf {y}}_1)\) iff
Let \(T_2({\mathcal {Y}})\) and \(T_3({\mathcal {Y}})\) be two triangles formed by sets of points \(\{{\mathsf {y}}_1,{\mathsf {y}}_2,M_{2}\}\) and \(\{{\mathsf {y}}_1,{\mathsf {y}}_3,M_{3}\}\), respectively. First, we observe that \(x \in R_M({\mathsf {y}}_1)\) if and only if \(x \in T_2({\mathcal {Y}}) \cap T_3({\mathcal {Y}})\). So, for the forward direction, assume \(x \in T_2({\mathcal {Y}}) \cap T_3({\mathcal {Y}})\). Then \(x \in T_2({\mathcal {Y}})\) and \(x \in T_3({\mathcal {Y}})\). Since \(x \in T_2({\mathcal {Y}})\), we have \(x=\alpha _1 {\mathsf {y}}_1 + \alpha _2 {\mathsf {y}}_2 + \alpha _3 M_{2}\), i.e., the barycentric coordinate vector of point x with respect to the triangle \(T_2({\mathcal {Y}})\) is \({\mathbf {w}}_{T_2}(x)=(\alpha _1,\alpha _2,\alpha _3)\). But since \(M_2\) lies on edge \(e_2\), we can write it as \(M_2=b {\mathsf {y}}_1 + (1b) {\mathsf {y}}_3\) for some \(b \in (0,1)\). Then \(x=(\alpha _1+\alpha _3 b ){\mathsf {y}}_1 + \alpha _2 {\mathsf {y}}_2 + \alpha _3 (1b) {\mathsf {y}}_3\). Hence, by uniqueness of barycentric coordinates for x with respect to \(T({\mathcal {Y}})\), it follows that \(w_T^{(1)}(x) = \alpha _1+\alpha _3 b\), \(w_T^{(2)}(x) = \alpha _2\) and \(w_T^{(3)}(x) = \alpha _3 (1b)\). Also, since \(M_2\) and M are on the same line which crosses edge \(e_2\), we have \(M=c {\mathsf {y}}_2 + (1c) M_{2}\) for some \(c \in (0,1)\). Then
Hence, by uniqueness of barycentric coordinates for M with respect to \(T({\mathcal {Y}})\), it follows that \(m_1=b(1c)\), \(m_2=c\), and \(m_3=(1b)(1c)\). Hence
Then, \(x \in T_2({\mathcal {Y}})\) iff \(w_T^{(1)}(x) \ge (m_1/m_3) w_T^{(3)}(x)\), and similarly, \(x \in T_3({\mathcal {Y}})\) iff \(w_T^{(1)}(x) \ge (m_1/m_2) w_T^{(2)}(x)\). So, \(x \in T_2({\mathcal {Y}}) \cap T_3({\mathcal {Y}})=R_M({\mathsf {y}}_1)\) implies \(\displaystyle w_T^{(1)}(x) \ge \max \Bigg \{ \frac{m_1 w_T^{(2)}(x)}{m_2},\frac{m_1 w_T^{(3)}(x)}{m_3} \Bigg \}\).
For the reverse direction, for a contradiction, assume that \(x \in R_M({\mathsf {y}}_1)\) and \(\displaystyle w_T^{(1)}(x) < \max \Bigg \{ \frac{m_1 w_T^{(2)}(x)}{m_2},\frac{m_1 w_T^{(3)}(x)}{m_3} \Bigg \}\). Without loss of generality, assume that \(\displaystyle w_T^{(1)}(x) < \frac{m_1 w_T^{(3)}(x)}{m_3}\). Since \(x \in T_2({\mathcal {Y}})\), as before, we have \(\displaystyle \frac{w_T^{(1)}(x)}{w_T^{(3)}(z)}=\frac{\alpha _1+\alpha _3 b}{\alpha _3 (1b)}\) which is less than \(\displaystyle \frac{m_1}{m_3}=\frac{b}{1b}\). That is, \(\displaystyle \frac{\alpha _1+\alpha _3 b}{\alpha _3 (1b)} < \frac{b}{1b}\) which implies \(\alpha _1 < 0\) which implies \(x \not \in T({\mathcal {Y}})\) contradicting the assumption that \(x \in T_2({\mathcal {Y}}) \subset T({\mathcal {Y}})\). Thus,
\(\square\)
Note that, when \(M:=M_C\) (i.e., M is the centroid or the center of mass of the triangle \(T({\mathcal {Y}})\)), we can further simplify the result of Proposition 1; that is, for any point \(x \in T({\mathcal {Y}})^o\), we have \(x \in R_{M_C}({\mathsf {y}}_i)\) if and only if \(w_T^{(i)}(x)=\max _{j=1,2,3} w_T^{(j)}(x)\) since the vector of (special) barycentric coordinates of \(M_C\) is \({\mathbf {m}}_C=(1/3,1/3,1/3)\). The following theorem is an extension of Proposition 1 to higher dimensions.
Theorem 1
Let\({\mathcal {Y}}=\{{\mathsf {y}}_1,{\mathsf {y}}_2,\ldots , {\mathsf {y}}_{d+1}\} \subset {\mathbb {R}}^d\)be a set of noncoplanar points for\(d>0\), and let\(\{R_M({\mathsf {y}}_i)\}_{i=1}^{d+1}\)be theMvertex regions that partition\({\mathcal {S}}({\mathcal {Y}})\). Then, for\(x \in {\mathcal {S}}({\mathcal {Y}})\)and\(M \in {\mathcal {S}}({\mathcal {Y}})^o\), we have\(x \in R_M({\mathsf {y}}_i)\)if and only if
where\({\mathbf {w}}_{{\mathcal {S}}}(x)=\left( w_{{\mathcal {S}}}^{(1)}(x),\ldots ,w_{{\mathcal {S}}}^{(d+1)}(x)\right)\)and\({\mathbf {m}}=(m_1,\ldots ,m_{d+1})\)are the barycentric coordinates of pointsxandMwith respect to the simplex\({\mathcal {S}}({\mathcal {Y}})\), respectively.
See “Appendix” for the proof.
As in the triangle case above, for \(M=M_C\) and for any point \(x \in {\mathcal {S}}({\mathcal {Y}})^o\), we have \(x \in R_{M_C}({\mathsf {y}}_i)\) if and only if \(w_T^{(i)}(x)=\max _{j} w_T^{(j)}(x)\) since the set of barycentric coordinates of \(M_C\) is \({\mathbf {m}}_C=(1/(d+1),1/(d+1), \ldots , 1/(d+1))\). The \(M_C\)vertex regions are particularly appealing for our proportionaledge proximity regions.
Proximity regions and proximity catch digraphs
We consider proximity regions for the (supervised) twoclass classification problem, then perform complexity reduction via minimum dominating sets of the associated proximity catch digraphs. For \(j=0,1\), the proximity map\(N(\cdot ): \varOmega \rightarrow 2^{\varOmega }\) associates with each point \(x \in {\mathcal {X}}_j\), a proximity region\(N(x) \subset \varOmega\). Consider the datarandom (or vertexrandom) proximity catch digraph \(D_j=({\mathcal {V}}_j,{\mathcal {A}}_j)\) with vertex set \({\mathcal {V}}_j={\mathcal {X}}_j\) and arc set \({\mathcal {A}}_j\) defined by \((u,v) \in {\mathcal {A}}_j \iff\)\(\{u,v\}\subset {\mathcal {X}}_j\) and \(v \in N(u)\), for \(j=0,1\). The digraph \(D_j\) depends on the (joint) distribution of the sets of points \({\mathcal {X}}_0\) and \({\mathcal {X}}_1\), and on the map \(N(\cdot )\). The adjective proximity—for the digraph \(D_j\) and for the map \(N(\cdot )\) — comes from thinking of the region N(x) as representing those points in \(\varOmega\) “closer” to x (Toussaint 1980; Jaromczyk and Toussaint 1992). Our proximity catch digraphs (PCDs) for \({\mathcal {X}}_j\) against \({\mathcal {X}}_{1j}\) are defined by specifying \({\mathcal {X}}_j\) as the target class and \({\mathcal {X}}_{1j}\) as the nontarget class. Hence, in the definitions of our PCDs, the only difference is switching the roles of \({\mathcal {X}}_0\) and \({\mathcal {X}}_1\). For \(j=0\), 0 becomes the target class label and 1 becomes the nontarget class label, and it is vice versa for \(j=1\).
The proximity regions associated with PCDs introduced by Ceyhan and Priebe (2005) are simplicial proximity regions (regions that constitute simplices in \({\mathbb {R}}^d\)) defined for the target class points \({\mathcal {X}}_j\) in the convex hull of the nontarget class points \(C_H({\mathcal {X}}_{1j})\). However, by introducing the outer simplices associated with the facets of \(C_H({\mathcal {X}}_{1j})\), we extend the definition of the simplicial proximity regions to \({\mathbb {R}}^d {\setminus } C_H({\mathcal {X}}_{1j})\). Such simplicial regions are dsimplices in \(C_H({\mathcal {X}}_{1j})\) (intervals in \({\mathbb {R}}\), triangles in \({\mathbb {R}}^2\) and tetrahedrons in \({\mathbb {R}}^3\)) and dpolytopes for \({\mathbb {R}}^d {\setminus } C_H({\mathcal {X}}_{1j})\). After partitioning \({\mathbb {R}}^d\) into disjoint regions, we further partition each simplex \({\mathcal {S}}_k\) (only the ones inside \(C_H({\mathcal {X}}_{1j})\)) into vertex regions, and define the simplicial proximity regions N(x) for \(x \in {\mathcal {S}}_k\). Here, we define the regions N(x) as open sets in \({\mathbb {R}}^d\).
Class cover catch digraphs
Class Cover Catch Digraphs (CCCDs) are graph theoretic representations of the CCP (Priebe et al. 2001, 2003a). In a CCCD, for \(x,y \in {\mathcal {X}}_j\); let \(B=B(x,\varepsilon )\) be the ball centered at x with radius \(\varepsilon =\varepsilon (x)\). A CCCD is a digraph \(D_j=({\mathcal {V}}_j,{\mathcal {A}}_j)\) with vertex set \({\mathcal {V}}_j={\mathcal {X}}_j\) and the arc set \({\mathcal {A}}_j\) where \((x,y) \in {\mathcal {A}}_j\) iff \(y \in B\). One particular family of CCCDs are called pureCCCDs wherein, for all \(x \in {\mathcal {X}}_j\), no nontarget class point lies in B. Hence, for some \(\theta \in (0,1]\) and for all \(x \in {\mathcal {X}}_j\), the open ball B is denoted by \(B_{\theta }(x,\varepsilon _{\theta }(x))\) with the radius \(\varepsilon _{\theta }(x)\) given by
where \(u(x):={{\,\mathrm{argmin}\,}}_{y \in {\mathcal {X}}_{1j}} d(x,y),\) and \(\ell (x):={{\,\mathrm{argmax}\,}}_{z \in {\mathcal {X}}_j} \{d(x,z): d(x,z) < d(x,u(x))\}.\) Here, d(., .) can be any dissimilarity measure but we use the Euclidean distance henceforth. For all \(x \in {\mathcal {X}}_j\), the definition of the radius \(\varepsilon _{\theta }(x)\) keeps any nontarget class point \(v \in {\mathcal {X}}_{1j}\) out of the ball B; that is, \({\mathcal {X}}_{1j} \cap B = \emptyset\). We say the CCCD, \(D_j\), is “pure” since the balls include only the target class points and none of the nontarget class points. The CCCD, \(D_j\), is invariant to the choice of \(\theta\), but this parameter affects the classification performance. This parameter potentially establishes classifiers with increased performance (Priebe et al. 2003a). An illustration of the effect of parameter \(\theta\) on the radius of \(B_{\theta }(x,\varepsilon _{\theta }(x))\) is given in Fig. 3 (DeVinney 2003). In fact, CCCDs can be viewed as a family of PCDs using spherical proximity maps, letting \(N(x):=B(x,\varepsilon (x))\). We denote the proximity regions associated with pureCCCDs as \(N_S(x,\theta )=B_{\theta }(x,\varepsilon _{\theta }(x))\). For simplicity, we refer to pureCCCDs as CCCDs throughout this article.
Proportionaledge proximity maps
We use a type of proximity map with expansion parameter r, namely proportionaledge (PE) proximity map, denoted by \(N_{PE}(\cdot ,r)\). The PE proximity map and the associated digraphs, PEPCDs, are defined in Ceyhan and Priebe (2005). Currently, PEPCDs are only defined for the points in \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\). Hence, for the remaining target class points \({\mathcal {X}}_j\), i.e. \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\), we extend the definition of PE proximity maps to the outer simplices. Hence, we will be able to show in the subsequent sections that the resulting PCDs have computationally tractable minimum dominating sets which are equivalent to the exact minimum prototype sets of PEPCD classifiers for the entire data set.
PE proximity maps for the interior of convex hull of nontarget points
For \(r \in [1,\infty )\), we define \(N_{PE}(\cdot ,r)\) to be the PE proximity map associated with a triangle \(T=T({\mathcal {Y}})\) formed by the set of noncollinear points \({\mathcal {Y}}= \{{\mathsf {y}}_1,{\mathsf {y}}_2,{\mathsf {y}}_3\} \subset {\mathbb {R}}^2\). Let \(R_{M_C}({\mathsf {y}}_1)\), \(R_{M_C}({\mathsf {y}}_2)\) and \(R_{M_C}({\mathsf {y}}_3)\) be the vertex regions associated with vertices \({\mathsf {y}}_1\),\({\mathsf {y}}_2\) and \({\mathsf {y}}_3\). Note that the barycentric coordinates of \(M_C\) are (1/3:1/3:1/3). For \(x \in T^o\), let \(v(x) \in {\mathcal {Y}}\) be the vertex whose region contains x; hence \(x \in R_{M_C}(v(x))\). If x falls on the boundary of two vertex regions, or on \(M_C\), we assign v(x) arbitrarily. Let e(x) be the edge of T opposite to v(x). Let \(\ell (v(x),x)\) be the line parallel to e(x) through x. Let \(d(v(x),\ell (v(x),x))\) be the Euclidean (perpendicular) distance from v(x) to \(\ell (v(x),x)\). For \(r \in [1,\infty )\), let \(\ell _r(v(x),x)\) be the line parallel to e(x) such that \(d(v(x),\ell _r(v(x),x)) = rd(v(x),\ell (v(x),x))\) and \(d(\ell (v(x),x),\ell _r(v(x),x)) < d(v(x),\ell _r(v(x),x))\). Let \(T_r(x)\) be the triangle similar to and with the same orientation as T where \(T_r(x)\) has v(x) as a vertex and and the edge opposite v(x) lies on \(\ell _r(v(x),x)\). Then the proportionaledge proximity region \(N_{PE}(x,r)\) is defined to be \((T_r(x) \cap T)^o\). Figure 4a illustrates a PE proximity region \(N_{PE}(x,r)\) of a point x in an acute triangle.
The extension of \(N_{PE}(\cdot ,r)\) to \({\mathbb {R}}^d\) for \(d > 2\) is straightforward. Now, let \({\mathcal {Y}}= \{{\mathsf {y}}_1,{\mathsf {y}}_2,\ldots ,{\mathsf {y}}_{d+1}\}\) be a set of \(d+1\) noncoplanar points, and represent the simplex formed by the these points as \({\mathcal {S}}={\mathcal {S}}({\mathcal {Y}})\). We define the PE proximity map as follows. Given a point \(x \in {\mathcal {S}}^o\), let v(x) be the vertex in whose region x falls (if x falls on the boundary of two vertex regions or on \(M_C\), we assign v(x) arbitrarily.) Let \(\varphi (x)\) be the face opposite to vertex v(x), and \(\eta (v(x),x)\) be the hyperplane parallel to \(\varphi (x)\) which contains x. Let \(d(v(x),\eta (v(x),x))\) be the (perpendicular) Euclidean distance from v(x) to \(\eta (v(x),x)\). For \(r \in [1,\infty )\), let \(\eta _r(v(x),x)\) be the hyperplane parallel to \(\varphi (x)\) such that \(d(v(x),\eta _r(v(x),x))=r\,d(v(x),\eta (v(x),x))\) and \(d(\eta (v(x),x),\eta _r(v(x),x)) < d(v(x),\eta _r(v(x),x))\). Let \({\mathcal {S}}_r(x)\) be the polytope similar to and with the same orientation as \({\mathcal {S}}\) having v(x) as a vertex and \(\eta _r(v(x),x)\) as the opposite face. Then the proportionaledge proximity region is given by \(N_{PE}(x,r):=({\mathcal {S}}_r(x) \cap {\mathcal {S}})^o\).
Notice that, so far, we assumed a single dsimplex for simplicity. For \(n_{1j}=d+1\), the convex hull of the nontarget class \(C_H({\mathcal {X}}_{1j})\) is a dsimplex. If \(n_{1j}>d+1\), then we consider the Delaunay tessellation (assumed to exist) of \({\mathcal {X}}_{1j}\) where \({\mathfrak {S}}^{\text {in}}_{1j} =\{{\mathcal {S}}_1,\ldots ,{\mathcal {S}}_K\}\) denotes the set of all Delaunay cells (which are dsimplices). We construct the proximity region \(N_{PE}(x,r)\) of a point \(x \in {\mathcal {X}}_j\) depending on which dsimplex \({\mathcal {S}}_k\) this point resides in. Observe that, this construction pertains to points in \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\) only.
PE proximity maps for the exterior of convex hull of nontarget points
For target class points \({\mathcal {X}}_j\) outside of the convex hull of the nontarget class points \({\mathcal {X}}_{1j}\), i.e. \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\), we define the PE proximity maps similar to the ones defined for dsimplices. Let \({\mathscr {F}}\subset {\mathbb {R}}^2\) be an outer triangle defined by the adjacent boundary points which are without loss of generality assumed to be \(\{{\mathsf {y}}_1,{\mathsf {y}}_2\} \subset {\mathbb {R}}^2\) of \(C_H({\mathcal {X}}_{1j})\) and by rays \(\overrightarrow{C_{M} {\mathsf {y}}_1}\) and \(\overrightarrow{C_{M} {\mathsf {y}}_2}\) for \(C_M\) being the centroid of the boundary points of \(C_H({\mathcal {X}}_{1j})\). Also, let \(e={\mathcal {F}}\) be the edge (or facet) of \(C_H({\mathcal {X}}_{1j})\) adjacent to vertices \(\{{\mathsf {y}}_1,{\mathsf {y}}_2\}\). Note that there is no center in an outer triangle, and hence no need for vertex regions. For \(r \in [1,\infty )\), we define \(N_{PE}(\cdot ,r)\) to be the PE proximity map of the outer triangle as follows. For \(x \in {\mathscr {F}}^o\), let \(\ell (x,e)\) be the line parallel to e through x, and let \(d(e,\ell (x,e))\) be the Euclidean distance from e to \(\ell (x,e)\). For \(r \in [1,\infty )\), let \(\ell _r(x,e)\) be the line parallel to e such that \(d(e,\ell _r(x,e)) = rd(e,\ell (x,e))\) and \(d(\ell (x,e),\ell _r(x,e)) < d(e,\ell _r(x,e))\). Let \({\mathscr {F}}_r(x)\) be a polygon similar to the outer triangle \({\mathscr {F}}\) such that \({\mathscr {F}}_r(x)\) has e and \(e_r(x)=\ell _r(x,e) \cap {\mathscr {F}}\) as its two edges, however \({\mathscr {F}}_r(x)\) is a bounded region whereas \({\mathscr {F}}\) is not. Then, the proximity region \(N_{PE}(x,r)\) is defined to be \({\mathscr {F}}^o_r(x)\). Figure 4b illustrates a PE proximity region \(N_{PE}(x,r)\) of a point x in an outer triangle.
The extension of \(N_{PE}(\cdot ,r)\) of outer triangles in \({\mathbb {R}}^2\) to \({\mathbb {R}}^d\) for \(d > 2\) is also straightforward. Let \({\mathscr {F}}\subset {\mathbb {R}}^d\) be an outer simplex defined by the adjacent boundary points which are without loss of generality assumed to be \(\{{\mathsf {y}}_1,\ldots ,{\mathsf {y}}_d\} \subset {\mathbb {R}}^d\) of \(C_H({\mathcal {X}}_{1j})\) and by rays \(\{\overrightarrow{C_{M} {\mathsf {y}}_1},\ldots ,\overrightarrow{C_{M} {\mathsf {y}}_d}\}\). Also, let \({\mathcal {F}}\) be the facet of \(C_H({\mathcal {X}}_{1j})\) adjacent to vertices \(\{{\mathsf {y}}_1,\ldots ,{\mathsf {y}}_d\}\). We define the PE proximity map as follows. Given a point \(x \in \mathfrak {{\mathscr {F}}}^o\), let \(\eta (x,{\mathcal {F}})\) be the hyperplane parallel to \({\mathcal {F}}\) through x and let \(d({\mathcal {F}},\eta (x,{\mathcal {F}}))\) be the Euclidean distance from \({\mathcal {F}}\) to \(\eta (x,{\mathcal {F}})\). For \(r \in [1,\infty )\), let \(\eta _r(x,{\mathcal {F}})\) be the hyperplane parallel to \({\mathcal {F}}\) such that \(d({\mathcal {F}},\eta _r(x,{\mathcal {F}})) = rd({\mathcal {F}},\eta (x,{\mathcal {F}}))\) and \(d(\eta (x,{\mathcal {F}}),\eta _r(x,{\mathcal {F}})) < d({\mathcal {F}},\eta _r(x,{\mathcal {F}}))\). Let \({\mathscr {F}}_r(x)\) be the polytope similar to the outer simplex \({\mathscr {F}}\) such that \({\mathscr {F}}_r(x)\) has \({\mathcal {F}}\) and \({\mathcal {F}}_r(x)=\eta _r(x) \cap {\mathscr {F}}\) as its two faces. Then, the proximity region \(N_{PE}(x,r)\) is defined to be \({\mathscr {F}}^o_r(x)\).
The convex hull \(C_H({\mathcal {X}}_{1j})\) has at least \(d+1\) facets (exactly \(d+1\) when \(n_{1j}=d+1\)), and since each outer simplex is associated with a facet, the number of outer simplices is at least \(d+1\). Let \({\mathfrak {S}}^{\text {out}}_{1j} =\{{\mathscr {F}}_1,\ldots ,{\mathscr {F}}_L\}\) denote the set of all outer simplices. This construction handles the points in \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\) only. Together with the points inside \(C_H({\mathcal {X}}_{1j})\), the PEPCD, \(D_j\), whose vertex set is \({\mathcal {V}}_j={\mathcal {X}}_j\), has at least
many components where \(I(\cdot )\) stands for the indicator function.
Minimum dominating sets of PCDs
We develop prototypebased classifiers with computationally tractable exact minimum prototype sets. We model the target class with a digraph D such that prototype sets of the target class are equivalent to dominating sets of D. Ceyhan (2010) determined the appealing properties of minimum dominating set of CCCDs in \({\mathbb {R}}\) as a guideline in defining new parametric digraphs relative to the Delaunay tessellation of points from the nontarget class. In \({\mathbb {R}}\), finding the minimum dominating sets of CCCDs is computationally tractable, and the exact distribution of domination number is known for target class points which are uniformly distributed within each cell (Priebe et al. 2001). However, there is no polynomial time algorithm for finding the exact minimum dominating sets of CCCDs in \({\mathbb {R}}^d\) for \(d>1\). In this section, we provide a characterization of minimum dominating sets of PEPCDs with the barycentric coordinate system and employ those coordinates to introduce algorithms for finding their minimum dominating sets in polynomial time.
We model the support of the class conditional distributions, i.e. \(s(F_j)\), by a mixture of proximity regions. For a general proximity region \(N(\cdot )\), we estimate the support of the class j as \(Q_j:=\cup _{x \in {\mathcal {X}}_j}N(x)\) such that \({\mathcal {X}}_j \subset Q_j\). Nevertheless, the support of the target class j can be estimated by a cover with lower complexity (i.e. with fewer proximity regions). For this purpose, we wish to reduce the model complexity by selecting an appropriate subset of proximity regions that still gives approximately the same estimate as \(Q_j\). Let this (approximate) cover be defined as \(C_j:=\cup _{x \in S_j} N_{PE}(x,r)\), where \(S_j\) is a prototype set of points for \({\mathcal {X}}_j\) such that \({\mathcal {X}}_j \subset C_j\). A reasonable choice of the prototype set for our class covers is the minimum dominating set of PEPCDs, whose elements are often more “central” than the arbitrary sets of the same size. Dominating sets of minimum size are desirable, since the size of the prototype sets determine the complexity of the model; that is, the smaller the set in cardinality (i.e. the model is lower in complexity), the higher the expected classification performance (Mehta et al. 1995; Rissanen 1989; Gao et al. 2013).
In general, a digraph \(D=({\mathcal {V}},{\mathcal {A}})\) of order \(n={\mathcal {V}}\), a vertex vdominates itself and all vertices of the form \(\{u:\,(v,u) \in {\mathcal {A}}\}\). A dominating set, \(S_D\), for the digraph D is a subset of \({\mathcal {V}}\) such that each vertex \(v \in {\mathcal {V}}\) is dominated by a vertex in \(S_D\). A minimum dominating set (MDS), \(S_{MD}\), is a dominating set of minimum cardinality, and the domination number, \(\gamma (D)\), is defined as \(\gamma (D):=S_{MD}\). Finding a minimum dominating set is, in general, an NPhard optimization problem (Karr 1992; Arora and Lund 1996). However, an approximately minimum dominating set can be obtained in \(O(n^2)\) time using a wellknown greedy algorithm as in Algorithm 1 (Chvatal 1979; Parekh 1991). PCDs using \(N_S(\cdot ,\theta )\) (or CCCDs with parameter \(\theta\)) are examples of such digraphs. But, (exact) MDS of PEPCDs are computationally tractable unlike PCDs based on \(N_S(\cdot ,\theta )\). Many attributes of these PE proximity maps and the proof of the existence of an algorithm to find a MDS are conveniently implemented through the barycentric coordinate system. Before proving the results on the MDS, we give the following proposition.
Proposition 2
Let\({\mathcal {Y}}=\{{\mathsf {y}}_1,{\mathsf {y}}_2,\ldots , {\mathsf {y}}_{d+1}\} \subset {\mathbb {R}}^d\)be a set of noncoplanar points for\(d>0\). For\(x,x^* \in {\mathcal {S}}={\mathcal {S}}({\mathcal {Y}})\), we have\(d(x,f_i) < d(x^*,f_i)\)if and only if\(w^{(i)}_{{\mathcal {S}}}(x) < w^{(i)}_{{\mathcal {S}}}(x^*)\)for all\(i=1,\ldots ,d+1\), where\(d(x,f_i)\)is the distance between pointxand the face\(f_i\).
Proof
For \(i=1,\ldots ,d+1\), note that \(f_i\) is the face of the simplex \({\mathcal {S}}\) opposite to the vertex \({\mathsf {y}}_i\). Let \(L({\mathsf {y}}_i,x)\) be the line through points x and \({\mathsf {y}}_i\), and let \(z \in f_i\) be the point that \(L({\mathsf {y}}_i,x)\) crosses \(f_i\). Also, recall that \(\eta ({\mathsf {y}}_i,x)\) denotes the hyperplane through the point x and parallel to \(f_i\). Hence, for \(\alpha \in (0,1)\),
and since z is a convex combination of the set \(\{{\mathsf {y}}_k\}_{k \ne i}\),
for \(\beta _k \in (0,1)\) for all k. Thus, \(w^{(i)}_{{\mathcal {S}}}(x)=\alpha\) by the uniqueness of \({\mathbf {w}}_{{\mathcal {S}}}(x)\) for x with respect to \({\mathcal {S}}\). Observe that \(\alpha =d(x,z)/d({\mathsf {y}}_i,z)=d(x,f_i)/d({\mathsf {y}}_i,f_i)\), since distances d(x, z) and \(d(x,f_i)=d(\eta ({\mathsf {y}}_i,x),f_i)\) are directly proportional (and so are \(d({\mathsf {y}}_i,z)\) and \(d({\mathsf {y}}_i,f_i)\)). In fact, points that are on the same plane parallel to \(f_i\) have the same \(i^{th}\) barycentric coordinate as \(w^{(i)}_{{\mathcal {S}}}(x)=\alpha\) corresponding to the vertex \({\mathsf {y}}_i\). Also, recall that, with decreasing \(\alpha\), the point x gets closer to \(f_i\) (\(x \in f_i\) if \(\alpha =0\), and \(x={\mathsf {y}}_i\) if \(\alpha =1\)). Then, for any two points \(x,x^* \in {\mathcal {S}}\), we have \(w^{(i)}_{{\mathcal {S}}}(x)=d(x,f_i)/d({\mathsf {y}}_i,f_i)\) and \(w^{(i)}_{{\mathcal {S}}}(x^*)=d(x^*,f_i)/d({\mathsf {y}}_i,f_i)\). Thus \(w^{(i)}_{{\mathcal {S}}}(x) < w^{(i)}_{{\mathcal {S}}}(x^*)\) if and only if \(d(x,f_i) < d(x^*,f_i)\). \(\square\)
Barycentric coordinates of a set of points in \({\mathcal {S}}({\mathcal {X}}_{1j})\) are useful in characterizing the set of local extremum points, which are extreme (having maximum or minimum distance) with respect to a subset of the class supports. A subset of local extremum points would constitute the minimum dominating set \(S_{MD}\). We use Proposition 2 to prove the following theorem on MDS of a PEPCD, D.
Theorem 2
Let\({\mathcal {Z}}_n=\{z_1,z_2,\ldots , z_n\} \subset {\mathbb {R}}^d\)and\({\mathcal {Y}}=\{{\mathsf {y}}_1,{\mathsf {y}}_2,\ldots , {\mathsf {y}}_{d+1}\}\subset {\mathbb {R}}^d\)for\(d>0\), and let\({\mathcal {S}}={\mathcal {S}}({\mathcal {Y}})\)be thedsimplex given by the set\({\mathcal {Y}}\)such that\({\mathcal {Z}}_n \subset {\mathcal {S}}\). LetDbe the PEPCD associated with the proximity map\(N_{PE}(\cdot ,r)\)with vertex set\({\mathcal {V}}={\mathcal {Z}}_n\), then we have\(\gamma (D) \le d+1\)for all\(r>1\).
Proof
Let \(x_{[i]}:={{\,\mathrm{argmin}\,}}_{x \in {\mathcal {Z}}_n \cap R_M({\mathsf {y}}_i)} d(x,f_i)\), i.e., \(x_{[i]}\) is the closest \({\mathcal {Z}}_n\) point in \(R_M({\mathsf {y}}_i)\) to face \(f_i\) (so \(x_{[i]}\) is a local extremum in \({\mathcal {Z}}_n\) with respect to \(R_M({\mathsf {y}}_i)\)), provided \({\mathcal {Z}}_n \cap R_M({\mathsf {y}}_i) \ne \emptyset\). By Proposition 2, note that \(d\left( x_{[i]},f_i\right) \le \min _{z \in {\mathcal {Z}}_n \cap R_M({\mathsf {y}}_i)} d(z,f_i)\) if and only if \(w^{(i)}_{{\mathcal {S}}}(x_{[i]}) \le \min _{z \in {\mathcal {Z}}_n \cap R_M({\mathsf {y}}_i)} w^{(i)}_{{\mathcal {S}}}(z)\). Hence, the local extremum point \(x_{[i]}\) satisfies
Clearly, \({\mathcal {Z}}_n \cap R_M({\mathsf {y}}_i) \subset N_{PE}\left( x_{[i]},r\right)\) for all \(r > 1\). Hence, \({\mathcal {Z}}_n \subset \cup _{i=1}^{d+1} N_{PE}\left( x_{[i]},r\right)\). So, the set of all such local extremum points \(E_L:=\{x_{[1]},\ldots ,x_{[d+1]}\}\) (provided they all exist) is a dominating set for the PEPCD with vertices \({\mathcal {Z}}_n\). If some of the \(x_{[i]}\) do not exist, the set of such local extremum points will be a proper subset of \(E_L\). Hence, we obtain \(\gamma (D) \le d+1\). \(\square\)
For \(r=1\), \(x \not \in N_{PE}(x,r)\), so \(N_{PE}\left( x_{[i]},r\right)\) does not cover the points on its boundary, in particular on its face coincident with \(\eta _r\left( x_{[i]},f_i\right)\) which is the same as \(\eta \left( x_{[i]},f_i\right)\) for \(r=1\). But \(\eta \left( x_{[i]},f_i\right)\) has Lebesgue measure zero in \({\mathbb {R}}^d\).
With \(M=M_C\), MDSs of PEPCDs are found by locating the closest point \(x_{[i]}\) to face \(f_i\) in the vertex region \(R_{M_C}({\mathsf {y}}_i)\) for all \(i=1,\ldots ,d+1\). By Theorem 2, in \(R_{M_C}({\mathsf {y}}_i)\), the point \(x_{[i]}\) is the closest point among \({\mathcal {X}}_j \cap R_{M_C}({\mathsf {y}}_i)\) to the face \(f_i\). For a set of dsimplices given by the Delaunay tessellation of \({\mathcal {X}}_{1j}\), Algorithm 2 identifies all such local extremum points of each dsimplex in order to find the (exact) minimum dominating set \(S_j=S_{MD}\).
Let \(D_j=({\mathcal {V}}_j,{\mathcal {A}}_j)\) be the PEPCD with vertex set \({\mathcal {V}}={\mathcal {X}}_j\). In Algorithm 2, we partition \({\mathcal {X}}_j\) into such subsets that each subset falls into a single dsimplex in the Delaunay tessellation of the set \({\mathcal {X}}_{1j}\). Let \({\mathfrak {S}}_{1j}\) be the set of all dsimplices associated with \({\mathcal {X}}_{1j}\). Moreover, for each \({\mathcal {S}}\in {\mathfrak {S}}_{1j}\), we further partition the subset \({\mathcal {X}}_j \cap {\mathcal {S}}\) into subsets that each subset falls into a single vertex region of \({\mathcal {S}}\). In each vertex region \(R_{M_C}({\mathsf {y}}_i)\), we find the closest point \(x_{[i]}\) to face \(f_i\) provided \({\mathcal {X}}_j \cap R_{M_C}({\mathsf {y}}_i) \ne \emptyset\). Let S(D) denote the minimum dominating set and \(\gamma (D)\) denote the domination number of a digraph D. Also, let \(D_j[{\mathcal {S}}]\) be the digraph induced by points of \({\mathcal {X}}_j\) inside the dsimplex \({\mathcal {S}}\), i.e. \({\mathcal {X}}_j \cap {\mathcal {S}}\). Recall that, as a result of Theorem 2, \(\gamma (D_j[{\mathcal {S}}]) \le d+1\) since \({\mathcal {X}}_j \cap {\mathcal {S}}\subset \cup _{i=1}^{d+1} N_{PE}\left( x_{[i]},r\right)\). To find \(S(D_j[{\mathcal {S}}])\), we sort all subsets of the set of such local extremum points, from smallest cardinality to highest, and check if \({\mathcal {X}}_j \cap {\mathcal {S}}\) is in the union of proximity regions of these subsets of local extremum points. For example, \(S(D_j[{\mathcal {S}}])=\left\{ x_{[l]}\right\}\) and \(\gamma (D_j[{\mathcal {S}}])=1\) if \({\mathcal {X}}_j \cap {\mathcal {S}}\subset N_{PE}\left( x_{[l]},r\right)\) for some \(l \in \{1,2,\ldots ,d+1\}\); else \(S(D_j[{\mathcal {S}}])=\left\{ x_{[l_1]},x_{[l_2]}\right\}\) and \(\gamma (D_j[{\mathcal {S}}])=2\) if \({\mathcal {X}}_j \cap {\mathcal {S}}\subset N_{PE}\left( x_{[l_1]},r\right) \cup N_{PE}\left( x_{[l_2]},r\right)\) for some \(\{l_1,l_2\} \in \genfrac(){0.0pt}0{\{1,2,\ldots ,d+1\}}{2}\); or else \(S(D_j[{\mathcal {S}}])=\{x_{[1]},x_{[2]},x_{[3]}\}\) and \(\gamma (D_j[{\mathcal {S}}])=3\) if \({\mathcal {X}}_j \cap {\mathcal {S}}\subset \cup _{l=1,2,3} N_{PE}\left( x_{[l]},r\right)\), and so on. The resulting minimum dominating set of \(D_j\) for \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\) is the union of these sets, i.e., \(S_j=\cup _{{\mathcal {S}}\in {\mathfrak {S}}_{1j}} S(D_j[{\mathcal {S}}])\) and \(\gamma (D_j)=S_j\). Observe that \(S(D_j[{\mathcal {S}}]) = \emptyset\) if \({\mathcal {X}}_j \cap {\mathcal {S}}= \emptyset\). This algorithm is guaranteed to terminate, as long as \(n_0\) and \(n_1\) are both finite.
The level of reduction in the training data depends also on the magnitude of the expansion parameter r. In fact, the larger the magnitude of r, the more likely the \(S(D_j[{\mathcal {S}}])\) have smaller cardinality, i.e. the more the reduction in the data set. Thus, we have a stochastic ordering as follows:
Theorem 3
Let\({\mathcal {S}}\)be adsimplex in\({\mathbb {R}}^d\)for\(d>0\)with\(d+1\)noncoplanar vertices and\({\mathcal {Z}}_n=\{X_1,X_2,\ldots ,X_n\}\)be a random sample from a continuous distributionFwhose support is\({\mathcal {S}}\). Also let PEPCD be defined with vertices\({\mathcal {Z}}_n\) (i.e.,\({\mathcal {Z}}_n\)is from the target classj) and expansion parameter\(r\ge 1\). Denote the domination number of this PEPCD as\(\gamma _d(r)=\gamma ({\mathcal {Z}}_n,D_j,r)\). Then for\(r_1<r_2\), we have\(\gamma _d(r_2) \le ^{ST} \gamma _d(r_1)\)where\(\le ^{ST}\)stands for “stochastically smaller than”.
Proof
Suppose \(r_1<r_2\). Then for any \(x\in {\mathcal {S}}\), we have \(N_{PE}(x,r_1) \subseteq N_{PE}(x,r_2)\). Let \(A_S:=\{x \in {\mathcal {S}}: N_{PE}(x,r_1) \subsetneq N_{PE}(x,r_2) \}\). Since \(r_1<r_2\), we have \(\lambda (A_S)>0\) where \(\lambda\) is the Lebesgue measure. For any \(t \in {\mathbb {Z}}_+\), \(\gamma _d(r) \le t\) iff there exist \(x_1,x_2,\ldots ,x_t \in {\mathcal {Z}}_n\) such that \({\mathcal {Z}}_n \subset \cup _{i=1}^{t} N_{PE}(x_i,r)\). But since \(r_1<r_2\), \(\cup _{i=1}^{t} N_{PE}(x_i,r_1) \subseteq \cup _{i=1}^{t} N_{PE}(x_i,r_2)\), hence \(\gamma _d(r_1) \le t\) implies \(\gamma _d(r_2) \le t\), hence \(P(\gamma _d(r_1) \le t) \le P(\gamma _d(r_2) \le t\) for all \(t>0\). Furthermore, strict inequality holds for at least one \(t \in {\mathbb {Z}}_+\), e.g., \(t=1\) since it is more likely that \(N_{PE}(X,r_2)\) is more likely to cover all \({\mathcal {Z}}_n\) compared to \(N_{PE}(X,r_1)\) for any \(X \in {\mathcal {Z}}_n\) as \(N_{PE}(x,r_1) \subsetneq N_{PE}(x,r_2)\) for \(x \in A_S\). Hence, the desired result follows. \(\square\)
Algorithm 2 ignores the target class points outside the convex hull of the nontarget class. This is not the case with Algorithm 1, since the map \(N_S(\cdot ,\theta )\) is defined over all points in \({\mathcal {X}}_j\) whereas the original PE proximity map \(N_{PE}(\cdot ,r)\) is not. Hence, with Algorithm 2, the prototype set \(S_j\) only yields a reduction in the set \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\). We tackle this issue with various approaches. One approach is to define covering methods with two proximity maps that are the PE proximity map and another oone which does not require the target class points to be inside the convex hull of the nontarget class points, e.g. spherical proximity regions (i.e. proximity maps \(N_S(\cdot ,\theta )\)).
Algorithm 3 uses both maps \(N_{PE}(\cdot ,r)\) and \(N_S(\cdot ,\theta )\) to generate a prototype set \(S_j\) for the target class points \({\mathcal {X}}_j\). There are two separate MDSs, \(S^{\text {in}}_j\) which is exactly MDS, and \(S^{\text {out}}_j\) which is approximately MDS. The two maps are associated with two distinct digraphs such that \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\) constitutes the vertex set of one digraph and \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\) constitute the vertex set of the other, where the nontarget class is always \({\mathcal {X}}_{1j}\). Algorithm 2 finds a prototype set \(S^{\text {in}}_j\) for \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\), and then the prototype set \(S^{\text {out}}_j\) for \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\) is merged with the overall prototype set, i.e. \(S_j=S^{\text {in}}_j \cup S^{\text {out}}_j\) as in Algorithm 3. Note that the set \(S_j\) is an approximately minimum dominating set, since \(S^{\text {out}}_j\) is an approximately minimum dominating set.
Algorithm 4 uses only the PE proximity map \(N_{PE}(\cdot ,r)\) with the original version inside \(C_H({\mathcal {X}}_{1j})\) and extended version outside \(C_H({\mathcal {X}}_{1j})\). The cover is a mixture of dsimplices and dpolytopes. Given a set of dsimplices, \({\mathfrak {S}}^{\text {in}}_{1j}\), and a set of outer simplices \({\mathfrak {S}}^{\text {out}}_{1j}\), we find the respective local extremum points of each dsimplex and outer simplex where local extremum in a dsimplex is the closest point among the data points in the vertex region to the face opposite to the relevant vertex and the local extremum in an outer simplex is the furthest data point to the face which constitutes the bottom edge of the outer simplex. Local extremum points of dsimplices are found as in Algorithm 2, and then we find the local extremum points of the remaining points to get the prototype set for the entire target class points \({\mathcal {X}}_j\). The following theorem provides a result on the local extremum points in an outer simplex \({\mathscr {F}}\). Note that, in Algorithm 4, the set \(S_j\) is the exact minimum dominating set, since both \(S^{\text {in}}_j\) and \(S^{\text {out}}_j\) are exact MDSs for the PEPCDs induced by \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\) and \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\), respectively.
Theorem 4
Let\({\mathcal {Z}}_n=\{z_1,z_2,\ldots , z_n\} \subset {\mathbb {R}}^d\)and\({\mathcal {F}}\)be a facet of the\(C_H({\mathcal {X}}_{1j})\)and\({\mathscr {F}}\)be the associated outer simplexsuch that\({\mathcal {Z}}_n \subset {\mathscr {F}}\). Then, the furthest point among\({\mathcal {Z}}_n\)points to the facet\({\mathcal {F}}\)is a minimum dominating set\(S_{MD}\)of the PEPCD restricted to\({\mathscr {F}}\)and is found in linear time. Moreover, the domination number of this restricted PEPCD equals to 1 (provided\(n>0\)).
Proof
We show that there is a point \(s \in {\mathcal {Z}}_n\) such that \({\mathcal {Z}}_n \subset N_{PE}(s,r)\) for all \(r \in (1,\infty )\). Note that \(\eta (x,{\mathcal {F}})\) denotes the hyperplane through x and parallel to \({\mathcal {F}}\). Thus, for \(x,x^* \in {\mathcal {F}}\), observe that \(d(x,{\mathcal {F}}) < d(x^*,{\mathcal {F}})\) if and only if \(d(\eta (x,{\mathcal {F}}),{\mathcal {F}}) < d(\eta (x^*,{\mathcal {F}}),{\mathcal {F}})\). Then it follows that \(N_{PE}(x,r) \subsetneq N_{PE}(x^*,r)\) which implies \(\{x,x^*\} \subset N_{PE}(x^*,r)\). Then, for \(s := {{\,\mathrm{argmax}\,}}_{x \in {\mathcal {Z}}_n} \> d(x,{\mathcal {F}})\), we have \({\mathcal {Z}}_n \subset N_{PE}(s,r)\). So, \(S_{MD}=\{s\}\) and \(\gamma =1\). Also, since s is the furthest point among \({\mathcal {Z}}_n\) from the facet \({\mathcal {F}}\), finding the MDS is linear in n. \(\square\)
For \(r=1\), some \({\mathcal {Z}}_n\) points may fall on \(\eta (s,r)\) (i.e., on the boundary of \(N_{PE}(s,r)\)) so \(\{s\}\) is not a dominating set in such a case, but \(\eta (s,r)\) is of Lebesgue measure zero in \({\mathbb {R}}^d\).
Given Theorems 2 and 4, Algorithm 4 may be the most appealing one, since it gives the exact minimum dominating set for the complete target class j. However, the following theorem shows that the cardinality of such sets increase exponentially with dimensionality of the data set, even though it is polynomial on the number of observations.
Theorem 5
Algorithm 4 finds an exact minimum dominating set\(S_j\)of the target class points\({\mathcal {X}}_j\)in\({\mathcal {O}}\left( d^k n^2_{1j} + 2^d n_{1j}^{\lceil d/2\rceil }\right)\)time for\(k >1\)where\(S_j = {\mathcal {O}}\left( dn_{1j}^{\lceil d/2\rceil }\right)\).
Proof
A Delaunay tessellation of the nontarget class points \({\mathcal {X}}_{1j} \subset {\mathbb {R}}^d\) is found in \({\mathcal {O}}(d^k n^2_{1j})\) time with the BowyerWatson algorithm for some \(k>1\), depending on the complexity of the algorithm that finds the circumcenter of a dsimplex (Watson 1981). The resulting tessellation with \(n_{1j}\) vertices has at most \({\mathcal {O}}\left( n_{1j}^{\lceil d/2\rceil }\right)\) simplices and at most \({\mathcal {O}}(n_{1j}^{\lfloor d/2\rfloor })\) facets (Seidel 1995). Hence, the union of sets of dsimplices \({\mathfrak {S}}^{\text {in}}_{1j}\) and outer simplices \({\mathfrak {S}}^{\text {out}}_{1j}\) is of cardinality at most \({\mathcal {O}}\left( n_{1j}^{\lceil d/2\rceil }\right)\). Now, for each simplex \({\mathcal {S}}\in {\mathfrak {S}}^{\text {in}}_{1j}\) or each outer simplex \({\mathcal {F}}\in {\mathfrak {S}}^{\text {out}}_{1j}\), the local extremum points are found in linear time. Each simplex is divided into \(d+1\) vertex regions with each having its own extremum point. Hence, a minimum cardinality subset of the set of local extremum points is of cardinality at most \(d+1\) and found in a brute force fashion. For outer simplices, however, the local extremum point is the furthest point to the associated facet of the Delaunay tessellation. Thus, it takes at most \({\mathcal {O}}(2^d)\) and \({\mathcal {O}}(n)\) time to find the exact minimum dominating sets which are subsets of local extremum points for each (inner) simplex and outer simplex, respectively. Hence, the desired result follows. \(\square\)
Theorem 5 shows the exponential increase of the number of prototypes as dimensionality increases. So, the complexity of the class cover model also increases exponentially, which might lead to overfitting. We will investigate this issue further in Sects. 6 and 7.
PCD covers
We establish class covers with the PE proximity map \(N_{PE}(\cdot ,r)\) and spherical proximity map \(N_{S}(\cdot ,\theta )\). We define two types of class covers: one type is called composite covers which cover the points in \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\) with PE proximity maps and the points in \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\) with spherical proximity maps, and the other is called standard cover incorporating the PE proximity maps for all points in \({\mathcal {X}}_j\). We use these two types of covers to establish a specific type of classifier that is more appealing in the sense of prototype selection.
Our composite covers are mixtures of simplicial and spherical proximity regions. Specifically, given a set of simplices and a set of spheres, the composite cover is the union of both these sets which constitute proximity regions of two separate PCD families, hence the name composite cover. Let \(N_{\text {in}}(\cdot )\) and \(N_{\text {out}}(\cdot )\) be the proximity maps associated with sets \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\) and \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\), respectively. The set \(Q_j\) is partitioned into two: the cover \(Q^{\text {in}}_j\) of points inside the convex hull of nontarget class points, i.e., \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\), and the cover \(Q^{\text {out}}_j\) of nontarget class points outside, i.e., \({\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})\). Let \(Q_j^{(1)}:=\cup _{x \in {\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})}N_{\text {in}}(x)\) and \(Q^{\text {out}}_j:=\cup _{x \in {\mathcal {X}}_j {\setminus } C_H({\mathcal {X}}_{1j})} N_{\text {out}}(x)\) such that \(Q_j:=Q^{\text {in}}_j \cup Q^{\text {out}}_j\). Hence, in composite covers, target class points inside \(C_H({\mathcal {X}}_{1j})\) are covered with PE proximity map \(N_{\text {in}}(\cdot )=N_{PE}(\cdot ,r)\), and the remaining points are covered with spherical proximity map \(N_{\text {out}}(\cdot )=N_{S}(\cdot ,\theta )\). Given the covers \(Q^{\text {in}}_j\) and \(Q^{\text {out}}_j\), let \(C^{\text {in}}_j\) and \(C^{\text {out}}_j\) be the class covers with lower complexity associated with the dominating sets \(S^{\text {in}}_j\) and \(S^{\text {out}}_j\). Let \(\displaystyle C^{\text {in}}_j:= \cup _{s \in S^{\text {in}}_j} N_{\text {in}}(s)\) and \(\displaystyle C^{\text {out}}_j:= \cup _{s \in S^{\text {out}}_j} N_{\text {out}}(s)\). Then, the composite cover is given by
An illustration of the class covers \(C_0\) and \(C_1\) with \(N_{\text {in}}(\cdot )=N_{PE}(\cdot ,r=2)\) and \(N_{\text {out}}(\cdot )=N_{S}(\cdot ,\theta =1)\) is given in Fig. 5b.
By definition, the spherical proximity map \(N_{S}(\cdot ,\theta )\) yields class covers for all points in \({\mathcal {X}}_j\). Figure 5a illustrates the class covers of the map \(N_{S}(\cdot ,\theta =1)\). We call such covers which only constitute a single type of proximity map as standard covers. Hence, the standard cover of the PEPCD, \(D_j\), is a union of dsimplices and dpolytopes:
Here, \(N_{\text {in}}(\cdot )=N_{\text {out}}(\cdot )=N_{PE}(\cdot ,r)\). An illustration is given in Fig. 5c.
PCD covers can easily be generalized to the multiclass case with J classes. To establish the set of covers \({\mathcal {C}} = \{C_1,C_2, \ldots , C_J\}\), the set of PCDs \(\mathscr {D}=\{D_1,\ldots ,D_J\}\), and the set of MDSs \({\mathfrak {S}}=\{S_1,S_2\ldots ,S_J\}\) associated with a set of classes with labels \({\mathfrak {C}} = \{1,2,\ldots ,J\}\), we gather the classes into two classes as \(C_T=j\) and \(C_{NT}=\cup _{t \ne j} \{t\}\) for \(t,j=1,\ldots ,J\). We refer to classes \(C_T\) and \(C_{NT}\) as target and nontarget classes, respectively. More specifically, target class is the class we want to find the cover of, and the nontarget class is the union of the remaining classes. We transform the multiclass case into the twoclass setting and find the cover of jth class, \(C_j\) for each \(j=1,2,\ldots ,J\).
Classification with PCDs
The elements of the minimum dominating set \(S_j\) are selected prototypes for the problem of modelling the class conditional discriminant regions via a collection of proximity regions (balls, simplices, polytopes, etc.). The sizes of these regions represent an estimate of the domain of influence, which is the region in which a given prototype should influence the class labelling. Our semiparametric classifiers depend on the class covers given by these proximity regions. We define various classifiers based on the class covers (composite or standard) and some other classification methods. We approach classification of points in \({\mathbb {R}}^d\) in two ways:
 Hybrid classifiers:
Given the class covers \(C^{\text {in}}_0\) and \(C^{\text {in}}_1\) associated with classes with labels 0 and 1, we classify a given point \(z \in {\mathbb {R}}^d\) with \(g_P\) if \(z \in C^{\text {in}}_0 \cup C^{\text {in}}_1\), and with \(g_A\) otherwise. Here, \(g_P\) is the preclassifier and \(g_A\) is an alternative classifier.
 Cover classifiers:
These classifiers are constructed by class covers only; that is, a given point \(z \in {\mathbb {R}}^d\) is classified as \(g_C(z)=j\) if \(z \in C_j {\setminus } C_{1j}\) or if \(\rho (z,C_j) < \rho (z,C_{1j})\), hence class of the point z is estimated as j if z is only in cover \(C_j\) or closer to \(C_j\) than \(C_{1j}\). Here, \(\rho (z,C_j)\) is a dissimilarity measure between point z and the cover \(C_j\). Cover classifiers depend on the type of covers which are either composite or standard.
We incorporate PEPCDs for establishing both of these types of classifiers. Hence, we will refer to them as hybrid PEPCD and cover PEPCD classifiers. Since the PE proximity maps were originally defined for points \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\), we develop hybrid PEPCD classifiers to account for points outside of the convex hull of the nontarget class in a convenient fashion. However, as we shall see later, cover PEPCD classifiers have more appealing properties than hybrid PEPCD classifiers in terms of both efficiency and classification performance. Nonetheless, we consider and compare both types of classifiers, but first we define the PEPCD preclassifier.
PEPCD preclassifier
Let \(\rho (z,C)\) be a dissimilarity measure between z and the class cover C. The PEPCD preclassifier is given by
Here \(g_P(z)=\,1\) denotes a “no decision” case. Given that class covers \(C^{\text {in}}_0\) and \(C^{\text {in}}_1\) are the unions of PE proximity regions \(N_{PE}(x,r)\) of points in dominating sets \(S^{\text {in}}_0\) and \(S^{\text {in}}_1\), the closest cover for a new point z is found by, first, finding the proximity region of a point in the cover closest to the point z:
which is expressed based on a dissimilarity measure between the point z and the region \(N_{PE}(s)\). For such measures, we employ convex distance functions. Let H be a convex set in \({\mathbb {R}}^d\) with \(x \in H\) where the point x may be viewed as the center of the set H. Thus, the convex distance (or dissimilarity) between z and H be defined by
where \(d(\cdot ,\cdot )\) is the Euclidean distance and t is a point of intersection for the half line \(L(x,z):=\{x+\alpha (zx):\alpha \in [0,\infty )\}\) and \(\partial (H)\). An illustration is given in Fig. 6 for several convex sets, including balls and simplices in \({\mathbb {R}}^2\).
For spherical proximity map \(N_{S}(\cdot ,\theta )\), the dissimilarity function is defined by putting the radius of that ball which is a spherical proximity region: \(d(x,t)=\varepsilon _{\theta }(x)\) into the denominator (Priebe et al. 2003a). However, for dsimplices, we characterize the dissimilarity measure in terms of barycentric coordinates of z with respect to \({\mathcal {S}}(x)=N_{PE}(x,r)\).
Proposition 3
Let\(\{t_1,t_2, \ldots ,t_{d+1}\} \subset {\mathbb {R}}^d\)be a set of noncoplanar points that are the vertices of simplex\({\mathcal {S}}(x)=N_{PE}(x,r)\)with the centroid\(M_C(x) \in {\mathcal {S}}(x)^o\). Then, for\(z \in {\mathbb {R}}^d\)and\(t \in \partial ({\mathcal {S}}(x))\), the convex distance betweenzand\({\mathcal {S}}\)which is defined as\(\rho (z,{\mathcal {S}}(x))=d(M_C(x),z)/d(M_C(x),t)\)satisfies the following
wheretis on the closest face\(f_k\)of\({\mathcal {S}}(x)\)tozand\(w^{(k)}_{{\mathcal {S}}(x)}(z)\)is the\(k^{th}\)barycentric coordinate ofzwith respect to\({\mathcal {S}}(x)\). Moreover,\(\rho (z,{\mathcal {S}}(x)) \le 1\)iff\(z \in {\mathcal {S}}(x)\).
Proof
Let the line segment \(L(M_C(x),z)\) cross \(\partial ({\mathcal {S}}(x))\) at the point \(t \in f_k\) for \(f_k\) being the face of \({\mathcal {S}}(x)\) opposite to vertex \(t_k\). Thus, for \(\alpha _i \in (0,1)\) and \(\beta > 0\),
Here, note that \(\beta =d(M_C(x),z)/d(M_C(x),t)=\rho (z,{\mathcal {S}}(x))\) since z is a convex combination of t and \(M_C(x)\). Also, since \(M_C(x)\) is the centroid,
Hence, \((1\beta )/(d+1)=w^{(k)}_{{\mathcal {S}}(x)}(z)\) which implies \(\beta =1(d+1)w^{(k)}_{{\mathcal {S}}(x)}(z)\).
For the second part, we first assume \(\rho (z,{\mathcal {S}}(x)) \le 1\). Then \(d(M_C(x),z) \le d(M_C(x),t)\) and \(M_C(x)\) is in the interior of \({\mathcal {S}}(x)\) as \({\mathcal {S}}(x)\) is convex. Since t is on the face \(f_k\) of \({\mathcal {S}}(x)\) closest to z, z falls on the line segment joining \(M_C(x)\) and t, denoted as \([M_C(x),t]\) which lies in \({\mathcal {S}}(x)\) as well. Hence, \(z \in {\mathcal {S}}(x)\). For the reverse direction, assume that \(z \in {\mathcal {S}}(x)\). Then there exists a face \(f_k\) of \({\mathcal {S}}(x)\) closest to z. Since \(f_k\) is the closest face to z and \(M_C(x)\) is in the interior of \({\mathcal {S}}(x)\) as \({\mathcal {S}}(x)\) is convex, the line segment \([M_C(x),t]\) lies in \({\mathcal {S}}(x)\) as well, and z also lies on this line segment. Then it follows that \(d(M_C(x),z) \le d(M_C(x),t)\), so \(\rho (z,{\mathcal {S}}(x)) = d(M_C(x),z) / d(M_C(x),t) \le 1\). \(\square\)
Observe that in Proposition 3, it also follows (from the proof) that \(\rho (z,{\mathcal {S}}(x)) < 1\) iff \(z \in {\mathcal {S}}(x)^o\) and \(\rho (z,{\mathcal {S}}(x)) = 1\) iff \(z \in \partial ({\mathcal {S}}(x))\).
For a (convex) proximity region \(N_{PE}(x,r)\), the dissimilarity measure \(\rho (z,{\mathcal {S}}(x))=\rho (z,N_{PE}(x,r))\) indicates whether the point z is in the proximity region \(N_{PE}(x,r)\) or not, since \(\rho (z,{\mathcal {S}}(x)) < 1\) if \(z \in N_{PE}(x,r)\) and \(\ge 1\) otherwise. Hence, the PEPCD preclassifier \(g_P\) may be simplified to
Here, without loss of generality, \(z \in C^{\text {in}}_0 {\setminus } C^{\text {in}}_1\) if and only if \(\rho (z,C^{\text {in}}_0) < 1\). Let \(\rho (z,x):=\rho (z,{\mathcal {S}}(x))\) be the dissimilarity between x and z, then the dissimilarity measure \(\rho (\cdot ,\cdot )\) violates the symmetry axiom of the metric, since \(\rho (x,z) \ne \rho (z,x)\) unless \(d(x,t(x)) = d(z,t(z))\) where proximity regions \(N_{PE}(x,r)\) and \(N_{PE}(z,r)\) intersect with the lines \(L(M_C(x),z)\) and \(L(M_C(z),x)\) at points t(x) and t(z), respectively.
Hybrid PEPCD classifiers
Constructing hybrid classifiers has many purposes. Some classifiers are designed to solve harder classification problems by gathering many weak learning methods (often known as ensemble classifiers) while some others have advantages only when combined with another single classifier (Woźniak et al. 2014). Our hybrid classifiers are of the latter type. The PEPCD preclassifier, \(g_P\), is able to classify points in the union of class covers, \(C^{\text {in}}_0 \cup C^{\text {in}}_1\), however classifying the remaining points in \({\mathbb {R}}^d\) requires incorporating an alternative classifier, often one that works for all points in \({\mathbb {R}}^d\). We use the PEPCD preclassifier, \(g_P(\cdot )\), to classify all points of the test data, and if no decision are made for some of these points, we classify them with the alternative classifier \(g_A\). Hence, let \(g_H\) be the hybrid PEPCD classifier such that
That is, for “no decision” cases where \(g_P(z)=1\), we rely on the alternative classifier \(g_A\); we will use the \(k\hbox {NN}\), SVM and CCCD classifiers as alternative classifiers. The parameters are k, the number of closest neighbors to make a majority vote in the \(k\hbox {NN}\) classifier; \(\gamma\), the scaling parameter of the radial basis function (RBF) kernel of the SVM classifier; and \(\theta\), the parameter of the CCCD classifier that regulates the size of each ball as described in Sect. 3.1.
Composite and standard cover PEPCD classifiers
We propose PEPCD classifiers \(g_C\) based on composite and standard covers. The classifier \(g_C\) is defined as
The cover is based on either composite covers or standard covers wherein \({\mathcal {X}}_j \subset C_j\) for \(j=1,2\), hence a decision can be made without an alternative classifier. Note that composite cover PEPCD classifiers are, in fact, different types of hybrid classifiers where the classifiers are only modelled by class covers but with multiple types of PCDs. Compared to hybrid PEPCD classifiers, cover PEPCD classifiers have many appealing properties. Since a reduction is done over all target class points \({\mathcal {X}}_j\), depending on the percentage of reduction, classifying a new point \(z \in {\mathbb {R}}^d\) is computationally faster and more efficient, whereas an alternative classifier might not provide such a reduction.
Note that, given the multiclass prototype sets, \(S_j\), the twoclass cover PEPCD classifier, \(g_C\), can be modified for the multiclass case as
for a general proximity map \(N(\cdot )\).
Consistency analysis
In this section, we will prove consistency of cover and hybrid PCD classifiers when the two class conditional distributions are strictly \(\delta\)separable. For \(\delta \in [0,\infty )\), the regions \(A,B \subset {\mathbb {R}}^d\) are called \(\delta\)separable if
and strictly \(\delta\) separable if, moreover, \(\delta >0\). Notice that the definition of \(\delta\)separability allows overlap in the sets A and B with \(\delta =0\). Furthermore, if the continuous distributions F and G have \(\delta\)separable supports, then they are also called \(\delta\)separable distributions, and if \(\delta > 0\), they are called strictly \(\delta\) separable distributions (Devroye et al. 1996).
Recall that cover classifiers are characterized by PCDs associated with proximity regions N(x) for \(x \in {\mathbb {R}}^d\), and thus, the consistency of such PCD classifiers depends on the proximity map \(N(\cdot )\). We require that the proximity map \(N(\cdot )\) satisfies the following properties:
 P1:
For all \(x \in {\mathbb {R}}^d\), the proximity region N(x) is either an open set or \(N(x)=\{x\}\) and x is in the interior of N(x) almost everywhere (a.e.) in Lebesgue measure.
 P2:
For two classes, the proximity map N(x) is a function of x from target class and also depends on the nontarget class points y in such a way that \(N(x) \cap y = \emptyset\) a.e. in Lebesgue measure.
Notice that P1 implies that N(x) is an open set a.e. in \({\mathbb {R}}^d\)Lebesgue measure and P2 implies that, the set \(\{(x,y):N(x) \cap y \ne \emptyset \}\) has zero \({\mathbb {R}}^{2d}\)Lebesgue measure. Both \(N_{S}(\cdot ,\theta )\) for \(\theta \in (0,1]\) and \(N_{PE}(\cdot ,r)\) for \(r \in (1,\infty )\) satisfy the properties P1 and P2. These will be useful in showing that the classifiers based on our class covers attain Bayesoptimal classification performance for classes with (strictly) \(\delta\)separable continuous distributions.
In the rest of this section, we assume that we have a random sample \({\mathcal {X}}_j\) of size \(n_j\) from class j with continuous distribution \(F_j\) whose support is \(s(F_j)\) for \(j=0,1\). Recall that the PCD class cover for class j based on \(N(\cdot )\) is \(C_j=\cup _{x \in S_j}N(x)\) with \(S_j\) being a prototype set of points for \({\mathcal {X}}_j\) (so \(S_j \subseteq {\mathcal {X}}_j\)). Note that all target class (say, class j) points reside inside the class cover \(C_j\) w.p. 1 by P1, i.e. \({\mathcal {X}}_j \subset C_j\) w.p. 1 for all \(n_j>0\). Hence, we have the following lemma.
Lemma 1
Let\({\mathcal {X}}_j=\{X_1,X_2,\ldots ,X_{n_j}\}\)be a random sample of size\(n_j\)from a classjwith a continuous distribution\(F_j\)whose support is\(s(F_j) \subseteq {\mathbb {R}}^d\). Also, let the class cover for classjbased on proximity map\(N(\cdot )\)be denoted as\(C_j:=C({\mathcal {X}}_j)\)such that\(C_j=\cup _{X \in S_j}N(X)\)with prototype set\(S_j\subseteq {\mathcal {X}}_j\). If\(N(\cdot )\)satisfies propertyP1, then we have\(\lambda (s(F_j) {\setminus } C_j) \rightarrow 0\)a.s. as\(n_j \rightarrow \infty\).
Proof
Assume that \(N(\cdot )\) satisfies property P1. Then, N(X) is open w.p. 1, and \(P(X \in N(X))=1\) for \(X \sim F_j\), which implies \(P({\mathcal {X}}_j \subset C_j)=1\) for all \(n_j>0\). Assume, for a contradiction, that \(\lambda (s(F_j) {\setminus } C_j) \rightarrow \varepsilon\) a.s. for some \(\varepsilon >0\) as \(n_j \rightarrow \infty\). Then, as \(n_j \rightarrow \infty\), there exists a region \(s_\varepsilon (F_j)\) in \(s(F_j)\) with positive Lebesgue measure so that \(P(X \in s_\varepsilon (F_j))>0\) and such X’s are not in \(C_j\) with positive probability. That is, \(P(X_i \in s_\varepsilon (F_j))>0\) for any \(X_i \in {\mathcal {X}}_j\), which implies \(P({\mathcal {X}}_j \cap s_\varepsilon (F_j) \ne \emptyset )>0\) for all \(n_j>0\) and also in the limit. Therefore, it follows that \(P({\mathcal {X}}_j \subset C_j)<1\) for all \(n_j>0\) and also in the limit, which is a contradiction. \(\square\)
Lemma 1 shows that the class cover, \(C_j\), almost surely covers the support of its associated class (except perhaps on a region of Lebesgue measure zero) as \(n_j \rightarrow \infty\). In particular, if the support \(s(F_j)\) is bounded, then \(P(\lambda (s(F_j) {\setminus } C_j)=0)=1\) for sufficiently large \(n_j\) and if the support \(s(F_j)\) is unbounded, then \(P(\lambda (s(F_j) {\setminus } C_j)>0)>0\) for all \(n_j\) but this probability converges to 0 as \(n_j \rightarrow \infty\).
To show consistency of classifiers based on PCD class covers, we have to investigate the class covers under the assumption of (strict) \(\delta\)separability of class supports. Let the two classes be labeled as 0 and 1 with strictly \(\delta\)separable continuous distributions (i.e., \(\delta >0\)), then a proximity map \(N(\cdot )\) satisfying property P2 establishes pure class covers that include none of the nontarget class points w.p. 1, i.e. \(C_j \cap {\mathcal {X}}_{1j} = \emptyset\) w.p. 1. In this case, we have the following lemma showing that the intersection of the cover of the target class and the support of the nontarget class is almost surely empty as \(n_{1j} \rightarrow \infty\) (except perhaps for a region of Lebesgue measure zero). Let \(P_j\) be the probability with respect to distribution \(F_j\) for \(j=0,1\) and \(P_{01}\) be with respect to the joint distribution \(F_{01}\) of (X, Y) for \(X \sim F_0\) and \(Y \sim F_1\). Then, P2 also implies that \(P_1(s(F_0) \cap {\mathcal {X}}_1 = \emptyset )=1\). Hence, it also follows that \(P_{01}({\mathcal {X}}_0 \cap {\mathcal {X}}_1 = \emptyset )=1\), since \(P_0({\mathcal {X}}_0 \subset s(F_0))=1\).
Lemma 2
Let the target and the nontarget classes be labeled as 0 and 1 and\({\mathcal {X}}_0=\{X_1,X_2,\ldots ,X_{n_0}\}\)and\({\mathcal {X}}_1=\{Y_1,Y_2,\ldots ,Y_{n_1}\}\)be two random samples from classes 0 and 1 with class conditional continuous distributions\(F_0\)and\(F_1\)whose supports are strictly\(\delta\)separable (i.e.,\(\delta >0\)) in\({\mathbb {R}}^d\), respectively. If the proximity map\(N(\cdot )\)satisfies propertiesP1andP2, then, for any fixed\(n_0>0\), we have\(\lambda (C_0 \cap s(F_1)) \rightarrow 0\)a.s. as\(n_1 \rightarrow \infty\).
Proof
Let \(n_0>0\). Notice that \(s(F_1)\) is fixed and the randomness in \(\lambda (C_0 \cap s(F_1))\) is due to \({\mathcal {X}}_0\) and \({\mathcal {X}}_1\), both of which are used in the construction of N(X). Moreover, recall that \(C_0=\cup _{X \in S_0} N(X)\) for \(S_0 \subset {\mathcal {X}}_0\) being a minimum prototype set of \({\mathcal {X}}_0\). Note that from P2, it follows that \(P_1(N(x) \cap {\mathcal {X}}_1=\emptyset )=1\) for all \(x \in s(F_0)\) and \(P_{01}(N(X) \cap {\mathcal {X}}_1 = \emptyset )=1\) for \(X \sim F_0\). Then, as \(n_1 \rightarrow \infty\), \(P_{01}(C_0 \cap {\mathcal {X}}_1 = \emptyset ) \rightarrow 1\) (or equivalently, \(P_{01}(C_0 \cap {\mathcal {X}}_1 \ne \emptyset ) \rightarrow 0\)), since \(C_0\) is the union of N(X) for \(X \in S_j \subseteq {\mathcal {X}}_j\). Now assume, for a contradiction, that \(\lambda (C_0 \cap s(F_1)) \rightarrow \varepsilon\) w.p. 1 as \(n_1 \rightarrow \infty\) for some \(\varepsilon > 0\). Thus, as \(n_1 \rightarrow \infty\), there exists a region \(s_\varepsilon (F_1)\) in \(s(F_1)\) with positive measure such that \(P_{01}(C_0 \cap s_\varepsilon (F_1) \ne \emptyset )\) is positive in the limit and hence \(P_{01}(C_0 \cap {\mathcal {X}}_1 \ne \emptyset )\) is positive in the limit, which is a contradiction. \(\square\)
Recall that PCD cover classifiers are defined with either standard covers which employ only one type of proximity map, or composite covers which employ two (or more) types of proximity maps. On the other hand, hybrid classifiers use cover classifiers for data points from one class in the convex hull of points from the other class(es), and use an alternative classifier elsewhere.
We show the consistency of cover and hybrid PCD classifiers. That is, e.g., we show that the error rate of the cover classifier \(L(g_C)\) converges to the Bayes optimal error rate, which is 0 for continuous class conditional distributions with strictly \(\delta\)separable supports as \(n_0,n_1 \rightarrow \infty\) (Devroye et al. 1996). Then, we have the following theorem.
Theorem 6
Let\({\mathcal {X}}_0\)and\({\mathcal {X}}_1\)be two random samples of size\(n_0\)and\(n_1\)from classes 0 and 1, respectively, such that the data set\({\mathcal {X}}={\mathcal {X}}_0 \cup {\mathcal {X}}_1\)is a random sample from the distribution\(F=\pi _0\,F_0+\pi _1\,F_1\)for some\(\pi _0, \pi _1 \in [0,1]\)and\(\pi _0 + \pi _1=1\)where\(F_0\)and\(F_1\)are continuous class conditional distributions with finite dimensional strictly\(\delta\)separable supports\(s(F_0)\)and\(s(F_1)\), respectively. Then we have the following results.
 (1)
Let the cover classifier\(g_C\)be based on a standard cover with proximity map\(N(\cdot )\)which satisfiesP1andP2or based on a composite cover with proximity maps\(N_i(\cdot )\)for\(i=1,\ldots ,k\), each of which satisfiesP1andP2. Then\(g_C\)is consistent; that is,\(L(g_C)\rightarrow L^*=0\) as \(n_0,\,n_1 \rightarrow \infty\).
 (2)
Let the hybrid classifier\(g_H\)be based on\(g_C\)in\(C^{\text {in}}:=C_0^{\text {in}} \cup C_1^{\text {in}}\)where\(C_j^{\text {in}}\)is the cover of points\({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\)for\(j=0,1\)and based on an alternative classifier\(g_A\)which is different from\(g_C\). If\(g_C\)is as in part (1) and\(g_A\)is consistent as\(n_0,\,n_1 \rightarrow \infty\), then\(g_H\)is consistent as\(n_0,\,n_1 \rightarrow \infty\).
Proof

(1)
It suffices to prove part (1) for standard cover classifiers, as the extension to the composite cover case is straightforward, since each \(N_i(\cdot )\) also satisfies P1 and P2. Let Z be a random variable from F. Then \(Z=Z_j \sim F_j\) with probability \(\pi _j\) for \(j=0,1\).
Then, by Lemma 1, we have \(P(Z_j \in C_j) \rightarrow 1\) as \(n_j \rightarrow \infty\). And by Lemma 2, \(P(Z_j \in C_j {\setminus } s(F_{1j}) \rightarrow 1\) as \(n_{1j} \rightarrow \infty\) for any fixed \(n_j>0\). Furthermore, \(P(Z \in s(F))=1\) where \(s(F)=s(F_0) \cup s(F_1)\). By Lemmas 1 and 2, as \(n_0,n_1 \rightarrow \infty\), we have \(\lambda (s(F){\setminus } C_0 \cup C_1) \rightarrow 0\) w.p. 1 and \(P(C_0 \cap C_1 \subset s(F)^c) \rightarrow 1\) and so \(P(Z \in C_0 \triangle C_1) \rightarrow 1\) where \(C_0 \triangle C_1\) is the symmetric difference between \(C_0\) and \(C_1\). Then, we have \(\lambda (C_{j}\cap s(F_{1j})) \rightarrow 0\) w.p. 1 for \(n_j>0\) as \(n_{1j} \rightarrow \infty\). Also, \(P(Z_j \in C_j {\setminus } C_{1j})\rightarrow \pi _j\) as \(n_0,n_1 \rightarrow \infty\), so \(P(g_C(Z_j)=j) \rightarrow \pi _j\) (or equivalently \(P(g_C(Z_j)\ne j) \rightarrow 0\)) for \(j=0,1\).
Therefore,
$$\begin{aligned} L(g_C)&= \sum _{j=0,1} P(g_C(Z_j) \ne j) \\&= \sum _{j=0,1}P(g_C(Z) \ne j  Z \text { is from class } j)P( Z \text { is from class } j)\\&= P(g_C(Z_0) \ne 0) \pi _0 + P(g_C(Z_1) \ne 1) \pi _1 \end{aligned}$$Hence, \(L(g_C) \rightarrow 0\) as \(n_0,n_1 \rightarrow \infty\).

(2)
First observe that, for \(j=0,1\), \(C_H({\mathcal {X}}_j)\) converges to \(C_H(s(F_j))\) as \(n_j \rightarrow \infty\) in the sense that \(\lambda (C_H(s(F_j)) {\setminus } C_H({\mathcal {X}}_j)) \rightarrow 0\) w.p. 1 as \(n_j \rightarrow \infty\), which implies, for \(X \sim F_j\), \(P(X \in C_H({\mathcal {X}}_j)) \rightarrow 1=P(X \in s(F_j))=P(X \in C_H(s(F_j)))\) as \(n_j \rightarrow \infty\), since \(s(F_j) \subseteq C_H(s(F_j))\). Let \(s_j^{\text {in}}:=s(F_j) \cap C_H({\mathcal {X}}_{1j})\). Without loss of generality, we assume \(s_0^{\text {in}}\) or \(s_1^{\text {in}}\) has positive Lebesgue measure, since, otherwise, \(P(g_H(Z)=g_A(Z)) \rightarrow 1\) and the result follows, since \(g_A\) is consistent. So, \(C_j^{\text {in}}\) is constructed with \(S_j^{\text {in}} \subset s_j^{\text {in}}\) w.p. 1 for \(j=0,1\). Let \(F^{\text {in}}_j\) be the distribution restricted to \(s_j^{\text {in}}\) (which can also be denoted as \(F_j _{s_j^{\text {in}}}\)). Then \(F^{\text {in}}_0\) and \(F^{\text {in}}_1\) are continuous and strictly \(\delta\)separable as well. Furthermore, \(g_H=g_C\) for points in \(C^{\text {in}}\) and \(g_C\) is consistent by part (1). For \(j=0,1\), let \(Z_j \sim F_j\) and let \(\varUpsilon _j\) be the event that \(Z_j \in C^{\text {in}}\) and \(\upsilon _j := P(\varUpsilon _j)\). Since \(s_0^{\text {in}}\) or \(s_1^{\text {in}}\) has positive Lebesgue measure, we can not have the case \(\upsilon _1=\upsilon _2=0\). Since \(C^{\text {in}}\) is not unique, there exist \(\upsilon _j^{\sup }\) and \(\upsilon _j^{\inf }\) such that \(\upsilon _j^{\inf } \le \lim _{n_0,n_1 \rightarrow \infty } \upsilon _j \le \upsilon _j^{\sup }\) where \(\upsilon _j^{\sup }\) corresponds to the supremum of the volume of \(C^{\text {in}}\) and \(\upsilon _j^{\inf }\) corresponds to the infimum of the volume of \(C^{\text {in}}\) in the limit. Note also that
$$\begin{aligned} L(g_H) = P(g_H(Z_0) \ne 0) \pi _0 + P(g_H(Z_1) \ne 1) \pi _1. \end{aligned}$$And, for \(j=0,1\),
$$\begin{aligned} P(g_H(Z_j) \ne j)&= P(g_H(Z_j) \ne j , \varUpsilon _j) + P(g_H(Z_j) \ne j , \varUpsilon _j^c) \\&= P(g_H(Z_j) \ne j  \varUpsilon _j) P(\varUpsilon _j) + P(g_H(Z_j) \ne j  \varUpsilon _j^c) (1P(\varUpsilon _j)) \\&= P(g_C(Z_j) \ne j  \varUpsilon _j) P(\varUpsilon _j) + P(g_A(Z_j) \ne j  \varUpsilon _j^c) (1P(\varUpsilon _j)). \end{aligned}$$Hence, \(L(g_H) \rightarrow L_H \le L_C \max \left( \upsilon _0^{\sup },\upsilon _1^{\sup }\right) + L_A \left( 1\min \left( \upsilon _0^{\inf },\upsilon _1^{\inf }\right) \right)\) as \(n_0,n_1 \rightarrow \infty\). But, as \(n_0,n_1 \rightarrow \infty\), \(g_C\) is consistent by part (1) (i.e., by part (1), \(P(g_C(Z_j) \ne j  \varUpsilon _j) \rightarrow 0\) for both \(j=0,1\)), hence \(L_C=0\). Moreover, \(\sum _{j=0,1} P(g_A(Z_j) \ne j) \rightarrow L_A\) as \(n_0,n_1 \rightarrow \infty\), since the classifier \(g_A\) is consistent with Bayes error being \(L_A\) in the limit. Notice also that \(L_A=0\), since the supports of the distributions restricted to the complement of \(C^{\text {in}}\) are also strictly \(\delta\)separable. Hence, \(L_H = 0\), which is the desired result.
\(\square\)
As a corollary to Theorem 6 part (1), we have that classifier \(g_C\) of standard and composite covers with proximity maps \(N_{S}(\cdot ,\theta )\) for \(\theta \in (0,1]\) and \(N_{PE}(\cdot ,r)\) for \(r > 1\) are consistent; and as a corollary to Theorem 6 part (2), we have that classifier \(g_H\) is consistent provided that \(g_C\) is based on standard and composite covers with proximity maps \(N_{S}(\cdot ,\theta )\) for \(\theta \in (0,1]\) and \(N_{PE}(\cdot ,r)\) for \(r > 1\) and \(g_A\) is also consistent. A special case occurs when \(r=1\); that is, observe that \(x \in \partial (N_{PE}(x,r=1))\), and hence \(N_{PE}(\cdot ,r=1)\) does not satisfy P1. Moreover, in part (1) we showed that a cover PEPCD classifier is consistent since, as \(n_0,n_1 \rightarrow \infty\), the PEPCD cover excludes all nontarget class points almost surely and if the support of the target class is bounded, it is a subset of the class cover, or if the support is unbounded, probability of observing a point in the support and outside of cover is zero. To show that the hybrid PEPCD classifiers are consistent in part (2), we required alternative classifiers to be consistent as well.
In proving Theorem 6 using Lemmas 1 and 2, the assumption of strict \(\delta\)separability is crucial. If this assumption is dropped; that is, if \(\lambda (s(F_0) \cap s(F_1))>0\), then no proximity map \(N(\cdot )\) satisfies both P1 and P2 and consistency is not guaranteed to follow.
Monte Carlo simulations and experiments
In this section, we assess the classification performance of hybrid and cover PEPCD classifiers. We perform simulation studies wherein observations of two classes are drawn from separate distributions where \({\mathcal {X}}_0\) is a random sample from a multivariate uniform distribution \(U([0,1]^d)\) and \({\mathcal {X}}_1\) is a random sample from \(U([\nu ,1+\nu ]^d)\) for \(d = 2,3,5\) with the overlapping parameter \(\nu \in [0,1]\). Here, \(\nu\) determines the level of overlap between the two class supports. We regulate \(\nu\) in such a way that the overlapping ratio \(\zeta\) is fixed for all dimensions, i.e. \(\zeta ={{\,\mathrm{Vol}\,}}(s(F_0) \cap s(F_1))/{{\,\mathrm{Vol}\,}}(s(F_0) \cup s(F_1))\). When \(\zeta =0\), the supports are well separated, and when \(\zeta =1\), the supports are identical: i.e. \(s(F_0) = s(F_1)\). Hence, the closer the \(\zeta\) to 1, the more the supports overlap. Observe that \(\nu \in [0,1]\) can be expressed in terms of the overlapping ratio \(\zeta\) and dimensionality d:
In this simulation study, we train the classifiers with \(n_0=400\) and \(n_1 = qn_0\) with the imbalance level \(q={\mathcal {X}}_1/{\mathcal {X}}_0 \in \{0.1,0.5,1.0\}\) and overlapping ratio \(\zeta =0.5\). For values of q closer to zero, classes of the data set are more imbalanced. On each replication, we form a test data with 100 random samples drawn from each of \(F_0\) and \(F_1\), resulting in a test data set of size 200. This setting is similar to a setting used by Manukyan and Ceyhan (2016), who showed that CCCD classifiers are robust to imbalance in data sets. We show that the same robustness extends to PEPCD classifiers in this article. Using all classifiers, at each replication, we record Fmeasures for the test data, and also, we record the correct classification rates (CCRs) of each class of the test data separately. We perform these replications until the standard errors of Fmeasures of all classifiers are below 0.0005. We refer to the CCRs of two classes as “CCR0” and “CCR1”, respectively. We consider the expansion parameters \(r=1,1.1,1.2,\ldots ,2.9,3,5,7,9\) for the PEPCD classifiers. Our hybrid PEPCD classifiers are referred as PESVM, PE\(k\hbox {NN}\) and PECCCD classifiers with alternative classifiers SVM, \(k\hbox {NN}\) and CCCD, respectively.
Before the main Monte Carlo simulation, we perform a preliminary (pilot) Monte Carlo simulation study to determine the values of optimum parameters of SVM, CCCD and \(k\hbox {NN}\) classifiers. The same values will be used for alternative classifiers as well. We train the \(g_{svm}\), \(g_{cccd}\) and \(g_{knn}\) classifiers, and classify the test data sets for each classifier to find the optimum parameters. We perform Monte Carlo replications until the standard errors of all Fmeasures are below 0.0005 and record which parameter produced the maximum Fmeasures among the set of all parameters in a trial. Specifically, on each replication, we (1) classify the test data set with each \(\theta\) value (2) record the \(\theta\) values with maximum Fmeasures and (3) update the count of the recorded \(\theta\) values. Finally, given a set of counts associated with each \(\theta\) value, we appoint the \(\theta\) with the maximum count as the \(\theta ^*\), the optimum \(\theta\) (or the best performing \(\theta\)). Later, we use \(\theta ^*\) as the parameter of alternative classifier \(g_{cccd}\) in our main simulations. Optimal parameter selection process is similar for classifiers \(g_{knn}\) and \(g_{svm}\) associated with the parameters k and \(\gamma\).
The optimum parameters of each simulation setting are listed in Table 1. We consider parameters of SVM \(\gamma =0.1,0.2, \ldots ,4.0\), of CCCD \(\theta =0,0.1,\ldots ,1\) (here, \(\theta =0\) is actually equivalent to \(\theta =\epsilon\), the machine epsilon), and of \(k\hbox {NN}\)\(k=1,2,\ldots ,30\). In Table 1, as q and d increase, optimal parameters \(\gamma\) and \(\theta\) decrease whereas k increases. Manukyan and Ceyhan (2016) showed that dimensionality d may affect the imbalance between classes when the supports overlap. Observe that in Table 1, with increasing d, optimal parameters are more sensitive to the changes in imbalance level q. For the CCCD classifier, \(\theta =1\) is usually preferred when the data set is imbalanced, i.e. \(q=0.1\) or \(q=0.5\). Bigger values of \(\theta\) are better for the classification of imbalanced data sets, since with \(\theta =1\), the cover of the minority class is substantially bigger which increases the domain influence of the points of the minority class. For \(\theta\) closer to 0, the class cover of the minority class is much smaller compared to the class cover of the majority class, and hence, the CCR1 is much smaller. Bigger values of parameter k are also detrimental for imbalanced data sets, the bigger the parameter k, the more likely a new point is classified as the majority class since the points tend to be labelled as the class of the majority of k nearest neighbors. As for the parameter \(\gamma\), support vectors have more influence over the domain as \(\gamma\) decreases (Wang et al. 2003). Note that \(\gamma =1/(2\sigma ^2)\) in the radial basis function (RBF) kernel. The smaller the \(\gamma\), the bigger the \(\sigma\). Hence, more points are classified as the majority class with decreasing \(\gamma\) since the majority class has more influence. Thus, bigger values of \(\gamma\) are better for the imbalanced data sets.
Average of Fmeasures and CCRs of three hybrid PEPCD classifiers are presented in Fig. 7. For \(q=0.1\), the classifier PE\(k\hbox {NN}\), for \(q=0.5\), the classifier PECCCD and, for \(q=1.0\), the classifier PESVM performs better than others. Especially, when the data set is imbalanced, the CCR1 determines the performance of a classifier (thus the Fmeasure); that is, generally, the better a method classifies the minority class, the better the method performs overall. When the data is balanced (i.e. \(q=1\)), PESVM is expected to perform well, however it is known that SVM classifiers are confounded by the imbalanced data sets (Akbani et al. 2004). Moreover, when \(q=0.1\), PE\(k\hbox {NN}\) performs better than PECCCD. This result contradicts the results of Manukyan and Ceyhan (2016). The reason for this is that hybrid PEPCD classifiers incorporate alternative classifiers for points outside of the convex hull and \(k\hbox {NN}\) might perform better for these points. The \(k\hbox {NN}\) classifier is prone to missclassify points closer to the decision boundary when the data is imbalanced, and we expect points outside the convex hull to be far away from the decision boundary in our simulation settings.
In Fig. 7, CCR1 increases while CCR0 decreases for some settings of q and d, and vice versa for some other settings. Recall that Theorem 3 shows a stochastic ordering of the expansion parameter r; that is, with increasing r, there is an increase in the probability of exact MDS being less than or equal to some \(\kappa =1,\ldots ,d+1\). Hence, with increasing r, the proximity region \(N_{PE}(x,r)\) gets bigger and the cardinality of the prototype set \(S_j\) gets lower. Therefore, we achieve a bigger cover of the minority class and more reduction in the majority class. The bigger the cover is, the higher the CCR1 is in the imbalanced data sets. However, the decrease in the performance, when r increases, may suggest that alternative classifiers perform better for these settings. For example, the CCR1 of PESVM increases as r increases for \(q=0.1,0.5\) and \(d=2,3\), but CCR1 of PECCCD and PE\(k\hbox {NN}\) decreases for \(r \ge 1.6\). The higher the r, the more the reduction in data set. However, higher values of r may confound the classification performance. Hence, we choose an optimum value of r. Observe that for \(d=5\), the Fmeasures of all hybrid PEPCD classifiers are equal for all r. With increasing dimensionality, the probability that a target class point falling in the convex hull of the nontarget class points decreases, hence most target class points remain outside of the convex hull of nontarget class points.
In Fig. 8, we compare the composite cover PEPCD classifier and the standard cover PEPCD classifier. The standard cover is slightly better in classifying the minority class, especially when there is imbalance between classes. In general, the standard cover PEPCD classifier appear to have higher CCR1 than the composite cover PEPCD classifiers. However, the composite covers are better when \(d=5\). The PEPCD class covers are surely influenced by the increasing dimensionality. Moreover, for \(q=0.1,0.5\), we see that the CCR1 of standard cover PEPCD classifier slightly decreases with r, even though the data set is more reduced with increasing r. Hence, we should choose an optimum value of r that can still be incorporated to both substantially reduce the data set and to achieve a good classification performance.
In Fig. 9, we compare all five classifiers, three hybrid and two cover PEPCD classifiers. We consider the expansion parameter \(r=3\) since, in both Figs. 7 and 8, class covers with \(r=3\) perform well and, at the same time, substantially reduce the data set. For all \(d=2,3,5\), it appears that all classifiers show comparable performance when \(q=1\), but PESVM and SVM give slightly better results. However, when there is imbalance in the data sets, the performances of PESVM and SVM degrade, and hybrid and cover PEPCD classifiers and CCCD classifiers have higher Fmeasures than others. Compared to all other classifiers, on the other hand, the standard cover PEPCD classifier is clearly the best performing one for \(d=2,3\) and \(q=0.1,0.5\). Observe that the standard cover PEPCD classifier achieves the highest CCR1 among all classifiers. Apparently, the standard cover constitutes the most robust (to class imbalance) classifier. The performance of standard cover PEPCD classifier is usually comparable to the composite cover PEPCD classifier, but slightly better. However, for \(d=5\), the performance of standard cover PEPCD classifier degrades and composite cover PEPCD classifiers usually perform better. These results show that cover PEPCD classifiers are more appealing than hybrid PEPCD classifiers. The reason for this is that the cover PEPCD classifiers have both good classification performance and reduce the data considerably more since hybrid PEPCD classifiers provide a data reduction for only \({\mathcal {X}}_j \cap C_H({\mathcal {X}}_{1j})\) whereas cover PEPCD classifiers reduce the entire data set. The level of reduction, however, may decrease as the dimensionality of the data set increases.
In Fig. 10, we compare all five classifiers, three hybrid and two cover PEPCD classifiers in a slightly different simulation setting where there exists an inherent class imbalance. We perform simulation studies wherein equal number of observations \(n_0=n_1=n\) are drawn from separate distributions where \({\mathcal {X}}_0\) is a random sample from a multivariate uniform distribution \(U([0,1]^d)\) and \({\mathcal {X}}_1\) is a random sample from \(U([0.3,0.7]^d)\) for \(d = 2,3,5\) and \(n=50,100,200,500\). Observe that the support of one class is entirely inside of the other, i.e. \(s(F_1) \subset s(F_0)\). The same simulation setting has been used to highlight the robustness of CCCD classifiers to imbalanced data sets (Manukyan and Ceyhan 2016). In Fig. 10, the performance of \(k\hbox {NN}\) and PE\(k\hbox {NN}\) classifiers degrade as d increases and n decreases. With sufficiently high d and low n, the minority class points \({\mathcal {X}}_0\) is sparsely distributed around the overlapping region of class supports \(s(F_1) \cap s(F_0)\) which is the support of \({\mathcal {X}}_1\). Hence, although the number of observations are equal in both classes, there exists a “local” imbalance between classes (Manukyan and Ceyhan 2016). However, CCCD and SVM classifiers, including the associated hybrid PEPCD classifiers perform fairly well. Although the cover PEPCD classifiers have considerably smaller CCR1, they perform relatively well compared to other classifiers and generally have higher CCR0 than other classifiers. Similar to other simulation settings, cover PEPCD classifiers are also affected by the increasing dimensionality in this setting.
Although the PEPCD based standard cover classifiers are competitive in classification performance, a case should be made on how much they reduce the data sets during the training phase. In Fig. 11, we illustrate the percentage of reduction in the training data set, and separately, in both minority and majority classes, using PEPCD for \(r=1,2,3\). The overall reduction increases with r, which is also indicated by Theorem 3, and the reduction in the majority class is much more than in minority class when \(q=0.1,0.5\) since proximity regions of the majority class catch more points unlike the minority class. The majority class is reduced over nearly \(60\%\) when \(q=0.1\), and \(40\%\) when \(q=0.5\). Indeed, the higher the imbalance between classes, the higher the reduction in the abundantly populated classes. On the other hand, as the dimensionality increases, composite covers reduce the data set more than the standard covers. The number of the facets and simplices increases exponentially with d, and hence the cardinality of minimum dominating set (or the prototype set) also increases exponentially with d (see Theorem 5). As a result, composite PEPCD covers achieve much higher reduction than standard PEPCD covers.
Real data examples
In this section, we apply the hybrid and cover PEPCD classifiers on UCI and KEEL data sets (Dua and Graff 2019; Alcalá et al. 2011). Most of these data sets were subjected to preprocessing before analysis such as log transformation, deletion of outliers and missing value imputation. We start with a trivial but a popular data set, iris with 150 flowers classified into three types based on their petal and sepal width and lengths. In Fig. 12, we illustrate standard and composite PEPCD covers, and CCCD covers of the first and the third variables of iris data set, sepal and petal lengths. Observe that in composite covers of Fig. 12b, only a few or no triangles are used to cover the setosa and virginica classes. Points from these classes are almost all outside of the convex hull of the versicolor class points, and hence covered mostly by spherical proximity regions. However, the standard cover of Fig. 12c covers setosa and virginica classes with polygons since these classes are in the outer triangles of the convex hull of the versicolor class.
We first assess the performance of PEPCD classifiers and other classifiers (i.e. \(k\hbox {NN}\), SVM and CCCD) on two real data sets. The first data set, High Time Resolution Universe Survey (HTRU), is composed of 91192 signals where only 1196 of these signals are labelled as pulsars (Jameson et al. 2010; Morello et al. 2014). A pulsar is a ratio emitting star that was formerly a massive star being on the verge of collapsing. Detecting whether a signal indicates the existence of candidate pulsar is of considerable interest in the field of radio astronomy (Lyon et al. 2016). This data set is preprocessed by Lyon et al. (2016) such that eight variables are generated to predict whether a signal is a pulsar or not. These features are the mean, standard deviations, kurtosis and skewness values of integrated pulse profiles and the DMSNR (dispersion measure and signaltonoise) curves. In HTRU data set, there are 1196 pulsar candidates and 89995 nonpulsar (stars that are not pulsars) candidates which makes the data set having an imbalance ratio of \(89995/1196=75.24\). We randomly split the HTRU data set into training and test data sets which comprise of the 75% and 25% of the original HTRU data set, respectively, where both training and test sets have approximately the same level of imbalance.
Lyon et al. (2016) shows that all eight variables based on the pulse profiles and DMSNR curves are fundamental for predicting a pulsar signal, but three variables, that are mean, kurtosis and the skewness of the pulse profiles, are more explanatory than the other variables. Also, recall that the number of prototypes in PEPCD classifier increases exponentially with d as shown by Theorem 5. Simulation studies in Sect. 6 also indicated that the dimensionality of a data set affects the classification performance. Hence, we apply dimensionality reduction to the HTRU data set to mitigate the dimensionality effect. After preprocessing the HTRU data set, we used principal component analysis (PCA) to extract the three principal components with 96% of variation explained. In Fig. 13, we illustrate scatter diagrams of first two principal components. We observe that the classes are almost separated with a mild level of overlap.
We establish the PEPCD covers of the HTRU data sets for increasing values of \(r=1,1.1,1.2,\ldots ,2\). Also in Fig. 14, for all values of r, we give the levels of reduction and the imbalance ratio of these two classes after reduction, i.e. reduced imbalance ratio. In Fig. 14a, we ignore values \(r > 2\) due to no substantial change in either the reduction percentage or reduced imbalance ratio. For increasing values of r, there exists a considerable reduction in the number of observations. With \(r=1\), the percentage of reduction in the number of nonpulsar candidates is almost 99%. Also, the reduction in the pulsar candidates is 40%, reaching up to 50% with increasing values of r. PEPCDs achieve a reduced imbalance ratio of approximately 3 where the global imbalance ratio of HTRU data set was originally 75.24. In Fig. 14b, we illustrate the reduction inside and outside of the convex hulls \(C_H({\mathcal {X}}_{1j})\), for \(j=0,1\), of both classes for \(r=2\). Both inside and outside of the convex hull of the nontarget class, where the set of pulsar candidates is the nontarget class, nonpulsar candidates achieve higher than 90% reduction. However, only the 50% of the pulsar candidates in the convex hull of the nonpulsar candidates are chosen as members of the prototype set. In both classes, the reduction is over 90% for those target class points that are outside of the convex hull of the nontarget class. The reduction in the minority class (the pulsar candidates) is indeed lower than the reduction in the majority class (nonpulsar candidates), but it results in an undersampling of the majority class which successfully reduces the imbalance between the number of pulsar and nonpulsar signals.
In Fig. 15a, we illustrate the Fmeasure of standard cover classifier, SVM, \(k\hbox {NN}\) and CCCD with parameters \(r=1,1.1,\ldots ,2.9,3,5,6,7,8,9\), \(\gamma =0.1,0.2,\ldots ,4\), \(k=1,2,\ldots ,40\) and \(\theta =0,0.1,\ldots ,1\) measured on the HTRU test data set, respectively. The classification performance of SVM seems to be not affected by the increasing values of \(\gamma\); that is, SVM achieves approximately 0.80 Fmeasure for all \(\gamma\). The parameter \(k=1\) is not the best performing one for \(k\hbox {NN}\); however, the best performance is achieved at \(k=5\) since, looking at Fig. 13, there exists some considerable separation between the pulsar and nonpulsar candidates. Although the HTRU data is highly imbalanced with IR = 75.24, due to the well separation of pulsar and nonpulsar candidates, moderate values of k may perform better. CCCD classifiers for all values of \(\theta\), however, achieves the least Fmeasure in classifying the test data set of HTRU. Moreover, standard cover classifiers with expansion parameter r have higher Fmeasures than CCCDs, where standard cover performs the best at \(r=1.5\). It was also observed in the simulation studies of Sect. 6 that an optimum value of r is preferred to achieve a considerable level of reduction while keeping the classification performance high. We observe a comparable level of Fmeasure in both \(k\hbox {NN}\) and standard cover classifiers while the cover reduces the number of observations and mitigate the effects of imbalance in classes.
In Fig. 15b, we illustrate the Fmeasure of hybrid classifiers and the composite cover classifier for PEPCD parameter \(r=1,1.1,\ldots ,2.9,3,5,6,7,8,9\) measured on the HTRU test data set. The k, \(\gamma\) and \(\theta\) parameters of the alternative classifiers and the parameter of the spherical proximity regions of the composite cover are fixed to those values that performed the best in the experiments of Fig. 15a. We observe that the PESVM hybrid classifier slightly outperforms the SVM classifiers with the best performance achieved at \(r=1.4\). PECCCDs also have higher Fmeasure then CCCDs, considerably increasing the prediction accuracy of the CCCDs with the addition of the PEPCDs. All hybrid classifiers seem to have similar classification performances, even though lower values of r may produce composite cover classifiers with slightly smaller Fmeasures. The increase in the classification performance of the hybrid classifiers may indicate that correctly classifying the pulsar candidates in the overlapping region of two classes is of higher importance. Both Fig. 15a and b indicate that, if the dimensionality of the data set is sufficiently reduced, standard classifiers and PEPCD based hybrid classifiers may produce comparable classification performances.
Cover PEPCD classifiers perform better if the data set has low dimensionality. Hence, we reduce the dimensionality of data sets by means of PCA and then classify the data set with the cover PEPCD classifiers trained over this data set in the reduced dimension. Although PEPCD classifiers have computationally tractable MDSs and potentially have comparable performance to those other classifiers, the moderately high dimensionality of the data sets are detrimental for these classifiers based on PEPCD class covers. Now we apply all classifiers to the Letter data set which is composed of 20,000 black and white pixels. Each of these pixels represents one of the 26 alphabetic letters (Frey and Slate 1991; Dua and Graff 2019), and each pixel is converted into 16 integer features. For the sake of investigating the performance of standard cover classifiers under class imbalance, we restrict our attention to successfully recognizing the letter “M” where only 792 of examples represent this letter. Thus, the transformed data set “LetterM” has an imbalance ratio of 19208/792 = 24.25. We apply dimensionality reduction to the data set after some preprocessing and extract 5 principal components with total explained variance of 72%. In Fig. 16, we illustrate two of these principal components that illustrate the separability of the LetterM data set. There exists a medium level of separability between classes that may help in achieving a high classification performance. We then randomly split the LetterM data set into training and test data sets that both constitute equally 50% of all observations of the LetterM data set with the approximately same level of class imbalance.
In Fig. 17, we establish the PEPCD cover of LetterM data set for all values of r, and then, observe the reduction in the number of observations and the reduced imbalance ratios by means of the prototype sets. Although the LetterM data set exhibits some level of separability between classes similar to HTRU data set, the reduction in the minority class is far less than what was from HTRU. Most reduction is achieved in the observations outside of convex hull of the nontarget class, but the number of observations in the minority class is only reduced to 40% outside of the convex hull. Here, the convex hull becomes smaller compared to the entire domain with increasing dimensionality. Outside the convex hull of the minority class, we observe a 90% reduction in the majority class while the minority class achieves only 20% reduction, but the reduced imbalance level is approximately 2.4. The reduced imbalance ratio does not considerably change with increasing expansion parameter r since most points are outside of the convex hull and, by the Theorem 4, the number of prototypes (dominating points) outside of the convex hull is fixed for all r.
In Fig. 18a, we illustrate the Fmeasure of standard cover classifier, SVM, \(k\hbox {NN}\) and CCCD measured on the LetterM test data set. Contrary to the performance of standard cover classifier on HTRU dataset, the cover classifier based on the standard PEPCD cover shows much worse performance compared to the other classifiers. There is almost no change in the performance of hybrid classifiers with increasing r since only few number of target class points(es) fall into the convex hull of the nontarget class(es). Composite cover classifiers achieve nearly 0.70 Fmeasure with increasing r but both PECCCD hybrid and CCCD classifier outperform the composite cover classifier. CCCDs, \(k\hbox {NN}\), SVM and all hybrid classifiers have nearly 0.80 Fmeasure on the test data while the cover classifiers achieve Fmeasures approximately between 0.60 and 0.70. The best performing values of r for composite and standard covers are \(r=1.7\) and \(r=1.9\), respectively. The Fmeasure seems to be stable for increasing values of \(\theta\) and \(\gamma\), but lower values of k perform better for \(k\hbox {NN}\). Although it is possible to achieve approximately 0.80 Fmeasure with other classifiers, the standard cover classifiers suffer from model complexity of the PEPCD cover which depends on d, dimensionality of the data set. This is again due to the results of Theorem 5 where a dominating set of PEPCD \(S_j\) is of complexity \({\mathcal {O}}\left( dn_{1j}^{\lceil d/2\rceil }\right)\). Here, the cardinality of the dominating set increases exponentially on d which is the total exact minimum number of dsimplices and outersimplices needed to cover the target class. Although the dimensionality reduction may be helpful in reducing the complexity of the class cover, one may not be able to reduce the dimensionality of the data in such a way that the reduced data set has considerably high fraction of explained variance and still be eligible to be trained by PEPCD based classifiers.
In Table 2, we apply all classifiers on 17 UCI and KEEL data sets including iris, HTRU and LetterM data sets. For testing the statistical difference between the Fmeasures of classifiers, we employ the combined \(5 \times 2\) CV Ftest (Dietterich 1998; Alpaydın 1999). We also use micro Fmeasure for data sets with multiple number of classes since micro Fmeasure is more suitable for multiple imbalanced classes than macro Fmeasure (Narasimhan et al. 2016). The test works as an omnibus test for all ten possible \(5 \times 2\) CV ttests (for each five repetitions there are two folds, hence ten folds in total). Basically, if a majority of ten \(5 \times 2\) CV ttests suggest that two classifiers are significantly different in terms of performance, the Ftest also suggests a significant difference. Hence, an Ftest with high pvalue suggests that some of the ten ttests fail to reject the nullhypothesis (i.e. they have high pvalue); that is, it is very likely that there exist no significant difference between the Fmeasures of two classifiers. We only report the pvalues of the difference between the Fmeasures of standard cover classifier with all other classifiers (including composite cover and hybrid classifiers). In Table 2, we report on the Fmeasures of all classifiers along with the optimum values of associated tuning parameters. We either reduce the dimensionality of each data set, empirically select some subset of features or use the original set of features of each data set (hence called the unreduced data set). We report the best performing number of extracted or selected features for each data set. We mostly avoid applying PEPCD based hybrid and cover classifiers to unreduced data sets due to the high model complexity of PEPCDs with moderately high number of dimensions (or features), hence we only apply SVM, \(k\hbox {NN}\) and CCCD classifiers to these unreduced data sets.
Alongside with Table 2, we report on the reduced imbalance ratios (the ratio between the majority and minority class after PEPCDs are applied and the prototype set is extracted), and reduction rates of standard and composite cover classifiers of all these 17 data sets in Table 3. Here, we employ a multi class imbalance ratio notation which indicates the reduced imbalance ratio of all classes with one class being the reference class; for example, IR = \(n_3/n_1 \mid n_2/n_1 \mid n_1/n_1 = n_3/n_1\mid n_2/n_1 \mid 1\). We report on the global imbalance ratio \(q={\mathcal {X}}_0/{\mathcal {X}}_1\) for \({\mathcal {X}}_0\) being cardinality of the majority and \({\mathcal {X}}_1\) of minority class. We also report on the local imbalance ratio of two classes and the percentage of minority class members in the overlapping region of two classes. Local imbalance ratio and the overlapping ratio of imbalanced classes have also been investigated by Manukyan and Ceyhan (2016), who showed that, although two classes are balanced, there may be some region \(E \subset {\mathbb {R}}^d\) that two classes show some level of imbalance, and this region E is usually where subsets of two classes are close in proximity, e.g., where two classes overlap. We employ oneclass SVMs with radial basis function (RBF) kernels to estimate the support of each class with \(\nu =0.01\) and an optimum \(\gamma\). We calculate the imbalance ratio of two classes within the overlapping region of their estimated supports, i.e. the local class imbalance. We also provide the percentage of minority class members falling into this region. The emphasis is that, the higher the (local) class imbalance or the higher the percentage of minority class members in the overlapping region, the smaller the Fmeasure since it would be harder to predict the true labels of the minority class (Manukyan and Ceyhan 2016).
In Tables 2 and 3, we observe that composite and standard cover classifiers usually perform the best in lower dimensions; that is, dimensionality reduction helps increasing the Fmeasure and mitigating the drawbacks caused by moderately high number of features. However, in order for the feature selection or extraction to succeed, the reduced set of features should be explanatory enough to help classifiers achieve a considerable performance. We select two features from the AlivaD data set where we aim to predict if a letter recognized as “D” or not. Hence, we reduce the dimensionality of the data set drastically, and as a result, cover classifiers nearly achieve the Bayes optimal performance. However, applying the classifiers to Pageblocks0 data set, we require four extracted principal components to achieve a 99% explained variance where cover classifiers perform poorly against the other classifiers because of the rapidly increasing complexity of the PEPCD based covers. A similar argument can be made for LetterM data set since four principal components are clearly not enough. Fmeasure is naturally effected by the overlapping ratio of two classes since it would be harder to correctly predict the minority class members closer to the points of the majority class. As seen in Table 3, the overlapping ratios of data sets like Iris, Thyroid (1 and 2) and Banknote are small and these data sets are also quite balanced. Hence, there are a handful of minority class members in the support of the majority class which results in a high Fmeasure. There are some data sets however that, although they exhibit some global class imbalance, they do not have any local imbalance within the overlapping region. Shuttle0vs4 and Segment0 data sets are examples of such cases where all classifiers, including the cover classifiers, perform well even in reduced dimensions with PE\(k\hbox {NN}\) classifier being an exception; that is, a hybrid of PEPCD and \(k\hbox {NN}\) classifiers drastically confounds the performance. On the other hand, Yeast data sets have high overlapping ratios and both local and global imbalance ratios of these data sets are also notably high. Although PEPCD based (hybrid and cover) classifiers perform the best in lower dimensions, this may result in a considerable loss of information since the percentage of explained variance is, for example in Yeast data sets, nearly 35%. Therefore, all other (nonhybrid or noncover) classifiers enjoy high Fmeasure due to the employment of all the features (of unreduced data set). PEPCD based cover classifiers may also perform well for classifying data sets like Ionosphere and Ozone data with moderately high number of features; that is, after dimensionality reduction, standard cover classifier achieves comparable performance to other classifiers but only underperforms against SVM and PESVM classifiers in classifying the Ionosphere data set. Hybrid classifiers often perform comparable to their nonhybrid counterparts, but in some cases, they slightly increase the classification performance; see for example, the Fmeasures of PESVM classifiers in Yeast 4 and Ozone data sets.
In Table 3, we observe that all data sets with number of observations bigger than 5000 are reduced to at most 10% of the original number of observations; that is, cover classifiers prune almost 90% of all observations by only choosing nearly 10% of all observations as the members of the prototype (i.e. minimum dominating) set. Moreover, both types of covers reduce the imbalance between two classes; that is, in all data sets, the reduced imbalance ratio is nearly 2 or lower. The level of reduction of PEPCD covers is highly dependant on the dimensionality. In Table 2, standard and composite cover classifiers perform better in Ozone with fewer dimensions than in Ionosphere data set, hence the cardinality of dominating sets in both covers are higher in Ionosphere data where approximately 35% of all observations constitute the dominating set. The exponentially increasing complexity of the Delaunay tessellation effects both the performance and the model complexity of the PEPCD covers. Optimal values of k, \(\gamma\), \(\theta\) and r is also similarly affected by the global (or local) class imbalance levels and the dimensionality. Lower values of k perform better for locally imbalanced data sets like Yeast4 and Yeast1289vs7, but \(\gamma\) is mostly effected by the dimensionality. The higher the dimension d or more the extracted features, the lower the gamma, and hence the higher the bandwidth (\(\sigma\)) of the RBF kernel. But most importantly, there is a positive trend in \(\gamma\) as the class imbalance increases. CCCDs mostly perform best in unreduced data sets, therefore higher values of \(\theta\) are preferred. However, Shuttle0vs4 is a well separated data set with CCCD achieving higher Fmeasure in reduced dimension, and hence, optimal \(\theta\) is set to the lowest, i.e. \(\theta =0\). The cardinality of minimum dominating set of the PEPCDs decrease with r, but as it is also demonstrated in Sect. 6, a high value of r is detrimental for the performance of the PEPCD based cover classifiers. Hence, in most data sets, moderate values of r achieve the best Fmeasure. One apparent difference in optimum values of r among data sets is between locally imbalanced and balanced data sets. Yeast5, Yeast6, LetterM and Segment0 data sets have high global imbalance but low local imbalance despite the fact that majority and minority classes overlap. An optimum value of r for “global only” imbalanced data sets is high which undersamples majority classes as much as possible. The bias in globally imbalanced data sets is originated from the abundant number of majority class members closer to the decision boundary but not in the overlapping region, hence higher the r the better the performance.
Summary and discussion
We use proximity catch digraphs (PCDs) to construct semiparametric classifiers that show potential in solving problems with substantial class imbalance. These families of random geometric digraphs constitute class covers of a class of interest (i.e. the target class) in order to generate decisionboundaries for classifiers. PCDs are generalized versions of Class Cover Catch Digraphs (CCCDs). For imbalanced data sets, CCCDs showed better performance than some other commonly used classifiers in previous studies (Manukyan and Ceyhan 2016; DeVinney et al. 2002). CCCDs are actually examples of PCDs with spherical proximity maps. Our PCDs, however, are based on simplicial proximity maps, e.g. proportionaledge (PE) proximity maps. Our PCD, or PEPCD, class covers are extended to be unions of simplicial and polygonal regions whereas original PEPCD class covers were composed of only simplicial regions. The most important advantage of these family of PE proximity maps is that their respective digraphs, or namely PEPCDs, have computationally tractable minimum dominating sets (MDSs). The class covers of such digraphs are minimum in complexity, offering maximum reduction of the entire data set with comparable and, potentially, better classification performance. PEPCDs are one of many PCD families using simplicial proximity maps investigated in Ceyhan (2010). Their construction is also based on the Delaunay tessellations of the nontarget class, and similar to PEPCDs, they enjoy various properties that CCCDs do in \({\mathbb {R}}\), and they can also be used to establish PCD classifiers.
The PEPCDs are defined in the Delaunay tessellation of the points from the nontarget class (i.e. the class not of interest). PEPCDs, and associated proximity maps, were only defined for the points inside of the convex hull of the nontarget class points, \(C_H({\mathcal {X}}_{1j})\), in previous studies. Here, we introduce the outer simplices associated with facets of \(C_H({\mathcal {X}}_{1j})\) and thus extend the definition of the PE proximity maps to these outer simplices. Hence, the class covers of PEPCDs apply for all target class points \({\mathcal {X}}_j\). PEPCDs are based on the regions of simplices associated with the vertices of these simplices, called Mvertex regions. We characterize these vertex regions with barycentric coordinates of target class points with respect to the vertices of the dsimplices. However, the barycentric coordinates only apply for the target class points inside the convex hull of nontarget class points \(C_H({\mathcal {X}}_{1j})\). For those points outside the convex hull, we may incorporate the generalized barycentric coordinates of, for example, the coordinate system of Warren (1996). Such coordinate systems are convenient for locating points outside \(C_H({\mathcal {X}}_{1j})\) since outer simplices are similar to convex dpolytopes even though they are unbounded. However, generalized barycentric coordinates of the points with respect to these convex polytopes are not unique. Hence, the associated properties of MDSs and convex distance measures are not welldefined.
We define two types of classifiers based on PEPCDs, namely, hybrid and cover PEPCD classifiers. We show that these classifiers are better in classifying the minority class in particular. This makes cover PEPCD classifiers more appealing since they present slightly better performance than other classifiers (including hybrid PEPCD classifiers) with a high reduction in the data set. In hybrid PEPCD classifiers, alternative classifiers are used when PEPCD preclassifiers are unable to make a decision on a query point. These preclassifiers are only defined by the simplices provided in the Delaunay tessellation of the set \({\mathcal {X}}_{1j}\), hence only for target class points in \(C_H({\mathcal {X}}_{1j})\). We considered alternative classifiers \(k\hbox {NN}\), SVM and CCCD. In both our simulation studies and real data experiments, there are some cases where hybrid classifiers outperform their nonhybrid counterparts; for example, PESVMs outperform SVM classifiers in dimensionally reduced HTRU data set. This may be an indication that, if used alongside with proper alternative classifiers, PEPCD classifiers could be better in modelling the decision boundary closer to the overlapping region of classes. The cover PEPCD classifiers, on the other hand, are based on two types of covers: composite covers where the target class points inside and outside of the convex hull of the nontarget class points are covered with separate proximity regions, and standard covers where all points are covered with regions based on the same family of proximity maps. For composite covers, we consider a composition of spherical proximity maps (used in CCCDs) and PE proximity maps. We observe that, in general, standard cover classifiers perform slightly better or comparable to composite cover classifiers in reduced and imbalanced data sets unless the standard cover suffers from the high dimensionality. In general, however, results on both hybrid and cover PEPCD classifiers indicate that when the dimensionality is low and classes are imbalanced, standard cover PEPCD classifiers achieve either comparable or slightly better classification performance than others.
PEPCD class covers are low in complexity (with respect to the number of observations); that is, by finding the MDSs of these PEPCDs, we can construct class covers with minimum number of proximity regions. The minimum dominating set, or the prototype set, is viewed as a reduced data set that potentially increases the testing speed of a classifier. CCCDs have the same properties, but only for data sets in \({\mathbb {R}}\). By extending end intervals, i.e. intervals with infinite end points, to outer simplices in \({\mathbb {R}}^d\) for \(d>1\), we established classifiers having the same appealing properties of CCCDs in \({\mathbb {R}}\). Experiments on both simulated and real data sets indicate that the expansion parameter r of the PE proximity maps substantially decreases the cardinality of the minimum dominating set, but the classification performance decreases if r is very large. Although PEPCDs substantially reduce the number of observations of almost all of the real data sets that are considered in this work, higher values of r actually degrade the classification performance of both PEPCD based classifiers. Hence, an optimal choice of r value is preferred. But a major drawback of PEPCDs is the exponentially increasing complexity of the prototype set on the dimensionality of the data set d. This fact is due to the Delaunay tessellation of the nontarget class since the number of simplices and facets increase exponentially in d (see Theorem 5). Therefore, these class covers become inconvenient for modelling the supports of the points from the classes in high dimensions. We employ methods of dimensionality reduction, e.g. principal components analysis, to mitigate the effects of high dimensionality. PEPCD cover classifiers perform well in reduced dimensions only if the extracted set of principal components provide a high percentage of explained variance. For some real data sets, however, it is inefficient to rely on a few number of features, and hence, PEPCD classifiers are outperformed by other classifiers which make use of a set of higher number of explanatory input variables.
PEPCDs offer classifiers of (exact) minimum complexity based on estimation of the class supports. The MDSs of PEPCDs are computationally tractable, and hence, the maximum reduction is achieved in polynomial time (on the size of the training data set). This property of PEPCDs, however, achieved by partitioning of \({\mathbb {R}}^d\) by Delaunay tessellation, and as a result, the number of the simplices and facets of the convex hull of the nontarget class determines the complexity of the model which increases exponentially fast with the dimensionality d of the data set, i.e. \({\mathcal {O}}\left( n_{1j}^{\lceil d/2\rceil }\right)\) for \(n_{1j}\) being the number of nontarget class points. Indeed, this leads to an overfitting of the data set. We employ PCA to extract the features with the most variation, and thus reduce the dimensions to mitigate the effects of dimensionality. PCA, however, is one of the oldest dimensionality reduction methods, and there are many dimension reduction methods in literature that may potentially increase the classification performance of PCD classifiers. One other case to be made on PEPCD covers is that, with the assumption of strict \(\delta\)separability between two classes in a data set, PEPCDs can be shown to be consistent and they achieve Bayes optimal performance of \(L^*=0\). However, Devroye et al. (1996) suggests that classifiers with homogeneous decision regions often lead to overfitting and they are best fit for data sets with separable class conditional distributions. The PCDs considered in this work, including CCCDs, are pure of nontarget class points, hence none of the nontarget class points reside in the class cover. DeVinney et al. (2002) introduced random walk CCCDs (RWCCCDs) that are nonpure alternatives of CCCDs where some nontarget class points are allowed inside of the class cover in order to mitigate the effects of overfitting. Although PEPCDs offer class covers with computationally tractable minimum prototype set, PEPCDs are also pure class covers, and hence, they also build homogeneous regions for decision making. We believe that, as a followup to this work, it is worthwhile to define nonpure and relaxed PCDs that constitute both exact minimum dominating sets and heterogeneous class covers. Such directed graphs would enjoy appealing properties of PCDs and they would also be the building blocks of classifiers that are consistent for a general set of real life data sets.
Although, our work proves the idea that relatively good performing classifiers with minimum prototype sets can be provided with PCDs, a discussion raises if there exist PCDs with alternative partitioning methods whose exact minimum dominating sets are fixedparameter tractable with respect to d, unlike PCDs based on Delaunay tessellations. A problem is said to be fixedparameter tractable (FPT), if there exists an algorithm to solve the problem, running in \(f(k) x^c\) time where \(c \in {\mathbb {R}}^+\) is a constant, f is an arbitrary computable and nondecreasing function of the parameterk, and x is the size of the input x (Downey and Fellows 2013). It is often appealing to try to find an FPT algorithm for a problem initially shown to be solvable in \(O(n^{f(k)})\) time (which is not FPT). Note that, as in Theorem 5, it only takes at most \({\mathcal {O}}(2^d)\) or \({\mathcal {O}}(1)\) time to find the exact extremum points of each dsimplex \({\mathcal {S}}\). Hence, a possible line of research for PCDs could be to employ alternative partitioning methods such that \({\mathbb {R}}^d\) is partitioned in at most \({\mathcal {O}}(n_{1j}^c)\) time and the extremum points are found in \({\mathcal {O}}(2^k)\) with parameter \(k=d\). We believe such a partitioning method, say for example a rectangular partitioning scheme with polynomial running time on both n and d, that produces less partitioning than a Delaunay tessellation could be more appealing for the class cover. Classifiers based on such PCDs and their classification performance are topics of ongoing research.
References
Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Proceedings of 15th European conference on machine learning, Pisa (pp. 39–50).
Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., et al. (2011). Keel datamining software tool: Data set repository, integration of algorithms and experimental analysis framework. MultipleValued Logic and Soft Computing, 17(2–3), 255–287.
Alpaydın, E. (1999). Combined \(5\times 2\) CV \(F\) test for comparing supervised classification learning algorithms. Neural Computation, 11(8), 1885–1892.
Angiulli, F. (2012). Prototypebased domain description for oneclass classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6), 1131–1144.
Arora, S., & Lund, C. (1996). Approximation algorithms for NPhard problems. Boston, MA: PWS Publishing.
Bereg, S., Cabello, S., DíazBáñez, J. M., PérezLantero, P., Seara, C., & Ventura, I. (2012). The class cover problem with boxes. Computational Geometry, 45(7), 294–304.
Bien, J., & Tibshirani, R. (2011). Prototype selection for interpretable classification. The Annals of Applied Statistics, 5(4), 2403–2424.
Cannon, A. H., & Cowen, L. J. (2004). Approximation algorithms for the class cover problem. Annals of Mathematics and Artificial Intelligence, 40(3–4), 215–223.
Ceyhan, E. (2005). An investigation of proximity catch digraphs in delaunay tessellations. PhD thesis, Johns Hopkins University, Baltimore, MD.
Ceyhan, E. (2010). Extension of onedimensional proximity regions to higher dimensions. Computational Geometry, 43(9), 721–748.
Ceyhan, E., & Priebe, C. E. (2005). The use of domination number of a random proximity catch digraph for testing spatial patterns of segregation and association. Statistics & Probability Letters, 73(1), 37–50.
Ceyhan, E., Priebe, C. E., & Marchette, D. J. (2007). A new family of random graphs for testing spatial segregation. Canadian Journal of Statistics, 35(1), 27–50.
Ceyhan, E., Priebe, C. E., & Wierman, J. C. (2006). Relative density of the random \(r\)factor proximity catch digraph for testing spatial patterns of segregation and association. Computational Statistics & Data Analysis, 50(8), 1925–1964.
Chvatal, V. (1979). A greedy heuristic for the setcovering problem. Mathematics of Operations Research, 4, 233–235.
Deng, X., & Zhu, B. (1999). A randomized algorithm for the Voronoi diagram of line segments on coarsegrained multiprocessors. Algorithmica, 24(3–4), 270–286.
DeVinney, J., Priebe, C., Marchette, D., & Socolinsky, D. (2002). Random walks and catch digraphs in classification. In Proceedings of the 34th symposium on the interface, volume 34: Computing science and statistics, Montreal, Quebec.
DeVinney, J. G. (2003). The class cover problem and its application in pattern recognition. PhD thesis, Johns Hopkins University, Baltimore, MD.
Devroye, L., Gyorfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer.
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923.
Downey, R. G., & Fellows, M. R. (2013). Fundamentals of parameterized complexity (Vol. 4). New York: Springer.
Dua, D., & Graff, C. (2019). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml. Accessed 10 July 2019.
Eveland, C. K., Socolinsky, D. A., Priebe, C. E., & Marchette, D. J. (2005). A hierarchical methodology for class detection problems with skewed priors. Journal of Classification, 22(1), 17–48.
Frey, P. W., & Slate, D. J. (1991). Letter recognition using hollandstyle adaptive classifiers. Machine Learning, 6(2), 161–182.
Gao, B. J., Ester, M., Xiong, H., Cai, J. Y., & Schulte, O. (2013). The minimum consistent subset cover problem: A minimization view of data mining. IEEE Transactions on Knowledge and Data Engineering, 25(3), 690–703.
Hammer, P., Liu, Y., Simeone, B., & Szedmák, S. (2004). Saturated systems of homogeneous boxes and the logical analysis of numerical data. Discrete Applied Mathematics, 144(1–2), 103–109.
Jameson, A., Possenti, A., Stappers, B. W., Levin, L., Bailes, M., Burgay, M., et al. (2010). The high time resolution universe pulsar survey ? I. system configuration and initial discoveries. Monthly Notices of the Royal Astronomical Society, 409(2), 619–627.
Jaromczyk, J. W., & Toussaint, G. T. (1992). Relative neighborhood graphs and their relatives. Proceedings of the IEEE, 80(9), 1502–1517.
Karr, A. F. (1992). Probability (1st ed.). New York: Springer.
Lawson, C. L. (1986). Properties of ndimensional triangulations. Computer Aided Geometric Design, 3(4), 231–246.
Lyon, R. J., Stappers, B., Cooper, S., Brooke, J., & Knowles, J. (2016). Fifty years of pulsar candidate selection: from simple filters to a new principled realtime classification approach. Monthly Notices of the Royal Astronomical Society, 459(1), 1104–1123.
Manukyan, A., & Ceyhan, E. (2016). Classification of imbalanced data with a geometric digraph family. Journal of Machine Learning Research, 17(189), 1–40.
Marchette, D. J. (2004). Random graphs for statistical pattern recognition. Hoboken: Wiley.
Mehta, M., Rissanen, J., & Agrawal, R. (1995). Mdlbased decision tree pruning. In Knowledge discovery and data mining (pp. 216–221).
Morello, V., Barr, E., Bailes, M., Flynn, C., Keane, E., & van Straten, W. (2014). Spinn: A straightforward machine learning solution to the pulsar candidate selection problem. Monthly Notices of the Royal Astronomical Society, 443(2), 1651–1662.
Narasimhan, H., Pan, W., Kar, P., Protopapas, P., & Ramaswamy, H. G. (2016). Optimizing the multiclass fmeasure via biconcave programming. In 2016 IEEE 16th international conference on data mining (ICDM) (pp. 1101–1106). IEEE.
Parekh, A. K. (1991). Analysis of a greedy heuristic for finding small dominating sets in graphs. Information Processing Letters, 39, 237–240.
Pȩkalska, E., Duin, R. P., & Paclík, P. (2006). Prototype selection for dissimilaritybased classifiers. Pattern Recognition, 39(2), 189–208.
Priebe, C. E., DeVinney, J. G., & Marchette, D. J. (2001). On the distribution of the domination number for random class cover catch digraphs. Statistics & Probability Letters, 55(3), 239–246.
Priebe, C. E., Marchette, D. J., DeVinney, J., & Socolinsky, D. (2003a). Classification using class cover catch digraphs. Journal of Classification, 20(1), 3–23.
Priebe, C. E., Solka, J. L., Marchette, D. J., & Clark, B. T. (2003b). Class cover catch digraphs for latent class discovery in gene expression monitoring by DNA microarrays. Computational Statistics & Data Analysis, 43(4), 621–632.
Rissanen, J. (1989). Stochastic complexity in statistical inquiry theory. River Edge: World Scientific Publishing Co. Inc.
Schölkopf, B., Platt, J. C., ShaweTaylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a highdimensional distribution. Neural Computation, 13(7), 1443–1471.
Seidel, R. (1995). The upper bound theorem for polytopes: An easy proof of its asymptotic version. Computational Geometry, 5(2), 115–116.
Serafini, P. (2014). Classifying negative and positive points by optimal box clustering. Discrete Applied Mathematics, 165, 270–282.
Takigawa, I., Kudo, M., & Nakamura, A. (2009). Convex sets as prototypes for classifying patterns. Engineering Applications of Artificial Intelligence, 22(1), 101–108.
Toussaint, G. T. (1980). The relative neighborhood graph of a finite planar set. Pattern Recognition, 12(4), 261–268.
Toussaint, G. T. (2002). Proximity graphs for nearest neighbor decision rules: Recent progress. In Proceedings of the 34th symposium on the interface, Montreal, Quebec (Vol. 34).
Ungar, A. A. (2010). Barycentric calculus in euclidean and hyperbolic geometry: A comparative introduction. Singapore: World Scientific Publishing Co. Pte. Ltd.
Vazirani, V. V. (2001). Approximation algorithms. New York: Springer.
Wang, W., Xu, Z., Lu, W., & Zhang, X. (2003). Determination of the spread parameter in the Gaussian kernel for classification and regression. Neurocomputing, 55(3–4), 643–663.
Warren, J. (1996). Barycentric coordinates for convex polytopes. Advances in Computational Mathematics, 6(1), 97–108.
Watson, D. F. (1981). Computing the \(n\)dimensional Delaunay tessellation with application to Voronoi polytopes. The Computer Journal, 24(2), 167–172.
West, D. B. (2000). Introduction to graph theory (2nd ed.). Englewood Cliffs: Prentice Hall.
Woźniak, M., Graña, M., & Corchado, E. (2014). A survey of multiple classifier systems as hybrid systems. Information Fusion, 16, 3–17.
Acknowledgements
Most of the Monte Carlo simulations presented in this article were executed at Koç University High Performance Computing Laboratory, and the remaining numerical calculations on real data sets were performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Tapio Elomaa.
Appendix
Appendix
Proof of Theorem 1
We prove this theorem by induction on dimension d. The proof of the case \(d=1\) is trivial. For \({\mathcal {S}}({\mathcal {Y}})=({\mathsf {y}}_1,{\mathsf {y}}_2) \subset {\mathbb {R}}\) and \({\mathsf {y}}_1 < {\mathsf {y}}_2\), the vertex regions \(R_M({\mathsf {y}}_1)\) and \(R_M({\mathsf {y}}_2)\) are the intervals \(({\mathsf {y}}_1,M)\) and \((M,{\mathsf {y}}_2)\), respectively (\(\{x=M\}\) and \(\{x={\mathsf {y}}_i\}\) for \(i=1,2\) have zero \({\mathbb {R}}\)Lebesgue measure). For \(\alpha _1 \in (0,1)\) and \(\alpha _2=1\alpha _1\), let \(\alpha _1 {\mathsf {y}}_1 + \alpha _2 {\mathsf {y}}_2\) be the convex (or barycentric) combination of \(x \in {\mathcal {S}}({\mathcal {Y}})\). Hence, for \(m_1 {\mathsf {y}}_1 +m_2 {\mathsf {y}}_2\) being the convex combination of M, we have \(x \in ({\mathsf {y}}_1,M)=R_M({\mathsf {y}}_1)\) if and only if \(\alpha _1 / \alpha _2 > m_1/m_2\). The case \(d=2\) is proved in Proposition 1. Thus, there only remains the case \(d>2\). Suppose the statement is true for all faces of the dsimplices which are \(d1\) dimensional, and by that, we will show that the statement is also true for the dsimplex which is d dimensional.
It is sufficient to show the result for \({\mathsf {y}}_1\), as the others follow by symmetry. Let \(x \in R_M({\mathsf {y}}_1)\) and note that the elements of the set of \((d1)\)faces, \(\{f_j\}^{d+1}_{j=2}\), are adjacent to \({\mathsf {y}}_1\). Each of these faces is of \(d1\) dimensions. Hence, they are \((d1)\)simplices and they also have their own vertex regions. Thus, let \(R_{M_i}({\mathsf {y}}_j,f_i)\) be the vertex region of \({\mathsf {y}}_j\) with respect to \((d1)\)simplex \(f_i\) for \(j \ne i\). Note that \(M_i\) is the center of \(f_i\). Now, let \(w_{f_i}(z,{\mathsf {y}}_j) = w_{ij}\) be the barycentric coordinate of point z corresponding to \({\mathsf {y}}_j\) with respect to the \(f_i\). Observe that \(w_{ii}\) is not defined since \({\mathsf {y}}_i\) is not a vertex of the face \(f_i\).
Moreover, let \({\mathbf {m}}'=(m'_1,\ldots ,m'_{i1},m'_{i+1},\ldots ,m'_{d+1})\) be the vector of barycentric coordinates of \(M_i\) with respect to \(f_i\), and note that \(M_i\) is a linear combination of M and \({\mathsf {y}}_i\). Also, observe that \(m'_i\) is not defined since the vertex \({\mathsf {y}}_i\) is not a vertex of \(f_i\). Hence, for \(\beta \ge 1\),
Therefore, by the uniqueness of barycentric coordinates, we have \(m'_t=\beta m_t\) for \(t=1,\ldots ,d+1\) and \(t \ne i\). Note that \((1\beta )=0\) since \(M_i \in f_i\) and also \(f_i \subset \partial ({\mathcal {S}}({\mathcal {Y}}))\). Hence, \(\beta =1\) which implies \(m'_t=m_t\) for all \(t \ne i\). Then, \(m'_1/m'_j=m_1/m_j\) for \(j=2,3,\ldots ,d+1\) and \(j \ne i\). We use this result on our induction hypothesis.
Now, for \(i=2,\ldots ,d+1\), let the face \(f_i\) and line defined by x and \({\mathsf {y}}_i\) cross at the point \(z_i\) Observe that \(z_i \in f_i\), and since \(f_i\) is a \((d1)\)simplex and \(x \in R_M({\mathsf {y}}_1)\), see that \(z_i \in R_{M_i}({\mathsf {y}}_1,f_i)\). By induction hypothesis and Equation (14), we observe that \(z_i \in R_{M_i}({\mathsf {y}}_1,f_i)\) if and only if \(w_{i1} > (m'_1/m'_j) w_{ij}\) if and only if \(w_{i1} > (m_1/m_j) w_{ij}\) for \(j=2,3,\ldots ,d+1\) and \(j \ne i\). Since the point x is the convex (and linear) combination of \(z_i\) and \({\mathsf {y}}_i\), for \(\alpha \in (0,1)\), we have
By the uniqueness property of barycentric coordinates, it follows that \(w_{{\mathcal {S}}}^{(1)}(x) = \alpha w_{i1}\) and \(w_{{\mathcal {S}}}^{(j)}(x) = \alpha w_{ij}\). Hence,
Since Eq. (15) is true for all \(i=2,\ldots ,d+1\), we see that \(x \in R_M({\mathsf {y}}_1)\) if and only if \(w_{{\mathcal {S}}}^{(1)}(x) > (m_1/m_i) w_{{\mathcal {S}}}^{(i)}(x)\). Hence, the desired result follows. \(\square\)
Rights and permissions
About this article
Cite this article
Manukyan, A., Ceyhan, E. Classification using proximity catch digraphs. Mach Learn 109, 761–811 (2020). https://doi.org/10.1007/s10994020058784
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
 Class cover problem
 Delaunay tessellation
 Digraph
 Domination
 Prototype selection
 Support estimation