1 Introduction

In this paper we introduce a new embedding technique which uses a barycentric coordinate system to embed input points into a higher-dimensional representation, where the target space is nevertheless quite sparse. The embedding, which we term the Nested Barycentric Coordinate System (NBCS), has the useful property that a piecewise linear function in the origin space can be represented as a single linear function in the target space. We will demonstrate that the NBCS has application to multiple problems, including the following:

Approximating and learning polyhedra. The problem of finding a simple polyhedron consistent with a labelled set is known to be computationally intractable (Khot & Saket, 2011), although better bounds are known for polyhedra with margin (Gottlieb et al.,,, 2018). Still, state-of-the-art algorithms for this problem feature a steep dependence on the inverse margin and half-space number defining the polyhedron. We will show that NBCS can be used to model this problem as the well-studied maximum-margin hyperplane problem. In fact, our technique can even solve the case where the labelled space is defined by a union of multiple disjoint polyhedra.

The related problem of learning a separating polyhedron is known to require a large sample size (Goyal & Rademacher, 2009), although this too can be mitigated when the polyhedron has a large margin (Gottlieb et al.,, 2018). We show that NBCS can be used for this learning problem as well.

Function approximation and regression. The problem of computing a piecewise linear approximation to an input function is of significant mathematical and numerical interest. The learning counterpart of this problem is to model a finite number of function observations using a piece-wise linear regressor, while avoiding overfitting. However, algorithms for these problems often scale poorly with increased dimension or observation size (Hannah & Dunson, 2013). As with polyhedron approximation, we can show that NBCS can be used to transform the problem of finding a piece-wise linear approximator or regressor to that of finding a single linear approximator or regressor in the embedded space.

Techniques. Our proposed embedding technique partitions the input space into a nested hierarchy of simplices, and then embeds each data point into features corresponding to the barycentric coordinates of its containing simplex. This yields an explicit feature map, and allows us to fit a linear separator (or train a linear classifier) in the rich feature space obtained from the simplices.

For sample size n in d-dimensional space, our method has runtime \(O(d^2n)\) regardless of the dimension of the embedding space (when the approximation parameter is taken to be fixed, see Sects. 6 and 5 for exact bounds). In contrast, standard polynomial embedding techniques for p-degree polynomials typically run in time \(O(d^p n)\). Likewise, typical kernel embedding have runtime \(O(dn^2)\), and so scale poorly to big data.

Our embedding technique allows for highly non-linear decision boundaries, although these are linear within each simplex (and hence piece-wise-linear overall), as explained in Sect. 3. At the same time, our approach is sufficiently robust to closely approximate realizable convex bodies or functions for any given error, in only linear time for fixed dimension and precision (Sect. 4). We also give generalization bounds based on empirical margin (Theorem 16) and a novel hybrid sample compression technique (Theorem 17). Finally, we perform an extensive empirical evaluation in which our method compares favorably with a wide range of other popular kernel and embedding methods, for classification and regression (Sect. 8).

An extended abstract of this work previously appeared in Gottlieb et al.,, (2021b), and this current version expands on that abstract. In this full version we present completed proofs and new material: For example, we show how our system can approximate a polytope under the Haussdorf measure and functions under the min-squared-error measure, and also present new hybrid PAC-compression bounds (all in Sect. 4).

2 Related work

Approximating convex polyhedra

Learning arbitrary convex bodies requires a very large sample size (Goyal & Rademacher, 2009), and so we focus instead on convex polyhedra defined by a small number of halfspaces. However, the problem of finding consistent polyhedra is known to be \(\textrm{NP}\)-complete even when the polyhedron is simply the intersection of two halfspaces (Megiddo, 1988). In fact, Khot and Saket (2011) showed that “unless \(\textrm{NP}=\textrm{RP}\), it is hard to (even) weakly PAC-learn intersection of two halfspaces”, even when allowed the richer class of O(1) intersecting halfspaces. Klivans and Sherstov (2009) showed that learning an intersection of \(n^\varepsilon\) halfspaces is intractable regardless of hypothesis representation (under certain cryptographic assumptions). These negative results have motivated researchers to consider the problem of discovering consistent polyhedra which have some separating margin. Several approximation and learning algorithms have been suggested for this problem, featuring bounds with steep dependence on the inverse margin and number of halfspaces forming the polyhedron (Arriaga & Vempala, 2006; Klivans & Servedio, 2008; Gottlieb et al.,, 2018; Goel & Klivans, 2018).

In contrast, we show in Sect. 4 that our method is capable of approximating any convex polyhedron in time independent of the halfspace number: We can achieve linear runtime (in fixed dimension) with mild dependence on the inverse margin, or quadratic runtime with only polylogarithmic dependence on the inverse margin. We accomplish this by finding a linear separator in the higher-dimensional embedded space, and projecting the solution back into the origin space. However, our approach is not strictly comparable to those above, as they are concerned with minimizing the disagreement between the computed polyhedron (or object) and the true underlying polyhedron with respect to the point space, while we minimize the volume of the region between these two polyhedra.

Embedding and Kernel maps

No embedding or kernel trick is known for finding piecewise linear functions. Kernel methods provide two principal benefits over embedding techniques: (1) They implicitly induce a non-linear feature map, which allows for a richer space of classifiers and (2) when the kernel trick is available, they effectively replace the dimension d of the feature space with the sample size n as the computational complexity parameter. As such, these are well-suited for the ‘high dimension, moderate data size’ regime. For very large datasets, however, naive use of kernel methods becomes prohibitive. The cost is incurred both at the training stage, where an optimal classifier is searched for over an n-dimensional space, and at the hypothesis evaluation stage, where a sum of n kernel evaluations must be computed. For these reasons, for large data sets, explicit feature maps are preferred. Various approximations have been proposed to mitigate the computational challenges associated with explicit feature maps, including Chang et al.,, (2010), Maji and Berg (2012), Perronnin et al., (2010), Rahimi and Recht (2007), Vedaldi and Zisserman (2012), Li et al., (2010), Shahrampour and Tarokh (2018), Chum (2015), Zafeiriou and Kotsia (2013). Kernel approximations for explicit feature maps come in two basic varieties:

Data-dependent kernel approximations. This category includes Nystrom’s approximation (Williams & Seeger, 2000), which projects the data onto a suitably selected subspace. If \(K(x, z_i)\) is the projection of example \(x\) onto the basis element \(z_i\), the points \(\{z_{1}, \ldots , z_{n}\}\) are chosen to maximally capture the data variability. Some methods select \(z_i\) from the sample. The selection can be random (Williams & Seeger, 2001), greedy (Smola & Schökopf, 2000), or involve an incomplete Cholesky decomposition (Fine & Scheinberg, 2001). Perronnin et al., (2010) applied Nystrom’s approximation to each dimension of the data independently, greatly increasing the efficiency of the method.

Data-independent kernel approximations This category includes sampling the Fourier domain to compute explicit maps for translation invariant kernels. Rahimi and Recht (2007, 2009) do this for the radial basis function kernel, also known as Random Kitchen Sinks. Li et al., (2010), Vedaldi and Zisserman (2012) applied this technique to certain group-invariant kernels, and proposed an adaptive approximation to the \(\chi ^2\) kernel. Porikli and Ozkan (2011) map the input data onto a low-dimensional spectral (Fourier) feature space via a cosine transform. Vempati et al., (2010) proposed a skewed chi squared kernel, which allows for a simple Monte Carlo approximation of the feature map. Maji and Berg (2012) approximated the intersection kernel and the \(\chi ^2\) kernel by a sparse closed-form feature map. Pele et al., (2013) suggested using not only piece-wise linear function in each feature separately but also to add all pairs of features. Chang et al., (2010) conducted an extensive study on the usage of the second-order polynomial explicit feature map. Bernal et al., (2012) approximated second order features relationships via a Conditional Random Field model.

SVM decompositions

Simplex decompositions have been used to produce proximity-based classifiers (Belkin et al.,, 2018; Davies, 1996), but to the best of our knowledge, ours is the first work to utilize either nested simplex decompositions or barycentric centers in conjunction with SVM. Simplex decompositions are related to the quadtree, and the quadtree has been used together with SVM for various learning tasks (Saavedra et al.,, 2004; Beltrami & da Silva, 2015), but not for the creation of kernel embeddings. Simplex decompositions are more efficient than quadtrees, since a simplex naturally decomposes into only \(d+1\) sub-simplices (Sect. 3), while a quadtree cell naturally decomposes into \(2^d\) sub-cells. In particular, our simplex decomposition allows us to decompose the entire space into a nested continuous sequence of cells completely covering a convex area. We may choose the number of such cells to our liking – for example, in Sect. 6 we stipulate that each cell must contain an input point (or at least neighbor a cell containing and input point), and this allows us to decompose into only O(dn) cells. In contrast, a similar quadtree decomposition completely covering a convex area would require \(2^d n\) cells. Our continuous convex cover allows us to efficiently build piece-wise continuous classifiers and functions on top of the decomposition.

As mentioned, our emphasis in this paper is specifically on explicit feature maps, but there are numerous approaches to reducing kernel SVM runtime (for example the CoreSVM of Tsang et al., (2005, 2007)). Another related paradigm is that of Local SVM (Zhang et al.,, 2006; Gu & Han, 2013), which assumes continuity of the labels with respect to spacial proximity; similarly labeled points tend to cluster together. This differs from the underlying assumption motivating kernel SVM, which assumes that the data is approximately linearly separable, but not necessarily clusterable. These approaches find success in distinct settings, and are incomparable.

Regression and function approximation

Univariate piece-wise linear regression is defined by a union of lines corresponding to distinct segments of the x-axis. Given a partition of the x-axis into K segments (defined by endpoints \(a_1,\ldots ,a_k\)) and associated weights \(w_{k=1}^K\) and values \(b_{k=1}^K\), define a class of functions as \(g_k(x) = x\) when \(x \in [a_k,a_{k+1}]\) and zero otherwise. The hypothesis takes the form:

$$\begin{aligned} {\tilde{f}}(x) = \sum _{k \in [K]} w_k g_k(x) + b_k. \end{aligned}$$
(1)

Typically the objective is to reduce the mean square error between the target function f and a piece-wise linear approximation of f. The extension to higher dimensions is far from trivial, as the choice of high-dimensional partition is not immediate, and potentially quite complex. But if we add assumptions on f such as monotonicity and convexity, then the problem can take the form of a convex program, yielding a convex regression problem:

$$\begin{aligned} {\tilde{f}}(x) = \max _{k \in [K]} {w_k^T \cdot x + b_k}. \end{aligned}$$
(2)

Here \(w_k \in \mathbb {R}^d\) and \({\tilde{f}}(x)\) is a piece-wise linear convex function whose knots (split points) are parametrically estimated.

Solving the above convex optimization problem has a computational complexity of \(O((d+1)^4n^5)\) (Monteiro & Adler, 1989), which quickly becomes impractical. Some methods include an objective function constrained to a set of convex functions only for each pair of observations (Hildreth, 1954; Kuosmanen, 2008; Seijo & Sen, 2011; Allon et al.,, 2007; Lim & Glynn, 2012) or semidefinite constraints over all observations (Aguilera, 2008; Wang & Ni, 2012). Aguilera et al., (2011) proposed a two step smoothing and fitting process. First, the data is smoothed and functional estimates are generated over an \(\epsilon\)-net over the domain. Then the convex hull of the smoothed estimate is used as a convex estimator. Although this algorithm does scale to larger data sets, it does not scale gracefully to large dimension due to the \(\epsilon\)-net dividing each dimension separately into K partitions, thereby resulting in a new embedding space of order \(O(d^K)\). Hannah and Dunson (2011) proposed a Bayesian model that placed a prior over the set of all piece-wise linear models. They were able to show adaptive rates of convergence, but the inference algorithm did not scale to more than a few thousand observations. Koushanfar et al., (2010) transformed the ordering problem associated with shape constrained inference into a combinatorial optimization problem which was solved with dynamic programming; this scales to a few hundred observations. Magnani and Boyd (2009) proposed to divide the data into K random subsets and a linear model was fit within each subset; a convex function was generated by taking the maximum over these hyperplanes. Hannah and Dunson (2013) introduced convex adaptive partitioning, which creates a globally convex regression model from locally linear estimates fit on adaptively selected covariate partitions. Our work suggests a different approach by embedding the problem directly into a Hilbert space via an explicit feature map. In the embedded space, the regression problem is a non-constrained linear regression problem instead of a piece-wise regression problem. However, after solving the problem in the embedded space, the projection back to the original space creates a smooth, convex, continuous and differentiable piecewise linear manifold.

3 The Nested barycentric coordinate system

Here we describe the nested barycentric coordinate system. We explain its construction and description, how to embed a point from the origin space into the new coordinate system, and how a point in the embedded system can be projected back into the origin space (Sect. 3.1). We then show that if we associate a weight with each simplex point, then the embedding and weights together imply some (not necessarily convex) polyhedron on the origin space (Sect. 3.2). Later in Sect. 4 we will show that this system is sufficiently robust that it can be used to approximate any convex body.

3.1 Nested barycentric embedding

Let \(S\subset {\mathbb {R}}^d\) be a regular simplex of unit side-length, and let \((q_0, \ldots , q_d)\) be its vertices. Each point x inside the simplex can be written using the barycentric coefficients:

$$\begin{aligned} \begin{aligned}&x = \sum _{i=0}^{d} \alpha _{i} q_i\\&\sum _{i=0}^{d} \alpha _{i} = 1 \quad 0 \le \alpha _{i} \le 1 \end{aligned} \end{aligned}$$
(3)

Here \(\alpha _i\) denotes the coefficient of point x corresponding to vertex \(q_i\). Denote the vector of \(\alpha\)’s corresponding to x by \(\phi _{d+1}(x)=(\alpha _0, \ldots , \alpha _d )\). This vector can be computed by letting

$$\begin{aligned} \phi _{d+1}(x) = Q^{-1}\cdot (x, 1)^T, \end{aligned}$$
(4)

where Q is the \((d+1)\times (d+1)\) matrix whose ith column is \((q_i, 1)^T\).

We can further refine the system by introducing a new point \(q_{d+1}\) inside the simplex, thereby inducing a partition of the simplex into \(d+1\) new sub-simplices, \(S_1,\ldots ,S_{d+1}\). We order the coordinates of our system as \((q_0, \ldots , q_{d+1})\). A point x inside the system is embedded by first identifying the sub-simplex \(S_i\) containing x, and then utilizing the \(d+1\) vertices of \(S_i\) to compute the barycentric coefficients (the \(\alpha\)’s) of equation (3). Then x is assigned a vector wherein each coordinate corresponding to each vertex of \(S_i\) is set to the coefficient of that vertex, and the remaining coordinates are set equal to 0. This defines the embedding \(\phi _{d+2}(x): \mathbb {R}^{d} \rightarrow \mathbb {R}^{d+2}\).

The refinement process can be continued by choosing a single point inside a simplex to further split this simplex. In general, after a sequence \(Q_t = (q_0,..., q_t)\) of \(t+1\) points has been chosen, we have a nested architecture represented by a \((d+1)\)-ary tree \(B_t\) data structure (illustrated in Fig. 1): The root of this tree is labeled by the vertices \(q_0,..., q_d\) of the original simplex S. Each node v of tree \(B_t\) is labeled by a \((d+1)\)-tuple of points of \(Q_t\), which are the vertices of the simplex \(s(v) \subseteq S\) containing v. Given a new split point \(q_{t+1}\), we form \(B_{t+1}\) from \(B_t\) by first traversing the tree from the root down, at each point choosing the nodes whose simplices contain \(q_{t+1}\), until encountering the leaf v whose simplex contains \(q_{t+1}\). Then we add \(d+1\) children to leaf v, each one labeled by a different \((d+1)\)-tuple in which \(q_{t+1}\) replaces a vertex of v. Note that the simplices corresponding to the leaves of \(B_t\) form a partition of S.

The tree \(B_t\) defines an embedding \(\phi _t(x): \mathbb {R}^d \rightarrow \mathbb {R}^{t}\) as follows: Given x, let v be the leaf of \(B_t\) whose simplex s(v) contains x. Let \(q_{i_0},..., q_{i_d}\) be the vertices of s(v). Write \(x = \Sigma \alpha _{i_j} q_{i_j}\) as a convex combination of these vertices, and let \(\alpha _k = 0\) for all other points \(q_k\). Then \(\phi _t(x) = (\alpha _0,..., \alpha _t)\). (If x lies on the boundary of several leaf simplices, then we can arbitrarily choose any one of them, since the resulting embedding \(\phi _t(x)\) will be the same in all cases.)

We note that the embedding — the nested barycentric coordinate system — is sparse, as at most \(d+1\) coefficients are non-zero, and also that the embedded points lie on the \(L_{1}\) sphere (\(\sum \alpha = 1\)).

Fig. 1
figure 1

Constructing the nested system

Fig. 2
figure 2

An example of three disjoint polyhedra (in red) created by NBCS (Color figure online)

A point in the embedded space can be projected back into the original space by taking

$$\begin{aligned} \begin{aligned}&x = \sum _{i=0}^{t} \alpha _{i} q_i.\\ \end{aligned} \end{aligned}$$
(5)

3.2 Weights, hyperplanes and polyhedra

Given an embedding, we will assign a sequence of real-valued weights \(w=(w_0, \ldots , w_t)\) to the vertices \(( q_0, \ldots , q_t)\). Then the set of points R such that

$$\begin{aligned} R= \{ x\in S: w\cdot \phi _t(x) \ge 0\} \end{aligned}$$
(6)

is a union of interior-disjoint convex regions, each of which equals the intersection of one of the simplices with a half space (see Fig. 2 for an illustration).

Lemma 1

For any hyperplane that crosses a given simplex P, there exists a sequence of weights \(w = (w_0, \ldots , w_d)\), such that for all points \(x \in P\), x lies in the hyperplane if and only if it satisfies \(w \cdot \phi _{d+1}( x) = 0\).

Proof

Choose d affinely independent points on the given hyperplane \(\{v_i \in P \}\). These points have a unique set of barycentric coefficients of the containing simplex, and thus a unique representation. Finding the desired weights is equivalent to solving \(A \cdot w = {0}\), where A is a matrix of dimension \(d \times (d+1)\) whose rows are the embedding \(\phi _{d+1}(x)\) of the points. Since this homogeneous linear system has more unknowns than equations, it has a nontrivial solution.

If we add the constraint that \(|w |=1\) then we can solve the equations for w. Any point \(x \in P\) on the plane can be written as an affine combination of the \(\{v_i\}\) points, \(x=\sum _{i=1}^{d}\alpha _i v_i\). This equation can be written in the basis of the vertices \(\{q_i\}\) to equal \(x=\sum _{i=1}^{d}\alpha _i \phi _{d+1}(v_i)\) or in matrix form \(x = A^T \alpha\). From this we conclude that \(x^Tw = \alpha ^T A w = 0\) \(\square\)

In Sect. 4 we will show that a simple nested barycentric system, together with a prudent choice of weights, can be used to closely approximate any given convex body. To this end, we will require a useful property of these systems — essentially, that splitting a simplex cannot decrease the expressiveness of the system. For this end, it is enough to show the following:

Theorem 2

Let tree \(B_{d+1}\) be defined by points \((q_0, \ldots , q_d)\) (so that it consists of a single simplex P), let \(w=(w_0, \ldots , w_d)\) be a sequence of weights, and let \(H=\{x\in P: w\cdot \phi _{d+1}(x)=0\}.\) Let \(q_{d+1}\in P\) be a split point, and let tree \(B_{d+2}\) be defined by points \((q_0, \ldots , q_{d+1}).\) Then there exists a weight \(w_{d+1}\) for \(q_{d+1}\) such that, letting \(w' = (w_0, \ldots , w_{d+1})\), we have

$$\begin{aligned} \{x\in P: w'\cdot \phi _{d+2}(x) = 0\} = H. \end{aligned}$$

Proof

We will show that the desired weight is \(w_{d+1} = \phi _{d+1}(q_{d+1})\cdot w\). Indeed, let

$$\begin{aligned} \phi _{d+1}(x)&= (\alpha _0, \ldots , \alpha _d),\\ \phi _{d+2}(x)&= (\beta _0, \ldots , \beta _{d+1}),\\ \phi _{d+1}(q_{d+1})&= (\gamma _0, \ldots , \gamma _d). \end{aligned}$$

Then since \(q_{i+1} = \sum _{i=0}^d \gamma _i q_i\) we have

$$\begin{aligned} x=\sum _{i=0}^{d+1} \beta _i q_i = \sum _{i=0}^d (\beta _i+\beta _{d+1}\gamma _i) q_i. \end{aligned}$$

Since the barycentric representation is unique, this implies that \(\alpha _i = \beta _i + \beta _{d+1}\gamma _i\) for all \(i\le d\). Hence,

$$\begin{aligned} w'\cdot \phi _{d+2}(x)= & {} \sum _{i=0}^d \beta _i w_i + \beta _{d+1}\sum _{i=0}^d \gamma _i w_i\\= & {} \sum _{i=0}^d \alpha _i w_i\\= & {} w \cdot \phi _{d+1}(x), \end{aligned}$$

and this implies the claim. \(\square\)

The significance of the above theorem is that an unnecessary split is not harmful, in that the subsequent optimization problem can assign weights which neutralize the effect of the unnecessary split.

3.3 Relation between margins

For some simplex lying in the origin space, let \(\epsilon\) be the margin separating the points inside this simplex, that is \(w\cdot x\ge \epsilon\) for all x in the simplex, where w is a unit vector. It follows that

$$\begin{aligned} w\cdot x= w x^T = (w,0)(x,1)^T= (w,0) Q Q^{-1} (x,1)^T = {\tilde{w}} {\tilde{x}}^T={\tilde{w}}\cdot {\tilde{x}}, \end{aligned}$$

where the matrix Q is constructed as defined above, using the vertices of the simplex containing x. Here \({\tilde{x}}^T = Q^{-1}(x,1)^T\) is the point in the embedded space, and \({\tilde{w}} = (w,0) Q\) is the representation of the hyperplane w in the embedded space. Setting \(\lambda _{\textrm{max}}\) to be the maximal eigenvalue of \((Q^T Q)^{-1}\), it follows that the margin separating these points in the embedded space is \(\frac{\varepsilon }{\sqrt{\lambda _{\textrm{max}}}}\).

4 Convex body approximation

In this section we prove the following properties regarding a convex body approximation using NBCS: In Sect. 4.1 we show that NBCS can approximate any convex manifold in the Hausdorff sense, and can construct a consistent manifold for a given convex body with margin \(\varepsilon\). In Sect. 4.2 we show that NBCS can approximate any convex manifold in the MSE sense using an Archimedean style subdivision. Finally, in Sect. 5 we present hybrid PAC-compression out of sample error bounds for our methods.

4.1 NBCS polyhedron Hausdorff approximation and consistent polyhedron inside a margin

In this section we show that the nested barycentric coordinate system (NBCS) can represent an arbitrarily close approximation to any convex body. As stated the NBCS produces a (not necessarily convex) piece-wise linear classifier. In fact, this method can approximate multiple convex bodies. For simplicity, we focus on the case of a single convex body, and demonstrate how our method approximates it. This will be done by placing split points at the barycenters of their containing simplices, where the barycenter of a simplex with vertices \(p_0, \ldots , p_{d}\) is given by \((p_0 + \cdots + p_d)/(d+1)\). We note that in this section we focus on proving the existence of a coordinate system satisfying the above goal; questions of optimality and runtime will be addressed in later sections.

In order to state our result formally, we introduce some notation: Given a point \(p\in {\mathbb {R}}^d\) and a parameter \(\varepsilon >0\), let \(B_\varepsilon (p) = \{q\in {\mathbb {R}}^d: \Vert q-p\Vert _2\le \varepsilon \}\) be the ball of radius \(\varepsilon\) centered at p. Given a set \(X\subseteq {\mathbb {R}}^d\), let

$$\begin{aligned} X^{(-\varepsilon )} = \{p\in X:B_\varepsilon (p)\subseteq X\} \end{aligned}$$

be the set of all points of X that are at distance at least \(\varepsilon\) from the boundary of X. See Fig. 3 (left) for an example in the plane. Recall that S denotes the simplex of side-length 1.

Theorem 3

Let \(P\subseteq S\) be an n-vertex convex polytope, and let \(0<\varepsilon <1\) be given. Set \(s = 2^{O(d)}\ln ^2(1/\varepsilon )\). Then there exists a nested system \(B_t\) of \((d+1)^s\) distinct simplices obtained by always placing split points at the barycenters of their containing simplices, and a corresponding set of weights w, such that

$$\begin{aligned} {\tilde{P}}= \{ x\in S: w\cdot \phi _t( x) \ge 0\} \end{aligned}$$
(7)

satisfies the following:

  1. 1.

    \({{\,\textrm{vol}\,}}({\tilde{P}}\setminus P) < \varepsilon {{\,\textrm{vol}\,}}(S)\).

  2. 2.

    \(P^{(-\varepsilon )} \subseteq {\tilde{P}}^{(-\varepsilon )} \subseteq P \subseteq {\tilde{P}}\).

Further, this system may be computed in time

$$\begin{aligned} O((d+1)^s(d^2n + e^{O(\sqrt{d \log d})})). \end{aligned}$$
Fig. 3
figure 3

Left: Example of a set X (light gray) and the corresponding set \(X^{(-\varepsilon )}\) (dark gray). Right: Example of an approximation \({\tilde{P}}\) (light gray) produced by the NBCE to a given polytope P (dark gray)

The approximation \({\tilde{P}}\) produced by the NBCE typically has long ‘spikes,’ but as these are necessarily thin, they disappear in \({\tilde{P}}^{(-\varepsilon )}\). See Fig. 3 (right) for an example in the plane.

The importance of Theorem 3 is that a simple NBCS can closely approximate any convex body P. And if P has margin \(\varepsilon\), NBCS can produce (\({\tilde{P}}^{(-\varepsilon )}\)) which falls fully within the margin of P. As described in section 2, finding a consistent polytope to a convex body with margin is a problem of great interest which was widely investigated.

Proof

The construction proceeds in stages \(i=0, 1, \ldots , s\). At stage 0 the only points present are the vertices of S. At each stage i, \(i\ge 1\), a new split point is placed at the barycenter of each existing simplex, and the final construction is called the s-stage uniform subdivision of S. Let \(A_i\) be the set of simplices present at stage i; clearly \(|A_i| = (d+1)^i\). Note that all simplices in \(A_i\) have the same volume.

The weights \(w_i\) are assigned as follows: Initially, vertices \(q_0, \ldots , q_d\) of S are assigned weights \(w_0 = \cdots = w_d = -1\). For the first introduced split point, we assign the new point the smallest possible weight that ensures \({\tilde{P}}\supseteq P\), where \({\tilde{P}}\) is given by (7). For this, it suffices to ensure that the n vertices defining P are inside \({\tilde{P}}\), and this involves solving a linear system. For each subsequent split point, the weight is assigned as follows: If the simplex which was split does not intersect P, then the split point is assigned weight \(-\infty\). If the simplex which was split intersects P, then the split point is assigned the average weight of the \(d+1\) vertices of the split simplex. Once a weight is assigned to a point, it is never changed again. In other words, for those points of \(B_{i+1}\) that already belonged to \(B_i\), their weights at \(B_{i+1}\) are the same as their weights at \(B_i\). This completes the system construction.

We first derive the runtime of the construction. One can determine whether P intersects a given simplex by solving a linear program. This can be done in time \(O(d^2n + e^{O(\sqrt{d \log d})})\) (Gärtner, 1995). As each stage i has \((d+1)^i\) simplices, the runtime follows (for appropriate constant in the exponent of s). We proceed to prove the remaining bounds of the Theorem.

Let \(S'\in A_i\) be a simplex with vertices \(q_{i_0}, \ldots , q_{i_d}\) and weights \(w_{i_0}, \ldots , w_{i_d}\), respectively. Let \(q' = (q_{i_0} + \cdots + q_{i_d})/(d+1)\) be the barycenter of \(S'\). By Theorem 2, if \(q'\) is assigned weight \(w_{\textrm{avg}} = (w_{i_0} + \cdots + w_{i_d})/(d+1)\), then \({\tilde{P}}\cap S'\) remains unchanged. (Although we may be able to assign \(q'\) a weight \(w'\le w_{\textrm{avg}}\) satisfying \({\tilde{P}}\supseteq P\), computing the smallest such \(w'\) may be computationally consuming, and so we avoid this.) Hence, only at iteration \(i=0\) will \(q'\) be assigned a weight larger than \(w_{\textrm{avg}}\); whenever \(i\ge 1\), the weight \(w'\) that will be assigned to \(q'\) will satisfy \(w'\le w_{\textrm{avg}}\). And therefore, at each stage \(i\ge 1\), \({\tilde{P}}\) only shrinks.

Let us denote by \({\tilde{P}}_s\) the region \({\tilde{P}}\) produced by this construction after stage s. (See Fig. 4 for an illustration in the plane.) We will now prove that, if s is made large enough, then \({\tilde{P}}_s\) approximates the given convex body P arbitrarily well, as stated in the theorem.

Fig. 4
figure 4

Four stages of the approximation of a given convex polygon in the plane (Color figure online)

The diameter of a compact subset of \({\mathbb {R}}^d\) is the maximum distance between two points in the set. In particular, the diameter of a simplex is the largest distance between two vertices of the simplex. \(\square\)

Lemma 4

Let \(S'\) be a simplex with vertices \(p_0, \ldots , p_d\), let c be the diameter of \(S'\), and let q be the barycenter of \(S'\). Then the distance between q and any vertex \(p_i\) is at most \(cd/(d+1)\).

Proof

Fix \(p_i = 0\) for concreteness. Then, under the constraints \(\Vert p_j\Vert _2 \le c\) for \(j\ne i\), the distance between q and \(p_i\) is maximized by the degenerate simplex that has \(p_j = (c, 0, \ldots 0)\) for all \(j\ne i\), which yields the claimed distance. \(\square\)

Lemma 5

Let \(S'\) be a simplex with diameter c. Let A be the collection of the \((d+1)^d\) simplices obtained by a d-stage uniform subdivision of \(S'\). Then there are at least \((d+1)!\) simplices in A with diameter at most \(cd/(d+1)\).

Proof

By Lemma 4, every simplex in A that contains at most one vertex of \(S'\) will have diameter at most \(cd/(d+1)\). Each time a simplex \(S''\) is subdivided into \(d+1\) simplices by an interior point q, the new simplices share only d of their vertices with \(S''\). Hence, at stage 1 of the subdivision of \(S'\), there are \(d+1\) simplices that share only d vertices with \(S'\); at stage 2, there are \((d+1)d\) simplices that share only \(d-1\) vertices with \(S'\); and so on, until at stage d there are \((d+1)d\cdots 2=(d+1)!\) simplices that share only one vertex with \(S'\). \(\square\)

Recall that \(A_i\) denotes the collection of simplices present in the i-stage uniform subdivision of S.

Lemma 6

Let kz be integers, and set \(s = zkd\). Then at most a \(\bigl (z(1-e^{-d})^k\bigr )\)-fraction of the simplices in \(A_s\) have diameter larger than \((d/(d+1))^z\).

Proof

By repeated application of Lemma 5. After kd stages, at most an \(\alpha\)-fraction of the simplices in \(A_{kd}\) have diameter larger than \(d/(d+1)\), for \(\alpha = \left( 1-\frac{(d+1)!}{(d+1)^d}\right) ^k\). All the other simplices have diameter at most \(d/(d+1)\). Of the latter simplices, after kd more stages, at most an \(\alpha\)-fraction of their descendants have diameter larger than \((d/(d+1))^2\). Hence, in \(A_{2kd}\), the fraction of simplices with diameter larger than \((d/(d+1))^2\) is at most \(\alpha + (1-\alpha )\alpha < 2\alpha\). And so on. In \(A_{zkd}\), the fraction of simplices with diameter larger than \((d/(d+1))^z\) is at most \(z\alpha\). Since \((d+1)!/(d+1)^d > e^{-d}\) for all d, the lemma follows. \(\square\)

We can now complete the proof of Theorem 3. Given \(\varepsilon\), let \(\rho = \varepsilon / (2\sqrt{2}d^2)\). Choose z minimally so that \((d/(d+1))^z \le \rho\), and then choose k minimally so that \(z(1-e^{-d})^k \le \varepsilon /2\). Let \(s=1+zkd\). (Hence, we have \(s\le c^d\ln ^2(1/\varepsilon )\) for some c.)

As mentioned above, for every simplex in \(A_{s-1}\) that is completely disjoint from P, its barycenter will be given a weight of \(-\infty\) in stage s, so the simplex will be completely disjoint from \({\tilde{P}}_s\). In the worst case, every other simplex in \(A_{s-1}\) will be completely contained in \({\tilde{P}}_s\). Hence, let \(Z_1\) be the region surrounding P that is at distance at most \(\rho\) from P, and let \(Z_2\) be the union of all the simplices in \(A_{s-1}\) with diameter larger than \(\rho\). Thus, every point in \({\tilde{P}}_s \setminus P\) belongs to \(Z_1\cup Z_2\). Let us bound each of \({{\,\textrm{vol}\,}}(Z_1)\) and \({{\,\textrm{vol}\,}}(Z_2)\).

As \(\rho \rightarrow 0\) (keeping P fixed) we have \({{\,\textrm{vol}\,}}(Z_1) \le (1+o(1))\rho {{\,\textrm{surf}\,}}(P)\), where \({{\,\textrm{surf}\,}}\) denotes the \((d-1)\)-dimensional surface volume.

Furthermore, P and S are both convex with \(P\subseteq S\), so \({{\,\textrm{surf}\,}}(P)\le {{\,\textrm{surf}\,}}(S)\). Since \(S = S_d\) where \(S_d\subset {\mathbb {R}}^d\) is a regular simplex of unit side-length, we have \({{\,\textrm{vol}\,}}(S_d) = \sqrt{d+1}/(d!\sqrt{2^d})\) and \({{\,\textrm{surf}\,}}(S_d) = (d+1){{\,\textrm{vol}\,}}(S_{d-1}) \approx \sqrt{2}d^2{{\,\textrm{vol}\,}}(S_d)\). Hence, by the choice of \(\rho\), we have \({{\,\textrm{vol}\,}}(Z_1) \le (\varepsilon /2){{\,\textrm{vol}\,}}(S)\). By Lemma 6, we also have \({{\,\textrm{vol}\,}}(Z_2) \le (\varepsilon /2){{\,\textrm{vol}\,}}(S)\). Hence, \({{\,\textrm{vol}\,}}({\tilde{P}}_s\setminus P)\le \varepsilon {{\,\textrm{vol}\,}}(S)\), and the first item follows.

For the second item, by construction \(P \subseteq {\tilde{P}}\). Now given a parameter \(\varepsilon >0\), apply the first part of the theorem with \(\varepsilon ' = {{\,\textrm{vol}\,}}(B_\varepsilon )/(2{{\,\textrm{vol}\,}}(S))\), where \({{\,\textrm{vol}\,}}(B_\varepsilon ) \approx \varepsilon ^d \pi ^{d/2} / (d/2)!\) is the volume of a d-dimensional ball of radius \(\varepsilon\). (A calculation shows that \(\varepsilon ' \ge \varepsilon ^d\), so it suffices to take \(s=(c')^d\ln ^2(1/\varepsilon )\) for an appropriate constant \(c'\).) Suppose for a contradiction that there exists a point \(p\in {\tilde{P}}_s^{(-\varepsilon )}\) that is outside of P. Then the ball \(B=B_\varepsilon (p)\) is contained in \({\tilde{P}}\). But since P is convex, more than half of B is outside of P. Hence, \({{\,\textrm{vol}\,}}({\tilde{P}}_s\setminus P) > {{\,\textrm{vol}\,}}(B)/2 = \varepsilon '{{\,\textrm{vol}\,}}(S)\), contradicting the first part of the theorem. This implies that \(\tilde{P}_s^{(-\varepsilon )}\subseteq P\). Finally, \(P \subset {\tilde{P}}\) implies that \(P^{(-\varepsilon )} \subset {\tilde{P}}^{(-\varepsilon )}\), concluding the second item and the proof of Theorem 3. \(\square\)

4.2 NBCS MSE function approximation

In this section we show how to approximate any concave function \(f(x)\) using a \(B_t\) NBCS. Our objective is to reduce the mean square error between the target function \(f(x)\) and its piece-wise linear approximation \({\tilde{f}}_t(x) = w \cdot \phi _t(x)\):

$$\begin{aligned} {\text {err}}_t(x)&= f(x) - {\tilde{f}}_t(x)\\ L(w,t)&= \int _{\textbf{V}}|| {\text {err}}_t(x)||^2 dV \end{aligned}$$

where \({\textbf{V}}\) is the volume and \(dV= dx_1 dx_2 \ldots\).

For simplicity we will add a split point inside each simplex at stage \(t+1\), while retaining all the same weights from stage t.

Lemma 7

Given a NBCS of rank t with a set of weights that minimizes \(L(w,t)\), for any choice of a new coordinate \(q_{t+1}\) inside simplex \(S'\) there exists a weight \(w_{t+1}\) such that:

$$\begin{aligned} L(w,t+1) \le L(w,t). \end{aligned}$$
(8)

Proof

Let

$$\begin{aligned} \phi _{t}(x)&= (\alpha _0, \ldots , \alpha _t),\\ \phi _{t+1}(x)&= (\beta _0, \ldots , \beta _{t+1}),\\ \phi _{t}(q_{d+1})&= (\gamma _0, \ldots , \gamma _d). \end{aligned}$$

As show in Sect. 3.2\(\alpha _i = \beta _i + \beta _{d+1}\gamma _i\), so it follows that,

$$\begin{aligned} {\text{err}}_{{t + 1}} (x) = & \int_{V} | |f(x) - \sum\limits_{{i = 0}}^{{t + 1}} {w_{i} } \beta _{i} (x)||^{2} dV \\ = & \int_{V} | |f(x) - \sum\limits_{{i = 0}}^{t} {w_{i} } \beta _{i} (x) - w_{{t + 1}} \beta _{{t + 1}} (x)||^{2} dV \\ = & \int_{V} | |f(x) - \sum\limits_{{i = 0}}^{t} {w_{i} } \alpha _{i} (x) \\ + & \beta _{{t + 1}} (x)\sum\limits_{{i = 0}}^{t} {w_{i} } \gamma _{i} - w_{{t + 1}} \beta _{{t + 1}} (x)||^{2} dV \\ = & \int_{V} | |f(x) - \sum\limits_{{i = 0}}^{t} {w_{i} } \alpha _{i} (x) \\ - & \beta _{{t + 1}} (x)(w_{{t + 1}} - \sum\limits_{{i = 0}}^{t} {w_{i} } \gamma _{i} )||^{2} dV \\ \end{aligned}$$
(9)

This above equation implies the following:

  1. 1.

    If point \(x \not \in S'\) then \(\beta _{t+1}(x) = 0\) and the contribution of \(x\) remains the same as in the previous step. We will therefore only integrate over \(S'\).

  2. 2.

    If we assign the weight \(w_{t+1}=\sum _{i=0}^t w_i \gamma _i\) then \(L(w,t+1)=L(w,t)\).

  3. 3.

    If we assign the weight \(w_{t+1}\) such that:

    $$\begin{aligned} \sum _{i=0}^t w_i \gamma _i< w_{t+1} < \sum _{i=0}^t w_i \gamma _i + 2 \frac{{\text {err}}_t(x)}{\beta _{t+1}(x)} \forall x \in S' \end{aligned}$$
    (10)

    then \(L(w,t+1) < L(w,t)\) and the algorithm strictly reduces the objective function at step \(t+1\).

In order to find the optimal weight \(w_{t+1}\), we will differentiate with respect to \(w_{t+1}\):

$$\begin{aligned} \begin{aligned}&\frac{\partial L}{\partial w_{t+1}}=0 \\&\implies \int _{\textbf{V}} -2 {\text {err}}_t(x) \beta _{t+1}(x) + 2\beta ^2_{t+1}(x) (w_{t+1}-\sum _{i=0}^t w_i \gamma _i ) =0 \\&\implies w_{t+1}=\sum _{i=0}^t w_i \gamma _i + \int _{S'} \frac{{\text {err}}_t(x)}{\beta _{t+1}(x)} dV. \end{aligned} \end{aligned}$$
(11)

This equation can be interpreted as follows: the new weight is the linear estimation of point \(q_{t+1}\), since \(\sum _{i=0}^t w_i \gamma _i = {\tilde{f}}(q_{t+1})\), plus an integral sum over all points in the simplex of their deviation from the previous model and weighted by the contribution of the new point \(q_{t+1}\) to their new representation. \(\square\)

Lemma 7 establishes that any new split point added to the hierarchical structure of NBCS can potentially reduce the target function. But it does not establish which candidate split point is the best, nor does it bound the number of times this procedure must be repeated in order to reduce the target function beneath an error estimate of \(\epsilon\). In order to find an upper bound on the number of splits necessary to achieve a given error estimate \(\epsilon\), and in order to describe a procedure to find the best split points, we need to assume additional properties on \(f(\cdot )\) — specifically, continuity and concavity — and also make the trivial stipulation that we choose \(w_{t+1}= f(q_{t+1})\); that is, the weight of point \(q_t\) must be the value of the function at that point.

Theorem 8

Let \(f: \mathbb {R}^d \rightarrow \mathbb {R}\) be a concave function within the unit cube and let \(\varepsilon >0\) be a given number, then there exists a \(B_t\) NBCS and a corresponding set of weights \(w\), such that the integrated mean square error \(L(w,t) \le \varepsilon\) where \(t = O(d \ln (1/\varepsilon ))\).

Proof

We shall prove this theorem by means of a constructive proof. The proposed construction proceeds in stages \(i=0, 1, \ldots , t\). At stage 0 the only points present are the vertices of the original d-dimensional simplex to which all data points are confined. At step zero the weight of coordinate \(q_i\) is given by \(w_i = f(q_i)\). Let \(S_0\) be the simplex \(S_0 = \{ (x,y) \mid x \in \mathbb {R}^d, y=w \cdot \phi _0(x) \}\). Likewise let the manifold \(F = \{(x,y) \mid x \in \mathbb {R}^d, y = f(x)\}\). Next, we perform the shearing transformation \(f'(x) = f(x) - w \cdot \phi _0(x)\). Now the function \(f'(x)\) evaluated at the vertices of \(S_0\) is zero, and all subsequent stages will produce concave functions with this property. As the volume under the curve remains the same under the shearing transformation, and the concavity property is unaffected, then \(f'(x)\) is a concave function with the same initial error. Next, create the simplex \(S'_0\) by translating \(S_0\) along the y axis until it is tangent to the manifold F at some point \(p'\) (see Fig. 5 for an illustration).

The concavity property implies that every \((x,y) \in S'_0\) satisfies \(y \ge f(x)\). We will say that manifold A is above manifold B if for any given x with \((x,y_1) \in A\) and \((x,y_2) \in B\), we have that \(y_1 \ge y_2\). Project the point \(p'\) back to \(S_0\), and call the projection point p (\(p \in S_0, p' \in S'_0\)). The point p is our choice of the new knot (or split point): p splits \(S_0\) into \(d+1\) simplices, and we let \(S_k\) be the simplex wherein the point p substitutes for the vertex \(q_k\). Likewise, the point \(p'\) splits \(S'_0\) into \(d+1\) simplices, and we let \(S'_k\) be the simplex wherein the point \(p'\) substitutes for the vertex \(q'_k\). Thus, we have constructed \(d+1\) parallel simplices (\(\{S_k,S'_k\}_{k \in [d+1]}\)). If we take the vertices of each pair of simplices, we can construct from them a parallelogram in the \(d+1\) space – we call this parallelogram \(\Pi _k\). Each \(\Pi _k\) contains the pair \(\{S_k,S'_k\}\), and the part of F which is contained in \(\Pi _k\) is termed \(F_k\).

We can now consider each of the \(\Pi _k\) separately. Take the point \(p'\), and construct within each \(\Pi _k\) the hyperplane \(S''_k\) spanning the point set which includes \(p'\) and the d vertices of \(S_k\) excluding p (see Fig. 5 for an illustration). The integrated area under \(S''_k\) is \(\frac{{{\,\textrm{vol}\,}}(\Pi _k)}{d+1}\). We note that due to concavity, \(F_k\) is above \(S''_k\) and under \(S'_k\), and intersects simplex \(S''_k\) at the latter’s vertices. Informally, the integrated error between \(F_k\) and our new linear approximation \(S''_k\) is at most a factor \(\frac{d}{d+1}\) of the integrated error in the previous iteration. This process may be repeated inductively, where \(S''_k\) within \(\Pi _k\) can be translated and now play the role of \(S_0\). More formally, it follows that the error at stage \(t+1\) satisfies \(L(w,t+1) \le \frac{d}{d+1}L(w,t) \le (\frac{d}{d+1})^t L(w,0) = (1-\frac{1}{d+1})^t L(w,0) \le e^{-\frac{t}{d+1}} L(w,0) \le e^{-\frac{t}{d+1}}\), and therefore for \(t \ge (d+1) \ln (1/\varepsilon )\) the error is smaller than any given \(\varepsilon\).

By adding the point p to the coordinate system with the weight f(p), the collection of all \(S''_k\) is exactly the set \((x, \sum w_i \phi _1(x))\), and the shearing transformation \(f'(x) = f(x) - w \cdot \phi _1(x)\) breaks F into \(d+1\) segments where each \(F'_k\) is above \(S''_k\). \(\square\)

We call this the generalized Archimedes integration process, which is reminiscent of Archimedes integration process for finding the area of a parabola by triangulation, except that this method generalizes the procedure to any concave function and to multiple dimensions.Footnote 1

Fig. 5
figure 5

Three steps of the function approximation method. For each step, the left side illustrates the constructed polytope in the origin space, and the right hand side shows the function after applying the transformation

5 Hybrid PAC-compression bounds

In this section, we present preliminary hybrid compression bounds which we shall use in the derivation of learning bounds for classification and regression using naive and adaptive barycentric embeddings. These final learning bounds are found in Theorems 16, 17 and 18 in Sect. 6.

General theory; agnostic case

The novelty of our setting is that we use date-dependent feature embeddings that may be analyzed as a novel type of sample compression scheme: Each sample k-tuple represents some barycentric embedding upon which a classifier may be computed. Hence, the k-tuple encodes not a single function (as assumed classically) but rather a class of possible functions. Our goal in this section is to derive bounds for this novel setting.

It will be convenient to present our bounds for the more general case and then specialize these to our setting. Our notation and terminology (compression scheme, etc.) will be in line with Hanneke and Kontorovich (2019). Let P be a distribution on \(\mathcal {Z}\). We write \(Z_{[n]}=(Z_1,\ldots ,Z_n)\sim P^n\) and for function \(f\in [0,1]^\mathcal {Z}\) we define its true mean

$$\begin{aligned} R(f,P):=\mathop {\mathbb {E}}_{Z\sim P}f(Z), \end{aligned}$$
(12)

and empirical mean

$$\begin{aligned} {\hat{R}}(f,Z_{[n]}):=\frac{1}{n}\sum _{i=1}^nf(Z_i). \end{aligned}$$

We represent the difference between the true and empirical means by

$$\begin{aligned} \Delta _n(f)= \Delta _n(f,P,Z_{[n]}):= |R(f,P)-{\hat{R}}(f,Z_{[n]}) |, \end{aligned}$$

and then our main object of interest is the maximum value of this difference

$$\begin{aligned} {\bar{\Delta }}_n(\mathcal {F}):=\sup _{f\in \mathcal {F}}\Delta _n(f,P,Z_{[n]}), \end{aligned}$$
(13)

for function class \(\mathcal {F}\subset [0,1]^\mathcal {Z}\).

The catch for our setting is that the function class \(\mathcal {F}\) may itself be random, determined by the random variables \(Z_{[n]}\) via some compression scheme. For a fixed \(k\in {\mathbb {N}}\), consider a fixed pre-defined mapping \(\rho :\mathcal {Z}^k\mapsto \mathcal {F}\subseteq [0,1]^\mathcal {Z}\). In words, \(\rho\) maps k-tuples over \(\mathcal {Z}\) into real-valued function classes over \(\mathcal {Z}\). Denote by \(\mathcal {F}_\rho (Z_{[n]})\) the (random) collection of all functions constructable by \(\rho\) via all k subsets of \(Z_{[n]}\):

$$\begin{aligned} \mathcal {F}_\rho (Z_{[n]}) = \bigcup _{I\in {[n]\atopwithdelims ()k}} \rho (Z_I). \end{aligned}$$
(14)

Here \([n]\atopwithdelims ()k\) denotes the set of all k-subsets of [n], and \(Z_I\) is the restriction of \(Z_{[n]}\) to the index set I.Footnote 2

A trivial application of the union bound yields

$$\begin{aligned} \mathop {\mathbb {P}}\left( {\bar{\Delta }}_n(\mathcal {F}_\rho (Z_{[n]})) \ge \varepsilon \right) \le {n\atopwithdelims ()k}\max _{ I\in {[n]\atopwithdelims ()k}} \mathop {\mathbb {P}}\left( {\bar{\Delta }}_n(\rho (Z_I))\ge \varepsilon \right) . \end{aligned}$$
(15)

The key observation is that, conditioned on \(Z_I\), the function \(\rho (Z_I)\) becomes deterministic and independent of \(Z_J\), where \(J:=[n]\setminus I\). Thus,

$$\begin{aligned} \mathop {\mathbb {P}}\left( {\bar{\Delta }}_n(\rho (Z_I))\ge \varepsilon \right)= & {} \mathop {\mathbb {E}}_{Z_I}\left[ \mathop {\mathbb {P}}\left( {\bar{\Delta }}_n(\rho (Z_I))\ge \varepsilon \, | \,Z_I \right) \right] . \end{aligned}$$
(16)

Conditional on \(Z_I\), we have, for \(f=\rho (Z_I)\),

$$\begin{aligned} \Delta _n(f,P,Z_{[n]})= & {} |R(f,P)-{\hat{R}}(f,Z_{[n]}) |\\= & {} \left| \frac{1}{n}\sum _{i=1}^n\mathop {\mathbb {E}}f(Z)-f(Z_i) \right| \\\le & {} \left| \frac{1}{n}\sum _{i\in J}\mathop {\mathbb {E}}f(Z)-f(Z_i) \right| + \left| \frac{1}{n}\sum _{i\in I}\mathop {\mathbb {E}}f(Z)-f(Z_i) \right| \\\le & {} \frac{1}{n-k}\left| \sum _{i\in J}\mathop {\mathbb {E}}f(Z)-f(Z_i) \right| + \frac{k}{n}\max _{i\in I}|\mathop {\mathbb {E}}f -f(Z_i)| \\= & {} \Delta _{n-k}(f,P,Z_{J}) + \frac{k}{n}\max _{i\in I}|\mathop {\mathbb {E}}f (Z) -f(Z_i)| \\\le & {} \Delta _{n-k}(f,P,Z_{J}) + \frac{k}{n}. \end{aligned}$$

Thus, conditioned on \(Z_I\), we may express \(\Delta _n\) via \(\Delta _{n-k}\) up to a small error term:

$$\begin{aligned} \Delta _n(\rho (Z_I),P,Z_{[n]})\le & {} \Delta _{n-k}(\rho (Z_I),P,Z_{[n]\setminus I}) + \frac{k}{n} \end{aligned}$$

holds with probability 1. The main result of this section now follows from (15) and (16):

Theorem 9

$$\begin{aligned} \mathop {\mathbb {P}}\left( {\bar{\Delta }}_n(\mathcal {F}_\rho (Z_{[n]})) \ge \frac{k}{n}+\varepsilon \right)\le & {} {n\atopwithdelims ()k}\! \mathop {\mathbb {P}}\left( {\bar{\Delta }}_{n-k}(\rho (Z_{[k]}),P,Z_{[n]\setminus [k]}) \ge \varepsilon \right) . \end{aligned}$$

In the sequel, we will derive specific bounds for the cases where \(\mathcal {F}_\rho\) is a VC class or a margin class.

General theory; realizable case

In the realizable case there is a consistent hypothesis, which means that \({\hat{R}}(f,Z_{[n]})=0\) is achievable. In this case our basic object of interest is not the \({\bar{\Delta }}_n\) of (13). Rather, we are interested in the ‘bad event’ \(E_n\) wherein a hypothesis is consistent with the sample (so that its empirical mean is 0), but achieves poor generalization performance on the distribution (so that its true mean is at least \(\varepsilon\)):

$$\begin{aligned} E_n(\mathcal {F},P,\varepsilon ) = \left\{ z\in \mathcal {Z}^n: \exists f\in \mathcal {F}, R(f,P)>\varepsilon , {\hat{R}}(f,z)=0 \right\} . \end{aligned}$$
(17)

Using \(\mathcal {F}_\rho (Z_{[n]})\) as already defined in (14), the union-bound and conditioning analysis performed in (15) and (16) again applies:

$$\begin{aligned} \mathop {\mathbb {P}}(E_n(\mathcal {F}_\rho (Z_{[n]}),P,\varepsilon )\le & {} {n\atopwithdelims ()k}\max _{ I\in {[n]\atopwithdelims ()k}} \mathop {\mathbb {P}}(E_n(\mathcal {F}_\rho (Z_{I}),P,\varepsilon )) \end{aligned}$$
(18)
$$\begin{aligned}\le & {} {n\atopwithdelims ()k}\max _{ I\in {[n]\atopwithdelims ()k}} \mathop {\mathbb {E}}_{Z_I}[ \mathop {\mathbb {P}}(E_n(\mathcal {F}_\rho (Z_{I}),P,\varepsilon )\, | \,Z_I) ]. \end{aligned}$$
(19)

Of course, for a fixed \(Z_I\) we have that \(\rho\) is fixed as well, and so the probability term \(\mathop {\mathbb {P}}(E_n(\mathcal {F}_\rho (Z_{I}),P,\varepsilon )\, | \,Z_I)\) is nothing but a standard error tail bound for some deterministic function class \(\mathcal {F}^{\tiny {\text {FIX}}}\) which acts upon an iid sample of size \(n-k\) (not including the points of I). Then we have

Theorem 10

If \(\rho\) always maps k-tuples into a function class that is a subset of \(\mathcal {F}^{\tiny {\text {FIX}}}\) and furthermore, for the latter, we have the generalization bound \(\mathop {\mathbb {P}}(E_n(\mathcal {F}^{\tiny {\text {FIX}}},P,\varepsilon )) \le \delta\), where \(\varepsilon =G_n(\mathcal {F}^{\tiny {\text {FIX}}},\delta )\), then

$$\begin{aligned} \mathop {\mathbb {P}}(E_n(\mathcal {F}_\rho (Z_{[n]}),P,\varepsilon ))\le & {} \delta , \end{aligned}$$

where \(\varepsilon =G_{n-k}(\mathcal {F}^{\tiny {\text {FIX}}},\delta /{n\atopwithdelims ()k})\).

We will also derive specific bounds for the cases where \(\mathcal {F}_\rho\) is a VC class or a margin class.

Example: VC classes

We provide more concrete examples of the application of the above abstract framework. In our first example, suppose that \(\rho\) maps k-tuples of \(\mathcal {Z}\) to binary concept classes — which might well be different for each k-tuple — of VC-dimension at most d. More precisely, we take \(\mathcal {Z}=\mathcal {X}\times \left\{ 0,1 \right\}\), where \(\mathcal {X}\) is an instance space. Let \(\mathcal {H}=\mathcal {H}_z\subseteq \left\{ 0,1 \right\} ^\mathcal {X}\) be a concept class defined by the k-tuple \(z\in \mathcal {Z}^k\), with VC-dimension d. Define \(\mathcal {F}\subseteq \left\{ 0,1 \right\} ^\mathcal {Z}\) to be its associated loss class, where we write \(\mathcal {F}^{\tiny {\text {FIX}}}\) to distinguish fixed (i.e., deterministic) vs. random function classes:

$$\begin{aligned} \mathcal {F}^{\tiny {\text {FIX}}}=\left\{ f_h: (x,y)\mapsto \varvec{1}_{\left\{ h(x)\ne y \right\} }; h\in \mathcal {H} \right\} . \end{aligned}$$
(20)

We call this setting a hybrid (kd) VC sample-compression scheme. The well-known agnostic VC bound (see, e.g., Anthony & Bartlett, 1999, Theorem 4.9) states that

$$\begin{aligned} \mathop {\mathbb {E}}[ {\bar{\Delta }}_{n}(\mathcal {F}^{\tiny {\text {FIX}}})] \le c\sqrt{{d}/{n}}, \end{aligned}$$
(21)

where \(c>0\) is a universal constant. Further, \({\bar{\Delta }}_{n}(\mathcal {F}^{\tiny {\text {FIX}}})\) is known to be concentrated about its mean (see, e.g., Mohri et al, 2012, Theorem 3.1):

$$\begin{aligned} \mathop {\mathbb {P}}\left( {\bar{\Delta }}_{n}(\mathcal {F}^{\tiny {\text {FIX}}}) \ge \mathop {\mathbb {E}}[ {\bar{\Delta }}_{n}(\mathcal {F}^{\tiny {\text {FIX}}})] +\varepsilon \right) \le \exp (-2n\varepsilon ^2). \end{aligned}$$
(22)

Combining Theorem 9 with (21), and (22), we conclude:

Corollary 11

Define the learner’s sample error \(\widehat{{\text {err}}}({\hat{h}}_n):={\hat{R}}(f)\) and generalization error \({\text {err}}({\hat{h}}_n):=R(f)\) where \({\hat{R}},R\) are defined in (12) and \(f(x,y):=\varvec{1}_{\left\{ h_n(x)\ne y \right\} }\). In a hybrid (kd) VC sample compression scheme, on a sample of size n, \(\widehat{{\text {err}}}({\hat{h}}_n)\) and \({\text {err}}({\hat{h}}_n)\) satisfy

$$\begin{aligned} {\text {err}}({\hat{h}}_n) \le \widehat{{\text {err}}}({\hat{h}}_n) + c\sqrt{\frac{d}{n-k}} + \sqrt{ \frac{\log [\delta ^{-1}{n\atopwithdelims ()k}]}{2(n-k)} } + \frac{k}{n} \end{aligned}$$

with probability at least \(1-\delta\).

In the realizable case, it is well-known (see, e.g., Anthony & Bartlett 1999, Theorem 4.8) that \(\mathop {\mathbb {P}}(E_n(\mathcal {F}^{\tiny {\text {FIX}}},P,\varepsilon ))\) where \(E_n\) is as in (17) and

$$\begin{aligned} \varepsilon \le \frac{c}{n}\left( d\log \frac{n}{d} +\log \frac{1}{\delta } \right) , \end{aligned}$$

where \(c>0\) is a universal constant. Combining this with Theorem 10 yields

Corollary 12

Define the generalization error \({\text {err}}({\hat{h}}_n):=R(f)\) and \(f(x,y):=\varvec{1}_{\left\{ h_n(x)\ne y \right\} }\). In a realizable hybrid (kd) VC sample compression scheme, on a sample of size n,

$$\begin{aligned} {\text {err}}({\hat{h}}_n) \le \frac{c}{n-k}\left( d\log \frac{n-k}{d} +\log \frac{{n\atopwithdelims ()k}}{\delta } \right) \end{aligned}$$

holds with probability at least \(1-\delta\).

Example: margin classes

In our second more concrete example, we take \(\mathcal {X}\) to be an abstract set, \(\mathcal {Y}=\left\{ -1,1 \right\}\), \(\mathcal {Z}=\mathcal {X}\times \mathcal {Y}\), and define the real-valued class

$$\begin{aligned} {\tilde{\mathcal {H}}}=\left\{ h_w:\mathcal {X}\ni x\mapsto w\cdot \Psi (x); \left\| w \right\| \le 1 \right\} , \end{aligned}$$
(23)

where \(\Psi (x)=\Psi _z(x)\) is a map from \(\mathcal {X}\) to \({\mathbb {R}}^N\) determined by some k-tuple \(z\in \mathcal {Z}\), with \(\left\| \Psi _z(\cdot ) \right\| \le 1\). Associate to \({\tilde{\mathcal {H}}}\) the \(\gamma\)-margin loss class

$$\begin{aligned} \mathcal {F}_\gamma ^{\tiny {\text {FIX}}}=\left\{ f_h:\mathcal {X}\times \left\{ -1,1 \right\} \ni (x,y)\mapsto \Phi _\gamma (yh(x)); h\in {\tilde{\mathcal {H}}} \right\} , \end{aligned}$$
(24)

where \(\Phi _\gamma (t)=\max (0,\min (1,1-t/\gamma ))\). We refer to this setting as a hybrid \((k,\gamma )\) margin sample compression scheme. The following agnostic margin-based bound is well-known (see, e.g., Mohri et al, 2012, Theorem 4.4):

$$\begin{aligned} \mathop {\mathbb {P}}\left( {\bar{\Delta }}_{n}(\mathcal {F}_\gamma ^{\tiny {\text {FIX}}}) \ge \frac{2}{\gamma \sqrt{n}} +\varepsilon \right) \le \exp (-2n\varepsilon ^2). \end{aligned}$$
(25)

Combining Theorem 9, (25), and a standard stratification argument (see Mohri et al, 2012, Theorem 4.5), we obtain the following result. Fix a map \(\rho :\mathcal {Z}^k\rightarrow \Psi (\cdot )\). Given a sample \(Z_{[n]}=(X_i,Y_i)_{i\in [n]}\) drawn iid, the learner chooses some k examples to define the random mapping \(\Psi _z:\mathcal {X}\rightarrow {\mathbb {R}}^N\). Having mapped the sample to \({\mathbb {R}}^N\), the learner obtains some hyperplane w. Then:

Corollary 13

With probability at least \(1-\delta\), we have

$$\begin{aligned} \mathop {\mathbb {E}}_{(X,Y)}[{\text {sgn}}(Yw\cdot \Psi (X))\le 0\, | \,Z_{[n]}]\le & {} \frac{1}{n}\sum _{i=1}^n \varvec{1}_{\left\{ Y_iw\cdot \Psi (X_i)<\gamma \right\} } + \frac{4}{\gamma \sqrt{n-k}}\\{} & {} + \sqrt{\frac{\log \log _2 \frac{2 }{ \gamma ^2 } }{n-k}} + \sqrt{\frac{\log (2{n\atopwithdelims ()k}/\delta )}{2(n-k)}} +\frac{k}{n}. \end{aligned}$$

In contrast to the agnostic case above, for the realizable case, instead of associating to \({\tilde{\mathcal {H}}}\) the \(\gamma\)-margin loss class \(\mathcal {F}_\gamma ^{\tiny {\text {FIX}}}\) as in (24), we associate the following ‘bad event’: It is the event that the hyperplane achieves margin \(\gamma\) on the sample, but has true error at least \(\varepsilon\).

$$\begin{aligned} E_n(\varepsilon ,\gamma ) = \left\{ (x,y)\in \mathcal {X}^n\times \mathcal {Y}^n: \exists h\in {\tilde{\mathcal {H}}}, \forall _{i\in [n]} y_i h(x_i)\ge \gamma , \mathop {\mathbb {P}}_{(X,Y)\sim P}(Yh(X)\le 0)\ge \varepsilon \right\} . \end{aligned}$$

It follows from Bartlett & Shawe-Taylor (1999, Theorem 1.5)Footnote 3 that for the realizable margin classifier \(\mathop {\mathbb {P}}(E_n(\varepsilon ,\gamma ))\le \delta\) where

$$\begin{aligned} \varepsilon \le \frac{c}{n}\left( \frac{\log ^2 n}{\gamma ^2}+\log \frac{1}{\delta } \right) , \end{aligned}$$
(26)

and \(c>0\) is a universal constant. Hence Corollary 12 implies

Corollary 14

With probability at least \(1-\delta\), if \(\min _{i\in [n]}Y_iw\cdot \Psi (X_i)\ge \gamma\) then

$$\begin{aligned} \mathop {\mathbb {E}}_{(X,Y)}[{\text {sgn}}(Yw\cdot \Psi (X))\le 0\, | \,Z_{[n]}]\le & {} \frac{c}{n-k}\left( \frac{\log ^2 (n-k)}{\gamma ^2} +\log \frac{ {n\atopwithdelims ()k} }{\delta } \right) . \end{aligned}$$

Example: kernel ridge regression

Kernel ridge regression may be viewed as the real-valued-prediction variant of the margin-based binary classification described in the previous example. As before \(\mathcal {X}\) is an abstract set, but now \(\mathcal {Y}=\left\{ -1,1 \right\}\) and \(\mathcal {Z}=\mathcal {X}\times \mathcal {Y}\). The hypothesis class is exactly the \({\tilde{\mathcal {H}}}\) defined in (23) and the loss class is

$$\begin{aligned} \mathcal {F}^{\tiny {\text {FIX}}}=\left\{ f_h:\mathcal {X}\times [-1,1]\ni (x,y)\mapsto (h(x)-y)^2; h\in {\tilde{\mathcal {H}}} \right\} . \end{aligned}$$
(27)

Under the assumption that the feature map \(\Psi (\cdot )\) has bounded norm — i.e., \(\sup _{x\in \mathcal {X}}\left\| \Psi (x) \right\| _2\le 1\), which is the case for our barycentric embedding assuming \(\left\| x \right\| \le 1\) — the following classic generalization bound holds (Mohri et al, 2018, Theorem 11.11)

$$\begin{aligned} {\bar{\Delta }}(\mathcal {F}^{\tiny {\text {FIX}}}) \le \frac{8}{\sqrt{n}} + 4\sqrt{\frac{\log (2/\delta )}{2n}} \end{aligned}$$
(28)

with probability at least \(1-\delta\). Applying Theorem 9 to (28) yields

Corollary 15

With probability at least \(1-\delta\), each \(h\in {\tilde{\mathcal {H}}}\) satisfies

$$\begin{aligned} \mathop {\mathbb {E}}_{(X,Y)}[(h(X)-Y)^2]\le & {} \frac{1}{n}\sum _{i=1}^n (h(X_i)-Y_i)^2 + \frac{8}{\sqrt{n-k}} + 4\sqrt{\frac{\log (2{n\atopwithdelims ()k}/\delta )}{2(n-k)}}. \end{aligned}$$

6 Learning algorithms

In Sect. 4, we demonstrated that the uniform subdivision embedding, coupled with an appropriate choice of weights, can represent an approximation to any given n-vertex polytope. This motivates an embedding technique also for statistical learning.

For some parameter q (determined by cross-validation), our classification algorithm produces a q-stage uniform subdivision: Beginning with a single simplex covering the entire space, at each stage we add to the system the barycenter of each simplex, thereby splitting all simplices into \(d+1\) sub-simplices. We call a set of \(d+1\) simplices formed by a split siblings. The procedure stops after q stages, having produced \((d+1)^q\) simplices. We choose not to split empty simplices (that is, simplices which do not contain a point of the input set), so the algorithm may ignore these (and this may also help us avoid creating thin simplices). Then an empty simplex must have a sibling that contains points, and since a simplex has d siblings, we have that the total number of simplices is not greater than \(\min \{(d+1)^q, dnq \}\). Parameter q is analogous to depth parameter s of Lemma 6; however, we have consistently observed that as a practical matter it suffices to take q as a small constant, and so in the analysis below we treat q as a constant.

Having computed the nested coordinate system, we use it to embed all points into high-dimensional space. To find an appropriate weight assignment \(w\) for the simplex points, we compute a linear classifier on the embedded space to separate the data. A linear classifier takes the form \(h(x) = {\text{ sign }}( w \cdot x)\), and this \(w\) serves as our weight vector for the embedding – we note that as \(w\) is computed globally over points (and not locally as was done in Sect. 4), we expect it to produce a smoother classifier with no rigid spikes. We use soft SVM as our linear classifier, and note that the training phase can be executed in time O(dn) on \((d+1)\)-sparse vectors (Joachims, 2006) – here, the sparsity improves both the space and runtime complexity. The total runtime of the algorithm is bounded by the cost of executing the sparse SVM plus the total number of simplex points, that is

$$\begin{aligned} O(\min \{d(d+1)^q + dn, d^2qn \}) = \min \{ d^{O(1)} + dn, O(d^2n) \}. \end{aligned}$$

To classify a new point, we can simply search top-bottom for its lowest containing simplex: We begin at the initial simplex, investigate which of its d sub-simplices contains the query point, and iterate on that simplex. This can all be done in time \(O(qd^2) = O(d^2)\). After bounding the run time, we want to bound the out of sample error:

Theorem 16

If our classifier achieves sample error \({\hat{R}}\) with margin \(\gamma\) (i.e., \({\hat{R}}\) is the fraction of the points whose margin is less than \(\gamma\)) on a sample of size n after stopping at stage q, its generalization error R is bounded by

$$\begin{aligned} {\hat{R}}+O(1/(\gamma \sqrt{n})+\sqrt{\log (q/\delta )/n}) \end{aligned}$$
(29)

with probability at least \(1-\delta\). In the realizable case (\({\hat{R}}=0\)), we have

$$\begin{aligned} R \le O(n^{-1}(\log ^2(n)/\gamma ^2+\log (q/\delta ))) \end{aligned}$$
(30)

with probability at least \(1-\delta\).

Proof

The agnostic bound in (29) follows from (25), coupled with a union bound over the q stages. The realizable bound in (30) follows from (26), coupled with a union bound over the q stages. \(\square\)

This bound is a consequence of the SVM margin bound (Mohri et al, 2012, Theorem 4.5) and the stratification technique (Shawe-Taylor et al.,, 1998), where the q-th stage receives weight \(1/2^q\).

Adaptive splitting strategies

The above algorithm is data-independent in its selection of split points. It is reasonable to suggest that a data-dependent choice of split points can improve the performance of the learning algorithm, and this too may be less prone to creating thin simplices. Several greedy strategies suggest themselves, but after empirical trials we suggest the following split heuristic: At every stage, a linear classifier of the embedding space is computed. For each simplex, we identify the points in the simplex have been misclassified so far, and choose a data point which is closest to the barycentric center of the misclassified points. As before, it is not necessary to subdivide an empty simplex, or one that contains few misclassified points. (See Sect. 8 for empirical results.) We obtain the following generalization bounds:

Theorem 17

If our adaptive classifier achieves sample error \({\hat{R}}\) with margin \(\gamma\) (i.e., \({\hat{R}}\) is the fraction of the points whose margin is less than \(\gamma\)) on a sample of size n after stopping at stage q and retaining k split points, its generalization error R is bounded by

$$\begin{aligned} {\hat{R}}+O(1/(\gamma \sqrt{n-k})+\sqrt{ k\log (qn/\delta )/(n-k) }) \end{aligned}$$
(31)

with probability at least \(1-\delta\). In the realizable case (\({\hat{R}}=0\)), we have

$$\begin{aligned} R \le O((n-k)^{-1}(\log ^2(n-k)/\gamma ^2+ k\log (qn/\delta )/(n-k) )) \end{aligned}$$
(32)

with probability at least \(1-\delta\).

Proof

The agnostic bound in (31) follows from Corollary 13, coupled with a union bound over the q stages. The realizable bound in (32) follows from Corollary 14, coupled with a union bound over the q stages. \(\square\)

We note that in the above algorithms, computations done on individual simplices are easily parallelizable.

NBCS Regression

Having shown in Sect. 4 how to use NBCS to approximate convex functions, we can now apply this tool to regression. Consider a regression problem with \(y_i =f(x_i)+ \epsilon _i\), where \(y_i \in \mathbb {R}\), \(x_i \in \mathbb {R}^d\) and \(\epsilon _i \sim N(0,\sigma ^2)\) (\(\sigma \le 1\)) is an error term. Theorem 8 implies that the NBSC embedding reduces the mean square error at every step. We will take our error term to be \(\mathop {\mathbb {E}}[\epsilon _i^2] = O(\sigma ^2)\), and so by the theorem the number of steps can be bounded by \(O(\log \frac{1}{\sigma })\). In order to find an appropriate weight assignment \(w\) for the simplex points, we compute a linear regressor on the embedded space, where the regressor is of the form \(h(x) = w \cdot x + b\). Following the algorithm for concave function approximation, we can suggest after each step calculating \(f'(x) = f(x) - \sum w_i \phi (x)\) and taking the data point within each new simplex which maximizes the value of \(f'(x)\). If \(f'(x)\) evaluated at the knot is less than the predefined constant parameter \(\epsilon\), then there is nothing to be gained by further splitting this simplex. This causes the algorithm to make more splits in the ‘curved’ parts of the manifold than on the nearly linear ones.

The following theorem provides a generalization bound for our regression algorithm.

Theorem 18

If our (non-adaptive) regressor achieves sample risk \({\hat{R}}\) with \(\left\| w \right\| \le 1\) on a sample of size n after stopping at stage q, its true error R is bounded by

$$\begin{aligned} {\hat{R}} + O\left( \sqrt{\frac{\log (q/\delta )}{n}} \right) \end{aligned}$$
(33)

with probability at least \(1-\delta\). The adaptive version of our regressor verifies

$$\begin{aligned} R\le {\hat{R}} + O\left( \sqrt{ k\log (qn/\delta )/(n-k) } \right) \end{aligned}$$
(34)

Proof

The non-adaptive bound in (33) follows from (28), coupled with a union bound over the q stages. The adaptive bound in (34) follows from Corollary 15, coupled with a union bound over the q stages. \(\square\)

7 Extension to multiclass and prioritized separators

In this section, we show how to extend our results for polyhedral separability and classification can be directly extended to two new models: The multiclass case, wherein there are multiple labels amongst the points, and the prioritized case, where one class may be more important than another, so that false negatives for this class incur a greater penalty in the target optimization function.

The multiclass case is familiar and intuitive, and does not require additional motivation. Some popular approaches to the multiclass problem reduce it to the binary problem using the one-vs-one or one-vs-all paradigms. However, we prefer a solution which functions on the master problem directly, as we will present below. For the prioritized case, this can be used to model a scenario when a false negative is significantly more severe than false positives – for example, in medical applications where a false negative will serve to overlook a dangerous illness. An additional use of this approach is to deal with imbalanced classes. One may increase the our of sample error margin in order to compensate for subsampling in a certain class.

In approaching either of these problems, we may utilize the same barycentric embedding presented earlier in Sect. 3.1, using either the standard splitting strategy, or the more advanced adaptive strategy of Sect. 6. The central challenge is compute in the target space an SVM variant which handles multiple classes, or prioritization between the classes, or both. To address this problem, we will adopt the cost-sensitive multiclass SVM variant presented in Gottlieb et al., (2021a) – this is known as the apportioned margin approach. We first briefly summarize this variant, and then explain how to integrate it into our barycentric embedding.

Technical background

In reviewing the technique of Gottlieb et al., (2021a), let us first consider the prioritized binary case. In the realizable binary case, the classic SVM searches for a solution vector \({w}\) defining a linear separator with margin depending on \(\left\Vert {w}\right\Vert\). At the two edges of this margin lie the respective hyperplanes \({w} \cdot {x} \pm 1 = 0\). Now suppose the positive and negative sets are equipped with respective priorities \(\theta _+,\theta _-\). In this case, the decision boundary is shifted in the direction of the positive points by a ratio of \(\frac{\theta _+}{\theta _-}\); this ratio corresponds to the Bayesian probabilities of the sets, as shown there. Then the respective hyperplanes can be formulated as \(w \cdot x = \theta _+\) and \(-w \cdot x = \theta _-\). Importantly, this formulation can be extended to multiclass categorization, producing for each class i equipped with priority \(\theta _i\) a vector \({w}_i\) representing a hyperplane \(w_i \cdot x = \theta _i\). For each pair of classes ij, the decision rule between these pairs is determined by the hyperplane \(\left( \frac{w_i}{\theta _i} - \frac{w_j}{\theta _j} \right) \cdot x = 0\). See Fig. 6.

Fig. 6
figure 6

Support hyperplanes and their bisector

Use for barycentric embedding

To address the prioritized multiclass problem, we first create a barycentric embedding as above, choosing the split points either using the standard oblivious strategy, or a splitting strategy adaptive to the underlying point set. In the embedded space, we compute the apportioned margin multiclass SVM variant of Gottlieb et al., (2021a), taking all priorities \(\theta _i\) equal to 1. The procedure produces a separating hyperplane for each class pair, and the intersection of these gives for each class a polyhedron separating it from all other classes. Now recall that a hyperplane in the embedded space corresponds to a polyhedron in the original space. Hence, the separating polyhedron for a given class in the embedded space (which is itself an intersection of hyperplanes in the embedded space) corresponds to the intersection of multiple distinct polyhedra in the origin space. The intersection of these polyhedra is a polyhedron itself, and so the procedure produces for each class a separating polyhedron in the original space (Fig. 7).

The above can be easily extended to the prioritized case, by taking the input priorities \(\theta _i\), and using this to construct the SVM variant in the embedded space.

8 Experiments

Our theory has important applications for polyhedron approximation, classification and regression. We first present results on synthetic data, in Sect. 8.1. This allows us to guarantee polyhedral separability, meaning that the two labels are separated by the polyhedron with some margin. For regression, our synthetic datasets similarly guarantee that all points lie within some small distance from a target polyhedron.

We then consider real-world data in Sect. 8.2. For these datasets, the actual separator (if one exists) is typically more complex and not necessarily linear. It may lack a defined margin and be further characterized by noise and outliers. Nevertheless, we find that our method consistently produces useful piecewise linear separators for this data.

8.1 Synthetic datasets: polyhedron approximation, regression and multiclass classification

In this section we present experiments on synthetic datasets.

Polyhedron approximation

Let us first present a simple example that illustrates the power of our approach in approximating polyhedra. We constructed a random data-set wherein all positive examples were taken from inside a convex 5-gon in the plane with margin 0.05, and all negative points were taken from outside it.

Figure 8 shows the iterative boundary formation, where the bold black line is the decision boundary and the dotted lines are the margin (\(w \cdot \phi _t(x) = \pm 1\)). For each iteration, the nested barycentric system is illustrated by the red lines. A consistent approximation of the underlying polyhedron for multiple runs was achieved after only 3 iterations. Notice how the margins become smaller at each iteration until reaching their predetermined size. We find that the method does not tend to overfit, and also that the constructed polyhedron is non-convex, although there exists a convex polyhedron consistent with the points.

Regression

We also present a regression simulation executed for three different functions: (i) A piecewise linear function consisting of three lines with noise. (ii) A piecewise linear function consisting of four lines with noise. (iii) A function consisting of two lines with a cosine shape function in the middle.

As in Birke and Dette (2007),we ran simulations with uniformly distributed design points for the explanatory variables, and added normal noise with standard deviation \(\sigma = 0.05\) to the response variable. Figure 9 shows the result of \(\{1,2,4\}\) steps along with \(\log _{10}\)-normalized MSE for all different steps from \(1-6\). Notice how the algorithm chooses the knots on the vertices of the polyhedron and does not introduce additional knots within the linear parts of the edges. For the hybrid linear-cosine function, the algorithm quickly found the lines and the vertices. The algorithm did not overfit by adding more knots to the linear parts, but rather introduced evenly spaced knots to the cosine shape. Stipulating a minimum inter-point distance serves to prevent the creation of simplices with very thin volume (spikes), and also acts as a regulator to prevent overfitting.

Cost sensitive multiclass classification

For this experiments, we created a dataset by placing points of four different classes near the respective corners of a unit square. We then constructed a barycentric decomposition on top of this dataset, and computed the apportioned margin multiclass SVM variant of on top of this partition. We ran two experiments, where in the first all classes were assigned equal priority, while in the second one of the four classes was given a cost of twice that of the other classes.

The results of this experiment are illustrated in Fig. 7. In the left figure there all class have equal costs, and so the constructed polyhedra have similar margins. In the right figure the blue class was assigned the greater cost, and so its corresponding polyhedra occupies a larger area.

Fig. 7
figure 7

An example of prioritized multiclass classification. Left: all class have equal costs. Right: the blue class is assigned priority twice as large as the others

Fig. 8
figure 8

Learning a polyhedron separating the triangle and circle points. Note that by the third step the polyhedron is consistent with data, so that no change in the embedding is effected by the fourth step

Fig. 9
figure 9

Function approximation over a partial linear function with noise and the overall log error. Top: a piecewise linear function consisting of 3 lines with noise. Middle: a piecewise linear function consisting of 4 lines with noise. Bottom: a cosine function

8.2 Real-world datasets: classification and regression

Our embedding technique is motivated by provable bounds for convex polyhedra, but we find that it is sufficiently robust to yield impressive empirical results for classification and regression on real-world datasets. We compared our method on real-world datasets with varying sizes and dimensions. For our benchmarks we used several widely used datasets.

From the UCI repositoryFootnote 4: Wine, Iris, Cardiotocography (Cardio) with 3 or 10 labels, Image Segmentation (Segment), Blood Transfusion Service (Transf), Steel Plates Faults (Steel), Diabetes, Statlog (Landsat Satellite), Statlog (Shuttle), Ionosphere (Ion), Libras Movement (Libras), Connectionist Bench (Sonar), Covertype and Skin Segmentation (SkinNonSkin). From the LiBSVM repositoryFootnote 5 we used for regression: Bodyfat, Cadata, Cpusmall, Eunite2001, Housing, Triazines, Prim, Space_ga. We also used Cod-rna there for classification.

We first considered relatively small datasets – these are described in Table 1. For these, the explicit feature maps methods discussed in Sect. 2 are computationally feasible, and may even run faster and with greater precision than standard kernel methods. For these experiments we focused on accuracy, since all these methods have very similar run-time complexity. Figure 10 demonstrates a comparison of the average accuracy between our embedding technique (adapt-NBCS), Kitchen Sink (KS) (Rahimi & Recht, 2007), Nystrom’s approximation (Williams & Seeger, 2000) and the adaptive \(\chi ^2\) (Vedaldi & Zisserman, 2012), all of which have open source implementations within the scikit-learn library. We also included for classification the accuracy of random forests (RF), a technique which tied for first place in comparisons made by Delgado et al., (2014). (The reported results for the RF comparison were taken directly from Delgado et al., (2014).) Random forest is a good comparison to our method since much like our method it will create piecewise linear boundaries, but without maximal margin. Our algorithm’s accuracy compared favorably with the others.

We also ran classification experiments on very large datasets. For these, techniques based on explicit feature maps may be computationally infeasible due to memory and run time constraints. In these experiments we compared our methods against advanced kernel methods which were specifically designed to deal with large datasets. We experimentally confirmed that our run-times on large datasets are competitive with specially designed algorithms such as Core-SVM without significant differences in accuracy (Table 2).

For regression tests, we compared our method to other embedding and convex regression methods on real-world datasets varying in size and dimension. The datasets and the experimental results are presented in Table 3, which presents a comparison of the squared correlation coefficient (\(R^2\)) of our embedding technique (NBCS) against aforementioned embedding and kernel techniques – in all cases, Linear SVR was applied in the embedded space. The \(R^2\) coefficient was 5-fold cross validated. The table reports the average \(R^2\) value, along with the standard deviation among the tests. Our results compare favorably with the baselines provided by the the other kernel SVM techniques.

Our experiments utilized the python scikit-learn library (Pedregosa et al.,, 2011).Footnote 6 The SVC/SVR regularization parameter C was 5-fold cross-validated over the set \(\{2^{-5}, 2^{-3}, \ldots , 2^{15}\}\), and for the RBF kernel, the \(\gamma\) parameter was five-fold cross-validated over the set \(\{2^{-15}, 2^{-3},\ldots , 2^{3}\}\). For polynomial kernel we cross-validated the polynomial exponent d over the set \(\{2, \ldots ,10\}\). For our methods, the maximum iteration parameter q was cross validated over the set \(\{2, \ldots ,\log _d(n)\}\). Our algorithms usually converged even before reaching the maximum number of allowed iterations, and in fact we have never observed more than 5 iteration. Our results demonstrate that NBCS compares favorably to other methods for all datasets.

Table 1 Small datasets
Fig. 10
figure 10

classification results for different embedding techniques

Table 2 Classification results for large datasets
Table 3 Regression results for different embedding techniques

9 Discussion and future work

In this paper, we introduced the barycentric coordinate embedding system, and showed its applicability to problems such as finding consistent polyhedra, classification, function approximation and regression. We derived a statistical foundation for this approach, and presented experiments on datasets which show promising empirical results. Future work includes analytical and empirical investigations of other natural splitting strategies.