Abstract
We present a dataadaptive multivariate histogram estimator of an unknown density f based on n independent samples from it. Such histograms are based on binary trees called regular pavings (RPs). RPs represent a computationally convenient class of simple functions that remain closed under addition and scalar multiplication. Unlike other density estimation methods, including various regularization and Bayesian methods based on the likelihood, the minimum distance estimate (MDE) is guaranteed to be within an \(L_1\) distance bound from f for a given n, no matter what the underlying f happens to be, and is thus said to have universal performance guarantees (Devroye and Lugosi, Combinatorial methods in density estimation. Springer, New York, 2001). Using a form of tree matrix arithmetic with RPs, we obtain the first generic constructions of an MDE, prove that it has universal performance guarantees and demonstrate its performance with simulated and realworld data. Our main contribution is a constructive implementation of an MDE histogram that can handle large multivariate data bursts using a treebased partition that is computationally conducive to subsequent statistical operations.
1 Introduction
Suppose our random variable X has an unknown density f on \(\mathbb {R}^d\), then for all Borel sets \(A \subseteq \mathbb {R}^d\),
Any density estimate \(f_n(x) := f_n(x;X_1,X_2,\ldots ,X_n) : \mathbb {R}^d \times \left( \mathbb {R}^d \right) ^n \rightarrow \mathbb {R}\) is a map from \(\left( \mathbb {R}^d\right) ^{n+1}\) to \(\mathbb {R}\). The objective in density estimation is to estimate the unknown f from an independent and identically distributed (IID) sample \(X_1,X_2,\ldots ,X_n\) drawn from f. Density estimation is often the first step in many learning tasks, including anomaly detection, classification, regression and clustering.
The quality of \(f_n\) is naturally measured by how well it performs the assigned task of computing the probabilities of sets under the total variation criterion:
where \({\mathcal {B}}^d\) are Borel sets in \(\mathbb {R}^d\). The last equality above is due to Scheffé’s identity and this equates the \(L_1\) distance between \(f_n\) and f, in the absolute scale of [0, 1], to the total variation distance between them.
A nonparametric density estimator is said to have universal performance guarantees when the underlying f is allowed to be any density in \(L_1\) (Devroye and Lugosi 2001, p. 1). Histograms and kernel density estimators can approximate f in this universal sense in an asymptotic setting, i.e., as the number of data points n approaches infinity (the socalled asymptotic consistency of the estimator \(f_n\)). But for a fixed n, however large but finite, classical studies of the rate of convergence of \(f_n\) to f require additional assumptions on the smoothness class (to solve this socalled smoothing problem), such as \(f \in L_2 \ne L_1\) or \(f \in C^k\), the set of k times differentiable functions, as opposed to letting f simply belong to the set where densities exist, i.e., \(f \in L_1\), and thereby violate the universality property.
Universal performance guarantee is provided by the minimum distance estimate (MDE) due to Devroye and Lugosi (2001, 2004). Their fundamentally combinatorial approach combined ideas from Yatracos (1985, 1988) on minimum distance methods and from Vapnik and Chervonenkis (1971) on uniform convergence of empirical probability measures over classes of sets. See Devroye and Lugosi (2001) for a selfcontained introduction to combinatorial methods in density estimation. Unlike the likelihood based methods, MDE gives universal performance guarantees, i.e., MDE does not assume that f is in \(L_2\) in order to address the smoothing problem for the given sample of size n, by directly minimizing the \(L_1\) distance over the socalled Yatracos class—a certain class of subsets of the support set that are induced by the partitions of each ordered pair of histograms in the set of histograms from which one has to choose the optimally smoothed histogram (Devroye and Lugosi 2001).
The Yatracos class is not trivial to represent for the purposes of concretely obtaining the MDE in a nonparametric multivariate setting involving large sample sizes. The particular class of MDEs studied in Devroye and Lugosi (2001, 2004) were limited to kernel estimates and histograms under simpler partitioning rules. Inspired by this, here we develop an MDE over statistical regular pavings using treebased partitioning strategies to produce a much more general nonparametric MDE that has (1) dataadaptive partitions (2) in d dimensions with (3) partitioning structures imbued with arithmetic for downstream statistical operations. Briefly, our approach exploits a recursive arithmetic using nodes imbued with recursively computable statistics and a specialized collator structure to compute the supremal deviation of the heldout empirical measure over the Yatracos class of the candidate densities.
Unlike other treebased partitions, our regular paving structure restricts partitioning by only bisecting a box along its first widest coordinate to make the countable set of such trees closed under addition and scalar multiplication and thereby allowing for computationally efficient computer arithmetic over a dense set of simple functions. See Harlow et al. (2012) for statistical applications of this arithmetic, including conditional density regression and multivariate tail probability computations for anomaly detection. Although a more efficient algorithm (up to preprocessing the \(L_1\) distances for each pair of densities) is characterized in Mahalanabis and Stefankovic (2008), we are not aware of any publicly available implementations of the MDE using dataadaptive multivariate histograms for bursts of data common in many industrial applications today, especially for downstream statistical operations with the density estimate, including anomaly detection (with \(n \approxeq 10^7\) in dimensions up to 6 for instance in a nondistributed computational setting over one commodity machine).
To the best of our knowledge, the accompanying code of this paper in mrs2 Sainudiin et al. (2008–2019) is the only publicly available implementation of such an MDE estimator. Our main contribution in this work is a rigorous implementation of the minimum distance estimate proposed by Devroye and Lugosi (2001) for the nonparametric multivariate setting that can handle large bursts of data. The estimator has been successfully used in industryscale problems where one needs to construct a multivariate density estimate in a “batch” setting and use this estimate for producing anomaly scores.
In the next two sections, we give the definitions, algorithms, theorems and proofs needed for our minimum distance estimator. Three core algorithms are given in the Appendix for completeness. We finally conclude after evaluating the performance of the estimator on simulated and realworld datasets.
2 Regular pavings and histograms
Let \({\varvec{x}}:=[\underline{x},\overline{x}]\) be a compact real interval with lower bound \(\underline{x}\) and upper bound \(\overline{x}\), where \(\underline{x} \le \overline{x}\). Let the space of such intervals be \(\mathbb {I}\mathbb {R}\). The width of an interval \({\varvec{x}}\) is \({\mathrm{wid\,}}({\varvec{x}}) := \overline{x}\underline{x}\). The midpoint is \({\mathrm{mid\,}}({\varvec{x}}) := \left( \underline{x}+\overline{x}\right) /2\). A box of dimension d with coordinates in \(\varDelta := \{1,2,\ldots ,d\}\) is an interval vector with \(\iota\) as the first coordinate of maximum width:
The set of all such boxes is \(\mathbb {I}\mathbb {R}^d\), i.e., the set of all interval real vectors in dimension d. A bisection or split of \({\varvec{x}}\) perpendicularly at the midpoint along this first widest coordinate \(\iota\) gives the left and right child boxes of \({\varvec{x}}\):
Such a bisection is said to be regular. Note that this bisection gives the left child box a halfopen interval \([\underline{x}_{\iota },{\mathrm{mid\,}}({\varvec{x}}_{\iota }))\) on coordinate \(\iota\) so that the intersection of the left and right child boxes is empty. A recursive sequence of selective regular bisections of boxes, with possibly open boundaries, along the first widest coordinate, starting from the root box \({\varvec{x}}_{\rho }\) in \(\mathbb {I}\mathbb {R}^d\) is known as a regular paving (Kieffer et al. 2001) or ntree (Samet 1990) of \({\varvec{x}}_{\rho }\). A regular paving of \({\varvec{x}}_{\rho }\) can also be seen as a binary tree formed by recursively bisecting the box \({\varvec{x}}_{\rho }\) at the root node. Each node in the binary tree has either no children or two children. These trees are known as plane binary trees in enumerative combinatorics (Stanley 1999, Ex. 6.19(d), p. 220) and as finite, rooted binary trees (frbtrees) in geometric group theory (Meier 2008, Chap. 10). The relationship of trees, labels and partitions is illustrated in Fig. 1 via a sequence of bisections of a square (2dimensional) root box by always bisecting on the first widest coordinate.
Let \({\mathbb {N}}:=\{1,2,\ldots \}\) be the set of natural numbers. Let the jth interval of a box \({\varvec{x}}_{{{\rho \mathsf {v}}}}\) be \([\underline{x}_{{{\rho \mathsf {v}}},j}, \overline{x}_{{{\rho \mathsf {v}}},j}]\), the volume of a ddimensional box \({\varvec{x}}_{{{\rho \mathsf {v}}}}\) be \({\mathrm{vol\,}}({\varvec{x}}_{{{\rho \mathsf {v}}}}) = \prod _{j=1}^d (\overline{x}_{{{\rho \mathsf {v}}},j}  \underline{x}_{{{\rho \mathsf {v}}},j})\). Let the set of all nodes, leaf nodes and internal nodes (or splits) of a regular paving \(s\) be \(\mathbb {V}(s) := \rho \cup \{ \rho \{{{\mathsf {L}}}, {{\mathsf {R}}} \}^j : j \in \mathbb {N}\}\), \(\mathbb {L}(s)\) and \(\breve{\mathbb {V}}(s) :=\mathbb {V}(s)\setminus \mathbb {L}(s)\), respectively. The set of leaf boxes of a regular paving \(s\) with root box \({\varvec{x}}_{\rho }\) is denoted by \({\varvec{x}}_{\mathbb {L}(s)}\) and it specifies a partition of the root box \({\varvec{x}}_{\rho }\). Let \(\mathbb {S}_k\) be the set of all regular pavings with root box \({\varvec{x}}_{\rho }\) made of k splits. Note that the number of leaf nodes \(m= \mathbb {L}(s) =k+1\) if \(s\in \mathbb {S}_k\). The number of distinct binary trees with k splits is equal to the Catalan number \(C_k\):
For \(i,j\in \mathbb {Z}_+\), where \(\mathbb {Z}_+ := \{0,1,2,\ldots \}\) and \(i \le j\), let \(\mathbb {S}_{i:j}:=\cup _{k=i}^j \mathbb {S}_{k}\) be the set of regular pavings with k splits where \(k \in \{i,i+1,\ldots ,j\}\). Let the set of all regular pavings be \(\mathbb {S}_{0:\infty } := \lim _{j \rightarrow \infty } \mathbb {S}_{0:j}\).
A statistical regular paving (SRP) denoted by \(s\) is an extension of the RP structure that is able to act as a partitioned ‘container’ and responsive summarizer for multivariate data. An SRP can be used to create a histogram of a data set. A recursively computable statistic (Fisher 1925; Gray and Moore 2003) that an SRP node \({{\rho \mathsf {v}}}\) caches is \(\#{\varvec{x}}_{{{\rho \mathsf {v}}}}\), the count of the number of data points that fell into \({\varvec{x}}_{{{\rho \mathsf {v}}}}\). A leaf node \({{\rho \mathsf {v}}}\) with \(\#{\varvec{x}}_{{{\rho \mathsf {v}}}} > 0\) is a nonempty leaf node. The set of nonempty leaves of an SRP \(s\) is \(\mathbb {L}^{+}(s) := \{{{\rho \mathsf {v}}}\in \mathbb {L}(s) : \#{\varvec{x}}_{{{\rho \mathsf {v}}}} > 0\} \subseteq \mathbb {L}(s)\).
Figure 2 depicts a small SRP \(s\) with root box \({\varvec{x}}_{\rho } \in \mathbb {I}\mathbb {R}^2\). The number of sample data points in the root box \({\varvec{x}}_{\rho }\) is 10. Figure 2a shows the tree, including the count associated with each node in the tree and the partition of the root box represented by the leaf boxes of this tree, with the sample data points superimposed on the boxes. Figure 2b shows how the density estimate is computed from the count and the volume of leaf boxes to obtain the density estimate \(f_{n,s}\) as an SRP histogram.
An SRP histogram is obtained from n data points that fell into \({\varvec{x}}_{\rho }\) of SRP s as follows:
It is the maximum likelihood estimator over the class of simple (piecewiseconstant) functions given the partition \({\varvec{x}}_{\mathbb {L}(s)}\) of the root box of \(s\). We suppress subscripting the histogram by the SRP s for notational convenience. SRP histograms have some similarities to dyadic histograms [for eg. Klemelä (2009, chap. 18), Lu et al. (2013)]. Both are binary treebased and partition so that a box may only be bisected at the midpoint of one of its coordinates, but the RP structure restricts partitioning further by only bisecting a box on its first widest coordinate in order to make \(\mathbb {S}_{0:\infty }\) closed under addition and scalar multiplication and thereby allowing for computationally efficient computer arithmetic over a dense set of simple functions [see Harlow et al. (2012) for statistical applications of this arithmetic]. Crucially, when data bursts have large sample sizes, this restrictive partitioning does not affect the \(L_1\) errors when compared to a computationally more expensive Bayes estimator (see Sect. 4).
A statistically equivalent block (SEB) partition of a sample space is some partitioning scheme that results in equal numbers of data points in each element (block) of the partition (Tukey 1947). The output of \(\texttt {SEBTreeMC}(s, \overline{\#}, \overline{m})\) of Algorithm 1 is \([s(0),s(1),\ldots ,s(T)]\), a sequence of SRP states visited by a sample path of the Markov chain \(\{S(t)\}_{t \in \mathbb {Z}_+}\) on \(\mathbb {S}_{0:\overline{m}1}\), such that, \(\mathbb {L}^{\bigtriangledown }(s(T)) = \emptyset\), or \(\#({{\rho \mathsf {v}}}) \le \overline{\#}\) \(\forall {{\rho \mathsf {v}}}\in \mathbb {L}^{\bigtriangledown }(s(T))\), or \(\mathbb {L}(s(T)) = \overline{m}\) and T is a corresponding random stopping time. As the initial state \(S(t=0)\) is the root \(s\in \mathbb {S}_0\), the Markov chain \(\{S(t)\}_{t \in \mathbb {Z}_+}\) on \(\mathbb {S}_{0:\overline{m}1}\) satisfies \(S(t) \in \mathbb {S}_{t}\) for each \(t \in \mathbb {Z}_+\), i.e., the state at time t has \(t+1\) leaves or t splits. The operation may only be considered to be successful if \(\mathbb {L}(s) \le \overline{m}\) and \(\#{\varvec{x}}_{{{\rho \mathsf {v}}}} \le \overline{\#}\) \(\forall {{\rho \mathsf {v}}}\in \mathbb {L}^{\bigtriangledown }(s)\). Therefore, the sequence of SRP histogram states visited by \(\texttt {SEBTreeMC}\) that successfully terminates at stopping time T will have the terminal histogram with at most \(\overline{\#}\) many of the n data points in each of its leaf nodes and with at most \(\overline{m}\) many leaf nodes.
Intuitively, \(\texttt {SEBTreeMC}(s, \overline{\#}, \overline{m})\) prioritizes the splitting of leaf nodes with the largest numbers of data points associated with them. As we will see in Theorem 1, the \(L_1\) consistency of \(\texttt {SEBTreeMC}\) requires that \(\overline{m}\) must grow sublinearly (i.e., \(\overline{m}/n \rightarrow 0\) as \(n \rightarrow \infty\)) while the volume of leaf boxes shrink such that a combinatorial complexity measure of the partitions in the support of the \(\texttt {SEBTreeMC}\) grows subexponentially. Figure 3 shows two different SRP histograms constructed using two different values of \(\overline{\#}\) for the same dataset of \(n=10^5\) points simulated under the standard bivariate Gaussian density. A small \(\overline{\#}\) produces a histogram that is undersmoothed with unnecessary spikes (Fig. 3 left), while the other histogram with a larger \(\overline{\#}\) is oversmoothed (Fig. 3 right). We will obtain the minimum distance estimate from the SRP histograms visited by the \(\texttt {SEBTreeMC}\) in Theorem 3.
3 Minimum distance estimation using statistical regular pavings
We show that the SRP density estimate from the \(\texttt {SEBTreeMC}\)based partitioning scheme is asymptotically \(L_1\)consistent as \(n\rightarrow \infty\) provided that \(\overline{\#}\), the maximum sample size in any leaf box in the partition, and \(\overline{m}\), the maximum number of leaf boxes in the partition grow with the sample size n at appropriate rates. This is done by proving the three conditions in Theorem 1 of Lugosi and Nobel (1996). We will need to show that as the number of sample points increases linearly, the following conditions are met:

1.
the number of leaf boxes grows sublinearly;

2.
the partition grows subexponentially in terms of a combinatorial complexity measure;

3.
and the volume of the leaf boxes in the partition is shrinking.
Let \(\{S_{n}(i)\}_{i =0}^{\dot{I}}\) on \(\mathbb {S}_{0:\infty }\) be the Markov chain of algorithm \(\texttt {SEBTreeMC}\). The Markov chain terminates at some state \({\dot{s}}\) with partition \(\mathbb {L}({\dot{s}})\). Associated with the Markov chain is a fixed collection of partitions:
and the size of the largest partition \(\mathbb {L}({\dot{s}})\) in \({\mathcal {L}}_{n}\) is given by
such that \({\mathcal {L}}_n \subseteq \{\mathbb {L}(s): s\in \mathbb {S}_{0:\overline{m}  1} \}\).
Given n fixed points \(\{x_{1}, \ldots , x_{n}\} \in \left( \mathbb {R}^d\right) ^n\). Let \(\varPi \left( {\mathcal {L}}_{n}, \{x_{1}, \ldots , x_{n}\}\right)\) be the number of distinct partitions of the finite set \(\{x_{1}, \ldots , x_{n}\}\) that are induced by partitions \(\mathbb {L}(\dot{s}) \in {\mathcal {L}}_{n}\):
For any fixed set of n points, the growth function of \({\mathcal {L}}_{n}\) is then
Let \(A \subseteq \mathbb {R}^d\). Then, the diameter of A is the maximum Euclidean distance between any two points of A, i.e., \(\mathrm{diam}(A) := \sup _{x,y\in A} \sqrt{\sum _{i=1}^d (x_iy_i)^2}\). Thus, for a box \({\varvec{x}}= [\underline{x}_{1}, \overline{x}_{1}] \times \cdots \times [\underline{x}_{d}, \overline{x}_{d}]\), \(\mathrm{diam}({\varvec{x}}) = \sqrt{\sum _{i=1}^d(\overline{x}_i\underline{x}_i)^2}\).
Theorem 1
(\(L_{1}\)Consistency) Let \(X_{1}, X_{2}, \ldots\) be independent and identical random vectors in \({\mathbb {R}}^{d}\) whose common distribution \(\mu\) has a nonatomic density f, i.e., \(\mu \ll \lambda\). Let \(\{S_{n}(i)\}_{i =0}^{\dot{I}}\) on \(\mathbb {S}_{0:\infty }\) be the Markov chain formed using \(\texttt {SEBTreeMC}\) (Algorithm 1) with terminal state \(\dot{s}\) and histogram estimate \(f_{n,{\dot{s}}}\) over the collection of partitions \({\mathcal {L}}_n\). As \(n \rightarrow \infty\), if \(\overline{\#}\rightarrow \infty,\) \(\overline{\#}/n \rightarrow 0,\) \(\overline{m}\ge n/\overline{\#},\) and \(\overline{m}/n \rightarrow 0,\) then the density estimate \(f_{n,{\dot{s}}}\) is asymptotically consistent in \(L_{1},\) i.e.,
Proof
We will assume that \(\overline{\#}\rightarrow \infty\), \(\overline{\#}/n \rightarrow 0\), \(\overline{m}\ge n/\overline{\#}\), and \(\overline{m}/n \rightarrow 0\), as \(n \rightarrow \infty\), and show that the three conditions:
are satisfied. Then, by Theorem 1 of Lugosi and Nobel (1996) our density estimate \(f_{n,{\dot{s}}}\) is asymptotically consistent in \(L_{1}\).
Condition (a) is satisfied by the assumption that \(\overline{m}/n \rightarrow 0\) since \(m({\mathcal {L}}_n) \le \overline{m}\).
The largest number of distinct partitions of any n point subset of \({\mathbb {R}}^{d}\) that are induced by the partitions in \({\mathcal {L}}_{n}\) is upper bounded by the size of the collection of partitions \({\mathcal {L}}_{n} \subseteq \mathbb {S}_{0:\overline{m}1}\), i.e.,
where k is the number of splits.
The growth function is thus bounded by the total number of partitions with 0 to \(\overline{m}1\) splits, i.e., the \((\overline{m}1)\)th partial sum of the Catalan numbers. The partial sum can be asymptotically equivalent to (Mattarei 2010):
Taking logs and dividing by n on both sides of the above two equations, and using the assumption that \(\overline{m}/n \rightarrow 0\) as \(n \rightarrow \infty\), we can see that condition (b) is satisfied
We now prove the final condition. Fix \(\gamma , \xi > 0\). There exists a box \(\check{{\varvec{x}}} = [M, M]^{d}\) for a large enough M, such that \(\mu (\check{{\varvec{x}}}^{c}) < \xi\), where \(\check{{\varvec{x}}}^{c} := \mathbb {R}^d \setminus [M,M]^d\). Consequently
Using \(2^{di}\) hypercubes of equal volume \((2M)^d/2^{di}, i = \left\lceil \log _2 \left( 2M\sqrt{d}/\gamma \right) \right\rceil\) with side length \(2M/2^i\) and diameter \(\sqrt{d \left( \frac{2M}{2^i}\right) ^2}\), we can have at most \(m_{\gamma }< 2^{di}\) boxes in \(\check{{\varvec{x}}}\) that have diameter greater than \(\gamma\). By choosing i large enough, we can upper bound \(m_{\gamma }\) by \((2M\sqrt{d}/\gamma )^d\), a quantity that is independent of n, such that
The first term in the parenthesis converges to zero since \(\overline{\#}/n \rightarrow 0\) by assumption. For \(\epsilon > 0\) and \(n>4d\), the second term goes to zero by applying the Vapnik–Chervonenkis (VC) theorem to boxes in \(\mathbb {I}\mathbb {R}^d\) with VC dimension 2d and shatter coefficient \(S(\mathbb {IR}^d, n) \le (e n/2d)^{2d}\) (Devroye et al. 1996, Thms. 12.5, 13.3 and p. 220), i.e.,
For any \(\epsilon >0\) and finite d, the righthand side of the above inequality can be made arbitrarily small for n large enough. This convergence in probability is equivalent to the following almost sure convergence by the bounded difference inequality:
Thus, for any \(\gamma , \xi > 0\),
Therefore, condition (c) is satisfied and this completes the proof. \(\square\)
Let \(\varTheta\) index a set of finitely many density estimates: \(\{f_{n, \theta }: \theta \in \varTheta \}\), such that \(\int f_{n,\theta } = 1\) for each \(\theta \in \varTheta\). We can index the SRP trees by \(\{s_{\theta } : \theta \in \varTheta \}\), where \(\theta\) is the sequence of leaf node depths that uniquely identifies the SRP tree, and denote the density estimate corresponding to \(s_{\theta }\) by \(f_{n,s_{\theta }}\) or simply by \(f_{n,\theta }\). Now, consider the asymptotically consistent path taken by the Markov chain of \(\texttt {SEBTreeMC}\). For a fixed sample size n, let \(\{ s_{\theta } : \theta \in \varTheta \}\) be an ordered subset of states visited by the Markov chain, with \(s_{\theta } \prec s_{\vartheta }\) if \(s_{\vartheta }\) is a refinement of \(s_{\theta }\), i.e., if \(s_{\theta }\) is visited before \(s_{\vartheta }\). The goal is to select the optimal estimate from \(\varTheta \) many candidates.
When our candidate set of densities are additive like the histograms, we can use the holdout method proposed by Devroye and Lugosi (2001, Sec. 10.1) for minimum distance estimation as follows. Let \(0< \varphi < 1/2\). Given n data points, use \(n\varphi n\) points as the training set and the remaining \(\varphi n\) points as the validation set (by \(\varphi n\) we mean \(\lfloor \varphi n \rfloor\)). Denote the set of training data by \({\mathcal {T}} := \{x_1, \ldots , x_{n  \varphi n}\}\) and the set of validation data by \({\mathcal {V}} := \{x_{n  \varphi n + 1}, \ldots , x_{n}\} = \{y_1, \ldots , y_{\varphi n}\}\). For an ordered pair \((\theta , \vartheta ) \in \varTheta ^2\), with \(\theta \ne \vartheta\), the set
is known as a Scheffé set. The Yatracos class (Yatracos 1985) is the collection of all such Scheffé sets over \(\varTheta\):
Let \(\mu _{\varphi n}\) be the empirical measure of the validation set \({\mathcal {V}}\). Then, the minimum distance estimate or MDE \(f_{n\varphi n,\theta ^*}\) is the density estimate \(f_{n\varphi n, \theta }\) constructed from the training set \({\mathcal {T}}\) with the smallest index \(\theta ^*\) that minimizes:
Thus, the MDE \(f_{n\varphi n,\theta ^*}\) minimizes the supremal absolute deviation from the heldout empirical measure \(\mu _{\varphi n}\) over the Yatracos class \({\mathcal {A}}_\varTheta\).
The SRP is adapted for MDE to mutably cache the counts for training and validation data separately and the \(n\varphi n\) training data points in \({\mathcal {T}}\) and the \(\varphi n\) validation data points in \({\mathcal {V}}\) are accessible from any leaf node \(\rho v\) of the SRP via pointers to \(x_i \in {\mathcal {T}}\) and \(y_i \in {\mathcal {V}}\), respectively. The training data drive the Markov chain \(\texttt {SEBTreeMC}(s, \overline{\#}, \overline{m})\) to produce a sequence of SRP states: \(s_{\theta _1}, s_{\theta _2}, \ldots\) that are further selected to build the candidate set of adaptive histogram density estimates given by \(\{f_{n\varphi n,\theta _i}: \theta _i \in \varTheta \}\). For each \(\theta _i \in \varTheta\), the validation data are allowed to flow through \(s_{\theta _i}\) and drop into the leaf boxes of \(s_{\theta _i}\). A graphical representation of an SRP with training counter \(\#{\varvec{x}}_{\rho v}\) and validation counter \(\check{\#}{\varvec{x}}_{\rho v}\) is shown in Fig. 4. Computing the MDE objective \(\varDelta _{\theta _i}\) in (3) requires the histogram estimate \(f_{n\varphi n}(\rho v) = {\#{\varvec{x}}_{\rho v}}/{n \lambda ({\varvec{x}}_{\rho v})}\) and the empirical measure of the validation data \(\mu _{\varphi n}({\varvec{x}}_{\rho v}) = \check{\#}{\varvec{x}}_{\rho v}/\varphi n\) at any node \(\rho v\). These can be readily obtained from \(\#{\varvec{x}}_{\rho v}\) and \(\check{\#}{\varvec{x}}_{\rho v}\).
Our approach to obtaining the MDE \(f_{n\varphi n,\theta ^*}\) with optimal SRP \(s_{\theta ^*}\) exploits the partition refinement order in \(\{ s_{\theta } : \theta \in \varTheta \}\), a subset of states along the path taken by the \(\texttt {SEBTreeMC}\). Using nodes imbued with recursively computable statistics for both training and validation data, and a specialized collation according to SRPCollate (Algorithm 3) over SRPs, we compute the objective \(\varDelta _{\theta }\) in (3) using GetDelta (Algorithm 2) via a dynamically grown Yatracos Matrix with pointers to all Scheffé sets constituting the Yatracos class according to GetYatracos (Algorithm 4). We briefly outline the core ideas in these three algorithms next [see Appendix for their pseudocode and mrs2 Sainudiin et al. (2008–2019) for details].
In the MDE procedure, pairwise comparisons of the heights of the candidate density estimates \(f_{n  \varphi n, \theta }\) and \(f_{n  \varphi n, \vartheta }\) are needed to get the Scheffé sets that make up the Yatracos class. An efficient way to approach this is to collate the SRPs corresponding to the density estimates onto a collator regular paving (CRP) where the space of CRP trees is also \(\mathbb {S}_{0:\infty }\). Consider now two SRPs \(s_{\theta }\) and \(s_{\vartheta }\) for which the corresponding histogram estimates \(f_{n, \theta }\) and \(f_{n, \vartheta }\) are computed. Both SRPs \(s_{\theta }\) and \(s_{\vartheta }\) have the same root box \({\varvec{x}}_{\rho }\). By collating the two SRPs, we get a CRP c with the same root box and the tree obtained from a union of \(s_{\theta }\) and \(s_{\vartheta }\). Unlike the union operation over RPs (Harlow et al. 2012, Algorithm 1), each node \(\rho v\) of the SRP collator c stores \(f_{n, {\theta }}\) and \(f_{n, {\vartheta }}\) as a vector \({\varvec{f}}_{n, c}(\rho v) := (f_{n, \theta }(\rho v), f_{n, \vartheta }(\rho v))\). The empirical measure of the validation data \(\mu _{\varphi n}({\varvec{x}}_{\rho v})\) will also be stored at each node \(\rho v\) and can be easily accessed via pointers. Figure 5 shows how CRP c can collate two SRPs \(s_{\theta }\) and \(s_{\vartheta }\) using SRPCollate.
We now use Theorem 10.1 of Devroye and Lugosi (2001, p. 99) and Theorem 6.6 of Devroye and Lugosi (2001, p. 54) to obtain the \(L_1\)error bound of the minimum distance estimate \(f_{n\varphi n,\theta ^*}\), with \(\theta ^*\in \varTheta\) and \(\varTheta  < \infty\).
Theorem 2
If \(\int f_{n\varphi n, \theta } = 1\) for all \(\theta \in \varTheta,\) then for the minimum distance estimate \(f_{n\varphi n, \theta ^*}\) obtained by minimizing \(\varDelta _{\theta }\) in (3),we have
where
Theorem 2 can be proved directly by a conditional application of Theorem 6.3 of Devroye and Lugosi (2001, p. 54) and is nothing but the finite \(\varTheta\) version of their Theorem 10.1 (Devroye and Lugosi 2001, p. 99) without the additional 3 / n term due to \(\varTheta <\infty\).
When f is unknown and \(2^n > {\mathcal {A}}_{\varTheta }\), \(\varDelta\) may be approximated using the cardinality bound (Devroye et al. 1996, Theorem 13.6, p. 219) for the shatter coefficient of \({\mathcal {A}}_{\varTheta }\). Given \(\{x_1, \ldots , x_n\}\) the nth shatter coefficient of \({\mathcal {A}}_{\varTheta }\) is defined as
Since \({\mathcal {A}}_{\varTheta }\) is finite, containing at most quadratically many Scheffé sets \(A_{\theta ,\vartheta }\) with distinct ordered pairs \((\theta ,\vartheta ) \in \varTheta ^2\) given by the nondiagonal elements of the Yatracos matrix returned by GetYatracos , by Theorem 13.6 of Devroye et al. (1996, p. 219) its nth shatter coefficient is bounded as follows:
Finally, given that adaptive multivariate histograms based on statistical regular pavings in \({\mathbb {S}}_{0:\infty }\) form a class of regular additive density estimates, we can slightly modify Theorem 10.3 of Devroye and Lugosi (2001, p. 103) for the case with finite \(\varTheta\) to get the following error bound that further accounts for splitting the data.
Theorem 3
Let \(0< \varphi < 1/2\) and \(n < \infty.\) Let the finite set \(\varTheta\) determine a class of adaptive multivariate histograms based on statistical regular pavings with \(\int f_{n\varphi n, \theta } = 1\) for all \(\theta \in \varTheta.\) Let \(f_{n, \theta ^*}\) be the minimum distance estimate. Then for all n, \(\varphi n,\) \(\varTheta\) and \(f \in L_1:\)
Proof
By Theorem 2,
Taking expectations on both sides and using Theorem 10.2 in Devroye and Lugosi (2001, p. 99)
Finally, by Theorem 3.1 in Devroye and Lugosi (2001, p. 18) and (6),
\(\square\)
4 Performance evaluation
4.1 Practical minimum distance estimation
To effectively use the error bound, we need to ensure that \(\varTheta \) is not too large and the densities in \(\varTheta\) are close to the true density f. Next, we highlight the effectiveness and limitations of our MDE.
The size of \(\varTheta\) is kept small (typically less than 100) and independent of n by an adaptive search. Note that \(\varTheta \) is upperbounded by \(\overline{m}\) if we were to exhaustively consider each SRP state along the entire path of the \(\texttt {SEBTreeMC}\) in \(\varTheta\), our set of candidate SRP partitions. Such an exhaustive approach is computationally inefficient as the Yatracos matrix that updates the Scheffé sets grows quadratically with \(\varTheta \). We take a simple adaptive search approach by considering only k (typically \(10 \le k \le 20\)) SRP states in each iteration. In the initial iteration, we add k states to \(\varTheta\) by picking uniformly spaced states from a longenough \(\texttt {SEBTreeMC}\) path that starts from the root node and ends at a state with a large number of leaves and a significantly higher \(\varDelta _{\theta }\) score than its preceding states. Then, we simply zoomin around the states with the lowest \(\varDelta _{\theta }\) values and add another k states along the same \(\texttt {SEBTreeMC}\) path close to such optimal states from the first iteration. We repeat this adaptive search process until we are unable to zoomin further. Typically, we are able to find nearly optimal states within 5 or fewer iterations. By Theorem 1, we know that the histogram partitioning strategy of \(\texttt {SEBTreeMC}\) is asymptotically consistent. Thus, the adaptive search set \(\varTheta\) that is selected iteratively from the set of histogram states along the path of \(\texttt {SEBTreeMC}\) with optimal \(\varDelta _{\theta }\) values will naturally contain densities that approach f as n increases. However, the rate at which the \(L_1\) distance between the best density in \(\varTheta\) and f approach 0 will depend on the complexity of f in terms of the number of leaves needed to uniformly approximate f using simple functions with SRP partitions, a class that is dense in \({\mathcal {C}}({\varvec{x}}_{\rho },\mathbb {R})\), the algebra of realvalued continuous functions over the root box \({\varvec{x}}_{\rho }\) by the Stone–Weierstrass Theorem (Harlow et al. 2012, Theorem 4.1). This dependence on the structural complexity of f is evaluated next.
4.2 Simulations
To evaluate the performance of our MDE we first choose the unstructured multivariate uniform density. Although the dimension d of the uniform density on \([0,1]^d\) ranges in \(\{1,10,100,1000\}\), the true density is given by the root box, the first candidate density indexed by \(\varTheta\). Based on the mean integrated absolute errors (MIAE) shown in Table 1 for each d and n in \(\{10^2, 10^3, 10^4, 10^5, 10^6, 10^7\}\), there is a dimensionfree performance by the MDE for such a target density that immediately belongs to the set of candidate densities indexed by \(\varTheta\). The sample mean of the integrated absolute errors was taken over five replicate simulations with standard error less than half of the MIAE values. When the sample size is \(10^7\) and dimension is 1000, the data cannot be represented in a machine with 32GB of memory (as indicated by the ‘–’ entry in Table 1).
We independently verified the inequalities in Eqs. 4 and 7 of Theorems 2 and 3, respectively, by explicitly computing the left and right handsides of the inequalities for MDEs and checking that they are indeed satisfied for the simulated datasets in the previous subsection. This verification is in examples/StatsSubPav/MinimumDistanceEstimation/MDETest of the mrs2 module companions/mrs1.0YatracosThis/. These results are shown for multivariate uniform densities in Fig. 6. The bounds are not sharp although they do decrease with the sample size. This is because they are extremely general by construction and based merely on the cardinality of the set of candidate densities.
To evaluate the performance of our MDE, we also chose two structured multivariate densities: the spherically symmetric Gaussian with a simple concentrated structure and the highly structured Rosenbrock density [whose expression up to normalization is given in (8)] in d dimensions for various sample sizes:
The sample standard deviations about the mean integrated absolute errors or MIAEs for the MDE method, i.e., \(L_1(f_{n,\theta ^*},f)\) (shown in the top panel of Table 2), based on ten trials, are below \(10^{3}\) and \(10^{4}\) for values of n in \(\{10^4,10^5\}\) and \(\{10^6,10^7\}\), respectively. Thus, these standard errors are not shown. However, the \(L_1\) distance between the MDE and the best estimate in the candidate set \(\varTheta\), \(L_1(f_{n,\theta ^*},f)\min _{\theta \in \varTheta } L_1(f_{n,\theta },f)\), is shown in Table 2 for each density and sample size. For comparison, as shown in the bottom panel of Table 2, we used the Bayes estimator from the posterior mean histograms (Sainudiin et al. 2013, see for details on this evaluation). Note how the \(L_1\) errors decrease with the sample size and how the errors are comparable between the methods, albeit the MDE method is at least an order of magnitude faster than the posterior mean histogram that does not provide universal performance guarantees like most density estimators.
Due to the use of spacepartitioning regularly paved trees, our MDE histograms cannot provide small \(L_1\) errors for highly structured densities beyond 10 or so dimensions on the basis of sample sizes in the order of millions. The reason is simply due to the large \(L_1\) distance between the best candidate density in \(\varTheta\) based on a reasonable maximal number of splits. However, using modern dimensionality reduction techniques including autoencoders we can often project high dimensional data nonlinearly to a lower dimensional space and use the MDE histograms to construct a density estimate and do further statistical processing as we show below.
All experiments were performed on the same physical machine that is currently considered to be commodity hardware (Sainudiin et al. 2013, for machine specifications and CPU times).
4.3 Detecting bot flows using MDE tail probabilities
We apply the MDE histogram on the realworld scenario 11 of the CTU13 dataset of botnet traffic on a computer network (Garcia et al. 2014). The dataset captured 8164 real botnet traffic flows mixed with 99087 normal and background traffic flows. These flows were augmented into 80 dimensions using Word2Vec embeddings of the flows (datapoints) and reduced to 8 dimensions by training a deep autoencoder with a bottleneck layer of eight nodes by Ramström (2019), who gives domainspecific details on how the raw data was augmented and fitted to an autoencoder. For the purpose of our application, it suffices to note that a deep autoencoder was trained on appropriately augmented normal flows in order to nonlinearly reduce the dimensions from 80 to 8. We then used the \(n=99087\) samples in 8 dimensions to obtain the MDE histogram. For each data point x from the normal as well as the botnet flow, we computed its multivariate tail probability (Harlow et al. 2012, Algorithm 9). Briefly, this is given by 1 minus the sum of the probability mass of all leaf nodes in the MDE histogram whose density (“height”) is larger than that of the leaf node whose box contains x. The tail probability can be directly used as a score of how unlikely an event is under the density estimate constructed with the normal flows. We obtain these tail probabilities for all 107251 flows (mixed with 7.6% botnet flows) and sort them by their tail probabilities. Our histogram estimate was able to identify 87% and 99.1% of the botnet flows, i.e., 7115 and 8090 out of 8164 botnet flows were within the lowest 7.6% and 10% of the tail probabilities, respectively. Thus, using the tail probabilities of the MDE histogram estimated from the normal flows was extremely effective in identifying the botnet flows.
5 Conclusion and future directions
Thus, using the collator regular paving (CRP), we obtain the minimum distance estimate (MDE) with universal performance guarantees. All the methods are implemented and available in mrs2 Sainudiin et al. (2008–2019), including the downstream statistical operations for anomaly detection and conditional density regression (Harlow et al. 2012). We limited our minimum distance estimate (MDE) to the candidate set given by the SRP histograms visited along the path of the Markov chain \(\texttt {SEBTreeMC}\). This was done to take advantage of the structure of consecutive refinements of the tree partitions along a single path of \(\texttt {SEBTreeMC}\).
However, obtaining the MDE from an arbitrary set of SRP histograms taken from \(\mathbb {S}_{0:\infty }\) will need more sophisticated collators. Initial experiments using the Scheffé tournament approach (as opposed to the MDE) to find the best estimate in a candidate set of arbitrary SRP histograms (not just those along a path in \(\mathbb {S}_{0:\infty }\)) look feasible. Such a Scheffé tournament will allow us to compare estimates from entirely different methodological schools (Bayesian, penalized likelihood, etc.). Finally, the pure tree structure allows one to possibly extend this MDE to a distributed faulttolerant computational setting such as Zaharia et al. (2016) as the sample size becomes too large for the memory of a single machine.
References
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: SpringerVerlag.
Devroye, L., & Lugosi, G. (2001). Combinatorial methods in density estimation. New York: SpringerVerlag.
Devroye, L., & Lugosi, G. (2004). Bin width selection in multivariate histograms by the combinatorial method. TEST, 13(1), 129–145.
Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22, 700–725.
Garcia, S., Grill, M., Stiborek, H., & Zunino, A. (2014). An empirical comparison of botnet detection methods. Computers and Security Journal, 45, 100–123.
Gray, A. G., & Moore, A. W. (2003). Nonparametric density estimation: Towards computational tractability. In SIAM international conference on data mining (pp. 203–211). San Francisco, California, USA: SIAM.
Harlow, J., Sainudiin, R., & Tucker, W. (2012). Mapped regular pavings. Reliable Computing, 16, 252–282.
Kieffer, M., Jaulin, L., Braems, I., & Walter, E. (2001). Guaranteed set computation with subpavings. In W. Kraemer & J. Gudenberg (Eds.), Scientific computing, validated numerics, interval methods, proceedings of SCAN 2000 (pp. 167–178). New York: Kluwer Academic Publishers.
Klemelä, J. (2009). Smoothing of multivariate data: density estimation and visualization. Chichester: Wiley.
Lu, L., Jiang, H., & Wong, W. H. (2013). Multivariate density estimation by bayesian sequential partitioning. Journal of the American Statistical Association, 108(504), 1402–1410. https://doi.org/10.1080/01621459.2013.813389.
Lugosi, G., & Nobel, A. (1996). Consistency of datadriven histogram methods for density estimation and classification. The Annals of Statistics, 24(2), 687–706.
Mahalanabis, S., & Stefankovic, D. (2008). Density estimation in linear time. In R. A. Servedio & T. Zhang (Eds.), 21st annual conference on learning theory—COLT 2008 (pp. 503–512). Finland: Omnipress, Helsinki.
Mattarei, S. (2010). Asymptotics of partial sums of central binomial coefficients and Catalan numbers. arXiv.0906.4290v3
Meier, J. (2008). Groups, graphs and trees: an introduction to the geometry of infinite groups. Cambridge: Cambridge University Press.
Ramström, K. (2019). Botnet detection on flow data using the reconstruction error from Autoencoders trained on Word2Vec network embeddings. Msc thesis, Uppsala University
Sainudiin, R., Teng, G., Harlow, J., & Lee, D. S. (2013). Posterior expectation of regularly paved random histograms. ACM Transactions on Modeling and Computer Simulation, 23(26), 6:1–6:20.
Sainudiin, R., York, T., Harlow, J., Teng, G., Tucker, W., & George, D. (2008–2019). MRS 2.0, a C++ class library for statistical set processing and computeraided proofs in statistics. https://github.com/lamastex/mrs2
Samet, H. (1990). The design and analysis of spatial data structures. Boston: AddisonWesley Longman.
Stanley, R.P. (1999). Enumerative combinatorics. Vol. 2, Cambridge Studies in Advanced Mathematics, vol 62. Cambridge University Press, Cambridge. https://books.google.fr/books?id=zg5wDqT6TUC&hl=fr&source=gbs_book_other_versions
Tukey, J. W. (1947). Nonparametric estimation II. Statistically equivalent blocks and tolerance regions—The continuous case. The Annals of Mathematical Statistics, 18(4), 529–539.
Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab Appl, 16, 264–280.
Yatracos, Y. G. (1985). Rates of convergence of minimum distance estimators and kolmogorov’s entropy. The Annals of Statistics, 13(2), 768–774.
Yatracos, Y. G. (1988). A note on l1 consistent estimation. The Canadian Journal of Statistics, 16(3), 283–292.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., et al. (2016). Apache spark: A unified engine for big data processing. Commun ACM, 59(11), 56–65. https://doi.org/10.1145/2934664.
Acknowledgements
RS and GT proved the theorems and GT implemented the three MDE algorithms based on codes by Jennifer Harlow and RS in mrs2. This research began from a conversation RS had with Luc Devroye at the World Congress in Probability and Statistics in 2008 and was partly supported by RS’s external consulting revenues from the New Zealand Ministry of Tourism, UC College of Engineering Sabbatical Grant and Visiting Scholarship at Department of Mathematics, Cornell University, Ithaca NY, USA and completed through the project CORCON: Correctness by Construction, Seventh Framework Programme of the European Union, Marie Curie ActionsPeople, International Research Staff Exchange Scheme (IRSES) with counterpart funding from the Royal Society of New Zealand. The application to botnet detection was partially supported by Combient Mix AB and the Department of Mathematics, Uppsala University.
Funding
This study was funded by FP7 People: MarieCurie Actions (project CORCON) (Grant number: 612638).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: MDE algorithms
Appendix: MDE algorithms
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Sainudiin, R., Teng, G. Minimum distance histograms with universal performance guarantees. Jpn J Stat Data Sci 2, 507–527 (2019). https://doi.org/10.1007/s4208101900054y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s4208101900054y
Keywords
 Rooted planar binary tree
 Yatracos class
 Tree matrix arithmetic
 Model selection
 Regular paving
 Density estimation