# Learning data discretization via convex optimization

## Abstract

Discretization of continuous input functions into piecewise constant or piecewise linear approximations is needed in many mathematical modeling problems. It has been shown that choosing the length of the piecewise segments adaptively based on data samples leads to improved accuracy of the subsequent processing such as classification. Traditional approaches are often tied to a particular classification model which results in local greedy optimization of a criterion function. This paper proposes a technique for learning the discretization parameters along with the parameters of a decision function in a convex optimization of the true objective. The general formulation is applicable to a wide range of learning problems. Empirical evaluation demonstrates that the proposed convex algorithms yield models with fewer number of parameters with comparable or better accuracy than the existing methods.

## Introduction

Many mathematical modeling problems involve discretization of continuous input functions to convert them into their discrete counterparts (García et al. 2013; Liu et al. 2002). The goal of the discretization is to reduce the number of values a continuous attribute assumes by grouping them into a number of predefined bins (Cios et al. 2007, Chapter 8). The discretization is useful for reducing the data and subsequently generated model complexity and it is a necessary preprocessing step for data mining and machine learning algorithms that operate on discrete attribute spaces (Cios et al. 2007, Chapter 8). The discrete functions are then piece-wise constant (0th order) or piece-wise linear (1st order) approximations of the continuous input functions.

The discretization is useful, for example, when estimating probability density functions. The density functions are typically represented as multi-dimensional histograms in the discrete domain (Silverman 1986, Chapter 2). Although histograms can asymptotically approximate any continuous density, the accuracy of such approximation depends on the histogram bin size (Rao et al. 2005, Chapter 2). In another example, the domain of input features is discretized in order to apply a linear decision function as in logistic regression or Support Vector Machine classification (Chapelle et al. 1999). The input feature discretization is a simple method to learn non-linear decision functions where algorithms essentially learn linearly parameterized rules. Clearly, the accuracy of the decision functions directly depends on the discretization of the feature values. One common difficulty in the discretization process is the choice of the discretization step which then indicates the size of the piece-wise segments, e.g. histogram bins, or parameters of the feature representation quantization. The parameters of the decision function are typically estimated in a separate subsequent process (Dougherty et al. 1995; Pele et al. 2013).

Most algorithms employ the simplest unsupervised discretization by choosing a fixed number of bins with the same size (equal interval width) or the same number of samples in each bin (equal frequency interval) (Dougherty et al. 1995). The total number of bins is tuned for a specific application balancing two opposite considerations. Wider bins reduce the effects of noise in regions where the number of input samples is low. On the other hand, narrower bins result in more accurate function approximation in regions where there are many samples and thus the effects of the noise are suppressed. Equal frequency intervals have been extended by using Shannon entropy over discretized space to adjust the bin boundaries (Dougherty et al. 1995). Evidently, using varying bin sizes during discretization can be beneficial.

Supervised discretization algorithms use sample labels to improve the binning (Kerber 1992; Dougherty et al. 1995), often in an optimization step when learning a classifier (Boullé 2006; Fayyad and Irani 1992; Friedman and Goldszmidt 1996). One widely-adopted approach is to initially start with a large number of bins and then merge neighboring bins while optimizing a criterion function (Boullé 2006). In Boullé (2006), the discretization is based on a search algorithm to find a Bayesian optimal interval splits using a prior distribution on a model space. In Fayyad and Irani (1992), the recursive splitting is based on an information entropy minimization heuristic. The algorithm is extended with a Minimum Description Length stopping criteria in Fayyad and Irani (1993) and embedded into a dynamic programming algorithm in Elomaa and Rousu (1999). These techniques introduce supervision for finding the optimal discretization but are tied to a particular classification model (Naïve Bayes, decision trees, or Bayesian Networks (Friedman and Goldszmidt 1996; Yang and Webb 2008)). As a result, they rely on local greedy optimization of the criterion function (Hue and Boullé 2007).

This paper proposes an algorithm for learning piece-wise constant or piece-wise linear embeddings from data samples along with the parameters of a decision function. Similarly to several previous techniques, the initial fine-grained discretization with many histogram bins is adjusted by optimizing a criterion function. Our algorithm proposes several important contributions. First, when training a decision function, the algorithm optimizes the true objective (or its surrogate) which includes the discretization parameters and the parameters of the decision function. This is in contrast to previous methods that rely on two separate steps, discretization and classifier learning, which can deviate from the true objective (Pele et al. 2013). Second, the parameter learning is transformed into a convex optimization problem that can be solved effectively. Other techniques do not provide a global solution and resort to a greedy strategy, where the features are processed sequentially (Hue and Boullé 2007). Third, our formulation is general with piece-wise embeddings being used when training a linear decision function which makes it applicable to a wide range of linear models. Other methods are specific to a particular classification model (Boullé 2006; Fayyad and Irani 1992; Friedman and Goldszmidt 1996; Yang and Webb 2008).

Our experiments demonstrate that the learned discretization is effective when applied to various data representation problems. The first set of results combines the estimation of the malware traffic representation parameters with learning a linear decision function in a joint convex optimization problem. The learned linear embedding needs much lower dimensionality than equidistant discretization (by a factor of three on average) to achieve the same or better malware detection accuracy level. The second set of results shows a piece-wise linear approximation of a probability density function using non-equidistant bins when estimating the density from data samples. The proposed algorithm achieves lower KL-divergence between the estimate and the ground truth than histograms with equidistant bins. Comparison to the previously published piece-wise linear embedding for non-linear classification (Pele et al. 2013) shows higher accuracy of the proposed technique on a number of datasets from the UCI repository. These encouraging results show the promise of the technique which could be extended to other linear embedding and classification problems.

## Learning piece-wise constant functions

We consider learning a univariate piece-wise constant (PWC) function

\begin{aligned} f_{\mathrm{pwc}}(x, {\varvec{w}},{\varvec{\theta }}) = \sum _{i=1}^{B} \llbracket x\in [\theta _{i-1},\theta _{i})\rrbracket w_{i} = w_{k(x,{\varvec{\theta }})} \end{aligned}
(1)

where $$x\in {\mathbb {R}}$$ is the input variable, B is the number of bins, $${\varvec{\theta }}=(\theta _0,\ldots ,\theta _B)^T\in {\mathbb {R}}^{B+1}$$ is a vector defining the bin edges, $${\varvec{w}}=(w_1,\ldots ,w_B)^T\in {\mathbb {R}}^B$$ is a vector of weights each of them associated to a single bin and the function

\begin{aligned} k(x,{\varvec{\theta }}) = 1 + \min \{ i\in \{0,\ldots ,B-1\} \mid x \ge {\theta _{i}} \} \end{aligned}

returns the bin number to which the variable x falls to. For notational convenience, we omit the additive scalar bias $$w_0$$ in the definition (1) which does not affect the discussion that follows. We denote set inclusive bracket as ’[’ and exclusive bracket as ’)’. The operator $$\llbracket A\rrbracket$$ evaluates to 1 if the logical statement A is true and it is zero otherwise.

Let $$g:{\mathbb {R}}^{m}\rightarrow {\mathbb {R}}$$ be a convex function whose value $$g(f(x^{1}),\ldots ,f(x^{m}))$$ measures how well the function $$f:{\mathbb {R}}\rightarrow {\mathbb {R}}$$ evaluated on m training inputs $${\mathcal {T}}=\{x_{i} \in {\mathbb {R}}\mid i=1,\ldots ,m\}$$ explains the data. For example, g can be the empirical risk or the likelihood of f evaluated on $${\mathcal {T}}$$. Assuming the bin edges $${\varvec{\theta }}$$ are fixed, the weights $${\varvec{w}}$$ of the PWC function (1) can be learned by solving the minimization problem

\begin{aligned} {\varvec{w}}^{*} \in \mathop {\mathrm{argmin}}_{{\varvec{w}}\in {\mathbb {R}}^{B}} F_{\mathrm{pwc}}({\varvec{w}},{\varvec{\theta }}) \end{aligned}
(2a)

where

\begin{aligned} F_{\mathrm{pwc}}({\varvec{w}},{\varvec{\theta }}) = g\big ( f_{\mathrm{pwc}}(x^{1},{\varvec{w}},{\varvec{\theta }}),\ldots , f_{\mathrm{pwc}}(x^m,{\varvec{w}},{\varvec{\theta }}) \big )\, \end{aligned}
(2b)

is convex in $${\varvec{w}}$$ since it is a composition of a convex function g and $$f_{\mathrm{pwc}}$$ which is linear in $${\varvec{w}}$$ (Boyd and Vandenberghe 2004).

In practice the bin edges $${\varvec{\theta }}$$ are often selected before learning $${\varvec{w}}$$. The simplest way is to use equidistantly spaced bins,

\begin{aligned} \begin{array}{l} \theta _{i} = i({\hbox {Max}}-{\hbox {Min}})/B + {\hbox {Min}} \,, \quad \forall i\in \{0,\ldots ,B\}\,, \end{array} \end{aligned}

where $$\mathrm{Min}$$ and $$\mathrm{Max}$$ are defined such that every admissible input x is inside the interval $$[\mathrm{Min}, \mathrm{Max})$$. The bin edges are constructed for different values of B and the optimal setting is typically tuned on validation examples. This procedure involves minimization of $$F_\mathrm{pwc}({\varvec{w}},{\varvec{\theta }})$$ for all proposal discretizations $${\varvec{\theta }}$$. In principle, one could optimize the width of individual bins as well, however, this would generate exponentially many proposal discretizations making this naive approach intractable.

We propose a method which can simultaneously learn the discretization $${\varvec{\theta }}$$ and the weights $${\varvec{w}}$$ of the PWC function (1). The important feature is that the resulting bins do not have to be equidistant. We only assume that the bin edges are selected from a finite monotonically increasing sequence of real numbers $$\nu _{0}< \nu _{1}< \cdots < \nu _{D}$$, in further text represented by a vector $${\varvec{\nu }}=(\nu _{0},\ldots ,\nu _{D})^{T}\in {\mathbb {R}}^{D+1}$$ and denoted as the initial discretization. The set of admissible discretizations $$\varTheta _{B}\subset {\mathbb {R}}^{B+1}$$ of the variable $$x\in {\mathbb {R}}$$ into B bins contains all vectors $${\varvec{\theta }}$$ satisfying:

\begin{aligned}&\theta _{i} = \nu _{l_{i}}\;, i\in \{0,\ldots ,B\},\end{aligned}
(3a)
\begin{aligned}&\left. \begin{array}{ll} \displaystyle l_{0} = 0\;, \\ l_{i} < l_{i+1}\;, &{} i\in \{0,\ldots , B-1\}\;, \\ l_{B} = D \;, \end{array} \right\} \end{aligned}
(3b)

The Eq. (3a) says that the bin edges $${\varvec{\theta }}\in \varTheta _B$$ form a subset of the initial discretization $${\varvec{\nu }}$$. The indices $$\{l_0,\ldots ,l_{B}\}$$ define which edges are selected from $${\varvec{\nu }}$$. The Eq. (3b) state that the left most bin edge is $$\nu _{0}$$, the right most edge is $$\nu _{D}$$, and the intermediate edges form an increasing sequence.

Given a convex learning algorithm (2), we propose to learn the discretization $${\varvec{\theta }}^{*}\in \varTheta _{B}$$ simultaneously with the weights $${\varvec{w}}^*\in {\mathbb {R}}^B$$ by solving

\begin{aligned} ({\varvec{w}}^{*},{\varvec{\theta }}^{*}) \in \mathop {\mathrm{argmin}}_{{\varvec{w}}\in {\mathbb {R}}^{B}, {\varvec{\theta }}\in \varTheta _{B}} F_{\mathrm{pwc}}({\varvec{w}},{\varvec{\theta }}) \,. \end{aligned}
(4)

Unlike (2) where $${\varvec{\theta }}$$ is fixed, the problem (4) is almost always non-convex and hard to optimize. In the sequel we derive a convex approximation of (4) which can be solved efficiently provided the original problem (2) can be solved efficiently.

Let us define a function $$c:{\mathbb {R}}^{D}\rightarrow \{0,\ldots ,D-1\}$$

\begin{aligned} c({\varvec{v}}) = \sum _{i=1}^{D-1} \llbracket v_{i}\ne v_{i+1}\rrbracket \end{aligned}

returning the total number of different neighboring elements of the vector $${\varvec{v}}\in {\mathbb {R}}^{D}$$. Let $$\varTheta _{B}$$ be induced by the initial discretization $${\varvec{\nu }}$$ as defined by (3). It is easy to see that for any $$({\varvec{w}},{\varvec{\theta }})\in ({\mathbb {R}}^{B}\times \varTheta _{B})$$ there exists a unique $${\varvec{v}}\in \mathrm{V}_{B}=\big \{{\varvec{v}}\in {\mathbb {R}}^{D}\big | c({\varvec{v}}) \le B-1 \big \}$$ such that

\begin{aligned} f_\mathrm{pwc}({\varvec{x}},{\varvec{w}},{\varvec{\theta }}) = f_\mathrm{pwc}({\varvec{x}},{\varvec{v}},{\varvec{\nu }})\,, \quad \forall {\varvec{x}}\in {\mathbb {R}}^n\,, \end{aligned}
(5)

holds. In particular, $${\varvec{v}}$$ is constructed from $$({\varvec{w}},{\varvec{\theta }})$$ by

\begin{aligned} v_{l} = f_\mathrm{pwc}(\nu _{l},{\varvec{w}},{\varvec{\theta }}) \;, \qquad l\in \{1,\ldots ,D\}\,. \end{aligned}
(6)

Computing $$({\varvec{w}},{\varvec{\theta }})$$ from $$({\varvec{v}},{\varvec{\nu }})$$ is also straightforward but not unique in general. It can be seen that for any $${\varvec{v}}\in \mathrm{V}_B$$ it is possible to construct a finite number of pairs $${\mathcal {W}}({\varvec{v}})=\{({\varvec{w}}_{1},{\varvec{\theta }}_{1}),\ldots ,({\varvec{w}}_{L},{\varvec{\theta }}_{L})\}\in ({\mathbb {R}}^{B}\times \varTheta _{B})^{L}$$, such that the equality (5) holds for any $$({\varvec{w}},{\varvec{\theta }})\in {\mathcal {W}}({\varvec{v}})$$. Provided $$c({\varvec{v}}) = B - n$$ the conversion from $${\varvec{v}}$$ and $$({\varvec{w}},{\varvec{\theta }})$$ is unique, i.e. $$|{\mathcal {W}}({\varvec{v}})|=1$$. The parametrization $$({\varvec{w}},{\varvec{\theta }})$$ is a shortened representation of $$({\varvec{u}},{\varvec{\nu }})$$ which is composed of sub-sequences of the equal components. Therefore we will denote $$({\varvec{v}},{\varvec{\nu }})$$ as the uncompressed parametrization and $$({\varvec{w}},{\varvec{\theta }})$$ as the compressed parametrization. The equivalence between the compressed parametrization and the uncompressed one is illustrated in Fig. 1. The equivalence implies that

\begin{aligned} \min \big \{ F_\mathrm{pwc}({\varvec{w}},{\varvec{\theta }}) \mid ({\varvec{w}},{\varvec{\theta }}) \in \big ({\mathbb {R}}^{B} \times \varTheta _{B}\big )\big \} = \min \big \{F_\mathrm{pwc}({\varvec{v}},{\varvec{\nu }}) \mid {\varvec{v}}\in \mathrm{V}_{B} \big \} \end{aligned}

and therefore the problem (4) can be solved by finding

\begin{aligned} {\varvec{v}}^{*} \in \mathop {\mathrm{argmin}}_{{\varvec{v}}\in {\mathbb {R}}^{D}} F_\mathrm{pwc}({\varvec{v}},{\varvec{\nu }}) \quad \text{ s.t. }\quad c({\varvec{v}}) \le B-1 \,, \end{aligned}
(7)

and solving (6) for $$({\varvec{w}}^{*},{\varvec{\theta }}^{*})$$ while $${\varvec{v}}$$ is set to $${\varvec{v}}^{*}$$. Methods for an efficient computation of the compressed parametrization $$({\varvec{w}}^{*},{\varvec{\theta }}^{*})$$ from the uncompressed one $$({\varvec{v}}^{*},{\varvec{\nu }})$$ are subject of Sect. 4.

The problem (7) has a convex objective but its single constraint remains non-convex. By introducing a vector $${\varvec{d}}=(v_{1}-v_{2},\ldots ,v_{D-1}-v_{D})^{T}$$ we see that the function $$c({\varvec{v}})$$ can be written as the $$L_0$$-norm of the vector $${\varvec{d}}=(v_{1}-v_{2},\ldots ,v_{D-1}-v_{D})^{T}$$, i.e. $$c({\varvec{v}}) = \Vert {\varvec{d}}\Vert _{0}$$. It has been demonstrated that the $$L_{1}$$-norm is often a good convex proxy of the $$L_{0}$$-norm, see e.g. Candes and Tao (2005), Candes et al. (2006), Donoho (2006) . Therefore, we propose to approximate the non-convex function $$c({\varvec{v}})=\Vert {\varvec{v}}\Vert _{0}$$ by the convex function $$\tilde{c}({\varvec{v}})=\Vert {\varvec{d}} \Vert _{1}$$. A convex relaxation of the problem (7) then reads

\begin{aligned} {\varvec{v}}^{*} \in \mathop {\mathrm{argmin}}_{{\varvec{v}}\in {\mathbb {R}}^{D}} F_\mathrm{pwc}({\varvec{v}},{\varvec{\nu }}) \quad \text{ s.t. }\quad \sum _{i=1}^{D-1} | v_{i} - v_{i+1} | \le B-1 \,, \end{aligned}

or, equivalently, we can solve an unconstrained problem

\begin{aligned} {\varvec{v}}^{*}\in \mathop {\mathrm{argmin}}_{{\varvec{v}}\in {\mathbb {R}}^{D}} \Big [F_\mathrm{pwc}({\varvec{v}},{\varvec{\nu }}) + \gamma \sum _{i=1}^{D-1} | v_{i} - v_{i+1} | \Big ]\,, \end{aligned}
(8)

where the constant $$\gamma \in {\mathbb {R}}_{+}$$ is monotonically dependent on $$B-1$$.

## Learning piece-wise linear functions

The previous section shows how to transform learning of PWC functions to a convex optimization problem. A similar idea can be applied to learning piece-wise linear (PWL) functions which provide more accurate approximation of the input data. In this case, we want to learn a function

\begin{aligned} f_\mathrm{pwl}(x, {\varvec{w}},{\varvec{\theta }}) = w_{k(x,{\varvec{\theta }})-1} \cdot (1- \alpha (x,{\varvec{\theta }})) + w_{k(x,{\varvec{\theta }})} \cdot \alpha ( x,{\varvec{\theta }}) \end{aligned}
(9)

where $$x\in {\mathbb {R}}$$ is the input variable as for before, $${\varvec{\theta }}\in {\mathbb {R}}^{B+1}$$ is a vector defining the bin edges, B is the number of bins and $${\varvec{w}}\in {\mathbb {R}}^{B+1}$$ is the vector of weights.Footnote 1 The function $$\alpha :{\mathbb {R}}\times {\mathbb {R}}^{B+1}\rightarrow [0,1]$$, defined as

\begin{aligned} \alpha (x,{\varvec{\theta }}) = \frac{ x - \theta _{k(x,{\varvec{\theta }})-1}}{\theta _{k(x,{\varvec{\theta }})} - \theta _{k(x,{\varvec{\theta }})-1}} \end{aligned}

is a normalized distance between x and the right edge of the $$k(x,{\varvec{\theta }})$$-th bin.

Analogously to the previous section, we want to learn simultaneously the discretization $${\varvec{\theta }}^{*}\in \varTheta _{B}$$ and the weights $${\varvec{w}}^{*}\in {\mathbb {R}}^{B+1}$$ by solving

\begin{aligned} ({\varvec{w}}^{*},{\varvec{\theta }}^{*})\in \mathop {\mathrm{argmin}}_{{\varvec{w}}\in {\mathbb {R}}^{B+1},{\varvec{\theta }}\in \varTheta _{B}} F_\mathrm{pwl}({\varvec{w}},{\varvec{\theta }}) \end{aligned}
(10)

where

\begin{aligned} F_\mathrm{pwl}({\varvec{w}},{\varvec{\theta }}) = g\big ( f_\mathrm{pwl}(x^{1},{\varvec{w}},{\varvec{\theta }}),\ldots ,f_\mathrm{pwl}(x^{m},{\varvec{w}},{\varvec{\theta }})\big ) \end{aligned}

is the learning objective depending on the responses of the PWL function (9) and evaluated on the training inputs $$\{x^{1},\ldots ,x^{m}\}\in {\mathbb {R}}^{m}$$. In the sequel, we derive a convex relaxation of (10).

Let us define a function $$e:{\mathbb {R}}^{D+1}\rightarrow \{0,\ldots ,D+1\}$$,

\begin{aligned} e({\varvec{u}}) = \sum _{i=1}^{D-1} \llbracket u_{i} \ne \frac{1}{2}(u_{i-1}+u_{i+1})\rrbracket \end{aligned}

which returns the number of weights that cannot be expressed as an average of the neighboring weights. Let $$\varTheta _{B}$$ be induced by the initial discretization $${\varvec{\nu }}$$ as defined by (3). It can be seen that for any $$({\varvec{w}},{\varvec{\theta }})\in ({\mathbb {R}}^{B+1}\times \varTheta _{B})$$ we can construct a unique $${\varvec{u}}\in {\mathcal {U}}_B = \{{\varvec{u}}\in {\mathbb {R}}^{D+1}\mid e({\varvec{u}}) \le B - 1\}$$ such that

\begin{aligned} f_\mathrm{pwl}(x,{\varvec{w}},{\varvec{\theta }}) = f_\mathrm{pwl}(x,{\varvec{u}},{\varvec{\nu }})\,,\quad \forall x\in {\mathbb {R}}^n \, \end{aligned}
(11)

holds, in particular, $${\varvec{u}}$$ is constructed from $$({\varvec{w}},{\varvec{\theta }})$$ by

\begin{aligned} u_{l} = f_\mathrm{pwl}(\nu _{l},{\varvec{w}},{\varvec{\theta }})\,, \quad l\in \{0,\ldots ,D\}\,. \end{aligned}
(12)

In addition, for any $${\varvec{u}}\in {\mathcal {U}}_{B}$$ it is possible to construct a finite number of pairs $${\mathcal {W}}'({\varvec{u}})=\{({\varvec{w}}_{1},{\varvec{\theta }}_{1}),\cdots ,({\varvec{w}}_{L},{\varvec{\theta }}_{L})\}\in ({\mathbb {R}}^{B+1}\times \varTheta _{B})^{L}$$ such that the equality (11) holds for any $$({\varvec{w}},{\varvec{\theta }})\in {\mathcal {W}}'({\varvec{u}})$$. For $$e({\varvec{u}})=B-1$$ the conversion from $${\varvec{u}}$$ to $$({\varvec{w}},{\varvec{\theta }})$$ is unique, i.e. $$|{\mathcal {W}}'({\varvec{u}})|=1$$. The equivalence between the compressed parametrization $$({\varvec{w}},{\varvec{\theta }})$$ and the uncompressed parametrization $$({\varvec{u}},{\varvec{\nu }})$$ is illustrated in Fig. 2. The equivalence implies that

\begin{aligned} \min \big \{ F_\mathrm{pwl}({\varvec{w}},{\varvec{\theta }}) \mid ({\varvec{w}},{\varvec{\theta }})\in \big ({\mathbb {R}}^{B+1}\times \varTheta _{B}\big ) \big \} = \min \big \{ F_\mathrm{pwl}({\varvec{u}},{\varvec{\nu }}) \mid {\varvec{u}}\in {\mathcal {U}}_{B} \big \}\,. \end{aligned}

Consequently, the problem (10) can be solved by choosing

\begin{aligned} {\varvec{u}}^{*} \in \mathop {\mathrm{argmin}}_{{\varvec{u}}\in {\mathbb {R}}^{D+1}} F_\mathrm{pwl}({\varvec{u}},{\varvec{\nu }}) \quad \text{ s.t. }\quad e({\varvec{u}}) \le B - 1 \,, \end{aligned}
(13)

and computing $$({\varvec{w}}^{*},{\varvec{\theta }}^{*})$$ from $${\varvec{u}}^{*}$$ by solving the Eq. (12) for $$({\varvec{w}},{\varvec{\theta }})$$ with $${\varvec{u}}={\varvec{u}}^{*}$$. Efficient methods computing the compressed parametrization $$({\varvec{w}}^{*},{\varvec{\theta }}^{*})$$ from the uncompressed one $$({\varvec{u}}^{*},{\varvec{\nu }})$$ are discussed in Sect. 4. As before, replacing the $$L_{0}$$-norm in the non-convex constraint $$e({\varvec{u}})= B-1$$ by the $$L_{1}$$-norm, we obtain a convex relaxation of (10) which reads

\begin{aligned} {\varvec{u}}^*\in \mathop {\mathrm{argmin}}_{{\varvec{u}}\in {\mathbb {R}}^{D+1}} F_\mathrm{pwl}({\varvec{u}},{\varvec{\nu }}) \quad \text{ s.t. }\quad \sum _{i=1}^{D-1} \big |u_{i} - \frac{1}{2}u_{i-1} - \frac{1}{2}u_{i+1}\big | \le B -1 \,, \end{aligned}

or, equivalently, we can solve an unconstrained problem

\begin{aligned} {\varvec{u}}^{*}\in \mathop {\mathrm{argmin}}_{{\varvec{u}}\in {\mathbb {R}}^{D+1}} \Big [F_\mathrm{pwl}({\varvec{u}},{\varvec{\nu }}) + \gamma \sum _{i=1}^{D-1} \big |u_{i} - \frac{1}{2}u_{i-1} - \frac{1}{2}u_{i+1}\big | \Big ] \,. \end{aligned}
(14)

where the constant $$\gamma \in {\mathbb {R}}_{+}$$ is monotonically dependent on $$B-1$$.

## Rounding of piece-wise functions

The uncompressed parametrization $$({\varvec{v}},{\varvec{\nu }})$$ of the PWC function [e.g. found by the proposed algorithm (8)] can be converted to the compressed one $$({\varvec{w}},{\varvec{\theta }})$$ by splitting $${\varvec{v}}$$ into sub-vectors of equal weights, i.e. $${\varvec{v}}=({\varvec{v}}_{1}^{T}, {\varvec{v}}_{2}^{T},\ldots , {\varvec{v}}_{B}^{T})$$ where each $${\varvec{v}}_i$$ can be written as $${\varvec{v}}_{i} = w_{i} [1,\ldots ,1]^{T}$$. Obtaining the compressed parametrization $$({\varvec{w}},{\varvec{\theta }})$$ is then straightforward (see Fig. 1). An analogous procedure can be applied for the PWL parametrizations in which case we are searching for sub-vectors whose intermediate components can be expressed as an average of its neighbors. In practice, however, the components of the uncompressed solution can be noisy thanks to the used convex relaxation and usage of approximate solvers to find the uncompressed parameters. For this reason, it is useful to round the uncompressed solution before its conversion to the compressed one. The rounding procedures for the PWC and the PWL parametrization are described in the next sections.

### Rounding of PWC parametrization

A very crude rounding is obtained by splitting $${\varvec{v}}$$ into the sub-vectors $${\varvec{v}}=({\varvec{v}}_{1}^{T}, {\varvec{v}}_{2}^{T},\ldots , {\varvec{v}}_{B}^{T})$$ such that all components of each $${\varvec{v}}_{i}$$ have the same sign. The rounded uncompressed solution is then

\begin{aligned} \bar{{\varvec{v}}} = (\hbox {mean}({\varvec{v}}_1) [1,\ldots ,1]^{T},\ldots ,\hbox {mean}({\varvec{v}}_{B})[1,\ldots ,1]^{T})^{T} \end{aligned}
(15)

where the operator $$\mathrm{mean}({\varvec{v}}')$$ returns the mean value of the components of the vector $${\varvec{v}}'$$. Note that this procedure does not require any parameter.

Another method is to find the parameters $$\bar{{\varvec{v}}}$$ of the PWC function with B bins which have the shortest Euclidean distance to the given $${\varvec{v}}$$ by solving

\begin{aligned} \bar{{\varvec{v}}} \in \mathop {\mathrm{argmin}}_{{\varvec{v}}'\in {\mathbb {R}}^{D}} \Vert {\varvec{v}} - {\varvec{v}}'\Vert ^{2}\quad \text{ s.t. }\quad c({\varvec{v}}') = B -1 \,. \end{aligned}
(16)

The problem (16) is the Euclidean projection of $${\varvec{v}}$$ onto the set $$\mathrm{V}_{B}=\big \{{\varvec{v}}\in {\mathbb {R}}^{D}\big | c({\varvec{v}}) \le B-1 \big \}$$. The optimal solution of (16) can be found by dynamic programming as follows. Provided that $$c({\varvec{v}}') = B -1$$ holds, the vector $${\varvec{v}}'$$ can be parametrized by a shorter vector $${\varvec{w}}=(w_{1},\ldots ,w_{B})\in {\mathbb {R}}^{b}$$ and indices $$(l_{0},\ldots ,l_{B})$$ satisfying (3b) which define bins and hence also sub-sequences within $${\varvec{v}}'$$ with the same value, i.e. $$v'_{l_i+1}=\cdots =v'_{l_{i}}=w_{i}$$ holds for all $$i\in \{1,\ldots ,B\}$$. With this notation, we can rewrite the distance $$\Vert {\varvec{v}} - {\varvec{v}}'\Vert ^{2}$$ as

\begin{aligned} \Vert {\varvec{v}} - {\varvec{v}}'\Vert ^{2} = \sum _{i=1}^{D} (v_{i} - v'_{i})^{2} = \sum _{j=1}^{B} \sum _{i = l_{j-1}+1}^{l_{j}} (v_{i} - w_{j})^{2} = \sum _{j=1}^{B} R(l_{j-1},l_{j},w_{j})\;. \end{aligned}

It is clear that weight $$\bar{w}_{j}$$ solving the sub-problem $$\min _{w_{j}} R(l_{j-1},l_{j},w_{j})$$ equals the arithmetic mean of the corresponding weights, $${\bar{w}_{j}}=\sum _{i = l_{j-1}+1}^{l_{j}}v_{i} / (l_{j}-l_{j-1})$$. Let us define $${\hat{R}}(l_{j-1},l_{j}) = R(l_{j-1},l_{j},{\hat{w}}_{j})$$. Instead of solving (16) we can now search for the optimal indices:

\begin{aligned} ({\bar{l}}_{0},\ldots ,{\bar{l}}_{B}) \in \mathop {\mathrm{argmin}}_{(l_{0},\ldots ,l_{B})} \sum _{j=1}^{B} {\hat{R}}(l_{j-1},l_{j}) \quad \text{ s.t }\quad \begin{array}{ll} \displaystyle l_{0} = 0\;, \\ l_{i} < l_{i+1}\;, &{} i\in \{0,\ldots , B-1\}\;. \\ l_{B} = D \;. \end{array} \end{aligned}
(17)

The problem (17) can be solved by dynamic programming since its objective decomposes into a sum of functions each of which sharing at most two variables with other functions. The overall computational time of the DP procedure is $${\mathcal {O}}(B\cdot D^{2})$$.

Because the number of bins is often unknown a priori, we can search for the minimal number of bins which explain the given solution $${\varvec{v}}$$ with a prescribed precision $$\varepsilon$$, i.e. we can solve

\begin{aligned} \bar{{\varvec{v}}} =\mathop {\mathrm{argmin}}_{B\in {\mathcal {N}}, {\varvec{v}}'\in {\mathbb {R}}^D} B\quad \text{ s.t. }\quad \Vert {\varvec{v}}-{\varvec{v}}'\Vert ^{2} \le \varepsilon \quad \text{ and }\quad \, c({\varvec{v}}') = B - 1 \,. \end{aligned}
(18)

The solution of (18) can be converted to the solution of (16) by increasing B until the constraint $$\Vert {\varvec{v}}-{\varvec{v}}'\Vert ^{2} \le \varepsilon$$ is satisfied.

### Rounding PWL function

The parameter vector $$\bar{{\varvec{u}}}$$ describing the PWL function with B bins which has the shortest Euclidean distance to given $${\varvec{u}}$$ can be found by solving

\begin{aligned} {\bar{{\varvec{u}}}} \in \mathop {\mathrm{argmin}}_{{\varvec{u}}'\in {\mathbb {R}}^{D+1}} \Vert {\varvec{u}} - {\varvec{u}}'\Vert ^{2}\quad \text{ s.t. }\quad e({\varvec{u}}') = B -1 \,. \end{aligned}
(19)

If the number of bins is unknown, we can search for the minimal number of bins which explain the given solution $${\varvec{u}}$$ with a prescribe precision $$\varepsilon$$ by solving

\begin{aligned} {\bar{{\varvec{u}}}} =\mathop {\mathrm{argmin}}_{B\in {\mathcal {N}}, {\varvec{u}}'\in {\mathbb {R}}^{D+1}} B\quad \text{ s.t. }\quad \Vert {\varvec{u}}-{\varvec{u}}'\Vert ^2 \le \varepsilon \quad \text{ and }\quad \, e({\varvec{v}}') = B - 1 \,. \end{aligned}
(20)

The problems (19) and (20) can be solved exactly by procedures analogous to the ones for the PWC function which were described in previous section and hence omitted for the sake of space.

## Examples of the proposed framework

The previous sections desribe a generic framework that allows to modify a wide class of convex algorithms so that they can learn the PWC and the PWL functions. In this section, we give three instances of the proposed framework. We also show how the same idea can be applied to learning multi-variate PWC and PWL functions. In particular, we consider learning of the linear classifiers using sequential data represented by PWC histograms (Sect. 5.1), estimation of the PWL probability density models (Sect. 5.2) and learning of the non-linear classifiers via the PWL data embedding (Sect. 5.3). It will be shown that learning leads to an instance of convex optimization problem in all cases. Moreover, the optimization can be reformulated as a well-understood Quadratic Programming task.

### Classification of histograms

In many applications, the object to be classified is described by a set of sequences sampled from some unknown distributions. A simple yet effective representation of the sequential data is the normalized PWC histogram which is then used as an input to a linear classifier. This classification model has been successfully used e.g. in computer vision (Dalal and Triggs 2005) or in the computer security (Bartos and Sofka 2015) as will be described in Sect. 7.1.

Assume we have a training set $${\mathcal {T}}=\{({\varvec{X}}^{1},y^{1}),\ldots ,({\varvec{X}}^{m},y^{m})\}\in ({\mathbb {R}}^{n\times d}\times \{-1,+1\})^{m}$$ where the input matrix $${\varvec{X}}^{i}$$ describes n sequences each having d elements. The linear classifier $$h({\varvec{X}},{\varvec{w}},{\varvec{\theta }})=\mathop {{\mathrm{sign}}}( f_{{\mathrm{pwc}}}({\varvec{X}},{\varvec{w}},{\varvec{\theta }}))$$ assigns $${\varvec{X}}$$ into a class based on the sign of the discriminant function

\begin{aligned} f_{{\mathrm{pwc}}}({\varvec{X}},{\varvec{w}},{\varvec{\theta }}) = \sum _{i=1}^{n} \sum _{j=1}^{b_{i}} \frac{1}{d}\sum _{k=1}^{d}\llbracket X_{i,k} \in [\theta _{i,j-1}, \theta _{i,j})\rrbracket w_{i,j} \end{aligned}
(21)

where $${\varvec{\theta }}_{i}=(\theta _{i,0},\ldots ,\theta _{i,b_{i}})^{T}\in {\mathbb {R}}^{b_{i}+1}$$ is a vector defining bin edges of the i-th histogram and $${\varvec{\theta }}=({\varvec{\theta }}_{1}^{T},\ldots ,{\varvec{\theta }}_{n}^{T})^{T}\in {\mathbb {R}}^{n+B}$$ is their concatenation, $$B=\sum _{i=1}^{n} b_{i}$$ denotes the total number of bins, $${\varvec{w}}_{i}=(w_{i,1},\ldots ,w_{i,b_{i}})^{T}\in {\mathbb {R}}^{b_{i}}$$ are bin heights of the i-th histogram and $${\varvec{w}}=({\varvec{w}}_{1}^{T},\ldots ,{\varvec{w}}_{n}^{T})\in {\mathbb {R}}^{B}$$ is their concatenation.

In order to learn the bin edges $${\varvec{\theta }}$$ and the weights $${\varvec{w}}$$ from examples $${\mathcal {T}}$$ simultaneously, we apply the general framework described in Sect. 2. In particular, we show how to adapt the Support Vector Machine (SVM) algorithm. First, we define the initial discretization $${\varvec{\nu }}_i=(\nu _{i,0},\ldots , \nu _{i,D})^{T}\in {\mathbb {R}}^{D+1}$$ for each of the n histograms. For example, we place $$D+1$$ edges equidistantly between the minimal $$\hbox {Min}_{i}$$ and the maximal $$\hbox {Max}_{i}$$ value that can appear in the i-th sequence, so that $$\nu _{i,j}=j(\mathrm{Max}_i - \mathrm{Min}_i) + \mathrm{Min}_i$$. Second, we combine the SVM objective function

\begin{aligned} F^\mathrm{svm}_\mathrm{pwc}({\varvec{w}},{\varvec{\theta }},\lambda ) = \frac{\lambda }{2}\Vert {\varvec{w}}\Vert ^{2} + \frac{1}{m}\sum _{i=1}^{m} \max \big \{0,1-y_{i} f_\mathrm{pwc}({\varvec{x}}^{i},{\varvec{w}},{\varvec{\theta }}) \big \} \end{aligned}

with the $$L_{1}$$-norm approximation of the function

\begin{aligned} c_{n}({\varvec{v}}) = \sum _{i=1}^{n} \sum _{j=1}^{D-1} \llbracket v_{i,j} \ne v_{i,j+1} \rrbracket \end{aligned}

The function $$c_{n}({\varvec{v}})$$ indicates the total number of different neighboring components of $${\varvec{v}}$$. Analogously to the general formulation (8), we obtain the following convex program

\begin{aligned} {\varvec{v}}^{*} \in \mathop {\mathrm{argmin}}_{{\varvec{v}}\in {\mathbb {R}}^{n D}} \bigg [ F^\mathrm{svm}_\mathrm{pwc}({\varvec{v}},{\varvec{\nu }},\lambda ) + \gamma \sum _{i=1}^{n} \sum _{j=1}^{D-1} | v_{i,j} - v_{i,j+1} | \bigg ]\,. \end{aligned}
(22)

The constant $$\lambda \in {\mathbb {R}}_{+}$$ controls the empirical error. The constant $$\gamma \in {\mathbb {R}}_{+}$$ influences the number of equal neighboring in $${\varvec{v}}^{*}$$ and thus the number of emerging bins. The uncompressed parametrization $$({\varvec{v}}^{*},{\varvec{\nu }})$$ is converted to the compressed one $$({\varvec{w}}^{*},{\varvec{\theta }}^{*})$$ via the rounding methods from Sect. 4, which are applied to each pair $$({\varvec{v}}^{*}_{i},{\varvec{\nu }}_{i}), i\in \{1,\ldots ,n\}$$, separately.

### Estimation of PWL histograms

Given a training sample $${\mathcal {T}}=\{x^{i} \in {\mathbb {R}}\mid i=1,\ldots ,m\}$$ drawn from i.i.d. random variables with unknown distribution p(x), the goal is to find $${\hat{p}}(x)$$ approximating p(x) accurately based on the samples $${\mathcal {T}}$$. Assume we want to model the unknown p.d.f. by the PWL function

\begin{aligned} {\hat{p}}_\mathrm{pwl}(x,{\varvec{w}},{\varvec{\theta }}) = \big (1-\alpha (x,{\varvec{\theta }})\big )w_{k(x,{\varvec{\theta }})-1} + \alpha (x,{\varvec{\theta }})w_{k(x,{\varvec{\theta }})}, \end{aligned}
(23)

where $${\varvec{\theta }}\in {\mathbb {R}}^{B+1}$$ are the bin edges and $${\varvec{w}}\in {\mathbb {R}}_{+}^{B+1}$$ is a vector of non-negative weights selected such that $$\int \hat{p}_\mathrm{pwl}(x,{\varvec{w}},{\varvec{\theta }}) \mathrm{d}x=1$$. To learn the unknown parameters $$({\varvec{w}},{\varvec{\theta }})$$ from the sample $${\mathcal {T}}$$, we are going to instantiate the framework proposed in Sect. 3 to the Maximum Likelihood method. Let the initial discretization $${\varvec{\nu }}\in {\mathbb {R}}^{D+1}$$ be equidistantly spaced between the minimal and the maximal value, i.e. $$\nu _{j}=j( \hbox {Max} - \hbox {Min})+ \hbox {Min}, \forall j\in \{0,\ldots ,D\}$$, where D is set to be sufficiently high. We substitute the negative log-likelihood

\begin{aligned} F^\mathrm{nll}_\mathrm{pwl}({\varvec{w}},{\varvec{\theta }})=-\sum _{i=1}^{m} \log {\hat{p}}_\mathrm{pwl}(x^{i},{\varvec{w}},{\varvec{\theta }}) \end{aligned}

to the general formulation (14) which yields the following convex problem

\begin{aligned} {\varvec{u}}^{*}= \displaystyle \mathop {\mathrm{argmin}}_{{\varvec{u}}\in {\mathbb {R}}^{D}} \bigg [F^\mathrm{nnl}_\mathrm{pwl}({\varvec{u}},{\varvec{\nu }}) + \gamma \sum _{j=1}^{D-1} \big | u_{j} - \frac{1}{2}u_{j-1} -\frac{1}{2}u_{j+1} \big | \bigg ] \end{aligned}
(24a)

subject to

\begin{aligned} \displaystyle u_{0} + u_{D} + 2\sum _{i=1}^{D-1} u_{i} = \frac{2D}{\mathrm{Max} - \mathrm{Min}}\,,\qquad \quad \displaystyle u_{i} \ge 0 \,, \;\;\;\ i\in \{0,\ldots ,D\}\,, \end{aligned}
(24b)

where $$\gamma \in {\mathbb {R}}_{+}$$ is a constant controlling the number of bins. The introduced constraints (24b) ensure that $$\int _\mathrm{Min}^\mathrm{Max} {\hat{p}}_\mathrm{pwl}(x,{\varvec{u}},{\varvec{\nu }}) \mathrm{d}x=1$$ holds for any feasible $${\varvec{u}}$$. The found $${\varvec{u}}^{*}$$ defines a PWL probability density model $${\hat{p}}_\mathrm{pwl}(x^{i},{\varvec{u}}^{*},{\varvec{\nu }})$$. If necessary, the compressed parameters $$({\varvec{w}}^{*},{\varvec{\theta }}^{*})$$, defining the non-equidistant bins, can be recovered from $${\varvec{u}}^{*}$$ by the rounding methods described in Sect. 4.

### PWL embedding for non-linear classification

The PWL embedding (Pele et al. 2013) is a simple yet efficient way to learn highly non-linear decision function by a linear algorithm such as the Support Vector Machines. Let $${\mathcal {T}}=\{({\varvec{x}}^{1},y^{1}),\ldots ,({\varvec{x}}^{m},y^{m})\}\in ({\mathbb {R}}^{n}\times \{-1,+1\})^{m}$$ be a training set of input features and output binary labels. A linear classifier $$h({\varvec{x}},{\varvec{w}},{\varvec{\theta }})=\mathop {{\mathrm{sign}}}(f_\mathrm{pwl}({\varvec{x}},{\varvec{w}},{\varvec{\theta }}))$$ assigns the input $${\varvec{x}}=(x_{1},\ldots ,x_{n})^{T}$$ into classes based on the sign of the discriminant function

\begin{aligned} f_\mathrm{pwl}({\varvec{x}}, {\varvec{w}},{\varvec{\theta }}) = \sum _{i=1}^{n} \big ( w_{i,k(x_{i},{\varvec{\theta }}_{i})-1} \cdot (1- \alpha (x_{i},{\varvec{\theta }}_{i})) + w_{i,k(x_{i},{\varvec{\theta }}_{i})} \cdot \alpha (x_{i}, {\varvec{\theta }}_{i}) \big ) \end{aligned}
(25)

which is a sum of n PWL functions each defined for a single input feature. The vector $${\varvec{\theta }}_{i}\in {\mathbb {R}}^{b_{i}+1}$$ contains the bin edges of the i-th feature and $${\varvec{\theta }}=({\varvec{\theta }}_{1}^{T},\ldots ,{\varvec{\theta }}_{n}^{T})^{T}\in {\mathbb {R}}^{n+B}$$ is their concatenation where $$B=\sum _{i=1}^{n} b_{i}$$ denotes the total number of bins. The vector $${\varvec{w}}_{i}\in {\mathbb {R}}^{b_{i+1}}$$ contains the weights associated to the edges $${\varvec{\theta }}_{i}$$ and $${\varvec{w}}=({\varvec{w}}_{1}^{T},\ldots ,{\varvec{w}}_{n}^{T})\in {\mathbb {R}}^{B+n}$$ is their concatenation.

We can learn the weights $${\varvec{w}}$$ as well as the discretization $${\varvec{\theta }}$$ using the SVM algorithm adapted by the framework from Sect. 3. First, for each input vector we define the initial discretization $${\varvec{\nu }}_{i}= (\nu _{i,0},\dots ,\nu _{i,D})^{T}\in {\mathbb {R}}^{D+1}$$, e.g. as before by setting $$\nu _{i,j} = j(\hbox {Max}_{i} - \hbox {Min}_{i})+ \hbox {Min}_{i}$$ where $$\hbox {Min}_{i}$$ and $$\hbox {Max}_{i}$$ is the minimal and the maximal value of the i-the feature and D is the maximal number of bins per feature. Let $${\varvec{\nu }}=({\varvec{\nu }}_{1}^{T},\ldots ,{\varvec{\nu }}_{n}^{T})^{T}\in {\mathbb {R}}^{n(D+1)}$$ be the concatenation of initial discretizations for all features. The SVM objective function

\begin{aligned} F^\mathrm{svm}_\mathrm{pwl}({\varvec{w}},{\varvec{\theta }},\lambda ) = \frac{\lambda }{2}\Vert {\varvec{w}}\Vert ^{2} + \frac{1}{m}\sum _{i=1}^{m} \max \big \{0,1-y_{i} f_\mathrm{pwl}({\varvec{x}}^{i},{\varvec{w}},{\varvec{\theta }}) \big \} \end{aligned}

is then combined with the $$L_{1}$$-norm approximation of the function

\begin{aligned} e_{n}({\varvec{u}}) = \sum _{i=1}^{n} \sum _{j=1}^{D-1} \llbracket u_{i,j} \ne \frac{1}{2}(u_{i,j-1}+u_{i,j+1}) \rrbracket . \end{aligned}

The function $$e_{n}({\varvec{u}})$$ indicates the total number of weights that can be expressed as the average of its neighbors. Analogously to the general formulation (14), we obtain the following convex program

\begin{aligned} {\varvec{u}}^{*} = \mathop {\mathrm{argmin}}_{{\varvec{u}}\in {\mathbb {R}}^{n (D+1)}} \bigg [F^\mathrm{svm}_\mathrm{pwl}({\varvec{u}},{\varvec{\nu }},\lambda ) + \gamma \sum _{i=1}^{n}\sum _{j=1}^{D-1}\big | u_{i,j}-\frac{1}{2}u_{i,j-1} - \frac{1}{2} u_{i,j+1} \big | \bigg ]\,. \end{aligned}
(26)

The constant $$\lambda \in {\mathbb {R}}_{+}$$ controls the empirical error as in the standard SVM. The constant $$\gamma \in {\mathbb {R}}_{+}$$ controls the number of emerging bins, and thus the complexity (or smoothness) of the decision function. The uncompressed parameters $${\varvec{u}}^{*}$$ can be converted to the compressed ones $$({\varvec{w}}^{*},{\varvec{\theta }}^{*})$$ by the projection methods from Sect. 4 which are applied to each pair $$({\varvec{u}}^{*}_{i},{\varvec{\nu }}_{i}), i\in \{1,\ldots ,n\}$$, separately.

## Limitations of the convex relaxation

In this section we discuss the limitations of the proposed convex relaxation. We concentrate on the PWC learning which leads to solving a convex minimization problem (8). The objective function (8) is a weighted sum of the original task objective $$F_\mathrm{pwc}({\varvec{v}},{\varvec{\nu }})$$ and a regularization term $$\sum _{i=1}^{D-1} | v_{i} - v_{i+1} |$$ that was introduced to control the number of different neighboring weights. However, the regularization term can also have an undesired influence on the solution as shown below.

It is possible to factorize every candidate solution $${\varvec{v}}\in {\mathbb {R}}^{D}$$ into a monotonic subsequences whose endpoints are local extrema of the sequence $$(v_{1},v_{2},\ldots ,v_{D})$$. The value of the regularization term is given only by the weights at these endpoints. Indeed, let $$v_{k}, v_{k + 1}, \dots v_{k + l}$$ be one of the monotonic subsequences and let it be an increasing one, that is, $$v_{k} \le v_{k+1} \le \ldots \le v_{k+l}$$. In this case, we have

\begin{aligned} \sum _{i=k}^{k+l-1} | v_i - v_{i+1} | = \sum _{i=k}^{k+l-1} (v_{i+1} - v_i ) = v_{k+1} - v_{k}. \end{aligned}

The same argument holds for decreasing subsequences, $$v_{k} \ge v_{k+1} \ge \ldots \ge v_{k+l}$$, in which case $$\sum _{i=k+1}^{k+l} | v_{i-1} - v_{i} | = v_{k} - v_{k+l}$$. This observation has the following two implications. First, the values of the intermediate weights, that is those between the local extrema, have no influence on the regularization term. Second, the only way to decrease the value of the regularization term is to either decrease a local maximum or to increase a local minimum. Subsequently the regularization term tends to push the local extrema towards each other and to make the resulting sequence flat.

Estimation of the PWC histograms by the ML method is an example application in which the described behavior is not favorable. In principle, we can use the same method as for estimation of the PWL histograms which was described in Sect. 7.2. The only difference is that for PWC histograms we use the regularization term $$\gamma \sum _{i=1}^{D-1} | v_{i} - v_{i+1} |$$ and we have to adjust the normalization constraint (24b) correspondingly. We applied the method to estimate the PWC histogram from $$10^6$$ examples randomly drawn from a known bimodal distribution (the high number was used to get rid of the estimation error). We estimated the PWC histogram with the regularization constant set to $$\gamma \in \{0.1,1,10,100\}$$. The effect of increasing $$\gamma$$ is illustrated in Fig.3. It is seen that the PWC regularization term favors uniform distributions which may not be desirable. This problem can be addressed by using the PWL histograms instead as shown in Sect. 7.2. In other applications, like learning linear classifiers on top of PWC histograms, the preference of uniform weights can be beneficial as shown in Sect. 7.1.

## Experiments

This section provides empirical evaluation of the algorithms proposed in Sect. 5. Section 7.1 describes learning of malware detector representing the network communication by PWC histograms. Section 7.2 evaluates the proposed PWL density estimator on synthetic data. Section 7.3 evaluates the proposed algorithm for learning PWL data embedding on classification benchmarks selected from the UCI repository.

### Malware detection by classification of histograms

The proposed approach was applied in the network security domain to classify unseen malware samples from HTTP traffic. The data was obtained from several months (January–July 2015) of real network traffic of companies of various sizes in form of proxy log records. The logs contain anonymized HTTP/HTTPS web requests, where one request represents one connection defined as a group of packets from a single user and source port with a single server IP address, port, and protocol. Each request consists of the following fields: user name, source IP address, destination IP address, source port, destination port, protocol, number of bytes transferred from client to server and from server to client, flow duration, timestamp, user agent, URL, referer, MIME-Type, and HTTP status. The most informative field is the URL, which can be decomposed further into several parts, e.g. protocol, hostname, path, file name, or query.

We grouped all connections into bags, where one bag contains all connections of the same user going to the same domain. We the extracted 115 feature values for each connection (see Table 1), computed a histogram representation of each bag and used the histograms as input to a linear two-class classifier (21) as described in Bartos and Sofka (2015).

We compare two methods learning the classifier parameters:

1. 1.

The linear SVM using histograms with equidistantly spaced bins. The number of bins per feature varied from $$\{8,12,\ldots ,256\}$$.

2. 2.

The proposed algorithm learning non-equidistant bins from examples. The uncompressed weights $${\varvec{u}}^{*}$$ are obtained by solving (22) with the initial discretisation $${\varvec{\nu }}$$ set to split each feature equidistantly to $$D=256$$ bins. The constant $$\gamma$$, which controls the number of bins, varied from $$10^{-1}$$ to $$10^{-6}$$. The compressed weights $$({\varvec{w}}^{*},{\varvec{\theta }}^{*})$$ were obtained by the rounding procedure (15). Finally, the linear SVM was re-trained on the learned bins $${\varvec{\theta }}^*$$.

The optimal value of the SVM constant $$\gamma$$ used by both compared methods was selected from $$\{10^{-1},\ldots ,10^{-5}\}$$ based on the validation error.

The data consists of 7028 positive (malware) and 44,338 negative (legitimate) samples. The positive samples are from 32 classes representing 32 different malware types. The samples were split into training, validation and testing set by a ratio 3/1/1. It is ensured that the same malware class never appears simultaneously in the training, validation and test sets. We report the average performance and its standard deviation computed over five test splits.

We analyzed the effect of the number of bins on the precision and recall of the linear SVM classifier. Figure 4 shows precision recall curves (PRCs) for the representation with different number of equidistant bins. Figure 5 shows PRCs of the representation with different number of non-equidistant bins positions of which are learned by the proposed method. It is seen that the representation with non-equidistant bins achieves higher precision using substantially smaller number of bins. In Fig. 6 we show the recall as the function of the number of bins in order to emphasize the difference between the equidistant and learned non-equidistant representation. The precision was set at $$95\%$$, a common operating point of the detector used in real-life deployment. The figures are numerically summarized in Table 2.

Figure 7 illustrates the weights $$\mathbf w$$ of the linear SVM classifier trained with three different representations. First, we used a histograms with 256 equidistant bins (black line), resulting in a large complexity of weights. Second, we learned the weights and bins simultaneously according to the proposed algorithm (red line), which significantly decreased the number of bins without sacrificing the efficacy. Finally, we applied the second step to define new bins by rounding the uncompressed solution and then re-trained a new linear SVM classifier on the top of it (blue line). It means that the discretization for the second and third method is similar, but the third method retrained the classifier and boosted the weights to further maximize the separability of the positive and negative samples.

### Non-parametric distribution estimation

We evaluated the algorithm finding PWL approximation of an unknown distribution described in Sect. 5.2. The samples were drawn from a mixture of two Gaussians: $$p(x) = 0.4 \cdot {\mathcal {N}}(x;-2,1) + 0.6\cdot {\mathcal {N}}(x;2,0.5)$$ with mean and standard deviation $$(-2,1)$$ and (2, 0.5), respectively. We compared three methods:

1. 1.

The proposed algorithm estimating non-equdistant PWL histogram. The uncompressed weights $${\varvec{u}}^{*}$$ were obtained by solving (24). The initial $$D=100$$ bins $${\varvec{\nu }}$$ were placed equidistantly between the minimal and the maximal value in the training set. The optimal value of the constant $$\gamma$$ was selected from $$\{10,\ldots ,10{,}000\}$$ based on the log-likelihood evaluated on the validation set. The compressed parameters $$({\varvec{w}}^*,{\varvec{\theta }}^*)$$ of the PWL histogram (23) were computed from $${\varvec{u}}^*$$ by the rounding procedure (20) with the precision parameter $$\varepsilon =0.001$$.

2. 2.

The PWL histogram with bin edges $${\varvec{\theta }}$$ placed equidistantly between the minimal and the maximal value in the training set. The weights $${\varvec{w}}$$ were found by maximizing the likelihood function which is equivalent to solving (24) with $$\gamma =0$$. The optimal number of bins was selected from $$\{5,10,\ldots , 100\}$$ based on the log-likelihood evaluated on the validation set.

3. 3.

The standard PWC histogram with equidistant bins whose number was selected from $$\{5,10,\ldots , 100\}$$ based on the log-likelihood evaluated on the validation set. The bin heights were found by the ML method.

We used the distribution p(x) to generate training and validation set the size of which varied from 100 to 10, 000. For each method we recorded the optimal number of bins and the KL-divergence between the estimated and the ground distribution p(x). The results are averages and standard deviations computed over ten generated data sets.

Figure 8a shows the KL-divergence and Fig. 8b the number of bins as a function of the training set size. As expected, the equidistant PWC histogram provides the least precise (high KL divergence) and the most complex (high number of bins) model. We also see that PWL model with equidistantly-spaced bins provides the same accurate model as the model with non-equidistant bins learned from example, however, the compactness (the number of bins) of the non-equidistant model is consistently smaller.

Figure 9 shows examples of PWC historams and PWL histograms with learned non-equdistant bins. It is seen that the non-equdistant PWL histogram can closely approximate the ground truth model even from small training sets.

### PWL embedding for non-linear classification

In this section we evaluate the algorithm for learning the PWL data embedding proposed in Sect. 5.3. We learned a two-class classifier $$h({\varvec{x}},{\varvec{w}},{\varvec{\theta }})=\mathop {{\mathrm{sign}}}(f_\mathrm{pwl}({\varvec{x}},{\varvec{w}},{\varvec{\theta }}))$$ with the discriminant function $$f_\mathrm{pwl}({\varvec{x}},{\varvec{w}},{\varvec{\theta }})$$ defined by (25) for a set of classification problems selected from the UCI repository (Lichman 2013) which are summarized in Table 3. We evaluated three methods:

1. 1.

The proposed algorithm learning simultaneously $${\varvec{\theta }}$$ and $${\varvec{w}}$$. The uncompressed parameters $${\varvec{u}}^*$$ were found by solving (26) with the initial discretization $${\varvec{\nu }}$$ equidistantly splitting each feature to $$D=100$$ bins. The compressed parameters $$({\varvec{w}}^*,{\varvec{\theta }}^*)$$ were computed from $$({\varvec{u}}^*,{\varvec{\nu }})$$ by the rounding procedure (20) with the precision parameter $$\varepsilon =0.1$$. Finally, a linear SVM was re-trained on the learned bins $${\varvec{\theta }}^*$$. The constant $$\gamma$$, which controls the number of bins, was varied from 0.1 to 0.0001.

2. 2.

The parameters $${\varvec{w}}$$ were trained by the linear SVM on top of equdistantly constructed bins $${\varvec{\theta }}$$. The number of bins per feature varied from 5 to 20.

3. 3.

Method of Pele et al. (2013) which was shown to outperform the non-linear SVM with many state-of-the-art kernels and data embeddings. Namely, we re-implemented the “PL1 algorithm” applying the PWL embedding on individual features as we do. The non-equidistant bins were found for each feature independently by constructing edges as the mid-points between the cluster centers obtained from the k-means algorithm. The number of bins was varied from 3 to 20.

The optimal value of the SVM constant $$\lambda$$ used by all three methods was selected from $$\{0.1,\ldots ,0.00001\}$$ based on the validation error.

We used the same evaluation protocol as in Pele et al. (2013). Each data set was ten times randomly split into training, validation and test part in the ratio 60/20/20. The reported results are averages and the standard deviations computed on the test part over the ten splits.

Figure 10 shows the test classification accuracy of the compared methods as a function of the number of bins. The accuracy for the best choice of bins and the corresponding number of bins is summarized in Table 3. The baseline PWL embedding with equidistant bins provides slightly but consistently worse accuracy compared to the other two methods which learn the non-equidistant bins. The proposed method yields comparable or slightly better results than the approach of Pele et al. (2013) both in terms of the classification accuracy and the complexity of the embedding (number of bins). However, the proposed method relies solely on solving convex optimization problems unlike the method of Pele et al. (2013) which involves a highly non-convex clustering problem.

### The computational time

In this section we provide empirical estimate of the computational time required when learning discretization by the proposed framework. As a benchmark we use the task of learning the PWL embedding for non-linear classification described in Sect. 7.3. In this case learning leads to a convex optimization problem (26) which has two hyper-parameters. First, the parameter $$\lambda$$ controls the quadratic regularization similar to the standard SVM formulation. Second, the additional parameter $$\gamma$$ which implicitly controls the number of bins of the resulting feature discretization. The time required to solve the convex problem thus depends on the hyper-parameters $$\lambda$$ and $$\gamma$$ and on the size of the optimized problem specified by the number of examples m and the number of features n. We empirically measured the dependency of the computational time on the four variables as described below.

Besides the four variables, the runtime obviously depends on a particular optimization solver. There is a large number of optimization methods applicable to the convex problem (26). For example, the problem (26) can be expressed as an equivalent quadratic program (QP) and solved by any QP solver. In this work we used the Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization (OCA) (Franc and Sonneburg 2009). The OCA is a general purpose and easy-to-implement solver for minimization of convex functions containing a quadratic regularization term like the problem (26). The empirical study presented below uses this particular solver which is sufficiently efficient for the problem sizes considered in our experiments. Other solvers might be even more efficient.

We used a Matlab implementation of the OCA solver. The experiments were performed on a Linux machine with 32 CPUs (AMD Opteron 1.4GHz) and 256 GB RAM. We stopped the OCA solver when the objective value reached a level not more than $$1\%$$ above the optimal value. Such precision may be exaggeratedly high for practical purposes but it removes the influence of the optimization error on the resulting accuracy.

Figure 11 shows the average runtime required by the OCA solver as a function of $$\lambda$$, $$\gamma$$, the number of examples m and the number of features n. The runtime is measured on a subset of classification problems listed in Table 3. In case of $$\lambda$$ and $$\gamma$$ we use problems with more than 1000 examples (6 problems in total). The dependency on the number of examples and the number of features is measured on “miniboo” and “musk” dataset, respectively.

We found that the runtime scales gracefully with respect to the quadratic regularization parameter $$\lambda$$, specifically, it grows approximately as $${\mathcal {O}}(\lambda ^{-0.8})$$. On the other hand, the runtime grows much faster, approximately $${\mathcal {O}}(e^{20\sqrt{\gamma }})$$, with increasing $$\gamma$$ controlling the number of bins. It means that the lower number of bins the higher computational time is needed. A low number of bins is enforced by increasing the weight of the $$L_1$$-regularization term in the objective function, which brings the task closer to a linear program. The dominant linear term impairs the regularization effect of the quadratic term which causes the “zig-zag” behavior of the cutting plane solver leading to a higher number of iterations. The dependency of the running time on the number of training examples is linear which is consistent with the theoretical upper bound on the number of iterations proved in Franc and Sonneburg (2009). Finally, the dependency on the number of features is approximately quadratic.

In absolute numbers, the longest time, around 160 min, was required for training a single model on “miniboo” dataset (130,065 examples, 50 features) with the highest value of $$\lambda =0.1$$.

## Conclusions

We proposed a generic framework which allows to modify a wide class of convex learning algorithms so that they can learn parameters of the piece-wise constant (PWC) and the piece-wise linear (PWL) functions from examples. The learning objective of the original algorithm is augmented by a convex term which enforces compact bins to emerge from an initial fine discretization. In contrast to existing methods, the proposed approach learns the discretization and the parameters of the decision function simultaneously. In addition, learning is converted to a convex problem which is solvable efficiently by global methods. We instantiated the proposed framework for three problems, namely, learning PWC histogram representation of sequential data, estimation of the PWL probability density function and learning PWL data embedding for non-linear classification. The proposed algorithms were evaluated on synthetic data and standard public benchmarks and applied to malware detection in network traffic data. It was demonstrated that the proposed convex algorithms yield models with fewer number of parameters with comparable or better accuracy than the existing methods. The main disadvantage of the proposed method, when compared to heuristic local methods like Pele et al. (2013), is a higher computational time required to solve the convex problem.

1. 1.

Note that the PWC function has B weights associated with the bins while the PWL function has $$B+1$$ weights associated with the bin edges.

## References

1. Bartos, K., & Sofka, M. (2015). Robust representation for domain adaptation in network security. In In proceedings of ECML/PKDD, volume 3, (pp. 116–132).

2. Bhatt, R., & Dhall, A. (2010). Skin segmentation dataset. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/skin+segmentation

3. Boullé, M. (2006). Modl: A bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.

4. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge: Cambridge University Press.

5. Candes, E., Romberg, J., & Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8), 1207–1223.

6. Candes, E., & Tao, T. (2005). Decoding by linear programming. IEEE Transactions on Infromation Theory, 51(12), 4203–4215.

7. Chapelle, O., Haffner, P., & Vapnik, V. N. (1999). Support vector machines for histogram-based image classification. IEEE Transactions on Neural Networks, 10(5), 1055–1064.

8. Cios, K., Pedrycz, W., Swiniarski, R., & Kurgan, L. (2007). Data Mining: A Knowledge Discovery Approach. Berlin: Springer.

9. Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. In Proceedings of computer vision and pattern recognition, volume 1, (pp. 886–893).

10. Donoho, D. (2006). Compressed sensing. IEEE Transactions on Infromation Theory, 52(4), 1289–1306.

11. Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Proceedings of international conference on machine learning, Morgan Kaufmann, (pp. 194–202).

12. Elomaa, T., & Rousu, J. (1999). General and efficient multisplitting of numerical attributes. Machine Learning, 36(3), 201–244.

13. Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.

14. Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of international joint conference on artificial intelligence, (pp. 1022–1029).

15. Franc, V., & Sonneburg, S. (2009). Optimized cutting plane algorithm for large-scale risk minimization. Journal of Machine Learning Research, 10, 2157–2232.

16. Friedman, N., & Goldszmidt, M. (1996). Discretizing continuous attributes while learning bayesian networks. In Proceedings of international conference on machine learning, (pp. 157–165).

17. García, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734–750.

18. Hue, C., & Boullé, M. (2007). A new probabilistic approach in rank regression with optimal bayesian partitioning. Journal of Machine Learning Research, 8, 2727–2754.

19. Johnson, B. A., Tateishi, R., & Hoan, N. T. (2013). A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. International Journal of Remote Sensing, 34(20), 6969–6982.

20. Kerber, R. (1992). Chimerge: Discretization of numeric attributes. In Proceedings of the tenth national conference on artificial intelligence, AAAI’92, (pp. 123–128).

21. Lichman, M. (2013). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science.

22. Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393–423.

23. Pele, O., Taskar, B., Globerson, A., & Werman, M. (2013). The pairwise piecewise-linear embedding for efficient non-linear classification. In Proceedings of the international conference on machine learning, (pp. 205–213).

24. Rao, C. (2005). Data mining and data visualization. In C. R. Rao, E. J. Wegman, & J. L. Solka (Eds.), Handbook of Statistics, volume 24. Newyork: Elsevier.

25. Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. London, New York: Chapman & Hall.

26. Yang, Y., & Webb, G. I. (2008). Discretization for naive–bayes learning: Managing discretization bias and variance. Machine Learning, 74(1), 39–74.

27. Yeh, I.-C., Yang, K.-J., & Ting, T.-M. (2008). Knowledge discovery on RFM model using bernoulli sequence. Expert Systems with Applications, 36(3), 5866–5871.

## Acknowledgements

VF was supported by Czech Science Foundation Grant 16-05872S. OF was supported by the internal CTU Funding SGS17/185/OHK3/3T/13.

## Author information

Authors

### Corresponding author

Correspondence to Vojtech Franc.