# Learning data discretization via convex optimization

- 332 Downloads

## Abstract

Discretization of continuous input functions into piecewise constant or piecewise linear approximations is needed in many mathematical modeling problems. It has been shown that choosing the length of the piecewise segments adaptively based on data samples leads to improved accuracy of the subsequent processing such as classification. Traditional approaches are often tied to a particular classification model which results in local greedy optimization of a criterion function. This paper proposes a technique for learning the discretization parameters along with the parameters of a decision function in a convex optimization of the true objective. The general formulation is applicable to a wide range of learning problems. Empirical evaluation demonstrates that the proposed convex algorithms yield models with fewer number of parameters with comparable or better accuracy than the existing methods.

## Keywords

Piecewise constant embedding Piecewise linear embedding Parameter discretization Convex optimization Classification Histograms## 1 Introduction

Many mathematical modeling problems involve discretization of continuous input functions to convert them into their discrete counterparts (García et al. 2013; Liu et al. 2002). The goal of the discretization is to reduce the number of values a continuous attribute assumes by grouping them into a number of predefined bins (Cios et al. 2007, Chapter 8). The discretization is useful for reducing the data and subsequently generated model complexity and it is a necessary preprocessing step for data mining and machine learning algorithms that operate on discrete attribute spaces (Cios et al. 2007, Chapter 8). The discrete functions are then piece-wise constant (0th order) or piece-wise linear (1st order) approximations of the continuous input functions.

The discretization is useful, for example, when estimating probability density functions. The density functions are typically represented as multi-dimensional histograms in the discrete domain (Silverman 1986, Chapter 2). Although histograms can asymptotically approximate any continuous density, the accuracy of such approximation depends on the histogram bin size (Rao et al. 2005, Chapter 2). In another example, the domain of input features is discretized in order to apply a linear decision function as in logistic regression or Support Vector Machine classification (Chapelle et al. 1999). The input feature discretization is a simple method to learn non-linear decision functions where algorithms essentially learn linearly parameterized rules. Clearly, the accuracy of the decision functions directly depends on the discretization of the feature values. One common difficulty in the discretization process is the choice of the discretization step which then indicates the size of the piece-wise segments, e.g. histogram bins, or parameters of the feature representation quantization. The parameters of the decision function are typically estimated in a separate subsequent process (Dougherty et al. 1995; Pele et al. 2013).

Most algorithms employ the simplest unsupervised discretization by choosing a fixed number of bins with the same size (*equal interval width*) or the same number of samples in each bin (*equal frequency interval*) (Dougherty et al. 1995). The total number of bins is tuned for a specific application balancing two opposite considerations. Wider bins reduce the effects of noise in regions where the number of input samples is low. On the other hand, narrower bins result in more accurate function approximation in regions where there are many samples and thus the effects of the noise are suppressed. Equal frequency intervals have been extended by using Shannon entropy over discretized space to adjust the bin boundaries (Dougherty et al. 1995). Evidently, using varying bin sizes during discretization can be beneficial.

Supervised discretization algorithms use sample labels to improve the binning (Kerber 1992; Dougherty et al. 1995), often in an optimization step when learning a classifier (Boullé 2006; Fayyad and Irani 1992; Friedman and Goldszmidt 1996). One widely-adopted approach is to initially start with a large number of bins and then merge neighboring bins while optimizing a criterion function (Boullé 2006). In Boullé (2006), the discretization is based on a search algorithm to find a Bayesian optimal interval splits using a prior distribution on a model space. In Fayyad and Irani (1992), the recursive splitting is based on an information entropy minimization heuristic. The algorithm is extended with a Minimum Description Length stopping criteria in Fayyad and Irani (1993) and embedded into a dynamic programming algorithm in Elomaa and Rousu (1999). These techniques introduce supervision for finding the optimal discretization but are tied to a particular classification model (Naïve Bayes, decision trees, or Bayesian Networks (Friedman and Goldszmidt 1996; Yang and Webb 2008)). As a result, they rely on local greedy optimization of the criterion function (Hue and Boullé 2007).

This paper proposes an algorithm for learning piece-wise constant or piece-wise linear embeddings from data samples along with the parameters of a decision function. Similarly to several previous techniques, the initial fine-grained discretization with many histogram bins is adjusted by optimizing a criterion function. Our algorithm proposes several important contributions. First, when training a decision function, the algorithm optimizes the true objective (or its surrogate) which includes the discretization parameters and the parameters of the decision function. This is in contrast to previous methods that rely on two separate steps, discretization and classifier learning, which can deviate from the true objective (Pele et al. 2013). Second, the parameter learning is transformed into a convex optimization problem that can be solved effectively. Other techniques do not provide a global solution and resort to a greedy strategy, where the features are processed sequentially (Hue and Boullé 2007). Third, our formulation is general with piece-wise embeddings being used when training a linear decision function which makes it applicable to a wide range of linear models. Other methods are specific to a particular classification model (Boullé 2006; Fayyad and Irani 1992; Friedman and Goldszmidt 1996; Yang and Webb 2008).

Our experiments demonstrate that the learned discretization is effective when applied to various data representation problems. The first set of results combines the estimation of the malware traffic representation parameters with learning a linear decision function in a joint convex optimization problem. The learned linear embedding needs much lower dimensionality than equidistant discretization (by a factor of three on average) to achieve the same or better malware detection accuracy level. The second set of results shows a piece-wise linear approximation of a probability density function using non-equidistant bins when estimating the density from data samples. The proposed algorithm achieves lower KL-divergence between the estimate and the ground truth than histograms with equidistant bins. Comparison to the previously published piece-wise linear embedding for non-linear classification (Pele et al. 2013) shows higher accuracy of the proposed technique on a number of datasets from the UCI repository. These encouraging results show the promise of the technique which could be extended to other linear embedding and classification problems.

## 2 Learning piece-wise constant functions

*B*is the number of bins, \({\varvec{\theta }}=(\theta _0,\ldots ,\theta _B)^T\in {\mathbb {R}}^{B+1}\) is a vector defining the bin edges, \({\varvec{w}}=(w_1,\ldots ,w_B)^T\in {\mathbb {R}}^B\) is a vector of weights each of them associated to a single bin and the function

*x*falls to. For notational convenience, we omit the additive scalar bias \(w_0\) in the definition (1) which does not affect the discussion that follows. We denote set inclusive bracket as ’[’ and exclusive bracket as ’)’. The operator \(\llbracket A\rrbracket \) evaluates to 1 if the logical statement

*A*is true and it is zero otherwise.

*m*training inputs \({\mathcal {T}}=\{x_{i} \in {\mathbb {R}}\mid i=1,\ldots ,m\}\) explains the data. For example,

*g*can be the empirical risk or the likelihood of

*f*evaluated on \({\mathcal {T}}\). Assuming the bin edges \({\varvec{\theta }}\) are fixed, the weights \({\varvec{w}}\) of the PWC function (1) can be learned by solving the minimization problem

*g*and \(f_{\mathrm{pwc}}\) which is linear in \({\varvec{w}}\) (Boyd and Vandenberghe 2004).

*x*is inside the interval \([\mathrm{Min}, \mathrm{Max})\). The bin edges are constructed for different values of

*B*and the optimal setting is typically tuned on validation examples. This procedure involves minimization of \(F_\mathrm{pwc}({\varvec{w}},{\varvec{\theta }})\) for all proposal discretizations \({\varvec{\theta }}\). In principle, one could optimize the width of individual bins as well, however, this would generate exponentially many proposal discretizations making this naive approach intractable.

*B*bins contains all vectors \({\varvec{\theta }}\) satisfying:

## 3 Learning piece-wise linear functions

*B*is the number of bins and \({\varvec{w}}\in {\mathbb {R}}^{B+1}\) is the vector of weights.

^{1}The function \(\alpha :{\mathbb {R}}\times {\mathbb {R}}^{B+1}\rightarrow [0,1]\), defined as

*x*and the right edge of the \(k(x,{\varvec{\theta }})\)-th bin.

## 4 Rounding of piece-wise functions

The uncompressed parametrization \(({\varvec{v}},{\varvec{\nu }})\) of the PWC function [e.g. found by the proposed algorithm (8)] can be converted to the compressed one \(({\varvec{w}},{\varvec{\theta }})\) by splitting \({\varvec{v}}\) into sub-vectors of equal weights, i.e. \({\varvec{v}}=({\varvec{v}}_{1}^{T}, {\varvec{v}}_{2}^{T},\ldots , {\varvec{v}}_{B}^{T})\) where each \({\varvec{v}}_i\) can be written as \({\varvec{v}}_{i} = w_{i} [1,\ldots ,1]^{T}\). Obtaining the compressed parametrization \(({\varvec{w}},{\varvec{\theta }})\) is then straightforward (see Fig. 1). An analogous procedure can be applied for the PWL parametrizations in which case we are searching for sub-vectors whose intermediate components can be expressed as an average of its neighbors. In practice, however, the components of the uncompressed solution can be noisy thanks to the used convex relaxation and usage of approximate solvers to find the uncompressed parameters. For this reason, it is useful to round the uncompressed solution before its conversion to the compressed one. The rounding procedures for the PWC and the PWL parametrization are described in the next sections.

### 4.1 Rounding of PWC parametrization

*B*bins which have the shortest Euclidean distance to the given \({\varvec{v}}\) by solving

*B*until the constraint \(\Vert {\varvec{v}}-{\varvec{v}}'\Vert ^{2} \le \varepsilon \) is satisfied.

### 4.2 Rounding PWL function

*B*bins which has the shortest Euclidean distance to given \({\varvec{u}}\) can be found by solving

## 5 Examples of the proposed framework

The previous sections desribe a generic framework that allows to modify a wide class of convex algorithms so that they can learn the PWC and the PWL functions. In this section, we give three instances of the proposed framework. We also show how the same idea can be applied to learning multi-variate PWC and PWL functions. In particular, we consider learning of the linear classifiers using sequential data represented by PWC histograms (Sect. 5.1), estimation of the PWL probability density models (Sect. 5.2) and learning of the non-linear classifiers via the PWL data embedding (Sect. 5.3). It will be shown that learning leads to an instance of convex optimization problem in all cases. Moreover, the optimization can be reformulated as a well-understood Quadratic Programming task.

### 5.1 Classification of histograms

In many applications, the object to be classified is described by a set of sequences sampled from some unknown distributions. A simple yet effective representation of the sequential data is the normalized PWC histogram which is then used as an input to a linear classifier. This classification model has been successfully used e.g. in computer vision (Dalal and Triggs 2005) or in the computer security (Bartos and Sofka 2015) as will be described in Sect. 7.1.

*n*sequences each having

*d*elements. The linear classifier \(h({\varvec{X}},{\varvec{w}},{\varvec{\theta }})=\mathop {{\mathrm{sign}}}( f_{{\mathrm{pwc}}}({\varvec{X}},{\varvec{w}},{\varvec{\theta }}))\) assigns \({\varvec{X}}\) into a class based on the sign of the discriminant function

*i*-th histogram and \({\varvec{\theta }}=({\varvec{\theta }}_{1}^{T},\ldots ,{\varvec{\theta }}_{n}^{T})^{T}\in {\mathbb {R}}^{n+B}\) is their concatenation, \(B=\sum _{i=1}^{n} b_{i}\) denotes the total number of bins, \({\varvec{w}}_{i}=(w_{i,1},\ldots ,w_{i,b_{i}})^{T}\in {\mathbb {R}}^{b_{i}}\) are bin heights of the

*i*-th histogram and \({\varvec{w}}=({\varvec{w}}_{1}^{T},\ldots ,{\varvec{w}}_{n}^{T})\in {\mathbb {R}}^{B}\) is their concatenation.

*n*histograms. For example, we place \(D+1\) edges equidistantly between the minimal \(\hbox {Min}_{i}\) and the maximal \(\hbox {Max}_{i}\) value that can appear in the

*i*-th sequence, so that \(\nu _{i,j}=j(\mathrm{Max}_i - \mathrm{Min}_i) + \mathrm{Min}_i\). Second, we combine the SVM objective function

### 5.2 Estimation of PWL histograms

*p*(

*x*), the goal is to find \({\hat{p}}(x)\) approximating

*p*(

*x*) accurately based on the samples \({\mathcal {T}}\). Assume we want to model the unknown p.d.f. by the PWL function

*D*is set to be sufficiently high. We substitute the negative log-likelihood

### 5.3 PWL embedding for non-linear classification

*n*PWL functions each defined for a single input feature. The vector \({\varvec{\theta }}_{i}\in {\mathbb {R}}^{b_{i}+1}\) contains the bin edges of the

*i*-th feature and \({\varvec{\theta }}=({\varvec{\theta }}_{1}^{T},\ldots ,{\varvec{\theta }}_{n}^{T})^{T}\in {\mathbb {R}}^{n+B}\) is their concatenation where \(B=\sum _{i=1}^{n} b_{i}\) denotes the total number of bins. The vector \({\varvec{w}}_{i}\in {\mathbb {R}}^{b_{i+1}}\) contains the weights associated to the edges \({\varvec{\theta }}_{i}\) and \({\varvec{w}}=({\varvec{w}}_{1}^{T},\ldots ,{\varvec{w}}_{n}^{T})\in {\mathbb {R}}^{B+n}\) is their concatenation.

*D*is the maximal number of bins per feature. Let \({\varvec{\nu }}=({\varvec{\nu }}_{1}^{T},\ldots ,{\varvec{\nu }}_{n}^{T})^{T}\in {\mathbb {R}}^{n(D+1)}\) be the concatenation of initial discretizations for all features. The SVM objective function

## 6 Limitations of the convex relaxation

In this section we discuss the limitations of the proposed convex relaxation. We concentrate on the PWC learning which leads to solving a convex minimization problem (8). The objective function (8) is a weighted sum of the original task objective \(F_\mathrm{pwc}({\varvec{v}},{\varvec{\nu }})\) and a regularization term \(\sum _{i=1}^{D-1} | v_{i} - v_{i+1} |\) that was introduced to control the number of different neighboring weights. However, the regularization term can also have an undesired influence on the solution as shown below.

## 7 Experiments

This section provides empirical evaluation of the algorithms proposed in Sect. 5. Section 7.1 describes learning of malware detector representing the network communication by PWC histograms. Section 7.2 evaluates the proposed PWL density estimator on synthetic data. Section 7.3 evaluates the proposed algorithm for learning PWL data embedding on classification benchmarks selected from the UCI repository.

### 7.1 Malware detection by classification of histograms

List of selected connection-based features extracted from HTTP/HTTPS traffic

Features applied on URL, path, query, file name | |
---|---|

Length; digit ratio | |

Lower/upper case ratio; ratio of digits | |

Vowel changes ratio | |

Ratio of a character with max occurrence | |

Has a special character | |

Max length of consonant/vowel/digit stream | |

Number of non-base 64 characters | |

Has repetition of parameters |

Other features | |
---|---|

Number of bytes from client to server | |

Number of bytes from server to client | |

Length of referer/file extension | |

Number of parameters in query | |

Number of ’/’ in path/query/referer |

We grouped all connections into bags, where one bag contains all connections of the same user going to the same domain. We the extracted 115 feature values for each connection (see Table 1), computed a histogram representation of each bag and used the histograms as input to a linear two-class classifier (21) as described in Bartos and Sofka (2015).

- 1.
The linear SVM using histograms with equidistantly spaced bins. The number of bins per feature varied from \(\{8,12,\ldots ,256\}\).

- 2.
The proposed algorithm learning non-equidistant bins from examples. The uncompressed weights \({\varvec{u}}^{*}\) are obtained by solving (22) with the initial discretisation \({\varvec{\nu }}\) set to split each feature equidistantly to \(D=256\) bins. The constant \(\gamma \), which controls the number of bins, varied from \(10^{-1}\) to \(10^{-6}\). The compressed weights \(({\varvec{w}}^{*},{\varvec{\theta }}^{*})\) were obtained by the rounding procedure (15). Finally, the linear SVM was re-trained on the learned bins \({\varvec{\theta }}^*\).

Performance comparison of a linear SVM classifier trained from a histogram representation with equidistant bins with the two proposed methods: learned soft bins (when the bins and SVM weights are learned simultaneously) and learned rounded bins (retraining new SVM weights once the bins are learned from the samples)

Equidistant bins | Learned soft bins | Learned rounded bins | |||||
---|---|---|---|---|---|---|---|

Bins per feature | B | Recall at 95\(\%\) | \(\gamma \) | Recall at 95\(\%\) | Bins per feature | B | Recall at 95\(\%\) |

256 | 58,880 | 53.5 (25.4) | \(5\times 10^7\) | 58.2 (24.4) | 58 | 13,316 | 58.9 (23.6) |

128 | 29,440 | 51.0 (27.9) | \(1\times 10^6\) | 56.4 (23.9) | 40 | 9196 | 58.3 (22.9) |

64 | 14,720 | 51.2 (26.5) | \(5\times 10^6\) | 56.4 (22.6) | 20 | 4634 | 55.0 (20.6) |

32 | 7360 | 50.3 (26.3) | \(1\times 10^5\) | 56.2 (24.3) | 13 | 2991 | 54.5 (22.2) |

16 | 3680 | 46,7 (26.9) | \(5\times 10^5\) | 54.6 (25.4) | 3 | 741 | 51.1 (25.7) |

8 | 1840 | 45.6 (28.5) | \(1\times 10^4\) | 51.2 (22.5) | 2 | 510 | 50.0 (27.5) |

### 7.2 Non-parametric distribution estimation

- 1.
The proposed algorithm estimating non-equdistant PWL histogram. The uncompressed weights \({\varvec{u}}^{*}\) were obtained by solving (24). The initial \(D=100\) bins \({\varvec{\nu }}\) were placed equidistantly between the minimal and the maximal value in the training set. The optimal value of the constant \(\gamma \) was selected from \(\{10,\ldots ,10{,}000\}\) based on the log-likelihood evaluated on the validation set. The compressed parameters \(({\varvec{w}}^*,{\varvec{\theta }}^*)\) of the PWL histogram (23) were computed from \({\varvec{u}}^*\) by the rounding procedure (20) with the precision parameter \(\varepsilon =0.001\).

- 2.
The PWL histogram with bin edges \({\varvec{\theta }}\) placed equidistantly between the minimal and the maximal value in the training set. The weights \({\varvec{w}}\) were found by maximizing the likelihood function which is equivalent to solving (24) with \(\gamma =0\). The optimal number of bins was selected from \(\{5,10,\ldots , 100\}\) based on the log-likelihood evaluated on the validation set.

- 3.
The standard PWC histogram with equidistant bins whose number was selected from \(\{5,10,\ldots , 100\}\) based on the log-likelihood evaluated on the validation set. The bin heights were found by the ML method.

*p*(

*x*) to generate training and validation set the size of which varied from 100 to 10, 000. For each method we recorded the optimal number of bins and the KL-divergence between the estimated and the ground distribution

*p*(

*x*). The results are averages and standard deviations computed over ten generated data sets.

Figure 8a shows the KL-divergence and Fig. 8b the number of bins as a function of the training set size. As expected, the equidistant PWC histogram provides the least precise (high KL divergence) and the most complex (high number of bins) model. We also see that PWL model with equidistantly-spaced bins provides the same accurate model as the model with non-equidistant bins learned from example, however, the compactness (the number of bins) of the non-equidistant model is consistently smaller.

### 7.3 PWL embedding for non-linear classification

- 1.
The proposed algorithm learning simultaneously \({\varvec{\theta }}\) and \({\varvec{w}}\). The uncompressed parameters \({\varvec{u}}^*\) were found by solving (26) with the initial discretization \({\varvec{\nu }}\) equidistantly splitting each feature to \(D=100\) bins. The compressed parameters \(({\varvec{w}}^*,{\varvec{\theta }}^*)\) were computed from \(({\varvec{u}}^*,{\varvec{\nu }})\) by the rounding procedure (20) with the precision parameter \(\varepsilon =0.1\). Finally, a linear SVM was re-trained on the learned bins \({\varvec{\theta }}^*\). The constant \(\gamma \), which controls the number of bins, was varied from 0.1 to 0.0001.

- 2.
The parameters \({\varvec{w}}\) were trained by the linear SVM on top of equdistantly constructed bins \({\varvec{\theta }}\). The number of bins per feature varied from 5 to 20.

- 3.
Method of Pele et al. (2013) which was shown to outperform the non-linear SVM with many state-of-the-art kernels and data embeddings. Namely, we re-implemented the “PL1 algorithm” applying the PWL embedding on individual features as we do. The non-equidistant bins were found for each feature independently by constructing edges as the mid-points between the cluster centers obtained from the k-means algorithm. The number of bins was varied from 3 to 20.

A summary of two-class classification problems selected from the UCI repository (Lichman 2013) and used to evaluate the linear embedding algorithms

Name | Num. of examples | Num. of features | Pele et al. (2013) | Equdist. bins | Learned bins | |||
---|---|---|---|---|---|---|---|---|

acc (\(\%\)) | Bins | acc (\(\%\)) | Bins | acc (\(\%\)) | Bins | |||

Eyestate | 14,980 | 15 | 64.6 ± 7.1 | 20.0 | 60.5 ± 5.1 | 21.0 | | 13.3 |

Ionosphere | 351 | 34 | 92.1 ± 1.9 | 3.0 | 90.9 ± 2.3 | 6.0 | | 8.4 |

Magic | 19,020 | 11 | 85.4 ± 0.3 | 20.0 | 85.4 ± 0.3 | 21.0 | | 11.7 |

Miniboo | 130,065 | 50 | 91.0 ± 0.2 | 3.0 | 84.4 ± 0.6 | 21.0 | | 5.7 |

Musk | 6598 | 167 | 98.8 ± 0.3 | 10.0 | 98.4 ± 0.4 | 11.0 | | 20.2 |

Skin | 245,057 | 4 | 96.5 ± 0.2 | 20.0 | 96.6 ± 0.1 | 21.0 | | 19.3 |

Sonar | 208 | 60 | | 5.0 | 79.3 ± 5.8 | 6.0 | 82.0 ± 3.7 | 3.7 |

Spectf | 267 | 44 | | 10.0 | 79.2 ± 0.0 | 6.0 | 79.3 ± 3.2 | 4.6 |

Transf | 748 | 5 | 76.4 ± 0.7 | 3.0 | | 11.0 | 76.3 ± 0.6 | 3.0 |

Wilt | 4889 | 6 | 97.0 ± 0.6 | 20.0 | 94.8 ± 0.3 | 21.0 | | 9.7 |

### 7.4 The computational time

In this section we provide empirical estimate of the computational time required when learning discretization by the proposed framework. As a benchmark we use the task of learning the PWL embedding for non-linear classification described in Sect. 7.3. In this case learning leads to a convex optimization problem (26) which has two hyper-parameters. First, the parameter \(\lambda \) controls the quadratic regularization similar to the standard SVM formulation. Second, the additional parameter \(\gamma \) which implicitly controls the number of bins of the resulting feature discretization. The time required to solve the convex problem thus depends on the hyper-parameters \(\lambda \) and \(\gamma \) and on the size of the optimized problem specified by the number of examples *m* and the number of features *n*. We empirically measured the dependency of the computational time on the four variables as described below.

Besides the four variables, the runtime obviously depends on a particular optimization solver. There is a large number of optimization methods applicable to the convex problem (26). For example, the problem (26) can be expressed as an equivalent quadratic program (QP) and solved by any QP solver. In this work we used the Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization (OCA) (Franc and Sonneburg 2009). The OCA is a general purpose and easy-to-implement solver for minimization of convex functions containing a quadratic regularization term like the problem (26). The empirical study presented below uses this particular solver which is sufficiently efficient for the problem sizes considered in our experiments. Other solvers might be even more efficient.

Figure 11 shows the average runtime required by the OCA solver as a function of \(\lambda \), \(\gamma \), the number of examples *m* and the number of features *n*. The runtime is measured on a subset of classification problems listed in Table 3. In case of \(\lambda \) and \(\gamma \) we use problems with more than 1000 examples (6 problems in total). The dependency on the number of examples and the number of features is measured on “miniboo” and “musk” dataset, respectively.

We found that the runtime scales gracefully with respect to the quadratic regularization parameter \(\lambda \), specifically, it grows approximately as \({\mathcal {O}}(\lambda ^{-0.8})\). On the other hand, the runtime grows much faster, approximately \({\mathcal {O}}(e^{20\sqrt{\gamma }})\), with increasing \(\gamma \) controlling the number of bins. It means that the lower number of bins the higher computational time is needed. A low number of bins is enforced by increasing the weight of the \(L_1\)-regularization term in the objective function, which brings the task closer to a linear program. The dominant linear term impairs the regularization effect of the quadratic term which causes the “zig-zag” behavior of the cutting plane solver leading to a higher number of iterations. The dependency of the running time on the number of training examples is linear which is consistent with the theoretical upper bound on the number of iterations proved in Franc and Sonneburg (2009). Finally, the dependency on the number of features is approximately quadratic.

In absolute numbers, the longest time, around 160 min, was required for training a single model on “miniboo” dataset (130,065 examples, 50 features) with the highest value of \(\lambda =0.1\).

## 8 Conclusions

We proposed a generic framework which allows to modify a wide class of convex learning algorithms so that they can learn parameters of the piece-wise constant (PWC) and the piece-wise linear (PWL) functions from examples. The learning objective of the original algorithm is augmented by a convex term which enforces compact bins to emerge from an initial fine discretization. In contrast to existing methods, the proposed approach learns the discretization and the parameters of the decision function simultaneously. In addition, learning is converted to a convex problem which is solvable efficiently by global methods. We instantiated the proposed framework for three problems, namely, learning PWC histogram representation of sequential data, estimation of the PWL probability density function and learning PWL data embedding for non-linear classification. The proposed algorithms were evaluated on synthetic data and standard public benchmarks and applied to malware detection in network traffic data. It was demonstrated that the proposed convex algorithms yield models with fewer number of parameters with comparable or better accuracy than the existing methods. The main disadvantage of the proposed method, when compared to heuristic local methods like Pele et al. (2013), is a higher computational time required to solve the convex problem.

## Footnotes

- 1.
Note that the PWC function has

*B*weights associated with the bins while the PWL function has \(B+1\) weights associated with the bin edges.

## Notes

### Acknowledgements

VF was supported by Czech Science Foundation Grant 16-05872S. OF was supported by the internal CTU Funding SGS17/185/OHK3/3T/13.

## References

- Bartos, K., & Sofka, M. (2015). Robust representation for domain adaptation in network security. In
*In proceedings of ECML/PKDD*, volume 3, (pp. 116–132).Google Scholar - Bhatt, R., & Dhall, A. (2010). Skin segmentation dataset.
*UCI Machine Learning Repository*. https://archive.ics.uci.edu/ml/datasets/skin+segmentation - Boullé, M. (2006). Modl: A bayes optimal discretization method for continuous attributes.
*Machine Learning*,*65*(1), 131–165.CrossRefGoogle Scholar - Boyd, S., & Vandenberghe, L. (2004).
*Convex Optimization*. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar - Candes, E., Romberg, J., & Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements.
*Communications on Pure and Applied Mathematics*,*59*(8), 1207–1223.MathSciNetCrossRefMATHGoogle Scholar - Candes, E., & Tao, T. (2005). Decoding by linear programming.
*IEEE Transactions on Infromation Theory*,*51*(12), 4203–4215.MathSciNetCrossRefMATHGoogle Scholar - Chapelle, O., Haffner, P., & Vapnik, V. N. (1999). Support vector machines for histogram-based image classification.
*IEEE Transactions on Neural Networks*,*10*(5), 1055–1064.CrossRefGoogle Scholar - Cios, K., Pedrycz, W., Swiniarski, R., & Kurgan, L. (2007).
*Data Mining: A Knowledge Discovery Approach*. Berlin: Springer.MATHGoogle Scholar - Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. In
*Proceedings of computer vision and pattern recognition*, volume 1, (pp. 886–893).Google Scholar - Donoho, D. (2006). Compressed sensing.
*IEEE Transactions on Infromation Theory*,*52*(4), 1289–1306.MathSciNetCrossRefMATHGoogle Scholar - Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In
*Proceedings of international conference on machine learning*, Morgan Kaufmann, (pp. 194–202).Google Scholar - Elomaa, T., & Rousu, J. (1999). General and efficient multisplitting of numerical attributes.
*Machine Learning*,*36*(3), 201–244.CrossRefMATHGoogle Scholar - Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation.
*Machine Learning*,*8*(1), 87–102.MATHGoogle Scholar - Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In
*Proceedings of international joint conference on artificial intelligence*, (pp. 1022–1029).Google Scholar - Franc, V., & Sonneburg, S. (2009). Optimized cutting plane algorithm for large-scale risk minimization.
*Journal of Machine Learning Research*,*10*, 2157–2232.MathSciNetMATHGoogle Scholar - Friedman, N., & Goldszmidt, M. (1996). Discretizing continuous attributes while learning bayesian networks. In
*Proceedings of international conference on machine learning*, (pp. 157–165).Google Scholar - García, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning.
*IEEE Transactions on Knowledge and Data Engineering*,*25*(4), 734–750.CrossRefGoogle Scholar - Hue, C., & Boullé, M. (2007). A new probabilistic approach in rank regression with optimal bayesian partitioning.
*Journal of Machine Learning Research*,*8*, 2727–2754.MATHGoogle Scholar - Johnson, B. A., Tateishi, R., & Hoan, N. T. (2013). A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees.
*International Journal of Remote Sensing*,*34*(20), 6969–6982.CrossRefGoogle Scholar - Kerber, R. (1992). Chimerge: Discretization of numeric attributes. In
*Proceedings of the tenth national conference on artificial intelligence, AAAI’92*, (pp. 123–128).Google Scholar - Lichman, M. (2013).
*UCI machine learning repository*. Irvine, CA: University of California, School of Information and Computer Science.Google Scholar - Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique.
*Data Mining and Knowledge Discovery*,*6*(4), 393–423.MathSciNetCrossRefGoogle Scholar - Pele, O., Taskar, B., Globerson, A., & Werman, M. (2013). The pairwise piecewise-linear embedding for efficient non-linear classification. In
*Proceedings of the international conference on machine learning*, (pp. 205–213).Google Scholar - Rao, C. (2005). Data mining and data visualization. In C. R. Rao, E. J. Wegman, & J. L. Solka (Eds.),
*Handbook of Statistics,*volume 24. Newyork: Elsevier.Google Scholar - Silverman, B.W. (1986).
*Density Estimation for Statistics and Data Analysis*. London, New York: Chapman & Hall.Google Scholar - Yang, Y., & Webb, G. I. (2008). Discretization for naive–bayes learning: Managing discretization bias and variance.
*Machine Learning*,*74*(1), 39–74.CrossRefGoogle Scholar - Yeh, I.-C., Yang, K.-J., & Ting, T.-M. (2008). Knowledge discovery on RFM model using bernoulli sequence.
*Expert Systems with Applications*,*36*(3), 5866–5871.CrossRefGoogle Scholar