Abstract
Kernel-based methods (KBMs) such as support vector machines (SVMs) are popular data mining tools for solving classification and regression problems. Due to their high prediction accuracy, KBMs have been successfully used in various fields. However, KBMs have three major drawbacks. First, it is not easy to obtain an explicit description of the discrimination (or regression) function in the original input space and to make a variable selection decision in the input space. Second, depending on the magnitude and numeric range of the given data points, the resulting kernel matrices may be ill-conditioned, with the possibility that the learning algorithms will suffer from numerical instability. Although data scaling can generally be applied to deal with this problem and related issues, it may not always be effective. Third, the selection of an appropriate kernel type and its parameters can be a complex undertaking, with the choice greatly affecting the performance of the resulting functions. To overcome these drawbacks, we present here the sparse signomial classification and regression (SSCR) model. SSCR seeks a sparse signomial function by solving a linear program to minimize the weighted sum of the ℓ 1-norm of the coefficient vector of the function and the ℓ 1-norm of violation (or loss) caused by the function. SSCR employs the signomial function in the original variables and can therefore explore the nonlinearity in the data. SSCR is also less sensitive to numerical values or numeric ranges of the given data and gives a sparse explicit description of the resulting function in the original input space, which will be useful for the interpretation purpose in terms of which original input variables and/or interaction terms are more meaningful than others. We also present column generation techniques to select important signomial terms in the classification and regression processes and explore a number of theoretical properties of the proposed formulation. Computational studies demonstrate that SSCR is at the very least competitive and can even perform better compared to other widely used learning methods for classification and regression.
Similar content being viewed by others
References
Brieman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Belmont: Wadsworth International.
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167.
Chang, C. C., & Lin, C. J. (2001). LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178.
Chen, S.-H., Sun, J., Dimitrov, L., Turner, A. R., Adams, T. S., Meyers, D. A., Chang, B.-L., Zheng, S. L., Gronberg, H., Xu, J., & Hsu, F.-C. (2008). A support vector machine approach for detecting gene-gene interaction. Genetic Epidemiology, 32, 152–167.
Chou, P.-H., Wu, M.-J., & Chen, K.-K. (2010). Integrating support vector machine and genetic algorithm to implement dynamic wafer quality prediction system. Expert Systems with Applications, 37, 4413–4424.
Chvátal, V. (1983). Linear programming. New York: Freeman.
Fang, Y., Park, J. I., Jeong, Y. S., Jeong, M. K., Baek, S., & Cho, H. (2010). Enhanced predictions of wood properties using hybrid models of PCR and PLS with high-dimensional NIR spectra data. Annals of Operations Research, 190, 3–15.
Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3, 95–110.
Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–67.
Glasmachers, T., & Igel, C. (2010). Maximum likelihood model selection for 1-norm soft margin SVMs with multiple parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8), 1522–1528.
Gunn, S. R. (1998). Support vector machines for classification and regression. Technical report. School of Electronics and Computer Science, University of Southampton.
Hosmer, D., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). New York: Wiley.
Huang, K., Zheng, D., King, I., & Lyu, M. R. (2009). Arbitrary norm support vector machines. Neural Computation, 21, 560–582.
Kang, P., Lee, H., Cho, S., Kim, D., Park, J., Park, C.-K., & Doh, S. (2009). A virtual metrology system for semiconductor manufacturing. Expert Systems with Applications, 36, 12554–12561.
Kim, H., & Loh, W. Y. (2001). Classification tree with unbiased multiway splits. Journal of American Statistical Association, 96, 598–604.
Mangasarian, O. L. (1999). Arbitrary-norm separating plane. Operations Research Letters, 24, 15–23.
Mangasarian, O. L. (2006). Exact 1-norm support vector machines via unconstrained convex differentiable minimization. Journal of Machine Learning Research, 7, 1517–1530.
Mangasarian, O. L., & Thomson, M. E. (2008). Chunking for massive nonlinear kernel classification. Optimization Methods and Software, 23, 265–274.
MATLAB Statistics Toolbox (2008). http://www.mathworks.com.
Mixture Flexible Discriminant Analysis Package (2009). http://cran.r-project.org/web/packages/mda.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2006). Introduction to linear regression analysis (4th ed.). New York: Wiley.
Murphy, P. M., & Aha, D. W. (1992). UCI machine learning repository. www.ics.uci.edu/~mlearn/MLRepository.html.
Nemhauser, G. L., & Wolsey, L. A. (1988). Integer and combinatorial optimization. New York: Wiley.
Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14, 199–222.
SOCR body density data. http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_BMI_Regression.
Tax, D. M. J., & Duin, R. P. W. (2004). Support vector data description. Machine Learning, 54, 45–66.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Veenman, C. J., & Tax, D. M. J. (2005). LESS: a model-based classifier for sparse subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1496–1500.
Wang, S., Jiang, W., & Tsui, K.-L. (2010). Adjusted support vector machines based on a new loss function. Annals of Operations Research, 174, 83–101.
Weston, J., Elisseeff, A., Schölkopf, B., & Tipping, M. (2003). Use of zero-norm with linear models and kernel methods. Journal of Machine Learning Research, 3, 1439–1461.
Xpress (2009). http://www.fico.com.
Acknowledgement
The authors thank the editor and anonymous referees for their constructive and helpful comments.
Author information
Authors and Affiliations
Corresponding authors
Additional information
This research for the first author was supported by Hankuk University of Foreign Studies Research Fund.
Appendix
Appendix
Proof of Theorem 1
The result can be established by showing that the well-known maximum independent set problem, which is known to be \(\mathcal{NP}\)-hard (Nemhauser and Wolsey 1988), is a special case of CGP. For a given undirected simple graph G=(V,E), and an independent set of G is a subset U of V such that (i,j)∉E for every pair of nodes i,j∈U. The problem is to find a maximum cardinality independent set of G.
For each node v∈V, we define an n-dimensional positive vector x v with components x vk for all k∈V, such that x vk =1 for k≠v and x vk =1/n 3, where n=|V|. We also define, for each edge e∈E, an n-dimensional vector y e with components y ek for all k∈V, such that y ek =1/n 6 if k=i or k=j and y ek =1 otherwise, where e=(i,j). Let S 1={x v |v∈V} and let S 2={y e |e∈E}. We specify the four parameters of D as d min =0, d max =1, T=1, and L=n. The set D is then the set of all n-dimensional binary vectors. We now set a 1(x v ) for each v∈V to be equal to 1, and set each element of a 2(y e ) for each e∈E to be n and define a special case of CGP with S 1 and S 2 along with a 1(x v ) for each v∈V, a 2(y e ) for each e∈E, and the four parameters of D. This special case of CGP is the problem of maximizing the function \(z(\mathbf{d}) = \sum_{i \in V} (1/n^{3})^{d_{i}} - n(\sum_{(i,j) \in E} (1/n^{6})^{d_{i}} (1/n^{6})^{d_{j}})\) over all \(\mathbf{d} \in\mathbb{B}^{n}\).
Now, we show that a maximum independent set U, such that |U|=K exists for some positive integer K<n if and only if \(K \leq\hat{z} < K+1\), where \(\hat{z} = \max\{z(\mathbf{d}) | \mathbf {d} \in\mathbb{B}^{n} \}\). Suppose that we have a maximum independent set U of G such that |U|=K. Then, for the vector d, such that d i =0 if i∈U and d j =1 otherwise z(d)≥K+1/n 3(n−K)−n|E|/n 6≥K and z(d)≤K+1/n 3(n−K)<K+1. Now, suppose that we have a binary vector d, such that \(K \leq\hat{z} < K+1\) for some positive integer such that K<n. This means that at least one of d i and d j should be equal to 1 if and only if (i,j)∈E, otherwise, \(\hat{z} \leq0\). Moreover, the number of zero elements of d is equal to K. If we define U={i∈V|d i =0}, then U should be an independent set of cardinality K. Therefore, the result follows. □
Proof of Theorem 2
The result can also be established by showing that the maximum independent set problem is a special case of CGP with S 1=S 2. For any given instance of the maximum independent set problem, we define the vectors x v , v∈V and y e , e∈E as in the proof of Theorem 1. Let S 1=S 2={x v ,v∈V}∪{y e ,e∈E}. In addition to a 1(x v ) for each v∈V and a 2(y e ) for each e∈E defined in the proof of Theorem 1, we define a 1(y e )=0 for each e∈E and a 2(x v )=0 for each v∈V. We then can define a special case of CGP with S 1, S 2, a 1, a 2 along with the four parameters of D specified as those in the proof of Theorem 1. Observe that this special case of CGP with S 1=S 2 is the same problem as that discussed in the proof of Theorem 1. Therefore, the result follows. □
Rights and permissions
About this article
Cite this article
Lee, K., Kim, N. & Jeong, M.K. The sparse signomial classification and regression model. Ann Oper Res 216, 257–286 (2014). https://doi.org/10.1007/s10479-012-1198-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-012-1198-y