Abstract
In many real classification problems a monotone relation between some predictors and the classes may be assumed when higher (or lower) values of those predictors are related to higher levels of the response. In this paper, we propose new boosting algorithms, based on LogitBoost, that incorporate this isotonicity information, yielding more accurate and easily interpretable rules. These algorithms are based on theoretical developments that consider isotonic regression. We show the good performance of these procedures not only on simulations, but also on real data sets coming from two very different contexts, namely cancer diagnostic and failure of induction motors.
Similar content being viewed by others
References
Agresti A (2002) Categorical data analysis. Wiley, Hoboken
Agresti A (2010) Analysis of ordinal categorical data, 2nd edn. Wiley, Hoboken
Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
Auh S, Sampson AR (2006) Isotonic logistic discrimination. Biometrika 93(4):961–972
Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD (1972) Statistical inference under order restrictions. Wiley, New York
Bühlmann P (2012) Bagging, boosting and ensemble methods. In: Handbook of computational statistics, Springer. Chapter, vol 33, pp 985–1022
Cano JR, García S (2017) Training set selection for monotonic ordinal classification. Data Knowl Eng 112:94–105
Cano JR, Gutiérrez PA, Krawczyk B, Wozniak M, García S (2019) Monotonic classification: an overview on algorithms, performance measures and data sets. Neurocomputing 341:168–182
Chen Y, Samworth RJ (2016) Generalized additive and index models with shape constraints. J R Stat Soc B 78:729–754
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y (2019) xgboost: Extreme Gradient Boosting. R package version 0.82.1 https://CRAN.R-project.org/package=xgboost
Choudhary A, Goyal D, Shimi SL, Akula A (2019) Condition monitoring and fault diagnosis of induction motors: a review. Arch Comput Methods Eng 1:2. https://doi.org/10.1007/s11831-018-9286-z
Conde D, Fernández MA, Rueda C, Salvador B (2012) Classification of samples into two or more ordered populations with application to a cancer trial. Stat Med 31(28):3773–3786
Conde D, Salvador B, Rueda C, Fernández MA (2013) Performance and estimation of the true error rate of classification rules built with additional information: an application to a cancer trial. Stat Appl Gen Mol Biol 12(5):583–602
Conde D, Fernández MA, Salvador B, Rueda C (2015) dawai: an R package for discriminant analysis with additional information. J Stat Softw 66(10):1–19
Conde D, Fernández MA, Rueda C, Salvador B (2020) isoboost: isotonic Boosting Classification Rules. R package version 1.0.0 https://CRAN.R-project.org/package=isoboost
De Leeuw J, Hornik K, Mair P (2009) Isotone optimization in R: pool-adjacent-violators algorithm (PAVA) and active set methods. J Stat Softw 32(5):1–24
Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19(9):1061–1069
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157
Fang Z, Meinshausen N (2012) LASSO isotone for high-dimensional additive isotonic regression. J Comput Graph Stat 21(1):72–91
Fernández MA, Rueda C, Salvador B (2006) Incorporating additional information to normal linear discriminant rules. J Am Stat Assoc 101:569–577
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML’96 Proceedings of the thirteenth international conference on international conference on machine learning, pp 148–156
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407
Fullerton AS, Anderson KF (2013) The role of job insecurity in explanations of racial health inequalities. Sociol Forum 28(2):308–325
Fullerton AS, Xu J (2016) Ordered regression models: parallel, partial, and non-parallel alternatives. CRC Press, Boca Raton
Garcia-Escudero LA, Duque-Perez O, Fernandez-Temprano M, Moriñigo-Sotelo D (2017) Robust detection of incipient faults in VSI-fed induction motors using quality control charts. IEEE Trans Ind Appl 53(3):3076–3085
Gauchat G (2011) The cultural authority of science: public trust and acceptance of organized science. Public Understand Sci 20(6):751–770
Ghosh D (2007) Incorporating monotonicity into the evaluation of a biomarker. Biostatistics 8(2):402–413
Halaby CN (1986) Worker attachment and workplace authority. Am Sociol Rev 51(5):634–649
Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classication problems. Mach Learn 45:171–186
Härdle W, Hall P (1993) On the backfitting algorithm for additive regression models. Stat Neerl 47:43–57
Hastie T, Tibshirani R (2014) Generalized additive models. In: Wiley StatsRef: Statistics Reference Online. Wiley-Interscience. https://doi.org/10.1002/9781118445112.stat03141
Hofner B, Kneib T, Hothorn T (2016) A unified framework of constrained regression. Stat Comput 26(1–2):1–14
Holmes G, Pfahringer B, Kirkby R, Frank E, Hall M (2002) Multiclass alternating decision trees. In: European conference on machine learning. Springer, Berlin
Jarek Tuszynski (2019) caTools: tools: moving window statistics, GIF, Base64, ROC, AUC, etc. R package version 1.17.1.2 https://CRAN.R-project.org/package=caTools
Liaw A, Wiener M (2002) Classification and Regression by random. Forest R News 2(3):18–22
Marshall RJ (1999) Classification to ordinal categories using a search partition methodology with an application in diabetes screening. Stat Med 18:2723–2735
Masters GN (1982) A Rasch model for partial credit scoring. Psychometrika 47:149–174
McDonald R, Hand D, Eckley I (2003) An empirical comparison of three boosting algorithms on real data sets with artificial class noise. In: MSC2003: multiple classifier systems, pp 35–44
Mease D, Wyner A (2008) Evidence contrary to the statistical view of boosting. J Mach Learn Res 9:131–156
Meyer MC (2013) Semi-parametric additive constrained regression. J Nonparametr Stat 25(3):715–730
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2019) e1071: Misc functions of the department of statistics, probability theory group (Formerly: E1071), TU Wien. R package version 1.7-1 https://CRAN.R-project.org/package=e1071
Pya N, Wood SN (2014) Shape constrained additive models. Stat Comput 25(3):543–559
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Robertson T, Wright FT, Dykstra R (1988) Order restricted statistical inference. Wiley, New York
Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227
Sobel ME, Becker MP, Minick SM (1998) Origins, destinations, and association in occupational mobility. Am J Sociol 104(3):687–721
Therneau T, Atkinson B (2019) rpart: recursive partitioning and regression trees. R package version 4.1-15 https://CRAN.R-project.org/package=rpart
Turner R (2019). Iso: functions to perform isotonic regression. R package version 0.0-18 https://CRAN.R-project.org/package=Iso
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York
Acknowledgements
The authors thank the Associate Editor and two anonymous reviewers for suggestions that led to this improved version of the paper.
Funding
Funding was provided by Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. MTM2015-71217-R).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 A.1 Theoretical justification for algorithms under the adjacent categories model
Let \(\mathbf{x} \in \mathbb {R}^d,y\in \{1,\dots ,K\}, y_k^*=I_{\left[ y=k\right] },k=1,\dotsc ,K\) and assume the adjacent probabilities model (2). Denote further \(F_1(\mathbf{x} )=0\), so the a posteriori probabilities are:
Now, the expected log-likelihood is:
Conditioning on \(\mathbf{x} \), the score vector \(\mathbf{S}(x) =(s_k(\mathbf{x} ))\) for the population Newton algorithm is:
The Hessian is a \((K-1)\times (K-1)\) matrix, \(\mathbf{H} (\mathbf{x} )=(H_{km}(\mathbf{x} ))\), \(2\le k,m\le K\), where each element \(H_{km}(\mathbf{x} )\) is:
If \(\mathbf{W} (\mathbf{x} )=-diag(\mathbf{H} (\mathbf{x} ))\), a quasi-Newton update for the ASILB algorithm is:
The full Newton update, which is implemented in the AMILB algorithm, is:
1.2 A.2 Theoretical justification for algorithms under the cumulative probabilities model
In this case we have to update the function \(F(\mathbf{x} )\) and the parameters \(\alpha _k,k=2,\dotsc ,K\). We will perform a “two step” update, first on \(F(\mathbf{x} )\) and then on the \(\alpha \) parameters.
Let us denote \(\gamma _k(\mathbf{x} )=\sum _{j=k}^K p_j(\mathbf{x} ),k=1,\dotsc ,K\), and assume the cumulative probabilities model (3). For this model
with \(\gamma _1(\mathbf{x} )=1\) and \(\gamma _{K+1}(\mathbf{x} )=0.\)
First, we perform the \(F(\mathbf{x} )\) update. Here, as in the previous model, we consider a single observation as this step is used for updating the weights and the values to be adjusted. The expected log-likelihood is:
Conditioning on \(\mathbf{x} \), the first and second derivatives for the population Newton algorithm are:
and
so that the Newton update for \(F(\mathbf{x} )\) is
As for the parameters \(\alpha _k,k=2,\dotsc ,K\), let us denote \(\varvec{\alpha }=(\alpha _2,\dots ,\alpha _K)'\). Now, we use all the \(\mathbf{x} _i\) observations as in this case we are going to perform a Newton step for updating the \(\alpha \)’s which do not depend on \(\mathbf{x} \). Then, the log-likelihood is:
The score for the Newton algorithm \(\mathbf{S} =(s_2,\dots ,s_K)'\) is:
And the Hessian is a tri-diagonal symmetric \((K-1)\times (K-1)\) matrix \(\mathbf{H} =(H_{km})\) with \(H_{km} = \frac{\partial ^2 l(\varvec{\alpha })}{\partial \alpha _k\partial \alpha _m},\) for \(2\le k,m\le K\), such that
In these conditions the Newton update is \(\varvec{\alpha } \leftarrow \varvec{\alpha } - \mathbf{H} ^{-1} \mathbf{S} .\)
1.3 A.3 Full numerical results for the simulations performed
This subsection contains the Tables showing the full numerical mean results for TMP, MAE and AUC for the two sets of simulations. In Tables 7, 8 and 9 appear the results for the first set of simulations performed under model 2 for the different F functions appearing in Table 3. Tables 10, 11 and 12 contain the mean results for the simulations performed under the uniform order-restricted predictors scheme. In all cases the best results appear in bold. Notice that there are no results for CSILB and CMILB when \(K=2\) as in that case those algorithms coincide with ASILB and AMILB. For this case the TMP and MAE values also coincide. They are given in both tables for completeness.
Rights and permissions
About this article
Cite this article
Conde, D., Fernández, M.A., Rueda, C. et al. Isotonic boosting classification rules. Adv Data Anal Classif 15, 289–313 (2021). https://doi.org/10.1007/s11634-020-00404-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00404-9