Isotonic boosting classification rules

Conde, David; Fernández, Miguel A.; Rueda, Cristina; Salvador, Bonifacio

doi:10.1007/s11634-020-00404-9

Isotonic boosting classification rules

Regular Article
Published: 12 June 2020

Volume 15, pages 289–313, (2021)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

David Conde¹,
Miguel A. Fernández ORCID: orcid.org/0000-0002-1272-8950¹,
Cristina Rueda¹ &
…
Bonifacio Salvador¹

271 Accesses
2 Citations
Explore all metrics

Abstract

In many real classification problems a monotone relation between some predictors and the classes may be assumed when higher (or lower) values of those predictors are related to higher levels of the response. In this paper, we propose new boosting algorithms, based on LogitBoost, that incorporate this isotonicity information, yielding more accurate and easily interpretable rules. These algorithms are based on theoretical developments that consider isotonic regression. We show the good performance of these procedures not only on simulations, but also on real data sets coming from two very different contexts, namely cancer diagnostic and failure of induction motors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of boosting methods for imbalanced data classification

Article 06 August 2014

Optimization by Gradient Boosting

Boosting as a kernel-based method

Article 17 May 2019

References

Agresti A (2002) Categorical data analysis. Wiley, Hoboken
Book Google Scholar
Agresti A (2010) Analysis of ordinal categorical data, 2nd edn. Wiley, Hoboken
Book Google Scholar
Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
MathSciNet MATH Google Scholar
Auh S, Sampson AR (2006) Isotonic logistic discrimination. Biometrika 93(4):961–972
Article MathSciNet Google Scholar
Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD (1972) Statistical inference under order restrictions. Wiley, New York
MATH Google Scholar
Bühlmann P (2012) Bagging, boosting and ensemble methods. In: Handbook of computational statistics, Springer. Chapter, vol 33, pp 985–1022
Cano JR, García S (2017) Training set selection for monotonic ordinal classification. Data Knowl Eng 112:94–105
Article Google Scholar
Cano JR, Gutiérrez PA, Krawczyk B, Wozniak M, García S (2019) Monotonic classification: an overview on algorithms, performance measures and data sets. Neurocomputing 341:168–182
Article Google Scholar
Chen Y, Samworth RJ (2016) Generalized additive and index models with shape constraints. J R Stat Soc B 78:729–754
Article MathSciNet Google Scholar
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y (2019) xgboost: Extreme Gradient Boosting. R package version 0.82.1 https://CRAN.R-project.org/package=xgboost
Choudhary A, Goyal D, Shimi SL, Akula A (2019) Condition monitoring and fault diagnosis of induction motors: a review. Arch Comput Methods Eng 1:2. https://doi.org/10.1007/s11831-018-9286-z
Article Google Scholar
Conde D, Fernández MA, Rueda C, Salvador B (2012) Classification of samples into two or more ordered populations with application to a cancer trial. Stat Med 31(28):3773–3786
Article MathSciNet Google Scholar
Conde D, Salvador B, Rueda C, Fernández MA (2013) Performance and estimation of the true error rate of classification rules built with additional information: an application to a cancer trial. Stat Appl Gen Mol Biol 12(5):583–602
MathSciNet Google Scholar
Conde D, Fernández MA, Salvador B, Rueda C (2015) dawai: an R package for discriminant analysis with additional information. J Stat Softw 66(10):1–19
Article Google Scholar
Conde D, Fernández MA, Rueda C, Salvador B (2020) isoboost: isotonic Boosting Classification Rules. R package version 1.0.0 https://CRAN.R-project.org/package=isoboost
De Leeuw J, Hornik K, Mair P (2009) Isotone optimization in R: pool-adjacent-violators algorithm (PAVA) and active set methods. J Stat Softw 32(5):1–24
Article Google Scholar
Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19(9):1061–1069
Article Google Scholar
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157
Article Google Scholar
Fang Z, Meinshausen N (2012) LASSO isotone for high-dimensional additive isotonic regression. J Comput Graph Stat 21(1):72–91
Article MathSciNet Google Scholar
Fernández MA, Rueda C, Salvador B (2006) Incorporating additional information to normal linear discriminant rules. J Am Stat Assoc 101:569–577
Article MathSciNet Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Article MathSciNet Google Scholar
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML’96 Proceedings of the thirteenth international conference on international conference on machine learning, pp 148–156
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407
Article MathSciNet Google Scholar
Fullerton AS, Anderson KF (2013) The role of job insecurity in explanations of racial health inequalities. Sociol Forum 28(2):308–325
Article Google Scholar
Fullerton AS, Xu J (2016) Ordered regression models: parallel, partial, and non-parallel alternatives. CRC Press, Boca Raton
Book Google Scholar
Garcia-Escudero LA, Duque-Perez O, Fernandez-Temprano M, Moriñigo-Sotelo D (2017) Robust detection of incipient faults in VSI-fed induction motors using quality control charts. IEEE Trans Ind Appl 53(3):3076–3085
Article Google Scholar
Gauchat G (2011) The cultural authority of science: public trust and acceptance of organized science. Public Understand Sci 20(6):751–770
Article Google Scholar
Ghosh D (2007) Incorporating monotonicity into the evaluation of a biomarker. Biostatistics 8(2):402–413
Article Google Scholar
Halaby CN (1986) Worker attachment and workplace authority. Am Sociol Rev 51(5):634–649
Article Google Scholar
Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classication problems. Mach Learn 45:171–186
Article Google Scholar
Härdle W, Hall P (1993) On the backfitting algorithm for additive regression models. Stat Neerl 47:43–57
Article MathSciNet Google Scholar
Hastie T, Tibshirani R (2014) Generalized additive models. In: Wiley StatsRef: Statistics Reference Online. Wiley-Interscience. https://doi.org/10.1002/9781118445112.stat03141
Hofner B, Kneib T, Hothorn T (2016) A unified framework of constrained regression. Stat Comput 26(1–2):1–14
Article MathSciNet Google Scholar
Holmes G, Pfahringer B, Kirkby R, Frank E, Hall M (2002) Multiclass alternating decision trees. In: European conference on machine learning. Springer, Berlin
Jarek Tuszynski (2019) caTools: tools: moving window statistics, GIF, Base64, ROC, AUC, etc. R package version 1.17.1.2 https://CRAN.R-project.org/package=caTools
Liaw A, Wiener M (2002) Classification and Regression by random. Forest R News 2(3):18–22
Google Scholar
Marshall RJ (1999) Classification to ordinal categories using a search partition methodology with an application in diabetes screening. Stat Med 18:2723–2735
Article Google Scholar
Masters GN (1982) A Rasch model for partial credit scoring. Psychometrika 47:149–174
Article Google Scholar
McDonald R, Hand D, Eckley I (2003) An empirical comparison of three boosting algorithms on real data sets with artificial class noise. In: MSC2003: multiple classifier systems, pp 35–44
Mease D, Wyner A (2008) Evidence contrary to the statistical view of boosting. J Mach Learn Res 9:131–156
Google Scholar
Meyer MC (2013) Semi-parametric additive constrained regression. J Nonparametr Stat 25(3):715–730
Article MathSciNet Google Scholar
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2019) e1071: Misc functions of the department of statistics, probability theory group (Formerly: E1071), TU Wien. R package version 1.7-1 https://CRAN.R-project.org/package=e1071
Pya N, Wood SN (2014) Shape constrained additive models. Stat Comput 25(3):543–559
Article MathSciNet Google Scholar
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Robertson T, Wright FT, Dykstra R (1988) Order restricted statistical inference. Wiley, New York
MATH Google Scholar
Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227
Google Scholar
Sobel ME, Becker MP, Minick SM (1998) Origins, destinations, and association in occupational mobility. Am J Sociol 104(3):687–721
Article Google Scholar
Therneau T, Atkinson B (2019) rpart: recursive partitioning and regression trees. R package version 4.1-15 https://CRAN.R-project.org/package=rpart
Turner R (2019). Iso: functions to perform isotonic regression. R package version 0.0-18 https://CRAN.R-project.org/package=Iso
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York
Book Google Scholar

Download references

Acknowledgements

The authors thank the Associate Editor and two anonymous reviewers for suggestions that led to this improved version of the paper.

Funding

Funding was provided by Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. MTM2015-71217-R).

Author information

Authors and Affiliations

Departamento de Estadística e Investigación Operativa, Facultad de Ciencias, Universidad de Valladolid, Paseo de Belén 7, 47011, Valladolid, Spain
David Conde, Miguel A. Fernández, Cristina Rueda & Bonifacio Salvador

Authors

David Conde
View author publications
You can also search for this author in PubMed Google Scholar
Miguel A. Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Rueda
View author publications
You can also search for this author in PubMed Google Scholar
Bonifacio Salvador
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel A. Fernández.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A.1 Theoretical justification for algorithms under the adjacent categories model

Let $\mathbf{x} \in \mathbb {R}^d,y\in \{1,\dots ,K\}, y_k^*=I_{\left[ y=k\right] },k=1,\dotsc ,K$ and assume the adjacent probabilities model (2). Denote further $F_1(\mathbf{x} )=0$, so the a posteriori probabilities are:

$$\begin{aligned} p_k(\mathbf{x} )=\frac{\exp \left( \sum _{j=1}^k F_j(\mathbf{x} )\right) }{\sum _{k=1}^K \exp \left( \sum _{j=1}^k F_j(\mathbf{x} )\right) },k=1,\dotsc ,K. \end{aligned}$$

Now, the expected log-likelihood is:

$$\begin{aligned} E l(F_2,\dots ,F_K)= E\left[ \sum _{k=2}^K y_k^* \left( \sum _{j=2}^k F_j(\mathbf{x} )\right) - \log \left( 1+\sum _{k=2}^K \exp \left( \sum _{j=2}^k F_j(\mathbf{x} )\right) \right) \right] . \end{aligned}$$

Conditioning on $\mathbf{x} $, the score vector $\mathbf{S}(x) =(s_k(\mathbf{x} ))$ for the population Newton algorithm is:

$$\begin{aligned} s_k(\mathbf{x} )= \frac{\partial E l(F_2(\mathbf{x} ),\dots ,F_K(\mathbf{x} ))}{\partial F_k(\mathbf{x} )}=E\left( \left. \sum _{j=k}^K (y_j^*-p_j(\mathbf{x} ))\right| \mathbf{x} \right) ,k=2,\dotsc ,K. \end{aligned}$$

The Hessian is a $(K-1)\times (K-1)$ matrix, $\mathbf{H} (\mathbf{x} )=(H_{km}(\mathbf{x} ))$, $2\le k,m\le K$, where each element $H_{km}(\mathbf{x} )$ is:

$$\begin{aligned} H_{km}(\mathbf{x} ) = \frac{\partial ^2 E l(F_2(\mathbf{x} ),\dots ,F_K(\mathbf{x} ))}{\partial F_k(\mathbf{x} ) \partial F_m(\mathbf{x} )}= \left\{ \begin{array}{l l} -(\sum _{j=m}^K p_j(\mathbf{x} ))(1-\sum _{j=k}^K p_j(\mathbf{x} )), &{} m\ge k\\ -(\sum _{j=k}^K p_j(\mathbf{x} ))(1-\sum _{j=m}^K p_j(\mathbf{x} )), &{} m < k \end{array} \right. \end{aligned}$$

If $\mathbf{W} (\mathbf{x} )=-diag(\mathbf{H} (\mathbf{x} ))$, a quasi-Newton update for the ASILB algorithm is:

$$\begin{aligned} \begin{bmatrix} F_2(\mathbf{x} ) \\ \vdots \\ F_K(\mathbf{x} ) \end{bmatrix} \leftarrow \begin{bmatrix} F_2(\mathbf{x} ) \\ \vdots \\ F_K(\mathbf{x} ) \end{bmatrix} + E_W\left( \left. \mathbf{W} ^{-1}(\mathbf{x} ) \mathbf{s} (\mathbf{x} )\right| \mathbf{x} \right) . \end{aligned}$$

The full Newton update, which is implemented in the AMILB algorithm, is:

$$\begin{aligned} \begin{bmatrix} F_2(\mathbf{x} ) \\ \vdots \\ F_K(\mathbf{x} ) \end{bmatrix} \leftarrow E_H\left( \left. \begin{bmatrix} F_2(\mathbf{x} ) \\ \vdots \\ F_K(\mathbf{x} ) \end{bmatrix} - \mathbf{H} ^{-1}(\mathbf{x} ) \mathbf{s} (\mathbf{x} )\right| \mathbf{x} \right) \end{aligned}$$

1.2 A.2 Theoretical justification for algorithms under the cumulative probabilities model

In this case we have to update the function $F(\mathbf{x} )$ and the parameters $\alpha _k,k=2,\dotsc ,K$. We will perform a “two step” update, first on $F(\mathbf{x} )$ and then on the $\alpha $ parameters.

Let us denote $\gamma _k(\mathbf{x} )=\sum _{j=k}^K p_j(\mathbf{x} ),k=1,\dotsc ,K$, and assume the cumulative probabilities model (3). For this model

$$\begin{aligned} \gamma _k(\mathbf{x} ) = \frac{\exp (\alpha _k+F(\mathbf{x} ))}{1+\exp (\alpha _k+F(\mathbf{x} ))},k=2,\dotsc ,K, \end{aligned}$$

with $\gamma _1(\mathbf{x} )=1$ and $\gamma _{K+1}(\mathbf{x} )=0.$

First, we perform the $F(\mathbf{x} )$ update. Here, as in the previous model, we consider a single observation as this step is used for updating the weights and the values to be adjusted. The expected log-likelihood is:

$$\begin{aligned} E l(F)= E\left( \sum _{k=1}^K y_k^* \log (\gamma _k(\mathbf{x} ) - \gamma _{k+1}(\mathbf{x} ))\right) . \end{aligned}$$

Conditioning on $\mathbf{x} $, the first and second derivatives for the population Newton algorithm are:

$$\begin{aligned} \frac{\partial E l(F(\mathbf{x} ))}{\partial F(\mathbf{x} )}= E\left( \left. \sum _{k=1}^K y_k^* [1-\gamma _k(\mathbf{x} ) - \gamma _{k+1}(\mathbf{x} )]\right| \mathbf{x} \right) , \end{aligned}$$

and

$$\begin{aligned} w(\mathbf{x} )= -\frac{\partial ^2 E l(F(\mathbf{x} ))}{\partial F(\mathbf{x} )^2} = E\left( \left. \sum _{k=1}^K y_k^* [\gamma _k(\mathbf{x} )(1-\gamma _k(\mathbf{x} )) + \gamma _{k+1}(\mathbf{x} )(1-\gamma _{k+1}(\mathbf{x} ))]\right| \mathbf{x} \right) , \end{aligned}$$

so that the Newton update for $F(\mathbf{x} )$ is

$$\begin{aligned} F(\mathbf{x} ) \leftarrow E_H\left( \left. F(\mathbf{x} ) + \frac{1}{w(\mathbf{x} )}\frac{\partial E l(F(\mathbf{x} ))}{\partial F(\mathbf{x} )}\right| \mathbf{x} \right) . \end{aligned}$$

As for the parameters $\alpha _k,k=2,\dotsc ,K$, let us denote $\varvec{\alpha }=(\alpha _2,\dots ,\alpha _K)'$. Now, we use all the $\mathbf{x} _i$ observations as in this case we are going to perform a Newton step for updating the $\alpha $’s which do not depend on $\mathbf{x} $. Then, the log-likelihood is:

$$\begin{aligned} l(\varvec{\alpha })= \sum _{i=1}^n \sum _{k=1}^K y_{k,i}^* \log (\gamma _k(\mathbf{x} _i) - \gamma _{k+1}(\mathbf{x} _i)). \end{aligned}$$

The score for the Newton algorithm $\mathbf{S} =(s_2,\dots ,s_K)'$ is:

$$\begin{aligned} s_k= \frac{\partial l(\varvec{\alpha })}{\partial \alpha _k} = \sum _{i=1}^n \left( \frac{y_{k,i}^*}{p_k(\mathbf{x} _i)}- \frac{y_{k-1,i}^*}{p_{k-1}(\mathbf{x} _i)}\right) \gamma _k(\mathbf{x} _i)(1-\gamma _k(\mathbf{x} _i)),k=2,\dotsc ,K. \end{aligned}$$

(4)

And the Hessian is a tri-diagonal symmetric $(K-1)\times (K-1)$ matrix $\mathbf{H} =(H_{km})$ with $H_{km} = \frac{\partial ^2 l(\varvec{\alpha })}{\partial \alpha _k\partial \alpha _m},$ for $2\le k,m\le K$, such that

$$\begin{aligned} H_{kk}&= -\sum _{i=1}^n \gamma _k(\mathbf{x} _i)(1-\gamma _k(\mathbf{x} _i)) \left[ \frac{y_k^*}{p_k^2(\mathbf{x} _i)}(p_k^2(\mathbf{x} _i)+\gamma _{k+1}(\mathbf{x} _i)(1-\gamma _{k+1}(\mathbf{x} _i)))\right. \nonumber \\&\left. \quad +\frac{y_{k-1}^*}{p_{k-1}^2(\mathbf{x} _i)} (p_{k-1}^2(\mathbf{x} _i)+\gamma _{k-1}(\mathbf{x} _i)(1-\gamma _{k-1}(\mathbf{x} _i)))\right] \end{aligned}$$

(5)

$$\begin{aligned} H_{k,k-1}&=H_{k-1,k}= \sum _{i=1}^n \frac{y_{k-1,i}^*}{p_{k-1}^2(\mathbf{x} _i))}\gamma _k(\mathbf{x} _i)(1-\gamma _k(\mathbf{x} _i))\gamma _{k-1}(\mathbf{x} _i)(1-\gamma _{k-1}(\mathbf{x} _i)) \end{aligned}$$

(6)

$$\begin{aligned} H_{km}&=H_{mk}=0 \text { otherwise.} \end{aligned}$$

(7)

In these conditions the Newton update is $\varvec{\alpha } \leftarrow \varvec{\alpha } - \mathbf{H} ^{-1} \mathbf{S} .$

1.3 A.3 Full numerical results for the simulations performed

This subsection contains the Tables showing the full numerical mean results for TMP, MAE and AUC for the two sets of simulations. In Tables 7, 8 and 9 appear the results for the first set of simulations performed under model 2 for the different F functions appearing in Table 3. Tables 10, 11 and 12 contain the mean results for the simulations performed under the uniform order-restricted predictors scheme. In all cases the best results appear in bold. Notice that there are no results for CSILB and CMILB when $K=2$ as in that case those algorithms coincide with ASILB and AMILB. For this case the TMP and MAE values also coincide. They are given in both tables for completeness.

Table 7 Mean TMP for the first simulation scheme for different classification rules, number of groups K, predictors d and functions F

Full size table

Table 8 Mean MAE for the first simulation scheme for different classification rules, number of groups K, predictors d and functions F

Full size table

Table 9 Mean AUC for the first simulation scheme for different classification rules, number of groups K, predictors d and functions F

Full size table

Table 10 Mean TMP for different classification rules, number of groups K, predictors d and training sample sizes n, for the second set of simulations

Full size table

Table 11 Mean MAE for different classification rules, number of groups K, predictors d and training sample sizes n, for the second set of simulations

Full size table

Table 12 Mean AUC for different classification rules, number of groups K, predictors d and training sample sizes n, for the second set of simulations

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Conde, D., Fernández, M.A., Rueda, C. et al. Isotonic boosting classification rules. Adv Data Anal Classif 15, 289–313 (2021). https://doi.org/10.1007/s11634-020-00404-9

Download citation

Received: 09 August 2019
Revised: 21 January 2020
Accepted: 19 May 2020
Published: 12 June 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11634-020-00404-9

Keywords

Mathematics Subject Classification

62H30

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Isotonic boosting classification rules

Abstract

Access this article

Similar content being viewed by others

A review of boosting methods for imbalanced data classification

Optimization by Gradient Boosting

Boosting as a kernel-based method

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 A.1 Theoretical justification for algorithms under the adjacent categories model

1.2 A.2 Theoretical justification for algorithms under the cumulative probabilities model

1.3 A.3 Full numerical results for the simulations performed

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Isotonic boosting classification rules

Abstract

Access this article

Similar content being viewed by others

A review of boosting methods for imbalanced data classification

Optimization by Gradient Boosting

Boosting as a kernel-based method

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 A.1 Theoretical justification for algorithms under the adjacent categories model

1.2 A.2 Theoretical justification for algorithms under the cumulative probabilities model

1.3 A.3 Full numerical results for the simulations performed

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation