Skip to main content

Advertisement

Log in

Isotonic boosting classification rules

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In many real classification problems a monotone relation between some predictors and the classes may be assumed when higher (or lower) values of those predictors are related to higher levels of the response. In this paper, we propose new boosting algorithms, based on LogitBoost, that incorporate this isotonicity information, yielding more accurate and easily interpretable rules. These algorithms are based on theoretical developments that consider isotonic regression. We show the good performance of these procedures not only on simulations, but also on real data sets coming from two very different contexts, namely cancer diagnostic and failure of induction motors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Agresti A (2002) Categorical data analysis. Wiley, Hoboken

    Book  Google Scholar 

  • Agresti A (2010) Analysis of ordinal categorical data, 2nd edn. Wiley, Hoboken

    Book  Google Scholar 

  • Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141

    MathSciNet  MATH  Google Scholar 

  • Auh S, Sampson AR (2006) Isotonic logistic discrimination. Biometrika 93(4):961–972

    Article  MathSciNet  Google Scholar 

  • Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD (1972) Statistical inference under order restrictions. Wiley, New York

    MATH  Google Scholar 

  • Bühlmann P (2012) Bagging, boosting and ensemble methods. In: Handbook of computational statistics, Springer. Chapter, vol 33, pp 985–1022

  • Cano JR, García S (2017) Training set selection for monotonic ordinal classification. Data Knowl Eng 112:94–105

    Article  Google Scholar 

  • Cano JR, Gutiérrez PA, Krawczyk B, Wozniak M, García S (2019) Monotonic classification: an overview on algorithms, performance measures and data sets. Neurocomputing 341:168–182

    Article  Google Scholar 

  • Chen Y, Samworth RJ (2016) Generalized additive and index models with shape constraints. J R Stat Soc B 78:729–754

    Article  MathSciNet  Google Scholar 

  • Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794

  • Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y (2019) xgboost: Extreme Gradient Boosting. R package version 0.82.1 https://CRAN.R-project.org/package=xgboost

  • Choudhary A, Goyal D, Shimi SL, Akula A (2019) Condition monitoring and fault diagnosis of induction motors: a review. Arch Comput Methods Eng 1:2. https://doi.org/10.1007/s11831-018-9286-z

    Article  Google Scholar 

  • Conde D, Fernández MA, Rueda C, Salvador B (2012) Classification of samples into two or more ordered populations with application to a cancer trial. Stat Med 31(28):3773–3786

    Article  MathSciNet  Google Scholar 

  • Conde D, Salvador B, Rueda C, Fernández MA (2013) Performance and estimation of the true error rate of classification rules built with additional information: an application to a cancer trial. Stat Appl Gen Mol Biol 12(5):583–602

    MathSciNet  Google Scholar 

  • Conde D, Fernández MA, Salvador B, Rueda C (2015) dawai: an R package for discriminant analysis with additional information. J Stat Softw 66(10):1–19

    Article  Google Scholar 

  • Conde D, Fernández MA, Rueda C, Salvador B (2020) isoboost: isotonic Boosting Classification Rules. R package version 1.0.0 https://CRAN.R-project.org/package=isoboost

  • De Leeuw J, Hornik K, Mair P (2009) Isotone optimization in R: pool-adjacent-violators algorithm (PAVA) and active set methods. J Stat Softw 32(5):1–24

    Article  Google Scholar 

  • Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19(9):1061–1069

    Article  Google Scholar 

  • Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157

    Article  Google Scholar 

  • Fang Z, Meinshausen N (2012) LASSO isotone for high-dimensional additive isotonic regression. J Comput Graph Stat 21(1):72–91

    Article  MathSciNet  Google Scholar 

  • Fernández MA, Rueda C, Salvador B (2006) Incorporating additional information to normal linear discriminant rules. J Am Stat Assoc 101:569–577

    Article  MathSciNet  Google Scholar 

  • Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  Google Scholar 

  • Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML’96 Proceedings of the thirteenth international conference on international conference on machine learning, pp 148–156

  • Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407

    Article  MathSciNet  Google Scholar 

  • Fullerton AS, Anderson KF (2013) The role of job insecurity in explanations of racial health inequalities. Sociol Forum 28(2):308–325

    Article  Google Scholar 

  • Fullerton AS, Xu J (2016) Ordered regression models: parallel, partial, and non-parallel alternatives. CRC Press, Boca Raton

    Book  Google Scholar 

  • Garcia-Escudero LA, Duque-Perez O, Fernandez-Temprano M, Moriñigo-Sotelo D (2017) Robust detection of incipient faults in VSI-fed induction motors using quality control charts. IEEE Trans Ind Appl 53(3):3076–3085

    Article  Google Scholar 

  • Gauchat G (2011) The cultural authority of science: public trust and acceptance of organized science. Public Understand Sci 20(6):751–770

    Article  Google Scholar 

  • Ghosh D (2007) Incorporating monotonicity into the evaluation of a biomarker. Biostatistics 8(2):402–413

    Article  Google Scholar 

  • Halaby CN (1986) Worker attachment and workplace authority. Am Sociol Rev 51(5):634–649

    Article  Google Scholar 

  • Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classication problems. Mach Learn 45:171–186

    Article  Google Scholar 

  • Härdle W, Hall P (1993) On the backfitting algorithm for additive regression models. Stat Neerl 47:43–57

    Article  MathSciNet  Google Scholar 

  • Hastie T, Tibshirani R (2014) Generalized additive models. In: Wiley StatsRef: Statistics Reference Online. Wiley-Interscience. https://doi.org/10.1002/9781118445112.stat03141

  • Hofner B, Kneib T, Hothorn T (2016) A unified framework of constrained regression. Stat Comput 26(1–2):1–14

    Article  MathSciNet  Google Scholar 

  • Holmes G, Pfahringer B, Kirkby R, Frank E, Hall M (2002) Multiclass alternating decision trees. In: European conference on machine learning. Springer, Berlin

  • Jarek Tuszynski (2019) caTools: tools: moving window statistics, GIF, Base64, ROC, AUC, etc. R package version 1.17.1.2 https://CRAN.R-project.org/package=caTools

  • Liaw A, Wiener M (2002) Classification and Regression by random. Forest R News 2(3):18–22

    Google Scholar 

  • Marshall RJ (1999) Classification to ordinal categories using a search partition methodology with an application in diabetes screening. Stat Med 18:2723–2735

    Article  Google Scholar 

  • Masters GN (1982) A Rasch model for partial credit scoring. Psychometrika 47:149–174

    Article  Google Scholar 

  • McDonald R, Hand D, Eckley I (2003) An empirical comparison of three boosting algorithms on real data sets with artificial class noise. In: MSC2003: multiple classifier systems, pp 35–44

  • Mease D, Wyner A (2008) Evidence contrary to the statistical view of boosting. J Mach Learn Res 9:131–156

    Google Scholar 

  • Meyer MC (2013) Semi-parametric additive constrained regression. J Nonparametr Stat 25(3):715–730

    Article  MathSciNet  Google Scholar 

  • Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2019) e1071: Misc functions of the department of statistics, probability theory group (Formerly: E1071), TU Wien. R package version 1.7-1 https://CRAN.R-project.org/package=e1071

  • Pya N, Wood SN (2014) Shape constrained additive models. Stat Comput 25(3):543–559

    Article  MathSciNet  Google Scholar 

  • R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  • Robertson T, Wright FT, Dykstra R (1988) Order restricted statistical inference. Wiley, New York

    MATH  Google Scholar 

  • Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227

    Google Scholar 

  • Sobel ME, Becker MP, Minick SM (1998) Origins, destinations, and association in occupational mobility. Am J Sociol 104(3):687–721

    Article  Google Scholar 

  • Therneau T, Atkinson B (2019) rpart: recursive partitioning and regression trees. R package version 4.1-15 https://CRAN.R-project.org/package=rpart

  • Turner R (2019). Iso: functions to perform isotonic regression. R package version 0.0-18 https://CRAN.R-project.org/package=Iso

  • Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York

    Book  Google Scholar 

Download references

Acknowledgements

The authors thank the Associate Editor and two anonymous reviewers for suggestions that led to this improved version of the paper.

Funding

Funding was provided by Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. MTM2015-71217-R).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miguel A. Fernández.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A.1 Theoretical justification for algorithms under the adjacent categories model

Let \(\mathbf{x} \in \mathbb {R}^d,y\in \{1,\dots ,K\}, y_k^*=I_{\left[ y=k\right] },k=1,\dotsc ,K\) and assume the adjacent probabilities model (2). Denote further \(F_1(\mathbf{x} )=0\), so the a posteriori probabilities are:

$$\begin{aligned} p_k(\mathbf{x} )=\frac{\exp \left( \sum _{j=1}^k F_j(\mathbf{x} )\right) }{\sum _{k=1}^K \exp \left( \sum _{j=1}^k F_j(\mathbf{x} )\right) },k=1,\dotsc ,K. \end{aligned}$$

Now, the expected log-likelihood is:

$$\begin{aligned} E l(F_2,\dots ,F_K)= E\left[ \sum _{k=2}^K y_k^* \left( \sum _{j=2}^k F_j(\mathbf{x} )\right) - \log \left( 1+\sum _{k=2}^K \exp \left( \sum _{j=2}^k F_j(\mathbf{x} )\right) \right) \right] . \end{aligned}$$

Conditioning on \(\mathbf{x} \), the score vector \(\mathbf{S}(x) =(s_k(\mathbf{x} ))\) for the population Newton algorithm is:

$$\begin{aligned} s_k(\mathbf{x} )= \frac{\partial E l(F_2(\mathbf{x} ),\dots ,F_K(\mathbf{x} ))}{\partial F_k(\mathbf{x} )}=E\left( \left. \sum _{j=k}^K (y_j^*-p_j(\mathbf{x} ))\right| \mathbf{x} \right) ,k=2,\dotsc ,K. \end{aligned}$$

The Hessian is a \((K-1)\times (K-1)\) matrix, \(\mathbf{H} (\mathbf{x} )=(H_{km}(\mathbf{x} ))\), \(2\le k,m\le K\), where each element \(H_{km}(\mathbf{x} )\) is:

$$\begin{aligned} H_{km}(\mathbf{x} ) = \frac{\partial ^2 E l(F_2(\mathbf{x} ),\dots ,F_K(\mathbf{x} ))}{\partial F_k(\mathbf{x} ) \partial F_m(\mathbf{x} )}= \left\{ \begin{array}{l l} -(\sum _{j=m}^K p_j(\mathbf{x} ))(1-\sum _{j=k}^K p_j(\mathbf{x} )), &{} m\ge k\\ -(\sum _{j=k}^K p_j(\mathbf{x} ))(1-\sum _{j=m}^K p_j(\mathbf{x} )), &{} m < k \end{array} \right. \end{aligned}$$

If \(\mathbf{W} (\mathbf{x} )=-diag(\mathbf{H} (\mathbf{x} ))\), a quasi-Newton update for the ASILB algorithm is:

$$\begin{aligned} \begin{bmatrix} F_2(\mathbf{x} ) \\ \vdots \\ F_K(\mathbf{x} ) \end{bmatrix} \leftarrow \begin{bmatrix} F_2(\mathbf{x} ) \\ \vdots \\ F_K(\mathbf{x} ) \end{bmatrix} + E_W\left( \left. \mathbf{W} ^{-1}(\mathbf{x} ) \mathbf{s} (\mathbf{x} )\right| \mathbf{x} \right) . \end{aligned}$$

The full Newton update, which is implemented in the AMILB algorithm, is:

$$\begin{aligned} \begin{bmatrix} F_2(\mathbf{x} ) \\ \vdots \\ F_K(\mathbf{x} ) \end{bmatrix} \leftarrow E_H\left( \left. \begin{bmatrix} F_2(\mathbf{x} ) \\ \vdots \\ F_K(\mathbf{x} ) \end{bmatrix} - \mathbf{H} ^{-1}(\mathbf{x} ) \mathbf{s} (\mathbf{x} )\right| \mathbf{x} \right) \end{aligned}$$

1.2 A.2 Theoretical justification for algorithms under the cumulative probabilities model

In this case we have to update the function \(F(\mathbf{x} )\) and the parameters \(\alpha _k,k=2,\dotsc ,K\). We will perform a “two step” update, first on \(F(\mathbf{x} )\) and then on the \(\alpha \) parameters.

Let us denote \(\gamma _k(\mathbf{x} )=\sum _{j=k}^K p_j(\mathbf{x} ),k=1,\dotsc ,K\), and assume the cumulative probabilities model (3). For this model

$$\begin{aligned} \gamma _k(\mathbf{x} ) = \frac{\exp (\alpha _k+F(\mathbf{x} ))}{1+\exp (\alpha _k+F(\mathbf{x} ))},k=2,\dotsc ,K, \end{aligned}$$

with \(\gamma _1(\mathbf{x} )=1\) and \(\gamma _{K+1}(\mathbf{x} )=0.\)

First, we perform the \(F(\mathbf{x} )\) update. Here, as in the previous model, we consider a single observation as this step is used for updating the weights and the values to be adjusted. The expected log-likelihood is:

$$\begin{aligned} E l(F)= E\left( \sum _{k=1}^K y_k^* \log (\gamma _k(\mathbf{x} ) - \gamma _{k+1}(\mathbf{x} ))\right) . \end{aligned}$$

Conditioning on \(\mathbf{x} \), the first and second derivatives for the population Newton algorithm are:

$$\begin{aligned} \frac{\partial E l(F(\mathbf{x} ))}{\partial F(\mathbf{x} )}= E\left( \left. \sum _{k=1}^K y_k^* [1-\gamma _k(\mathbf{x} ) - \gamma _{k+1}(\mathbf{x} )]\right| \mathbf{x} \right) , \end{aligned}$$

and

$$\begin{aligned} w(\mathbf{x} )= -\frac{\partial ^2 E l(F(\mathbf{x} ))}{\partial F(\mathbf{x} )^2} = E\left( \left. \sum _{k=1}^K y_k^* [\gamma _k(\mathbf{x} )(1-\gamma _k(\mathbf{x} )) + \gamma _{k+1}(\mathbf{x} )(1-\gamma _{k+1}(\mathbf{x} ))]\right| \mathbf{x} \right) , \end{aligned}$$

so that the Newton update for \(F(\mathbf{x} )\) is

$$\begin{aligned} F(\mathbf{x} ) \leftarrow E_H\left( \left. F(\mathbf{x} ) + \frac{1}{w(\mathbf{x} )}\frac{\partial E l(F(\mathbf{x} ))}{\partial F(\mathbf{x} )}\right| \mathbf{x} \right) . \end{aligned}$$

As for the parameters \(\alpha _k,k=2,\dotsc ,K\), let us denote \(\varvec{\alpha }=(\alpha _2,\dots ,\alpha _K)'\). Now, we use all the \(\mathbf{x} _i\) observations as in this case we are going to perform a Newton step for updating the \(\alpha \)’s which do not depend on \(\mathbf{x} \). Then, the log-likelihood is:

$$\begin{aligned} l(\varvec{\alpha })= \sum _{i=1}^n \sum _{k=1}^K y_{k,i}^* \log (\gamma _k(\mathbf{x} _i) - \gamma _{k+1}(\mathbf{x} _i)). \end{aligned}$$

The score for the Newton algorithm \(\mathbf{S} =(s_2,\dots ,s_K)'\) is:

$$\begin{aligned} s_k= \frac{\partial l(\varvec{\alpha })}{\partial \alpha _k} = \sum _{i=1}^n \left( \frac{y_{k,i}^*}{p_k(\mathbf{x} _i)}- \frac{y_{k-1,i}^*}{p_{k-1}(\mathbf{x} _i)}\right) \gamma _k(\mathbf{x} _i)(1-\gamma _k(\mathbf{x} _i)),k=2,\dotsc ,K. \end{aligned}$$
(4)

And the Hessian is a tri-diagonal symmetric \((K-1)\times (K-1)\) matrix \(\mathbf{H} =(H_{km})\) with \(H_{km} = \frac{\partial ^2 l(\varvec{\alpha })}{\partial \alpha _k\partial \alpha _m},\) for \(2\le k,m\le K\), such that

$$\begin{aligned} H_{kk}&= -\sum _{i=1}^n \gamma _k(\mathbf{x} _i)(1-\gamma _k(\mathbf{x} _i)) \left[ \frac{y_k^*}{p_k^2(\mathbf{x} _i)}(p_k^2(\mathbf{x} _i)+\gamma _{k+1}(\mathbf{x} _i)(1-\gamma _{k+1}(\mathbf{x} _i)))\right. \nonumber \\&\left. \quad +\frac{y_{k-1}^*}{p_{k-1}^2(\mathbf{x} _i)} (p_{k-1}^2(\mathbf{x} _i)+\gamma _{k-1}(\mathbf{x} _i)(1-\gamma _{k-1}(\mathbf{x} _i)))\right] \end{aligned}$$
(5)
$$\begin{aligned} H_{k,k-1}&=H_{k-1,k}= \sum _{i=1}^n \frac{y_{k-1,i}^*}{p_{k-1}^2(\mathbf{x} _i))}\gamma _k(\mathbf{x} _i)(1-\gamma _k(\mathbf{x} _i))\gamma _{k-1}(\mathbf{x} _i)(1-\gamma _{k-1}(\mathbf{x} _i)) \end{aligned}$$
(6)
$$\begin{aligned} H_{km}&=H_{mk}=0 \text { otherwise.} \end{aligned}$$
(7)

In these conditions the Newton update is \(\varvec{\alpha } \leftarrow \varvec{\alpha } - \mathbf{H} ^{-1} \mathbf{S} .\)

1.3 A.3 Full numerical results for the simulations performed

This subsection contains the Tables showing the full numerical mean results for TMP, MAE and AUC for the two sets of simulations. In Tables 7, 8 and 9 appear the results for the first set of simulations performed under model 2 for the different F functions appearing in Table 3. Tables 10, 11 and 12 contain the mean results for the simulations performed under the uniform order-restricted predictors scheme. In all cases the best results appear in bold. Notice that there are no results for CSILB and CMILB when \(K=2\) as in that case those algorithms coincide with ASILB and AMILB. For this case the TMP and MAE values also coincide. They are given in both tables for completeness.

Table 7 Mean TMP for the first simulation scheme for different classification rules, number of groups K, predictors d and functions F
Table 8 Mean MAE for the first simulation scheme for different classification rules, number of groups K, predictors d and functions F
Table 9 Mean AUC for the first simulation scheme for different classification rules, number of groups K, predictors d and functions F
Table 10 Mean TMP for different classification rules, number of groups K, predictors d and training sample sizes n, for the second set of simulations
Table 11 Mean MAE for different classification rules, number of groups K, predictors d and training sample sizes n, for the second set of simulations
Table 12 Mean AUC for different classification rules, number of groups K, predictors d and training sample sizes n, for the second set of simulations

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Conde, D., Fernández, M.A., Rueda, C. et al. Isotonic boosting classification rules. Adv Data Anal Classif 15, 289–313 (2021). https://doi.org/10.1007/s11634-020-00404-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-020-00404-9

Keywords

Mathematics Subject Classification

Navigation