Logistic Regression Tree Analysis
 WeiYin Loh
 … show all 1 hide
Abstract
This chapter describes a treestructured extension and generalization of the logistic regression method for fitting models to a binaryvalued response variable. The technique overcomes a significant disadvantage of logistic regression viz. the interpretability of the model in the face of multicollinearity and Simpsonʼs paradox. Section 29.1 summarizes the statistical theory underlying the logistic regression model and the estimation of its parameters. Section 29.2 reviews two standard approaches to model selection for logistic regression, namely, model deviance relative to its degrees of freedom and the Akaike information criterion (AIC) criterion. A dataset on tree damage during a severe thunderstorm is used to compare the approaches and to highlight their weaknesses. A recently published partial onedimensional model that addresses some of the weaknesses is also reviewed.
Section 29.3 introduces the idea of a logistic regression tree model. The latter consists of a binary tree in which a simple linear logistic regression (i.e., a linear logistic regression using a single predictor variable) is fitted to each leaf node. A split at an intermediate node is characterized by a subset of values taken by a (possibly different) predictor variable. The objective is to partition the dataset into rectangular pieces according to the values of the predictor variables such that a simple linear logistic regression model adequately fits the data in each piece. Because the tree structure and the piecewise models can be presented graphically, the whole model can be easily understood. This is illustrated with the thunderstorm dataset using the LOTUS algorithm.
Section 29.4 describes the basic elements of the LOTUS algorithm, which is based on recursive partitioning and costcomplexity pruning. A key feature of the algorithm is a correction for bias in variable selection at the splits of the tree. Without bias correction, the splits can yield incorrect inferences. Section 29.5 shows an application of LOTUS to a dataset on automobile crash tests involving dummies. This dataset is challenging because of its large size, its mix of ordered and unordered variables, and its large number of missing values. It also provides a demonstration of Simpsonʼs paradox. The chapter concludes with some remarks in Sect. 29.5.
 Title
 Logistic Regression Tree Analysis
 Reference Work Title
 Springer Handbook of Engineering Statistics
 Reference Work Part
 Part D
 Pages
 pp 537549
 Copyright
 2006
 DOI
 10.1007/9781846282881_29
 Print ISBN
 9781852338060
 Online ISBN
 9781846282881
 Publisher
 Springer London
 Copyright Holder
 SpringerVerlag London
 Additional Links
 Topics
 Industry Sectors
 eBook Packages
 Editors

 Hoang Pham Prof. ^{(1)}
 Editor Affiliations

 1. Department of Industrial and Systems Engineering, Rutgers the State University of New Jersey
 Authors

 WeiYin Loh ^{(1)}
 Author Affiliations

 1. Department of Statistics, University of Wisconsin – Madison, 1300 University Avenue, 53706, Madison, WI, USA
Continue reading...
To view the rest of this content please follow the download PDF link above.