Tree-structured scale effects in binary and ordinal regression

In binary and ordinal regression one can distinguish between a location component and a scaling component. While the former determines the location within the range of the response categories, the scaling indicates variance heterogeneity. In particular since it has been demonstrated that misleading effects can occur if one ignores the presence of a scaling component, it is important to account for potential scaling effects in the regression model, which is not possible in available recursive partitioning methods. The proposed recursive partitioning method yields two trees: one for the location and one for the scaling. They show in a simple interpretable way how variables interact to determine the binary or ordinal response. The developed algorithm controls for the global significance level and automatically selects the variables that have an impact on the response. The modeling approach is illustrated by several real-world applications.


Introduction
Tree-based models are strong non parametric tools that allow to investigate interaction effects of covariates on responses.The basic concept is very simple: By binary recursive partitioning the predictor space is partitioned into a set of rectangles and on each rectangle a simple model (for example a constant) is fitted.
The most popular versions are CART (Breiman et al., 1984), which is an abbreviation for classification and regression trees, and conditional inference trees, abbreviated by CTREE (Hothorn et al., 2006).Introductions and overviews were given, among others, by Loh (2014) and Strobl et al. (2009).Recursive partitioning methods, or simply trees, have several advantages: (i) they can be used in high dimensional settings because they provide automatic variable selection, (ii) they have a built-in interaction detector, and (iii) they are easy to interpret and visualize.Besides classical regression trees for metrically scaled response variables, also versions for binary and ordinal responses are available, see Piccarreta (2008), Archer (2010) and Galimberti et al. (2012).
The objective of the present paper is to introduce trees in regression structures with ordinal responses that include scale effects, which are needed if unobserved heterogeneity of variances is present.The modelling of scale effects in ordinal regression was already considered by McCullagh (1980), who introduced the socalled location-scale model and gave a simple example with one binary covariate dealing with the quality of right eye vision for men and women.The locationscale model was considered and extended, among others, by Cox (1995) and Tutz and Berger (2017); Ishwaran and Gatsonis (2000) investigated the link to ROC analysis, Hedeker et al. (2008), Hedeker et al. (2009) and Hedeker et al. (2012) showed how to use it in the case of repeated ordinal measurements.
Scale effects are also found in binary data.Their potential impact found much attention since Allison (1999) demonstrated that comparisons of binary model coefficients across groups can be misleading if one has underlying heterogeneity of residual variances.The problem has been investigated in various papers since then, see Williams (2009), Mood (2010), Karlson et al. (2012), Breen et al. (2014) and Rohwer (2015).One strategy to account for heterogeneity is to use McCullagh's location-scale model, which in the social sciences is also known as the heterogeneous choice or heteroskedastic logit model (Alvarez and Brehm, 1995;Williams, 2009).It is included in various program packages as Stata, Limdep, SAS, and R.
As a parametric model that uses linear predictors the location-scale model is rather restrictive.In particular interactions of higher order are hard to include and lower order interactions are restricted to linear interactions.Tree-based methods offer a non parametric alternative to investigate the interaction structure and automatically select variables.Variable selection is important since typically it is not known which variables contribute to location and to scaling.Since there are two components in the model, location and scaling, classical recursive partitioning methods can not be used.The method developed in the following is explicitly designed to account for these two components.Two separate trees are obtained, one for each component.
In Section 2 the basic approach is introduced and illustrated by an application.In Section 3 the proposed algorithm is given in detail.More applications are considered in Section 4. The paper concludes with a summary given in Section 5.

Trees with Scale Effects
In the following we first consider basic ordinal models and the problems that might occur if variance heterogeneity is ignored.Then we introduce the treestructured modelling approach that is proposed.

Proportional Odds and Location-Scale Model
A common way to derive ordinal regression models is to assume that a latent variable is behind the ordinal response Y .Let the latent regression model have the form i is the latent variable, x i is a vector of covariates, and σ is the standard deviation of the noise variable ε i , which has symmetric distribution function F (.).The essential concept is to consider the ordinal response as a categorized version of the latent variable with the link between the observable ordinal variable Y i with k categories and the latent variable Y * i given by where are thresholds on the latent scale.Simple derivation yields that the response probabilities are given by where α 0r = θ r − α 0 .However, the model parameters are not identifiable.An identifiable version is obtained by setting σ = 1 or, equivalently, using β 0r = α 0r /σ, β = α/σ, which yields the cumulative model The most prominent member of the family of cumulative models is the proportional odds model, which uses the logistic distribution function The strength of model ( 3) is that the parameters have an easily accessible interpretation.Let γ r (x i ) = P (Y i > r|x i )/P (Y i ≤ r|x i ) denote the cumulative odds for category r.Then one can derive that the effect of the jth variable is given by which does not depend on r.That means that e β j represents the multiplicative change in cumulative odds if x ij increases by one unit for each category.Of course, the interpretation holds only if the model holds or is at least a good approximation to the data generating model.It has been shown that the cumulative model (2) can yield very misleading results if there is variance heterogeneity in the underlying continuous regression model.Allison (1999) considered an example with the binary response being the promotion to an associate professor from the assistant professor level.It turned out that the number of published articles had a much stronger effect for male researchers than for female researchers, which seems rather unfair.He demonstrated that this effect could be due to heterogeneous variances.
The effect of heterogeneous variances is easily seen.Let the latent regression model be given by Y * i = α 0 + x T i α + σ i ε i , where σ i now depends on the specific observation i.In the simplest case one has σ i = z i γ, where z i is an indicator variable, which takes the value one for group 1 (for example males) and the value zero for group 0 (for example females).Then the simple cumulative model ( 2) is mis-specified.The derivation from the latent variable yields for observations from group 1 and for observations from group 0 .
(5) Thus, effects of covariates differ between the groups.One has α/σ in group 1 and α in group 0. If, for example, σ = 0.5 the effect strength in group 1 is twice the effect strength in group 0. The dependence on the group is simply ignored if one sets σ = 1, which is typically assumed in categorical regression.It means that in both groups the same scaling is used, although different ones are needed, see also Williams (2009), Mood (2010).This form of mis-specification can be avoided by explicit modelling of the heterogeneity of variances.Let the standard deviation be determined by σ i = exp(z T i γ), where z i is an additional vector of covariates, then one obtains from assumption (1) the location-scale model which for the logistic distribution function yields The model contains two terms in the predictor that specifies the impact of covariates.The first is the location term β 0r +x T i β, the second is the variance or scaling term exp(z T i γ), which derives from the "variance equation" σ i = exp(z T i γ).Importantly, if x i and z i are distinct the interpretation of the x-variables is the same as in the proportional odds model.With γ r (x i , z i ) = P (Y i > r|x i , z i )/P (Y i ≤ r|x i , z i ) denoting the cumulative odds for category r one obtains again the relation (4) and therefore an interpretation of parameters that does not depend on the category.
The location-scale model was introduced by McCullagh (1980) but is also known as heterogeneous choice model or heteroscedastic logit model (Alvarez and Brehm, 1995).It should be noted that although the scaling component is typically motivated from variance heterogeneity it can also be seen as representing interactions or effect-modifying effects, see Rohwer (2015), Tutz (2018).As Williams (2010) noted, it is also strongly related to the logistic response model with proportionality constraints proposed by Hauser and Andrew (2006) and extended by Fullerton and Xu (2012).

Tree-Structured Location-Scale Models
Recursive partitioning methods for ordinal responses have been proposed by Archer (2010), Galimberti et al. (2012), Janitza et al. (2016) and are available in R packages.Also the conditional unbiased recursive partitioning framework as proposed by Hothorn et al. (2006) allows to fit trees for ordinal responses.However, all of these methods do not account for possible heterogeneity induced by variance.
The problem with modelling heterogeneity is that one has to fit two separate predictors, the location term and the variance term.In the traditional locationscale model ( 7) they are represented by the linear predictor β 0r − x T i β and the variance term exp(z T i γ), respectively.The tree proposed here also distinguishes between location and variance; for both components separate trees are fitted.It is crucial that the partitioning of location and variance terms has to be done in a coordinated way.Trees have to be grown by taking both components into account simultaneously.
In the following, we first sketch the basic algorithm, which will be given in more detail in Section 3. The basic concept is to replace the predictor η ir = (β 0r − x T i β)/exp(z T i γ) of the location-scale model ( 7) by coordinated recursive partitioning terms.

Basic Algorithm
Let us consider the building of a tree when starting at the root.We will focus on metrically scaled and ordinal (including binary) covariates.In this case the partition of a node A into two subsets A 1 and A 2 has the form with regard to threshold c on variable x j .

First Step
For each variable x j and all corresponding thresholds c that can be built for this variable one investigates the following fits: Alternatively one can replace I(.) by I * (.) = 2I(.)− 1, which means one uses effect coding and replaces the 0−1 dummy variable by the variable (b) Variance Term: One fits the location-scale model with one split in the variance term and predictor .
Then one obtains One chooses the best split according to an appropriate splitting criterion (for details, see Section 3) among all the fitted models from (a) and (b).Thus in the first step one split is performed either in the location term or the variance term.

Later Steps
In later steps the splitting is done in a similar way.Let A loc 1 , . . ., A loc m loc denote the nodes (subsets of the predictor space) of the location term from the previous steps.Accordingly, let A sc 1 , . . ., A sc msc denote the nodes (subsets of the predictor space) of the variance term from the previous steps.Note, that all nodes are determined by a product of indicator functions.For example, if the splits were in the metric variables x 3 and x 7 a node may be determined by One fits all the candidate models (a) for the splitting of A loc k , k = 1, . . ., m loc in the location term with predictors to obtain the (m loc + 1)-th node in the location term with parameter estimate β, (b) for the splitting of A sc k , k = 1, . . ., m sc in the variance term with predictor .
to obtain the (m sc + 1)-th node in the variance term with parameter estimate γ.
One chooses the best split according to an appropriate splitting criterion among all the possible models from (a) and (b).Again, each step means an update of the location term or the variance term.After termination of the algorithm according to an appropriate stopping criterion, the final model consists of two trees, one for the location component and one for the scale component, with different partitions.
We refer to the concept as tree-structured model building to distinguish it from the model-based recursive partitioning models as considered by Zeileis et al. (2008).
The basic idea of model-based recursive partitioning is to fit models in subspaces of the predictor space and then decide which partitioning explains the predictorresponse relationships best.Of course elaborated methods are needed to ensure that the splits represent relevant information, for example, by using appropriate tests, see Zeileis et al. (2008).Although in principle this approach could also be used in the location-scale framework the obtained tree would not separate between the the two types of influential terms.The main difference between tree-structured modelling and model-based recursive partitioning is that treestructured model building means that the predictor structure is determined by trees, whereas model-based approaches do not structure the predictor but fit the whole model in sub spaces.Model-based trees yield separate trees for the two influential terms, one tree for the location and one tree for the variance heterogeneity.Thus, it is easily seen which variables contribute to which component.Tree structures in the predictor have been considered before, but in a quite different context; Berger and Tutz (2017) and Tutz and Berger (2018) considered trees to model the effect of categorical predictors on the response if the predictors have a very large number of categories.Before considering an illustrative example we briefly consider the interpretation of parameters.Let A loc 1 , . . ., A loc m loc denote the end nodes of the location term, and A sc 1 , . . ., A sc msc denote the end nodes of the variance term.Then one has the predictor The interpretation is similar to the interpretation of parameters in the locationscale model, the β-parameters indicate the location and the γ-parameters variance heterogeneity.For illustration let us consider extreme cases. -If That means the size of β s indicates the preference for high categories.
-If γ → ∞ one obtains for x i ∈ A loc (fixed location component) the probabilities P (Y i = 1|x i ) = P (Y i = k|x i ) = 0.5, that means maximal heterogeneity with all responses in the extreme categories.

Illustrative Example Confidence Data
We consider data from the general social survey of social science, in short ALLBUS, a study by the German institute GESIS.The data is available from http://www.gesis.org/allbus.Our analysis is based on a subset containing 2935 respondents of the ALLBUS in 2012.The response is the confidence in the federal government measured on a symmetric scale from 1 (no confidence at all/excessive distrust) to 7 (excessive confidence).As explanatory variables we consider the gender (0: male, 1: female), the income in thousands of Euros, the age in decades (centered at 50) and the self reported interest in politics from 1 (very strong interest) to 5 (no interest at all).
Figure 1 shows the tree obtained for the location term and Figure 2 the tree for the variance term.It is seen that the main drivers of confidence are interest in politics and age.Among respondents that have strong interest in political issues (interest=5) those above 40 years of age have weak confidence (node 5) whereas those below 40 years tend to prefer higher categories (node 4).Among respondents that are less interested in politics in particular young people (age lesser than 25) and older people (age above 74) show a strong tendency to choose high confidence categories ( βs = 0.960 and βs = 0.841).From the variance tree it is seen that females with high interest (node 9; γ = 0.168) and males with low income (node 4; γ = 0.227) are the most heterogeneous groups with comparatively large variance.Older males with high income (node 7) followed by younger females with moderate interest form the most homogeneous groups.

The Algorithm in Detail
In all tree-based methods, one has to decide in particular how to split and how to determine the size of the trees.In traditional approaches, one typically grows large trees and prunes them to an adequate size afterwards, see Breiman et al. (1984) and Ripley (1996).An alternative strategy, which was propagated within the conditional unbiased recursive partitioning framework (Hothorn et al., 2006), is to directly control the size of the trees by early stopping.We also use this approach and control the significance of splits by using tests for cumulative regression models.
Let us consider again the construction of the first split.A split in the location term with regard to the j-th variable yields the model with predictor a split in the variance term with regard to the j-the variable yields the model with predictor .
To test for the best split among all the covariates, the set of possible split points and the two components (location or variance) one examines all the null hypotheses H 0 : β j = 0 and H 0 : γ j = 0 and selects that split as the optimal one that has the smallest p-value.As test statistic, we use the LR test statistic.Computing the LR test statistic requires fitting of both models, the full model and the restricted model under H 0 .We nevertheless prefer the LR statistic because it corresponds to selecting the model with minimal deviance.This criterion is also equivalent to minimizing the entropy, which belongs to the family of impurity measures.
To decide whether the selected split should be performed, we apply a concept based on maximally selected statistics.The basic idea is to investigate the dependence of the ordinal response and the selected variable at a global level that takes the number of splits into account.For one fixed component and variable j, one simultaneously considers all LR test statistics T jc j , where c j are from the set of possible split points, and computes the maximal value statistic T j = max c j T jc j .The p-value that can be obtained by the distribution of T j provides a measure for the relevance of variable j.The result is not influenced by the number of split points since they have been taken into account yielding unbiased selection; for similar approaches, which inspired the proposed method, see Hothorn and Lausen (2003), Shih (2004), Shih and Tsai (2004), Strobl et al. (2007).The method explicitly accounts for the involved multiple testing problem.As the distribution of T j in general is unknown we use a permutation test to obtain a decision on the null hypothesis.The distribution of T j is determined by computing the maximal value statistics based on random permutations of variable j.A random permutation of variable j breaks the relation of the covariate and the response in the original data.By computing the maximal value statistics for a large number of permutations one obtains an approximation of the distribution under the null hypothesis and the corresponding p-value.Importantly, to determine the p-value with sufficient accuracy, the number of permutations should increase with the number of covariates.
In all later steps the basic procedure is the same, one searches for the statistic with the maximal value trying all combinations of variables and split points in both components.For the components that already have been split (location, variance or both) one starts from already built nodes.Given overall significance level α the significance level for the permutation test that tests splits in one variable is chosen by α/p, where p denotes the number of covariates that are available.

(Tree Building).
(a) For all explanatory variables x j , j = 1, . . ., p, fit all the candidate models with one additional split in one of the already built nodes in both components.
(b) Select the best model using the p-values of the LR test statistics.
(c) Carry out the permutation test for the selected node (defined by a combination of variable, split point and component) with significance level α/p.If significant, fit the selected model and continue with Step 2(a), else continue with Step 3.

(Selected Model)
. Fit the final model with components β0r , β and γ.
The final model consists of one or two separate trees, one referring to the location component and one referring to the variance component.In general, the trees will be different but can also yield the same partitioning.It should be noted that in contrast to the way trees are grown in traditional recursive partitioning all parameter estimates change if an additional split is performed.

Biochemists Data
Let use consider the application used by Allison (1999) when investigating the problem if effects of variables differ over gender groups.The data set, which has also been used by Long et al. (1993) and Williams (2009), investigates the careers of 301 male and 177 female biochemists (the following description is adapted from Allison, 1999).Binary regression is used to predict the probability of promotion to associate professor from the assistant professor level (1: no promotion, 2: promotion).The variables in the model are the number of years since the beginning of the assistant professorship (years), undergraduate selectivity as a measure of the selectivity of the colleges where scientists received their bachelor's degrees (select), the number of articles (articles) representing the cumulative number of articles published by the end of each person year, and job prestige (prestige) measuring the prestige of the department in which scientists were employed.Figures 3 and 4 show the fitted trees for location and variance, respectively.While Allison (1999) focused on gender as a relevant variable in the variance term, it is seen from the trees that gender does not seem to be very influential; neither in the location term nor in the variance term gender is present.A similar result was obtained by Williams (2010).When he used a stepwise forward strategy to select variables in the parametric location-scale model the only variable that entered the variance equation was the number of articles.He also made a plausible argument for this by stating that "there may be little residual variability among biochemists with few articles (with most of them being denied tenure) but there may be much more variability among biochemists with more articles (having many articles may be a necessary but not sufficient condition for tenure)." It is seen from the trees that the chances of a promotion to associate professor are best for biochemists who have spent at least three years at a department with not the highest prestige (node 6).Applicants with articles ≤ 6 or articles > 6 in combination with year ≤ 4 seem to form the most homogeneous groups.To evaluate the issue of unfairness further we fitted trees when only the covariates gender and number of articles are included in the analysis.The corresponding trees are given in Figure 5.It is seen that only the number of articles was found to have an impact on location as well as on variance.There is no indication that gender plays a crucial role for the promotion to associate professor.

Retinopathy Data
In a 6-year followup study on diabetes and retinopathy status reported by Bender and Grouven (1998) the interesting question was how the retinopathy status is associated with risk factors.The considered risk factors wer smoking (SM = 1: smoker, SM = 0: non-smoker), diabetes duration (DIAB) measured in years, glycosylated hemoglobin (GH), which is measured in percent, and diastolic blood pressure (BP) measured in mmHg.The response variable retinopathy status has three categories (1: no retinopathy; 2: nonproliferative retinopathy; 3: advanced retinopathy or blind).
It is seen from Figure 6 that in particular the duration of diabetes is influential followed by glycosylated hemoglobin.The lowest risk is found in node 10 (DIAB ≤ 13.57, GH ≤ 7.36).Even if GH > 7.36 but DIAB ≤ 11.53 the risk is still very low.The highest risks are found for long duration of diabetes DIAB ≤ 23.34 in combination with low values of glycosylated hemoglobin GH ≤ 7.96 (node 7) and in node 9, which combines long diabetes duration and high values of glycosylated hemoglobin and diastolic blood pressure.Figure 7 shows that patients with longer duration of diabetes are more homogeneous (sharing higher risk) than patients with lower values of diabetes duration.

Summary and Concluding Remarks
Let us summarize the strengths of the proposed tree method.
-One obtains two trees, one for the location and one for the variance.Thus, it is clearly seen which variables have an impact on which component.
-The obtained trees have a simple interpretation showing which combinations of variables determine the preference of categories, and which sub populations form more homogeneous or heterogeneous groups.
-By fitting a scale (or variance) component the method avoids misleading effects that may occur if one ignores potential variance heterogeneity.
-As in all tree-based methods interactions are explicitly modelled and there is a built-in variable selection procedure.
The presented algorithm is constructed such that only variables for which a significant effect can be detected are included.By controlling for the overall significance level the inclusion of irrelevant variables is avoided.It has the effect that the procedure tends to include relatively few variables, in particular if many variables are available.However, the method can also be used in an exploratory way.If one uses a significance level distinctly larger than .05one obtains much larger trees, which might hint at further possible interaction effects.Nevertheless, we think it is essential to control for the significance level, which gets lost in many procedures, especially if one first fits trees and then starts pruning as in conventional trees.An R implementation of the proposed tree-structured model including an auxiliary function to plot the trees, as well as exemplary code to reproduce the illustrative example is available from GitHub (https://github.com/jmober/LocationScaleTree).
(a) Location term: One fits the location-scale model with one split in the location term and predictor η ir = β 0r − βI(x ij ≤ c) , where I(.) is the indicator function.Then one obtains

Figure 1 :Figure 2 :
Figure 1: Tree for location term of confidence data.The parameter estimates βs are given in the terminal nodes.variance term

Figure 3 :Figure 4 :
Figure 3: Tree for location term of biochemists example.The parameter estimates βs are given in the terminal nodes.varianceterm

Figure 5 :
Figure 5: Tree for location (left) and variance (right) of biochemists example with only gender and articles included.The parameter estimates βs and γ are given in the terminal nodes, respectively.

Figure 6 :
Figure 6: Tree for location term of retinopathy data.The parameter estimates βs are given in the terminal nodes.variance term

Figure 7 :
Figure 7: Tree for variance term of retinopathy data.The parameter estimates γ are given in the terminal nodes.