1 Introduction

In many scientific disciplines, quantitative information is collected on a set of variables for a number of objects or participants. When the set of variables can be divided into a set of P predictors and a set of R responses, we need regression type of models. Researchers often neglect the multivariate nature of the response set and consequently univariate multiple regression models are fitted, one for each response variable separately. Such an approach does not take into account that the response variables might be correlated and does not provide insight into the relationships between the response variables. In this paper, the interest lies in the case where the R response variables are binary.

For binary response variables, the typical regression model is a logistic regression (Agresti 2013), that is, a generalized linear model (McCullagh and Nelder 1989) with logit link function and a Bernoulli or binomial distribution. In logistic regression models, the probability that person \(i\) answers yes (or 1) on the response variable Y is defined as \(\pi _{i} = P(Y_{i} = 1)\). These probabilities (or estimated values) are commonly defined in terms of the log-odds form (in GLMs the “linear predictor’ ’), denoted by \(\theta _{i}\), that is,

$$\begin{aligned} \pi _{i} = \frac{\exp (\theta _{i})}{1 + \exp (\theta _{i})} = \frac{1}{1 + \exp (-\theta _{i})}, \end{aligned}$$

and similarly

$$\begin{aligned} 1 - \pi _{i} = \frac{1}{1 + \exp (\theta _{i})} = \frac{\exp (-\theta _{i})}{1 + \exp (-\theta _{i})}. \end{aligned}$$

Finally, the log-odds form is a (linear) function of the P predictor variables, that is

$$\begin{aligned} \theta _i = m + \sum _{p=1}^P x_{ip}a_p, \end{aligned}$$

where \(x_{ip}\) is the observed value for person i on predictor variable p and m and \(a_p\) are the parameters which need to be estimated. The intercept m is the expected log-odds when all predictor variables equal zero, and the regression weights \(a_p\) indicate the difference in log-odds for two observations that differ one unit in predictor variable p and have equal values for all other predictor variables.

Having R outcome variables (\(Y_r\), \(r = 1,\ldots , R\)) a multivariate model can be defined with probabilities \(\pi _{ir} = P(Y_{ir} = 1)\) that are parameterized as

$$\begin{aligned} \pi _{ir} = \frac{\exp (\theta _{ir})}{1 + \exp (\theta _{ir})} = \frac{1}{1 + \exp (-\theta _{ir})}, \end{aligned}$$

and the log-odds form is

$$\begin{aligned} \theta _{ir} = m_r + \sum _{p=1}^P x_{ip}a_{pr}. \end{aligned}$$

The intercepts can be collected in a vector \({\varvec{m}}\) and the regression weights can be collected in a matrix \({\varvec{A}}\) of size \(P \times R\).

For multivariate outcomes, Yee and Hastie (2003) proposed reduced rank vector generalized linear models, that are multivariate models with a rank constraint on the matrix with regression weights, that is,

$$\begin{aligned} {\varvec{A}} = {\varvec{BV}}' \end{aligned}$$

where \({\varvec{B}}\) is a matrix of size \(P \times S\) and \({\varvec{V}}\) a matrix of size \(R \times S\). The rank of the matrix \({\varvec{A}}\) is S, a number in the range 1 to \(\min (P, R)\), and when \(S < \min (P, R)\) the rank is reduced, hence the name reduced rank regression. The matrix \({\varvec{B}}\) has elements \(b_{ps}\) for \(s = 1,\ldots , S\) and \({\varvec{V}}\) has elements \(v_{rs}\). The S elements for the predictor variable p (i.e., the p-th row of \({\varvec{B}}\)), are collected in the column vector \({\varvec{b}}_p\); similarly, the S elements for response variable r are collected in the column vector \({\varvec{v}}_r\).

At this point, it is instructive to take a step back to models for continuous outcomes. Reduced rank regression (Anderson 1951; Izenman 1975; Tso 1981; Davies and Tso 1982), also called redundancy analysis (Van den Wollenberg 1977), has been proposed as a multivariate tool for simultaneously predicting the responses from the set of predictors. Reduced rank regression can be motivated from a regression point of view, but also from a principal component point of view.

From the regression point of view of reduced rank regression, the goal is to predict the response variables using the set of predictor variables. We, therefore, set up a multivariate regression model

$$\begin{aligned} {\varvec{Y}} = {\varvec{1}}{\varvec{m}}' + {\varvec{XA}} + {\varvec{E}}, \end{aligned}$$

where \({\varvec{m}}\) denote the vector with the R intercepts, \({\varvec{A}}\) is a \(P \times R\) matrix with regression weights, and \({\varvec{E}}\) is a matrix with residuals. In the usual multivariate regression model the matrix of regression coefficients is unrestricted. In reduced rank regression, the restriction \({\varvec{A}} = {\varvec{B}}{\varvec{V}}'\) is imposed.

Reduced rank regression can also be cast as a constrained principal component analysis (PCA, Takane (2013)). In PCA, the matrix with multivariate responses (\({\varvec{Y}}\)) of size \(N \times R\) is decomposed into a matrix with object scores (\({\varvec{U}}\)) of dimension \(N \times S\), and a matrix with variable loadings (\({\varvec{V}}\)) of size \(R \times S\). The rank, or dimensionality, should be smaller or equal to \(\min (N, R)\). We can write PCA as the following expression

$$\begin{aligned} {\varvec{Y}} = {\varvec{1}}{\varvec{m}}' + {\varvec{UV}}' + {\varvec{E}}, \end{aligned}$$

where usually, identifiability constraints are imposed such as \({\varvec{V}}'{\varvec{V}} = {\varvec{I}}\) or \({\varvec{U}}'{\varvec{U}} = N{\varvec{I}}\). To estimate the PCA parameters, usually the means of the responses are computed, that is \(\hat{{\varvec{m}}} = N^{-1} {\varvec{Y}}'{\varvec{1}}\). Then, the centered response matrix \({\varvec{Y}}_c = {\varvec{Y}} - {\varvec{1}}\hat{{\varvec{m}}}'\) is computed, and subsequently \({\varvec{U}}\) and \({\varvec{V}}\) are estimated by minimizing the least squares loss function

$$\begin{aligned} L({\varvec{U}}, {\varvec{V}}) = \Vert {\varvec{Y}}_c - {\varvec{UV}}' \Vert ^2, \end{aligned}$$

where \(\Vert \cdot \Vert ^2\) denotes the squared Frobenius norm of a matrix. Eckart and Young (1936) show that this can be achieved by a singular value decomposition. With predictor variables, the scores (\({\varvec{U}}\)) can be restricted to be a linear combination of these predictor variables, sometimes called external variables (i.e., \({\varvec{X}}\)), that is, \({\varvec{U}} = {\varvec{XB}}\), with \({\varvec{B}}\) a \(P \times S\) matrix, to be estimated. Ten Berge (1993) shows that \({\varvec{B}}\) and \({\varvec{V}}\) can be estimated using a generalized singular value decomposition in the metrics \({\varvec{X}}'{\varvec{X}}\) and \({\varvec{I}}\) (see Appendix A for details), that is, we decompose \(({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{X}}'{\varvec{Y}}_c\) with a singular value decomposition, that is

$$\begin{aligned} ({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{X}}'{\varvec{Y}}_c = {\varvec{P}}{\varvec{\Phi }}{\varvec{Q}}' \end{aligned}$$
(1)

where \({\varvec{P}}'{\varvec{P}} = {\varvec{Q}}'{\varvec{Q}} = {\varvec{I}}\) and \({\varvec{\Phi }}\) is a diagonal matrix with singular values. Subsequently, define the following estimates

$$\begin{aligned} \hat{{\varvec{B}}}= & {} ({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{P}} \end{aligned}$$
(2)
$$\begin{aligned} \hat{{\varvec{V}}}= & {} {\varvec{Q}}{\varvec{\Phi }}. \end{aligned}$$
(3)

The \({\varvec{m}}\) are subsequently estimated as

$$\begin{aligned} \hat{{\varvec{m}}} = N^{-1} ({\varvec{Y}} - {\varvec{XBV}}')'{\varvec{1}}. \end{aligned}$$

Reduced rank regression can thus be understood from two different points of views: a multivariate regression model with constraints on the regression weights or a principal component analysis with constraints on the object scores. Yee and Hastie (2003) approached logistic reduced rank regression as a multivariate regression model with a rank constraint on the regression weights. We could approach logistic reduced rank regression also from the PCA point of view. PCA for binary variables has received considerable attention lately (Collins et al. 2001; Schein et al. 2003; De Leeuw 2006; Landgraf and Lee 2020).

De Leeuw (2006) defined the log-odds term \(\theta _{ir}\) in terms of a principal component analysis \(\theta _{ir} = m_r + {\varvec{u}}_{i}'{\varvec{v}}_{r}\), where \({\varvec{u}}_i\) is the vector with object scores for participant i. Similar to standard PCA, object scores and variable loadings are obtained. These scores and loading, however, reconstruct the log-odds term, not the response variables themselves. De Leeuw (2006) also proposed a Majorization Minimization (MM) algorithm (Heiser 1995; Hunter and Lange 2004; Nguyen 2017) for maximum likelihood estimation of the parameters. In the majorization step, the negative log-likelihood is majorized by a least squares function. This majorizing least squares function is minimized by applying a singular value decomposition on a matrix with working responses, that are variables that function as the response variables in each iteration of the algorithm but are updated from iteration to iteration.

Yee and Hastie (2003, Sect. 3.4) note that “in general, minimization of the negative likelihood cannot be achieved by use of the singular value decomposition” and therefore proposed an alternating algorithm, where in each iteration first \({\varvec{B}}\) is estimated considering \({\varvec{V}}\) fixed, and subsequently \({\varvec{V}}\) is estimated considering \({\varvec{B}}\) fixed. For both steps a weighted least squares update is derived, where in every iteration the weights and the responses need to be redefined based on the current set of parameters. In the next section, we develop an MM algorithm for logistic reduced rank regression (note, not all reduced rank generalized linear models) based on the work of De Leeuw (2006), where in each of the iterations a generalized singular value decomposition is applied. We compare the two algorithms in terms of speed of computation in Sect. 4.

For interpretation of (logistic) reduced rank models, a researcher can inspect the estimated coefficients \({\varvec{A}} = {\varvec{BV}}'\). Coefficients in this matrix can be interpreted like the usual regression weights in (logistic) regression models. Because the number of coefficients is usually large (i.e., \(P \times R\)) it is difficult to obtain a holistic interpretation from this matrix. Visualization can help to obtain such a holistic interpretation of the reduced rank model. PCA solutions can be graphically represented by biplots (Gabriel 1971; Gower and Hand 1996; Gower et al. 2011). Biplots are generalizations of usual scatterplots for multivariate data, where the observations (\(i = 1,\ldots , N\)) and the variables (\(r = 1,\ldots , R\)) are represented in low dimensional visualizations. The observations are represented by points, whereas the variables are represented by variable axes. These biplots have been extended for reduced rank regression (Ter Braak and Looman 1994), that not only represent observations and response variables but also predictor variables by variable axes, that is, they represent three different types of information, and therefore we call them triplots.

For visualization of the logistic reduced rank regression model, two types of triplots have been proposed. Vicente-Villardón et al. (2006) and Vicente-Villardón and Vicente-Gonzalez (2019) modified the usual biplots/triplots for the representation of binary data based on logistic models. Another type of triplot was proposed by Poole and Rosenthal (1985), Clinton et al. (2004), Poole et al. (2011), and De Rooij and Groenen (2023) that represents the logistic reduced rank model in a distance framework. In these triplots, each of the response variables is represented by two points instead of a variable axis, one for each response category. In Sect. 3, we discuss the two types of triplots, show advantages and disadvantages of both, and propose a hybrid type of triplot that has elements of both types.

In Sect. 4, we compare our new algorithm in terms of speed against that of Yee and Hastie (2003) with two empirical data sets. With these data sets, we also show the hybrid triplot for both data sets and provide an interpretation. We end this paper with a brief discussion.

2 An MM-algorithm for logistic reduced rank regression

Logistic models are often fitted by maximizing the likelihood, or equivalently minimizing the negative log-likelihood, that is

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) = -\sum _{i=1}^N\sum _{r = 1}^R y_{ir} \log (\pi _{ir}) + (1 - y_{ir}) \log (1 - \pi _{ir}), \end{aligned}$$

where the probabilities (\(\pi _{ir}\)) are functions of \(\theta _{ir}\), with \(\theta _{ir} = m_r + {\varvec{x}}_i{\varvec{B}}{\varvec{v}}_r\). Let us define \(q_{ir} = 2 y_{ir} - 1\), so that the loss function can be written as

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) = -\sum _{i=1}^N\sum _{r = 1}^R \log \frac{1}{1 + \exp (-q_{ir}\theta _{ir})}. \end{aligned}$$

To estimate the parameters, below we derived an MM-algorithm following the ideas of De Leeuw (2006). We implemented this algorithm in the R-package lmap (De Rooij and Busing 2022).

2.1 General theory about MM algorithms

The idea of MM for finding a minimum of the function \({\mathcal {L}}({\varvec{\theta }})\), where \({\varvec{\theta }}\) is a vector of parameters, is to define an auxiliary function, called a majorization function, \({\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }})\) with two characteristics

$$\begin{aligned} {\mathcal {L}}({\varvec{\vartheta }}) = {\mathcal {M}}({\varvec{\vartheta }}|{\varvec{\vartheta }})\\ \end{aligned}$$

where \({\varvec{\vartheta }}\) is a supporting point, and

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) \le {\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }}). \end{aligned}$$

The two equations tell us that \({\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }})\) is a function that lies above (i.e., majorizes) the original function and touches the original function at the support point. Because of the above two properties, an iterative sequence defines a convergent algorithm because by construction

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}^+) \le {\mathcal {M}} ({\varvec{\theta }}^+|{\varvec{\vartheta }}) \le {\mathcal {M}} ({\varvec{\vartheta }}|{\varvec{\vartheta }}) = {\mathcal {L}}({\varvec{\vartheta }}), \end{aligned}$$

where \({\varvec{\theta }}^+\) is

$$\begin{aligned} {\varvec{\theta }}^+ = \textrm{argmin}_{{\varvec{\theta }}} \ {\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }}), \end{aligned}$$

the updated parameter. A main advantage of MM algorithms is that they always converge monotonically to a (local) minimum. The challenge is to find a parametrized function family, \({\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }})\), that can be used in every step.

In our case, the original function equals the negative log-likelihood. We majorize this function with the least squares function, that is, \({\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }})\) is a least squares function. We use majorization via a quadratic upper bound, that is, for a twice differentiable function \({\mathcal {L}}({\varvec{\theta }})\) and for each \({\varvec{\vartheta }}\) the function

$$\begin{aligned} {\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }}) = {\mathcal {L}}({\varvec{\vartheta }}) + {\mathcal {L}}'({\varvec{\vartheta }}) ({\varvec{\theta }} - {\varvec{\vartheta }}) + \frac{1}{2}({\varvec{\theta }}-{\varvec{\vartheta }})'{\varvec{A}}({\varvec{\theta }}-{\varvec{\vartheta }}) \end{aligned}$$

majorizes \({\mathcal {L}}({\varvec{\theta }})\) at \({\varvec{\vartheta }}\) when the matrix \({\varvec{A}}\) is such that

$$\begin{aligned} {\varvec{A}} - \partial ^2 {\mathcal {L}}({\varvec{\theta }}), \end{aligned}$$

is positive semi definite. We also use the property that majorization is closed under summation, that is, when \({\mathcal {M}}_1\) majorizes \({\mathcal {L}}_1\) and \({\mathcal {M}}_2\) majorizes \({\mathcal {L}}_2\), then \({\mathcal {M}}_1 + {\mathcal {M}}_2\) majorizes \({\mathcal {L}}_1 + {\mathcal {L}}_2\).

2.2 The algorithm

Lets recap our loss function

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) = \sum _{i=1}^N\sum _{r = 1}^R {\mathcal {L}}_{ir}(\theta _{ir}) = \sum _{i=1}^N\sum _{r = 1}^R -\log \frac{1}{1 + \exp (-q_{ir}\theta _{ir})}. \end{aligned}$$

Because of the summation property, we can focus on a single element, \({\mathcal {L}}_{ir}(\theta _{ir})\). The first derivative of \({\mathcal {L}}_{ir}(\theta _{ir})\) with respect to \(\theta _{ir}\) is

$$\begin{aligned} \begin{aligned} \xi _{ir} \equiv \frac{\partial {\mathcal {L}}_{ir} (\theta _{ir})}{\partial \theta _{ir}}&= -(y_{ir} - \pi _{ir}) \end{aligned} \end{aligned}$$

Filling in the derivative and using the upper bound \(A = \frac{1}{4}\) (Böhning and Lindsay 1988; Hunter and Lange 2004), we have that

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{ir}(\theta _{ir})&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \xi _{ir}(\theta _{ir} - \vartheta _{ir}) + \frac{1}{8}(\theta _{ir} - \vartheta _{ir})(\theta _{ir} - \vartheta _{ir}) \\&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \xi _{ir}\theta _{ir} - \xi _{ir}\vartheta _{ir} + \frac{1}{8}(\theta _{ir}^2 + \vartheta _{ir}^2 -2\theta _{ir}\vartheta _{ir}) \\&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \frac{1}{8}\theta _{ir}^2 + \xi _{ir}\theta _{ir} -2\frac{1}{8}\theta _{ir}\vartheta _{ir} - \xi _{ir}\vartheta _{ir} + \frac{1}{8}\vartheta _{ir}^2. \end{aligned} \end{aligned}$$

Let us now define \(z_{ir} = \vartheta _{ir} - 4\xi _{ir}\) to obtain

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{ir}(\theta _{ir})&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \frac{1}{8}\theta _{ir}^2 -2\frac{1}{8}\theta _{ir}z_{ir} + \left( \frac{1}{8}z_{ir}^2 - \frac{1}{8}z_{ir}^2 \right) - \xi _{ir}\vartheta _{ir} + \frac{1}{8}\vartheta _{ir}^2 \\&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \frac{1}{8}(\theta _{ir} - z_{ir})^2 - \frac{1}{8}z_{ir}^2 - \xi _{ir}\vartheta _{ir} + \frac{1}{8}\vartheta _{ir}^2 \\&\le \frac{1}{8}(\theta _{ir} - z_{ir})^2 + c_{ir}, \end{aligned} \end{aligned}$$

where \(c_{ir} = {\mathcal {L}}_{ir}(\vartheta _{ir}) - \frac{1}{8}z_{ir}^2 - \xi _{ir}\vartheta _{ir} + \frac{1}{8}\vartheta _{ir}^2\) is a constant.

Now as

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) = \sum _{i=1}^N\sum _{r = 1}^R {\mathcal {L}}_{ir}(\theta _{ir}) \end{aligned}$$

we have that

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) \le \sum _{i=1}^N\sum _{r = 1}^R \frac{1}{8}(\theta _{ir} - z_{ir})^2 + c, \end{aligned}$$

a least squares majorization function, with \(c = \sum _i \sum _r c_{ir}\).

For logistic principal component analysis, De Leeuw (2006) defined \(\theta _{ir} = m_r + {\varvec{u}}_i'{\varvec{v}}_r\). Collecting the elements \(z_{ir}\) in the matrix \({\varvec{Z}}\), in every iteration of the MM-algorithm he minimizes

$$\begin{aligned} \Vert {\varvec{Z}} - {\varvec{1m}}' - {\varvec{UV}}' \Vert ^2. \end{aligned}$$

For logistic reduced rank regression, we define \(\theta _{ir} = m_r + {\varvec{x}}_i'{\varvec{B}}{\varvec{v}}_r\). In every iteration of the MM-algorithm we minimize

$$\begin{aligned} \Vert {\varvec{Z}} - {\varvec{1m}}' - {\varvec{XBV}}' \Vert ^2, \end{aligned}$$

which can be done by computing the mean and a generalized singular value decomposition as in Expressions 12, and 3. In detail, in every iteration we compute \({\varvec{Z}} = {\varvec{1m}}' + {\varvec{XBV}}' + 4({\varvec{Y}} - {\varvec{\Pi }})\), where \({\varvec{\Pi }}\) is the matrix with elements \(\pi _{ir}\), using the current parameter values and update our parameters as

  • \({\varvec{m}}^+ = N^{-1} ({\varvec{Z}} - {\varvec{XBV}}'){\varvec{1}}\)

  • \(({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{X}}'({\varvec{Z}} - {\varvec{1m}}') = {\varvec{P}}{\varvec{\Phi }}{\varvec{Q}}'\)

  • \({\varvec{B}}^+ = \sqrt{N} ({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}} {\varvec{P}}_S\)

  • \({\varvec{V}}^+ = (\sqrt{N})^{-1} {\varvec{Q}}_S{\varvec{\Phi }}_S\)

where \({\varvec{P}}_S\) are the singular vectors corresponding to the S largest singular values, similarly for \({\varvec{Q}}_S\), and \({\varvec{\Phi }}_S\) is the diagonal matrix with the S largest singular values.

3 Visualization

A rank 2 model can be visualized with a two-dimensional representation in a so-called triplot, that shows simultaneously three types of information: the predictor variables, the response variables, and the participants (objects). When a higher rank model is fitted, triplots can be constructed for any pair of dimensions.

We discuss two types of triplots for logistic reduced rank regression. The first is based on the inner product relationship where we project points, representing the participants, on response variable axes with markers indicating probabilities of responding with yes (or 1). We call this a triplot of type I. Another type of triplot was recently described in detail by De Rooij and Groenen (2023) and uses a distance representation. We call this a triplot of type D. The two triplots are equivalent in the sense that they represent the same information but in a different way. In this section, we describe the two triplots in detail, make a comparison, and propose a new hybrid type of triplot that combines the advantages of the two types of triplots.

Fig. 1
figure 1

Visualization of Logistic Reduced Rank Models. a graphical representation of two predictor variables and the process for interpolation for two observations A and B; b type I representation of one response variable and the process of prediction for the two observations; c type D representation of one response variable, where the distance between the two observations A and B and the two response classes determines the probabilities

3.1 The type I triplot

This type of logistic triplot was proposed by Vicente-Villardón et al. (2006) and Vicente-Villardón and Vicente-Gonzalez (2019). The objects, or participants, are depicted as points in a two-dimensional Euclidean space with coordinates \({\varvec{u}}_i = {\varvec{B}}'{\varvec{x}}_i\).

Each of the predictor variables is represented by a variable axis through the origin of the Euclidean space with direction \(b_{p2}/b_{p1}\). Markers can be added to the variables axis representing units \(t = \pm 1, \pm 2, ..\) with coordinates \(t \frac{b_{p2}}{b_{p1}}\). We use the convention that the variable axis has a dotted and a solid part. The solid part represents the observed range of the variable in the data, the dotted part extends the variable axis to the border of the display. The variable label is printed on the side with the highest value of the variable.

Object coordinates (\({\varvec{u}}_i\)) are a direct result of these predictor variable axis by the process of interpolation, as described in Gower and Hand (1996) and Gower et al. (2011).

We illustrated in Fig. 1a, where two predictor variables are represented one with regression weights 0.55 and -0.45 (diagonal variable axis), the other with regression weights -0.05 and -0.75 (almost vertical variable axis). Markers for values -3 till 3 are added to both variable axes. Also included are two observations, A and B, who have values on the two predictor variables 2 and 1 (A), and -3 and 2 (B) respectively. The process of interpolation is illustrated by the grey dotted lines, that is, we have to add two vectors to obtain the coordinates for the two observations.

Each of the response variables is also represented by a variable axis through the origin. The direction of the variables axis is \(v_{r2}/v_{r1}\). Markers can be added to these variables axis as well. We add markers that represent the probabilities of responding 1 equal to \(\pi = \{0.1, 0.2, \ldots , 0.9\}\). The location of these markers is given by \(\lambda ({\varvec{v}}_r'{\varvec{v}}_r)^{-1}{\varvec{v}}_r\), where \(\lambda = \log (\pi /(1-\pi )) - {\hat{m}}_r\) (based on Gower et al. 2011, page 24) . To obtain the predicted probabilities for an object or participant for response variables r, the process called prediction (Gower and Hand 1996; Gower et al. 2011) has to be used, where the point representing this object is projected onto the variable axis.

This is illustrated in Fig. 1b for a single response variable with \(v_{r1} = -0.8\) and \(v_{r2} = -0.5\). The value of \(m = -1\). On the variable axis there are markers indicating the expected probabilities. By projecting the points of the observations, A and B, onto this variable axis we obtain the expected probabilities. For observation A this is approximately 0.28, while for observation B the expected probability is approximately 0.61.

The two-dimensional space can be partitioned into two parts by a decision line for response variable r. These decision lines are perpendicular to the response variable axis and through the \(\pi = 0.5\) marker. As we have a decision line for each response variable, we have R of these lines in total, together partitioning the space in a maximum of \(\sum _{s=0}^S \left( {\begin{array}{c}R\\ s\end{array}}\right)\) regions (Coombs and Kao 1955), each region having a favourite response profile.

3.2 The type D triplot

These types of triplots are described in detail in De Rooij and Groenen (2023). Similar work has been proposed earlier mainly in the context of political vote casts by Poole and Rosenthal (1985); Clinton et al. (2004), and Poole et al. (2011).

In Type D triplots, the response variables are represented in a different manner, while the object points and the variable axes for the predictor remain the same (see Fig. 1a). Each response variable is represented by two points, one for the no-category (with coordinates \({\varvec{w}}_{r0}\)) and one for the yes-category (with coordinates \({\varvec{w}}_{r1}\)). The squared distance between an object location and these two points defines the probability, that is

$$\begin{aligned} \pi _{ir} = \frac{\exp \left( -\frac{1}{2}d^2({\varvec{u}}_i,{\varvec{w}}_{r1}) \right) }{\exp \left( -\frac{1}{2}d^2({\varvec{u}}_i,{\varvec{w}}_{r0})\right) + \exp \left( -\frac{1}{2}d^2({\varvec{u}}_i,{\varvec{w}}_{r1})\right) }, \end{aligned}$$
(4)

where \(d^2({\varvec{u}}_i,{\varvec{w}}_{r1})\) is the squared Euclidean two-mode distance

$$\begin{aligned} d^2({\varvec{u}}_i,{\varvec{w}}_{r1}) = \sum _{s=1}^S (u_{is} - w_{r1s})^2. \end{aligned}$$

The coordinates \({\varvec{w}}_{r0}\) and \({\varvec{w}}_{r1}\) for all r can be collected in the \(2R \times S\) matrix \({\varvec{W}}\). This matrix can be reparametrized as

$$\begin{aligned} {\varvec{W}} = {\varvec{A}}_l {\varvec{L}} + {\varvec{A}}_k {\varvec{K}} \end{aligned}$$

with \({\varvec{A}}_l = {\varvec{I}}_R \otimes [1, 1]'\) and \({\varvec{A}}_k = {\varvec{I}}_R \otimes [1, -1]'\), where \(\otimes\) denotes the Kronecker product, and where \({\varvec{L}}\) is the \(R \times S\) matrix with response variable locations and \({\varvec{K}}\) the \(R \times S\) matrix representing the discriminatory power for the response variables. Elements of \({\varvec{K}}\) and \({\varvec{L}}\) are denoted by \(k_{rs}\) and \(l_{rs}\), respectively.

To obtain the coordinates \({\varvec{W}}\) from the parameters of the logistic reduced rank regression we take the following steps (for details, see De Rooij and Groenen 2023). The matrix \({\varvec{K}}\) is defined as \({\varvec{K}} = -{\varvec{V}}/2\). The matrix \({\varvec{L}}\) can be obtained from \({\varvec{m}}\) and \({\varvec{K}}\). For every response variable we have that

$$\begin{aligned} m_r = \sum _{s=1}^S k_{rs}l_{rs}. \end{aligned}$$

The solution that is closest to the origin of the Euclidean space is

$$\begin{aligned} l_{rs} = \frac{m_r k_{rs}}{2\sum _s k_{rs}^2}. \end{aligned}$$

Classification of an object is straightforward by choosing the response category that is closest by the object location. Probabilities are more difficult to obtain from this Type D triplot, as the mental operation described in Eq. 4 has to be performed. Because the model is defined in squared Euclidean distances, it is not the distance to the category points that is needed per se, the distance towards the decision line suffices to derive the probabilities (De Rooij and Groenen 2023). The decision line is defined as the line orthogonal to the line joining the two category points and through their midpoint. As discussed by De Rooij and Groenen (2023), the distance between the two category points relates to the discriminatory power: points far away can be discriminated well, while points close to each other can not be discriminated on the basis of the predictor variables.

This is illustrated in Fig. 1c for the same response variable and the same two observations. It is directly clear from this figure that observation A is closer to category 0, and observation B closer to category 1. Therefore, A has a higher probability for responding no (0) and B a higher probability for responding yes (1). The distance between the two points indicates the discriminative power, the further the two points of a response variable are apart the better these two classes are distinguished by the predictor variables.

3.3 Comparison

Both triplots are equal on the predictor side of the model, that is, the object positions and the variable axes for the predictor variables. The response side of the model, however, has a different geometric representation. For the Type I triplot, the points of the objects need to be projected on the variable axes to obtain the probability of answering yes. For the Type D triplot, the distance of the object point towards the two category points of a response variable has to be inspected. The probability is highest for the response category closest to the object point. Determining the exact probability for the Type D biplot is, however, more involved. On the other hand, for classification the Type D plot has advantages, as it is simply the closest class to which an object is classified. Another advantage of the Type D plot is that the discriminatory power of a response variable can be directly inspected as it is dependent on the distance between the two class points, that is when the two class points of a response variable are further apart, they are overall better distinguished by the predictor variables. The discriminatory power can also be obtained from the Type I plot, but as the markers are not evenly spread over the variable axis this is usually cumbersome. We would need to inspect the distance between, say, the \(\pi = 0.5\) marker and the \(\pi = 0.6\) marker for two variables to make such a comparison, where smaller distances indicate higher discriminatory power. Which predictor variables are responsible for the discrimination of the two categories of a given response variable is most easily judged by the Type I plot, as it corresponds to the angle between the variable axis for a predictor and that of a response. The information can also be obtained from the Type D plot, but first we would need to draw the line connecting the two category points and subsequently inspect the angle.

3.4 The hybrid triplot

We propose to use a type of hybrid triplot, combining the features of both the Type I and the Type D plots. As the objects and predictor variables are represented in exactly the same way in the two triplots, these remain the same. The hybrid plot, uses both variables axes with markers as well as the information of the category points in its representation.

The two points \({\varvec{w}}_{r0}\) and \({\varvec{w}}_{r1}\) in the Type D triplot, derived as outlined above, lie on the variable axis for response variable r of the Type I triplot. The midpoint of these two points coincides with the marker for \(\pi = 0.5\). These two properties allow us to combine the two types of triplots in a hybrid visualization, where the response variables axis are printed by dotted lines with a solid part from \({\varvec{w}}_{r0}\) to \({\varvec{w}}_{r1}\). The endpoints of this solid part of the variable axis represent the two category points. The length of this solid part indicates the discriminatory power.

4 Empirical examples

4.1 Drug consumption data set

The drug consumption data (Fehrman et al. 2017) has records for 1885 respondents and has been analyzed by De Rooij and Groenen (2023) before. Here we replicate that analysis, but focus on the new, hybrid, triplot. For each respondent there is information about age and gender plus measurements for the big five personality traits, neuroticism (N), extraversion (E), openness to experience (O), agreeableness (A), and conscientiousness (C), and another personality characteristic, namely sensation seeking (SS).

In addition, participants were questioned concerning their use of 18 legal and illegal drugs. For each drug, participants indicated whether they used that drug in the last year (yes or no). We focus on the 11 drugs that had a minimum percentage of 10% and a maximum of 90%, which are amphetamine (Am), benzodiazepine (Be), cannabis (Ca), cocaine (Co), ecstasy (Ex), ketamine (Ke), legal highs (Le), LSD, methadone (Me), mushrooms (Mu), and nicotine (Ni) (\(R = 11\)).

4.1.1 Comparison of algorithms

In this section, we compare the two algorithms in terms of speed. The IRLS algorithm of Yee and Hastie (2003) been implemented in the VGAM package (Yee 2022), the MM algorithm is implemented in the lmap package (De Rooij and Busing 2022). An exact comparison is difficult, because in the implementation in VGAM the call starts with a formula, so the design matrices have to be built, whereas the lmap packages starts with the two matrices. Furthermore, the VGAM implementation uses several criteria for convergence, while the lmap convergence criterion is solely based on deviance. In the comparison, we first checked using the complete data set which convergence criterion in lmap leads to the same solution. Nevertheless, the comparison is not completely fair. We estimate the same rank 2 model using these two algorithms. The data set has \(N = 1885\), \(P = 9\), and \(R = 11\). To compare the speed of the two algorithms we use the microbenchmark package (Mersmann 2021) where the two algorithms are applied ten times. Results are shown in Table 1, where it can be seen that MM algorithm is much faster.

Table 1 Timing of rrvglm-algorithm in the VGAM package and the lpca-algorithm in the lmap package for the drug consumption and NCD data sets

For our MM algorithm, \(({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}\) and \(({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{X}}\) need only to be evaluated once. During the iterations a SVD of a \(P \times R\) matrix has to be solved which is a relatively small matrix. Therefore, the computational burden during the iterations is small. As is usual for MM algorithms convergence is relatively slow in terms of the number of iterations needed (Heiser 1995). For the drug consumption data, 42 iterations were needed.

The IRLS algorithm alternates between an update of \({\varvec{B}}\) and \({\varvec{V}}\) (Yee 2015, Sect. 5.3.1). For both updates, a matrix with working weights needs to be inverted. As this matrix depends on the current estimated values, the inverse needs to be re-evaluated in every iteration. These matrix inversions are computationally heavy. In terms of the number of iterations the IRLS algorithm is faster, that is, only 6 iterations were needed for the drug consumption data.

Yee (2015) (Sect. 3.2.1) points out that, apart from some matrix multiplications, the number of floating point operations (flops) for the IRLS algorithm is \(2NS^3(P^2 + R^2)\) per iteration, which for this application is 6092320. The SVD in the MM algorithm takes \(2PR^2 + 11R^3 = 16819\) flops per iteration (cf., Trefethen and Bau 1997). These computations exclude the intercepts (\({\varvec{m}}\)). Also with respect to storage our algorithm has some advantages, that is, the MM algorithm works with the matrix \({\varvec{X}}\) while the IRLS algorithm works with \({\varvec{X}} \otimes {\varvec{I}}_S\), a much larger matrix.

4.1.2 Visualization

De Rooij and Groenen (2023) introduced a quality of representation measure for the response variables. The quality of representation measure, \(Q_r\), compares the fit of a single response variable in the rank 2 model with the fit obtained in a separate logistic regression based on the same predictor variables. \(Q_r\) is defined by

$$\begin{aligned} Q_r = ({\mathcal {D}}_{(0,r)} - {\mathcal {D}}_r)/({\mathcal {D}}_{(0,r)} - {\mathcal {D}}_{lr}), \end{aligned}$$

where \({\mathcal {D}}_{(0,r)}\) is the deviance of the intercept-only logistic regression model for a response variable \(r\), \({\mathcal {D}}_r\) is part of the deviance (i.e., \(2{\mathcal {L}}({\varvec{\theta }})\)) corresponding to response variable r, and \({\mathcal {D}}_{lr}\) is the deviance from a logistic regression of response variable r with the same predictor variables as in the reduced rank model. \(Q_r\) ranges between 0 (very bad fit) to 1 (no loss due to rank restriction). As such, we see how much fit is lost for every response variable, by reducing the rank of the model. For this analysis, the quality measures are all very high, ranging between 0.90 and 1.00, with the worst fit for Cocaine.

The hybrid triplot for the rank 2 model of the drug consumption data is shown in Fig. 2. First, let us inspect the response variables. All the labels are printed on the left-hand side of the triplot, indicating that the probability of drug use is highest on that side of the display and lowest on the right-hand side of the triplot. We see that the variable axes have solid parts of different lengths, that is, the solid part for benzodiazepine (Be) and methadone (Me) is much smaller than the solid part of cannabis (Ca) or LSD. Therefore, overall cannabis use and LSD are better discriminated by the predictor variables than benzodiazepine or methadone use. The endpoints of each solid line correspond to the positions of the yes and no point for each response variable in the type D plot, where in this particular triplot the yes marker is always on the lefthand side of the display.

Some predictor variable axes for the predictors have sharp angles with the variable axes of the responses, whereas others have obtuse angles. Openness, extraversion, age, sensation seeking, and gender have sharp angles indicating that these predictor variables discriminate well between drug use and abstinence.

The variable axes for openness, extraversion, and age are close together. Therefore these variables have similar effects on the response variables. More open participants, younger participants, and more introvert participants tend to have a higher probability for using LSD, mushrooms (Mu), and ecstasy (Ex). Participants who score higher on sensation seeking have a higher probability of using benzodiazepine (Be), methadone (Me), and cocaine (Co). Female participants (i.e., gender = 1) have a lower probability of drug use than male participants.

Neuroticism (N) and Agreeableness (A), on the other hand, are almost orthogonal to the response variable axes indicating that the relationship between these two predictor variables and drug use are weak.

The variable axis for conscientiousness has a sharp angle with, for example, benzodiazepine (Be) and methadone (Me), but an obtuse angle with the variable axes for LSD, mushrooms, and ecstasy. Less conscientious participants, therefore, have a higher probability to use benzodiazepine and methadone while more conscientious participants have a lower probability to use benzodiazepine and methadone. The degree of conscientiousness, however, does not result in large differences in the probability of, say, LSD use.

Fig. 2
figure 2

The hybrid triplot for the drug consumption data. The grey points represent the participants. The blue lines are the variable axes for the predictor variables (age, gender, neuroticism (N), extraversion (E), openness (O), conscientiousness (C), agreeableness (A), sensation seeking (SS)). The green lines are the variable axes for the responses (amphetamine (Am), benzodiazepine (Be), cannabis (Ca), cocaine (Co), ecstasy (Ex), ketamine (Ke), legal highs (Le), LSD, methadone (Me), mushrooms (Mu), and nicotine (Ni)). The variable labels are printed on the positive side of the variable, that is, the highest values of the predictor variables and the highest probabilities for the response variables (color figure online)

4.2 Non communicable diseases data set

Noncommunicable diseases (NCDs) such as diabetes, chronic lung disease, stroke, and depression tend to be of long duration and are the result of a combination of genetic, physiological, environmental and behavioural factors. The 2030 Agenda for Sustainable Development recognizes NCDs as a major challenge for sustainable development. Here, data (wave 1) of the Survey of Health, Ageing and Retirement in Europe (SHARE; Börsch-Supan et al. 2013) were used. In this data set, we have for \(N = 29207\) subjects information on seven NCDs, diabetes (Di), hypertension (H), chronic lung disease (CL), joint disorders (JD), angina (An), stroke (S), and depression (De). Overall, the prevalence of NCDs is not large, percentages range from 3.5% for stroke till 37.8% for depression. However, relatively many people develop multiple NCDs, that is, in the current data set 6021 participants have two NCDs, 2451 participants developed three NCDs, 894 have four, 214 have five, 40 have six, and 3 participants suffer from all seven NCDs.

Participants of this study live in various European countries, Austria (AT, 1538), Germany (DE, 2902), Sweden (SE, 2937), The Netherlands (NL, 2805), Spain (ES, 2226), Italy (IT, 2494), France (FR, 2890), Denmark (DK, 1665), Greece (EL, 2800), Switzerland (CH, 957), Belgium (BE, 3735), and Israel (IL, 2258). In the current analysis we are interested in the relationship between country of origin and NCDs, taking into account the gender and age of the participants. As, country is coded by 11 dummy variables, with The Netherlands as a reference category, the number of predictor variables in this analysis equals \(P = 13\).

4.2.1 Comparison of algorithms

Again, we compare the two algorithms in terms of speed. This data set has \(N = 29207\), \(P = 13\), and \(R = 7\). To compare the speed of the two algorithms we use the microbenchmark package (Mersmann 2021). Results are shown in Table 1, where it can be seen that again the MM algorithm is much faster.

4.2.2 Visualization

The quality of representation is not as good as in the drug consumption data. Some response variables are well represented (H, 0.86; JD, 0.93; An, 0.96; S, 0.82; De, 0.81), but diabetes and chronic lung disease are not well represented, the quality of representation are 0.66 and 0.57 respectively. The two-dimensional triplot is shown in Fig. 3. As can be seen, not for every response variable the solid green part is drawn (Di, CL, and S). The reason is that this part of the variable axis is far outside the range of the participant points. Extending the boundaries of the triplot to include this part, would clutter the center of the display. An important conclusion for these three variables is that for all participants the probability of these diseases is small, that is smaller than 0.3 for stroke and diabetes, and smaller than 0.2 for chronic lung disease. The decision lines fall outside the triplot, indicating that all participants are classified as not having those noncommunicable diseases. For the other response variables, we see that joint disorders have a relatively high discriminatory power (long solid part), whereas hypertension has a relatively low discriminatory power (small solid part). Angina and depression are in between.

Concerning the relationships between the predictor variables and response variables we see that being female is associated with higher depression and lower angina. Higher ages are associated with angina, stroke, diabetes, and hypertension, but age has no relationship to depression and only a small relationship with joint disorders.

Considering the countries, the effects of the different countries are compared against the profile of The Netherlands (i.e., The Netherlands is the baseline category). Austria and Sweden have very small vectors and Germany and Greece small vectors, indicating that their average profiles of non-communicable diseases are very similar to that of The Netherlands. The countries that differ the most from The Netherlands are Spain, Italy, France, Israel, and Switzerland.

Interpretation goes again by inspecting the angles of vectors corresponding to the countries to the variable axes of the responses. A small angle suggests that a country has, on average, a higher probability for that response, a \(90^o\) angle suggest a similar probability, while an obtuse angle indicates a lower probability for a noncommunicable disease. Switzerland scores lower on angina, stroke, diabetes, chronic lung disease, and hypertension, but higher on joint disorders and depression. France, Spain and Italy, and to a lower degree also Denmark and Belgium score higher on depression, joint disorders, and hypertension (small angles), but similar on angina, stroke, diabetes, and chronic lung disease (\(90^o\) angles). Israel has higher probabilities for angina, stroke, diabetes, chronic lung disease, and hypertension, but similar probabilities as The Netherlands on joint disorders and depression.

Fig. 3
figure 3

The hybrid triplot for the NCD data. The grey points represent the participants. The blue lines are the variable axes for the predictor variables (age, gender, Austria, Germany, Sweden, Spain, Italy, France, Denmark, Greece, Switzerland, Belgium, and Israel). The green lines are the variable axes for the responses (diabetes (Di), hypertension (H), chronic lung disease (CL), joint disorders (JD), angina (An), stroke (S), and depression (De)). The variable labels are printed on the positive side of the variable, that is, the highest values of the predictor variables and the highest probabilities for the response variables (color figure online)

5 Discussion

Logistic reduced rank regression is a useful tool for regression analysis of multivariate binary response variables. We developed a new algorithm based on previous work by De Leeuw (2006) and implemented it in the lmap-package (De Rooij and Busing 2022). The algorithm uses the majorization inequality of De Leeuw (2006) to move from the negative log-likelihood to a least squares function and the generalized singular value decomposition for minimizing the least squares function. Each of these two steps is known and as such the algorithm is a straightforward combination of the two. The inequality of De Leeuw (2006) to find a least-squares majorization function for the negative binomial or multinomial log-likelihood can more commonly used to develop logistic models for categorical data.

We compared the new algorithm to the IRLS algorithm (Yee and Hastie 2003; Yee 2015) on two empirical data sets; our new algorithm is about ten times faster but uses more iterations. MM algorithms are known for slow convergence (see Heiser 1995) in terms of a number of iterations needed. Yet, the updates within the iterations are computationally cheap. A singular value decomposition of a \(P \times R\) matrix needs to be computed, where usually P and R are relatively small. Our MM algorithm is only applicable for logistic reduced rank models, whereas the IRLS algorithm of Yee (2015) is designed for the family of reduced rank generalized linear models.

In the VGAM-package (Yee 2022, 2015), in a second step also the standard errors of the model parameters can be obtained. In an MM algorithm, not the negative log-likelihood is itself minimized but iteratively the majorization function is minimized. Therefore, our algorithm does not automatically give an estimate of the Hessian matrix nor the standard errors. Is this a bad thing? Buja et al. (2019a, 2019b) recently argued that if we know statistical models are approximations, we should also carry the consequences of this knowledge. The computation of standard errors assumes the model to be true, that is, not an approximation. Therefore, assuming the model is true while knowing it is not true results in estimated standard errors that are biased. A better approach to obtain standard errors or confidence intervals is to use the so-called pairs bootstrap, where the predictors and responses are jointly re-sampled. For the bootstrap it is useful if the algorithm is fast.

Two types of triplots were discussed and compared, the Type I and Type D triplots. The two types of triplots are equal on the predictor side of the model, but differ in the representation of the response variables. Whereas the Type I triplot uses an inner product relationship where object points have to be projected onto response variable axes, the Type D uses a distance relationship, where the distance of an object point towards the yes and no point for each response variable determines the probabilities. We discussed advantages and disadvantages of both approaches, and were then able to develop a new, hybrid, type of triplot by combining the two.

In the Type D triplot we make use of the two-mode Euclidean distance. This distance is often used to model single-peaked response functions and the object points are then called ideal points. The single-peaked response function is usually contrasted with the dominance response function (see, for example, Drasgow et al. 2010, and references therein). Whereas in the latter the probability of answering yes is a monotonic function of the position on the latent trait, in the former this probability is a single-peaked function defined by distances. In the Type D triplot, however, the relationship is still of the dominance type, where the probability of answering yes goes monotonically up or down for objects located on any straight line in the Euclidean representation. The main reason is that the model is in terms of the distance towards the categories of the response variable, not the response variable itself. De Rooij et al. (2022) recently developed a distance model where the distance between the object and the response variable (i.e., not the categories) determines the probability of answering yes. Such a representation warrants an interpretation in terms of single peaked relationships. Logistic reduced rank regression models assume a monotonic predictor-response relationship, no matter how the model is visualized.