Introduction

In contemporary statistical inference and machine learning theory, classification and prediction are of great importance and many approaches have been proposed. Those methods typically include the support vector machine (SVM), linear discriminant analysis (LDA), and K-nearest neighbors (KNN) (Hastie et al. 2008; James et al. 2017). These methods have widespread applications and their extensions to accommodating complex settings have been proposed. For example, Lee and Lee (2003) studied multicategory support vector machines for classification of multiple types of cancer. Cristianini and Shawe-Taylor (2000) presented comprehensive discussions of SVM methods. Guo et al. (2007) discussed the LDA method and its application in microarray data analysis. Safo and Ahn (2016) considered the multiclass analysis by performing the generalized sparse linear discriminant analysis. Regarding analysis of multiclass classification problems, Bagirov et al. (2003) proposed a new algorithm for multiclass cancer data. Bicciato et al. (2003) presented disjoint models for multiclass cancer analysis using the principal component technique. Liu et al. (2005) proposed the genetic algorithm (GA)-based algorithm to carry out multiclass cancer classification.

Recent development on classification further incorporates the dependence structure of predictors. For example, Cetiner and Akgul (2014) developed a graphical-model-based method for the multi-label classification. Zhu and Pan (2009) proposed the network-based support vector machine for classification of microarray samples for binary classification. Zi et al. (2016) discussed identification of rheumatoid arthritis-related genes by using network-based support vector machine. Cai et al. (2018) considered the network linear discriminant analysis. Huttenhower et al. (2007) proposed the nearest neighbor network approach. In the Bayesian paradigm, various classification approaches with network-structures accommodated have been explored, such as Bielza et al. (2011), Miguel Hernández-Lobato et al. (2011), Baladanddayuthapani et al. (2014), and Peterson et al. (2015).

Although there have been methods handling network structures in classification, those methods basically assume a common network structure for predictors of all subjects without taking into account of possible heterogeneity for different classes. To overcome those shortcomings, in this paper we propose classification methods with possibly class-dependent network structures of predictors taken into account. Our methods utilize the graphical model theory and allow the predictors to follow an exponential family distribution, instead of a restrictive normal distribution. Furthermore, we develop a prediction criterion for multiclass classification which accommodates pairwise dependence structures among the predictors. Our methods facilitate informative predictors with pairwise dependence structures into classification procedures, and they are computationally easy to implement.

The remainder of the paper is organized as follows. In “Data structure and framework” section, we introduce the data structure and review a convenient multiclass classification method for simple settings. In “nSMClassification with predictor graphical structures accom-modated” section, we describe the basics of graphical model theory and propose two methods for multiclass classification to accommodate network structures of predictors. In “Evaluation of the performance” section, we describe the criteria for evaluating the performance of the proposed methods, and briefly review several competing classification methods for comparisons. In “Numerical studies” section, we conduct simulation studies to assess the performance of the proposed methods, and apply the proposed methods to analyze a real dataset for illustration. A general discussion is presented in the last section.

Data structure and framework

In this section, we present the data structure with multiclass responses and introduce the basic notation.

Notation

Suppose the data of n subjects come from I classes, where I is an integer no smaller than 2 and the classes are free of order, i.e., they are nominal. Let ni be the class size in class i with i=1,⋯,I, and hence \(n = \sum \limits _{i=1}^{I} n_{i}\). Define Yik=i for class i=1,⋯,I and subject k=1,⋯,ni, and let \(Y = \left (Y_{11}, Y_{12}, \cdots, Y_{1n_{1}}, Y_{21}, \cdots, Y_{2n_{2}}, \cdots, Y_{I1}, \cdots, Y_{In_{I}} \right)^{\top }\) denote the n-dimensional random vector of response. Let Y·j denote the jth component of Y. In other words, if we ignore the class information, then Y·j represents the response (or the class membership) for the jth subject in the sample, where j=1,⋯,n.

For i=1,⋯,I, let \(X_{li} = \left (X_{li1},\cdots,X_{li{n_{i}}} \right)^{\top }\) denote the lth predictor (or covariate) vector associated with class i, where l=1,⋯,p for a positive integer p. We write \(X_{l} = \left (X_{l1}^{\top },\cdots, X_{lI}^{\top } \right)^{\top }\) for l=1,⋯,p, and let X=(X1,⋯,Xp) denote the n×p matrix of predictors. Let X·j=(X·j1,⋯,X·jp) denote the jth row of X, which represents the p-dimensional predictor vector for the jth subject. Without loss of generality, the {X·j,Y·j} are treated as independent and identically distributed (i.i.d.) for j=1,⋯,n. We let lower case letters represent realized values for the corresponding random variables. For example, x·j stands for a realized value of X·j. The data structure is shown in Table 1.

Table 1 Two ways to display data with a multiclass response and predictors

The objective here is to use the observed data to build models in order to predict the class label for a new subject using his/her observed predictor measurement.

Logistic regression model for multiclass response

With the multiclass response, we may consider the use of the logistic regression model by adapting the discussion of Agresti (2012, Section 7.1). For i=1,⋯,I and j=1,⋯,n, let πij(x·j)=P(Y·j=i|X·j=x·j) denote the conditional probability that subject j is selected from class i, given the predictor information X·j=x·j.

Noting the constraint \(\sum \limits _{i=1}^{I} \pi _{ij}(x_{\cdot j}) = 1\) for every j=1,⋯,n, to describe the πij(x·j), we can only model (I−1) of the πij(x·j) rather than all of the πij(x·j). Without loss of generality, we take the Ith conditional probability πIj(x·j) as the reference and then consider the logistic model

$$\begin{array}{@{}rcl@{}} \log \left\{ \frac{\pi_{ij}(x_{\cdot j})}{\pi_{Ij}(x_{\cdot j})} \right\} = \gamma_{0i} + \gamma_{i}^{\top} x_{\cdot j} \end{array} $$
(1)

for i=1,⋯,I−1 and j=1,⋯,n, where \(\gamma = \left (\gamma _{01}, \gamma _{1}^{\top }, \gamma _{02}, \gamma _{2}^{\top },\cdots,\gamma _{0,I-1}, \gamma _{I-1}^{\top } \right)^{\top }\) is the vector of parameters with the intercepts γ0i and a p-dimensional vector γi of parameters.

Equivalently, (1) shows that for i=1,⋯,I−1 and j=1,⋯,n,

$$\begin{array}{@{}rcl@{}} \pi_{ij}(x_{\cdot j}) = \frac{\exp\left(\gamma_{0i} + \gamma_{i}^{\top} x_{\cdot j} \right)}{1 + \sum \limits_{l=1}^{I-1} \exp\left(\gamma_{0l} + \gamma_{l}^{\top} x_{\cdot j} \right)} \end{array} $$
(2)

and

$$\begin{array}{@{}rcl@{}} \pi_{Ij}(x_{\cdot j}) = 1 - \sum \limits_{i=1}^{I-1} \pi_{ij}(x_{\cdot j}). \end{array} $$
(3)

Since the distribution of the Yij can be delineated by a multinominal distribution, the likelihood function for the observed data is given by

$$\begin{array}{@{}rcl@{}} L\left(\gamma \right) = \prod \limits_{i=1}^{I} \left\{ \prod \limits_{j=1}^{n} \pi_{ij}(x_{\cdot j})^{y_{ij}} \right\}, \end{array} $$
(4)

where πij(x·j) is determined by (2) or (3). Estimation of γ can proceed with maximizing (4). Let \(\widehat {\gamma } = \left (\widehat {\gamma }_{01}, \widehat {\gamma }_{1}^{\top }, \widehat {\gamma }_{02}, \widehat {\gamma }_{2}^{\top },\cdots, \widehat {\gamma }_{0,I-1}, \widehat {\gamma }_{I-1}^{\top } \right)^{\top } \) denote the resulting maximum likelihood estimate of γ.

To predict the class label for a new subject with a p-dimensional predictor vector \(\widetilde {x}\), we first calculate the right-hand side of (2) and (3) with the \(\left (\gamma _{0i}, \gamma _{i}^{\top } \right)^{\top }\) replaced by the corresponding estimate obtained for the training data and let \(\widehat {\pi }_{1},\cdots,\widehat {\pi }_{I}\) denote the corresponding values. Let i denote the index which corresponds to the largest value of \(\left \{ \widehat {\pi }_{1},\cdots,\widehat {\pi }_{I} \right \}\). Then the class label for this new subject is predicted as i.

Classification with predictor graphical structures accommodated

In this section, we propose two classification methods for prediction which incorporate the network structure of the predictors. We first describe the use of graphical models to facilitate the association structure of the predictors, and then explore two methods of building prediction models using the identified association structures.

Predictor network structure

Graphical models are useful to facilitate the network structures of the predictors. Here we describe the way of using graphical models to delineate possible association structures of the predictors. For j=1,⋯,n, we use an undirected graph, denoted as Gj=(Vj,Ej), to describe the relationship among the components of X·j=(X·j1,⋯,X·jp), where Vj={1,⋯,p} includes all the indices of predictors and Vj×Vj contains all pairs with unequal coordinates. A covariate X·jr is called a vertex of the graph Gj if rVj; a pair of predictors {X·jr,X·js} is called an edge of the graph Gj if (r,s)∈EjVj×Vj. In the setting we consider, the sets Vj and Ej are common for j=1,⋯,n, so we let V and E denote the vertex and edge of the graph, respectively.

To characterize the distribution of the predictor X·j, we consider the graphical model with the exponential family distribution,

$$ {}f(x_{\cdot j};\beta,\Theta) = \exp \left\{ \sum \limits_{r \in V} \beta_{r} B(x_{\cdot jr}) + \sum \limits_{(s,t) \in E} \theta_{st} B(x_{\cdot js})B(x_{\cdot jt}) + \sum \limits_{r \in V} C(x_{\cdot jr}) - A(\beta,\Theta) \right\}, $$
(5)

where β=(β1,⋯βp) is a p-dimensional vector of parameters, Θ=[θst] is a p×p symmetric matrix with zero diagonal elements, and B(·) and C(·) are given functions. The function A(β,Θ) is the normalizing constant which makes (5) integrated as 1; this function is also called the log-partition function, given by

$$\begin{array}{@{}rcl@{}} A(\beta,\Theta) = \log \int \exp \left\{ \sum \limits_{r \in V} \beta_{r} B(x_{\cdot jr}) + \sum \limits_{(s,t) \in E} \theta_{st} B(x_{\cdot js})B(x_{\cdot jt}) + \sum \limits_{r \in V} C(x_{\cdot jr}) \right\} dx_{\cdot j}. \end{array} $$

Formulation (5) gives a broad class of models which essentially covers most commonly used distributions. For example, if \(B(x) = \frac {x}{\sigma }\) and \(C(x) = -\frac {x^{2}}{2 \sigma ^{2}}\) where σ is a positive constant, then (5) yields the well-known Gaussian graphical model (Friedman et al. 2008; Hastie et al. 2015; Lee and Hastie 2015). If B(x)=x and C(x)=0 with x∈{0,1}, then with the βr set to be zero, (5) reduces to

$$\begin{array}{@{}rcl@{}} \exp \left\{ \sum \limits_{(s,t) \in E} \theta_{st} x_{\cdot js} x_{\cdot jt} - A(\Theta) \right\}, \end{array} $$
(6)

which is the Ising model without the singletons (Ravikumar et al. 2010).

To focus on featuring the pairwise association among the components of X·j, similar to the structure of (6), we consider the following graphical model

$$\begin{array}{@{}rcl@{}} f(x_{\cdot j};\Theta) = \exp \left\{ \sum \limits_{(s,t) \in E} \theta_{st} x_{\cdot js} x_{\cdot jt} + \sum \limits_{r \in V} C(x_{\cdot jr}) - A(\Theta) \right\}, \end{array} $$
(7)

where the function A(Θ) is the normalizing constant, and the θst and C(·) are defined as for (5). Model (7) is a special case of (5) which constraints the main effects parameters βr in (5) to be zero; nonzero parameter θst implies that X·js and X·jt are conditionally dependent given other predictors.

To estimate Θ, one may apply the likelihood method using the distribution (7) directly. Alternatively, a simpler estimation method can be carried out based on a conditional distribution derived from (7) (Meinshausen and Bühlmann2006; Hastie et al. 2015, p.254). For every sV, let X·j,V∖{s} denote the (p−1)-dimensional subvector of X·j with its sth component deleted, i.e., X·j,V∖{s}=(X·j1,⋯,X·j,s−1,X·j,s+1,⋯,X·jp). By some algebra, we have

$$ f\left(x_{\cdot js} | x_{\cdot j,V \setminus \{s\}}; \theta_{s}\right) = \exp \left\{ x_{\cdot js} \left(\sum \limits_{t \in V \setminus \{s\}} \theta_{st} x_{\cdot jt} \right) + C\left(x_{\cdot js} \right) - D \left(\sum \limits_{t \in V \setminus \{s\}} \theta_{st} x_{\cdot jt} \right) \right\}, $$
(8)

where D(·) is the normalizing constant ensuring the integration of (8) equal one, and θs=(θs1,⋯,θs,s−1,θs,s+1,⋯,θsp) is a (p−1)-dimensional vector of parameters indicating the relationship of X·js with all other predictors X·jr for r∈{1,⋯,p}∖{s} associated with (8).

Let (θs) be the log-likelihood for θs multiplied with \(- \frac {1}{n}\) with the constand omitted, i.e.,

$$\begin{array}{@{}rcl@{}} \ell \left(\theta_{s} \right) &=& - \frac{1}{n} \log \left\{ \prod \limits_{j=1}^{n} f\left(x_{\cdot js} | x_{\cdot j, V \setminus \{s\}} ; \theta_{s}\right) \right\} \\ &=& \frac{1}{n} \sum \limits_{j=1}^{n} \left\{ -x_{\cdot js} \left(\sum \limits_{t \in V \setminus \{s\}} \theta_{st} x_{\cdot jt} \right) + D \left(\sum \limits_{t \in V \setminus \{s\}} \theta_{st} x_{\cdot jt} \right) \right\}. \end{array} $$

Then an estimator of θs can be obtained as

$$\begin{array}{@{}rcl@{}} \widehat{\theta}_{s}(\lambda) = \underset{\theta_{s}}{\text{argmin}} \left\{ \ell \left(\theta_{s} \right) + \lambda \left\| \theta_{s} \right\|_{1} \right\}, \end{array} $$
(9)

where λ is a tuning parameter and ∥·∥1 is the L1-norm. In principle, the L1-norm in (9) may be replaced by other penalty functions such as the weighted L1-norm (Zou 2006) and the nonconcave function (Fan and Li 2001). Here we focus on using the L1-norm, the well-known LASSO penalty (Tibshirani 1996), to determine informative pairwise dependent predictors. The LASSO penalty is frequently considered when dealing with graphical models; it has been implemented in R. For instance, R packages huge and XMRF use the LASSO penalty to determine the network structure.

We comment that the estimator obtained from (9) depends on the choice of the tuning parameter λ. There is no unique way of selecting a suitable tuning parameter, and methods such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the Cross Validation (CV), and the Generalized Cross Validation (GCV) may be considered in the selection of the tuning parameter. Suggested by Wang et al. (2007), BIC tends to outperform others in many situations, especially in the setting with a penalized likelihood function. Consequently, here we employ the BIC approach to select the tuning parameter λ.

Define

$$\begin{array}{@{}rcl@{}} BIC(\lambda) = 2n \ell \left(\widehat{\theta}_{s}(\lambda) \right) + \log(n)\times \text{df} \left\{ \widehat{\theta}_{s}(\lambda) \right\}, \end{array} $$
(10)

where \(\text {df} \left \{ \widehat {\theta }_{s}(\lambda) \right \}\) represents the number of non-zero elements in \(\widehat {\theta }_{s}(\lambda)\) for a given λ. The optimal tuning parameter λ, denoted by \(\widehat {\lambda }\), is determined by minimizing (10) within a suitable range of λ. As a result, the estimator of θs is determined by \(\widehat {\theta }_{s} = \widehat {\theta }_{s}\left (\widehat {\lambda }\right)\).

The preceding procedure is repeated for all sV and yields the estimator \(\widehat {\theta }_{s}\) for all sV. There is an important point we need to pay attention. For (s,t)∈E, the estimates \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) are not necessarily identical although θst and θts are constrained to be equal. To overcome this problem, we apply the AND rule (Meinshausen and Bühlmann 2006; Hastie et al. 2015, p.255) to determine the final estimates of \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) as their maximum if both \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) are not zero; and set \(\widehat {\theta }_{st}\) and \(\widehat {\theta }_{ts}\) to be zero if one of them is zero.

To determine an estimated set of edges, we define

$$\begin{array}{@{}rcl@{}} \widehat{\mathcal{N}}(s) = \left\{ t \in V : \widehat{\theta}_{st} \neq 0 \right\} \end{array} $$

for sV. Then

$$\begin{array}{@{}rcl@{}} \widehat{E} = \left\{ (s,t) : s \in \widehat{\mathcal{N}}(t) \ \text{and} \ t \in \widehat{\mathcal{N}}(s) \right\} \end{array} $$
(11)

is taken as the set of the edges that are estimated to exist. The R package ‘huge’ can be implemented to show the graphic results.

Under mild regularity conditions, the estimated set of edges \(\widehat {E}\) approximate the true network structure E accurately, as shown below which was available in Ravikumar et al. (2010, Section 2.2) and Theorem 5 (b) of Yang et al. (2015).

Proposition 1

(Network Recovery) Suppose E is the set of edges, and let \(\widehat {E}\) be the estimated set of edges. Under regular conditions in Meinshausen and Bühlmann (2006), we have that as n,

$$\begin{array}{@{}rcl@{}} P\left(\widehat{E} = E\right) \rightarrow 1. \end{array} $$

Logistic regression with homogeneous graphically structured predictors

To incorporate the network structures of the predictors into building a prediction model, in the next two subsections, we present two methods which can be readily implemented using the R package huge and the R function glm for fitting a logistic regression model.

In the first method, called the logistic regression with homogeneous graphically structured predictors (LR-HomoGraph) method, we consider the case where the subjects in different classes share a common network structure in the predictors. To build a prediction model, we make use of the development of the logistic model with multiclass responses, discussed by Agresti (2007, Section 6.1) and Agresti (2012, Section 7.1).

We first identify the pairwise dependence of the predictors using the measurements of all the subjects without distinguishing their class labels. Let \(\widehat {\theta }_{st}\) be the estimate for θst obtained for (9) by using all the predictor measurements of {X·j:j=1,⋯,n}, and let \(\widehat {E} = \left \{ (s,t) : \widehat {\theta }_{st} \neq 0 \right \}\) denote the resulting estimated set of edges.

Next, for i=1,⋯,I and j=1,⋯,n, we let

$$\begin{array}{@{}rcl@{}} p_{ij}(x_{\cdot j}) = P\left(\left. Y_{\cdot j} = i \right| X_{\cdot j} = x_{\cdot j} \right) \end{array} $$

be the conditional probability of Y·j=i given X·j=x·j. Consider the logistic regression model

$$\begin{array}{@{}rcl@{}} p_{ij}(x_{\cdot j}) = \frac{\exp\left(\alpha_{i0} + \sum \limits_{(s,t) \in \widehat{E}} \alpha_{i,st} x_{\cdot js} x_{\cdot jt} \right)}{1 + \sum \limits_{l=1}^{I-1} \exp\left(\alpha_{l0} + \sum \limits_{(s,t) \in \widehat{E}} \alpha_{l,st} x_{\cdot js} x_{\cdot jt} \right)} \end{array} $$
(12)

for i=1,2,⋯,I−1, where (αi0,αi,st) is the vector of parameters associated with class i and the constraint \(\sum \limits _{i=1}^{I} p_{ij}(x) = 1\) is imposed for every j=1,⋯,n.

For subject j=1,⋯,n, we let \(Y_{ij}^{\ast } = 1\) if subject j is in class i and \(Y_{ij}^{\ast } = 0\) otherwise, and hence, \(\sum \limits _{i=1}^{I} Y_{ij}^{\ast } = 1\) for every j. Let \(y_{ij}^{\ast }\) denote a realized value of \(y_{ij}^{\ast }\). For i=1,⋯,I and j=1,⋯,n, the likelihood function is given by (Agresti 2012, p.273)

$$\begin{array}{@{}rcl@{}} L(\alpha) = \prod \limits_{i=1}^{I} \left\{ \prod \limits_{j=1}^{n} p_{ij}(x_{\cdot j})^{y_{ij}^{\ast}} \right\}, \end{array} $$
(13)

where \(\alpha = \left (\alpha _{10}, \alpha _{1\cdot }^{\top },\cdots, \alpha _{(I-1)0}, \alpha _{(I-1)\cdot }^{\top } \right)^{\top }\) is the vector of parameters with vector \(\alpha _{i\cdot } = \left (\alpha _{i,st} : (s,t) \in \widehat {E} \right)^{\top }\) for i=1,⋯,I−1.

The estimator \(\widehat {\alpha }\) can be derived by maximizing (13) with respect to α. Therefore, for the realization x·j of the p-dimensional vector X·j,pij(x·j) is estimated as

$$\begin{array}{@{}rcl@{}} \widehat{p}_{ij}(x_{\cdot j}) = \frac{\exp\left(\widehat{\alpha}_{i0} + \sum \limits_{(s,t) \in \widehat{E}} \widehat{\alpha}_{i,st} x_{\cdot js} x_{\cdot jt} \right)}{1 + \sum \limits_{l=1}^{I-1} \exp\left(\widehat{\alpha}_{l0} + \sum \limits_{(s,t) \in \widehat{E}} \widehat{\alpha}_{l,st} x_{\cdot js} x_{\cdot jt} \right)}\ \ \text{for} \ \ i = 1,\cdots,I-1, \end{array} $$
(14)

and pIj(x·j) is estimated as

$$\begin{array}{@{}rcl@{}} \widehat{p}_{Ij}(x_{\cdot j}) = 1 - \sum\limits_{i=1}^{I-1} \widehat{p}_{ij}(x_{\cdot j}). \end{array} $$
(15)

Finally, to predict the class label for a new subject with a p-dimensional predictor \(\widetilde {x}\), we first calculate the right-hand side of (14) and (15), and let \(\widetilde {\widehat {p}}_{1},\cdots,\widetilde {\widehat {p}}_{I}\) denote the corresponding values. Let i denote the index which corresponds to the largest value of \(\left \{ \widetilde {\widehat {p}}_{1},\cdots,\widetilde {\widehat {p}}_{I} \right \}\), i.e., \(i^{\ast } = \underset {1 \leq i \leq I}{\text {argmax}} \widetilde {\widehat {p}}_{i}\). Then the class label for this new subject is predicted as i.

Logistic regression with class-dependent graphically structured predictors

We now present an alternative to the method described in “Logistic regression with homogeneous graphically structured predictors” section. Instead of pooling all the covariates to feature the covariate network structure, this method, called the logistic regression with class-dependent graphically structured covariates (LR-ClassGraph) method, stratifies the covariate information by class when characterizing the covariate network structures.

We first introduce a binary, surrogate response variable \(Y_{ij}^{i}\) for every i and j, where i=1,⋯,I and j=1,⋯,n. Let

$$\begin{array}{@{}rcl@{}} Y_{ij}^{i} = \left\{ \begin{array}{c c} 1, & Y_{ij} = i,\\ 0, & \text{otherwise}, \end{array} \right. \end{array} $$

and define \(Y^{i} = \left (0,\cdots,0,Y_{i1}^{i},\cdots,Y_{in_{i}}^{i},0,\cdots,0 \right)^{\top }\) to be an n-dimensional vector whose elements corresponding to class i are respectively \(Y_{i1}^{i},\cdots,Y_{in_{i}}^{i}\), and other elements are zero. That is, \(Y^{i} = (\underbrace {0,\cdots,0}_{n_{1} + \cdots + n_{i-1}}, \underbrace {1,\cdots,1}_{n_{i}}, \underbrace {0,\cdots,0}_{n_{i+1}+ \cdots + n_{I}})^{\top }\) with i=1,⋯,I. Now we implement the following steps. Step 1: (Class-Dependent Predictor Network) For each class i=1,⋯,I, we apply the procedure described in “Predictor network structure” section to determine the network structure of predictors in class i. Let \(\widehat {E}^{i} = \left \{ (s,t) : \widehat {\theta }_{st}^{i} \neq 0 \right \}\) denote an estimated set of edges for class i, where \(\widehat {\theta }_{st}^{i}\) is the estimate of θst derived from (9) based on using the predictor measurements in class i. Step 2: (Class-Dependent Model Building) For each class i=1,⋯,I, fit a logistic regression model using the surrogate response vector Yi with the estimated covariates network structure \(\widehat {E}^{i}\) incorporated. Specifically, for the jth component of \(Y^{i}, Y^{i}_{j}\), define \({\pi }_{j}^{i}(x_{\cdot j}) = P\left (Y_{j}^{i} = 1 | X_{\cdot j} = x_{\cdot j} \right)\) and consider the logistic regression model

$$\begin{array}{@{}rcl@{}} \text{logit} \left\{ {\pi}_{j}^{i}(x_{\cdot j})\right\} = \gamma_{0}^{i} + \sum \limits_{(s,t) \in \widehat{E}^{i}} \gamma_{st}^{i} x_{\cdot js} x_{\cdot jt}, \end{array} $$
(16)

where \(j=1,\cdots,n, \left (\gamma _{0}^{i}, \gamma _{st}^{i}\right)^{\top }\) is the vector of parameters associated with class i. By the theory of maximum likelihood (e.g., Agresti 2012), we obtain the estimate \(\left (\widehat {\gamma }_{0}^{i}, \widehat {\gamma }_{st}^{i} \right)^{\top }\) of \(\left (\gamma _{0}^{i}, \gamma _{st}^{i} \right)^{\top }\). Step 3: (Prediction) For a realization x·j of the p-dimensional vector X·j, based on (16), \({\pi }_{j}^{i}(x_{\cdot j})\) can be estimated by

$$\begin{array}{@{}rcl@{}} \widehat{\pi}_{j}^{i}(x_{\cdot j}) = \frac{\exp\left(\widehat{\gamma}_{0}^{i} + \sum \limits_{(s,t) \in \widehat{E}^{i}} \widehat{\gamma}_{st}^{i} x_{\cdot js} x_{\cdot jt}\right)}{1+\exp\left(\widehat{\gamma}_{0}^{i} + \sum \limits_{(s,t) \in \widehat{E}^{i}} \widehat{\gamma}_{st}^{i} x_{\cdot js} x_{\cdot jt}\right)}\ \ \ \text{for} \ \ i = 1,\cdots, I. \\ \end{array} $$
(17)

To predict the class label for a new subject with a p-dimensional covariate vector \(\widetilde {x}\), we first calculate (17) with x·j replaced by \(\widetilde {x}\) for i=1,⋯,I, and let \(\widetilde {\widehat {\pi }^{1}},\cdots,\widetilde {\widehat {\pi }^{I}}\) denote the corresponding values. Let i denote the index which corresponds to the largest value of \(\left \{ \widetilde {\widehat {\pi }^{1}},\cdots,\widetilde {\widehat {\pi }^{I}} \right \}\), i.e.,

$$\begin{array}{@{}rcl@{}} \widetilde{\widehat{\pi}^{i^{\ast}}} = \max \limits_{1 \leq i \leq I} \widetilde{\widehat{\pi}^{i}}. \end{array} $$
(18)

Then the class label for this new subject is predicted as i.

Comparison of decision boundaries

As noted in “Logistic regression with homogeneous graphically structured predictors” and “Logistic regression with class-dependent graphically structured predictors” sections, while both the LR-HomoGraph and LR-ClassGraph methods employ logistic regression to classify classes, they are different in the way of featuring predictor structures. Furthermore, we may compare their differences in terms of decision boundaries.

First, we examine the decision boundaries for the LR-HomoGraph method. For ik, the boundary between the ith and kth classes is determined by

$$\widehat{p}_{ij}(x_{\cdot j}) =\widehat{p}_{ik}(x_{\cdot j}) $$

for a new instance with the predictor value x·j, where \(\widehat {p}_{ij}(x_{\cdot j})\) and \(\widehat {p}_{ik}(x_{\cdot j})\) are given by (14) or (15). To be more specific, for any i=1,...,I−1, if k=1,...,I−1 and ki, then by (14), the boundary between the ith and kth classes is

$$ \sum \limits_{(s,t) \in \widehat{E}} (\widehat{\alpha}_{i,st}-\widehat{\alpha}_{k,st}) x_{\cdot js} x_{\cdot jt} +(\widehat{\alpha}_{i0} - \widehat{\alpha}_{k0})=0; $$
(19)

and the boundary between the ith and Ith classes is, by (15),

$$ \sum \limits_{(s,t) \in \widehat{E}} \widehat{\alpha}_{i,st} x_{\cdot js} x_{\cdot jt} +\widehat{\alpha}_{i0}=0. $$
(20)

Similarly, the decision boundaries for the LR-ClassGraph method can be determined based on (17). For ik, equating \(\widehat {\pi }_{j}^{i}(x_{\cdot j})\) and \( \widehat {\pi }_{j}^{k}(x_{\cdot j})\) for a covariate value x·j gives the boundary between the ith and kth classes

$$ \sum \limits_{(s,t) \in \widehat{E}^{i}} \widehat{\gamma}_{st}^{i} x_{\cdot js} x_{\cdot jt}-\sum \limits_{(s,t) \in \widehat{E}^{k}} \widehat{\gamma}_{st}^{k} x_{\cdot js} x_{\cdot jt} +\left(\widehat{\gamma}_{0}^{i} -\widehat{\gamma}_{0}^{k}\right)=0. $$
(21)

Comparing (21) to (19) or (20) shows that decision boundaries for both the LR-HomoGraph and LR-ClassGraph methods are all quadratic surfaces determined by the features selected from the graphical models. However, the way of incorporating the features is different for the two methods. The boundaries (21) are determined by the quadratic terms identified using instances from classes i and k separately, but the quadratic terms in the boundary (19) or (20) are not distinguished by the class labels. In addition, the coefficients \(\widehat {\gamma }_{st}^{i}\) and \(\widehat {\alpha }_{i,st}\) associated with the decision boundaries are generally different.

Evaluation of the performance

In this section we discuss the evaluation of the procedures proposed in “Logistic regression with homogeneous graphically structured predictors” and “Logistic regression with class-dependent graphically structured predictors” sections. For comparisons, we also examine some conventional classification methods in machine learning, including support vector machine (SVM), linear discriminant analysis (LDA), K-nearest neighbor (KNN), and extreme gradient boosting (XGBOOST). We first describe the measures of assessing the prediction error that are commonly used, and then we briefly review the four classification methods.

Criteria for performances

In this subsection, we describe several criteria of evaluating the performance for prediction. To show the overall performance of prediction, we consider either micro averaged metrics or macro averaged metrics (Parambath et al. 2018). For subject j=1,⋯,n, let \(\widehat {y}_{\cdot j}\) denote the predicted class label. For class i=1,⋯,I, we calculate the number of the true positives, the number of the false positives, and the number of the false negatives, respectively, given by

$$\text{TP}_{i} = \sum \limits_{j=1}^{n} \mathbb{I} \left(y_{\cdot j} = i, \widehat{y}_{\cdot j} = i\right), \text{FP}_{i} = \sum \limits_{j=1}^{n} \mathbb{I} \left(y_{\cdot j} \neq i, \widehat{y}_{\cdot j} = i\right), $$

and

$$\begin{array}{@{}rcl@{}} \text{FN}_{i} = \sum \limits_{j=1}^{n} \mathbb{I} \left(y_{\cdot j} = i, \widehat{y}_{\cdot j} \neq i\right), \end{array} $$

where \(\mathbb {I}(\cdot)\) is the indicator function. For micro averaged metrics, we define precision and recall, respectively, given by

$$\begin{array}{@{}rcl@{}} PRE_{micro} = \frac{\sum \limits_{i=1}^{I} \text{TP}_{i}}{\sum \limits_{i=1}^{I} \text{TP}_{i} + \sum \limits_{i=1}^{I} \text{FP}_{i}} \ \ \text{and} \ \ REC_{micro} = \frac{\sum \limits_{i=1}^{I} \text{TP}_{i}}{\sum \limits_{i=1}^{I} \text{TP}_{i} + \sum \limits_{i=1}^{I} \text{FN}_{i}}. \end{array} $$

Then Micro-F-score is defined as

$$\begin{array}{@{}rcl@{}} F_{micro} = 2 \times \frac{PRE_{micro} \times REC_{micro}}{PRE_{micro} + REC_{micro}}. \end{array} $$
(22)

On the other hand, for macro averaged metrics, for i=1,⋯,I, let \(PRE_{i} = \frac {\text {TP}_{i}}{\text {TP}_{i} + \text {FP}_{i}}\) denote precision for class i, and let \(REC_{i} = \frac {\text {TP}_{i}}{\text {TP}_{i} + \text {FN}_{i}}\) denote recall for class i. Then the overall precision and recall are, respectively, defined as

$$\begin{array}{@{}rcl@{}} PRE_{macro} = \frac{1}{I} \sum \limits_{i=1}^{I} PRE_{i} \ \ \text{and} \ \ REC_{macro} = \frac{1}{I} \sum \limits_{i=1}^{I} REC_{i}; \end{array} $$

and Macro-F-score is defined as

$$\begin{array}{@{}rcl@{}} F_{macro} = 2 \times \frac{PRE_{macro} \times REC_{macro}}{PRE_{macro} + REC_{macro}}. \end{array} $$
(23)

In principle, higher values of PRE, REC and F based on both micro and macro reflect better performance of methods (Parambath et al. 2018; Sokolova et al. 2006).

Support vector machine for multiclass responses

Support vector machine (SVM) was originally designed for two-class classification (Hastie et al. 2008, Sec. 12.2), and its extensions to the multiclass responses have been discussed by many authors. An early extension of the SVM to accommodating multiclass classification is the one-against-all method (Hsu and Lin 2002). The main idea is that the ith SVM is trained from all subjects with positive labels in the ith class and all other subjects with negative labels. This type of SVM for multiclass classification, however, ignores the heterogeneity among the subjects in each class.

A useful multiclass SVM is the one-against-one method (Knerr et al. 1990), which is implemented in the R package e1071. Different from the one-against-all method, the one-against-one method first produces I(I−1)/2 pairwise classifiers and trains data from any two selected classes, and then it applies SVM with binary classification to each pairwise classifiers. To see this, for i1,i2∈{1,⋯,I} with i1<i2, we consider the following optimization

$$\begin{array}{*{20}l} &\mathop{\text{min}}_{w^{i_{1}i_{2}},b^{i_{1}i_{2}},\xi_{j}^{i_{1}i_{2}}} \quad \left\{ \frac{1}{2}\left(w^{i_{1}i_{2}}\right)^{\top} w^{i_{1}i_{2}} + C \sum_{j=1}^{n}\xi_{j}^{i_{1}i_{2}} \right\} \\ \text{subject to}&\\ &\text{for} \ \ j=1,\ldots,n,\ \ \\ &\qquad\ \xi_{j}^{i_{1}i_{2}} \ge 0\\ &\qquad\left\{\left(w^{i_{1}i_{2}}\right)^{\top} \phi(X_{\cdot j}) + b^{i_{1}i_{2}} \right\} \ge 1-\xi_{j}^{i_{1}i_{2}}, \text{if} Y_{\cdot j} = i_{1}, \\ &\qquad \left\{ \left(w^{i_{1}i_{2}}\right)^{\top} \phi(X_{\cdot j}) + b^{i_{1}i_{2}}\right\} \le -1+\xi_{j}^{i_{1}i_{2}}, \text{if} Y_{\cdot j} = i_{2}, \end{array} $$
(24)

where ϕ(·) is a non-linear mapping from a p-dimensional vector to a q-dimensional vector with q>p (Hsu and Lin 2002), \(\phantom {\dot {i}\!}w^{i_{1}i_{2}}\) is a q-dimensional vector of parameters associated with the comparison between classes i1 and \(\phantom {\dot {i}\!}i_{2}, b^{i_{1}i_{2}}\) is a scalar, \(\xi _{j}^{i_{1}i_{2}}\) is the slack variable for the soft margin solution, and C is a cost parameter controlling balance of maximizing the margin and minimizing the training error.

Solving (24) for arbitrary i1,i2∈{1,⋯,I} with i1<i2 yields I(I−1)/2 classifiers and those classifiers can then be used for classification of a new instance, say \(\widetilde {X} = \widetilde {x}\). This can be done through a voting process (Hsu and Lin 2002). Specifically, let \(\mathcal {L} = \left \{(1,2), (1,3), \cdots, (1,I), (2,3), \cdots, (2,I),\cdots, (I-1,I)\right \}\) be the collection of all pairwise class labels which includes I(I−1)/2 elements. For each class i with i=1,⋯,I, we let vote(i) denote the “number of vote” related to class i. Then we carry out the following three steps.

  • For class i=1,⋯,I, the initial value of vote(i) is set as 0.

  • For any given class i, we consider a subcollection of \(\mathcal {L}, \left \{ (i,i'): i'=i+1,\cdots,I \right \}\), which is associated with class i. Calculate \(\text {sign}\left \{ (w^{ii'})^{\top } \phi (\widetilde {x}) + b^{ii'} \right \}\)repeatedly for i=i+1,⋯,I and then determine the values of vote(i) and vote(i)iteratively by the rule:

    $$\begin{array}{@{}rcl@{}} &&\text{If}\ \text{sign}\left\{ (w^{ii'})^{\top} \phi(\widetilde{x}) + b^{ii'} \right\} > 0,\ \text{then we let} \\ & \ \ & \ \ \ \ \ \ \ \ \ \ vote(i) = vote(i) + 1; \\ &&\text{otherwise}, \\ &\ \ & \ \ \ \ \ \ \ \ \ \ vote(i') = vote(i') + 1; \end{array} $$

    where vote(i) on the right-hand-side of the equation is a value determined by the previous step, vote(i) on the left-hand-side of the equation represents a newly determined value, and i=i+1,⋯,I.

  • Repeat Step 2 for i=1,⋯,I. In this way, we determine all the final values of vote(1),⋯,vote(I). Let i denote the class index corresponding to the largest value of {vote(1),⋯,vote(I)}, i.e., \(i^{\ast } = \underset {1 \leq i \leq I}{\text {argmax}} \left \{vote(i)\right \}\). Then we let i be the predicted class for the new instance.

Linear discriminant analysis

The idea of LDA is to model the distribution of the predictors X·j separately for each of the classes Y·j, and then use the Bayes theorem to obtain the conditional probabilities P(Y·j=i|X·j=x·j) (e.g., James et al. 2017). For i=1,⋯,I and j=1,⋯,n, let fj|i(x·j) denote the conditional probability density function of the predictor X·j taking value x·j given that subject j comes from the ith class. Let πi,j=P(Y·j=i) denote the probability that the jth subject is randomly selected from class i. It is immediate that \(\sum \limits _{i=1}^{I} \pi _{i,j} = 1\) for j=1,⋯,n. By some algebra (Hastie et al. 2008, p.108) and the Bayes theorem, we obtain the posterior probability

$$\begin{array}{@{}rcl@{}} P\left(Y_{\cdot j} = i | X_{\cdot j} = x_{\cdot j} \right) = \frac{f_{j|i}(x_{\cdot j})\pi_{i,j}}{\sum \limits_{l = 1}^{I} f_{j|l}(x_{\cdot j})\pi_{l,j}} \end{array} $$
(25)

for i=1,⋯,I and j=1,⋯,n.

To compare two classes i and l with il, we calculate the log-ratio of (25) for classes i and l, given by

$$\begin{array}{@{}rcl@{}} \log \left\{ \frac{P\left(Y_{\cdot j} = i | X_{\cdot j} = x_{\cdot j} \right)}{P\left(Y_{\cdot j} = l | X_{\cdot j} = x_{\cdot j} \right)} \right\} &=& \log \left(\frac{f_{j|i}(x_{\cdot j})}{f_{j|l}(x_{\cdot j})} \right) + \log \left(\frac{\pi_{i,j}}{\pi_{l,j}} \right). \end{array} $$
(26)

To elaborate on the idea, we particularly consider the case where the conditional distribution fj|i(x·j) of X·j given Y·j=i is assumed to be the normal distribution N(μi,Σi) with the probability density function

$$\begin{array}{@{}rcl@{}} f_{j|i}(x_{\cdot j}) = \frac{1}{\left(2 \pi \right)^{p/2} \left| \Sigma_{i} \right|^{1/2}} \exp \left\{- \frac{1}{2} \left(x_{\cdot j} - \mu_{i} \right)^{\top} \Sigma_{i}^{-1} \left(x_{\cdot j} - \mu_{i} \right) \right\}. \end{array} $$
(27)

If the covariance matrices Σi in (27) are assumed to be common, i.e., Σi=Σ for every i where Σ is a positive definite matrix, (26) becomes

$$\begin{array}{@{}rcl@{}} \log \left(\frac{\pi_{i,j}}{\pi_{l,j}} \right) - \frac{1}{2} \left(\mu_{i} + \mu_{l} \right)^{\top} \Sigma^{-1} \left(\mu_{i} + \mu_{l} \right) + x_{\cdot j}^{\top} \Sigma^{-1} \left(\mu_{i} + \mu_{l} \right). \end{array} $$
(28)

If (28) >0, then

$$P\left(Y_{\cdot j} = i | X_{\cdot j} = x_{\cdot j} \right) > P\left(Y_{\cdot j} = l | X_{\cdot j} = x_{\cdot j} \right), $$

showing that subject j with predictors X·j=x·j is more likely to be selected from class i than from class l. Consequently, (28) defines a boundary between classes i and l which is a linear function of x·j.

Motivated by the form of (28), we consider a linear function in x

$$\begin{array}{@{}rcl@{}} \delta_{i}(x) = \log \left(\pi_{i} \right) - \frac{1}{2} \mu_{i}^{\top} \Sigma^{-1} \mu_{i} + x^{\top} \Sigma^{-1} \mu_{i}, \end{array} $$
(29)

where μi,πi, and Σ are estimated by \(\widehat {\mu }_{i} = \frac {1}{n_{i}} \sum \limits _{y_{\cdot j} = i} x_{\cdot j}, \widehat {\pi }_{i} = \frac {n_{i}}{n}\), and \(\widehat {\Sigma } = \frac {1}{n-I} \sum \limits _{i=1}^{I} \sum \limits _{y_{\cdot j} = i} \left (x_{\cdot j} - \widehat {\mu }_{i} \right)\left (x_{\cdot j} - \widehat {\mu }_{i} \right)^{\top }\), respectively. That is, (29) can be estimated by

$$\begin{array}{@{}rcl@{}} \widehat{\delta}_{i}(x) = \log \left(\widehat{\pi}_{i} \right) - \frac{1}{2} \widehat{\mu}_{i}^{\top} \widehat{\Sigma}^{-1} \widehat{\mu}_{i} + x^{\top} \widehat{\Sigma}^{-1} \widehat{\mu}_{i}. \end{array} $$
(30)

Function (30) is called the linear discriminant function and is used to determine the class label for a new instance (James et al. 2017, p.143; Hastie et al. 2008, p. 109). For the prediction of a new subject with covariate \(\widetilde {x}\), we first calculate \(\widehat {\delta }_{i}(\widetilde {x})\) using (30) for i=1,⋯,I. Next, we find i which is defined as

$$\begin{array}{@{}rcl@{}} i^{\ast} = \underset{i=1,\cdots,I}{\text{argmax}}\ \widehat{\delta}_{i}(\widetilde{x}); \end{array} $$

and the class label for this subject is then predicted as i.

K-nearest neighbor

The third classification method we compare with is the K-nearest neighbor (KNN) method which is a non-parametric approach. The key idea of KNN is to use the available instances to estimate the conditional probability of Y·j given X·j, and then classify a new instance to a certain class based on the highest estimated conditional probability.

For a positive integer K and a new instance \(\widetilde {x}\) of predictors \(\widetilde {x}\), the first step of KNN is to identify K points which are closest to \(\widetilde {x}\); let \(\mathcal {N}_{0}\left (\widetilde {x} \right)\) denote the set containing such K-nearest points of \(\widetilde {x}\). Next, for i=1,⋯,I, we calculate

$$\begin{array}{@{}rcl@{}} \widehat{\pi}_{i} = \frac{1}{K} \sum \limits_{j' \in \mathcal{N}_{0}(\widetilde{x})} \mathbb{I}(y_{\cdot j'} = i). \end{array} $$

Finally, let i denote the class label which corresponds to the largest value of \(\left \{ \widehat {\pi }_{1},\cdots, \widehat {\pi }_{I} \right \}\). Then the class label for this new subject is predicted as i.

For the KNN method, a crucial issue is the selection of K. A small value of K usually yields an over-flexible decision boundary, which makes the classifier have a small bias but a large variance. On the contrary, with a large K, the boundary becomes less flexible and is close to linear, and classifier would have a small variance but a large bias. To determine an optimal K from the theoretical perspective, James et al. (2017, p. 184 and p. 186) suggested to use the cross-validation method to select K; but from the computational viewpoint, sometimes, a choice of K may be based on a random guess, as commented by James et al. (2017, p. 167).

Extreme gradient boosting

The extreme gradient boosting (XGBOOST) is a tree based ensemble method created under the gradient boosting framework (e.g., Chen and Guestrin 2016) and can be implemented by the R package xgboost.

Let \(\mathcal {F}\) denote the space of functions representing regression trees f, where for \(f \in {\mathcal {F}}\) with \(f(x) = w_{q(x)}, q: \mathbb {R}^{p} \rightarrow {\mathcal {L}}\) reflects the structure of the tree f that maps an example to the corresponding leaf index, \({\mathcal {L}}\) is the set of the leaf indices, \(w \in \mathbb {R}^{T}\) is leaf weight, and T is the number of leaves in the tree. Suppose that K regression trees in \({\mathcal {F}}, f_{k}(\cdot) \in {\mathcal {F}}\) with k=1,⋯,K, are used to predict the output:

$$\begin{array}{@{}rcl@{}} \widehat{y}_{\cdot j} = \sum \limits_{k=1}^{K} f_{k}\left(x_{\cdot j} \right) \end{array} $$

for an example with the input x·j.

To learn the set of functions used for classification, we minimize the regularized objective function

$$\begin{array}{@{}rcl@{}} \mathcal{L}(y,\widehat{y}) = \sum \limits_{j=1}^{n} L(y_{\cdot j},\widehat{y}_{\cdot j}) + \sum \limits_{k=1}^{K} \Omega(f_{k}), \end{array} $$
(31)

where Ω is the regularization used to measure the model complexity, given by

$$ {\kern110pt}\Omega(f) = \gamma T+ \frac{1}{2}\lambda \left\|w\right\|^{2} $$
(32)

with tuning parameters γ and λ. Here L(·) is the loss function which measures how well the model fits the training data. With the multiclass classification problem discussed in “Classification with predictor graphical structures accommodated” section, we specify L(·) as

$$\begin{array}{@{}rcl@{}} \sum \limits_{j=1}^{n} L(y_{\cdot j},\widehat{y}_{\cdot j}) = - \sum \limits_{i=1}^{I} \sum \limits_{j=1}^{n} y_{{ij}} \log \left(p_{{ij}} \right) \end{array} $$

with \(p_{{ij}} = \frac {\exp \left (\widehat {y}_{{ij}}\right)}{1 + \sum \limits _{l=1}^{I-1} \exp \left (\widehat {y}_{{lj}}\right)}\) for i=1,…,I−1 and \(p_{{Ij}} = 1 - \sum \limits _{i=1}^{I-1} p_{{ij}}\).

While the formulation of the objective function in (31) is conceptually easy to balance the tradeoff between predictive accuracy and model complexity, minimizing the objective function (31) cannot be directly carried out using traditional optimization procedures. One approach is to invoke the gradient boosting tree algorithm iteratively to call for a second order approximation to the objective function. Specifically, at iteration t, we define

$$\begin{array}{@{}rcl@{}} \widehat{y}_{\cdot j}^{(t)} = \sum \limits_{k=1}^{t} f_{k}\left(x_{\cdot j} \right) = \widehat{y}_{\cdot j}^{(t-1)} + f_{t}\left(x_{\cdot j} \right) \end{array} $$

with \( \widehat {y}_{\cdot j}^{(0)} = 0\), and hence the objective function

$$\begin{array}{@{}rcl@{}} \mathcal{L}^{(t)}(y,\widehat{y}) = \sum \limits_{j=1}^{n} L\left(y_{\cdot j},\widehat{y}_{\cdot j}^{(t)}\right) + \Omega(f_{t}). \end{array} $$
(33)

Applying the second-order approximation to (33) gives

$$ \mathcal{L}^{(t)}(y,\widehat{y}) \approx \sum \limits_{j=1}^{n} \left\{ L\left(y_{\cdot j},\widehat{y}_{\cdot j}^{(t-1)}\right) + g_{j} f_{t}\left(x_{\cdot j} \right) + \frac{1}{2} h_{j} f_{t}^{2}\left(x_{\cdot j} \right) \right\} + \Omega(f_{t}), $$
(34)

where gj and hj are the first and second order gradients of the loss function \(L(y_{\cdot j},\widehat {y}^{(t-1)})\) with respect to \(\widehat {y}^{(t-1)}\), respectively.

Let Im={j:q(x·j)=m} denote the instance set of leaf m. Then by (32), (34) becomes

$$\begin{array}{@{}rcl@{}} \mathcal{L}^{(t)}(y,\widehat{y}) &\approx & \sum \limits_{m=1}^{T} \left\{ \left(\sum \limits_{j \in I_{m}} g_{j} \right) w_{m} + \frac{1}{2} \left(\sum \limits_{j \in I_{m}} h_{j} + \lambda \right) w_{m}^{2} \right\} + \gamma T. \end{array} $$
(35)

For a given tree structure q(·), minimizing (35) gives the optimal weight \(w_{m}^{\ast }\) of leaf m and the optimal value of (35), respectively, given by

$$\widehat w_{m} = - \frac{\sum \limits_{j \in I_{m}} g_{j}}{\sum \limits_{j \in I_{m}} h_{j} + \lambda} \ \ \ \text{and} \ \ \ \widehat {\mathcal{L}}^{(t)} = - \frac{1}{2} \sum \limits_{m=1}^{T} \frac{\left(\sum \limits_{j \in I_{m}} g_{j}\right)^{2}}{\sum \limits_{j \in I_{m}} h_{j} + \lambda} + \gamma T. $$

Numerical studies

In this section, we first conduct simulation studies to evaluate the performance of the proposed procedures in “Classification with predictor graphical structures accommodated” section, and then we apply the procedures to analyze a real dataset to illustrate their usage. The discussion is carried out in contrast to the classification methods reviewed in “Evaluation of the performance” section as well as the usual multiclass logistic regression model in “Logistic regression model for multiclass response” section. The R packages, svm(e1071), lda(MASS), knn.cv(class), and xgboost are used to implement the SVM, LDA, KNN, and XGBOOST methods, respectively.

Simulation study

For class i=1,⋯,I, the predictors are generated from the multivariate normal distribution with mean zero and covariance matrix \(\Sigma _{i} = \Omega _{i}^{-1}\), where Ωi is a matrix associated with the network structure in class i with all diagonal elements 1 and off-diagonal elements 0 or 1; for st, entry (s,t) is 1 if the edge exists between Xs and Xt and 0 otherwise. The relationship between a multivariate normal distribution N(0,Σi) and the Gaussian graphical model with edges determined by \(\Omega _{i} = \Sigma _{i}^{-1}\) is discussed by Hastie et al. (2015, p.246 and p.263).

We specifically consider two scenarios of network structures where the dimension of predictors is p=12. In the first scenario we specify Ωi to reflect the network structures displayed in Fig. 1. For example, element (1,5) for Ω1 is 1, but element (1,5) for Ωi is 0 if i=2,3,4. For a given class i and a subject j in this class, we calculate \(\pi _{j}^{i}(x_{\cdot j})\) by (16) where we set \(\gamma _{0}^{i} = \gamma _{{st}}^{i} = 1\). The outcome measurements are set to be \(Y_{j}^{i} = 1\) if \(\pi _{j}^{i}(x_{\cdot j}) > c\), and \(Y_{j}^{i} = 0\) otherwise, where the threshold c is chosen such that the size in class i equals ni.

Fig. 1
figure 1

Covariate network structures of four classes for the simulation study

In the second scenario, Ωi is taken as the identity matrix for i=1,⋯,I, showing that the predictors have no network structures. For subject j, the predictor X·j is generated from the multivariate normal distribution with mean zero and identity matrix. To generate Y·j for subject j, we first calculate πij(x·j) for every i=1,⋯,I by (2) and (3) where γ0i and γi are both set as log(i)+1 for class i. Then we set Y·j=i if \(i^{\ast } = \underset {i}{\text {argmax}} \pi _{{ij}}(x_{\cdot j})\). Continue this process until the desired size ni is achieved for i=1,⋯,I. We consider the case with I=4 and ni=50 for i=1,⋯,I and run 500 simulations. We use criteria (22) and (23) to report the performance of each method. The results are summarized in Table 2. It is seen that the proposed LR-ClassGraph method outperforms all the classification methods with larger values of PRE, REC and F from both micro and macro view points. The SVM performs the second best, and the performance of the LR-HomoGraph method is ranked the third, followed by that of the XGBOOST method.

Table 2 Simulation study with and without network structures for covariates, respectively, indicated by Scenarios 1 and 2: I=4

To understand how the proposed methods perform with the binary classification, we repeat the preceding simulations by setting I to be 2 and taking the network structures of classes 1 and 2 when considering scenario 1. The results are in Table 3. When covariates are associated with a network structure, the proposed LR-ClassGraph method still performs the best, and the improvement of the LR-ClassGraph method over existing classifiers is a lot more noticeable for I=2 than for I=4. Interestingly, when covariates are uncorrelated, unlike the multiclass case with I=4, the LR-HomoGraph method outperforms the LR-ClassGraph method; and in this case, the SVM is the best classifier.

Table 3 Simulation study with and without network structures for covariates, respectively, indicated by Scenarios 1 and 2: I=2

Glass identification dataset

We analyze a dataset concerning glass identification. The study of classification of glass types was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified. It is of interest to predict the glass type based on the information of the predictors.

The dataset contains 7 types of glass, including

  • building\(\underline {\ }{windows}\underline {\ }{float}\underline {\ }\)processed (Glass-1),

  • building\(\underline {\ }{windows}\underline {\ }{non}\underline {\ }{float}\underline {\ }\)processed (Glass-2),

  • vehicle\(\underline {\ }{windows}\underline {\ }{float}\underline {\ }\)processed (Glass-3),

  • vehicle\(\underline {\ }{windows}\underline {\ }{non}\underline {\ }{float}\underline {\ }\)processed (Glass-4),

  • containers (Glass-5),

  • tableware (Glass-6), and

  • headlamps (Glass-7),

and the predictors include 9 different chemical materials, refractive index (RI), Sodium (NA), Magnesium (MG), Aluminum (AL), Silicon (SI), Potassium (K), Calcium (CA), Barium (BA), and Iron (FE). The complete dataset is available at https://archive.ics.uci.edu/ml/datasets/glass+identification. The sample size in each class is, respectively, n1=70,n2=76,n3=17,n4=0,n5=13,n6=9, and n7=29, yielding the total sample size \(n = \sum \limits _{i=1}^{7} n_{i} = 214\). To see the correlation among the predictors, we draw a scatter plot of those 9 predictors, displayed in Fig. 2. It is seen that some predictors, such as RI and CA, are highly correlated, and that many pairwise predictors are generally correlated.

Fig. 2
figure 2

Pairwise scatter plots of predictors in glass data

We first present the network structures for different chemical materials in each class. The network structure for each class is determined by (9) and (11). The graphical results are reported in Fig. 4. It is seen that the network structure of the predictors is different from class to class. We notice that RI has no connection with other variables in every class and the predictor FE also has no connection with others except in class 6.

We next evaluate the performance of our proposed methods as opposed to the conventional approaches, SVM, LDA, KNN, and XGBOOST, which are respectively implemented by the R packages svm(e1071), lda(MASS), knn.cv(class), and xgboost. To examine the performance of LR-HomoGraph proposed in “Logistic regression with homogeneous graphically structured predictors” section, we first construct the network structures, displayed in Fig. 3, of the predictors with the class information ignored, and we then apply the procedure described in “Logistic regression with homogeneous graphically structured predictors” section. To implement the LR-ClassGraph method in “Logistic regression with class-dependent graphically structured predictors” section, we apply model (16) with respect to six different network structures in Fig. 4, and then determine the predictive class using (18).

Fig. 3
figure 3

Network structures of the covariates in glass data: without classes distinguished

Fig. 4
figure 4

Network structures of the covariates in glass data: within each class

To measure the classification results in each class, we define the misclassification rate in class i to be

$$\begin{array}{@{}rcl@{}} \text{MIS}_{i} = \frac{1}{n_{i}} \sum \limits_{j=1}^{n} \mathbb{I} \left(y_{\cdot j} = i, \widehat{y}_{\cdot j} \neq i\right) \ \ \text{for} \ \ i=1,\cdots,I. \end{array} $$

The results obtained from SVM, LDA, KNN, XGBOOST, and the proposed methods are reported in Table 4. The misclassification rate of our proposed methods in each class are smaller than other methods, and the LR-ClassGraph yields the smallest misclassification rate for each class. Among the four compared methods, the SVM outperforms the other three methods.

Table 4 Classification results for glass data

Finally, we use criteria (22) and (23) to compare the overall performance of all the methods and summarize the results in Table 5. It is clear that both LR-HomoGraph and LR-ClassGraph produce higher values of the F, PRE and REC measures, regardless of micro and macro, implying that our proposed methods perform better than other multiclassification methods considered here. In addition, we further implement the two methods in “Classification with predictor graphical structures accommodated” section by respectively extending models (12) and (17) with the linear terms in each predictor included, and we denote those methods as LR-HomoGraph+main and LR-ClassGraph+main, respectively, and report the results in the last two columns of Table 5. Such an extension of the models, however, does not help increase the values of these measures.

Table 5 Overall performance of classification methods applied to glass data

Discussion

In this paper, we propose to use logistic regression methods to make a prediction for data with network structures in predictors. In our methods, we first identify the network structures of the predictors for every class using graphical models, and then we capitalize on the identified network structures for the predictors to fit a logistic regression model to do classification and prediction. Simulation studies demonstrate that in the presence of network structures for covariates, our proposed methods produce more precise classification results than conventional methods, such as SVM, LDA, KNN, and XGBOOST. To allow interested readers to use the algorithms developed in “Classification with predictor graphical structures accommodated” section, the implementation procedures will be posted at CRAN.

Our development here focuses on examining pairwise dependence structures among predictors using the formulation (7). This is primarily driven by the consideration that such a dependence structure is intuitively interpretable and commonly exists in many problems. Extensions to facilitating triplewise or higher order dependence structures or even with the main effects (i.e., single variable effects), among predictors can be carried out by extending (7) to the form (9.5) of Hastie et al. (2015). Such extensions are, in principle, straightforward to implement technically, but the issue of overfitting may arise. In addition, underlying constraints on the model parameters may become a complex concern in numerical implementation. Discussions on this aspect were given by many authors, including Yang et al. (2015), Yi (2017), and Yi et al. (2017). Our discussion in this paper is directed to using the exponential family distribution to facilitate continuous predictor. It is easy to extend our methods to accommodate mixture graphical models which feature both continuous and discrete predictors.

In obtaining the estimator (9), we use the L1-norm or the LASSO penalty, which is driven by its popularity as well as the availability of the implementation software packages (e.g., R packages huge and XMRF). However, the methods described in “Classification with predictor graphical structures accommodated” section are not just confined to the LASSO penalty. Our methods apply as well when other penalty functions are used. For instance, penalty functions, such as the elastic-net, SCAD, adaptive LASSO, L2-norm penalties can be used to replace the LASSO penalty in deriving the estimator (9); the remaining procedures developed in “Classification with predictor graphical structures accommodated” section still carry through. It will be interesting to conduct numerical studies for the use of different penalty functions to compare how results may differ with and without incorporating the network structure in the analysis, as noted by a referee. Though in this paper we are not able to exhaust numerical explorations for all possible penalty functions, the implementation framework presented in “Classification with predictor graphical structures accommodated” section allows the users to take any penalty functions that suit their own problems.

Finally, we comment that several aspects of the methods described in “Classification with predictor graphical structures accommodated” section warrants further research. As pointed out by a referee, our methods are developed for the problems with low dimensional data (i.e., p<n) and they are not applicable to sizable data with pn. In the current digital world, it is not uncommon that we often have to handle data with thousands of predictor variables but the sample size is a lot smaller. In such circumstances, dimension reduction or feature screening techniques would be employed before proceeding with formal data analysis. It is interesting to generalize our methods to handle high-dimensional data with p being of a polynomial order of n or even ultra high-dimensional data with p being of an exponential order of n.

Our methods basically involve two steps in using measurements for the covariates and class labels. In the first step, we utilize undirected graphs to examine the covariate measurements alone, and the class information only comes into play in the second step when using logistic regression for classification. Alternatively, one may consider using directed acyclic graphs to feature conditional independencies among variables and develop probabilistic graphical models for classification. To evaluate the performance of the proposed methods, we focus on the comparisons with the competing classifiers reviewed in “Evaluation of the performance” section. While those algorithms cover a good range of available classifiers, they are not exhaustive, or even far from being comprehensive, in comparisons. Despite the frequentist nature of our methods, it is interesting to compare the proposed methods to the Bayesian network classifiers which have proven useful in applications (e.g., Geiger and Heckerman 1996; Pérez et al. 2006; Bielza and Larrañaga 2014). Furthermore, it is worthwhile to employ rigorous hypothesis testing procedures to evaluate whether the differences in the results obtained from different classifiers are statistically significant.