1 Introduction

Epidemiological studies in a clustered, or longitudinal data setting often generate multivariate (repeated) outcomes that are analyzed under the ubiquitous multivariate normal (MVN) assumptions of the random terms (random effects, and within-subject random errors) via standard software, such as SAS, or R. However, violations of those assumptions can lead to imprecise parameter estimates (Bandyopadhyay et al. 2010). These non-Gaussian features are usually manifested through skewness of the response vector, and/or thick-tails. Although achieving close-to-normality via suitable data transformations of the responses (such as log, or Box-Cox) for standard linear mixed model (LMM) analysis are possible, they maybe avoided due to their non-universality, and difficulty in covariate interpretation on the original scale (Jara et al. 2008). To address this, various flexible (parametric) alternatives to the MVN density exists, such as the multivariate skew-normal density (Azzalini and Capitanio 1999; Gupta et al. 2004; Azzalini 2010), the heavy-tailed multivariate skew t-density (Azzalini and Capitanio 2003), and others, that can accommodate departures from normality without resorting to ad-hoc data transformations.

In practice, this setup can be further complicated in presence of multiple outcomes recorded at each cluster units/components. The motivating data example in this paper comes from a clinical study of periodontal disease (PD) conducted on Gullah-speaking African-American Type-2 diabetics (henceforth, GAAD). Here, the multiple outcomes of interest are the tooth-level (mean) probed pocket depth (PPD) and clinical attachment level (CAL), which are recorded (in mm, via a periodontal probe) simultaneously for each tooth nested/clustered within a subject. While PPD quantifies the current PD status, CAL measures the (past) disease history and progression (Page and Eke 2007). An oral clinician may be interested in studying the joint evolution of these outcomes over some features of covariates, and the complexity is induced from two different sources of correlation—(a) Between repeated observations of any given outcome (PPD, or CAL) measured at a cluster unit (tooth), and (b) Between multiple outcomes (PPD and CAL) measured at the same tooth. The existing literature (both classical and Bayesian) in this context of multiple repeated outcomes modeling is also very rich (Luo and Wang 2014; Verbeke et al. 2014; Lin and Wang 2013; Michaelis et al. 2018; Bandyopadhyay et al. 2010). However, a vast majority of these models are developed under the restrictive assumption of linearity of the covariate effects over the multivariate responses.

Fig. 1
figure 1

GAAD Data: Plots of the empirical Bayes’ estimates of random effects (panels a and b), corresponding Q-Q plots (panels c and d), and observed versus estimated (non-linear) curve (panels e and f), obtained from fitting a linear mixed model separately to the PPD and CAL responses

To motivate further, consider Fig. 1, which presents plots of the empirical Bayes’ estimates of random effects (panels a and b), corresponding Q-Q plots (panels c and d), and observed versus estimated (non-linear) curve (panels e and f), obtained from fitting a LMM separately to the PPD and CAL responses in the GAAD data, using the lme function in R. The plots clearly reveal evidence of asymmetryq (departures from the Gaussian assumptions), which cannot be explained by a standard LMM fit. In addition, the predictor space restricted to be linear combinations of covariates may not provide an elegant picture of their cross-sectional association with the (bivariate) response. Formulating an index for PD (that handles possible non-linearity, confounding, and interaction effects between the PD outcomes and the covariates) via a single-index model, or SIM (Hardle et al. 1993) can be a clinically elegant alternative. SIMs are a popular class of semiparametric regression models that relaxes the assumption of linearity, and bypass the ‘curse of dimensionality’ by reducing the multi-dimensional predictor space \({\textbf {X}}\) into an univariate (scalar) index \(U = {\textbf {X}}^{T}\varvec{\beta }\). A link function g(.) now connects the covariate space to the response Y, offering a pragmatic compromise between a fully nonparametric (and often non-interpretable) multiple regression, and a restrictive (parametric) linear regression. Here, the magnitude of the index coefficient \(\beta _j\) determine the relative importance of the j-th predictor on the index, and g(U) denotes the location of interest in the response curve at the index U. In biomedical research, the recent work by Wu and Tu (2016) develops an adiposity index via a (multivariate) SIM to efficiently predict multiple longitudinal outcomes (systolic and diastolic blood pressure) in children. However, their proposal considers the usual MVN assumptions for the random terms (errors and effects), and may not well accommodate heavy tailed and other non-Gaussian features. Furthermore, they did not provide rigorous theoretical justification.

Considering Wu and Tu (2016) as our starting point, we seek to develop an index that can efficiently predict the clustered bivariate (PPD and CAL) PD outcomes. Such a clinical index that links both outcomes is vastly absent in the oral health literature. Our bivariate single-index mixed (BV-SIM) model tackles non-Gaussian features in the responses via the multivariate asymmetric Laplace density (ALD; Kotz et al. 2001) assumptions in the random terms. The multivariate ALD can accommodate asymmetric, peaked, and heavy-tailed data using fewer number of parameters than the popular multivariate skew-t density (Gupta 2003). The multivariate symmetric Laplace density (Naik and Plungpongpun 2006), a special case of the ALD, has been applied in other fields, such as speech clustering, classification problems, and image/signal analysis. Under this framework, we consider a polynomial spline approximation to the nonparametric index function, and propose an efficient EM-type algorithm for estimation and inference. The spline approximation, and the mixture normal representation of the multivariate ALD presents a computationally efficient, and intuitively appealing estimation setup, quantifying correlations from both sources.

The rest of the paper is organized as follows. In Sect. 2, we propose the BV-SIM model under the assumptions of a multivariate asymmetric Laplace density. Using the polynomial splines approximation for the nonparametric (index) functions, we derive the maximum likelihood (ML) estimate, and establish the large sample properties of the proposed estimators in Sect. 3, with the detailed technical proofs relegated to the Appendix, where we use the projection method to prove the asymptotic normality of parametric part. In Sect. 4, we develop an efficient MLE procedure based on the EM-algorithm. Simulation studies comparing finite sample performance of our approach to other alternatives appear in Sect. 5, while Sect. 6 illustrates the method via application to the PD dataset. Finally, some concluding remarks are presented in Sect. 7.

2 Statistical Model

We begin with a sketch of the multivariate shifted Laplace density (Kotz et al. 2001), and then develop our SIM mixed effects framework for bivariate clustered data. The multivariate ALD has the density

$$\begin{aligned} p({\textbf {y}};\varvec{\Sigma },\varvec{\gamma })=\frac{2\exp \{{\textbf {y}}{}^{\textrm{T}}\varvec{\Sigma }^{-1}\varvec{\gamma }\}}{(2\pi )^{d/2}|\varvec{\Sigma }|^{1/2}}\times \left( \frac{{\textbf {y}}{}^{\textrm{T}}\varvec{\Sigma }^{-1}{\textbf {y}}}{2+\varvec{\gamma }{}^{\textrm{T}}\varvec{\Sigma }^{-1}\varvec{\gamma }}\right) ^{\nu /2}K_{\nu }(u), \end{aligned}$$
(2.1)

where \(K_\nu\) is the modified Bessel function of the third kind with index \(\nu\), \(\nu =(2-d)/2\), \(u=\sqrt{(2+\varvec{\gamma }{}^{\textrm{T}}\varvec{\Sigma }^{-1}\varvec{\gamma })({\textbf {y}}{}^{\textrm{T}}\varvec{\Sigma }^{-1}{\textbf {y}})}\), \(\varvec{\gamma }\in \mathbb {R}^d\) is a skewness parameter and \(\varvec{\Sigma }\) is a positive definite (p.d.) scatter matrix with dimension \(d\times d\). We denote (2.1) as \(\textrm{ALD}_d(\varvec{\Sigma },\varvec{\gamma })\). Note, the ALD forces each component density to be joined at the same origin. An extension, the multivariate shifted asymmetric Laplace distribution (SALD; Kotz et al. 2001), has the form

$$\begin{aligned} p({\textbf {y}};\varvec{ \mu }, \varvec{\Sigma },\varvec{\gamma })=\frac{2\exp \{({\textbf {y}}-\varvec{ \mu }){}^{\textrm{T}}\varvec{\Sigma }^{-1}\varvec{\gamma }\}}{(2\pi )^{d/2}|\varvec{\Sigma }|^{1/2}}\times \left( \frac{\delta ({\textbf {y}},\varvec{ \mu },\varvec{\Sigma })}{2+\varvec{\gamma }{}^{\textrm{T}}\varvec{\Sigma }^{-1}\varvec{\gamma }}\right) ^{\nu /2}K_{\nu }(u), \end{aligned}$$
(2.2)

where \(u=\sqrt{(2+\varvec{\gamma }{}^{\textrm{T}}\varvec{\Sigma }^{-1}\varvec{\gamma })\delta ({\textbf {y}},\varvec{ \mu },\varvec{\Sigma })}\), \(\delta ({\textbf {y}},\varvec{ \mu },\varvec{\Sigma })=({\textbf {y}}-\varvec{ \mu }){}^{\textrm{T}}\varvec{\Sigma }^{-1}({\textbf {y}}-\varvec{ \mu })\), and \(\nu ,\varvec{\gamma },\varvec{\Sigma }\) are defined in (2.1). Here, we use the notation \({\textbf {Y}}\sim \textrm{SAL}_d(\varvec{ \mu },\varvec{\Sigma },\varvec{\gamma })\) to denote the random variable \({\textbf {y}}\) following a d-dimensional SALD. After some calculations, the mean and variance of SALD are given by

$$\begin{aligned}\textrm{E}({\textbf {Y}})=\varvec{ \mu }+\varvec{\gamma }\ \ \textrm{and} \ \ \textrm{Var}({\textbf {Y}})=\varvec{\Sigma }+\varvec{\gamma }\varvec{\gamma }{}^{\textrm{T}}.\end{aligned}$$

It is clear that the mean depends on the shifted location parameter \(\varvec{ \mu }\) and skewness parameter \(\varvec{\gamma }\), while its variance depends on scatter matrix \(\varvec{\Sigma }\) and skewness parameter \(\varvec{\gamma }\). Also, \(\varvec{\Sigma }+\varvec{\gamma }\varvec{\gamma }{}^{\textrm{T}}\) must be p.d. if \(\varvec{\Sigma }\) is p.d. The parameter \(\varvec{\gamma }\) plays an important role in multivariate asymmetric data analysis, besides the location \(\varvec{ \mu }\) and scatter matrix \(\varvec{\Sigma }\). Note, the multivariate density in (2.2) reduces to (2.1) when \(\varvec{ \mu }=\varvec{0}\), and it further reduces to the multivariate symmetry Laplace distribution (Eltoft et al. 2006) when \(\varvec{\gamma }=\varvec{0}\). Moreover, (2.2) reduces to the univariate ALD when dimension \(d=1\), \(\gamma =(1-2\tau )/\tau (1-\tau )\) and \(\varvec{\Sigma }_{1\times 1}=2/\tau (1-\tau )\), and is popularly used in the likelihood framework for quantile regression with density \(p(y)=\tau (1-\tau )\exp \{-\rho _\tau (y-\mu )\}\), where \(\rho _\tau (u)=u(\tau -I(u<0))\). The SALD in (2.2) has the following stochastic representation

$$\begin{aligned} {\textbf {Y}}=\varvec{ \mu }+V \varvec{\gamma }+\sqrt{V}{\textbf {Z}}, \end{aligned}$$
(2.3)

where V is a random variable from an exponential distribution with mean 1 and \({\textbf {Z}}\sim \textrm{N}_d(0,\varvec{\Sigma })\) is generated independent of V. Using Bayes’s theorem, the density of V given \({\textbf {Y}}={\textbf {y}}\) is generalized inverse Gaussian, with the density

$$\begin{aligned} p_V(v|{\textbf {Y}}={\textbf {y}})=\frac{v^{\nu -1}}{2K_\nu (u)}\left( \frac{\delta ({\textbf {y}}, \varvec{ \mu },\varvec{\Sigma })}{2+\varvec{\gamma }{}^{\textrm{T}}\varvec{\Sigma }^{-1}\varvec{\gamma }}\right) ^{-\nu /2} \exp \left\{ -\frac{1}{2v}\delta ({\textbf {y}}, \varvec{ \mu },\varvec{\Sigma })-\frac{v}{2}(2+\varvec{\gamma }{}^{\textrm{T}}\varvec{\Sigma }^{-1}\varvec{\gamma })\right\} , \end{aligned}$$
(2.4)

where \(\nu , \varvec{\gamma }, \varvec{ \mu }, \varvec{\Sigma }, \delta ({\textbf {y}}, \varvec{ \mu },\varvec{\Sigma })\) and u are as defined in (2.2). The SALD allows for peakedness, heavy tails, and skewness, and hence provides more flexibility in modeling multivariate data with non-Gaussian features. More properties, extensions and applications of SALD appear in Kozubowski and Podgórski (2001); Franczak et al. (2014); Bouveyron and Brunet-Saumard (2014).

2.1 Single-Index Mixed-Effects Model

Let \({\textbf {y}}_{ij}=(y_{ij}^{(1)}, y_{ij}^{(2)}){}^{\textrm{T}}\) be the observed values of two response variables (here, mean PPD and CAL) for the ith subject at the jth location (here, tooth), where \(i=1,\ldots ,n\) and \(j=1,\ldots ,m_i\). We assume

$$\begin{aligned} \left\{ \begin{array}{l} {\textbf {y}}_{ij}=\widetilde{\varvec{ \mu }}_{ij}+\varvec{\epsilon }_{ij}, \ \ \widetilde{\varvec{ \mu }}_{ij}=({\widetilde{\mu }}_{ij}^{(1)},{\widetilde{\mu }}_{ij}^{(2)}){}^{\textrm{T}},\\ \\ {\widetilde{\mu }}_{ij}^{(1)}=g_1(({\textbf {x}}_{ij}^{(1)}){}^{\textrm{T}}\varvec{\beta }_1)+({\textbf {z}}_{ij}^{(1)}){}^{\textrm{T}}{\textbf {b}}_{i1}, \ \ {\widetilde{\mu }}_{ij}^{(2)}=g_2(({\textbf {x}}_{ij}^{(2)}){}^{\textrm{T}}\varvec{\beta }_2)+({\textbf {z}}_{ij}^{(2)}){}^{\textrm{T}}{\textbf {b}}_{i2},\\ \\ \varvec{\epsilon }_{ij} \sim {\textrm{SAL}}_2 (\varvec{0},\varvec{\Sigma },\varvec{\gamma }), i.i.d. \ \ \forall \ \ i, j, \end{array} \right. \end{aligned}$$
(2.5)

where \(g_1\) and \(g_2\) are two unknown nonparametric functions, \({\textbf {x}}_{ij}^{(1)}=(x_{ij1}^{(1)},\ldots ,x_{ijp_1}^{(1)}){}^{\textrm{T}}\), \({\textbf {x}}_{ij}^{(2)}=(x_{ij1}^{(2)},\ldots ,x_{ijp_2}^{(2)}){}^{\textrm{T}}\), and \({\textbf {z}}_{ij}^{(1)}=(1,z_{ij1}^{(1)},\ldots ,z_{ijq_1}^{(1)}){}^{\textrm{T}}\), \({\textbf {z}}_{ij}^{(2)}=(1,z_{ij1}^{(2)},\ldots ,z_{ijq_2}^{(2)}){}^{\textrm{T}}\), \(\varvec{\beta }_k \in \mathbb {R}^{p_k}\) and \({\textbf {b}}_{ik}\in \mathbb {R}^{q_k+1}\) are the (fixed) index coefficients and random effect for the k-th response (k=1 or 2), \(\varvec{\gamma }\) is a \(2\times 1\) vector of skewness parameters, and \(\varvec{\Sigma }\) is the scatter matrix with dimension \(2\times 2\) for the random error \(\varvec{\epsilon }\). To accommodate a robust specification, we also assume the random effects \({\textbf {b}}_i=({\textbf {b}}_{i1}{}^{\textrm{T}},{\textbf {b}}_{i2}{}^{\textrm{T}}){}^{\textrm{T}}\sim \textrm{SAL}_{(q_1+q_2+2)}({\varvec{0}}, \varvec{\Omega }, \varvec{0})\), where \(\varvec{\Omega }\) is an unstructured covariance matrix with dimension \((q_1+q_2+2)\times (q_1+q_2+2)\). Note, \(\varvec{\Omega }\) carries information pertaining to both the clustering correlation within a response found on the two blocks of diagonal sub-matrices, with dimensions \((q_1+1)\times (q_1+1)\) and \((q_2+1)\times (q_2+1)\), and the cross-correlations between responses, found on the off-diagonal sub-matrices. In addition, we further assume the joint density of \((\varvec{\epsilon }_{ij}{}^{\textrm{T}}, {\textbf {b}}_i{}^{\textrm{T}}){}^{\textrm{T}}\) is \(\textrm{SAL}_{(q_1+q_2+4)}(\varvec{0}_{(q_1+q_2+4)}, \textrm{blockdiag}(\varvec{\Sigma }, \varvec{\Omega }), (\varvec{\gamma }{}^{\textrm{T}},\varvec{0}{}^{\textrm{T}}_{q_1+q_2+4}){}^{\textrm{T}})\). We call model (2.5) as the single-index mixed-effects (SIME) model for bivariate clustered data.

For identifiability, we assume both \(\Vert \varvec{\beta }_1\Vert =1\) and \(\Vert \varvec{\beta }_2\Vert =1\), and their first components are positive, respectively. In this paper, the popular “delete one component” method is used to avoid the equality constraints (Yu and Ruppert 2002; Cui et al. 2011). Specifically, we write \(\varvec{\beta }_1=((1-\Vert \varvec{\beta }_1^{(-1)}\Vert ^2)^{1/2},\beta _{12},\ldots ,\beta _{1p_1}){}^{\textrm{T}}\)where, \(\varvec{\beta }_1^{(-1)}=(\beta _{12},\ldots ,\beta _{1p_1}){}^{\textrm{T}}\). Under this parametrization, \(\varvec{\beta }_1\) is a smooth deterministic function of \(\varvec{\beta }_1^{(-1)}\), with its Jacobian matrix given by

$$\begin{aligned} {\textbf {J}}_1=\frac{\partial \varvec{\beta }_1}{\partial \varvec{\beta }_1^{(-1)}}=\left( \begin{array}{c} -\frac{\varvec{\beta }_1^{(-1)}}{(1-\Vert \varvec{\beta }_1^{(-1)}\Vert ^2)^{1/2}}\\ {\textbf {I}}_{p_1-1} \end{array}, \right) \end{aligned}$$

where \({\textbf {I}}_{p_1-1}\) is the identity matrix with \(p_1-1\) rows/columns. The true parameter \(\varvec{\beta }_1^{(-1)}\) satisfies the constraint \(\varvec{\beta }_1^{(-1)} < 1\), which implies that it is a interior point in a unit ball in \(\mathbb {R}^{p_1-1}\). Therefore, \(\varvec{\beta }_1\) is infinitely differentiable in a neighborhood of \(\varvec{\beta }_1^{(-1)}\). Similarly, we define \(\varvec{\beta }_2^{(-1)}\) and \({\textbf {J}}_2\), and let \(\varvec{\beta }^{(-1)}=((\varvec{\beta }_1^{(-1)}){}^{\textrm{T}},(\varvec{\beta }_2^{(-1)}){}^{\textrm{T}}){}^{\textrm{T}}\), \({\textbf {J}}=\textrm{blockdiag}({\textbf {J}}_1,{\textbf {J}}_2)\). Applying the stochastic representation in (2.3), model (2.5) admits the following hierarchical structure:

$$\begin{aligned} \left\{ \begin{array}{l} {\textbf {y}}_i|{\textbf {b}}_i,V_i \ \sim \ \textrm{N}_{2 m_i}({\widetilde{\varvec{ \mu }}}_i+V_i(\varvec{1}_{m_i}\otimes \varvec{\gamma }), V_i \varvec{\Lambda }_i),\\ \\ {\textbf {b}}_i|V_i \sim \textrm{N}_{2(q+1)}({\varvec{0}}, V_i\varvec{\Omega }), \ \ V_i \sim \textrm{E} (1), \end{array}\right. \end{aligned}$$
(2.6)

where \({\textbf {y}}_i=({\textbf {y}}_{i1}{}^{\textrm{T}},\ldots ,{\textbf {y}}_{im_i}{}^{\textrm{T}}){}^{\textrm{T}}\), \({\widetilde{\varvec{ \mu }}}_i=({\widetilde{\varvec{ \mu }}}_{i1}{}^{\textrm{T}},\ldots ,{\widetilde{\varvec{ \mu }}}_{im_i}{}^{\textrm{T}}){}^{\textrm{T}}\), \(\textrm{E}\) denotes the exponential distribution, \(\varvec{\Lambda }_i={\textbf {I}}_{m_i}\otimes \varvec{\Sigma }\), where \(\otimes\) denotes the kronecker product, and \({\varvec{1}}_{m_i}\) is a \(m_i\) column vector with element 1. From (2.5) and (2.6), it is clear that conditional on \(V_i\), \(\varvec{\epsilon }_{ij}\) and \({\textbf {b}}_i\) are independent. Integrating out \({\textbf {b}}_i\) in (2.6), we have the following hierarchical model

$$\begin{aligned} {\textbf {y}}_i|V_i \sim \textrm{N}_{2 m_i} ({\varvec{ \mu }}_i+V_i(\varvec{1}_{m_i}\otimes \varvec{\gamma }), V_i{\textbf {G}}_i), \ \ V_i\sim \textrm{E}(1), \end{aligned}$$
(2.7)

where \(\varvec{ \mu }_i=((\varvec{ \mu }_{i1}){}^{\textrm{T}},\ldots ,(\varvec{ \mu }_{im_i}){}^{\textrm{T}}){}^{\textrm{T}}\) with \(\varvec{ \mu }_{ij}=(g_1(({\textbf {x}}_{ij}^{(1)}){}^{\textrm{T}}\varvec{\beta }_1), g_2(({\textbf {x}}_{ij}^{(2)}){}^{\textrm{T}}\varvec{\beta }_2)){}^{\textrm{T}}\), \({\textbf {Z}}_i=({\textbf {Z}}_{i1},\ldots ,{\textbf {Z}}_{im_i})\), \({\textbf {Z}}_{ij}=\textrm{blockdiag}({\textbf {z}}_{ij}^{(1)},{\textbf {z}}_{ij}^{(2)})\), \({\textbf {G}}_i={\textbf {Z}}_i{}^{\textrm{T}}\varvec{\Omega }{\textbf {Z}}_i+{\varvec{\Lambda }}_i\). Moreover, it follows from (2.7) that the \({\textbf {y}}_i\) are independent and marginally distributed as

$$\begin{aligned} {\textbf {y}}_i \sim \textrm{SALD}_{2 m_i} ({\varvec{ \mu }}_i, {\textbf {G}}_i, \varvec{\gamma }_i^*), \ \ i=1,\ldots ,n, \end{aligned}$$
(2.8)

where \(\varvec{\gamma }_i^*=\varvec{1}_{m_i}\otimes \varvec{\gamma }\). From (2.7) and by the properties of the generalized inverse Gaussian distribution in (2.4), we have

$$\begin{aligned} \mathbb {E}(V_i|{\textbf {y}}_i)=\sqrt{\frac{b_i}{a_i}}R_\nu (\sqrt{a_ib_i}) \ \ \textrm{and} \ \ \mathbb {E}(V_i^{-1}|{\textbf {y}}_i)=\sqrt{\frac{a_i}{b_i}}R_\nu (\sqrt{a_ib_i})-\frac{2\nu }{b_i}, \end{aligned}$$
(2.9)

where \(a_i=2+(\varvec{\gamma }_i^*){}^{\textrm{T}}{\textbf {G}}_i^{-1}\varvec{\gamma }_i^*\), \(b_i=({\textbf {y}}_i-\varvec{ \mu }_i){}^{\textrm{T}}{\textbf {G}}_i^{-1}({\textbf {y}}_i-\varvec{ \mu }_i)\), \(R_\nu (u)=K_{\nu +1}(u)/K_\nu (u)\) and \(\nu =1-m_i\).

2.2 Modeling the Index Functions

Since the two functions \(g_1\) and \(g_2\) in (2.5) are unknown, we use polynomial splines to approximate them in the subsequent ML estimation. Polynomial splines are simple, yet practical tools with computational tractability and statistical efficiency, and has been proven to be an extremely powerful method for smoothing.

For simplicity, we assume that the covariates \({\textbf {x}}_{ij}^{(1)}\) and \({\textbf {x}}_{ij}^{(2)}\) are bounded and the supports of \(({\textbf {x}}^{(1)}){}^{\textrm{T}}\varvec{\beta }_{10}\) and \(({\textbf {x}}^{(2)}){}^{\textrm{T}}\varvec{\beta }_{20}\) are contained in the finite interval [ab]. Such a compactness assumption is almost always used in nonparametric regression with spline approximation. We use polynomial splines to approximate the nonparametric functions \(g_1\) and \(g_2\). Let \(t_0=a<t_1<\cdots<t_{K'}<b=t_{K'+1}\) be the partitions of [ab] into subintervals \([t_k,t_{k+1}),k=0,\ldots ,K'\) with \(K'\) internal knots. A polynomial spline of order d is a function whose restriction to each subinterval is a polynomial of degree \(d-1\) and globally \(d-2\) times continuously differentiable on [ab]. The collection of splines with a fixed sequence of knots has a B-spline basis \(\{B_{1}(x),\ldots ,B_{K}(x)\}\), with \({K}=K'+d\). We assume the B-spline basis is normalized to have \(\sum _{k=1}^KB_k(x)=\sqrt{K}\), although, any scaling can be used without changing the theoretical results.

Let \({\textbf {B}}_1(\cdot )=(B_1(\cdot ),\ldots ,B_{K_1}(\cdot )){}^{\textrm{T}}\) and \({\textbf {B}}_2(\cdot )=(B_1(\cdot ),\ldots ,B_{K_2}(\cdot )){}^{\textrm{T}}\), where \(K_1=K_1'+d\) and \(K_2=K_2'+d\) with number of knots \(K'_1\) and \(K'_2\) for \(g_1\) and \(g_2\). Then, we have \(g_k(\cdot )\approx {\textbf {B}}_k{}^{\textrm{T}}(\cdot )\varvec{\theta }_k, k=1,2\) where \(\varvec{\theta }_k=(\theta _{k1},\ldots , \theta _{kK_k}){}^{\textrm{T}}, k=1,2\). As a result, we can write

$$\begin{aligned} \mu _{ij}^{(1)}\approx {\textbf {B}}_1{}^{\textrm{T}}(({\textbf {x}}_{ij}^{(1)}){}^{\textrm{T}}\varvec{\beta }_1)\varvec{\theta }_1 \ \ \textrm{and} \ \ \mu _{ij}^{(2)}\approx {\textbf {B}}_2{}^{\textrm{T}}(({\textbf {x}}_{ij}^{(2)}){}^{\textrm{T}}\varvec{\beta }_2)\varvec{\theta }_2 \end{aligned}$$
(2.10)

for \(i=1,\ldots ,n, j=1,\ldots ,m_i\). By letting the number of knots increase with the sample size at an appropriate rate, the spline estimate of the unknown function can achieve the optimal nonparametric convergence rate.

3 Theoretical Properties

In this section, we will investigate the theoretical properties for the index parameters and the index functions. In the following we establish the large sample properties based on the marginal distribution (2.8) of the proposed BV-SIM model in (2.5). For simplicity, we assume \(m_i\equiv m\), with the response viewed as i.i.d. data, \({\textbf {y}}_i \sim \textrm{SALD}_{2\,m} ({\varvec{ \mu }}_i, {\textbf {G}}_i, \varvec{\gamma }^*), \ \ i=1,\ldots ,n\). In (2.8), \(\varvec{\gamma }^*=\varvec{1}_m\otimes \varvec{\gamma }\) and \({\textbf {G}}_i={\textbf {Z}}_i{}^{\textrm{T}}\varvec{\Omega }{\textbf {Z}}_i+{\varvec{\Lambda }}\), with \(\varvec{\Lambda }={\textbf {I}}_{m}\otimes \varvec{\Sigma }\). We first introduce some notations.

Let \(\varvec{\beta }_{01}\) and \(\varvec{\beta }_{02}\) be the true index parameters, and \(g_{01}\) and \(g_{02}\) the corresponding true index functions. Let \(\varvec{\beta }_0=(\varvec{\beta }_{01}{}^{\textrm{T}},\varvec{\beta }_{02}{}^{\textrm{T}}){}^{\textrm{T}}\), \(\varvec{\beta }_0^{(-1)}=((\varvec{\beta }_{01}^{(-1)}){}^{\textrm{T}},(\varvec{\beta }_{02}^{(-1)}){}^{\textrm{T}}){}^{\textrm{T}}\), \(\varvec{ \mu }_i^0=((\varvec{ \mu }_{i1}^0){}^{\textrm{T}},\ldots ,(\varvec{ \mu }_{im_i}^0){}^{\textrm{T}}){}^{\textrm{T}}\) with \(\varvec{ \mu }_{ij}^0=(g_{01}(({\textbf {x}}_{ij}^{(1)}){}^{\textrm{T}}\varvec{\beta }_{01}), g_{02}(({\textbf {x}}_{ij}^{(2)}){}^{\textrm{T}}\varvec{\beta }_{02})){}^{\textrm{T}}\). Denote the support of \(\{{\textbf {X}}_{i}{}^{\textrm{T}}\varvec{\beta }_0\}\) as [ab], where \(a=\min _{i} \{ {\textbf {X}}_{i}{}^{\textrm{T}}\varvec{\beta }_0\}\) and \(b=\max _{i} \{ {\textbf {X}}_{i}{}^{\textrm{T}}\varvec{\beta }_0\}\), \({\textbf {X}}_i=({\textbf {X}}_{i1},\ldots ,{\textbf {X}}_{im_i})\) with \({\textbf {X}}_{ij}=\textrm{blockdiag}({\textbf {x}}_{ij}^{(1)},{\textbf {x}}_{ij}^{(2)})\). Let \(\mathcal {H}_s\) be the collection of all functions on the support [ab] whose l-th order derivative satisfies the Hölder condition of the order r with \(s=l+r\). Then, for each \(g \in \mathcal {H}_s\), there exists a positive constant \(C_0\) such that \(|g^{(l)}(u)-g^{(l)}(v)|\le C_0|u-v|^r, \ \ \forall u, v \in [a,b]\). From De Boor (2001), there exists a constant C (see page 149) such that

$$\begin{aligned} \sup _{u\in [a,b]}|g_k(u)-{\textbf {B}}_k^T(u)\varvec{\theta }_{0k}|\le C K_k^{-s}, \end{aligned}$$
(3.1)

if \(g_k \in \mathcal {H}_s\), where \(\varvec{\theta }_{0k}=(\theta _{0k1},\ldots ,\theta _{0kK_k})^T, k=1,2\) are the true value of spline coefficients, which can be viewed as the best approximation coefficient vectors for \(g_k\).

Denote \(\varvec{\delta }=(\varvec{\gamma }{}^{\textrm{T}},\textrm{vech}(\varvec{\Omega }){}^{\textrm{T}},\textrm{vech}(\varvec{\Sigma }){}^{\textrm{T}}){}^{\textrm{T}}\) and \(\varvec{\Theta }\) as the parameter space of \(\varvec{\zeta }=(\varvec{\beta }{}^{\textrm{T}},\varvec{\theta }{}^{\textrm{T}},\varvec{\delta }{}^{\textrm{T}}){}^{\textrm{T}}\). Given the covariates \({\textbf {X}}_i\) and \({\textbf {Z}}_i\), let \(\ell _m(\varvec{ \mu }_i,\varvec{\delta }, {\textbf {y}}_i)\) be the log-likelihood of the marginal distribution for response \({\textbf {y}}_i\) in (2.8) and \(\ell _m(\varvec{\zeta }, {\textbf {y}}_i)\triangleq \ell _m({\textbf {W}}_i{}^{\textrm{T}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta })\varvec{\theta },\varvec{\delta }, {\textbf {y}}_i)\) be the corresponding spline-approximated log-likelihood. Let \(\varvec{\delta }_0\) be the true value of \(\varvec{\delta }\) and \(\varvec{\theta }_0=(\varvec{\theta }_{01}{}^{\textrm{T}},\varvec{\theta }_{02}{}^{\textrm{T}}){}^{\textrm{T}}\). Define \({\widehat{\varvec{\zeta }}}=({\widehat{\varvec{\beta }}}{}^{\textrm{T}},{\widehat{\varvec{\theta }}}{}^{\textrm{T}},{\widehat{\varvec{\delta }}}{}^{\textrm{T}}){}^{\textrm{T}}\) as the MLE, given by

$$\begin{aligned} {\widehat{\varvec{\zeta }}}=\textrm{argmax}_{\varvec{\zeta }}\sum _{i=1}^n \ell _m({\textbf {W}}_i{}^{\textrm{T}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta })\varvec{\theta },\varvec{\delta }, {\textbf {y}}_i), \end{aligned}$$
(3.2)

where \({\textbf {W}}_i({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta })=({\textbf {W}}_{i1},\ldots ,{\textbf {W}}_{im_i})\), \({\textbf {W}}_{ij}=\textrm{blockdiag}({\textbf {B}}_{ij}^{(1)}, {\textbf {B}}_{ij}^{(2)})\) with \({\textbf {B}}_{ij}^{(k)}={\textbf {B}}_k(({\textbf {x}}_{ij}^{(k)}){}^{\textrm{T}}\varvec{\beta }_k), k=1,2\). Define the space of square integrable single-index functions \(\mathcal {G}=\{\textbf{g}: \mathbb {E}\Vert \textbf{g}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)\Vert ^2<\infty \}\), where \(\textbf{g}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta })=(\textbf{g}{}^{\textrm{T}}({\textbf {X}}_{i1}{}^{\textrm{T}}\varvec{\beta }),\ldots ,\textbf{g}{}^{\textrm{T}}({\textbf {X}}_{im_i}{}^{\textrm{T}}\varvec{\beta })){}^{\textrm{T}}\) with \(\textbf{g}({\textbf {X}}_{ij}{}^{\textrm{T}}\varvec{\beta })=(g_1(({\textbf {x}}_{ij}^{(1)}){}^{\textrm{T}}\varvec{\beta }_{1}), g_2(({\textbf {x}}_{ij}^{(2)}){}^{\textrm{T}}\varvec{\beta }_{2})){}^{\textrm{T}}\). Denote \({\textbf {C}}_i(\varvec{ \mu }_i,\varvec{\delta })=-\partial ^2 \ell _m(\varvec{ \mu }_i,\varvec{\delta },{\textbf {y}}_i)/\partial \varvec{ \mu }_i\partial \varvec{ \mu }_i{}^{\textrm{T}}\) and \({\textbf {C}}_i^0={\textbf {C}}_i(\varvec{ \mu }_{i}^0,\varvec{\delta }_0)\). Then, the projection of a 2m-dimensional random vector \(\varvec{\Gamma }\) onto \(\mathcal {G}\) (defined as \(\mathbb {E}[\varvec{\Gamma }] = \textbf{g}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0))\) is the minimizer of

$$\begin{aligned}\min _{\textbf{g}\in \mathcal {G}}\mathbb {E}\left[ (\varvec{\Gamma }-\textbf{g}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)){}^{\textrm{T}}{\textbf {C}}_i^0 (\varvec{\Gamma }-\textbf{g}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0))\right] .\end{aligned}$$

Note, the definition of projection involves the distributions of both \({\textbf {X}}_i, {\textbf {Z}}_i\) and \(\varvec{\Gamma }\) since we take the expectation over these random variables. This definition can be extended to any \(2m\times L\) matrix by column-wise projection. In the following, we list the regularity conditions (Wang et al. 2014; Lian and Liang 2013; Zhao et al. 2017) that are necessary to study the asymptotic behavior of the MLEs.

  1. (A1)

    Both \(g_1(\cdot ) \in \mathcal {H}_s\) and \(g_2(\cdot ) \in \mathcal {H}_s\) for some \(s\ge 2\).

  2. (A2)

    Both \({\textbf {x}}_{ij}^{(1)}\) and \({\textbf {x}}_{ij}^{(2)}\), \(i=1,\ldots ,n, j=1,\ldots , m_i\), are bounded, with density supported on a convex set.

  3. (A3)

    The true parameter point \(\varvec{\zeta }_0\) is an interior point of the parameter space \(\varvec{\Theta }\).

  4. (A4)

    The log-likelihood \(\ell _m(\varvec{\zeta },{\textbf {y}}_i)\) is at least thrice differentiable on parameters \(\varvec{\zeta }\). Furthermore, the second derivatives of the likelihood function satisfy the equations

    $$\begin{aligned}\mathbb {E}\left\{ \left( \frac{\partial \ell _m(\varvec{\zeta },{\textbf {y}}_i)}{\partial \varvec{\zeta }}\right) \left( \frac{\partial \ell _m(\varvec{\zeta },{\textbf {y}}_i)}{\partial \varvec{\zeta }}\right) {}^{\textrm{T}}\right\} = -\mathbb {E}\left\{ \frac{\partial ^2 \ell _m(\varvec{\zeta },{\textbf {y}}_i)}{\partial \varvec{\zeta }\partial \varvec{\zeta }{}^{\textrm{T}}}\right\} .\end{aligned}$$

    Also, there exists functions \(M_{jkl}({\textbf {y}}_i)\), such that

    $$\begin{aligned}\left| \frac{\partial ^3 \ell _m(\varvec{\zeta },{\textbf {y}}_i)}{\partial \varvec{\zeta }_j\partial \varvec{\zeta }_k\partial \varvec{\zeta }_l}\right| \le M_{jkl}({\textbf {y}}_i)\end{aligned}$$

    for \(\varvec{\zeta }\in \varvec{\Theta }\), and \(\mathbb {E}[M_{jkl}({\textbf {y}}_i)]<C_3<+\infty\). Here \(\varvec{\zeta }_j\) denotes the j-th component of \(\varvec{\zeta }\).

  5. (A5)

    The Fisher information matrix \(\mathcal {I}(\varvec{\zeta }_0)=-\mathbb {E}\left. \left\{ \frac{\partial ^2 \ell _m(\varvec{\zeta },{\textbf {y}}_i)}{\partial \varvec{\zeta }\partial \varvec{\zeta }{}^{\textrm{T}}}\right\} \right| _{\varvec{\zeta }_0}\) satisfies the conditions

    $$\begin{aligned}0<C_1<\lambda _{\min }\{\mathcal {I}(\varvec{\zeta }_0)\}\le \lambda _{\max }\{\mathcal {I}(\varvec{\zeta }_0)\}<C_2<+\infty ,\end{aligned}$$

    where \(\lambda _{\min }\) and \(\lambda _{\max }\) denote the smallest and largest eigenvalues of a matrix.

  6. (A6)

    Suppose \(\mathbb {E}_{\mathcal {G}}[{\textbf {X}}_{ij}\textrm{diag}\{\dot{\textbf{g}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)\}]=(h_1({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0),\ldots , h_{p_1+p_2}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)){}^{\textrm{T}}\). Assume all \(h_j \in \mathcal {H}_{s'}\) with \(s'>1\). We also assume that

    $$\begin{aligned}\mathbb {E}\left[ ({\textbf {J}}{}^{\textrm{T}}{\textbf {X}}_i \textrm{diag}\{\dot{\textbf{g}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)\}- \mathbb {E}_{\mathcal {G}}[{\textbf {J}}{}^{\textrm{T}}{\textbf {X}}_{i}\textrm{diag}\{\dot{\textbf{g}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)\}])^{\otimes 2}\right] \end{aligned}$$

    is positive definite, where \({\textbf {J}}\) is evaluated at \(\varvec{\beta }_0\).

Remark 1

The smoothness condition in (A1) is a requirement to attain the best convergence rate for single-index functions approximated in the spline space. Condition (A2) is widely used in the single-index modeling literature, ensuring that the index functions are defined in a compact set and thus facilitates the technical derivations. Conditions (A3) and (A4) are two common assumptions in the literature of maximum likelihood estimation with spline approximations (Wang et al. 2011, 2014), implying that the information matrix of the likelihood function is positive definite. Condition (A5) is slightly stronger than that used in the usual asymptotic likelihood theory, however, widely used in high-dimensional likelihood estimation literature Fan and Peng (2004). Finally, Condition (A6) is related to the ‘projection’, or the ‘orthogonalization’ technique common in a semiparametric setup, which includes partially linear model (Li 2000), partially linear additive model (Lian and Liang 2013), and single-index models (Cui et al. 2011; Zhao et al. 2017).

Denote \(K=\max \{K_1,K_2\}\), and let \(r_n=\sqrt{K/n}+K^{-s}\). Then, we have the following result.

Theorem 1

Under the Conditions (A1)–(A5), suppose that \(K^4/n\rightarrow 0\), \(\sqrt{n}K^{-2s+1}\rightarrow 0\), then we have

$$\begin{aligned}\Vert {\widehat{\varvec{\beta }}}-\varvec{\beta }_0\Vert +\Vert {\widehat{\varvec{\theta }}}-\varvec{\theta }_0\Vert =O_p(r_n).\end{aligned}$$

As an immediate implication of Theorem 1, we have \(\Vert {\widehat{g}}_1-g_1\Vert =O_p(r_n)\) and \(\Vert {\widehat{g}}_2-g_2\Vert =O_p(r_n).\)

Remark 2

Note that the rate of convergence for nonparametric functions is \(O_p(n^{-s/(2s+1)})\) if the optimal \(K\sim n^{1/(2s+1)}\), which is the same as that found in the nonparametric and semiparametric literature.

Theorem 2

Under Conditions (A1)–(A6), suppose that \(K^4/n\rightarrow 0\), \(\sqrt{n}K^{-2\,s+1}\rightarrow 0\) and \(\sqrt{n}K^{-s-s'}\rightarrow 0\). Then, we have

$$\begin{aligned}\sqrt{n}({\widehat{\varvec{\beta }}}^{(-1)}-{\varvec{\beta }}^{(-1)}_0) {\mathop {\longrightarrow }\limits ^{\textrm{d}}} N(\varvec{0},{\varvec{{\Psi }}}^{-1}),\end{aligned}$$

where

$$\begin{aligned}\begin{array}{lll} {\varvec{{\Psi }}}&{}=&{}\mathbb {E}\left[ ({\textbf {J}}{}^{\textrm{T}}{\textbf {X}}_i\textrm{diag}\{\dot{\textbf{g}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)\}-{\textbf {J}}{}^{\textrm{T}}\mathbb {E}_{\mathcal {G}}[{\textbf {X}}_i\textrm{diag}\{\dot{\textbf{g}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)\}])\cdot {\textbf {C}}_i^0\cdot \right. \\ &{}&{}\left. \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ ({\textbf {J}}{}^{\textrm{T}}{\textbf {X}}_i\textrm{diag}\{\dot{\textbf{g}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)\}-{\textbf {J}}{}^{\textrm{T}}\mathbb {E}_{\mathcal {G}}[{\textbf {X}}_i\textrm{diag}\{\dot{\textbf{g}}({\textbf {X}}_i{}^{\textrm{T}}\varvec{\beta }_0)\}]){}^{\textrm{T}}\right] \end{array}\end{aligned}$$

and \({\textbf {J}}\) is evaluated at the true \(\varvec{\beta }_0\).

Following Theorem 2 and invoking the Delta method, we have

$$\begin{aligned}\sqrt{n}({\widehat{\varvec{\beta }}}-{\varvec{\beta }}_0) {\mathop {\longrightarrow }\limits ^{\textrm{d}}} N(\varvec{0},{\textbf {J}}{\varvec{{\Psi }}}^{-1}{\textbf {J}}{}^{\textrm{T}}).\end{aligned}$$

4 Maximum Likelihood Estimation

In this section, we develop the ML estimation for our BV-SIM model. We utilize EM-type algorithms for obtaining the MLE, based on two types of missing data structures in (2.6). The EM algorithm is a popular iterative algorithm for MLE in models with incomplete data (Dempster et al. 1977), where each iteration of the EM algorithm consists of two steps, the expectation (E) step and the maximization (M) step. Despite desirable features, the M-step in the EM algorithm is often difficult to implement for complicated models, and is replaced with a sequence of computationally simple conditional maximization (CM) steps, i.e. maximizing over one parameter with the other parameters held fixed. This leads to a simple extension of the EM algorithm, called the ECM algorithm (Meng and Rubin 1993).

Consider the hierarchical multivariate Laplace model in (2.6), where both \(V_i\) and \({\textbf {b}}_i\) are missing data. Let \({\textbf {y}}=({\textbf {y}}_1{}^{\textrm{T}},\ldots ,{\textbf {y}}_n{}^{\textrm{T}}){}^{\textrm{T}}\), \({\textbf {b}}=({\textbf {b}}_1{}^{\textrm{T}},\ldots ,{\textbf {b}}_n{}^{\textrm{T}}){}^{\textrm{T}}\), \({\textbf {V}}=(V_1,\ldots ,V_n){}^{\textrm{T}}\) and \(\varvec{\theta }=(\varvec{\theta }_1{}^{\textrm{T}},\varvec{\theta }_2{}^{\textrm{T}}){}^{\textrm{T}}\). The log-likelihood for the complete data in the multivariate Laplace single-index mixed-effects model up to an additive constant can be written as

$$\begin{aligned} \ell (\varvec{\beta },\varvec{\theta },\varvec{\gamma },\varvec{\Sigma },\varvec{\Omega }|{\textbf {y}},{\textbf {b}},{\textbf {V}})=\ell _1(\varvec{\beta },\varvec{\theta },\varvec{\gamma },\varvec{\Sigma }|{\textbf {y}},{\textbf {b}},{\textbf {V}})+ \ell _2(\varvec{\Omega }|{\textbf {b}},{\textbf {V}}), \end{aligned}$$
(4.1)

where

$$\begin{aligned} \ell _1(\varvec{\beta },\varvec{\theta },\varvec{\gamma },\varvec{\Sigma }|{\textbf {y}},{\textbf {b}},{\textbf {V}})=-\frac{N}{2}\log |\varvec{\Sigma }|-\frac{1}{2}\sum _{i=1}^n\sum _{j=1}^{m_i} V_{i}^{-1}({\textbf {y}}_{ij}-{\widetilde{\varvec{ \mu }}}_{ij}-V_{i}\varvec{\gamma }){}^{\textrm{T}}\varvec{\Sigma }^{-1}({\textbf {y}}_{ij}-{\widetilde{\varvec{ \mu }}}_{ij}-V_{i}\varvec{\gamma }) \end{aligned}$$

and

$$\begin{aligned}\ell _2(\varvec{\Omega }|{\textbf {b}},{\textbf {V}})=-\frac{n}{2} \log |\varvec{\Omega }|-\frac{1}{2}\textrm{trace}\left( \varvec{\Omega }^{-1} \sum _{i=1}^n V_i^{-1}{\textbf {b}}_i{\textbf {b}}_i{}^{\textrm{T}}\right) ,\end{aligned}$$

where \({\widetilde{\varvec{ \mu }}}_{ij}\) is defined in (2.5) and \(N=\sum _{i=1}^n m_i\). Note that \(\ell _1\) can be further written as

$$\begin{aligned} \ell _1&=-\frac{N}{2}\log |\varvec{\Sigma }|-\frac{1}{2}\sum _{i=1}^n V_{i}^{-1} ({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}\varvec{\theta }){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}\varvec{\theta }) -\frac{1}{2}\sum _{i=1}^n V_{i}^{-1}{\textbf {b}}_i{}^{\textrm{T}}{\textbf {Z}}_{i}\varvec{\Lambda }_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}{\textbf {b}}_i \\&+\sum _{i=1}^nV_i^{-1}({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}\varvec{\theta }){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}{\textbf {b}}_i -\sum _{i=1}^n(\varvec{\gamma }^*_i){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}{\textbf {b}}_i +\sum _{i=1}^n({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}\varvec{\theta }){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}\varvec{\gamma }_i^* \\&-\frac{1}{2}\sum _{i=1}^n V_i(\varvec{\gamma }_i^*){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}\varvec{\gamma }_i^*. \end{aligned}$$

Denote \(\varvec{ \eta }\) as the full parameter vector to be estimated. We firstly compute the conditional posterior mean and variance of \({\textbf {b}}_i\) at the current estimate \({\widehat{\varvec{ \eta }}}\), leading to

$$\begin{aligned} \begin{array}{ll} \textrm{Cov}({\textbf {b}}_i|\varvec{ \eta }={\widehat{\varvec{ \eta }}}, {\textbf {y}}, {\textbf {V}})=V_i\left( {{\widehat{\varvec{\Omega }}}}^{-1}+{\textbf {Z}}_{i}{\widehat{\varvec{\Lambda }}}_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}\right) ^{-1}\triangleq V_i \cdot {\widehat{\varvec{\Delta }}}_i, \\ \\ \mathbb {E}({\textbf {b}}_i|\varvec{ \eta }={\widehat{\varvec{ \eta }}}, {\textbf {y}}, {\textbf {V}})={\widehat{\varvec{\Delta }}}_i {\textbf {Z}}_i {\widehat{\varvec{\Lambda }}}_i^{-1}({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}{\widehat{\varvec{\theta }}}-V_i{\widehat{\varvec{\gamma }}}^*_i) \triangleq {\widehat{{\textbf {R}}}}_{i1}-V_i{\widehat{{\textbf {R}}}}_{i2}, \end{array} \end{aligned}$$

for \(i=1,\ldots ,n\), where

$$\begin{aligned} {\widehat{\varvec{\Delta }}}_i=\left( {\widehat{\varvec{\Omega }}}^{-1}+{\textbf {Z}}_{i}{\widehat{\varvec{\Lambda }}}_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}\right) ^{-1}, \ \ {\widehat{{\textbf {R}}}}_1={\widehat{\varvec{\Delta }}}_i {\textbf {Z}}_i {\widehat{\varvec{\Lambda }}}_i^{-1}({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}{\widehat{\varvec{\theta }}}) \ \ \textrm{and} \ \ {\widehat{{\textbf {R}}}}_2={\widehat{\varvec{\Delta }}}_i {\textbf {Z}}_i {\widehat{\varvec{\Lambda }}}_i^{-1}\widehat{\varvec{\gamma }}^*_i. \end{aligned}$$
(4.2)

After obtaining the estimates of the conditional mean and conditional covariance of the random effect \({\textbf {b}}_i\), we proceed to calculate the expectation of \(\mathbb {E}(\ell (\cdot ))=\mathbb {E}_{\textbf {V}}\{\mathbb {E}_{\textbf {b}}[\ell (\cdot )|{\textbf {V}}]\}\). Define the quantities

$$\begin{aligned} \widehat{c}_i=\mathbb {E}(V_i|\varvec{ \eta }={\widehat{\varvec{ \eta }}}, {\textbf {y}}) \ \ \textrm{and } \ \ \widehat{d}_i=\mathbb {E}(V_i^{-1}|\varvec{ \eta }={\widehat{\varvec{ \eta }}}, {\textbf {y}}), \end{aligned}$$
(4.3)

which can be computed from (2.9), using the current estimate \({\widehat{\varvec{ \eta }}}\). After some simple calculations, we have

$$\begin{aligned} \begin{array}{lll} Q_1&{}\triangleq &{} \mathbb {E}\left[ \ell _1(\cdot |{\textbf {y}},{\textbf {b}},{\textbf {V}})|{\textbf {y}}, \varvec{ \eta }={\widehat{\varvec{ \eta }}}\right] \\ \\ &{} = &{} -\frac{N}{2}\log |\varvec{\Sigma }|-\frac{1}{2}\sum _{i=1}^n \widehat{d}_i ({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}\varvec{\theta }){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}\varvec{\theta }) -\frac{1}{2}\sum _{i=1}^n \widehat{c}_i(\varvec{\gamma }_i^*){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}\varvec{\gamma }_i^* \\ \\ &{}&{} -\frac{1}{2}\sum _{i=1}^n \textrm{trace}\left\{ {\textbf {Z}}_{i}\varvec{\Lambda }_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}\left[ \widehat{d}_i{\widehat{{\textbf {R}}}}_{i1}{\widehat{{\textbf {R}}}}_{i1}{}^{\textrm{T}}-{\widehat{{\textbf {R}}}}_{i1}{\widehat{{\textbf {R}}}}_{i2}{}^{\textrm{T}}-{\widehat{{\textbf {R}}}}_{i2}{\widehat{{\textbf {R}}}}_{i1}{}^{\textrm{T}}+\widehat{c}_i{\widehat{{\textbf {R}}}}_{i2}{\widehat{{\textbf {R}}}}_{i2}{}^{\textrm{T}}+{\widehat{\varvec{\Delta }}}_i\right] \right\} \\ \\ &{}&{} +\sum _{i=1}^n \widehat{d}_i ({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}\varvec{\theta }){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i1} -\sum _{i=1}^n({\textbf {y}}_{i}-{\textbf {W}}_{i}{}^{\textrm{T}}\varvec{\theta }){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}[{\textbf {Z}}_{i}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i2}-\varvec{\gamma }_i^*]\\ \\ &{}&{} -\sum _{i=1}^n(\varvec{\gamma }^*_i){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i1} +\sum _{i=1}^n \widehat{c}_i(\varvec{\gamma }^*_i){}^{\textrm{T}}\varvec{\Lambda }_i^{-1}{\textbf {Z}}_{i}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i2}, \end{array} \end{aligned}$$
(4.4)

and

$$\begin{aligned} \begin{array}{lll} Q_2&{}\triangleq &{} \mathbb {E}\left[ \ell _2(\cdot |{\textbf {y}},{\textbf {b}},{\textbf {V}})|{\textbf {y}}, \varvec{ \eta }={\widehat{\varvec{ \eta }}}\right] \\ \\ &{} = &{} -\frac{n}{2} \log |\varvec{\Omega }|-\frac{1}{2}\sum _{i=1}^n \textrm{trace}\left\{ \varvec{\Omega }^{-1}\left[ \widehat{d}_i{\widehat{{\textbf {R}}}}_{i1}{\widehat{{\textbf {R}}}}_{i1}{}^{\textrm{T}}-{\widehat{{\textbf {R}}}}_{i1}{\widehat{{\textbf {R}}}}_{i2}{}^{\textrm{T}}-{\widehat{{\textbf {R}}}}_{i2}{\widehat{{\textbf {R}}}}_{i1}{}^{\textrm{T}}+\widehat{c}_i{\widehat{{\textbf {R}}}}_{i2}{\widehat{{\textbf {R}}}}_{i2}{}^{\textrm{T}}+{\widehat{\varvec{\Delta }}}_i\right] \right\} +C, \end{array} \end{aligned}$$
(4.5)

Next, maximizing \(Q_1\) over parameters \(\varvec{\theta }\), \(\varvec{\gamma }\), \(\varvec{\beta }\) and \(\varvec{\Sigma }\), and maximizing \(Q_2\) over \(\varvec{\Omega }\), we can obtain their estimates, which constitutes the CM-steps 1-5 in the following ECM algorithm:

  1. E-step

    Given current parameter estimates, for \(i=1,\ldots ,n\), update \(c_i\) and \(d_i\) using (4.3), and update \(\widehat{\varvec{\Delta }}_i\), \({\widehat{{\textbf {R}}}}_{i1}\) and \({\widehat{{\textbf {R}}}}_{i2}\) by (4.2).

  2. CM-step 1

    Fix \({\widehat{\varvec{\beta }}}, {\widehat{\varvec{\gamma }}}\) and \({\widehat{\varvec{\Sigma }}}\), and update \({\widehat{\varvec{\theta }}}\) by maximizing (4.4) over \(\varvec{\theta }\), which gives

    $$\begin{aligned}{\widehat{\varvec{\theta }}}=\left( \sum _{i=1}^n \sum _{j=1}^{m_i} \widehat{d}_i{\textbf {W}}_{ij}{\widehat{\varvec{\Sigma }}}^{-1}{\textbf {W}}_{ij}{}^{\textrm{T}}\right) ^{-1}\sum _{i=1}^n \sum _{j=1}^{m_i} {\textbf {W}}_{ij}{\widehat{\varvec{\Sigma }}}^{-1}\left[ \widehat{d}_i({\textbf {y}}_{ij}-{\textbf {W}}_{ij}{}^{\textrm{T}}{\widehat{\varvec{\theta }}}-{\textbf {Z}}_{ij}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i1})+{\textbf {Z}}_{ij}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i2}-\widehat{\varvec{\gamma }}\right] .\end{aligned}$$
  3. CM-step 2

    Fix \({\widehat{\varvec{\beta }}}, {\widehat{\varvec{\theta }}}\) and \({\widehat{\varvec{\Sigma }}}\), update \({\widehat{\varvec{\gamma }}}\) by maximizing (4.4) over \(\varvec{\gamma }\), i.e.,

    $$\begin{aligned}{\widehat{\varvec{\gamma }}}=\frac{\sum _{i=1}^n\sum _{j=1}^{m_i} ({\textbf {y}}_{ij}-{\textbf {W}}_{ij}{}^{\textrm{T}}{\widehat{\varvec{\theta }}}-{\textbf {Z}}_{ij}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i1}+\widehat{c}_i{\textbf {Z}}_{ij}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i2})}{\sum _{i=1}^n m_i \widehat{c}_i}.\end{aligned}$$
  4. CM-step 3

    Fix \({\widehat{\varvec{\theta }}}\), \({\widehat{\varvec{\gamma }}}\) and \({\widehat{\varvec{\Sigma }}}\), and update \({\widehat{\varvec{\beta }}}\) by maximizing (4.4) over \(\varvec{\beta }\). Since there is no explicit expression for the estimate of the index parameter \(\varvec{\beta }\), we use the Newton–Raphson method to obtain \({\widehat{\varvec{\beta }}}\), leading to the following iterative formula

    $$\begin{aligned}\begin{array}{lll} \left( {\widehat{\varvec{\beta }}}^{(-1)}\right) ^\textrm{new}&{}=&{}\left( {\widehat{\varvec{\beta }}}^{(-1)}\right) ^\textrm{old}+\left( \sum _{i=1}^n\sum _{j=1}^{m_i}\widehat{d}_i{\textbf {H}}_{ij}{\widehat{\varvec{\Sigma }}}^{-1}{\textbf {H}}_{ij}{}^{\textrm{T}}\right) ^{-1}\times \\ \\ &{}&{} \times \sum _{i=1}^n\sum _{j=1}^{m_i} {\textbf {H}}_{ij}{\widehat{\varvec{\Sigma }}}^{-1}\left[ \widehat{d}_i({\textbf {y}}_{ij}-{\textbf {W}}_{ij}{}^{\textrm{T}}{\widehat{\varvec{\theta }}}-{\textbf {Z}}_{ij}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i1})+{\textbf {Z}}_{ij}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i2}-{\widehat{\varvec{\gamma }}}\right] \end{array} \end{aligned}$$

    where \({\textbf {H}}_{ij}=\left[ \begin{array}{cc} {\textbf {J}}_1{}^{\textrm{T}}{\textbf {x}}_{ij}^{(1)}\{\dot{{\textbf {B}}}_1{}^{\textrm{T}}(({\textbf {x}}_{ij}^{(1)}){}^{\textrm{T}}{\widehat{\varvec{\beta }}}_1^\textrm{old}){\widehat{\varvec{\theta }}}_1\} &{} {\varvec{0}}_{(p_1-1)\times 1}\\ {\varvec{0}}_{(p_2-1)\times 1} &{} {\textbf {J}}_2{}^{\textrm{T}}{\textbf {x}}_{ij}^{(2)}\{\dot{{\textbf {B}}}_2{}^{\textrm{T}}(({\textbf {x}}_{ij}^{(2)}){}^{\textrm{T}}{\widehat{\varvec{\beta }}}_2^\textrm{old}){\widehat{\varvec{\theta }}}_2\} \end{array} \right]\), and \(\dot{{\textbf {B}}}(\cdot )\) denotes the first derivative of the spline basis \({\textbf {B}}(\cdot )\).

  5. CM-step 4

    Fix \({\widehat{\varvec{\beta }}}\), \({\widehat{\varvec{\theta }}}\) and \({\widehat{\varvec{\gamma }}}\), and update \({\widehat{\varvec{\Sigma }}}\) by maximizing (4.4) over \(\varvec{\Sigma }\). Denote

    $$\begin{aligned}{\widehat{{\textbf {D}}}}=\sum _{i=1}^n \sum _{j=1}^{m_i} \left\{ \left[ \widehat{d}_i ({\textbf {y}}_{ij}-{\textbf {W}}_{ij}{}^{\textrm{T}}{\widehat{\varvec{\theta }}}-2{\textbf {Z}}_{ij}{}^{\textrm{T}}\widehat{{\textbf {R}}}_{i1})+2({\textbf {Z}}_{ij}{}^{\textrm{T}}\widehat{{\textbf {R}}}_{i2} -\widehat{\varvec{\gamma }})\right] ({\textbf {y}}_{ij}-{\textbf {W}}_{ij}{}^{\textrm{T}}{\widehat{\varvec{\theta }}}){}^{\textrm{T}}+\widehat{c}_i{\widehat{\varvec{\gamma }}}{\widehat{\varvec{\gamma }}}{}^{\textrm{T}}\right\} +\end{aligned}$$
    $$\begin{aligned}\sum _{i=1}^n \sum _{j=1}^{m_i}{\textbf {Z}}_{ij}{}^{\textrm{T}}\left[ \widehat{d}_i{\widehat{{\textbf {R}}}}_{i1}{\widehat{{\textbf {R}}}}_{i1}{}^{\textrm{T}}-{\widehat{{\textbf {R}}}}_{i1}{\widehat{{\textbf {R}}}}_{i2}{}^{\textrm{T}}-{\widehat{{\textbf {R}}}}_{i2}{\widehat{{\textbf {R}}}}_{i1}{}^{\textrm{T}}+\widehat{c}_i{\widehat{{\textbf {R}}}}_{i2}{\widehat{{\textbf {R}}}}_{i2}{}^{\textrm{T}}+{\widehat{\varvec{\Delta }}}_i\right] {\textbf {Z}}_{ij} +\sum _{i=1}^n \sum _{j=1}^{m_i} ({\textbf {Z}}_{ij}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i1}- \widehat{c}_i{\textbf {Z}}_{ij}{}^{\textrm{T}}{\widehat{{\textbf {R}}}}_{i2}){\widehat{\varvec{\gamma }}}{}^{\textrm{T}}.\end{aligned}$$

    Applying the result in Lemma 1, we obtain \({\widehat{\varvec{\Sigma }}}=\frac{1}{N}{\widehat{{\textbf {D}}}}\).

  6. CM-step 5

    Update \({\widehat{\varvec{\Omega }}}\) by maximizing (4.5) over \(\varvec{\Omega }\), which gives

    $$\begin{aligned}{\widehat{\varvec{\Omega }}}=\frac{1}{n}\sum _{i=1}^n \left[ \widehat{d}_i{\widehat{{\textbf {R}}}}_{i1}{\widehat{{\textbf {R}}}}_{i1}{}^{\textrm{T}}-{\widehat{{\textbf {R}}}}_{i1}{\widehat{{\textbf {R}}}}_{i2}{}^{\textrm{T}}-{\widehat{{\textbf {R}}}}_{i2}{\widehat{{\textbf {R}}}}_{i1}{}^{\textrm{T}}+\widehat{c}_i{\widehat{{\textbf {R}}}}_{i2}{\widehat{{\textbf {R}}}}_{i2}{}^{\textrm{T}}+{\widehat{\varvec{\Delta }}}_i\right] .\end{aligned}$$

Repeat the above E-step and CM-steps, until all parameters achieve the desired convergence criterion. Since our estimation procedure requires initial values, we set \({\widehat{\varvec{\gamma }}}^{(0)}=(0,0){}^{\textrm{T}}\), \({\widehat{\varvec{\Sigma }}}^{(0)}={\textbf {I}}_2\), and the estimates of \({\widehat{\varvec{\beta }}}_1^{(0)}\), \({\widehat{\varvec{\beta }}}_2^{(0)}\) and \({\widehat{\varvec{\Omega }}}^{(0)}\) are obtained from fitting a linear mixed model via the R package lmer, where \({\textbf {X}}_{ij}=\textrm{blockdiag}({\textbf {x}}_{ij}^{(1)},{\textbf {x}}_{ij}^{(2)})\) and \({\textbf {Z}}_{ij}\) are the design matrices corresponding to the fixed effects and random effects, respectively. Simulation studies (in Sect. 5) show that the above strategy works well.

5 Simulation Studies

In this section, we conduct extensive simulation studies using synthetic data to study the finite-sample performance of the model parameters in our proposed method (Simulation 1), and the robustness of our method when compared to existing alternatives, under data generated under various settings (Simulation 2).

5.1 Knots Selection

It is well-known that the performance of any spline estimation depends on the knots selection. Here, we employed Schwartz information criteria (SIC) for adaptive know selection (Ma and Song 2015; Lu 2017; Zhao et al. 2017). In view of the order \(n^{1/(2s+1)}\) (of knots) to attain optimal convergence rate of nonparametric functions in 1, a sequence of knots are selected in a neighborhood of \(n^{1/(2s+1)}\), such as \(\left[ 0.5N_s, \min (5N_s, n^{1/2})\right]\), where \(N_s=\lfloor n^{1/(2\,s+1)}\rfloor\), and s is the smoothing parameter. We choose \(s=2\) in both simulation studies and real data application. For simplicity, we use cubic polynomial splines and the number of interior knots \(K_1=K_2\equiv K\) are the same for the two nonparametric link functions. The number \(K_\textrm{opt}\) corresponding to the minimum SIC value is defined as the optimal number of knots \(\textrm{SIC}(K)=-\sum _{i=1}^n \textrm{log} \hat{L}_i^K+\textrm{log}n\times 2K\), where \(\textrm{log}\hat{L}_i^K\) denotes the estimated value of the log-likelihood function obtained from(2.8), with the given K knots.

5.2 Simulation 1: Assessing Finite-Sample Properties

Here, data is generated from the model (2.5), where the two nonparametric functions are \(g_1(u)=2\sin (\pi u)\) and \(g_2(u)=8u(1-u)\), with the true index parameters \(\varvec{\beta }_1=(1/\sqrt{3},-1/\sqrt{3},1/\sqrt{3}){}^{\textrm{T}}\) and \(\varvec{\beta }_2=(2/\sqrt{6},1/\sqrt{6},1/\sqrt{6}){}^{\textrm{T}}\), respectively. Both covariates \({\textbf {x}}_{ij}^{(1)}\) and \({\textbf {x}}_{ij}^{(2)}\) are generated independently from the trivariate uniform distribution \(U^3(0,1)\). The random effects \({\textbf {b}}_i=({\textbf {b}}_{i1}{}^{\textrm{T}},{\textbf {b}}_{i2}{}^{\textrm{T}}){}^{\textrm{T}}\) are generated from \(\textrm{SAL}_{4}({\varvec{0}}, \varvec{\Omega }, \varvec{0})\), with covariance matrix

$$\begin{aligned}\varvec{\Omega }=\left( \begin{array}{cccc} 9 &{} 4.8 &{} 3.6 &{} 0.6\\ 4.8 &{} 4 &{} 2 &{} 1.2\\ 3.6 &{} 2 &{} 4 &{} 1 \\ 0.6 &{} 1.2 &{} 1 &{} 1 \end{array}\right) ,\end{aligned}$$

and the corresponding covariates \({\textbf {z}}_{ij}^{(1)}=(1,z_{ij1}^{(1)}){}^{\textrm{T}}\) and \({\textbf {z}}_{ij}^{(2)}=(1,z_{ij1}^{(2)}){}^{\textrm{T}}\), where \(z_{ij1}^{(1)}\) and \(z_{ij1}^{(2)}\) are generated from the standard normal distribution. The random error \(\varvec{\epsilon }_{ij}\) is generated from \(\textrm{SAL}_2 (\varvec{0},\varvec{\Sigma },\varvec{\gamma })\) with \(\varvec{\Sigma }=\left( \begin{array}{cc}1 &{} 0.6\\ 0.6 &{} 1 \end{array}\right)\) and \(\varvec{\gamma }=(2,1.5){}^{\textrm{T}}\). The sample size n is set to be 50, 100 and 200, and the number of cluster members \(m_i\) in each subject is generated from the discrete uniform distribution on \(5,6,\ldots ,10\). Table 1 presents the averages of bias, absolute bias, and the empirical standard error estimates for the index parameters and the skewness parameter, over 400 replications.

Table 1 Table entries are the average bias (BIAS), average absolute bias (ABIAS), and empirical standard error (ESE) estimates for \(n = 50, 100, 200\), calculated over 400 replications, corresponding to Simulation 1

From Table 1, all biases are close to zero for all sample sizes, implying our proposed estimators are consistent. Moreover, the absolute biases and the standard errors are smaller with increasing sample sizes, with the estimation performance of index parameters significantly better than the skewness parameters. To further assess the estimation results, we calculate the integrated mean squared error (IMSE), defined as

$$\begin{aligned}\textrm{IMSE}(g_l)=\frac{1}{400 }\sum _{s=1}^{400}\sqrt{\frac{1}{N}\sum _{i=1}^n\sum _{j=1}^{m_i} \{\widehat{g}^{(s)}_l(({\textbf {x}}_{ij}^{(1)}){}^{\textrm{T}}{\widehat{\varvec{\beta }}}_l)-g_l(({\textbf {x}}_{ij}^{(1)}){}^{\textrm{T}}\varvec{\beta }_l)\}^2},\ \ l=1,2,\end{aligned}$$

where \(\widehat{g}_l^{(s)}(\cdot )\) is the spline approximation to \(g_l(\cdot )\) in the sth simulation run. We report the average of the IMSE as \(\textrm{AIMSE}=\frac{1}{2}\sum _{l=1}^2 \textrm{IMSE}(g_l)\) in Table 2. For evaluating the estimation performances of the scatter matrix \(\varvec{\Sigma }\) (corresponding to the bivariate responses) and the covariance matrix \(\varvec{\Omega }\) (for the random effects), we use the Frobenius-norm of the matrix of differences between the estimated and true values, i.e. \(\Vert {\textbf {A}}\Vert _F=\sqrt{\textrm{trace}({\textbf {A}}{}^{\textrm{T}}{\textbf {A}})}\), where \({\textbf {A}}\) is either \(\widehat{\varvec{\Sigma }}-\varvec{\Sigma }\) or \({\widehat{\varvec{\Omega }}}-\varvec{\Omega }\). Simulation results, together with the root of mean square error (RMSE) for \(\varvec{\beta }_1\), \(\varvec{\beta }_2\) and \(\varvec{\gamma }\) are listed in Table 2, where the RMSE for an arbitrary parameter \(\varvec{\delta }\) is defined as \(\textrm{RMSE}_{\varvec{\delta }}=\sqrt{({\widehat{\varvec{\delta }}}-\varvec{\delta }){}^{\textrm{T}}({\widehat{\varvec{\delta }}}-\varvec{\delta })}\). It is clear from Table 2 that the finite-sample performances of our proposed estimation procedures are satisfactory, with increasing sample sizes. In sum, the simulation results show that both index parameters, the nonparametric functions, and other parameters associated with the mixed effect models are reliably estimated, thereby confirming that our proposed algorithm works well in synthetic data settings.

Table 2 Table entries are the averages of the IMSE (AIMSE), the Frobenius-norms for \(\varvec{\Sigma }\) and \(\varvec{\Omega }\), and the root of mean squared errors (RMSE) of the model parameters, under various sample sizes \((n = 50, 100, 200)\), calculated over 400 replications, corresponding to Simulation 1

5.3 Simulation 2: Assessing Robustness, in Light of Competing Methods

Here, the data is generated similar to Simulation 1 (from a BV-SIM), except that the random effects and errors are independently generated under the following four distributional assumptions:

  1. Case 1:

    \({\textbf {b}}_i \sim N(\varvec{0}, \varvec{\Omega }), \ \ \varvec{\epsilon }_{ij} \sim N(\varvec{0}, \varvec{\Sigma })\);

  2. Case 2:

    \({\textbf {b}}_i \sim t(\varvec{0},\varvec{\Omega }, v), \ \ \varvec{\epsilon }_{ij} \sim t(\varvec{0},\varvec{\Sigma },v)\);

  3. Case 3:

    \({\textbf {b}}_i \sim \textrm{SAL}_{4}({\varvec{0}}, \varvec{\Omega }, \varvec{0}), \ \ \varvec{\epsilon }_{ij} \sim \textrm{SAL}_{2}({\varvec{0}}, \varvec{\Sigma }, \varvec{0})\);

  4. Case 4:

    \({\textbf {b}}_i \sim 0.8 N(\varvec{0}, \varvec{\Omega })+0.2N(\varvec{0}, 10 \varvec{\Omega }), \ \ \varvec{\epsilon }_{ij} \sim 0.8 N(\varvec{0}, \varvec{\Sigma })+0.2 N(\varvec{0}, 10\varvec{\Sigma })\),

for \(i=1,\cdots ,n, \ \ j=1,\cdots , m_i\),

Here, Case 1 corresponds to random effects and errors independently generated from the multivariate normal distribution. For Case 2, both are generated from the multivariate t-distribution with degree of freedom v (setting \(v=5\)). For Case 3, the random effects and errors are generated from the multivariate symmetric Laplace distribution with covariance matrix \(\varvec{\Omega }\) and \(\varvec{\Sigma }\), respectively. Finally, Case 4 corresponds to generating both the random terms (effects and errors) from multivariate normal mixtures. Note, for the above four cases, the bivariate clustered response is symmetric, since both the random effects and errors are generated from symmetric distributions. This is to make our approach comparable to the following two existing alternatives, (a) The bivariate normal mixed effect single-index model of Wu and Tu (2016), and (b) The bivariate mixed effect single-index model using the multivariate t-distribution, which extends the univariate linear mixed model proposal of (Pinheiro et al. 2001). In (a), penalized splines were used to approximate the nonparametric index function, whereas we use polynomial splines. At each replication, we use the same dataset to obtain the estimates from these three competing methods. We focus on the estimation of the index parameters and the index functions for the fixed effect part, with the same interpretation for all cases.

Table 3 Table entries are the root of mean squared errors (RMSE) of \(\varvec{\beta }_1\) and \(\varvec{\beta }_2\), and the Average Integrated Mean Squared Error (AIMSE) from our model and the 2 competing models (Wu and Pinheiro), for \(n = 50, 100, 200\), with data generated from the 4 cases described in Sect. 5.2

The results are summarized in Table 3. For all cases, RMSEs and AIMSEs decrease quickly as the sample size increases for all three methods. That said, our proposed method performs well for all four cases, and is significantly better than both the alternatives for Cases 3 and 4. The advantages of our method appears more prominent if we further reduce the mixing proportion of the mixture distribution in Case 4 from 0.8 to 0.7, 0.6 or 0.5 (results not reported here). In Cases 1 and 2, the performances of our method is comparable to the two others. In particular, our method performs almost similar to Pinheiro’s t-distribution method in Case 2 when \(n=200\), while they are both better than the normal mixed-effects method of Wu and Tu (2016). To summarize, the performance of our proposed method appears to be satisfactory in all cases, and is robust to misspecified (non-Gaussian) random effects and errors, under a bivariate mixed model framework.

6 Application: GAAD Dataset

In this section, we illustrate our method via application to the GAAD dataset. Here, the tooth-level mean PPD and CAL measures are non-Gaussian bivariate responses representing PD status, and our objective is to evaluate the distribution of PD status for this population, and quantify the effects of various subject-level covariates such as Age (in years), body mass index (BMI), Gender (\(1 = \textrm{Female}, 0 = \textrm{Male}\)), Smoking status (\(1 = \textrm{Smoker}, 0 = \mathrm{Never \ Smoker}\)) and glycemic level or HbA1c (\(1 = \mathrm{High/Uncontrolled}\)), \(0 = \textrm{Controlled}\)) on the PD status. For our analysis, we have \(n = 288\) subjects with complete covariate information. About 30% of the subjects are smokers. The mean age of the subjects is about 54 years with a range from 26–87 years. There is a predominance of female subjects (around 76%) in the data. Around 60% of subjects are obese (BMI \(\ge 30\)), and 59% are with uncontrolled HbA1c. Each subject has varying number of teeth, ranging from 3 to 28, with a total of 5461 observations. A full dentition will constitute 28 teeth, however, missing tooth is very common in any oral health studies, with the actual cause of missingness mostly unknown. Hence, in order to avoid unverifiable missing data assumptions, we did not resort to missing data analysis, and present only complete case analysis.

Fig. 2
figure 2

Bivariate kernel density estimate (left panel) and boxplots (right panel) for PPD and CAL responses, from the GAAD data

As part of explanatory analysis, we present the bivariate kernel density estimate of the PPD and CAL responses in Fig. 2 (left panel). The plot reveals significant (right) skewness for both responses. Also, the right panel in Fig. 2 indicates presence of possible outliers. Recent research (Zhao et al. 2018) confirmed possible non-linear relationship between oral health responses, and continuous covariates, like Age. Motivated by this, we set forward to estimate a clinically meaningful single-index structure determining PD for the subjects in this database.

Table 4 Estimates of the index parameters, the skewness parameter and their 95% confidence intervals, corresponding to the PPD and CAL responses from the GAAD study

We consider fitting the following model to the GAAD data

$$\begin{aligned}\left\{ \begin{array}{l} \textrm{PPD}_{ij}=g_1({\textbf {x}}_{ij}{}^{\textrm{T}}\varvec{\beta }_1)+{\textbf {z}}_{ij}{}^{\textrm{T}}{\textbf {b}}_{i1}+\epsilon _{ij1},\\ \\ \textrm{CAL}_{ij}=g_2({\textbf {x}}_{ij}{}^{\textrm{T}}\varvec{\beta }_2)+{\textbf {z}}_{ij}{}^{\textrm{T}}{\textbf {b}}_{i2}+\epsilon _{ij2}, \end{array} \right. i=1,\ldots , 288, j=1,\ldots ,m_i, \end{aligned}$$

where \({\textbf {x}}_{ij}=(x_{ij1},\ldots ,x_{ij5}){}^{\textrm{T}}\) with \(x_{ij1}=\) Age, \(x_{ij2}=\) BMI, \(x_{ij3}=\) Gender, \(x_{ij4}=\) Smoker, \(x_{ij5}=\) HbA1c and \({\textbf {z}}_{ij}=(1,z_{ij1},z_{ij2},z_{ij3}){}^{\textrm{T}}\) with \(z_{ij1}=\) Gender, \(z_{ij2}=\) Smoker, \(z_{ij3}=\) HbA1c. We further assume \({\textbf {b}}_i=({\textbf {b}}_{i1}{}^{\textrm{T}},{\textbf {b}}_{i2}{}^{\textrm{T}}){}^{\textrm{T}}\sim \textrm{SAL}_8(\varvec{0},\varvec{\Omega },\varvec{0})\) and \(\varvec{\epsilon }_{ij}=(\epsilon _{ij1},\epsilon _{ij2}){}^{\textrm{T}}\sim \textrm{SAL}_2(\varvec{0},\varvec{\Sigma },\varvec{\gamma })\). The estimates for index parameters, skewness parameter and their 95% confidence intervals are presented in Table 4, where the 95% confidence intervals are obtained by bootstrap resampling with 200 replications. We observe that all parameters (except \(\beta _{13}\) corresponding to Gender for the PPD regression) were positive and significant. Interestingly, the estimate of Gender (\(\beta _{13}\)) is negative yet significant for PPD, while, the corresponding estimate (\(\beta _{23}\)) for CAL is positive and significant, implying that Gender is contributing to the index development for the two responses in opposite directions. Figure 3 presents the estimated curves corresponding to the two index functions, along with their 95% confidence bands using bootstrap method. Compared to the CAL, the 95% band is tighter for the PPD.

It is immediate that the correlation between PPD and CAL are significant, implying the need to account for the crosswise correlation between the two responses, and the cluster-wise correlation of the responses within the same subject, while modeling the bivariate clustered responses. Furthermore, Fig. 4 presents the bivariate kernel density surface of the estimated residuals (left panel), and the same from random draws of \(n=5461\) observations from the bivariate ALD density \(\textrm{ALD}({\widehat{\varvec{\Sigma }}},{\widehat{\varvec{\gamma }}})\), where \({\widehat{\varvec{\Sigma }}}\) and \({\widehat{\varvec{\gamma }}}\) are plugged-in estimates derived from our fit. We observe that the estimated surfaces are very similar, confirming the adequacy of model fit to the GAAD dataset.

Fig. 3
figure 3

Estimated curves for the two index functions \(\widehat{g}_1\) and \(\widehat{g}_2\), along with the 95% confidence bands. The left and right panels correspond to PPD and CAL regressions, respectively

Correlation matrices \(\varvec{\Sigma }\) and \(\varvec{\Omega }\) are estimates as:

$$\begin{aligned}\widehat{\varvec{\Sigma }}=\left( \begin{array}{ll} 1.2429 &{}0.7937\\ 0.7937 &{}0.9024 \end{array}\right) \end{aligned}$$

and

$$\begin{aligned}{\widehat{\varvec{\Omega }}}=\left( \begin{array}{cccccccc} 1.6589 &{}-0.0089&{} -0.0461&{} -0.2792&{} 1.5780&{} -0.1832&{} -0.1815&{} -0.5760 \\ -0.0089 &{} 0.8797&{} -0.4081&{} 0.1553&{} -0.1685&{} 0.5379&{} -0.0466&{} 0.4289 \\ -0.0461 &{}-0.4081&{} 0.8423&{} 0.3273&{} -0.0808&{} 0.1296&{} 0.1931&{} 0.1264 \\ -0.2792 &{} 0.1553&{} 0.3273&{} 0.7782&{} -0.4164&{} 0.3802&{} 0.1290&{} 0.6585 \\ 1.5780 &{}-0.1685&{} -0.0808&{} -0.4164&{} 2.1987&{} -0.8975&{} -0.4840&{} -0.8462 \\ -0.1832 &{} 0.5379&{} 0.1296&{} 0.3802&{} -0.8975&{} 1.0517&{} 0.3364&{} 0.6420 \\ -0.1815 &{}-0.0466&{} 0.1931&{} 0.1290&{} -0.4840&{} 0.3364&{} 0.2016&{} 0.1681 \\ -0.5760 &{} 0.4289&{} 0.1264&{} 0.6585&{} -0.8462&{} 0.6420&{} 0.1681&{} 0.8158 \end{array}\right) . \end{aligned}$$
Fig. 4
figure 4

Plots of bivariate kernel density estimates from model residuals (left panel), and from random draws of \(n=5461\) observations following \(\textrm{ALD}({\widehat{\varvec{\Sigma }}},{\widehat{\varvec{\gamma }}})\)

To further evaluate the usefulness of our proposed new model, we consider the fitted and prediction errors in light of two alternatives, denoted as “AM1” (bivariate normal, mixed effects SIM) and “AM2” (bivariate, asymmetric Laplace SIM, without random effects). We randomly partition the data into training and testing sets, where the training data is used to fit the 3 models, and the test data to evaluate the prediction errors. Using varying sizes of training and testing data, the average absolute fitted errors (AAFE), and the average absolute prediction errors (AAPE) for the two responses, based on 200 random partitions, are reported in Table 5, where

$$\begin{aligned}\textrm{AAFE}_k=\frac{1}{\sum _{i=1}^{nb}m_i}\sum _{i=1}^{nb}\sum _{j=1}^{m_i}|y_{ijk}-\widehat{y}_{ijk}|\end{aligned}$$

and

$$\begin{aligned}\textrm{AAPE}_k=\frac{1}{\sum _{i=1}^{n-nb}m_i}\sum _{i=1}^{n-nb}\sum _{j=1}^{m_i}|y_{ijk}-\widetilde{y}_{ijk}|,\end{aligned}$$

for \(k=1\) and 2, with \(\widehat{y}_{ijk}\), the fitted value based on training data, and \(\widetilde{y}_{ijk}\), the predicted value based on the test data, and nb denote the number of subjects in the training data.

Table 5 Average absolute fitted and prediction errors for our model and 2 competing models (AM1 and AM2), for the PPD and CAL responses in the GAAD data, based on 200 random partitions

From Table 5, we observe that our model performs the best in terms of AAFE and AAPE, for various sizes of the training and testing set. More specifically, our proposed mixed-effects SIM model is superior to the bivariate asymmetric Laplace SIM (excluding random effects), implying the necessity to account for the within-subject correlation. Furthermore, our proposed model is also better than the SIM with the usual multivariate normal specification for the random effects, thereby providing evidence of the gain in accounting for data asymmetry during modeling.

7 Conclusions

Derivation of useful medical indices that correlate with multiple health outcomes is an issue of significant practical importance. In this paper, we propose a single-index mixed-effects regression model for bivariate responses, where both the error term and random effect are assumed to follow multivariate asymmetric Laplace distribution. By the polynomial spline smoothing for index functions, we proposed a scalable ML estimation method based on EM-type algorithm, and study the asymptotic properties of the ML estimates under some mild conditions. Simulations and real data analysis reveal the potential of the proposed model under data asymmetry, compared to existing alternatives.

There exists a number of future directions to pursue. To further improve model fit and prediction, we can consider the joint modeling of the location, skewness, and scatter matrix, within a multivariate ALD setup. When the number of covaiates is large in both fixed effects and random effects, it is of interest to select important variables in both parts to obtain a concise model. Some existing variable selection work of linear mixed effects model are available for univariate response case; see, for example, Kinney and Dunson (2010); Bondell et al. (2010); Fan and Li (2012); Schelldorfer and Geer (2011); Pan and Huang (2014), and others. However, for the case of single-index mixed effects models for multivariate responses, there is limited work, and pursuing the variable selection is a non-trivial journey. Another extension is to consider mixed effects quantile regression (Waldmann and Kneib 2015) for bivariate responses. These will be pursued elsewhere.