Mixed-effects location-scale model based on generalized hyperbolic distribution

Motivated by better modeling of intra-individual variability in longitudinal data, we propose a class of location-scale mixed-effects models, in which the data of each individual is modeled by a parameter-varying generalized hyperbolic distribution. We first study the local maximum-likelihood asymptotics and reveal the instability in the numerical optimization of the log-likelihood. Then, we construct an asymptotically efficient estimator based on the Newton–Raphson method based on the original log-likelihood function with the initial estimator being naive least-squares-type. Numerical experiments are conducted to show that the proposed one-step estimator is not only theoretically efficient but also numerically much more stable and much less time-consuming compared with the maximum-likelihood estimator.


Introduction
The key step in the population approach [14] is modeling dynamics of many individuals to introduce a flexible probabilistic structure for the random vector Y i = (Y i (t ij )) ni j=1 ∈ R ni representing time series data (supposed to be univariate) from ith individual.Here t i1 < • • • < t ini denotes sampling times, which may vary across the individuals with possibly different n i for i = 1, . . ., N .The model is desired to be tractable from theoretical and computational points of view.
In the classical linear mixed-effects model [13], the target variable Y i in R ni is described by for i = 1, . . ., N , where the explanatory variables X i ∈ R ni ⊗ R p and Z i ∈ R ni ⊗ R q are known design matrices, where {b i } and { i } are mutually independent centered i.i.d.sequences with covariance matrices G ∈ R q ⊗ R q and H i ∈ R ni ⊗ R ni , respectively; typical examples of H i = (H i,kl ) include H i = σ 2 I ni (I q denotes the q-dimensional identity matrix) and H i,kl = σ 2 ρ |k−l| with ρ denoting the correlation coefficient.Although the model (1.1) is quite popular in studying longitudinal data, it is not adequate for modeling intra-individual variability.Formally speaking, this means that for each i, conditionally on b i the objective variable Y i has the covariance which does not depend on b i .Therefore the model is not suitable if one wants to incorporate a random effect across the individuals into the covariance and higher-order structures such as skewness and kurtosis.
1.1.Mixed-effects location-scale model.Let us briefly review the previous study which motivated our present study.The paper [9] introduced a variant of (1.1), called the mixed-effects location-scale (MELS) model, for analyzing ecological momentary assessment (EMA) data; the MELS model was further studied in [10], [8], and [11] from application and computational points of view.EMA is also known as the experience sampling method, which is not retrospective and the individuals are required to answer immediately after an event occurs.Modern EMA data in mental health research is longitudinal, typically consisting of possibly irregularly spaced sampling times from each patient.To avoid the so-called "recall bias" of retrospective self-reports from patients, the EMA method records many events in daily life at the moment of their occurrence.The primary interest is modeling both between-and within-subjects heterogeneities, hence one is naturally led to incorporate random effects into both trend and scale structures.We refer to [16] for detailed information on EMA data.
In the MELS model, the jth sample Y ij from the ith individual is given by for 1 ≤ j ≤ n i and 1 ≤ i ≤ N .Here, (x ij , z ij , w ij ) are non-random explanatory variables, ( 1,i , 2,i ) denote the i.i.d.random-effect, and 3,ij denote the driving noises for each i ≤ N such that and that 3,i1 , . . ., 3,ini ∼ i.i.d.N (0, 1), with ( 1,i , 2,i ) and ( 3,ij ) j≤ni being mutually independent.Direct computations give the following expressions: E[Y ij ] = x ij β, Var[Y ij ] = exp(w ij τ + σ 2 w /2) + exp(z i α), and also Cov[Y ik , Y il ] = exp(z i α) for k = l; the covariance structure is to be compared with the one (2.2) of our model.Further, their conditional versions given the random-effect variable R i := ( 1,i , 2,i ) are as follows: E[Y ij |R i ] = x ij β + exp(z i α/2) 1,i , Var[Y ij |R i ] = exp(w ij τ + σ w 2,i ), and Cov[Y ik , Y il |R i ] = 0 for k = l.We also note that the conditional distribution 1,i , diag e w i1 τ +σw 2,i , . . ., e w in i τ +σw 2,i , where X i := (x i1 , . . ., x ini ) and 1 ni ∈ R ni has the entries all being 1. Importantly, the marginal distribution L(Y i1 , . . ., Y ini ) is not Gaussian.See [9] for details about the data-analysis aspects of the MELS model.
The third term on the right-hand side of (1.2) obeys a sort of normal-variance mixture with the variance mixing distribution being log-normal, introducing the so-called leptokurtosis (heavier tail than the normal distribution).Further, the last two terms on the right-hand side enable us to incorporate skewness into the marginal distribution L(Y ij ); it is symmetric around x ij β if ρ = 0.
The optimization of the corresponding likelihood function is quite time-consuming since we need to integrate the latent variables ( 1,ij , 2,ij ): with representing R i by the two-dimensional standard normal variable, the log-likelihood function of θ := (β, α, τ, σ w , ρ) is given by where w i := (w ij ) j≤ni , z i := (z ij ) j≤ni , φ m (•; µ, Σ) denotes the m-dimensional N (µ, Σ)-density, and ) , . . ., e w in i τ +σw(ρx1+ Just for reference, we present a numerical experiment by R Software for computing the maximumlikelihood estimator (MLE).We set N = 1000 and n 1 = n 2 = • • • = n 1000 = 10 and generated x ij , z ij , w ij ∼ i.i.d.N 2 (0, I 2 ) independently; then, the target parameter is 8-dimensional.The true values were set as follows: β = (0.6, −0.2), α = (−0.3,0.5), τ = (−0.5,0.3), σ w = √ 0.8 ≈ 0.894, and ρ = −0.3.The results based on a single set of data are given in Table 1.It took more than 20 hours in our R code for obtaining one MLE (Apple M1 Max, memory 64GB; the R function adaptIntegrate was used for the numerical integration); we have also run the simulation code for N = 500 and n 1 = n 2 = • • • = n 500 = 5, and then it took about 8 hours.The program should run much faster if other software such as Fortran and MATLAB is used instead of R, but we will not deal with that direction here.Though it is cheating, the numerical search started from the true values; it would be much more time-consuming and unstable if the initial values were far from the true ones.The EM-algorithm type approach for handling latent variables would work at least numerically, while it is also expected to be time-consuming even if a specific numerical recipe is available.Some advanced tools for numerical integration would help to some extent, but we will not pursue it here.1.2.Our objective.In this paper, we propose an alternative computationally much simpler way of the joint modeling of the mean and within-subject variance structures.Specifically, we construct a class of parameter-varying models based on the univariate generalized hyperbolic (GH) distribution and study its theoretical properties.The model can be seen as a special case of inhomogeneous normal-variance-mean mixtures and may serve as an alternative to the MELS model; see Section A for a summary of the GH distributions.Recently, the family has received attention for modeling non-Gaussian continuous repeated measurement data [2], but ours is constructed based on a different perspective directly by making some parameters of the GH distribution covariate-dependent.
This paper is organized as follows.Section 2 introduces the proposed model and presents the locallikelihood analysis, followed by numerical experiments.Section 3 considers the construction of a specific asymptotically optimal estimator and presents its finite-sample performance with comparisons with the MLE.Section 4 gives a summary and some potential directions for related future issues.
for j = 1, . . ., n i and i = 1, . . ., N , where and w ij ∈ R p τ are given non-random explanatory variables; , where GIG refers to the generalized inverse Gaussian distribution (see Section A); As mentioned in the introduction, for (2.1) one may think of the continuous-time model without system noise: where t ij denotes the jth sampling time for the ith individual.We will write and so on for i = 1, . . ., N , and also where Θ is supposed to be a convex domain and p := p β +p α +p τ +3.We will use the notation (P θ ) θ∈Θ for the family of distributions of {(Y i , v i , i )} i≥1 , which is completely characterized by the finite-dimensional parameter θ.The associated expectation and covariance operators will be denoted by E θ and Cov θ , respectively.Let us write s ij (α) = s(z ij , α) and σ ij (τ ) = σ(w ij , τ ).For each i ≤ N , the variable Y i1 , . . ., Y ini are v i -conditionally independent and normally distributed under P θ : (2. 2) The marginal distribution L(Y i1 , . . ., Y ini ) is the multivariate GH distribution; a more flexible dependence structure could be incorporated by introducing the non-diagonal scale matrix (see Section 4 for a formal explanation).By the definition of the GH distribution, the variables Y ij and Y ik may be uncorrelated for some (z ij , α) while they cannot be mutually independent.We can explicitly write down the log-likelihood function of (Y 1 , . . ., Y N ) as follows: where i,j denotes a shorthand for N i=1 ni j=1 and The detailed calculation is given in Section B.1.
To deduce the asymptotic property of the MLE, there are two typical ways: the global-and the local-consistency arguments.In the present inhomogeneous model where the variables (x ij , z ij , w ij ) are non-random, the two asymptotics have different features: on the one hand, the global-consistency one generally entails rather messy descriptions of the regularity conditions as was detailed in the previous study [7] while entailing theoretically stronger global claims; on the other hand, the local one only guarantees the existence of good local maxima of N (θ) while only requiring much weaker local-aroundθ 0 regularity conditions.
For a domain A, let C k (A) denote a set of real-valued C k -class functions for which the lth-partial derivatives (0 ≤ l ≤ k) admit continuous extensions to the boundary of A. The asymptotic symbols will be used for N → ∞ unless otherwise mentioned.Assumption 2.1.
We are going to prove the local asymptotics of the MLE by applying the general result [17, Theorems 1 and 2].
Under Assumption 2.1 and using the basic facts about the Bessel function K • (•) (see Section A), we can find a compact neighborhood B 0 ⊂ Θ of θ 0 such that Note that min{δ, γ} > 0 inside B 0 .
Let M ⊗2 := M M for a matrix M , and denote by λ max (M ) and λ min (M ) the largest and smallest eigenvalues of a square matrix M , and by ∂ k θ the kth-order partial-differentiation operator with respect to θ. Write for the right-hand side of (2.3).Then, by the independence we have just for reference, the specific forms of ∂ θ N (θ) and ∂ 2 θ N (θ) are given in Section B.2.Further by differentiating θ → ∂ 2 θ N (θ) with recalling Assumption 2.1, it can be seen that for m = 1, 2, and that lim sup These moment estimates will be used later on; unlike the global-asymptotic study [7], we do not need the explicit form of ∂ 2 θ N (θ).We additionally assume the diverging information condition, which is inevitable for consistent estimation: Under Assumption 2.1, we may and do suppose that the matrix is well-defined, where M 1/2 denotes the symmetric positive-definite root of a positive definite M .We also have sup θ∈B0 |A N (θ)| −1 N −1/2 → 0. This A N (θ) will serve as the norming matrix of the MLE; see Remark 2.5 below for Studentization.Further, the standard argument through the Lebesgue dominated theorem ensures that For c > 0, Assumption 2.2 yields Here the last convergence holds since the function θ → N −1/2 A N (θ) is uniformly continuous over B 0 .
Define the normalized observed information: Then, (2.6) ensures that followed by the property ∀ > 0, sup (1) For any bounded sequence (2) There exists a local maximum point θN of N (θ) with P θ0 -probability tending to 1, for which are regular and asymptotically efficient in the sense of Hajék-Le Cam.See [3] and [12] for details.
Remark 2.5 (Studentization of (2.10)).Here is a remark on the construction of approximate confidence sets.Define the statistics Then, to make inferences for θ 0 we can use the distributional approximations ÂN ( θN To see this, it is enough to show that under P θ0 , We have √ N ( θN − θ 0 ) = O p (1) by Theorem 2.3 and Assumption 2.2.This together with the Burkholder inequality and (2.6) yield that and hence concluding (2.14).Note that, instead of (2.12), we may also use the square root of the observed information matrix for concluding the same weak convergence as in (2.13).In our numerical experiments, we made use of this A 2 N for computing the confidence interval and the empirical coverage probability.The elements of A 2 N are explicit while rather lengthy: see Section B.2.
Remark 2.6 (Misspecifications).In addition to the linear form x ij β in (2.1), misspecification of a parametric form of the function (s(z i , α), σ(w ij , τ )) is always concerned.Using the M -estimation theory (for example, see [20] and [6, Section 5]), under appropriate identifiability conditions, it is possible to handle their misspecified parametric forms.In that case, however, the maximum-likelihood-estimation target, say θ * , is the optimal parameter (to be uniquely determined) in terms of the Kullback-Leibler divergence, and we do not have the LAN property in Theorem 2.3 in the usual sense while an asymptotic normality result of the form ) could be given, where (non-random) Σ 0 and Γ 0 are specified by Finally, we note that the statistical problem will become non-standard if we allow that the true value of (δ, γ) for the GIG distribution L(v i ) satisfies that δ 0 = 0 or γ 0 = 0. We have excluded these boundary cases at the beginning of Section 2.2.

Numerical experiments.
For simulation purposes, we consider the following model: where the ingredients are specified as follows.
The computation time for one MLE was about 8 minutes for case (i) and about 6 minutes for case (ii).Estimation performance for (λ, δ, γ) were less efficient than those for (β, α, τ ).It is expected that the unobserved nature of the GIG variables make the standard-normal approximations relatively worse.It is worth mentioning that case (ii) shows better normal approximations, in particular for (λ, δ, γ); case (ii) would be simpler in the sense that the data from each individual have similarities in their trend (mean) structures.
Table 2 shows the empirical 95%-coverage probability for each parameter in both (i) and (ii), based on the confidence intervals θ(k) kk for k = 1, . . ., 9 with θN =: ( θ(k) N ) k≤9 and α = 0.05.We had 365 and 65 numerically unstable cases among 1000 trials, respectively (mostly cased by a degenerate det(−∂ 2 θ N ( θN ))).Therefore, the coverage probabilities were computed based on the remaining cases.Let us note the crucial problem in the above Monte Carlo trials: the objective log-likelihood is highly non-concave, hence as usual the numerical optimization suffers from the initial-value and local-maxima problems.Here is a numerical example based on only a single set of data with N = 1000 and n 1 = n 2 = • • • = n 1000 = 10 as before.The same model as in (2.16) together with the subsequent settings was used, except that we set λ = −1/2 known from the beginning so that the latent variables v 1 , . . ., v N have the inverse-Gaussian population IG(δ, γ) = GIG(−1/2, δ, γ).For the true parameter values specified in Table 3, we run the following two cases for the initial values of the numerical optimization: The results in Table 3 clearly show that the inverse-Gaussian parameter (δ, γ) can be quite sensitive to a bad starting point for the numerical search.In the next section, to bypass the numerical instability we will construct easier-to-compute initial estimators and their improved versions asymptotically equivalent to the MLE.

Asymptotically efficient estimator
Building on Theorem 2.3, we now turn to global asymptotics through the classical Newton-Raphson type procedure.A systematic account for the theory of the one-step estimator can be found in many textbooks, such as [19,Section 5.7].Let us briefly overview the derivation with the current matrixnorming setting.
Suppose that we are given an initial estimator θ0 . By Theorem 2.3 and Assumption 2.2, this amounts to √ N ( θ0 We define the one-step estimator θ1 N by θ1 , the P θ0 -probability of which tends to 1. Write û1 Using Taylor expansion, we have Î0 By the arguments in Section 2.2, it holds that Combining (3.2) and (3.3) and recalling Remarks 2.4 and 2.5, we obtain the asymptotic representation (2.11) for θ1 N , followed by the asymptotic standard normality and its asymptotic optimality.
3.1.Construction of initial estimator.This section aims to construct a √ N -consistent estimator θ0 N satisfying (3.1) through the stepwise least-squares type estimators for the first three moments of Y ij .We note that the model (2.1) does not have a conventional location-scale structure because of the presence of v i in the two different terms.
We assume that the parameter space 2 with the compact closure.Write θ = (λ, δ, γ) for the parameters contained in L(v 1 ), the true value being denoted by , and ρ 0 = ρ(θ 0 ) correspondingly.Further, we introduce the sequences of the symmetric random matrices: To state our global consistency result, we need additional assumptions.
Assumption 3.1.In addition to Assumption 2.1, the following conditions hold.
To construct θ0 N , we will proceed as follows.
Step 1: For estimating the remaining parameters, we introduce the (heteroscedastic , which is to be regarded as an estimator of the unobserved quantity Step 2: , we estimate the variance-component parameter (τ, α) by minimizing Step 3: that is, Step 4: Finally, under Assumption 3.1(4), we construct θ 0 N = ( λ0 N , δ0 N , γ0 N ) through the delta method by inverting (μ 0 N , ĉ0 In the rest of this section, we will go into detail about Steps 1 to 3 mentioned above and show that the estimator θ0 N thus constructed satisfies (3.1); Step 4 is the standard method of moments [19,Chapter 4].For convenience, let us introduce some notation.The multilinear-form notation is used for M = {M i1,...,i k } and u = {u i1 , . . .u i k }.For any sequence random functions {F N (θ)} N and a non-random sequence (a N ) N ⊂ (0, ∞), we will write ) under P θ0 , respectively.Further, we will denote by m i = (m i1 , . . ., m ini ) ∈ R ni any zero-mean (under P θ0 ) random variables such that m 1 , . . ., m N are mutually independent and sup i≥1 max 1≤j≤ni E θ0 [|m ij | K ] < ∞ for any K > 0; its specific form will be of no importance.

3.1.1.
Step 1.Put a = (α, β, µ) and a 0 = (α 0 , β 0 , µ 0 ).By (2.1) and (3.4), we have where α = α(α, α 0 ) is a point lying on the segment joining α and α 0 .The first term on the rightmost side equals O * p (N −1/2 ).The second term equals o * p (1) by Assumption 3.1(1), hence we conclude that we may and do focus on the event {∂ a M 1,N (â N ) = 0}, on which where ãN is a random point lying on the segment joining âN and α 0 .Observe that Concerning the right-hand side, the first term equals o p (1), and the inverse of the second term does . The last two displays combined with Assumption 3.1(1) and (3.7) conclude that √ N (â N − a 0 ) = O p (1); it could be shown under additional conditions that √ N (â N − a 0 ) is asymptotically centered normal, while it is not necessary here.c) and b 0 := (τ 0 , c 0 ), and moreover where As in Section 3.1.1,we observe that for some point τ = τ (τ, τ 0 ) lying on the segment joining τ and τ 0 .Thus Assumption 3.1(2) concludes the consistency bN ) can be also deduced as in Section 3.1.1:it suffices to note that for every random sequence ( bN ) such that bN under Assumption 3.1(3).We end this section with a few remarks.Remark 3.2.As an alternative to (3.4), one could also use the profile least-squares estimator [15]: first we construct the explicit least-squares estimator of (β, µ) knowing α, and then optimize α → M 1,N (α, βN (α), μN (α)) to get an estimator of α.
Remark 3.3.If one component of θ = (λ, δ, γ) is known from the very beginning, then it is enough to look at the estimation of (µ, c) and we can remove Assumption 3.1(3) with modifying Assumption 3.1(4).
Remark 3.4.Because of the asymptotic nature, the same flow of estimation procedures (the MLE, the initial estimator, and the one-step estimator) remain valid even if we replace the trend term x ij β in (1.2) by some nonlinear one, say µ(x ij , β), with associated identifiability conditions.Remark 3.5.We can construct a one-step estimator for the MELS model (1.2) in a similar manner to Steps 1 to 3 described in Section 3.1.To construct an initial estimator θ0 Step 2, and then ρ0 N in Step 3 in this order through the contrast functions to be minimized: denoting As in the case of (3.6), ρ0 N is explicitly given while the meaning of the parameter ρ is different in the present context.It is also possible to develop an asymptotic theory for the MLE of the MELS and the relate one-step estimator in similar ways to the present study.However, the one-step estimator toward the log-likelihood function (1.3) still necessitates the numerical integration over R 2 with respect to the twodimensional standard normal random variables; the numerical integration would need to be performed for every i = 1, . . ., N and j = 1, . . ., n i , hence the computational load would still be significant.
In each case, we computed √ N ( ξN − θ 0 ) for ξN = θ0 N , θ1 N , and θN , all being conducted 1000-times Monte Carlo trials.To estimate 95%-coverage probabilities empirically as in Section 2.3, we computed the quantities −∂ 2 θ N ( θN ) and −∂ 2 θ N ( θ1 N ) through the function θ → −∂ 2 θ N (θ) for the approximately 95%-confidence intervals for each parameter.The results are shown in Table 4; therein, we obtained numerically unstable 4 MLEs and 5 one-step estimators for case (i') and 299 MLEs and 6 one-step estimators for case (ii'), and then computed the coverage probabilities based on the remaining cases.In Figures 4 and 5 (for cases (i') and (ii'), respectively), we drew histograms of θ1 N and θN together with those of the initial estimator θ0 N for comparison.In each figure, the histograms in the first and fourth columns are those for θ0 N , those in the second and fifth columns for θ1 N , and those in the third and sixth columns for θN , respectively; the red solid line shows the zero-mean normal densities with the consistently estimated Fisher information for the variances.4. The empirical 95%-coverage probabilities of the MLE and the one-step estimators in cases (i') and (ii') based on 1000 trials; MLE of (δ, γ) in case (ii') showed instability in numerical optimizations, while the one-step estimator is stable as in case (i').
Here is a summary of the important findings.
• Approximate computation times for obtaining one set of estimates are as follows: (i') 0.2 seconds for θ0 N ; 10 seconds for θ1 N ; 2 minutes for θN ; (ii') 0.2 seconds for θ0 N ; 10 seconds for θ1 N ; 9 minutes for θN .A considerable amount of reduction can be seen for θ1 N compared with θN .• About Figures 4 and 5: -In both cases (i') and (ii'), the inferior performance of θ0 N is drastically improved by θ1 N , which in turn shows asymptotically equivalent behaviors to the MLE θN .
-On the one hand, as in Section 2.3, the MLE θN is much affected by the initial value for the numerical optimization, partly because of the non-convexity of the likelihood function N (θ); in Case (ii'), we observed the instability in computing the MLE of (δ, γ) (in the bottom panels in Figure 5), showing the local maxima problem.On the other hand, we did not observe the local maxima problem in computing θ0 N and the one-step estimator θ1 N does not require an initial value for numerical optimization.

In sum, θ1
N is not only asymptotically equivalent to the efficient MLE but also much more robust in numerical optimization than the MLE.It is recommended to use the one-step estimator θ1 N against the MLE θN from both theoretical and computational points of view.
We end this section with applications of the proposed one-step estimator θ1 N for (3.8) to the two real data sets riesby example.datand posmood example.datborrowed from the supplemental material of [11].Here are brief descriptions.
Figures 6 and 7 show some data plots and histograms, respectively; the former is positively skewed while the latter is negatively skewed.We could apply our one-step estimation methods for these data sets, although they can be seen as categorical data (with a moderately large number of categories).The results are given in Table 5; the parameters β 0 , α 0 , and τ 0 denote the intercept.The skewness mentioned above is reflected in the estimates of α 0 and α 1 .5. One-step estimates for the two data sets riesby example.datand posmood example.dat.It took 0.9 and 6.7 seconds, respectively.

Concluding remarks
We proposed a class of mixed-effects models with non-Gaussian marginal distributions which can incorporate random effects into the skewness and the scale simply and transparently through the normal   variance-mean mixture.The associated log-likelihood function is explicit and the MLE is asymptotically efficient (Remark 2.4) while computationally demanding and unstable.To bypass the numerical issue, we proposed the easy-to-use one-step estimator θ1 N , which turned out to not only attain a significant reduction of computation time compared with the MLE but also guarantee the asymptotic efficiency property.
Here are some remarks on important related issues.(1) Inter-individual dependence structure.A drawback of the model (2.1) is that its inter-individual dependence structure is not flexible enough.Specifically, let us again note the following covariance structure for j, k ≤ n i : This in particular implies that Y i1 , . . ., Y ini cannot be correlated as long as s(z, α) ≡ 0. Nevertheless, it is formally straightforward to extend the model (2.1) so that the distributional structure of Y i ∈ R ni obeys the multivariate GH distribution for each L(Y i ) with a non-diagonal scale matrix.To mention it briefly, suppose that the vector of a sample Y i = (Y i1 , . . ., Y ini ) ∈ R ni from ith individual is given by the form Here v 1 , . . ., v N ∼ i.i.d.GIG(λ, δ, γ) as before, while we now incorporated the scale matrix Λ(w i , τ ) which should be positive definite and symmetric, but may be non-diagonal.Then, the dependence structure of Y i1 , . . ., Y ini can be much more flexible than (2.1).( 2) Forecasting random-effect parameters.In the familiar Gaussian linear mixed-effects model of the form This is a direct consequence of the general results about the multivariate GH distribution; see [5] and the references therein for details.As in the Gaussian case mentioned above, we can make use of where νi := ν i ( θN ), ηi := η i ( θN ), and ψi := ψ i ( θN ); formally θN could be replaced by the one-step estimator θ1 N .Then, it would be natural to regard Ŷij := x ij βN + s(z ij , αN )v i as a prediction value of Y ij at (x ij , z ij ).This includes forecasting the value of ith individual at a future time point.
( kk .Alternatively, one may consider information criteria such as the conditional AIC [18] and the BIC-type one [4].To develop these devices in rigorous ways, we will need to derive several further analytical results: the uniform integrability of ( √ N ( θN − θ 0 ) 2 ) n for the AIC, the stochastic expansion for the marginal likelihood function for the BIC, and so on. where .
Making the change of variables S 2 i v i /T i = u i with we can continue as This leads to the expression (2.3).

Figure 2 .
Figure 2. Standardized distributions of the MLE in case (i).The lower rightmost panel shows the chi-square approximation based on (2.13).

Figure 3 .
Figure 3. Standardized distributions of the MLE in case (ii).The lower rightmost panel shows the chi-square approximation based on (2.13).

Figure 4 .
Figure 4. Case (i'): Histograms of the initial estimator θ0N (first and fourth columns), the one-step estimator θ1N (second and fifth columns), and the MLE θN (third and sixth columns).In each histogram panel, the solid red line shows the estimated asymptotically best possible normal distribution.

Figure 5 .
Figure 5. Case (ii'): Histograms of the initial estimator θ0N (first and fourth columns), the one-step estimator θ1N (second and fifth columns), and the MLE θN (third and sixth columns).In each histogram panel, the solid red line shows the estimated asymptotically best possible normal distribution.

Figure 6 .
Figure 6.Data plots of riesby example.dat(left) and posmood example.dat(right) borrowed from the supplemental material of[11]; the former shows data of 10 patients over 6 time points with a few missing values, and the latter does those of 3 people over 26 time points with no missing value.

Table 1 .
MLE results; the computation time for one pair was about 21 hours.

2 .
Parameter-varying generalized hyperbolic model 2.1.Proposed model.We model the objective variable at jth-sampling time point from the ithindividual by denote the convergence in distribution.Having obtained (2.7), (2.8), and (2.9), we can conclude the following theorem by applying [17, Theorems 1 and 2].Theorem 2.3.Under Assumptions 2.1 and 2.2, we have the following statements under P θ0 .

Table 2 .
The empirical 95%-coverage probabilities of the MLE in cases (i) and (ii) based on 1000 trials.