Controlling the familywise error rate when performing multiple comparisons in a linear latent variable model

In latent variable models (LVMs) it is possible to analyze multiple outcomes and to relate them to several explanatory variables. In this context many parameters are estimated and it is common to perform multiple tests, e.g. to investigate outcome-specific effects using Wald tests or to check the correct specification of the modeled mean and variance using a forward stepwise selection (FSS) procedure based on Score tests. Controlling the family-wise error rate (FWER) at its nominal level involves adjustment of the p-values for multiple testing. Because of the correlation between test statistics, the Bonferroni procedure is often too conservative. In this article, we extend the max-test procedure to the LVM framework for Wald and Score tests. Depending on the correlation between the test statistics, the max-test procedure is equivalent or more powerful than the Bonferroni procedure while also providing, asymptotically, a strong control of the FWER for non-iterative procedures. Using simulation studies, we assess the finite sample behavior of the max-test procedure for Wald and Score tests in LVMs. We apply our procedure to quantify the neuroinflammatory response to mild traumatic brain injury in nine brain regions.


Introduction
Latent variable models (LVM) are an attractive tool for studying systems of variables where an exposure, e.g. a treatment or a disease, is to be related to several outcome variables, e.g. the concentration of a specific protein in various brain regions. They are able to jointly analyze several dependent variables, relate them to exogenous factors, and to investigate shared correlation structures. They encompass linear regressions, probit models, and mixed models as specific cases (Holst and Budtz-Jørgensen 2013). They also admit a graphical representation called path diagram (Fig. 1).
In applications, LVMs include a high number of parameters and several of them (or combinations) are of interest. Global tests, such as likelihood ratio tests, enable to simultaneously test several null hypotheses but provide no guidance as to which null hypotheses are false. However, this is of critical importance, e.g. when investigating the effect of a disease on several brain regions, and motivate the use of separate tests. Traditional adjustments for handling multiple testing like Bonferroni ignore the correlation between test statistics. If the correlation is strong, e.g. above 0.7 in our real data application, power is lost which is problematic in fields like neuroscience where the inclusion of many subjects is expensive.
Another situation where multiple testing arises is model checking. Traditionally, practitioners specify LVMs drawing a path diagram based on a priori information, fit the model, and assess its goodness of fit. In absence of a priori information, one often considers a parsimonious structure, e.g. a single latent variable is sufficient to capture the covariance structure (i.e. no double-headed arrows). Searching for local dependencies (i.e. conditional dependency between two variables that are not connected in the path diagram) is then a recommended practice (Ropovik 2015) and identifying omitted local dependencies will raise doubts about the validity of the LVM. A possible remedy is to sequentially include the omitted local dependencies until the fit of the model is considered satisfying. At each step, the most relevant local dependency is identified using Score tests over the set of possible additional local dependencies. While widely used, such forward stepwise selection (FSS) procedures have been criticized due to low reproducibility (MacCallum et al. 1992) and inflation of Type 1 error (Ropovik 2015).
Efficient methods for handling multiple comparisons have been developed for a number of years (see Dmitrienko and D'Agostino 2013 for an introduction in the context of clinical trials), but they are not often used in conjuction with LVMs. For instance, the max-test procedure is not implemented in statistical softwares specialized in LVM such as M-plus (Muthén and Muthén 2017), the PROC CALIS in SAS, the R packages lava (Holst and Budtz-Jørgensen 2013) or lavaan (Rosseel 2012). A few articles have stressed the importance of controlling the FWER in the LVM literature, promoting the Bonferroni procedure (Cribbie 2000;Cudeck and O'dell 1994) or a procedure similar to Bonferroni-Holm when performing backward stepwise selection (Green et al. 2001). Interestingly, Smith and Cribbie (2013) proposed a modified Bonferroni procedure to account for the correlation between the test statistics. This is performed in an ad-hoc way by correcting the number of tests to adjust for using the average absolute correlation between estimated coefficients. In particular, there is no guarantee that the FWER is appropriately controlled (it is not difficult to construct Fig. 1 Path diagram of the generative LVM used for the simulations in Sect. 5.2. The outcomes are represented in blue, the latent variable in red, and the covariates in green. Regression links are indicated with black single-headed arrows, covariance links are indicated with red double-headed arrow, and the absence of arrows or a gray arrow indicates conditional independence between two variables. Dashed arrows indicate the links that are tested in the FSS (not displayed in a traditional path diagram) examples where it is not the case). In comparison a max-test procedure can be shown to control the FWER while having a power advantage over the Bonferroni procedure. It can be carried in a parametric way for normally (or Student's t) distributed test statistics (Hothorn et al. 2008). Westfall and Troendle (2008) show that the procedure generalizes to other distributions provided that one can compute cumulative distribution function (cdf) of the maximum. However, for Score tests, the cdf of the maximum of χ 2 variables is difficult to calculate.
In this article, we extend the parametric max-test procedure to LVMs (i) when testing multiple parameters using Wald statistics and (ii) when using a Score test. To achieve (i), we apply the max-test procedure proposed by Hothorn et al. (2008) in conjunction with a modification of the classical Wald statistics (Ozenne et al. 2020) to obtain a max-test procedure for LVMs that is valid in small samples (e.g. n=36 in our real data application). To achieve (ii), we have developed two novel procedures to approximate the max-distribution for χ 2 distributed variables. The max-test procedures are implemented in a package for the R software called lavaSearch2, available on CRAN (https://cran.r-project.org/web/packages/lavaSearch2/index.html). The code used for the simulation studies and for the data analysis is available at https://github.com/ bozenne/Article-lvm-multiple-comparisons.

Latent variable model (LVM) framework
Let us consider a vector of outcome variables Y = (Y 1 , . . . , Y m ) and a vector of covariates X = (X 1 , . . . , X l ) with arbitrary distribution. We observe a sample (X i ) i∈{1,...,n} = ( y i , x i ) i∈{1,...,n} of n replications of X = (Y , X). We assume that the sample contains independent and identically distributed (iid) replicates. For this we consider a LVM, denoted M(Θ), which models the conditional distribution of the outcomes as a function of the covariates and the vector of model parameters Θ through a normal distribution: To express the conditional mean μ(Θ, X) and the conditional variance Ω(Θ), we introduce a set of latent variables η and relate them to the observed variables via the measurement model: and the structural model: where ε ⊥ ⊥ ζ (⊥ ⊥ denotes stochastic independence) and B is a matrix with 0 on its diagonal and such that 1 − B is invertible. We also impose constraints on the parameters α and Λ, typically that their first element is, respectively, 0 and 1, to ensure identifiability of the model. Thus Θ contains the unconstrained parameters of ν, λ, K , Σ ε , α, B, Σ ζ . See supplementary material E for an example. In this model, the conditional mean and variance are: and the log-likelihood is: Modification of this model to handle non-linear effects of some covariates, clustered data, binary data, or censored data can be found in Holst and Budtz-Jørgensen (2013). Note that in the following, we will assume that four regularity conditions are satisfied, at least in a neighborhood of Θ 0 : (i) Θ 0 is interior to the space of possible parameter values, (ii) distinct Θ values represent distinct distributions, (iii) Σ ε and Σ ζ , are positive definite, and (iv) ∂μ(Θ) ∂Θ and ∂Ω(Θ) ∂Θ are of full column rank. We denote by Θ the estimate of Θ obtained by maximum likelihood (ML) estimation. The estimation can be carried out using the Newton-Raphson and iteratively computing the vector of scores: and the expected information matrix: for updating Θ until convergence. Explicit expression for S and I in LVMs can be found in supplementary material A. They can be used to show that I is invertible under (i-iv). We can then obtain an estimateΣ Θ of the variance-covariance matrix of Θ (denoted Σ Θ ) in two ways: by estimating model-based variance-covariance matrix, Σ m, Θ = I(Θ) −1 , or using the robust variance-covariance matrix, Σ r , Θ = I(Θ) −1 S(Θ) S(Θ)I(Θ) −1 . In addition, from the theory of M-estimators (e.g. see Tsiatis 2006, section 3.2), we get that ML estimators for LVM are asymptotically linear. This means that there exists a function ψ Θ , called the influence function, such that: where Θ 0 denotes the true value of Θ and ψ Θ (Θ 0 , Hypothesis testing in LVMs estimated by ML can be done using classical tests such as likelihood ratio tests, Score tests, or Wald tests. When testing parameters or combinations of parameters, Wald tests are the privileged approach since confidence intervals can easily be obtained. For model building, e.g. when deciding whether to include a new parameter, Score tests are often preferred for computational reasons.

Multiple inference using Wald tests
In this section, we consider a LVM M(Θ) and assume that it is correctly specified. This means that there exists a vector Θ 0 such that M(Θ 0 ) is the distribution that has generated ( y i ) i∈{1,...,n} given (x i ) i∈{1,...,n} . We also denote by Σ Θ,0 the true value of Σ Θ ; in a univariate linear model Y = X β + ε where ε ∼ N 0, σ 2 we have Σβ ,0 = σ 2 (X X ) −1 . We are interested in testing c null hypotheses: where (β 1 , . . . , β c ) are distinct elements of Θ and β 1,0 denotes the true value of β 1 , e.g. if β 1 = Θ 1 then β 1,0 = Θ 1,0 . This set of null hypotheses can also be written in a matrix form: β 0 = 0 or, more generally, CΘ = b where C is any full rank matrix (often called contrast matrix) and b is a vector. From maximum likelihood theory (e.g. Van der Vaart (2000), section 5.5), we know that: where d ∼ denotes convergence in distribution as n − → ∞ and I * 1 (Θ) that large sample limit of I 1 (Θ). We introduce D β the diagonal matrix containing the diagonal elements converges in probability toward Σ Θ,0 and using the same arguments as in Hothorn et al. (2008), we can approximate (under the null hypothesis) the distribution of T by a normal distribution with mean 0 and variance-covariance matrix where T j is the j-th element of T, an adjusted p-value for the j-th statistic can be obtained by computing 1 − P |T | max ≤ |t j | where: (2) Here f T denotes the joint density of T and [−t; t] c the cartesian product of c intervals [−t; t]. Asymptotically f T equals the density of a multivariate Gaussian distribution with 0 mean and variance-covariance matrix Σ T . In finite samples, one can use the asymptotic distribution as an approximation for the distribution f T . This procedure is called a max-z test.
The previous derivations do not account for the fact that, in practice, D β is estimated and plugged-in to estimate T nor for the small sample bias of the ML estimator. As a result, Wald tests based on the asymptotic ML theory generally show inflated type 1 error rates in small samples. For linear models and linear mixed models it is recommended to model the distribution of the Wald statistic using a Student's t-distribution and to estimate the variance parameters using restricted maximum likelihood (REML). Because REML has not been developed for LVMs, Ozenne et al. (2020) recently proposed a procedure, called hereafter "small sample correction", to correct the finite sample bias of the ML estimator of Σ ε , Σ ζ and used a Satterwaithwaite approximation to estimate the degrees of freedom corresponding to the Student's t-distribution. This enables us to use a multivariate Student's distribution in equation (2) instead of a Gaussian distribution, and will be referred to as a max-t test. As an approximation, the degrees of freedom of the multivariate Student's t-distribution is computed as the average of the Wald's degrees of freedom. We summarize the multiple testing procedure in the following definition: Definition 1 (Single step max-test procedure for Wald tests in LVM) Given a LVM estimated by ML with estimated parameter Θ and estimated variancecovariance matrixΣ Θ , the max-test procedure for testing the set of null using Wald tests is: 1. Extract β = C Θ the estimated parameters relative to each null hypothesis from Θ. 2. ExtractΣ β = CΣ Θ C the variance-covariance matrix of β fromΣ Θ and denoteσ β its diagonal elements. CreateD β the diagonal matrix containinĝ σ β . When using a max-t test, extract d f β the degrees of freedom relative toσ β and use the bias-corrected estimate of the variance-covariance matrix obtained from the small sample correction forΣ Θ .

Compute p-values using formula
(2) with f T being the density of a multivariate normal distribution (max-z test) or of a multivariate Student's t-distribution (max-t test) with variance-covariance matrixΣ β . When using a max-t test, estimate the degree of freedom of the Student's t-distribution by 1 Note that at step 4, resampling technics (e.g. Chernozhukov et al. (2013)) could also be used to obtain a non-parametric estimator of P [|T | max ≤ |t|] based the iid decomposition of equation (1) instead of assuming a multivariate normal or Student's t-distribution and performing numerical integration to compute the right hand side of equation (2).
The max-test procedure enjoys several desirable properties: asymptotically, it provides a strong control of the FWER, i.e. the probability of incorrectly rejecting at least one hypothesis is at most 5%. It is asymptotically exact in the sense that the FWER tends to 5% as the sample size tends to infinity (the Bonferroni correction does not have this property with correlated test statistics). Consequently, the max-test procedure will be equally or more powerful than tests adjusted with the Bonferroni procedure. It is also known to be a coherent and consonant procedure ( Bretz et al. (2011), section 2.1.2) leading to decision patterns that are logical and simple to communicate: rejection of any null hypotheses implies rejection the global null hypothesis (intersection of the c null hypotheses) and rejection of the global null hypothesis implies rejection of at least one null hypothesis. The power of the procedure could be further improved, e.g., by considering step-down or step-up max-test procedures. However, this complicates the definition of the confidence intervals.

Multiple inference using Score tests
To motivate the use of Score tests, we will consider a LVM M p (θ ) with p parameters, defined blinded to the data. We also consider the set of LVMs with p + 1 parameters (referred to as the extended models), containing M p as a submodel, and that are identifiable. For instance if M p is the LVM defined by Fig. 1, M (1) p+1 may add a regression parameter between X 1 and η, while M (2) p+1 may add, instead, a regression parameter between X 2 and η (and so on). We denote by c the number of extended LVMs and by Θ j = (θ , β j ) the parameters in M ( j) p+1 , the j-th extended LVM. As a diagnostic test, the practitioner would like to know whether any of the extended LVM has a significantly better fit compared to the original LVM. We therefore test, for each extended model M do not require to fit any additional model; only to compute the score and the expected information matrix of the extended model (respectively denoted S M ( j) ). In the equation above, Θ = ( θ , 0) denotes the ML estimator in any extended model under the null, i.e. under the constraint that β j = 0. In contrast, we denote by It is a classical result from Maximum Likelihood theory that, under the null hypothesis, where χ 2 1 denotes the χ 2 distribution with one degree of freedom. Using the same reasoning as in the previous section, we are once more interested in a max statistic: p+1 . If the test statistics were independent then up to a linear transformation, U max is known to converge toward a Gumbel distribution when c tends to infinity (Gasull et al. 2015). In practice, the number of tests can be small (e.g. c < 10) and the test statistics in LVMs are typically correlated; we therefore need an alternative approach.

Resampling of the score under the null hypothesis
The principle of this approach is first to identify the joint distribution of the scores across all extended LVMs, under the global null hypothesis. Then we resample from this distribution, compute the Score statistics for each extended LVM, take their maximum, and therefore obtain iid realizations of U max under the global null hypothesis. The p-value can then be computed as the frequency at which the sampled realizations of U max are more extreme than the realization of U max obtained from the data. If we would have an iid decomposition of the scores in each extended model, we could use the same approach as in Pipper et al. (2012) to identify the joint distribution of the scores: stack the iid decompositions across models and use the multivariate central limit theorem to show that the scores are asymptotically jointly normally distributed. A consistent estimator of the variance-covariance matrix of this distribution could then be deduced from the iid decomposition.
An intuitive idea for the iid decomposition would be to use that . While this is a valid iid decomposition at Θ 0 this is not the case at Θ j or at Θ due to the constraints on the score implied by the ML estimation. For instance S M ( j) p+1 ( Θ j ) = 0 and has variance 0, while the individual terms have non-0 variance. We therefore developed another decomposition (see supplementary material B for details) by first expressing Θ j as a function of Θ j and then using a Talyor expansion of the score around Θ j . We obtain that the first p components of the score evaluated at Θ j are 0 and the last component (corresponding . Therefore, denoting by ψ β j (Θ 0 , X i ) the contribution of the i-th observation to the iid decomposition ofβ j (equation (1)), we can introduce the normalized score U M ( j) p+1 , a vector of length p + 1 with iid decomposition: Once squared, U M ( j) p+1 Θ is equivalent to the Score statistic: . This leads to the following procedure: (a) sampling in a multivariate normal distribution with covariance matrix Σ U . (b) using wild bootstrap, i.e., weight the individual iid terms with individual specific weight sampled from a standard normal distribution and sum the iid terms over the individuals to obtain a sample of the normalized score.
5. For each sample: compute the Score statistic for each extended model, take the maximum over the extended models. Estimate the p-value using the relative frequency of the event that the sampled maximum is greater than the observe Score statistic.
This procedure can be limited in practice by step 4(a) or step 4(b).
Step 4(a) involves sampling in a Gaussian distribution of dimension ( p + 1)c which can be very large for complex LVMs. For instance in our illustration, this dimension will be (45+1)*36=1656 which is already quite large for a rather simple LVM where we limit the search to covariance links. Using step 4(b) may then be more tractable. Also, step 4(b) does not rely on the assumption that U follows a normal distribution, which may not be valid in small samples. However, it involves sampling a weight for each individual so step 4(b) may be slow for very large n. In the following subsection, we propose a procedure that is numerically more efficient. Krishnaiah and Armitage (1965) proposed to compute the distribution of the maximum of χ 2 distributions using the joint distribution of the underlying Gaussian variables: given a vector T j j∈{1,...,c} such that T 2

Approximation of the max-distribution using latent Gaussian variables
tributed with one degree of freedom, the marginal distribution of each T j is a standard normal distribution. So it only remains to identify R T , the correlation matrix of T j j∈{1,...,c} . Introducing: where the difference with Eq. (3) is that the first element on the right-hand side is the individual score instead of the total score. We can then express the Score statistic . We therefore propose the following estimator for the pairwise correlation between two underlying Gaussian variables j and j : where m As illustrated in the following example, the estimator defined in equation (6) may not always be consistent but can provide a reasonable approximation of the magnitude of the correlation. Example: consider for M p the univariate model Y i = ν +ε i where ε i ∼ N (0, σ 2 ) and for the alternative models Y i = ν + K j X i j + ε i for j ∈ {1, . . . , q}. To simplify we will assume that X j has mean 0 and variance 1. Then the score vector for the parameters ν, σ 2 , and K j is σ 2 s 2 j n and we can identify the latent Gaussian variable T j = σ s j √ n . For ( j, j ) ∈ {1, . . . , q} 2 , Cor (T j , T j ) = Cor (s j , s j ) which can be consistently estimated by computing the correlation between the vectors s i j i∈{1,...,n} and s i j i∈{1,...,n} . However, the empirical correlation between ϕ 4. Estimate the correlation matrix R T using equation (6). 5. Compute p-values applying formula (2) to T j j∈{1,...,c} where f T is the density of a multivariate normal distribution with mean zero and variancecovariance matrix R T .
This procedure is expected to be more numerically efficient than the resampling procedure proposed in the previous subsection. Indeed, equation (5) gives a univariate influence function, compared to formula (4) where it is p+1 dimensional, making step 5 reasonably fast. However, the validity of the approximation performed with this procedure is unclear and will be empirically assessed in simulation studies (see section 5.2).

Multiple comparisons using Wald tests
We consider a latent factor model with 9 outcomes and two binary covariates called group and gene. In the generative model all the outcomes are equally correlated, normally distributed, independent of the group variable but dependent of the gene variable. This corresponds to the following measurement and structural models: 2 FWER when testing 9 null hypotheses with Wald tests. using different procedures to adjust the p-values for multiple comparisons (columns). The rows indicate whether a small sample correction is used. For instance in the third column, the upper row uses a max-z test while the lower row uses a max-t test.
The correlation reported in the x-axis is the median Pearson correlation between the test statistics computed over the repetitions. For the y-axis, a logarithmic scale was used in the special case where we constrain (i) ∀ j ∈ {1, . . . , 9} λ r = λ 1 and (ii) ∀ j ∈ {1, . . . , 9} K 1, j = 0. The values of the remaining parameters were obtained fitting the unconstrained model to the data used for the illustration (see section 6). To assess the control of the FWER of the max-test procedure, we consider the LVM defined by the equations (7), (8), (9), and (10) under the constraint λ 1 = 1. This model will be referred as the investigator model thereafter. The set of null hypotheses that is being tested is K 1, j = 0 j∈{1,...,9} , i.e. the group effects are zero. We consider several scenarios where we varied the sample size: n ∈ {30, 50, 75, 100, 150, 200, 300, 500} and the covariance between the outcomes: λ 1 ∈ {0.1, 0.2, 0.35, 0.65, 1, 5}. We generated 10 000 datasets, analyzed them using the investigator model, and computed the pvalue for the global null hypothesis using no adjustment for multiple comparison, the Bonferroni procedure, or the max-test procedure with the model-based variancecovariance matrix. To improve the control of the FWER in finite samples, the p-values were also computed after application of the small sample correction.
The upper panel of Fig. 2 shows the FWER in absence of adjustment, when using the Bonferroni procedure or the max-test procedure. For large sample sizes (e.g. n=500), the FWER was above its nominal level in absence of adjustment (except when the Fig. 3 Power when testing 9 null hypotheses with Wald tests using the Bonferroni or the max-test procedure to adjust the p-values for multiple comparisons. The last column displays the difference in power between the two procedures test statistics were perfectly correlated) while below its nominal level when using the Bonferroni procedure (except when the test statistics were independent). The max-test procedure managed to keep the FWER at its nominal level regardless the correlation. Without small sample size correction, the FWER increased when the sample size decreased, e.g. the max-test procedure had a FWER of approximately 0.1 for n=30. This was corrected when using the small sample correction (Fig. 2 lower panel).
To assess the gain in power when using the max-test procedure instead of the Bonferroni procedure, we simplified the generative used in the previous simulation. We set the intercepts to 0 and the other coefficients to 1, except for the group effects where the first was set to 0.4 and the others to 0 (i.e K 1,1 = 0.4 and ∀ j ∈ {2, . . . , 9} K 1, j = 0), the loadings (λ j ) j∈{1,...,9} were all set to a value a, and the residual variances (σ 2 ε, j ) j∈{1,...,9} were all set to 5.25 − √ a, to vary the degree of correlation between the outcomes. This ensured that the conditional variance of the outcomes (Var[Y j |Group, Gene] = λ 2 j + σ 2 ε, j ) remained constant when we varied a ∈ {0.25, 1, 2, 3, 4, 5}. We generated 5000 datasets for each configuration and the power was computed as the frequency at which the p-value for the global null hypothesis was lower than 0.05. Figure 3 shows that the max-test procedure was always more powerful than the Bonferroni procedure with a gain in power that ranged between 0% and 22%, when the correlation between the test statistics was respectively low and high. We also compared the max-test procedure to step-up procedures (Hochberg and Hommel, see figure A in supplementary material C) and found that the power improvement obtained using a step-up procedure was neglectable.

Multiple comparisons using Score tests
We now assess the control of the FWER when testing multiple hypotheses using Score tests. The generative model is a latent factor model with 5 outcomes loading on a single latent variable η. The latent variable was correlated with a single variable called treatment. 15 other covariates were simulated (X 1 , . . . , X 15 ) and the investigator aimed to assess whether they had an effect on the latent variable. The covariates were simulated with a common pairwise covariance that was varied: a ∈ {0, 0.6, 1, 1.5, 2.5, 5}. In each scenario, the 15 possible extended models were formed and the Score statistics computed. The corresponding p-values were calculated using no adjustment, the Bonferroni procedure, the max-test procedure with resampling (i.e. definition 2 with step 4(a)), or the approximate max-test procedure (i.e. definition 3). For each sample size and covariance value 10000 datasets were simulated and analyzed. The FWER was computed as the relative frequency at which the smallest p-value was below 5%. Figure 4 displays the FWER relative to each procedure. Results are similar to those of the previous simulation, the max-test correction providing a good control of the FWER regardless to the correlation, while the Bonferroni procedure was too conservative for correlated test statistics. In small samples, the approximate max-test procedure appeared to provide a better FWER compared to the use of resampling. In 0.01% of the datasets the p-value adjusted using the approximate max-test procedure were greater than for Bonferroni. This only occurred when the non-adjusted p-value was small (< 0.001) and we think this is due to inaccuracies in the numerical integration procedure required to compute the integral in Eq. (2).
In term of computation time, the approximate max-test procedure was similar to the bootstrap in small samples, e.g. for n = 50 the median [5% quantile; 95% quantile] computation time in seconds was 1. We also repeated this simulation including an additional variable Z in the generative model. This variable had an effect on the outcomes through the latent variable (in the structural model η i = α + γ Z i + ζ , the regression coefficient γ was set to 0.25) but was independent of the covariates X 1 , . . . , X 15 (Fig. 1, the correlation between the variables X 1 , . . . , X 15 is omitted for readability). The investigator aimed to assess whether Z or any of the other 15 covariates had an effect on the latent variable. This would correspond to the first step of FSS: the variable Z is selected if the p-value of relative to the parameter γ is significant and if it has the greatest test statistic (in absolute value). The p-values were adjusted for multiple testing using either the Bonferroni procedure or the approximate max-test procedure. The power was then defined as the proportion of simulations where Z was selected. 5000 datasets were generated for each scenario. The upper panels of Fig. 5 display the relative frequency at which the effect of Z reached the significance level. As expected, when the test statistics were correlated the effect Z reaches significance more often with the maxtest procedure than with the Bonferroni procedure. The observed increase in frequency varied between 0% and 20%, depending on the sample size and the correlation. A similar pattern was observed when looking at the empirical power, i.e. when the effect

Illustration
Mild traumatic brain injury (mTBI) is an injury to the head inducing disruption of brain function, e.g. loss of consciousness not exceeding 30 min and dysfunction of memory around the time of injury not exceeding 24 h. Because the pathogenesis of symptoms following mTBI is poorly understood and no evidence-based treatments are available for patients with bad recovery, there is a medical interest in a more precise and objective characterization of mTBI using medical imaging. It has been hypothesized that mTBI induces a neuroinflammatory response that could act as a therapeutic target. The neuroinflammatory response is expected to vary over the brain dependent on the trauma mechanism in the individual subject, but with deeper lying brain regions generally being more vulnerable due to local concentration of shock waves. Neuroinflammation can be measured indirectly using single-photon emission computed tomography (SPECT) and injection of the radioligand [123I]-CLINDE which Fig. 5 Percentage of simulations where the p-value of the Score test γ = 0 was significant (upper panels) and power (lower panels). The third column displays the difference between the values obtained using the approximate max-test procedure (second column) vs. Bonferroni (first column). The correlation reported in the x-axis is the median Pearson correlation between the test statistics computed over the datasets visualizes the translocator protein (TSPO); a protein upregulated in active immune cells. A genetic polymorphism of TSPO is known to affect [123I]-CLINDE binding to TSPO and partially explain interindividual variability.
Clinical, genetic and [123I]-CLINDE-SPECT data of 14 patients with mTBI and 22 healthy controls were collected. Patients were scanned first one to two weeks after the injury, and a second time 3 to 4 months after the injury (this second scan will be ignored in the following). Healthy controls were only scanned once. One of the aims of the study (Ebert et al. 2019) was to compare [123I]-CLINDE binding to TSPO between the two groups in 9 brain regions: thalamus, pallidostratum, impact region (patients) or neocortex (healthy controls), midbrain, pons, cingylate gyrus, hippocampus, supramarginal gyrus, corpus callosum. To quantify [123I]-CLINDE binding to TSPO, regional distribution volumes of [123I]-CLINDE were calculated with a two-tissue compartment model using arterial plasma as the input function. Distribution volume is the ratio between radiotracer concentration in the brain and in the blood at equilibrium. In the following we ignore the uncertainty related to this method of quantification and treat the distribution volumes as if they were directly measured.
The LVM defined with the investigator included the log of the TSPO distribution volumes of [123I]-CLINDE in the 9 regions as outcomes, a single latent variable to account for the covariance between the outcomes, region-specific genetic effects and group effects. This corresponds to the LVM defined previously (Eqs. (7), (8), (9), (10) under the constraint λ 1 = 1) which contains 45 parameters: 8 parameters from ν, 8 parameters from Λ, 18 parameter from K , 9 parameters from Σ ε , 1 parameter from α, no parameter from B or Γ (since here B and Γ are null) and one parameter from Σ ζ . Their value are reported in supplementary material E. With this model, the χ 2 statistic testing whether the modeled variance-covariance matrix differs from the observed covariance matrix was found significant (p=0.0016). To assess which covariance parameters should be added, we use a FSS on the 36 possible covariance links. The Score statistics were weakly correlated, with a median absolute pairwise Pearson correlation of 0.17 (2.5% quantile : 0.013, 97.5% quantile 0.59). Table 1 contains the results of the FSS using no adjustment, or the Bonferroni procedure, or the approximate max-test procedure. Without adjustment two covariance parameters would be added while when adjusting the p-values with the Bonferroni or max-test approach one covariance parameter would be added. The max-test approach multiplied the unadjusted p-value by a factor of 18 (first step), 26 (second step) and 13 (third step) instead of 36, 35 and 34 for the Bonferroni approach. Including the first covariance parameter enables to obtain a non-significant χ 2 test (p=0.31) and this model will be the one retained for performing inference. The qqplots of the residuals of the measurement models did not show any clear violation of the normality assumption.
Using an F-test for testing the global null hypothesis of no effect of mTBI on the distribution volume in any region gives a p-value of 0.011. While the p-value supports that mTBI induces neuro-inflammation, it does not inform on which brain region is affected which limits its practical value. This motivates the use of the maxtest procedure when assessing the significance of the region-specific effects. The 9 Wald statistics were highly correlated with a median correlation of 0.831 (min 0.702, max 0.960). The p-values adjusted with the max-test approach were only between 1.17 to 3.3 times larger than the unadjusted p-values (instead of 9 times larger for the Bonferroni approach, see Table 2). Corpus Callosum was the region with the largest effect and test statistic. The unadjusted p-value was 0.026, the Bonferroni adjusted was 0.234 and the max-t adjusted was 0.086. The next region was cingulate gyrus (max-t adjusted p=0.108) while the remaining regions had a large adjusted p-value (>0.3).
In this data the smallest adjusted p-value was higher than for the F-test. This may be happen when the effect is similar across the regions and the opposite happens when the effect is only present in a few regions.

Concluding remarks
When dealing with complex systems of variables, LVMs are a convenient modeling tool that provides, under some assumptions, interpretable and efficient estimates. This enables the investigator to translate her hypotheses into a function of the parameters and test whether this function evaluated at the estimated parameters equals a particular value. Like in many other models, the statistical testing framework in LVM is well established for a single statistical test. However, investigators are often interested in testing multiple clinical hypotheses (using Wald tests) or performing multiple diagnostic tests (using Score tests). In this article, we present adjustments for multiple comparisons applicable to both Wald and Score tests that appropriately control the FWER without sacrificing statistical power. While both procedures rely on asymptotic results, we found via simulation studies that they had a satisfying behavior in finite samples. The procedures are implemented in a freely available R package (lavaSearch2). Our implementation of the max-test procedure rely on numerical integration (Genz et al. 2018) to compute tail probabilities of the multivariate Gaussian or Student's t-distributions, restricting its applicability to low dimensional problems.
The power of the max-test procedure can be further increased using a stepdown max-test procedure (analogue to a Bonferroni-Holm procedure but accounting for the correlation between the test statistics). While the most significant p-value is not affected, the other p-values can sometimes be substantially reduced (e.g. compare the last two columns of Table 2). The power of the proposed procedure could also be improved by taking advantage of logical restrictions between null hypotheses (Westfall and Tobias 2007). While none were present in our simulation study and illustration, they typically arise when considering all pairwise differences between exposures (A vs. B, A vs. C, B vs. C). One common limitation of these improved procedures is that it is difficult to obtain simultaneous confidence intervals that are informative (i.e. provide information additional to the rejection of the null hypothesis) and consistent with the adjusted p-values. This is also why we focused on the single-step max-test procedure in the article. We refer to Strassburger and Bretz (2008) and Brannath and Schmidt (2014) for a more detailed discussion on simultaneous confidence intervals. Another possible improvement would be to handle sequential hypothesis testing. For instance, in our illustration, we first performed several Score test until finding a satisfying model and then tested the clinical hypothesis based on the retained model. Based on a simulation study (supplementary material D.3), the type 1 error appeared to be properly controled in that example. However this is likely not to be the case if the model misspecifications are directly related to the clinical hypothesis (e.g. region-specific group effects). Resampling procedures (e.g. supplementary material D.2) being generaly too computer-intensive to be used, more efficient post-selection procedures would be beneficial.
As suggested by a reviewer post-selection methods could also be used to avoid multiple comparisons, e.g. by using part of the data ( Cox 1975;DiCiccio et al. 2020) to identify the most promising region and another part to assess its statistical significance. In the present application, this lead to a median p-value of 0.066 for a critical threshold of 0.025 so no apparent gain in power. We believe that this approach is mostly relevant when testing a large number of hypotheses and there is no interest in assessing the individual null hypotheses (i.e. here identifying which brain regions were subject to inflammation).