Data generation for composite-based structural equation modeling methods

Examining the efficacy of composite-based structural equation modeling (SEM) features prominently in research. However, studies analyzing the efficacy of corresponding estimators usually rely on factor model data. Thereby, they assess and analyze their performance on erroneous grounds (i.e., factor model data instead of composite model data). A potential reason for this malpractice lies in the lack of available composite model-based data generation procedures for prespecified model parameters in the structural model and the measurements models. Addressing this gap in research, we derive model formulations and present a composite model-based data generation approach. The findings will assist researchers in their composite-based SEM simulation studies.


Introduction
Research in the social sciences often involves inference about concepts such as attitudes, perceptions, and behavioral intentions. Since such concepts cannot be measured directly, observed variables (also referred to as indicators) are used to represent them as latent variables (or constructs) in statistical models. Structural equation modeling (SEM) has become the standard tool to validate the indirect measurement of unobservable concepts and analyze complex interrelations between latent variables. Researchers can choose from two conceptually different approaches to SEM: factorand composite-based SEM (Jöreskog and Wold 1982;Rigdon et al. 2017).
In factor-based SEM unobservable conceptual variables are approximated by common factors under the assumption that each latent variable exists as an entity independent of observed variables. This latent variable serves as the sole source of the associations among the observed variables. That is, when controlling for the impact of the latent variable, the indicator correlations are zero. One of the first and most prominent formulations of factor-based SEM has been established by Jöreskog (1978). On the contrary, composite-based SEM represents latent variables by weighted composites of observed variables, assuming each one to be an aggregation of observed variables . Although many methods fall into the domain of compositebased SEM, partial least squares (PLS ;Lohmöller 1989;Wold 1982) and generalized structured component analysis (GSCA; Hwang and Takane 2004) constitute the most advanced and frequently used approaches in the field (Hwang et al. 2020;Hwang and Takane 2014).
As factor-and composite-based SEM both try to achieve the same aim -estimating a series of structural equations that represent causal processes -researchers have routinely compared their relative efficacy on the grounds of simulated data (Rigdon et al. 2017). However, the studies usually have evaluated composite-based SEM methods on the grounds of factor model data, where the indicator covariances define the nature of the data . These studies univocally show that composite-based SEM methods produce biased results that typically manifest themselves in measurement model parameters (i.e., indicator loadings and weights) being overestimated and structural model parameters being underestimated (Goodhue et al. 2012;Lu et al. 2011;Reinartz et al. 2009). However, these results are not considering that the estimated models were misspecified with regard to the data generation process in the simulation studies-as noted by numerous authors (Marcoulides et al. 2012;Rigdon 2012;Rigdon et al. 2017). 1 In fact, very few simulation studies have assessed composite-based SEM using data that are consistent with the assumptions of the method. We believe that the reason for the scarcity of research in this field lies in the lack of suitable date generation procedures. Specifically, while the data generation process for factor-based SEM is well documented and frequently discussed in the literature (e.g., Reinartz et al. 2002), this is not the case with composite-based SEM. Generating data for composite-based simulation studies in an SEM context is challenging because the size of the path coefficients, which define the strength of relationships between latent variables, are inextricably tied to the target variable's coefficient of determination. A composite model-based data generating process must consider such dependencies. Even though needed for simulation studies, corresponding procedures have remained nontransparent.
Our research seeks to fill this gap by discussing the specification of covariance matrices in composite-based data generation, which can serve as input for simulation studies. Our approach allows researchers to generate data for composite models with pre-specified indicator weights and path coefficients or coefficients of determination to assess the method's efficacy. The package cbsem (Schlittgen 2019) of the statistical software R (R Core Team 2019) contains all functions described in the further course of this article.

The composite-based model
Consider two sets of indicator variables, x = (X 1 , . . . , X p 1 ) and y = (Y 1 , . . . , Y p 2 ), whereby all the variables should be standardized, E(X i ) = 0 and Var(X i ) = 1; the same applies to Y i . The relationships between these two sets of variables are modelled using composites in the structural model. 2 The independent composites ξ use x as indicators in their measurement models, whereas the dependent composites η employ the indicators y. Independent composites do not depend on any other composite in the structural model. Each of the composites η, which result from the y indicator variables, is dependent and, as such, is regressed on at least one other composite, regardless of whether it is an independent composite ξ or another dependent composite η. The number of independent composites is q 1 , while the number of dependent composites is q 2 .
The measurement models allow to determine the composites of the structural model (i.e., their scores) by using a specific set of observed variables as indicators for each composite. Linear combinations of the x and y indicator variables generate the scores of each composite. The indicators of ξ g build a subvector x g of x, g = 1, . . . , q 1 . The corresponding weights vectors are denoted by w The parameter vectors are column vectors whereas the random vectors are row vectors. This formal representation is not very common but has the advantage that the equations have the same appearance as the corresponding data matrices' relations. The weights relations are (Semadeni et al. 2014): with The composites have unit variances, Var(ξ g ) = 1 and Var(η h ) = 1. This implies that the weights are standardized, w (1) g While the measurement models determine the composites using the weights W 1 and W 2 , the structural model provides the relationships between the two sets of indicators by means of the resulting two sets of composites: The matrix B can be arranged as a lower triangular with zeros on the diagonal for recursive models, which applies here; ζ is a vector of errors, whereby the errors are presumed to be uncorrelated and also uncorrelated in respect of the other random vectors. The formulation with row vectors implies that the transposes of and B appear in Eq.
(2). The path coefficients in and B are the parameters of primary interest. They describe the composites' interrelations. From the structural model's recursiveness, it follows that (I − B ) is regular and a reduced form of Equation (2) exists:

The covariance matrix of the composites
Establishing the covariance matrix of a path model with composites requires determining the main parameters. In the structural model, these include (a) the path coefficients, (b) the independent composites' correlations, and (c) the dependent composites' coefficients of determination; in the measurement model, the relevant parameters are (d) the weights. The specification of the path coefficients and the coefficients of determination are interrelated. When path coefficients are of primary concern, the coefficients of determination result from the structural model requiring uncorrelated errors. Researchers can establish the covariance matrix of the dependent composites, ηη, as follows: The computation of ηη employs a nonlinear optimization to determine the diagonal matrix ζ ζ such that the composites have unit variances (Fig. 1). When specifying the dependent composites' coefficients of determination a priori, researchers must determine the path coefficients accordingly. Consider the structural regression equation for the dependent composite η c given in Eq. (2): Here β c,1:c−1 is the row vector consisting of the first c − 1 elements of row c of B. η 1:c−1 is the vector of the dependent composites related to rows 1 to c − 1 of B. The coefficients of the composites that do not appear in the regression equation of η c are zero. These considerations, together with the covariance matrix (q 1 +c−1),(q 1 +d−1) of (ξ , η 1:c−1 ) and (ξ , η 1:d−1 ), results in the following equations: Cov(η c , ξ ) = (γ c , β c,1:q 1 +c−1 ) (q 1 +c−1),q1 , Cov(η c , η d ) = (γ c , β c,1:c−1 ) (q 1 +c−1),(q 1 +d−1) , 1 ≤ d ≤ c.
These equations provide the relations required to compute the composites' covariance matrix.
For simulations that focus on the path coefficients in the structural model, no further information is needed. Here, the R 2 depends on the pre-specified structural model relationships. In contrast, one determines B a priori to obtain a specific vector r 2 = (R 2 1 , . . . , R 2 q 2 ) of the dependent composites' coefficients of determination in the structural model. More specifically, the coefficient of determination for the regression of η c on (ξ , η 1:c−1 ), which is based on Eq. (6c), follows with the assumption Var(η c ) = 1: One needs to work through matrix B from row q 1 + 1 to the last one in order to modify the path coefficients in a way that they arrive at the desired coefficients of determination. The first part of the covariance matrix is given by ξξ . After the modification of the path coefficients in row q 1 + c of B, the covariance matrix of the composites must be augmented by row and column c before the coefficients of row c + 1 can be modified. Initially, choose the row vector β q 1 +c as preferred. Subsequently, this preliminary value is multiplied by a factor τ , which allows to fulfill Eq. (7):

Computation
The covariance matrix of the indicators is used to simulate the model. With a choice of ξξ , the covariance matrix of the x-indicators and the weights W 1 must be determined so that is fulfilled. This formulation is en par with the general comprehension of compositebased models as formative measurement (Rhemtulla et al. 2020). Several options are available to choose xx and the standardized weights, resulting in a given ξξ . For instance, researchers can first deal with each block of indicators of the different exogenous composites separately, which only requires to ensure the standardization of the composites. This means that ξ g = x g w g ,w g x g x g w g = 1 must be fulfilled. One can meet this requirement, for example, by setting x g x g as the identity matrix and choosing the weights vectors such that w g w g = 1. In an alternative approach, researchers can choose the covariance matrix arbitrarily and subsequently scale it to fulfill Eq. (9). If the exogenous composites are uncorrelated, one uses x g x h = 0 for g = h. In contrast, if two composites are correlated, one must appropriately select the correlations between the indicators in the two related blocks of indicators. A straightforward solution uses x g x h and scales it such that w g x g x h w h = σ ξ g ξ h . Becker et al. (2013) used this approach in their study on latent class analysis in PLS.
In the next step, B is given, or must be determined according to the given vector r 2 of the coefficients of determination (Sect. 3). With this information, one can obtain ηη as described in Sect. 3. y y and the weights W 2 are determined in the same way as the covariance matrix of the X -indicators, using The covariances of the exogenous and the endogenous composites can be used to determine x y . First, from Eq. (1) it follows that: whereas Eq. (3) leads to: The combination of these two equations provides a necessary condition that must be fulfilled: Choosing the covariance matrix x y as permits to meet the requirement of Eq. (13). To arrive at this result, it is necessary to insert this expression into the left-hand side of Eq. (13) and to consider the relations for the covariance matrices of the composites. Figure 2 offers a quasi-code for the computation of the covariance matrices of the indicators. Equation (14) ensures that Eq. (13) is fulfilled. In special constellations other solutions may exist for the given matrices W 1 , W 2 and ξη . In any case, the resulting covariance matrices of the composites are the same. Therefore, a possible non-uniqueness does not affect the estimated results of the structural model.
Computing the factor of the third regression computation requires researchers to establish the covariance matrix of (ξ , η 1 , η 2 ). Equations (6a)  Next, the computation of the complete covariance matrix of the composites, again, uses Eqs. (6a) to (6c).
Finally, the indicators' covariance matrix is determined on the basis of previously established parameters. For this purpose, we build on the results already obtained in Step 1 of Fig. 2. For the next Step 2, let where 1 is a 3×3 matrix of ones, w 1 = (0.4, 0.5, 0.6) and 0 a vector of zeros. First, W 1 has to be standardized. This is done by computing w 1 / √ f with f = w 1 Kw 1 = 1.106, and by substituting the new vector for the old w 1 . w 2 and w 3 are standardized analogously. Subsequently, blocks of ones in xx have to be changed such that the covariances in ξξ are recovered. For example, to obtain σ 13 = 0.469, the ones in the first three rows and the last three columns are modified to 0.469/ (w 1 1w 3 ).
Analogously, one obtains the matrix W 2 by considering ηη . Finally, x y is computed using Eq. (14). As a result, one receives the complete covariance for data generation. Based on this covariance matrix follows the data generation as explained in the following section.

Data generation
The covariance matrix can be used to generate a dataset for composite model-based simulation studies. This is particularly easy when the indicators are normally distributed. Then a (n, p 1 + p 2 ) matrix of independent standard normal random variables is generated and multiplied from the right by the Cholesky factor of the covariance matrix. On the other hand, several suggestions exist for generating data from nonnormal distributions with pre-specified parameters. For instance, Vale and Maurelli (1983) extended the Fleishman (1978) method to generate multivariate random numbers with specified intercorrelations and univariate means, variances, skewness values, and kurtoses. To begin with, they produce a suitably sized matrix of independent, normally distributed random numbers. Then, they subsequently compute the Fleishman's transformation coefficients and use them an intermediate correlation matrix from the desired indicators' correlation matrix. A principal components factorization allows to obtain the intermediate correlation matrix. The resulting factor is multiplied with the matrix of independent normally distributed random numbers. Finally, the componentwise application of the Fleishman transformation follows to generate the indicator data.
This method was used for a small simulation experiment to compare the estimates of GSCA and PLS. The experiment changes the generated indicator data's levels of skewness √ β 1 and excess kurtosis β 2 . These levels correspond to normal, Laplace, exponential and t 5 -distributions (although the empirical values of the kurtosis are smaller than those of the target ones). We used the model of the example in Sect. 4 to generate 50 samples of size n = 100 for each distribution. Schlittgen's (2018) gscals (i.e., for GSCA) and plspath (i.e., for PLS) implementations have been used to obtain the model estimation results. Figure 3 shows the differences between the estimates and the path coefficients used for the simulation.
The results show that the normal data situation does not produce different results compared to the other distributions. Overall, the differences between the two estimation methods' results are marginal. However, the GSCA results are a bit closer to prespecified value (higher precision) while the PLS estimates are more closely grouped around the pre-specified value (higher robustness).

Conclusion
The data generation of pre-specified models is an important issue in composite-based SEM, especially when conducting simulation studies. Reinartz et al. (2002) investigate the simulation of common factor-based models when the latent variables are generated first. This is a sensible approach in these models, but not in composite-based ones since they comprise linear combinations of indicators . Their distributions, therefore, depend on the distributions of the indicators and will be nearer to the normal distribution if the weights do not deviate strongly from each other.
This article contributes to the literature on SEM by discussing properties of data generation in composite-based models. The pre-specified model parameters allow to obtain the indicators' covariance matrices to be used as input for data generation. Furthermore, we offer an example of nonnormally distributed indicators using Vale and Maurellis' (1983) approach.
Our findings are important for researchers who run simulation studies to compare the efficacy of existing, expanded, and newly developed algorithms for the estimation of composite-based SEM models. Also, researchers who like to analyze methodological extensions for composite-based SEM-such as the efficiency of existing and new segmentation algorithms (e.g., Schlittgen et al. 2016)-will take advantage of this research. Future research should further evaluate our approach, for example, in terms of more extreme forms of nonnormality or multimodal distributions. A promising extension would be to adjust the approach to accommodate nonlinear relationships whose use has gained momentum in applications of composite-based SEM (Sarstedt et al. 2020).