Evaluation of Formative Measurement Models

PLS-SEM is the preferred approach when formatively specified constructs are included in the PLS path model. In this chapter, we discuss the key steps for evaluating formative measurement models. These include the assessment of (1) convergent validity, (2) indicator collinearity, and (3) statistical significance and relevance of the indicator weights. We introduce key criteria and their thresholds and illustrate their use with an extended version of the corporate reputation model estimated with SEMinR.

The third step in assessing formatively measured constructs is examining the statistical significance and relevance (i.e., size) of the indicator weights. The indicator weights result from regressing each formatively measured construct on its associated indicators. As such, they represent each indicator's relative importance for forming the construct. Significance testing of the indicator weights relies on the bootstrapping procedure, which facilitates deriving standard errors from the data without relying on any distributional assumptions (Hair, Sarstedt, Hopkins, & Kuppelwieser, 2014).
The bootstrapping procedure yields t-values for the indicator weights (and other model parameters). We need to compare these t-values with the critical values from the standard normal distribution to decide whether the coefficients are significantly different from zero. Assuming a significance level of 5%, a t-value above 1.96 (twotailed test) suggests that the indicator weight is statistically significant. The critical values for significance levels of 1% (α = 0.01) and 10% (α = 0.10) probability of error are 2.576 and 1.645 (two tailed), respectively.
Confidence intervals are an alternative way to test for the significance of indicator weights. They represent the range within which the population parameter will fall assuming a certain level of confidence (e.g., 95%). In the PLS-SEM context, we also refer to bootstrap confidence intervals because the construction of the confidence interval is inferred from the estimates generated by the bootstrapping process (Henseler, Ringle, & Sinkovics, 2009). Several types of confidence intervals have been proposed in the context of PLS-SEM -see Hair et al. (2022, Chap. 5) for an overview. Results from Aguirre-Urreta and Rönkkö (2018) indicate the percentile method is preferred, as it exceeds other methods in terms of coverage and balance, producing comparably narrow confidence intervals. If a confidence interval does not include the value zero, the weight can be considered statistically significant, and the indicator can be retained. On the contrary, if the confidence interval of an indicator weight includes zero, this indicates the weight is not statistically significant (assuming the given significance level, e.g., 5%). In such a situation, the indicator should be considered for removal from the measurement model.
However, if an indicator weight is not significant, it is not necessarily interpreted as evidence of poor measurement model quality. We recommend you also consider the absolute contribution of a formative indicator to the construct , which is determined by the formative indicator's loading. At a minimum, a formative indicator's loading should be statistically significant. Indicator loadings of 0.5 and higher suggest the indicator makes a sufficient absolute contribution to forming the construct, even if it lacks a significant relative contribution. . Figure 5.2 shows the decision-making process for testing formative indicator weights.
In bootstrapping, a large number of samples (i.e., bootstrap samples) are drawn from the original sample, with replacement (Davison & Hinkley, 1997). The number of bootstrap samples should be high but must be at least equal to the number of valid observations in the dataset. Reviewing prior research on bootstrapping implementations, Streukens and Leroi-Werelds (2016) recommend that PLS-SEM applications should be based on at least 10,000 bootstrap samples. The bootstrap samples are used to estimate the PLS path model 10,000 times. The resulting parameter estimates, such as the indicator weights or path coefficients, form a bootstrap distribution that can be viewed as an approximation of the sampling distribution. Based on this distribution, it is possible to calculate the standard error, which is the standard deviation of the estimated coefficients across bootstrap samples. Using the standard error as input, we can evaluate the statistical significance of the model parameters.

Excurse
When deciding whether to delete formative indicators based on statistical outcomes, researchers need to be cautious for the following reasons. First, formative indicator weights are a function of the number of indicators used to measure a construct. The greater the number of indicators, the lower their average weight. Formative measurement models are inherently limited in the number of indicator weights that can be statistically significant (e.g., ). Second, indicators should seldom be removed from formative measurement models since formative measurement requires the indicators to fully capture the entire domain of a construct, as defined by the researcher in the conceptualization stage. In contrast to reflective measurement models, formative indicators are not interchangeable, and removing even one indicator can therefore reduce the measurement model's content validity (Bollen & Diamantopoulos, 2017).

> Important
Formative indicators with nonsignificant weights should not automatically be removed from the measurement model, since this step may compromise the content validity of the construct.
After the statistical significance of the formative indicator weights has been assessed, the final step is to examine each indicator's relevance. With regard to relevance, indicator weights are standardized to values between −1 and +1. Thus, indicator weights closer to +1 (or −1) indicate strong positive (or negative) relationships, and weights closer to 0 indicate relatively weak relationships. . Table 5.1 summarizes the rules of thumb for formative measurement model assessment. .

Model Setup and Estimation
The simple corporate reputation model introduced in 7 Chap. 3 (7 Fig. 3.2) and evaluated in 7 Chap. 4 describes the relationships between the two dimensions of corporate reputation (i.e., competence and likeability) as well as the two key target constructs (i.e., customer satisfaction and loyalty). While the simple model is useful to explain how corporate reputation affects customer satisfaction and customer loyalty, it does not indicate how companies can effectively manage (i.e., improve) their corporate reputation. Schwaiger (2004) identified four driver constructs of corporate reputation that companies can manage by means of corporate-level marketing activities. . Table 5.2 lists and defines the four driver constructs of corporate reputation. All four driver constructs are (positively) related to the competence and likeability dimensions of corporate reputation in the path model. . Figure 5.3 shows . the constructs and their relationships, which represent the extended structural model for our PLS-SEM example in the remaining chapters of the book. To summarize, the extended corporate reputation model has three main conceptual/theoretical components: 1. The target constructs of interest (CUSA and CUSL) 2. The two corporate reputation dimensions, COMP and LIKE, that represent key determinants of the target constructs 3. The four exogenous driver constructs (i.e., ATTR, CSOR, PERF, and QUAL) of the two corporate reputation dimensions The endogenous constructs on the right-hand side in . Fig. 5.3 include a singleitem construct (i.e., CUSA) and three reflectively measured constructs (i.e., COMP, CUSL, and LIKE). In contrast, the four new driver constructs (i.e., exogenous latent variables) on the left-hand side of . Fig. 5.3 (i.e., ATTR, CSOR, PERF, and QUAL) have formative measurement models in accordance with their role in the reputation model (Schwaiger, 2004). Specifically, the four new constructs are measured by a total of 21 formative indicators (detailed in . Table 5.3) that have been derived from literature, qualitative studies, and quantitative pretests (for more details, see Schwaiger, 2004). . Table 5.3 also lists the single-item reflective global measures for validating the formative driver constructs when executing the redundancy analysis. We continue to use the corp_rep_data dataset with 344 observations introduced in 7 Chap. 3 for our PLS-SEM analyses. Unlike in the simple model that was used in the previous chapter, we now also have to consider the formative measurement models when deciding on the minimum sample size required to estimate the model. The maximum number of arrowheads pointing at a particular construct occurs in the measurement model of QUAL. All other formatively measured constructs have fewer indicators. Similarly, there are fewer arrows pointing at each of the endogenous constructs in the structural model. Therefore, when building on the 10-time rule of thumb, we would need 8 · 10 = 80 observations. Alternatively, following Cohen's (1992) recommendations for multiple ordinary least squares regression analysis or running a power analysis using the G*Power program (Faul, Erdfelder, Buchner, & Lang, 2009), we would need only 54 observations to detect R 2 values of around 0.25, assuming a significance level of 5% and a statistical power of 80%. When considering the more conservative approach suggested by Kock and Hadaya (2018), we obtain a higher minimum sample size. Considering prior research on the corporate reputation model, we expect a minimum path coefficient of 0.15 in the structural model. Assuming a significance level of 5% and statistical power of 80%, the inverse square root method yields a minimum sample size of approximately 155 (see 7 Chap. 1 for a discussion of sample size and power considerations).

Excurse
The corporate reputation data file and project are also bundled with SEMinR. Once the SEMinR library has been loaded, we can access the demonstration code for 7 Chap. 5 by using the demo() function on the object "seminr-primer-chap5". missing = mean_replacement, missing_value = "-99") # Summarize the model results summary_corp_rep_ext <-summary(corp_rep_pls_model_ext) Just like the indicator data that we used in previous chapters, the corp_rep_data dataset has very few missing values. The number of missing observations is reported in the descriptive statistic object nested within the summary return object. This report can be accessed by inspecting the summary_corp_rep_ ext$descriptives$statistics object. Only the indicators cusl_1 (three missing values, 0.87% of all responses on this indicator), cusl_2 (four missing values, 1.16% of all responses on this indicator), cusl_3 (three missing values, 0.87% of all responses on this indicator), and cusa (one missing value, 0.29% of all responses on this indicator) include missing values. Since the number of missing values is relatively small (i.e., less than 5% missing values per indicator; Hair et al., 2022, Chap. 2), we use mean value replacement to deal with missing data when running the PLS-SEM algorithm (see also Grimm & Wagner, 2020).
When the PLS-SEM algorithm stops running, check whether the algorithm converged (Hair et al., 2022, Chap. 3). For this example, the PLS-SEM algorithm will stop when the maximum number of 300 iterations or the stop criterion of 1.0E-7 (i.e., 0.0000001) is reached. To do so, it is necessary to inspect the corp_ rep_pls_model object by using the $ operator:

# Iterations to converge summary_corp_rep_ext$iterations
The results show that the model estimation converged after eight iterations. Next, the model must be bootstrapped to assess the indicator weights' significance. For now, we run a simple bootstrap as conducted in 7 Chap. 4. But in this chapter, we discuss the bootstrap function in further detail when assessing the formative indicator weights' significance. To run the bootstrapping procedure in SEMinR, we use the bootstrap_model() function and assign the output to a variable; we call our variable boot_corp_rep_ext. Then, we run the summary() function on the boot_corp_rep object and assign it to another variable, such as sum_boot_ corp_rep_ext. # Bootstrap the model boot_corp_rep_ext <-bootstrap_model( seminr_model = corp_rep_pls_model_ext, nboot = 1000) # Store the summary of the bootstrapped model sum_boot_corp_rep_ext <-summary(boot_corp_rep_ext, alpha = 0.10)

Reflective Measurement Model Evaluation
An important characteristic of PLS-SEM is that the model estimates will change when any of the model relationships or variables are changed. We thus need to reassess the reflective measurement models to ensure that this portion of the model remains valid and reliable before continuing to evaluate the four new exogenous formative constructs. We then follow the reflective measurement model assessment procedure in 7 Fig. 4.1 (for a refresher on this topic, return to 7 Chap. 4). The reflectively measured constructs meet all criteria as discussed in 7 Chap. 4 -for a detailed discussion of the assessment of reflectively measured constructs for this model, see Appendix B.
! Each redundancy analysis model is included in the SEMinR demo file accessible at demo ("seminr-primer-chap5"), so that the code can easily be replicated. Alternatively, we can create these four models for the convergent validity assessment manually using the code outlined above. Following the steps described in previous chapters, a new structural and measurement model must be created using the SEMinR syntax for each redundancy analysis, and the subsequently estimated model object needs to be inspected for the path coefficients.
. Figure 5.5 shows the results for the redundancy analysis of the four formatively measured constructs. For the ATTR construct, this analysis yields a path coefficient of 0.874, which is above the recommended threshold of 0.708 (. Table 5.1), . Fig. 5.5 Output of the redundancy analysis for formative measurement models. (Source: authors' screenshot from R) thus providing support for the formatively measured construct's convergent validity. The redundancy analyses of CSOR, PERF, and QUAL yield estimates of 0.857, 0.811, and 0.805, respectively. Thus, all formatively measured constructs exhibit convergent validity.
In the second step of the assessment procedure (. Fig. 5.1), we check the formative measurement models for collinearity by looking at the formative indicators' VIF values. The summary_corp_rep_ext object can be inspected for the indicator VIF values by considering the validity element for vif_items; summary_corp_rep_ ext$validity$vif_items.

# Collinearity analysis summary_corp_rep_ext$validity$vif_items
Note that SEMinR also provides VIF values for reflective indicators. However, since we expect high correlations among reflective indicators, we do not interpret these results but focus on the formative indicators' VIF values.
According to the results in . Fig. 5.6, qual_3 has the highest VIF value (2.269). Hence, all VIF values are uniformly below the conservative threshold value of 3 (. Table 5.1). We therefore conclude that collinearity does not reach critical levels in any of the formative measurement models and is not an issue for the estimation of the extended corporate reputation model.
Next, we need to analyze the indicator weights for their significance and relevance (. Fig. 5.1). We first consider the significance of the indicator weights by means of bootstrapping. To run the bootstrapping procedure, we use the boot-strap_model() function. The first parameter (i.e., seminr_model) allows specifying the model on which we apply bootstrapping. The second parameter nboot allows us to select the number of bootstrap samples to use. Per default, we should use 10,000 bootstrap samples (Streukens & Leroi-Werelds, 2016). Since using such a great number of samples requires much computational time, we may choose a smaller number of samples (e.g., 1,000) for the initial model estimation. For the final result reporting, however, we should use the recommended number of 10,000 bootstrap samples.
The cores parameter enables us to use multiple cores of your computer's central processing unit (CPU). We recommend using this option since it makes bootstrapping much faster. As you might not know the number of cores in your device, we recommend using the parallel::detectCores() function to automatically detect the number of cores and use the maximum cores available. By default, cores will be set to the maximum value and as such, if you do not specify this parameter, your bootstrap will default to using the maximum computing power of your CPU. Finally, seed allows reproducing the results of a specific bootstrap run while maintaining the random nature of the process. Assign the output of the boot-strap_model() function to the boot_corp_rep_ext object. Finally, we need to run the summary() function on the boot_corp_rep_ext object and set the alpha parameter. The alpha parameter allows selecting the significance level (the default is 0.05) for two-tailed testing. When testing indicator weights, we follow general convention and apply two-tailed testing at a significance level of 5%. # Bootstrap the model # seminr_model is the SEMinR model to be bootstrapped # nboot is the number of bootstrap iterations to run # cores is the number of cpu cores to use # in multicore bootstrapping # parallel::detectCores() allows for using # the maximum cores on your device # seed is the seed to be used for making bootstrap replicable At this point in the analysis, we are only interested in the significance of the indicator weights and therefore consider only the measurement model. We thus inspect the sum_boot_corp_rep_ext$bootstrapped_weights object to obtain the results in . Fig. 5.7.
. Figure 5.7 shows t-values for the measurement model relationships produced by the bootstrapping procedure. Note that bootstrapped values are generated for all measurement model weights, but we only consider the indicators of the formative constructs. The original estimate of an indicator weight (shown in the second column, Original Est.; . Fig. 5.7) divided by the bootstrap standard error, which equals the bootstrap standard deviation (column: Bootstrap SD), for that indicator weight results in its empirical t-value as displayed in the third-to-last column in . Fig. 5.7 (column: T Stat.). Recall that the critical values for significance levels of 1% (α = 0.01), 5% (α = 0.05), and 10% (α = 0.10) probability of error are 2.576, 1.960, and 1.645 (two tailed), respectively.

! Attention
The bootstrapping results shown in . Fig. 5.7 will differ from your results. A seed is used in random computational processes to make the random process reproducible. However, note that for the same seed, different hardware and software combinations will generate different results. The important feature of the seed is that it ensures that the results are replicable on your computer or on computers with a similar hardware and software setup. Recall that bootstrapping builds on randomly drawn samples, so each time you run the bootstrapping routine with a different seed, different samples will be drawn. The differences become very small, however, if the number of bootstrapping samples is sufficiently large (e.g., 10,000).
The bootstrapping result report also provides bootstrap confidence intervals using the percentile method Chap. 5). The lower boundary of the 95% confidence interval (2.5% CI) is displayed in the second-to-last column, whereas the upper boundary of the confidence interval (97.5% CI) is shown in the last column. We can readily use these confidence intervals for significance testing. Specifically, a null hypothesis H 0 that a certain parameter, such as an indicator weight w 1 , equals zero (i.e., H 0 : w 1 = 0) in the population is rejected at a given level α, if the corresponding (1α)% bootstrap confidence interval does not include zero. In other words, if a confidence interval for an estimated coefficient, such as an indicator weight w 1 , does not include zero, the hypothesis that w 1 equals zero is rejected, and we assume a significant effect.
Looking at the significance levels, we find that all formative indicators are significant at a 5% level, except csor_2, csor_4, qual_2, qual_3, and qual_4. For these indicators, the 95% confidence intervals include the value zero. For example, for csor_2, our analysis produced a lower boundary of −0.097 and an upper boundary of 0.173. Similarly, these indicators' t-values are clearly lower than 1.960, providing support for their lack of statistical significance.
To assess these indicators' absolute importance, we examine the indicator loadings by running sum_boot_corp_rep_ext$bootstrapped_loadings. The output in . results from bootstrapping show that the t-values of the five indicator loadings (i.e., csor_2, csor_4, qual_2, qual_3, and qual_4) are clearly above 2.576, suggesting that all indicator loadings are significant at a level of 1% (. Fig. 5.8). Moreover, prior research and theory also provide support for the relevance of these indicators for capturing the corporate social responsibility and quality dimensions of corporate reputation (Eberl, 2010;Sarstedt, Wilczynski, & Melewar, 2013;Schwaiger, 2004;Schwaiger, Sarstedt, & Taylor, 2010). Thus, we retain all indicators in the formatively measured constructs, even though not every indicator weight is significant.
The analysis of indicator weights concludes the evaluation of the formative measurement models. Considering the results from 7 Chaps. 4 and 5 jointly, all reflective and formative constructs exhibit satisfactory levels of measurement quality. Thus, we can now proceed with the evaluation of the structural model (7 Chap. 6).

Summary
The evaluation of formative measurement models starts with convergent validity to ensure that the entire domain of the construct and all of its relevant facets have been covered by the indicators. In the next step, researchers assess whether pronounced levels of collinearity among indicators exist, which would inflate standard errors and potentially lead to sign changes in the indicator weights. The final step involves examining each indicator's relative contribution to forming the construct. Hence, the significance and relevance of the indicator weights must be assessed. It is valuable to also report the bootstrap confidence interval that provides additional information on the stability of the coefficient estimates. Nonsignificant indicator weights should not automatically be interpreted as indicating poor measurement model quality. Rather, researchers should also consider a formative indicator's absolute contribution to its construct (i.e., its loading). Only if both indicator weights and loadings are low or even nonsignificant should researchers consider deleting a formative indicator.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons. org/licenses/by/4. 0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.