Beyond the number of classes: separating substantive from non-substantive dependence in latent class analysis

Latent class analysis (LCA) for categorical data is a model-based clustering and classification technique applied in a wide range of fields including the social sciences, machine learning, psychiatry, public health, and epidemiology. Its central assumption is conditional independence of the indicators given the latent class, i.e. “local independence”; violations can appear as model misfit, often leading LCA practitioners to increase the number of classes. However, when not all of the local dependence is of substantive scientific interest this leads to two options, that are both problematic: modeling uninterpretable classes, or retaining a lower number of substantive classes but incurring bias in the final results and classifications of interest due to remaining assumption violations. This paper suggests an alternative procedure, applicable in cases when the number of substantive classes is known in advance, or when substantive interest is otherwise well-defined. I suggest, in such cases, to model substantive local dependencies as additional discrete latent variables, while absorbing nuisance dependencies in additional parameters. An example application to the estimation of misclassification and turnover rates of the decision to vote in elections of 9510 Dutch residents demonstrates the advantages of this procedure relative to increasing the number of classes.


Introduction
Latent class (finite mixture) models for categorical variables are applied in a broad range of fields including stratification research in the social sciences (Savage et al. 2013), document classification in machine learning (Hastie et al. 2009), psychological measurement (Heinen 1996, and public health and epidemiology (Collins and Lanza 2010).
The key assumption of such models is conditional independence of the observed variables given the latent class (mixture component). Violations of this assumption may occur when there are unmodeled latent classes, and a common reaction to detected misfit is therefore to increase the number of classes based on criteria such as L 2 , χ 2 , (C)AIC, BIC, CVIC or ICL (McLachlan and Peel 2000, Ch. 6). In response, a literature has developed to aid the researcher in finding the "correct" number of classes (see Nylund et al. 2007;Tofighi and Enders 2008). As an alternative to deciding the "correct" number of classes, Hennig and Liao (2013) suggested distance-based clustering methods as an exploratory method to find clusters that bring similar observations together, while Anderlucci and Hennig (2014) compared this approach to latent class analysis.
However, not all of the local dependence and pursuant additional latent classes may be of substantive interest to the researcher. For example, Hagenaars and McCutcheon (2002) suggested that local dependence between items in a questionnaire or psychological test can occur because respondents attempt to make their responses consistent; and Oberski and Vermunt (2014) found that ethnicity measurements discussed by Johnson (1990) were locally dependent due to the fact that some were measured on the same occasion. In these instances, additional classes do not yield substantively useful results.
To deal with the problem of non-substantive classes, one might select the number of classes based on their relationship with external, substantively meaningful, variables (Baudry et al. 2014). While this approach does prevent the modeling of non-substantive classes, the nuisance local dependence within the substantive classes remains. Local dependence is still problematic in such cases, because unmodeled local dependencies may bias model parameters of interest as well as posterior classifications (Vacek 1985;Qu et al. 1996;Hadgu et al. 2005).
This paper demonstrates another alternative to increasing the number of classes: modeling additional discrete latent variables when the dependence between items is substantively interesting, while modeling local dependencies directly when dependence is considered a nuisance. Local fit measures are used to detect conditional dependence, and substantive considerations are then used to decide how detected dependencies should be modeled. The goal of this approach is to deal with the problem of non-substantive classes in latent class analysis while avoiding the bias associated with ignoring nuisance dependencies.
Multiple discrete latent variable models and latent class models with local dependencies have a long history (Harper 1972;Clogg 1981;Hagenaars 1988a, b;Skrondal and Rabe-Hesketh 2004;Vermunt and Magidson 2013). However, the circumstances under which it may be preferable to use these techniques rather than increasing the number of classes have never been clarified. This has led the model-based clustering literature to develop methods for selecting the number of classes somewhat separately from the consideration of these alternatives (Hennig and Liao 2013). This paper therefore aims to reconnect the fields by clarifying the connection between these models and demonstrating the use in data analysis of recently developed local fit measures.
Section 2 introduces the data used to illustrate the suggested approach to modeling local dependence. The latent class model with several discrete variables and local dependence for binary data is presented in Sect. 3, together with the use of the "bivariate residual" (BVR) for detecting local dependence. Subsequently, Sect. 4 demonstrates the advantages of the approach introduced here over simply increasing the number of classes, after which Sect. 5 concludes.

Example application data
Why citizens vote in elections is studied intensively in political science (e.g. Campbell et al. 1960;Franklin 2004;Gallego and Oberski 2012). Even so, instead of citizens' actual turnout decisions, the answer to the survey question "did you vote in the last election?" is usually observed. The conclusions of such studies are therefore potentially threatened by misclassification in the answers to this question, and indeed validation studies (see Ansolabehere and Hersh 2012) have found that respondents are reluctant to admit not having voted. This means that estimating this misclassification so that parameter estimates of substantive interest to political science may be corrected for its biasing effects (Vermunt 2010) is an important endeavour for the field.
In this application, the goal was therefore to estimate misclassification in voting and turnover of vote decisions between elections by applying latent class analysis to repeated survey measurements. Latent class analysis has the advantage that it can be applied to existing panel surveys in which respondents are asked about their turnout decisions, without requiring difficult-to-obtain administrative data on voting. The disadvantage of latent class models is, however, that they make assumptions of local independence that may be incorrect. We demonstrate how this issue may be dealt with by applying latent class analysis to the LISS panel, a Dutch probability sample of 9510 voters. For more information on the design of the study, response rates, and recruitment efforts, please see Scherpenzeel (2011). All data used in this application are publicly available online (http://lissdata.nl/).
The 9510  Strikingly, this means that initially reported turnout exceeded actual turnout, possibly due to nonresponse error. But even though the same respondents were asked whether they had voted in the same elections, over time the claimed turnout rate declined toward the actual turnout rates. We therefore suspect that misclassification plays a role that may change over time.

Latent class model with possible local dependencies
Suppose an i.i.d. sample of size N is obtained on J observed binary variables, aggregated by the R response patterns into Y. Let n be the R-vector of observed response pattern counts. We also postulate K discrete latent variables ξ k , collected in a vector ξ , whose distribution is to be estimated. The K -way cross-table of ξ yields T unobserved patterns. In the case of latent structure analysis, there is only one discrete latent variable and T will equal the number of latent classes. The log-likelihood for the latent class model is then the discrete mixture (e.g. Formann 1992) where log and exp denote elementwise operations, . (2) The GLM linear predictors η Y|ξ and η ξ are parameterized using effect-coded design matrices (Evers and Namboodiri 1979): where X (Y ) , X (Y Y ) and X (Y ξ ) are design matrices for the observed variables' main effects τ , bivariate associations ψ, and associations with the latent discrete variables λ ("slopes"), respectively. Similarly, X (ξ ) and X (ξξ) are design matrices for the discrete unobserved variables' main effects α and associations β. This parameterization of the local dependence latent class model is similar to that adopted by Hagenaars (1988b) and Formann (1992, section 4.3), except that we additionally allow for explicit modeling of multiple discrete latent variables and their interrelations (Magidson and Vermunt 2001;Vermunt and Magidson 2013). For example, with two binary discrete latent variables and choosing "dummy coding", there are four unobserved patterns, T = 4, and the main effects and associations design matrices are The β parameter is then the log-odds ratio in the 2 × 2 cross-table of the two latent variables. A similar interpretation holds for the λ parameters, while the ψ parameters can be interpreted as conditional log-odds ratios in the cross-tables of the observed variables after conditioning on the latent variables.
The q-vector of parameters θ can be defined as θ := α , β , τ , λ , ψ . There are thus q ≤ T (J + 1) − 1 + J 2 (possible) parameters. The standard local independence latent class model, however, has as its key assumption that ψ = 0. In addition, the slopes λ are typically restricted such that, given exactly one unobserved discrete variable, each indicator is conditionally independent from all other latent variables; in analogy with linear factor analysis this might be termed "simple structure".
Maximum likelihood estimates of the parameters of the model are usually obtained asθ = arg max θ ∈R q (θ) by expectation-maximization (see Formann 1992), quasi-Newton methods, or a combination of both (Vermunt and Magidson 2013). Goodman (1974) showed that the parameters of the model are locally identifiable when the Jacobian S := ∂Pr(Y)/∂θ is of full column rank. A necessary but not sufficient condition for this is that R > q. In practice, local identifiability can be evaluated empirically by examining the rank of the information matrix at the maximum likelihood solution, or by randomly sampling many parameter values in the parameter space and evaluating the information matrix at each point (Forcina 2008). For a general discussion of identification in latent class models, we refer to Huang and Bandeen-Roche (2004); for a discussion of identifiability of the local dependence parameters, see Oberski and Vermunt (2014, Appendix).

Model misfit and local dependence
After estimation, for each response pattern expected frequenciesμ r := N · Pr Y r |θ =θ are obtained, which can be compared with the observed frequencies n r . Overall goodness of fit measures based on this comparison such as the chi-square and likelihood ratio (L 2 ), as well as information criteria such as BIC, AIC, CAIC, CVIC, and ICL are often used to evaluate whether the latent class model adequately describes the observed data (see McLachlan and Peel 2000, chapter 6).
Since the key assumption is that of local independence (ψ = 0), a major source of misfit will be locally dependent item pairs. In our example, local dependence may, for instance, arise because respondents remember their answer on the first measurement occasion and try to remain consistent on later occasions (Hagenaars and McCutcheon 2002). Assuming the model is overidentified, such local dependence will be picked up by the overall fit statistics and information criteria. When these indicate a problem, additional latent classes are then included in the model to account for the dependence. This will lead to a latent class model in which some of the classes represent, for instance, "consistent answering".
However, local dependencies and the pursuant additional classes are not necessarily of scientific interest. For theoretical reasons, one may prefer a model with fewer classes in the voting data application: we know that respondents have either voted or not and that the measurements pertain to two separate elections. Two classes are also preferred when evaluating diagnostic tests for disease/non-diseased status (Qu et al. 1996).
When a specific number of classes is desired or local dependence is not substantively meaningful, it may be preferable to model local dependencies by freeing elements of ψ. Freeing all local dependencies is, however, usually not desirable for reasons of model stability and (sometimes) identifiability (Oberski and Vermunt 2014). We therefore use the "bivariate residual" (BVR) between item pairs to monitor whether it might be necessary to free local dependencies (Vermunt and Magidson 2013). The bivariate residual is an intuitively attractive fit index measuring the degree to which the bivariate cross-table between a pair of observed variables fits the model: where the raw residuals r kl := n kl −μ kl , and n kl andμ kl now indicate observed and expected frequencies in the bivariate 2 × 2 cross-table of the observed variables y j and y j ( j = j ). The last step is a simplification possible with binary indicators, for which the marginals are perfectly reproduced . A BVR can be obtained for each of the J 2 pairs of observed variables; in this way, for each pair it can be investigated whether the cross-table between this pair appears to fit the hypothesis of local independence.
The BVR has the same form as a Pearson residual and is often treated in applied research as though its asymptotic distribution converged to a Chi-square distribution.  showed that this is not a good approximation; the BVR is a score test uncorrected for cell interdependencies and far from Chi-square distributed. The score test for residual dependencies, which does asymptotically follow a Chi-square distribution, was introduced by Oberski and Vermunt ( , 2014 give example applications of its usage. Alternatively, p values for the BVR very close to Rao's (1948) classic efficient score test can be obtained by a parametric bootstrap (Efron 1982;Langeheine et al. 1996). The software Latent Gold 5.0 (Vermunt and Magidson 2013) implements these procedures. Since there are many item pairs, we will also adjust the obtained p values for multiple testing using the procedure of Benjamini and Hochberg (1995). 1

Example application results
To demonstrate the approach introduced here, we now follow two procedures for data analysis of the Dutch voting example. The first procedure is a standard single nominal latent class model, which is fitted to the data with an increasing number of classes. BIC and CAIC are used to select the number of classes, after which these are interpreted. We compare this standard procedure with one in which two discrete latent variables are modeled jointly, one for voting in each of the 2006 and 2010 elections, and the bivariate residuals are inspected to decide which local dependencies should be freed. The substantive interest of a typical political scientist would focus here on true voting behavior and its relationship with other variables, rather than measured voting behavior. Figure 1 shows criteria used to select the number of classes. Both BIC and CAIC select the four-class model. When this model is fit to the five claims of having voted, the conditional probabilities shown in Fig. 2 result. The left-hand side of Fig. 2 shows the probability of claiming to have voted on each of the five measurement occasions given the four latent classes, indicated by the different lines (colors, point shapes). The right-hand side of Fig. 2 provides a legend and shows class size estimates with 2 s.e. error bars. Figure 2 shows that class 1 is the class of people who voted in both elections, while class 3 is voting in neither election. Class 4 appears to represent voting in 2010 but not in 2006, although the probability of claiming to have voted in 2006 in this class is still around 0.25. Class 2, containing 10 % of observations, is the most difficult to explain; it contains people who initially claim to have voted, but, as time goes by, become more likely to admit that they did not.
The standard latent class model procedure applied to these data is somewhat unsatisfactory. Considering that there are only two actual elections, the only latent classes that represent the "true voting" variable of substantive interest to political scientists would correspond to the 2 × 2 = 4 cells in the cross-table of voting or not in 2006 and 2010. Four classes are indeed selected, but instead of a class "voting in 2006 and not in 2010", the difficult-to-interpret class 2 results, which partially also represents artefacts that are not of interest to political scientists.  An alternative procedure is to fit a model with two discrete latent variables, one for each election, each with two classes (voted/did not vote). The first three answers, being about the 2006 elections, are related to the first latent variable and the last two answers, about the 2010 elections, to the second latent variable. Conditional probabilities then represent misclassification rates with respect to true turnout in the 2006 and 2010 elections, which is the question of scientific interest.
Initially a model is fit in which all 5 2 = 10 possible local dependencies are set to zero. This "Model 1" is shown as a graph in Fig. 3. The table under "Model 1" in Fig. 3 provides p values for the 10 bivariate residuals obtained by parametric bootstrapping. All p values have been adjusted for multiple testing using the procedure of Benjamini and Hochberg (1995). The BVR's of the dependence between answers in 2008 and in other years correspond to Hagenaars and McCutcheon's (2002) suggestion that respondents sometimes attempt to make their answers consistent with the first occasion. Based on these and the values of the BVR's (not shown for conciseness), we free the local dependence between the answer in 2008 and in 2009 and re-fit the model to obtain the model and BVR p values shown under "Model 2". One adjusted p value is then still < 0.01 and in line with the memory effect theory: the corresponding dependence is therefore freed. The final model ("Model 3") does not have any BVR with adjusted bootstrapped p value < 0.01. The overall bootstrapped likelihood ratio test L 2 indicates a good fit as well.
This final model has several advantages over the four-class model. First, it explicitly models true turnout in the two elections so that the conditional probabilities may be interpreted as misclassification rates ("specificity" and "sensitivity"). These misclassification rates are of interest to political scientists. Second, the two-variable classification allows researchers to relate voting in these two elections to external variables (Vermunt 2010). Third, nuisance local dependencies such as memory effects are not part of the classification but are accounted for by local dependence parameters.
Sensitivity and specificity (misclassification rates) are shown on the left-hand side of Fig. 4. The figure shows that the probability of a respondent claiming to have  Year Probability Sensitivity (green) and specificity (red) of voting. voted when they have not decreases as the election period becomes more distant. This finding corresponds to the idea that false positives are due to social desirability, since the "norm" of voting will be less salient three years after the election than during election season. 2 This pattern explains the overall pattern that claimed turnout rates approached the actual turnout rates as time goes by. 3 An interesting external validation of our analysis is to compare the latent class model estimates of misclassification with those obtained using administrative data. Based on comparing US vote validation data with the National Election Study, Ansolabehere and Hersh (2012 , Table 1 on p. 446) report the false negative rate as between 0.002 and 0.012, whereas the latent class model used here estimates it at 0.019 and 0.024 for 2006 and 2010 respectively. Similarly, the true negatives from validation were between 0.723 and 0.747, while we have estimated them at 0.739 and 0.780. The estimates obtained here, even though they do not use expensive validation data and are for a Dutch rather than a US population, were therefore very close to other results in the literature.

Voted in 2006
The right-hand side of Fig. 4 shows the estimated turnover table of true turnout from 2006 to 2010 with class sizes in the margins. The class prevalences of 81 and 82 percent are higher than the actual turnout rates 80.4 and 75.4 percent, although they are much closer to true turnout than the raw reported rates (around 87 %). The turnover table suggests that voters mostly remained voters whereas non-voters in 2006 had a chance of 0.287 of voting in the 2010 election. If such a pattern were to be predicted for future elections, it would suggest that efforts to encourage citizens to vote would be best focused on non-voters in previous elections.
Finally, Fig. 5, which shows the posterior classifications under two different models (Model 1 vs. Model 3 in Fig. 3), demonstrates that ignoring the local dependencies may lead to bias. Posterior classifications shown in Fig. 5 are different for the two latent class variables depending on whether the local dependencies are taken into account or not, potentially biasing subsequent analyses of the classifications. This shows that simply ignoring non-substantive local dependence is not an attractive option in this case.

Posterior: voting 2006
Dependence model Independence model

Summary
Latent class analysis often involves selecting the number of classes. Several approaches to do this have been suggested in the literature, focusing on the purely statistical concerns of balancing model complexity and model fit (for an exception, see Hennig and Liao 2013). Since the identifying assumption of latent class models is local dependence, this means data dependencies that fail to be predicted by the local dependence model are absorbed as additional classes. This approach can be expected to work adequately (e.g. Nylund et al. 2007) when the model is correctly specified and all latent classes are of substantive interest. It can also be useful when the analysis is of an entirely exploratory nature and the researcher simply wishes to determine groups that are maximally "different" by some criterion (Anderlucci and Hennig 2014). This paper demonstrated a possible problem with this approach when not all of the model fit violations are of substantive interest. An application demonstrated that it may sometimes be more advantageous to model substantively interesting local dependence as additional discrete latent variables, while modeling nuisance dependencies using additional local dependence parameters rather than additional classes. When estimating misclassification and turnover rates of the decision to vote in an election, increasing the number of classes led to better fit but uninterpretable classes, whereas retaining a lower number of classes or otherwise ignoring the non-substantive local dependence led to bias. The alternative procedure suggested here yielded a model that was better-interpretable in the sense that the results corresponded more directly to those of substantive interest to most political scientists. Moreover, the resulting estimates were close to those from independent external validation studies. Local fit measures such as the bivariate residual or the score test  can be used to guide in this procedure.
Although the BVR used here to detect local dependence is closely connected to the score test for local independence, it can be seen as a general measure of model misfit. It could therefore also be used to guide decisions on the number of classes when increasing classes is of interest. In other words, the decision to increase the number of classes or follow the procedure suggested here is not a statistical issue but a theoretical one that should be based on the substantive interest of the researcher.
The approach suggested here does have several limitations. First, it is inapplicable when the substantive interest is not well defined. In such more exploratory cases, increasing the number of classes may be more attractive. Second, when allowing for local dependencies, there is a certain danger that substantively interesting dependencies are inadvertently modeled as nuisance dependencies. In the example application, a survey methodologist's interest might focus exactly on the "consistent answering", for instance. In other words, a consequence of the procedure is that the model selected depends on the goal of the analysis, which must be kept in mind when introducing local dependence parameters. Overall, however, many data analyses in model-based clustering and classification may be more amenable to the approach discussed here than to an increase in the number of classes.