6.1 Introduction

One aim of educational assessment is to find out the strengths and difficulties of students regarding a particular concept or skill (Pelligrino et al., 2001). To achieve this, teachers and researcher construct items, score them as correct or incorrect and analyze which items a student solved and which not. However, if the number of students and items is increasing, an itemwise interpretation for every student is complex and time consuming. Rasch analysis offers a more efficient method to link students’ scores (number of solved items) to the difficulty of items. A Rasch analysis represents students’ abilities and item difficulties on the same continuous scale, so that both can be compared directly. If the student’s ability fits the item difficulty, a student typically has a 50% chance to solve the item. For items with a larger difficulty, the probability for students to solve decreases, while for items with a lower difficulty it increases.

However, teachers and researchers are often interested in students’ preconceptions, i.e., their ideas about natural phenomena which are based on their experiences, use of language, or inappropriate instruction. Preconceptions are often quite coherent ideas that have explanatory power with respect to multiple phenomena but are to some extent incompatible with scientific concepts, and thus cause erroneous responses to items (Vosniadou, 2019). If we only look at dichotomously coded data, we will not get insights into the students’ preconceptions that influence a particular response. Instead, we have to code which preconceptions might have caused erroneous student responses. Since multiple ideas about a task can exist, such an analysis will not result in dichotomous data but in multiple categories representing the diverse ideas of students regarding the task. However, if we use multiple tasks, the data can be complex and hard to interpret as many combinations of tasks and categories can exist (e.g., 10 tasks with 3 categories each can result in up to 310 = 59049 combinations). Latent class analysis (LCA) is a method to analyze such complex data by grouping students with similar patterns of responses (combination of preconception and item) together into one class. A basic LCA requires only categorical data and no information about correct / incorrect responses or a hierarchical order of categories. By describing and interpreting the students’ patterns of responses grouped together into the same class, or group, researchers can investigate whether and which coherent conceptions the students share and quantify the influence of each concept on a student’s response. Thus, LCA offers qualitatively different information about students’ responses than methods based on dichotomous coding, while additionally reducing the complexity of categorical data.

In this section, we describe the general idea of the LCA within the Rasch measurement framework based on a concrete example of a study by Schwichow et al. (2022), which utilizes an LCA to analyze patterns of students’ responses on items asking to design and interpret controlled experiments. We describe the assumptions and mathematical model underlying LCA. We explain how missing data can be handled in LCA, as well as criteria and conventions for reporting and interpreting the results of LCA. Finally, we combine the results of the LCA with findings of a unidimensional Rasch model based on the same data. In an online appendix, we present a step-by-step guide to run the LCA with the open source R package poLCA (Linzer & Lewis, 2011; Linzer & Jeffrey, 2022). The R script to run the analysis as well as the data set of the example are available as an appendix, too.

6.2 Commonalities and Differences Between LCA and Rasch Analysis

Before discussing the details of the LCA, we will summarize general commonalities and differences between the LCA (details in Sect. 6.4) and the Rasch analysis (details e.g., Rost, 2004; Sick, 2010; Boone et al., 2014; Boone & Staver, 2020) (overview: Fig. 6.1).

Fig. 6.1
A Venn diagram of the latent class analysis and Rasch analysis. The commonalities are the latent variable model, categorical observed data, mixture distribution model, local independence of responses, and aim. The difference is also visible in both the analyses.

Venn diagram of the commonalities and differences between LCA and Rasch analysis

Overall, LCA in its structure is completely parallel to the Rasch analysis – the only difference is the scaling level of the latent variable (Rost, 2004). Rasch models result in continuous ordered ability θ, while LCA models lead to unordered categorical class membership c, unless further specified. To describe the item-response probability based on a person’s class membership class c, we need to take into account that c is a discrete number. We do not calculate (parallel) item functions, like in Rasch analysis, but item profiles.

LCA models are, like Rasch models, mixture distribution models, meaning the latent variable (the continuous θ or the categorical c) is derived in a way that the local independence of observed variables is given – dependencies between items will disappear if only persons from one class c or with one ability θ are considered (Rost, 2004). This assumption is a strong simplification of the complex reasons behind concrete response behavior and might only seldom be true. However, as “all models are wrong but some are useful” (Box, 1979, p. 202) we should still use LCA to reduce the complexity of categorical data but be aware that latent classes are one but not the only one way to analyze these data.

Beside these commonalities of LCA and Rasch analysis, there are some crucial differences. In Rasch models, there is only one item parameter, the difficulty σ, per item, while LCA models have one item parameter, the item-response probability π, for each item for every class c (assuming dichotomous data).

Additionally, when using a Rasch model, only the sum score is important for the person’s ability θ, it doesn’t matter which items were solved correctly (Rost, 2004). In contrast, LCA does not use such data aggregation. The class c depends on the frequency of the response patterns of all students and therefore LCA is a full information method (Rost, 2004).

Consequently, LCA needs a larger sample size than Rasch analysis – four to five items need 60–70 persons, ten items already need around 500 persons (Rost, 2004), whereas a Rasch model for 20 items needs around 200 persons (Yen & Fitzpatrick, 2006). Furthermore, LCA requires complete data samples without missing values while Rasch analysis can handle missings. As an advantage of LCA however, we get a more detailed picture of the response pattern within a population but as downside, we need complete data sets that is the detailed response category of each person for every item.

At last, it is to note that measuring a quantitative latent trait (Rasch analysis) is special case of measuring a qualitative latent trait (LCA) when persons with response patterns with an identical sum score are treated equally (Rost, 2004).

6.3 Empirical Example

We describe the assumptions of LCA and how it works based on a data set from a study that investigates response patterns on items asking students to plan, identify, and interpret controlled experiments. The skills associated with the design and interpretation of controlled experiments are summarized under the term control-of-variables-strategy (CVS). Research on elementary school students’ experimentation skills shows that they have basic conceptions of what a controlled experiment is, as they can correctly identify controlled experiments to test given hypotheses and interpret the results of controlled experiments. Students perform poorly, however, when they have to plan experiments and interpret confounded experiments. Furthermore, Siler and Klahr (2012) have identified eight misleading preconceptions regarding designing experiments that cause invalid experimental designs, called design errors (Table 6.1). Examples are designing confounded experiments (beside the investigated variable, one or more additional variables differ between conditions) or non-contrastive experiments (contrasting identical conditions). Based on students’ oral and written responses, they furthermore conducted interviews to identify underlying preconceptions that produced the design errors (Table 6.1).

Table 6.1 Overview of CVS preconceptions, visible errors and misconceived aspects of the CVS according to Siler and Klahr (2012) complemented by the hotat strategy described by Tschirgi (1980)

The overview in Table 6.1 shows that the preconceptions numbered two to five cause the same design error resulting in a contrast of confounded or multiple confounded conditions (more than one variable differs between conditions). With respect to the underlying conceptual idea, these preconceptions are quite different from each other, but nonetheless all lead to the design of confounded experiments as a visible result. Reasons for designing confounded experiments are that students (1) believe that the additional varying variables have no causal effects, (2) just ignore them, (3) want to test multiple effects in the same experiment or (4) want to produce extreme differences on the dependent variable. The remaining four preconceptions cause unique design errors. Students who understand the logic of controlling variables but who have problems identifying the correct independent variable will compare controlled conditions but focus on the wrong independent variable (according to the given hypothesis). If students understand the importance of controlling variables, they might over-interpret this idea and thus build “totally fair conditions” that differ in no variable and thus are non-informative. Students who do not understand why conditions are compared will eventually design an “experiment” consisting of only one condition and neglect comparing at least two different conditions. Tschirgi (1980) describes a further design error that she calls “hold-one-thing-at-a-time” (hotat). In this case, students design experiments in which the independent variable is not varying between conditions, whereas all other variables differ. This design error is thus a special case of a confounded experiment. It is based on the idea that one has to find the one and only variable that has to be constant in order to get identical results under varying conditions.

6.3.1 Sample and Assessment Instrument

Students from 17 elementary schools (grades 2–4; ages 7–10) in southern Germany (in the state of Baden-Württemberg) participated in this study. The final sample of our analyses is N = 496 (48% female, 51% male). To assess students’ CVS skills, we adopted and expanded the CVS Inventory (CSVI) by Schwichow et al. (2016) to an elementary school level. For each context (dropping of parachutes, floating of boats, scaling of lemonade, crawling of snails), we created four items: two items for the identification subskill (students identify one controlled experiment out of a selection of four experiments) and one each for the interpretation (students interpret the outcome of controlled experiments) and understanding subskill (students reject to draw causal inferences from confounded experiments). We did not create items for the CVS subskill planning because this subskill is very challenging for primary school students (Bullock & Ziegler, 1999; Peteranderl & Edelsbrunner, 2020) and we wanted to keep a reasonable test length. Every item presents a short story and a hypothesis about a causal relationship. The adapted version of the CVSI includes 16 multiple choice/multiple select items out of four contexts (questionnaire available at Schwichow et al., 2022).

6.3.2 Data Scaling and Descriptive Statistics

We assigned every response option to the represented design error. For missing responses, we used hot deck imputation (Sect. 6.4.2). We combined items of the same item type based on different contexts so that every student can be characterized by his/her two responses within one of the four item types. In the following sections, we will cut this original dataset in half and only use one item for each item type for a less complex and therefore easier to understand presentation of the general idea of the LCA in Sect. 6.4. Based on this categorization, we calculated the observed frequencies of design errors for every item type (Table 6.2).

Table 6.2 Observed frequency of students’ design errors

6.3.3 Research Questions

Table 6.3 shows the categorical responses of four students for the four items Xi. The value 1 represents a correct understanding of CVS; values 2–7 represent a wrong response with its associated design error (as described in Table 6.2). From a quantitative point of view, we can count the correct responses and design errors (as shown for all students in Table 6.2) but we cannot determine if there are qualitative differences between the response patterns. We presume that different pattern of design errors are based on different underlying CVS misconception. However, we do not have strong theoretical assumptions for the number of sub-groups within the students and how the misconceptions are manifested in the responses.

Table 6.3 Response pattern of design errors for four students. p represents the person ID of the data set in the supplemental material

Thus, our first research question is: Based on an LCA: Which patterns of design errors regarding the CVS subskills can be identified within elementary school students? (Examine a categorical/discrete latent variable.)

Moreover, we expect that items representing different CVS subskills differ in their item difficulty because each subskills require specific cognitive tasks. Items of the identification subskill require only the selection of a controlled experiment and thus might be easier than items that require justified conclusions (subskills: interpretation and understanding).

This assumption is further investigated based on a Rasch analysis in research question 2: How difficult are the items of different CVS subskills for elementary school students? (Examine a continuous latent variable.)

At last, we expect to get different perspectives on students’ CVS skills by LCA and Rasch analysis. Thus, our third research questions is: What additional insight can be gained by combining the results of the LCA and the Rasch analysis?

We assume that the reader of this chapter is familiar with the basic principles of Rasch analysis (for an introduction: e.g., Boone et al., 2014; Boone & Staver, 2020). However, we explain LCA as a different type of latent variable model further and by describing the theoretical background of LCA based on the empirical example.

6.4 Latent Class Analysis

6.4.1 Overview & Introduction

In the 1950s, Paul Lazarsfeld developed a conceptual foundation of latent structure models (Andersen, 1982). The concept and methods were refined over time and Lazarsfeld and Henry (1968) published the first detailed work (conceptual and mathematical) covering LCA (Collins & Lanza, 2010). By now, LCA is widely used in social sciences (e.g., Collins & Lanza, 2010; Davier & Lee, 2019; Fulmer et al., 2014; Vincent-Ruz & Schunn, 2021) along with other latent variable models.

Latent variable models aim to describe an underlying latent construct (i.e., a construct that is not directly measurable) based on observed data. The observed data as well as the latent variable can be continuous or categorical/discrete. As a result, there are four possible combinations that form the different latent variable models (Table 6.4). Models based on Rasch’s (1960) measurement theory utilize discrete data (often dichotomous) to construct continuous latent variables. In contrast, the observed data in the LCA as well as the latent variable are discrete categories, while neither have to be necessarily ordinal (Rost, 1988). Thus, the aim of the LCA is not to arrange persons on a latent continuum but to group persons together into “classes” that show similar response patterns (Rost, 2004).

Table 6.4 Overview of latent variable models

In our empirical example, the observed data are the categorical design errors, as shown in Table 6.2. The latent variable is the underlying CVS misconception, also a categorical variable. Our aim is to identify classes, or groups, of students that show qualitatively similar response patterns in their design errors.

6.4.1.1 Assumptions of the LCA and the General Model

LCA uses probabilistic procedures to identify classes of qualitatively different response patterns. Thereby, LCA makes no assumptions about the response patterns or the number of classes. Furthermore, it does not require that all items have the same number of categories (Collins & Lanza, 2010; Davier & Lee, 2019; Rost, 2004). In our empirical example, items 1 and 2 have four categories, items 3 and 4 have six categories (Table 6.2). However, as with all probabilistic methods, LCA has assumptions that we will explain next, together with the theoretical background and the general (mathematical) model of the LCA based on our empirical example. We will use the four students from Table 6.3 and an already estimated model with three classes. In section “Choosing a model”, we will explain why we have chosen this model.

Annotation to Mathematical Notations

  • Class: c ∈ {1, …, C}

  • Person: p ∈ {1, …, N}

  • Item: i ∈ {1, …, j}

  • Response: x ∈ {1, …, m}

  • Person’s response to an item: Xpi = x; for example X14 = 2: the response of person 1 to item 4 is 2.

  • Person’s response pattern: \( {\underset{\_}{x}}_p \); for example the response pattern for Theo is \( {\underset{\_}{x}}_{Theo}=1,3,4,3 \).

  • Person’s conditional probability: P(response| condition); for example P(Xp3 = 4| c = 1): the probability of a student within class 1 responding to item 3 with 4.

  • Unconditional probability: P(response); for example P(Xp3 = 4): the probability over all students responding to item 3 with 4.

6.4.1.2 Assumption 1: Constant Response Probability for all Persons Within a Class

Within a class, every person has the same probability of responding to a specific item Xi (Collins & Lanza, 2010; Rost, 2004). In the 3-class model of the LCA, we predict that Theo and Julia are both assigned to class 2. Therefore, both Theo and Julia have the same probability of 74% (Table 6.5) to respond to the item 1 with correct understanding of CVS: P(XTheo, 1 = 1| c = 2) = P(XJulia, 1 = 1| c = 2) = .74. Students in other classes, like Tim (class 1) and Aylin (class 3), may have different probabilities of responding to item 1 with correct understanding of CVS. In a model that fits well, the estimated expected frequencies of responses should closely match the observed frequencies (Collins & Lanza, 2010). Therefore, we can verify the probability estimated by the model by counting the observed responses in the class. In class 2, 79 out of 107 students responded with correct understanding of CVS which is, based on the law of large numbers, equivalent to a probability of 79/107 = 74%, matching the item-response probability estimated by the latent class model.

Table 6.5 Estimated model parameters πixc for item X1 in the 3-class model

In general, we can equate the conditional item-response probability to response x for all students:

$$ P\left({X}_{1i}=x\right|c\left)=P\left({X}_{2i}=x\right|c\right)=\dots =P\left({X}_{pi}=x\right|c\Big)={\pi}_{ixc} $$
(6.1)

We use the probability parameter πixc to describe the item-response probability for the response x to item i in class c (Table 6.5 for item 1, Appendix Table 6.1 for all other parameters). Class wise, πixc of each item i across all responses x needs to sum up to 1 (e.g., each row in Table 6.5; within rounding errors). To have a good latent class separation, different classes should have different item-response probabilities for (at least some) items, so the classes are conceptually distinct and can be labeled according to these differences (e.g., correct CVS understanding vs. change of too many variables) (Collins & Lanza, 2010).

In the previous example, 74% might be a high probability for responding with correct CVS. Nevertheless, a student assigned to class 2 can still respond differently, like Julia (Table 6.3). The observed responses of a student are determined by their latent trait (class membership) and a random error (Collins & Lanza, 2010), so two students with the same latent trait (Theo and Julia) can respond differently, represented by the item-response probability. The closer the item-response probabilities within a class are to 0 or 1 (0 to .2 and .8 to 1), the higher is the homogeneity of the latent class – the members of the latent class are more likely to provide the same observed response patterns (Collins & Lanza, 2010). Furthermore, persons have a particular probability to be members of any class (section “Assigning the class membership”).

6.4.1.3 Assumption 2: Disjunctive and Exhaustive Latent Classes With Unknown Prevalence

Every person is a member of one class (exhaustive) and one class only (disjunctive) which represents the person’s latent trait (Collins & Lanza, 2010; Rost, 2004). We use the probability parameter πc to describe the probability of a random student of the data set to be member of a specific class c – the expected class prevalence (Table 6.6). Moreover, because the class is a latent variable, the class membership of individual persons estimated by the model is probabilistic. Every person has a particular probability of membership of all classes (section “Assigning the class membership”).

Table 6.6 Expected class prevalence πc for the 3-class model

Additionally, we know, based on the second assumption (disjunctive and exhaustive classes), that the expected class prevalences across all classes need to sum up to 1 (Eq. 6.2).

$$ \sum \limits_{c=1}^C{\pi}_c=1 $$
(6.2)

The number of the classes C is not a parameter of the model. Though it is unknown, it will not be estimated within our model (Rost, 2004). To decide the number of classes, we either have a strong theoretical assumption (confirmatory approach) or we need to estimate several models (C = 1, C = 2, C = 3, …) and compare the goodness of fit of the models (Sect. 6.4.4, explorative approach).

Assumption 1 gives us πixc – the conditional item-response probability of every student within a particular class c responding to an item i with response x. With πc from assumption 2, we know the expected class prevalence. Multiplying the expected class prevalence with the conditional item-response probability will give us, summed up over all classes, the unconditional item-response probability P(Xpi = x) of all students (Rost, 2004; Eq. 6.3).

$$ P\left({X}_{pi}=x\right)=\sum \limits_{c=1}^C{\pi}_c\cdotp {\pi}_{ixc} $$
(6.3)

We can follow the plausibility of Eq. 6.3 with an example. We use the conditional item-response probability π11c for responding to item 1 with correct CVS (Table 6.5) and the expected class prevalence estimated for the 3-class model (Table 6.6). We calculate the unconditional item-response probability following Eq. 6.3: P(Xp1 = 1) = .53 ∙ .89 + .24 ∙ .74 + .23 ∙ .54 = .77. In a well-fitting model, the expected frequencies should again closely match the observed frequencies. In our sample, 384 students out of 496 students showed correct CVS at item 1 (Table 6.2). Based on the law of large numbers, the observed probability over all students for item 1 to respond with correct CVS is 384/496 = .77 – the same probability as calculated based on Eq. 6.3.

6.4.1.4 Assumption 3: Homogeneity of Items

Within the latent class model, we assume that all items measure the same latent variable (Collins & Lanza, 2010; Rost, 2004), the categorical underlying CVS misconception. Therefore, we can calculate an unconditional pattern probability \( P\left({\underset{\_}{x}}_p\right) \) (Eq. 6.4) in the same fashion as Eq. 6.3 by multiplying the expected class prevalence with the conditional pattern probability \( P\left({\underline{x}}_p|c\right) \).

$$ P\left({\underline{x}}_p\right)=\sum \limits_{c=1}^C{\pi}_c\cdotp P\left({\underline{x}}_p|c\right) $$
(6.4)

6.4.1.5 Assumption 4: Local Independence of Observed Responses

Only a person’s membership of a specific latent class c explains the different observed responses to an item (Collins & Lanza, 2010; Rost, 2004). Therefore, the probability of a student within a latent class c responding in a particular pattern P(1, 3, 4, 3|c), like Theo, is the same as the product of the probability of every item: P(1, 3, 4, 3|c) = P(X1 = 1|c) ∙ P(X2 = 3|c) ∙ P(X3 = 4|c) ∙ P(X4 = 3|c).

We can formulate the general conditional probability for a response pattern \( {\underset{\_}{x}}_p \):

$$ P\left({\underline{x}}_p|c\right)=P\left({X}_{p1}=x|c\right)\cdotp \dots \cdotp P\left({X}_{pj}=x|c\right)=\prod \limits_{i=1}^jP\left({X}_{pi}=x|c\right)=\prod \limits_{i=1}^j{\pi}_{ixc} $$
(6.5)

6.4.1.6 General Model

If we combine the assumptions and Eqs. 6.4 and 6.5, we can set up the general model for the LCA (Collins & Lanza, 2010; Rost, 2004). Eq. 6.6 describes the unconditional pattern probability \( P\left({\underline{x}}_p\right) \) dependent on the expected class prevalence πc and the conditional item-response probability πixc. \( P\left({\underline{x}}_p\right) \) describes the estimated probability of a specific observed response pattern \( {\underline{x}}_p \) being present in a data set, under the assumption that the LCA model is valid (Rost, 2004). For example, Theo’s observed pattern \( {\underline{x}}_{\mathrm{Theo}}=1,3,4,3 \) has an estimated probability of 4.66 ∙ 10−3 based on the parameters of our 3-class model (Table 6.6 and Appendix Table 6.1). For a good fitting model, we want to maximize the estimated probability of all observed pattern frequencies (maximum likelihood approach, Sect. 6.4.3).

$$ P\left({\underline{x}}_p\right)=\sum \limits_{c=1}^C{\pi}_c\cdotp \prod \limits_{i=1}^j{\pi}_{ixc} $$
(6.6)

6.4.1.7 Assigning the Class Membership

In the previous sections, we calculated the probability for a response pattern conditional on a distinct class membership: \( P\left({\underline{x}}_p|c\right) \) (Eq. 6.5). However, to estimate how precisely the observed variables measure the underlying latent variable, we have to turn things around (Collins & Lanza, 2010). When we calculate the probability of the class membership conditional on a distinct response pattern \( P\left(c|{\underline{x}}_p\right) \), we are able to figure out the probability of each person’s latent class membership (based on the response pattern) and therefore assign a class membership.

We calculate the conditional probability of the class membership c of a person with the response pattern \( {\underset{\_}{x}}_p \) in Eq. 6.7 based on Bayes’ theorem (Rost, 2004) using Eq. 6.5. This probability is also called classification probability or posterior probability (Collins & Lanza, 2010).

$$ P\left(c|{\underset{\_}{x}}_p\right)=\frac{\pi_c\cdot P\left({\underset{\_}{x}}_p|c\right)}{\sum \limits_{c=1}^C{\pi}_c\cdot P\left({\underset{\_}{x}}_p|c\right)} $$
(6.7)

To assign a person to a class, we need to estimate the person’s classification probability for all classes and assign the person to the class with the highest classification probability. We can check the assignment by calculating the classification probabilities based on Eq. 6.7 for the 3-class model by using the probability of Theo’s response pattern in different classes (Eq. 6.5 with parameters πixc from Appendix Table 6.1) and the expected class prevalence πc (Table 6.6). The classification probability is clearly the highest for class 2 (71%) – the known class membership of Theo.

$$ P\left(1|1,3,4,3\right)=\frac{\pi_1\cdot P\left(1,3,4,3|1\right)}{\sum \limits_{c=1}^C{\pi}_c\cdot P\left(1,3,4,3|c\right)}=\frac{.53\cdot 0\cdot {10}^{-3}}{4.66\cdot {10}^{-3}}=.00 $$
$$ P\left(2|1,3,4,3\right)=\frac{\pi_2\cdot P\left(1,3,4,3|2\right)}{\sum \limits_{c=1}^C{\pi}_c\cdot P\left(1,3,4,3|c\right)}=\frac{.24\cdot 13.86\cdot {10}^{-3}}{4.66\cdot {10}^{-3}}=.71 $$
$$ P\left(3|1,3,4,3\right)=\frac{\pi_3\cdot P\left(1,3,4,3|3\right)}{\sum \limits_{c=1}^C{\pi}_c\cdot P\left(1,3,4,3|c\right)}=\frac{.23\cdot 5.81\cdot {10}^{-3}}{4.66\cdot {10}^{-3}}=.29 $$

Every person that is an assigned member of a class still has particular classification probabilities for alternative classes, like Theo has a probability of 29% to have the latent trait of class 3. Moreover, in all data sets, there always will be individuals with very ambiguous classification probabilities (Collins & Lanza, 2010). However, a good model with high homogeneity of latent classes and good class separation will – over all students – produce classes where the classification probability for one class is high and for the other classes low (Collins & Lanza, 2010). By the mean classification probability for all students (Table 6.7), we calculate the “hit rate” – a measure similar to reliability (Rost, 2004). For example, a person with class membership 2 has a hit rate of 82% to be in class 2, and only a probability of 5% (class 1) / 13% (class 3) to be a member of the other two classes. Our model produces a relatively good hit rate (>.85 is recommended; Rost, 2004) for all classes (diagonal in Table 6.7), thus we conclude a reasonably good reliability of the model based on the cut-in-half original data set.

Table 6.7 Mean classification probability; hit rate in the diagonal

When estimating a model several times using different starting values (Sect. 6.4.3), the same latent classes will be produced, but the order might vary: class 1 in solution 1 might be labeled class 3 in solution 2. This effect is called “label switching”. Label switching is no statistical problem; however, it can be confusing (Collins & Lanza, 2010). It is helpful to re-name the classes in descending order of their expected prevalence, so you will always have the same labeling.

Still, the assignment of a class based on the classification probability is not recommended to be the base of subsequent analyses (except as rough exploratory or heuristic device) (Collins & Lanza, 2010), because of the potential of underestimating the uncertainty / error that lies in the class probability of every student.

6.4.2 Missing Data

So far, we have assumed that our sample is complete and we do not have to deal with missing data. However, while collecting data, especially in an educational context, missing values are frequent. We can distinguish two types of missing data: Either by students neglecting to choose a response for an item (missing at random) or by using test booklets (missing by design).

As we describe in Sect. 6.4.3, the LCA uses the complete response patterns to estimate the model parameters. Therefore, a complete sample is necessary for running an LCA. Nevertheless, imputation methods replace missing values with reliable, estimated values.

In performance tests, replacing missing values as “wrong response” is widely used. Even though the responses of our CVS test include a “correct CVS understanding,” there is no general category for “wrong response” because we collected qualitative data about design errors – even though it is possible to use the category “missing” to gain further insight into reasons for missing data (Rost, 2004). Therefore, if we want to replace missing responses, we have to impute responses the student presumably would have given. In our original data sample, less than 3% of the data was missing. We utilized “hot deck imputation” to impute values for missing responses (Rubin, 1987). This technique selects a (weighted) random value out of the distribution of given responses. For example, if overall 77% of the responses in an item were correct CVS, an imputed response has a probability of 77% to be correct CVS. With the hot deck imputation, we generated a similar variation in the missing data as in our complete data. This non-parametric method has several advantages (i.e., any type of variable can be replaced and no strong distributional assumptions are required) (Pérez et al., 2002). If we have a distribution of the chosen responses for each item and the number of missing cases, we can just calculate how often each response should appear in the missing data and randomly allocate these responses to the missing cases.

6.4.3 Parameter Estimation With Maximum Likelihood Approach

As seen in section “General model”, the model for the LCA only includes the expected class prevalence πc and the conditional item-response probability πixc. Because these probability parameters sum up to 1 (Collins & Lanza, 2010), the number of model parameters t that has to be estimated for a model with C classes and j items with m categories each is:

$$ t=C\cdotp \left(j\cdotp \left(m-1\right)+1\right)-1 $$
(6.8)

For our example, a C = 3 class model with 4 items (two items (j = 2) with m = 4 categories and two items (j = 2) with m = 6 categories), we need to estimate t = 3 ∙ (2 ∙ (4 − 1) + 2 ∙ (6 − 1) + 1) − 1 = 50 parameters.

We use LCA in an explorative way and therefore we have no strong theoretical assumptions for restricting the parameters. However, it is possible to fix parameters (e.g., fixing the expected class prevalence πc to equal sized classes) or constrain parameters (e.g., constraining that two item-response probability parameters πixc should be equal between the classes). With restricted parameters, LCA makes it possible to test different a priori hypotheses (Rost, 2004).

We will use the maximum of the likelihood function (Rost, 2004; Eq. 6.9) to estimate the parameters (MLE). That way, the found parameters πc and πixc will represent a model where the probability of all observed response patterns is maximal. We find the likelihood function L in a similar way as in the Rasch analysis (Rost, 2004) as the product of the unconditional pattern probabilities \( P\left({\underline{x}}_p\right) \) out of the general model (section “General model”).

$$ L=\prod \limits_{p=1}^NP\left({\underline{x}}_p\right) $$
(6.9)

To find the maximum of the likelihood function, a basic idea is to derivate the likelihood function and set it equal to zero (Rost, 2004). As stated before, because of the latent variable being categorical, we cannot use this method for LCA. LCA uses the EM algorithm (E for “expectation” and M for “maximization”; Dempster et al., 1977) to estimate the model parameters for the maximum of the likelihood function (MLE). The EM algorithm is an iterative process that will estimate with every step a higher value of the likelihood function (Rost, 2004).

Overall, there are several aspects determining if a latent class model can be found with the MLE approach. A large sample size N, a large ratio between observed response patterns/sample size and possible response patterns (low sparseness), a good latent class homogeneity, and good class separation benefit the identification of a latent class model (Collins & Lanza, 2010).

One noteworthy problem of MLE in the LCA is that the likelihood function might have several local maxima, so we might only find a local, but no global, maximum (Collins & Lanza, 2010). This will happen more frequently with five or more classes and 12 or more items (Rost, 2004) or with complex categories. To minimize the probability of only finding a local maximum, we can use different (random) start values and compare the maximum likelihoods for the estimated model parameters. If we had found a global maximum, other start values should find the same maximum likelihood. Collins & Lanza (2010) recommend to estimate at least 10 models.

6.4.4 Goodness of Fit / Model Selection

Even after estimating a model, we do not know if the estimated model fits the data. We need this information to decide for example on the number of classes by comparing the fit of several models (C = 1, C = 2, C = 3, …) because the number of classes is no model parameter. However, there is no general “yes/no” to the model fit – every estimated model fits the data more or less well and we need to find criteria to set boundaries for fitting (Rost, 2004). Beside the model fit, we need to consider the complexity of the model that might result in a good fit. In LCA, the best fitting model, called saturated model, will be a model that has as many classes as there are response patterns (161 classes in our example), resulting in a high number of model parameters. In general, the more classes we choose the better the fit might be, although under a higher complexity of parameters. Lastly, we need to consider the state of research and the potential for theoretical insight. We need to keep in mind that model selection always requires statistical and theoretical considerations, and a good fit alone cannot be used as “proof” for a psychological structure (Edelsbrunner & Dablander, 2019). Therefore, model selection is, within certain statistical boundaries, a judgement call (Collins & Lanza, 2010).

6.4.4.1 Simplicity and Usefulness

Rost (2004) suggests that the aim of building a theory (in our example: explaining design errors by underlying misconceptions) is not solely rooted in finding a good as possible statistical fit but rather on being based on few and simple assumptions. Models should be empirically valid as well as simple as possible. This philosophical principal, quite similar to Ockham’s Razor, is called “parsimony” (Collins & Lanza, 2010). Therefore, a simpler model should be preferred to a more complex one whenever the more complex model does not supply significant higher theoretical insight and has therefore a higher usefulness or model interpretability (Collins & Lanza, 2010).

6.4.4.2 Likelihood Ratio Test G2 and Observed Pattern Frequency Test χ2

The likelihood of a model gives insight into how well the model fits the data – the higher the likelihood, the better is the model fit. In Sect. 6.4.3, we calculated the likelihood of a latent class model by multiplying the unconditional pattern probabilities. The likelihood of one estimated model can be compared to the likelihood of other (nested) estimated models or the saturated model (Collins & Lanza, 2010; Rost, 2004). Of course, we can only compare the likelihoods of models that are estimated based on the same data.

The empirical comparison of the likelihood is based on a likelihood ratio test: \( \mathrm{LR}=\frac{L_0}{L_1} \). The two compared models have to be nested models. Models are statistically nested when one model L1 is the restricted version of another model L0 (Collins & Lanza, 2010). Furthermore, the restrictions must not be achieved by restricting model parameters to 0. In other words, all compared models need to have the same number of classes and response pattern but can differ regarding the allocation of response pattern to specific classes based on fixed or constrained model parameters. If these requirements are met, we can calculate a χ2 distributed test statistic G2 (Eq. 6.10) with the difference in the number of model parameters t as degrees of freedom df = t(L1) − t(L0) (Rost, 2004).

$$ {G}^2=-2\cdotp \log \left(\frac{L_0}{L_1}\right) $$
(6.10)

An alternative to the likelihood ratio test, we can compare the observed frequency of patterns (\( {o}_{\underline{x}} \)) to the expected frequency of patterns (\( {e}_{\underline{x}} \)) based on the model parameters, Pearson’s χ2 test statistic (Rost, 2004; Eq. 6.11). A lower value of χ2 is associated with a better model fit (Collins & Lanza, 2010; Linzer & Lewis, 2011). Again, we will get a χ2 distributed test statistic with the same degrees of freedom as in the likelihood ratio test against the saturated model. The likelihood ratio test and the χ2-test usually lead to the same results (Rost, 2004).

$$ {\chi}^2=\sum \limits_{\underline{x}}\frac{{\left({o}_{\underline{x}}-{e}_{\underline{x}}\right)}^2}{e_{\underline{x}}} $$
(6.11)

Unfortunately, we cannot compare models with a different number of classes using G2 because of an unknown distribution rather than χ2 (Collins & Lanza, 2010; Rost, 2004). The model parameters of the LCA are the expected class prevalence πc and the dependent item-response probabilities πicx, but not the number of classes (section “General model”). Therefore, a lower class model is restricted by setting a model parameter of the higher class model to 0 (e.g., for three classes C = 3, the expected class prevalence of a fourth class is zero: π4 = 0). Furthermore, comparisons to the saturated model with G2 as well as the χ2 test for the frequency of patterns are not allowed in data sets with a large sparseness, because then the test statistic will also have an unknown distribution (Collins & Lanza, 2010; Linzer & Lewis, 2011; Rost, 2004). To make sure data is not sparse, 80% of the possible patterns should have an expected frequency greater than 5 and no possible pattern should have an expected pattern frequency of less than 1 – therefore, as a rule of thumb, the sample size should equal at least the number of possible patterns multiplied by 5 when using a χ2-test (McHugh, 2013). Sparseness happens quickly in more complex latent class models with many latent classes (Collins & Lanza, 2010; McHugh, 2013). Our data is sparse. The number of observed patterns (161) is a lot smaller than the number of possible patterns (576).

6.4.4.3 Bootstrapping

To solve the problem of an unknown distribution of the test statistic, Aitkin et al. (1981) (also see Davier, 1997; Rost, 2004) suggest performing a parametric bootstrap. During parametric bootstrapping, random data sets are resampled repeatedly out of a distribution based on the original model’s estimated parameters. As we now have multiple data sets, we can estimate the distribution of the test statistics from these resampled data.

After resampling the data, we compare test statistics of the original model with the now known empirical distribution of the test statistics. As test statistic, we can use G2 or χ2 (section “Likelihood ratio test G2 and observed pattern frequency test χ2”). However, with sparse data, only the use of χ2 is recommended (Davier, 1997).

We can localize the test statistic of the original model within the empirical distribution of the resampled models. Similar to a normal χ2-test with a χ2 distribution, we can conclude the original model fits the data if the p-value is bigger than .05 because then there is no significant difference between the original model and most of the resampled models (Rost, 2004).

6.4.4.4 Information Criteria

We can use information criteria to determine which model is better in relative terms (Collins & Lanza, 2010; Rost, 2004). The only restriction to the use of information criteria is that the models must be estimated based on the same data. On the downside, we do not know how much lower the information criteria of a model needs to be to underline a better fit.

Following the simplicity criterion, models should be empirically valid as well as simple as possible. Information criteria consider both requirements because they use the log-likelihood log(L) (measurement for empirical validity) and the number of model parameters t (measurement for complexity). Additionally, the sample size N is used. There are several information criteria with differences in weighting the number of model parameters t. We can calculate four information criteria: the well-known AIC (Akaike’s information criterion; Akaike, 1974) and BIC (Bayes’ information criterion; Schwartz, 1978) as well as CAIC (consistent AIC; Bozdogan, 1987) and aBIC (adjusted BIC; Sclove, 1987). Lower criteria are associated with a better fit.

$$ \mathrm{AIC}=-2\log (L)+2\cdotp t $$
$$ \mathrm{BIC}=-2\log (L)+\log (N)\cdotp t $$
$$ \mathrm{CAIC}=-2\log (L)+\left[\log (N)+1\right]\cdotp t $$
$$ \mathrm{aBIC}=-2\log (L)+\log \left(\frac{N+2}{24}\right)\cdotp t $$

The AIC takes an equal weighting of the log-likelihood and the number of parameters. The BIC weights the number of model parameters more, based on the sample size. The CAIC is a correction of the AIC for bigger sample sizes. The aBIC further adjusts the weight of the sample size.

What information criteria to use depends on the sample size and aim of the analysis. Dziak et al. (2020) discuss several information criteria, including their sensitivity and specificity. Their simulations for latent class models (Dziak et al., 2020, pp. 13–14) show, for a 3-class model with N around 500 cases, like our empirical example, that the AIC has significant and aBIC marginal overfitting rates (choosing a model with too many classes) whereas BIC and CAIC have marginal underfitting rates (choosing a model with too few classes). Furthermore, BIC and aBIC have a similarly correct fit rate close to 100%, whereas AIC and CAIC have a lower fit rate. Based on their simulations, Dziak et al. (2020) provide heuristics for choosing a useful information criterion based on the aim of the analysis. For a sufficiently rich model with more theoretical insight where describing heterogeneity within the population is more important than the simplicity of the model, or when expecting similar classes that still have distinct differences, AIC or aBIC are preferable than BIC or CAIC. For obtaining fewer larger classes, BIC is more fitting. If the AIC favors a solution with a high, difficult to interpret class number, BIC is the better choice. In general, information criteria often are not used to identify one single, best fitting model, but to reduce the number of viable models (Collins & Lanza, 2010). To find an appropriate model, researchers should also consider the usefulness of the found model (Rost, 2004).

6.4.4.5 Hit Rate and Assigned Class Membership

As described in section “Assigning the class membership”, we can calculate a hit rate as the mean probability of class membership for all students. The hit rate is a measure similar to reliability, and good models should have a high hit rate (high >.85; Rost, 2004). Additionally, we added the percentage of the students assigned to the found classes to provide a better understanding of the distribution of students over the classes.

6.4.4.6 Choosing a Model

Until now, we worked with the 3-class model without giving reason why we are using it. In this section, we will explain how we exploratively chose a latent class model for our empirical example based on goodness of fit statistics (Table 6.8) discussed in Sect. 6.4.4.

Table 6.8 Goodness of fit statistics for the empirical example (N = 496)

As expected, the log-likelihood of the data shows a better fit for higher class models. The hit rate of the models decreases with more classes but is for the 2- and 3-class models still within reason. Looking at the empirical p-values of the bootstrap, we clearly need to reject the 1-class model (p < .05) because the empirical data differs significantly from the resampled data based on the model parameters. The 2-, 3- and 4-class models show adequate fit for the empirical χ2 test statistic (p > .05). Following the simplicity criterion, we reject the 4-class model and focus further on comparing the 2-class and 3-class models.

Comparing the information criteria, the 2-class model has a slightly worse AIC than the 3-class model (13), but a better BIC (59). The CAIC is in favor of the 2-class model (76), whereas the aBIC of the 2-class model is only slightly better (5). As discussed in section “Information criteria”, AIC and aBIC might favor overfitting models, whereas BIC and CAIC might lead to underfitting.

However, for choosing a model, we cannot rely solely on statistical considerations but need to include theoretical aspects like potential insight and usefulness. If we take a closer look at the actual class distribution, we notice that the potential theoretical insight of a 2-class model is only small. In the 2-class model, class 1 (π1 = .52, 58% of the students) includes students with high CVS skill (overall item-response probability for correct CVS: .73) and class 2 (π2 = .47, 42% of the students) students with low CVS skill (overall item-response probability for correct CVS: .36). A split in half of the sample into “good” and “bad” students will give us only small additional insight into patterns of design errors and, therefore, underlying CVS misconception. A 3-class model will split one class found in the 2-class model, therefore we will have similar classes with distinct differences, which will lead to a more differentiated model, favoring AIC for the model selection.

In conclusion, we will choose the 3-class model, which shows overall a good model fit (bootstrapping χ2), potentially offers additional theoretical insight into distinct patterns of design errors (Sect. 6.5.1) and is still simple enough (AIC).

6.5 Findings

6.5.1 Finding for Research Question 1: Identifying Patterns of Design Errors With LCA

In this section, we will analyze and interpret the item-response probabilities in the classes and find common characteristics to label the three classes. Figure 6.2 shows the item-response probabilities (Appendix Table 6.1) for the four items.

Fig. 6.2
4 bar graphs of item response probabilities versus four items of class 1, class 2, and class 3. The identification 1 and 2 graph have the highest bars of all three classes in C V S. C V S and c e 1 have the highest bars of classes 1 and 3 for interpretation and understanding, and class 2 pekas in c e 2.

Item-response probabilities. Note. cvs: correct application of the control-of-variables strategy; cwv: CVS for wrong variable, nce: noncontrastive target variable, ooe: single-condition experiment, ce1 & ce2: Confounded experiments, hotat: hold-one-thing-at-a-time

In our example, we will describe the distinct underlying preconception in every class based on the probabilities of the design errors in the four item types: identification 1 + 2 (ID1, ID2), interpretation (IN), and understanding (UN) based on the CVS subskills. The first step in describing the latent classes is to find a label that describes well what the response patterns that are in the same class have in common. To do so, we can refer to dominant response categories (e. g. correct response) and the shape of the response patterns like “high ce1 and low correct” (Marsh et al., 2009). In the following, we will describe the three classes, label them, and discuss the extent of class homogeneity. We labeled the three classes only by dominant response pattern and not by the shape because the categories differ characteristically in the dominant design errors. Overall, our model shows a good class separation, because the item-response probabilities differ between the three classes for almost all response options. In general, a high class homogeneity and good class separation help to interpret the found model.

Class 1 (correct CVS understanding, π1 = .53, 60% of the students): Students in class 1 are very likely to respond with correct CVS in items regarding all subskills. The item-response probabilities for correct CVS were close to 1 for ID1 (.89) and ID2 (.98). In the subskill ‘interpretation’ they still have the highest item-response probability of all classes (.47). However, the probability for the response ce1 is likely high (.38). We conclude, if students in class 1 make errors, it is mostly the choice of a confounded experiment (two variables change). Nevertheless, class 1 students understand the problem of confounded experiments as they have a quite high probability of .56 to respond that the experiments are poorly designed because too many variables differ between conditions. Overall, class 1 shows a relative high class homogeneity with item-response probabilities for correct CVS be close to 1 (cvs: ID1 .89, ID2 .98) or stand out in relation to the other response options (cvs: IN .47, UN .56).

Class 2 (change of too many variables, π2 = .24, 22% of the students): Students in class 2 basically understand that variables need to differ between contrasted conditions. Nonetheless, they do not comprehend why all non-investigated variables should be equal. Accordingly, their item-response probabilities for responses representing design errors associated with the choice or justification of confounded experiments (ce1, ce2) are higher compared to class 1. In particular, such design errors appear in interpretation and understanding items. In interpretation items, students are likely to state that in a controlled experiment more variables have to be changed (.15 ce1, .29 ce2 within IN items). In understanding items, they are more likely to respond that a purely planned experiment has too few changed variables (ce1 .38, ce2 .23 within UN items). Overall, class 2 shows a lower class homogeneity than class 1, because there are no single item-response probabilities close to 1. However, item-response probabilities for changing too many variables (ce1/ce2) stand out in relation to the other response options different from correct CVS.

Class 3 (non-contrastive experiments, π3 = .23, 19% of the students): Students in class 3 basically understand that it is part of CSV to keep variables unchanged. However, they tend to overgeneralize this concept. In the subskill ‘identification’, they are more likely to show the error ‘only one experiment’ than any other class (.16 within ID1) and stick to this concept even if the non-contrastive experiment is the only choice with unchanging variables (.24 within ID2). Furthermore, they are more likely to show these design errors in understanding items when they respond that the experiment is flawed because all variables are changed (nce .24, hotat .27). However, in interpretation items, the most likely design errors of class 3 students are associated with confounded experiments (ce1 .45). Overall, class 3 also shows a lower class homogeneity than class 1 but the item response probabilities for changing too few variables (nce/ooe/hotat) are higher than in the other classes.

6.5.2 Finding for Research Question 2: Difficulty of Items With Rasch Analysis

To get insight into the difficulty of the CVS items, we conducted a unidimensional Rasch analysis. To do this, we transformed the categorical data including the students’ errors into dichotomous (correct CVS response / not correct) data. We did use the complete data set with 8 items, because a split-half data set with 4 items is not suitable for a Rasch analysis. Therefore, every item type is present twice, marked with .1 and .2. We utilized the R package TAM to fit our data to a unidimensional Rasch model (MML 1PL, Robitzsch et al., 2022). We set the average latent ability θ of the students to 0 because we want to make statements about the distribution within the students, not the items. To estimate the student latent ability, we draw five plausible values (PVs). The items’ wMNSQ fits between .96 and 1.08, the EAP/PV reliability is .60 is acceptable. In summary, our data fit the Rasch model. We use PVs because, compared to EAP or WLE, the use of PVs allows measurement error adjusted estimations for the sample (Lüdtke & Robitzsch, 2017). It should be noted, however, that PVs are not suitable as individual estimators (Rost, 2004). For all following computations of statistics for the PVs, we followed the recommendations of OECD (2009). The final mean estimate is equal to the average of the five mean estimates. The final error variance has been calculated with the final sampling variance and the imputation variance (for details: OECD (2009): PISA Data Analysis Manual SPSS, p. 118).

The student’s mean latent ability over all drawn PVs is \( \overline{\theta}=0.00 \) (SD = 1.04). Table 6.9 lists the item difficulties and standard errors. The mean difficulty of the items is −.24, meaning the CVS test is overall a bit too easy for the students.

Table 6.9 Item difficulties of the Rasch model

In the Wright Map, we see a clear distribution in the item difficulty (Fig. 6.3). ID1 and ID2 items (M =  − 1.24, SD = .20) are easier than IN and UN items (M = .77, SD = .38) with a large effect (t(6) =  − 9.36, p < .001, d =  − 5.29). We can conclude that identifying controlled experiments (ID1, ID2) is easier than interpreting controlled experiments (IN) or explaining why a confounded experiment can’t be meaningfully interpreted (UN). The different difficulties are consistent with the different requirements of the subskills. For identification, only one controlled experiment had to be selected, whereas for interpretation and understanding, conclusions had to be drawn from (un-)controlled experiments and justified.

Fig. 6.3
A wright map. The left panel with a histogram of the respondent locations on the logit scale presents the latent variable. Units on the logit scale are on the far-right axis of the plot. Following are the estimated values. I D1.1, minus 1.19. I D1.2, minus 1.49. I D2.1, minus 1.00. I D2.2, minus 1.28.

Wright Map of the unidimensional Rasch model

To answer research question 2, we used a unidimensional Rasch model. However, as discussed in Brandenburger et al. (2022), the data on students’ CVS skills better fit a multidimensional model based on the subskills. Nevertheless, the fit to the unidimensional model is acceptable for answering research question 2 and the planned combination with an LCA for research question 3.

6.5.3 Finding for Research Question 3: Combination of LCA and Rasch Analysis

To combine the found classes of the LCA with the student score of the unidimensional Rasch model, we calculated the mean student score within the unidimensional Rasch model within the 3 classes of the LCA, aware that the class membership holds unavoidable uncertainty (section “Assigning the class membership”). Overall, the mean student score of class 1 (correct CVS understanding) is above the average of 0 (M = .43, SD = .90, N = 296). The mean student score of class 2 (change of too many variables) and class 3 (non-contrastive experiments) are lower than 0. (class 2: M = −.50, SD = .86, N = 107; class 3: M = −.81, SD = .96, N = 93). An ANOVA confirms there are differences in the student score between the classes (F(2, 493) = 87.35, p < .001, η2 = .26, large effect). A Tukey post-hoc analysis confirmed that the differences between all classes are significant (p < .050). We can conclude that students who follow the dominant design error of class 3 (non-contrastive experiments) have overall a lower CVS skill than students out of class 2 (change of too many variables) – a conclusion we couldn’t make only with the LCA. However, we can see in the boxplot in Fig. 6.4 that there is quite an overlap between the student scores associated with different classes.

Fig. 6.4
A boxplot of Rasch P V versus L C A class. Following are the estimated maximum, median, and minimum values, respectively. Class 1. 2.5, 0.5, negative 1.5. Class 2. 1.3, negative 0.5, negative 2.5. Class 3. 1.1, negative 0.9, negative 3. It follows a downward trend.

Boxplot for Rasch PVs of every student per LCA class

6.6 Discussion

By utilizing LCA, we identified three classes of patterns within students’ responses to different items regarding the design and interpretation of experiments (research question 1, Sect. 6.5.1). The three identified classes are easy to interpret because the number of classes is low, and students within each class choose characteristic distractors that are related to common design errors. This was possible because the response options of our multiple-choice items correspond to typical experimental design errors. However, LCA is not restricted to data from multiple-choice tests but can be used to analyze all kinds of categorical data, including a categorical coding of open-response items or student drawings (e.g., Zarkadis et al., 2021). Researchers can utilize LCA to analyze data from all kinds of tasks that encourage students to respond according to their preconceptions. Additionally, to improve the performance of the LCA, covariates (like gender or reading skill) can be used to predict the latent class membership based on logistic regression (Collins & Lanza, 2010).

Our findings of the Rasch analysis (research question 2, Sect. 6.5.2) show that items of the interpretation and understanding subskills are more difficult than items of the identification subskill. We were able to identify a progression in students understanding of CVS by combining the findings of the LCA and Rasch analysis (research question 3, Sect. 6.5.3). Students of the class 3 “non-contrastive experiments” have the lowest CVS abilities. Students in class 2 “change of too many variables” have higher abilities while the students in class 1 “correct understanding of CVS” have the highest CVS abilities and can solve even the challenging interpretation and understanding items. Generally, LCA groups students together that have similar response patterns while Rasch analysis groups items together that have similar item difficulty.

LCA and Rasch analysis have their distinct areas of application. Based on a similar data set, the LCA allows to find classes of underlying leading design errors (research question1; Schwichow et al., 2022), whereas using a Rasch analysis gives insight into the difficulty of items (research question 2) or dimensionality of the CVS subskills (Brandenburger et al., 2022). By combining both analysis approaches (research question 3), researchers can get a more detailed picture of students’ response behavior and the item characteristics (see also Fulmer et al., 2014).

However, LCA also have some disadvantages compared to a Rasch analysis. LCA needs a more complex handling of missing data because it is a full information method. Rasch analysis allows an easy handling of missing data because it uses aggregated data. Thus, researchers utilizing a Rasch analysis can divide the items of a bigger item pool into different test booklets and thus reduce the number of items each student has to work on without reducing the total number of items. A further disadvantage of LCA is that LCA models do not always have clear and easy to interpret outcomes. If the number of classes is high and if the patterns within classes are diverse, then the meaning of the classes is hard to understand.

If LCA delivers clear results, they offer detailed information that can help identify students’ conceptions that underlie their incorrect responses. LCA seems to be an ideal tool to analyze student preconceptions because the methods align with theories of preconceptions (Vosniadou, 2019). These theories describe preconceptions as rather coherent ideas that can be applied by students to different tasks. Accordingly, students’ responses to different tasks should cluster together so that specific preconceptions result in specific response patterns that are based on students’ preconceptions. Indeed, as many cognitive models are analogies from statistical models (Gigerenzer, 1991), the mathematical model behind LCA can serve as a blueprint to understand and describe how specific preconceptions (represented as classes) result in probabilities for specific responses to tasks (represented as conditional item-response probability). This probabilistic view on the relation between preconception and item response matches with empirical finding that although students have prevalent preconceptions they do not respond consistently to these conceptions in all items (Vosniadou, 2019).