With the advances in technology, the use of computers as the delivery platform for assessments facilitates the development of innovative item types, such as simulated interactive tasks (Xiao et al., 2021). Such tasks usually require respondents to interact with the problem scenarios to uncover information, filter and integrate it, and make multistep decisions to approach the solution. Thus, interactive tasks can be used to measure higher-order thinking skills that involve more complex cognitive processes. And this has been put into practice in many large-scale assessments, especially for measuring problem-solving competency, such as the computer-based problem-solving assessments in the Program for International Student Assessment (PISA), the Programme for International Assessment of Adult Competencies (PIAAC), and the National Assessment of Educational Progress (NAEP). One of the typical design frameworks for interactive problem-solving tasks is finite-state automata (FSA) (Buchner & Funke, 1993; Funke, 2001), which have a normative design and easily defined actions and optimal solutions.

For computer-based simulated tasks, a broader range of data can be collected in log files, including not only the final outcomes but also information about how respondents approach the solution (He & von Davier, 2016; Xiao et al., 2021). All the actions of each respondent during their problem-solving process are typically recorded in the form of ordered sequences of multi-type events with timestamps, which can be referred to as the process data. This type of data is valuable when examining interactive tasks (He et al., 2019, 2021). It can promote the understanding of human problem solving, for example, identifying the problem-solving strategies used by respondents and detecting the typical behavioral characteristics of different groups (e.g., Arieli-Attali et al., 2019; He & von Davier, 2015, 2016; Liao et al., 2019; Xiao et al., 2021). More importantly, as the problem-solving process determines the final outcome, process data contain rich information about respondents’ problem-solving ability beyond the outcome. Stadler et al. (2020) revealed that individual differences in test-taking behavior sequences indeed indicated differences in problem-solving ability despite the same scores.

However, how to measure individual ability based on process data is a considerable challenge. Unlike traditional test data in which a univariate response is observed for each item, process data are highly unstructured. Specifically, each response process is a sequence of categorical actions (or events). The sequences of different respondents may be completely different, with different lengths and different events that occur at different time points. Moreover, information about the order of actions is critical and should be taken into account in modeling. Therefore, it is difficult to directly apply traditional measurement models to the process data, or even to fully extract meaningful information from it.

To draw valuable inferences from the process data, an increasing number of statistical methods have been proposed in recent years. According to the information obtained, the existing approaches for the analysis of process data can be roughly divided into two categories: (a) methods of extracting features from process data, and (b) measurement models that can infer respondents’ latent ability. The first category, feature extraction methods, includes extracting summary statistics according to expert input (e.g., Greiff et al., 2015, 2016), data mining techniques (e.g., He & von Davier, 2015, 2016; Kerr et al., 2011; Liao et al., 2019; Qiao & Jiao, 2018), the use of numerical values or vectors to represent sequences (e.g., the multidimensional scaling approach and the sequence-to-sequence autocoder; Tang et al., 2020, 2021), and so on. This class of methods facilitates the discovery and understanding of problem-solving strategies and behavioral characteristics of respondents. However, these techniques cannot directly provide information about latent ability, and it is difficult to link the obtained features with the latent traits due to a lack of interpretability or theoretical support.

To infer latent traits from process data, some measurement models were proposed, such as the Markov-IRT [item response theory] model (Shu et al., 2017), Markov decision process measurement model (MDP-MM; Lamar, 2018), the modified multilevel mixture IRT model (MMixIRT; Liu et al., 2018), and the continuous-time dynamic choice (CTDC) measurement model (Chen, 2020). These models are built based on action sequences by taking into account the serial dependence in different ways, such as through the Markov property assumption (e.g., Lamar, 2018; Shu et al., 2017), and are somewhat related to traditional measurement models to derive latent trait levels (e.g., Chen, 2020; Lamar, 2018; Liu et al., 2018; Shu et al., 2017).

However, these models have their own limitations. Most of them utilize only limited information about the problem-solving process. For example, in the Markov-IRT model, the transitions between every two adjacent actions are used as indicators and they are scored only based on the frequency of occurrence. Therefore, the sequence order of the response process is not actually preserved in the constructed indicators. In the modified MMixIRT model, the person-level ability estimates are based only on the last step, not the whole process. In addition, these models put much attention on latent abilities, and do not consider or care about task characteristics at the process level that may contribute to understanding the behavioral features of individuals when solving the task.

In this paper, we propose a measurement model for the process data to extract information about both the respondents’ latent trait and the task characteristics from the response process. Specifically, we start with FSA tasks, which are commonly used in problem-solving assessments and have been discussed in many studies related to process data analysis methods (e.g., Chen, 2021; Han et al., 2021; Liu et al., 2018; Zhan & Qiao, 2022), and develop the state response (SR) measurement model, which can be applied to process data from one or more tasks. This model focuses on the individual’s action choice at each step in the response process. It links these choices with the respondent’s latent problem-solving ability and the characteristics of task steps or events, and can be applied to process data from one or more tasks. In addition, the proposed SR model is closely related to the action sub-model of the CTDC model (hereafter referred to as the DC model) that also focuses on the probability of choosing the next action depending on the respondent’s latent ability and task parameters. However, the major difference between the two models lies in whether the task characteristics at the process level are taken into account. The DC model only focuses on the overall difficulty of a task, whereas our model goes deeper into each problem state of the task in the problem-solving process.

In the next section, we first briefly describe the FSA tasks and then introduce the proposed model in detail, including the model specification and its estimation, as well as the connection with the related DC model. A simulation study is presented in Section 3 to illustrate the parameter recovery of the proposed model. For comparison, the DC model was also included. Afterward, an empirical study using the real data from PISA 2012 is provided to illustrate the application and rationality of the SR model. Finally, we end this article with a discussion.

State response measurement model

Before clarifying our proposed model, we first briefly introduce the finite-state automata (FSA) tasks. In an FSA task, there are a finite set of system states, a finite set of input signals (i.e., allowable actions), and a transition function that determines which state will follow from a given state depending on an input signal (Buchner & Funke, 1993; Funke, 2001). Figure 1 presents a graphical representation of an FSA with three states (A, B, C) and two possible actions (X1, X2).

Fig. 1
figure 1

A graphical representation of an FSA with three states (A, B, C) and two possible actions (X1, X2)

In such tasks, each action can be represented as the resulting state of the problem scenario, which is the cumulative result of system changes caused by all actions that have occurred before. Accordingly, problem states contain part of the information accumulated from the beginning to the current point, and each action sequence can be represented as a corresponding state sequence. In the example of Fig. 1, an action sequence {X1, X1, X2, X1} can be represented as the state sequence {A, B, C, C, A} if the initial problem state is A. A more concrete example can be found in the first task used in the empirical study, for which the problem scenario is described in the “Empirical study” section, and the problem states definition and the state sequence of its optimal solution are provided in Appendix Table 9 and Appendix Fig. 5, respectively (see “Empirical study” section for details).

Model specification

According to the task structure of FSAs introduced above, it can be easily found that when the respondent is in a certain problem state, the reachable states in the next step are a finite set that depends on the current state. In other words, each time the respondent takes an action, they are making a choice among a set of optional events for a certain problem state. According to the problem-solving goal and the events that have occurred before, each choice can be classified as correct or incorrect. Inspired by the idea of IRT modeling, therefore, we view each state as an item and each action choice (i.e., state choice) in the process as a response to the current state, and then model the relationship of the state responses with the characteristics of both the persons and task events. However, in contrast to IRT modeling, which assumes conditional independence between item responses given latent ability, in the proposed model each action choice in the sequence may depend on the previous actions. In addition, a state may appear more than once in a respondent’s sequence, unlike the item response data in IRT. The more times the state is visited, the more choice data the respondent produces in that state, and the more information about the state and the person is provided for parameter estimation.

Specifically, the SR model describes the conditional probability of respondent i choosing to reach state s when they are in problem state s of task k, taking the form

$$P\left({Y}_{ik\left(j+1\right)}={s}^{\prime }|{Y}_{ik j}=s,{\theta}_i,{\beta}_{ks},\mathcal{R}\right)=\frac{\exp \left[\left({\beta}_{ks}+{\theta}_i\right)\bullet {I}_{ss\prime}\right]}{\sum_{r\in {M}_s}\exp \left[\left({\beta}_{ks}+{\theta}_i\right)\bullet {I}_{sr}\right]},\kern0.5em {s}^{\prime}\in {M}_s$$
(1)

where Yikj denotes the jth state in the sequence of respondent i to solve task k; θi is the latent ability of respondent i; βks is the easiness parameter for state s of task k. Ms represents the set of reachable states in the next step given the current state s (which can be understood as optional actions at the current state). The states in Ms can be classified as correct and incorrect according to whether they are closer to the target state given the current situation, and Iss is an indicator variable that shows the correctness of the reachable state s when the respondent is in state s. Specifically, if moving from state s to state s is closer to the target state, Iss = 1; otherwise, Iss = 0. For example, suppose that state A has three reachable states {B, C, D}, in which B is the correct choice and the other two are incorrect choices. The correctness values for reachable states of state A are IAB = 1, IAC = IAD = 0. Sometimes, this judgment of correctness of reachable states may also depend on previous events in addition to the current state. The problem states of each task, the set of reachable states for each state, and the correctness of each reachable state need to be predefined manually before data analysis, in which states and their transitions (i.e., their reachable states) always already exist in FSA tasks. These predefined rules of the task(s) are denoted by \(\mathcal{R}\). Therefore, in the proposed model, conditional dependencies between actions are taken into account in the form of defined states and correctness of reachable states of each state.

In the model, the state-specific easiness parameter βks reflects the characteristics of each unique state of the task, showing the propensity to choose a correct next state given the current state s. Since respondents often face the same choices with the same correctness values each time they are in the same state, it is assumed here that each state usually has only one easiness parameter. The state easiness parameter is independent of the latent ability parameter θi, and the two parameters jointly determine the probability of a respondent making a choice each time they are in state s. According to Eq. (1), if state s is a correct choice, the numerator is exp(βks + θi); otherwise, the numerator is exp(0) = 1. The denominator is the sum of the exponential terms of all reachable states for state s, which is used for normalization. Therefore, the larger the βks, the more likely respondents are to take correct actions in state s in general, thus indicating that state s is easier. Given βks, the students with a larger value of θ have a higher probability of choosing a correct next state when they are in state s.

Note that in some cases, a state can have more than one easiness parameters, which is related to previously occurring events (i.e., event history). Specifically, although the action options for a state are usually the same each time it is visited, the correctness of those options may sometimes vary according to the information status determined by event history. For example, in the second task of the TICKET unit of PISA 2012 problem-solving assessment (OECD, 2014), students should check the prices of two alternative tickets (ticket 1 and ticket 2) and then buy the cheaper one, i.e., ticket 2. Then, based on the event history, it can be determined at each step which ticket price is already known, resulting in four possible information statuses during the problem-solving process. Suppose that state A is the situation where students are faced with choosing which of the two tickets to check, and the reachable states B and C represent the choice of ticket 1 and ticket 2, respectively. Another reachable state D is the initial state, which means selecting reset in state A to start over. Therefore, when the prices of both tickets are unknown (the first information status), states B and C are both correct, and only state D is incorrect; when the price of only one ticket is known (the second or third information status), it is correct to choose the other ticket (state B or C); and when the prices of two tickets are known (the fourth information status), only state C (i.e., choosing ticket 2) is the correct option and states B and D are both incorrect. Figure 2 shows the transitions from state A with four different sets of correctness. Logically, the easiness of state A may vary across the four cases. However, if we introduce state-history-specific parameters (that is, the easiness of each state under each history is estimated separately), the model may be much more complex, with a large number of parameters, and the estimation may be poor.

Fig. 2
figure 2

Transitions from state A with different sets of correctness in four information statuses. Panel a corresponds to the case where both ticket prices are unknown. Panels b and c respectively correspond to the cases where the price of only ticket 1 or 2 is known. Panel d corresponds to the information status of both ticket prices known. The solid arrow indicates the correct transition, and the dotted arrow represents the incorrect transition. The numerator of the transition probability is annotated beside the corresponding arrow, while the denominator is the sum of numerators across three transitions from state A, for example, exp(0) +  exp (βA + θ) +  exp (βA + θ) in panel (a)

To balance the parsimoniousness and interpretability of the model, for the above cases, the SR model has a simplified assumption. That is, given different event histories (or different information statuses determined by event history), if the number set of correct and incorrect action options for state s remains the same, the easiness parameter for the state (βs) is assumed to be the same; otherwise, state s given different event histories will be viewed as different states with different easiness parameters. In the above example, when respondents are in state A, there are two correct options and one incorrect option given the first information status, while in the latter three information statuses, there is always one correct option and two incorrect options. Therefore, state A given the first and the latter three event histories will be treated as two different problem states, and their easiness parameters are estimated separately (βA and \({\beta}_A^{\prime }\) as shown in Fig. 2). In other words, the impact of event history, which can also be considered the temporal dependence, is further incorporated into the current definition of task states.

Denote the sequence length of respondent i in task k as Jik. Assuming the conditional independence between tasks given the latent ability θi, the conditional likelihood of action sequences Yi = {Yi1, Yik, …, YiK} of respondent i in all K tasks can be written as:

$$L\left(\boldsymbol{Y}_i\left|\theta_i,\boldsymbol{\beta}_1,\cdots,\boldsymbol{\beta}_K,\mathcal R\right.\right)={\textstyle\prod_{k=1}^K}L\left(\boldsymbol{Y}_{ik}\left|\theta_i,\boldsymbol{\beta}_1,\cdots,\boldsymbol{\beta}_K,\mathcal R\right.\right)={\textstyle\prod\limits_{k=1}^K}{\textstyle\prod\limits_{j=1}^{J_{ik}-1}}P\left(Y_{ik,\left(j+1\right)}\right)\left|Y_{ikj},\theta_i,\boldsymbol{\beta}_k,\mathcal R)\right.\\$$
(2)

where \({\boldsymbol{\beta}}_k=\left({\beta}_{k1},\dots, {\beta}_{k{S}_k}\right)\) is the vector of easiness parameters of all Sk problem states in task k. For model identification, the mean of θ is set to 0.

If the easiness parameters of all states in the same task are constrained to be equal, we can obtain a simplified version of the proposed SR model. Its form is essentially the same as the action sub-model of the CTDC measurement model of Chen (2020), which is referred to as the DC model. In the DC model, the easiness parameter is task-specific and is the same for all states in a task. Such a specification, however, is unrealistic and restrictive. Since the information for the solution is often gradually revealed in interactive tasks, the difficulty in choosing a correct action given different problem states may be quite different. Therefore, it is conceivable that the DC model ignores the differences between task states and does not probe into the process characteristics of tasks. From this perspective, the DC model is not really built at the process level. By contrast, parameters are constructed for problem states in the proposed SR model, and different unique events in the response process can be distinguished. Then, action choices in different states can provide differentiated information for latent ability estimation. In this sense, the SR model better captures and reflects the dynamics in process data.

Model estimation

In this study, we adopted the Markov chain Monte Carlo (MCMC) method to implement the estimation of the proposed model. The observed data, all sequences of N respondents in K tasks, are denoted as Y. The parameters to be estimated include individual latent ability θ = (θ1, …, θN) and state easiness parameters β = (β1, …βk, …, βK), in which \({\boldsymbol{\beta}}_k=\left({\beta}_{k1},{\beta}_{k2},\dots, {\beta}_{k{S}_k}\right)\). The joint posterior distribution of interest is

$$\mathcal p(\boldsymbol\theta,\boldsymbol\beta\left|\boldsymbol Y,\mathcal R\right.)\propto\mathcal p(\boldsymbol Y\left|\boldsymbol\theta,\boldsymbol\beta,\mathcal R\right.)\cdot\mathcal p\left(\boldsymbol\theta,\boldsymbol\beta\right),$$
(3)

where

$$\mathcal p(\boldsymbol Y\left|\boldsymbol\theta,\boldsymbol\beta,\mathcal R\right.)=\prod\nolimits_{i=1}^N\prod\nolimits_{k=1}^K\prod\nolimits_{j=1}^{J_{ik}-1}P\left(Y_{ik,(j+1)}\right)\left|Y_{ikj},\theta_i,{\boldsymbol\beta}_k,\mathcal R\right.),$$
(4)
$$\mathcal p(\boldsymbol\theta,\boldsymbol\beta)=\mathcal p(\boldsymbol\theta)\cdot\mathcal p(\boldsymbol\beta)={\textstyle\prod_{i=1}^N}\;\mathcal p(\theta_i)\cdot{\textstyle\prod_{k=1}^k}{\textstyle\prod_{s=1}^{S_k}}\;\;\mathcal p(\beta_{ks}).$$
(5)

In Eq. (5), p(θ) and p(β) are the prior distributions of the latent ability and the state parameters, respectively, and they are assumed to be independent of each other. According to the commonly used priors in the MCMC algorithm (e.g., Fox, 2010; Han et al., 2021; Kim & Bolt, 2007; Patz & Junker, 1999b), priors for latent ability and state easiness parameters are set to the standard normal distribution N(0, 1).Footnote 1 The initial values for parameters are randomly assigned, yielding the collection of θ0 and β0. The superscript refers to iteration l (l = 0 indicating that those are initial values). The Metropolis-Hastings-within-Gibbs sampling approach was used to implement the MCMC estimation to empirically approximate the posterior distributions of parameters (Patz & Junker, 1999a, 1999b). The sampling procedure comprises the following steps for iteration l + 1:

  • Step 1. Sample a latent ability θi for each respondent. Specifically, draw a candidate value \({\theta}_i^{\ast }\) from a proposal distribution centered on the current value \({\theta}_i^l\), \({\theta}_i^{\ast}\sim N\left({\theta}_i^l,{\sigma}_{\theta}^2\right)\), independently for each i = 1, 2, …, N. Then calculate the acceptance probability for \({\theta}_i^{\ast }\)

$$\alpha_i=\min\left\{\frac{\mathcal p\left(\theta_i^{\ast}\left|{\boldsymbol{\beta}}^l,\boldsymbol{Y}_i,\mathcal R\right.\right)}{\mathcal p\left(\theta_i^l\left|{\boldsymbol{\beta}}^l,{\boldsymbol Y}_i,\mathcal R\right.\right)}\right\}=\min\left\{1,\frac{p\left({\boldsymbol Y}_i\left|\theta_i^{\ast},{\boldsymbol{\beta}}^l,\mathcal R\right.\right)\cdot\mathcal p\left(\theta_i^{\ast}\right)}{p\left({\boldsymbol Y}_i\left|\theta_i^l,{\boldsymbol{\beta}}^l,\mathcal R\right.\right)\cdot\mathcal p\left(\theta_i^l\right)}\right\}$$
(6)

where \(p\left({\theta}_i^{\ast}\right)\) and \(p\left({\theta}_i^l\right)\) denote the prior probability densities of \({\theta}_i^{\ast }\) and \({\theta}_i^l\), respectively. Draw a random value r~Uniform(0, 1). Accept \({\theta}_i^{l+1}={\theta}_i^{\ast }\) if αi ≥ r; otherwise, \({\theta}_i^{l+1}={\theta}_i^l\).

  • Step 2. Sample the easiness parameter βks for each problem state. Draw a candidate value \({\beta}_{ks}^{\ast }\) from a proposal distribution, \({\beta}_{ks}^{\ast}\sim N\left({\beta}_{ks}^l,{\sigma}_{\beta}^2\right)\), independently for each s = 1, 2, …, Sk and k = 1, 2, …, K. Calculate the acceptance probability

$$\alpha_{ks}=\min\left\{1,\frac{p\left(\beta_{ks}^{\ast}\;\left|\boldsymbol{\theta}^{l+1},{\boldsymbol Y}_{ks},\right.\mathcal R\right)}{p\left(\beta_{ks}^l\;\left|\boldsymbol{\theta}^{l+1},{\boldsymbol Y}_{ks},\mathcal R\right.\right)}\right\}\;=\;\min\left\{1,\frac{p\left({\boldsymbol Y}_{ks}\left|\boldsymbol{\theta}^{l+1}\;,\;\beta_{ks}^{\ast},\mathcal R\right.\right)\cdot p\left(\beta_{ks}^\ast\right)}{p\left({\boldsymbol Y}_{ks}\left|\boldsymbol{\theta}^{l+1},\;\beta_{ks}^l,\mathcal R\right.\right)\cdot p\left(\beta_{ks}^l\right)}\right\}$$
(7)

where \(p\left({\beta}_{ks}^{\ast}\right)\) and \(p\left({\beta}_{ks}^l\right)\) denote the prior probability densities of \({\beta}_{ks}^{\ast }\) and \({\beta}_{ks}^l\), respectively, and Yks denotes the collection of action choices made by all respondents when they are in state s of task k. Draw a random value u~Uniform(0, 1). Set \({\beta}_{ks}^{l+1}={\beta}_{ks}^{\ast }\) if αks ≥ u; otherwise, \({\beta}_{ks}^{l+1}={\beta}_{ks}^l\).

The variances of proposal distributions, \({\sigma}_{\theta}^2\) and \({\sigma}_{\beta}^2\), affect the estimation efficiency and govern the variability in sampling values. In preliminary runs of the chains, the proposal variances \({\sigma}_{\theta}^2\) and \({\sigma}_{\beta}^2\) are tuned to control the acceptance rate of each parameter, which is usually between 20% and 60% in practice (Junker et al., 2016; Rosenthal, 2011). Afterward, run multiple chains of length L and then discard a number of initial iterations as burn-in.

The convergence of Markov chains is monitored using the potential scale reduction factor (\(\hat{R}\); Brooks & Gelman, 1998; Gelman & Rubin, 1992). The \(\hat{R}\) close to 1 indicates that the Markov chains converge to the target distribution.

Simulation study

Design

Three factors were manipulated: (1) sample size (800, 1500, 3000), (2) sequence length (short, medium, long), and (3) the easiness of problem states within task (equal, unequal). The sequence length was mainly controlled by the number of tasks and the number of problem states within each task in the data. Specifically, we simulated two FSA tasks (Task T1, Task T2), involving 9 and 15 problem states, respectively. The task structure, including the problem states, their reachable states, and the corresponding correctness, are listed in Table 1. We then approximated the conditions of short, medium, and long sequence lengths by conducting analyses for Task T1, Task T2, and the two tasks (T1 and T2) together, respectively. Thus, the three levels of sequence length are later represented as Task T1, Task T2, and Two Tasks. In addition, the equal or unequal easiness of problem states within tasks indicates that the true (i.e., the generating) model behind the data was the DC model or the SR model, respectively. The corresponding true values of state easiness parameters when they were unequal within task are listed in Table 1. When the state easiness parameters within task were equal, their true values in tasks T1 and T2 were 1.0 and −0.5, respectively.

Table 1 Structures of two simulated tasks and the true values of state easiness parameters when they were unequal within each task

In total, we simulated 3*2*3=18 different conditions. For each condition, 50 independent replications were generated based on the corresponding true model. Latent abilities θi were drawn from N(0, 1). Each dataset was analyzed using the SR and DC models, respectively. The MCMC sampling algorithm for parameter estimation was implemented in R (R Core Team, 2018), in which three chains of 10,000 iterations were used, with the first 2000 iterations discarded as burn-in iterations and every fifth iteration kept. The R code for simulating data and implementing the MCMC algorithm is available at https://osf.io/w9dvf/?view_only=832f623510ba4a7a82ac35fd875e5e30

Note that a few generated sequences were too long due to the randomness of simulation, whereas such sequences were almost impossible in practice and always resulted in negative infinite log-likelihood values in the estimation. To solve this issue, we added a restriction for the input data in the algorithm; that is, only the first 200 problem states in each sequence for each task were used for estimation. The proportion of such sequences that were too long was very low in any generated dataset (the maximum percentage was only 0.75%), and the practice of taking only the first 200 states would not affect the estimation.

Evaluation

Five commonly used indices were applied for model comparison, namely the Akaike information criterion (AIC; Akaike, 1974), Bayesian information criterion (BIC; Schwarz, 1978), the sample size-adjusted BIC (SABIC; Sclove, 1987), deviance information criterion (DIC; Spiegelhalter et al., 1998), and pseudo-Bayes factor (PsBF; Geisser & Eddy, 1979; Gelfand & Dey, 1994). For AIC, BIC, SABIC, and DIC, smaller values indicate a better model fit. PsBF is calculated as the ratio of the conditional predictive ordinates (CPOs) of two models.

$$CPO=\prod\nolimits_{i=1}^N\frac{1}{\frac{1}{R}{\sum}_{r=1}^R{\left[p\left({\boldsymbol{x}}_i|{\Theta}^{(r)}\right)\right]}^{-1}},$$
(8)
$$PsBF=\frac{CPO\left( Model\ 1\right)}{CPO\left( Model\ 2\right)},$$
(9)

in which R is the number of MCMC iterations, N denotes the number of persons, xi denotes the sequence(s) of person i in the data, and Θ(r) contains values of all parameters to be estimated in the rth iteration. A value of PsBF greater than 3 provides positive (or stronger) evidence in favor of Model 1 and against Model 2 (Levy & Mislevy, 2016, p. 246).

Parameter estimation was evaluated using three criteria: bias and root mean squared error (RMSE) of the estimated values, and their correlations with true values. Note that when evaluating the accuracy of ability estimation, we used the average ability of the same action sequence instead of abilities of single persons.

Results

In all conditions, all parameters successfully converged, of which the \(\hat{R}\) values were smaller than 1.1. In the nine conditions with unequal state easiness parameters within task, AIC, BIC, SABIC, DIC, and PsBF all strongly supported the correct SR model across 50 replications, as shown in Table 2. The corresponding estimation accuracy of state easiness and latent ability parameters are shown in the left panels of Figs. 3 and 4, respectively. As seen in Fig. 3, the SR model estimates of state parameters were reasonably accurate under all the simulation settings, which can be shown by negligible average bias and RMSE of less than 0.1. In addition, the estimation accuracy improved with the increase in the sample size. According to the left panel of Fig. 4, the SR estimates of latent ability were acceptable, of which the correlation with true values was higher than 0.8 and the bias was between −0.01 and 0.01. Moreover, a longer sequence length resulted in higher accuracy in the estimation. When comparing the two models, it can be easily observed that the SR model estimation for all parameters was generally more accurate than that DC model estimation, especially for the state easiness parameters.

Table 2 Percentages of replications in which the true SR model was supported in 9 conditions with unequal state easiness parameters within each task
Fig. 3
figure 3

Estimation accuracy for state easiness parameters using SR and DC models under different conditions in which state easiness parameters within task were unequal (the left panel) or equal (the right panel)

Fig. 4
figure 4

Estimation accuracy for latent ability using SR and DC models under different conditions in which state easiness parameters within task were unequal (left column) or equal (right column). In panels d and f, the solid and dashed lines overlap, since the results of the two models are basically the same.

In the nine conditions of equal state easiness parameters within task, the percentages supporting the DC model across 50 replications are listed in Table 3. As seen from Table 3, only BIC always supported the correct and parsimonious DC model, followed by SABIC, while DIC and PsBF were the least effective. In these conditions, the SR model could still provide good estimation, in which the estimation biases for all parameters were close to zero, RMSE for state parameters was lower than 0.1, and the correlation of ability parameter estimates with true values was higher than 0.9. The estimation accuracy values of the two models for latent ability were very close to each other (see the right panel of Fig. 3). The difference in estimation accuracy of state easiness between the two models was also unsubstantial, although the RMSE for the DC model was slightly smaller (see the right panel of Fig. 4).

Table 3 Percentages of replications in which the true DC model was supported in 9 conditions with equal state easiness parameters within each task

Empirical study

Data

Task description

To demonstrate the practical applicability of the proposed SR model, we used data from the first two items of the TICKETS unit in PISA 2012. The problem scenario of the TICKETS unit is an automated ticketing machine, including five interfaces. In the first three interfaces, students can choose the train network (CITY SUBWAY, COUNTRY TRAINS, CANCEL), fare type (FULL FARE, CONCESSION, CANCEL), and ticket type (DAILY, INDIVIDUAL, CANCEL) in order. If a student selects DAILY, the next interface will show the price of the selected ticket, and two options, BUY or CANCEL. Alternatively, if the student selects INDIVIDUAL, the next interface will show the available number of individual trips (1 to 5), as well as BUY and CANCEL buttons. After the student selects a certain number, the price of the ticket will be presented. When the student clicks BUY, the task terminates. The CANCEL button in each interface allows the student to reset all choices and navigate to the initial interface. More details about the unit can be found in the PISA 2012 results report (OECD, 2014).

The first item of the TICKETS unit required students to buy a full-fare, country train ticket with two individual trips. The requirements for the ticket were very clear, and students only needed to make choices on the machine following those requirements. The optimal solution was to select the network “COUNTRY TRAINS,” the fare type “FULL FARE,” the ticket type “INDIVIDUAL,” and the number of tickets “2” in that order, and finally click BUY. This item was dichotomously scored based on whether the student purchased the correct ticket.

The second item was more complicated. Students were asked to find and buy the cheapest ticket that allowed them to take four trips around the city on the subway within a day, and they were told that they could use concession fares. To complete this task, students had to find and compare the prices of two possible alternatives that satisfied the ticket requirements, which were a daily subway ticket with concession fare, and an individual concession fare subway ticket with four trips. Afterward, the student had to purchase the cheaper ticket, which was the individual concession fare subway ticket with four trips. In PISA 2012, this task was polytomously scored as 0/1/2. Only if the student compared the two prices and purchased the correct ticket would they be considered to have successfully solved the task and receive full credit. If the student purchased one of the two tickets without comparing prices, they could be given only partial credit.

The raw process data and item scores of the two tasks are available from the OECD website: http://www.oecd.org/pisa/pisaproducts/database-cbapisa2012.htm. Students’ process data were organized into state sequences according to the definition of problem states.

Definitions of problem states and correctness

All the problem states for the two tasks, their reachable states, and corresponding correctness are provided in the Appendix Tables 9 to 11, which are the same as in Chen (2020). For an intuitive understanding, screenshots of the optimal solution for the first task, as well as the corresponding defined problem states, are provided in Appendix Fig. 5 as an example.

Note that in the second task, the correct and incorrect options of states S7, S9, S10, and S11 vary with the information status caused by previous actions (see Appendix Table 11). For example, S7 represents the case in which the participant faces the choice of ticket type after choosing the correct network (CITY SUBWAY) and the correct fare type (CONCESSION). If the participant does not know the prices of two tickets that meet the travel requirement (information status A), both ticket types are correct choices. If the price of one of the two tickets has been known (information status B or C), another ticket type is the only correct choice. And if the prices of both tickets are known (information status D), only the INDIVIDUAL ticket type is correct. According to the simplified assumption mentioned in the Model specification section, states S9, S10, and S11 have one correct option and three incorrect options in all information statuses, and therefore each of them is considered to have only one easiness parameter regardless of the information status. By contrast, given the information status A, the numbers of correct and incorrect options for S7 are 2 and 1, respectively, while the numbers are 1 and 2 given the other three information statuses B–D. Accordingly, S7, given the information status A and B–D, was treated as two different states S7_1 and S7_2, respectively, of which the state parameters were estimated separately.

Sample

After data cleaning, the sequences of 27,616 students who completed the two tasks were used for process data analysis. The sequence length of the first task ranged from 5 to 146, with a mean of 7.39 and median of 6. The sequence length of the second task ranged from 5 to 91, with a mean of 10.05 and median of 6.

After the process data analysis, we examined the ability estimates obtained, in which item scores were used. Since a small number of students had missing data in the scores of one or both tasks, after matching the item scores and ability estimates, the data of only 26,718 students were included in this stage.

Analysis

We applied both the SR model and the DC model to the process data from the two tasks. According to the findings of the previous simulation study, longer sequences contribute to better estimation of the latent trait. Therefore, sequences from the two tasks were analyzed together. In the SR and DC models, the priors of the latent ability and state easiness parameters were specified as N(0, 1). Model fit indices AIC, BIC, ABIC, DIC, and PsBF were used for model comparison.

The latent ability estimates from two models were compared by their correlations with the task outcome scores and their explanatory power to the overall problem-solving performance in PISA 2012. That is, we regressed the overall performance scores on the ability estimates from different models and compared the R2 values. In PISA 2012, the plausible values are a selection of likely proficiencies for students based on the scores of tasks they received, and five plausible values were generated for each student (OECD, 2014). Following Greiff et al. (2015) and Chen (2020), the first plausible value of problem-solving proficiency provided in PISA 2012 products was used as the overall performance score.

Results

In the analysis of two tasks, both models successfully converged, as the potential scale reduction factor values for all parameters were between 1 and 1.01. Table 4 lists the model fit for the two models. All five indices strongly supported the SR model over the DC model. Therefore, the easiness parameters of the problem states within each task might be quite different and should not be fixed to be equal.

Table 4 Model fit of two models in the empirical study

The state parameter estimates for the two tasks are presented in Tables 5 and 6. The easiness parameters of states in the same task were quite different and their 95% credible intervals had almost no overlap, which was consistent with the model comparison results. It can be seen from Tables 5 and 6 that the easiness parameters of states in the optimal solution were generally higher than those of other states. For example, S16 in the first task had the highest estimated value (2.578). This implies that the respondents were very likely to directly click the “BUY” button when they arrived at the ticket purchase interface after choosing the correct network (COUNTRY TRAIN), the correct fare type (FULL), the correct ticket type (INDIVIDUAL), and the correct number of individual trips (2). S14 in the first task also had a high easiness estimate (2.575), which indicates that the respondents who had selected the correct network (COUNTRY TRAIN), correct fare type (FULL), and correct ticket type (INDIVIDUAL) were very likely to choose the correct number of individual trips. In the second task, S7_1 had the highest easiness estimate (3.590). This state indicates that the respondents who had correctly selected CITY SUBWAY and CONCESSION fare needed to choose a ticket type (INDIVIDUAL or DAILY) when they did not know the prices of the two tickets that met the trip requirement. Therefore, it should be very easy to make the right response, as both ticket types were the right choices at that time. By contrast, the estimated value of S7_2 was lower (1.318), which means that after obtaining the price information of one or two tickets that met the trip requirement, students needed to be more careful to correctly choose another ticket type to check its price or to correctly choose the cheaper of the two tickets.

Table 5 Estimates of state easiness parameters of the first task in the empirical study
Table 6 Estimates of state easiness parameters of the second task in the empirical study

By contrast, the DC model provided only task-level parameters, of which the information was limited. In addition, it is counterintuitive that the easiness of the second task was higher than that of the first task. Chen (2020) speculated that the familiarity with the task interface was also included in the task-level easiness parameter and that it was not difficult to partially solve the problem in the second task, thus reducing the task’s overall difficulty. Nevertheless, this result is still difficult to understand.

As for the latent ability, the estimates provided by the SR model and the DC model were highly consistent, with correlation coefficients of 0.977. The correlations between the ability estimates and the outcome scores are given in Table 7. For the first task, the difference between the correlation values of the two models was not substantial. However, for the second task, the SR model estimates were more strongly correlated with the task outcome than those of the DC model.

Table 7 Correlations between ability estimates from two models and task outcomes

We further compared the R2 values of regressions of individuals’ overall performance on the ability estimates obtained by the two process data analysis models. The two regression models were significant (p < 0.01), and the corresponding estimation results are shown in Table 8. The slope parameters were significantly positive, indicating that students with higher process-based estimates tended to have better overall performance in problem solving, which is in line with expectations. Further, results show that the SR model estimates had higher explanatory power of the overall performance (R2 = 0.384) than the DC model estimates (R2 = 0.361). This implies that, due to the considerations of process steps, the SR model estimates for latent ability are more informative about the individuals’ overall problem-solving competence than the DC model estimates.

Table 8 Estimation results for two regression models that regress the overall performance score on ability estimates from the SR model (M1) and DC model (M2) in the empirical study

Discussion

Different people usually react differently to the same task, resulting in a variety of action sequences. These sequences always contain richer information than the outcomes, not only about the respondents, but also about the tasks. In this study, starting with FSA tasks, we develop a state response measurement model for problem-solving process data, which is a discrete choice model and can reflect the characteristics of both persons and task steps. Through the predefined correctness of events that are available as the next action, the SR model links the action choice with the latent ability of the respondent and the easiness of the current problem state. Results of the simulation study show that the proposed SR model could provide a reasonably accurate estimation of parameters regardless of whether the state easiness parameters were indeed equal within tasks. Longer sequences (or more tasks) helped to improve the estimation accuracy of ability parameters, and a larger sample size contributed to a better estimation of state parameters.

The proposed SR model was also applied to the process data from two problem-solving tasks in PISA 2012. For each problem state, an estimate of its easiness was obtained, and the value made sense for characterizing the corresponding task step. In addition, SR model estimates for ability parameters explained nearly 40% of the variance in students’ overall performance scores reported by PISA 2012 and had a certain degree of correlation with the outcome scores of the two tasks.

In both simulation and empirical studies, we also included the DC model—i.e., the action sub-model of Chen’s (2020) CTDC model—for comparison. The DC model can be viewed as a special case of our proposed SR model that constrains the easiness of all states in the same task to be equal. Accordingly, the easiness parameters related to tasks in the model are task-specific, not state-specific. This constraint on task states is unrealistic and ignores the task characteristics at the process level. As shown in our simulation study, the task easiness parameters in the DC model provided limited and possibly inaccurate information about tasks. However, the proposed SR model overcomes this disadvantage of the DC model. The state-specific parameters included in the SR model reflect the process features of each task, that is, the difficulty of different task states (or steps). Such specification is closer to reality, and the estimation results are more accurate and informative. As shown in the empirical study, due to the consideration of task states in the SR model, its ability estimates contain more information about the overall performance than the DC model estimates. In addition, the estimates are more consistent with the outcome scores of the more complex second task, which were given with partial consideration of the response process.

Above all, the proposed model provides an effective measurement framework for analyzing process data. It can reveal information on both the person and task aspects from the process data. In fact, most of the existing research on process data focuses primarily on the person-level information or characteristics, such as the strategies used, the types of mistakes, and the latent traits (e.g., Shu et al., 2017; Stadler et al., 2020; Tang et al., 2020; Zhan & Qiao, 2022). By contrast, the proposed model not only can provide the estimation of individual ability, but also considers the process characteristics of tasks, that is, the difficulty of each step in the task, which can aid in understanding the interactive problem-solving tasks and individuals' behavioral characteristics. Parameter estimates of problem states may also provide the item designers and researchers in the field of cognition with more specific directions for task improvement. In addition, the SR model can handle data with different types of missingness. Since it focuses on the action choice in each state, and the dependence between states is included in the model through the predefined correctness, the SR model can be applied normally when some participants complete only a subset of tasks due to test design, or when individuals’ actions after a certain time point are not observed, for example, due to time limits.

Although the SR model is constructed based on the FSA tasks in this paper, the model application is not limited to this type of task. Actually, the key to using the SR model lies in the predefinition of all problem states, the relations between them (that is, the optional next states of each state), and their correctness as the next state. In FSA tasks, problem states and the transitions between them are built into the task design, and thus researchers usually only need to define the correctness of the reachable states for each state. For other types of tasks without built-in problem states, the data preprocessing work, such as the definition of problem states and their correctness and data recoding, can be implemented manually, which requires the involvement of content experts. In this step, some data-driven algorithms (e.g., hidden Markov modeling) can be additionally considered to provide information about problem-solving sub-phases, thereby assisting in the identification of problem states.

Limitations and future directions

Despite its flexibility, the proposed model has some limitations that remain to be improved in the future. First, although the correctness value of each reachable state in the SR model is dichotomous (0 or 1) in this paper, it can be defined as a value between 0 and 1 (or other lower and upper limits) and can be different for different reachable states, indicating the efficiency of choosing different next actions for achieving the target state. In future research, this correctness can also be included as a parameter to be estimated in the model, similar to the reward function in Lamar’s (2018) Markov decision process measurement model.

Second, when the correct action choice of a state varies with event history, we treat this state given different information statuses caused by event history as different states based on the number of its correct and incorrect options in this study. This is partly due to the consideration of parameter estimation. The model already includes a number of parameters since it considers each task state. Further introducing state-history-specific parameters may result in poor model performance. However, the state easiness is likely to be related to the specific correct and incorrect events in that state. Additionally, in the proposed model, we assume that all parameters are usually static. In other words, the state easiness and students’ ability parameters remain constant during the whole problem-solving process. This assumption may hold for relatively simple tasks without feedback, such as the TICKET tasks used in this study. However, for more complex dynamic interactive tasks, respondents may receive feedback from the task scenarios, resulting in an increase in their ability, and a state may become easier after the respondent visits the state several times. Therefore, determining a way to consider the influence of the previous events in the model more reasonably is an interesting issue that needs careful consideration.

Third, some states may be rarely reached if there are many allowable actions in a task that lead to many possible states. In such cases, the easiness parameters for these states with few response data may not be stably estimated. For this issue, one way is to reduce the number of states in the predefinition stage. For example, some unimportant states can be combined into a more general state. Solving this issue from the perspective of model estimation can also be considered. Specifically, parameter estimation methods for IRT models dealing with sparse response matrices, small samples, and missing data, such as regularized estimation (Battauz, 2020; Chen et al., 2021), can be introduced and adapted to the current model framework, or some improvements to the current Bayesian estimation procedure used in this paper can be attempted along the lines of these methods, such as the use of hierarchical priors (e.g., Gilholm et al. 2021; König et al. 2020).

Conclusions

In this study, we propose a new SR measurement model for process data analysis by incorporating the characteristics of action sequences and the concept of IRT modeling. The SR model takes full advantage of the whole solution sequence by focusing on the action choice at each response step, and takes into account the temporal dependence in the sequence by predefining the correctness of each choice. The application of the SR model holds promise in providing deeper insights into individuals' behavioral characteristics in interactive tasks and their latent ability levels, and equally importantly, it offers a new perspective for understanding interactive tasks, which can be helpful in designing, evaluating, and improving new types of technology-based assessments with interactive modes. Overall, the SR model provides an analytical framework with great potential for process data in computer-based interactive tasks.