Introduction

Recently, the importance of nurturing writing skills in higher education has been widely acknowledged (Uto and Ueno 2015). A typical instruction method for writing is an instructor providing feedback on a completed text. As another approach, instruction methods that focus on the writing process have attracted attention in recent years (Deane and Zhang 2015; Leijten and Waes 2013; Seow 2002; Zhang et al. 2016; Bayat 2014; Conijn et al. 2018).

The writing process can be generally regarded as a sequence of subtasks, such as planning, formulation, and revision (Flower and Hayes 1981; Seow 2002; Bayat 2014; Southavilay et al. 2010b). These processes are known to be dependent on writing skills (Stevenson et al. 2006; Hayes and Flower 1980; Sasaki 2000, 2002; Larios et al. 2008; Chan 2017). For example, learners with advanced skills tend to formulate faster but spend more time on revisions, as compared with those with lower skills (Sasaki 2000, 2002; Larios et al. 2008). In addition, learners with advanced skills tend to make major edits in logical structures and main arguments, while those with lesser skills primarily perform superficial corrections such as expressions and typographical errors (Sasaki 2000; 2002; Lester and Witte 1981; Barkaoui 2016). Because there is a relation between writing skills and writing processes, instruction based on the writing process can be an effective approach toward improving writing skills (Bayat 2014; Conijn et al. 2018).

Instructions focused on the writing process are often based on the appearance pattern of the above-described subtasks (Bayat 2014; Conijn et al. 2018). As a method for analyzing appearance patterns of these subtasks, the think-aloud technique and video playback stimulation method have been long used (Stevenson et al. 2006; de Larios et al. 2008). In the think-aloud technique, learners sequentially utter their thoughts during the writing process. The playback stimulation method presents learners with videos of their writing process and has learners discuss their thoughts. However, these methods require considerable time for analysis, so they are impractical when there are many learners.

To address this issue, writing process analysis methods using keystroke log data recorded during composition on a computer have been recently proposed (Deane and Zhang 2015; Leijten and Waes 2013; Zhang et al. 2016; Chan 2017; Conijn et al. 2018). For example, Zhang et al. (2016) proposed a method in which the writing process is categorized based on the distribution of intervals between word inputs. However, while existing methods can categorize stages in the writing processes, there is no method for estimating subtask appearance patterns by learner.

We therefore propose a method for estimating subtask appearance patterns by learner from keystroke log data. Specifically, we propose a method for converting keystroke log data of each learner to time-series data with multiple features that express writing characteristics for applying an unsupervised machine learning method. Feature extraction is based on a sliding window approach, which divides keystroke log data into analytical frames with a small time-width and extracts features from each frame. The Gaussian hidden Markov model (GHMM) is well-known as a typical unsupervised machine learning method for such time-series data. GHMM assumes that observational data at an arbitrary point in time arise depending on a latent variable called the state, and that the state sequence can be estimated from the data. Therefore, by applying GHMM under an assumption of latent state for each analytical frame as a subtask, a sequence of subtasks for each learner can be estimated from keystroke log data. However, because this approach estimates a subtask for each analytical frame with a small time-width, the variety of subtask sequences becomes extremely large, hindering interpretation of the writing process for each learner. For educational application, knowing the subtask to be performed each moment is not necessarily important. What we need, instead, is to know the subtask appearance patterns by learner, as discussed earlier. The information required to understand the patterns is the appearance ratios for subtasks in a time interval with a certain time span and the temporal changes in the ratios. For example, we wish to know the extent to which each learner performs subtasks such as formulation, major edits, and superficial corrections in every quarter of writing time. Information on temporal changes in the subtask appearance ratios will be helpful by allowing learners and instructors to characterize the writing patterns of each learner quantitatively. Furthermore, it will also be beneficial for instructors by allowing them to give appropriate feedback and instruction toward improving learners’ writing activities.

For the above reasons, this paper proposes an extension of GHMM that incorporates parameters representing temporal change in the subtask appearance distribution for each learner. For the model, we first divide feature vector sequences obtained from each learner’s keystroke data into a few time intervals. We then incorporate parameters that express state appearance probabilities for each learner in each time interval in the GHMM. The characteristics of the proposed model are as follows:

  1. 1.

    Because the incorporated parameters represent temporal change in subtask appearance patterns for each learner, by interpreting the parameters we can understand the writing process of each learner.

  2. 2.

    By comparing differences in state appearance distributions between learners, we can quantitatively analyze between-learner differences in the writing process.

  3. 3.

    The writing processes of learners can be categorized by applying typical cluster analyses, taking differences in state appearance distributions between learners as the distance function.

As a method for estimating parameters for the proposed model, we propose a collapsed Gibbs sampling algorithm, which is a type of Markov-chain Monte Carlo method. This paper demonstrates the effectiveness of the proposed method through evaluation testing applied to actual keystroke log data.

Related Works

This section describes related works on keystroke data applications.

The most common application of keystroke data is user authentication in the security domain. Many user authentication methods that use keystroke data have been proposed (Karnan et al. 2011; Teh et al. 2013; Quraishi and Bedi 2018). In these methods, users are identified by using supervised classifiers trained on keystroke features. Various statistical and probabilistic models and machine learning methods have been used for classification (Karnan et al. 2011; Teh et al. 2013; Quraishi and Bedi 2018). Hidden Markov models (HMMs), which are used in this study, have also been used for user authentication (Chen and Chang 2004; Rodrigues et al. 2005; Ali et al. 2016). In HMM-based authentication methods, an HMM is trained for each user from sequences of keystroke timing features such as the durations of key presses and the time elapsed between key presses. User authentication is performed by checking how well a set of keystroke data from an authentication attempt fits the pre-trained HMMs.

Another application of keystroke data is emotion recognition (Epp et al. 2011; Salmeron-Majadas et al. 2018). Emotion-estimation methods classify user emotion by applying a supervised machine learning classifier trained from keystroke features for each emotion class.

As approaches for general pattern recognition tasks including writing-pattern recognition, temporal interval Bayesian networks (Zhang et al. 2013) and a generative probabilistic model with Allen’s interval-based relations (Liu et al. 2016, 2018) have recently been proposed. These methods can capture complex temporal relations among the occurrences of observable features (so-called atomic/primitive actions) by using Allen’s interval algebra (Allen 1983) and Bayesian networks. Such approaches have been used for classification of sequential data, achieving state-of-the-art accuracy for applications such as classifications of sports videos (Zhang et al. 2013; Liu et al. 2016), human actions (Liu et al. 2018), and facial expression (Wang et al. 2013).

It is important to note that the above-mentioned approaches focus on classification. The approaches are not suitable for our research objective because the purpose of this study is to estimate the hidden processes underlying observable keystroke activities, not to classify the datasets.

For a method focused on estimation of the writing process, Southavilay et al. (2010a) proposed a method for estimating the process of collaborative writing activities by using an HMM. The proposed method collects the versions of a document produced during collaborative writing and estimates changes in semantic meaning for the next two versions according to predefined heuristics (Southavilay et al. 2010b). The process of collaborative writing is then analyzed by an HMM trained on the sequences of semantic meanings. The purpose and approaches are similar to those of our study, but that method does not use keystroke data. Moreover, that study analyzes the writing process by interpreting the parameters of an HMM trained for each collaborative group. In this approach, we cannot compare writing processes among groups because the means of the latent states will differ among the groups. Although this approach might be extendable to writing process analysis for each learner, the interpretation of each process would be infeasible.

Keystroke Logging System and Log Data

This study assumes that writing tasks are presented to learners, and that keystroke log data are collected as learners compose their responses. To collect keystroke log data, we developed a keystroke logging system similar to those in previous studies (Deane and Zhang 2015; Leijten and Waes 2013; Salmeron-Majadas et al. 2018). Figure 1 shows the interface of the developed system. The system presents learners with a writing task in the left panel and a text input area in the right panel. The system records information on typed characters, cursor position, and timestamps for each learner input made to the text area using the keyboard or mouse. The time at which learners access the system is recorded as the response start time. Therefore, keystroke log datasets for each learner consist of a sequence of tuples in the format 〈written text, cursor position, timestamp〉, with the number of tuples being the number of keyboard operations plus one. The keystroke logging function is implemented in Javascript and works on our e-testing platform, developed in Java. The system stores obtained keystroke log data in an SQL database.

Fig. 1
figure 1

Interface of keystroke logging system (the writing task on the left side is hidden for copyright reasons)

In this study, we use keystroke log data obtained in this manner to analyze the writing process.

Feature Extraction

Previous studies on analyses of keystroke log data analyzed the writing process based on multiple features, such as the number of characters and keystroke interval times as extracted from each learner’s keystroke series (Deane and Zhang 2015; Leijten and Waes 2013; Zhang et al. 2016; Chan 2017). As described in “Introduction”, previous studies extracted one set of feature values for each learner, and used those features to categorize writing patterns. In contrast, the present study aims to estimate temporal change in the subtasks of each learner. Thus, keystroke log data for each learner must be defined as time-series feature data.

To extract feature sequences, we use a sliding window approach like that widely used for image processing and speech recognition (Leijten and Waes 2013). The sliding window approach extracts features with analytical frame units by building frames with a small time-width from time-series data. The width of each analytical frame W is called the frame width, and the movement width of adjacent frames H is called the step width. If H < W, overlapping of adjacent frames is permitted. Figure 2 shows the sliding window approach.

Fig. 2
figure 2

Feature extraction based on the sliding window approach

In this study, we apply the sliding window approach to keystroke log data for each learner. We extract the seven features listed in Table 1 from each analytic frame (those features are designated as writing features). These features are commonly used in similar studies (Deane and Zhang 2015; Leijten and Waes 2013; Zhang et al. 2016).

Table 1 Writing features

By extracting these features for each analytical frame, we can define keystroke log data as series data for seven dimensions of writing features. Specifically, letting \(X_{ijf}\in \mathbb {R}\) be feature \(f \in {\mathcal F} = \{1,\cdots , F=7\}\) for the \(j\in {\mathcal J} = \{1,\cdots , J\}\)-th analytical frame of learner \(i \in {\mathcal I} = \{1, \cdots , I\}\), series data for writing features of learner i are defined as Xi = {Xi1,⋯ ,XiJ} (where Xij = {Xij1,⋯ ,XijF}).

We expect that by applying a machine learning method to dataset X = {X1,⋯ ,XI}, the subtask for each analytical frame of each learner can be estimated. We use an unsupervised machine learning method for this estimation, because it is unclear what subtask types exist within the analytical frame.

Gaussian Hidden Markov Model–based Writing Process Analysis

As discussed in “Feature Extraction”, data Xi are time-series data, and there likely is a dependency of subtasks between adjacent analytical frames. A typical unsupervised classifier for such time-series data is the hidden Markov model (HMM). Especially in the case where observed data have continuous values, as in the present study, GHMM is generally used. The following provides details of GHMM assuming application to our dataset X.

GHMM assumes a latent variable \(S_{ij} \in {\mathcal S} = \{1, \cdots , S \}\), called the state, for each feature Xij. Each state Sij is obtained according to a transition probability that is dependent on the state immediately before Si,j− 1. Specifically, letting the transition probability from state s to state \(s^{\prime }\) be \(A_{ss^{\prime }}\) (where \(0 \le A_{ss^{\prime }} \le 1,{\sum }_{s^{\prime }=1}^{S} A_{ss^{\prime }} = 1\)) and letting As be S-dimensional multinomial distribution {As1,⋯ ,AsS}, the probability of state Sij being dependent on state Si,j− 1 = s can be written as \(P(S_{ij} |S_{i,j-1} = s, {\boldsymbol A}_{s}) = A_{s,S_{ij}}\). The initial state Si1 is obtained following \(p(S_{n1} | \boldsymbol {\pi }) = \pi _{S_{n1}}\) in accordance with the initial probabilities, which are defined as the S-dimensional multinomial distribution \(\boldsymbol {\pi }=\{ \pi _{1},\cdots , \pi _{S}\}\) (where \(0 \le \pi _{s} \le 1,{\sum }_{s=1}^{S} \pi _{s} = 1\)).

The features for each analytic frame Xij follow a normal distribution dependent on state Sij. In this study, given state Sij = s, we assume that the f -th feature Xijf follows a normal distribution with values of mean μsf and variance \(\sigma _{sf}^{2}\): \(p(X_{ijf} | \mu _{sf}, \sigma _{sf}^{2}) = N(\mu _{sf}, \sigma _{sf}^{2})\). Therefore, the emission probability for features Xij can be obtained from the following equation when state Sij = s:

$$ p(\boldsymbol{X}_{ij} | S_{ij}=s, \boldsymbol{\mu}, \boldsymbol{\sigma}) = {\prod}_{f=1}^{F} p(X_{ijf} | \mu_{sf}, \sigma_{sf}^{2}), $$
(1)

where μ = {μ11,⋯ ,μSF} and σ = {σ11,⋯ ,σSF}.

By applying GHMM under the assumption of latent states for each analytical frame as subtasks, a sequence of subtasks for each learner is expected to be estimated. In this approach, however, a subtask is estimated for each analytical frame, which is defined with a small time-width and which overlaps adjacent multiple frames. Therefore, subtask sequence patterns become significantly large, making it difficult to grasp temporal changes in subtasks for each learner.

Proposed Model

To address this issue, this study proposes an expanded GHMM model that incorporates parameters representing temporal change in subtasks for each learner. In this study, we divided series data for each learner into a small number of time intervals, and incorporated into the GHMM a parameter representing the subtask appearance probability for each time interval for each learner. Specifically, time-series data are divided into T time intervals \({\mathcal T} = \{1,\cdots , T\}\) with a constant time span. We then incorporate a state appearance probability ϕits that learner i is in state (subtask) s in time interval \(t \in {\mathcal T}\), as shown in Fig. 3. In this figure, ϕit represents a state appearance distribution of learner i in time interval t that is defined as S-dimensional multinomial distribution {ϕit1,⋯,ϕitS}.

Fig. 3
figure 3

State appearance distributions for each time interval

In the proposed model, state probabilities for individual analytical frames are assumed to follow the product of the transition probability and the state appearance probability, as

$$ P(S_{ij} |S_{i,j-1}=s, {\boldsymbol A}_{s}, \boldsymbol{\phi}_{i}) \propto A_{s,S_{ij}} \cdot \phi_{i, t_{ij}, S_{ij}}, $$
(2)

where ϕi = {ϕi11⋯ ,ϕiTS} and \(t_{ij}\in {\mathcal T}\) is the time interval to which data Xij belong.

Similarly, we assume that the initial probability is defined as

$$ P(S_{i1} |{\boldsymbol \pi}, \boldsymbol{\phi}_{i}) \propto \pi_{S_{i1}} \cdot \phi_{i, 1, S_{i1}}. $$
(3)

The features are obtained using (1) in the same manner as in GHMM.

In the proposed model, the subtask appearance distribution ϕit can be estimated in an arbitrary time interval for each learner. Thus, by interpreting temporal change in the distribution, writing pattern trends of each learner can be quantitatively grasped. Furthermore, by measuring differences in the subtask appearance distributions ϕi between learners, using for example the Kullback–Leibler divergence or the Jensen–Shannon divergence, differences in the writing process between learners can be quantitatively evaluated. In addition, categorizing writing processes becomes possible using typical cluster methods with those divergence functions.

Parameter Estimation Algorithm

Representative parameter estimation methods for GHMM are maximum likelihood estimation using the Baum–Welch algorithm and Bayesian estimation using Markov chain Monte Carlo (MCMC) (Bishop 2006). For complex models, Bayesian estimation by MCMC would provide higher accuracy (Bishop 2006; Brooks et al. 2011). This method estimates the posterior distribution of each parameter and uses expected or maximum values as a point estimate for parameters. MCMC approximates posterior distributions via sampling. This study proposes a collapsed Gibbs sampling (CGS) algorithm for the proposed model. CGS has been widely used for learning-topic modeling and HMM as a method to improve the efficiency of MCMC by marginalizing out a part of the parameter set (Griffiths and Steyvers 2004; Griffiths et al. 2004; Paisley and Carin 2009).

CGS repeatedly samples parameter values from the conditional posterior distribution for each parameter, approximating the posterior distribution of parameters using the obtained samples. Conditional posterior distribution is defined as a distribution in which all parameters other than those of interest are given, after marginalizing a specific parameter set from the joint distribution. In this study, we marginalize the initial probability π, transition probability A, and state appearance probability ϕ, and sample the latent state S = {S11,⋯ ,SIL} and emission distribution parameters ξ = {μ,σ}.

The remainder of this subsection presents the details of this algorithm. Figure 4 shows a graphical representation of the proposed model for the following derivation. In the figure, α, β, and γ represent hyperparameters for the distributions of A, ϕ, and π, respectively, while μ0, n0, g1, and g2 are hyperparameters for the emission distribution.

Fig. 4
figure 4

Graphical representation of time- and learner-dependent GHMM

Conditional Posterior Distribution for Sampling State Sij for j > 1

We first derive the conditional posterior distribution of state Sij for j > 1.

Letting \(\boldsymbol {X}^{\backslash ij} = \boldsymbol {X} \backslash \{ \boldsymbol {X}_{ij} \}\) and Sij = S∖{Sij}, the conditional posterior distribution where Sij = s is obtained for j > 1 can be written as

$$ p(S_{ij}=s|\boldsymbol{X}_{ij}, \boldsymbol{X}^{\backslash ij}, \boldsymbol{S}^{\backslash ij}, \boldsymbol{\xi}) \propto p(\boldsymbol{X}_{ij} | S_{ij} = s, \boldsymbol{\xi}) \cdot p(S_{ij} = s | \boldsymbol{S}^{\backslash ij}). $$
(4)

The first term on the right side of (4) is obtained using (1). Furthermore, by omitting constant terms, the second term can be written as

$$ \begin{array}{@{}rcl@{}} {} p(S_{ij} = s | \boldsymbol{S}^{\backslash ij}) \!&\propto &\! p(S_{ij} = s, S_{i,j+1}| \boldsymbol{S}^{\backslash i,j,j+1})\\ &\propto &\! p(S_{i,j+1}| S_{ij} = s, \boldsymbol{S}^{\backslash i,j,j+1}) \!\cdot\! p(S_{ij} = s| S_{i,j-1}, \boldsymbol{S}^{\backslash i,j-1,j,j+1}), \end{array} $$
(5)

where Si,j,j+ 1 = S∖{Sij,Si,j+ 1} and Si,j− 1,j,j+ 1 = S∖{Si,j− 1,Sij,Si,j+ 1}.

If Dirichlet priors with hyperparameters α and β are respectively used for the transition probabilities As and the state appearance probabilities ϕit, then by omitting constant terms, the first term on the right side of (5) can be reorganized as

$$ \begin{array}{@{}rcl@{}} p(S_{i,j+1}| S_{ij} &=& s, \boldsymbol{S}^{\backslash i,j,j+1}) \\ &\propto& \int p(S_{i,j+1}| \boldsymbol{A}_{s}) \cdot p(\boldsymbol{A}_{s}| \boldsymbol{S}^{\backslash i,j,j+1}) d \boldsymbol{A}_{s} \\ &&\ \cdot \int p(S_{i,j+1}| \boldsymbol{\phi}_{i,t_{i,j+1}}) \cdot p(\boldsymbol{\phi}_{i,t_{i,j+1}}| \boldsymbol{S}^{\backslash i,j,j+1}) d \boldsymbol{\phi}_{i, t_{ij+1}} \\ &=& \frac{n_{s,S_{i,j+1}}^{\backslash i,j,j+1} + \alpha}{{\sum}_{s^{\prime}=1}^{S} \left( n_{s,s^{\prime}}^{\backslash i,j,j+1} + \alpha\right) }, \end{array} $$
(6)

where \(n_{s,s^{\prime }}^{\backslash i,j,j+1}\) represents the frequency at which state s transitioned to state \(s^{\prime }\) among Si,j,j+ 1.

The second term on the right side of (5) is similarly rewritten as

$$ \begin{array}{@{}rcl@{}} p(S_{ij}&=&s| S_{i,j-1}, \boldsymbol{S}^{\backslash i,j-1,j,j+1}) \\ &&\propto \int p(S_{ij}=s| \boldsymbol{A}_{S_{i,j-1}}) \cdot p(\boldsymbol{A}_{S_{i,j-1}}| \boldsymbol{S}^{\backslash i,j-1,j,j+1}) d \boldsymbol{A}_{S_{i,j-1}} \\ &&\ \ \ \ \ \ \cdot \int p(S_{ij}=s| \boldsymbol{\phi}_{i,t_{ij}}) \cdot p(\boldsymbol{\phi}_{i,t_{ij}}| \boldsymbol{S}^{\backslash i,j-1,j,j+1}) d \boldsymbol{\phi}_{i, t_{ij}} \\ &&= \frac{n_{S_{i,j-1},s}^{\backslash i,j-1,j,j+1} + \alpha}{{\sum}_{s^{\prime}=1}^{S} \left( n_{S_{i,j-1},s^{\prime}}^{\backslash i,j-1,j,j+1} + \alpha \right) } \cdot \frac{n_{i, t_{ij}, s}^{\backslash i,j-1,j,j+1} + \upbeta}{{\sum}_{s^{\prime}=1}^{S} \left( n_{i, t_{ij}, s^{\prime}}^{\backslash i,j-1,j,j+1} + \upbeta \right) } \\ &&\propto (n_{S_{i,j-1},s}^{\backslash i,j-1,j,j+1} + \alpha) \cdot (n_{i, t_{ij}, s}^{\backslash i,j-1,j,j+1} + \upbeta). \end{array} $$
(7)

In these equations, \(n_{s,s^{\prime }}^{\backslash i,j-1,j,j+1}\) represents the appearance frequency at which state s transitioned to state \(s^{\prime }\) among Si,j− 1,j,j+ 1. Furthermore, \(n_{i,t,s}^{\backslash i,j-1,j,j+1}\) represents the appearance frequency of state s in a state set \(\{S_{ij} \in \boldsymbol {S}^{\backslash i,j-1,j,j+1} | t_{ij} = t\}\) for learner i.

From the above, the conditional posterior distribution of Sij for j > 1 can be described as

$$ \begin{array}{@{}rcl@{}} p(S_{ij}&=&s|\boldsymbol{X}_{ij}, \boldsymbol{X}^{\backslash ij}, \boldsymbol{S}^{\backslash ij}, \boldsymbol{\xi}) \\ &\propto& \left[ {\prod}_{f=1}^{F} p(X_{ijf} | \mu_{sf}, \sigma_{sf}^{2}) \right] \cdot \frac{n_{s,S_{i,j+1}}^{\backslash i,j,j+1} + \alpha}{{\sum}_{s^{\prime}=1}^{S} \left( n_{s,s^{\prime}}^{\backslash i,j,j+1} + \alpha\right) } \\ &&\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \cdot (n_{S_{i,j-1},s}^{\backslash i,j-1,j,j+1} + \alpha) \cdot (n_{i,t_{ij},s}^{\backslash i,j-1,j,j+1} + \upbeta). \end{array} $$
(8)

Conditional Posterior Distribution for Sampling Initial States

The conditional posterior distribution of initial state Si1 can be written as

$$ p(S_{i1}=s|\boldsymbol{X}_{i1}, \boldsymbol{X}^{\backslash i1}, \boldsymbol{S}^{\backslash i1}, \boldsymbol{\xi}) \propto p(\boldsymbol{X}_{i1} | S_{i1} = s, \boldsymbol{\xi}) \cdot p(S_{i1} = s | \boldsymbol{S}^{\backslash i1}). $$
(9)

Here, the first term on the right side of (9) is obtained using (1), while the second term can be written as

$$ p(S_{i1} = s |\boldsymbol{S}^{\backslash i1}) \propto p(S_{i2}| S_{i1} = s, \boldsymbol{S}^{\backslash i,1,2}) \cdot p(S_{i1}=s| \boldsymbol{S}^{\backslash i,1,2}). $$
(10)

The first term on the left side of the above equation is calculable from (6). When the Dirichlet prior with hyperparameter γ is used for the initial distribution π, the second term can be expressed by omitting constant terms as

$$ \begin{array}{@{}rcl@{}} p(S_{i1}&=&s| \boldsymbol{S}^{\backslash i,1,2}) \\ &\propto& \int p(S_{i1}=s| \boldsymbol{\pi}) \cdot p(\boldsymbol{\pi}| \boldsymbol{S}^{\backslash i,1,2}) d \boldsymbol{\pi} \cdot \int p(S_{i1}=s| \boldsymbol{\phi}_{i}) \cdot p(\boldsymbol{\phi}_{i}| \boldsymbol{S}^{\backslash i,1,2}) d \boldsymbol{\phi} \\ &=&\frac{n^{\backslash i,1,2}_{s} + \gamma}{{\sum}_{s^{\prime}=1}^{S} \left( n^{\backslash i,1,2}_{s^{\prime}} + \gamma \right)}\cdot \frac{n_{i,1,s}^{\backslash i,1,2} + \upbeta}{{\sum}_{s^{\prime}=1}^{S} \left( n_{i,1,s^{\prime}}^{\backslash i,1,2} + \upbeta \right) } \\ &\propto& (n^{\backslash i,1,2}_{s} + \gamma) \cdot (n_{i,1,s}^{\backslash i,1,2} + \upbeta), \end{array} $$
(11)

where \(n^{\backslash i,1,2}_{s}\) represents the appearance frequency of Si1 among Si,1,2 becoming state s, while \(n_{i,1,s}^{\backslash i,1,2}\) represents the appearance frequency of state s in the state set \(\{S_{ij} \in \boldsymbol {S}^{\backslash i,1,2} | t_{ij} = 1\}\) for learner i.

Thus, the sampling distribution of Si1 is obtained as

$$ \begin{array}{@{}rcl@{}} &&p(S_{i1}=s|\boldsymbol{X}_{i1}, \boldsymbol{X}^{\backslash i1}, \boldsymbol{S}^{\backslash i1}, \boldsymbol{\xi}) \\ &&\propto \left[ {\prod}_{f=1}^{F} p(X_{ijf} | \mu_{sf}, \sigma_{sf}^{2}) \right] \cdot \frac{n_{s,S_{i2}}^{\backslash i,1,2} + \alpha}{{\sum}_{s^{\prime}=1}^{S} \left( n_{s,s^{\prime}}^{\backslash i,1,2} + \alpha\right) }\cdot (n^{\backslash i,1,2}_{s} + \gamma) \cdot (n_{i,1,s}^{\backslash i,1,2} + \upbeta). \end{array} $$
(12)

Conditional Posterior Distribution of Emission Probability Parameters

The conditional posterior distributions of emission probability parameters μsf and \(\sigma ^{2}_{sf}\) for feature f can be expressed as

$$ \begin{array}{@{}rcl@{}} p(\mu_{sf}|\boldsymbol{X}, \boldsymbol{S}, \boldsymbol{\xi}^{\backslash s,f}, \sigma^{2}_{sf}) \propto p(\mu_{sf} |\boldsymbol{X}^{s}_{f}, \sigma^{2}_{sf}) \end{array} $$
(13)
$$ p(\sigma^{2}_{sf}|\boldsymbol{X}, \boldsymbol{S}, \boldsymbol{\xi}^{\backslash s,f}, \mu_{sf}) \propto p(\sigma^{2}_{sf} |\boldsymbol{X}^{s}_{f}, \mu_{sf}), $$
(14)

where \(\boldsymbol {\xi }^{\backslash s,f}=\boldsymbol {\xi } \backslash \{ \mu _{sf}, \sigma ^{2}_{sf} \}\) and \(\boldsymbol {X}^{s}_{f}=\{X_{ijf} \in \boldsymbol {X} | S_{ij} = s, i \in {\mathcal I}, j \in {\mathcal J}\}\). These equations are consistent with the conditional posterior distributions of a typical normal distribution for a sample set \(\boldsymbol {X}^{s}_{f}\). The normal distribution \(N(\mu _{0}, \sigma _{sf}^{2}/n_{0})\) is generally used as the conjugate prior of the mean parameter μsf, where μ0 and n0 are hyperparameters. Concretely, the conditional posterior probability of μsf is written as Uto and Ueno (2016) and Fox (2010)

$$ p(\mu_{sf} |\boldsymbol{X}^{s}_{f}, \sigma^{2}_{sf}) = N(\frac{n_{0}\mu_{0} + |\boldsymbol{X}^{s}_{f}| \cdot \bar{{X^{s}_{f}}}}{n_{0} + |\boldsymbol{X}^{s}_{f}|}, \frac{\sigma_{sf}^{2}}{n_{0} + |\boldsymbol{X}^{s}_{f}|}), $$
(15)

where \(\bar {{X^{s}_{f}}} = {\sum }_{x \in \boldsymbol {X}^{s}_{f}} \frac {x}{|\boldsymbol {X}^{s}_{f}|}\) and \(|\boldsymbol {X}^{s}_{f}|\) indicates the number of data points in \(\boldsymbol {X}^{s}_{f}\).

The inverse gamma distribution IG(g1/2,g2/2), a type of conjugate prior, is often used as the prior distribution of normal distribution variance \(\sigma ^{2}_{sf}\) (Gelman 2006), where g1 and g2 are hyperparameters and are generally small positive real numbers, such as g1 = g2 = 0.01. Specifically, the conditional posterior distribution of variance \(\sigma ^{2}_{sf}\) can be expressed as Uto and Ueno (2016) and Fox (2010)

$$ p(\sigma^{2}_{sf} |\boldsymbol{X}^{s}_{f}, \mu_{sf}) = IG(\frac{g_{1} +|\boldsymbol{X}^{s}_{f}|}{2}, \frac{{\sigma_{0}^{2}}}{2} ), $$
(16)

where

$$ \sigma_{0} = g_{2} + \sum\limits_{x \in \boldsymbol{X}^{s}_{f}} (x - \bar{{X^{s}_{f}}} )^{2} \\ + \frac{|\boldsymbol{X}^{s}_{f}| \cdot n_{0}}{|\boldsymbol{X}^{s}_{f}| + n_{0}} \cdot(\bar{{X^{s}_{f}}} - \mu_{0}). $$
(17)

The algorithm proposed by Tanizaki (2008) is useful for obtaining random samples from an inverse gamma distribution.

Estimation of Marginalized Parameters

Given the obtained state samples, we can estimate the initial probabilities π, transition probabilities A, and state appearance probabilities ϕ as follows:

$$ \pi_{s} = \frac{n_{s} + \gamma}{{\sum}_{s^{\prime}=1}^{S} \left( n_{s^{\prime}} + \gamma \right)} $$
(18)
$$ A_{ss^{\prime}} = \frac{n_{ss^{\prime}} + \alpha}{{\sum}_{s^{\prime}=1}^{S} \left( n_{ss^{\prime}} + \alpha \right) } $$
(19)
$$ \phi_{its} = \frac{n_{its} + \upbeta}{{\sum}_{s^{\prime}=1}^{S} \left( n_{its^{\prime}} + \upbeta \right)} $$
(20)

In these equations, ns is the frequency at which initial state Si1 becomes state s, \(n_{ss^{\prime }}\) is the frequency at which the state transitioned from s to \(s^{\prime }\), and nits is the frequency of the state becoming s among a state set with time interval t for learner i.

Algorithm

CGS of the proposed model repeatedly samples states S and the parameters of the emission probability distribution ξ = {μ,σ} according to the equations introduced in the previous subsection. Specifically, (8) is used to sample \(\{S_{ij} \in \boldsymbol {S} \mid j > 1, i \in {\mathcal I}\}\), and (12) is used for \(\{S_{i1} \in \boldsymbol {S} \mid i \in {\mathcal I}\}\). Equations (15) and (16) are used to sample μ and σ, respectively. Additionally, in the CGS algorithm, the initial probability π, transition probability A, and state appearance probability ϕ are calculated from the obtained state samples using (18), (19), and (20), respectively. Finally, the expected values for the obtained parameters are calculated. Algorithm 1 shows pseudocode for the algorithm. A burn-in period is required to remove the effect of initial values.

figure a

Application and Evaluation

In this section, we evaluate the effectiveness of the proposed model using actual keystroke log data collected using the keystroke logging system introduced in “Keystroke Logging System and Log Data”. In this experiment, we collected actual keystroke log data as follows.

We assigned a writing task to 72 subjects and collected keystroke log data while the subjects composed their responses. The task was a reading-to-write task in which the subjects read a short text and related material, then wrote their opinion. This task required no prior knowledge. The subjects were 37 boys/men and 35 girls/women. The range of ages was 16–23, and the median age was 19 years. Of the subjects, 34 were high school students and 38 were university students. Among the university students, 18 were studying arts/humanities and 20 were studying science of some kind. All subjects had experience using a keyboard to create documents. We provided 45 min to respond, and subjects were not allowed to finish before 45 min had elapsed. The total number of keystroke operations obtained in the experiment was 184,916. The mean and standard deviation of the number of keystroke operations by learners were 2568.28 and 852.53, respectively. The mean and standard deviation values for the final number of characters by learner were 608.92 and 168.30.

Keystroke log data were transformed to writing feature vectors using the sliding window approach, with frame width W = 30 s and step width H = 10 s. W and H are the hyperparameters, as described in “Feature Extraction”. W controls the granularity of subtask estimation. A small W value enables the capture of more detailed subtasks, although extremely small values of W increase the number of frames with no or only a few keystroke operations, which makes the subtask estimation unstable. A small H value increases the smoothness of the temporal change in the state appearance probabilities, although overlap among the frames is increased by using a small H. Excessive overlap is not desirable because the computational cost increases rapidly with overlap. Based on the above-mentioned factors, W = 30 and H = 10 were selected by empirical observation. As a result of this feature extraction, features Xi for learner i were obtained as the seven previously described dimensional features at 268 timepoints. In the remainder of this section, we evaluate the proposed model through application to this dataset X.

Model Selection Using Information Criteria

Writing process analysis using the proposed model depends on the number of states S and the number of time intervals T. To select the state number in GHMM, the Akaike information criterion (AIC) (Akaike 1974) and the Bayesian information criterion (BIC) (Schwarz 1978) have been widely used. AIC and BIC assume asymptotic normality of maximum likelihood estimates (Watanabe 2010, 2013). However, GHMM does not satisfy this assumption, so these information criteria are not theoretically appropriate. When MCMC is used, a log-marginal likelihood (log-ML) that does not assume asymptotic normality is approximately calculable (Newton and Raftery 1994). In recent years, various studies have used the log-ML calculated using MCMC for model selection (Uto et al. 2017; Griffiths and Steyvers 2004; Wallach et al. 2009; Taddy 2012; Uto and Ueno 2018). Therefore, in this study, we use the log-ML to select S and T. Specifically, we calculate the log-ML while changing the number of states S = {2,⋯ ,10} and the number of time intervals T = {1,⋯ ,12}. Note that S = 1 is meaningless for writing process analysis because the same writing process will be estimated for all learners. Thus, S = 1 was ignored in this experiment. Furthermore, the upper limit values of S = 10 and T = 12 were selected because extremely large values make the interpretation of the estimated writing process difficult. We discuss the appropriateness of the upper limit values later, using data for justification. For comparison, we also calculated the log-ML for each state number with GHMM.

Table 2 shows the experimental results, where a larger value for log-ML indicates increased appropriateness of the model. Table 2 shows that the proposed model tends to produce higher values than does the GHMM when the state number increases. This suggests that trends in subtask appearance differ among learners and among time intervals, and that the proposed model represents them appropriately. Here, to confirm the appropriateness of the upper limits for S and T, Figs. 5 and 6 show, respectively, the average log-ML for each S ∈{1,⋯ ,10} and for each T ∈{1,⋯ ,12}. Figure 5 shows that the log-ML values rapidly increase until around S = 7, and the increase rate is slow for S > 7. Furthermore, the value with S = 10 is smaller than that with S = 9. Figure 6 shows that the log-ML values tend to increase until T = 11. When T = 12, however, the value is sharply reduced. These results suggest that the optimal values probably lie within S = {2,⋯ ,10} and T = {1,⋯ ,12}. In these ranges, the proposed model with S = 9 and T = 10 had the highest indicator value, so we used those values for S and T in the following experiments.

Table 2 Log-marginal likelihood values for each number of states and time interval
Fig. 5
figure 5

Average log-ML value for each S

Fig. 6
figure 6

Average log-ML value for each T

States Interpretation

To analyze the writing patterns of each learner based on the proposed model, we first need to interpret the characteristics of each state. For this interpretation, Table 3 presents the mean and standard deviation parameters of the emission distribution for each feature in each state. Furthermore, for ease of interpretation, Fig. 7 shows the mean feature values for each state, which are normalized so that the maximum is 1 and the minimum is 0. From these results, the characteristics of each state can be interpreted as follows:

Table 3 Mean and standard deviation parameters of emission distributions for each state
Fig. 7
figure 7

Normalized mean values of emission distributions for each state

State 1 can be interpreted as a waiting stage, because the number of stops is the highest and there are no addition or subtraction operations.

States 2, 3, 7, and 9 can be interpreted as formulation stages, because the cursor is at the end of the text, and some number of addition and subtraction operations can be seen. Here, the numbers of bursts, addition operations, and subtraction operations exhibit the following relation: state 7 < state 3 < state 9 < state 2. In other words, these four states can be differentiated by keystroke speed.

States 4, 5, 6, and 8 can be interpreted as a revision stage, because the cursor positions are relatively toward the beginning of the text, the numbers of characters are relatively high, and there are a certain number of bursts, addition operations, and subtraction operations. A characteristic of state 5 is that the cursor moves often, while characteristics of state 6 are that the cursor is positioned relatively toward the beginning of the text and there are few cursor moves. Moreover, a characteristic of state 4 is that there are extremely few addition and subtraction operations. Conversely, there are many addition and subtraction operations in state 8. From these analyses, we can interpret state 4 as a revision state involving few edits, state 5 as a revision state involving overall edits, state 6 as a revision state involving edits of specific parts in the beginning and middle parts of the text, and state 8 as a revision state with many edits.

Table 4 summarizes the characteristics based on the above analyses.

Table 4 Interpretation of each state

It is worth noting that we might be able to evaluate the appropriateness of our state interpretation by comparing the interpretations with the subjects’ intentions. Subjects’ intentions can be investigated via traditional writing process analysis methods, such as the video playback stimulation method. For this analysis, however, we must create the subtask labels summarized in Table 4 in advance by estimating the proposed model parameters using data from all subjects. Due to our experimental constraints, we could not gather the same subjects after all the data had been collected. The evaluation of appropriateness by comparison with subjects’ intentions remains for future work.

Interpretation of State Appearance Distribution

This section discusses interpretation of the state appearance distribution for each learner based on the above interpretation of states. To that end, Figs. 8 and 9 show the state appearance distribution ϕi for two learners. The horizontal axes in these figures show the time interval, while vertical axes show the appearance probability of each state. Line types show individual states. The figures lead to the following interpretations of the writing process for each learner.

Fig. 8
figure 8

State appearance distribution of learner 1

Fig. 9
figure 9

State appearance distribution of learner 2

Writer 1 (Fig. 8) has a high ratio of waiting states in the first time interval, which we interpret as the learner reading the task and planning. As time progresses, the appearance ratio increases in the order of state 9 (formulation stage with fast writing), state 8 (revision stage with many edits), and state 5 (revision of overall text). The learner returns to the waiting state in later time intervals, suggesting that this learner has an ideal writing process.

In contrast, learner 2 (Fig. 9) shows high appearance ratios for states 3 and 7, representing slow writing formulation stages across all time intervals. The appearance ratio of revision states is low. This learner might not have been spending enough time on revisions.

The above analysis shows that the proposed model allows quantitative analysis of temporal changes in subtask appearance patterns for each learner.

Validity Evaluation of State Appearance Distributions

This subsection evaluates the validity of subtask appearance distribution estimations by the proposed method.

For this experiment, we randomly selected ten learners from among the subjects. Then we showed a replay of their keystrokes to two experts (evaluators A and B, below). We asked the evaluators to assess the appearance ratio of nine states for each time interval for each learner. However, it might be difficult for humans to directly differentiate between the nine states. Therefore, we first asked the evaluators to assess ratios of waiting, formulation, and revision, which are presented as major divisions for the nine states for individual time intervals of each learner using five categories: 5. appears very often, 4. appears often, 3. appears relatively often, 2. appears somewhat, and 1. does not appear. If revision was scored 2 or higher in this assessment, the range and amount of revision were evaluated as 2. wide or 1. narrow, and 2. large amount or 1. small amount. Furthermore, keystroke speed was scored as 2. fast or 1. slow for all learners.

In this experiment, we calculated subtask distributions for each learner from these evaluation data (hereinafter called the correct distribution). To create the correct distribution, we calculated scores for each subtask subdivision in Table 4, based on each evaluator’s assessment data. Concretely, for each time interval for each learner, the scores for the seven subtask subdivisions were calculated as follows:

Waiting : :

Use the evaluation score for waiting.

Formulation (Fast writing): :

If input speed is 2. fast, use the evaluation score for formulation. If not, use half of the score.

Formulation (Slow writing): :

If input speed is 1. slow, use the evaluation score for formulation. If not, use half of the score.

Revision (Many edits): :

If the number of edits is 2. large amount, use the evaluation score for revision. If not, use half of the score.

Revision (Few edits): :

If the number of edits is 1. small amount, use the evaluation score for revision. If not, use half of the score.

Revision (Overall edits): :

When edit range is 2. wide, use the evaluation score for revision. If not, use half of the score.

Revision (Individual edits): :

When edit range is 1. narrow, use the evaluation score for revision. If not, use half of the score.

The correct distribution was created by normalizing those scores for each evaluator. Note that this experiment did not distinguish between states 2 and 9 or between states 3 and 7, because those pairs are difficult for humans to differentiate.

We evaluated the validity of the proposed model by comparing the correct distribution with state appearance distributions from the proposed model. To evaluate these differences, we used the Jensen–Shannon (JS) divergence, which is widely used to evaluate differences in probability distributions. The JS divergence is zero when the distributions are completely consistent and increases with increasing differences. To discuss the degree of differences between correct distributions and state appearance distributions from the proposed model, we also calculated JS divergence between the uniform distribution and each distribution. Here, because the correct distribution combined states 2 and 9 and states 3 and 7 as described above, the JS divergence was calculated after the state appearance probabilities for those state pairs were summed.

Table 5 shows the mean and standard deviation for the JS divergence calculated between each distribution. These results demonstrate that the difference between the state appearance distribution from the proposed model and each correct distribution is smaller than differences between the uniform distribution and each correct distribution. We performed paired multiple comparison using the Bonferroni method to evaluate whether significant differences are confirmed for those JS divergence means. Table 6 shows the results. In that table, A (or B) indicates evaluator A (or B), X/Y refers to the JS divergence between distributions of methods X and Y, and values in each cell are the p-value for the mean difference of the JS divergences. For example, the cell in the first row and first column shows the p-value of the mean difference between the JS divergences of A/B and those of A/Proposed. The results show that JS divergences between state appearance distributions from the proposed model and the correct distributions as calculated by both evaluators present no significant differences, while those between uniform and correct distributions reveal significant differences.

Table 5 JS divergence between methods
Table 6 Results of statistical tests

These results indicate that state appearance distributions from the proposed model have trends similar to the interpretations by the expert evaluators. This suggests that using the proposed model to analyze subtask appearance patterns for each learner is appropriate.

Analysis of the Relation Between Skills and the Writing Process

As discussed in “Introduction”, the writing process and writing skills are known to be related. Therefore, if analyses of this relation based on the proposed model are consistent with findings from existing studies, the validity of analyses using the proposed model can be confirmed.

For this evaluation, we classified learners with similar processes and analyzed relations between these clusters and writing skills. Specifically, we performed hierarchical clustering using the JS divergence of the state appearance distribution between learners as the distance function. We used the pseudo-F criterion to determine the optimum cluster number, and two clusters were supported. Therefore, in this experiment we classified learners into two clusters, with 26 learners in one group and 46 learners in another. Figures 10 and 11 show mean values of state appearance probabilities for learners belonging to each cluster. Those figures show the following characteristics for each cluster.

Fig. 10
figure 10

Average state appearance probabilities of cluster 1

Fig. 11
figure 11

Average state appearance probabilities of cluster 2

Writers in cluster 1 wait for a certain amount of time, followed by a fast-writing formulation stage (states 9 and 2) and then, in the latter half, transition to the waiting state with a certain ratio of overall revision (state 5) and minor edits (state 4). This can be considered as a good writing process, because there is good balance between planning, formulation, and revision, and writing is completed with time to spare.

Writers in cluster 2 are relatively slow to start writing, and the start of writing is was followed by slow-writing formulation (states 7 and 3) for a long time, with minor edits (state 4) and edits at specific locations (state 6) conducted just before the end of the writing period. This can be seen as a cluster of learners who formulate slowly and cannot secure sufficient time for revisions.

Learners who write fast and spend more time performing revisions are generally known to have high writing skills (Sasaki 2000, 2002; Larios et al. 2008). Therefore, if the product quality in cluster 1 is high, analyses based on the proposed model are validated.

To evaluate the quality of writing products, we had two experts score the writing of each learner by two perspectives: 1) organization, and 2) readability. Organization was evaluated using five scores: 1. extremely poor, 2. poor, 3. neither poor nor skilled, 4. skilled, and 5. extremely skilled. Readability was evaluated using four categories: 1. very difficult to read and understand, 2. difficult to read and understand, 3. somewhat difficult to read and understand, and 4. no problem with readability. Evaluators were not informed of which cluster learners belonged to. The average of the scores by the two experts was used as the final score. We conducted the Wilcoxon rank-sum test to examine differences in mean score between clusters for each evaluation point. We also performed the same calculation for the total score of the two evaluation points.

Table 7 shows the results. The experimental result shows that the score for cluster 1 is higher than that for cluster 2 for each evaluation point. In addition, the readability and total scores are significantly different. Here, we also confirmed the relations between the attributes of the subjects and their scores. Concretely, Table 8 shows the averaged total scores (standard deviations) and p-values of the Wilcoxon rank-sum test for different genders and different backgrounds. These tests found no significant difference for gender or background. In addition, the correlation between age and total score was 0.17, with no significance (p = 0.15). These results show that the effects of subjects’ attributes did not significantly affect the outcome of this experiment.

Table 7 Scores for each cluster
Table 8 Scores for each attribute of subjects

From the experimental results, we can confirm that the product quality in cluster 1 was higher than that in cluster 2, as expected. This suggests that writing process analyses based on the proposed model derive findings consistent with those of previous studies (e.g., Sasaki 2000, 2002; Larios et al. 2008), suggesting that such analyses are appropriate.

Finally, to show some examples of the writing process of subjects with advanced skills and those with lesser skills, Figs. 12 and 13 depict the state appearance probabilities of the subjects with the top 3 and bottom 3 scores, respectively. Figure 12 shows that high-performing subjects have high ratios of formulation stages (State 2, 3, 7, 9) in the first half of the total writing time. In the second half, the ratios of waiting (State 1) and revision actions (States 4, 5, 6, 8) increase. Although the time allocated to each stage differs among the subjects, they tend to divide the formulation and revision phases consciously. In contrast, Fig. 13 shows that low-performing subjects have a low ratio of waiting (State1) across all time intervals, meaning that they continued to write until just before the end of the writing period. Concretely, the subject shown in the center of Fig. 13 has high ratios of formulation stages (State 3, 7, 9) overall, and the subjects shown in the left and right of Fig. 13 have high ratios of editing/revision actions (State 4, 5, 6, 8) across all time intervals. These results suggest that the common characteristics of good writers are that (1) the formulation and revision phases are divided, and (2) time management is practiced. In addition, the figures show that the temporal changes in subtasks differ between the subjects, although good (and bad) writers share roughly similar trends. The interpretation of the detailed temporal changes would help learners and instructors grasp the writing characteristics of each learner quantitatively, as we discussed in “Introduction”. Furthermore, such data also provide useful information for instructors to use in providing feedback and instruction based on the writing patterns of each learner.

Fig. 12
figure 12

State appearance probabilities of subjects with top 3 scores

Fig. 13
figure 13

State appearance probabilities of subjects with bottom 3 scores

Conclusion

In this study, we proposed a method for machine learning from keystroke log data to estimate temporal change in learners’ subtask appearance patterns. Specifically, we defined keystroke log data as series data of writing features, and developed an expanded GHMM model that estimates the subtask appearance distribution from these data. The proposed model considered latent states in GHMM as subtasks, and incorporated parameters that express state appearance probabilities for each time interval for each learner. Furthermore, we proposed a Bayesian estimation method via collapsed Gibbs sampling as a method for estimating parameters for the proposed model.

We used actual data to show that the proposed model produces higher fit than does GHMM. Furthermore, we demonstrated that the writing process of each learner could be quantitatively interpreted based on the state appearance distribution from the proposed model, and that the distribution was valid and similar to interpretations by expert evaluators. We also showed that differences in learner writing processes can be measured based on JS divergence in state appearance distributions, and that clustering of writing processes can be conducted using that measure. We also demonstrated the validity of writing process analysis based on the clustering.

In the future, we would like to examine generalizability of the proposed method through applications to various actual data. We will also examine an extension of the proposed method to deal with reports that writing activity depends on emotion (Epp et al. 2011; Salmeron-Majadas et al. 2018). In the proposed method, the state appearance probabilities for each learner ϕi will reflect the effects of emotion. The effects of emotion and writing characteristics cannot be differentiated, but the effects might be explicitly captured by incorporating emotion parameters. This extension will be explored in future work. We also aim to develop a writing learning support system that visualizes estimated results as feedback to learners.