Eye movement analysis with hidden Markov models (EMHMM) with co-clustering

The eye movement analysis with hidden Markov models (EMHMM) method provides quantitative measures of individual differences in eye-movement pattern. However, it is limited to tasks where stimuli have the same feature layout (e.g., faces). Here we proposed to combine EMHMM with the data mining technique co-clustering to discover participant groups with consistent eye-movement patterns across stimuli for tasks involving stimuli with different feature layouts. Through applying this method to eye movements in scene perception, we discovered explorative (switching between the foreground and background information or different regions of interest) and focused (mainly looking at the foreground with less switching) eye-movement patterns among Asian participants. Higher similarity to the explorative pattern predicted better foreground object recognition performance, whereas higher similarity to the focused pattern was associated with better feature integration in the flanker task. These results have important implications for using eye tracking as a window into individual differences in cognitive abilities and styles. Thus, EMHMM with co-clustering provides quantitative assessments on eye-movement patterns across stimuli and tasks. It can be applied to many other real-life visual tasks, making a significant impact on the use of eye tracking to study cognitive behavior across disciplines. Supplementary Information The online version contains supplementary material available at 10.3758/s13428-021-01541-5.

In this Appendix we provide the formal derivation of VHEM with co-clustering. The original variational hierarchical EM (VHEM) algorithm (Coviello et al., 2014) clusters a set of individuals' HMMs into groups and forms a representative HMM for each group. Here we modify VHEM to perform co-clustering over several sets of HMMs, where each set corresponds to individuals' HMMs for one stimulus. In particular, VHEM with co-clustering is equivalent to running VHEM separately on each set of HMMs (i.e., for each stimulus), but it computes consistent cluster assignments of individuals to groups across all stimuli (i.e., all runs of VHEM). The output result of co-clustering is a set of representative HMMs (one for each stimulus) for each group.

A.1 Problem formulation
Consider an experiment where subjects view multiple image stimuli, which have different spatial arrangements (e.g., images of different scenes). The goal is to cluster the subjects into groups so that group members share similar eye gaze behaviors across all stimuli, although the shared behavior for a specific stimuli might be different from other stimuli.
Let B (s) i be the individual (base) HMM learned for subject i on stimulus s. Table 1 shows the individual HMMs. The goal of co-clustering is to find subsets of rows in Table 1 that are similar, i.e., for each stimuli, the HMMs are similar for all subjects in the group. Associated with each group are representative HMMs (i.e., cluster "centers"), one for each stimulus. Let R (s) j be the representative HMM for group j and stimulus s. Table 2 shows an example of co-clustering, where similar HMMs are grouped together to form representative HMMs.

B
(2) 4 Table 2: Co-clustering will group similar rows (i.e., group similar HMMs across stimuli), and form representative HMMs for each group.

A.2 Derivation
We will use the same notation as the VHEM paper (Coviello et al., 2014), but we will use B and R to represent the base and reduced HMM mixtures (H3Ms), and superscript (s) to represent the stimuli index s. The set of base H3Ms B = {B (1) , · · · , B (S) } contains the base H3M for each stimulus s, B (s) , and likewise for the set of reduced H3Ms R = {R (1) , · · · , R (S) }, where S is the number of simuli. For stimulus s, the base H3M B (s) is composed from the individuals' HMMs {B (s) is the number of individuals and K (r) is the number of groups.

A.2.1 Formulation
For the s-th stimuli, the likelihood functions for observed sequence y (s) under the base (individuals') H3M B (s) and the reduced (groups') H3M R (s) are p(y (s) |B (s) ) = p(y (s) |R (s) ) = Note that the cluster priors ω and ω (r) j are the same for each stimulus s, since we have the same number of subjects for each stimulus, and the same group sizes for each stimulus.
The VHEM algorithm first draws a set of virtual samples from each base mixture component (Coviello et al., 2014). For stimulus s, let Y . For stimulus s, the reduced HMMs are obtained by maximizing the expected log-likelihood using the virtual samples on stimulus s, For co-clustering, we aggregate the expected log-likelihood of virtual samples from all stimuli {Y (s) i }. Since the virtual samples are independent, we have Finally the reduced model R is estimated by maximizing the expected log-likelihood in (4),

A.2.2 Variational Lower Bound
As with the original formulation, (4) is analytically intractable to evaluate. Hence, we construct a variational lower bound by deriving a lower bound for each term in the sum, where L (s),i H3M is a variational lower bound for each term in the summation, The term L where in (10) the "max" operator can be swapped with the summation over s because z ij is the same for each stimulus s. Defining the term the maximum with respect to the assignment variables z ij is obtained using results from Appendix D.2 of Coviello et al. (2014), The main difference between the assignment variable in (15) and the corresponding Eq. 19 in the VHEM paper (Coviello et al., 2014) is that (15) averages the expected log-likelihoods between HMMs L (s),i,j HM M over all the stimuli s. After calculating the assignment variablesẑ ij , the remaining E-step is equivalent to using the original E-step of VHEM on each stimulus separately. That is, the summary statistics Θ (s) ij are computed separately for each stimulus s. The M-step then updates the models using the shared assignment variablesẑ ij for each stimulus s. Finally, convergence is tested using the lower bound of J (R), by substituting the optimalẑ ij into (13), where the last line follows from the property that assignment variables sum to 1 (over j).

A.2.3 VHEM Algorithm with co-clustering
The VHEM algorithm with co-clustering is summarized in Algorithm S is the number of stimuli. K (b) is the number of base HMMs. K (r) is the number of reduced HMMs.  Calculate assignment variablesẑ ij using (14) and (15).

B. Derivation of VBHEM with Co-clustering
In this Appendix we provide the formal derivation of VBHEM with co-clustering. The original variational bayesian hierarchical EM (VBHEM) algorithm is a bayesian version of variational hierarchical EM (VHEM) algorithm (Coviello et al., 2014), which clusters a set of individuals' HMMs into groups and forms a representative HMM for each group.
Here we modify VBHEM to perform co-clustering over several sets of HMMs, where each set corresponds to individuals' HMMs for one stimulus. In particular, VBHEM with coclustering is equivalent to running VBHEM separately on each set of HMMs (i.e., for each stimulus), but it computes consistent cluster assignments of individuals to groups across all stimuli (i.e., all runs of VBHEM). The output result of co-clustering is a set of representative HMMs (one for each stimulus) for each group. The main difference between VBHEM and VHEM is that VBHEM assumes prior distributions on the HMM parameters of the group models. The number of groups can be estimated by maximizing the marginal likelihood of the data.

B.1 Problem Formulation
The co-clustering problem formulation and notation is the same as the VHEM derivation above. In particular, we use B and R to represent the base and reduced HMM mixtures (H3Ms), and superscript (s) to represent the stimuli index s. The set of base H3Ms B = {B (1) , · · · , B (S) } contains the base H3M for each stimulus s, B (s) , and likewise for the set of reduced H3Ms R = {R (1) , · · · , R (S) } , where S is the number of stimuli. For stimulus s, the base H3M B (s) is composed from the individuals' HMMs {B is the number of individuals viewing stimulus s and K (r) is the number of groups.

B.2 Evidence Lower Bound
While VHEM considers the parameters of reduced model R (s) j as unknown terms and computes its point-estimate, VBHEM considers the parameters of reduced mode R (s) j as hidden variables, and gives priors for each parameters and computes its posteriors. Given the observations Y and hidden variables H. The VB framework for H3M with K components and S states is formulated as follows. Starting from the marginal log-likelihood (i.e., model and KL(q||p) = q(z) log q(z) p(z) dz is Kullback-Leibler divergence (KLD) between distributions q and p, and satisfies KL(q||p)) ≥ 0. Thus, we have log p(Y ) ≥ L(q), which holds for any distribution q(H, K, S), and equality occurs when q(H, K, S) = p(H, K, S|Y ) (i.e. KL(q||p)) = 0). Therefore, L(q) is a lower bound on log p(Y ), and optimizing L(q) w.r.t q(H, K, S) will obtain an approximation of the true posterior distribution p(H, K, S|Y ).
However, if we maximize L(q) w.r.t. q(H|K, S), the results for different pairs of (K, S) are coupled since they are conditioned on K and S. We proceed instead by first optimizing each of the q(H|K, S) individually by optimization of L (K,S) (q). Assuming that q(H|K, S) = L l q l (H l |K, S) and {H l } l∈[L] is a partition of H. Then, the optimal solution q * l (H l |K, S) is given by (Bishop, 2006): Model Selection. For a set of candidate models, i.e. different pairs of (K, S), we can rewrite (21) as Recognizing (24) as the negative KLD between q(K, S) and the distribution (not necessarily normalized) p(K, S) exp{L (K,S) (q)}. The lower bound L(q) will be maximized when the KLD is minimized, which will be the case when Thus, the optimal model structure can be found as (K * , S * ) = argmax K,S q * (K, S).

B.3 Formulation
represents a "base" hidden Markov model mixtures (H3M) for the s-th stimulus with K (s,b) components (HMMs) and S (b) states for each component. The reduced model for s-th stimulus is denoted by r) . Note that in the co-clustering setting, the different stimuli share the same mixture weight ω (r) j . In the Bayesian framework, we assume priors on all unknown parameters in R (s) , where Dir(·|α) is a Dirichlet distribution with concentration vector α, N (·|µ, Σ) is a Gaussian distribution with mean µ and covariance Σ, and W(·|W , ν) is a Wishart distribution with scale matrix W and degrees-of-freedom ν.
We also assume prior distributions over the number of component K (r) and states S (r) in the reduced model as uniform distributions on K (r) and S (r) , where (K max ) are the minimum and maximum possible values that K (r) and S (r) can take. Table 3 summarizes the notation used in the derivation, including the variable names, latent variables, model names, and hyperparameters. The algorithm is summarized is Algorithm 2.
In order to compute the posteriors of each parameters using (23), we first consider the fixed K (r) and S (r) and give the joint probability distribution as log p(Y , Z, R, Ω (r) ) = log p(Y |Z, R) + log p(Z|Ω (r) ) + p(R) + p(Ω (r) ) where the hidden variables H = {Z, Ω (r) , R}. For each term,  Compute responsibilitiesẑ i,j using (26). until convergence of L (K (r) ,S (r) ) (q * ). 10: end for 11: Select the reduced H3M R with maximum L (K (r) ,S (r) ) (q * ). i,j = z i,j , i.e., the cluster assignments are the same among all stimuli, which is based on the co-clustering framework. Also, we have defined, (ρ t |ρ t−1 , β t ) = 1, and all the factors are non-negative.

B.4 Optimal Variational Distribution
The hidden variables in our model are the assignment variables Z, the group HMMs R, and the group HMMs component weights Ω The optimal variational distribution q * are found via (23). For each hidden variable, we obtain the posterior: Normalizing z i,j , where theẑ i,j is the probability that i-th subject is assigned to j-th Group. Also, we have defined, Thus, and N j is the number of samples that have been assigned to R Removing the terms in L (s),i,j HM M that irrelevant to the mean and covariance, Then, and N (s) Hereν (s),i,j (ρ, β) has the same form with that in VHEM algorithm (Coviello et al., 2014), and N (s) j,ρ is the expected number of samples that have been assigned to R j with state ρ during the whole time. From (29), as more samples are assigned to R j with state ρ (i.e., N (s) j,ρ increases), the γ (s) j,ρ will increase and the covariance of posterior of µ (s,r) j,ρ will decrease; At the same time, the degree of freedom ν (s) j,ρ will increase, which leads to increasing precision of the posterior of µ For log q * (π (s,r) j ), we have Hereν (s),i,j 1 (ρ 1 ) andξ (s),i,j (ρ, ρ ) have the same form as in the VHEM algorithm (Coviello et al., 2014). N

B.5 Lower Bound
With the optimal variational distribution q * we now compute the lower bound L (K (r) ,S (r) ) (q), which is used for model selection, where, to keep the notation uncluttered, we have omitted the superscript * on the q distributions, superscript (r) on the parameters, along with the subscripts on the expectation operators because each expectation is taken with respect to all of the random variables in its argument. Next, we look at each term, 2. E log p(Z|Ω): whereα = j α j and ψ(x) is the digamma function.

B.6 Lower bound derivatives
To obtain the hyperparameters of the prior distributions, we maximize the lower bound w.r.t. the hyperparameters. This is performed using gradient ascent, which requires derivatives of L w.r.t. each hyperparameter.
• Diagonal : We assume W 0 = diag(w 0 ), where w 0 > 0 is a parameter vector. Then, where operations here are using element-wise product and division.