An interpretable measure of semantic similarity for predicting eye movements in reading

Predictions about upcoming content play an important role during language comprehension and processing. Semantic similarity as a metric has been used to predict how words are processed in context in language comprehension and processing tasks. This study proposes a novel, dynamic approach for computing contextual semantic similarity, evaluates the extent to which the semantic similarity measures computed using this approach can predict fixation durations in reading tasks recorded in a corpus of eye-tracking data, and compares the performance of these measures to that of semantic similarity measures computed using the cosine and Euclidean methods. Our results reveal that the semantic similarity measures generated by our approach are significantly predictive of fixation durations on reading and outperform those generated by the two existing approaches. The findings of this study contribute to a better understanding of how humans process words in context and make predictions in language comprehension and processing. The effective and interpretable approach to computing contextual semantic similarity proposed in this study can also facilitate further explorations of other experimental data on language comprehension and processing. Supplementary Information The online version contains supplementary material available at 10.3758/s13423-022-02240-8.


Supplementary Materials
We explored dataset 2 using content and function words.In this dataset, the consine semantic similarity adopted from Frank and Willems (2017) is excluded.We employed five measures: dynamic semantic similarity, simpler dynamic similarity, new cosine similarity, Euclidean semantic similarity and simpler Euclidean similarity.The statistical methods are the same as those used in the main text.

A The predictability of dynamic semantic similarity from dataset 2
We fitted 10 GAMM models to analyze the five types of semantic similarity as predictors of two dependent variables (first fixation duration and total fixation duration).The main predictor of interest is modeled as a tensor product smooth.
The models also include word frequency length and word frequency as control predictors, modeled as tensor interaction, and participant as a random effect.Note that the size of dataset 2 (39210 tokens) is much larger than that of dataset 1 (6548 tokens).
Table 3 summaries the 10 GAMM models fitted to the eye-tracking data.The semantic similarity data is largely normally distributed.With the exceptions of Euclidean semantic similarity and simpler Euclidean similarity, which only significantly predict first fixation duration, the other semantic similarity measures all significantly predict both first and total fixation duration.The tensor interaction of word length and word frequency also significantly predicts fixation duration, indicating that either variable is capable of predicting eye-movement in reading.The random effect of participant is significant in all GAMM models as well.
Figure 7 presents the fixed effects of the semantic similarity variables on first fixation and total fixation in the GAMM models.As is the case for dataset 1, all significant fixed effects are negative.In other words, both first fixation duration and total fixation duration tend to decrease as semantic similarity increases.These results suggest that words that are less likely to occur in the same context have smaller contextual semantic similarity and require more time for language users to process, while those that are more likely to occur in the same context have larger semantic similarity and entail less time to process.When content and function words are both examined here, the fixed effects of both dynamic semantic similarity measures appear stronger than those on dataset 1 (see Figure 4).
Using compareML and BIC/AIC and referring to the fixed effect of semantic similarity metric, we obtained the following ranking of performance of the GAMM models: dynamic semantic similarity > simpler dynamic similarity > new cosine

B Comparison of the five measures in qgam models
In this section, we compared the performance of the five semantic similarity measures in qgam models on dataset 2. Two models were built for each semantic similarity measure, each with first fixation duration and total fixation duration as the response variable, respectively.Each model includes the main effect of semantic similarity, the tensor interaction of word length and word frequency, and the random effect of participant.The effects of each semantic similarity measure for predicting the two response variables across deciles are visualized in Figures 8 and  9.
We first examined the p-values and trends of fixed effect changes for the independent variables across deciles.In Figures 8 and 9, a panel in each row represents how a given semantic similarity metric reacts at the five deciles.Models that vary between significant and insignificant effects across the five deciles are not as stable as those that consistently show significant effects across the five deciles.We set the alpha value at .01 for this comparison.The results show that the effects of the two dynamic semantic similarity measures and the two cosine similarity measures remain stable across deciles.In contrast, the simpler Euclidean similarity measure has an insignificant effect at one decile, while the Euclidean similarity measure has an insignificant effect at nine deciles.The changes of the effects for the metrics look similar, with some fluctuations at the beginning and the slope gradually becoming steep later, indicating strong effects.However, the curves of the Euclidean semantic similarity measure are likely to fluctuate at zero level, suggesting that the model does not converge well.
Next, we employed both compareML and BIC/AIC to make further model comparisons.Both compareML and BIC/AIC were taken to compare models at the same decile.Although the performance of each model may vary across different deciles and/or for the two response variables, the overall performance of these models is very consistent.Specifically, for both first and total fixation duration, the performance of the models is ranked as follows: dynamic semantic similarity > simpler dynamic similarity > new cosine similarity > simpler Euclidean similarity > Euclidean semantic similarity.We also calculated the BIC and AIC values for these models at the same decile.With the criterion that smaller values of BIC or AIC correspond to better performance, the performance of these models is ranked consistently for all deciles and for both response variables, as follows: dynamic semantic similarity > simpler dynamic similarity > simpler Euclidean similarity > new cosine similarity > Euclidean semantic similarity.
Overall, the performance of the semantic similarity metrics in the qgam models is largely consistent with the results of model comparisons using compareML and BIC/AIC, pointing to the same overall ranking of the performance of the GAMM models on dataset 2: dynamic semantic similarity > simpler dynamic similarity > new cosine similarity > simpler Euclidean similarity > Euclidean semantic similarity.

Figure 9 :
Figure 9: Fixed effects for semantic similarity in predicting the total fixations with qgam models.Note: simpler dyn_sim = simpler dynamic similarity; cosine similarity = new cosine similarity; simpler Eucl_sim = simpler Euclidean similarity.