Are emojis processed like words? As with words, a core function of emojis is to convey semantic information (Lo, 2008; Riordan, 2017b). Although emojis serve many different functions to support reading, it is unknown whether the same cognitive and perceptual processes that support the identification and integration of words during reading also extend to emojis. Given the growing popularity of emojis on the internet, social media (Novak, Smailović, Sluban, & Mozetič, 2015), and in text-based communication (Tigwell & Flatla, 2016; i.e., more than 3,000 emojis are available to date; Unicode 12.1; https://unicode.org/emoji/charts/full-emoji-list.html), we aimed to determine whether the time course of semantic processing for emojis is similar to words. The time course of semantic processing provides empirical constraints for the development of models of eye movement control during reading (for reviews, see Rayner, 1998, 2009). Considering the ubiquity of emojis, their inclusion in these models will provide a more comprehensive understanding of reading under ecologically valid conditions. Towards this goal, we used eye tracking to examine the time course of semantic processing of emojis that accompanied text during natural sentence reading. Accordingly, we will briefly summarize prior work on the semantic processing of both words and emojis, and we will then introduce the present study’s objectives and predictions.

To delineate the time course of semantic processing, a large body of work has manipulated the extent to which target words are semantically congruent versus incongruent with prior sentence contexts (i.e., semantic congruency). Semantic processing can have both an early and a long-lasting effect on target word processing, as indexed by effects on both early and late eye-tracking measures (for reviews, see Rayner, 1998, 2009), and by event-related potential (ERP) effects on the N1 component (e.g., Penolazzi, Hauk, & Pulvermüller, 2007; Sereno, Brewer, & O’Donnell, 2003), and the later N400 component (e.g., Dimigen, Kliegl, & Sommer, 2012; Kutas & Hillyard, 1980).

Evidence that the semantic processing of words can begin in the parafovea (i.e., before a word is directly fixated) derives from studies that used the gaze-contingent boundary paradigm (Rayner, 1975) to replace preview words (i.e., upcoming words in the parafovea) with a target word during a saccade towards the preview word. This paradigm revealed that reading is facilitated when the preview is semantically congruent with the target (i.e., semantic preview benefits; Hohenstein & Kliegl, 2014; Rayner & Schotter, 2014; Schotter, 2013; Schotter, Lee, Reiderman, & Rayner, 2015), but more specifically when it is congruent with the sentence context (i.e., plausibility preview benefits; Schotter & Jia, 2016; Veldre & Andrews, 2016, 2017). Intriguingly, semantic preview benefits have typically been more reliable for text in Chinese relative to text in alphabetic languages (e.g., Yan, Richter, Shu, & Kliegl, 2009; Yan, Zhou, Shu, & Kliegl, 2012; Yang, Wang, Tong, & Rayner, 2012). As discussed by Schotter, Angele, and Rayner (2012), one possible explanation for this cross-language difference is that the written form of Chinese words is spatially more compact than words in alphabetic languages. This compact layout could facilitate parafoveal processing by positioning more of the form of upcoming words closer to the fovea. Similar to Chinese characters, emojis also convey rich semantic information in a compact spatial layout. Thus, emojis seem ideal for exploring the perceptual and cognitive limits of the effect of semantic processing on eye movement control during reading.

Although the time course of semantic processing for words has been extensively investigated, much less is known about the processing of emojis. Prior investigations of emoji processing have used a substitution approach whereby emojis substituted for specific words in a sentence (e.g., the pizza and smiley face emojis could replace the words pizza and happy, respectively). Using this substitution approach with the self-paced reading paradigm (Cohn, Roijackers, Schaap, & Engelen, 2018) and the rapid serial visual presentation (RSVP) paradigm (Tang, Chen, Zhao, & Zhao, 2020; Weissman, 2019), results have demonstrated that emojis that are semantically congruent with the word they replace are read faster (Cohn et al., 2018) and elicit reduced N400 amplitudes (Tang et al., 2020; Weissman, 2019), compared with emojis that are semantically incongruent with the word they replace. Akin to the N400 effect with words (e.g., Dimigen et al., 2012; Kutas & Hillyard, 1980), the reduction in N400 amplitudes for congruent compared with incongruent emojis suggests that emojis are integrated with text in a similar manner as are words (Weissman, 2019).

While these prior findings offer a glimpse into the online processing of emojis, details about their time course remain unclear. In particular, it is unclear how early the semantic processing of emojis begins during natural reading, as well as the extent to which emoji processing can begin in the parafovea. To address these questions, we tested whether semantic congruency can rapidly modulate the eye-movement record for emojis, as was previously shown for words. In doing so, we build on a large prior literature that has used eye tracking to provide fine-grained time-course processing about word processing during reading (for reviews, see Rayner, 1998, 2009). Extending this literature, we aimed to use emojis to assess the generalizability of the assumptions of contemporary models of eye movement control, such as E-Z Reader (Reichle, Pollatsek, Fisher, & Rayner, 1998; Reichle, Pollatsek, & Rayner, 2012) and SWIFT (Engbert, Longtin, & Kliegl, 2002; Engbert, Nuthmann, Richter, & Kliegl, 2005). Of particular relevance to our study, one of the longstanding controversies among models of eye movement during reading concerns the extent to which lexical and linguistic variables, such as semantic congruency, can rapidly modulate fixation durations from moment-to-moment during reading (i.e., the direct cognitive control hypothesis). As reviewed elsewhere (e.g., Reingold, Sheridan, & Reichle, 2015), some models adopt the assumption of direct cognitive control of eye movements by high-level variables, whereas other models assume that the online control of fixation durations is instead primarily driven by lower level visual and/or oculomotor variables. To the extent that we find that semantic congruency can rapidly impact the eye-movement record for emojis, our study would provide further evidence for the direct cognitive control hypothesis in a new type of naturalistic reading setting (i.e., emojified text). It is unknown to what extent the assumptions of current models of eye-movement control will extend to emojis, given that emojis differ from words in alphabetic languages due to their compact visual format, and because emojified text entails the integration of verbal information (i.e., text) with nonverbal information (i.e., emojis). Thus, by examining the extent to which the time course of emoji processing is “word-like,” we aimed to explore the generalizability of the assumptions of current models of eye-movement control during reading.

We also extend prior emoji research that has largely focused on emoji–word substitutions (Cohn et al., 2018; Gustafsson, 2017; Tang et al., 2020; Weissman, 2019) by instead presenting the text and the emojis in our study at the same time (e.g., the apple emoji accompanied the word apple in the sentence). Although emojis commonly replace a word in text, they are also commonly appended to the end of the message (Cramer, Juan, & Tetreault, 2016). Concurrent presentation of text and emoji allowed us to study how the verbal information conveyed by text interacts with the nonverbal information conveyed by the emojis. Finally, as an additional methodological refinement, our design allowed us to contrast the same sentences (they served as their own controls) with and without emojis, as well as in the presence of semantically congruent and incongruent emojis.

To summarize, we used eye tracking to document the time course of semantic processing of emojis during natural sentence reading. We presented sentences containing target words (e.g., coffee in the sentence “My tall coffee is just the right temperature”) in three different conditions: (1) the synonymous condition, in which the sentence ended with an emoji that has the same meaning as the target word (e.g., the coffee cup emoji ); (2) the incongruent condition, in which the sentence ended with an emoji that did not depict the target word (e.g., the beer mug emoji ); and (3) the no-emoji condition, in which the sentence did not contain an emoji. We selected emojis that could replace target words and maintain sentence meaning (i.e., “emoji synonyms”) in the synonymous condition so as to create a condition of strong semantic congruency, thereby building on prior work that used synonyms to probe for semantic preview benefits during reading (Schotter, 2013).

Whereas prior work on emoji processing has largely relied on self-reports (e.g., Kelly & Watts, 2015; Riordan, 2017a, 2017b), self-paced reading paradigms (e.g., Cohn et al., 2018), modified RSVP paradigms (e.g., Weissman, 2019) and overall reading times (e.g., Gustafsson, 2017), we obtained fine-grained time-course information by analyzing both early and late eye-movement measures of emoji processing. We predicted reduced processing for congruent (i.e., synonymous) relative to incongruent emojis. Under the premise that emojis and words are processed similarly (Weissman, 2019), and given the breadth of prior work documenting both early and late word congruency effects (Dimigen et al., 2012; Kutas & Hillyard, 1980; Penolazzi et al., 2007; Rayner, 1998, 2009; Sereno et al., 2003), we expected the effect of emoji congruency to emerge in early measures (i.e., first-pass fixation measures and skipping rates) and to persist into later measures (i.e., total time). Likewise, we expected reduced processing of target words that were accompanied by congruent emojis, particularly for later measures (e.g., total time on the target word) that include regressions back to the target word. More specifically, we expected this congruency effect on the target word to be particularly salient for late measures because conditions of congruency were established by the emoji that always occurred after the target word in the sentence. Finally, based on the previous studies with emojified text (Cohn et al., 2018; Gustafsson, 2017), we anticipated longer overall sentence reading times for sentences with emojis relative to sentences without emojis.

Method

Participants

Sixty native-English-speaking undergraduate students (39 females, 21 males, Mage = 19 years) with normal or corrected-to-normal vision were recruited from the SUNY Albany SONA system. All participants reported owning a smartphone, with 55 having Apple and five having Android mobile operating systems (MOS). The average age at which participants received their first smartphone was 12.9 years and their average reported daily usage was 5.3 hours. All participants reported using emojis. Since emojis have different renderings across platforms and our stimuli were rendered in Apple IOS, we removed the five participants who reported having Android MOS. Our final sample size consisted of 55 participants and was determined a priori based on previous studies in the reading and eye-movements literature (for reviews, see Rayner, 1998, 2009). We selected a large sample size because it was difficult to predict the effect sizes that we would observe given the novelty of our approach.

Materials and design

There were 89 experimental sentences (see Table 5 in the Appendix A) that ranged in length from five to 13 words (M = 8.99 words, excluding the emoji).Footnote 1 Each sentence contained a target word that ranged in length from four to 13 letters (M = 6.4), with a mean word frequency of 25.77 words per million (SUBLETX; Brysbaert & New, 2009). Each sentence appeared on a single line, and the target word could appear in any location in the sentence, with the exception of the first or last word.

The 89 experimental sentences were displayed in three conditions: (1) The synonymous condition contained an emoji that was semantically congruent with the target word (e.g., Homemade cookies are delicious ); (2) the incongruent condition contained an emoji that was semantically incongruent with the target word (e.g., Homemade cookies are delicious );Footnote 2and the no-emoji condition did not contain an emoji (e.g., Homemade cookies are delicious). The 178 emojis in our study (see Table 5 in Appendix A) were from Apple IOS 12.2. In the synonymous and incongruent conditions, the emoji was always located immediately after the last word in the sentence, with a minimum of two words and maximum of eight words between the emoji and the target word (M = 4.8 intervening words). We counterbalanced the assignment of sentences to conditions such that across participants each sentence occurred equally often in each of the three conditions (synonymous, incongruent, no emoji), but each participant saw only one version of each sentence.

To confirm that the emojis in the synonymous and incongruent conditions were equated for processing difficulty, an additional 30 participants read sentences that contained a general target word that fit semantically with either emoji (see Table 5 in Appendix A). There were no differences between emoji processing across all measures with the general target words (see Appendix Table 7).

Apparatus and procedure

Eye movements were measured with an SR Research EyeLink 1000 Plus system with a sampling rate of 1000 Hz. Viewing was binocular, but only the right eye was monitored. A chin rest and forehead rest were used to minimize head movements. Following calibration, average gaze-position error was less than 0.5°. The sentences were presented on a 24-inch Asus VG248QE monitor, with a refresh rate of 144 Hz and a screen resolution of 1,920 × 1,080 pixels. All letters were lowercase (except where capitals were appropriate) and were shown in single-spaced, 22-point black Courier New font on a white background. The emoji was presented in 28-pt font. Participants were seated 92 cm from the monitor. Approximately 1 degree of visual angle was equivalent to 3.21 letter characters and 1.69 emoji characters (see Fig. 1 for an example trial).

Fig. 1
figure 1

Example of the three sentence types in which the sentence contained a (a) congruent emoji, (b) incongruent emoji, and (c) No emoji. Approximately 1 degree of visual angle was equivalent to 3.21 letter characters and 1.69 emoji characters

Participants were instructed to read for comprehension. They initiated each trial by pressing a button while fixating on a cross on the left side of the screen, and they ended each trial by pressing a button. To ensure that participants read for comprehension, 30 additional filler sentences (approximately 25% of trials) were followed by a comprehension question. To ensure participants were processing both the emoji and the sentence we included two types of comprehension questions that queried about the sentence content or the emoji specifically (see Table 6 in Appendix A). Average comprehension accuracy was 97%. The order of sentences was randomized, with a new random order for each participant.

Results

We analyzed the data using the lme4 package (Version 1.1.-12; Bates, Mächler, Bolker, & Walker, 2015) within the R studio environment to run generalized linear mixed-effects models (LMMs). For each model, emoji congruency was entered as a fixed effect, and subjects and items were treated as random effects.

To discover the optimal model a full model using the maximal random effects structure (Barr, Levy, Scheepers, & Tily, 2013) was first fitted. If a model failed to converge it was systematically trimmed until it converged, by removing correlations between random effects, and then random effects associated with the smallest variance. Once models converged we used restricted maximum likelihood (REML) estimates for model comparisons involving correlation parameters or variance components (see Appendix Table 8 for a summary of random effects structures). Patterns of significance were the same for the log-transformed data as for the untransformed data, so we report the analysis of the untransformed data to maintain transparency of the effect sizes.

Linear mixed-effects regressions were used for fixation duration measures and sentence reading time and we report the regression coefficients which represent the effect size (in milliseconds) of the reported comparison as well as its associated t value. Logistic mixed-effects regressions were used for binary variables (probability of single-fixations and probability of skipping), and we report the regression coefficients which represent the effect size in log-odds space, as well as the z and p value of the effect coefficient. Absolute values of the t and z statistics greater than or equal to 1.96 indicate a significant effect at approximately the .05 alpha level.

Overview of eye-tracking measures

To examine the processing of the emojis and the target words, we analyzed the following eye-tracking measures: (1) first-fixation duration (i.e., the duration of the very first fixation on a target); (2) single-fixation duration (i.e., the first-fixation duration for the subset of trials in which there was only one first-pass fixation on the target); (3) gaze duration (i.e., the sum of all consecutive first-pass fixations on the target, prior to a saccade to another word); (4) total time (i.e., the sum of all the fixations on the target, including regressions back to the target); (5) probability of skipping (i.e., the proportion of trials in which there was no first-pass fixation on the target, regardless of whether or not the target was fixated later in the trial); and (6) probability of a single fixation (i.e., the proportion of trials in which there was only a single first-pass fixation on the target). Fixations that were shorter than 80 ms or more extreme than three standard deviations from each subject’s mean on the emoji (1.7% of trials) or the target word (2.6% of trials) were excluded from fixation-based analyses (i.e., first fixation, single fixation, gaze, total time, probability of a single fixation).

Emoji processing

To investigate the time course of emoji processing, LMMs were built with one planned contrast to test the difference between synonymous and incongruent emojis. This contrast was achieved by setting the synonymous condition to the baseline (intercept) in the model and using the default contrast for the comparison of the synonymous condition to the incongruent condition. For the fixation-based measures (i.e., first fixation, single fixation, gaze duration, total time, probability of a single fixation), trials were excluded from the analysis if readers skipped the emoji (48.5%) or if there was a blink before or after the emoji (10.7%). Means are depicted in Fig. 2.

Fig. 2
figure 2

Means (aggregated by subject) for the eye movement measures and sentence reading time. Specifically, we analyzed eye movements (a: first fixation; b: single fixation; c: gaze duration; d: total time; e: probability of single fixations; f: probability of skipping) for the emoji as a function of emoji congruency (synonymous vs. incongruent) as well as for the target word as a function of emoji condition (synonymous vs. incongruent vs. no emoji). In addition, we analyzed sentence reading time (f) as a function of emoji condition (synonymous vs. incongruent vs. no-emoji). The error bars depict the standard error of the mean

As shown in Table 1, synonymous emojis exhibited faster processing relative to incongruent emojis, as revealed by shorter first-fixation durations (b = 25.69, t = 2.27), single-fixation durations (b = 36.52, t = 2.96), gaze durations (b = 59.01, t = 3.57), and shorter total times on the emoji (b = 103.77, t = 4.83). There was also an increase in the proportion of single-fixations for synonymous emojis compared with incongruent emojis as shown in Table 2 (z = −2.53, p < .05). As evidence that emoji processing may begin in the parafovea, skipping was higher for synonymous than for incongruent emojis (z = −4.74, p <.001; see Table 2).

Table 1 Results of the linear mixed-effects models for the emoji for the fixation time measures (first fixation, single fixation, gaze duration, and total time) as a function of emoji congruency (synonymous vs. incongruent)
Table 2 Results of the linear mixed-effects models for the emoji for the fixation probability measures (probability of single fixations, probability of skipping) as a function of emoji congruency (synonymous vs. incongruent)

Target word and sentence processing

To examine how emojis (i.e., non-verbal elements) interact with text (i.e., verbal elements) during natural reading, target word LMMs were built with two planned contrasts (Helmert) such that the intercept corresponded to the grand mean: The first tested for a difference in processing between target words followed by synonymous versus incongruent emojis (i.e., a congruency effect), and the second tested for a difference in processing between the target words followed by an emoji (i.e., the mean of the synonymous and incongruent conditions) versus followed by no emoji. For the fixation based measures (i.e., first fixation, single fixation, gaze duration, total time, probability of a single fixation), trials were excluded if readers skipped the target word (17.2%) or if there was a blink before or after the target word (6.8%). Means are depicted in Fig. 2.

With the exception of the probability of skipping, early eye-tracking measures of target-word processing (i.e., first fixation, single fixation, gaze duration, probability of single fixations; see Tables 3 and 4) were not significant. Nevertheless, the two later measures of processing (i.e., total time on the target word, sentence reading times) revealed significant effects (see Table 3). Specifically, the presence of an emoji (compared with its absence) led to greater skipping (z = −2.13, p < .05) and shorter total time on the target word (b = 13.11, t = 4.86), but longer overall sentence reading times (b = −78.89, t = −5.85). Results of emoji presence replicate those of Cohn et al. (2018). Semantic processing of the emoji did not affect the probability of skipping or total time on the target word as the synonymous and incongruent contexts did not differ (b = 7.09, t = 1.50). Nonetheless, overall sentence reading times were shorter when the emoji was synonymous compared with when it was incongruent (b = 115.27, t = 4.96).

Table 3 Results of the linear mixed-effects models for the target word for the fixation time measures (first fixation, single fixation, gaze duration, and total time) as well as sentence reading times as a function of emoji congruency (synonymous, incongruent, no emoji)
Table 4 Results of the linear mixed-effects models for the target word for the fixation probability measures (probability of single fixations, probability of skipping) as a function of emoji congruency (synonymous, incongruent, no emoji)

Discussion

To study the extent to which emojis are processed like words, we used eye tracking to document the time course of the semantic processing of emojis during natural sentence reading. Although a large prior literature has used eye tracking (Rayner, 1998, 2009) and ERPs (e.g., Dimigen et al., 2012; Kutas & Hillyard, 1980) to explore the time course of semantic processing for words, our study provides one of the first demonstrations that the time course of semantic congruency effects on eye movements for emojis is analogous to effects that were previously shown for words. Specifically, the effect of congruency on emoji processing emerged in both early and late eye tracking measures. Relative to the synonymous emojis, the incongruent emojis elicited longer fixation durations and reductions in skipping and the probability of single-fixations. By providing fine-grained time-course information, our study extends prior work that used self-paced reading paradigms and RSVP paradigms to investigate the semantic processing of emojis (Cohn et al., 2018; Weissman, 2019). Taken together, our results indicate that emojis show a timeline of semantic processing similar to words.

To investigate how emojis affect reading, we also contrasted the “synonymous” and “incongruent” emoji conditions with the “no-emoji” condition. Replicating Cohn et al. (2018), the presence of an emoji, relative to its absence, lengthened sentence reading times. Skipping rates on the target word increased in the presence of the emoji relative to its absence suggesting that attention is prioritized toward the emoji. In addition, total time on the target word decreased in the presence of an emoji, which potentially indicates that emojis can facilitate the postlexical integration of preceding text. Building on these findings, future work could further explore how verbal elements (i.e., text) and nonverbal elements (i.e., emojis) interact to facilitate reading. The nonverbal information provided by emojis is analogous to gestures (e.g., Feldman, Aragon, Chen, & Kroll, 2017; McCulloch & Gawne, 2018), which are known to enhance communication (Goldin-Meadow & Alibali, 2013), but emojis could serve many functions by enhancing reading comprehension (Walther & D’Addario, 2001), memory for text (Halvorson & Hilverman, 2018), emotion (Riordan, 2017a, 2017b), irony, and humor comprehension (Dresner & Herring, 2010; Weissman & Tanner, 2018). Finally, as with gestures, emojis can exhibit different semantic relationships with the words that accompany them (Na’aman, Provenza, & Montoya, 2017), such as the contrasting contribution of face and object emojis (Barach, Srinivasan, Fernandes, Feldman, & Shaikh, 2020). Building off of Cohn (2016), future work could examine how these different relationships influence the interaction of verbal and nonverbal processing.

Regarding their theoretical implications, our results provide empirical constraints for models of eye movement control during reading. By showing a rapid time course of semantic congruency effects on the eye movement record for emojis, our results support the assumption that higher level lexical and linguistic factors can rapidly affect fixation durations during online reading (i.e., direct cognitive control; for a review of this debate, see Reingold et al., 2015). Also, contemporary models of eye movement control, including E-Z Reader and SWIFT, assume that reading entails a substantial amount of parafoveal processing, and there is growing evidence that semantic processing can begin in the parafovea during reading, as previously revealed by semantic preview effects (Hohenstein & Kliegl, 2014; Rayner & Schotter, 2014; Schotter, 2013; Schotter et al., 2015; Schotter & Jia, 2016; Veldre & Andrews, 2016; Yan et al., 2009; Yan et al., 2012; Yang et al., 2012). As evidence that the semantic processing of emojis also begins in the parafovea, here we show that semantic congruency modulates skipping rates for emojis. By demonstrating a time course of processing for emojis that overlaps with that of words, our results indicate that the time-course assumptions of current models of eye-movement control can potentially generalize to a novel condition that mixes a verbal element (i.e., the target word) with a spatially compact nonverbal element (i.e., an emoji). In essence, we identify a new experimental condition under which semantic similarity across verbal and nonverbal codes facilitates processing.

Our results may also clarify why semantic preview effects are easier to detect in the written form of some languages than of others. Emojis, like Chinese characters, are physically smaller than typical words written in alphabetic scripts. The Chinese script is logographic, and words are composed of characters that are not separated by spaces. Semantic preview effects are robust in Chinese, and they arise not only for single characters (Yu et al., 2016) but also for their semantic sublexical components (Yan et al., 2009). The implication is that the spatial layout of a written word can influence the extent to which semantic parafoveal preview effects arise (for a related discussion, see Schotter et al., 2012).

In summary, emoji-word congruency effects provide an index of early semantic processing with implications for the control of saccades. One limitation of the present study is that the emoji was always located in its conventional position, at the end of the sentence with at least one intervening word between the emoji and its word counterpart. Therefore, future work would benefit from manipulating the spatial proximity of emoji and target word to further explore the time course of semantic processing. Most relevant is whether the parafoveal processing of emojis can affect the first-pass reading of preceding words (i.e., parafoveal-on-foveal effects; for a review, see Drieghe, 2011). Due to their spatially compact format, emojis are ideal for studying the time course of semantic processing, as well as the extent to which reading processes generalize over visual formats. Future work could use emojis to further explore a variety of reading variables, including word frequency and predictability.