Convolutional Neural Networks Trained to Identify Words Provide a Surprisingly Good Account of Visual Form Priming Effects

A wide variety of orthographic coding schemes and models of visual word identification have been developed to account for masked priming data that provide a measure of orthographic similarity between letter strings. These models tend to include hand-coded orthographic representations with single unit coding for specific forms of knowledge (e.g., units coding for a letter in a given position). Here we assess how well a range of these coding schemes and models account for the pattern of form priming effects taken from the Form Priming Project and compare these findings to results observed with 11 standard deep neural network models (DNNs) developed in computer science. We find that deep convolutional networks (CNNs) perform as well or better than the coding schemes and word recognition models, whereas transformer networks did less well. The success of CNNs is remarkable as their architectures were not developed to support word recognition (they were designed to perform well on object recognition), they classify pixel images of words (rather than artificial encodings of letter strings), and their training was highly simplified (not respecting many key aspects of human experience). In addition to these form priming effects, we find that the DNNs can account for visual similarity effects on priming that are beyond all current psychological models of priming. The findings add to the recent work of Hannagan et al. (2021) and suggest that CNNs should be given more attention in psychology as models of human visual word recognition.


Introduction
Skilled visual word identification requires extensive experience with written words.It entails identifying the visual characteristics of letters (e.g., oriented lines), mapping these features onto letters, coding for letter order, and eventually selecting word candidates from one's vocabulary (Carreiras et al., 2014).The process of representing the identity and orders of letters in a letter string is referred to as orthographic coding and it constitutes a crucial component of word identification, with different models of word identification adopting different orthographic coding schemes.
Much of the empirical research directed at distinguishing different orthographic coding schemes and models of word identification more generally comes from priming studies that vary the similarity of the prime and target.Various priming procedures have been used, but the most common is the masked Lexical Decision Task (LDT) as introduced by Forster et al. (1987b).The procedure involves measuring how quickly people classify a target stimulus as a word or nonword when it is briefly preceded by a prime.The basic finding is that orthographically similar prime strings speed responses to the targets relative to unrelated prime strings (e.g., Schoonbaert and Grainger 2004;Burt and Duncum 2017;Bhide et al. 2014).The assumption is that the greater the priming, the greater the orthographic similarity between the prime and the target.A variety of different models of letter coding and models of word identification more broadly have been developed in an attempt to account for more variance in masked form priming experiments.
The primary objective of the current study is to investigate to what extent artificial Deep Neural Network (DNN) models developed in computer science with architectures designed to classify images of objects can account for masked form priming data, and in addition, compare their successes to some standard models of orthographic coding and word recognition.In all, we test 11 DDNs (different versions of convolutional and transformer networks), five orthographic coding schemes, and three models of visual recognition, as well as five control conditions, as detailed below.

Orthographic coding schemes and models of word identification
Any model of word recognition needs to include a series of basic processes.This includes encoding the letters and their order, a process of mapping the ordered letters onto lexical representations, and finally a manner of selecting one lexical entry from others.Different models adopt different accounts of these basic processing steps.Most relevant for present purposes, models in the psychological literature have taken three basic approaches to encoding letters and letter orders, namely, slot-based coding, context-based coding, and context-independent coding.
On slot-based coding schemes, separate slots for position-specific letter codes are assumed.For example, the word CAT might be coded by activating the three letter codes C 1 , A 2 , and T 3 , whereas the word ACT would be coded as A 1 , C 2 , and T 3 (where the subscript indexes letter position).Because letter codes are conjunctions of letter identities and letter position, the letter A 1 and A 2 are simply different letters (accordingly, CAT and ACT only share one letter, namely T 3 ).
In context-based coding schemes, letters are coded in relation to other letters in the letter string.For example, in open-bigram coding schemes (e.g., Grainger and Whitney 2004), a letter string is coded in terms of all of the ordered letter pairs that it contains.For example CAT is coded by the letter pairs CA, AT, and CT, whereas ACT is coded as AC, AT, and CT.Various different versions of context-based coding schemes (and different versions of open-bigrams) have been proposed, and these again impact how transposing letters and other manipulations impact the orthographic similarity between letter strings.
Finally, in context-independent coding schemes, letter units are coded independently of position and context.That is, a node that codes the letter A is activated when the input stimulus contains an A, irrespective of its serial position or surrounding context (the same C, A, and T letter units are part of CAT and ACT).The ordering of letters is computed on-line (in order to distinguish between CAT and ACT), and this is achieved in various ways.For example, in the spatial coding scheme (Davis, 2010b), the precise time at which units are activated codes for letter order.Again, various versions of context-independent coding schemes have been proposed with consequences for the orthographic similarity of letter strings.
These different orthographic coding schemes form the front end of more complete models of visual word identification that include processes that select word representations from these orthographic encodings.For example, the Interactive Activation (IA) Model (McClelland and Rumelhart, 1981), the overlap model (Gomez et al., 2008), and the Bayesian Reader (Norris, 2006) all adopt different versions of slot coding; the open bigram (Grainger et al., 2004) and Seriol (Whitney, 2001) models use different context-based encoding schemes; the spatial coding (Davis, 2010b) and SOLAR models (Davis, 1999) uses context independent encoding scheme.Note that the predictions of masked priming in these models are the product of both the encoding schemes and the additional processes that support lexical selection.In addition to these models, the Letters in Time and Retinotopic Space (LTRS; Adelman 2011) model is agnostic to the encoding scheme and instead makes predictions based on the rate at which different features of the stimulus (that could take different forms) are extracted.We will consider how well various orthographic coding schemes as well as models of word identification account for masked priming results reported in the form priming project (Adelman et al., 2014).

Deep Neural Network
DNNs are a type of artificial neural network in which the input and output layers are separated by multiple (hidden) layers, forming a hierarchical structure.Two of the most common types of DNNs are convolutional neural networks (CNNs) and transformers.
CNNs are inspired by biological vision (Felleman and Van Essen 1991;Krubitzer and Kaas 1993;Sereno et al. 2015).The convolutions in CNNs refer to a set of feature detectors that repeat at different spatial locations to produce a series of feature maps (analogous to simple cells in V1, for example, where the same feature detector -e.g., a vertical line detector -repeats at multiple retinal locations).The convolutions are followed by a pooling operation in which corresponding features in nearby spatial locations are mapped together (analogous to complex cells that map together corresponding simple cells in nearby retinal locations), after which more feature detectors and pooling operations are applied, hierarchically, to form more and more complex feature detectors that are more invariant to spatial location.Different CNNs differ in various ways, including the number of hidden layers (many models have over 100 hidden layers), but they generally include "localist" or "one hot" representations in the output layer such that a single output unit codes for a specific category (e.g., an object class such as banana, or in the current case, a specific word), and CNNs tend to be trained through backpropagation as is the case with older parallel distributed processing models (Rumelhart et al., 1986).
In contrast, Vision Transformers (ViTs; Dosovitskiy et al. 2020) do not include any convolutions but instead introduce a self-attention mechanism.In object classification, this mechanism divides the input image into patches, and a similarity score between each and every patch is computed.These scores are used to compute a new representation of the image that emphasizes the most relevant features for the task at hand.Overall these models have much more complicated architectures, and describing them is beyond the scope of this paper.But again, the models include multiple layers, tend to include localist output codes, and are trained with back-propagation.
Importantly for present purposes, CNNs and ViTs are not only highly successful engineering tools that support a wide range of challenging AI tasks, but they are also often claimed to provide good models of the human visual system, and indeed, they are the most successful models in predicting judgements of category typicality (Lake et al., 2015) and predicting object classification errors (Jozwik et al., 2017) and human similarity judgements for natural images (Peterson et al., 2018) on several datasets.DNNs have also been good at predicting the neural activation patterns elicited during object recognition in both human and non-human primates' ventral visual processing streams (Cichy et al., 2016;Storrs et al., 2021).A benchmark called Brain-Score has been developed to assess similarities between biological visual systems and DNNs (Schrimpf et al., 2018).The best performing models on the Brain-Score benchmarks are often described as the best models of human vision, and CNNs are currently the best performing models on this benchmark.More recently, Biscione and Bowers (2022a) have demonstrated that CNNs can acquire a variety of visual invariance phenomena found in humans, namely, translation, scale, rotation, brightness, contrast, and to some extent, viewpoint invariance.
More relevant for the present context, Hannagan et al. (2021) demonstrated that training a biologically inspired, recurrent CNN (CORnet-Z; Kubilius et al. 2018) to recognise words lead the model to reproduce some key findings regarding visual-orthographic processing in the human visual word form area (VWFA) as observed with fMRI, such as case, font, and word length invariance.In addition, the model's word recognition ability was mediated by a restricted set of reading-selective units.When these units were removed to simulate a lesion, it caused a reading-specific deficit, similar to the effects produced by lesions to the VWFA.
The current work explores this topic further by determining to what extent CNNs and ViTs account for human orthographic coding as measured through masked form priming effects.Given past reports of DNN-human similarity, it might be predicted that DNNs will account for some form priming effects.What is less clear is how well DNNs will capture orthographic similarity effects in comparison to various orthographic coding schemes and models of word identification specifically designed to explain form priming effects, among other findings.Given that DNNs do not include any hand-built orthographic knowledge and are trained to classify pixel images of words, it would be impressive if these models did as well.This would be particularly so given that we trained the models in a highly simplified manner that ignores many important features of how humans learn to identify words, as described below.

Human Priming Data
Human priming data was sourced from the Form Priming Project (FPP; Adelman et al. 2014) and was used to assess how well various psychological DNN models account for orthographic similarity.FPP contains reaction times for 28 prime types across 420 six-letter word targets, gathered from over 924 participants.The prime types and priming effects are shown in Table 1.To measure the priming effect size, the mean reaction time (mRT ) of each prime condition is compared to the mRT of unrelated arbitrary strings (e.g., 'pljokv' for the word 'design').

Orthographic coding schemes and word recognition models
As discussed above, numerous orthographic coding schemes and word recognition models have been developed to account for orthographic priming effects in humans.Here we assess how well five coding schemes and three models of word identification account for the priming effects reported in the FPP dataset, namely:  Table 1: The 28 prime types from the FPP.Each prime type denotes the transformation of a given target word (in this example the target word is DESIGN) into a string via transposing, removing, or adding letters.The numbers in the second column indicated the letters of the prime relative to the target.For example, the prime type 'final-deletion' in the second row transforms the word 'DESIGN' by deleting the last letter into the string 'desig'.When multiple codes (e.g., 123/456) are specified, it indicates that each of these sub-conditions contains an equal number of targets.When 'd' or 'D' is specified, a random letter not found in the target is used.When 'd' is specified multiple times, the same letter is not reused.When 'D' is specified multiple times, the same letter is reused.The same transformations were applied to all targets.Adapted from Adelman et al. (2014).
In order to determine the degree of priming we used the match value calculator created by Davis (2010a) that implements the five orthographic coding schemes.For each coding scheme, the calculator takes two strings as input and returns a match value that indicates their predicted similarity.It is assumed that the greater the orthographic similarity, the greater the priming.Note, for any given coding scheme and priming condition, the match value for any target word is the same across target words.For example, in the final deletion condition noted in Table 1, the example target given is DESIGN, but the same exact similarity score is computed for all the targets in this condition (because the prime and target all share the first 5 letters, with only the final letter mismatching).
For the models of visual word identification we assessed orthographic similarity on the basis of their predicted priming score.IA and SCM implementations were taken from Davis (2010b) and LTRS model was taken from a simulator developed by Adelman (2011) 1 .For each model the mRT was computed for each related prime condition and subtracted from the mRT in the unrelated arbitrary condition to produce the priming score.

DNN models
We trained seven common convolutional networks (CNNs) and four Vision Transformer networks (ViTs).The convolutional models belong to the families of AlexNet (Krizhevsky et al., 2012), VGG (Simonyan and Zisserman, 2014), ResNet (He et al., 2016), DenseNet (Huang et al., 2016) and EfficientNet (Tan and Le, 2019), and all Transformers were from the ViTs family (Dosovitskiy et al., 2020).All models were pretrained on ImageNet (Deng et al., 2009) to initialise the weights.The ViTs listed in Table 2 vary in their complexities and a number of properties including number of layers and the way that attention is implemented.
After pretraining on ImageNet, the final classifier layer was removed and the models were trained to classify images of 1000 different words (the same number used by Hannagan et al. 2021), with each word represented locally.All 420 six-letter words from From Priming Project were used and the remaining 580 were sourced from Google's Trillion Word Corpus (Google, 2011).All words were presented in upper-case letters and the lengths of the 580 words were evenly distributed between three, four, five, seven, and eight letters, with 116 words chosen at each length.As with the Form Priming Project's 420 words, the 580 words are chosen to not contain the same letter twice.All words were trained in parallel (there was no age-of-acquisition manipulation) and for the same number of trials (there was no frequency manipulation).The complete list is available through the GitHub repository of the current study2 .
We employed data augmentation techniques to diversify the visual representation of each word by manipulating font, size, rotation of letters, and translation by changing the position of the image in space (See Figure 1 for some examples; for details of augmentation see Figure B.5).We generated 6,000 images for each word, resulting in a dataset containing 6,000,000 images.5,000,000 images are used for training, while the remaining 1,000,000 are used for performance validation.The algorithm for generating datasets is described in detail in Appendix B. For training the Adam optimizer and the cross-entropy loss function are used.A hyperparameter search was performed for the learning rate using a random grid-search, yielding a value of 1e-5.When the training average loss stops improving by a specified threshold of 0.0025, the training was terminated.The accuracy of the models on the validation set of words is reported in Table 2.
We then generated a dataset of prime words.To generate this dataset, each of the 420 Form Priming Project words was transformed into 28 prime types.For example, the target word 'ABDUCT' was transformed into 'baduct' for the 'initial transposition' condition, 'abdutc' for the 'final transposition' condition, and so forth.Each prime is used to generate an image using the Arial font, resulting in 11,760 images (420 target words × 28 prime types).No rotation or translation variations at the letter level are introduced, and all strings are positioned at the centre of the image using the same font size of 26, which is the average size used for training.

Measuring orthographic similarity of the various DNNs
To measure the the orthographic similarity between prime-target images we compared the unit activations at the penultimate layer using Cosine Similarity (CS) after each image was presented as an input to the network: CS ranges from -1 (opposite internal representation) to 1 (identical internal representation).We then computed the overall relation between human priming scores and model cosine similarity scores by calculating the correlation (τ ) between the mean human priming scores and the mean cosine similarities across conditions.We also consider the relation between humans and models by exploring the (mis)match in priming and cosine similarity scores in the individual conditions, as discussed below.
We also included five baseline conditions to better understand the priming effects observed.First, the CS between the pixels of the prime and target images was computed.This served as a baseline for determining the extent to which the models contribute to orthographic similarity scores beyond the pixel values between stimuli.In addition, we used ImageNet pretrained models (without any training on letter strings) and ImageNet pretrained models fine-tuned on 1000 classes of six-letter random strings for the CNNs and ViTs as four additional baselines.Rather than reporting tau for the individual models we report the average correlation tau across all the CNNs and ViTs, respectively.This was done to assess the role of training DNNs on English words on the pattern of form priming effects obtained.Using the aforementioned method, the mean cosine similarity score for each model condition was calculated, and the average correlation coefficient was used for each model class.

Results
Figure 2 plots Kendall's correlation, over the 28 prime types, between the human priming data and the various orthographic coding schemes (as measured by match values), priming models (as measured by predicted priming scores), DNNs (as measured by cosine similarity scores), as well as various baseline measures of similarity.
The most striking finding is that the CNNs did a good job in predicting the pattern of human priming scores across conditions, with correlations ranging from τ = .49(AlexNet) to τ = .71. (ResNet101) with all p-values < .01.Indeed, the CNNs performed similarly to the various orthographic coding schemes and word recognition models, and often better.This contrasts with the relatively poor performance of the Transformer networks, with τ ranging from .25 to .38.
Importantly, the good performance of the CNNs was not due to the pixel value similarity between the prime-target images, as the pixel control condition (pixCS) has no significant correlation with the human priming data.It is also not simply the product of the architectures of the CNNs, as the predictions were much poorer for the CNNs that were pretrained on ImageNet but not trained on English words.Rather, it is the combination of the CNN architectures with training on English words that led to good performance.For a complete set of correlations between human priming data, DNNs' cosine similarity scores, orthographic coding similarity scores, and priming scores in psychological models, see Appendix B, Figure B.6.
A more detailed assessment of the overlap between DNNs, orthographic coding schemes, and word identification models is provided in Figures 3 and  4 that depict the distribution of responses in each condition for all models as well as summarise the priming results per condition.From this, it is clear that all the CNNs had particular difficulty in predicting the priming in 'half' condition (as indicated by the red arrow) in which either the first three letters or the final three letters served as primes (the CNNs substantially underestimated the priming in this condition).A similar difficulty was found in many of the psychological models as well, but the effect was not quite so striking.There were no other prime conditions that led to such a large and consistent error in any model or coding scheme.
One surprising result from the form priming project was that there was little evidence that external letters were more important than internal letters.For instance, final and initial substitutions produced more priming than medial substitutions, and similar priming effects were obtained for final, medial, and initial transpositions, with slightly less priming for initial transpositions.This contrasts with the common claim that external letters are more important than medial letters for visual word identification (Estes et al., 1976), although the past evidence for this in masked priming is somewhat mixed (e.g., Perea and Lupker 2003).Interestingly, most of the CNNs and ViTs showed similar effects across the three substitution conditions, and slightly less priming in the initial transposition condition, and thus also predicted little extra importance attributed to external letters.For each subplot, the x-axis is the metric specific similarity measure and the y-axis is the prime condition.Prime types are ordered according to the size of priming effect in the human data (largest to smallest), as in Table 1.As illustrated in the first row, the 'identity' condition has the strongest priming effect as it has the highest priming score of 42.69ms.Priming score for a condition is the difference of its mRT and the 'unrelated arbitrary' condition.See table 1 for the index of the 28 conditions.For each subplot, the x-axis is the metric specific similarity measure and the y-axis is the prime conditions.See table 1 for the index of the 28 conditions.The Interactive Activation and Spatial Coding Model values are made negative to obtain a positive correlation coefficient with human priming data, as they represent estimated RT s that is negatively correlated with priming effect size.A key feature of all the psychological models is that they code letters in an abstract format such that there is no variation in visual similarity between letters (different letters are simply unrelated in their visual form).This manifests itself in the fact that there is no distribution in priming scores in each of the 28 prime conditions for the orthographic priming schemes and the LTRS model.For the IA and Spatial Coding model there is variation in priming score in each condition, but this reflects the impact of lexical access in the models (e.g., the impact of word frequency or lexical competition) rather than any influence of visual similarity.
By contrast, in the case of the CNNs and ViTs, the input is an image in pixel space, and accordingly, it is possible that the visual similarity is contributing to the distribution of priming scores observed in each of the priming conditions.To test for this, we obtained human visual similarity rat-ings between upper case letters (Simpson et al., 2012) and assessed whether these scores correlate with the cosine similarity scores observed in the initial, middle, and final substitution conditions.In these conditions, the visual similarity of all the letters is the same other than the substituted letter, and the question is whether the similarity score computed with the model correlates with the similarity scores produced by humans.As can be seen Table 3, there was a strong correlation, for all models in the letter substitution conditions.This is an advantage of DNNs over current psychological models given that masked priming in humans is also sensitive to visual similarity of letter transpositions (Kinoshita et al., 2013;Forster et al., 1987a;Perea et al., 2008).

Discussion
A wide number of orthographic coding schemes and models of visual word identification have been designed to account for masked form priming effects that provide a measure of the orthographic similarities between letter strings.Here, we assessed how well these standard approaches account for the form priming effects reported in the Form Priming Project (Adelman et al., 2014) and compared the results to two different classes of DNNs (CNNs and Transformers).Strikingly, the CNNs we tested did similarly, and in some cases better, than the psychological models specifically designed to explain form priming effects.This is despite the fact that the CNN architectures were designed to perform well in object identification rather than word identification, despite the fact that the models were trained to classify pixel images of words rather than hand-built and artificially encoded letter strings, and despite the fact that the models were trained to classify 1000 words in a highly simplified manner.
By contrast, we found that visual transformers are less successful in accounting for form priming effects in humans, suggesting that these models are identifying images of words in a non-human like way.Still, both CNNs and Transformers did better than psychological models in one important respect, namely, they can account for the impact of the visual similarity of primes and targets on masked priming.By contrast, all current psychological models cannot given that their input coding schemes treat all letters as unrelated in form.This highlights a key advantage of inputting pixel images into a model (similar to a retinal encoding) as opposed to abstract letter codes that lose relevant visual information.
In some respects the poorer performance of Transformers compared to CNNs is surprising given past findings that the pattern of errors observed in object recognition is more similar between visual transformers and humans compared to CNNs and humans (Tuli et al., 2021).But at the same time, our findings are consistent with the finding that CNNs provide the best predictions of neural activation in the ventral visual stream during object recognition as measured by Brain-Score (Schrimpf et al., 2018) and other brain benchmarks.Indeed, visual transformers (similar to the ones we tested here) do much worse on Brain-Score compared to CNNs, with the top performing transformer model performing outside the top-100 models on the current Brain-Score leaderboard.To the extent that better performance on Brain-Score reflects a greater similarity between DNNs and humans, our finding that CNNs do a better job in accounting for masked form priming makes sense.But what specific features of CNNs lead to better performance is currently unclear.
Our findings are also consistent with recent work by Hannagan et al. (2021) who found that a CNN trained to classify images of words and objects showed a number of hallmark findings of human visual word recognition.This includes CNNs learning units that are both selective for words (analogous to neurons in the visual word form area) as well as invariant to letter size, font, or case.Furthermore, lesions to these units led to selective difficulties in identifying words (analogous to dyslexia following lesions to the visual word form area). Interestingly, the authors also provided evidence that the CNN learned a complex combination of position specific letter codes as well as bigram representations.It seems that these learned representations are also able to account for a substantial amount of of form priming effects observed in humans.The observation that CNNs not only account for a range of empirical phenomena regarding human visual word identification but, in addition, perform well on various brain benchmarks for visual object identification lends some support to the "recycling hypothesis" according to which a subpart of the ventral visual pathway initially involved in face and object recognition is repurposed for letter recognition (Hannagan et al., 2021).
Despite these successes, it is important to note that there is a growing number of studies highlighting that CNNs fail to capture most key low-level, mid-vision, and high-level vision findings reported in psychology (Bowers et al., 2022).Indeed, most CNNs that perform well on brain-score do not even classify objects on the basis of shape, and rather, classify objects on the basis of texture (Geirhos et al., 2018).And when models are trained to have a shape bias when classifying objects they have a non-human shape bias (Malhotra et al., 2021a).In addition, when tested on stimuli designed to elicit Gestalt effects, most CNNs exhibit a limited ability to organise elements of a scene into a group or whole, and the grouping only occurs at the output layer, suggesting that these models may learn fundamentally different perceptual properties than humans (Biscione and Bowers, 2022b).
How is it possible to obtain such high performance benchmarks as Brain-Score and account for a few findings from psychology?Bowers et al. ( 2022) (also see Malhotra et al. 2021b) argued that the good performance on these benchmarks may provide a misleading estimate of CNN-human similarity with good performance reflecting two very different systems picking up on different sources of information that are correlated with each other.For instance, it is possible that texture representations in CNNs are used not only to identify objects but also to predict the neural activation of the ventral visual system that identifies objects based on shape.Indeed, Malhotra et al. 2021b have run simulations showing that CNNs that are designed to recognise objects in a very different way nevertheless can support good predictions of brain activations based on confounds that are commonplace in image datasets.
In some ways, this makes the current findings all the more impressive, as the CNNs are doing reasonably well at accounting for a complex set of priming conditions that were specifically designed to contrast different hypotheses regarding orthographic coding schemes.Of course, there were some notable failures in the CNNs' accounting for form priming (most notably, all the CNNs underestimated the amount of priming in the half condition in which the primes were composed of the first three or final three letters of the target), and there may be many additional priming conditions that prove problematic for CNNs.Nevertheless, the current results do highlight that CNNs should be given more attention in psychology as models of human visual word recognition.It is possible that developing new CNN architectures motivated by biological and psychological findings and adopting more realistic training conditions will lead to even more impressive performance and new insights into human visual word identification.

Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions
J.S.B and V.B were responsible for the design and supervision of the study.D.Y specified the statistical approach and wrote the scripts for data generation and analyses.All authors contributed to the analysis of the data and interpretation and writing of the paper.

Data Availability
The code that generate the data that support the findings of this study are available at: https://github.com/Don-Yin/Orthographic-DNN

Code availability
The code that support the findings of this study are available at: https://github.com/Don-Yin/Orthographic-DNN

Ethics statement
The present study was approved by the School of Psychological ScienceResearch Ethics Committee.

Consent to Publish
We certify that the paper contains no personal information (names, initials, or any other information which could identify an individual person) that would infringe upon that person's right to privacy.

Funding
This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 741134).

Consent to Participate
Not Applicable.).The word is drawn with a random font and size.Each letter is then rotated and translated randomly.This procedure is applied on each target word to generate 6,000 images, resulting in 6,000,000 images (1,000 words × 6,000 images).
be found at the Github repository of the current study3 .2. Apply one of ten sizes selected from {x ∈ 2Z : 18 ≤ x < 38} 3. Draw the target word as an image.

Figure 1 :
Figure 1: Image examples of (a): training data; (b): priming data using the word ABDUCT.

Figure 2 :
Figure2: Priming (correlation) scores between model predictions and human data over the prime types for each DNN, coding scheme, priming model, and baseline.The term "LTRS" refers to the Letters in Time and Retinotopic Space.To obtain the standard error for each bar, we computed tau 1,000 times by randomly sampling 28 mean cosine similarity scores with replacement across conditions.The error bar corresponds to the standard error of this vector.

Figure 3 :
Figure 3: Distributions of Human Priming and the DNN Models Perception.For each subplot, the x-axis is the metric specific similarity measure and the y-axis is the prime condition.Prime types are ordered according to the size of priming effect in the human data (largest to smallest), as in Table1.As illustrated in the first row, the 'identity' condition has the strongest priming effect as it has the highest priming score of 42.69ms.Priming score for a condition is the difference of its mRT and the 'unrelated arbitrary' condition.See table 1 for the index of the 28 conditions.

Figure 4 :
Figure4: Distributions of Human Priming and the coding schemes.For each subplot, the x-axis is the metric specific similarity measure and the y-axis is the prime conditions.See table 1 for the index of the 28 conditions.The Interactive Activation and Spatial Coding Model values are made negative to obtain a positive correlation coefficient with human priming data, as they represent estimated RT s that is negatively correlated with priming effect size.

Figure B. 5 :
Figure B.5: Illustration of the process of generating training/validation images from target words (e.g., 'ABDUCT').The word is drawn with a random font and size.Each letter is then rotated and translated randomly.This procedure is applied on each target word to generate 6,000 images, resulting in 6,000,000 images (1,000 words × 6,000 images).
4. Add random rotation to individual letters using an angle determined by the normal distribution N 0, 2π 45 . 5. Add random translation to individual letters so that the updated letter coordinates (x, y) meet the condition shown in Equation B.1, where (a, b) are the letter's initial coordinates.6. Transform the image into grayscale.

Figure B. 6 :
Figure B.6: Pairwise Correlation Matrix Between Human Priming Data, the DNN Models and Other Similarity Metrics.The values represent Kendall's τ correlation coefficients.Priming-ARB is the human priming size using the arbitrary unrelated prime condition as the baseline; pixCS is the Pixel cosine similarity; SCM is the Spatial Coding Model.It should be noted that, in order to obtain a positive correlation value, the LDist and SCM's values are made negative.* p < .05. * * p < .01. * * * p < .001

Table 2 :
DNN Models' performance on the word recognition task using the validation dataset.Accuracy denotes the probability that a model's prediction (the one with the highest probability) matches the correct response.

Table 3 :
Pearson correlation coefficients between visual similarity ratings of upper case letters and cosine similarity scores observed in the initial, middle, and final substitution conditions for various DNN models.The SN (substitution) conditions I, M, and F denote the initial, middle, and final substitution, respectively.All p values < .001.
TableA.5: Coding schemes, LTRS: predicted match value; SCM and IA: predicted mean reaction time (ms).In order to obtain a positive correlation value, the SCM and IA values are made negative.Figure B.5 illustrates how the 1,000 target words are used to generate 6,000 images per word algorithmically, which involves the following steps:1.Apply one of the ten common fonts (e.g., Arial).The complete list can ...