BACS: The Brussels Artificial Character Sets for studies in cognitive psychology and neuroscience
Written symbols such as letters have been used extensively in cognitive psychology, whether to understand their contributions to written word recognition or to examine the processes involved in other mental functions. Sometimes, however, researchers want to manipulate letters while removing their associated characteristics. A powerful solution to do so is to use new characters, devised to be highly similar to letters, but without the associated sound or name. Given the growing use of artificial characters in experimental paradigms, the aim of the present study was to make available the Brussels Artificial Character Sets (BACS): two full, strictly controlled, and portable sets of artificial characters for a broad range of experimental situations.
KeywordsArtificial characters Letters Uppercase/lowercase Similarity
Given the growing use of artificial scripts in cognitive sciences, the aim of the present study was to generate and make available a full, strictly controlled, and portable set of artificial characters. In the following, we first review the different types of studies using unknown or artificial characters, and we then present the critical elements to take into consideration when devising and using a set of artificial characters.
Using unknown and artificial characters: State of the art
Researchers resort to unfamiliar symbols in three main situations: (1) to understand letter/word recognition processes, (2) to create a control condition in experiments involving letters, and (3) to investigate nonlinguistic learning processes.
Unsurprisingly, most of the studies using artificial characters are found in the field of psycholinguistics. The starting point was the hot debate about reading instruction (Valentine, 1913): Should reading be taught by means of a phonic method (systematic teaching of print-to-sound mapping) or a whole-word method (teaching associations between orthographic word form and meaning, without code explicitly provided)? Artificial or unknown scripts started to be used in the 1960s to investigate this issue, by manipulating print-to-sound mapping. Typically, each character was mapped onto a phoneme or a syllable of the language (most of the time a phoneme, to mimic the print-to-sound mapping in English), so that word-like pronunciations could be generated from groups of characters. This mapping was explicitly taught to new readers or not. The first studies showed that explicit teaching of print-to-sound correspondences facilitates novel word reading in the unfamiliar script (e.g., Bishop, 1964, in adults; Jeffrey & Samuels, 1967, in children), thus supporting the phonic method. Follow-up studies were mostly run in adults (who could be extensively trained and who were quickly able to use a completely new alphabet) and largely confirmed the first results (e.g., Baron & Hodge, 1978; Bitan & Booth, 2012; Bitan & Karni, 2003, 2004; Bitan, Manor, Morocz, & Karni, 2005; Brooks, 1977, 1978; Yoncheva, Blau, Maurer, & McCandliss, 2010; Yoncheva, Wise, & McCandliss, 2015). In the same line, artificial characters were used to examine the impact of letter sound or letter name knowledge (e.g., Chisholm & Knafle, 1975; Jenkins, Bausell & Jenkins, 1972; Samuels, 1972) and phonetic feature knowledge (e.g., Byrne, 1984; Byrne & Carroll, 1989) on reading acquisition. Some other factors potentially influencing learning were also examined (e.g., letter discrimination: Williams, 1969; or the grain size of print-to-sound mapping: Hirshorn & Fiez, 2014).
Gradually, unknown and artificial scripts have been used differently. The aim was no more to simulate reading acquisition per se (which is actually hardly possible, see Knafle & Legenza, 1978, for a discussion), but rather to examine the developmental course of letter string processing (e.g., acquisition of visual expertise in reading: Maurer, Blau, Yoncheva, & McCandliss, 2010; development of high quality lexical representations: Hart & Perfetti, 2008; letter position coding: García-Orza, Perea, & Muñoz, 2010) or to finely investigate processes that occur during letter/word processing (e.g., effects of orthographic or graphotactic regularities: Samara & Caravolas, 2014; Singer, 1980; Mason & Katz, 1976; print-to-sound consistency effects: Taylor, Plunkett, & Nation, 2011; influence of first language characteristics on the acquisition of a second language: Ehrich & Meuter, 2009; Meuter & Ehrich, 2012; influence of handwriting knowledge on letter recognition: Longcamp, Boucard, Gilhodes, & Velay, 2006). For such studies, the relevance of using unknown or artificial characters lies in the possibility of investigating issues in a “pure” way, in the sense that the degree of familiarity with the script is fully controlled and that it is easier to avoid confounds that are inevitable with natural stimuli. We know for example that in real orthographies frequent words entail frequent letter clusters. Due to this confound, it can be tricky to examine pure effects of cluster frequency (i.e., disentangled from word frequency effects, see Chetail, 2015). With an artificial script on the contrary, it is possible to devise combinations of artificial characters made of either rare or recurrent character clusters, while maintaining constant the frequency at which each artificial word is presented to the participants.
More recently, there has been a renewal of interest in artificial scripts, combined with the development of neuro-imaging techniques (electroencephalography and functional magnetic resonance imaging, especially). Using characters unfamiliar to readers makes it possible to precisely track the development of the neural networks underpinning letter and written word recognition, from lack of knowledge of the script to high familiarity (e.g., Callan, Callan, & Masaki, 2005; Moore, Brendel, & Fiez, 2014; Xue, Chen, Jin, & Dong, 2006). Brain plasticity associated with the development of orthography-phonology relationships was also examined (e.g., Hashimoto & Sakai, 2004), as well as the impact of script characteristics on neural activation during reading (e.g., Mei et al., 2013).
In other studies, researchers used unfamiliar characters, but without these characters being the focus of interest. For example, to investigate the mechanisms of letter perception (in real scripts), one can use an alphabetic decision task (e.g., Cosky, 1976; Marzouki, Grainger, & Theeuwes, 2007; New & Grainger, 2011). Symbols are presented (either letters or unknown characters), and participants have to decide whether each symbol is a letter from the Latin alphabet. With this task, New and Grainger (2011) tested the effect of letter frequency in letter recognition. The artificial characters were therefore only used as filler items, for the negative responses.
More generally, pseudoletter strings are frequently used to provide a control condition (usually referred as a false-font condition). In this case, the experiment deals with the processing of real letters or written words, and pseudoletters are used as a baseline to control for the task execution processes, which are not specific to real letters/words (e.g., detection of visual features) (e.g., Ben-Shachar, Dougherty, Deutsch, & Wandell, 2007; Longcamp, Anton, Roth, & Velay, 2003; Turkeltaub, Gareau, Flowers, Zeffiro, & Eden, 2003). Another reason to use unknown characters for the control condition is that it makes it possible to reduce the familiarity with the symbols while maintaining visual characteristics identical to those of letters used in the experimental conditions (e.g., Chanceaux, Mathôt, & Grainger, 2014; Petersen, Fox, Snyder, & Raichle, 1990; Vinckier et al., 2007).
Importantly, the use of false fonts as a baseline or filler condition is not restricted to experiments on letter/word processing (e.g., Awh & Jonides, 2001; de Gardelle, Sackur, & Kouider, 2009; Maki & Mebane, 2006). For example, to show that the richness of phenomenal experience (i.e., the feeling that our perceptual experience is richer than what we can express) is an illusion, de Gardelle et al. (2009) used a classical partial-report paradigm with letters. Participants were briefly presented with a matrix of letters and they had to report the cued row. In some trials, the uncued rows contained pseudoletters. The results of free reports showed that in these rows, participants had the illusory impression that there were only letters.
Unknown characters are also used to investigate learning, beyond print, because they offer a good alternative to objects, letters or digits that are traditionally used in learning paradigms. In the field of concept learning for example, it is well known that people can learn a new concept from few examples (see Feldman, 1997), leading to the acquisition of rich representations that enable them to generate new exemplars and parse objects. In several concept learning experiments, artificial characters were used to understand how people learn categories (e.g., Lake, Salakhutdinov, & Tenenbaum, 2015; see also Feldman, 1997). For example, participants are first exposed to a target image and to new examples of that character, and they are then asked to devise a new exemplar or to parse the exemplars into parts. The reason to use pseudoletters in such experiments is they are cognitively natural and can serve as a benchmark for comparing learning algorithms. Moreover, parsing (on the basis of visual features) can be easily tested as well as generalization (be it by humans or machines, Lake et al., 2015). Yet another example comes from the field of sequence learning, dedicated to understand how we use sequences of information or sequences of actions to which we are exposed. Sequence learning is also used to examine the acquisition of new skills such as the capacity to draw inferences. For instance, participants first learn the sequential relation between adjacent elements (e.g., A < B, B < C, C < D), and they are then tested on their capacity to infer the transitive relation between nonadjacent stimulus elements (i.e., B < D; see, e.g., Van Opstal, Verguts, Orban, & Fias, 2008). In such experiments, letters or digits are frequently used, but to avoid the highly reinforced knowledge about of the ordinal sequence of numbers and letters throughout lifespan, one can rather use pseudoletters or shapes (e.g., Acuna, Sanes, & Donoghue, 2002; Van Opstal et al., 2008).
Why and how devising a set of artificial characters?
The previous overview showed the assets of using artificial characters, whatever the domain of research. In the field of visual word recognition specifically, designing experiments with an artificial script is a unique way to thoroughly examine the developmental course of a given orthographic process or effect that is stable in adults. Children could still be tested in their native writing system, but this is often made difficult by the presence of developmental confounds and by the practical difficulties of training experiments in children. In addition, using an artificial script enables one to perfectly control for the amount of exposure to the symbols across participants, so that one can be sure that there is no difference in familiarity. It also makes it possible to independently manipulate variables that covary in real scripts and that are therefore hard to isolate in native-language studies. Moreover, it is easy to take into account the mapping of “artificial words” with linguistic features (phonology, semantics) either to avoid confounds or to examine their impact, while generating a large number of stimuli. More generally, in any study including letter stimuli, unknown or artificial characters are an ideal control condition (as soon as they have similar characteristics as letters, see below). Furthermore, they enable one to use letter-like stimuli while eliminating knowledge associated with the letters (e.g., shape, sound, ordinal arrangement).
Until now, the character sets that have been used have varied strongly from one study to another, and there is no accepted rule of thumb for selecting or devising symbols. Sometimes, the new characters are devised from a recombination of the features of real letters (e.g., Park et al., 2014; Stevens et al., 2013). Sometimes characters are just borrowed from other, unfamiliar scripts (e.g., Bishop, 1964; Callan et al., 2005) or result from modifications of borrowed symbols (e.g., Williams, 1969). Sometimes, pseudoletters are just nonalphanumeric symbols (e.g., *, /, ^), which do not necessarily entail letter features and are rather familiar to the participants (e.g., Bitan & Karni,, 2003; Gombert & Peereman, 2001). Thus, the character sets used vary overall and are more or less similar to the native script of participant. Furthermore, the characters used in previous studies are most of the time not available, so that replications are difficult and comparisons among studies are questionable (e.g., Knafle & Legenza, 1978). In the following, we highlight the characteristics that need to be considered when devising and using artificial characters. This enables us to present the main features of the Brussels Artificial Character Sets (BACS).
Configurations of strokes
Despite a great deal of variation, characters of different writing systems share several properties. A cross-linguistic study comparing more than 100 alphabetic and nonalphabetic scripts showed that writing systems share a similar number of strokes per symbol, with three strokes per character on average (e.g., Changizi & Shimojo, 2005). Moreover, there is high redundancy within sets (around 50%) reflecting the tendency to re-use the same types of strokes rather than to create new ones. Along the same lines, Changizi, Zhang, Ye, and Shimojo (2006) showed that the typological configurations of strokes (i.e., the organization of strokes relative to each other) are very similar across writing systems. According to them, the high similarity of basic features and stroke configurations between strongly different scripts can be explained by the fact that characters would be largely made of strokes that are commonly found in natural scenes, and that are thus easily processed by the visual system. The first criterion for devising BACS was therefore to meet these characteristics shared by most writing systems.
Similarity with the native script
As we already mentioned, characters borrowed from unknown scripts could be used (e.g., Thai characters for monolingual French speakers) rather than artificial symbols. In that case, the characters are necessarily made of attested configurations of strokes. However, as characters vary in complexity, the risk is to use symbols of higher complexity (e.g., higher average number of strokes) than those of the native script. This can be an issue because characters more complex than those of the native script could alter processing relative to simpler symbols. Knafle and Legenza (1978) showed for example that the positive influence of letter name knowledge on reading acquisition in English (see Levin, Shatil-Carmon, & Asif-Rave, 2006) was present in artificial scripts only when the characters were of similar complexity to the letters of the Latin alphabet. To meet the complexity and familiarity constraints, devising new characters (based on the native system) thus appears as a good alternative. In BACS, characters were therefore made so that they share most of the characteristics of the Latin alphabet (note that the procedure described in the next section can be applied to any writing system).
BACS provides two sets of characters. In the first one (BACS-1), character strokes were borrowed from existing writing systems and characters were controlled overall relative to major features of the Latin alphabet. Thus, the set shares the same average number of strokes per character and the same number of different types of strokes as the Latin alphabet. In the second set (BACS-2), each character was matched with a Latin letter on size and number of strokes. Characters were also matched on the number of junctions, that is vertices (e.g., Lanthier, Risko, Stolz & Besner, 2009; Szwed, Cohen, Qiao, & Dehaene, 2009; Szwed et al., 2011) and on the number of terminations (Fiset et al., 2009; Fiset et al., 2008). Some studies actually showed that these characteristics are critical for letter identification (e.g., Fiset et al., 2009; Fiset et al., 2008; Lanthier et al., 2009; Szwed et al., 2009; Szwed et al., 2011) although this result has not been consistently replicated (e.g., Petit & Grainger, 2002; Rosa, Perea, & Enneson, 2016). Most characters were also matched on presence/absence of axes of symmetry.
Note that although BACS-2 is more strictly matched on Latin letters than BACS-1, characters of the latter set are more distinct from the Latin letters, which may be preferable for certain studies.
Similarity between characters
Similarity within the set of characters should also be taken into account when devising new symbols. In the Latin alphabet—as in any other system—some symbols are highly similar (e.g., O, Q), whereas others are very different (e.g., O, W). It is well known that similarity between characters influence their identification, with similar letters being less easily recognizable than dissimilar letters (see Mueller & Weidemann, 2012, for a review). An artificial set of characters mimicking a real script should therefore include high- and low-similarity symbols. This was taken into account in BACS-1 and BACS-2. Furthermore, we provide objective measures of similarity (i.e., similarity matrices and clustering; Mueller & Weidemann, 2012; Podgorny & Garner, 1979; Simpson et al., 2013) so that researchers can easily select more similar or less similar characters.
Many studies with artificial characters used restricted sets (e.g., 6–12 characters only; e.g., Bitan & Karni, 2004; Jeffrey & Samuels, 1967; Singer, 1980; Yoncheva et al., 2010). This could be sufficient for certain nonlinguistic studies, but it is definitely not valid when the aim is to closely reproduce situations of exposure to natural print. In real writing systems, the number of characters varies from 6 to 180, but only two writing systems (out of more than 100) have less than ten characters, and the average number is 32 (Changizi & Shimojo, 2005). In both sets, we therefore created a number of characters similar to the number of letters in the Latin alphabet (i.e., 24 for BACS-1 and 26 for BACS-2). An additional strength of our scripts—rendering them complete and unique—is that they contain three different series: uppercase and lowercase computerized characters (which are delivered as OpenType fonts) as well as lowercase handwritten characters. Furthermore, BACS-2 has one version with serifs and one without serifs.
For each set, three groups of characters were devised, corresponding to the three usual different versions of alphabets: uppercase characters, lowercase computerized characters, and handwritten lowercase characters (see Appendices 1 and 2). The uppercase characters and the lowercase version for computer were generated with the FontCreator software (format .otf). The font can be used in text editors as well as in experimental programming softwares (e.g., PsychoPy: Peirce, 2007; Psychophysics Toolbox: Brainard, 1997). Size and colour can be changed and bold and italic variants are available. The handwritten lowercase characters were created manually on a sheet of paper before being scanned. All the files are available at https://osf.io/dj8qm/. In the following sections, we present the procedure followed to design the sets.
Handwritten lowercase characters
Computerized lowercase characters
The lowercase computerized characters were then derived from the handwritten lowercase characters, with some modifications. The line strokes included for character linking were removed (e.g., compare Open image in new window and m in the handwritten and computerized letters), and the curved strokes that were added to facilitate hand-writing were removed or replaced by straight lines (e.g., compare Open image in new window and j). Finally, as in the Latin alphabet, some of the lowercase computerized characters were different from the handwritten versions (e.g., compare Open image in new window and r).
Contrary to BACS-1, BACS-2 was devised by directly pairing each character with a Latin letter. In addition to control overall the type, number and configuration of strokes, this set provides characters paired with letters on size, number of strokes, presence/absence of symmetry, number of junctions and number of terminations. Furthermore, given the fairly high number of fonts with serifs, we devised for each case a version of the characters with and without serifs.
Computerized lowercase characters
Handwritten lowercase characters
Characters were derived from the computerized form without additional changes (see Appendix 2).
Character similarity measurements
Letter similarity can be a factor of confusion when perceiving strings, leading to false recognitions (e.g., reporting P instead of R; see Mueller & Weidemann, 2012), but similarity between characters is inherent to any script, since the same strokes are used in several characters (cf. the type/token ratio in the Latin alphabet; e.g., Changizi et al., 2006). To mimic real scripts, BACS includes both similar and dissimilar characters. To enable researchers to precisely select groups of characters according to their similarity, and to facilitate cross-script comparisons, we provide here objective measures of similarity.
Among the different methods to measure letter similarity (e.g., Bagnara, Boles, Simion, & Umiltà, 1983; Boles & Clifford, 1989; Mueller & Weidemann, 2012), we used a similarity judgment task on a rating scale (e.g., Podgorny & Garner, 1979; Simpson, Mousikou, Montoya, & Defior, 2013). In this task, two characters are presented and participants have to assess how similar/dissimilar they are. This technique was favoured over other ones (e.g., speeded same–different matching) because it does not require a rapid presentation. Although rapid presentation may be adequate for familiar symbols, which have robust memory representations, one cannot be sure that new characters would be precisely processed under such conditions.
Separate groups of 31 and 75 students estimated similarity for characters of BACS-1 and BACS-2, respectively. For BACS-2, the pool of participants was divided into four groups of 18–19 participants, so that each participant was exposed to only one of the four versions of the set (sans-serif lowercase, sans-serif uppercase, serif lowercase, serif uppercase). They were all native French speakers and reported having normal or corrected-to-normal vision. They received a small financial compensation for their participation. To anticipate, four of the participants were excluded (n = 2 in BACS-2 sans lowercase, n = 2 in BACS-2 sans uppercase) due to misunderstanding the instructions (e.g., wrong use of the scale, incomplete task). They were not considered in the following analyses.
For the first group of participants, the stimuli corresponded to the computerized BACS-1 characters (24 uppercase characters, 24 lowercase characters). The 576 (24 × 24) combinations of each case were presented, thus including reverse versions of the same pairs (e.g., Open image in new window and Open image in new window ) as well as identical pairs (e.g., Open image in new window ), leading to a total of 1,152 trials. In the second group of participants, each individual was exposed to one group of computerized BACS-2 characters (26 sans-serif lowercase characters, 26 sans-serif uppercase characters, 26 serif lowercase characters, 26 serif uppercase characters), again including the reverse version of each pairs and identical pairs. This led to 676 trials (26 × 26) for each participant.
Participants were tested individually or in groups of up to six, for approximately 35 to 45 min. The task was programmed with the PsychoPy toolbox (Peirce, 2007, version 1.81). The session started with a familiarisation phase. Each character was presented once on the screen and participants had to hand copy it on paper. Before moving to the similarity judgment task, they received a sheet with all the characters and had to examine the whole set for 45 s. Then, for each trial, a pair of characters was displayed on the centre of the screen, as well as a continuous rating scale (ranging from 0 to 1) at the bottom of the screen until response. Participants were asked to judge to what extent the two characters were similar by placing the cursor on the scale, with 0 and 1 corresponding to very dissimilar and very similar characters, respectively. They were encouraged to use the whole scale, and not just extremities. For BACS-1, the 1,152 pairs were randomly distributed among 16 blocks of 72 trials, separated by brief breaks, mixing uppercase and lowercase pairs. The order of presentation was randomized for each participant. For BACS-2, the 676 pairs were randomly distributed among 13 blocks of 52 trials. The computer recorded the similarity score, corresponding to the distance from 0 to the position of the cursor on the scale, thus ranging from 0 to 1.
BACS provides two original collections of artificial characters devised so that they closely match the visual characteristics of Latin letters. In both sets, the total number and types of strokes are similar to what is found in scripts overall and in the Latin alphabet in particular. Moreover, in BACS-2, each character is paired with a Latin letter for the number of strokes, junctions, terminations and serifs, with a number of strokes types similar to the Latin alphabet. Furthermore, similarity matrices confirmed that, as in the Latin alphabet, characters are relatively dissimilar, except for a few characters, so that it is possible to select very similar symbols as well as very dissimilar ones.
BACS is therefore a perfect tool to investigate letter and word processing through artificial scripts. It enables one to create new print-to-sound correspondence systems (be it alphabetic or not) and thus to examine in a unique way the developmental course of a given orthographic process or effect, to precisely control or manipulate print exposure, to disentangle the effect of confounded variables in real scripts. Furthermore, the three versions of the two sets (uppercase, handwritten lowercase, and computerized lowercase) strengthen the similarity with existing scripts. The different versions can be used to address precise issues, such as upper/lowercase learning or the development of abstract letter representations. Alternatively, this gives the possibility to choose a subset of symbols among 3x24 characters. More generally, due to its precise controls, BACS can be used in any experimental situation requiring either the manipulation of letter-like stimuli or the inclusion of a baseline condition for letters (e.g., Lake et al., 2015; Longcamp et al., 2003; Vinckier et al., 2007). Finally, because BACS is available to the research community and easily usable when designing experiments, its use would improve the comparability between studies using artificial characters.
The work reported here was supported by the Interuniversity Attraction Poles Program of the Belgian Science Policy Office (Project P7/33). We adhere to the PRO initiative for open science (https://opennessinitiative.org). All of the files (e.g., BACS fonts, raw data, script for analyses, matrices of similarity) are available at https://osf.io/dj8qm/. We thank the anonymous reviewers for their helpful comments on an earlier version of the manuscript.
- Baron, J., & Hodge, J. (1978). Using spelling–sound correspondences without trying to learn them. Visible Language, 12, 55–70.Google Scholar
- Brooks, L. (1977). Visual pattern in fluent word identification. In A. S. Reber & D. L. Scarborough (Eds.), Toward a psychology of reading (pp. 143–181). Hillsdale, NJ: Erlbaum.Google Scholar
- Brooks, L. (1978). Non-analytic correspondences and pattern in word pronunciation. In J. Requin (Ed.), Attention and performance VII (pp. 163–177). Hillsdale, NJ: Erlbaum.Google Scholar
- Chisholm, D., & Knafle, J. D. (1975). Letter-name knowledge as a prerequisite to learning to read. Reading Improvement, 15(1), 2.Google Scholar
- Gombert, J. E., & Peereman, R. (2001). Training children with artificial alphabet. Psychology, 8, 338–357.Google Scholar
- Hart, L., & Perfetti, C. A. (2008). Learning words in Zekkish: Implications for understanding lexical representations. In E. L. Grigorenko & A. J. Naples (Eds.), Single word reading: Behavioral and biological perspectives (pp. 107–128). New York, NY: Taylor & Francis.Google Scholar
- Jenkins, J. R., Bausell, R. B., & Jenkins, L. M. (1972). Comparisons of letter name and letter sound training as transfer variables. American Educational Research Journal, 75–86. doi:10.3102/00028312009001075Google Scholar
- Longcamp, M., Boucard, C., Gilhodes, J.-C., & Velay, J.-L. (2006). Remembering the orientation of newly learned characters depends on the associated writing knowledge: A comparison between handwriting and typing. Human Movement Science, 25, 646–656. doi: 10.1016/j.humov.2006.07.007 CrossRefPubMedGoogle Scholar
- Meuter, R. F. I., & Ehrich, J. F. (2012). The acquisition of an artificial logographic script and bilingual working memory: Evidence for L1-specific orthographic processing skills transfer in Chinese–English bilinguals. Writing Systems Research, 4(1), 8–29. doi: 10.1080/17586801.2012.665011 CrossRefGoogle Scholar
- R Development Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from www.R-project.org.Google Scholar
- Stevens, C., McIlraith, A., Rusk, N., Niermeyer, M., & Waller, H. (2013). Relative laterality of the N170 to single letter stimuli is predicted by a concurrent neural index of implicit processing of letternames. Neuropsychologia, 51, 667–674. doi: 10.1016/j.neuropsychologia.2012.12.009 CrossRefPubMedGoogle Scholar
- Valentine, C. W. (1913). Expermiments on the method of teaching reading. Journal of Experimental Pedagogy, 2, 99–112.Google Scholar