The study of letter similarity (or confusability) and letter frequency has a long history over several decades within the fields of psychology and psychophysics (see Mueller & Weidemann, 2012, for a review). Continued interest in the study of this topic is predicated on the widely held belief that a good understanding of what drives perceived similarity among letters and the availability of reliable statistics regarding their distributional properties are crucial for a number of reasons. First, the study of letter properties lays the groundwork for the study of how letters are represented in the cognitive system, since letters of individual words are thought to represent the first “language-specific” stage of the reading process, following the work done by oculomotor control mechanisms enabling fixation on the word and the early visual processing that allows visual feature extraction (Carreiras, Armstrong, Perea, & Frost, 2014; Dehaene, Cohen, Sigman, & Vinckier, 2005; Grainger, 2008). Second, since mastery of alphabetic reading is generally thought to require, as a first step, the ability to map letters and letter strings onto the sounds of the language (Bowey, 2005; Snowling & Hulme, 2011), the study of letter properties can provide valuable information to educators regarding the complexity of letter forms and guide the choice of the order in which the learner is exposed to these letters. Finally, the investigation of letter properties promotes empirical investigations with a view toward gaining a better understanding of how the visual system functions.

For many years, researchers have sought to establish letter frequency databases for different languages such as Russian (Gusein-Zade, 1988), English (Mayzner & Tresselt, 1965), and Spanish (Li & Miramontes, 2011) in order to provide normative frequency data for researchers interested in verbal learning and retention, anagram problem solving, word recognition thresholds, and linguistic analyses. Similar interest in developing letter similarity/confusability matrices is evident in a long research tradition spanning several decades, with the early work, mainly on English, seeking to identify typefaces, fonts, and letters that were more or less legible, with the aim of improving printing and typesetting (Roethlein, 1912; Tinker, 1928). More recently, research has come to focus on understanding the visual system and how it represents and processes letters as visual objects, without losing interest, however, in attempting to make written text more comprehensible or helping learners to acquire reading skills more easily (Boles & Clifford, 1989; Fiset et al., 2009; Liu & Arditi, 2001; Mueller & Weidemann, 2012). Collectively, these studies have played a fundamental role in enabling the design and implementation of many well-controlled empirical studies seeking to pin down the dynamics of letter processing (e.g., Evans, Lambon Ralph, & Woollams, 2017; Grainger, Dufau, Montant, Ziegler, & Fagot, 2012; Kinoshita & Kaplan, 2008; Schelonka, Graulty, Canseco-Gonzalez, & Pitts, 2017).

Despite the importance of having reliable letter similarity matrices and letter frequency counts, this type of information is available only for a handful of Indo-European languages. Other languages, such as Modern Standard Arabic (henceforth MSA), suffer from a lack of lexical resources in general and computerized databases about letter similarity and letter frequency in particular. MSA is the language taught at most schools, colleges, and universities in the Arab world and is the one used in the media, literature, and formal settings such as political meetings (e.g., Kamusella, 2017; Versteegh, 2014). This language, despite its importance for the study of letter processing and letter representation by virtue of its very special writing system, as we will detail below, has very few published lexical resources. Notable exceptions are Aralex (Boudelaa & Marslen-Wilson, 2010) and Arabicorpus (Parkinson, 2000). Therefore, researchers interested in the study of Arabic letter processing, Arabic reading, and developing better Arabic reading tools, and psycholinguists interested in cross-linguistics investigations of letter and word processing are in dire need of reliable information about the distributional characteristics of letters and their similarities.

The aim of this study is to provide, for the first time, (a) comprehensive statistical information about Arabic letters and their allographs and (b) a similarity/confusability matrix of Arabic letters and allographs in the visual, auditory, and motoric domains. We begin by providing some relevant background about the orthographic system and its importance for the study of letter processing. Second, we provide a detailed statistical count of the frequencies of Arabic letters and their allographs based on a 40-million-word corpus. Third, we present a visual similarity matrix of Arabic letters and their allographs based on ratings by 125 participants, followed by a phonetic similarity matrix based on theory-driven phonetic features and a motoric similarity matrix based on the strokes required to write each letter and its allographic variants. We conclude by highlighting the importance of this new set of information on the distributional and structural properties of Arabic for future investigation of this language in different research fields.

The Arabic writing system

MSA is a Semitic language written from right to left in a cursive manner. The MSA alphabet consists of 28 letters, 22 of which always connect to the following letter using a ligature, while the remaining 6 connect to the preceding but not the following letter. MSA is the fifth most common language in the world, with over 300 million speakers. One of the most important features of the Arabic writing system is “allography,” whereby the shapes of 15 of the 28 letters differ considerably depending on their location within the letter sequence (initial, middle, final, and isolated). For instance, the letter ع, which stands for a voiced pharyngeal fricative represented by /ʕ/ in IPA notation, takes the shape عــ word-initially, ــعــ word-medially, ــع word-finally when preceded by a ligating letter, and ع word-finally when preceded by a non-ligating letter. The remaining 13 letters (e.g., ب, ث, د, ر) preserve their shapes regardless of their position within the word, but have ligature marks on either side (e.g., ـبـ, ـثـ) or only on their right-hand side (e.g., ـد, ـر). Another important feature of the MSA orthographic system is the use of a cursive writing system even in typing, a rare feature among the world’s writing systems, including typologically related languages such as Hebrew. A final unique aspect of MSA is that a given letter can have up to three diacritic symbols superposed on it, thus creating a highly complex visual percept. This is illustrated by the second letter خ of the word مـــخٌّ “brain,” which shows a single dot diacritic underneath a gemination sign indicating that the consonant خ is doubled, and the nunation sign, which denotes the indefinite article -un.

The complexity of this orthographic system has given rise to many studies across several research areas. For instance, in the field of reading, Asadi, Khateb, and Shany (2017) showed that unlike Indo-European languages, where reading processes are seen as the product of decoding abilities and listening comprehension, MSA requires an extended model that includes the orthographic and the morphological domains in order to capture the intricacies of reading in Arabic. Similarly, some researchers have suggested that the complexity of the Arabic orthographic system leads to slower processing than in related languages such as Hebrew (Ibrahim, Eviatar, & Aharon-Peretz, 2002), while others (Taha & Saiegh Haddad, 2017) have argued that this feature leads Arabic orthography learners to rely on morphological structure much earlier in the course of learning to read and spell than their Indo-European counterparts.

In the visual word recognition domain, researchers have been interested in establishing the role of allography and whether Arabic cognitive representations contain a level that corresponds to abstract letter identities (Boudelaa, Norris, Mahfoudhi, & Kinoshita, 2019; Carreiras, Perea, & Abu Mallouh, 2012; Friedmann & Haddad-Hanna, 2012; Perea, Abu Mallouh, & Carreiras, 2010). This line of research relates to a much broader set of issues in cognitive science regarding the types of representations used in reading and whether letter recognition is subserved by a hierarchical processing system that involves both case-specific and case-independent representations of alphabetic stimuli (Petit, Midgley, Holcomb, & Grainger, 2006; Rothlein & Rapp, 2014, 2017). In this respect, Boudelaa et al. (2019) reported a series of priming experiments looking at whether a target word (e.g., يسعدون “be happy”) is facilitated more by a nonword transposed letter (TL) prime that does not cause allographic changes (e.g., يعسدون) than a TL prime that causes such changes (e.g., يسدعون). The results showed that the non-allographic TL primes produced significantly greater facilitation than allographic TL primes, indicating that Arabic readers use allographic variation to resolve the uncertainty in letter order during the early stages of orthographic processing. Similar results were reported by Yakup, Abliz, Sereno, & Perea (2014, 2015) for Uyghur, a Turkic language spoken in Western China that uses the Arabic orthographic system, suggesting that visual form changes that Arabic letters undergo as a function of their position in the word play a critical role in guiding the reading process.

Finally, in the field of automatic language processing, there has been a recent surge in the study of the characteristics of typed and handwritten Arabic letters to develop algorithms that can automatically process Arabic written scripts (Abandah, Younis, & Khedher, 2014; Cowell & Hussain, 2002; Khorsheed, 2002). The development of new lexical resources related to letter frequency and letter similarity can only help to further spur interest in MSA and provide the tools necessary to conduct well-controlled and replicable research.

Letter and allograph frequencies

Here we provide the frequency of Arabic letters and their allographs based on the 40-million-word corpus previously used by Boudelaa and Marslen-Wilson (2010) to develop the Aralex database. These frequency figures were calculated as percentages over the non-diacritized version of Aralex. In Table 1, we provide the frequencies of the 28 letters of the alphabet along with the letter frequencies published online by Mohsen Madi (2010) for comparison.

Table 1. Percentage frequencies of the 28 Arabic letters in the current study and in Madi 2010

There are numerous similarities between the frequency statistics of the current study and Madi’s (2010), as demonstrated by a Pearson correlation analysis (r = 0.9), suggesting a close match between the two sets of frequencies. The small discrepancies in the frequency counts between the two studies are probably due to the use of different kinds of corpora. The current study’s 40-million-word corpus comes from contemporary written sources, namely newspaper articles, as detailed in Boudelaa and Marslen-Wilson (2010). In contrast, Madi (2010) relied on a small corpus of a little more than one million words derived mainly from old Arabic books such as البداية والنهاية The Beginning and The End of Ibn Katheer (1300–1373) and الرحيق المختوم The Sealed Nectar by Al Mubarkafoori, which is a compilation of the sayings of the Prophet of Islam produced in classical Arabic 14 centuries ago, or on books that deal with Islamic jurisprudence and hence use mostly older Arabic, such as تحفة العروسين The Masterpiece of the Brides by Al-Shuri.

It is important to further note that the current letter frequency values make intuitive sense, because the four letters with the highest frequencies are on the one hand the letters و and ل, which respectively correspond to the function words “and” and “in order to,” and the letters ي, ت on the other, which are in fact inflectional affixes. At the same time, the letters with the lowest frequencies correspond either to marked sounds that are very rare across the world languages, such as the pharyngealized alveolar ض and the pharyngealized interdental ظ, or indeed to letters that do not correspond to function words or affixes, such as ث, ذ and خ.

In Table 2 below we present for the first time the frequencies of Arabic letters broken down by allograph.

Table 2 Percentage frequency (% Frq) of 116 Arabic letter allographs (Allog)

For each letter of the alphabet, we determined the frequency of its allographic form in isolation and at the onset, middle, and offset of the word. Thus, for the majority of letters, such as ع ain, and غ ghayn, we report the frequencies of four allographs, whereas for others, such as د daal, ذ thaal, ر raa, and ز zein, we report only two values because they have only two allographic forms. For the letter أ alif, we report values for seven allographic forms because this letter has different interchangeable variants such as ـأ ,أ, and ا. Finally, for the letter ت, taa, we report values for six allographs, four of which are for the taa maftuuha, “open taa,” and two for taa marbuuta “closed taa” As can be clearly seen from Table 2, allographs of the same letter do not occur with the same frequency across the board. For instance, the allograph بـ, baa, with a frequency of 2%, is much more common than the allograph ـبـ with only 0.22%. The frequencies of other letter allographs (e.g., ـط 0.26, طـ 0.76) are much more evenly distributed.

An interesting theoretical question that allograph frequencies can help address is whether the effects of allographic changes in visual word recognition experiments, such as those reported by Friedmann and Haddad-Hanna (2012) and Boudelaa et al. (2019), can be modulated by allographic frequency. From a practical point of view, these data can help educators not only in making informed choices about the development of teaching materials that reflect the frequency of different letters and their allographs, but also in modulating their instructional focus. For instance, when teaching the letter ع the instructor can, based on allograph frequency data, dedicate more time to teaching the allograph ع than the allograph ـع, given that the latter is much more frequent than the former and may not need as much time to be learned.

Subjective Letter Similarity Experiment

The technique that we employed to construct the similarity matrix is based on data obtained under normal (untimed) reading conditions and is comparable to the approach used in previous studies examining letter knowledge in children (Treiman, Kessler, & Polo, 2006; Treiman, Levin, & Kessler 2007, 2012) and letter similarity in adults (Simpson, Mousikou, Montoya, & Defior, 2013). Participants were speakers of MSA who were required to rate letter pairs on a scale from 1 (not similar at all) to 7 (very similar). We anticipate that the matrix presented here will also prove useful to researchers in any field of investigation in which Arabic letters are used as stimuli and where a measure of visual similarity between stimuli is required.

Method

Participants

A total of 125 participants, aged 20 to 24, were recruited to take part in this experiment. All participants were literate MSA speakers who were undergraduate students in the female campus of the faculty of Humanities and Social Sciences at United Arab Emirates University. All participants spoke English as a second language but declared Arabic (i.e., MSA and the Emirate Dialect) their dominant language. This experiment was approved by the ethics committee of United Arab Emirates University, and all participants gave their written consent to take part in it in return for 50 AED.

Stimuli

As in the previous study, we selected four allographs for each letter of the alphabet except for the letters (a) ط ,ز ,ر, and ظ, for which only two allographs were included; (b) the letter ه, for which only three allographs were used; and (c) the letter أ alif, for which eight different allographs were included. This choice, which was based on pilot testing, resulted in a total of 110 allographs. Each allograph was paired with every other allograph, including itself, resulting in 6105 pairs. These were used to build 15 experimental lists consisting of 407 experimental pairs each. Each participant was randomly assigned to one list. To ensure that subjects were assessing the visual, and not phonetic, similarity between the different allograph pairs, a further 32 foil pairs were built consisting of the 28 Arabic letters paired with Latin letters to create four conditions. The first consisted of cross-alphabet letter pairs that were both phonetically and visually similar. These were pairs like ل– L, which share the straight downward-directed stroke. The second condition consisted of Arabic-Latin pairs which were phonetically similar but visually dissimilar, such as ن-N, which share phonetic features [+coronal, +nasal, +continuant, +sonorant] but look very different visually. The third condition consisted of cross-alphabet pairs that were phonetically dissimilar but visually similar, like خ-G, which share the downward-directed semicircular stroke. The final condition comprised pairs that were neither phonetically nor visually similar, such as ذ- I. The ordering of the letters within each pair was counterbalanced across lists, such that each letter appeared almost half of the time in the first position and half in the second.

Design and procedure

The presentation of the stimuli and recording of responses were controlled by desktop computers running SuperLab 5. On each trial, two stimulus allographs appeared at the center of the screen in Traditional Arabic 72-point font size in black against a white background. Participants were instructed to ignore the sounds of the letters and to rate the letter pairs on the computer keyboard based purely on visual similarity on a scale from 1 (not at all similar) to 7 (very similar). No time limits were imposed, and participants responded at their own pace. Participants could advance to the following trial only after providing a response to the current trial. To emphasize the importance of paying attention to the shape of the allograph, participants were also asked to rate a number of geometrical shapes (e.g., squares, rectangles, circles) on their similarity in shape. The experiment lasted about 15 minutes.

Results and discussion

An initial screening was performed on the data in order to detect cases in which the participants may have misunderstood or not correctly followed the instructions. This resulted in the exclusion of no data points at all. A second screening process tested whether participants’ knowledge of the letter sounds exerted a strong influence on their responses, by examining the ratings assigned to the Arabic-Latin letter pairs. We have linearly rescaled the similarity ratings on the 1–7-point scale into distances on a 0–1 scale. In order to take into account the fact that human-generated similarity judgments are likely to be logarithmic on actual distance, we used the following formula: Distance = [exp(7) − exp (Distance1)]/[exp(7) − exp(1)], where Distance1 is the distance between a given pair of letter allographs. This formula simply rescales the similarity score provided by the participants into a distance metric that can be fed to the hierarchical clustering technique.

Table 3 suggests that although the overall perceived visual distance among cross-alphabet letters is large, the +P+V pairs (e.g., ل-L) and –P+V (e.g., غ-G) pairs were perceived as significantly closer in visual space than the +P−V (e.g., ب-B) and the –P−V pairs (e.g., ش-E). Thus, phonetic similarity did not modulate the perceived distance among the cross-alphabet pairs, with the visually similar pairs perceived to be the same distance from each other regardless of phonetic similarity, and the visually dissimilar pairs being rated as maximally distant from each other regardless of whether they were phonetically similar. A series of paired two-tailed t tests revealed +P+V to be significantly different from +P−V (p < 0.00) and –P−V (p < 0.00), but not from –P+V (p = 0.48). More interestingly, –P+V was also reliably different from +P−V (p < 0.01) and –P−V (p < 0.02). This pattern of results clearly demonstrates that participants carried out the task solely based on the visual similarities of the letter pairs and completely ignored the phonetic dimension as instructed.

Table 3 Mean (and standard deviation) of the visual distance between cross-alphabet Roman–Arabic letter pairs

Where the within-alphabet letter and allograph pairs are concerned, the full visual similarity matrix for 110 allographs can be accessed at https://osf.io/yqns4/, with the distance measures rescaled using the distance formula mentioned above. The dendrogram in Fig. 1 displays the hierarchical relationships of the 110 Arabic allographs used in this experiment.

Fig. 1
figure 1

Hierarchical clustering (dendrogram) using the nearest neighbor method. The vertical axis of the dendrogram represents the distance or dissimilarity between clusters. The horizontal axis represents the 110 Arabic allographs

The general technique we use here is hierarchical clustering, which aims to group similar objects into groups called clusters (Kassambara, 2017; Jajuga, Sokolowski, & Bock, 2002; Stahl, Leese, Landau, & Everitt, 2011). The end point of such an approach is to create a set of clusters that are distinct from each other, while the objects within each cluster are broadly similar to each other. Hierarchical clustering typically operates on a distance matrix. It starts by treating each observation as a separate cluster, then it iteratively identifies the two clusters closest to each other and merges them until no clusters are left unmerged. The main output of hierarchical clustering is a dendrogram, which is simply a diagram that shows the hierarchical relationships between objects. The main use of a dendrogram is to work out the best way to allocate objects to clusters, and this usually requires (1) the computation of the distance (similarity) between two given clusters using a distance metric (e.g., Euclidean distance, city block) and (2) selecting a linkage criterion to determine whether the distance is computed between the two most similar parts of a cluster (single-linkage), the two least similar bits of a cluster (complete-linkage), the center of the clusters (mean or average-linkage), or some other criterion.

In this study, all dendrograms are based on the standard Euclidean distance metric and use “ward. D2” as a linkage criterion to determine the distance between sets of observations as a function of the pairwise comparisons (Murtagh & Legendre, 2014). However, since hierarchical cluster analysis can typically yield as many cluster solutions as there are cases to be clustered (Clatworthy, Buick, Hankins, Weinman, & Horne, 2005), one needs to determine the appropriate cluster solution using objective formal rules and equations to identify the optimal number of clusters in a sample. Here we have opted for the “gap statistic,” which operates by taking the input of the hierarchical clustering analysis and compares the change in within-cluster dispersion with that expected under a reference null distribution. The gap statistic has been reported to outperform other methods (Tibshirani & Walther, 2005) and to provide quite stable solutions (Yan and Ye, 2007). Upon applying this method to our data, the results suggest that the value that maximized the gap statistic was 0.94, with an optimal number of 19 clusters (Table 4).

Table 4 Optimal number of clusters based on visual similarity as suggested by the gap method, the members of each class, and its within-cluster sum of squares

Table 4 shows that the largest of the 19 clusters consisted of nine allographs, and the smallest consisted of two. The within-cluster sum of squares (SS), which measures the amount of variance in the data, is < 2 for all clusters except Cluster 7. Although the within-cluster SS is influenced by the number of observations and is therefore often not directly comparable across clusters with different numbers of observations, the preponderance of low SS for all clusters save one suggests that the clusters are highly consistent, with very little variability. In addition, the total SS is 40.62 and the between-cluster SS is 21.97, suggesting that data points cluster neatly in a 19-dimensional space of visual attributes.

The component members of each cluster share a number of characteristics that the participants relied on to assign their similarity ratings. For example, Cluster 14 in Table 3 features the allographs ط ظ ـطـ ـظـ, which share the egg-shaped loop with a vertical stroke, and the only difference between them is the dot above the first and third members of this set. Similarly, the eighth cluster in the same table features the six allographs ج ح خ ـج ـح ـخ, with the first three ligating to the right (i.e., to the preceding letter), while the second three do not. Two main features cut across the members of this cluster: the downward-directed semicircle and the acute angle it makes at its upper end. Even Cluster 7, which consists of nine seemingly heterogeneous allographs overall, reveals a clear structure at a lower level of granularity, with the allographs ن and ـن sharing the downward-directed semicircle, while the ـهـ, ـة ,ـه ,ة ,ه share the closed loop written on or above the line. The final two members of this cluster are the isolated ك and the right-ligating ـك. One reason these two allographs are grouped with Cluster 7 is arguably the small dot-like shape in the middle of the two allographs, which allies them with the four dot-bearing allographs in this cluster.

Table 4 further suggests that phonetic similarity among allographs played little or no role in the similarity judgment process. This is clearly illustrated by Cluster 1, for example, where the allograph ء corresponds to a voiceless glottal stop sound, whereas the allographs ع عـ ـع and the allographs غ غـ ـغ correspond to a voiced pharyngeal fricative and a voiced velar fricative, respectively. More importantly, perhaps, the cluster membership as illustrated in Table 4 is in keeping with recent psycholinguistic and neurolinguistics research on Arabic letter allography (Boudelaa et al., 2019; Friedmann & Haddad-Hanna, 2012; Yakup, Abliz, Sereno, & Perea, 2014, 2015). For instance, the allographs ـج and جـ are two different instantiations of the abstract letter ج, but they belong to Clusters 8 and 9, respectively. This strongly suggests that different allographic shapes of the same abstract letter were treated as two different perceptual objects in our similarity judgment task. Further credence for this idea comes from the recent demonstration by Boudelaa et al. (2019) that transposed-letter priming (TL-priming) is modulated by allographic changes, such that a target word like يسعدون “be happy” is easier to recognize when preceded by the non-allographic TL-prime يعسدون than when preceded by the allographic TL-prime يسدعون. Similar results were reported by Yakup et al. (2014, 2015) for Uyghur, a non-Semitic language that uses the Arabic writing system, and by Friedmann and Haddad-Hanna (2012), who showed that Arabic dyslexic patients’ letter migration errors when reading aloud were reduced for words in which letter transposition or letter substitution caused allographic changes.

The current experiment refines and extends the recent findings of Wiley, Wilson, and Rapp (2016) in a number of ways. For example, those authors studied the similarity structure of 45 Arabic letter shapes in a timed same–different judgment task with experienced and novice speakers. Our study included 110 allographs, allowing us to provide the principled similarity structure displayed in Fig. 1 above for allograph groups absent from Wiley et al.’s study. Consider, for instance, the letter ي: In our study, this letter meaningfully clusters with its allographic variant in a right-ligating context (i.e., ـي), with the allograph called alif maqsuura in isolation with or without a hamza ئ ى, and with the alif maqsuura ligating to the right with and without the glottal stop, hamza ـى ـئ. The same letter ي in Wiley et al. (2016) clusters with م and هـ in the latency and accuracy data of the expert subjects, respectively, making it more difficult to isolate the basis of the visual similarity underlying such clusters. Further, Wiley et al. (2016) did not include the glottal stop, hamza, either by itself (ء) or in the context of the different letters that can support it, such as alif أ, alif maqsuura ئ, waaw ؤ, or nabrah ـئـ. Presumably, Wiley et al.’s choice is reasonably predicated on the standard view that the hamza is not a letter of the alphabet. We have opted for completeness and included the glottal stop in our analysis. In doing so, we have gained the novel insight that this letter is typically treated like a dot when it occurs in the context of a supporting letter. Thus, ؤ clusters with و ـو, while ـئـ clusters with ـبـ ـتـ ـنــيـ. In contrast, isolated ء is treated like a full-fledged letter allograph and clusters with ع عـ غ غـ ـع ـغ, arguably because it is perceived as a miniature عـ.

Finally, our study provides strong empirical support for Wiley et al.’s observation that allographs of letters in the middle position (e.g., ـبـ ـتـ ـثـ ـحـ ـخـ ـجـ) are identical to the corresponding allographs in the initial position when the ligature to the right is ignored (i.e., بـ تـ ثـ حـ خـ جـ). Based on the structure of Clusters 5 and 9 in our data, it is clear that participants ignored the right ligation of the middle allographs and grouped them with their counterparts in the initial position. This is a seemingly surprising outcome, since ligation is not only taught as part of the letter form to Arabic learners, but it also provides crucial information about word length and lexical stress position (Boudelaa et al., 2019). It is however consistent with recent research that reports comparable masked repetition priming effects for isolated letter pairs with similar (e.g., ـفـ فـ) and with dissimilar (ـعـ ع) visual features across letter positions (Carreiras, Perea, Gil-López, & Abu Mallouh, 2013. Furthermore, event-related potential (ERP) data recorded continuously while subjects performed a masked same–different matching task with visually similar (e.g., ط ـط ) and visually dissimilar (e.g., ع ـعـ) allographs clearly show an early ERP (P/N150) associated with visual form similarity, and a later ERP component (P300) related to abstract letter representations. Specifically, allographs like ـعـ-ع showed clear electrophysiological response differences early on in processing, while brain responses later in processing were modulated by abstract letter representations such that ـعـ-ع were perceived as equally similar as ـط-ط (Carreiras, Perea, Gil-López, Abu Mallouh, & Salillas, 2013).

Phonetic letter similarity

The ability to quantify the phonetic similarity between words is important in many fields, including computational linguistics, dialectometry, applied linguistics, psycholinguistics, and cognitive neuroscience. The literature provides a number of methods for measuring the degree of phonetic similarity between segments. Some of these are based on experimental studies showing, for instance, the degree of confusability of different segments (Klatt, 1968; Greenberg & Jenkins, 1964; Mohr & Wang, 1968). Others are based on more theoretical arguments (Austin, 1957). Others still have opted for quantifying the degree of similarity between segments by counting the number of differences in their specifications in terms of phonetic/phonological features (Ladefoged, 1970). Here we opted for the use of phonetic features to quantify the amount of similarity/difference among the various Arabic letter sounds. Our choice is predicated on recent reports in the literature suggesting that similarity between component speech sounds is much better captured by theoretically driven measures based on phonetic/phonological features than empirically derived measures based on confusability (Bailey & Hahn, 2005; Hahn & Bailey, 2005). Accordingly, we focused on providing a similarity metric that simultaneously compares consonants and vowels using 16 features from phonological theory. Specifically, these consist of a first set of three Major Class features that define the major classes of sounds in the language into consonantal, sonorant, and approximant. A second set consists of seven Place of Articulation features, namely, labial, coronal, dorsal, pharyngeal, anterior, distributed, and high, serving to define the specific articulator involved in producing the sound. A third set of four features, continuous, lateral, nasal, and strident, pertains to the manner in which the letter sound is produced. Finally, a fourth set consists of one Laryngeal feature, voicing, that distinguishes voiced from voiceless segments, and a fifth set comprises a Quantity feature, categorizing segments as long and short. The full matrix of features for the 28 consonants and 6 vowels of the language is accessible here: https://osf.io/mx5t7/.

Using these features, each letter was then converted into a vector consisting of 16 elements of 0s and 1s (0 if the feature did not apply to the letter and 1 if it did). We then performed the same hierarchical clustering procedure on these vectors as before in order to determine the similarity structure underlying them (see Fig. 2).

Fig. 2
figure 2

Hierarchical clustering (dendrogram) using the nearest neighbor method. The vertical axis of the dendrogram represents the distance or dissimilarity between clusters. The horizontal axis represents the 34 Arabic sounds

Visual inspection of the dendrogram in Fig. 2 suggests that there are seven distinct phonetic sound clusters, with an average number of letter sounds per cluster ranging from two to eight. However, to more objectively determine the optimal number of groups that the 36 letter sounds cluster into, we used the gap statistic as before. The results of this analysis suggest that the optimal number of clusters is five, with a maximal value of 0.23. The sizes of these clusters, displayed in Table 5, range from 5 to 10 members.

Table 5 Optimal number of clusters based on phonetic letter similarities as suggested by the gap method, the members of each class, and its within-cluster sum of squares

Interestingly, the different clusters make intuitive sense. For instance, the members of Cluster 1 are all back fricative consonants except for the voiceless glottal stop أ /ʔ/, which is part of this cluster because it shares many features with the voiceless

glottal fricative هـ /h/, which in turn naturally clusters with the back fricatives غ ح خ ع /ʕ x ħ γ/. Similarly, the members of Cluster 2 are all bilabial consonants except for the palatal approximant ي /y/ arguably added to this cluster due to its similarity to the bilabial approximant و /w/, which shares the place feature of bilabial with all the other members of the cluster. The largest cluster, Cluster 3 with 10 members, consists of consonants that are all non-back consonants with places of articulation starting with the ج /j/ at the palate and progressing anteriorly to the dental area with the ذ /ð/ and ث /θ/ sounds. Cluster 4 includes seven sounds, all emphatic. In the environment of such sounds, the low front vowel phoneme /æ/ of the language is standardly pronounced as a low back vowel /a/, which is the typical manifestation of phonetic emphasis in Arabic. The only non-emphatic sound in this cluster is the velar ك /k/, arguably added to this cluster by virtue of sharing the features back, voiceless, and plosive with the sound ق /q/. Finally, Cluster 5 includes the six vowels of the language.

It is interesting to note that the within-cluster SS is 8.59 on average, while the total SS and between-cluster SS stand at 90.6 and 47.6, respectively, suggesting a high degree of consistency within the component members of each cluster. Furthermore, our theoretically driven measure of similarity based on phonetic features is in agreement with empirically derived measures based on confusability as shown by hidden Markov recognition systems. For instance, Maaly, Elobeid, and Ahmed (2002) reported that the sounds /ℏ/ and /ʔ/ are highly confusable and that their automatic Arabic phoneme recognizer failed to distinguish between them. It is also with consistent with the phonological neutralization processes at play in many Arabic dialects. For instance, in the Egyptian dialect spoken in Cairo, the interdental voiceless fricative ث /θ/ is typically realized as ت /t/ (e.g., ثمن /θæmæn/ “price” pronounced تمن /tæmæn/) or س /s/ (e.g., ثانية /θaanyæ/ “second” pronounced سانية /saanyæ/). These phonemes /θ, t, s/ are members of Cluster 3. Analogously, phonological speech errors made by children learning Arabic (e.g., قلبي /qalbi/ “my heart” pronounced as كلبي /kalbi/ “my dog”) also seem to target phonemes that are members of the same clusters (Dyson & Amayreh, 2000).

Finally, it is important to note that as far as we know, there are no phonetic confusion tables for Arabic like those available for English (e.g., Luce, 1986; Shattuck-Hufnagel & Klatt, 1979; Wickelgren, 1966). Interestingly, however, Bailey and Hahn (2005) have forcefully argued that measures of similarity based on theoretically motivated phonetic features, as we have applied here, are superior to similarity measures based on confusability from speech perception, speech production, and short-term memory. Therefore, we feel confident that the current phonetic similarity matrix can serve as the basis for further explorations either within a language (Kishon-Rabin & Rosenhouse, 2000) or across languages (Boudelaa, 2018; Khattab, 2002).

Motoric Letter Similarity

Our ability to generate similar shapes with different limbs or execution modes suggests the existence of a relatively abstract, effector-independent level of representation that specifies the forms of letters (Keele, 1981; Rapp & Caramazza, 1997). If this is so, then language users must somehow develop a motoric scheme that represents information about the characteristics of the strokes required to write down a given allograph. Research into the written spelling performance of patients with dysgraphia strongly supports the involvement of multiple representational types, including a relatively abstract, effector-independent representational level that specifies the features of the component strokes of letters (Rapp & Caramazza, 1997). Specifically, individuals with dysgraphia seem to make well-formed letter substitution errors in written spelling, such as writing “F-A-P-L-E” for TABLE, while correctly spelling the target word as [ti, ei, bi, el, i]). Similarly, neuroimaging research suggests that the motoric features of letters activate significant portions of the brain in the left intraparietal sulcus and in areas previously associated with spelling processes (Rothlein & Rapp, 2014).

Given the importance of understanding the content of motor plans used to execute letter writing, we sought to develop a motoric letter similarity matrix for Arabic letters and their allographs based on 26 stroke features we established to be necessary to uniquely identify each of 100 letter allographs of Arabic.Footnote 1 We used 10 generic features to capture the visuospatial characteristics of each allograph in terms of a set of strokes. Accordingly, for each letter allograph, we specified the number of strokes (1 to 5) required to create it and the shape of those strokes (i.e., line, curve). When the stroke was a line, we specified its shape as downward- or upward-directed and its orientation, horizontal or vertical. When the stroke was a curve, we defined its shape (clockwise or anticlockwise). We also included the number and position of the dots as well as the overall shape of the allograph and the number of angles it contained. Finally, we determined whether the allograph’s main part was above or below the line and whether its overall shape was a half or full loop with no dots. The combination of these features allowed us to quantize each of the 100 letter allographs into a 26-element vector that captured the motor scheme necessary to create it. These vectors, accessible at https://osf.io/v2gb7/, were then submitted to a hierarchical clustering analysis with a view to determining the similarity structure underlying the motor plans of the different allographs. The dendrogram in Fig. 3 displays the clusters yielded by the nearest-neighbor method.

Fig. 3
figure 3

Dendrogram of 100 Arabic letter allographs based on the motor scheme needed to produce them in writing

Using the gap statistic suggests that the data optimally cluster into 12 groups with a maximal value of 0.40. The average within-cluster SS is 16.46, while the total SS is 418.62 and the between-cluster SS is 221.07, thus suggesting a high degree of consistency within clusters. Table 6 displays the members of each cluster along with the associated within-cluster SS.

Table 6 Optimal number of clusters based on motoric letter similarity as suggested by the gap method, the members of each class, and within-cluster sum of squares

According to Table 6, a number of motoric features seem to underlie the way in which the 100 Arabic allographs used here cluster. Specifically, these are the presence and to some extent the number and position of the dots, as well as the presence and shape of a loop. Thus, for instance, the six members of Cluster 12, ق قـ ـقـ ـق ي ـي share two dots, and four of them exhibit a clockwise downward-directed loop. Similarly, the seven members of Cluster 10, ض ضـ ـضــض ـ ـنـ ـن feature a single dot above the allograph, while those of Cluster 5 ـثـ ـث ش شـ ـشـ ـش share the three dots above the allograph itself. The importance of the presence and number of dots in this context is that they define whether the abstract motoric program required to write down a letter allograph can be completed with or without lifting the pen: When a dot is present, the letter allograph cannot be written without lifting the pen. Another dimension of similarity arising from Cluster 1, ـا بـ جـ حـ ـحـ خـ ـخـ د ذ ك كـ لـ نـ يـ ـيـ, is the presence of an angle, which can be either a right angle, as in ك لـ ـا بــ نــ يـ ــيــ, or an acute angle, as in جـ حـ ـحـ خـ كـ ـخـ د ذ. A final example is Cluster 9, صـ ـصـ ط طــ ـطـ ـط ـلـ مـ ـمـ ـم, where the presence of a closed loop in all allographs save ـلـ appears to underlie the motoric similarity of this group of allographs. One obvious reason the allograph ـلـ clusters with this group is the presence of the line segment that it shares in shape and orientation with ط طــ ـطـ ـط and in shape only with ـم.

Overall, then, there is a clear sense in which one might claim that similarity in terms of the characteristics of the strokes—number, orientation, and direction—that are required to produce the different allographs has a significant weight in the structure of each cluster. The viability of the present matrix as a measure of similarity between the motoric plans required to write each letter allograph is consistent with the performance of patients with dysgraphia as described by Nashaat, Kilany, Hasan, Helal, Gebril, and Abdelraouf (2016). Some of these patients made letter substitution errors in writing (e.g., دأيت for رأيت), where the downward-directed stroke that starts above the “discontinuous” line and ends with a straight stroke on the line --د-- substitutes for a downward-directed stroke that begins on the line and ends underneath it, -ر--. Further research is needed to examine the extent to which the motoric plan of allograph writing maps onto the neurocognitive domains of Arabic processing.

Conclusion

We present new data on the frequencies and similarities of Arabic letters and their allographs in the visual, phonetic, and motoric domains. These sets of frequencies of Arabic letters and their allographs, which are based on a 40-million-word corpus, comprise the only frequencies of letter allographs available for MSA. The visual similarity matrix is based on ratings collected from untimed responses of 125 participants to clearly presented allographic variants of the same letter. This methodology preempts serious issues likely to be inherent in matrices formed from data generated in atypical reading conditions, using, for example, speeded naming or degraded presentation conditions. Our visual similarity builds on and significantly extends previous findings in the literature (e.g., Wiley et al., 2016). The phonetic similarity matrix is based on theoretically motivated major phonetic/phonological class features, an approach that has recently been demonstrated to be efficient in identifying cognitively relevant similarities while at the same time significantly avoiding spurious task-specific similarities that characterize similarity metrics based on the perception of speech in noise (Bailey & Hahn, 2005). Finally, the motoric similarity matrix is based on a set of stroke features necessary to implement each letter and its allographs. This sort of similarity matrix is not very common across languages, and the only one we know of is the motoric similarity matrix developed for English (Rapp & Caramazza, 1997). Collectively, these new data will be a valuable tool for psycholinguistic research directed toward the study of letter stimuli and the effects and time courses of their visual similarity (Boudelaa et al., 2019; Carreiras et al., 2012; Gutiérrez-Sigut, Marcet, & Perea, 2019; Perea et al., 2010). They will be equally useful in informing cognitive neuropsychological reading research (Friedmann & Haddad-Hanna, 2012; Khwaileh, Body, & Herbert, 2014; Prunet, Béland, and Idrissi, 1998). Finally, since alphabet knowledge is consistently recognized as the strongest and most durable predictor of later literacy achievement (Jones, Clark, & Reutzel, 2012), the current results have clear practical implications for developing strategies to increase the effectiveness of teaching alphabet knowledge to young MSA learners by capitalizing on the similarity structure underlying the different letter and allograph groups (Mahfoudhi, Everatt, & Elbeheri, 2011; Perea, Abu Mallouh, & Carreiras, 2013; Taha, 2013).

Authors’ note

This research was funded by two United Arab Emirates University College of Humanities and Social Sciences grants to Sami Boudelaa (G00002367 and G00003158).