The world’s writing systems contain graphs that span a wide variety of visual forms. Much of this variety is associated with variable mappings that graphic units can have to linguistic units (abjad, alphabetic, syllabary, alphasyllabary, and morphosyllabary). This mapping variety has been the focus of comparative reading research (e.g., universal grammar of reading, Perfetti, 2003; phonological grain size, Ziegler & Goswami, 2005; orthographic depth, Katz & Frost, 1992; semantic transparency, Wydell, 2012; for reviews, see Frost, 2012; Perfetti & Harris, 2013; Seidenberg, 2011). The actual forms of the graphic units have received less attention. However, the visual forms of graphs—reflecting their visual complexity and discriminability—have the potential to affect the identification of both individual graphs and graph combinations (e.g., single letters and letter combinations in alphabets and abjads, akshara in alphasyllabaries, syllables in syllabaries, and characters in morphosyllabaries; Pelli, Burns, Farell, & Moore-Page, 2006), and thus to affect learning to read.

To study the effects of graphic challenges to learning to read in a universal way, free of biases based on a particular writing system, it is important to have a measure of graphic complexity that is sensitive to the variety of devices used in writing. We report here such a measure, a multidimensional measurement system for quantifying graphic complexity, GraphCom, and its application to 131 written languages. We demonstrate the value of the system in predicting similarity ratings made by speakers of different languages.

In what follows, we first discuss the broader context for descriptions of graphic units, delineating the units that are the object of our study and reviewing previously developed measures of graphic complexity (i.e., perimetric complexity: Pelli et al., 2006; Watson, 2012) and considering perceptual principles in human cognition. We then present the rationale and descriptions of our new multi-dimensional measure, the results of applying the measure to 131 languages, and the performance of the measure in predicting visual similarity judgments.

Graphic units: Graphs and graphemes

It is common in alphabetic reading research to refer to a grapheme as the basic unit of writing—in particular, one or more letters that map onto a single phoneme. Such a definition lacks both universality (e.g., Chinese characters do not map to phonemes) and also departs from the logic of linguistic descriptions. A definition of grapheme that conforms to linguistic analysis by being parallel to descriptions of phoneme and morpheme is this: a grapheme is a functional unit of writing that abstracts over variations in graphs—allographs; for instance, all the fonts for the letter b that exist at in a given language. The unit is functional in that the grapheme is the minimal graphic unit distinguishing two written morphemes, thus analogous to the phoneme, which distinguishes two spoken morphemes. For example, in English all letters are graphemes as well as graphs, because all letters distinguish among written English morphemes. According to this definition, the functional role of graphemes does not depend on mapping to phonemes, as attested by the contrast between homophonic morphemes such as buy/bye and reel/real. This technical definition of grapheme also includes nonletter graphemes such as the apostrophe, which distinguishes teacher’s from teachers. What counts as a grapheme is language-dependent even within a writing system. Thus, a capital letter and a lowercase letter seem to be allographs of a single grapheme in English, but probably not in German, where capitalization distinguishes between grammatically derived morphemes (Wissen as noun vs. wissen as verb). This sense of a grapheme as a distinguisher of written morphemes is more systematic and universal than the commonly used definition in the English language research literature. Thus, it applies to Chinese as well, where a character is a grapheme as well as a morpheme and distinguishes between multimorpheme words.

For purposes of measuring graphic complexity, our view is that the common psychological use of “grapheme,” which originated in alphabetic research, is too narrow. However, the more universal linguistic definition requires a detailed morphological analysis of each written language, a goal that is beyond the scope of our research. These issues of the definition of grapheme have led us to focus instead on the minimal unit of the graph, a written form that can be combined with other graphs to form graphemes (in any sense of grapheme). These graphs are readily recognized by literate users of a language as basic writing units—several thousand characters in Chinese, 26 letters in English, 33 letters in Russian, and 46 kana in the Japanese syllabaries—that are combined to produce written language, whatever their mapping. Most important is that a metric based on writing graphs (rather than graphemes) can be applied to any written language according to the goals of a researcher. For an English example, the complexity measure applied to the letters S and H separately can also be applied to the combination SH if one wants to measure “grapheme” complexity.

Writing graphs, as a culture product, are different from other visual categories

Every writing graph (henceforth, simply “graphs”) is a basic, two-dimensional visual form that participates alone or in combination in coding a linguistic unit (e.g., phoneme, syllable, or morpheme). In the information they convey, these graphic forms are different from other visual categories such as natural scenes, objects, and faces, but similar to line drawings (Changizi, Zhang, & Shimojo, 2006). Scenes carry more complex information about color, texture, shading, illumination, and occlusion (Sayim & Cavanagh, 2011). Objects, similar to scenes, provide more information about three-dimensional space, depth, and texture than line drawings. Faces, although they are composed with fewer elements than scenes and objects, are still more complex than line drawings because faces are usually seen from many viewpoints. In contrast, line drawings are simple. Because their complexity varies along fewer visual dimensions, indices that are useful for natural visual categories—for example, entropy on information, Fourier analysis on spatial-frequency, JPEG compression for size of an image (see Chikhman, Bondarko, Danilova, Goluzina, & Shelepin, 2012)—are not applicable.

Although graphs and line drawings share the general properties of two-dimensional simplicity, graphs become differentiated from line drawings with the earliest emergence of literacy contexts. Letters, the graphs used in alphabetic systems for examples, are differentiated from line drawings by children by the age of three (Levin & Bus, 2003; Robins & Treiman, 2009). For English-speaking young children, preschoolers 3–5 years of age who are preliterate have some understanding that a written word represents a specific spoken word, differing in this way from a drawing (Treiman, Hompluem, Gordon, Decker, & Markson, 2016). Moreover, these children are sensitive to the visual spatial layout of their own writing system, as compared to foreign writing systems (Treiman, Mulqueeny, & Kessler, 2014). When they are asked what writing is, these young children are more likely to choose sets of graphs from their own language (i.e., English) as instances of writing than graphs in other languages (e.g., Chinese characters; Lavine, 1977). These observations point to a categorical importance of graphs, as they becomes a functionally distinct perceptual objects in learning to read.

Previous measures of graphic complexity

One well-defined and well-attested dimension for quantifying graphs is perimetric complexity (Pelli, Bums, Farell, & Moore-Page, 2006): the ratio of the square of the sum of the inside and outside perimeters to the product of 4π and the area of the foreground (Pelli et al., 2006; Watson, 2012; see Tables 1 and 2 for examples, and the Method section for the algebraic expression). More informally, perimetric complexity captures the density of the written marks (“black ink”) relative to the background space in which they are located. Perimetric complexity has some valuable characteristics. First, it is objective, quantitative, and size-invariant. Thus, its values are not affected by font size. Second, it is empirically tested and correlates well with other subjective measures, such as pattern goodness and information load (for a discussion, see Jiang, Shim, & Makovski, 2008). Third, it is computerized, and the algorithm can be used for binary-code (black-and-white) images (Watson, 2012), making it a tool that is general across visual categories.

Table 1 Comparison of two graphs in terms of their visual complexity
Table 2 Five graphs with complexity values using GraphCom, the measurement system with four dimensions

Pelli et al. (2006) applied perimetric complexity to a range of graphs and demonstrated that perimetric complexity is inversely proportional to graph identification efficiency. Specifically, they sampled graphs across a wide range of written languages (i.e., Arabic, Armenian, Chinese, Devanagari, English, and Hebrew) and different fonts (e.g., Bookman, Couier, Helvetica, Kustler, and Sloan). They asked participants (ranging from 3 to 68 years of age) to look at a briefly displayed graph and then to identify it from a list of graphs in the given language. Graphic complexity was negatively correlated with human identification efficiency. Given the reliability and validity of perimetric complexity, it became a useful measure for controlling the complexity of stimuli in studies on learning to read (e.g., Liu, Chen, & Wang, 2016; Wang, McBride-Chang, & Chan, 2014; Yin & McBride, 2015).

Research on the relation between visual complexity and learning to read across writing systems suggests that learning to read more visually complex first language (L1) may require higher visual skills and may, in turn, strengthen such skills. In particular, a series of studies by Nag and colleagues provided evidence that visual skills required for reading Indian languages tend to be relatively high as compared with alphabetic languages (e.g., Nag, 2008; Nag & Snowling, 2011; Nag, Snowling, Quinlan, & Hulme, 2014; Nag, Treiman, & Snowling, 2010). These high demands come from the large number of graphs in these “extensive” written languages (Nag, 2007, 2014) and impose a strong influence on the pace of learning to read (Nag, Caravolas, & Snowling, 2011). It is possible that meeting the higher learning demands imposed by visually complex writing systems leads to improved visual skills: in a cross-writing-system study (McBride-Chang et al., 2011), children learning to read traditional Chinese outperformed age-matched kindergarteners who were learning to read less complex languages (Hebrew and Spanish) in a visual–spatial processing task. Similarly, in a comparison of 8- to 14-year-old readers of Chinese and Greek, controlling for reading experience, Chinese readers of all ages outperformed their age-matched Greek counterparts on visual-spatial processing (Demetriou et al., 2005). Collectively, these findings underscore the importance of visual complexity of graphs regarding their roles in impacting learning to read across writing systems.

Complexity characteristics in different writing systems

Not all characteristics of graphs found to be important in reading research are captured by perimetric complexity. There are many examples of two graphs that share the same value in perimetric complexity, while differing substantially in other ways. For instance, in Table 1, perimetric complexity quantifies both the graph <w> (an English letter) and the graph <> (a Thai letter) as 13; however, the two graphs have salient visual differences in their numbers of disconnected components (i.e., <w> has one component, and <> has two components), and thus also in their numbers of connected points; <w> has three points, each composed by two lines, and <> has two connected points, each composed by one circle and one line.

Variation in disconnected components is typical for alphasyllabaries, and variation in numbers of connected points is typical for alphabets. The numbers of connected points in letters of the Roman alphabet used in English (e.g., line terminations in <R>) are the features most critical in letter identification (Fiset et al., 2008). In alphasyllabaries, letters featuring disjointed components (e.g., the Thai letter <>) are highly associated with visual confusion in early literacy (Winskel, 2010). However, it is unclear whether the number of connected points, an important factor in the recognition of alphabetic writing, also affects letter identification in an alphasyllabary; similarly, we do not know whether the number of disconnected components, a salient measure in an alphasyllabary, plays a role in early alphabetic literacy.

In the morphosyllabic system for Chinese languages, the number of strokes (usually defined as a one-time movement of pen) has been long used as a complexity index with demonstrated psychological reality. For instance, Su and Samuels (2010) report that, in a Chinese character recognition task, response latencies to characters increased with the number of strokes for Chinese-speaking second-graders. In a study of Japanese kanji, Tamaoka and Kiyama (2013) found that both lexical decision times and naming depended on the number of strokes as well as on kanji frequency. Chinese character reading studies have also examined (and experimentally controlled) the number of strokes in both simplified (Wu, Zhou, & Shu, 1999) and traditional (Y. P. Chen, Allport, & Marshall, 1996) Chinese. Although all writing varies in the number of strokes, this measure remains unapplied to any writing system other than Chinese.

To examine complexity characteristics of graphs in different written languages and to examine and compare reading and writing across writing systems, a general, multidimensional measure that can apply to all writing systems is needed.

Gestalt principles for perceptual organization of graphs

Some of the features highlighted in research (i.e., numbers of disconnected points, connected components, and strokes, respectively) seem to echo principles of the perceptual organization of relations among visual components (proximity, symmetry, convexity, closure, connectedness, and continuation) that were emphasized in Gestalt theory (Koffka, 1935/1963). These principles were proposed as a partial answer to the question of how individual elements group into parts that then group into the larger perceptual object that is separated from other perceptual objects (Ehrenstein, 2008; Spillmann & Ehrenstein, 2004). For example, continuation affords clues to the relationship between simple features (Biederman, 1987), and connectedness is sensitive to information regarding continuity (Lanthier, Risko, Stolz, & Besner, 2009). In contrast, discontinuity highlights relations between more complex features.

An emphasis on continuity and discontinuity echoes the criteria for making a well-designed written language suggested by Watt (1983, 1994; see also Treiman & Kessler, 2011). Watt argued that shapes in such written language should be (1) similar, or have a degree of homogeneity; (2) contrasting, or distinguishable from one to another; (3) economical, or easy to perceive and produce; (4) redundant; (5) attractive; and (6) expressive. The systematicity of graph shapes was also emphasized by Treiman and Kessler (2014), who observed that, across writing systems, there is a tendency for graphs to look similar. This similarity may reflect basic principles of learning, one of which is that learners abstract patterns that hold across a set of graphs and use these patterns to supplement their memory for individual graphs.

Consideration of the different ways in which graphs vary across writing systems led us to develop a new measure that uses these different complexity-related variations, while also building on perimetric complexity. This measure, GraphCom, consists of four dimensions: perimetric complexity, number of disconnected components, number of connected points, and number of simple features (strokes). We applied this visual complexity measure to a large number of written languages, representing all five of the major writing systems. To the best of our knowledge, this is the first attempt to apply multidimensional complexity measures to quantify a larger number of graphs, in order to provide a valid tool to study the visual forms of those graphs.

The graph complexity measure, GraphCom

GraphCom includes four dimensions of graph measurement. The three dimensions added to perimetric complexity are quantified by the following basic units: A simple feature, following Pelli et al.’s (2006) definition, is a discrete element of an image that can be discriminated independently from other features. For example, <T> has two simple features, a vertical segment and a horizontal segment. A connected point (or a junction) is an adjoining of at least two features. For example, <T> has one (the junction of the horizontal line and the vertical line) and <F> has two connected points (the junctions of one vertical line with two horizontal lines). A disconnected component is a simple feature that is not linked to other features in a set. For example, disconnected components are shown, respectively, in <i> (the dot and the vertical line) and <> (the horizontal line on the top and the integral component at the bottom). Given these basic definitions, we can describe our four dimensions:

Perimetric complexity (PC)

PC is the ratio of the squared perimeter of a graph (number of pixels) to the number of background pixels in the graph. Specifically, PC is \( \frac{P^2}{A4\uppi\ } \), the square of the sum of the inside and outside perimeters of the foreground (P), divided by the foreground area (A), divided by 4π (Pelli et al., 2006; Watson, 2012). For example, if upper-case <W> has a 4,656-pixel perimeter and 136,602-square-pixel area, its perimetric complexity is 12.6287 (= 4,656 × 4,656 / 136,602 / 4π). This dimension is sensitive to the changes in luminance across space (i.e., spatial frequency) of a graph and its value is invariant to the size of the graph (Grainger, Rey, & Dufau, 2008).

Number of disconnected components (DC)

DC is defined as a simple feature or a feature that is not linked to other features in a set. If a given graph is composed of multiple disconnected components, there are spaces among these components; for instance, <> has four disconnected components created by spaces among the circle and the three dots. This dimension is sensitive to discontinuity information (Gibson, 1969).

Number of connected points (CP)

CP is a point of contact between features. This dimension is sensitive to information regarding continuity (Lanthier et al., 2009) and provides clues to relations between simple features (Biederman, 1987), counter to the DC dimension. Note that CP is not simply the inverse of DC; for instance, Vai syllables <> and <> have the same number of disconnected components (three), but the number of connected points of <> is four (for the diamond) and the number of connected points of <> is zero.

Number of simple features (SF)

SF is a discrete element that can be discriminated from others (Pelli et al., 2006); a typical example is a stroke within a Chinese character (Wu, Zhou, & Shu, 1999). Other examples for one simple feature include a line, a dot, a circle, or a curved line. To make the measure size-invariant, length, width, and thickness are not considered properties of features. This dimension is sensitive to the extent to which the graph combines simple features.

Collectively, these four dimensions provide objective, quantitative, and size-invariant estimations of graphic complexity. Table 2 shows how these four dimensions of GraphCom capture different characteristics of five example graphs.

Method

The written languages

For the application of the GraphCom to actual writing, we selected 131 written languages to represent five writing systemsFootnote 1 (alphabet, 60; abjad, 16; alphasyllabary, 41; syllabary, 11; morphosyllabary, 3), using languages examined in previous cross-writing-system (Changizi & Shimojo, 2005), cross-alphabet (Seymour, Aro, & Erskine, 2003), and cross-script (traditional vs, simplified Chinese; H. C. Chen, Chang, Chiou, Sung, & Chang, 2011) studies. To identify the inventory of graphs and writing system categories for these languages, we followed Changizi and Shimojo (2005), who used Ager’s Omniglot: A Guide to Writing Systems (Ager, 1998). For the three languages on which Omniglot offers no information, we consulted other sources: H. C. Chen et al. (2011) for the two major scripts of Chinese (i.e., traditional and simplified Chinese), and an official list of 1,006 Japanese kanji by school year (Ministry of Education in Japan, 2015). Finally, for purposes of the complexity measure, we used only the forms of isolated graphs. For most written languages (ignoring handwriting), this is of no consequence. However, in some, especially the akshara of alphasyllabaries, graphs can change shape when they are combined in actual writing: Vowel graphs are reduced to diacritics when conjoined with consonants. These variations, which are important in actual writing, are not captured in our analyses, which defines the graphs of every language in their canonical forms.

Graphic complexity quantification

We generated images of each of 21,550 graphs using the Processing software (www.processing.org; Reas & Fry, 2010). Graphs were presented in black Arial font against a 500 × 500 pixel white background. In all, 25% of the selected languages are not supported by the Arial font; for these, an alternative font similar to Arial was adopted. Appendix Table 10 summarizes the detailed information about these 131 written languages. Measures of the four dimensions of the GraphCom were then applied to each of these 21,550 images.

Results

Complexity variation along individual dimensions

We describe the complexity of a graph as a set of values along the four dimensions of GraphCom. Figure 1 shows the complexity variations across writing systems as boxplots for each of the four dimensions: perimetric complexity (PC), number of disconnected components (DC), number of connected points (CP), and number of simple features (SF).

Fig. 1
figure 1

Boxplots comparing graphic complexity across writing systems, for each dimension. The boxplots indicate, for each writing system (coded by color), the range of written languages (in terms of Quartile 1, Quartile 2, Quartile 3, and outliers)

To assess the relationships among these dimensions, we correlated the complexity values on each of the four dimensions across the five writing systems, as well as separately for each writing system. Table 3 summarizes the results for the overall correlations, collapsed across writing systems: All correlations are greater than .82 (all ps < .001), except for the r = .65 correlation of DC (the number of disconnected components) with CP (the number of connected points). Perimetric complexity shows high correlations with the other dimensions, although a lower correlation with DC, reflecting PC’s ability to capture indirectly much of what the other dimensions target specifically. However, the measure with the greatest shared variance is the number of simple features, the building blocks of the graphs. Finally, the correlations show that the number of discontinuous components (DC) is the most distinctive measure, sharing no more than 67% of variance with other measures, and only 42% with the number of connected points. Significantly, not all writing systems showed the same pattern of correlations among the dimensions. These specific writing system differences are discussed in Chang (2015).

Table 3 Correlations of graphic complexity across writing systems

Dimensions differentiate writing system pairs

Next, we determined which dimension is best at differentiating among parent writing systems. If different dimensions play a role in such differentiation, this would support the value of the multidimensional approach. For this analysis, we used the nonparametric Kolmogorov–Smirnov (KS) distanceFootnote 2 (Stephens, 1974), one of the most commonly used distance measures for comparing two samples. In our case, the two samples correspond to two writing systems. The KS distance, which does not assume a normal distribution, is sensitive to the difference in the cumulative distribution functions of two samples, and thus is suitable for the highly nonnormal distributions of our writing systems on the various dimensions. Our five writing systems yielded ten writing system pairs. For each pair, we calculated the KS distances on each dimension (see Table 4); the dimension responsible for the greatest KS distance was taken as the dimension that is most sensitive to differences between those two writing systems.

Table 4 KS distances between two given writing systems for each dimension

Table 5 shows the complexity dimension that maximally differentiates each pair of writing systems. Thus, the alphabet and abjad writing systems are most differentiated by their numbers of disconnected components; alphasyllabaries and morphosyllabaries are most differentiated by their numbers of simple features; and so forth. The number of connected points was not a maximal differentiator for any pair of systems. Interestingly, perimetric complexity, which has been the only dimension used to compare graphic complexity across writing systems in prior research (Pelli et al., 2006), was the most reliable differentiator only for the alphasyllabary–alphabet pair. These results suggest that the most effective dimension for differentiating writing system pairs is the number of disconnected components (DC); in Table 5, DC provides maximal differentiation for six of the ten writing system pairs. More generally, the results highlight the value of the multidimensional approach. No single dimension is universally the most effective at distinguishing any two arbitrarily selected writing systems.

Table 5 Dimensions that maximally differentiate writing system pairs

Behavioral validation: Similarity ratings of graph pairs

To provide a behavioral test of GraphCom and its individual dimensions, we had participants with different first language (L1) backgrounds make similarity ratings on pairs of graphs from a single written language. We chose similarity ratings because they represent a paradigm commonly used in visual science and psychology over the past 130 years (for a review, see Mueller & Weidemann, 2012). The assumption is that two graphs that are more similar in complexity will be judged more similar than two graphs that are less similar in complexity.

Method

Stimuli

To select a language that is representative of its writing system, we identified a centroid for each writing system within a multidimensional complexity spaceFootnote 3 defined by the four dimensions of GraphCom. A centroid is the geometric center of a multidimensional space; in our case, the centroid of a writing system is the location of the unweighted mean of all the written languages within this writing system—that is, the average of their coordinates along the four dimensions. Thus, for each writing system, one language was designated as its centroid written language: Hebrew (abjad), Russian (alphabet), Telugu (alphasyllabary), Cree (syllabary), and Chinese (morphosyllabary). The stimuli included graphs from these five centroid written languages.

We created two categories for Chinese, because it contains thousands of characters and more than one type of graphic unit. Basic components (including radicals), which are the functional “building blocks” in the Chinese language (Shen & Ke, 2007), can stand alone as characters and are composed of a small number of strokes (average: 4.52). Compound characters, which are composed from these building blocks, have a large number of strokes (average: 13.21). Thus, we classified all characters as either basic or compound; note that these characters have the same forms in the traditional and simplified Chinese systems.

The division of Chinese into basic and compound types resulted in six groups of graphs based on the centroid written languages. These are ordered, in terms of increasing complexity, Hebrew, Russian, Cree, Telugu, basic Chinese characters, and compound Chinese characters. For the similarity ratings, graphs were paired within each written language, with graphs in each pair matched on either upper or lower case and, where applicable, vowel or consonant; all graphs in each written language (except for Chinese) were exhaustively used.

We created four stimulus lists, each consisting of six groups of graphs. Each list contained 180 pairs, more than the 120 pairs that are adequate to induce meaningful similarity judgments (Simpson, Mousikou, Montoya, & Defior, 2013). Appendix Table 11 shows the graph pairs for each list; Table 6 provides further information regarding these pairs of graphs.

Table 6 Characteristics and numbers of graph pairs for each written language in similarity rating (per list; four lists in total)

Observers

A total of 180 observers participated in this experiment. All reported normal or corrected-to-normal vision. Table 7 presents demographic information about these observers. We chose observers whose first language was among those five languages with the most speakers worldwide (Arabic, English, and Hindi) and for whom the graphs to be judged were not from their first language.

Table 7 Demographic information for the participating observers (n = 180 in total)

Procedure

The experiment was carried out via a large crowdsourcing platform, Amazon Mechanical Turk (MTurk). MTurk data have been demonstrated to be indistinguishable from laboratory data in different research fields (e.g., economics: Horton, Rand, & Zeckhauser, 2011; politics: Berinsky, Huber, & Len, 2012; social science: Buhrmester, Kwang, & Gosling, 2011; psycholinguistics: Sprouse, 2011; and psychology: Simcox & Fiez, 2014); to ensure data quality, we also followed the principles for using MTurk (Chandler, Mueller, & Paolacci, 2014) to design our online experiment. Four human intelligence tasks (HITs) for recruiting observers from four writing systems were posted on MTurk’s online recruitment interface. Each HIT had a two hour completion limit. Consent was obtained prior to the experiment; after MTurk volunteers agreed to participate, they were directed via a Web link to any of the four stimuli lists for similarity ratings.

The sequence of tasks was the same for each observer: a similarity rating, a language history questionnaire, a demographic background task, and a translation task (except for the English HIT) for verifying the observer’s L1 backgrounds. After completing the last task, a unique 13-digit code associated with the observer’s responses appeared on the screen automatically, along with debriefing information. The observer was instructed to report the code to MTurk to obtain monetary compensation. Successful generation of the 13-digit code also indicated that all of the observer’s responses were successfully sent from their local machines to our server. Below, we give a brief introduction for each task.

Similarity rating task

This task was designed to tap variability in observers’ judgments of visual similarity. Each trial began with a black fixation cross appearing for 300 ms, followed by a pair of graphs appearing for up to 5,000 ms, followed by a blank for 1,000 ms. The observer saw a pair of graphs appear at the center of the screen and the heading “1 = very different 2 = mainly different 3 = mainly similar 4 = very similar” at the bottom of the screen. The observers were instructed to rate how visually similar the two graphs were by pressing one of four keys on their alphanumeric keypad (not the numeric keypad) to indicate the rated similarity. Once the observers had responded, the screen moved on to the next trial.

After instructions, the observers were given 12 demonstration trials with explicit statements on the degree of similarity, 12 practice trials without feedback, and 180 experimental trials; the ordering of grapheme pairs was randomized. Table 8 shows trials in the demonstration, practice, and experimental phases. Responses and response time were recorded. This task took approximately 15 minutes to complete.

Table 8 Examples for graph pairs at different phases in the similarity rating task

Language questionnaire

The language questionnaire (Tokowicz, Michael, & Kroll, 2004) was used to assess participants’ language-learning experiences both quantitatively (e.g., rating general language learning skill) and qualitatively (e.g., comments about language-learning experience). Observers were encouraged to give their best answers to the questions, without any time limit.

Demographic background questionnaire

The demographic background questionnaire was developed to learn more about observers’ educational, cultural, and health statuses (e.g., visual and hearing problems) and their surroundings during participation in this study. The responses on visual and hearing questions were used to filter data quality. We imposed no time limit to complete this survey.

Translation task

The translation task was developed to filter the data for quality. This task consisted of 20 English words chosen from the instructions for this experiment. Observers saw one word at a time and were asked to type the first translation to their L1 that came to mind within 12 seconds; timing was determined in a pilot study. Observers who failed to provide translations in a written language consistent with their reported L1 were excluded from the analysis.

Results

Each dimension played a prominent role in predicting human similarity ratings.

To test the effects of complexity on the perceptual judgments, we used a mixed-effects modeling approach, which is well-suited to assess the effects of both items (graphs) and subjects (observers) (Baayen, Davidson, & Bates, 2008). We assumed that two graphs that were more similar in complexity, as measured in GaphCom, would be judged more visually similar than two graphs that were less similar in complexity. Accordingly, we expected that a model that used all four dimensions of graphic complexity would provide the best fit to the human similarity ratings. We thus tested alternative models by means of a backward elimination procedure, which ensured that any joint predictive capabilityFootnote 4 of the dimensions could be observed (Burnham & Anderson, 2003). We first tested the full model containing all the predictors (the four complexity dimensions); then we constructed a second model that removed one of the predictors, to test whether removing one predictor would reduce the predictive performance. If so, that was evidence that the predictor should remain in the model.

Our predictors were the absolute differences in similarity ratings between two graphs in each of the complexity dimensions (i.e., PC, DC, CP, and SP). We performed a series of model comparisons with Laplace estimation using the lmer() function of the lme4 package (Bates, Maechler, & Dai, 2010) to fit the models, and the likelihood ratio test (Lehmann, 1986) to determine model performance. The lme4 model formulae used to fit each model are displayed in Appendix A.

The full mixed-effects model (FULL) included fixed effects of the four predictors and crossed random effects for subjects and items. The additional four models had fixed effects for three predictors (one predictor removed for each model) and crossed random effects for subjects and items. Thus, four model comparisons were carried out. Table 9 summarizes the model tests in terms of the Akaike information criterion (AIC) and Bayesian information criterion (BIC), two common criteria for model selection, as well as the chi-square values (and associated degrees of freedom) for the likelihood ratio test. A lower AIC/BIC indicates a better-fitting model (Wasserman, 2006). As is shown in Table 9, both AIC and BIC suggested that the FULL model scored best on these criteria. Similarly, for all likelihood ratio tests, the FULL model showed significant advantages over any reduced model (p values below .001): [FULL without PC vs. FULL, χ 2(8) = 372.88; FULL without DC vs. FULL, χ 2(8) = 138.44; FULL without CP vs. FULL, χ 2(8) = 390.08; FULL without SF vs. FULL, χ 2(8) = 558.83]. These tests indicate that removing any one of the predictor dimensions made the model significantly worse in accounting for variance in the data. This suggests that each dimension played a role in accounting for observers’ judgments of similarity.

Table 9 A summary for four model comparisons, using dimensions to predict human similarity ratings (n = 180)

Discussion

The multidimensional measure of graphic complexity, GraphCom, is a useful tool for assessing visual complexity in any writing system. Its dimensions are grounded in basic perceptual factors—the number of simple visual features (lines, curves, and dots), the number of connected points, and discontinuities in the configural form. These dimensions are added to perimetric complexity, a proven measure that captures overall configurational complexity (Pelli et al., 2006). We applied GraphCom to 131 written languages across the world’s five major writing systems, demonstrating that this measurement system surpassed previous measures in predicting human perceptual judgments. Importantly for research, GraphCom can be applied to any of the many other written languages beyond our sample of 131.

The value of GraphCom is supported by several results. First, it resulted in an ordering of complexity among the 131 languages that aligns with informal observations of these languages. Thus, Chinese written in its traditional script is measured as the most complex written language, more than the simplified Chinese script. At the other end of the scale, abjads and alphabets show similar low levels of complexity, and are distinguished primarily by their number of discontinuous elements. Of course, these alignments are to be expected to some extent, because we developed GraphCom measures to reflect properties of real writing. Thus, the ordering of the written languages is not a validation, but a demonstration that the measure produces sensible outcomes.

More interesting are the results concerning the individual dimensions. Perimetric complexity, the most commonly used measure for capturing configurational complexity of graphs in prior research, may not be suitable in some situations. When we applied each dimension to pairs of writing systems, using the nonparametric KS distance measure, we found that perimetric complexity was not the best differentiator among writing systems. It was the most successful differentiator only for separating alphabetic from alphasyllabary languages. The number of disconnected components was generally the most important distinguisher of writing systems.

Also relevant are the results of a modeling study that simulated graph learning across hundreds of languages (Chang, Plaut, & Perfetti, 2016). In the learning model, each dimension of the GraphCom was found to uniquely account for the training times the model needed to reach mastery. Indeed, perimetric complexity was the weakest predictor in the graph learning simulation; the number of simple features was the strongest one.

The most direct validation of the measure comes from its prediction of human perceptual similarity judgments. In fitting the perceptual judgment data to regression models, we found that all dimensions contributed to explaining the data. Removing any one dimension score from the model significantly reduced the model’s ability to predict visual similarity judgments.

We emphasize that these dimensions are not independent, and indeed they are highly inter-correlated when the data are collapsed across writing systems to allow a correlation based on 21,550 graphs. Writing systems differ in how they use the visual, graphic characteristics that are measured by GraphCom dimensions (Chang, 2015). In alphabets, connected points (or line junctions such as <L>, <T>, and <Y>) are especially important in letter identification (Lanthier et al., 2009; Szwed, Cohen, Qiao, & Dehaene, 2009). This importance reflects the relatively small number of graphs needed in most alphabetic languages. This allows the re-use of a small set of simple features that can be combined at junctions to form unique graphs. In contrast, morphosyllabic writing (Chinese and Japanese kanji) requires a very large number of graphs to code syllable morphemes. As the number of graphs increases, recombining features through connected points becomes impossible; instead additional graphs must add more simple features that also create more connected points and discontinuous components. Overall, graphic complexity is largely driven by the number of graphs that is needed in a written language. Collapsed over all 131 orthographies in our study, the number of graphs is highly correlated with the GraphCom measure of written language complexity, r = .78 (p < .001). This correlation is governed by how the written language manages the mapping of graphs to linguistic units in spoken language, because the writing system largely determines the number of graphs required (for a discussion, see Perfetti & Harris, 2013; Perfetti & Verhoeven, in press).

Although the neural basis of visual perception is beyond the scope of our study, it seems relevant to consider the relation between the properties of the graphs developed for written language and the properties of human vision. Hubel and Wiesel (1962, 1965) established that the receptive field of the cats’ visual system included line, curvature, and edge detectors and computations that estimate their numbers. Primate visual systems have layered receptive fields that selectively respond to specific dimensions—for instance, V1 neurons to orientations; V2 neurons to corners; or V4 neurons to linear gratings, colors, angles, and curves—and computational abilities that operate across these layers (Van Essen, Anderson, & Felleman, 1992; for more recent work, see Coen-Cagli & Schwartz, 2013; Grill-Spector & Malach, 2004; Troncoso, Macknik, & Martinez-Conde, 2011).

Details aside, it is reasonable to suggest that the development of writing graphs has become aligned with human vision capabilities within other constraints, especially time and effort in graph production (Changizi & Shimojo, 2005). The three added dimensions (beyond perimetric complexity) of GraphCom seem to align with basic detection functions (simple features) and computational capabilities (connected points, discontinuous components) of human vision. Perimetric complexity seems to indirectly capture most of these detection and computational capabilities. Contributing substantially to perimetric complexity’s measures of inside and outside perimeters are graphemes’ simple features and their junctions. Indeed, the number of simple features and the number of connected points together account for over 88% of the variance in perimetric complexity.

Finally, we note the practical value of GraphCom as a research tool. Researchers can access the dimension-specific complexity values of the 21,550 graphs from 131 written languages in the graphic complexity database, available at https://dl.dropboxusercontent.com/u/28768192/GraphemeAll/GraphDataset_131_languages.zip. The database can be used in various applications, depending on the research goals. For example, within a single language, graphic complexity measures can be applied to the graphs a child encounters in reading instruction; across languages, graphic complexity in one language can be compared with those of another. For some research aims, specific complexity dimensions can be applied to within-language and between-languages comparisons; for other aims, researchers can create composite scores at the level of individual graphs, the language using them, or the writing system to which the language belongs. More generally, data at the graph, grapheme, written language, or writing system level can be useful for a wide range of applications from comparative writing studies to learning to read to models of graph processing; in short, for studies that take account of visual factors in written language.

Summary and conclusion

We introduced GraphCom, a multidimensional measurement system for quantifying the visual complexity of graphs across the world’s writing systems. Starting with perimetric complexity, a well-validated single measure of complexity, GraphCom adds three dimensions that reflect the ways that graph forms differ in their composition over simple features, their connection points, and their discontinuities. These four dimensions were validated by their abilities to predict human perceptual judgments on graphs that varied in complexity as measured by GraphCom. As a tool for research, the GraphCom measures are available online for 131 written languages and 21,550 graphs. In addition, its measures are defined precisely, in order to allow application to any of the world’s writing systems. This provides a practical research tool for constructing studies of perception and orthographic learning by children and adults, and also cross-language studies of reading and writing.