1 Reading Text in Serif and Sans Serif Typefaces

Ovink (1938) included a sans serif typeface (Futura) as well as a variety of serif typefaces in experiments where participants read lines of text presented for a limited exposure (pp. 84–88) or where their eye movements were monitored while they were reading passages held in their hands in clear view (pp. 88–100). In the latter case, Ovink used a mechanical apparatus in which a rubber pad rested on one of the participants’ upper eyelids. He also included a sans serif typeface (Gill Sans) as well as three serif typefaces in a further study where the short-exposure method was used to present material in the form of dictionary entries and where the participants subsequently rated the clarity of the typefaces that had been used (pp. 100–106). There were no marked differences between their responses to the sans serif typefaces and their responses to the serif typefaces.

Wendt (1969, 1994) carried out an investigation to compare the effects of typographic factors, including the legibility of a serif typeface (Bodoni) and a sans serif typeface (Futura). He constructed 16 passages in German and presented each in five columns across a sheet of paper. He asked roughly 2,000 students to read one of the passages and recorded the number of words read in 3 min. There was a slight mean advantage of 8.62 words for passages in the Futura typeface, but this did not achieve statistical significance. Wendt argued that modern readers were equally familiar with both styles of typeface. Indeed, in a subsequent survey of Australian students, the serif typeface Press Roman was ranked only marginally higher in their overall preference than the sans serif typeface Univers (Bell & Sullivan, 1981).

Taylor (1990) arranged for the instructors of four remedial reading classes at a US high school to administer reading tests to groups of students aged 15–16. Over three weeks, they were timed reading excerpts from their reading workbooks, one printed in the sans serif typeface Helvetica, the other printed in the serif typeface Times Roman. (As was mentioned in Sect. 1.2, this is a similar typeface to Times New Roman that was developed by a rival printing company; it is sometimes known just as “Times”.) Across 74 students, the median difference in their reading rate scores on the two typefaces was zero (p. 50). Over the next two weeks, they were given sheets of paper containing other excerpts printed in both of the typefaces, and they were asked to choose one version to read aloud. There was no significant difference in their preferences in either week.

A major research issue is that actual examples of serif and sans serif typefaces tend to differ on a variety of other characteristics (Beier & Larson, 2010). Arditi (2004) had devised software to generate typefaces that differed only in the presence or absence of slab serifs and in the size of the serifs. Arditi and Cho (2005) used this tool to construct lowercase typefaces of uniform thickness with slab serifs that extended for 0% (sans serif), 5%, or 10% of their cap height. Two individuals with normal vision and two with impaired vision were asked to read aloud three “passages” in which 400 words had been randomly ordered to yield “scrambled” text. Arditi and Cho calculated each participant’s “reading speed” by dividing the number of characters in the correctly read words by the time taken to read each passage. The participants with normal vision obtained higher scores than those with impaired vision, but there was no significant variation in their reading speed as a function of serif size and a fortiori no significant effect of the presence or absence of the slab serifs. As Arditi and Cho noted, the small number of participants was a major limitation of their study.

2 Comprehending Text in Serif and Sans Serif Typefaces

In Sect. 2.2, it was noted that asking participants to read continuous meaningful text provides less opportunity for researchers to impose experimental control over their reading behaviour. To address this issue, some researchers have focused on their participants’ comprehension of such material rather than upon its legibility per se. This could be justified on the grounds that, in order to be comprehended, written material must first be read; thus the legibility of a text places an upper limit on how much of it can be comprehended. A different argument was put forward by Gasser et al. (2005). They suggested that, insofar as a message was easy to read, fewer attentional resources would need to be devoted to reading the message, leaving more resources available for the processing of the information that it contained. However, Reynolds (1979) argued that the value of comprehension tests in the measurement of legibility was questionable, insofar as comprehension implicated cognitive skills of a higher order than those that were required for the accurate perception of the text. Wilkins et al. (1996) also claimed that conventional reading tests tended to emphasise the linguistic and semantic aspects of reading rather than the purely visual aspects (see Sect. 4.1). Thus, factors affecting comprehension might not be relevant to measuring legibility.

Fox (1963) compared two typefaces intended for use on typewriters: Standard Elite, a conventional serif typeface; and Gothic Elite, a sans serif typeface in which lowercase letters are replaced by small capitals. Both typefaces are monospaced or non-proportional (each character occupies the same width), and small capitals in Gothic Elite occupy the same width as lowercase letters in Standard Elite. The participants read two passages, one in each typeface, silently but as quickly as possible. After reading each passage, they were required to recount the story that it contained, and their comprehension was rated as “good” or “poor”. There was no significant difference in the time taken to read passages in the two typefaces or in their comprehension of their content.

Poulton (1965) asked 375 adult volunteers to read two passages of about 450 words in the same typeface. They were allowed 90s to read each passage and were given a test of their comprehension of ten key points from the passage. The passages had been printed in seven different typefaces: three serif typefaces (Baskerville, Bembo, and Modern) and four sans serif typefaces (Gill Medium, Grotesque 215, and two versions of Univers). On the first passage, Poulton found no significant differences among the comprehension scores, which he ascribed to the participants becoming familiar with the general procedure and the particular typeface that they had to read. On the second, there was significant variation among the sans serif typefaces but not among the serif typefaces; more important, none of the serif typefaces yielded scores that were significantly different from those of any of the sans serif typefaces.

A fundamental question using this paradigm is whether a test on the content of a passage administered immediately after its presentation is a test of comprehension or simply a test of factual recall or verbatim memory (Hartley et al., 1975). Poulton and Brown (1967) used the same procedure as in Poulton’s (1965) study, but they remarked that their measure of “comprehension” was “more correctly described as a measure of memory” (p. 219). They found that requiring the participants to read a passage aloud led to poorer performance on the early key points in the passage but to better performance on the last key point in comparison with requiring the participants to read the passage silently. Unfortunately, Poulton and Brown had not matched the key points and their associated questions for difficulty, and consequently the theoretical interpretation of these results remains unclear.

Soleimani and Mohammadi (2012) evaluated different typefaces in Iranian students who had been selected for having an intermediate proficiency in English. This included a reading comprehension test based on a passage from a widely used English-language textbook: it was presented in the sans serif typeface Arial for 42 students and in in the serif typeface Bookman Old Style for 47 students. There was no significant difference between the two groups in their reading speed, in an immediate test of their reading comprehension, or in a multiple-choice test of their memory for ten key points from the passage that was administered 2 weeks later.

Serif and sans serif typefaces are also used in some non-Western alphabets. Akhmadeeva et al. (2012) asked 238 Russian medical students to read a passage about the history of neurology in Russia. The passage had been printed in Cyrillic script using ParaType, a family of artificial typefaces: 108 students were shown the passage in a serif typeface, and 130 students were shown the passage in a sans serif typeface with the same x-height. The students were given 1 min to read the passage and were then asked ten multiple-choice questions about its content. Akhmadeeva et al. found no sign of any difference either in the mean number of words that the two groups had read or in the mean number of questions that they had answered correctly.

3 The Connotative Meaning of Typefaces

Some researchers argued that typefaces could serve as carriers of connotative meaning (reflecting their associations with different attitudes, experiences, and emotions) as well as carriers of denotative meaning (reflecting factual information). Subjective impressions of the legibility of different typefaces can be regarded as just one aspect of their connotative meaning. (German-speaking writers sometimes refer to this quality as their Atmosphärenwert or “atmosphere value”. North American writers sometimes talk about the “personality” of different typefaces.) This aspect might in principle affect their legibility for different readers. In this kind of research, participants are asked to report on their experiences and preferences when reading material printed in different typefaces. Once again, examples can be presented either individually or in groups of two or more for comparison, and the self-reports can be collected either informally (for instance, through interviews) or more formally (for instance, through the use of rankings or rating scales).

As an example of a formal approach, Tinker and Paterson (1942) asked different groups of participants to arrange samples of ten different typefaces in order from the most legible to the least legible and from the most pleasing to the least pleasing. They found that their judgements of legibility and pleasantness demonstrated “remarkable agreement” (p. 40), to the extent that they could be regarded as being equivalent to one another. As Dreyfus (1985) pointed out, readers’ preferences may be irrelevant if they have no choice in whether or not to read something (such as an airline schedule or a railway timetable) but may be crucial if they can choose what they read (as in product information or voting literature).

Another example of a formal approach is the semantic differential. This was devised by Osgood et al. (1957) to measure participants’ attitudes to objects and concepts. The participants provide evaluations of these using bipolar rating scales; typically, these are 7-point scales in which the middle category is neutral between the two poles. Their responses are subjected to factor analysis to yield higher-order dimensions that are regarded as reflecting underlying aspects of connotative meaning. Research studies in a wide variety of domains and cultures converged on three overarching dimensions that reflected variations in peoples’ attitudes: evaluation (good vs. bad), potency (strong vs. weak), and activity (active vs. passive) (Osgood et al., 1975). Hofstätter (1966) devised a similar methodology for German-speaking countries, which he described as a Polaritätsprofile (polarity profile).

4 Connotations of Serif and Sans Serif Typefaces

Connotative meaning is potentially important in the world of advertising. Berliner (1920) initiated this line of inquiry by asking students to rank order different hand-lettered styles on different dimensions for advertising different products. She found that their rankings of appropriateness were different for different products. Subsequent researchers confirmed this when using actual typefaces (Davis & Smith, 1933; Poffenberger & Franken, 1923; Schiller, 1935). They found no clear difference between the ratings given to serif and sans serif typefaces; other features seemed to be more important in determining the appropriateness of different typefaces to different products. Ovink (1938, pp. 127–177) noted that these studies had only used typefaces that were in common use in advertising displays in the United States. He chose 17 typefaces, including some that were more widely used in Europe. He asked 68 participants to rate the appropriateness of each of the typefaces for advertising purposes on eight dimensions. The 17 typefaces varied on most dimensions, but there was no clear difference overall between serif typefaces and sans serif typefaces.

Using the semantic differential methodology developed by Osgood et al. (1957), Tannenbaum et al. (1964) presented participants with the English alphabet printed in both uppercase and lowercase in both upright and italic fonts in two serif typefaces, Bodoni and Garamond, and two sans serif typefaces, Spartan and Kabel. A total of 75 participants rated each display on 25 scales. There were no significant differences between the ratings given to the serif typefaces and those given to the sans serif typefaces.

Wendt (1968) asked 70 participants to rate 35 typefaces using an adapted version of Hofstätter’s (1966) semantic differential. Eleven were sans serif typefaces (seven variants of Folio and four variants of Futura). Each typeface was presented on a printed card containing the alphabet in lowercase, the alphabet in uppercase, and the ten single digits. Each was evaluated by ten participants on 63 7-point rating scales. Factor analysis was used to reduce their ratings to four broad dimensions, but there were no clear differences between the serif and the sans serif typefaces on these dimensions. Cluster analysis of the ratings yielded three clusters, but each of these contained both serif and sans serif typefaces. In other words, the participants’ ratings differentiated among the 35 typefaces, but they did not differentiate systematically between the serif typefaces and the sans serif typefaces.

Benton (1979; Rowe, 1982) asked 24 participants to evaluate ten typefaces on 26 bipolar 7-point scales. Five were general typefaces in general use, including one sans serif typeface (Helvetica); five were novelty typefaces. Each was presented on a printed sheet containing the alphabet in lowercase, the alphabet in uppercase, the ten single digits, and common punctuation marks. Factor analysis was used to reduce their ratings to five broad dimensions, which Benton labelled “potency”, “elegance”, “novelty”, “antiquity”, and “evaluation”, and which she used to calculate scale scores for the ten typefaces. The serif typefaces (Bodoni, Garamond, Palatino, and Times Roman) did not differ significantly from each other on any of the five dimensions. Moreover, Helvetica only differed from these typefaces on antiquity, where it was seen as being relatively modern.

Bartram (1982) conducted a similar study in which 38 design students and 52 students of other disciplines were asked to rate 12 typefaces on 18 bipolar scales. A factor analysis of the ratings provided by the design students yielded four dimensions: the three hypothesised by Osgood et al. (1957) (evaluation, potency, and activity) and a fourth dimension concerned with mood. Bartram then compared the scores on these dimensions given to the 12 typefaces by the design students and the other students. They included three regular upright typefaces: the serif typeface Times New Roman and the sans serif typefaces Futura Medium and Univers 67. Bartram did not compare the ratings of these typefaces directly, but their profiles were relatively similar.

Morrison (1986) obtained ratings of four serif typefaces (Egyptian, Modern, Old Style, and Transitional) and one sans serif typeface (Contemporary) on 25 bipolar dimensions taken from Tannenbaum et al. (1964). There were three groups, each of 14 participants: typography students, technology students, and students of other subjects. They were presented with “text” consisting of statistical approximations to English in all five typefaces in three weights and in both upright and italic font. There were no significant differences among the three groups in terms of their mean ratings on seven scales that included the three primary factors identified by Osgood et al. (1957). The only significant difference between the serif typefaces and the sans serif typeface was that the latter received slightly higher ratings than the former on the potency factor, which Morrison attributed to sans serif typefaces being used on public signs that had an association with authority (such as freeway and airport signage).

Tantillo et al. (1995) asked 250 students to evaluate examples of six typefaces on 28 bipolar 7-point scales. There were three serif typefaces (Century Schoolbook, Goudy Old Style, and Times New Roman) and three sans serif typefaces (Avant Garde Gothic, Helvetica, and Univers). Significant differences emerged on 26 scales: “The serif type styles... are rated as more elegant, charming, emotional, distinct, beautiful, interesting, extraordinary, rich, happy, valuable, new, gentle, young, calm, and less traditional than the sans serif type styles. Serif styles have more personality, freshness, high quality, vitality, and legibility, but the sans serif group is more manly, powerful, smart, upper-class, readable, and louder than the serif styles” (p. 452).

These results are anomalous when compared with the findings of earlier research. A procedural difference is that Tantillo et al. only presented the nonsense word “NRESTA” in uppercase as the example of each typeface to be evaluated, but other studies had used longer examples involving both uppercase and lowercase letters. The rating forms that Tantillo et al. employed often showed the more positive pole on the left end of each scale, but it sometimes appeared on the right end, which may have led to a degree of confusion on the students’ part. This might explain two curious findings: first, sans serif typefaces are usually regarded as being more modern in their appearance than serif typefaces, yet Tantillo et al.’s participants rated serif typefaces as being younger and less traditional; second, the sans serif typefaces were rated as being significantly more readable but as significantly less legible than the serif typefaces.

5 Conclusions

A number of studies have evaluated the role of typographic variables (including the presence or absence of serifs) in reading continuous text. Asking participants to read continuous text allows less scope for experimental control, and so other researchers have instead focused on participants’ comprehension of written material. In both cases, the modal finding is that there are no significant differences between text printed in serif typefaces and text printed in sans serif typefaces. Subjective impressions of the legibility of different typefaces can be regarded as one aspect of their connotative meaning, and other researchers have asked participants to evaluate typefaces on different dimensions using single rating scales or semantic differentials. The modal finding is that there are no significant differences in readers’ overall preference between serif and sans serif typefaces, nor any significant differences in the connotations of serif and sans serif typefaces.