1 Introduction

Privacy policies are a vital instrument for communicating with customers about how and why sensitive data (personal data revealing, for example, racial or ethnic origin, political opinions, religious or philosophical beliefs) is collected [1]. It has been shown, that about 90% of all websites in the US collect data from users during their visit [2]. Companies, in particular, those doing business on the internet are forced by laws and regulations to explain reasons, amount and handling of personal data collected. However, these statements are often full of expert language, on technological or legal issues [3]. The language used is therefore hard to understand for most people without technical or legal education. Some regulations like the GDPR [4, 5], force companies to focus on the use of clear and understandable language in privacy statements, yet do not further elaborate on it. Some studies already addressed the influence of readability (e.g. [3, 6]) on users’ behaviour [7, 8]. Although general measures for assessing the readability of text do exist, their applicability for privacy policies has rarely been tested (e.g. [6]). In this study we therefore investigate seven common approaches to assess the readability by applying them on selected privacy statements of fifteen companies. The goal of this research consequently is to show, how existing approaches can be applied to assess the readability of privacy policies. Thus, we aim at closing the gap between current research and possibilities to measure readability of privacy policies. As a result, we aim at suggesting an approach on how to measure readability of privacy policies. The remainder of the paper is structured as follows. First, we give a brief overview of related research followed by a description of our approach and research design. We, in brief, describe seven different approaches, followed by results from applying these approaches to measure readability. In addition, we exposed the text to consumers of different grade levels, asking them to assess the text in terms of readability. Next, we discuss the results and provide recommendations based on the insights gained. Finally, we show limitations and provide some insights into further research.

2 Related Research

Privacy can be understood as a “claim of individuals, groups and institutions to determine for themselves, when, how and to what extent information about them is communicated to others” [9]. This offers the individual the possibility to decide which of its private data is transferred to someone else [10]. As companies need user data to stay competitive [1, 11], a conflict between customers’ right to privacy and organizations’ need to know about their customers to enable targeted marketing is necessary [12, 13] evolves. The increasing reluctance to providing private [10] and sensitive data (Personally Identifiable Information – PII) [14] has been investigated widely. On the other hand, it has been shown that most people disclose a lot more data than they think, often referred to as privacy paradox [13, 16]. Research suggests that people make informed considerations, even calculate the pros and cons before transferring PII [15]. To communicate with users and overcome their concerns [17], organizations use privacy policies to inform about how, which and when data is collected and stored [3]. This is a common way of communicating data collection activities and establish an agreement [18] or informed consent [19] between organization and user. The role of privacy policies in e-commerce has already been discussed intensively, for being necessary to establish a relationship [6] and trust [15, 20, 21] as well as influencing customers’ purchase behaviour [22]. Privacy policies in general consist of information mainly regarding legal [25, 26] and technical issues or technology applied [3, 27], but also social issues [18, 23, 24]. The language used to explain and describe legal issues and technical activities is often rather complex and has been found to protect the organization not explaining the rights of users [23, 28]. Complex and hard to assess privacy statements influence the balance of power between users and organizations [29].

Not only research claimed, that readability of privacy statements has to be improved to enable all customers and users to understand privacy statements [8]. Policymakers have found this to be important resulting in the General Data Protection Regulation (GDPR), which came into force on May, 25th 2018 [5]. However, in the last twenty years, it was evidenced that privacy statements are not written in plain and clear language [3, 7, 30]. By contrast, privacy policies often require a high level of reading ability and vocabulary above the average [3, 30, 31], distracting the users from reading them [6]. Readability has been defined as the “sum total (including the interactions) of all those elements within a given piece of printed material that affects the success that a group of readers have with it” [32]. In the early 1960s, Klare defined that readability is “the ease of understanding or comprehension due to the style of writing” [33], focusing on writing style but not text structure. Later in the same century, McLaughlin defined readability as “the degree to which a given class of people find certain reading matter compelling and comprehensible” [34], identifying classes of people as determining factor for readability of a text. In particular for young people, it was found that simple language, paragraph organization, avoidance of redundant information spread in the text as well as highlighting important information would improve readability [27]. And even more, organization and structure (see [32]) or design [27] of the text influence readability. Readability is also influenced by the readers’ general ability to read and how interested they are in the text [32]. Other authors identified more quantitative characteristics of text as influential, such as length (i.e., word count and words per sentence), negative words (negativity rating), immediacy, vagueness or subordinating conjunctions [29]. Such measurable variables qualify for calculating scores, making readability of text comparable [29]. Different approaches for calculating scores are at hand [35, 36], such as the Flesch-Kincaid Grade Level (respectively Flesch Reading Ease) [37,38,39], the Gunning Fog Index Readability Formula [40], the SMOG Grading by McLaughlin [34], the Coleman-Liau Index [41], and the Automated Readability Index [42]. Other scoring schemes like the Dale-Chall Index [32, 43] and the Fry Graph Readability Formula [44] are also important. It is important to notice, these approaches have been developed for and applied on books, websites, medical, military or official documents.

3 Research Design

To show how existing approaches can be applied to assess the readability of privacy policies, we applied the scales on a sample of privacy statements of 15 companies listed in the Forbes Magazine. To allow comparability and control over biasing factors, only a very specific part existing in every privacy statement was used as the basis for assessment and calculation. All privacy statements were downloaded in spring 2019 and preprocessed (i.e., converted into machine-readable text). The analysis of the privacy statements was supported by software The analysis was conducted by two researchers independently. In general, the process started with an in-depth literature review to identify approaches to assess readability. Next, we identified privacy policies and decided which part of them should be assessed for readability. We applied qualitative analysis first, relying on qualities of text influencing readability, on these parts. Next, we applied the quantitative approaches and exposed the sample to people on different grade levels.

3.1 Approaches to Assess Readability

Approaches to measure the readability of different types of English text (e.g. military or medical context) have been developed already in the last century and are still the basis for current computer-based readability assessment [45, 46]. Based on prior studies [35, 36, 45, 47], we selected seven scales (Dale-Chall Index [32, 43], Automated Readability Index Gunning Fog Index Readability Formula [40], SMOG Grading [34], Fry Graph Readability Formula [44], Flesch-Kincaid Grade Level [37,38,39], Coleman-Liau Index [41]) with low standard errors. They have been discussed extensively in the literature and are among the most used readability scales [45, 46]. In general, the approaches use different variables derived from the text (e.g., word count, sentence length) and combine the numbers with constants to adjust the weight of the variables.

The Dale-Chall Index [32, 43] (DCI) relies on the average sentence length (ASL) and the percentage of unfamiliar words (out of a list of 3,000 commonly used words) [32, 43]. Calculated are the Dale score (ratio of familiar and unfamiliar words as a percentage and the raw score (RS), which is converted into grades based on a correction table. The Automated Readability Index (ARI) [42], uses sentence length (words per sentence - SL) and word length (number of characters or letters per word - WL). It has been developed to overcome flaws in other approaches [38, 43]. The Gunning Fog Index Readability Formula (FOG Index) [40] uses average sentence length (ASL) and the number of complex words of a text of approximately 100 words [48]. Complexity of words is related to length represented by syllables of a word, whereby for example three or more syllables in general qualify for a complex or long word. The number of complex words is divided by the number of overall words, leading to a percentage of hard words (PHW). The results correspond to the US grade levels. The SMOG Grading [34] is related to the FOG Index as it uses the idea of complex words, expressed in syllables. In addition, it has some rules to split the text and count polysyllable (with three or more syllables) words (PSW). It relies on 30 sentences only, hence, some improved formulas have been developed [35, 49]. The Fry Graph Readability Formula [44], relies on three different parts at the beginning, middle and end of the text of 100 words each (resulting in 300 words). The average number of sentences as well as the average number of syllables are calculated. The values are applied to the so-called Fry Graph [44], where word length (short words – long words) is on one axes and sentence length (short sentence – long sentence) is on the other. By allocating the results to according sections, the grade level can be derived from it [44]. The Flesch Reading Ease (FRE), which focuses on the relationship of average sentence length (ASL) compared to the average number of syllables per word (ASW) or syllables per 100 words (ASW/100). The Flesch-Kincaid Grade Level [29, 37] (FKGL) extends the FRE, but focuses more on the length of sentences due to constants used in the formula. Finally, the Coleman-Liau Index (CLI) [41] focuses on a readability formula that can be applied by computer programs. The variables are the average sentence length (ASL) represented by the average number of sentences per 100 words and the average number of characters per 100 words (AC). Table 1 briefly summarizes variables, formulas and results of the seven approaches.

Table 1. Approaches Chosen – Overview (incl. formulas)

3.2 Selection and Analysis of Privacy Statements

Our study focuses on companies operating in Europe to reflect the GDPR [5] forcing companies to aim at understandability of privacy policies. As sensitive data is mainly an issue for end consumers, B2B-companies were not selected, leading to fifteen companies (Allianz, Axa Group, Banco Santander, BMW Group, BNP Paribas, Daimler, Gazprom, HSBC Holding, ING Group, Nestle, Sberbank, Shell, Siemens, Total, Volkswagen – links to the privacy statements are provided upon request). We pre-screened the privacy statements to identify particular parts which are appropriate (e.g., in terms of length) for analysis. By comparing the privacy statements, we identified the part describing ‘cookie policy’ (further referred to as the sample) as most appropriate. First, this part was present in every document. Second, it is a rather technical part which is not so easy to comprehend without further knowledge. Third, in all statements, this part was long enough for applying all seven approaches.

To set a baseline, we analyzed the text parts based on qualitative parameters such as simple language, paragraph organization, avoidance of redundant information spread in the text as well as highlighting important information (to improve readability) [27]. In addition, we assessed, organization and structure (see [32]) respectively design [27]. We developed a coding sheet and to control for bias, two researchers independently analyzed them but compared and discussed the results after having finished their analysis. The coding sheet included the qualitative parameters and scores from ‘fulfilled’ to ‘not fulfilled’ in five grades (ordinal scale). We calculated inter-coder and intra-coder reliability based on Holsti [50], resulting in score of 0.97 and 0.95 (intracoder reliability) and 0.81 (intercoder reliability). We used MAXQDA, which is a tool for analyzing text in a semi-automated way, to count text length, sentences, words, syllables and characters based on the according rules. In addition, we used the software for quantifying readability parameters such as negativity rate or vagueness and qualification [29]. Further rules had to be established due to some specific words in cookie policies (e.g., the name of the company, contact details or and tables). We applied all seven approaches to the sample and documented the results. In a final step to verify the quantitative results, we exposed the sample to 15 people on five different grade levels in accordance with the Dale-Chal approach (i.e., 7–8, 9–10, 11–12, 13–15, above 16 – three people per grade level). The participants were selected from students of an international school (with English as their first language, age 13–18 years) and an international study program (age 19–24 years). Based on variables used in the selected scales, we developed a validation scheme of five questions (5-point liker scales): length of text (short – long), length of sentences (short – long), complex sentences (some – many), unfamiliar words (some – many) and general readability (very easy – very hard) and documented reading speed (expressed in minutes per 1 500 words). However, we only assessed their impression, not how well they understood the text.

4 Results

Although quantitative approaches to measure readability are in the focus of this study, we first summarize the results from the qualitative analysis. Regarding the simple language used in the sample, we found some parts easy to read (‘fulfilled’), whereas others require a rather deep knowledge technical knowledge (‘hardly fulfilled’ or ‘not fulfilled’). Parameters like paragraph organization, avoidance of redundant information spread in the text as well as highlighting important information [27] has been found in most of the documents (‘partly fulfilled’ or ‘fulfilled’). The same applies to the structure [32] and design [27], which has been ‘partly fulfilled’ or ‘fulfilled’ in all but one text of the sample. Regarding quantifiable parameters, we first want to give an overview regarding word count (WC), sentence count (SC), average sentence length in words (ASL), average syllables per word (ASW), letters per word (LW), negativity rating (NR), vagueness and qualification (V&Q) [29] (see Table 2). The word count (WC) on average is approximately 849 words, whereas the sentence count (SC) is 41. A rather wide variation can be found in the average sentence length (ASL), varying from 18.06 to 40.26. The longest text has 1518 words and 75 sentences (BMW Group), whereas the shortest text of the sample has only 303 words (Allianz) respectively 19 sentences (Banco Santander), which shows also the longest average sentence length (40.26) in the text. Average syllables per word (ASW) are rather stable between 1.6 and 1.8, whereas the average characters per word (AC) vary from 4.7 to 5.1. The rather low variation in ASW and AC can be explained by the limited variety of words in the text. Regarding negativity rating, a range below 1% of the words being negative is seen as not being too complex [29]. In five cases, the result is above 1, but still rather low. In the same manner, vagueness and qualification is in the range (below 2%), which is considered acceptable for comprehensibility [29]. Table 2 summarizes the results.

Table 2. Quantifiable Parameters of the Sample - Legend: word count (WC), sentence count (SC), average sentence length in words (ASL), average syllables per word (ASW), average characters per word (AC), negativity rating (NR), vagueness and qualification (V&Q)

Next, all seven approaches were applied and compared to each other. We do not report on the Dale-Chall Index (as the results were not conclusive, probably biased by the Dales list of familiar words) and the FRE, which is part as the FKGL. In general, considering all results, the sample seems to require at least grade 10. As we can see, the Coleman-Liau Index (CLI) and the Flesch-Kincaid Grade Level (FKGL) are often slightly lower, compared to other scores. However, some scores are not in-line, for example, the CLI related to the text from Banco Santander pretty much average, whereas the other scores (ARI, FOG, SMOG, FKGL) are rather high and so is the average of all scores for this text. This is interesting since the text from Banco Santander has already been identified as the one with the longest average sentence length and the lowest number of sentences. The scores from the text from Daimler, interestingly, are rather low (average 11.16, all scores below 12). Another low average score (11.31) has been calculated from the scores for the cookie policy of Daimler, however, the FOG is above 12. A summary of the results of six approaches, including an average (excluding Fry), is given in Table 3.

Table 3. Results of Approaches - Scores per Approach and Assessment of Participants (Exp. = Experience, HTR = Hard to read for all participants/ETR = Easy to read for all participants/M = mixed results

By exposing the sample to fifteen consumers, we gained more insights for the interpretation of the results. The text with the highest average score (17.75 - Banco Santander, Table 3) was experienced by all participants as the longest, most complex and hard to understand. However, the text experienced as holding many unfamiliar words was from Gazprom. Interestingly, fast readers (5 min per 1 500 words) and slow readers (12 min per 1 500 words) were found on all grade levels. In general, text experienced by all participants as very easy or easy to read (ETR) showed average scores below 12. By contrast, text experienced as hard to read (HTR) by all participants showed average scores above 14. Averages score between 12 and 14 showed an expected result: hard to read for participants falling into grade level below 12, easy or very easy to read for grade levels above 12. Text from Daimler and Volkswagen were assessed as being very easy to read by all participants.

5 Discussion

As discussed already, due to the drawbacks of qualitative assessment of readability (e.g. requiring trained individuals, subjective assessment, low bias control) automated approaches gain importance [46]. In this study, we investigated how quantitative approaches to measure readability as the basis for computer-based approaches can be applied to privacy statements, in particular cookie policies. The selected approaches use similar variables (e.g., average sentence length, average word length) and combine them [45, 47], to generate easy-to-assess results. The approaches provide scores or grades (related to the US education scheme), indicating the minimum requirement for being able to understand the text. Due to this easy-to-assess result, these approaches have gained some importance for organizations, aiming at improving the readability of their privacy policies due to the variables used and grade scores. This study consequently contributes to both, research, by providing evidence of an ongoing issue, since readability of privacy statements has been addressed in research [7, 8] without providing tools to measure it. For companies, some interesting insights and hints for improving privacy statements’ readability may be derived.

Besides similarities, the approaches investigated differ significantly in terms of variables, rules and sample size. Whereas some approaches can be applied on small text parts (e.g., FRE and FKGL 100 words), others require for example 300 words from a longer text (e.g., Fry Graph Readability Formula). Regarding variables used, sentence length seems to be an accurate measure, however, the average number of syllables per word is not a good parameter for readability assessment [36]. This is interesting, as in our study the approaches using syllables (Fry Graph Readability Formula, Flesch-Kincaid Grade Level and SMOG Grading) are not specifically different from the other approaches. However, privacy statements (as text type) differ from traditional text types (e.g., in textbooks), sharing some characteristics with web pages, on which the application of readability measures has been questioned [36]. In addition, for specific text types, such as cookie policies, the existing rules may not fit. Hence, defining rules for new terms (e.g., email addresses) is necessary. A list of familiar words (like in the Dale-Chall formula [43]) would have to be adopted to the specific context (i.e., privacy statements on web pages) and updated frequently.

These quantitative approaches to measure readability, ignore abilities of the readers which are not reflected by grade levels (e.g., vocabulary knowledge in the specific field) [3]. Traditional readability metrics do not consider people with intellectual disabilities [52]. As the approaches are not designed to measure how well a single person would comprehend the content of the written text since they exclude the reader’s attitude towards the text [51]. For example, a person on grade level 8, interested in computers and online games, may have more specific required vocabulary knowledge compared to a person on 16th grade with no interest in computers at all. Even more, as semantic complexity is measured, additional factors influencing readability (e.g., writing style, negativity rating) are not considered [3]. However, our additional assessment by exposing the sample to fifteen participants more or less supported the results of the quantitative approaches. We do not want to overemphasize the results, but there seems to be a borderline marking a text as being in general easy to read or hard to read regardless of grade levels. However, to overcome the problems of traditional readability measures [3], statistical language models [36] or cloze procedures [53, 54] have been discussed. Combining lexical and syntactical traits of tests [52] as well as integrate content in the formula [36] seems to be a valid solution. Further research is necessary in order to test, whether this approach would lead to significantly different results, in particular when applied to privacy statements. Cloze procedures [53, 54] integrate the know-how of the readers by involving people from the target group of the text to assess readability. In a rather complex procedure, participants are asked to replicate certain passages, by filling in removed words [3].

Companies, interested in improving the readability of privacy statements, could combine different scales, as already discussed in the literature [35], avoiding over- and underestimating the scores. Although not specifically mentioned in the GDPR, a grade level of 9–10 seems to be appropriate to provide privacy statements readable for the public in general. Based on our insights gained from this study, we suggest using combing approaches which are syllable- and word-based, to not favour the one over the other. Due to cost and time constraints, companies would not apply all seven approaches. With our sample, we tested different combinations related to the average score. Resulting in a combination of FOG, SMOG and CLI on average changes the results by 0.05 grades. Only the average score of the text from Banco Santander changed by −1.4 (average grade level 16). We suggest combining the three approaches, enriched by specific rules and related to privacy policies (e.g., for contact email), with vagueness and negativity rating (weighted). Two different scenarios to improve the readability of privacy statements are proposed. First, comparing different versions of privacy statements, to identify the one with best readability score or permanently rewriting and measuring privacy policies until the readability formulas generate a certain grade level (e.g., 9–10).

6 Conclusion, Limitations and Further Research

As we have shown in our research, existing approaches to measure readability can be applied to privacy policies, but require some additional rules. The results can be used as a basis for decision making, but do not explicitly suggest, what to change. A combination of different scales and adding some of the qualitative parameters might be a solution. This study for sure has some limitations, as we investigated only the parts or privacy statements, explaining cookies and how they are handled. However, we think that this is not only one of the most recognized parts, but also loaded with technical terms. Another limitation owes to the fact that most measures are only reliable for text in English. Further extending the study to other parts and other companies may reveal different results, in terms of additional rules or ways to improve the measurement. As we have only started this research, we plan further elaborating on it, in particular, we would like to compare scores with more in-depth users’ assessment. And even more, since readability does not mean comprehensibility, some more research has to be done on the link of actual privacy statements, readability and comprehensibility.