Introduction

The CDC defines “health literacy” as “the degree to which an individual has the capacity to obtain, communicate, process, and understand basic health information” [127]. Although correlation does not equate to causation, previous studies have shown associations between health literacy and healthcare outcomes, noting that poor health literacy correlates with increased healthcare cost, hospitalization rates, and mortality [35, 95, 118, 119]. Accordingly, the CDC encourages physicians to maximize health literacy by adapting patient materials to the level of knowledge of the intended audience [99]. For written materials, the American Medical Association (AMA) suggests writing at or below the 6th grade reading level; similarly, the NIH suggests writing at the 7th or 8th grade reading level [128, 135]. This “readability” can be determined by using various validated formulas which incorporate factors such as sentence length, word count, and word complexity. While studies have shown that patient-directed materials, such as consent forms, educational materials, or discharge summaries often are written above suggested reading levels [7, 24, 123, 135], few studies have analyzed materials used by patients to self-report their health status [3, 38]. Known as “patient-reported outcome measures,” (PROMs) these questionnaires are used clinically and in research to quantify patients’ perceptions of their conditions, functional abilities, baseline health status, treatment success, and physician competency [6, 8, 18, 103, 108] .

Validation of PROMs is important to ensure that they measure the endpoints of interest accurately and reproducibly. Elements such as reliability, consistency, content/construct validity, and sensitivity to change, are often considered part of this validation process; however, readability is seldom mentioned as a factor in validation [11, 88, 97, 110]. Thus, while a PROM may have elements of a well-designed survey, low readability could impair its practical value and clinical utility. However, readability is not synonymous with comprehension, as comprehension is a multifaceted concept of aesthetic utility and academic content. Thus, while patient materials may be deemed “highly readable,” if they are written in poor font or color choices, the reader may have an issue comprehending the contained content. Furthermore, with diverse knowledge levels among patients, the level of comprehension is unique to every patient, regardless of the readability level of the PROM content. Even so, investigating the readability of patient materials offers practitioners a sense for whether broad, diverse populations of patients are likely to be able to use these tools in real-world practice.

Two studies exist regarding the readability of orthopaedic-specific PROMs [3, 38], and both are limited in their scope of PROM selection and their heavy reliance on a single readability measurement test, the Flesch Reading Ease [3, 38]. While the Flesch Reading Ease is one of the most commonly-used readability scores in health literature, its continued utility with modern syntax has been called into question as newer, more broadly applicable readability measures have been developed [130]. In the absence of a single, accepted, validated readability measure for healthcare materials, the use of a lone readability measure could lead to an unnecessary skew in the results of these studies. To account for this, one systematic review supports using multiple readability measures to evaluate a passage [45]. Additionally, prior studies have not assessed if PROMs with higher numbers of questions were traditionally written at higher reading levels. With trends toward increasingly brief surveys [37], questions arise regarding if lack of readability was an issue in longer surveys. Finally, if PROMs were written at too high of a reading level, an effort must be made to improve their readability and warrant their continued clinical use. While this has been shown in patient education materials [52, 115], this has not been performed in PROMs. Therefore, questions arise regarding if the same techniques used to improve patient education materials are also applicable for use with PROMs, and if edited versions of PROMs continue to possess their prior reliability and validation.

We therefore asked: (1) What proportion of orthopaedic-related PROMs and orthopaedic-related portions of the NIH Patient Reported Outcomes Measurement Information System (PROMIS®) are written at or below the 6th and 8th grade levels? (2) Is there a correlation between the number of questions in the PROM and reading level? (3) Using systematic edits based on guidelines from Centers for Medicare and Medicaid Services (CMS) [90], what proportion of PROMs achieved NIH-recommended reading levels?

Materials and Methods

Selection of Patient-Reported Outcomes Instrument List

A PubMed search was conducted to identify an inclusive list of orthopaedic-associated PROMs. The most relevant article identified, “Are patient-reported outcome measures in orthopaedics easily read by patients?” included a list of 59 PROMs [38]. We supplemented this list with orthopaedic PROMs from the following resources: (1) “Guide to outcomes instruments for musculoskeletal trauma research,” if they were specified as “patient” reported, and not “combined” or “physican” reported [1]; (2) the American Academy of Orthopaedic Surgeons’ (AAOS) website [5]; and (3) the quick-link Orthopaedic Scores website [74].

Preparation of Patient-Reported Outcomes Documents

A total of 86 independent PROMs were identified for inclusion and obtained in their published form (ie, as original journal publications or via the authors’ respective websites). These PROMs were grouped as follows: general health/musculoskeletal/pain status (15) (Table 1), upper extremity (21) (Table 2), lower extremity (41) (Table 3), and spine (nine) (Table 4). In addition, four PROMIS® Adult Short Forms and one investigator-compiled “PROMIS® Bank” consisting of questions from 11 relevant PROMIS® Adult Item Banks were assessed (Table 5). Individual PROMs and PROMIS® materials were attained in Portable Document Format (PDF), manually converted to Microsoft Word® format (Microsoft Corporation, Redmond, WA, USA), and reviewed for accuracy by the authors. All advertisements, hyperlinks, pictures, copyright notices, and other text that was not a direct element of the questionnaire were removed. Each PROM or section of the PROMIS® (item bank or short form) then was saved as a text-only file for analysis by the readability software.

Table 1 Median reading grade levels of 15 common, orthopaedic-related, patient-reported outcome measures for general/musculoskeletal health or pain status, as determined by 19 unique readability algorithms
Table 2 Median reading grade levels of 21 common, orthopaedic-related, patient-reported outcome measures of the upper extremity, as determined by 19 unique readability algorithms
Table 3 Median reading grade levels of 41 common, orthopaedic-related, patient-reported outcome measures of the lower extremity, as determined by 19 unique readability algorithms
Table 4 Median reading grade levels of nine common, orthopaedic-related, patient-reported outcome measures of the spine, as determined by 19 unique readability algorithms
Table 5 Median reading grade levels of the NIH PROMIS® question sets

Readability Assessment

Readability tests were chosen based on the following inclusion criteria: (1) intended for English text; (2) intended for adult use or used in a previously published study; and (3) score output scale of grade level, with higher grade levels corresponding to a more difficult to comprehend text. Additionally, we included the Flesch Reading Ease readability index score (scale, 1–100) owing to its simple grade scale convertibility and for comparative relevance with previously published use [38]. In the absence of any, single accepted readability measure for healthcare-related materials, each document was analyzed by 19 unique readability algorithms, each meeting the criteria above (Table 6). Assessment was performed via Readability Studio 2015 (Oleander Software, Ltd, Pune, Maharashtra, India). Descriptions and algorithms for each readability test were adapted from the Readability Studio descriptions (Appendix 1. Supplemental material is available with the online version of CORR ®.).

Table 6 Readability tests with MGL and IQR

Descriptive Statistics

Descriptive statistics were performed on the readability test results, and the median grade level (MGL) and interquartile range (IQR) were reported. Spearman’s correlation coefficient was used to determine whether the number of survey items in each PROM correlated with its readability level. All statistical analyses were performed using SPSS Version 22.0 (IBM SPSS Statistics for Macintosh, Armonk, NY, USA).

Readability Improvement Editing Process

For PROMs with mean readability scores above the 8th grade level, the following editing steps, based on the CMS Toolkit for Making Written Material Clear and Effective [90], were instituted. We edited the PROMs by using active voice, simple, short sentences, and a simplified vocabulary [90]. After these three steps, the median grade level (MGL) was reassessed. All PROMs meeting criteria for inclusion underwent each editing step as outlined above (Appendix 2. Supplemental material is available with the online version of CORR ®.).

Results

Sixty-four of 86 PROMs (74%) were found to have an MGL at or below the AMA-recommended 6th grade reading level, while 81 of 86 of the scores (94%) were found to be at or below the NIH-recommended 8th grade level (Fig. 1). The overall MGL of independent PROMs was 5.0 (IQR, 4.6–6.1), corresponding to approximately the start of the United States’ 5th grade school year. The investigator-compiled PROMIS® Bank had an MGL of 4.1 (IQR, 3.5–4.8). The four selected PROMIS® Adult Short Forms had an MGL of 4.2 (IQR, 4.2–4.3) (Table 5). The Nottingham Health Profile has the lowest MGL of the independent PROMS (MGL, 2.6; IQR, 0.2–3.8) (Table 1), followed by the American Shoulder and Elbow Surgeons Questionnaire (MGL, 3.8; IQR, 0.9–5.2) (Table 2), Marx Activity Rating Scale (MGL, 3.9; IQR, 2.0–4.5) (Table 3), RAND 20-item Short Form (MGL, 3.9; IQR, 2.1–4.5) (Table 1), and Simple Shoulder Test (MGL, 3.9; IQR, 2.3–4.7) (Table 2). The PROMs with the highest MGLs were the UCLA Activity Score (MGL, 12.1; IQR, 7.0–13.9), Modified Cincinnati Rating System (MGL, 9.1; IQR, 6.2–10.5), Lower Extremity Measure (MGL, 8.9; IQR, 5.5–10.9), Lysholm Knee Score (MGL, 8.4; IQR, 6.7–13.2), and the Tegner Activity Level Scale (MGL, 8.4; IQR, 6.1–9.8) (Table 3). All item banks and short forms of the PROMIS® achieved AMA and NIH recommendations (Table 5).

Fig. 1
figure 1

The median grade level (MGL) distribution of of the included independent patient-reported outcome measures (PROMs) are shown. Sixty-four of 86 met the American Medical Association recommendations at or below the 6th grade reading level (black line with *); 81 met the NIH recommendations for the 8th grade reading level (black line with #).

There was no correlation appreciated between the MGL and the number of questions contained in a PROM (r = −0.081; p = 0.460).

Following edits, all five PROMs (UCLA Activity Score, Modified Cincinnati Rating System, Lysholm Knee Score, Tegner Activity Level Scale, Lower Extremity Measure) achieved the NIH-recommended 8th grade level, while three (Modified Cincinnati Rating System, Tegner Activity Level Scale, Lower Extremity Measure) achieved the AMA recommendation of 6th grade level (Fig. 2). Editing of these PROMs improved readability by 4.3 MGL (before: 8.9 [IQR, 8.4–9.1], after: 4.6 [IQR 4.6–6.4]; difference of median, 4.3; p = 0.008).

Fig. 2
figure 2

The median reading grade level (MGL) improvements of low readability PROMs (> 8.0 MGL) after the Centers for Medicare & Medicaid-derived editing process are shown. LEM = Lower Extremity Measure; TALS = Tegner Activity Level Scale; LKS = Lysholm Knee Score; MCRS = Modified Cincinnati Rating System; UCLA = University of California, Los Angeles Activity Score.

Discussion

PROMs have been increasingly implemented in orthopaedic practice to objectively quantify surgical outcomes and assist in guiding surgical decision making [6, 8, 47]. However, their utility was questioned with a recent report suggesting that most PROMs are written at levels too difficult for the average adult to comprehend [38]. That study is limited by the use of only one readability measure, the Flesch Reading Ease. Our study, using multiple readability measures and giving equal weight to each, seeks to assess the true readability of orthopaedic-related PROMs. Therefore, we asked: (1) What proportion of orthopaedic-related PROMs and orthopaedic-related portions of the NIH PROMIS® are written at or below the 6th and 8th grade levels? (2) Is there a correlation between the number of questions in the PROM and reading level? (3) Using systematic edits based on guidelines from the CMS [90], what proportion of PROMs achieved NIH-recommended reading levels?

This study has limitations. First, the readability scores were determined by heeding equal weight to each algorithm used. This could be a weakness of the study, as some formulas could be better equipped and more reliable for use in assessing the readability of healthcare documents, and deserve greater weighting during the determination of MGLs. Additionally, the CMS Toolkit [89] highlights the importance of aesthetics on readability; however, we did not assess such aspects because they could not be analyzed by the software. We also did not evaluate whether the editing process altered the clinical validity and utility of the four selected PROMs. Thus, the possible effects of the editing process on clinical and diagnostic validity merits additional investigation. In addition, this analysis excluded non-English PROMs, as they were unable to be assessed via the readability algorithms used in MGL calculation. Finally, this readability analysis cannot assess the literacy level of PROMs. Readability equations are a numeric method to evaluate PROMs based solely on quantifiable metrics, while literacy involves numerous qualitative factors which this study was not designed to measure. Although having a low MGL does not necessarily translate to higher comprehension and clinical utility, MGLs are the best method currently available to broadly appreciate the level of understanding of healthcare documents among varied patient populations.

The finding that more than 90% of PROMs and all areas of the PROMIS® are written at acceptable reading levels refutes the study by El-Daly et al. [38], which led to fears regarding the widespread failure of PROMs. Based on their assessment, only 12% of PROMs had a reading grade level congruent with the average UK literacy level (reported as 11-year-old students or 6th grade), thus questioning the accuracy and reliability of data obtained through PROMs—a sentiment further endorsed in a response by Brown [20]. Inconsistencies between findings in our study and that by El-Daly et al. likely center on their use of a single readability score, the Flesch Reading Ease. While this readability algorithm is mentioned by the CMS, CDC, and NIH as having utility in assessing patient-related documents, it is not, nor is any other readability algorithm, recognized as a gold standard instrument intended to be used in isolation—each entity encourages the use of multiple readability algorithms, not one test in solitude [23, 90, 128]. In our analysis, the Flesch Reading Ease algorithm yielded the third highest MGL of the 19 readability tests used (Table 6). Additionally, the grade level extrapolation of this index score (with original outputs on a scale of 0–100) has only a 5th grade minimum, likely falsely elevating scores, and with an exaggerated baseline [44]. The Flesch Reading Ease is also a dated measure, and questions have arisen regarding its continued utility in assessing health literature [130]. In short, while the Flesch Reading Ease is a commonly used score, its aggressive grade level conversions and lack of adaptation to modern syntax may make it a poor choice on which to base sweeping PROM readability conclusions, and calls for reform. The potential alarm initiated by El-Daly et al. [38] and endorsed by Brown [20] appears to be overenthusiastic and potentially misleading. However, our findings should be met with guarded optimism. Even though most PROMs are readable to the average American, patients in traditionally low-literacy areas such as the rural southeastern United States where illiteracy rates encompass more than one in three adults [98], may continue to have issues with PROM comprehension. In these areas of decreased literacy, physicians might better serve their patients by selecting PROMs written at a 3rd to 5th grade level [100].

There was no correlation found between numbers of questions in a PROM and associated reading level. While trends of PROM formation are shifting from arduous surveys being multiple pages in length with numerous subsections, to those consisting of short, high-impact questioning [14, 37], it is interesting that reading level is not associated with PROM length. However, the readability algorithms do not assess for possible reader fatigue and document length, but instead analyze sentence and paragraph length. Therefore, while the readability may not be affected in longer, more detailed PROMs, mental fatigue of patients taking the PROMs could play a role. Mental fatigue, studied after traumatic brain injury, has been shown to negatively affect a patient’s ability to comprehend new information [62]. Additionally, it has been shown that tired patients are likely to leap to conclusions prematurely [134]. While readability is not affected by PROM length, future research is required to assess the possible effects of reader fatigue on comprehension of the longer PROMs.

Editing according to CMS guidelines improved all PROMs and brought them to or under the 8th grade reading level. These guidelines address many aspects of readability, from text selection to aesthetic appeal; however, in Part 4, Section 3 of the CMS Toolkit, multiple specific suggestions are made, including limiting the number and length of sentences, using the active voice, avoiding acronyms, and using conversational style with nontechnical terms [90]. These were adopted to formulate our editing process (Appendix 2. Supplemental materials are available with the online version of CORR ®) which yielded satisfactory results by lowering the MGL of documents with poor readability by 45%, allowing all to score under the 8th grade reading level (Fig. 2). With the emergence of PROMs as clinical and research tools, steps must be taken to ensure improved readability and sustained validity of measures written over recommended reading levels. Although validation of the CMS-based editing process for use with PROMs is necessary, the improvements in MGL after edits are encouraging. The edited PROMs also would need to be revalidated. Research has shown that minor changes may significantly alter the questions being asked, and thus, the nature of the responses [94, 105]. While onerous, this revalidation process for these five high-scoring PROMs may be necessary before the edited PROMs are used for clinical research.

PROMs are increasingly used in patient-centered healthcare and outcomes research. Thus, their readability is vital for accurate, valid responses. We disagree with the previous conclusion that the majority of PROMs used in orthopaedics are “incomprehensible to most patients asked to complete them” [38]. In contrast, our study, the most comprehensive analysis of PROM readability to date, revealed that more than 90% of orthopaedic PROMs are written at or below the 8th grade reading level. Additionally, our study tests a method of editing PROMs to reliably decrease the MGL; validation of this method and of edited PROMs is required. Our analysis contradicts previous concerns and provides confidence for the use of nearly all commonly used PROMs in clinical orthopaedic practice.