New Frontiers in Health Literacy: Using ChatGPT to Simplify Health Information for People in the Community

Background Most health information does not meet the health literacy needs of our communities. Writing health information in plain language is time-consuming but the release of tools like ChatGPT may make it easier to produce reliable plain language health information. Objective To investigate the capacity for ChatGPT to produce plain language versions of health texts. Design Observational study of 26 health texts from reputable websites. Methods ChatGPT was prompted to ‘rewrite the text for people with low literacy’. Researchers captured three revised versions of each original text. Main Measures Objective health literacy assessment, including Simple Measure of Gobbledygook (SMOG), proportion of the text that contains complex language (%), number of instances of passive voice and subjective ratings of key messages retained (%). Key Results On average, original texts were written at grade 12.8 (SD = 2.2) and revised to grade 11.0 (SD = 1.2), p < 0.001. Original texts were on average 22.8% complex (SD = 7.5%) compared to 14.4% (SD = 5.6%) in revised texts, p < 0.001. Original texts had on average 4.7 instances (SD = 3.2) of passive text compared to 1.7 (SD = 1.2) in revised texts, p < 0.001. On average 80% of key messages were retained (SD = 15.0). The more complex original texts showed more improvements than less complex original texts. For example, when original texts were ≥ grade 13, revised versions improved by an average 3.3 grades (SD = 2.2), p < 0.001. Simpler original texts (< grade 11) improved by an average 0.5 grades (SD = 1.4), p < 0.001. Conclusions This study used multiple objective assessments of health literacy to demonstrate that ChatGPT can simplify health information while retaining most key messages. However, the revised texts typically did not meet health literacy targets for grade reading score, and improvements were marginal for texts that were already relatively simple. Supplementary Information The online version contains supplementary material available at 10.1007/s11606-023-08469-w.

I n recent years, health literacy has come to the forefront of public health research and practice, with persistent calls to provide health information that is easy to access and understand. 1,2 4][5][6] This includes information developed by government, health services and non-government organisations. 7,8 dressing this issue is challenging given the vast amount of health information available online.0][11][12] This is a process that demands considerable expertise and time.Though there are tools for objectively assessing the health literacy of health information and automating text-simplification, [13][14][15] revisions are still largely carried out by humans.
Recent advances in large language models present new opportunities that might transform our ability to develop plain language health information at scale.For example, in November 2022, OpenAI publicly released ChatGPT, a large language model that has been trained on a large database of text data to produce plausible, contextually appropriate and human-like responses to prompts-typically questions or requests to produce writing meeting certain constraints.Large language models do not synthesise or evaluate evidence, but rather they predict what should come next in a piece of text by learning from large volumes of training data. 16ChatGPT is also capable of adapting text to different writing styles and audiences, has a simple user interface that does not require software or programming expertise, and is freely available.
There is limited evidence showing that ChatGPT can produce information that adheres to health literacy guidelines.For example, one study has shown that ChatGPT prompts can produce patient letters that are written at 9th grade reading level, 17 and rated ChatGPT output describing patient postoperative instructions as adequately understandable, actionable and generally complete. 18However, there is substantial room for improvement, both in terms of optimising the ChatGPT prompts and employing more comprehensive assessment of plain language.Other studies have found that ChatGPT outputs in health domains were generally correct and complete, with low potential for harm, though the complexity of the language was not assessed. 19,20 ][23][24] This study sought to investigate the capacity for ChatGPT (GPT-3.5) to produce plain language versions of health texts across a range of health topics.To our knowledge, no studies have evaluated the appropriateness of plain language health information generated by ChatGPT using multiple objective assessments.

Text Selection
The research team collected extracts from patient-facing online health information published by recognised national and international health information provider websites such as the World Health Organization, Centers for Disease Control and Prevention and National Health Service (UK) (Appendix 1).Extracts were at least 300 words and did not rely on images to explain the text.

ChatGPT
ChatGPT-3.5 was accessed via chat.openai.com between 28 April 2023 and 8 May 2023.The platform allows users to 'converse' with the model via API, by sending text-based prompts which the model then responds to.The model seeks to supply users with plausible, human-like responses.However, responses reflect statistical patterns based on training data, rather than knowledge synthesis. 16Given the risks associated with delivering unsupervised health advice, Chat-GPT includes some safeguards to reduce unsafe or harmful prompts.For example, the model is known not to give personalised health advice.

ChatGPT Prompt Development and Text Revision
To develop a prompt that applies health literacy principles to written text, several prompts were tested on four sample texts.Two types of prompts were tested: (a) prompts that described specific health literacy principles (e.g.simple language, active voice, minimal jargon); and (b) prompts that described the target audience.The latter reflected typical health literacy priority groups such as people who do not speak English as their main language, people who read at a school student level and people without health or medical training. 25ach candidate prompt was used in a separate 'chat' to reduce the risk of interference from previous instructions to revise other texts (13 March to 11 April 2023).The research team generated two revised texts per candidate prompt and assessed these for grade reading score, complex language, passive voice and subjective appraisals of retention of key messages (Appendix 2).Findings were discussed across the whole research team.The prompt 'rewrite the text for people with low literacy' was ultimately selected for this study because it more consistently produced texts with a lower grade reading score across the four sample texts and each of two iterations, avoided passive voice, used simpler language and is a brief prompt that is easy to use.We collected three responses from each prompt using the 'regenerate' function.Examples of revised text are shown in Appendix 3.

Text Assessment
Each text was assessed using the Sydney Health Literacy Lab Health Literacy Editor, which we developed. 15This is an online tool designed to objectively assess the extent that written health information is written in plain language.There are several ways to calculate grade reading score.This study used a formula called the Simple Measure of Gobbledygook (SMOG). 27The SMOG formula is a widely used in health research 28

Complex language
The proportion of the text (%) that contains acronyms, uncommon words (as defined by an existing English-language corpus), or terms listed as public health or medical jargon. 15Lower scores indicate lower levels of complex language For each text, the research team identified up to 5 key topic words that were excluded from complex language assessment as these words were inherent to the text Passive voice The number of times a passive voice construction appeared in the text Dot points for lists Using dot points for long lists is recommended in some plain language guidelines 11,29 Completeness was assessed by subjectively rating whether the key messages were retained in each text.Key messages were developed independently by authors JA and OM, with discrepancies resolved through discussion.The two people who assessed the completeness of the revised text were not involved in selecting the text or developing key messages.One consumer and one academic researcher rated each.Scores represent the average number of key messages retained across both assessors.

Analysis
Descriptive statistics were calculated for each text and averaged across the three texts generated by the ChatGPT prompt.Results also present the minimum and maximum scores of individual ChatGPT revisions to provide a sense of the reliability of the prompt.For continuous outcome variables, differences between original and revised text assessments were analysed using paired-sample t tests.ANOVA was used to explore these differences across texts with low, medium and high complexity in the original versions, and Pearson's correlations was used to explore the relationships between continuous variables.For the categorical outcome variable (presence/absence of dot points), differences between original and revised texts were analysed using McNemar's test.

RESULTS
On average, the 26 original texts had a grade 12.8 reading level.Almost one quarter (22.8%) of the words were assessed as 'complex' and there were on average 4-5 instances of passive voice (Table 2).Texts revised by ChatGPT were on average 1.8 grade reading scores lower (M = 11.0,p < 0.001), with significantly less complex language (14.4%, p < 0.001) and less use of passive voice (1.7, p < 0.001).Fourteen of the 26 original texts (54%) showed lists as dot points.When these texts were revised, only 4 of the 56 revised versions (7%) used the same format (p < 0.001).No revised texts introduced dot points where there were none in the original text.
ChatGPT was also more effective at revising texts that were more complex to begin with (Table 3).For example, when ChatGPT revised texts that were originally grade 13 or higher, the grade reading score was lowered by an average 3.3 grades.This was a much larger improvement than revisions to texts that were originally grade 11 or lower (mean decrease of 0.5, p = 0.009) or that were originally grades 11 to 12 (mean decrease of 1.4, p = 0.032).Similar patterns were observed for complex language and passive voice.
Original texts had on average 6.5 key messages (SD = 2.0), with a range of 3 to 10.When rating whether key messages were retained in the revised texts, we observed 84.3% agreement (across 510 ratings).On average 79.8% of key messages were retained in revised texts (SD = 15.0),ranging from 20% in one instance to as high as 100%.Completeness of revised texts was not related to the number of key messages in an original text (p = 0.43), its length (p = 0.84), or health literacy assessment (grade reading score: p = 0.39; complex language: p = 0.53; passive voice: p = 0.68).7.9

DISCUSSION
When asked to simplify existing health information, Chat-GPT on average improved the grade reading score of texts, used less complex language, and removed instances of the passive voice.It achieved this while retaining 80% of the key messages.These improvements were particularly notable for texts that were more complex to begin with, though almost all revised texts were above the recommended grade 8 reading score.Together this suggests that ChatGPT may provide a useful 'first draft' of plain language health information that can be further refined through human revision and checking processes.
These findings are consistent with other studies evaluating the capacity of ChatGPT to develop community-facing health information.For example, clinicians have rated Chat-GPT summaries of radiology reports as relatively accurate, clear and concise. 19,20  previous study also reported that ChatGPT typically produced health information above a grade 8 reading level.17However, the prompt used in the current study generated texts of a lower grade reading score than the previous study, which produced a SMOG grade reading score of 12.5 17 compared to our score of 11.0.
These findings highlight some of the benefits and limitations of using ChatGPT to improve access to plain language health information.8][19][20][21][22] Due to ChatGPT's reliance on human input for training, users should also carefully reflect on its potential to perpetuate biases relating to, e.g.race, age, gender, and ethnicity. 160][11][12] Although it is not a complete solution, ChatGPT's strength lies in the speed at which it can redraft plain language content for further review, rather than its ability to generate a 'final' publicfacing resource.
This study had several strengths.We evaluated the use of ChatGPT across a wide range of health topics, generated three versions of each text and used multiple objective health literacy assessments.Key messages were developed prior to the study and key message retention ratings were double coded, including by a consumer.Lastly, by documenting how the prompt was developed we highlight the potential pitfalls of other prompts to our readers.
The main limitation of this study is that we did not evaluate how easily consumers could understand the revised texts, using either subjective assessment such as Likert rating scales or objective assessment such as knowledge questions.Other limitations are that we did not explicitly assess potential for harm (e.g. through omission of key messages that are essential for patient safety).ChatGPT will also continue to evolve and will likely improve over time.Results presented in this study reflect ChatGPT-3.5, at the time of data collection, and do not reflect the performance of more recent versions of ChatGPT, which may become more widely used in the future.
Future research could vary the parameters of the original texts.For example, it is unclear how well ChatGPT can simplify information for less prevalent health conditions, different types of resources, longer texts and texts written in different languages or for different regions or cultural contexts.Research could also explore changes in ChatGPT performance over time, and performance of other emerging publicly accessible interfaces to large language models such as Google Bard and Bing Chat.In this study, no personal information is included in the original text because the information is general, but in cases where personal information about a diagnosis or prognosis is entered into ChatGPT, data privacy and ethical concerns may become an issue.With further evidence that ChatGPT can reliably, ethically, and safely produce health information that most people can easily understand, it would be valuable to explore how the platform can be systematically implemented into health literacy tools and health organisation practices.
Interfaces into large language models have the potential to rapidly transform the way plain language health information is produced, especially given the rapid improvements to large language models and the interfaces that make them accessible and useful.This study used multiple objective assessments of health literacy to demonstrate that ChatGPT was able to simplify health information while retaining key messages.However, human oversight remains essential to ensure safety, accuracy, completeness, and effective application of health literacy guidelines.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Table 1 Objective Assessments of Text Health Literacy Assessment Description Number
of words Number of words is not a health literacy assessment but provides context about the extent that ChatGPT 'summarises' the text Grade reading score Grade reading score estimates how difficult a text is to read, and roughly corresponds to expected reading ability for US school students in different grades.In Australia, a grade reading score of 8 or lower is a common target (see for example, Clinical Excellence Commission ).

Table 2 Summary of Objective Text Characteristics, Original and Revised Texts (N = 26)
Minimum and maximum scores represent the lowest and highest scores recorded for any ChatGPT text.Target for grade reading score is grade 8, there is no target for complex language (but lower scores are more favourable), target for passive voice is < 2