Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models

Purpose Phenotyping is critical for informing rare disease diagnosis and treatment, but disease phenotypes are often embedded in unstructured text. While natural language processing (NLP) can automate extraction, a major bottleneck is developing annotated corpora. Recently, prompt learning with large language models (LLMs) has been shown to lead to generalizable results without any (zero-shot) or few annotated samples (few-shot), but none have explored this for rare diseases. Our work is the first to study prompt learning for identifying and extracting rare disease phenotypes in the zero- and few-shot settings. Methods We compared the performance of prompt learning with ChatGPT and fine-tuning with BioClinicalBERT. We engineered novel prompts for ChatGPT to identify and extract rare diseases and their phenotypes (e.g., diseases, symptoms, and signs), established a benchmark for evaluating its performance, and conducted an in-depth error analysis. Results Overall, fine-tuning BioClinicalBERT resulted in higher performance (F1 of 0.689) than ChatGPT (F1 of 0.472 and 0.610 in the zero- and few-shot settings, respectively). However, ChatGPT achieved higher accuracy for rare diseases and signs in the one-shot setting (F1 of 0.778 and 0.725). Conversational, sentence-based prompts generally achieved higher accuracy than structured lists. Conclusion Prompt learning using ChatGPT has the potential to match or outperform fine-tuning BioClinicalBERT at extracting rare diseases and signs with just one annotated sample. Given its accessibility, ChatGPT could be leveraged to extract these entities without relying on a large, annotated corpus. While LLMs can support rare disease phenotyping, researchers should critically evaluate model outputs to ensure phenotyping accuracy.


Introduction
Rare diseases are chronically debilitating, often life-limiting conditions that affect 300 million individuals worldwide (Nguengang Wakap et al., 2020).Though individually rare (defined as affecting < 200, 000 individuals in the United States), rare diseases are collectively common and represent a serious public health concern (Chung et al., 2022).Because of the lack of knowledge and effective treatment options for rare diseases, patients undergo diagnostic and therapeutic odysseys, where they are diagnosed with delay and face difficulty searching for effective therapies (Childerhose et al., 2021;Insights, 2020).Rare disease odysseys have devastating medical, psychosocial, and economic consequences for patients and families, resulting in irreversible disease progression, physical suffering, emotional turmoil, and ongoing high medical costs (Cohen and Biesecker, 2010;Carmichael et al., 2015;Yang et al., 2022).Thus, there is an urgent need to shorten rare disease odysseys, and reaching this goal requires effective diagnostic and treatment strategies.
Phenotyping is crucial for informing both strategies.Ongoing initiatives like the National Institutes of Health's Undiagnosed Diseases Network rely on deep phenotyping to generate candidate diseases for diagnosis, identify additional patients with similar clinical manifestations, and personalize treatment or disease management strategies (Tifft and Adams, 2014;Macnamara et al., 2019).In addition, phenotyping can facilitate cohort identification and recruitment for clinical trials critical to the development of novel treatment regimes (Ahmad et al., 2020;Chapman et al., 2021).Because of scarce nosological guidelines, however, rare diseases and their associated phenotypes are seldom represented in international classifications as structured data (Rath et al., 2012).Instead, they are often embedded in unstructured text and require manual extraction by highly trained experts, which is laborious, costly, and susceptible to bias depending on the clinician's background and training.A promising alternative is to leverage natural language processing (NLP) models, which can automatically identify and extract rare disease entities, reduce manual workload, and improve phenotyping efficiency.
Automatic recognition of disease entities, or named entity recognition (NER), is an NLP task that involves the identification and categorization of disease information from unstructured text.This task is especially challenging due to the diversity, complexity, and ambiguity of rare diseases and their phenotypes, which can have different synonyms (e.g., cystic fibrosis and mucoviscidosis), abbreviations (e.g., CF for cystic fibrosis), and modifiers such as body location (e.g., small holes in front of the ear) and severity (e.g., extreme nearsightedness).Descriptions of rare disease phenotypes that are discontinuous, nested, or overlapping present additional challenges; moreover, those that range from short phrases in layman's terms (e.g., distention of the kidney) to medical jargon (e.g., hydronephrosis) may further complicate NER.
Over the last few decades, rapid evolution of NLP models led to significant advancements in NER.Early approaches relied on rules derived from extensive manual analysis (Wang et al., 2018); these were later superseded by sequence labeling models, including conditional random fields and recurrent neural networks, that capture contextual information between adjacent words (Li et al., 2015;Patil et al., 2020).Over the last several years, the NER paradigm shifted toward transformer-based language models like BERT (Bidirectional Encoder Representations from Transformers), which achieved state-of-the-art performance on benchmark datasets (Vaswani et al., 2017;Devlin et al., 2018).Despite their success, a major bottleneck of training models for rare diseases or biomedical applications in general is the development of large, annotated corpora, which is a laborious process that requires manual annotation by domain experts.Recently, OpenAI released ChatGPT, a revolutionary, GPT-based (Generative Pre-trained Transformer) language model capable of following complex human prompts and generating high-quality responses without any annotated data (zero-shot) or with just a few examples (few-shot) (OpenAI, 2022;Agrawal et al., 2022;Hu et al., 2023;Chen et al., 2023).This capability, which provides opportunities to significantly reduce the manual burden of annotation without sacrificing model performance, is especially attractive for NER in the context of rare diseases.
Despite the proliferation of studies on biomedical NER, few have explored this topic for rare diseases.Davis et al. (2013) and Lo Barco et al. (2021) developed NLP algorithms using the Unified Medical Language System Metathesaurus to recognize phenotypes for multiple sclerosis and Dravet syndrome, respectively.Nigwekar et al. ( 2014) used an unnamed NLP software to identify patients with the terms "calciphylaxis" or "calcific uremic arteriolopathy" in their medical records.Recently, Fabregat et al. (2018) and Segura-Bedmar et al. (2022) leveraged deep learning techniques, including Bidirectional Long Short Term Memory networks and BERT-based models, to recognize rare diseases and their clinical manifestations from texts.While some explored the potential of ChatGPT for diagnosing rare diseases with human-provided suggestions (Lee et al., 2023;Mehnen et al., 2023), none have studied its performance for NER in the zero-or few-shot settings.
To this end, our study makes the following contributions.1) We designed new prompts for ChatGPT to extract rare diseases and their phenotypes (i.e., diseases, symptoms, and signs) in the zero-and few-shot settings.2) To the best of our knowledge, this work is the first to establish a benchmark for evaluating ChatGPT's NER performance on a high-quality corpus of annotated texts on rare diseases (Martínez-deMiguel et al., 2022).In addition, we compared prompt learning to fine-tuning by training and evaluating a domain-specific BERT-based model on the annotated corpus.3) We conducted an in-depth error analysis to elucidate the models' performance and 4) provided suggestions to help guide future work on NER for rare diseases.

Dataset
We used the RareDis corpus, which consists of n = 832 texts from the National Organization for Rare Disorders database (Martínez-deMiguel et al., 2022).This corpus was annotated with four entities, rare diseases, diseases, signs, and symptoms, with an inter-annotator agreement of 83.5% under exact match.Table 1 provides the entity definitions.Unlike corpora with distinct entity types, e.g., {person, location, organization} or {problem, test, treatment}, RareDis consists of entities with considerable semantic overlap.Specifically, rare diseases are a subset of diseases.Diseases can cause or be associated with other diseases as a symptom or sign.The distinction between symptoms and signs is very subtle; while both are abnormalities that may indicate a disease, the former are subjective to the patient and cannot be measured by tests or observed by physicians (e.g., pain or loss of appetite).On the other hand, a sign can be measured or observed (e.g., high blood pressure, poor lung function).Across n = 832 texts, there were a total of 7,354 sentences, 4,065 rare diseases, 1,814 diseases, 316 symptoms, and 3,317 signs.Rare diseases and signs were more common than diseases and symptoms, accounting for 77% of all entities in the corpus.Fig. 1 provides a summary of counts per text.The RareDis corpus is publicly available and distributed in the Brat standoff format (Stenetorp et al., 2012).
We refer readers to Martínez-deMiguel et al. (2022) for details on the annotation guidelines.

NER Paradigms and Models
We considered two popular NER paradigms for comparison: 1) pretraining + fine-tuning, and 2) prompt learning (Radford et al., 2018;Liu et al., 2023).The former involves a two-step process where a language model (e.g., BERT) is first trained on a massive amount of unlabeled text data and then finetuned on specific downstream NER tasks with labeled data.In the case of BERT models, the objective is to learn general language presentations through masked language modeling during the pre-training phase, where BERT learns to predict masked portions of the input based on surrounding text.During the fine-tuning phase, the model is further trained using labeled data from the target task, and its parameters are jointly fine-tuned via supervised learning, allowing BERT to adapt its predictions to the specific task at hand.
In contrast, prompt learning is a more recent paradigm that reformulates the NER task as textual prompts so that the model itself learns to predict the desired output.Prompt learning has been shown to have better generalizability for unseen data with few or even no labeled samples (Agrawal et al., 2022).This is especially attractive for biomedical applications where annotations often require domain expertise and are not widely accessible due to data privacy.We compared BERT-and GPT-based models within the fine-tuning and prompt learning paradigms, respectively, due to their promising empirical performance on NER tasks in the biomedical domain (Yan et al., 2021;Chen et al., 2021Chen et al., , 2023)).

Data Pre-processing and Fine-tuning BioClinicalBERT
To pre-process our data, we split the texts into individual words (or subwords) with the BERT tokenizer and added special tokens (i.e., CLS and SEP) to the beginning and end of each tokenized sequence, respectively.We converted the tokens to their respective IDs, padded (or truncated) text sequences to obtain fixed-length inputs, and created an attention mask to distinguish between actual and padding tokens.Last, we mapped our labels, {rare disease, disease, symptom, sign}, to corresponding numerical values.
We partitioned the data into a training, validation, and test set based on an 8:1:1 ratio.For the base architecture, we selected BioClinicalBERT (Alsentzer et al., 2019), a variant of BERT that was pre-trained on large-scale biomedical (PubMed, ClinicalTrials.gov)and clinical corpora (MIMIC-III (Johnson et al., 2016)).To fine-tune BioClinicalBERT on our corpus, we trained the model on the training set and selected hyperparameters using the validation set.We used the test set to evaluate model performance.

Prompt Learning using ChatGPT (GPT-3.5-turbo)
In this section, we describe our approach to reformulating NER as a text generation task in the zero-and few-shot settings.The former refers to instructing the model to extract entities directly from an input text in the test set, and the latter is similar except we also provide an example of extracted entities from a training text.
Prompt design.Table 2 provides a summary of prompts in the zero-and few-shot settings.The five main building blocks of our prompt designs were 1) task instruction, 2) task guidance, 3) output specification, 4) output retrieval, and, in the few-shot setting, 5) a specific example.Task instruction conveys the overall set of directions for NER in a specific but concise manner.To prevent ChatGPT from rephrasing entities, we instructed it to extract their exact names from the input text.Task guidance provides entity definitions from the original RareDis annotation guidelines.The objective is to help ChatGPT differentiate between entity types within the context of the input text, as all four entities overlap semantically.Output specification instructs ChatGPT to output the extracted entities in a specific format to reduce post-processing workload.Output retrieval prompts the model to generate a response.In the few-shot setting, we also provided an example with an input text from the training set and its gold standard labels (i.e., entities labeled by the annotators).
Prompt format.In each setting, we experimented with two prompt formats: simple and structured (Table 2).The former presents the prompt as a simple sentence, and the latter a structured list.The simple sentence is shorter in length and resembles human instructions provided in a conversational setting where different building blocks (i.e., task instruction, task guidance, and output specification) are woven together as a single unit.Agrawal et al. (2022) and Hu et al. (2023) used a similar approach to extract medications and clinical entities, respectively.In contrast, the structured list resembles a recipe or outline that   2023) used a similar format for evaluating ChatGPT and GPT-4's NER performance on benchmark datasets.
Few-shot example selection.We explored two strategies for selecting an example text in the few-shot setting.The first strategy involved randomly selecting a text from the training set, and the second involved selecting the training text that was most similar to the test text.The motivation for the second strategy is that different rare diseases may have similar etiology, course of progression, and symptoms/signs.For example, Creutzfeldt-Jakob disease and CARASIL (cerebral autosomal recessive arteriopathy with subcortical infarcts and leukoencephalopathy) are neurological conditions that share similar signs, including progressive deterioration of cognitive processes and memory.Thus, providing a training text (and the corresponding gold standard entities) that was most similar to the test text may improve ChatGPT's performance.For each input text from the test set, we selected the training text that had the highest similarity score based on spaCy pre-trained word embeddings (spaCy).

Evaluation
To evaluate model performance on the test set, we computed the following evaluation metrics: precision, recall, and F1-score.Precision is the percentage of extracted entities found by the model that were correct, and recall the percentage of gold standard entities extracted by the model.F1 accounts for both metrics by taking the harmonic mean of precision and recall.We calculated these metrics under two evaluation settings: exact and relaxed.For an exact match, the extracted and true entity must share the same text span (i.e., boundary) and entity type.For a relaxed match, the extracted and true entity must overlap in boundary and have the same entity type.To ensure that stop words did not influence the evaluation, we removed them from both the gold standard and model-extracted entities.

Overall Results
Table 3 provides a summary of the model performance by entity type.Overall, BioClin-icalBERT achieved an F1-score of 0.689 under relaxed match.In the zero-shot setting, ChatGPT achieved F1-scores of 0.472 and 0.407 with the simple sentence and structured list prompts, respectively.Performance generally improved in the few-shot setting with F1-scores of 0.591 and 0.469; choosing the training text based on a similarity score led to additional improvement, resulting in F1 scores of 0.610 and 0.544.For some entities, ChatGPT had similar or better performance than its supervised counterpart, achieving F1scores of 0.776 (vs.0.755) and 0.725 (vs.0.704) for rare diseases and signs, respectively, in the few-shot setting.Compared to prompts written as a structured list, simple sentences generally achieved similar or better performance, suggesting that ChatGPT may be more receptive to conversational prompts.Moreover, simple sentences required fewer tokens and were preferred over structured lists from a cost perspective.In the few-shot setting, selecting a training example that was similar to the input text led to better performance than random selection.
Among the four entities, rare diseases were associated with the highest accuracy for both models across all settings.In contrast, diseases were more challenging for both models.While BioClinicalBERT performed similarly at extracting signs and symptoms, ChatGPT achieved significantly better performance for signs.Because the only difference between the prompts for these entities was the task guidance, i.e., specifying symptoms as problems that cannot be measured, whereas signs can be measured, this finding suggests that ChatGPT is sensitive to even small variations in the prompt.

Detailed Error Analysis
We conducted an in-depth error analysis to elucidate ChatGPT's performance.This analysis was crucial for gaining additional insight, as unlike other biomedical corpora, RareDis contains entities with overlapping semantics.Specifically, rare diseases are similar to diseases, and symptoms to signs.Depending on the context of the input text, diseases can also be symptoms or signs.
In our analysis, we considered five types of errors: 1) incorrect boundary, 2) incorrect entity type, 3) incorrect boundary and entity type, 4) spurious, and 5) missed.The first and second refer to an extracted entity whose boundaries or type do not match those of the gold standard label, respectively.The third refers to the case where neither the extracted entity's boundaries nor type match those of the true label.Spurious entities are extracted entities that do not correspond to gold standard labels (false positive), and missed entities are entities that the model failed to extract (false negative).
Table 4 shows the distribution of errors in the few-shot setting under exact match.The most common error type for rare diseases is false negative (45%) followed by incorrect entity type (31%).In the case of entity type errors, ChatGPT tended to label rare diseases as diseases.These errors may be attributed to the fact that there is no single definition of rare diseases; rather, the definition can vary by country or location (i.e., a disease is a rare disease if it affects < 200, 000 people in the United States or no more than 1 in 2,000 in the European Union).Moreover, this definition is subject to change over time, as a disease that used to be rare at the time of annotation may have become more prevalent, or vice versa.Because annotations are subjective, it's possible that what the domain experts deemed as rare diseases may not be reflected in textual information on the Internet before September 2021, ChatGPT's knowledge cut-off date.For instance, the annotators labeled "gastrointestinal anthrax" and "cutaneous anthrax" as rare diseases based on domain knowledge, but neither were listed in rare disease databases at the time of writing this manuscript.For diseases, signs, and symptoms, false positives and false negatives were the most common error types.Based on manual review, many of these errors can be attributed to the challenge of differentiating amongst these entities.Specifically, diseases could be signs or symptoms, and the difference between signs and symptoms is very subtle.In some cases, gold standard labels deviated from the definitions provided in the annotation guidelines, as the lack of abnormalities was also labeled as an entity (i.e., "asymptomatic during infancy or childhood" was labeled as a symptom by the annotators).As such, a portion of false negatives could be attributed to these edge cases.

Discussion
In this work, we reformulated NER as a text generation task and established a benchmark for ChatGPT's performance on extracting rare disease phenotypes.Overall, while finetuning a pre-trained biomedical language model led to better performance, prompt learning with ChatGPT achieved similar or higher accuracy for some entities (i.e., rare diseases and signs) with a single example, demonstrating its potential for out-of-the-box NER in the few-shot setting.Given ChatGPT's performance in the zero-shot setting, the model could be leveraged as a pre-annotation tool to accelerate annotation start-up times for rare diseases and signs (F1-scores of 0.761 and 0.627, respectively).Overall, we recommend simple, sentence-based prompts, as they performed similarly or better than lists and were shorter in length, leading to lower computational cost.
While other studies explored supervised deep learning techniques for extracting rare disease phenotypes, ours is the first to study ChatGPT in the zero-and few-shot settings.Segura-Bedmar et al. ( 2022) compared the NER performance of base BERT, BioBERT, and ClinicalBERT, and found that ClinicalBERT had the highest overall F1-score (0.695).This is comparable to BioClinicalBERT's performance in the current study (0.689).Fabregat et al. (2018) used support vector machines and neural networks with a long short-term memory architecture to extract disabilities associated with rare diseases and obtained an F1-score of 0.81.While this is much higher than the overall F1-score in the current study, Fabregat et al. (2018) focused on extracting a single entity, i.e., disabilities, whereas our goal was to recognize and differentiate among four entities with overlapping semantics.Hu et al. (2023) and Chen et al. (2023) evaluated ChatGPT on biomedical NER and found that it had lower performance than fine-tuning pre-trained language models.While our overall results aligned with this finding, we discovered that ChatGPT had similar or better performance on specific entities, suggesting that with appropriate prompt engineering, the model has the potential to match or outperform fine-tuned language models for certain entity types.
Our work has several potential limitations and extensions.First, we only had access to a subset of the RareDis corpus (832 out of 1041 texts), so our results may not fully reflect ChatGPT's performance across the entire spectrum of rare diseases.Second, the current work focuses on ChatGPT and does not include GPT-4 or other variants (e.g., LLaMA, Alpaca, etc.), so broadening the current set of experiments to include other large language models is a natural extension.Last, though manually created prompts are intuitive and interpretable, evidence suggests that small changes can lead to variations in performance (Cui et al., 2021).A promising alternative is to automate the prompt engineering process.To this end, Gutiérrez et al. ( 2022) employed a semi-automated approach combining manually-created prompts with an automatic procedure to choose the best prompt combination with cross validation.In addition, fully-automated prompt learning approaches where the prompt is described directly in the embedding space of the underlying language model are also interesting extensions of the current work (Ma et al., 2021;Taylor et al., 2022).
The advent of large language models is creating unprecedented opportunities for rare disease phenotyping by automatically identifying and extracting diseases related concepts.While these models provide valuable insights and assistance, researchers and clinicians should critically evaluate model outputs and be well-informed of their limitations when considering them as tools for supporting rare disease diagnosis and treatment.

Figure 1 :
Figure 1: Number of sentences and entities per document.

Table 1 :
Summary of entity definitions.
Extract the exact names of rare diseases , which are diseases which are[defn], from this passage that affect a small number of individuals, from this passage and output them in a list: and output them in a list: "The exact prevalence and incidence "[text from test set]".abetalipoproteinemia is unknown, but it is estimated to affect • • • . . .

Table 2 :
Summary of prompts.Different parts of the prompt are color-coded as follows: Task instruction , Task guidance , Output specification , Output retrieval , and Specific example .consists of multiple sub-prompts in a specific order.Chen et al. (

Table 4 :
Error analysis for ChatGPT in the few-shot setting under exact match.