Keywords

1 Introduction

Disease-symptom knowledge bases are the foundation for many medical tasks – including medical diagnosis [9] or the discovery of unexpected associations between diseases [12, 14]. Most knowledge bases only capture a binary relationship between diseases and symptoms, neglecting the degree of the importance between a symptoms and a disease. For example, abdominal pain and nausea are both symptoms of an appendicitis, but while abdominal pain is a key differentiating factor, nausea does only little to distinguish appendicitis from other diseases of the digestive system. While several disease-symptom extraction methods have been proposed that retrieve a ranked list of symptoms for a disease [7, 10, 13, 14], no collection is available to systematically evaluate the performance of such methods [11]. While these method are extensively used in downstream tasks, e.g., to increase the accuracy of computer-assisted medical diagnosis [9], their effectiveness for disease-symptom extraction remains unclear.

In this paper, we introduce the Disease-Symptom Relation Collection (dsr-collection) for the evaluation of graded disease-symptom relations. The collection is annotated by five physicians and contains 235 symptoms for 20 diseases. We label the symptoms using graded judgments [5], where we differentiate between: relevant symptoms (graded as 1) and primary symptoms (graded as 2). Primary symptoms—also called cardinal symptoms—are the leading symptoms that guide physicians in the process of disease diagnosis. The graded judgments allow us for the first time to measure the importance of different symptoms with grade-based metrics, such as nDCG [4].

As baselines, we implement two methods from previous studies to compute graded disease-symptom relations: In the first method [10], the relation is the cosine similarity between the word vectors of a disease and a symptom, taken from a word embedding model. In the second method [14], the relation between a disease and symptom is calculated based on their co-occurrence in the MeSH-keywordsFootnote 1 of medical articles. We describe limitations of the keyword-based method [14] and propose an adaption in which we calculate the relations not only on keywords of medical articles, but also on the full text and the title.

We evaluate the baselines on the dsr-collection to compare their effectiveness in the extraction of graded disease-symptom relations. As evaluation metrics, we consider precision, recall, and nDCG. For all three metrics, our proposed adapted version of the keyword-based method outperforms the other methods, providing a strong baseline for the dsr-collection.

The contributions of this paper are the following:

  • We introduce the dsr-collection for the evaluation of graded disease-symptom relations. We make the collection freely available to the research community.Footnote 2

  • We compare various baselines on the dsr-collection to give insights on their effectiveness in the extraction of disease-symptom relations.

2 Disease-Symptom Relation Collection

In this section, we describe the new Disease-Symptom Relation Collection (dsr-collection) for the evaluation of disease-symptom relations. We create the collection in two steps: In the first step, relevant disease-symptom pairs (e.g. appendicitis-nausea) are collected by two physicians. They collect the pairs in a collaborative effort from high-quality sources, including medical textbooks and an online information serviceFootnote 3 that is curated by medical experts.

In the second step, the primary symptoms of the collected disease-symptom pairs are annotated. The annotation of primary symptoms is conducted to incorporate a graded relevance information into the collection. For the annotation procedure, we develop guidelines that briefly describe the task and an online annotation tool. Then, the annotation of primary symptoms is conducted by three physicians. The final label is obtained by a majority voting. Based on the labels obtained from the majority voting, we assign the relevance score 2 to primary symptoms and 1 to the other symptoms, which we call relevant symptoms.

In total, the dsr-collection contains relevant symptoms and primary symptoms for 20 diseases. We give an overview of the collection in Table 1. For the 20 diseases, the collection contains a total of 235 symptoms, of which 55 are labeled as primary symptom (about 25%). The top-3 most occurring symptoms are: fatigue which appears for 15 of the 20 diseases, fever which appears for 10, and coughing which appears for 7. Notice that the diseases are selected from different medical disciplines: mental (e.g. Depression), dental (e.g. Periodontitis), digestive (e.g. Appendicitis), and respiration (e.g. Asthma).

Table 1. Overview of the dsr-collection. For each disease, we display the number of relevant symptoms (#S), the number of primary symptoms (#P), and the Fleiss’ inter-annotator agreement (\(\kappa \)).

We calculate the inter-annotator agreement using Fleiss’ kappa [2], a statistical measure to compute the agreement for three or more annotators. For the annotation of the primary symptoms, we measure a kappa value of \(\kappa =0.61\), which indicates a substantial agreement between the three annotators [6]. Individual \(\kappa \)-values per disease are reported in Table 1. By analyzing the disagreements, we found that the annotators labeled primary symptoms with varying frequencies: The first annotator annotated on average 2.1 primary symptoms per disease, the second 2.8, and the third 3.8.

Vocabulary Compatibility: We map each disease and symptom of the collection to the Unified Medical Language System (UMLS) vocabulary. The UMLS is a compendium of over 100 vocabularies (e.g. ICD-10, MeSH, SNOMED-CT) that are cross-linked with each other. This makes the collection compatible with the UMLS vocabulary and also with each of the over 100 cross-linked vocabularies.

Although the different vocabularies are compatible with the collection, a fair comparison of methods is only possible when the methods utilize the same vocabulary since the vocabulary impacts the evaluation outcome. For instance, the symptom loss of appetite is categorized as a symptom in MeSH; whereas, in the cross-linked UMLS vocabulary, it is categorized as a disease. Therefore, the symptom loss of appetite can be identified when using the MeSH vocabulary, but it cannot be identified when using the UMLS vocabulary.

Evaluation: We consider following evaluation metrics for the collection: Recall@k, Precision@k, and nDCG@k at the cutoff \(k=5\) and \(k=10\). Recall measures how many of the relevant symptoms are retrieved, Precision measures how many of the retrieved symptoms are relevant, and finally, nDCG is a standard metric to evaluate graded relevance [5].

3 Disease-Symptom Extraction Methods

3.1 Related Methods

In this section, we discuss disease-symptom extraction methods used in previous studies. A commonly used resource for the extraction of disease-symptom relations are the articles of the PubMed database. PubMed contains more than 30 million biomedical articles, including the abstract, title, and various meta-data. Previous work [3, 7] uses the abstracts of the PubMed articles together with rule-based approaches. In particular, Hassan et al. [3] derive patterns of disease-symptom relations from dependency graphs, followed by the automatic selection of the best patterns based on proposed selection criteria. Martin et al. [7] generate extraction rules automatically, which are then inspected for their viability by medical experts. Xia et al. [13] design special queries that include the name and synonyms of each disease and symptom. They use these queries to return the relevant articles, and use the number of retrieved results to perform a ranking via Pointwise Mutual Information (PMI).

The mentioned studies use resources that are not publicly available, i.e., rules in [3, 7] and special queries in [13]. To enable reproducibility in future studies, we define our baselines based on the methods that only utilize publicly available resources, described in the next section.

3.2 Baseline Methods

Here, we first describe two recently proposed methods [10, 14] for the extraction of disease-symptom relations as our baselines. Afterwards, we describe limitations of the method described in [14] and propose an adapted version in which the limitations are addressed. We apply the methods on the open-access subset of the PubMed Central (PMC) database, containing 1,542,847 medical articles. To have a common representation for diseases/symptoms across methods (including an unique name and identifier), we consider the 382 symptoms and 4,787 diseases from the Medical Subject Headings (MeSH) vocabulary [14]. Given the set of diseases (X) and symptoms (S), each method aims to compute a relation scoring function \(\lambda (x,s) \in \mathbb {R}\) between a disease \(x \in X\) and a symptom \(s \in S\). In the following, we explain each method in detail.

\(\small \textsc {Embedding}\): Proposed by Shah et al. [10], the method is based on the cosine similarity of the vector representations of a disease and a symptom. We first apply MetaMap [1], a tool for the identification of medical concepts within a given text, to the full text of all PMC articles to substitute the identified diseases/symptoms by their unique names. Then, we train a word2vec model [8] with 300 dimensions and a window size of 15, following the parameter setting in [10]. Using the word embedding, the disease-symptom relation is defined as \(\lambda (x,s)=\text {cos}(\mathbf {e}_x, \mathbf {e}_s)\), where \(\mathbf {e}\) refers to the vector representation of a word.

\(\small \textsc {CoOccur}\): This method, proposed by Zhou et al. [14], calculates the relation of a disease and a symptom, by measuring the degree of their co-occurrences in the MeSH-keywords of medical articles. The raw co-occurrence of the disease x and symptom s, is denoted by \(\text {co}(x,s)\). The raw co-occurrence does not consider the overall appearance of each symptom across diseases. For instance, symptoms like pain or obesity tend to co-occur with many diseases, and are therefore less informative. Hence, the raw co-occurrence is normalized by an Inverse Symptom Frequency (ISF) measure, defined as \(\text {ISF}(s)=\frac{|X|}{n_s}\), where |X| is the total number of diseases and \(n_s\) is the number of diseases that co-occur with s at least in one of the articles. Finally, the disease-symptom relation is defined as \(\lambda (x,s)=\text {co}(x,s)\times \text {ISF}(s)\). We compute three variants of the \(\small \textsc {CoOccur}\) method:

  • \(\small \textsc {Kwd}\): The disease-symptom relations are computed using the MeSH-keywords of the \(\approx \)1.5 million PMC articles.

  • \(\small \textsc {KwdLarge}\): While \(\small \textsc {Kwd}\) uses the 1.5 million PMC articles, Zhou et al. [14] apply the exact same method on the \(\approx \)30 million articles of the PubMed database. While they did not evaluate the effectiveness of their disease-symptom relation extraction method, they published their relation scores which we will evaluate in this paper.

  • \(\small \textsc {FullText}\): Applying the \(\small \textsc {CoOccur}\) method only on MeSH-keywords has two disadvantages: First, keywords are not available for all articles (e.g. only 30% of the \(\approx \)1.5 million PMC articles have keywords) and second, usually only the core topics of an article occur as keywords. We address these limitations by proposing an adaption of the \(\small \textsc {CoOccur}\) method, in which we use the full text, the title, and the keywords of the \(\approx \)1.5 million PMC articles. Specifically, we adapt the computation of the co-occurrence \(\text {co}(x,s)\), as follows: We first retrieve a set of relevant articles to a disease x, where an article is relevant if the disease exists in either the keyword, or the title section of the article. Given these relevant articles and a symptom s, we compute the adapted co-occurrence \(\text {co}(x,s)\), which is the number of relevant articles in that the symptom occurs in the full text. The identification of the diseases in the title and symptoms in the full text is done using the MetaMap tool [1].

4 Evaluation Results and Discussion

We now compare the disease-symptom extraction baselines on the proposed dsr-collection. The results for various evaluation metrics are shown in Table 2. The \(\small \textsc {FullText}\)-variant of the \(\small \textsc {CoOccur}\) method outperforms the other baselines on all evaluation metrics. This demonstrates the high effectiveness of our proposed adaption to the \(\small \textsc {CoOccur}\) method.

Further, we see a clear advantage of the \(\small \textsc {CoOccur}\)-method with MeSH-keywords from \(\approx \)30 million PubMed articles as the resource (\(\small \textsc {KwdLarge}\)) – in comparison to the same method with keywords from approximately 1.5 million PMC articles (\(\small \textsc {Kwd}\)). This highlights the importance of the number of input samples to the method.

Table 2. Comparison of the disease-symptom extraction methods using our proposed dsr-collection. We show significant improvements with: a refers to \(\small \textsc {Embedding}\), b to \(\small \textsc {Kwd}\), and c to \(\small \textsc {KwdLarge}\) (two-sided, paired t-test: \(p<0.01\)).
Table 3. Top-4 extracted symptoms of each method for the disease appendicitis. The retrieved and are highlighted.

Error Analysis: A common error source is a result of the fine granularity of the symptoms in the medical vocabularies. For example, the utilized MeSH vocabulary contains the symptoms abdominal pain and abdomen, acuteFootnote 4. Both symptoms can be found in the top ranks of the evaluated methods for the disease appendicitis (see Table 3). However, since the corpus is not labeled on such a fine-grained level, the symptom abdomen, acute is counted as a false positive.

Another error source is a result of the bias in medical articles towards specific disease-symptom relationships. For instance, between the symptom obesity and periodontitisFootnote 5 a special relationship exists, which is the topic of various publications. Despite obesity not being a characteristic symptom of a periodontitis, all methods return the symptom in the top-3 ranks. A promising research direction is the selective extraction of symptoms from biomedical literature by also considering the context (e.g. in a sentence) in that a disease/symptom appears.

5 Conclusion

We introduced the Disease-Symptom Relation Collection (dsr-collection) for the evaluation of graded disease-symptom relations. We provided baseline results for two recent methods, one based on word embeddings and the second on the co-occurrence of MeSH-keywords of medical articles. We proposed an adaption to the co-occurrence method to make it applicable to the full text of medical articles and showed significant improvement of effectiveness over the other methods.