DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations

Zlabinger, Markus; Hofstätter, Sebastian; Rekabsaz, Navid; Hanbury, Allan

doi:10.1007/978-3-030-45442-5_54

Markus Zlabinger¹⁵,
Sebastian Hofstätter¹⁵,
Navid Rekabsaz¹⁶ &
…
Allan Hanbury¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12036))

Included in the following conference series:

European Conference on Information Retrieval

5967 Accesses

Abstract

The effective extraction of ranked disease-symptom relationships is a critical component in various medical tasks, including computer-assisted medical diagnosis or the discovery of unexpected associations between diseases. While existing disease-symptom relationship extraction methods are used as the foundation in the various medical tasks, no collection is available to systematically evaluate the performance of such methods. In this paper, we introduce the Disease-Symptom Relation Collection (dsr-collection), created by five physicians as expert annotators. We provide graded symptom judgments for diseases by differentiating between relevant symptoms and primary symptoms. Further, we provide several strong baselines, based on the methods used in previous studies. The first method is based on word embeddings, and the second on co-occurrences of MeSH-keywords of medical articles. For the co-occurrence method, we propose an adaption in which not only keywords are considered, but also the full text of medical articles. The evaluation on the dsr-collection shows the effectiveness of the proposed adaption in terms of nDCG, precision, and recall.

You have full access to this open access chapter, Download conference paper PDF

Mining Electronic Health Records of Patients Using Linked Data for Ranking Diseases

Automatic Decision Support for Clinical Diagnostic Literature Using Link Analysis in a Weighted Keyword Network

Article 23 December 2017

Quality-Based Knowledge Discovery from Medical Text on the Web

Keywords

1 Introduction

Disease-symptom knowledge bases are the foundation for many medical tasks – including medical diagnosis [9] or the discovery of unexpected associations between diseases [12, 14]. Most knowledge bases only capture a binary relationship between diseases and symptoms, neglecting the degree of the importance between a symptoms and a disease. For example, abdominal pain and nausea are both symptoms of an appendicitis, but while abdominal pain is a key differentiating factor, nausea does only little to distinguish appendicitis from other diseases of the digestive system. While several disease-symptom extraction methods have been proposed that retrieve a ranked list of symptoms for a disease [7, 10, 13, 14], no collection is available to systematically evaluate the performance of such methods [11]. While these method are extensively used in downstream tasks, e.g., to increase the accuracy of computer-assisted medical diagnosis [9], their effectiveness for disease-symptom extraction remains unclear.

In this paper, we introduce the Disease-Symptom Relation Collection (dsr-collection) for the evaluation of graded disease-symptom relations. The collection is annotated by five physicians and contains 235 symptoms for 20 diseases. We label the symptoms using graded judgments [5], where we differentiate between: relevant symptoms (graded as 1) and primary symptoms (graded as 2). Primary symptoms—also called cardinal symptoms—are the leading symptoms that guide physicians in the process of disease diagnosis. The graded judgments allow us for the first time to measure the importance of different symptoms with grade-based metrics, such as nDCG [4].

As baselines, we implement two methods from previous studies to compute graded disease-symptom relations: In the first method [10], the relation is the cosine similarity between the word vectors of a disease and a symptom, taken from a word embedding model. In the second method [14], the relation between a disease and symptom is calculated based on their co-occurrence in the MeSH-keywords^{Footnote 1} of medical articles. We describe limitations of the keyword-based method [14] and propose an adaption in which we calculate the relations not only on keywords of medical articles, but also on the full text and the title.

We evaluate the baselines on the dsr-collection to compare their effectiveness in the extraction of graded disease-symptom relations. As evaluation metrics, we consider precision, recall, and nDCG. For all three metrics, our proposed adapted version of the keyword-based method outperforms the other methods, providing a strong baseline for the dsr-collection.

The contributions of this paper are the following:

We introduce the dsr-collection for the evaluation of graded disease-symptom relations. We make the collection freely available to the research community.^{Footnote 2}
We compare various baselines on the dsr-collection to give insights on their effectiveness in the extraction of disease-symptom relations.

2 Disease-Symptom Relation Collection

In this section, we describe the new Disease-Symptom Relation Collection (dsr-collection) for the evaluation of disease-symptom relations. We create the collection in two steps: In the first step, relevant disease-symptom pairs (e.g. appendicitis-nausea) are collected by two physicians. They collect the pairs in a collaborative effort from high-quality sources, including medical textbooks and an online information service^{Footnote 3} that is curated by medical experts.

In the second step, the primary symptoms of the collected disease-symptom pairs are annotated. The annotation of primary symptoms is conducted to incorporate a graded relevance information into the collection. For the annotation procedure, we develop guidelines that briefly describe the task and an online annotation tool. Then, the annotation of primary symptoms is conducted by three physicians. The final label is obtained by a majority voting. Based on the labels obtained from the majority voting, we assign the relevance score 2 to primary symptoms and 1 to the other symptoms, which we call relevant symptoms.

In total, the dsr-collection contains relevant symptoms and primary symptoms for 20 diseases. We give an overview of the collection in Table 1. For the 20 diseases, the collection contains a total of 235 symptoms, of which 55 are labeled as primary symptom (about 25%). The top-3 most occurring symptoms are: fatigue which appears for 15 of the 20 diseases, fever which appears for 10, and coughing which appears for 7. Notice that the diseases are selected from different medical disciplines: mental (e.g. Depression), dental (e.g. Periodontitis), digestive (e.g. Appendicitis), and respiration (e.g. Asthma).

Table 1. Overview of the dsr-collection. For each disease, we display the number of relevant symptoms (#S), the number of primary symptoms (#P), and the Fleiss’ inter-annotator agreement (\(\kappa \)).

Full size table

We calculate the inter-annotator agreement using Fleiss’ kappa [2], a statistical measure to compute the agreement for three or more annotators. For the annotation of the primary symptoms, we measure a kappa value of \(\kappa =0.61\), which indicates a substantial agreement between the three annotators [6]. Individual \(\kappa \)-values per disease are reported in Table 1. By analyzing the disagreements, we found that the annotators labeled primary symptoms with varying frequencies: The first annotator annotated on average 2.1 primary symptoms per disease, the second 2.8, and the third 3.8.

Vocabulary Compatibility: We map each disease and symptom of the collection to the Unified Medical Language System (UMLS) vocabulary. The UMLS is a compendium of over 100 vocabularies (e.g. ICD-10, MeSH, SNOMED-CT) that are cross-linked with each other. This makes the collection compatible with the UMLS vocabulary and also with each of the over 100 cross-linked vocabularies.

Although the different vocabularies are compatible with the collection, a fair comparison of methods is only possible when the methods utilize the same vocabulary since the vocabulary impacts the evaluation outcome. For instance, the symptom loss of appetite is categorized as a symptom in MeSH; whereas, in the cross-linked UMLS vocabulary, it is categorized as a disease. Therefore, the symptom loss of appetite can be identified when using the MeSH vocabulary, but it cannot be identified when using the UMLS vocabulary.

Evaluation: We consider following evaluation metrics for the collection: Recall@k, Precision@k, and nDCG@k at the cutoff \(k=5\) and \(k=10\). Recall measures how many of the relevant symptoms are retrieved, Precision measures how many of the retrieved symptoms are relevant, and finally, nDCG is a standard metric to evaluate graded relevance [5].

3 Disease-Symptom Extraction Methods

3.1 Related Methods

In this section, we discuss disease-symptom extraction methods used in previous studies. A commonly used resource for the extraction of disease-symptom relations are the articles of the PubMed database. PubMed contains more than 30 million biomedical articles, including the abstract, title, and various meta-data. Previous work [3, 7] uses the abstracts of the PubMed articles together with rule-based approaches. In particular, Hassan et al. [3] derive patterns of disease-symptom relations from dependency graphs, followed by the automatic selection of the best patterns based on proposed selection criteria. Martin et al. [7] generate extraction rules automatically, which are then inspected for their viability by medical experts. Xia et al. [13] design special queries that include the name and synonyms of each disease and symptom. They use these queries to return the relevant articles, and use the number of retrieved results to perform a ranking via Pointwise Mutual Information (PMI).

The mentioned studies use resources that are not publicly available, i.e., rules in [3, 7] and special queries in [13]. To enable reproducibility in future studies, we define our baselines based on the methods that only utilize publicly available resources, described in the next section.

3.2 Baseline Methods

Here, we first describe two recently proposed methods [10, 14] for the extraction of disease-symptom relations as our baselines. Afterwards, we describe limitations of the method described in [14] and propose an adapted version in which the limitations are addressed. We apply the methods on the open-access subset of the PubMed Central (PMC) database, containing 1,542,847 medical articles. To have a common representation for diseases/symptoms across methods (including an unique name and identifier), we consider the 382 symptoms and 4,787 diseases from the Medical Subject Headings (MeSH) vocabulary [14]. Given the set of diseases (X) and symptoms (S), each method aims to compute a relation scoring function \(\lambda (x,s) \in \mathbb {R}\) between a disease \(x \in X\) and a symptom \(s \in S\). In the following, we explain each method in detail.

\(\small \textsc {Embedding}\): Proposed by Shah et al. [10], the method is based on the cosine similarity of the vector representations of a disease and a symptom. We first apply MetaMap [1], a tool for the identification of medical concepts within a given text, to the full text of all PMC articles to substitute the identified diseases/symptoms by their unique names. Then, we train a word2vec model [8] with 300 dimensions and a window size of 15, following the parameter setting in [10]. Using the word embedding, the disease-symptom relation is defined as \(\lambda (x,s)=\text {cos}(\mathbf {e}_x, \mathbf {e}_s)\), where \(\mathbf {e}\) refers to the vector representation of a word.

\(\small \textsc {CoOccur}\): This method, proposed by Zhou et al. [14], calculates the relation of a disease and a symptom, by measuring the degree of their co-occurrences in the MeSH-keywords of medical articles. The raw co-occurrence of the disease x and symptom s, is denoted by \(\text {co}(x,s)\). The raw co-occurrence does not consider the overall appearance of each symptom across diseases. For instance, symptoms like pain or obesity tend to co-occur with many diseases, and are therefore less informative. Hence, the raw co-occurrence is normalized by an Inverse Symptom Frequency (ISF) measure, defined as \(\text {ISF}(s)=\frac{|X|}{n_s}\), where |X| is the total number of diseases and \(n_s\) is the number of diseases that co-occur with s at least in one of the articles. Finally, the disease-symptom relation is defined as \(\lambda (x,s)=\text {co}(x,s)\times \text {ISF}(s)\). We compute three variants of the \(\small \textsc {CoOccur}\) method:

\(\small \textsc {Kwd}\): The disease-symptom relations are computed using the MeSH-keywords of the \(\approx \)1.5 million PMC articles.
\(\small \textsc {KwdLarge}\): While \(\small \textsc {Kwd}\) uses the 1.5 million PMC articles, Zhou et al. [14] apply the exact same method on the \(\approx \)30 million articles of the PubMed database. While they did not evaluate the effectiveness of their disease-symptom relation extraction method, they published their relation scores which we will evaluate in this paper.
\(\small \textsc {FullText}\): Applying the \(\small \textsc {CoOccur}\) method only on MeSH-keywords has two disadvantages: First, keywords are not available for all articles (e.g. only 30% of the \(\approx \)1.5 million PMC articles have keywords) and second, usually only the core topics of an article occur as keywords. We address these limitations by proposing an adaption of the \(\small \textsc {CoOccur}\) method, in which we use the full text, the title, and the keywords of the \(\approx \)1.5 million PMC articles. Specifically, we adapt the computation of the co-occurrence \(\text {co}(x,s)\), as follows: We first retrieve a set of relevant articles to a disease x, where an article is relevant if the disease exists in either the keyword, or the title section of the article. Given these relevant articles and a symptom s, we compute the adapted co-occurrence \(\text {co}(x,s)\), which is the number of relevant articles in that the symptom occurs in the full text. The identification of the diseases in the title and symptoms in the full text is done using the MetaMap tool [1].

4 Evaluation Results and Discussion

We now compare the disease-symptom extraction baselines on the proposed dsr-collection. The results for various evaluation metrics are shown in Table 2. The \(\small \textsc {FullText}\)-variant of the \(\small \textsc {CoOccur}\) method outperforms the other baselines on all evaluation metrics. This demonstrates the high effectiveness of our proposed adaption to the \(\small \textsc {CoOccur}\) method.

Further, we see a clear advantage of the \(\small \textsc {CoOccur}\)-method with MeSH-keywords from \(\approx \)30 million PubMed articles as the resource (\(\small \textsc {KwdLarge}\)) – in comparison to the same method with keywords from approximately 1.5 million PMC articles (\(\small \textsc {Kwd}\)). This highlights the importance of the number of input samples to the method.

Table 2. Comparison of the disease-symptom extraction methods using our proposed dsr-collection. We show significant improvements with: a refers to \(\small \textsc {Embedding}\), b to \(\small \textsc {Kwd}\), and c to \(\small \textsc {KwdLarge}\) (two-sided, paired t-test: \(p<0.01\)).

Full size table

**Table 3. Top-4 extracted symptoms of each method for the disease *appendicitis*. The retrieved and are highlighted.**

Error Analysis: A common error source is a result of the fine granularity of the symptoms in the medical vocabularies. For example, the utilized MeSH vocabulary contains the symptoms abdominal pain and abdomen, acute^{Footnote 4}. Both symptoms can be found in the top ranks of the evaluated methods for the disease appendicitis (see Table 3). However, since the corpus is not labeled on such a fine-grained level, the symptom abdomen, acute is counted as a false positive.

Another error source is a result of the bias in medical articles towards specific disease-symptom relationships. For instance, between the symptom obesity and periodontitis^{Footnote 5} a special relationship exists, which is the topic of various publications. Despite obesity not being a characteristic symptom of a periodontitis, all methods return the symptom in the top-3 ranks. A promising research direction is the selective extraction of symptoms from biomedical literature by also considering the context (e.g. in a sentence) in that a disease/symptom appears.

5 Conclusion

We introduced the Disease-Symptom Relation Collection (dsr-collection) for the evaluation of graded disease-symptom relations. We provided baseline results for two recent methods, one based on word embeddings and the second on the co-occurrence of MeSH-keywords of medical articles. We proposed an adaption to the co-occurrence method to make it applicable to the full text of medical articles and showed significant improvement of effectiveness over the other methods.

Notes

1.
MeSH-keywords are meta-data that indicates the core topics of an medical article.
2.
Contact this paper’s first author to gain access.
3.
The website netdoktor.at which is certificated by the Health on the Net Foundation.
4.
Symptom for acute abdominal pain.
5.
A dental disease where the gum that surrounds the teeth retreats.

References

Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the American Medical Informatics Association Symposium, pp. 17–21 (2001)
Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Article Google Scholar
Hassan, M., Makkaoui, O., Coulet, A., Toussain, Y.: Extracting disease-symptom relationships by learning syntactic patterns from dependency graphs. In: Proceedings of BioNLP 2015, pp. 71–80. Association for Computational Linguistics, Beijing (2015)
Google Scholar
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. (TOIS) 20(4), 422–446 (2002)
Article Google Scholar
Kekäläinen, J.: Binary and graded relevance in IR evaluations - comparison of the effects on ranking of IR systems. Inf. Process. Manag. 41(5), 1019–1033 (2005)
Article Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Google Scholar
Martin, L., Battistelli, D., Charnois, T.: Symptom extraction issue. In: Proceedings of BioNLP 2014, pp. 107–111 (2014). Association for Computational Linguistics, Baltimore (2014)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Ni, J., Fei, H., Fan, W., Zhang, X.: Automated medical diagnosis by ranking clusters across the symptom-disease network. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 1009–1014, November 2017
Google Scholar
Shah, S., Luo, X., Kanakasabai, S., Tuason, R., Klopper, G.: Neural networks for mining the associations between diseases and symptoms in clinical notes. Health Inf. Sci. Syst. 7(1), 1–9 (2018). https://doi.org/10.1007/s13755-018-0062-0
Article Google Scholar
Shen, Y., Li, Y., Zheng, H.T., Tang, B., Yang, M.: Enhancing ontology-driven diagnostic reasoning with a symptom-dependency-aware Naïve Bayes classifier. BMC Bioinform. 20(1), 330 (2019)
Article Google Scholar
del Valle, E.P.G., García, G.L., Santamaría, L.P., Zanin, M., Ruiz, E.M., González, A.R.: Evaluating Wikipedia as a source of information for disease understanding. In: 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS), pp. 399–404, June 2018
Google Scholar
Xia, E., Sun, W., Mei, J., Xu, E., Wang, K., Qin, Y.: Mining disease-symptom relation from massive biomedical literature and its application in severe disease diagnosis. In: 45 - AMIA 2018 Annual Symposium, pp. 1118–1126 (2018)
Google Scholar
Zhou, X., Menche, J., Barabási, A.L., Sharma, A.: Human symptoms-disease network. Nat. Commun. 5(1), 1–10 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

TU Wien, Vienna, Austria
Markus Zlabinger, Sebastian Hofstätter & Allan Hanbury
Johannes Kepler University, Linz, Austria
Navid Rekabsaz

Authors

Markus Zlabinger
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Hofstätter
View author publications
You can also search for this author in PubMed Google Scholar
Navid Rekabsaz
View author publications
You can also search for this author in PubMed Google Scholar
Allan Hanbury
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Zlabinger .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
Joemon M. Jose
University College London, London, UK
Emine Yilmaz
Universidade NOVA de Lisboa, Lisbon, Portugal
João Magalhães
Universidad Autónoma de Madrid, Madrid, Spain
Pablo Castells
University of Padua, Padua, Italy
Nicola Ferro
Universidade de Lisboa, Lisbon, Portugal
Mário J. Silva
Universidade NOVA de Lisboa, Lisbon, Portugal
Flávio Martins

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zlabinger, M., Hofstätter, S., Rekabsaz, N., Hanbury, A. (2020). DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science(), vol 12036. Springer, Cham. https://doi.org/10.1007/978-3-030-45442-5_54

Download citation

DOI: https://doi.org/10.1007/978-3-030-45442-5_54
Published: 08 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45441-8
Online ISBN: 978-3-030-45442-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations

Abstract

Similar content being viewed by others

Mining Electronic Health Records of Patients Using Linked Data for Ranking Diseases

Automatic Decision Support for Clinical Diagnostic Literature Using Link Analysis in a Weighted Keyword Network

Quality-Based Knowledge Discovery from Medical Text on the Web

Keywords

1 Introduction

2 Disease-Symptom Relation Collection

3 Disease-Symptom Extraction Methods

3.1 Related Methods

3.2 Baseline Methods

4 Evaluation Results and Discussion

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations

Abstract

Similar content being viewed by others

Mining Electronic Health Records of Patients Using Linked Data for Ranking Diseases

Automatic Decision Support for Clinical Diagnostic Literature Using Link Analysis in a Weighted Keyword Network

Quality-Based Knowledge Discovery from Medical Text on the Web

Keywords

1 Introduction

2 Disease-Symptom Relation Collection

3 Disease-Symptom Extraction Methods

3.1 Related Methods

3.2 Baseline Methods

4 Evaluation Results and Discussion

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation