Gene fingerprint model for literature based detection of the associations among complex diseases: a case study of COPD
- 114 Downloads
Disease comorbidity is very common and has significant impact on disease treatment. Revealing the associations among diseases may help to understand the mechanisms of diseases, improve the prevention and treatment of diseases, and support the discovery of new drugs or new uses of existing drugs.
In this paper, we introduced a mathematical model to represent gene related diseases with a series of associated genes based on the overrepresentation of genes and diseases in PubMed literature. We also illustrated an efficient way to reveal the implicit connections between COPD and other diseases based on this model.
We applied this approach to analyze the relationships between Chronic Obstructive Pulmonary Disease (COPD) and other diseases under the Lung diseases branch in the Medical subject heading index system and detected 4 novel diseases relevant to COPD. As judged by domain experts, the F score of our approach is up to 77.6%.
The results demonstrate the effectiveness of the gene fingerprint model for diseases on the basis of medical literature.
KeywordsDisease connection Gene fingerprint model Chronic obstructive pulmonary disease COPD
Chronic Obstructive Pulmonary Disease
The database for annotation, visualization and integrated discovery
Genome wide association study
Ingenuity pathway analysis
Latent semantic analysis
Medical subject heading
Semantic MEDLINE database
Singular value decomposition
The coexistence of diseases, termed comorbidity, describes the presence of multiple diseases or conditions in the same person . Comorbidity is very common in clinical practice, for example, 31% of adult patients with arthritis had obesity, 47% had diabetes and 49% had heart disease in the United States in 2013–2015 (cdc.gov). Comorbidity is not a simple addition of diseases on another that independently follow their usual trajectories . Due to rapid advances in genomic technologies, genetic analyses have become vital in clinical practice and research to understand the gene-disease relationships. Revealing the associations between diseases and genes as well as between diseases may help to understand the mechanisms of diseases, improve the prevention and treatment of diseases, and support the discovery of new drugs or new uses for existing drugs [3, 4, 5].
Literature in the biomedical domain, as a significant addition to experimental data, has been broadly used by researchers for the inference of gene regulatory network , analysis of the relationship between drugs, genes and diseases, and other biomedical research purposes. For example, researchers inferred disease-disease associations  from PubMed abstracts and biological pathways and used large-scale knowledge-bases such as the Online Mendelian Inheritance in Man (OMIM) to find the disease-causing genes [8, 9].
Chronic obstructive pulmonary disease (COPD) is a common respiratory disease ranked as the third leading cause of death and the second leading cause of disability in the world . COPD also continues to be a major cause of morbidity and mortality in the United States. Approximately 6.5% of the U.S. adults (an estimated 15 million) have been diagnosed with COPD . COPD develops through the interaction of environmental and genetic factors, and the exact etiology is still not clear. Therefore, the study of COPD is an important topic in biomedical research. Genome wide association study (GWAS) and other biomedical research found many candidate susceptibility genes for COPD, including but not limited to SERPINA1, EPHX1, GST, MMP12, TGFB1, SERPINE2, CHRNA3/5 and HHIP [12, 13, 14, 15]. Finding additional genes and understanding their role in COPD may lead to the development of specific treatments and promote early prevention, detection and treatment.
Many experimental and quantitative researches have focused on predicting and knowledge mining of COPD genes. Using known COPD gene information, these studies identified genetic factors associated with COPD , discovered clinical features and genetic risk factors that overlap between COPD and asthma , found genetic determinants of quantitative imaging phenotypes , and detected a deletion affecting total lung capacity among subjects . However, experimental data collection is a long and laborious process. Bioinformatics efforts, one the other hand, could speed up our understanding of the molecular mechanism of COPD. Some of these approaches have focused on mining relevant knowledge from medical literature [19, 20, 21], and building the biological pathways through visualization . Novel methods such as Ontology Fingerprints  have been successfully used to infer active signaling pathways in cancer cells , to develop biological networks , and to help with personalized cancer therapy .
In this paper we report a novel approach to discover the relationships between COPD and other diseases. We also introduce a mathematical model to represent gene related diseases with a series of associated genes based on PubMed literature and Medical Subject Headings (MeSH) . MeSH is an index catalogue with hierarchical structure in life sciences and used to annotate journal articles and books for different databases such as MEDLINE articles and Clinical Trials registry. Moreover, we illustrate an efficient way to reveal the implicit connections between COPD and other diseases based on this model. Our results not only confirmed known disease-disease relationships for COPD, but also identified novel diseases related to COPD. The findings build a solid foundation to understand how COPD is related to other diseases, and drugs treating these diseases could be a useful resource in treating COPD.
Materials and methods
Data and materials
In this study, we focused on evaluating the relationships between COPD and the other lung diseases under the Lung disease branch in MeSH (MeSH tree id C08.384). We used this approach since the relationships between COPD and most of these diseases have been well studied, providing useful evidence to evaluate our methods. Among these diseases, we ignored those not linked to any gene and used the resulting 82 diseases for the study. The publication to gene relationship was obtained from the Oct 5, 2017 version of the gene2pubmed file downloaded from the National Center for Biotechnology Information (NCBI). The PubMed citations used were last updated on Sept 28, 2017.
Gene fingerprint for disease
Inspired by the development of gene Ontology Fingerprint and the success of its applications in several fields [23, 24, 25], we developed the Gene Fingerprints for a disease — a set of genes that are over-represented in the literature relevant to the disease together with the enrichment p-value, and thus established a mathematical model to represent the disease with a series of associated genes. However, to eliminate the possibility of propagating noise through this process, only directly co-occurring genes were taken into account, and further processing (see below) was applied to ensure implicit relationships could be detected.
Not Disease di
PubMed for di & gj(k)
PubMed for gj, not di
All PubMed for gj(K)
Not gene gj
PubMed for di, not gj
PubMed not for di or gj
All PubMed for not gj
All PubMed for di(n)
All PubMed for not di
All PubMed (diseases, N)
Analysis of disease relationships with the Low-rank matrix approximation
Using matrix approximation for information retrieval was initially introduced for Latent semantic analysis (LSA) [32, 33] by replacing the original term-document matrix with a low-rank approximation of the origin. A typical technology to produce low-rank matrix approximations is singular value decomposition (SVD) . An approximation of a matrix could be produced by replacing part of the smallest singular values on the diagonal of the scaling matrix with zeros using SVD. The logic behind the process is that with the linear transformation, the vectors for documents are rescaled towards their latent principle components in proportion to the rank of the approximate matrix [34, 35]. Through this approach, the implicit relationships among documents that do not share common terms could be discovered .
We created a primary matrix with rows representing diseases and columns representing genes for 82 lung diseases based on the disease Gene Fingerprint model. Genes associated with only one disease were removed in the matrix. Using this primary matrix, we then establish a disease to disease matrix measured by Spearman correlation distance based on a low-rank approximation of the primary matrix. The diseases in this matrix were clustered with the Spectral clustering algorithm to reveal relationships between diseases.
Model evaluations with COPD case
The diseases under the Lung diseases branch in the MeSH tree were categorized by five independent lung disease experts into three mutually exclusive groups: related to, not related to, and undefined relationship to COPD. The label of a disease was derived as follow: a disease will be marked as related if agreed by 4 or 5 experts, non-related if agreed by 3 experts or more, undefined otherwise. Among the 82 Lung diseases, 49 were marked as related, 24 as non-related, and 9 as undefined.
We used the 49 positive and 24 negative cases as training data to estimate the rank of the approximation of the matrix. The best performing matrix compared with the experts’ annotation was selected as the most efficient approximation of the primary matrix. This approximate matrix was then used to assess the relationship between the 9 undefined diseases and COPD, from which the novel associations were detected.
Evaluations of detected novel diseases
To evaluate the novel disease association for COPD, we analyzed the gene to gene and gene to disease relationships, as well as the associations from literature using the following methods and systems.
Analysis through the disease associated gene fingerprints
We assessed the contribution of the genes in the diseases’ Gene Fingerprints to the relationship between COPD and other detected diseases. We gradually removed the genes whose enrichment p-value were less significant than a threshold. The connections between COPD and these diseases were then re-evaluated using the filtered Gene Fingerprints.
Ingenuity pathway analysis (IPA™)
IPA™ [36, 37] has been widely used by the research community to explore the relationships among genes, diseases and pathways. Many results obtained from IPA analysis have been experimentally validated, indicating IPA as a credible source for analyzing these relationships. We explored the relationships between COPD and the detected diseases in IPA™ as a way to validate our findings and to provide additional insight into the mechanisms of discovered disease connections.
Semantic MEDLINE database (SemMedDB)
SemMedDB [38, 39] is an NIH maintained repository of semantic predications extracted using SemRep and covers all the relationship information of the medical concepts in 32 categories in MEDLINE. SemMedDB literally explains the pathway between COPD and the detected novel diseases through sematic relationships.
The database for annotation, visualization and integrated discovery (DAVID)
DAVID [40, 41] is an online bioinformatics resource developed by the Laboratory of Immunopathogenesis and Bioinformatics (ncifrederick.cancer.gov), which is a NCI lab located at Frederick, Maryland. It provides integrated functional annotation tools for significant gene sets obtained from genome studies. DAVID tests the enrichment of the functional annotations such as biological process, molecular function and pathway for a gene set. We used David to evaluate the Gene Fingerprint models of the diseases and the novel relationships between COPD and the detected diseases.
Using the 49 positive and 24 negative cases as training data, we identified an approximation of the primary matrix that retained 95% energy as the most efficient matrix to assess the relationships between Lung diseases. This approximation selects r largest eigenvalues such that their summation occupies 95% of the total eigenvalues’ summation  for all cases.
The performance of the model on diseases with minimum number of required genes in their Gene Fingerprints
Minimum # associated genes
Lung Injury, Sarcoidosis-Pulmonary, Acute Lung Injury, Bird Fancier’s Lung
Lung Injury, Sarcoidosis-Pulmonary, Acute Lung Injury, Bird Fancier’s Lung
Lung Injury, Sarcoidosis-Pulmonary, Acute Lung Injury, Bird Fancier’s Lung, Eosinophilic Granuloma, Pulmonary Veno-Occlusive Disease, Meconium Aspiration Syndrome
The 16 genes on the right of Fig. 3 all belong to the nicotinic cholinergic receptor (CHRN) family, a well-known susceptibility gene family for COPD . CHRNA1–7, CHRNA9–10 are CHRN α genes, CHRNB1–4 are CHRN β genes, CHRND, CHRNE and CHRNG are CHRN δ, ε, γ genes. CHRNA3, CHRNB4 and CHRNA5 are the most recognized susceptibility genes for COPD . CHRNA7 is located on the surface of immune cells. After activation, it mediates cholinergic regulation of inflammation and results in a decrease in pro-inflammatory cytokine production . CHRNA7 is associated with pulmonary sarcoidosis, whose expression is significantly elevated in peripheral blood mononuclear cell in patients with pulmonary sarcoidosis compared with healthy controls . Therapeutic activation of the CHRNA7-dependent nicotinic anti-inflammatory pathway represents a theoretical intervention to prevent progression of sarcoidosis . CHRNA7 also plays a role in acute lung injury and is a potential target for the treatment of this disease . These findings indicate that the newly detected connections between diseases and COPD are supported by common molecular mechanisms related to GR, the CHRN family and inflammation. Notably, only two out of 17 genes appear in these diseases’ Gene Fingerprints, indicating that IPA™ and the Gene Fingerprint approach are complementary and supportive of each other.
We also explored the disease relevance from highly relevant genes in the Gene Fingerprint of the diseases. One hundred and seventy two highly relevant genes were obtained from the Gene Fingerprints of the 4 novel diseases and COPD after applying an association p-value cutoff of 0.01, a value corresponding to the significance at the 0.01 level and 99% confidence interval.
For each disease, its highly relevant genes were analyzed by DAVID  to obtain the enriched KEGG pathways with a Bonferroni cutoff of 0.05. We obtained 72 significantly enriched pathways for COPD, which also includes all the 14 enriched pathways associated with Acute Lung Injury and the 4 with Lung Injury. There are 17 enriched pathways associated with Pulmonary Sarcoidosis, 12 of which are the members of the 72 pathways associated with COPD. Three of these 12 pathways also overlap with pathways associated with Acute Lung Injury, with the remaining 5 being unique to pulmonary sarcoidosis.
Discussion & Conclusions
In this project, we introduced a mathematical model based on the gene to PubMed mapping to characterize a disease, and the performance of this approach was evaluated with a case study of COPD. Applying this model, we analyzed all the diseases in the branch of Lung diseases in MeSH tree, and were able to successfully distinguish the COPD related and non-related diseases.
Our model predicted 4 novel COPD related diseases. Three of these diseases, Acute Lung Injury, Pulmonary Sarcoidosis, and Lung Injury were identified to be closely related to COPD based on gene information (Figs. 2, 3, 5) and literature (Fig. 4). Our analysis has also shown that lung injury, acute lung injury, COPD and pulmonary sarcoidosis are all related to inflammation and injury in lung. However, because acute lung injury is a branch of lung injury, and not all the children of lung injury relate to COPD, the relationship between Lung injury and COPD could be due to the contribution of acute lung injury as a child of Lung injury.
The identified relationship between Bird Fancier’s Lung and COPD only has shallow semantic connections. One possible reason is that the Bird Fancier’s Lung is not extensively studied and there is a lack of sufficient experiment evidence. This is reflected in the fact that no search result returned for the Bird Fancier’s Lung from IPA™ and DAVID—two integrative, widely used annotation databases for genes and diseases. The sensitivity of the Gene Fingerprint approach to detect disease-disease relationships could be a strength for studying diseases with limited experimental data. Further improvement such as replacing Spectral clustering algorithm by deep learning could further improve the performance of our approach in the future using large amount of training/testing data from literature and other sources.
We would like to thank the experts in the Department of Development Pediatrics in the Second Affiliated Hospital of Jilin University for the annotation of the diseases studied in this research.
NIH R01AI130460 (Tao).
NIH R01LM011829 (Tao).
NIH 1U01HG009454 (Tao).
The National Natural Science Foundation of China (NSFC) Grant #81672297 (Zhang).
Guocai Chen was partly supported by CPRIT R1307.
Publication of this article is sponsored by CPRIT RP170668 (Zheng) grant.
Availability of data and materials
The datasets used in this study was obtained from the PubMed which is accessible from the National Center for Biotechnology Information website (ncbi.nlm.nih.gov).
About this supplement
This article has been published as part of BMC Medical Informatics and Decision Making Volume 19 Supplement 1, 2019: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2018: medical informatics and decision making. The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-19-supplement-1.
GC, YJ, CT, WZ - Conceptualization and design of the study. GC, YJ – Data acquisition and analysis. PL, LZhang – data annotation. GC – model design and coding. GC, LZhu – evaluation and analysis of result. GC, YJ, CT, WZ, LZhu - Drafting and/or revising the manuscript. All of the authors have read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 11.Control, C.f.D. and Prevention. Chronic obstructive pulmonary disease among adults--United States, 2011. MMWR. Morbidity and mortality weekly report. 2012;61(46):938.Google Scholar
- 25.Qin, T., et al., Finding pathway-modulating genes from a novel ontology fingerprint-derived gene network. Nucleic Acids Res, 2014. 42(18): p. e138-e138.Google Scholar
- 26.Chen G, et al. “Using Ontology Fingerprints to disambiguate gene name entities in the biomedical literature” Database: the journal of biological databases and curation vol. 2015 bav034. 2015. https://doi.org/10.1093/database/bav034.
- 28.Smola AJ, Schölkopf B. Sparse greedy matrix approximation for machine learning; 2000.Google Scholar
- 29.Ng AY, Jordan MI, Weiss Y. On spectral clustering: Analysis and an algorithm. In: Advances in neural information processing systems; 2002.Google Scholar
- 30.Gasper G, Rahman M. Basic hypergeometric series. In Basic Hypergeometric Series (Encyclopedia of Mathematics and its Applications, pp. 1-37). Cambridge: Cambridge University Press; 2004. https://doi.org/10.1017/CBO9780511526251.004.
- 35.Banerjee A, et al. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J Mach Learn Res. 2007;8(Aug):1919–86.Google Scholar
- 37.Smith JA, Osborn M. Interpretative phenomenological analysis. Doing social psychology research. 2004:229–54.Google Scholar
- 38.Kilicoglu H, et al. Semantic MEDLINE: A Web application to manage the results of PubMed searches. In Proceedings of SMBM’08. 2008. p. 69–76.Google Scholar
- 42.Zhang L, Lin W. Selective visual attention: computational models and applications. Wiley; 2013. http://site.ebrary.com/id/10674838.
- 45.King T. Treatment of pulmonary sarcoidosis: Initial therapy with glucocorticoids. 2017. UptoDate, obtained on November 28, 2018 on the internet from http://www.uptodate.com/contents/treatment-of-pulmonary-sarcoidosis-initialtherapy-with-glucocorticoids.
- 57.Kiszałkiewicz J, Piotrowski WJ, Brzeziańska-Lasota E. Selected molecular events in the pathogenesis of sarcoidosis—recent advances. Advances in Respiratory Medicine. 2015;83(6):462–75.Google Scholar
- 64.Piotrowski WJ, et al. Expression of HIF-1A/VEGF/ING-4 axis in pulmonary sarcoidosis. In Noncommunicable Diseases (ed. Pokorski, M.). Cham: Springer International Publishing; 2015. p. 61–69.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.