An automated framework for hypotheses generation using literature
- First Online:
- 6.9k Downloads
In bio-medicine, exploratory studies and hypothesis generation often begin with researching existing literature to identify a set of factors and their association with diseases, phenotypes, or biological processes. Many scientists are overwhelmed by the sheer volume of literature on a disease when they plan to generate a new hypothesis or study a biological phenomenon. The situation is even worse for junior investigators who often find it difficult to formulate new hypotheses or, more importantly, corroborate if their hypothesis is consistent with existing literature. It is a daunting task to be abreast with so much being published and also remember all combinations of direct and indirect associations. Fortunately there is a growing trend of using literature mining and knowledge discovery tools in biomedical research. However, there is still a large gap between the huge amount of effort and resources invested in disease research and the little effort in harvesting the published knowledge. The proposed hypothesis generation framework (HGF) finds “crisp semantic associations” among entities of interest - that is a step towards bridging such gaps.
The proposed HGF shares similar end goals like the SWAN but are more holistic in nature and was designed and implemented using scalable and efficient computational models of disease-disease interaction. The integration of mapping ontologies with latent semantic analysis is critical in capturing domain specific direct and indirect “crisp” associations, and making assertions about entities (such as disease X is associated with a set of factors Z).
Pilot studies were performed using two diseases. A comparative analysis of the computed “associations” and “assertions” with curated expert knowledge was performed to validate the results. It was observed that the HGF is able to capture “crisp” direct and indirect associations, and provide knowledge discovery on demand.
The proposed framework is fast, efficient, and robust in generating new hypotheses to identify factors associated with a disease. A full integrated Web service application is being developed for wide dissemination of the HGF. A large-scale study by the domain experts and associated researchers is underway to validate the associations and assertions computed by the HGF.
KeywordsDisease network Disease model Biological literature-mining Hypothesis generation Knowledge discovery MeSH ontology
Hypothesis generation framework
Latent semantic analysis
Medical subject heading
Non-negative matrix factorization
Parameter optimized latent semantic analysis.
The explosion of OMICS- based technologies, such as genomics, proteomics, and pharmaco-genomics, has generated a wave of information retrieval tools, such as SWAN , to mine the heterogeneous, high dimensional and large databases, as well as complex biological networks. The general characteristics of such complex systems as well as their robustness and dynamical properties were reported by many researchers (i.e., [2, 3]). These reports of designing scalable and efficient knowledge discovery tools can further our understanding of complex biological systems. The burgeoning gap between the effort and investment made to acquire the knowledge about complexities of biological systems is disproportionately large compared to the development of knowledge discovery tools that can be used for effectively disseminating the acquired knowledge, generating and validating hypothesis, and understanding the complex causal relationships. Despite a plethora of efforts in reverse-engineering of complex systems to predict response to perturbations, there is a lack of significant effort to create a higher level abstraction of such complex biological systems using sources of information other than genetics data [2, 4]. A high level view of complex systems would be very useful in generating new hypotheses and connecting seemingly unrelated entities. Such an abstraction could facilitate translational research and may prove vital in clinical studies by providing a valuable reference to the clinicians, researchers, and other domain experts.
Disease networks can provide a high level view of complex systems; however, the reported networks are mostly based on genetic and proteomic data [2, 4]. Such networks could also be constructed based on literature data to incorporate a wider range of factors such as side effects and risk factors. Generating disease-models based on literature data is a very natural and efficient way to better understand and summarize the current knowledge about different high-level systems. A connection between two diseases can be formalized by risk factors, symptoms, treatment options, or any other diseases as compared to only common disease-genes. The relations between diseases can provide a systematic approach to identify missing links and potential associations. It will also create new avenues for collaborations and interdisciplinary research.
To construct a disease network based on literature data, it is imperative to have a scalable and efficient literature-mining tool to explore the huge textual resources. Nevertheless, mining of biological and medical literature is a very challenging task [5, 6, 7]. This can further be complicated by challenges with the implementation of relevant information extraction, also known as deep parsing, which is built on formal mathematical models. Deep parsing, also known as formal grammar, attempts to describe how text is generated in the human mind . Deterministic or probabilistic context-free grammars are probably the most popular formal grammars . Grammar-based information extraction techniques are computationally expensive as they require the evaluation of alternative ways to generate the same sentence. Grammar-based information could therefore be more precise but at the cost of reduced processing speed .
An alternative to the grammar-based methods are factorization methods such as Latent Semantic Analysis (LSA) , and Non-negative Matrix Factorization (NMF) [9, 10]. Factorization methods rely on bag-of-word concept, and have therefore reduced computational complexity. LSA is a well known information retrieval technique which has been applied to many areas in bioinformatics. Arguably, LSA captures semantic relations between various concepts based on their distance in the reduced eigen space . It has the advantage of extracting direct and indirect associations between entities. A commonly used distance measure in LSA is the cosine value of the angle between the document and query in the reduced eigen space.
Over the past two decades, medical text-mining has proved to be valuable in generating new exciting hypotheses. For instance, titles from MEDLINE were used to make connections between disconnected arguments: 1) the connection between migraine and magnesium deficiency  which has been verified experimentally; 2) between indomethacin and Alzheimer’s disease ; and finally 3) between Curcuma longa and retinal diseases . Hypothesis generation in literature-mining relies on the fact that chance connections can emerge to be meaningful .
This paper designs and implements an efficient and scalable literature-mining framework to generate and also validate plausible hypotheses about various entities that include (but not limited to): risk factors, environmental factors, lifestyle, diseases, and disease groups. The proposed hypothesis generation framework (HGF) is implemented based on parameter optimized latent semantic analysis (POLSA)  and is suitable to capture direct and indirect association among concepts. It is easy to note that the overall performance and quality of results obtained through LSA-based systems is a function of the dictionary used. The concept of mapping ontologies was integrated with the POLSA to overcome such limitations and to provide crisp associations. In particular, the Medical Subject Headings (MESH) is used to construct the dictionary. Such a dictionary allows a more efficient use of the LSA technique in finding semantically related entities in the biological and medical sciences. This framework can be used to generate customized disease-disease interaction networks, to facilitate interdisciplinary collaborations between scientists and organizations, to discover hidden knowledge, and to spawn new research directions. In addition, the concept of statistical disease modeling was introduced to compute the strongly related, related, and not related concepts.
The following section describes the proposed hypothesis generation framework and its evaluation. Two case studies were performed to showcase the potential and utility of the proposed method. Finally, the paper ends with a brief conclusion and discussions about the strengths and weaknesses of the method.
Results and discussion
Hypotheses generation framework (HGF)
MeSH is used to generate the dictionary in the POLSA model. The mapping of MeSH ontology to create the dictionary for the POLSA significantly enhances the quality of results and provides a crisp association of semantically related entities in biological and medical science. All MeSH headings are reduced to single words to create the context specific and data driven dictionary (see Figure 1B). For instance, “Reproductive and Urinary Physiological Phenomena” is a MeSH term and is reduced to five words in the dictionary (1. Reproductive, 2. and, 3. Urinary, 4. Physiological, and 5. Phenomena). In the filtering step, duplicates as well as stop words such as “and” or words containing fewer than three characters are removed. The final size of this dictionary is 19,165 words. Any dictionary word could be used as a query to the HGF. For instance, the disease “stroke” is a query in this study. The highly ranked factors with respect to a query-disease are considered factors associated with that disease. Cosine similarity measure is used as a metric in the HGF.
In order to develop an effective literature-mining framework to model disease-disease interaction networks, generate plausible new hypotheses, and support knowledge-discovery by finding semantically related entities, a Parameter Optimized LSA (POLSA)  was re-designed and adopted in the proposed HGF framework.
Potential risk factors and/or contributing factors selected by medical expert
Potential contributing factors
Asthma, autism, schizophrenia, HIV, immunological disorder, bipolar,hypertension, osteoporosis, coronary heart disease (CHD), diabetes,allergy, herpes, leukemia, breast cancer, lymphoma, hypothyroidism,hyperthyroidism, insomnia, depression, viral infection, bacterial infection,hepatitis B virus, retrovirus, enterovirus
Disease / medical condition
morning cortisol level, cholesterol level, head trauma, abdominaladiposity, fracture, bone mineral density (BMD), body mass index (BMI),pregnancy outcome, maternal influenza, postmenopause, mood, volumeof cerebrum, volume of hippocampus, volume of lateral ventricle, familyhistory, motor activity assessment
Sign / symptom
caffeine, hormone, aflatoxin, calcium deficiency or calcium overdose,phosphorus deficiency or phosphorus overdose, magnesium deficiencyor magnesium overdose, sodium deficiency or sodium overdose, potassiumdeficiency or potassium overdose, sulphur deficiency or sulphur overdose,chloride deficiency or chloride overdose, chromium deficiency or chromiumoverdose, copper deficiency or copper overdose, fluoride deficiency orfluoride overdose, iodine deficiency or iodine overdose, iron deficiency oriron overdose, manganese deficiency or manganese overdose, molybdenumdeficiency or molybdenum overdose, selenium deficiency or seleniumoverdose, zinc deficiency or zinc overdose, vitamin A or Retinol, vitamin B1or Thiamine, vitamin B2 or Riboflavin, vitamin B3 or Niacin, vitamin B5 orPantothenic acid, vitamin B6 or Pyridoxine, vitamin B7 or Biotin, vitamin, B9or Folic acid, vitamin B12 or Cyanocobalamin, vitamin C or Ascorbic acid,vitamin D or Calciferol, vitamin E or Tocopherol, vitamin K or Phylloquinone,Cannabis, cocaine, bisphenol-A (PBA), diethylstilbestrol (DES), estradiol (E2),oral contraceptive (OC)
air pollutants, volatile organic compounds, Pesticide, chemical agents, wooddust (exposure), silica dust (exposure), night shift work, outdoor workers,indoor workers, exposure polycyclic aromatic hydrocarbons, heterosexual,homosexual, Tobacco smoking, alcohol consumption, health education andhealth promotion, addiction, lifestyle intervention, diet nutrition, stress, agegender, breast-feeding
Environmental / life style andbehavioral factors
Where α1, α2 and α3 are the scaling factors; μ1, μ2 and μ3 are the position of the center of the peaks, and σ1, σ2, σ3 control the width of the distributions. The goodness of fit was measured using an R-square score.
Separating the three distributions allows implementation of a dynamic and data-driven threshold calculation. Hence, the parameters of the distributions can be used to model a cut-off threshold for the factors that are established, potential, or unknown. This method is empirical and provides an intuitive approach to evaluate the results. The score can be further optimized in a heuristic manner with utilization of a large-scale and comprehensive ground truth set. Furthermore, the highly associated factors to the disease are the well known factors; the hidden knowledge on the other hand resides in the region where the associations are positive yet weak.
The tri-modal distribution model is used to group the associated factors into different levels. The cut-off values to differentiate between different association levels vary slightly depending on the distribution of the similarity scores. The ideal decision boundary can be found if a large number of ground truth cases are available; in this situation the decision boundary is selected intuitively based on the shape of the distributions. For example, in the case of IS, factors are considered highly associated if their cosine score is greater than 0.3, factors are possible associated if their score is between 0.1 and 0.3 and are possibly not associated if their score is lower than 0.1. In the case of PD, factors are considered highly associated if their cosine score is greater than 0.2, factors are possibly associated if their cosine score is between 0.1 and 0.2 and finally the factors with scores between 0.05 and 0.1 are considered associated at low level, factors with scores lower than 0.05 are considered possibly not associated with the Parkinson’s Disease.
In the case of IS, the distribution of known associated factors are more shifted to the right as compared to the factors in PD, hence the separation between the known and unknown factors is more pronounced. In addition to that, associations at both extreme levels (close to +1.0 and −1.0) are likely to be common knowledge; however, the hidden knowledge tends to be captured at similarity scores that are low yet positive. Nonetheless, it is not realistic to compare the precise similarity score values in order to give more importance to one factor versus another factor mainly because there is a systemic bias that is inherent to the biological text data and causes the generic factors to be an underestimate of the true value (data not shown); hence a direct comparison would fail in this case if no additional normalization steps are taken.
Figure 3 summarizes a comparative analysis of MedLink Neurology and HGF for IS and PD. Overall in the case of IS, twelve factors were identified by both systems and six factors were identified by the HGF. In the case of PD, twelve factors were identified by both systems, ten factors were identified by the HGF and five factors were identified by MedLink Neurology. But, these factors had a low association level in HGF. The five factors were either very generic or were not exactly mapped in the set of the 96 factors, hence a direct comparison could not be made. Finally, this small scale comparative analysis corroborates the hypothesis that HGF based on literature can better predict the associated factors for diseases such as IS when the risk and associated factors are well studied and documented. In both cases, MedLink, Neurology, and HGF predicted twelve common associated factors; however, in the case of PD ten new factors were predicted in comparison to six in the case of IS.
A subset of factors identified only by the hypothesis generation framework
Level of association (cosine score)
Depression (morning cortisol level, mood,stress)
0.48, 0.18, and 0.12
There are three main limitations in the presented framework. We are currently in the process of finding solutions for these limitations. 1) Manual selection of the factors creates bias in the dataset and also limits its scalability property. To alleviate this problem, MeSH hierarchy will be used to generate the set of factors. MeSH comprises more than 25,000 subjects headings organized in an eleven-level hierarchy. 2) In the set of 96 factors, some factors were very generic and some very specific, therefore, there was a systemic bias in the dataset which caused the score for generic factors to be an underestimate of the true values and factors with limited information to be overestimated (data not shown). To partially solve this technical difficulty, an improved method based on local LSA is being developed in our lab. And finally, 3) looking only at literature from the past twenty years was not sufficient for the HGF. The expansion of the literature is necessary based on the observation that the association between head trauma and PD was significantly lower than expected.
Generating new hypotheses by mining a vast amount of raw unstructured knowledge from the archived reported literature may help in identifying new research trends as well as promoting interdisciplinary studies. In addition, the presented framework is not limited to uncovering disease-disease interactions; any word from MeSH can be used to query the system, and its associated factors can be identified accordingly. Disease-disease interaction networks, interaction networks among chemical compounds, drug-drug interaction networks, or any specific type of interaction network can be constructed using the HGF. The common basis for all these networks is the knowledge embedded in the literature. Application of this framework is broad as its usage is not limited to any specific domain. For instance, uncovering drug-drug interactions is valuable in drug development and drug administration, uncovering disease-disease interaction is important in understanding disease mechanism’s and advancing biology through integrated interdisciplinary research. Even though the framework is not limited to diseases, in this study two neurological diseases were used to test the system and demonstrate the power and applicability of the framework.
In addition to addressing the limitations of the framework, work is in progress to expand the HGF framework to allow the user to generate disease networks based on a number of user-defined queries. Such customized networks can be valuable to a wide range of scientists by promoting a faster identification of associated factors and detection of disease-disease interactions. Disease networks based on genetics and proteomics data display many connections between individual disorders and disease categories [2, 4]. Therefore, as expected each human disorder does not seem to have unique origins or be independent of other disorders. To uncover potential links between two disorders knowledge extraction from medical literature could be greatly beneficial and reliable.
VA is a Ph.D. candidate in Electrical and Computer Engineering at the University of Memphis; she has a B.A.Sc. in Computer Engineering and B.Sc. in Biochemistry in addition to a M.Sc. in Cellular Molecular Medicine and a second M.Sc. in Bioinformatics. Her research interests are interdisciplinary research in Medical Informatics and Systems Biology. VA’s research incorporates a systems approach to understanding gene regulatory networks, which combines mathematical modeling and molecular biology wet lab techniques. Her recent contributions are in medical informatics where her board understanding of interdisciplinary issues as well as deep knowledge in mathematics and experimental biology are fundamental in designing and performing experiments in translational research.
RZ is a M.D. in the department of Neurology at the University of Tennessee. He also holds a Masters of Public Health. His research interests include Vascular Neurology and Bioinformatics. Over the past few years, RZ has contributed to bridge the gap between clinical findings and application of bioinformatics tools.
FEF is a PhD candidate in Electrical and Computer Engineering at The University of Memphis; he has a B.Sc. in Computer Science and Engineering, M.Sc. degree in Computer Science and Engineering and a second M.Sc. degree in Bioinformatics. His research interests are biological information retrieval and data mining. FEF possesses good knowledge in software design and development. He participated in software development of some national and international research projects, such as Codewitz Asia-Link Project of European Union.
MY is an Associate Professor in the department of Electrical and Computer Engineering, adjunct faculty member of Biomedical Engineering and Bioinformatics Program, and an affiliated member of the Institute for Intelligent Systems (IIS) at The University of Memphis (U of M). He is a senior member of the IEEE. He made significant contributions in the research and development of real-time computer vision solutions for academic research and commercial applications. He has been involved with several technological innovations, including classifying gender, age group, ethnicity and emotion, face detection, recognition of human activities in video, and speech-gesture enabled sophisticated natural human-computer interfaces. Some of his research on facial image analysis and hand gesture recognition is used in developing several commercial products by the Videomining Inc.
This work was supported by the Electrical and Computer Engineering Department and Bioinformatics Program at the University of Memphis, by the University of Tennessee Health Science Center (UTHSC), as well as by NSF grant NSF-IIS-0746790. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding institution.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.