Bridge: a GUI package for genetic risk prediction
- 18k Downloads
Risk prediction models capitalizing on genetic and environmental information hold great promise for individualized disease prediction and prevention. Nevertheless, linking the genetic and environmental risk predictors into a useful risk prediction model remains a great challenge. To facilitate risk prediction analyses, we have developed a graphical user interface package, Bridge.
The package is built for both designing and analyzing a risk prediction model. In the design stage, it provides an estimated classification accuracy of the model using essential genetic and environmental information gained from public resources and/or previous studies, and determines the sample size required to verify this accuracy. In the analysis stage, it adopts a robust and powerful algorithm to form the risk prediction model.
The package is developed based on the optimality theory of the likelihood ratio and therefore theoretically could form a model with high performance. It can be used to handle a relatively large number of genetic and environmental predictors, with consideration of their possible interactions, and so is particularly useful for studying risk prediction models for common complex diseases.
KeywordsGene-gene interactions Optimal receiver operating characteristic curve
- ROC curve
Receiver operating characteristic curve
Area under ROC curve
Optimal ROC curve method
Forward ROC curve method
Graphical user interface
Single nucleotide polymorphism.
The translation of human genome discoveries into health practice represents one of the major challenges in the coming decades [1, 2]. The use of emerging genetic knowledge for early disease prediction, prevention and pharmacogenetics will advance future genomic medicine and lead to more effective prevention and treatment strategies . Among those, disease prediction based on genetic and environmental information is the first step in translating genomics into health . It assesses an individual’s risk of future disease, so that early preventive interventions can be adopted to reduce morbidity and mortality . For this reason, studies to assess the combined role of genetic and environmental information in early disease prediction represent a high priority, as manifested in multiple risk prediction studies now underway [6, 7, 8, 9, 10, 11, 12].
The yield from these studies can be enhanced by adopting powerful and computationally efficient study design and analytic tools . We have previously developed an optimal ROC curve (O-ROC) method to quickly evaluate new genetic and environmental findings for potential clinical practice by designing a new risk prediction model, estimating its classification accuracy, and calculating the sample size needed for evaluating the model .
If, in the design stage, a proposed risk prediction model appears to be superior to existing models, or if it reaches a desired accuracy level, it may worth developing further for clinical use. To evaluate the risk prediction model on a study sample, we developed a forward ROC curve (F-ROC) method . F-ROC builds on the optimality theory of the likelihood ratio , and is thus powerful for risk prediction analysis. It adopts a stepwise selection algorithm to efficiently deal with a large number of predictors and their possible high-order interactions.
To facilitate designing and analyzing risk prediction models, we have implemented the above two methods into the graphical user interface (GUI) software, Bridge. Bridge is comprised of two modules, Test Design and Test Build. The O-ROC approach has been implemented in the Test Design module, for designing a risk prediction model. The Test Design module uses the essential information (e.g., allele frequencies) of risk predictors from previously published studies or publically available resources to design a risk predictive model, calculating its estimated accuracy and the required sample size to further investigate the model. The F-ROC approach has been built into the Test Build module. The Test Build module is developed for risk prediction modeling on known risk predictors, as well as for high-dimensional risk prediction based on a large number of potential risk predictors. Bridge is freely accessible online at https://www.msu.edu/~qlu/Software.html.
R is open-source software used for statistical computing and graphics. With many built-in statistic functions and excellent scientific graphing capacity, R is now one of the most popularly used statistical software. Although R is widely used in statistics and related fields, it has a limited graphic interface, which makes it difficult for new R users. Bridge uses an R graphic user interface (GUI), providing an intuitive and interactive visualization experience for users. Instead of writing code in the R console window, which could be less convenient for new users, the user-friendly interface of Bridge allows users to load the datasets and run the program easily by simply clicking either the options from the menu or the buttons from the toolbar. Moreover, for users who prefer to use R console, Bridge also provides the access of its functions through R console. In this paper, we give an overview of the package. A detailed description of installation and use of the package can be found in the software vignette.
Bridge is comprised of two independent modules, Test Design and Test Build, for the design and construction of a risk prediction model, respectively. The Test Design module serves as a tool for designing a risk prediction study. Given the disease prevalence of a disease of interest and essential information of the known risk predictors (e.g., relative risks) from previous studies and/or public resources, the Test Design module plots an estimated receiver operating characteristic (ROC) curve of the proposed predictive model, so that users can easily visualize the estimated discriminating ability of the model. If the model reaches desired level of discriminating ability and worth further investigation, a power analysis can be conducted to make sure sufficient power of the study. Given the power and type I error, the required sample size can be determined by the Test Design module to further investigate the proposed model and verify of its classification accuracy.
At least two strategies can be used to select single-nucleotide polymorphisms (SNPs) for designing a risk prediction model. One strategy is to include only disease-susceptibility SNPs that have been replicated in multiple studies and the other is to include as much potentially disease-susceptibility SNPs as possible into the model. Each strategy has its own advantages and disadvantages. Given the limited number of SNPs identified for most of common complex diseases and their small effect sizes, a risk prediction model formed by the former strategy likely has a low AUC value but could have robust performance across different studies. The later strategy could result a risk prediction model with high accuracy, especially when gene-gene interactions exist. Nevertheless, the formed risk prediction model tends to be less stable.
If data is collected to investigate the proposed risk prediction model, we can then use the Test Build module of Bridge to form and evaluate the proposed model. The Test Build module can be used to assess combined effect of known risk predictors (i.e., those identified from previous association studies) in disease prediction, with the consideration of possible high-order interactions. In addition to risk prediction on known risk predictors, the Test Build module also allows the users to explore a large ensemble of potential risk predictors and their interactions for improved disease prediction. This strategy is particular useful for complex diseases where a majority of the genetic and environmental risk predictors are unknown. For this strategy, the potentially disease-susceptibility predictors can be chosen based on both biology knowledge and statistical evidence. For instance, we can follow a simple strategy previously used to evaluate different sets of SNPs based on their marginal p-values (i.e., 10-1, 10-2 ,…, 10-8) [8, 17]. The Test Build module has a built-in forward selection algorithm to handle a large set of predictors. The algorithm is capable of searching for important risk predictors and interactions from a large number of environmental and genetic predictors to further improve the risk prediction model.
In addition, the Test Build module has a build-in function for dealing with missing data and provides options for model building and validation (e.g., an option to control the maximum number of risk predictors to be included in the model). The Test Build module uses k-fold cross-validation to provide internal validation, and can also provide external validation if an independent data is available. The summary results (e.g., the AUC values) for the risk prediction models built on the training and validation datasets are summarized in the Bridge output window. Users can also view the proposed model via ROC-curve plots and tree structure plots. The detailed selection process is available under the Test Build. Results tab in the output area.
Results and discussion
We used an empirical study of Crohn’s disease (CD) as an example to illustrate how to use Bridge to design and form a risk prediction model.
Use the Test Design module to design a risk prediction model
Use the Test Build module to form a risk prediction model
The above analysis was limited to 3 well-established CD SNPs. In order to consider additional predictors to further improve the 3-locus model, we extended the analysis to 29 potential CD-related SNPs. Using the Wellcome Trust CD dataset (see Additional files 4 and 5), the Test Build module identified 5 SNPs and formed a five-locus model with an AUC value of 0.63. The five-locus model was further validated in the testing sample with a predicted AUC value of 0.62. By considering 29 potential CD-related SNPs, the Test Build module was able to select 2 additional predictor, rs3764147 and rs4263839, into the model, and further improved the accuracy of the CD risk prediction model.
With increasing genetic findings from large-scale genetic studies, risk prediction studies are being conducted to evaluate the role of potential genetic and environmental predictors in early disease prediction. While there is increasing interest in such risk prediction research, new bioinformatics tools have not been well developed for this emerging area of research. We developed a GUI package, Bridge, to facilitate risk prediction modeling. The software will help an investigator design a study to evaluate a new risk prediction model. It could also be used to form a new risk prediction model based upon multiple genetic and environmental risk predictors, with the consideration of possible interactions. Bridge is developed based on a graphical user interface, which can be easily accessed by basic science and clinical researchers.
Availability and requirements
Project name: Bridge package, Project home page: https://www.msu.edu/~qlu/Software.html. Operating system(s): Linux, Windows, Mac OS X, Programming language: R, Other requirements: R (≥3.0.0), License: GNU GPL, Any restrictions to use by non-academics: none except those posed by the license.
CY: Department of Health Management, Medical School, Hangzhou Normal University, Hangzhou, Zhejiang 310036 P.R. China. Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, 48824 USA, QL: Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, 48824 USA.
This study makes use of data generated by the Wellcome Trust Case Control Consortium. This work was supported by the National Institute of Dental and Craniofacial Research under Award Number R03DE022379 and the National Institute on Drug Abuse under Award Number K01DA033346.
- 7.Meigs JB, Shrader P, Sullivan LM, McAteer JB, Fox CS, Dupuis J, Manning AK, Florez JC, Wilson PW, D’Agostino RB, et al: Genotype score in addition to common risk factors for prediction of type 2 diabetes. N Engl J Med. 2008, 359: 2208-2219. 10.1056/NEJMoa0804742.PubMedCentralCrossRefPubMedGoogle Scholar
- 8.Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, Kim C, Frackleton E, Hou C, Glessner JT, Chiavacci R, et al: From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 2009, 5: e1000678-10.1371/journal.pgen.1000678.PubMedCentralCrossRefPubMedGoogle Scholar
- 12.Skafidas E, Testa R, Zantomio D, Chana G, Everall IP, Pantelis C: Predicting the diagnosis of autism spectrum disorder using gene pathway analysis. Mol Psychiatry. 2012, in pressGoogle Scholar
- 16.Egan JP: Signal Detection Theory and ROC Analysis. 1975, New York: Academic PressGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.