Efficient Genome-wide Association in Biobanks Using Topic Modeling Identifies Multiple Novel Disease Loci
Biobanks and national registries represent a powerful tool for genomic discovery, but rely on diagnostic codes that can be unreliable and fail to capture relationships between related diagnoses. We developed an efficient means of conducting genome-wide association studies using combinations of diagnostic codes from electronic health records for 10,845 participants in a biobanking program at two large academic medical centers. Specifically, we applied latent Dirichilet allocation to fit 50 disease topics based on diagnostic codes, then conducted a genome-wide common-variant association for each topic. In sensitivity analysis, these results were contrasted with those obtained from traditional single-diagnosis phenome-wide association analysis, as well as those in which only a subset of diagnostic codes were included per topic. In meta-analysis across three biobank cohorts, we identified 23 disease-associated loci with p < 1e-15, including previously associated autoimmune disease loci. In all cases, observed significant associations were of greater magnitude than single phenome-wide diagnostic codes, and incorporation of less strongly loading diagnostic codes enhanced association. This strategy provides a more efficient means of identifying phenome-wide associations in biobanks with coded clinical data.
This study was funded by the National Human Genome Research Institute and the National Institute of Mental Health. The authors acknowledge the Partners HealthCare Biobank for providing genomic data and health information data.
- 5.American Psychiatric Association. (2010) Practice guidelines for the treatment of major depression. Washington, DC: American Psychiatric Press.Google Scholar
- 9.Blei DM, Ng AY, Jordan MI. (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3:993–1022.Google Scholar
- 14.Minimac3. Available from https://doi.org/genome.sph.umich.edu/wiki/Minimac3. Accessed May 1, 2017.
- 15.Michigan imputation server. Available from https://doi.org/imputationserver.sph.umich.edu. Accessed May 1, 2017.
- 21.Rehurek R, Sojka P. (2010) Software framework for topic modelling with large corpora In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 45–50.Google Scholar
- 22.Hoffman M, Bach FR, Blei DM. (2010) Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing systems. Cambridge, MA: MIT Press; 856–64.Google Scholar
- 23.S Purcell, C Chang. (2015) PLINK 1.9. Available from https://doi.org/www.cog-genomics.org/plink2. Downloaded May 15, 2017.
- 26.Gene Page online. Cambridge, MA: Broad Institute. Available from https://doi.org/www.gtexportal.org/home/gene/CLPX. Accessed March 8, 2017.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, and provide a link to the Creative Commons license. You do not have permission under this license to share adapted material derived from this article or parts of it.
The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this license, visit (https://doi.org/creativecommons.org/licenses/by-nc-nd/4.0/)