Molecular Medicine

, Volume 23, Issue 1, pp 285–294 | Cite as

Efficient Genome-wide Association in Biobanks Using Topic Modeling Identifies Multiple Novel Disease Loci

  • Thomas H. McCoyJr
  • Victor M. Castro
  • Leslie A. Snapper
  • Kamber L. Hart
  • Roy H. Perlis
Research Article


Biobanks and national registries represent a powerful tool for genomic discovery, but rely on diagnostic codes that can be unreliable and fail to capture relationships between related diagnoses. We developed an efficient means of conducting genome-wide association studies using combinations of diagnostic codes from electronic health records for 10,845 participants in a biobanking program at two large academic medical centers. Specifically, we applied latent Dirichilet allocation to fit 50 disease topics based on diagnostic codes, then conducted a genome-wide common-variant association for each topic. In sensitivity analysis, these results were contrasted with those obtained from traditional single-diagnosis phenome-wide association analysis, as well as those in which only a subset of diagnostic codes were included per topic. In meta-analysis across three biobank cohorts, we identified 23 disease-associated loci with p < 1e-15, including previously associated autoimmune disease loci. In all cases, observed significant associations were of greater magnitude than single phenome-wide diagnostic codes, and incorporation of less strongly loading diagnostic codes enhanced association. This strategy provides a more efficient means of identifying phenome-wide associations in biobanks with coded clinical data.



This study was funded by the National Human Genome Research Institute and the National Institute of Mental Health. The authors acknowledge the Partners HealthCare Biobank for providing genomic data and health information data.

Supplementary material

10020_2017_2301285_MOESM1_ESM.pdf (2.9 mb)
Supplementary material, approximately 2.89 MB.


  1. 1.
    Antony A, et al. (2004) Translational upregulation of folate receptors is mediated by homocysteine via RNA-heterogeneous nuclear ribonucleoprotein E1 interactions. J. Clin. Invest. 113:285–301.CrossRefGoogle Scholar
  2. 2.
    Perlis R, et al. (2012) Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol. Med. 42:41–50.CrossRefGoogle Scholar
  3. 3.
    Castro VM, et al. (2015) Validation of electronic health record phenotyping of bipolar disorder cases and controls. Am. J. Psychiatr. 172:363–72.CrossRefGoogle Scholar
  4. 4.
    Hallberg P, Sjöblom V. (2005) The use of selective serotonin reuptake inhibitors during pregnancy and breast-feeding: a review and clinical aspects. J. Clin. Psychopharmacol. 25:59–73.CrossRefGoogle Scholar
  5. 5.
    American Psychiatric Association. (2010) Practice guidelines for the treatment of major depression. Washington, DC: American Psychiatric Press.Google Scholar
  6. 6.
    Cross-Disorder Group of the Psychiatric Genomics Consortium, et al. (2013) Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat. Genet. 45:984–94.CrossRefGoogle Scholar
  7. 7.
    Cho JH, Feldman M. (2015) Heterogeneity of autoimmune diseases: pathophysiologic insights from genetics and implications for new therapies. Nat. Med. 21:730–38.CrossRefGoogle Scholar
  8. 8.
    Elkin I, et al. (1995) Initial severity and differential treatment outcome in the National Institute of Mental Health Treatment of Depression Collaborative Research Program. J. Consult. Clin. Psychol. 63:841.CrossRefGoogle Scholar
  9. 9.
    Blei DM, Ng AY, Jordan MI. (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3:993–1022.Google Scholar
  10. 10.
    Gainer VS, et al. (2016) The biobank portal for Partners Personalized Medicine: a query tool for working with consented biobank samples, genotypes, and phenotypes using i2b2. J. Pers. Med. 6:11.CrossRefGoogle Scholar
  11. 11.
    Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res 2002;12:656–64. doi:10.1101/gr.229202. Article published online before March 2002.CrossRefGoogle Scholar
  12. 12.
    Henn BM, et al. (2012) Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS One. 7(4): e34267.CrossRefGoogle Scholar
  13. 13.
    Fuchsberger C, Abecasis GR, Hinds DA. (2015) minimac2: faster genotype imputation. Bioinformatics. 31:782–84.CrossRefGoogle Scholar
  14. 14.
    Minimac3. Available from Accessed May 1, 2017.
  15. 15.
    Michigan imputation server. Available from Accessed May 1, 2017.
  16. 16.
    Delaneau O, Marchini J, Zagury JF. (2012) A linear complexity phasing method for thousands of genomes. Nat. Methods. 9:179–81.CrossRefGoogle Scholar
  17. 17.
    Price AL, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38:904–09.CrossRefGoogle Scholar
  18. 18.
    Chang CC, et al. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 4:1.CrossRefGoogle Scholar
  19. 19.
    Purcell S, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81:559–75.CrossRefGoogle Scholar
  20. 20.
    Denny JC, et al. (2010) PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 26:1205–10.CrossRefGoogle Scholar
  21. 21.
    Rehurek R, Sojka P. (2010) Software framework for topic modelling with large corpora In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 45–50.Google Scholar
  22. 22.
    Hoffman M, Bach FR, Blei DM. (2010) Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing systems. Cambridge, MA: MIT Press; 856–64.Google Scholar
  23. 23.
    S Purcell, C Chang. (2015) PLINK 1.9. Available from Downloaded May 15, 2017.
  24. 24.
    Bulik-Sullivan BK, et al. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47:291–95.CrossRefGoogle Scholar
  25. 25.
    Prom-Wormley E, et al. (2015) Genetic and environmental contributions to the relationships between brain structure and average lifetime cigarette use. Behav. Genet. 45:157–70.CrossRefGoogle Scholar
  26. 26.
    Gene Page online. Cambridge, MA: Broad Institute. Available from Accessed March 8, 2017.
  27. 27.
    Plenge RM, et al. (2017) TRAF1-C5 as a risk locus for rheumatoid arthritis — a genomewide study. N. Engl. J. Med. 357:1199–1209.CrossRefGoogle Scholar
  28. 28.
    International Multiple Sclerosis Genetics Consortium. (2007) Risk alleles for multiple sclerosis identified by a genomewide study. N. Engl. J. Med. 357:851–62.CrossRefGoogle Scholar
  29. 29.
    Pavlova B, Perlis RH, Alda M, Uher R. (2015) Lifetime prevalence of anxiety disorders in people with bipolar disorder: a systematic review and meta-analysis. Lancet Psychiatr. 2:710–17.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2017

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, and provide a link to the Creative Commons license. You do not have permission under this license to share adapted material derived from this article or parts of it.

The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this license, visit (

Authors and Affiliations

  • Thomas H. McCoyJr
    • 1
  • Victor M. Castro
    • 1
    • 2
  • Leslie A. Snapper
    • 1
  • Kamber L. Hart
    • 1
  • Roy H. Perlis
    • 1
  1. 1.Center for Quantitative Health, Division of Clinical Research and Center for Human Genetic ResearchMassachusetts General HospitalBostonUSA
  2. 2.Partners Research Information Systems and ComputingPartners HealthCare SystemBostonUSA

Personalised recommendations