Extracting Phenotypes from Patient Claim Records Using Nonnegative Tensor Factorization

  • Joyce C. Ho
  • Joydeep Ghosh
  • Jimeng Sun
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8609)


Electronic health records (EHRs) are becoming an increasingly important source of patient information. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts; however, most of these approaches require labor intensive supervision from experienced clinical professionals.

In this paper, we use Limestone, a nonnegative tensor factorization method to derive phenotype candidates from claims data with virtually no human supervision. Limestone represents the interactions between diagnoses and procedures among patients naturally using tensors (a generalization of matrices). The resulting tensor factors are reported as phenotype candidates that automatically reveal patient clusters on specific diagnoses and procedures. To the best of our knowledge, this is the first study that successfully extracts useful phenotypes by applying sparse nonnegative tensor factorization to a large, public-domain EHR dataset covering a broad range of diseases. Our experiments demonstrate the interpretability and the promise of high-throughput phenotypes generated from tensor factorization.


EHR phenotyping tensor factorization dimensionality reduction 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nature Reviews: Genetics 13(6), 395–405 (2012)CrossRefGoogle Scholar
  2. 2.
    Greengard, S.: A new model for healthcare. Communications of the ACM 56(2), 17–19 (2013)CrossRefGoogle Scholar
  3. 3.
    Savage, N.: Better medicine through machine learning. Communications of the ACM 55(1), 17–19 (2012)CrossRefGoogle Scholar
  4. 4.
    Hripcsak, G., Albers, D.J.: Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association 20(1), 117–121 (2012)CrossRefGoogle Scholar
  5. 5.
    Denny, J.C., Bastarache, L., Ritchie, M.D., Carroll, R.J., Zink, R., Mosley, J.D., Field, J.R., Pulley, J.M., Ramirez, A.H., Bowton, E., Basford, M.A., Carrell, D.S., Peissig, P.L., Kho, A.N., Pacheco, J.A., Rasmussen, L.V., Crosslin, D.R., Crane, P.K., Pathak, J., Bielinski, S.J., Pendergrass, S.A., Xu, H., Hindorff, L.A., Li, R., Manolio, T.A., Chute, C.G., Chisholm, R.L., Larson, E.B., Jarvik, G.P., Brilliant, M.H., McCarty, C.A., Kullo, I.J., Haines, J.L., Crawford, D.C., Masys, D.R., Roden, D.M.: Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature Biotechnology 31(12), 1102–1111 (2013)CrossRefGoogle Scholar
  6. 6.
    Newton, K.M., Peissig, P.L., Kho, A.N., Bielinski, S.J., Berg, R.L., Choudhary, V., Basford, M., Chute, C.G., Kullo, I.J., Li, R., Pacheco, J.A., Rasmussen, L.V., Spangler, L., Denny, J.C.: Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. Journal of the American Medical Informatics Association 20(e1), e147–e154 (2013)Google Scholar
  7. 7.
    McCarty, C.A., Chisholm, R.L., Chute, C.G., Kullo, I.J., Jarvik, G.P., Larson, E.B., Li, R., Masys, D.R., Ritchie, M.D., Roden, D.M., Struewing, J.P., Wolf, W.A.: eMERGE Team: The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Medical Genomics 4, 13 (2011)CrossRefGoogle Scholar
  8. 8.
    Overhage, J.M., Ryan, P.B., Reich, C.G., Hartzema, A.G., Stang, P.E.: Validation of a common data model for active safety surveillance research. Journal of the American Medical Informatics Association 19(1), 54–60 (2012)CrossRefGoogle Scholar
  9. 9.
    Hripcsak, G., Albers, D.J.: Correlating electronic health record concepts with healthcare process events. Journal of the American Medical Informatics Association 20(e2), e311–e318 (2013)Google Scholar
  10. 10.
    Chen, Y., Carroll, R.J., Hinz, E.R.M., Shah, A., Eyler, A.E., Denny, J.C., Xu, H.: Applying active learning to high-throughput phenotyping algorithms for electronic health records data. Journal of the American Medical Informatics Association 20(e2), e253–e259 (2013)Google Scholar
  11. 11.
    Ho, J.C., Ghosh, J., Steinhubl, S., Stewart, W., Denny, J.C., Malin, B.A., Sun, J.: Limestone: High-throughput candidate phenotype generation via tensor factorization. Journal of Biomedical Informatics (accepted)Google Scholar
  12. 12.
    Mørup, M.: Applications of tensor (multiway array) factorizations and decompositions in data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1), 24–40 (2011)Google Scholar
  13. 13.
    Wang, D., Kong, S.: Feature selection from high-order tensorial data via sparse decomposition. Pattern Recognition Letters 33(13), 1695–1702 (2012)CrossRefGoogle Scholar
  14. 14.
    Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika 35(3), 283–319 (1970)CrossRefzbMATHGoogle Scholar
  15. 15.
    Harshman, R.A.: Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics 16, 1–84 (1970)Google Scholar
  16. 16.
    Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Review 51(3), 455–500 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
  17. 17.
    Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: Scaling tensor analysis up by 100 times-algorithms and discoveries. In: Proceeding of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 316–324. ACM (2012)Google Scholar
  18. 18.
    Davidson, I., Gilpin, S., Carmichael, O., Walker, P.: Network discovery via constrained tensor analysis of fMRI data. In: Proceeding of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM (August 2013)Google Scholar
  19. 19.
    Lin, Y.R., Sun, J., Sundaram, H., Kelliher, A., Castro, P., Konuru, R.: Community discovery via metagraph factorization. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(3) (August 2011)Google Scholar
  20. 20.
    Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.I.: Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. Wiley (2009)Google Scholar
  21. 21.
    Chi, E.C., Kolda, T.G.: On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications 33(4), 1272–1299 (2012)CrossRefzbMATHMathSciNetGoogle Scholar
  22. 22.
    Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning 4(1) (January 2012)Google Scholar
  23. 23.
    Centers for Disease Control and Prevention (CDC): Chronic diseases at a glance 2009. Technical report, CDC (2009)Google Scholar
  24. 24.
    Lochner, K.A., Cox, C.S.: Prevalence of multiple chronic conditions among Medicare beneficiaries, United State 2010. Preventing Chronic Disease: Public Health Research, Practice, and Policy (2013)Google Scholar
  25. 25.
    Hansen, S., Plantenga, T., Kolda, T.G.: Newton-Based Optimization for Nonnegative Tensor Factorizations. (April 2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Joyce C. Ho
    • 1
  • Joydeep Ghosh
    • 1
  • Jimeng Sun
    • 2
  1. 1.Electrical and Computer Engineering DepartmentThe University of Texas at AustinAustinUSA
  2. 2.College of ComputingGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations