Cohort analytics: efficiency and applicability


The abundant availability of health-care data calls for effective analysis methods to help medical experts gain a better understanding of their patients and their health. The focus of existing work has been largely on prediction. In this paper, we introduce Core, a framework for cohort “representation” and “exploration.” Our contributions are twofold: First, we formalize cohort representation as the problem of aggregating the trajectories of its patients. This problem is challenging because cohorts often consist of hundreds of patients who underwent medical actions of various types at different points in time. We prove that producing a representative cohort trajectory is NP-complete with a reduction in the multiple sequence alignment problem. We propose a heuristic that extends the Needleman–Wunsch algorithm for sequence matching to handle temporal sequences. To further improve cohort representation efficiency, we introduce “trajectory families” and “stratified sampling.” Our second contribution is formalizing the problem of cohort exploration as finding a set of cohorts that are similar to a cohort of interest and that maximize entropy. This problem is challenging because the potential number of similar cohorts is huge. We prove NP-completeness with a reduction in the maximum edge subgraph problem. To address complexity, we develop a multi-staged approach based on limiting the search space to “contrast cohorts.” To speed up the computation of cohort similarity, we use “event sets” that are inspired from the double dictionary encoding proposed for keyword search. Moreover, we explore the usefulness and efficiency of Core using an extensive set of qualitative and quantitative experiments on two real health-care datasets. In a user study with medical experts, we show that Core reduces time-to-insight from hours to seconds and helps them find better insights than baseline approaches. Moreover, we show that the obtained cohort representations offer the right trade-off between quality and performance. We study the benefits of trajectory families and stratified sampling for cohort representation and show their applicability on large and heterogeneous cohorts. We also show the benefit of event sets for cohort exploration in providing interactive performance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15


  1. 1.

    Continuous Positive Airway Pressure.

  2. 2.

    Throughout the paper, we use the shorter term “age” for “age category,” and “life” for “life status.”

  3. 3.

    Throughout the paper, the dot notation represents the invocation of a function on its right-hand side for the object (e.g., a patient) on its left-hand side.


  1. 1.

    Munshi, A., Sharma, V., Sharma, S.: Lessons learned from cohort studies, and hospital-based studies and their implications in precision medicine. In: Progress and Challenges in Precision Medicine. Elsevier (2017)

  2. 2.

    Welch, S.R., Huff, S.M.: Cohort amplification: an associative classification framework for identification of disease cohorts in the electronic health record. In: Annual Symposium Proceedings. American Medical Informatics Association (2010)

  3. 3.

    Maggi, F.M., Di Francescomarino, C., Dumas, M., Ghidini, C.: Predictive monitoring of business processes. In: International Conference on Advanced Information Systems Engineering. Springer, pp. 457–472 (2014)

  4. 4.

    Pham, T., Tran, T., Phung, D., Venkatesh, S.: Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 69, 218–229 (2017)

    Article  Google Scholar 

  5. 5.

    Fejza, A.., Genevès, P., Layaïda, N., Bosson, J.-L.: Scalable and interpretable predictive models for electronic health records. In DSAA, IEEE (2018)

  6. 6.

    Heuser, A., Huynh, M., Chang, J.C.: Empirical process-based large sample properties of the area bounded by cohort-weighted Kaplan Meier curves. arXiv preprint arXiv:1701.02424 (2017)

  7. 7.

    Liu, Y., Safavi, T., Dighe, A., Danai, K.: Graph summarization methods and applications: a survey. ACM Comput. Surv. 51, 1–34 (2018)

    Article  Google Scholar 

  8. 8.

    Senderovich, A., Weidlich, M., Gal, A.: Temporal network representation of event logs for improved performance modelling in business processes. In: BPM (2017)

  9. 9.

    Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence simplification. TVCG (2013)

  10. 10.

    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  11. 11.

    Pahins, C.A.L., Omidvar-Tehrani, B., Amer-Yahia, S., Siroux, V., Pépin, J.L., Borel, J.-C., Comba, J.: COVIZ: a system for visual formation and exploration of patient cohorts. PVLDB 12(12), 1822–1825 (2019)

    Google Scholar 

  12. 12.

    Von Elm, E., Altman, D.G., Egger, M., et al.: The strengthening the reporting of observational studies in epidemiology (strobe) statement: guidelines for reporting observational studies. PLoS Med. 147, 573–577 (2007)

    Google Scholar 

  13. 13.

    Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. 5(11), 1436–1446 (2012)

    Article  Google Scholar 

  14. 14.

    Omidvar-Tehrani, B., Amer-Yahia, S., Lakshmanan, L.V.S.: Cohort representation and exploration. In: DSAA. IEEE (2018)

  15. 15.

    Armony, M., Israelit, S., Mandelbaum, A., Marmor, Y.N., Tseytlin, Y., Yom-Tov, G.B.: On patient flow in hospitals: a data-based queueing-science perspective. Stoch. Syst. 5(1), 146–194 (2015)

    MathSciNet  Article  Google Scholar 

  16. 16.

    Jenkins, K.: Comorbidity patterns with female incontinence distinguish subtypes. MedPage Today J. (2018)

  17. 17.

    Woodfield, J.: Gestational diabetes associated with early signs of kidney damage. The Global Diabetes Community (2018)

  18. 18.

    Collins, T.: For your patients-REM sleep behavior disorder: REM disorder is highly predictive of neurodegenerative disease, study shows. Neurol. Today 18, 1–22 (2018)

    Google Scholar 

  19. 19.

    Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337–348 (1994)

    Article  Google Scholar 

  20. 20.

    Chen, Z., Dehmer, M., Shi, Y.: A note on distance-based graph entropies. Entropy 16(10), 5416–5427 (2014)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Feige, U., Peleg, D., Kortsarz, G.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Kaplan, E.L., Meier, P.: Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53(282), 457–481 (1958)

    MathSciNet  Article  Google Scholar 

  23. 23.

    Gollery, M.: Bioinformatics: sequence and genome analysis. Clin. Chem. 51, 2219 (2005)

    Article  Google Scholar 

  24. 24.

    Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473–483 (2010)

    Article  Google Scholar 

  25. 25.

    Smith, T., Waterman, M.: Identification of common molecular subsequences. Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  26. 26.

    Polyanovsky, V.O., Roytberg, M.A., Tumanyan, V.G.: Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences. Algorithms Mol. Biol. 6, 25 (2011)

    Article  Google Scholar 

  27. 27.

    Goonesekere, N.C.W., Lee, B.: Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins Struct. Funct. Bioinf. 71(2), 910–919 (2008)

    Article  Google Scholar 

  28. 28.

    Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)

    Article  Google Scholar 

  29. 29.

    Omidvar-Tehrani, B.: Augmented therapy with online support groups. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH). Springer (2018)

  30. 30.

    Notredame, C., Higgins, D.G., Heringa, J.: T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000)

    Article  Google Scholar 

  31. 31.

    Chatain, T., Carmona, J., Van Dongen, B.: Alignment-based trace clustering. In: International Conference on Conceptual Modeling. Springer, pp. 295–308 (2017)

  32. 32.

    Enright, A.J., Van Dongen, S., Ouzounis, C.A.: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30(7), 1575–1584 (2002)

    Article  Google Scholar 

  33. 33.

    Bhuiyan, M., Mukhopadhyay, S., Al Hasan, M.: Interactive pattern mining on hidden data: a sampling-based solution. In: CIKM. ACM (2012)

  34. 34.

    Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V.S., Zamar, R.H.: Exploring rated datasets with rating maps. In: WWW (2017)

  35. 35.

    Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. In: CIKM (2015)

  36. 36.

    Jiang, D., Cai, Q., Chen, G., Jagadish, H.V., Ooi, B.C., Tan, K.-L., Tung, A.K.H.: Cohort query processing. Proc. VLDB Endow. 10((1), 1–12 (2016)

    Article  Google Scholar 

  37. 37.

    Ge, C., He, X., Ilyas, I.F., Machanavajjhala, A.: Accuracy-aware differentially private data exploration. In: SIGMOD, Apex (2019)

  38. 38.

    Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Math. Program. 14(1), 265–294 (1978)

    MathSciNet  Article  Google Scholar 

  39. 39.

    Sabidussi, G.: The centrality index of a graph. Psychometrika 31(4), 581–603 (1966)

    MathSciNet  Article  Google Scholar 

  40. 40.

    Opsahl, T., Agneessens, F., Skvoretz, J.: Node centrality in weighted networks: generalizing degree and shortest paths. Soc. Netw. 32(3), 245–251 (2010)

    Article  Google Scholar 

  41. 41.

    Sharma, D., Kapoor, A., Deshpande, A.: On greedy maximization of entropy. In: International Conference on Machine Learning, pp. 1330–1338 (2015)

  42. 42.

    Korn, G.A., Korn, T.M.: Mathematical Handbook for Scientists and Engineers: Definitions, Theorems, and Formulas for Reference and Review. Courier Corporation, North Chelmsford (2000)

    MATH  Google Scholar 

  43. 43.

    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)

    MATH  Google Scholar 

  44. 44.

    Fekete, J.-D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv:1607.05162 (2016)

  45. 45.

    Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theory 2(3), 129–137 (1956)

    Article  Google Scholar 

  46. 46.

    Rozinat, A., de Medeiros, A.K.A., Günther, C.W., et al.: The need for a process mining evaluation framework in research and practice. In: BPM. Springer, pp. 84–89 (2007)

  47. 47.

    Sharma, G., Goodwin, J.: Effect of aging on respiratory system physiology and immunology. Clin. Interv. Aging 1(3), 253 (2006)

    Article  Google Scholar 

  48. 48.

    Shanks, D.: Solved and Unsolved Problems in Number Theory, vol. 297. AMS, Providence (2001)

    MATH  Google Scholar 

  49. 49.

    Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego, R., Trasarti, R.: Conquest: a constraint-based querying system for exploratory pattern discovery. In: ICDE (2006)

  50. 50.

    Yan, N., Li, C., Roy, S.B., Ramegowda, R., Das, G.: Facetedpedia: enabling query-dependent faceted search for Wikipedia. In: CIKM. ACM (2010)

  51. 51.

    Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: New trends on exploratory methods for data analytics. Proc. VLDB Endow. 10(12), 1977–1980 (2017)

    Article  Google Scholar 

Download references


Funding was provided by CDP LIFE (Grant No. C7H-ID16-PR4-LIFELIG).

Author information



Corresponding author

Correspondence to Behrooz Omidvar-Tehrani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Omidvar-Tehrani, B., Amer-Yahia, S. & Lakshmanan, L.V.S. Cohort analytics: efficiency and applicability. The VLDB Journal 29, 1527–1550 (2020).

Download citation


  • Health-care data analysis
  • Cohort analytics
  • Cohort representation
  • Cohort exploration