Cohort analytics: efficiency and applicability

Omidvar-Tehrani, Behrooz; Amer-Yahia, Sihem; Lakshmanan, Laks V. S.

doi:10.1007/s00778-020-00625-6

Cohort analytics: efficiency and applicability

Regular Paper
Published: 27 August 2020

Volume 29, pages 1527–1550, (2020)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Behrooz Omidvar-Tehrani ORCID: orcid.org/0000-0002-9405-3386¹,
Sihem Amer-Yahia² &
Laks V. S. Lakshmanan³

335 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The abundant availability of health-care data calls for effective analysis methods to help medical experts gain a better understanding of their patients and their health. The focus of existing work has been largely on prediction. In this paper, we introduce Core, a framework for cohort “representation” and “exploration.” Our contributions are twofold: First, we formalize cohort representation as the problem of aggregating the trajectories of its patients. This problem is challenging because cohorts often consist of hundreds of patients who underwent medical actions of various types at different points in time. We prove that producing a representative cohort trajectory is NP-complete with a reduction in the multiple sequence alignment problem. We propose a heuristic that extends the Needleman–Wunsch algorithm for sequence matching to handle temporal sequences. To further improve cohort representation efficiency, we introduce “trajectory families” and “stratified sampling.” Our second contribution is formalizing the problem of cohort exploration as finding a set of cohorts that are similar to a cohort of interest and that maximize entropy. This problem is challenging because the potential number of similar cohorts is huge. We prove NP-completeness with a reduction in the maximum edge subgraph problem. To address complexity, we develop a multi-staged approach based on limiting the search space to “contrast cohorts.” To speed up the computation of cohort similarity, we use “event sets” that are inspired from the double dictionary encoding proposed for keyword search. Moreover, we explore the usefulness and efficiency of Core using an extensive set of qualitative and quantitative experiments on two real health-care datasets. In a user study with medical experts, we show that Core reduces time-to-insight from hours to seconds and helps them find better insights than baseline approaches. Moreover, we show that the obtained cohort representations offer the right trade-off between quality and performance. We study the benefits of trajectory families and stratified sampling for cohort representation and show their applicability on large and heterogeneous cohorts. We also show the benefit of event sets for cohort exploration in providing interactive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 12

Disease trajectory browser for exploring temporal, population-wide disease progression patterns in 7.2 million Danish patients

Article Open access 02 October 2020

Troels Siggaard, Roc Reguant, … Søren Brunak

Time Series Disturbance Detection for Hypothesis-Free Signal Detection in Longitudinal Observational Databases

Article 21 February 2018

Ed Whalen, Manfred Hauben & Andrew Bate

Understanding Adherence and Prescription Patterns Using Large-Scale Claims Data

Article 10 December 2015

Margrét V. Bjarnadóttir, Sana Malik, … Catherine Plaisant

Notes

Continuous Positive Airway Pressure.
Throughout the paper, we use the shorter term “age” for “age category,” and “life” for “life status.”
Throughout the paper, the dot notation represents the invocation of a function on its right-hand side for the object (e.g., a patient) on its left-hand side.

References

Munshi, A., Sharma, V., Sharma, S.: Lessons learned from cohort studies, and hospital-based studies and their implications in precision medicine. In: Progress and Challenges in Precision Medicine. Elsevier (2017)
Welch, S.R., Huff, S.M.: Cohort amplification: an associative classification framework for identification of disease cohorts in the electronic health record. In: Annual Symposium Proceedings. American Medical Informatics Association (2010)
Maggi, F.M., Di Francescomarino, C., Dumas, M., Ghidini, C.: Predictive monitoring of business processes. In: International Conference on Advanced Information Systems Engineering. Springer, pp. 457–472 (2014)
Pham, T., Tran, T., Phung, D., Venkatesh, S.: Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 69, 218–229 (2017)
Article Google Scholar
Fejza, A.., Genevès, P., Layaïda, N., Bosson, J.-L.: Scalable and interpretable predictive models for electronic health records. In DSAA, IEEE (2018)
Heuser, A., Huynh, M., Chang, J.C.: Empirical process-based large sample properties of the area bounded by cohort-weighted Kaplan Meier curves. arXiv preprint arXiv:1701.02424 (2017)
Liu, Y., Safavi, T., Dighe, A., Danai, K.: Graph summarization methods and applications: a survey. ACM Comput. Surv. 51, 1–34 (2018)
Article Google Scholar
Senderovich, A., Weidlich, M., Gal, A.: Temporal network representation of event logs for improved performance modelling in business processes. In: BPM (2017)
Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence simplification. TVCG (2013)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
Pahins, C.A.L., Omidvar-Tehrani, B., Amer-Yahia, S., Siroux, V., Pépin, J.L., Borel, J.-C., Comba, J.: COVIZ: a system for visual formation and exploration of patient cohorts. PVLDB 12(12), 1822–1825 (2019)
Google Scholar
Von Elm, E., Altman, D.G., Egger, M., et al.: The strengthening the reporting of observational studies in epidemiology (strobe) statement: guidelines for reporting observational studies. PLoS Med. 147, 573–577 (2007)
Google Scholar
Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. 5(11), 1436–1446 (2012)
Article Google Scholar
Omidvar-Tehrani, B., Amer-Yahia, S., Lakshmanan, L.V.S.: Cohort representation and exploration. In: DSAA. IEEE (2018)
Armony, M., Israelit, S., Mandelbaum, A., Marmor, Y.N., Tseytlin, Y., Yom-Tov, G.B.: On patient flow in hospitals: a data-based queueing-science perspective. Stoch. Syst. 5(1), 146–194 (2015)
Article MathSciNet Google Scholar
Jenkins, K.: Comorbidity patterns with female incontinence distinguish subtypes. MedPage Today J. (2018)
Woodfield, J.: Gestational diabetes associated with early signs of kidney damage. The Global Diabetes Community (2018)
Collins, T.: For your patients-REM sleep behavior disorder: REM disorder is highly predictive of neurodegenerative disease, study shows. Neurol. Today 18, 1–22 (2018)
Google Scholar
Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337–348 (1994)
Article Google Scholar
Chen, Z., Dehmer, M., Shi, Y.: A note on distance-based graph entropies. Entropy 16(10), 5416–5427 (2014)
Article MathSciNet Google Scholar
Feige, U., Peleg, D., Kortsarz, G.: The dense k-subgraph problem. Algorithmica 29(3), 410–421 (2001)
Article MathSciNet Google Scholar
Kaplan, E.L., Meier, P.: Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53(282), 457–481 (1958)
Article MathSciNet Google Scholar
Gollery, M.: Bioinformatics: sequence and genome analysis. Clin. Chem. 51, 2219 (2005)
Article Google Scholar
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473–483 (2010)
Article Google Scholar
Smith, T., Waterman, M.: Identification of common molecular subsequences. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Polyanovsky, V.O., Roytberg, M.A., Tumanyan, V.G.: Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences. Algorithms Mol. Biol. 6, 25 (2011)
Article Google Scholar
Goonesekere, N.C.W., Lee, B.: Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins Struct. Funct. Bioinf. 71(2), 910–919 (2008)
Article Google Scholar
Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)
Article Google Scholar
Omidvar-Tehrani, B.: Augmented therapy with online support groups. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH). Springer (2018)
Notredame, C., Higgins, D.G., Heringa, J.: T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000)
Article Google Scholar
Chatain, T., Carmona, J., Van Dongen, B.: Alignment-based trace clustering. In: International Conference on Conceptual Modeling. Springer, pp. 295–308 (2017)
Enright, A.J., Van Dongen, S., Ouzounis, C.A.: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30(7), 1575–1584 (2002)
Article Google Scholar
Bhuiyan, M., Mukhopadhyay, S., Al Hasan, M.: Interactive pattern mining on hidden data: a sampling-based solution. In: CIKM. ACM (2012)
Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V.S., Zamar, R.H.: Exploring rated datasets with rating maps. In: WWW (2017)
Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. In: CIKM (2015)
Jiang, D., Cai, Q., Chen, G., Jagadish, H.V., Ooi, B.C., Tan, K.-L., Tung, A.K.H.: Cohort query processing. Proc. VLDB Endow. 10((1), 1–12 (2016)
Article Google Scholar
Ge, C., He, X., Ilyas, I.F., Machanavajjhala, A.: Accuracy-aware differentially private data exploration. In: SIGMOD, Apex (2019)
Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Math. Program. 14(1), 265–294 (1978)
Article MathSciNet Google Scholar
Sabidussi, G.: The centrality index of a graph. Psychometrika 31(4), 581–603 (1966)
Article MathSciNet Google Scholar
Opsahl, T., Agneessens, F., Skvoretz, J.: Node centrality in weighted networks: generalizing degree and shortest paths. Soc. Netw. 32(3), 245–251 (2010)
Article Google Scholar
Sharma, D., Kapoor, A., Deshpande, A.: On greedy maximization of entropy. In: International Conference on Machine Learning, pp. 1330–1338 (2015)
Korn, G.A., Korn, T.M.: Mathematical Handbook for Scientists and Engineers: Definitions, Theorems, and Formulas for Reference and Review. Courier Corporation, North Chelmsford (2000)
MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)
MATH Google Scholar
Fekete, J.-D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv:1607.05162 (2016)
Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theory 2(3), 129–137 (1956)
Article Google Scholar
Rozinat, A., de Medeiros, A.K.A., Günther, C.W., et al.: The need for a process mining evaluation framework in research and practice. In: BPM. Springer, pp. 84–89 (2007)
Sharma, G., Goodwin, J.: Effect of aging on respiratory system physiology and immunology. Clin. Interv. Aging 1(3), 253 (2006)
Article Google Scholar
Shanks, D.: Solved and Unsolved Problems in Number Theory, vol. 297. AMS, Providence (2001)
MATH Google Scholar
Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego, R., Trasarti, R.: Conquest: a constraint-based querying system for exploratory pattern discovery. In: ICDE (2006)
Yan, N., Li, C., Roy, S.B., Ramegowda, R., Das, G.: Facetedpedia: enabling query-dependent faceted search for Wikipedia. In: CIKM. ACM (2010)
Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: New trends on exploratory methods for data analytics. Proc. VLDB Endow. 10(12), 1977–1980 (2017)
Article Google Scholar

Download references

Acknowledgements

Funding was provided by CDP LIFE (Grant No. C7H-ID16-PR4-LIFELIG).

Author information

Authors and Affiliations

NAVER LABS Europe, Meylan, France
Behrooz Omidvar-Tehrani
CNRS, University of Grenoble Alpes, Grenoble, France
Sihem Amer-Yahia
University of British Columbia, Vancouver, Canada
Laks V. S. Lakshmanan

Authors

Behrooz Omidvar-Tehrani
View author publications
You can also search for this author in PubMed Google Scholar
Sihem Amer-Yahia
View author publications
You can also search for this author in PubMed Google Scholar
Laks V. S. Lakshmanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Behrooz Omidvar-Tehrani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Omidvar-Tehrani, B., Amer-Yahia, S. & Lakshmanan, L.V.S. Cohort analytics: efficiency and applicability. The VLDB Journal 29, 1527–1550 (2020). https://doi.org/10.1007/s00778-020-00625-6

Download citation

Received: 10 May 2019
Revised: 30 March 2020
Accepted: 08 August 2020
Published: 27 August 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s00778-020-00625-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Cohort analytics: efficiency and applicability

Abstract

Access this article

Similar content being viewed by others

Disease trajectory browser for exploring temporal, population-wide disease progression patterns in 7.2 million Danish patients

Time Series Disturbance Detection for Hypothesis-Free Signal Detection in Longitudinal Observational Databases

Understanding Adherence and Prescription Patterns Using Large-Scale Claims Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cohort analytics: efficiency and applicability

Abstract

Access this article

Similar content being viewed by others

Disease trajectory browser for exploring temporal, population-wide disease progression patterns in 7.2 million Danish patients

Time Series Disturbance Detection for Hypothesis-Free Signal Detection in Longitudinal Observational Databases

Understanding Adherence and Prescription Patterns Using Large-Scale Claims Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation