Using automated electronic medical record data extraction to model ALS survival and progression
- 211 Downloads
To assess the feasibility of using automated capture of Electronic Medical Record (EMR) data to build predictive models for amyotrophic lateral sclerosis (ALS) outcomes.
We used an Informatics for Integrating Biology and the Bedside search discovery tool to identify and extract data from 354 ALS patients from the University of Kansas Medical Center EMR. The completeness and integrity of the data extraction were verified by manual chart review. A linear mixed model was used to model disease progression. Cox proportional hazards models were used to investigate the effects of BMI, gender, and age on survival.
Data extracted from the EMR was sufficient to create simple models of disease progression and survival. Several key variables of interest were unavailable without including a manual chart review. The average ALS Functional Rating Scale – Revised (ALSFRS-R) baseline score at first clinical visit was 34.08, and average decline was − 0.64 per month. Median survival was 27 months after first visit. Higher baseline ALSFRS-R score and BMI were associated with improved survival, higher baseline age was associated with decreased survival.
This study serves to show that EMR-captured data can be extracted and used to track outcomes in an ALS clinic setting, potentially important for post-marketing research of new drugs, or as historical controls for future studies. However, as automated EMR-based data extraction becomes more widely used there will be a need to standardize ALS data elements and clinical forms for data capture so data can be pooled across academic centers.
KeywordsAmyotrophic lateral sclerosis Motor neuron disease Disease progression Electronic medical record
Amyotrophic lateral sclerosis
Amyotrophic lateral sclerosis functional rating scale – revised
Electronic medical record
Forced vital capacity
Healthcare Enterprise Repository for Ontological Narration
Informatics for Integrating Biology and the Bedside
University of Kansas Medical Center
Amyotrophic Lateral Sclerosis (ALS) is a fatal neuro-degenerative disease. While over 50 clinical trials have been conducted over the last two decades, none have been successful save riluzole and edaravone , which at best offer modest improvements in survival or function . While many studies may have failed because the drugs were ineffective, a recurring theme in ALS are trials which do not meet their primary outcome but yield indeterminate results . Two major hurdles to conducting ALS trials are the rarity of ALS (3.9 in every 100,000 people in the US ) and the disease’s heterogeneity , which is a barrier to properly powered studies.
Methodology for rare-disease clinical trials is an important area of study for ALS researchers . Enriching trials with historic controls has become possible due to the creation of large pooled placebo data sets  and is an approach used for selection of drugs for larger studies, such as in the lithium and rasagiline study [8, 9, 10]. Other benefits to large databases of ALS patients include constructing predictive models for screening particular subgroups of patients, which could reduce the heterogeneity of disease progression observed in the trial, or making interim decisions during the conduct of a clinical trial based on predicted and observed disease progression.
The wide implementation of Electronic Medical Record (EMR) systems across the United States, using one of two commercial systems, and the development automated abstraction and de-identification of data, create opportunities to: 1) better understand ALS disease progression and determinants of survival in the clinical setting; 2) use clinical data to enrich existing placebo-arm data sets to improve the power of trials; and 3) leverage this electronic infrastructure to run clinical trials – including EMR-based screening, randomization, and data collection. For these approaches to be worthwhile, we need to be able to demonstrate the feasibility of automatically extracting the data required for modelling ALS disease progression and survival directly from the EMR.
We consider the feasibility of constructing statistical models built with automatically captured EMR patient data from our ALS clinic at the University of Kansas Medical Center (KUMC). This is a key first-step in utilizing the EMR to augment clinical trials.
We first determined what specific data was necessary to build models for ALS disease progression and survival. Variables of interest for such models include, at a minimum, demographic information (age, race, and gender), survival information (vital status and date of death), ages of disease onset and diagnosis [5, 11], site of disease onset (typically bulbar or limb) [5, 12, 13, 14], riluzole use, BMI [5, 15], FVC [15, 16], and ALS Functional Rating Scale – Revised (ALSFRS-R) score [13, 17, 18, 19]. The ALSFRS-R, which is the gold-standard for measuring ALS disease progression, is a clinician-administered series of twelve questions which concern the ability to perform basic functional activities such as eating, walking, dressing, and breathing. Each question is rated on a 0–4 scale, with the overall score of 48 representing normal function .
To determine if these variables could be automatically extracted from the EMR, we conducted a retrospective chart review of patients seen at the KUMC ALS Clinic between summer 2013 and summer 2016. We obtained this data directly from the EMR using the KUMC Healthcare Enterprise Repository for Ontological Narration (HERON), powered by Informatics for Integrating Biology and the Bedside (i2b2), a discovery tool that allows searches of de-identified EMR data [21, 22, 23]. KUMC’s EMR is provided by Epic (EPIC EMR system, Epic Systems Corporation, Verona, USA, 2015. Using patient’s medical record numbers, this dataset was then verified for completeness and accuracy by manual review of the EMR records. Because we were interested in considering the efficiency of using automated tools versus manual review, the number of hours spent performing the automated review and manual review were tracked.
Analysis of disease progression
Disease progression is measured by patients’ average change per month in ALSFRS-R score. Each patient’s disease progression vs. time (as months since first clinical visit, where the first clinical visit is time 0) was modelled via a linear mixed model which included random slopes and random intercepts (these were allowed to correlate); the fixed effects for the intercept and slope of this model represent the average baseline ALSFRS-R score and average change in ALSFRS-R per month for the clinic. Individual estimates of patient baseline ALSFRS-R score can be obtained by adding the estimated fixed intercept effect to the patient’s estimated random intercept effect; similarly the individual estimate of a patient’s change in ALSFRS-R per month can be estimated by adding the fixed slope effect to the patient’s estimated random slope effect. Linearity was assumed from the literature [14, 24, 25] and verified via diagnostic plots (Additional file 1). These models were fit using the nlme package in R .
Analysis of survival
Our survival model analyzed time from patients’ first clinical visit to death (or censoring). Survival data captured by HERON includes both data from the EMR and from the Social Security Death Index . Median survival was estimated by via a Kaplan-Meier approach with interval given by the log-log transformation. A Cox Proportional Hazards model was employed to assess the simultaneous effects of available predictors: BMI, age, and ALSFRS-R score at first visit, and gender. 72 patients were missing baseline BMI scores and were excluded from the Cox model. All analyses were done using R (version 3.2.4) .
Accuracy of EMR data
A general search based on ICD10 code identified 572 subjects; 354 patients had at least one ALSFRS-R recorded in the EMR (62.4%), 352 of which were deemed eligible for analysis (two were excluded due to nonsensical death dates) (Fig. 1). Manual review verified ALSFRS-R and sub-scores as accurate.
Specific data automatically extracted from KUMC EMR by HERON, and data that required a manual chart review
Information automatically extracted from the EMR using HERON
Information requiring manual chart review
Subject age, race, gender, ethnicity
Date of disease onset, date of diagnosis, site of disease onset
ALSFRS-R and its sub scores, BMI, FVC (raw and percent-predicted)
Death status, date of death
The time spent coordinating with the team at HERON to properly identify and extract variables of interest took roughly 3 h. The manual review took over 30 h. Once the variables of interest were properly identified within the EMR, obtaining the data through HERON became a matter of minutes rather than hours.
Demographic information of KUMC ALS clinical patients. Baseline is defined as the time of a patient’s first recorded ALSFRS-R score at KUMC
Number of patients
Percent female / Male
Percent Caucasian / Non-Caucasian
Percent limb onset / Bulbar / Other
65/ 27/ 8
Percent using riluzole Yes / No / NA
63/ 35 / 2
Percent survived during follow up
Median time from baseline to last record, months (IQR)
Median age at baseline (IQR)
Median number of months from onset to baseline (IQR)
Median baseline FVC percent predicted (IQR)
Median BMI at baseline (IQR)
Median number of ALSFRS-R assessments (IQR)
Median baseline ALSFRS-R (SD)
Analysis of disease progression
Analysis of survival
Hazard ratios from Cox model
Age (at onset)
ALSFRS-R total (baseline)
Here we demonstrate the feasibility of using an automated extraction tool (HERON) to obtain ALS patient data directly from the KUMC EMR which could be used for analysis of ALS disease progression and survival. While data pertaining to demographic, ALSFRS-R, and survival information was both readily obtainable and accurate, some key variables (especially disease onset time and riluzole use) were only available via manual EMR review and/or suffered from large amounts of missing data.
The main advantages to using automatic tools such as HERON includes that they can drastically reduce the amount of time needed to accurately capture EMR data when compared to a manual review of the EMR. This methodology is generalizable across other research sites: EPIC is one of the two major EMR record systems in the US, serving over 50% of patients in the US , and represents a large number of academic centers with ALS clinics. The automatic extraction tool HERON is powered by i2b2, which is used by dozens of research institutions within the US and abroad .
Looking towards the future, as EMR data becomes more complete, other advantages of using this approach will emerge. Advantages to complete and comprehensive ALS records in the EMR include allowing clinicians to track the performance of their patients clinic-wide and compare these to other ALS clinics, for both research and quality control purposes. For example, the average ALSFRS-R decline per month in the KUMC clinic of 0.64 is somewhat high compared to reports from other clinics, which report monthly ALSFRS-R declines of between 0.36 to 0.65 [14, 31, 32, 33]. Note that this may be because we were unable to adjust for how long patients’ have had the disease.
Other future advantages include the ability to perform retrospective studies quickly and efficiently, which could create support for new therapeutics or improvements to standards of care. This depends heavily on tracking of patients’ use of therapeutics in a way that is accessible in the EMR. EMR data could also be used to augment clinical trial data, being used as either a placebo/ standard of care arm or as historical controls . This has become a vital issue for the broader ALS community. For example, approval of edaravone in the US has raised many questions about which patients will benefit from this therapy and for how long. This could be answered by pooling ALS clinic data. In addition, edaravone has put a limit on how broadly existing placebo data sets like PRO-ACT can be used for historical controls in clinical trials. Contemporary controls captured through automated EMR data abstraction could be one solution to this problem [1, 35, 36].
One current criticism of ALS clinical trials is that the ALS patients who serve in these trials are not representative of the general population , which is likely due to the rigorous inclusion/exclusion criteria for these trials. One simple solution to make ALS trials more representative is to simply modify the inclusion/exclusion criteria – however the resulting increased patient variability would require very large studies. Again we see the potential utility of EMR data: with a more general trial population, we would be free to use the EMR to augment the control population for these trials. Networks such as the Northeast or Western ALS Study Groups  could provide placebo or standard-of-care arms in a variety of designs, and could make such large-scale studies possible.
The main disadvantage of this approach is the current lack of completeness of the EMR with respect to critical ALS data, resulting in incomplete statistical models. To use the EMR as we propose across multiple academic centers, the ALS community would need to agree on a set of common data elements or ALS-related forms to capture in the EMR. Such agreement could allow common data dictionaries to be used to allow for automated data capture not just across academic centers, but across different EMR platforms (i.e. Epic and Cerner). Furthermore, physicians and their clinic personnel would need to adhere to these data dictionaries, and then rigorously enter all the required data for each patient at each visit. Many efforts have already been made toward developing these common data sets for ALS: much of the field already captures the ALSFRS-R, the FVC, and details about the diagnosis at each visit. In addition several initiatives are underway to standardize forms across institutions, with a suite of ALS clinic forms available for download through Epic Central.
One example of critical information that needs to be collected in a standardized way is disease onset time. Because disease duration (which is derived from disease onset time) is critical for both survival and disease progression modelling [5, 12, 24, 25, 39], it is necessary that ALS clinics dedicate a data-capture form for this, as opposed to entering it as free-text notes/comments where it is difficult to find systematically. Other critical variables include usage of approved therapeutics (such as riluzole or ederavone), time of diagnosis, and location of symptom onset.
We were able to use automated extraction tools to accurately obtain necessary variables from the EMR with which to create simple statistical models of both ALS disease progression and survival time. Key variables that might offer large improvements to these models (such as disease onset time or riluzole use) were unavailable via automatic extraction. In the future, as automated EMR data abstraction becomes increasingly important for post-marketing surveillance of FDA approved drugs, or for use as concurrent controls, the ALS community will need to adopt common data elements for the EMR. Optimal use of the EMR requires disease-specific key variables, such as disease-onset time for ALS, to be identifiable and obtainable by data extraction tools as well as rigorous data entry by clinical staff.
The Mabel A. Woodyard Fellowship in Neurodegenerative Disorders and the Roofe Fellowship in Neuroscience Research funded the writing of the manuscript, the statistical programming, and the analyses. All other funding bodies supported the authors’ time to work on this project during design, analysis, and manuscript preparation. This work was supported by a CTSA grant from NCRR and NCATS awarded to the University of Kansas Medical Center for Frontiers: The Heartland Institute for Clinical and Translational Research # UL1TR000001 (formerly #UL1RR033179). The contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH, NCRR, or NCATS. J.S. work on this project supported by a fellowship grant from the NCATS / Clinical Research in ALS and Related Disorders for Therapeutic Development Consortium awarded to the University of Miami (U54NS092091). HERON is supported by a CTSA grant from NCRR and NCATS awarded to the University of Kansas Medical Center for Frontiers: University of Kansas Clinical and Translational Science Institute # UL1TR002366 (formerly # UL1TR000001 and #UL1RR033179).
Availability of data and materials
The datasets generated and/or analysed during the current study are not publicly available due containing identifying medical information but de-identified data are available from the corresponding author on reasonable request.
AK managed the data, performed statistical analyses, drafted and revised the manuscript. JS proposed the study, and assisted in the analysis and interpretation of data, drafting/revising the manuscript for content, and also provided study supervision. LW, OJ, and RB assisted with drafting/revising the manuscript and the acquisition of the data. BG and JH assisted with analysis and interpretation of data, verification of statistical methods, and drafting/revising the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
This retrospective chart review was IRB-approved and received a waiver of consent by University of Kansas Medical Center Human Research Protection Program Institutional Review Board (IRB# STUDY00004291). Patient data was managed on secure servers at KUMC.
Consent for publication
JS is a consultant for aTyr, Acceleron, Fulcrum, Regeneron, and Strongbridge. The other authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.WRITING GROUP ON BEHALF OF THE EDARAVONE (MCI-186) ALS 18 STUDY GROUP. Exploratory double-blind, parallel-group, placebo-controlled study of edaravone (MCI-186) in amyotrophic lateral sclerosis (Japan ALS severity classification: grade 3, requiring assistance for eating, excretion or ambulation). Amyotroph Lateral Scler Frontotemporal Degener. 2017;18(sup1):40–8.CrossRefGoogle Scholar
- 3.Katyal N, Govindarajan R. Shortcomings in the Current Amyotrophic Lateral Sclerosis Trials and Potential Solutions for Improvement. Front Neurol. 2017;8 Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5626834/ [cited 17 Sep 2018].
- 4.Paul Mehta M. Prevalence of Amyotrophic Lateral Sclerosis — United States, 2012–2013. MMWR Surveill Summ. 2016;65 Available from: https://www.cdc.gov/mmwr/volumes/65/ss/ss6508a1.htm [cited 17 Sep 2018].
- 6.Hilgers R-D, König F, Molenberghs G, Senn S. Design and analysis of clinical trials for small rare disease populations. J Rare Dis Res Treat. 2016:53–60.Google Scholar
- 8.Statland JM, Moore D, Wang Y, Walsh M, Mozaffar T, Elman L, et al. Rasagiline for amyotrophic lateral sclerosis: a randomized controlled trial. Muscle Nerve. 2018. https://www.ncbi.nlm.nih.gov/pubmed/30192007.
- 23.Murphy SN, Mendis ME, Berkowitz DA, Kohane I, Chueh HC. Integration of clinical and genetic data in the i2b2 architecture. AMIA Annu Symp Proc. 2006;1040. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839291/.
- 26.Pinheiro J, Bates D, DebRoy S, Sarkar D, Heisterkamp S, R-core. nlme: Linear and Nonlinear Mixed Effects Models. 2018. Available from: https://CRAN.R-project.org/package=nlme [cited 17 Sep 2018]
- 27.Social Security Death Master File -> Home. Available from: https://ladmf.ntis.gov/ [cited 17 Sep 2018]
- 28.R Core Team. R: a language and environment for statistical computing. [internet]. Vienna, Austria: R Foundation for Statistical Computing; 2017. Available from: https://www.r-project.org/.
- firstname.lastname@example.org, 608-252-6138 JG. Epic Systems draws on literature greats for its next expansion. madison.com. Available from: https://madison.com/news/local/govt-and-politics/epic-systems-draws-on-literature-greats-for-its-next-expansion/article_4d1cf67c-2abf-5cfd-8ce1-2da60ed84194.html [cited 17 Sep 2018]
- 30.i2b2: Informatics for Integrating Biology & the Bedside. Available from: https://www.i2b2.org/work/i2b2_installations.html [cited 17 Sep 2018]
- 35.Kalin A, Medina-Paraiso E, Ishizaki K, Kim A, Zhang Y, Saita T, et al. A safety analysis of edaravone (MCI-186) during the first six cycles (24 weeks) of amyotrophic lateral sclerosis (ALS) therapy from the double-blind period in three randomized, placebo-controlled studies. Amyotroph Lateral Scler Frontotemporal Degener. 2017;18(sup1):71–9.PubMedCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.