Background

The electronic health record (EHR) contains a vast quantity of data due to its observation nature, holding great promise as a valuable, efficient, and cost-effective tool. These data can inform quality improvement and research initiatives, especially those related to medical resources and patient outcomes [1,2,3]. In its initial implementation, however, the EHR rarely captures outcomes of interest to key stakeholders reliably and accurately due to frequent limitations resulting from disorganized, incorrect, or missing variables that lack vigorous extraction methodologies. Together this limits the data’s validity and utility [4]. In order to provide validated results for scientific interpretation, vigorous, reproducible, and validated techniques must be established for each EHR variable of interest.

Many institutions rely on a structured repository of data, drawn from the EHR, to facilitate ongoing access, a so-called data warehouse or data mart [5, 6]. This data repository is frequently created after the EHR has been created and, at many institutions, is created and maintained by data analysts working in isolation from front-line clinicians. The intensive care unit (ICU) is a particularly challenging area for creation of a data mart. Critically ill patients suffer life threatening organ pathology in at least one, if not many, organ systems. These patients are generally intensively monitored, with very high frequency of physiologic data capture. Laboratory data may be obtained multiple times per day. Multiple organ support modalities may be employed, with complex documentation and monitoring to quantify the degree of support. Once data are located though, they can support surveillance, decision support, and modeling of outcomes [7].

We describe a methodology for the creation of a structured, rigorously constructed intensive care unit (ICU) data mart based on data automatically and routinely derived from the EHR. This was performed through identification of data elements commonly collected as part of routine clinical care that also hold value for quality improvement and research purposes. We then present a methodology for collecting, structuring, and assessing accuracy of data elements through sequential collection cycles. Importantly, this work was completed with data analysts and clinicians working side-by-side throughout the process to ensure data and clinical accuracy.

Methods

Identification

A multidisciplinary project team including clinical intensivists and data analysts met daily (virtually or in person) throughout construction, assembling a priority list of physiologic, laboratory, demographic, and billing data. The presence of data availability in routine clinical practice was confirmed using extensive chart review and identification. Data analysts worked to identify the location of variables within the data architecture underlying the EHR (Epic, Verona, WI). All elements were investigated for extent of chart documentation, frequency, duplication in different sites within the medical record, and agreement with patient clinical course. As is common within our EHR, data have the ability to be entered under variations of variable names housed on different database tables. Methodologic screening across database tables using clinician feedback was performed to ensure capture of variables across electronic sources. Variable location was confirmed after location(s) of these key variables within the EHR were vetted across data analysts and clinician sources. Key team members involved in identification of variables include anesthesiology and internal medicine faculty and trainees, business intelligence analysts, and database administrators who together function as a team and are all involved in daily core meetings.

Extraction

We created algorithmic definitions for complex data elements, including most outcomes, leveraging existing literature, when available. Test patients were extracted and algorithms iteratively refined at least weekly based on results of below preliminary validation methods. The number of iterations required for each variable was variable dependent, influenced by fidelity of the variable within the EHR. The following preliminary data extraction methodology was used to assess data quality for each variable throughout subsequent extractions: range check (is the data collection within a physiologic range?), type/format check (is the data presented in the format expected for that value?), check digit/length check (has the data been entered or saved correctly within the EHR?), and finally lookup (within a random sample of patients, are data obtained an accurate reflection of the patients’ clinical course?).

Transformation and loading

Once shown to be reproducible and appropriately valid based on above methodologies in a broad cohort of patients, structured query language (SQL) was used to extract, transform, and load data from the EHR for use into a relational database housed on a departmental server. This same departmental server houses intraoperative variables within a perioperative data warehouse (PDW) that were obtained through similar methodologies [8].The accuracy of loaded databases were again formally assessed using lookup strategies on large representative cohorts of ICU patients. For those variables with insufficient accuracy, investigators either returned to identification or extraction steps to improve data quality, or if no improvement strategies could be identified, removed data element from final inclusion within the data mart. After formal lookup assessments, patient outcome results were cataloged in appropriate outcomes tables for presentation and dissemination. This process is illustrated in Fig. 1. Total timeline for identification, extraction, and loading was approximately 3–6 months.

Fig. 1
figure 1

Progression Through Variable Identification. Figure 1 Variables presented have been identified by clinicians as high-yield variables for quality improvement and research purposes. As these variables are located and validated within the electronic health record, they are added within the data mart. These data are then analyzable to be able to draw conclusive findings regarding outcomes including acute kidney injury, reintubation rates, and 7-day mortality, to name a few. Directionality is dependent on quality of variable and accuracy across study stages, represented by the double-sided arrows

Results

A total of 459,465 ICU patient encounters were identified and included in the ICU data mart, accessible and maintained through the Division of Informatics within the Anesthesiology department. These patients include over 460,000,000 individual laboratory results and 4,610,776 vital signs (with 1-min fidelity in the first 24-h of admission). Using the above methodologies, a total of 26 outcomes were compiled (Table 1). These data have been structured within 19 tables, all of which have a sensitivity and specificity of greater than 95%. The iterative construction of our processes allows for continual structured updates and assessment of variables to maintain accuracy. When variables were not able to reach sufficient accuracy despite iterations, data analysts formed collaborative meetings to review, check, and improve techniques. If it was not possible to increase accuracy, variables were not advanced or included within final tables or projects due to lack of cogency.

Table 1 Variables Currently Validated Organized by Variable Type

Variables can be accessed or combined on request by data analysts. Reports are generated using data dictionaries to identify relevant variables dependent on the research or quality improvement project aims. Following institutional ethics approval, relevant data can be extracted and accessed in the desired format for clinicians or researchers.

Confirmed data are used to interpret patient outcomes for all patients within the ICU and ICU data mart. Additionally, these data can be joined to the 120 tables including more than 1900 unique variable columns within the existing perioperative data warehouse. While this division resides under the Department of Anesthesiology, it included data from ICU patients under Medicine and Surgery departments as well and is accessible through request. The division is responsible for support of administrators, clinicians, and researchers aiming to utilize ICU data for quality improvement or research endeavors following appropriate institutional approval.

Discussion

We present a methodology for building a robust and highly granular ICU data mart, leveraging the synergistic expertise of clinicians and data analysts. Optimizing the quality of data obtained from large databases will improve accuracy, results, and confidence within informatics research for quality improvement and research purposes.

These processes can be adapted to new variables as they present to provide real-time clinical data on large populations of patients within our ever-changing clinical environment. Several hospital systems involved in data informatics research have already established similar organizational methodologies to ensure quality of data obtained within their data warehouses [6, 9, 10]. As is often the case, it is accepted that data is available through the EHR, but the difficulty comes from the disconnects prevalent with extracting and utilizing that data [11]. The methodology presented provides a structure under which these data can be collected, incorporating key stakeholders requests and expertise, to draw results across institutional interfaces and patient locations. Together, these structured and validated methodologies strengthen the results obtained and the validity and trust within our research community. Even after establishing these methodologies, there is a need for consistent upkeep and maintenance of systems. Continual data maintenance and validation is not included within these methods but are equally essential to ensuring continuation of useable data collection. The major value for this established methodology lies in the additional variables and patient markers that are added to the EHR and identified as priority for inclusion within the data mart. These same processes are adapted to ensure quality data collection and trust of information obtained. Within workgroups, collection has been underway to extract variables that will identify positive coronavirus 2019 (COVID-19) test results into our variable lists using the presented methodology to confirm accuracy within results obtained across a variety of available laboratory data, as an example of the evolving needs addressed through adaptation of this same methodological structure.

Similar to much of informatics research, our results are limited by the quality of data entered into the EHR. Missingness and inaccurate data elements can be screened and eliminated when detected, but such errors are difficult to prevent entirely. The ability of our algorithms and methods to identify accurate data with high fidelity on repeated queries is evidence of the rigor of our data extraction method. We recognize, however, this number is not validated and level of inaccuracies cannot be eliminated. As our systems change and update within our underlying EHR architecture, aspects of our data extraction may need to be updated as well to ensure continued legitimacy.

Our methodology and accuracy provides a strong foundation for the results obtained through our large ICU data mart. As we plan to add patient data throughout the hospitalization and perioperative periods, we will continue to establish structured methodologies to ensure data accuracy. Future uses of this work will aim to rigorously validate our results and those variables within our institution and across multiple health care centers to create multicenter perioperative data warehouses with rigorously validated patient variables for quality improvement and research purposes.