Background

All over the world, emergency departments (ED) are struggling with an increasing inflow of patients, and especially elderly patients with complex pathology that is difficult to assess due to simultaneous chronic diseases, risk factors and/or polypharmacy [1, 2]. ED clinicians need to make fast and accurate risk estimates, and optimal management from the start is crucial for good patient outcomes. At the same time, the amount of available clinical information in electronic medical records is also increasing, as is the total body of medical knowledge. Often the ED physician can no longer grasp and process all available information, making it impossible for an individual clinician to provide the theoretically best possible care.

Artificial intelligence (AI) and machine learning (ML) are now developing fast, and most industries will likely be fundamentally changed by AI in the coming years [3]. In medicine, AI and ML provide new possibilities when applied to extensive electronic health records and registers [4]. The most impressive advances have occurred in radiology and pathology, where ML accuracy of image classifications now exceeds that of humans [5]. In emergency medicine, AI/ML-driven decision support tools have the potential to improve diagnostic accuracy [5], alleviate ED crowding [6, 7], and decrease the use of inpatient beds and healthcare costs [8]. The Swedish Board of Health and Welfare has therefore emphasized the great potential of AI/ML in emergency medicine [9]. So far however, there have been few AI/ML studies in the ED setting, and practically no implementation in routine ED care. The creation of ML-based decision support for ED use requires large amounts of high-quality clinical data, preferably from representative unselected ED patients in routine care.

In the present paper we describe the rationale for, and construction of, the Skåne Emergency Medicine (SEM) cohort and outline possible studies. The SEM cohort is a recently established data platform for developing clinical decision support systems (CDSS) based on traditional statistical methods or AI/ML, to be used in ED triage or later in the management of specific patient conditions. Specific aims include the prediction of diagnoses, critical interventions (e.g. defibrillation of cardiac arrest, thrombolysis in stroke) or inpatient care within 30 days of the ED visit, and mortality up to 1 year after the ED visit. We describe in this paper the process of building the SEM dataset with careful consideration of ethics, data protection, and bias. With the SEM cohort, we hope to create CDSS that can be tested in randomized trials in routine emergency care.

Methods/design

The formation of the SEM cohort was an initiative within the Artificially Intelligent use of Registers at Lund University (AIR Lund) research environment [10], which is a multidisciplinary collaboration between Lund University (Emergency medicine, Internal medicine, Epidemiology and biostatistics, Computational biology, Technology and society/ethics, and Law), Halmstad University (Information technology), and the Swedish health care regions Skåne and Halland.

Setting

Skåne is Sweden’s southernmost region and has some 1.4 million inhabitants. Healthcare is publicly financed with a small copayment at every visit. Patients in region Skåne almost always go to the nearest ED, and in general do not seek care outside the region. The SEM cohort includes data from patients presenting at eight general EDs in Skåne from January 1st, 2017 to December 31st, 2018. The characteristics of these EDs are described in Table 1. Five EDs are open 24/7/365 (Skåne university hospital at Lund and Malmö, Helsingborg general hospital, Kristianstad central hospital and Ystad hospital) and three EDs are open during office hours (Landskrona, Trelleborg and Hässleholm hospitals). There are very few patients with psychiatric disorders, problems related to obstetrics/ gynecology, ophthalmology, and pediatric patients without orthopedic problems at these EDs, since there are specialized EDs for these patients in the region. Table 1 describes that the yearly ED census ranges between 80000 (Malmö) and 5000 (Landskrona) patient cases, and that admission rates to in-hospital care range between 20% (Helsingborg) and 32% (Hässleholm or Landskrona). All EDs use the rapid emergency triage and treatment system (RETTS [11]) that includes five priority levels: Highest priority 1 (Red); Priority 2 (Orange); Priority 3 (Yellow); Lowest priority 4 (Green); and Priority primary care (Blue). The RETTS set of chief complaints are thus common for all EDs in the SEM cohort. All EDs have similar access to patient testing, and clinical guidelines are generally the same in the entire region.

Table 1 Characteristics of the EDs included in the SEM cohort, after Welch et al.[21] *trauma level according the American College of Surgeons [22]. EM, emergency medicine; ENT, Ear nose and throat; Ob/Gyn, Obstetrics/Gynecology

During and after the data collection period, the patients were informed of the purpose and structure of the SEM cohort in writing via public advertising on a website, and that they could decline participation at any time, for any reason, by contacting a research nurse or the first author at Lund. The creation of the SEM cohort and its use for AI/ML research and cross-sectional analyses has been approved by the Swedish Ethical Review Authority (Dnr 2019–05783), and by Region Skåne (302 − 19). There is no approval for commercial use of the data.

Data collection

During the study period, all patients at the eight EDs were included in the SEM cohort by default via identification in the common ED patient log system (Patientliggaren™, Tietoevry [12]), and data from the other registers (below) were then linked by each patient’s unique Swedish identification (ID) number, which is universally used in Swedish healthcare and all government registers. After collection and linkage, all data were pseudonymized with patient study ID numbers and kept on secure servers behind firewalls at Lund University where access is logged. The key between personal and study IDs is kept separately on a Region Skåne server with standard healthcare data security.

The data sources include healthcare databases and registers with complete national or regional coverage, which should ensure close to complete data on all patient visits. As much as possible, we used well described high-quality data sources (see e.g. references [13,14,15,16]) to collect the SEM data in order to decrease bias and data errors. The number of missing data varies across the sources but is generally very low. Data variables were chosen based on importance in the emergency care process as well as availability in the source registers. The collected data were the same as used in clinical care, and there was no major change in data labelling during 2017–2018. The SEM cohort was not designed with a specific CDSS or study in mind, but the size of the cohort (below) and the number of variables and data included was chosen to ensure sufficient statistical power for most CDSS research projects.

Data from the source registers were kept in their exported form with no deletion or curation, and software scripts are used to extract data to form tailor-made new datasets for each specific research project. Data curation or deletion will generally take place in each CDSS project, and only as needed in the original SEM cohort data.

As shown in Table 2, the available data for each patient visit include the patient’s baseline data, data on the ED visit, and the outcome within 30 days up to three years after the ED visit: diagnoses, ED returns, hospital admissions, death, and healthcare costs. In total, the SEM data include several hundred variables for each patient, and many more that can be calculated from the original variables, such as ED crowding or boarding data, return visits, and mortality at different times after ED arrival. Detailed variable lists are available on reasonable request.

Table 2 Available data for each patient visit in the SEM cohort

The SEM cohort is thus mainly based on register data and does not include free text information such as the patient’s detailed symptom history, findings at the physical examination, reasons for decisions and preliminary assessments. Also missing are the initial ED vital signs (blood oxygen saturation, respiratory rate, pulse rate, blood pressure, consciousness level and body temperature) and pharmacological treatment in the ED, since these data are primarily recorded on paper in the region. However, all this missing information can be obtained as needed by manual review of the individual patient records. As for diagnostic tests, ECG data are available as the raw signal, amplitude/interval measurements as well as the machine interpretation, and imaging and functional test data are available as the free text results. The images are not part of the SEM cohort data but can be obtained in specific projects.

Basic cohort characteristics

The SEM cohort is briefly described in Table 3 and includes 325 539 unique patients with 630 275 ED visits during 2017 and 2018. Fewer than five patients declined participation which makes the cohort almost 100% complete. The mean age was 55 years, 49% were male and 23.5% of all patients arrived by ambulance. The most common triage category was 3, Yellow, and 15.0% of the patients had no registered triage category mostly due to immediate referral from the ED to external primary care or self-care. 11% of the patients had previous diagnoses of diabetes, 10% of cancer, 8% of pulmonary disease, and 1.7% suffered from dementia.

Table 3 Baseline patient characteristics and management in the SEM cohort. Std, standard deviation. *Among the unique patients

Table 4 shows that the most common chief complaint in SEM was abdominal pain, followed by chest pain, dyspnea, hand injury and unspecific disorder. (The term “unspecific disorder” is used when the triage nurse is unable to classify the patient’s problem using the more specific terms in the system.) Some 9% of all visits had no registered chief complaint, again mostly because of immediate referral to primary or self-care. The median time to doctor was 70 min and the median length of stay was 206 min. In 24% percent of all ED visits the patient was admitted to in-hospital care.

Table 4 Twenty most common chief complaints in the SEM cohort, according to the RETTS system [11]. The term “Unspecific disorder” is used when the triage nurse is unable to classify the patient’s problem using the more specific terms in the system

As can be seen in Table 5, the most common discharge diagnoses were bacterial pneumonia, cerebrovascular incident, and acute myocardial infarction. The mortality at the ED was 0.2%, it was 0.9% within 7 days, and 2.2% within 30 days.

Table 5 Selected discharge diagnoses from the ED or from in-hospital care directly following the ED visit, in the SEM cohort

Discussion

In addition to CDSS development, SEM’s large amount of real-world ED patient data with almost complete follow-up will allow research in many fields of emergency medicine: Epidemiology, patient management, diagnostics, prognostics, ED crowding, resource allocation, and social medicine. Some of these studies may need supplementary ethics approval. The SEM cohort is currently being used to analyze cases of missed acute aortic syndrome, for prediction of venous thromboembolism, mapping of characteristics and outcomes in patients with dizziness or with head trauma, and for the evaluation of emergency care for adult patients with congenital heart disease.

Studies of the epidemiology of ED patients may be beneficial for public health surveillance, resource planning, evaluating healthcare delivery and for facilitating research, e.g. sample size calculations for prospective studies. Epidemiological information supports clinical evidence-based decision-making and enables the ED to organize according to the needs of the population. The SEM cohort includes almost all patients presenting at eight EDs in southern Sweden during two years, and it should therefore be possible to obtain reasonably accurate and generalizable data on chief complaints and underlying disease states in the entire population as well as in subgroups based on age, sex, comorbidities or sociodemographics. Also, diurnal, weekly, and seasonal variations may be described.

ED patient management and its impact on outcomes may be studied in the SEM cohort by analyzing e.g. waiting times, length of ED stay, admissions to intensive care, as well as patients who left without being seen by a physician or who returned to the ED. These analyses may also be made in the absence or presence of ED crowding. As mentioned, pharmaceutical treatment at the ED is not immediately available but can be extracted for all patients from the digitized (scanned) ED patient paper records.

The SEM cohort allows analysis of the accuracy of diagnostic and functional testing by comparing pre-test probability with short or medium-term outcomes such as diagnoses or death.

The utilization and costs of diagnostic testing, hospital admission and care at specific wards in each patient up to 30 days in the cohort can be used to analyze resource use in all patients and in specific subgroups. Also, the SEM cohort may be used to evaluate ED care and acute healthcare consumption in different socioeconomic and demographic groups, as well as inequalities and possible discrimination.

Strengths and limitations

SEM includes real-world clinical data from consecutive patients presenting to eight different EDs during two years. The large number of patient visits, variables, and clinical events should be sufficient for most analyses of interest. Data were collected in regular care and there are several general advantages with using routine care data when building CDSS. Firstly, it provides access to large amounts of data from a diverse and unselected patient population, which is crucial for developing CDSS that work across different patient demographics. Secondly, routine care data may be immediately available, reducing the cost and time required to collect data. Finally, routine care data collection will often allow simple tracking of patient outcomes and evaluation of the effectiveness of the CDSS, especially in a country with comprehensive healthcare databases like Sweden. In the future, it may be possible to use native, uncurated electronic health records directly for medical research [17]. Another strength of the multimodal SEM cohort is its potential utility in developing CDSS that provide relative risks of multiple diagnoses, in contrast to algorithms based on a single type of input and output (e.g. radiology algorithms detecting cancer), and current clinical decision support tools which often serve merely as rule-out tests, e.g. the PERC rule for pulmonary embolism.

SEM includes data from ED patient visits in one Swedish region, and the data may therefore not be generalizable to other populations or healthcare settings. There are few patients in the SEM cohort with problems related to psychiatry, obstetrics/gynecology, and ophthalmology, as well as few pediatric patients without orthopedic problems. Some clinical variables are missing or less readily available in SEM, e.g. free text imaging results that require manual review, and this will of course prevent or complicate the creation of some types of CDSS, as well as some data disaggregation. Missing data in SEM are rare, but there may of course be errors in the data, which can lead to biased or inaccurate CDSS. Since SEM data were registered as part of regular care, bias may also arise from different patient evaluation and management based on previous clinical findings (verification bias) or based on patients’ ethnic or socioeconomic background. Also, historical bias will exist in any clinical database, i.e. when the data no longer accurately reflect a new healthcare reality.

Several variables in the SEM database were originally manually entered or determined subjectively, such as time stamps in the ED and discharge diagnoses and may therefore contain errors or bias. Diagnoses might also have been registered several times for the same care episode. Bias or errors in the training data will cause a high risk of bias in the final CDSS, but the size and impact of the problem will vary in different CDSS. The optimal approach to the potential problem with bias is therefore best determined in each use case and CDSS. Before clinical implementation, any CDSS based on SEM data should be carefully reviewed and prospectively tested in a clinical trial in the specific healthcare setting.

On the other hand, it should be noted that if a CDSS is intended to operate in real time with standard register data as input, it is preferable that the underlying ML model is developed using this type of data rather than curated data that do not reflect the “dirty” truth of day-to-day operations. With sufficiently large training data, current ML algorithms can cope with a fair amount of noise and navigate between varying levels of noise in different types of input data.

In addition to algorithm quality, several barriers to successful implementation and use of AI/ML-based CDSS must be considered: IT problems, low model transparency (black box algorithms), proprietary code, lack of trust and knowledge among physicians and decision-makers, legal framework (oversight, malpractice issues) and ethical issues, integrity risks and financial challenges [18,19,20]. However, the size and implications of these barriers will vary in different use cases.

In conclusion, the SEM cohort provides a platform for collaborative CDSS research. SEM’s large amount of real-world patient data with almost complete follow-up will also allow research in epidemiology, patient management, diagnostics, prognostics, ED crowding, resource allocation, and social medicine.

SEM cohort access

So far, collaborations have been established with other research groups at Lund and Halmstad Universities in Sweden. We welcome initiatives on international collaborative projects using the SEM cohort. Anonymized parts of the SEM database will be available for sharing on reasonable request, as will detailed variable lists. Please contact the corresponding author via email (ulf.ekelund@med.lu.se).