Deep reinforcement learning for multi-class imbalanced training: applications in healthcare

With the rapid growth of memory and computing power, datasets are becoming increasingly complex and imbalanced. This is especially severe in the context of clinical data, where there may be one rare event for many cases in the majority class. We introduce an imbalanced classification framework, based on reinforcement learning, for training extremely imbalanced data sets, and extend it for use in multi-class settings. We combine dueling and double deep Q-learning architectures, and formulate a custom reward function and episode-training procedure, specifically with the capability of handling multi-class imbalanced training. Using real-world clinical case studies, we demonstrate that our proposed framework outperforms current state-of-the-art imbalanced learning methods, achieving more fair and balanced classification, while also significantly improving the prediction of minority classes. Supplementary Information The online version contains supplementary material available at 10.1007/s10994-023-06481-z.

United Kingdom National Health Service (NHS) approval via the national oversight/regulatory body, the Health Research Authority (HRA), has been granted for development and validation of artificial intelligence models to detect Covid-19 using routinely collected hospital data (CURIAL; NHS HRA IRAS ID: 281832).

D.2 Data Inclusion and Exclusion
All data used is part of the NHS data for research and subject to data opt out (i.e.patients can apply to the NHS to stop their data from being used for research).Patients opting out of electronic health record (EHR) research were excluded.
Oxford University Hospitals NHS Foundation Trust (OUH): We included all patients attending acute and emergency care settings at OUH who received routine blood tests on arrival, considering presentations before December 1, 2019, and thus before the pandemic, as the COVID-19-negative (control) cohort.We considered presentations during the 'first wave' of the UK COVID-19 pandemic (December 1, 2019 to June 30, 2020) with PCR confirmed SARS-CoV-2 infection as the COVID-19-positive (cases) cohort.We excluded patients who did not receive laboratory blood tests or were younger than 18 years of age.Due to incomplete penetrance of testing during the first wave of the pandemic, and imperfect sensitivity of the PCR test, there is uncertainty in the viral status of patients presenting during the pandemic who were untested or tested negative.We therefore selected a pre-pandemic control cohort during training to ensure absence of disease in patients labelled as COVID-19-negative.Clinical features extracted for each presentation included first-performed blood tests, blood gases, vital signs measurements and PCR testing for SARS-CoV-2 (Abbott Architect [Abbott, Maidenhead, UK], TaqPath [Thermo Fisher Scientific, Massachusetts, USA] and Public Health England-designed RNA-dependent RNA polymerase assays).
Portsmouth Hospitals NHS Foundation Trust (PUH): PUH considered all patients admitted to the Queen Alexandria Hospital, serving a population of 675,000 and offering tertiary referral services to the surrounding region, between March 1, 2020 and February 28, 2021.Confirmatory COVID-19 testing was by laboratory SARS-CoV2 RT-PCR assay, considering any positive PCR result within 48hrs of admission as a true positive.

D.3 Preprocessing
We used electronic health record (EHR) data with linked, deidentified demographic information for all patients presenting to emergency departments.To better compare our results to previously published studies using the same datasets (Soltan et al., 2022, Yang et al., 2022a, Yang et al., 2022b), we used the same focused subset of routinely collected clinical features (including blood tests and vital signs) and patient cohorts.
The OUH training set consisted of COVID-free cases prior to the outbreak, so we matched every COVID-positive case to twenty COVID-free presentations based on age, representing a simulated prevalence of 5%.Consistent with previous studies, we also used population median imputation to replace any missing values.We then standardized all features in our data to have a mean of 0 and a standard deviation of 1.
A training set was used for model development, hyperparameter selection, and training; a validation set was used for threshold-adjustment; and after successful development and training, held-out test sets were then used to evaluate the performance of the final model.Hyperparameters and thresholds values used in the final models can be found in Supplementary Tables 7, 8, and 9 (Section F in the Supplementary Material).

E.2 Data Inclusion and Exclusion
In terms of clinical applications of AI, patient diagnosis as been a popular problem to address (Sheikhalishahi et al., 2021;Lipton et al., 2015;Razavian et al., 2016), as it can directly influence clinical decision-making, resource allocation, and healthcare costs.
Here, the task was to predict which acute condition might be developed by a patient during the course of an ICU stay, as defined through ICD-9 codes.A similar task that included both acute and chronic conditions was previously investigated using the eICU-CRD dataset by grouping 767 ICD-9 codes into 25 overarching diagnoses, and then predicting these using a BiLSTM model (Sheikhalishahi et al., 2021).Using similar inclusion and exclusion criteria, we selected adult patients (age > 18) with a minimum of 15 ICU records, and grouped these records into 1 hour windows.Our clinical team reviewed the list of 25 diagnoses, removed 13 diagnoses considered chronic, non-acute, or poorly defined, and grouped the remaining 12 diagnoses into their relevant system and clinical specialties.This resulted in five labels: acute cardiovascular event, acute respiratory event, acute gastrointestional event, acute systemic event, and acute renal event.This grouping was selected to reflect clinic reality, where an emergency physician might consult with a system specialist to rule out a severe condition before admission to ICU, and to account for the relatedness of diagnoses within a system.For example, pneumonia is a leading cause of respiratory failure, and combining both diagnoses into a single "acute respiratory event" category reflects the systemic nature of the disease.We removed any samples that did not have a differentiable ICD9 code, or did not belong to any of the curated groups, resulting in 24,102 samples for training and testing.

E.3 Preprocessing
Further preprocessing was performed to remove samples with any missing values, one-hot encode categorical features, and standardize all continuous features to have a mean of 0 and a standard deviation of 1.
We used a 75:25 training and test ratio, resulting in 18,076 training and 6,026 test samples, respectively.As before, the training set was used for model development, hyperparameter selection, and training; and after successful development and training, the held-out test set was used to evaluate the performance of the final model.It should be noted that, as this is a multiclass task, standard threshold adjustment cannot be used, and thus, we did not split the data to include an additional validation set.

G.1 DDQN and Dueling DDQN Comparison
The dueling DDQN consistently outperforms the DDQN, across all four test sets.This can also be seen in the training curves, as the dueling DDQN appears to be able to learn a better policy.

Figure 1 :
Figure 1: A typical single-stream Q-network is shown in a).A dueling architecture, with two streams to independently estimate the state-values (scalar) and advantages (vector) for each each action is shown in b) (this implements equation 9).

F. 1
COVID-19 DiagnosisTable 7: Final Hyperparameter Values Used in Reinforcement Learning, Neural Network, and XGBoost-Based Models in COVID-19 Prediction Task stopping, which monitored validation performance, optimizing training for a sensitivity of >0.85 and specificity of >0.75.These thresholds were set to ensure that the model would be able to detect positive COVID-19 cases.
University Hospitals Birmingham NHS Foundation Trust (UHB): UHB considered all patients admitted to The Queen Elizabeth Hospital, Birmingham, between December 01, 2019 and October 29, 2020.The Queen Elizabeth Hospital is a large tertiary referral unit within the UHB group which provides healthcare services for a population of 2.2 million across the West Midlands.Confirmatory COVID-19 testing was performed by laboratory SARS-CoV-2 RT-PCR assay.
Category Features Vital Signs Heart rate, respiratory rate, systolic blood pressure, diastolic blood pressure, temperature Full Blood Count Haemoglobin, haematocrit, mean cell volume, white cell count, neutrophil count, lymphocyte count, monocyte count, eosinophil count, basophil count, platelets Liver Function Tests & C-reactive protein Albumin, alkaline phosphatase, alanine aminotransferase, bilirubin, C-reactive protein Urea & Electrolytes Sodium, potassium, creatinine, urea, estimated glomerular filtration rate
E Multiclass Patient Diagnosis Data and PreprocessingE.1 Ethics statementThe eICU Collaborative Research Database (eICU-CRD) is a publicly-available, anonymized database with pre-existing institutional review board (IRB) approval.The database is released under the Health Insurance Portability and Accountability Act (HIPAA) safe harbor provision.The re-identification risk was certified as meeting safe harbor standards by Privacert (Cambridge, MA) (HIPAA Certification no.1031219-2).

Table 5 :
Acute event groups and respective prevalences.

Table 6 :
Clinical predictors considered for predicting patient discharge status and patient diagnosis.

Table 8 :
Adjusted Threshold Values Used in Reinforcement Learning, Neural Network, and XGBoost Models, for COVID-19 status prediction.

Table 9 :
Final Hyperparameter Values Used in Reinforcement Learning, Neural Network, and XGBoost-Based Models in ICU Diagnosis Prediction Task