Introduction/Methods

A combination of developments in stroke interventions, biomedical informatics, and computer science in the last 10 years has transformed the way we approach the management of stroke. The publication of five landmark trials in 2015 established the preference for mechanical thrombectomy for selected patients [1,2,3,4,5]. However, while interventions like thrombolytics and thrombectomy have improved patient outcomes, significant gaps remain in our ability to select patients for specific treatments and predict complications. The proliferation of medical data and advances in computational techniques have created new methods to address these gaps. From 2012 to 2014, the proportion of US hospitals using an electronic health record increased from 44 to 97% [6]. As a result, more digital medical data are being generated at an increasing pace. The opportunity and challenge of leveraging this exponentially increasing amount of information are often described as the problem of big data.

Techniques to manage and analyze big data have also advanced rapidly. In 2015, Google Brain released TensorFlow [7], a machine learning software package that furthered the era of widely available deep learning algorithms. These algorithms, which have revolutionized technology like self-driving cars and virtual assistants, have also found success in the medical domain with applications ranging from drug discovery to imaging analysis to seizure detection. Big data and the novel computational techniques required to process it have been used in the domain of stroke management to identify additional patients who may benefit from acute intervention, standardize the detection of large vessel occlusions (LVOs), predict the location and extent of hemorrhagic transformation, stratify stroke risk, and make personalized treatment recommendations.

This narrative review highlights recent applications of big data and machine learning in stroke management. A review of the literature was conducted by searching PubMed using ((“big data” OR “machine learning”) AND “stroke”) filtered to results published in the last 10 years. The results were sorted by “Best Match” and reviewed by the primary author for relevancy. Further publications were added to the review based on the authors’ expert opinion.

Big Data

The proliferation of diverse digital medical data such as electronic medical records, digital imaging, genomics, and research registries has outpaced our capacity to analyze the data. While the term big data is often used when discussing the opportunities and challenges associated with the exponential accumulation of data, it has been difficult to settle on a commonly agreed-upon definition. To address this, the National Institute of Standards and Technology (NIST) in 2019 published a report for the purpose of better defining big data. Broadly, four characteristics of data describe a big data problem: volume, velocity, variety, and variability (Table 1) [8]. The NIST report specifically does not define concrete metrics for each of these variables, nor does it prescribe requirements of magnitude before data can be considered “big.” Rather, they are meant as an aid to think about a class of problems inadequately addressed by traditional analytic methods. Medical data, when considered in this context, is big data [9]. A large number of patients as well as variables pose a volume problem. Methods are needed to integrate a variety of data sources, including charts, imaging, genetics, and administrative data. The amount of data is rapidly accumulating (velocity), and, with the development of new tests, imaging modalities, and multi-omics testing, the nature of the data is rapidly changing as well (variability).

Table 1 Four V characteristics defining big data

Investigations using big data are facilitated by novel computational techniques often referred to as artificial intelligence (AI) or machine learning (ML, Table 2, Glossary). Biomedical studies have traditionally used statistical methods to analyze a handful of variables derived from structured clinical data that was hand-collected to either test a hypothesis or estimate a parameter of a model. For example, an investigation may require an estimation of hematoma volume. While delineating the precise three dimensional area of hemorrhage can be done by humans, it is time consuming and not feasible to perform by hand beyond several hundred scans. As a result, estimation methods such as ABC/2 (height * width * depth divided by 2) were developed to produce fast approximations [10]. In contrast, automated techniques based upon machine learning can reliably estimate hematoma volumes and can be scaled up to efficiently process thousands to millions of images, which is helpful for conducting research with large data sets [11].

Table 2 Glossary of commonly used terms

Common AI and ML algorithms are described in Table 3, and further reading can be found here [12]. These algorithms are used not only for parameter estimation but can also perform other tasks like classification and clustering and can be incorporated into clinical decision tools. The goal of classification is to predict the value of an output variable (sometimes referred to as a label), such as clinical outcome or development of hemorrhagic transformation. The performance of machine learning methods is often evaluated based on their ability to accurately predict on a validation dataset as opposed to traditional statistical methods, which typically focus on the estimation of the parameters of a model using all the available data.

Table 3 Common artificial intelligence and machine learning techniques

Two categories of ML methods are notable for their typically superior performance in complex classification tasks and as such are commonly found in recent studies using machine learning. The first category is ensemble methods—a category of algorithms that function by aggregating the predictions of a collection of simple classifiers to produce a more accurate prediction. Examples of ensemble methods include random forest and gradient boost. The second category of methods is collectively termed deep learning. Deep learning uses a complex, multi-layered artificial neural network (ANN) to capture complex relationships in the training data to make predictions. While the concept of an artificial neural network has been around since the invention of the perceptron in 1943 [13], the ability to capture complex relationships in the data using a multi-layered model was only recently made possible with the development of widely available and sufficiently powerful graphical processing units (GPUs) and parallel processing algorithms to perform the enormous number of calculations necessary to train the model. While deep learning models can demonstrate outstanding predictive performance, the sheer complexity of the models renders them nearly uninterpretable to humans, and, as such, a deep learning algorithm effectively becomes a black box. Making deep learning models human interpretable is an ongoing area of research [14].

New challenges arise when addressing big data problems using AI and ML. Unlike traditional datasets, big data cannot be feasibly curated by hand, and mis-formatted or missing data are common. This must be addressed automatically, and the choice of method for handling missing data may bias the results [15]. As the number of variables in a dataset increases, the space of possible models increases exponentially as well, and it becomes easy to overfit a model to training data, resulting in a model that does not generalize and performs poorly when attempting to make predictions on validation data. Methods for selecting models that are not too complex or too simple or reducing the number of variables help address this problem.

Although the total amount of data that exists in electronic health record numbers in the hundreds of millions of patients, the amount of data that is practically available for a given study in a specific disease is likely more limited. A combination of factors such as disease prevalence, patient consent to research, and data sharing agreements limits the amount of data available. In practice, studies in a particular medical domain may contain tens to hundreds of patients on the low end (observational or prospective studies) to hundreds of thousands of patients in large national cohorts, registries, or biobanks. The availability of amalgamated registries (e.g., Epic Cosmos) raises the possibility of research using data from hundreds of millions of patients [16]. In stroke, efforts to increase the availability of imaging data for research are underway as part of the NIH Stroke Trials Network (StrokeNet) [17].

Human Bias in Machine Learning

While ML is useful for finding otherwise unknown associations between variables, it does not inherently understand the plausibility or context of the learned associations. Though it is possible to infer causal relationships through only observed data using machine learning, the field of causal ML is still growing, and the majority of machine learning algorithms do not infer causality [18, 19]. As such, a researcher must be thoughtful about the selection of input variables and making inferences about causality from learned models. A model trained using clinical data that includes ethnicity, socioeconomic status, or hospital location might capture the effect of health disparities rather than the biological associations the researcher intended to study. For example, in a machine learning model using clinical data to predict outcome from a Chinese stroke cohort, significant clinical factors included the specific hospital at which the patient was treated [20]. One explanation is that different hospitals delivered different quality stroke care, which translated to different outcomes. Another explanation could be that sicker patients went to specific hospitals. The learned model is thus confounded by the inclusion of the presenting hospital in the input.

There exists a harmful misperception that decisions made with algorithmic assistance are not susceptible to human biases. Because these algorithms rely on models learned from data collected and curated in a biased society, they are manifestly as susceptible to systemic societal bias and inequity as human-made decisions. For example, studies using data from personal fitness trackers are likely to be biased toward people who are wealthy enough to afford them. Underserved communities are likely to be underrepresented in datasets, which are typically collected from academic research institutions. The topics of fairness, social justice, and bias in machine learning are increasingly researched, and methods to correct for these measures have been introduced [21]. In order to do so, terms such as fairness, bias, and protected attribute must first be explicitly defined in a computational context. Then, a fairness metric must be designed to quantify the degree of fairness so that a machine learning algorithm can be designed to optimize for it. Unfortunately, there are many possible ways to define fairness mathematically, some of which are mutually exclusive [22]. As such, it falls on the researcher to choose the most appropriate definition of fairness for its application. Once a metric of fairness has been defined, a machine learning algorithm can then search for a model that maximizes fairness. Training data must contain sufficient and unbiased representation of protected groups to allow for accurate training. Expressly including measures of social determinants of inequity may improve the fairness of AI models [23].

Trust in novel medical interventions stems from high-quality clinical trials that balance measured and unmeasured confounders; in AI/ML, these unmeasured confounders are a source of more accurate models that more fully incorporate the human bias that may already be contained in the data. The CONSORT guidelines provide evidence-based recommendations for transparency and completeness of reporting of randomized clinical trials [24]. As AI and ML methods are increasingly applied to medical problems in the age of big data, an extension of trial guidelines is needed to address the unique challenges of interpreting these studies. Because AI algorithms often construct complex predictive models using large numbers of variables and complicated processing pipelines, it can be difficult for a human reviewer to judge the models for bias. CONSORT-AI is an extension of the CONSORT guidelines to provide reporting guidance for trials, incorporating AI and machine learning to address these concerns [25]. In particular, the CONSORT-AI extension stipulates that authors should make clear how an AI algorithm is integrated into the trial setting, how input data is acquired, how poor quality or missing data is handled, how much human input is involved in handling the data, how the algorithm output is used in the trial, an analysis of errors, and whether code is accessible. SPIRIT-AI is a complementary extension of guidelines for the reporting of clinical trial protocols [26]. STARD-AI is currently underway to develop consensus guidelines on the reporting of AI diagnostic accuracy studies [27]. In a review of 41 randomized controlled trials using AI or machine learning for medical decisions, no trials were found to have met all of the CONSORT-AI guidelines for reporting, and most trials failed to discuss how they handled poor quality or missing data and failed to assess performance errors [28]. This review provides a broad overview of the different areas in which artificial intelligence and machine learning have been applied to stroke research but will not explicitly evaluate each study on the basis of CONSORT-AI as most studies referenced are not clinical trials.

Acute Treatment of Stroke

The widespread adoption of thrombolytics and thrombectomy for the acute treatment of stroke poses new big data challenges. Of particular importance is the timely selection of eligible patients, which is critical to the success of these treatments. Trials have demonstrated the effectiveness and safety of alteplase within 4.5 h of stroke onset [29,30,31], while an optimal window for tenecteplase remains unclear [32,33,34,35,36]. Although early trials failed to find benefit to thrombectomy, subsequent landmark trials demonstrated clear improvement in neurologic outcome in patients treated with thrombectomy up to 24 h after symptom onset [1,2,3,4,5, 37,38,39,40,41]. Robust, standardized patient selection using automated image processing algorithms such as RAPID [42] played a significant role in the success of subsequent trials.

The determination of time of stroke onset can be unreliable in a significant proportion of patients who present to the ED with an acute stroke due to either stroke onset during sleep, or unwitnessed strokes in which the patient is not cognitively intact or able to communicate the time of stroke onset. The proportion of patients who present with a wake-up stroke ranges from 14 to 24%, with a smaller fraction of additional patients who present with non-wake-up strokes with unknown time of onset [43, 44]. If time of stroke onset were able to be estimated for these patients from other data, some of these patients may be able to see benefits from acute stroke interventions.

In order to address this problem, MRI diffusion weighted imaging (DWI) and fluid attenuated inversion recovery (FLAIR) mismatch has been proposed as an imaging biomarker for predicting time of stroke onset within 4.5 h and has been studied in the context of the WAKE-UP trial [45]. While neuroradiologist protocols have been developed to classify the presence or absence of DWI-FLAIR mismatch, human readers were found to be around 60% sensitive and 80% specific in predicting stroke onset < 4.5 h [46]. ML methods are well suited to this problem and could potentially improve upon human predictive performance. Indeed, several studies, using region of interest (ROI)-based or deep-learning-extracted imaging features, have been able to achieve classification of time of stroke onset to less or more than 4.5 h with better sensitivity and specificity compared to manual DWI-FLAIR mismatch protocols [47,48,49]. Another study used quantitative imaging features (radiomics) to measure the degree of DWI-FLAIR mismatch beyond the typically human adjudicated categories of absent, subtle, or obviously present [50]. These studies are limited in their sample sizes, which are on the order of hundreds of stroke MRIs, but demonstrate how machine learning methods can leverage the full depth of imaging data to improve on manual reading.

The risk–benefit discussion with patients for acute interventions is informed by our assessment of their current deficits as well as an estimation of what their stroke might look like if it were to complete (i.e., all ischemic tissue became infarcted). The current clinical standard for predicting final stroke volume uses standardized thresholds on diffusion and perfusion maps; these thresholds, however, are susceptible to artifacts, have not been validated in a large cohort, and do not capture individual variability in physiology [51]. ML methods, particularly deep learning methods such as convolutional neural networks (CNN, Table 3), can better predict final stroke volume compared to the current clinical standard [52,53,54,55,56]. While these methods perform well with large infarcts, they are less accurate with smaller infarcts (e.g., < 20 mL infarct volume). The incorporation of anatomical information about each voxel and the probability of infarct at that location in addition to DWI and PWI data can increase the performance of predictive models [57]. Deep learning methods were also able to predict tissue at risk using arterial spin echo MRI, circumventing the need for intravenous contrast for perfusion imaging [58]. Most studies, thus far, on automated segmentation of infarct volume have been limited by the small amount of training data, with only tens to hundreds of samples available. There is evidence that training on larger datasets from repositories produces models that perform better than models trained on smaller, single-center datasets [59].

Intracranial vessel and perfusion imaging are critical to decision-making for thrombectomy, and big data has made possible the development of artificial intelligence imaging interpretation and decision aids to streamline the process of making timely decisions for intervention, particularly in resource-limited settings. Software suites, such as RAPID (iSchemaView), e-Stroke Suite (Brainomix), and VIZ.ai, provide interpretations of perfusion and vessel imaging [60]. Software such as RAPID automatically calculates parameters such as diffusion and perfusion, and perform some rudimentary segmentation to automatically calculate useful clinical maps such as a diffusion-perfusion mismatch or core-penumbra mismatch, calculate ASPECTS, or predict large vessel occlusion [42, 61]. The development of this software was made possible using datasets from thrombectomy trials and subsequently used in the decision-making process in trials such as EXTEND IA, DEFUSE 3, and DAWN, allowing for standardization of the information available across multiple medical centers [1, 40, 41]. These software suites have since been introduced into clinical practice—RAPID has been deployed to more than 1800 hospitals worldwide, and Brainomix has won a tender to be deployed to the national healthcare system in Hungary [62, 63].

Recent additions of automated ASPECTS calculation in these software systems typically use random forests or CNNs to make their predictions, and several are available commercially, including RAPID ASPECTS and Brainomix e-ASPECTS [64]. Similar work has been done using CNN for LVO detection [65] and has been commercialized in VIZ.ai LVO/CTP, though RAPID uses a non-machine learning–based algorithm for this purpose. VIZ.ai LVO detection is only trained to detect occlusions at the carotid terminus, M1, and M2 locations, and, in real-world studies, performs better with detecting carotid terminus (100% sensitivity) and M1 (93% sensitivity) occlusions compared to M2 (50% sensitivity for proximal, 28% sensitivity for distal) [66]. Specificities were reported in the 90% range, and, given the preponderance of studies without LVO in the dataset, positive predictive values were only in the 30–40% range. A similar study looking at RAPID’s detection of intracranial LVOs demonstrated sensitivity of 95–96%, with specificity of 74–79% without and with the inclusion of M2 occlusions, respectively [67]. A CNN-based method was also developed for automatically detecting LVOs on digital subtraction angiography instead of CT scans for the purposes of standardization in thrombectomy studies; however, on average the predicted locations differed from ground truth by about 1.2 cm for carotid terminus occlusions and 1.9 cm for distal occlusions [68].

In practice, algorithmically predicted LVO occlusion still requires expert validation but has a role in the triage of strokes from limited resource settings where expert radiologist or neurologist review may not be quickly available. In this sense, these algorithms help address the velocity of data arriving in the modern stroke code. Work has been done to expand automated algorithms to detect other neurological problems, such as intracranial hemorrhage, fracture, or mass effect, from imaging data [69, 70]. These algorithms can be helpful for screening in settings with limited access to expert interpretation.

Management of Complications

Early neurologic complications of acute stroke include cerebral edema, hemorrhagic transformation, and early post-stroke seizures. The ability to accurately predict complications before they happen or early in their course allows for the implementation of early interventions to minimize the amount of irreversible neurologic injury. While a clinician may be able to qualitatively estimate the risk of developing malignant edema or hemorrhagic transformation based on clinical characteristics and approximation tools such as the ASPECTS score, there is no standardized set of biomarkers from which predictive performance can be evaluated and studies on single biomarkers can produce conflicting results [71]. Challenges include the limitations of approximation tools (e.g., the ASPECTS score only measuring MCA territory) and the inability of any single biomarker to capture the heterogeneity of the problem. Machine learning can provide a standardized method to integrate multiple and more sophisticated biomarkers to estimate the risk of developing such complications [72], which can inform decisions on monitoring, reperfusion therapy, and goals of care.

Artificial intelligence is helpful to recognize complications of stroke interventions. ANNs (Table 3) were used to predict the presence or absence of hemorrhagic transformation at 48 h from clinical and demographic variables, achieving an AUC of 0.84 [73]. While the study authors did not compare performance against non-ML-based methods, previous work on other datasets using clinical biomarkers as risk scores achieved AUCs ranging from 0.50 to 0.86 [74,75,76]. As such, ML algorithms likely achieve comparable performance as clinical risk scores. It also confirms that important predictors of hemorrhagic transformation include stroke severity as represented by NIH Stroke Scale (NIHSS), cardioembolism as the stroke etiology, blood glucose, and systolic blood pressure. Long short-term memory (LSTM) neural networks, a deep learning algorithm that incorporates time series data, were used on the temporal information stored in the perfusion signal on MRIs from before reperfusion therapy to predict the extent and location of hemorrhagic transformation at 24 h after reperfusion therapy [77, 78]. Compared to traditional ML methods, LSTMs demonstrated superior performance on classification of hemorrhagic transformation, achieving an AUC on a voxel-by-voxel basis of 0.89. From a clinical perspective, this algorithm is not only able to predict whether hemorrhagic transformation will happen but where it will happen in the brain. Knowing the likely location and extent of hemorrhagic transformation would allow a physician to stratify the clinical significance of a potential hemorrhage and, thus, would inform decisions on reperfusion therapy, if hemorrhagic transformation does occur, current guidelines generally recommend management similar to that of spontaneous ICH [76]. Given the different comorbidities and etiologies of hemorrhagic transformation, however, more work is needed to identify areas in which management might differ. While hematoma expansion has been identified as a modifiable factor that can improve outcomes in spontaneous ICH [79], the same has not been established in hemorrhagic transformation. Identifying modifiable risk factors for worse outcomes in hemorrhagic transformation could inform specific management practices in that setting.

Similar rationale applies to predicting malignant cerebral edema (edema severe enough to cause mass effect and neurologic injury) as predicting hemorrhagic transformation. The Monro-Kellie doctrine provides a mathematical basis for the estimation of intracranial pressure and suggests that an estimation of intracranial CSF volume, or reserve, may predict the development of malignant edema [80]. In a study on hemispheric stroke patients, automated image processing was used to extract features representing intracranial reserve from baseline and 24-h CT scans, and these features were used to train a logistic regression model to predict the development of malignant cerebral edema (defined as either needing decompressive hemicraniectomy or death related to at least 5 mm of midline shift of the brain) with better accuracy than clinical variables alone [81]. In the future, such algorithms could predict deterioration and anticipate the need for additional surveillance or targeted treatments. By catching deterioration early or before it happens, early interventions can be applied to limit the amount of neurologic injury that would have been caused.

Seizures can complicate ischemic strokes as well as hemorrhagic strokes. Lobar location and ICH over infarct are both associated with increased seizure risk after stroke [82]. Risk scores for prediction of late seizures include the SeLECT score in ischemic stroke (AUC 0.76) [83] and the CAVE score in ICH (AUC 0.69) [84]. Early seizures after ICH are associated with worse quality of life; inconveniently, so is the use of prophylactic anti-seizure medication in unselected patients [85, 86]. As such, being able to predict early seizures may be helpful in selecting appropriate patients for closer monitoring or selective prophylactic management. With spontaneous ICH, gradient boosting has been shown to have improved performance at predicting early seizures compared to a subset of the CAVE score, achieving AUC of 0.79 compared to 0.72 [87]. A meta-analysis of studies on seizures in ischemic stroke found that risk factors for early seizures included cortical involvement, severe stroke, hemorrhagic transformation, age < 65, large lesion, and presence of atrial fibrillation [88], though only one study evaluated the predictive accuracy of a risk score for early seizures [89]. The study compared several risk scores using discretized clinical variables and achieved an AUC of 0.73 using a subset of 5 variables. Instead of manually choosing variables to consider, however, a ML approach such as decision trees could automatically find the most discriminative variables. ML to predict seizures could make antiseizure medication treatment more precisely targeted, while sparing potential adverse effects in patients less likely to have a seizure in the future.

While non-invasive EEG can more reliably detect sufficiently large seizures affecting the cortex, hippocampal seizures are often more difficult to detect without invasive electrodes. This is relevant to ischemic strokes as the hippocampus is particularly vulnerable to ischemic insults. The ability to detect deep hippocampal seizures from non-invasive EEG would be safer and better tolerated for patients. An ensemble CNN-based algorithm was able to detect hippocampal epileptiform activity from scalp EEG alone, achieving an AUC of 0.89 at detecting individual hippocampal epileptiform events recorded from invasive electrodes [90]. The algorithm was able to classify temporal lobe epilepsy from healthy controls with AUC of 0.88 and 0.95 in two separate validation data sets. ML can help identify subtle patterns on the scalp EEG not detectable by humans that are predictive of deeper hippocampal seizures, avoiding the need for invasive monitoring.

Stroke Outcomes and Prognosis

ML has been used to predict length of stay, functional outcome, and risk for readmission, which can be helpful in discharge planning and care coordination. Imaging, text analysis, and structured clinical data have all been used to predict outcome [91]. To simplify the classification task, outcome is often defined as a binary variable where a favorable outcome equals a modified Rankin Scale (mRS) < = 2 and an unfavorable outcome as a mRS > 2. The ASTRAL score is an integer-based scoring system derived from logistic regression on clinical variables present on admission and predicts the probability of unfavorable outcome at 3 months with an AUC of 0.90 and maximum accuracy around 0.8 in a pooled validation cohort [92]. It has also been used to predict 5-year dependence and mortality with similar performance [93]. A support vector machine (SVM, Table 3) model trained using anatomical information on the extent of infarcts in conjunction with patient age and NIHSS on admission predicted favorable outcome with an accuracy of 0.85 [94]. Neural networks had superior performance compared to the ASTRAL score at predicting favorable outcome at 3 months [95]. Instead of clinical data, one study used natural language processing (NLP) on MRI radiology reports to predict outcome, achieving an AUC of 0.78 with random forest and 0.8 with CNN [96]. In a study predicting outcome at 90 days using combined clinical, multimodal imaging, and angiographic data, a gradient boost algorithm found that NIHSS at 24 h, premorbid mRS, and final infarct volume were the most important predictors of long-term outcome, and a combined multimodal model achieved an AUC of 0.85 [97]. Overall, machine learning methods perform as well or better than the ASTRAL score at predicting 3-month functional outcome. In terms of mortality, ensemble machine learning methods such as random forest and gradient boost have demonstrated increased predictive accuracy compared to logistic regression in the prediction of mortality after rehabilitation, increasing AUC from 0.74 to 0.92 [98]. Machine learning methods perform better than simple integer-based scores at predicting outcome and can help in planning for the recovery process.

While other outcome measures such as Barthel Index (BI) and NIHSS exist, most stroke trials have used mRS as the primary outcome as it appears to correlate most closely with patient-reported quality-of-life metrics, such as the Stroke Impact Scale (SIS) [99]. While mRS correlates well with quality of life on a population level, it cannot account for personal values, which may dramatically impact health-related quality of life (HRQoL) on an individual level. A variety of methods for estimating multi-domain HRQoL are available, including the NIH Patient Reported Outcomes Measurement Information System (PROMIS), Neuro-QOL (a set of measures similar to PROMIS that were validated for proxy report and in patients with neurological diseases), and EuroQOL. A substantial amount of multi-domain HRQoL data is available for patients with stroke [100,101,102]. However, machine learning methods generally perform better with simple classification tasks compared to prediction of multi-domain scores [103]. Attempts to use ML to study HRQoL often rely on simplifying HRQoL scales such as the SIS into a composite score and further binarizing the composite score into good response or poor response [104]. Other strategies include limiting the number of domains being investigated and focusing on unsupervised instead of supervised learning (Table 2, Glossary). For example, clustering algorithms have been used to identify distinct phenotypes of 4-domain HRQoL responses after sub-arachnoid hemorrhage [105]., Overall, computational models remain a poor substitute for compassionate care when discussing detailed prognosis with patients or family.

From a systems improvement and resource utilization perspective, outcome metrics like length of hospital stay and 30-day readmission rates are also important. Unfortunately, attempts to predict length of stay [106] and 30-day readmission rates [107] have not been as successful. These outcome measures likely depend on other factors that are not well captured in clinical data, such as hospital administrative policies, the availability, and quality of disposition facilities, and the support systems in place for a patient after hospital discharge. Further work will likely need to better characterize these social factors and inequities in order to provide more accurate predictions. The lack of biologically plausible predictors may also render machine learning less useful for prediction tasks that depend on data not documented or inferred from the electronic health record and are potentially more associated with social determinants (e.g., resources for continued medical care).

Stroke Prevention

Management decisions on secondary prevention of stroke may be made on up to 690,000 patients with acute ischemic stroke and 240,000 patients with transient ischemic attack each year in the USA [108], while primary prevention applies to the entire population. As such, even small increases in the performance of risk prediction using big data and machine learning have the potential to benefit a large number of people. Although the use of risk scores in patient provider communication does not appear to change patient beliefs or behavior for stroke prevention [109], individualized predictions of stroke risk can, nevertheless, be helpful for the provider in recommending initiation of preventative therapy. Decisions on the use of anti-platelet therapy for primary prevention of stroke often depend on the calculation of risk scores such as the ASCVD score or the Framingham stroke risk profile, which are derived using Cox regression [108, 110]. A study on more than 500,000 Chinese patients used an ensemble method to combine Cox regression predictions with gradient boost predictions to increase positive predictive value (PPV) for future stroke by 1% compared to Cox regression alone [111]. A similar study on 57,000 hypertensive patients in China found that gradient boost predicted subsequent stroke in 3 years with better AUC than the Framingham stroke risk profile [112]. While a 1% increase in PPV may seem small, when applied to an eligible population in the hundreds of millions, additional million people may be appropriately screened for preventative therapy.

Automated methods for data extraction can be helpful due to the large number of patients in population risk factor studies. An automated NLP algorithm was found to be superior to manual coders in the detection of stroke comorbidities from data from the Sentinel Stroke National Audit Programme in the UK [113]. Another study used ML on administrative data and echocardiogram reports to identify likely cardioembolic strokes for the purposes of ensuring appropriate follow-up [114]. When using automated methods, however, it is important to understand the source of input data. Depending on the country, sources like billing or administrative data may misrepresent the prevalence of risk factors compared to clinical notes or structured data, as coders may over-code to maximize reimbursements [115]. This, in turn, may result in biased predictive models.

The nature of genetics studies necessitates a big data approach; due to the large amount of data contained in an individual human genome, a large study population is needed to identify relevant genetic markers. An in-depth review of computational techniques for genomic analysis is outside the scope of this paper. The role of genetics in stroke risk remains an active area of research. Genetics research has identified at least 35 genetic loci that are associated with increased stroke risk as well as a number of inherited stroke syndromes, such as CADASIL, CARASIL, and PADMAL [116]. The MEGASTROKE study, a genome-wide association study (GWAS) of more than half a million patients, discovered 22 new loci associated with stroke risk from the previously known 10 [117]. While work needs to be done to validate these discovered associations, their discovery advances our progress toward stratifying individual genetic risk and using that information to manage surveillance or preventative therapy. A major weakness of many GWAS studies is the biased representation of ethnicities, which can limit generalizability [118]. For example, out of the half a million patients included in the MEGASTROKE study, the majority were European, and less than 2000 were Latin American. This can bias the discovered associations to those polymorphisms disproportionately affecting Europeans and miss important polymorphisms affecting Latin Americans.

Future Directions

The current time since stroke onset thresholds in guidelines for the use of thrombolytics is unfortunately a relic of the design of randomized controlled trials. Discontinuity analysis suggests there is little difference in outcomes shortly before and shortly after time thresholds of 3 h or 4.5 h [119]. While current work focuses on predicting stroke onset within the first few hours after symptom onset, future work could eliminate the need for strict time-based exclusion criteria altogether. Instead, image processing techniques could shift the decision-making paradigm from a time-based approximation of the likelihood to benefit to a tissue-based one and would better account for individual variability in the rates of stroke progression.

Algorithms for the prediction of LVOs and segmentation of stroke volumes are increasingly available in clinical practice and can be helpful for decision support in limited resource settings. Improvements need to be made in their predictive accuracy to improve their clinical utility. As these algorithms become increasingly deployed in the clinical setting, proactive machine learning can help identify weak spots in the collected data (e.g., patients for whom a prediction of an LVO is more uncertain) and guide a continuous cycle of data augmentation and model evaluation to effectively improve algorithm performance [120].

Beyond making accurate predictions of risk, future work with big data may help establish precise treatment recommendations. For example, current guidelines only provide standardized blood pressure recommendations for both acute treatment and prevention. In the acute stroke setting, exceeding individualized autoregulatory blood pressure goals may be associated with worse outcomes [121]. In patients with intracranial atherosclerotic disease, while systolic blood pressure under 140 mm Hg has been associated with a lower rate of stroke recurrence, it remains unclear whether there is a subset of patients who might benefit from more permissive hypertension in the long term [122]. Instead of a general recommendation for long-term blood pressure control, a big data-driven ML approach may be able to identify individualized goals for blood pressure.

Despite the rapid adoption of big data and ML into everyday life (see virtual assistant technology, semi-autonomous driving, social media feeds, bank fraud detection, AI-generated art and writing), their adoption into medicine has lagged behind. Barriers to adoption include the lack of transparently reported prospective studies, concerns about the generalizability of models developed with research data to real-world applications, trust regarding the ability of algorithms to explain their decision-making, concerns about bias in training data as well as population shifts over time, potential liability, and technical issues with implementation involving security, privacy, and interoperability [123, 124]. A multi-faceted approach is necessary to address these diverse challenges. Future studies should strive to incorporate more prospective clinical trials and conform to CONSORT-AI guidelines on reporting to achieve transparency. Future work on machine learning and, particularly, deep learning should explore methods to increase human interpretability and build trust. Medical information technology infrastructure must evolve toward interoperability, using standards such as Health Level 7 (HL7) Fast Health Interoperability Resources (FHIR). Legislation is needed to promote standardization and data sharing while protecting privacy and security.

Conclusion

Machine learning and big data analytics are a rapidly developing asset to improve the acute management and prevention of strokes. These algorithms can help identify additional patients who may benefit from intervention, automate and standardize the detection of LVOs to facilitate the triage of patients, predict the development of hemorrhagic transformation or malignant edema, better stratify risk for stroke prevention, and make personalized treatment recommendations. Some ML and AI techniques are already being introduced into clinical practice for neuroimaging. High-quality clinical trials with transparent, AI-conscious reporting are needed to explicitly evaluate their utility for patient care. Barriers to the adoption of big data and AI in medicine will need to be addressed to benefit from these advances.