A classification framework for exploiting sparse multivariate temporal features with application to adverse drug event detection in medical records
 179 Downloads
Abstract
Background
Adverse drug events (ADEs) as well as other preventable adverse events in the hospital setting incur a yearly monetary cost of approximately $3.5 billion, in the United States alone. Therefore, it is of paramount importance to reduce the impact and prevalence of ADEs within the healthcare sector, not only since it will result in reducing human suffering, but also as a means to substantially reduce economical strains on the healthcare system. One approach to mitigate this problem is to employ predictive models. While existing methods have been focusing on the exploitation of static features, limited attention has been given to temporal features.
Methods
In this paper, we present a novel classification framework for detecting ADEs in complex Electronic health records (EHRs) by exploiting the temporality and sparsity of the underlying features. The proposed framework consists of three phases for transforming sparse and multivariate time series features into a singlevalued feature representation, which can then be used by any classifier. Moreover, we propose and evaluate three different strategies for leveraging feature sparsity by incorporating it into the new representation.
Results
A largescale evaluation on 15 ADE datasets extracted from a realworld EHR system shows that the proposed framework achieves significantly improved predictive performance compared to stateoftheart. Moreover, our framework can reveal features that are clinically consistent with medical findings on ADE detection.
Conclusions
Our study and experimental findings demonstrate that temporal multivariate features of variable length and with high sparsity can be effectively utilized to predict ADEs from EHRs. Two key advantages of our framework are that it is method agnostic, i.e., versatile, and of low computational cost, i.e., fast; hence providing an important building block for future exploitation within the domain of machine learning from EHRs.
Keywords
Electronic health records Adverse drug events Data mining Sparse multivariate features Temporal abstraction Machine learning ShapeletsAbbreviations
 AUC
Area under receiver operator curve
 ADE
Adverse drug event
 ATC
Anatomical therapeutic chemical classification
 EHR
Electronic health record
 DFT
Discrete Fourier transformation
 DWT
Discrete wavelet transformation
 ICD
International classification of disease
 MNAR
Missing notatrandom
 NPU
Nomenclature, Properties and Units codes
 PAA
Piecewise aggregate approximation
 ROC
Receiver operator curve
 RF
Random forest
 SAX
Symbolic aggregate approximation
Background
Although electronic health records (EHRs) have been extensively exploited for developing robust predictive models and for solving challenging predictive modeling tasks in healthcare [1, 2], EHRs still present critical problems that need to be solved so as to fully exploit the complex interactions and information they contain. Recent studies estimate that in the United States adverse drug events (ADEs), and other preventable adverse reactions in the hospital setting, incur, apart from the human suffering, a yearly monetary cost of approximately $3.5 billion [3]. Therefore, it is of paramount importance to reduce the impact and prevalence of ADEs within the healthcare sector, not only since it will result in reducing human suffering, but also since it can substantially reduce the economical strains on the healthcare system. Although benefit–risk analysis of newly developed drugs is already conducted during clinical trials, postmarketing detection and surveillance are necessary to detect unanticipated events. Clinical trials are normally performed with a limited sample of patients, who are followed for a limited period of time. As a result, not all serious adverse events can be detected prior to market deployment, which results in drugs being withdrawn due to serious adverse reactions not detected during clinical trials. To overcome some of these limitations, several attempts have been made to manually encode rules for detecting ADEs in EHRs [4, 5, 6]. However, in addition to requiring substantial efforts by domain experts to formulate such rules, the objectives typically change over time which requires the manually encoded rules to be frequently updated. More importantly, however, many ADEs are not identified, due to the limited knowledge about effects of medical treatments, such as drugs being tested only in limited trials and under controlled conditions.
Hence, an alternative approach towards ADE detection is to resort to machine learning for exploiting the constantly growing volume of EHR data, and more specifically to effectively exploit the inherently complex nature of these data sources. Indeed, a lack of investigation in utilizing EHR predictive modeling for ADEs built on structured medical data (e.g, laboratory test results) has led to interest in the development of machine learning models, such as random forests, which can aid in altering the clinical courses of ADE vulnerable patients [7]. The development and application of predictive models in a clinical setting can result in substantial improvements when it comes to ADE detection while minimizing the inherent costs.
The adoption of EHRs has increased the interest towards secondary use of clinical and medical data by researchers and practitioners [8, 9]. Examples of the obvious and detrimental benefits of EHR systems include public health surveillance, pharmacovigilance, healthcare quality assessment and monitoring [10]. Moreover, the employment of EHRs facilitates opportunities for ADE investigations to move from individuallevel to populationlevel research, a facet which has broadly received increased attention within clinical and translational research [8].
The vast majority of research on learning from EHRs has been focusing mainly on four key categories [11]: (1) comorbidity detection and analysis, (2) patient clustering, (3) predictive modeling, and (4) cohort analysis and querying. Specific examples of such categories include: association rule mining, classification or prediction of patient conditions such as identification of the smoking status of patients [12], patient safety and automated surveillance of adverse events [13], comorbidity and disease networks [6], processing of clinical text [14], identification of suitable individuals for clinical trials [15], and the identification of temporal associations between medical events and first prescriptions of medicines for signaling the presence of an ADE [16].
Temporal abstractions of EHRs
Additional complexity is also induced on the EHR feature space from time series variables which can often be of different lengths, measured at irregular time intervals, or exhibit high levels of sparsity. Several attempts have been made to address this problem, and retrofit standard machine learning methods for building predictive models from such feature spaces. One family of studies resort to handling temporal variables by considering simpler mappings, i.e., converting each time series feature value to a static representation, also known as temporal abstraction. Examples of such simple mappings include the length, average, mean, slope, or the weighted sum of all values of the time series [17, 18, 19, 20]. Although the mapping heuristics allow standard predictive modeling techniques to be employed directly, they compromise the quality of the feature space by nullifying the underlying temporal information of the variables; information which may be immensely useful for the predictive task at hand [21]. As shown in earlier studies and data domains where variables are time series, and characterized by high missingvalue rates, the best performing temporal abstraction has been to consider counts of the values in each variable.
An alternative approach is to employ offtheshelf time series summarization techniques, where the goal is to reduce the length of the time series by transforming them into more compact representations without loss of information and by preserving the notion of temporal order. Examples of such summarization techniques are, among others, the piecewise aggregate approximation (PAA) [22] and its followup version called symbolic aggregate approximation (SAX) [23, 24], the discrete Fourier transformation (DFT) [25] in the frequency domain, and the discrete wavelet transformation (DWT) [26]. Among those representations, SAX, maps the time series to a symbolic sequence; in particular, this mapping is achieved by assigning each continuous value to a symbol from a discrete alphabet that follows from a Gaussian distribution. More importantly, SAX is oftentimes preferred against other representations due to its simplicity [24].
Moreover, a basic concept that has been used extensively for featurebased time series classification is that of time series shapelets, which typically refer to classdistinctive time series subsequences [27, 28, 29]. In a typical time series classification problem, an object is described either by only a single (univariate) or a set (multivariate) of time series of equal length, that are measured at equal time intervals. The multivariate time series case could be considered as an equivalent formulation to our problem, as each feature could be seen as a single “channel” of a multivariate time series. Nonetheless, our setup differs substantially as in our case: (1) each individual time series is not necessarily sampled at fixed time intervals and (2) our multivariate features are highly sparse, i.e., they contain many missing values.
Approaches focusing on classification of epidemiological longitudinal data [30, 31] handle multivariate feature variables by employing what is known as populationbased feature extraction. Populationbased methods are, however, limited to very short time series variables of up to three measurements, and are not suited for long and sparse multivariate features. In fact, the closest approach and direct competitor of our work is the random dynamic subsquence method proposed by Zhao et al. [32], where the main idea is to convert the temporal features to SAX sequences, choose a representative subsequence for each feature, and compute the distance of all features to the representative. This process results in singlevalued features. Unfortunately, the approach by Zhao et al. [32] suffers from two main weaknesses: (1) the chosen distance function, i.e., Levenshtein distance, is highly dependent on the sequence length and (2) it cannot effectively handle and exploit the high degree of sparsity in the feature space.
ADE knowledge extraction from EHRs
Previous work automatic detection of ADEs from EHRs has consisted of a variety of approaches such as discovering statistical links between certain ADEs and drug dosage [33] alongside natural language processing using unstructured data such as clinical notes [34, 35]. However, there remains a deficiency for investigating ADEs with predictive modelling on structured EHR data. Existing predictive modelling studies have utilised standard techniques such as regression for largescale mining of ADEs [36] and random forest classifiers applied to clinical codes and measurements [37]. However, what most studies lack is the exploitation of crucial information regarding the temporal order of clinical events which may have a profound impact on the classification performance of particular ADEs. One such approach was proposed with the objective of detecting ADE signals focusing on laboratory abnormalities after treatment with specific medication [38]. The formulation of our paper is rather different, as in our case we are using all clinical measurement signals concurrently to learn temporal features for building an ADE classifier. These features can be of any type, normal or abnormal, and they are used by the classifier as long as they constitute good ADE predictors.
Recent work by Zhao et al. [32] has incorporated temporal information through the use of symbolic sequence representations of EHR time series data for the purpose of ADE detection. In this study we seek to improve upon the work of Zhao et al. by accounting for and exploiting high levels of feature sparsity inherent to EHR data.
The main focus of our paper is proposing a predictive modeling based framework that utilizes multivariate temporal features, allowing traditional machine learning algorithms to work for complex and timeevolving data sources. EHRs contain such complex data, but predictive modeling methods typically rely on data sources to be in a structured form, i.e., a tabular format, where objects (in our case patient records) correspond to rows and attributes (in our case patient variables) correspond to columns [39]. Nonetheless, EHR data can rarely fit into such format due to its inherent complexity, induced, for example, by the prevalence of longitudinal observations. A clinical variable for a particular patient, for instance, is not always described by a single value, but by a series of values over time. Consequently, the induced data table may contain features for which their data type is a time series variable, instead of a real or categorical variable.
Contributions
 a)
we propose a sparse symbolic representation for multivariate feature spaces, with emphasis on temporal features of arbitrary lengths and high degree of sparsity, i.e., missing values. The proposed representation is based on the SAX time series summarization technique as well as on the concept of sshapelets, which correspond to classdistinctive discrete event subsequences;
 b)
we propose three strategies for dealing with such feature spaces: (1) length encoding or plain (which is an extension of Zhao et al. [32]), mostcommon encoding or mc, and leftright optimized encoding or lr;
 c)
we provide an extensive experimental evaluation of the three strategies on 15 real datasets taken from the healthcare domain involving ADEs. Moreover, we study the utility of the chosen sshapelets, as well as their consistency to medical findings in the context of ADE detection.
Methods
Overview
Our hypothesis is that sparse and unevenly sampled feature variables from EHRs can serve as strong predictors of an ADE. More concretely, given an EHR dataset represented in the form of unevenly sampled and sparse multivariate features, our goal is to infer a classification model that is able to correctly predict the presence of an ADE for a previously unseen medical record.
 Phase A: each feature in the training set is first transformed into a discrete symbolic sequence representation.

Phase B: a set of candidate subsequences is generated from the discretized features, which are then evaluated based on their classdistinctive power (referred to as utility), and finally the set of subsequences with the highest utility, called sshapelets, is generated. An important component of this phase concerns the way empty data records, i.e., records with empty sequences, are processed. Towards this end we propose three strategies for handling and exploiting empty sequence features.

Phase C: using the sshapelets extracted from the training set, each multivariate data feature (of the training and test sets) is converted to a realvalued feature.
Definitions and problem formulation
We now provide some basic definitions and the formulation of the problem at hand.
Definition 1
A time series S={s_{1},…,s_{d}} is an ordered set of d real values, where each \(s_{k} \in \mathbb {R}\), with S=d.
The classification task involves a set of data features, where each feature is represented by a time series. We denote such features multivariate features.
Definition 2
A multivariate feature space \(\mathcal {A} =\{A_{1}, \ldots, A_{m}\}\) is a set of m multivariate features, where each \(A_{j} \in \mathcal {A}\) is a time series.
Using this multivariate feature space \(\mathcal {A}\) we can define a multivariate object\(O \in \mathcal {A}\) as an instantiation of that feature space. Moreover, given a set of predefined class labels \(\mathcal {Y}=\{y_{1}, \ldots, y_{\sigma }\}\), we can define the universe of labeled multivariate objects in that space. Due to the particular prediction task, the experiments only considers binary labels, σ=2. However, the proposed strategy handles multiple labels, i.e., σ>2 without modification.
Definition 3
Given a multivariate feature space \(\mathcal {A}\) and a set of predefined class labels \(\mathcal {Y}\), the universe of labeled multivariate objects is defined as a set of tuples \(\mathcal {O}=\{(O_{i},y_{i})~~O_{i}\in \mathcal {A}, y_{i}\in \mathcal {Y}\}\), where y_{i} is the class label assigned to object O_{i}, \(\forall (O_{i}, y_{i})\in \mathcal {O}\), with i∈[ 1,n].
In a typical classification setting, we are given a set of training objects, where each object is associated with a class label, called the training set. In our case, we employ a multivariate training set, denoted as \(\mathcal {L}\), which is simply drawn from our universe of labeled multivariate objects \(\mathcal {O}\). In other words, \(\mathcal {L}\subseteq \mathcal {O}\).
Using the above definitions, the problem studied in this paper is defined as follows:
Problem 1
In the following three subsections we describe each of the three phases of the proposed framework in detail.
Phase A: Multivariate feature discretization
The objective of the first phase is to discretize the space of multivariate features \(\mathcal {A}\), resulting into a new feature space, where feature values correspond to symbolic sequences. We denote this target space \(\mathcal {\hat {A}}\) and refer to it as multivariate symbolic feature space. The discretization process follows four steps: (1) normalization, (2) summarization, (3) symbolic representation, and (4) alphabet calibration. Since the multivariate features are practically instantiated as time series variables, we employ standard time series normalization, summarization, and symbolic representation techniques for the following steps. Nonetheless, in principle one could use any alternative technique for each of the four steps.
Normalization
where μ(S) and σ(S) correspond to the mean and standard deviation of the values of S.
Summarization
Hence, PAA results in a length reduction from d to w, by splitting S into w partitions of equal size, and assigning each partition the mean value of the points of the original time series falling into that partition.
Symbolic representation
Next, each value of \(\bar {S}\) is mapped to a discrete symbol defined over an alphabet Σ of size α. For this symbolic representation we employ a standard time series summarization technique called SAX [23, 24].
More concretely, a mapping is defined between \(\mathbb {R}\) and an alphabet Σ of α symbols. One assumption is that each time series variable is generated by an underlying distribution, e.g., a Gaussian. Next, a set of α−1 breakpoints \(\mathcal {B} =\{\beta _{1}, \ldots, \beta _{\alpha 1}\}\) are defined, so that the area under the Gaussian normal curve N(0,1) from each pair (β_{i}, β_{i+1}) is equal to 1/α, assuming that β_{0}=−∞ and β_{α}=∞. Hence, given a desired alphabet size α, the breakpoints can easily be defined by consulting a statistical table. Once the breakpoints are obtained, \(\bar {S}\) is mapped to a sequence of symbols \(\hat {S}\) as follows: all coefficients that are lower than the first breakpoint are mapped to the first alphabet symbol, e.g., a; the next set of coefficients with values between the first and the second breakpoints are mapped to the second available symbol, e.g., b; and so on. The resulting symbolic representation \(\hat {S}\) is called the SAX approximation of S, defined over a SAX alphabet Σ of α symbols. By applying SAX, the initial multivariate feature space \(\mathcal {A}\) is converted to its SAX representation, which defines the symbolic multivariate feature space \(\mathcal {\hat {A}} = \left \{\hat {A}_{1}, \ldots, \hat {A}_{m}\right \}\), comprising symbolic sequence representations of variable lengths.
Alphabet calibration
As it can be seen, a parameter to calibrate when using SAX is the alphabet size α, which corresponds to the number of symbols that are used for mapping the normalized time series values. Ideally, a minimum number of symbols is needed to reflect the underlying dynamics of a time series. Since the latter is typically unknown, the choice of a proper alphabet size is mostly an empirical task [44].
At the end of Phase A, each multivariate object O is converted to its symbolic counterpart, denoted as \(\hat {O}\). Hence, the universe of multivariate objects \(\mathcal {O}\) is converted to a universe of labeled symbolic objects, denoted as \(\mathcal {\hat {O}}\), and the converted training set is now denoted as \(\mathcal {\hat {L}}\), while empty records in a data object are represented as the empty set (∅). Note that the instantiation in Phase A replicates the method described by Zhao et al. [32] to allow for assessing the gain of using the subsequent phases.
Phase B: subsequence enumeration
In the second phase, after the original multivariate feature space has been transformed to its symbolic representation, a pool of candidate representative subsequences is generated, evaluated, and the subset of most representative ones is finally selected. The overall objective of this phase is to identify a classdistinctive subsequence, which we denote as sshapelet, for each multivariate feature. This process, which follows three steps, is detailed below.
Subsequence generation
Assume we are given a symbolic sequence \(\hat {S}\) of length \(\left \hat {S}\right \), corresponding to an instantiation of feature \(\hat {A}_{i}\), which is the symbolic representation of the original multivariate feature A_{i}. A subsequence s of \(\hat {S}\) is defined as a sampling of length s of contiguous symbols from \(\hat {S}\), such that \(s\leq \left \hat {S}\right \), i.e., \(s = \left \{\hat {S}_{t}, \ldots, \hat {S}_{t+s1}\right \}\), with \(s\leq t \leq \left \hat {S}\right +1\).
Given the symbolic representation \(\mathcal {\hat {L}}\) of a multivariate training set \(\mathcal {L}\), and an alphabet size α, we generate a pool of candidate subsequences, denoted as \(\mathcal {S}_{\alpha }\), by randomly sampling the sequences in \(\mathcal {\hat {L}}\). Practically, \(\mathcal {S}_{\alpha }\) contains snippets of existing symbolic sequences in \(\mathcal {\hat {L}}\) of arbitrary lengths in [1, l_{max}], where l_{max} is the length of the longest sequence in \(\mathcal {\hat {L}}\).
Subsequence evaluation
Intuitively, the above distance corresponds to the dissimilarity between s and its best matching subsequence in \(\mathcal {\hat {S}}\). Although D(·,·) can be any distance function for string matching, in this paper we use the edit distance [45], as it is one of the most widely used measures for evaluating string similarity.
We should note that in the approach described in Zhao et al. [32], referred to as random dynamic subsequence, a similar idea was used for measuring the distance between a candidate subsequence and a data sequence. The key difference in our paper is that we employ a modified edit distance function (Eq. 3), which computes the best subsequence match of candidate s in the target sequence, as opposed to a full sequence match computed by random dynamic subsequence. This is a substantial improvement of the competitor method as in its original version the used distance function is highly affected by the length of the target sequence; especially when \(s << \left \mathcal {\hat {S}}\right \), the distance value becomes meaningless. In our case, the solution we propose (Eq. 3) makes the distance function invariant to the length difference of the two compared sequences.
The utility of a subsequence in our pool of candidates \(\mathcal {S}_{\alpha }\) is computed following the approach by Ye et al. [46]. More precisely, for each subsequence \(s \in \mathcal {S}_{\alpha }\), we compute the dissimilarity between s and all the sequences in \(\mathcal {\hat {L}}\), using function Dist(·,·) (Eq. 3) to induce a partitioning of \(\mathcal {\hat {L}}\) into two disjoint subsets \(\mathcal {\hat {L}}_{1}\) and \(\mathcal {\hat {L}}_{2}\). For simplicity, we consider twoway splits, i.e., q=2, but the approach is generalizable to any number of partitions. Consequently the information gain given by s is evaluated (using Eq. 6) and measures the ability of s to separate \(\mathcal {\hat {L}}\) into two partitions with class distributions of low entropy.
Assuming that our multivariate feature space is sparse, i.e., a large fraction of the feature space in our raw dataset contains time series of zero length, represented as ∅, missing data plays a crucial role when the optimal distance threshold is computed. In “Exploiting sparsity” section, we will explain how missing entries should be treated in this phase, and propose alternative methods to achieve this goal.
Subsequence selection
So far, we have transformed raw multivariate features into a symbolic sequence dataset and used the latter to generate a set of candidate representative subsequences. We have also defined a way of ranking such candidates according to their utility, i.e., their ability to separate the dataset into two partitions with class distributions of low entropy.
Now, we want to identify the representative subsequence with the highest utility per multivariate feature variable. We call this subsequence a sequence shapelet or sshapelet.
Definition 4
where \(\mathcal {I}\subseteq \mathbb {N}\), i.e., is the set of candidate alphabet sizes.
In summary, for each feature \(\mathcal {\hat {A}}_{j} \in \mathcal {\hat {A}}\) in the training set \(\mathcal {\hat {L}}\), the optimum sshapelet, \(s^{*}\left ({\mathcal {\hat {A}}_{j}}\right)\), is selected which in fact is the sshapelet with the highest utility for \(\mathcal {\hat {A}}_{j}\) across all possible alphabet sizes in \(\mathcal {I}\). The final product of this phase is the set of m optimum sshapelets, one per feature in \(\mathcal {\hat {A}}\), which we denote as \(\mathcal {Z}^{*}=\left \{s^{*}\left ({\mathcal {\hat {A}}_{1}}\right), \ldots, s^{*}\left ({\mathcal {\hat {A}}_{m}}\right)\right \}\).
The only missing part of this phase is how we deal with feature sparsity, i.e., feature values corresponding to empty sequences. In “Exploiting sparsity” section, we will describe three strategies for dealing with sparse multivariate features.
Phase C: data transformation
Hence, any multivariate object \(O\in \mathcal {A}\) can be converted to a realvalued feature object, denoted as \(\tilde {O}\), by applying function τ^{∗}(·) to its original representation, i.e., \(\tilde {O} = \tau ^{*}(O)\).
In practice, this transformation is performed by computing the distance (Eq. 3) between each \(s^{*} \in \mathcal {Z}^{*}\) and its corresponding symbolic object feature. This transformation is performed to both the training set \(\mathcal {\hat {L}}\), during the model training phase, as well as to the test set at prediction time. Intuitively, our data objects are transformed from a set of multivariate features (columns), with each feature being a timeseries variable, to a set of singlevalued features, where each feature value corresponds to the distance between its symbolic representation to the selected optimum sshapelet for that feature.
Finally, we should stress that, again, object instances with empty records require special attention. To this end, several design choices are needed in order to investigate whether or not to consider these empty records and how to represent them. These choices are described next.
Exploiting sparsity
Throughout the transformation framework, mainly in Phase B, described in the previous sections, several design choices need to be made for handling and exploiting the sparsity of the feature space, i.e., data entries that are missing not at random. In this section, we highlight the steps where such choices are critical for exploiting these missing values towards improving prediction performance. To this end, we propose three strategies, which we call length encoding (or plain), mostcommon encoding (or mc), and leftright optimized encoding (or lr).
Strategy I: Length encoding (plain)
An encoding for missing entries is first needed when raw time series are mapped into SAX sequences, which. as mentioned earlier, are marked with ∅. Based on Algorithm 1, when a candidate subsequence s is evaluated, an optimal distance threshold is determined in order to compute the information gain achieved by the data split induced by s. At this step, a decision has to be taken on whether or not and how to consider empty sequences (i.e., empty feature records) when computing the optimal threshold. For example, suppose we have a very sparse multivariate training set \(\mathcal {L}\), which, after its conversion to \(\mathcal {\hat {L}}\), is mapped to a feature space of symbolic sequences with many empty strings (i.e., having a large fraction of ∅).
A simple strategy is to apply Algorithm 1 directly (see lines (4)–(6)), so that the distance between s and ∅ will be simply equal to the length of the candidate subsequence, that is, Dist(s,∅)=s. As a result, all empty feature records will be assigned either to \(\mathcal {\hat {L}}_{1}\) or \(\mathcal {\hat {L}}_{2}\) based solely on the length of s. This strategy is referred to as lengthencoding or plain.
In summary, plain treats empty multivariate feature records as regular entries, and replaces them with the distance between their symbolic representation and the optimum sshapelet, that is, it replaces them with \(\left s^{*}\left (\mathcal {\hat {A}}_{j}\right)\right \) for each feature \(\mathcal {\hat {A}}_{j}\). This approach is a modified and improved version of random dynamic subsequence used in Zhao et al. [32]. In particular, our method corrects for bias introduced by random dynamic subsequences, i.e., to favor longer subsequences, using our redefined subsequence distance measure. As a consequence, we use plain as a baseline for methods that consider the temporal information, similar to how sl acts as a baseline for methods that does not consider temporal information in our experimental evaluation.
Strategy II: mostcommon encoding (mc)
An alternative strategy is to ignore empty feature records corresponding to empty sequences, and compute the optimal distance split by using only nonempty feature records. We call this strategy most common encoding or mc. In fact, when building the singlevalued features at training and prediction time, mc replaces ∅s with the distance value Dist(s^{∗},·) that occurs most frequently within the training set. When dealing with very sparse feature spaces, this choice can be interpreted as a way of considering missing entries as “frequent”.
Consider, for example, the feature space corresponding to clinical measurements of patients taken over different time periods. If a clinical measurement has been recorded only for a relatively small number of patients, in the corresponding dataset empty feature records will be ubiquitous. Thus, associating a missing entry with the most frequently (or commonly) observed value will mark it as a recurring event. Of course, this strategy is not guaranteed to work for dense feature sets, while replacing empty sequences with the most common distance may not capture the actual meaning of a missing measure.
Strategy III: Leftright optimized encoding (lr)
In “Utility of sshapelets selected by lr vs. plain” section, we elaborate on the lr strategy in terms of interpretability and show that a dynamic encoding of missing data can (also) help understand the sshapelet that has been selected for classifying a multivariate feature dataset.
Data source
The experiments carried out in this work are based on HealthBank [1, 48], which is an EHR database containing deidentified health records for approximately 1.2 million patients admitted to a hospital or local care facility in the Stockholm County region. The data was collected during 2009 to 2015 by Karolinska University Hospital. The data source contains a total of 11,623 unique diagnoses codes defined by ICD10SE codes (The 10th revision of the International Statistical Classification of Diseases and Related Health Problems).
ADEs are reported using codes from the seven ADE categories proposed by Stausberg and Hasford [49]. Among these, A.1 and A.2 – a drugrelated and a drug or other substancerelated causation was noted in the diagnosis code, respectively – indicate a clear sign of an ADE occurrence and are, hence, included as possible datasets in this study. Note that our data source does not contain a sufficient number of patient covering all of the ADE categories mentioned by Stausberg and Hasford [49].
Empirical evaluation
We formulate our experiments as binary classification tasks. Hence, for each ADE we create a dataset where data examples correspond to patients that either have or have not been assigned with the corresponding ADE diagnosis code. In each dataset, positive examples are those patients who have been diagnosed with a specific ADE, while negative examples are those patients who have been given a diagnosis code that belongs to the same disease taxonomy, i.e., sharing the same first three levels of the ICD10 hierarchy, but is not an ADE. For example, if the ADE under consideration is I95.2 (Druginduced hypotension), patients who share the same code up to the third position are considered as negative examples, i.e., I95.* (Hypotension) (where * denotes any character) except for I95.2. In the experiments, we include inpatient encounter, but only predict the possibility of an ADE for their last encounter.
Datasets
Positive denotes patients which are ADE positive, whereas Negative corresponds to patients that are ADE negative
Adverse drug event  Positive  Negative 

D61.1: Aplastic anaemia  593  105 
Average age  50.3  55.6 
Gender distribution (% female)  42.7  49.5 
E27.3 Adrenocortical insufficiency  70  259 
Average age  61.8  56 
Gender distribution (% female)  58.6  58.6 
G62.0 Polyneuropathy  96  783 
Average age  59.8  72.7 
Gender distribution (% female)  47.9  40.1 
I95.2 Hypotension  115  1287 
Average age  79.3  74.7 
Gender distribution (% female)  40.9  49.6 
L27.0 Generalized skin eruption  182  468 
Average age  60.1  48.7 
Gender distribution (% female)  55.5  48.4 
L27.1 Localized skin eruption  151  498 
Average age  59.8  55.9 
Gender distribution (% female)  50.3  54.5 
M80.4 Osteoporosis  52  1170 
Average age  65.8  70.9 
Gender distribution (% female)  71.15  81 
O35.5 Damage to fetus by drugs  146  260 
Average age  38.5  38.9 
Gender distribution (% female)  100  100 
T78.2 Anaphylactic shock  131  856 
Average age  50.9  45.46 
Gender distribution (% female)  50.4  60.7 
T78.3 Angioneurotic oedema  283  720 
Average age  56.4  42.35 
Gender distribution (% female)  59  59.9 
T78.4 Allergy  574  415 
Average age  41.2  52.5 
Gender distribution (% female)  65.2  51.2 
T80.1 Vascular complications  66  609 
Average age  66.2  63.2 
Gender distribution (% female)  48.5  64.7 
T80.8 Infusion complications  538  138 
Average age  64.3  60.4 
Gender distribution (% female)  65.8  52.2 
T88.6 Anaphylactic shock  89  1506 
Average age  56.9  58.5 
Gender distribution (% female)  51.7  57.6 
T88.7 Unspecified adverse effect  1047  550 
Average age  60.9  53.6 
Gender distribution (% female)  60.2  51.3 
For each ADE dataset, the number of features included in the learning process with different sparsity requirements
0.2  0.3  0.5  0.7  0.9  0.95  1.0  

D61.1  16  21  23  34  72  90  186 
E27.3  11  12  14  19  42  88  137 
G62.0  4  11  16  19  40  62  151 
I95.2  11  13  14  20  30  56  180 
L27.0  4  12  18  25  33  54  162 
L27.1  6  11  17  24  35  62  169 
M80.4  9  11  14  19  42  62  170 
O35.5  1  2  4  15  24  38  73 
T78.2  8  9  12  17  29  50  168 
T78.3  8  9  12  17  27  43  131 
T78.4  8  9  13  17  29  51  194 
T80.1  11  13  19  25  33  40  131 
T80.8  11  14  19  25  33  43  128 
T88.6  11  12  15  21  33  59  202 
T88.7  11  12  16  21  33  62  217 
Benchmarked methods
The performance of the proposed multivariate feature representation framework using the three missing value strategies, i.e., plain, mc, and lr, has been evaluated using the Random Forest algorithm (RF, [51]). Since our proposed framework is model agnostic, we note that alternative predictive models can be used. The hyperparameters for RF have been configured as follows: (i) we set the number of trees to 100, (ii) information gain was used as the split criterion (being consistent with the way random subsequences are evaluated and an sshapelet is selected), and (iii) the number of features to consider at each decision split was set to the default value \(\sqrt {m}\), where m is the number of features in the dataset.
As a baseline we used a method referred to as sequence length representation or sl [20]. This representation has been shown to be the best singlevalued representation for clinical laboratory measurement features in the context of detecting ADEs in EHRs, compared to several other multivariate feature representation techniques that do not take into account the temporal order of the measurements [20].
Alphabet tuning
Following the best practice introduced by Zhao et al. [32], we include three different configurations for the SAX alphabet size (α). These are α={2,3,5}, where the smallest alphabet size is used to reflect two simple states: low or high, i.e., simply below or above the mean. To allow for more fine grained representations, both an alphabet size of three and five are used to indicate that values can be contained in different regions of the feature value distribution. Since the best choice of α is unknown apriori, we here employ a simple strategy to let the learning algorithm dynamically choose the best alphabet size.
Sparsity tolerance
We explored different levels of feature sparsity by introducing a threshold τ_{sp}, which limits the maximum fraction of missing data that a feature can contain in order to be selected (and transformed) for training. For instance, τ_{sp}=0.2 indicates that a feature is accepted for training, if it contains no more than 20% of empty sequence records. More specifically, the higher the sparsity threshold the more features are selected to be transformed and, hence, used by the predictive model, in our case RF. Clearly, when τ_{sp}=1.0, all available features are taken into account, regardless of the percentage of empty sequence records.
We refer to Table 2 for the number of features selected for each ADE dataset and sparsity threshold.
Evaluation metrics
Since the datasets employed in this study are imbalanced, we used the Area Under the ROC Curve (AUC) [52, 53], which has been shown to be a more appropriate classification performance metric compared to accuracy or classification error, when the data source is imbalanced ([54, 55]). AUC represents a range of tradeoffs between sensitivity and specificity, and since both are invariant to the actual balance between classes in the test set, AUC is not biased towards the majority class. Finally, all the AUC results reported in this paper are obtained by stratified 10fold crossvalidation.
Results
Sparsity tolerance of baseline
Sparsity tolerance of plain, mc, and lr
Next, we investigate whether the consideration of the temporal information provided by the multivariate features can further improve predictive performance for ADE detection. Moreover, we study whether any further improvements are achieved when including very sparse features, and we explore the most efficient way of representing empty multivariate feature records. To carry out these experiments, we train an RF by using the three proposed feature representations, namely plain, mc and lr, and measure their respective AUCs for each ADE dataset and for different sparsity levels.
Our findings are depicted in Table 5, where each row refers to one of the 15 ADE datasets, columns correspond to increasing values of τ_{sp}, and each cell reports the average AUC of the corresponding model, obtained by stratified 10fold crossvalidation. For each dataset, the best result is marked in bold. It can be clearly seen how predictive performance in terms of AUC increases as sparser columns are included in the learning process. Indeed, the column corresponding to the best performance is that related to the maximum sparsity threshold, namely τ_{sp}=1.
Overall comparison
Comparing the performance of all four methods (including sl) at each sparsity level – the highest AUC reached for a given threshold is underlined – we can see how plain and lr are the most effective feature representations, followed by sl. In particular, lr outperforms the other strategies (in terms of number of best results obtained on all of the 15 ADE datasets) for τ_{sp} equal to 0.2, 0.7, 0.95, and 1, while plain is the best method for τ_{sp}=0.3. For the other two sparsity levels the two methods perform equally well. When considering the best overall result on each ADE dataset, lr reaches the highest AUC on 8 out of 15 cases.
Comparison between lr and plain (first row) and between lr and sl (second row), for different values of the sparsity threshold
0.2  0.3  0.5  0.7  0.9  0.95  1.0  Overall  

lr vs plain  lr (10)  plain (8)  plain (8)  lr (11)*  lr(11)  lr(9)  lr(10)*  lr (9) 
lr vs sl  lr (13)*  lr (14)*  lr (13)*  lr (10)*  lr (13)*  lr (10)  lr(9)  lr (12)* 
Statistical significance
Furthermore, we provide a statistical analysis of the previous experiment. Following Demvsar et al. [56] regarding the case of individual comparisons between methods on different datasets, we use the Wilcoxon signedrank test [57] for rejecting the nullhypothesis that the compared methods perform equally well. Those entries of Table 3 where the nullhypothesis is rejected within a confidence interval of 0.05 are marked with an asterisk. In the first comparison, lr performs statistically better than plain when τ_{sp}=0.7,1.0, with a pvalue of p<0.05 in both cases. Conversely, the two cases in which plain outperforms lr are not statistically significant. Concerning the comparison between lr and sl, the nullhypothesis is rejected for τ_{sp}=0.2,0.3,0.5,0.9 (with p<0.01), and τ_{sp}=0.7 (with p<0.05). Finally, lr proves to be statistically better (p<0.05) than sl also when considering the overall best performance on each ADE dataset, as it can be noticed from the last column of Table 3.
Utility of sshapelets selected by lr vs. plain
By inspecting Fig. 5, we can clearly see how lr is consistently able to select sshapelets with a higher information gain compared to plain. Also, interestingly, we can identify ADE cases, where the above holds for extremely sparse features, such as T78.3, T80.8, T88.7, T80.1, L27.1, and G62.0. As confirmed by our results, this information gain difference between the two strategies results in a model with higher classification performance.
Investigation of three ADEs
We further investigated the top5 features (i.e., features with the highest utility) for lr and plain for three ADEs: E27.3, L27.1, and G62.0. These adverse effects were chosen since they are relatively frequent and the information gain between the baseline and our proposed method differed the most, i.e., there is a conflicting explanation between the two models.
The most important multivariate features used by the transformation framework for three of the datasets (from left, E27.3, L27.1 and G62.0)
Adrenocortical insufficiency  Localized skin eruption  Polyneuropathy  

lr  plain  lr  plain  lr  plain  
1  Dehydroepiandrosterone sulfate  Lymphocytes  Erythrocytes  MCHC  Erythrocytes  MCHC 
2  Potassium ion  Cortisol  MCHC  Erythrocytes  MCHC  Erythrocytes 
3  Neutrophilocytes  Hemoglobin  Calcium  Calcium  MCH  MCH 
4  Lymphocytes  Sedimentation reaction  Bilirubins  Creatininium  Bilirubins  Calcium 
5  Sodium ion  Erythrocytes  Creatininium  Carbamide  Calcium  Creatininium 
Regarding L27.1 (i.e., localized skin eruption), both methods mostly agree on the most informative features, which include high levels of erythrocytes and increased bilirubin levels in the liver. Both are reported as signs of druginduced skin disorder, with the second one also indicating anemia in the blood, which is a typical cause of localized skin eruption [59].
Finally, polyneuropathy, which refers to the damage of peripheral nerves can be medication induced. As shown in Table 4, calcium is a consistent predictor for both plain and lr. This also abides to the findings of Fernybough and Calcutt [60] that reduced or irregular levels of calcium can indicate peripheral neuropathy. More importantly, patients with peripheral neuropathy typically exhibit elevated erythrocyte sedimentation rate in their blood [61], which is also consistent with our findings.
In summary, it is clear that both plain and lr manage to identify important features that are connected with the corresponding ADEs. Based on the existing literature and our consultation with our collaborating clinical pharmacologists, our findings presented in Table 4 are not substantially significant as they are already known to clinical pharmacologists. Nonetheless, they demonstrate that our proposed method works in practice, while lr is shown to be highly robust to sparse temporal features without compromising predictive performance.
Discussion
The classical approach for dealing with time evolving representations of medical records is to either summarize the data using simple features [17, 18, 19] or to employ populationbased feature extraction [30, 31]. However, these summarization heuristics compromise the quality of the feature space by either ignoring the temporal dimension of the data or by simplifying it. The closest related work is the random dynamic subsequence method [32], which is included in the experimental evaluation. The results of our empirical evaluation demonstrate that the random dynamic subsequence approach suffers from two main weaknesses: (1) the chosen distance function, i.e., Levenshtein distance, is highly dependent on the sequence length and (2) it cannot effectively handle and exploit the high degree of sparsity in the feature space. To overcome these limitations, we first extend the method to allow for different distance functions and compute these on equilength representations, and secondly, we introduce three novel strategies for inferring the importance of values missing not at random. As such, our findings in this study advance the current stateoftheart in terms of detecting ADEs from timeevolving and sparsely recorded medical record systems.
More concretely, we observed how predictive performance in terms of AUC increases as sparser features are included in the learning process when employing the baseline method, sl, which simply employs the number of recordings (length) per clinical laboratory measurement as a feature for the ADE prediction task. This suggests that the presence or absence, as well as the number of times a clinical laboratory measurement has been taken for a patient can constitute a promising ADE predictor.
Moreover, as far as the three strategies for handling multivariate feature sparsity are concerned, our findings suggest that the proposed techniques provide good tradeoffs between feature sparsity and predictive performance in terms of AUC. Consequently, this indicates that strategies that take into account the sparsity of the data typically outperform those that do not. Hence, this study demonstrates that temporal multivariate EHR features of variable length and with high levels of sparsity can be effectively utilized to predict ADEs.
Furthermore, special attention has to be given on how to encode empty feature records in the transformed dataset when evaluating the utility of candidate sshapelet. Indeed, the sshapelet interpretation discussed above can easily become infeasible when the binary split introduced by the sshapelet is affected by a large amount of empty sequences. For example, consider the feature representation of sl. According to this strategy, ∅s are replaced with 0, regardless of the class distribution. The same holds for plain and mc, which map ∅ with s and the most common observed distance value, respectively. More generally, a static encoding of missing data, ignoring the information given by class distribution, affects the quality of an sshapelet. Conversely, by employing a strategy such as lr, the choice of the encoding of empty sequences dynamically fits the class balance.
One limitation of the current study, is that, while the framework is step agnostic, a single configuration is evaluated in the experiments. As such, there might be configurations that provide further improvements in terms predictive performance against the stateoftheart. We plan to explore this in future studies. Moreover, the novel framework is only evaluated on a single task (albeit using many instantiations of different datasets and targets). As such, it is rather difficult to generalize the results outside the scope of the current study, i.e., to detect other medical conditions except for ADEs. However, due to the fact that the results are consistent between different datasets, the relative performance of the investigated strategies are not expected to differ, only the absolute performance.
From a clinical perspective we acknowledge the limitations of the quality of our data sets pertaining to the nature in which ADEs are underreported by clinicians. ADEs are often not the primary reason for clinical encounters and can be seldom investigated or recognized by clinicians with a high degree of certainty. For this reason our negative set may indeed contain a high proportion of undetected and unreported ADE cases which could be detrimental to classification performance results. Nevertheless, we emphasize that our classification results demonstrate a clear ability to differentiate reported ADE cases from similar nonADE cases or cases where the relevant ADE was overlooked during a clinical encounter.
Additionally, for future studies we would wish to exploit, not only laboratory test data, but medication data due to the high clinical relevancy of this information for improving classification and for the potential of uncovering relations between certain medications and ADEs. For the purposes of the current study, laboratory test data was utilized as it’s sequential numerical nature was most relevant for the approach of our framework which involves summarizing numerical medical information for application with traditional machine learning algorithms. However, the uncovering of medicationbased insights could indeed validate our approach by discovering known relations between ADEs and certain medications while also generating novel hypotheses for medications which contribute to ADEs under particular conditions.
Conclusions
We have proposed a novel threephase symbolic transformation framework for classification of complex and sparse temporal multivariate feature datasets. These types of complex features are common in EHRs and can be effectively used for various prediction tasks, such as ADE detection. Moreover, applying machine learning algorithms to learn models from such data is typically challenging, due to the variable length and the sparsity of such features. In this study, we proposed and formalized a way of handling these types of multivariate and sparse features, with focus on ADE prediction from EHRs.
AUC obtained by RF on 15 ADE datasets: 4 methods are compared, respectively sl, plain, mc and lr
0.2  0.3  0.5  0.7  0.9  0.95  1.0  

D61.1  plain  78.151  80.842  80.817  80.909  84.107  84.801  82.484 
mc  76.876  78.132  80.163  78.854  81.259  82.946  81.720  
lr  77.978  80.720  79.961  79.675  83.624  83.610  81.510  
sl  73.846  78.653  78.928  80.756  83.016  82.412  82.817  
E27.3  plain  51.916  58.345  56.302  59.870  58.882  62.518  66.854 
mc  48.953  50.052  46.695  50.126  61.773  59.444  58.854  
lr  58.445  58.637  56.816  61.179  59.170  64.162  67.266  
sl  60.763  62.054  59.670  58.718  58.035  64.686  66.500  
G62.0  plain  64.443  71.325  75.452  77.247  78.839  79.074  80.279 
mc  66.228  72.700  71.376  76.974  74.769  74.146  74.713  
lr  64.713  70.756  74.787  77.169  78.992  79.357  79.418  
sl  61.473  69.429  76.335  78.379  81.980  81.509  82.222  
I95.2  plain  56.157  57.338  54.871  52.257  54.046  59.855  56.369 
mc  57.486  57.390  54.750  49.333  53.480  51.742  57.391  
lr  58.052  54.797  58.165  53.184  56.145  57.145  56.362  
sl  47.671  49.289  51.080  51.509  53.065  54.267  53.943  
L27.0  plain  60.374  68.063  66.689  67.815  65.273  67.416  65.322 
mc  56.065  62.683  66.766  65.527  67.431  66.443  64.500  
lr  56.585  66.422  66.256  67.020  66.277  68.424  66.110  
sl  55.400  62.981  61.919  64.648  64.597  65.136  68.271  
L27.1  plain  58.130  61.600  63.471  64.835  63.012  62.454  61.798 
mc  56.093  55.029  67.179  61.970  64.556  61.222  62.455  
lr  58.636  64.100  65.983  66.264  67.196  63.884  64.012  
sl  53.709  61.418  62.955  66.461  66.554  67.397  67.316  
M80.4  plain  55.646  54.745  59.375  59.029  68.713  67.004  65.396 
mc  55.729  55.770  52.641  58.996  66.863  63.000  63.507  
lr  55.753  58.056  58.849  59.760  68.912  69.119  67.756  
sl  47.933  57.149  52.678  58.414  67.945  65.199  64.845  
O35.5  plain  51.650  52.557  64.963  73.161  77.383  78.590  80.298 
mc  52.741  52.962  55.361  68.843  70.114  72.079  70.104  
lr  52.756  53.551  68.318  76.971  78.332  80.322  81.324  
sl  52.401  52.188  67.201  73.696  75.025  78.140  79.142  
T78.2  plain  53.028  51.573  56.977  56.538  57.153  58.709  59.229 
mc  55.811  54.349  53.146  55.771  53.260  54.417  53.092  
lr  52.924  52.369  55.286  59.569  63.066  57.775  59.254  
sl  53.113  50.471  50.492  57.561  56.850  58.646  59.421  
T78.3  plain  51.177  54.073  52.614  54.575  58.973  59.449  63.647 
mc  48.308  50.692  51.378  53.746  56.715  55.334  58.111  
lr  51.762  53.238  55.155  56.952  59.213  61.924  66.209  
sl  50.514  51.851  54.273  53.531  56.337  59.759  64.728  
T78.4  plain  51.937  54.675  58.151  56.976  57.124  56.506  59.162 
mc  52.176  55.459  55.644  54.720  53.475  55.145  56.218  
lr  54.434  58.122  54.655  57.019  56.627  58.496  58.549  
sl  49.903  49.876  47.081  50.235  54.326  56.986  58.444  
T80.1  plain  78.962  77.270  84.589  83.097  86.832  85.994  84.189 
mc  73.207  72.912  73.492  78.945  74.963  79.334  74.592  
lr  78.930  80.478  83.233  83.739  84.677  84.879  85.803  
sl  75.125  76.542  79.956  81.759  81.413  83.918  82.726  
T80.8  plain  80.744  81.769  83.486  83.704  85.648  86.136  86.751 
mc  71.968  71.717  76.488  77.570  78.308  78.182  80.725  
lr  81.119  81.211  85.431  85.210  86.319  86.715  86.627  
sl  80.537  80.973  84.689  85.569  86.332  86.875  87.351  
T88.6  plain  61.998  60.731  62.299  62.830  65.113  63.499  62.249 
mc  58.342  58.993  63.476  57.263  59.652  59.568  61.812  
lr  60.283  60.159  62.019  63.256  60.864  61.744  64.120  
sl  57.623  57.248  57.450  57.417  58.190  57.672  60.403  
T88.7  plain  58.971  60.742  62.399  62.239  62.552  63.803  65.352 
mc  56.875  57.931  62.472  60.892  63.013  62.435  64.209  
lr  60.580  60.462  62.803  61.543  62.588  64.137  66.276  
sl  60.297  59.862  61.498  61.811  62.259  64.001  65.139 
Finally, an advantage of the proposed framework is that its three phases are method agnostic, i.e., the framework allows for exchanging the subtasks of each phase. For instance, one could use different machine learning models, different normalization techniques, symbolic approximations, or distance measures. As such, the proposed framework provides an important building block for future exploitation with one important avenue for future work being the evaluation into alternative choices which may affect the predictive performance of the learned representations.
Notes
Acknowledgements
We thank Birgitta Norstedt Wikner (MD, PhD), medical expert in clinical pharmacology and Inger Öhman (MD, PhD), medical expert in clinical pharmacology and clinical chemistry from Karolinska Institute, Centre for Pharmacoepidemiology, for the valuable feedback and consultation regarding the evaluation of our findings. Both has given written permission to acknowledge.
Funding
IK and PP were supported by grants from the Stockholm County Council (SUSLL), which assisted in method development. PP was also supported by the VR201603372 Swedish Research Council Starting Grant, which provided the basis for the design of the study, method development, analysis and access to medical data. Both funding bodies assisted in preparing the manuscript.
Availability of data and materials
Due to the private nature of the data employed in this study, the electronic health record data cannot be openly published. Researchers with interest can gain access to the data under strict privacy constraints. A software implementation of the framework is available at https://github.com/fbagattini/sparse_symbolic_representation. The system requires Python and Numpy to function. Both are openly available online.
Authors’ contributions
FB developed the method, performed the analysis, and drafted the manuscript. IK and PP participated in the design of the method, performed the analysis, and helped to draft the manuscript. JR helped draft the manuscript and provided clinical insights. All authors read and approved the final manuscript.
Ethics approval and consent to participate
The use of the HealthBank has been approved by the Regional Ethical Review Board in Stockholm (permission number 2012/83431/5).
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
 1.Dalianis H, Hassel M, Henriksson A, Skeppstedt M. Stockholm EPR Corpus: a clinical database used to improve health care. In: Swedish Language Technology Conference.2012. p. 17–8.Google Scholar
 2.Karlsson I, Boström H. Predicting adverse drug events using heterogeneous event sequences. In: Healthcare Informatics (ICHI), 2016 IEEE International Conference On. IEEE: 2016. p. 356–62.Google Scholar
 3.Aspden P BJ, Wolcott J LRC. Generalized random shapelet forests. In: Committee on Identifying and Preventing Medication Errors.2007.Google Scholar
 4.Freeman R, Moore L, García Álvarez L, Charlett A, Holmes A. Advances in electronic surveillance for healthcareassociated infections in the 21st century: a systematic review. J Hosp Infect. 2013; 84(2):106–19.PubMedCrossRefGoogle Scholar
 5.Henriksson A, Zhao J, Boström H, Dalianis H. Modeling electronic health records in ensembles of semantic spaces for adverse drug event detection. In: IEEE International Conference on Bioinformatics and Biomedicine.2015. p. 343–50.Google Scholar
 6.Cao H, Markatou M, Melton GB, Chiang MF, Hripcsak G. Handling temporality of clinical events for drug safety surveillance. In: AMIA Annual Symposium Proceedings, vol. 2005. American Medical Informatics Association: 2005. p. 106–110.Google Scholar
 7.Ouchi K, Lindvall C, Chai PR, Boyer EW. Machine learning to predict, detect, and intervene older adults vulnerable for adverse drug events in the emergency department. J Med Toxicol. 2018; 14(3):248–52. https://doi.org/10.1007/s1318101806673.PubMedCrossRefGoogle Scholar
 8.Hersh WR. Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance. Clin Pharmacol Ther. 2007; 81:126–8.CrossRefGoogle Scholar
 9.Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform. 2013; 46(5):830–6.PubMedCrossRefGoogle Scholar
 10.Safran C, Bloomrosen M, Hammond WE, Labkoff S, MarkelFox S, Tang PC, Detmer DE, et al.Toward a national framework for the secondary use of health data: an american medical informatics association white paper. J Am Med Inform Assoc. 2007; 14(1):1–9.PubMedPubMedCentralCrossRefGoogle Scholar
 11.Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012; 13(6):395–405.PubMedCrossRefGoogle Scholar
 12.Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008; 15(1):14–24.PubMedPubMedCentralCrossRefGoogle Scholar
 13.Honigman B, Lee J, Rothschild J, Light P, Pulling R, Yu T, Bates D. Using computerized data to identify adverse drug events in outpatients. J Am Med Inform Assoc. 2001; 8(3):254–66.PubMedPubMedCentralCrossRefGoogle Scholar
 14.Henriksson A, Kvist M, Dalianis H, Duneld M. Identifying adverse drug event information in clinical notes with distributional semantic representations of context. J Biomed Inform. 2015; 57:333–49.PubMedCrossRefGoogle Scholar
 15.Pakhomov SV, Buntrock J, Chute CG. Prospective recruitment of patients with congestive heart failure using an adhoc binary classifier. J Biomed Inform. 2005; 38(2):145–53.PubMedCrossRefGoogle Scholar
 16.Norén GN, Bergvall T, Ryan PB, Juhlin K, Schuemie MJ, Madigan D. Empirical performance of the calibrated selfcontrolled cohort analysis within temporal pattern discovery: Lessons for developing a risk identification and analysis system. Drug Saf. 2013; 36(1):107–21. https://doi.org/10.1007/s402640130095x.CrossRefGoogle Scholar
 17.Singh A, Nadkarni G, Gottesman O, Ellis SB, Bottinger EP, Guttag JV. Incorporating temporal ehr data in predictive models for risk stratification of renal function deterioration. J Biomed Inform. 2015; 53:220–8.PubMedCrossRefGoogle Scholar
 18.Zhao J, Henriksson A, Kvist M, Asker L, Boström H. Handling temporality of clinical events for drug safety surveillance. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association: 2015. p. 1371.Google Scholar
 19.Zhao J. Temporal weighting of clinical events in electronic health records for pharmacovigilance. In: IEEE International Conference on Bioinformatics and Biomedicine.2015. p. 375–81.Google Scholar
 20.Zhao J, Henriksson A, Asker L, Boström H. Detecting adverse drug events with multiple representations of clinical measurements. In: IEEE International Conference on Bioinformatics and Biomedicine.2014. p. 536–43.Google Scholar
 21.Augusto JC. Temporal reasoning for decision support in medicine. Artif Intell Med. 2005; 33(1):1–24.PubMedCrossRefGoogle Scholar
 22.Chakrabarti K, Keogh E, Mehrotra S, Pazzani M. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst. 2002; 27(2):188–228.CrossRefGoogle Scholar
 23.Lin J, Keogh E, Lonardi S, Chiu B. A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM: 2003. p. 2–11.Google Scholar
 24.Lin J, Keogh E, Wei L, Lonardi S. Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Disc. 2007; 15(2):107–44.CrossRefGoogle Scholar
 25.Agrawal R, Faloutsos C, Swami A. Efficient Similarity Search in Sequence Databases. In: Foundations of Data Organization and Algorithms. Berlin Heidelberg: Springer: 1993.Google Scholar
 26.Chan KP, Fu AWC. Efficient time series matching by wavelets. In: Proceedings of 15th International Conference on Data Engineering. IEEE: 1999. p. 126–33.Google Scholar
 27.Ye L, Keogh E. Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2009. p. 947–56.Google Scholar
 28.Hills J, Lines J, Baranauskas E, Mapp J, Bagnall A. Classification of time series by shapelet transformation. Data Min Knowl Disc. 2014; 28(4):851–81.CrossRefGoogle Scholar
 29.Karlsson I, Papapetrou P, Boström H. Generalized random shapelet forests. Data Min Knowl Disc. 2016; 30(5):1053–85.CrossRefGoogle Scholar
 30.Hielscher T, Spiliopoulou M, Völzke H, Kühn J. Mining longitudinal epidemiological data to understand a reversible disorder. In: International Symposium on Intelligent Data Analysis.2014. p. 120–30.Google Scholar
 31.Hielscher T, Spiliopoulou M, Völzke H, Papapetrou P. Discovering, selecting and exploiting feature sequence records of study participants for the classification of epidemiological data on hepatic steatosis.2017.Google Scholar
 32.Zhao J, Papapetrou P, Asker L, Boström H. Learning from heterogeneous temporal data in electronic health records. J Biomed Inform. 2017; 65:105–19.PubMedCrossRefGoogle Scholar
 33.Eriksson R, Werge TM, Jensen LJ, Brunak S. Dosespecific adverse drug reaction identification in electronic patient records: Temporal data mining in an inpatient psychiatric population. In: Drug Safety.2014.Google Scholar
 34.Melton GB, Hripcsak G. Automated detection of adverse events using natural language processing of discharge summaries. J Am Med Inform Assoc. 2005; 12(4):448–57.PubMedPubMedCentralCrossRefGoogle Scholar
 35.Eriksson R, Jensen PB, Frankild S, Jensen LJ, Brunak S. Dictionary construction and identification of possible adverse drug events in danish clinical narrative text. J Am Med Inform Assoc. 2013; 20(5):947–53.PubMedPubMedCentralCrossRefGoogle Scholar
 36.Harpaz R, Haerian K, Chase HS, Friedman C. Mining electronic health records for adverse drug effects using regression based methods. In: the 1st ACM International Health Informatics Symposium. ACM: 2010. p. 100–107.Google Scholar
 37.Zhao J, Henriksson A, Asker L, Boström H. Predictive modeling of structured electronic health records for adverse drug event detection. BMC Med Informat Decis Making. 2015; 15(Suppl 4):1.Google Scholar
 38.Park MY, Yoon D, Lee K, Kang SY, Park I, Lee SH, Kim W, Kam HJ, Lee YH, Kim JH, Park RW. A novel algorithm for detection of adverse drug reaction signals using a hospital electronic medical record database. Pharmacoepidemiol Drug Saf. 2011; 20(6):598–607. https://doi.org/10.1002/pds.2139. https://onlinelibrary.wiley.com/doi/pdf/10.1002/pds.2139.PubMedCrossRefGoogle Scholar
 39.Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, et al.Machine learning in bioinformatics. Brief Bioinform. 2006; 7(1):86–112.PubMedCrossRefGoogle Scholar
 40.Haneuse S, Daniels M. A general framework for considering selection bias in ehrbased studies: what data are observed and why?. eGEMs. 2016; 4(1):1–17.CrossRefGoogle Scholar
 41.Johnson SG, Speedie S, Simon G, Kumar V, Westra BL. A data quality ontology for the secondary use of ehr data. In: AMIA Annual Symposium Proceedings, vol. 2015. American Medical Informatics Association.2015. p. 1937.Google Scholar
 42.Li X, Shen C, Li L. Effectiveness research using electronic health records (ehrs). In: Wiley StatsRef: Statistics Reference Online: 2016.Google Scholar
 43.Chakrabarti K, Keogh E, Mehrotra S, Pazzani M. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst. 2002; 27(2):188–228.CrossRefGoogle Scholar
 44.Sant’Anna A, Wickström N. Symbolization of timeseries: An evaluation of sax, persist, and aca. In: 4th International Congress on Image and Signal Processing, vol. 4. IEEE: 2011. p. 2223–8.Google Scholar
 45.Levenshtein V. Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm. 1965; 1(1):8–17.Google Scholar
 46.Ye L, Keogh E. Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2009. p. 947–56.Google Scholar
 47.Rakthanmanon T, Keogh E. Fast shapelets: A scalable algorithm for discovering time series shapelets. In: Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM: 2013. p. 668–76.Google Scholar
 48.Dalianis H, Henriksson A, Kvist M, Velupillai S, Weegar R. Health bank  a workbench for data science applications in healthcare. CAiSE2015 Industry Track colocated with 27th Conference on Advanced Information Systems Engineering (CAiSE  CEUR), International Conference on Advanced Information Systems. 2015; 1381:1–18.Google Scholar
 49.Stausberg J, Hasford J. Drugrelated admissions and hospitalacquired adverse drug events in germany: a longitudinal analysis from 2003 to 2007 of icd10coded routine data. BMC Health Serv Res. 2011; 11(1):134.PubMedPubMedCentralCrossRefGoogle Scholar
 50.Pontet F, Petersen UM, FuentesArderiu X, Nordin G, Bruunshuus I, Ihalainen J, Karlsson D, Forsum U, Dybkaer R, Schadow G, Kuelpmann W, Férard G, Kang D, McDonald CJ, Hill G. Clinical laboratory sciences data transmission: The npu coding system. Stud Health Technol Inform. 2009; 150:265–9.PubMedPubMedCentralGoogle Scholar
 51.Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.CrossRefGoogle Scholar
 52.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. 1982; 143(1):29–36.PubMedCrossRefGoogle Scholar
 53.Bradley AP. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997; 30(7):1145–59.CrossRefGoogle Scholar
 54.Ferri IC, Flach P, Orallo J, Lachice N. ECAI’2004 First Workshop on ROC Analysis in AI. In: European Conference on Artificial Intelligence: 2004.Google Scholar
 55.Fawcett T. An introduction to roc analysis. Pattern Recogn Lett. 2006; 27(8):861–74.CrossRefGoogle Scholar
 56.Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006; 7:1–30.Google Scholar
 57.Wilcoxon F. Individual comparisons by ranking methods. Biom Bull. 1945; 1(6):80–3.CrossRefGoogle Scholar
 58.Bornstein S, Allolio B, Arlt W, et al.Diagnosis and treatment of primary adrenal insufficiency: An endocrine society clinical practice guideline. J Clin Endocrinol Metab. 2016; 101(2):364–89.PubMedCrossRefGoogle Scholar
 59.Verma R, Vasudevan B, Pragasam V. Severe cutaneous adverse drug reactions. Med J Armed Forces. 2013; 69(4):375–83.CrossRefGoogle Scholar
 60.Fernyhough P, Nigel A C. Abnormal calcium homeostasis in peripheral neuropathies. Cell calcium 47.2. 2010; 47(2):130–9.CrossRefGoogle Scholar
 61.Sim M, Kim D, Yoon J, Park D, Kim Y. Assessment of peripheral neuropathy in patients with rheumatoid arthritis who complain of neurologic symptoms. Ann Rehabil Med. 2014; 38(2):249–55.PubMedPubMedCentralCrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.