1 Introduction

Sequences of temporal intervals are defined as ordered sets of events occurring over time, with each event having a time duration, which may co-occur with other events. As a result, several temporal relations are possible between pairs of events, such as one event overlapping another event or two events starting concurrently with one ending before the other. Such sequences, also known as e-sequences, can be found in a variety of application domains, including sign language transcription (Papapetrou et al. 2009), human activity recognition and monitoring (Uddin and Uddiny 2015), music classification (Pachet et al. 1996), and predicting clinical outcomes from medical records (Kosara and Miksch 2001; Moskovitch and Shahar 2015a).

An example of an e-sequence, taken from the healthcare domain, is depicted in Fig. 1. The example e-sequence contains six events describing an Adverse Drug Reaction (ADR) caused by the use of the medication “procainamide” on a patient suffering from arrhythmia. We observe that the patient shown in the example underwent an episode of arrhythmia (first event) before being hospitalized (second event) and administered with procainamide (third event). A second episode of arrhythmia occurred shortly after (fourth event) and another dosage of procainamide (fifth event) was provided to the patient. Eventually the patient developed ventricular tachycardia (last event) which is an ADR in relation to procainamide.

Fig. 1
figure 1

Example of an e-sequence describing an Adverse Drug Reaction (ADR) of ventricular tachycardia following the administration of procainamide to a patient suffering from arrhythmia. The e-sequence consists of six event intervals, each corresponding to a medical event, and the x-axis corresponds to the hours past since the start of the first arrhythmia event

Earlier work in the area of e-sequence classification has mostly focused on distance-based and feature-based classifiers.

For the case of distance-based classifiers, two state-of-the-art distance measures have been developed, i.e., Artemis (Kostakis et al. 2011) and IBSM (Kotsifakos et al. 2013). The first measure quantifies the distance between two e-sequences by measuring the fraction of temporal relations shared between them using a bipartite graph mapping, while ignoring the time duration of the individual events. On the contrary, the second measure performs a mapping of the e-sequences to vectors, with each time point described by a binary vector indicating the active and non-active events at that time point. Despite the promising classification results obtained by both measures when used in a k-NN formulation, this family of classifiers is hampered by the fact that only global properties are exploited as classification features, while local temporal properties are ignored, potentially leading to detrimental effects in predictive performance. Moreover, an extension of IBSM has been proposed for subsequence matching in event-interval sequences, with the distance measure called ABIDE. This distance measure is used by the proposed framework in this paper.

In regard to the case of feature-based classifiers for e-sequences, the most conventional solution is to extract patterns of temporal interval relation pairs, defined based on Allen’s temporal logic (Allen 1983), and use them as potential classification features (Bornemann et al. 2016) along with additional static features. This idea falls within the concept of temporal abstractions of multi-variate time series, where the main objective is to map each time series channel to an interval and then employ pattern extraction methods, such as the Karma–Lego framework (Moskovitch and Shahar 2015a) or its follow-up variants (Moskovitch and Shahar 2015b; Moskovitch et al. 2015; Batal et al. 2013; Patel et al. 2008; Karlsson and Boström 2016). The latter are, however, not direct competitors for our problem, as their target data space is multi-variate time series and not event-interval sequences. The main drawback of existing feature-based classifiers is that the temporal abstractions they employ only consider the relation types between the involved event intervals, while ignoring their actual time duration. This can impose a great deficiency in application domains where duration matters. For example, in healthcare, the duration of two overlapping medications could have a detrimental effect on the probability of the occurrence of an adverse drug event.

In this paper, we address the deficiencies of both distance-based and feature-based classifiers, by (1) considering both global and local class-predictive features, (2) taking into account both the event relation types in these discriminant features as well as their time duration.

Fig. 2
figure 2

A database of \(\mathcal {D}\) of 5 interval sequences of max length 8 with alphabet \(\varSigma =\{A,B, C\}\). Sequences \(S_1, S_3, S_5\) are classified as “−”, while \(S_2, S_4\) are classified as “+”. An example of a class-predictive e-let is highlighted in red. We observe that relation A followed by C occurs in all five sequences, however, with different time duration. The indicated e-let distinguishes the positive class (‘+’) from the negative class (‘−’) as it only occurs in sequences \(S_2\) and \(S_4\) (Color figure online)

1.1 Example

We illustrate the aforementioned deficiency with a simple example. Consider the five e-sequences depicted in Fig. 2, with event labels defined from a given alphabet \(\varSigma =\{A,B,C\}\). Assume that the sequences are classified as either positive or negative. That is, \(S_1, S_3, S_5\) are positive examples and \(S_2, S_4\) are negative examples. Let us now consider a simple temporal pattern A followed by C. Note that with the term temporal pattern we refer to any combination of event labels described by their pair-wise temporal relations. Observe that this pattern occurs in all five sequences. However, for the positive class both event intervals A and C, have a shorter time duration than those present in the negative class. Hence, any feature-based method that only considers the relation type between the intervals, ignoring their time duration, will be unable to identify this class-separation power of A followed by C.

Consider now a more descriptive representation of the same pattern that contains the event labels, as well as their start and end times, i.e., \(\mathcal {P} = ((A, 0, 1), (C, 2, 3))\). We refer to this representation as e-let. Using this representation, we can capture both the temporal relation between A and C as well as the time duration of each interval. Hence, if we compute the similarity of \(\mathcal {P}\) in terms of type of relation (in our case followed by) against all sequences, by counting the number of times this relation occurs, we can easily observe that the similarity score of \(\mathcal {P}\) is the same for all five sequences; \(\mathcal {P}\) occurs once in all of them resulting in an information gain of 0. On the other hand, if we also consider the time duration of the intervals, we can see that \(\mathcal {P}\) has a higher similarity to sequences \(S_2\) and \(S_4\) (due to the shorter time duration of A and C) compared to \(S_1\), \(S_3\) and \(S_5\). Assuming the similarity function used in the latter case is ABIDE (Kostakis and Papapetrou 2017), for each sequence we obtain the following similarity scores: \(S_2=1,S_4=1, S_1=0.67, S_3=0.33, S_5=0.33\). These scores yield 4 possible attribute split points with, e.g. a decision tree classifier, out of which only \(\frac{1+0.67}{2}\) separates the classes. However, the separation has the highest information gain, i.e., 1. Conversely, employing a similarity measure that takes into account both the relation type as well as the time duration of the event intervals is capable of identifying temporal patterns with higher class-separation power.

1.2 Contributions

The main contributions of our paper include a novel framework for event-interval sequence classification, a novel concept defining temporal features for this task, and a thorough empirical evaluation on 20 real-world datasets. More concretely:

  • We introduce a novel, generalized framework, called SMILE, building upon the STIFE framework introduced by Bornemann et al. (2016), for classification of event-interval sequences. The key novelty of SMILE is that it reduces the complexity of event-interval sequences by performing four levels of temporal abstraction, starting from simple, global abstraction features and gradually moving to more complex local class-predictive temporal features. These features take into consideration both the temporal relation types between the event-intervals, as well as their time duration.

  • We introduce and define a new concept for temporal interval sequence classification, which we refer to as e-lets. This primitive concept describes class-predictive subsequences of interval-based events, and it is one of the key abstraction features for our framework.

  • We present an extensive experimental evaluation of the proposed framework using 4 different classification models on 6 commonly used benchmark datasets, as well as on 14 datasets from the medical domain corresponding to electronic patient records of adverse drug reactions. The proposed framework achieves statistically significantly improved performance in terms of AUC over its competitors.

The remainder of this paper is organized as follows: in Sect. 2 we present the related work in the area of temporal interval sequence classification, while in Sect. 3 we formalize the classification problem studied in this paper along with the required technical background and definitions. Moreover, in Sect. 4, we present SMILE, the proposed framework of this paper. In Sect. 5, we provide and discuss the experimental evaluation and findings, while Sect. 6 concludes the paper and introduces directions for future work.

2 Related work

Research into sequences of temporal intervals have attracted attention within the areas of data mining and databases with original motivations focusing on simplifying complex temporal data while minimizing the loss of information. Earlier work, such as Lin (2003) have demonstrated a method to mine maximal frequent intervals, yet in the process, information loss increases as the different dimensions of the intervals become discarded. Another form of simplification commonly demonstrated is the direct mapping of sequences of temporal intervals to temporally ordered events; however such a simplification does not consider the actual duration of the intervals as seen in Giannotti et al. (2006). Various Apriori-based techniques such as from Höppner and Klawonn (2001), Mooney and Roddick (2004), and Laxman et al. (2007) exist for the discovery of temporal patterns, episodes, and association rules on interval-based event sequences. In addition, various candidate generation techniques employ approaches to reduce the exponential complexity of the mining problem, such as from Winarko and Roddick (2007) and Papapetrou et al. (2005, 2009).

Recent similarity measures pertaining to sequences of temporal intervals have been used as tools in this data domain for similarity searching, clustering and k-NN classification. As mentioned, for the k-NN family of classifiers, two state-of-the-art distance measures have been developed, one being Artemis introduced by Kostakis et al. (2011), and the other being IBSM, introduced by Kotsifakos et al. (2013). More recently a state-of-the-art e-sequence domain similarity search framework known as ABIDE has been introduced by Kostakis and Papapetrou (2017). ABIDE is capable of generating an accurate similarity search in sequences of temporal intervals with no false dismissals alongside a relatively low computational cost. This is achieved by combining lower bounds with early abandoning methods. Importantly, the use of ABIDE over Artemis as a more informative measure should be considered as ABIDE takes into account both absolute values of the interval durations and time between intervals while Artemis does not. ABIDE should also be considered as a measure over IBSM as the later may result in false dismissals.

Building upon the foundations of these previous works, a classification framework known as STIFE, has been introduced in Bornemann et al. (2016), achieving state-of-the-art performance regarding temporal interval sequence classification. STIFE exploits such sequences in a variety of manners, employing static features, class-defined medoid features, as well as class-distinctive temporal relation pairs. However, the main limitations of STIFE is the inability to exploit information regarding event durations and relational information over a broad range of event types, as its temporal features only consider the occurrence of single event pair relations. Hence, potentially class-discriminatory temporal arrangements of events are ignored, such as those in combination across the entire span of event labels and in which the duration of such events may be highly relevant for the classification task at hand. In a different line of research targeting classification of multi-variate time series using temporal abstractions, the Karma–Lego framework (Moskovitch and Shahar 2015a) and similar variants (Moskovitch and Shahar 2015b; Moskovitch et al. 2015; Batal et al. 2013; Patel et al. 2008) are employed on healthcare temporal measurements. The key objective of Karma–Lego and its variants is the efficient discovery of frequent patterns of interval-based events of any size, which can then then be employed as features for any off-the-shelf predictive model. The temporal features are constructed by enumerating the set of possible pair-wise temporal relations that may occur between the event labels contained in the training set, using the seven types of temporal relations defined in Allen’s temporal logic (Allen 1983). If the e-sequences in the training set are defined over m possible event labels, in order to account for all possible relations between event labels, then the feature space comprises \(7\cdot \frac{m(m-1)}{2}\) features. The corresponding feature values are determined by the number of occurrences of each temporal relation in a given e-sequence.

Despite the competitive predictive performance of these approaches, their main drawback is the fact that they only consider the types of temporal relations occurring between the involved events, while ignoring the actual duration of these relations. In many practical scenarios, such as in healthcare, the duration of, e.g., an overlap, or a gap, may convey critical value in terms of acting as a class-discriminant feature and predicting a future event of interest.

In addition, there exist alternative approaches that extract sequential patterns as a means of building meaningful temporal features for classifiers. Such methods include SPAM (Ayres et al. 2002), for mining tradition sequential patterns, CloSpan for mining closed sequential patterns (Yan et al. 2003), GoKrimp for mining compressing sequential patterns (Lam et al. 2014), and SCIP for building classifiers based on mined interesting patterns (Zhou et al. 2015). As such, we consider such approaches competitors to exploiting sequences of temporal intervals.

3 Problem setting

The problem studied in this paper is the classification of sequences of temporal intervals. In this section, we introduce the problem by providing the necessary definitions followed by the problem formulation.

Let \(\varSigma = \{e_1, \ldots , e_m\}\) be an alphabet of m event labels. An event-interval is an event that occurs over a time interval, while an ordered multi-set of event-intervals defines an event-interval sequence. Next, we define these two terms more formally.

Definition 1

(Event-interval) An event-interval is formally defined as a triplet \(S = <e,t_{s},t_{e}>\), where \(S.e \in \varSigma \) is the event label for that time interval, and \(S.t_{s},S.t_{e}\) correspond to the start and end times of S, respectively. Naturally it holds that \(S.t_{s}\le S.t_{e}\), where the equality is satisfied when the event is instantaneous.

We say that an event-interval \(S = <e,t_{s},t_{e}>\) is active during its defined time span, i.e., from \(t_{s}\) to \(t_{e}\).

Definition 2

(e-sequence) A sequence of event-intervals, also known as event-interval sequence, or e-sequence, denoted as \(\mathcal {S} = \{S_1,\ldots ,S_n\}\), is an ordered multi-set of n event-intervals. The temporal order of the event-intervals in \(\mathcal {S}\) is ascending based on their start time and in the case of ties it is descending based on their end time. If ties still exist, the event-intervals are sorted alphabetically.

The length of an e-sequence \(\mathcal {S}\) is defined as the time-span of the e-sequence, i.e., \(|\mathcal {S}| = S_n.t_{e}-S_1.t_{s}\), while the size of \(\mathcal {S}\) is the number of event-intervals in the e-sequence. For example, the e-sequence depicted in Fig. 1 is of length 10 and its size is 6.

Let \(\mathcal {D} = \{\mathcal {S}_1, \ldots , \mathcal {S}_N\}\) define an e-sequence dataset, i.e., a collection of e-sequences. Moreover, let us assume that each e-sequence \(\mathcal {S}_i \in \mathcal {D}\) is assigned with a class label \(c_i\in \mathcal {C}\), with \(\mathcal {C}\) being a predefined set of class labels. Hence, let \(\mathcal {X}=\{\{\mathcal {S}_1,c_1\}, \ldots , \{\mathcal {S}_N,c_N\} \}\) denote a labelled e-sequence dataset that is defined over \(\mathcal {D}\) and \(\mathcal {C}\).

The problem studied in this paper is to learn a classification model f from \(\mathcal {X}\) that can correctly assign a new (previously unseen) e-sequence \(\mathcal {S}\) with a class label from \(\mathcal {C}\).

Problem 1

(e-sequence classification) Given a labeled e-sequence dataset \(\mathcal {X}\), with each e-sequence assigned with a class label from \(\mathcal {C}\), we want to learn a mapping function \(f_{\mathcal {X},\mathcal {C}}: \mathcal {S} \rightarrow \mathcal {C}\) defined over \(\mathcal {X}\) and \(\mathcal {C}\), with \(\mathcal {S}\in \mathcal {X}\), such that for an independent labeled dataset of previously unseen e-sequences \(\mathcal {X}'\), the expectation of the classification function loss function \(E_{(\mathcal {S}_i,c_i) \in \mathcal {X}'}[\mathcal {L}(c_i,f(\mathcal {S}_i))]\) is minimized. The classification loss function \(\mathcal {L}\) is defined as follows:

$$\begin{aligned} \mathcal {L}_{\mathcal {X}'}\left( c_i, f\left( \mathcal {S}_i\right) \right) = {\left\{ \begin{array}{ll} 0 &{}\quad \text {if } f\left( \mathcal {S}_i\right) = c_i \text {, with } \mathcal {S}_i \in \mathcal {X}' \ ,\\ 1 &{}\quad \text {otherwise. } \end{array}\right. } \end{aligned}$$

4 SMILE: a generalized temporal abstraction framework for classifying sequences of temporal intervals

We introduce SMILE, a generalized framework for classification of sequences of temporal intervals, that performs four levels of temporal abstraction. Towards this end, the following four types of abstraction features are considered, the first three of which having also been employed by the STIFE framework (Bornemann et al. 2016):

  • Static corresponding to a set of static features providing global, aggregate statistics of an e-sequence (Sect. 4.1).

  • Medoids corresponding to a set of similarity values of an e-sequence compared to class-medoid e-sequences from the training set (Sect. 4.2).

  • Interval relation pairs corresponding to pairs of temporal intervals covering all combinations of pair-wise temporal relations (as defined by Allen’s temporal logic Allen (1983)) between all pairs of event labels in the given alphabet \(\varSigma \) (Sect. 4.3).

  • E-lets corresponding to class-predictive subsequences of event-intervals, called e-lets with high utility (Sect. 4.4).

Next, we introduce the above feature types starting from simple static metrics and progressing to more complex temporal abstractions, exploiting both global and local temporal information from the labeled training dataset of e-sequences.

4.1 Static features

For a given e-sequence \(\mathcal {S} = \{S_1,\ldots ,S_n\}\), as defined in Sect. 3, we consider 14 static features. For each feature described below, we also compute its corresponding value using the example e-sequence depicted in Fig. 1.

  1. (i)

    Duration the e-sequence length \(|\mathcal {S}|\); in our example, \(\mathrm {duration}(\mathcal {S})=10\).

  2. (ii)

    Size the total number of event-intervals in \(\mathcal {S}\); in our example, \(\mathrm {size}(\mathcal {S})=6\).

  3. (iii)

    Dim_count the number of unique event labels in \(\mathcal {S}\), i.e.,

    $$\begin{aligned} \mathrm {dim\_count}(\mathcal {S}) = | \{S.e | S \in \mathcal {S} \} | , \end{aligned}$$

    with \(\mathrm {dim\_count}(\mathcal {S})=4\).

  4. (iv)

    Start the start time of the first event-interval in \(\mathcal {S}\), i.e., \(S_1.t_{s}\); in our example, \(\mathrm {start}(\mathcal {S})=0\).

  5. (v)

    Majority the alphabet label \(e^{*}\in \varSigma \) that has the highest occurrence frequency in \(\mathcal {S}\); ties are resolved randomly. That is

    $$\begin{aligned} e^{*} = \mathrm {max\_freq}(\mathcal {S}) = \text {arg}\,\max \limits _{\varSigma }\ \sum _{i=1}^n \mathbf {1}(\varSigma _j = S_i.e). \end{aligned}$$

    In our example, \(e^{*}= \) ’arrhythmia’.

  6. (vi)

    Density the sum of interval duration values in \(\mathcal {S}\), i.e.,

    $$\begin{aligned} \mathrm {density}(\mathcal {S}) = \sum _{i=1}^n \{S_i.t_{e}- S_i.t_{s}\}\ , \end{aligned}$$

    with \(\mathrm {density}(\mathcal {S})= 20\), for our example.

  7. (vii)

    \(\mu \)Density the mean density of event-intervals in \(\mathcal {S}\), i.e., \(\mathrm {\mu density}(\mathcal {S})= 3.33\).

  8. (viii)

    Concurrency the maximum number of event-intervals that are concurrently active in \(\mathcal {S}\). Let \(\mathcal {V}(\mathcal {S})=\{V_1, \ldots , V_{|\mathcal {S}|}\}\) be the binary representation of \(\mathcal {S}\), where each \(V_j\) is an n-dimensional binary vector, with n being the number of event-intervals in \(\mathcal {S}\). Moreover, \(V_j[i]=1\) if event-interval \(S_i\in \mathcal {S}\) is active at time point j and \(V_j[i]=0\), otherwise (similar to the formulation in Kotsifakos et al. (2013)). Then, the maximum number of concurrently active event-intervals in \(\mathcal {S}\) is

    $$\begin{aligned} \mathrm {concurrent}(\mathcal {S})^* = \max _{j=1}^{|\mathcal {S}|} \sum _{i=1}^{n} V_j[i] \ \end{aligned}$$

    In our example, \(\mathrm {concurrent}(\mathcal {S})^* = 3\).

  9. (ix)

    Max_concurrency the time duration of the period with the highest number of concurrent intervals in \(\mathcal {S}\), i.e., \(\mathrm {concurrent\_dur}(\mathcal {S})\) = 1.

  10. (x)

    \(\mu \)Concurrency the maximum concurrent interval duration normalized by length, i.e., \(\mathrm {\mu concurrent\_dur}(\mathcal {S})\) = 0.1.

  11. (xi)

    Pause_time the total duration in \(\mathcal {S}\) with no active event-interval, i.e., \(pause\_time(\mathcal {S}) = 0\).

  12. (xii)

    \(\mu \)Pause_time the pause time normalized by length, \(\mu pause\_time(\mathcal {S}) = 0\).

  13. (xiii)

    Activity the total duration with at least one active event-interval, i.e., the inverse of pause time. In our example, \(activity(\mathcal {S}) = 10\).

  14. (xiv)

    \(\mu \)Activity the active time normalized by length, i.e., \(\mu activity(\mathcal {S}) = 1\).

Complexity We observe that the above set of features consists of static summarization metrics of an e-sequence, while requiring low computational runtime. In fact, after sorting the event-intervals of an e-sequence (i.e., \(\varTheta (n \cdot {} log(n))\)) all metrics can be calculated in either \(\varTheta (1)\) or \(\varTheta (n)\). Thus, the overall runtime complexity of extracting static features from an e-sequence is \(\varTheta (n\cdot n\cdot log(n))\), while only requiring \(\varTheta (n)\) additional memory, since the number of static features is constant. It follows that the time required to extract static features for an unseen sequence \(\mathcal {S}_{new}\) is \(\varTheta (n \cdot log(n))\).

4.2 Class-based medoid distance features

The main idea behind this type of e-sequence summarization approach is to extract a set of representative e-sequences from the training set and use them to map the e-sequences to a vector space. More concretely, for each set of e-sequences belonging to a class label, we compute their corresponding within class medoid using the IBSM distance function as defined in Bornemann et al. (2016). Hence, for a k-class classification problem, we extract a set \(\mathcal {M}=\{M_1, \ldots , M_k\}\) of k medoids functioning as representatives. We assume that each medoid \(M_j\in \mathcal {D}\).

Then, given a dataset \(\mathcal {D} = \{\mathcal {S}_1,\ldots ,\mathcal {S}_N\}\), each \(\mathcal {S}_i \in \mathcal {D}\) is mapped to a k-dimensional vector defined as

$$\begin{aligned} \mathcal {C}_{\mathcal {M}} \left( \mathcal {S}_i\right) = \left\{ IBSM\left( \mathcal {S}_i, M_1\right) , \ldots , IBSM\left( \mathcal {S}_i, M_k\right) \right\} . \end{aligned}$$

For each \(\mathcal {S}_i\), the resulting k-dimensional vector acts as a set of k features that are passed over to the classifier.

Complexity Since the class labels (and thus cluster labels) are given, the clustering takes \(\varTheta (N)\) time. Afterwards we need to calculate the medoid of each cluster and subsequently calculate the distance to those for all training sequences. Assuming the number k of classes is constant we know that the size of each cluster can be at most \(\varTheta (N)\). For each cluster all compressed event tables (produced by IBSM) and their pairwise distances (\(\varTheta (N^2)\)) need to be computed and stored. Thus, the runtime and memory complexity of finding the distances to all class-cluster medoids is \(\varTheta (N^2 \cdot m \cdot (|\varSigma | + log(m)))\) time and \(\varTheta (N^2 \cdot m \cdot |\varSigma |)\) memory. The online feature extraction requires \(\varTheta (m \cdot (|\varSigma | + log(m)))\) time and \(\varTheta (m \cdot |\varSigma |)\) memory.

4.3 Interval relation-pair features (2-lets)

This set of features involves temporal relation features between pairs of event intervals, also referred to as 2-shapelets (Bornemann et al. 2016) or 2-lets. In simple terms this stage of the framework incorporates e-sequence features corresponding to temporal relations across all possible event label pairs in \(\mathcal {D}\). We consider seven types of temporal arrangements, as defined in Allen (1983) and depicted in Figs. 1 and 3.

More concretely, let A and B be two event intervals with the following property: \(A.t_{start} \le B.t_{start}\) (B does not start before A). We define the set of possible temporal relations as \(\mathcal {R}=\{meets,\) matchesoverlaps withleft-containscontainsright-contains\(is~followed~by\}\) that can define the temporal arrangement of A and B. The individual definitions of these relations are visualized in Figs. 1 and 3 and can be found in Papapetrou et al. (2009). We denote as \(rel(A, B) \in \mathcal {R}\) the temporal relation between A and B.

These temporal relations for event intervals have already been used in the context of distance measures for e-sequences on multiple occasions such as by Kotsifakos et al. (2013) and Kostakis et al. (2011). Note that for an ordered pair of event intervals exactly one of these relations applies, meaning the temporal relation of two event intervals is unambiguous. Based on this, a 2-let can be defined as follows.

Definition 3

(2-let) Given two alphabet labels \(e_i, e_j \in \varSigma \), a 2-let is defined as \(l_2= (e_i,e_j,r)\), where \(r \in \mathcal {R}\) is the temporal relation between the \(e_i\) and \(e_j\).

Given an e-sequence \(\mathcal {S}\) and a 2-let \(l_2= (e_i,e_j,r)\), we say that \(l_2\) occurs in \(\mathcal {S}\) if there exists at least two event intervals \(S_k\) and \(S_l\) in \(\mathcal {S}\), such that \(S_k.e = e_i\), \(S_l.e = e_j\), and \(rel(S_k, S_l) = r\).

All 2-lets of an e-sequence \(\mathcal {S}\) can be found by simply determining the relations of all pairs of event-intervals (AB), where \(A,B \in \mathcal {S}\) and B does not occur before A. The idea for the resulting features is simply to treat the number of occurrences of each 2-let as a feature of the sequence. This results in exactly \(7\cdot \frac{|\varSigma |(|\varSigma |-1)}{2}\) possible features which is a swiftly increasing function of the number of features, i.e., dimensions. Thus, it is necessary to perform feature selection afterwards which is achieved by applying information gain as a feature selection criterion. The algorithm for 2-let feature extraction is summarized in Algorithm 1. To count all 2-let occurrences, an \(N \times 7\cdot |\varSigma |^2\) matrix is used (one row per sequence), which is denoted as SM.

Complexity For each sequence all correctly ordered pairs need to be examined, which amounts to a \(\varTheta (m^2)\) runtime per e-sequence. Thus, the runtime for shapelet occurrence counting is \(\varTheta (N \cdot m^2)\), while the memory footprint is \(\varTheta (N \cdot |\varSigma |^2)\). Calculating information gain of a numeric attribute requires \(\varTheta (N\cdot log(N))\) runtime. This is performed for each feature, which means the total runtime of feature selection via information gain is \(\varTheta (N\cdot log(N) \cdot |\varSigma |^2)\). Memory remains at \(\varTheta (N \cdot |\varSigma |^2)\). Thus, putting the two steps together we arrive at \(\varTheta (N \cdot ( m^2 + log(N) \cdot |\varSigma |^2))\) for runtime and \(\varTheta (N \cdot |\varSigma |^2)\) memory to execute 2-let extraction and select the best 2-lets as features. Calculating the occurrences of the selected 2-lets for a new e-sequence takes \(\varTheta (m^2)\) time in the worst case, since once again all its correctly ordered event-interval pairs need to be considered. Note that this is always independent of \(|\varSigma |\), since a constant number of 2-lets are selected in the feature selection step.

figure a

4.4 Interval segment features (e-lets)

The features we have discussed so far do not capture relations between more than single event-interval pairs. As such, these features are ignoring potentially class-predictive information from event subsequences that occur concurrently. In addition, class-predictive information related to varied event time spans cannot be fully captured by utilizing the previous three feature types. These limitations are demonstrated in Fig. 4.

Fig. 3
figure 3

The seven temporal relations between an ordered pair of event-intervals (A, B) as defined in Allen’s temporal logic

Fig. 4
figure 4

a A follow-by 2-let relation between the events A and B. b An e-let capturing the temporal arrangement between event A and B, while also considering the absence of the C event

To address the aforementioned deficiencies, we introduce the novel concept of e-lets, corresponding to event-interval subsequence-based features, inspired by time series shapelets (Ye and Keogh 2009). We motivate the use of random e-let features due to the evidence generated in Karlsson et al. (2016) and Wistuba et al. (2015) that such features provide both low cost and state-of-the-art classification performance, while capturing local event duration and multi-dimensional information.

Definition 4

(e-let) Given an e-sequence \(\mathcal {S} = \{S_1, \ldots , S_n\}\), the e-let \(\mathcal {S}^{k,l}\) of \(\mathcal {S}\) is defined as the e-sequence containing all event-intervals that are active from time point k until time point \(k+l-1\). In other words, \(\mathcal {S}^{k,l}=\{S^{*}_1, \ldots , S^{*}_{n'}\}\), where each \(S^{*}_i\) is the time-truncated counterpart of each \(S_i\), such that \(S^{*}_i.t_{t_{s}} = \max \{S_i.t_{t_{s}}, k\}\) and \(S^{*}_i.t_{t_{e}} = \min \{S_i.t_{t_{e}}, k+l-1\}\).

The feature extraction process for e-lets proceeds as follows. We first, select \(\theta \) random e-lets from the training dataset \(\mathcal {X}\) uniformly at random. In Algorithm 2, at each iteration i, an e-let \(\mathcal {S}_{t_i}^{k_i,l_i}\) is extracted by randomly selecting an e-sequence \(\mathcal {S}_{t_i} \in \mathcal {X}\), a random starting time point \(k_i\) and a random length \(l_i\). Note that the length of e-lets is constrained to be smaller than or equal to the length of the shortest e-sequence in \(\mathcal {X}\), i.e., \(l_i \le \min _{\mathcal {S}_i\in \mathcal {X}} \{|\mathcal {S}_i|\}\). Let us denote this maximum bound on the e-let length as \(\lambda \). Performing this selection procedure \(\theta \) times produces a pool of \(\theta \) candidate e-lets, which we denote as \(\mathcal {E}_{\mathcal {X}, \theta } = \{\mathcal {S}_{t_i}^{k_i,l_i}\}\), \(i\in [1, \theta ]\).

figure b

Next, we compute the distance of each e-let in \(\mathcal {E}_{\mathcal {X}, \theta }\) to each e-sequence in \(\mathcal {X}\). Effectively, this produces a mapping of each e-sequence to a set of \(\theta \) feature values, each corresponding to the distance of the e-sequence to an e-let. As a distance function we employ ABIDE (Kostakis and Gionis 2015), which is an extension of the IBSM distance metric defined for subsequence matching of event-interval sequences. The key structure used by ABIDE is a vector representation of the active event-intervals in an e-sequence at each time point.

Definition 5

(Active event vector) Given e-sequence \(\mathcal {S}\) and time-point t, the active event vector of \(\mathcal {S}\) at t, is a \(|\varSigma |\)-dimensional binary vector \(V_{\mathcal {S}}^t\), such that \(V_{\mathcal {S}}^t(i) = 1\), if \(e_i \in \varSigma \) is active in \(\mathcal {S}\) and 0, otherwise.

Hence, \(\mathcal {S}\) can be represented as an ordered set of active event vectors \(\mathcal {V}_{\mathcal {S}} = \{V_{\mathcal {S}}^1, \ldots , V_{\mathcal {S}}^n\}\). Moreover, the set of distinct event labels that are contained in \(\mathcal {S}\), or in other words the set of event labels that are active for at least one time point, is denoted as \(\varSigma _{\mathcal {S}}\), where \(\varSigma _{\mathcal {S}}\subseteq \varSigma \).

Given an e-sequence \(\mathcal {S}\) and a query e-let \(\mathcal {Q}\), the ABIDE distance function is then defined simply as the minimum distance of \(\mathcal {Q}\) to any e-let \(\mathcal {S}_{t_i}^{k_i,l_i}\) of \(\mathcal {S}\), with \(l_i = |\mathcal {Q}|\). Computing the ABIDE distance is done by employing a sliding window \(W_j\) of length |Q| over \(\mathcal {S}\), with j indicating the starting time point of the sliding window. The goal is to find the e-let, that minimizes the distance between the vector-based representation of Q, \(\mathcal {V}_{\mathcal {Q}}\), and the \(L_1\) norm of the vector-based representation of the corresponding e-let contained in \(\mathcal {V}_{\mathcal {W}_j}\). Or more formally:

$$\begin{aligned} D(\mathcal {Q},\mathcal {S}) = \min _{j} \sum ^{|\mathcal {S}|-|\mathcal {Q}|+1}_{j=1} \sum ^{|Q|}_{t=1}{\sum ^{|\varSigma _Q|}_{i=1}|\mathcal {V}_{\mathcal {Q}}^t(i) - \mathcal {V}_{\mathcal {W}_j}^t(i) |}. \end{aligned}$$

The full computation of ABIDE is performed by applying a skyline lower-bounding approach using early-abandoning and alphabet reduction heuristics so as to speed up the computation of the sliding windows vector-based distance. More details about ABIDE can be found in Kostakis and Papapetrou (2017). The resulting set of \(\theta \) distance values correspond to the features that can then be passed on to the chosen classification model.

Complexity In the worst case, for each generated e-let and each sliding window of length at most \(\lambda \), ABIDE computes the \(L_1\) norm of \(\varSigma \)-dimensional vectors, for a total of \(|\mathcal {S}|-\lambda +1\) times. This is furthermore performed across all N e-sequence training examples. The total process is executed for each of the total number of generated random shapelets \(\theta \), resulting in \(\varTheta (\theta \cdot N \cdot |\varSigma | \cdot \lambda (|\mathcal {S}|-\lambda +1)) = \varTheta (\theta \cdot N \cdot |\varSigma | \cdot \lambda \cdot |\mathcal {S}|)\) runtime.

5 Experimental evaluation

5.1 Experimental setup

5.1.1 Data

The data consists of information about diagnoses and prescribed drugs for 1,314,646 patients gathered from the research infrastructure Swedish Health Record Research Bank, Health Bank at Stockholm University (Dalianis et al. 2015); an anonymized patient record based on the TakeCare EPR records from Karolinska University Hospital in Stockholm, Sweden. Diagnoses are encoded using the International Statistical Classification of Diseases and Related Health Problems, 10th Edition (ICD-10) and drugs are encoded using the Anatomical Therapeutic Chemical Classification System (ATC). The TakeCare electronic health record data consisted of 84,100,593 entries of ICD10 diagnosis codes, lab test data, and ATC coded medication data.

The classification task chosen is in regard to identifying adverse drug events (ADEs) diagnoses which, according to Nebeker et al. (2004), are injuries that result from the use of a drug, including harm caused by the normal use of a drug, drug overdoses, and use-related harms such as from drug dose reductions and discontinuations of drugs administration. ADEs possess high clinical relevancy being that they account for approximately 3.7% of hospital admissions around the world according to Howard et al. (2007).

The diagnosis and prescription data were pre-processed into 14 ADE datasets consisting of interval sequences of ADE case groups (i.e., those patients experiencing an ADE) and control groups (i.e., those patients who do not). The case groups were chosen based on ICD10 codes of known ADEs. Patients diagnosed with these codes were selected as examples consisting of their 90 day diagnosis and medication histories of events occurring before the ADE, from which interval sequences were constructed. For patients whom had more than one occurrence of a particular ADE type, only the last ADE window was included. The specific ADE codes for case and the corresponding selected control group included in the experiments are presented in Table 1.

Table 1 ADE case groups (left) and corresponding control groups (right)

Corresponding control group codes were chosen based on codes that possessed the greatest medical similarity to the case ADE but did not constitute an actual ADE. Such intervals were given the alternative class labeling and extracted in the same manner as the case groups in which 90 day medical history windows were extracted per example.

The alphabet size of the intervals was reduced by selecting the 200 most frequent ICD10 and ATC codes for each combined case and control group. Candidate ADEs were also excluded from investigation if, after preprocessing, the total number of case and control data intervals was insufficient, in which a threshold of fewer than 2000 intervals was chosen. A basic description of these novel ADE data sets can be observed in Table 2. In addition, experiments were performed on six publicly available single-label benchmark data sets from a variety of domains which include the following: Auslan2 (Mörchen and Fradkin 2010); Blocks (Mörchen and Fradkin 2010); Context (Mörchen and Fradkin 2010); Hepatitis (Patel et al. 2008); Pioneer (Mörchen and Fradkin 2010); and Skating (Mörchen and Fradkin 2010). Multi-labeled datasets examined in Bornemann et al. (2016), which permit a sequence to possess multiple class-labels, were not included in this study.

Table 2 Summary of ADE data sets

5.1.2 Parameter configurations

Our framework evaluations have focused on three configuration approaches, all of which were under 10-fold cross validation. The first approach examined which classification model could produce the best performance under SMILE. This model type would then be utilized for the remaining configurations. Our second approach examines the effect on performance of different features added sequentially. Finally, when comparing the predictive performance of the different number of features for STIFE and SMILE, we employ a one-variable-at-a-time design. Specifically, for STIFE we vary one component of the framework, the number of 2-lets, from the set \(\{10, 25, 50, 75, 100, 200\}\). For SMILE we vary the novel e-let component of the framework, keeping the number of 2-lets constant at 75 features, while varying the number of e-lets from the set \(\{10, 25, 50, 75, 100, 200\}\). We observe the effects of altering the number of 2-lets and e-lets both on average and across all data sets.

Table 3 Comparison of classification models for SMILE

5.1.3 Evaluation metrics

We examined accuracy, i.e., the fraction of correct predictions produced by our classifiers, alongside area under curve (AUC). We focus on AUC as a more meaningful performance measure throughout our evaluation due to the considerable class imbalance which can be observed in many of our chosen data sets, as seen in Table 2.

5.2 Empirical investigation

5.2.1 Model comparison

We chose to begin our investigation by examining which classification model type would yield the best performance outcome while incorporating all stages, i.e. up to Phase SMILE, of our novel framework. For model comparisons, Friedman tests showed a highly significant result, for both accuracy (\(\chi _F^2\) = 19.545, df = 3, p = 0.0002109) and for AUC (\(\chi _F^2\) = 31.155, df = 3, p = 0.0000007885) across Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), and Support Vector Machine (SVM) algorithms which incorporate all four stages of SMILE. A full comparison of model types across all data sets can be viewed in Table 3.

Shown in Fig. 5 are critical distance plots for the post-hoc Nemenyi tests with \(\alpha \) = 0.05. Such plots were first used for visualization purposes by Demšar (2006). We can observe that the RF model is significantly better than all other models in regard to critical distance determined for both accuracy and AUC. For both accuracy and AUC there was no significant difference when comparing LR, DT, and SVM models. Based on this evidence, RF was selected as the best performing model, and thus, it remains the chosen model for all further analyses.

Fig. 5
figure 5

Critical distance plots of Nemenyi test model comparison for \(\alpha = 0.05\), shown both for Accuracy (left figure), and AUC (right figure). Groups of classifiers which are not significantly different are shown as connected

5.2.2 Comparison of methods

Shown in Table 4 are the paired-difference Wilcoxon signed-rank test results comparing SMILE to STIFE. We observe that SMILE yielded significantly (\(p<0.05\)) different population mean rank results over STIFE in terms of AUC, however improvements were insignificant when examining accuracy. Although improvements are not as pronounced in terms of accuracy, we once again emphasize the importance of AUC as the most valid metric in this study due to the profound class imbalances of our data sets. As such, this finding strongly indicates the superior approach of e-let feature inclusion, due to the addition of class-predictive information stemming from the use of e-lets.

Table 4 Wilcoxon signed-rank test between STIFE and SMILE
Fig. 6
figure 6

Critical distance plots of Nemenyi test method comparison for \(\alpha = 0.05\), shown for Accuracy (left), and \(\alpha = 0.01\) shown for AUC (right). Groups of methods which are not significantly different are shown as connected

Table 5 Area under ROC comparison across all methods
Table 6 Accuracy comparison across all methods

Secondly, we employed Friedman tests to allow for multiple comparisons between all competitor methods. In addition to utilizing the various stages of STIFE as competitors, we also utilize 1-nearest neighbor under IBSM distance, alongside using SPAM as an appropriate competitor from the sequential pattern mining domain. SPAM was initialized with a minsup of 0.1, minimum pattern length of 2, minimum pattern length of 8, and a max gap of 2. For method comparisons, Friedman tests showed highly significantly different results (\(p < 0.01\)) for both accuracy (\(\chi _F^2 = 52.964\), \(df = 5\), \(p = 0.0000000003421\)) and for AUC (\(\chi _F^2 = 35.45\), \(df = 4\), \(p = 0.0000003755\)) among competitors. Shown in Fig. 6 are critical distance plots for the post-hoc Nemenyi tests with \(\alpha \) = 0.01 for AUC (right), demonstrating on a highly significant level that SMILE outperforms the MEDOID, SPAM, and STATIC methods while STIFE does not outperform either SPAM nor MEDOID on a highly significant level. For accuracy (left), we provide evidence given \(\alpha \) = 0.05, that SMILE outperforms the SPAM, 1-nearest neighbor using IBSM, MEDOID, and STATIC methods while STIFE is unable to outperform the MEDOID method.

Fig. 7
figure 7

Average performance metrics comparison across all data sets of STIFE shown for variable number of 2-let and e-let features across values \(\{10, 25, 50, 75, 100, 200\}\) for AUC (left) and Accuracy (right). For SMILE the number of 2-lets was kept constant at 75

Fig. 8
figure 8

Performance metrics comparison of STIFE and SMILE shown for variable number of 2-let and e-let features across across all ADE datasets. For SMILE the number of 2-lets was kept constant at 75

Fig. 9
figure 9

Performance metrics comparison of STIFE and SMILE shown for variable number of 2-let and e-let features across across all benchmark datasets. For SMILE the number of 2-lets was kept constant at 75

Fig. 10
figure 10

Performance metrics comparison of STIFE and SMILE shown for variable number of 2-let and e-let features across across all ADE datasets. Baseline refers to the majority vote baseline shown in Table 2. For SMILE the number of 2-lets was kept constant at 75

Fig. 11
figure 11

Performance metrics comparison of STIFE and SMILE shown for variable number of 2-let and e-let features across across all benchmark datasets. For SMILE the number of 2-lets was kept constant at 75

Examining average AUC and accuracy performance across all datasets in Tables 5 and 6 also highlights the superior performance of SMILE.

5.2.3 Effect of number of 2-lets and e-lets

Figure 7 demonstrates average performance metrics comparison across all data sets regarding 2-let features with values from the set \(\{10, 25, 50, 75, 100, 200\}\) for STIFE. Also seen are the e-let feature variations chosen from the same set of values. For SMILE, the number of 2-lets was kept constant at 75. The effect of varied numbers of features demonstrated relatively little variation in performance, regardless of the configuration examined. Examining AUC for STIFE, a trend can be observed for small performance improvements from a greater number of features, while for SMILE the best AUC was reported for 100 e-let features.

In Figs. 89, 10, and 11 we similarly examine the effect of the number of features on performance for each dataset available and observe a greater variation in results when examining the novel ADE data sets, where selecting either the lowest or highest number of features could result in the best performance depending on the dataset. Based on this evidence we regard the use of 75 2-let and e-let features to be reasonable, and we would motivate future examinations to choose a similarly conservative number of both features to reduce cost.

5.3 Medical case study on e-let features

To further motivate the utility of including e-let features in SMILE, we examine several e-lets which where ranked in the top 10 of feature importance for a given data set i.e., contributing the highest average impurity decrease for the RF classifier across all feature types. Figures 1213, and 14 depict examples of real e-lets extracted from patients which SMILE prioritizes as being highly ranked in regard to feature importance. Due to the inherent randomness of these extracted e-lets, we do not suggest that each interval contained in the e-lets is of importance to class discrimination. With this in mind, an e-let with a subset of intervals in the provided examples could possess equal feature importance. Seen in Figs. 12 and 14 are two e-lets of highly ranked importance generated from the ADED611 medical data set, which discriminate between drug-induced aplastic anemia and unspecified aplastic anemia. Examining Fig. 12 reveals that the drugs Omeprazole and Furosemide prescribed over an extended duration contribute to high importance. This finding is backed up by medical literature which reports that Furosemide is a diuretic drug prescribed to cardiovascular patients and has a known association to drug-induced aplastic anemia (Rao 2014). Omeprazole is a proton-pump inhibitor with one study reporting a link of such inhibitors to anemia in cardiovascular outpatients (Shikata et al. 2014). Although Omeprazole has not been proven to be linked to drug-induced aplastic anemia in particular, our finding suggests it might contribute to the condition if its prescription occurs alongside Furosemide.

Secondly we examine Fig. 13 showing an e-let of high importance, extracted from a patient with long term multiple myeloma, whom also had treatments of antineoplastic chemotherapy and immunotherapy for this cancer. This finding might suggest that drug-induced aplastic anemia is more likely for this particular cancer combined with the respective chemotherapy and immunotherapy regimines for treatment. Finally, we examine Fig. 14 showing an e-let of high importance extracted from a patient initially possessing the prescription of 6 drugs and was later diagnosed with lymphocytic leukemia. Of the drugs under examination, the antibiotic trimethoprim-sulfamethoxazole has been known to induce aplastic anemia (Menger et al. 2015) while lansoprazole, a proton pump inhibitor, has been linked to drug-induced hemolytic anemia (Rao 2014). In this example, although there is greater ambiguity regarding how the contribution of drugs and leukemia diagnosis provides high importance, such a finding may be of medical interest.

Fig. 12
figure 12

Example of e-let extracted for the ADED611 medical data set suggesting high importance relating to Omeprazole and Furosemide drug prescriptions over an extended duration

Fig. 13
figure 13

Example of e-let extracted for the ADED611 medical data set suggesting high importance for multiple myeloma over a long duration containing antineoplastic chemotherapy and immunotherapy over a shorter duration

Fig. 14
figure 14

Example of e-let extracted for the ADED611 medical data set suggesting a combination of the many observed drugs being of high importance

6 Conclusions

The main contribution of this paper is the introduction of the SMILE framework, which was motivated by a need to capture information regarding event duration across multiple event types within e-sequences. A comprehensive evaluation has been performed which demonstrates that SMILE provides significantly improved AUC performance over the current state-of-the art, alongside a selection of competitors utilising varied combinations of features types contained within the SMILE framework itself. This evaluation was performed across a series of benchmark and newly generated ADE data sets. The investigation also reveals that the selection of the random forest model for use with SMILE achieved significantly better performance over a variety of competitor classifiers. Finally, the investigation demonstrated the effect of utilising a varied number of SMILE features, with the result being that a conservative number of features was often appropriate to achieve the best results. Such findings contribute to a growing knowledge base of informative features which can be employed for sequences of temporal intervals to achieve state-of-the-art performance for a variety of domains such as ADE detection. Directions for future work involve: investigating approaches to reduce feature extraction costs, utilising alternative similarily measures, more extensive medical validation of important features, examining the applicability of our framework in alternative domains, and introducing methods to aid in the interpretability of our framework.