Keywords

1 Introduction

Analyzing sequential data [1] has seen a vast surge in interest during recent years, driven by the growth of typical sources such as DNA databases, text repositories, road analysis [2] and user behavior analysis [3]. Many techniques exist to derive ordered items from temporal databases, focusing on either different techniques for discovery, e.g., using prefix-oriented and constraint-based approaches, or towards different outcomes, e.g., regular expressions or closed sequences. These sequential features can be used for classifying new database entries, a discipline that does not only focus on constructing the most complete set of features, but rather the most discriminating.

In this paper, we propose a new sequence classification technique, called iBCM (interesting Behavioural Constraint Miner), which featurizes sequences according to a predefined set of behavioral constraint templates. As such, a fine-granular view of the temporal relations between items can be achieved and applied towards classification. Furthermore, iBCM allows for easy identification of the differences between classes, and gives insight into what types of relations are typically relevant for classification. In the experimental evaluation, it is shown that iBCM is capable of obtaining high discriminative power while minimizing the number of features needed. In addition, only deriving a certain type of constraint templates can already capture the most discriminating features.

This paper is structured as follows. In Sect. 2, an overview of the state-of-the-art of both sequence mining and classification is discussed. In Sect. 3, the backdrop for mining behavioral sequence patterns is introduced, which leads into the discussion of the inference part of iBCM in Sect. 4. Next, Sect. 5 reports on a benchmark with other state-of-the-art techniques. Finally, Sect. 6 summarizes the contributions and provides suggestions for future work.

2 State-of-the-Art

In this section, an overview of existing sequence mining and classification techniques is discussed.

2.1 Sequence Mining

Sequence mining, also referred to as frequent ordered itemset mining or temporal data mining, has been tackled in numerous ways. The original approach was rooted in frequent itemset discovery [4] and based on apriori-concepts. Extensions to this original approach have been proposed to obtain closed sequences [5] and to achieve performance benefits through prefix representation of the dataset [6]. A constraint-based approach was proposed in [7] in the form of cSPADE, and has recently seen a strong interest towards extending it along the declarative constraint programming paradigm. More specifically, several studies investigate how to generically build a knowledge base of constraints covering the sequences in a temporal dataset. For example, in [8], a satisfiability-based technique is devised for enumerating all frequent sequences using cardinalities for the constraints retrieved. In [9] a better prefix representation for sequences mining constraints was introduced, which was later extended for GAP constraints [10]. A similar approach was devised in [11], in which the authors propose an approach that speeds up the retrieval of constraints by precomputing the relations between items in a dataset to avoid reiterating over the sequences. These approaches can also be used to quickly retrieve regular expressions. In [12], a general constraint programming approach that steers away from explicit wildcards is introduced. Finally, a similar vein of research was pursued with Warmer [13], an inductive logic programming pattern discovery algorithm that relies on the Datalog formalization for expressing multi-dimensional patterns. It was elaborated further for sequences in [14]. The proposed work is a special purpose algorithm that mines for a subset of Datalog patterns.

2.2 Sequence Classification

While many insights from sequence mining carry over into sequence classification, the nature of the objective is different. Rather than eliciting the full set of sequences or constraints supported, it is paramount that the feature set exhibits the following characteristics.

  • Compact: in order to build classifiers in reasonable time, the set of features should be reduced to a minimum,

  • Interesting: features of sequential patterns should be supported in a database, but their usefulness towards classification, i.e., their discriminative power, also depends on other factors such as confidence and interestingness [15]. In general, there is a need for a balance in the feature set that strikes support values in between extremely high and low values [16],

  • Concise: the feature set is small though comprehensive, and explains the sequential behavior in an understandable way.

Many sequence classification techniques have been proposed [17,18,19], each focusing on a different approach ranging from extensions to sequence pattern mining algorithms, to statistical approaches that infer the explanatory power of subsequences. They can be classified as either being direct, i.e., the features are extracted according to their strength towards the classifier, or indirect, i.e., all features are generated and later selected by a classifier. [17] extends the cSPADE algorithm with an interestingness measure that is based on both the support and the window (cohesion) in which the items of the constraint occur. In [18] BIDE-D(C) is introduced which rather incorporates information gain into BIDE to provide a direct sequence classification approach. In [19], the sequence database is split up in smaller parts to be recreated by a sparse knowledge base that punishes for infrequent behavior by constructing a Bayesian network of posteriors that are able to reconstruct the sequence database. A similar approach is used in [20], where a strong emphasis is used towards finding interesting sequences.

In contrast to the previously mentioned techniques, iBCM draws from insights in constraint programming, but rather than constructing a complete constraint base that is able to elicit the sequence database as a whole, highly diverse and informative behavioral patterns are used that incorporate cardinality, alteration, gaps, as well as negative information. By fixing the pattern base, it becomes easy to write a specific and fast algorithm for retrieving them from large databases. The technique employs only binary constraints, however, other studies such as [17] have already revealed that for sequence classification, the length of the patterns does not have to exceed 3, or even 2.

3 The Framework of Behavioral Templates

In this section, the preliminaries are established and an overview of the behavioral constraint templates/patterns and their characteristics is given.

3.1 Sequences and Sequence Databases

The task of sequence classification relies on the principles of both a sequence and a sequence database, as well as the classes or labels needed to discern their behavior.

Definition 1

A sequence \(\sigma =\langle \sigma _1,\sigma _2,...,\sigma _n\rangle \) is a list of items with length \(|\sigma |=n\) out of the alphabet \(\varSigma _{\sigma }\). We denote:

  • \(occ(a,\sigma )=\{i\mid \sigma _i=a,\, i\in \mathbb {N}\}\) the ordered set of positions of \(a\in \varSigma _\sigma \) in \(\sigma \),

  • \(min(occ(a,\sigma ))\) the first occurrence,

  • \(max(occ(a,\sigma ))\) the last occurrence, and

  • \(|occ(a,\sigma )|\) the number of occurrences.

Sequences, or ordered sets of items, are typically bundled in sequence databases, which can be defined as follows.

Definition 2

A sequence database \(\mathcal {SB}\) is a set of sequences with \(L: \mathcal {SB}\rightarrow \mathbb {N}\) a labeling function assigning a class label to a sequence consisting of the items in \(\varSigma _{\mathcal {SB}}\). The number of sequences in the database is \(|\mathcal {SB}|\).

Consider the example sequence database in Table 1, with \(\varSigma _{\mathcal {SB}}=\{a,b,c\}\), \(|\mathcal {SB}|=6\), and \(|img(L)|=2\).

Table 1. Example database.
Table 2. An overview of Declare constraint templates with their corresponding LTL formula and regular expression.

3.2 Declare Pattern Base

The iBCM approach relies on a set of behavioral constraint templates based on the Declare language [21], which itself is inspired by the formal verification patterns of Dwyer [22]. These are widely used for identifying not only sequential, but overall behavioral characteristics of programs and processes. The Declare template base consists of a number of patterns for modeling flexible business processes, which are typically expressed in linear temporal logic (LTL), or regular expressions and finite state machines (FSMs). The template base is extensible, but the most widely-used entries are listed in Table 2. The patterns contain both unary and binary constraints. The unary constraints focus either on the position (first/last), or the cardinality. The choice constraint can be considered an existence constraint over multiple items. The binary constraints exhibit a hierarchy [25]. There are unordered constraints (responded/co-existence), simple ordered (precedence, response, succession), alternating ordered, and chain ordered constraints. Hence, the opportunity exists to express not only the ordering, but also the repeating (alternation) and local (chain) behavior of two items. Furthermore, there are negative constraints, expressing behavior that does not occur. These can prove especially useful in the context of classification, and are typically not generated by sequence classification techniques that only mine for positive patterns.

Definition 3

A sequence constraint \(\pi =(A,t)\) is a tuple with A a set of items and t the type of constraint.

A binary constraint has an antecedent, implying the constraint, and a consequent. Both can exist out of a set of items, however, in the rest of the paper we will assume both to be singletons. The type of the constraints correspond with the templates that are defined in Table 2. For convenience, the constraints are written in an abbreviated fashion, e.g., altPrec(ab). They all correspond with a certain regular expression which can be converted into an FSM. We denote the corresponding regular expression as \(\S (t)\). We write the FSM \(\mathcal {A}\) corresponding with the regular expression as \(\mathcal {A}=\S (A,t)\) or \(\mathcal {A}=\S (altPrec(a,b))\). An example of altPrec(ab) is depicted in Fig. 1.

Fig. 1.
figure 1

Automaton of alternate precedence(a,b).

Table 3. The behavioral constraints present in the sequence database of Table 1. The constraints that are supported at 100% are left out for 50%.

Definition 4

A sequence \(\sigma \) supports a constraint \(\pi \) iff \(\sigma \in \mathcal {L}(\mathcal {A}(\pi ))\) where \(\mathcal {L}\) denotes the language of the corresponding FSM. The support of the constraint in the database is \(sup(\pi )_{\mathcal {SB}}=|\{\sigma |\sigma \in \mathcal {L}(\mathcal {A}(\pi ))\,,\forall \sigma \in \mathcal {SB}\}|\).

E.g., in \(\mathcal {SB}=\{aab,abb\}\), \(\sigma _1\in \mathcal {L}(\S (altPrec(a,b)))\), \(\sigma _2\notin \mathcal {L}(\S (altPrec(a,b)))\), and \(sup(\pi )_{\mathcal {SB}}=1\).

3.3 Comparison with Other Sequence Constraint Representation

The iBCM approach does not intend to be able to reproduce the database. Rather it is able to capture the most discerning sequence-based features. Consider for example the database in Table 1. Table 3 lists the constraints that are present for both labels. Notice that for label 1, a does not always precede b. Also, for label 2 c occurs before b. This can be discerned by only 3 constraints which are marked in bold. Hence, with only 3 features, it is possible to classify the traces correctly. Lowering the support threshold results in more constraints being different, although the number of constraints does not have to drastically increase, as for example response(a,b) will eventually be replaced by alternate response(a,b) because of the hierarchy between the constraints. To achieve the same results with typical sequence-based constraints as used in, e.g., SPADE, it is harder to make such concise distinctions, as non-local information present in, e.g., succession requires either longer or more sequences to approach the behavior that will converge towards the language of the regular expression.

To summarize, iBCM exhibits the following advantages:

  • It employs a rich, varied set of constraints that can be derived in a fast manner,

  • It can be extended to incorporate any regular expression,

  • It includes negative constraints for providing counter evidence, useful towards classification,

  • It includes both unary cardinalities, as well as relational constraints,

  • It enables easy comparison of constraint sets,

  • It enables understanding what type of behavioral relations are present,

  • It can be converted into a global automaton for representing behavior graphically.

4 iBCM: Algorithm Design and Implementation

This section outlines the algorithm for constructing the set of features based on the constraint templates discussed in Sect. 3. iBCM is an indirect sequence classification approach, i.e., the featurization and classification part are separate. Section 5 outlines the performance of the constraints generated by the approach as binary features (present/not present).

4.1 Inferring Constraints

The featurization approach is employed as a 3-step approach and outlined in Algorithm 1.

Step 1: Retain frequent items. First, items that exceed the support threshold are withheld in set A (line 2). Only these items will be used for checking unary constraints, and will be used in pair for checking binary constraints.

Step 2: Generate constraints. Next, every sequence in the database is checked in the following manner (line 4, and Algorithm 2). The sequence is traversed completely, and for every item in the frequent itemset the positions are stored. This allows for easy verification of the binary constraints. For every item \(a\in A\), \(|occ(a,\sigma |\) is used for determining the cardinality constraints, i.e., absence/exactly/ existence. It is also checked whether it occurred as the first or last item in the sequence. Next, a is paired with every other \(b\in A\setminus a\) to determine the type of behavioral constraint pattern. If a happens before b, the precedence lineage is reviewed. For every next occurrence of b, it is checked whether there was another a preceding it for alternate precedence. In the meantime for every occurrence, the exact position is checked for chain precedence. Both checks stop when there is no further evidence. If all occurrences of b fit, the constraints are added to the constraint set. If b happens after a, the response hierarchy is scrutinized. Similar to alternate precedence, every occurrence of a is checked for a subsequent b before the next occurrence. If every next occurrence of a is b, chain response is stored. After every pairwise check, the respective succession constraints are added if both (alternate/chain) response andprecedence are present in the sequence. When b is not present in the sequence, there is evidence for exclusive choice.

Step 3: Retain frequent constraints. Finally, for every constraint it is checked whether it satisfies the minimum support level for the different labels in the sequence database in line 6 of Algorithm 1. This allows for precise measuring of sequential behavior, as some sequences might support both response and precedence, and others do not. Still, they can be merged (i.e. the simultaneous presence of response and precedence forms succession) to reduce the size of the number of features.

As can be seen from Algorithm 2, the binary constraints can be derived very efficiently by boolean and string operations. The approach is inspired by both [25, 26]. However, for classification purposes the sequences need to be labeled right away. The former uses DFAs to check constraints for each frequent pair. Doing this on a sequence level is computationally expensive, as it would require running each string many times. The latter builds a knowledge base of occurrence and precedence relations and calculates the support for constraints. This, however, is done on a log level, rather than at entry/sequence level, which requires extra featurization steps afterwards.

figure a
figure b

4.2 Considerations on Constraint Template Base

Not all Declare constraint templates are fit to be considered for obtaining features from single sequences. First of all, constraints might suffer from being vacuously satisfied, i.e., they are satisfied because no counterevidence is provided. Hence, only binary pairs that are both present in a sequence are considered. This automatically satisfies the choice constraint, as well as responded and co-existence. Secondly, in a single sequence, absence, exactly, and existence are not distinguishable. It is opted not to generate all of them, but rather stick with a layered approach of absence for no occurrences, exactly for 1 to 2 occurrences, and existence for more than 3 occurrences. It would be possible to check them separately, and merge them afterwards, however, experiments showed that this does not have an impact on the results. Finally, exclusive choice and not chain succession both mine for negative behavior that reflects everything that is not present in the sequences. While absence does the same, the magnitude of the number of not existing sequence pairs is vastly larger. Although mining for negative information is one distinctive feature of the proposed approach, the gain in accuracy performance does not outweigh the burden in terms of the number of extra constraints generated. Hence they are not included in the final constraint set. Not succession is the only negative constraint used. Note that all constraints are mined with a confidence of 100%.

4.3 Scalability

The computational tractability of the technique relies heavily on two components. First of all, the length of the sequence is an important factor as they are traversed completely. Hence, the performance is bound in the extreme by the length of the longest sequence. Secondly, the minimum support determines the number of activities, hence the number of pairs and constraint templates that need to be checked. In the worst case, all pairs have to be checked for all binary templates. Most constraints can be checked by simple lookups, but in case the templates in the upper part of the hierarchy are checked, the complexity in the worst case is the length of the string for checking alternating and chain behavior. This results in O(\(|A|^2\times |\sigma |\)). As will become clear from experimental evaluation, however, iBCM can achieve good results at high minimum support levels, reducing |A| drastically.

5 Experimental Evaluation

In this section, the technique will be evaluated on widely-used, realistic datasets and compared with 4 other approaches.

5.1 Setup

Below, an overview of the data, implementation, and other approaches is given.

Data and Classification. The datasets that were used are summarized in Table 4 and are a mix with both a large set of distinct items, as well as a large number of data entries. They are discussed in more detail in [17, 19]. All techniques were first employed to generate interesting sequences, and next to build a predictive model by using the presence of the sequences as a binary feature. Three classifiers were considered, i.e., naive Bayes (NB), decision trees (DT), and random forests (RF), for which the WekaFootnote 1 Java implementation was used. All runs were executed using a Java 8 Virtual Machine on an Intel i7-6700HQ CPU with 16GB DDR4 memory. A 10-fold cross-validation was applied for all the experiments.

Table 4. Characteristics of the datalogs used for evaluation.

Approaches. iBCM is benchmarked against 4 other state-of-the-art techniques, being cSPADE [7], Interesting Sequence Miner (ISM) [19], Sequence Classification based on Interesting Sequences (SCIP) [17], and Mining Sequential Classification Rules (MiSeRe) [20], which all have the clear goal of obtaining discriminative, informative sequences for classification and are compared in Table 5. A comparison with other techniques can be found in the respective works as well. For cSPADE, iBCM, and SCIP, the support levels were set at 0.1–1.0 by 0.1 intervals. SCIP was used for a minimum interestingness level of 0.05 and a maximum sequence length of 2 (this length was devised by the authors in [17] and a longer length increased computation time and did not return better results). For MiSeRe, 1, 2, 5, and 10 second run times were considered. Finally ISM was used with a maximum number of iterations of 200, and a maximum number of optimization steps of 10,000. No notable differences were reported when using different settings. The implementation of the benchmark can be found online at https://feb.kuleuven.be/public/u0092789.

Table 5. An overview of the techniques used for benchmarking.

5.2 Results

The results in terms of accuracy and the number of generated constraints along the support spectrum are displayed in Figs. 2 and 3. The results for ISM and MiSeRe are reported separately in Table 6. An overview of the share of each constraint template family in the results of iBCM is given in Fig. 4.

Fig. 2.
figure 2

Overview of the performance of the different algorithms.

Fig. 3.
figure 3

Overview of the performance of the different algorithms.

Overall, iBCM is capable of achieving a high accuracy, without inducing a big amount of constraints (|C|). Especially for the aslbu and auslan datasets, a higher accuracy is obtained than using the state-of-the-art techniques. Also, iBCM achieves a higher accuracy more rapidly when going down the support spectrum, achieving high accuracy already for 50% to 60% with a small amount of constraints (<100). The differences in terms of the type of sequential behavior present becomes apparent. In the text-based datasets, such as news, the absence constraint clearly provides a prominent source of information, since rather the presence of items, not relations, are needed for classification. This lies in line with the findings in [19]. In the other datasets, the whole set of constraint patterns is used, except for the very specific chain constraints. The inclusion of negative constraints might explain the higher accuracy for aslbu and auslan2. The more comprehensive alternating constraints are indeed often present (note that the hierarchy reduction cuts away all simple/alternating ordered constraints when alternating/chain constraints are found).

Table 6. Accuracny and log(|C|) (between brackets) results for MiSeRe and ISM.

cSPADE was not able to finish executing the context dataset within 60 min for support values lower than 50%. Similarly, ISM was not able to do generate interesting sequences for the News dataset. Besides, the algorithms did not always generate constraints for certain higher support values. In terms of performance, in Fig. 5 the time needed to generate the constraints and label the sequences is plotted. All constraints could be derived in less than 1 second, except for the News dataset due to the bigger size of |A|. In this case, the technique clearly scales exponentially with the size of |A|. This is probably due to the nature of the data, being plain text. The higher number of items, of which there are no particularly frequent after a certain threshold, increases the runtime. In the other datasets, infrequent items are truly infrequent and |A| does not necessarily grow. Nevertheless, support settings as of 0.6 already guarantee a decent level of accuracy.

Fig. 4.
figure 4

An overview of the share (in %) of the type of constraint templates mined from the databases. The numbers stand for the minimum support, e.g., 2 stands for 20%.

There is no notable difference in accuracy when using different classifiers, except for the aslbu and auslan2 datasets. Especially the constraints generated by cSPADE seem to have a different impact on the classifiers. In general, the classifiers perform more stably on the datasets with either less labels or with more sequences to learn from. Random forests seem to perform the best overall.

Fig. 5.
figure 5

Time needed to mine constraints.

6 Conclusion and Future Work

This work proposed iBCM, a new technique with the ability to discover features for sequence classification. Based on behavioral constraint templates, iBCM is able to concisely distinguish different sequential behaviors in databases. It is capable of achieving results with high accuracy, while minimizing the number of features needed compared with other approaches. Furthermore, the inference technique devised can also be applied towards descriptively interpreting the nature of the patterns present in a sequence database, offering insights into what types of interplay are present between the items in the data.

In future work, a more in-depth comparison of which types of constraints contribute the most to the classifiers will be made. This establishes the base for building direct sequence classification techniques as well. Finally, the data-aware versions of the patterns can be introduced as well. Most patterns are also described in first-order LTL and can be extended to include non-sequential information [27] to bridge the gap with Datalog [13]. Also, the target-branched version of Declare [28], i.e., constraints with a consequent being a set rather than a singleton will be investigated.