1 Introduction

Business processes capture the activities of every profit or non-profit, public or private organisation, coordinating humans and software to collectively deliver value. As organisations evolve, new needs appear, e.g., covering electric scooters for an insurance company or handling a change in the law about reimbursing travel expenses at the university. These needs lead to the emergence of process variants, differing in their control flow or performance while having commonalities with the original processes. Process variants or configurations are specific combinations of the system’s options. We consider process executions stored in event logs, where an event trace (or trace) is an ordered sequence of events. To explore process reengineering opportunities, it is necessary to identify which variant(s) may have produced a given trace. Existing variant analysis Taymouri et al. (2021) techniques do not answer this question but cover the inverse operation, i.e., focusing on the differences between identified variants. This problem is not restricted to business processes and naturally extends to variability-intensive systems, which change their behaviour in response to the (de)activation of some options. Examples of variability-intensive systems include Software Product Lines (SPLs) Pohl et al. (2005); Apel et al. (2013), operating systems kernels She et al. (2010); Mortara and Collet (2021), code generators Boussaa et al. (2016); Temple et al. (2021), or web-based frameworks Halin et al. (2019); Sánchez et al. (2017). Validating these systems is difficult because enumerating all variants, whose number can grow exponentially with the number of options, is generally infeasible Halin et al. (2019). In this context, locating variations is essential for any reengineering endeavour Assunção et al. (2017). Black-box testing techniques can also benefit from this information to e.g., sample which variants should be tested first Halin et al. (2019).

To support these activities, in this article, we train Recurrent Neural Networks (RNNs) Rumelhart et al. (1986); Schuster et al. (1997) architectures with different hyperparameters (loss and activation functions among others) to predict the candidate variant(s) that could have produced a given event trace. We make the following contributions:

  1. (i)

    the first variability-aware approach, which we called VaryMinions, to map execution traces to variants of a system;

  2. (ii)

    a detailed account on the usage of Long Short Term Memory (LSTMs) Hochreiter and Schmidhuber (1997), and Gated Recurrent Units (GRUs) Cho et al. (2014), two RNN architectures, on six different datasets, describing business processes and course management system variants;

  3. (iii)

    four datasets openly available and based on Claroline Devroey et al. (2014, 2017); Devroey (2020) and containing \(2*10\) and \(2*50\) configurations with 5, 000 traces per configurations;

  4. (iv)

    a characterisation of the intrinsic learning difficulty for variability-intensive systems.

Methodology

For the first contribution, we showed empirically that VaryMinions can distinguish 50 variants from 5, 000+ event traces per variant. In our second contribution, we successfully determine the variant(s) responsible for generating an event trace with high accuracy (\(>80\%\)), regardless of whether the GRU or LSTM model is employed. To measure the learning difficulty, we defined and computed a metric based on the amount of behaviour shared amongst event traces.

Open Science Policy

We also provide a replication package Fortz et al. (2022) with an implementation of our approach using two common Python frameworks reusing RNNs implementations(namely Tensorflow Developers (2021) and Keras Chollet et al. (2015)) as well as presenting all the results of our experiments.

These contributions extend our preliminary research published at the MaLTeSQuE 2021 workshop Fortz et al. (2021). While our previous paper focused solely on business processes, this article adds a new source of datasets issued from the VIS domain: Claroline Devroey et al. (2014, 2017); Devroey (2020), a course management system that was reverse-engineered from an instance in-use at the University of Namur. We derive four new datasets from this newly added system, forming a much more challenging learning problem (up to 50 variants instead of 5), and we assess the effect of sampling (random uniform vs dissimilarity-based) on the outcome. In addition, we reran all our previous experiments and the new ones at one of the Belgian universities’ HPC facilities. We also refactored the VaryMinions source code to ease its reuse and make it more configurable. To summarise, the added value of this extension comes from:

  1. (i)

    four new and more complex datasets from the VIS domain;

  2. (ii)

    a discussion about the effect of sampling on this classification task;

  3. (iii)

    a refactored implementation of VaryMinions.

Section 2 introduces process mining, VIS and RNNs. Section 3 motivates the use of VaryMinions. Section 4 gives an overview of the proposed solution, while Section 5 presents the datasets and the experimental setup with more details. Section 6 gives the results of our evaluation. Section 7 discusses certain factors influencing our experiments, such as hyper-parameter variability and alternate labelling of variants. Section 8 presents related work, and finally, Section 9 wraps up the paper.

2 Background

Our work tackles the problem of tracing back the system variant that produced some event logs. This is an issue common to process variants and variability-intensive systems. We address it by relying on techniques coming from the Deep Learning community. In the following, we introduce these different concepts.

2.1 Process Variants

Nowadays, many organisations work with multiple (business) processes in parallel that can highly depend on environmental and human factors. For instance, a business process can be influenced by regional laws, available resources, the size of the organisation, etc. Most of them share common behaviours meaning that for one general business process, one can define several process variants, each one behaving (slightly) differently from the other variants. Similar process variants gather in process lines or process families and can be modelled using different formalisms Rosa et al. (2017).

Analysing the specificities and commonalities of process variants allows scale economies and helps practitioners to improve the general business process, define new variants or maintain existing ones Taymouri et al. (2021).

For process understanding and reverse engineering purposes, one commonly inspects execution logs. Indeed, they contain valuable information on the process behaviour in production. If the process owns several variants, one must know which variant(s) are involved in a behaviour of interest. Unfortunately, event logs do not usually contain information about a specific variant (or set of variants) which (could have) produced the sequence of events (i.e., the event trace). This can prevent practitioners from understanding why this behaviour occurs for one variant and not another. In this paper, we address the problem of identifying process variants that have (potentially) shown a specific behaviour, based on a given event trace. This mapping information is key in various re-engineering activities such as variant process mining Taymouri et al. (2021). However, these activities are beyond the scope of this paper.

To demonstrate the feasibility of our variant process identification learning approach, we use two datasets that gather execution traces of business processes. They both come from the Business Process Intelligence Challenge, a yearly challenge organised since 2011 to stimulate process mining research on real-life datasets.Footnote 1 We selected two editions, modelling process variants: the first one from the year 2015 and the second from the year 2020.

2.2 Variability-Intensive Systems

Process families belong to the vast and heterogeneous category of Variability-Intensive Systems (VISs). These are software-based systems that exist in many variants to address the diversity of customer needs and usage contexts. Structured approaches, like Software Product Lines (SPLs) Pohl et al. (2005), facilitate the design, development and quality assurance of such systems. They consider a global base of software artefacts for a family of software systems, and allow to produce variants through the (de)activation of options (also called features in the VIS world).Footnote 2 Reasoning at the family level rather than at the single system level yields significant economies of scale and quality improvements.

Variability Modelling

The variability of a VIS is usually decomposed using a tree-like structure called a (VIS) Feature Model (FM) Kang et al. (1990); Schobbens et al. (2007). An FM represents the common and variable aspects of the system. For instance, Fig. 1 presents the FM of a simple configurable beverage vending machine. The machine sells either soda or tea (or both) in euros or dollars, and may optionally support cancelling purchases and providing free drinks. As the number of possible variants increases exponentially with the number of options available, such compact representation of the variability of a VIS enables various kinds of analysis, including counting the number of possible variants, detecting dead options that can never be selected, etc. For instance, the vending machine of Fig. 1 counts already 24 possible (distinct) configurations. This number is very small compared to real-world VISs. For example, the Claroline case, which we will introduce in Section 5.2.2, has more than 5 million possible configurations for 44 options.

Behavioural Modelling

Complementary to FMs, Featured Transition Systems (FTSs) Classen et al. (2013) are designed to represent compactly the behaviour of a VIS. An FTS is a transition system where each transition is labelled using (VIS) feature expressions (i.e., a Boolean formula referring to its options) to indicate which valid configurations of the VIS can execute the transition. For instance, Fig. 2 presents the FTS of the beverage vending machine of Fig. 1. As can be seen from the feature expressions, only specific configurations can execute some transitions: e.g., only vending machines with the free (f) option enabled can execute the free transition from state 1 to state 3. As for FMs, FTSs provide a way to represent compactly the behaviour of all the different configurations of a VIS.

In this paper, we rely on the FM and FTS of Claroline, a highly configurable course management platform previously used at the University of Namur, to simulate executions of different configurations of a real system. This simulation offers a way to generate event logs in a controllable way without requiring running a large number of variants of the system. The Claroline FM and FTS were defined by Devroey et al. (2014, 2017); Devroey (2020), based on the logs of the implementation used at the University of Namur collected over 9 months.

Fig. 1
figure 1

VIS Feature Model of a beverage vending machine Devroey (2020); Classen et al. (2013)

Fig. 2
figure 2

VIS Feature Transition System of a beverage vending machine Devroey (2020); Classen et al. (2013)

2.3 Deep Learning and Recurrent Neural Nets

As explained previously, the number of possible variants grows exponentially with the number of VIS options. Similarly, the number of traces a system can generate is supposed to be infinite. These observations command the use of automatic reasoning instead of manual inspections. In particular, we rely on machine learning and deep learning techniques.

Deep Learning (DL) is a subset of machine learning techniques. They remain statistical techniques, but the main difference is that machine learning techniques rely on predefined features (or characteristics) compactly representing data. Historically, domain experts defined the relevant features and the procedures to extract them from raw data. In contrast, DL techniques can infer such features automatically while training but at the cost of more computational resources and time. In the last decade, DL techniques efficiently performed different tasks and new applications such as image processing, assistance in driving for autonomous vehicles, board gaming such as playing Go, sound processing, text processing, automatic translation, etc. 

Different families of machine learning algorithms exist: decision trees or random forests, support vector machines, linear regressors, neural networks, etc. Thanks to their capability to model and handle complex relations, neural networks are at the centre of attention of DL techniques. There are various neural network architectures, each adapted to a specific task. For instance, convolutional neural networks excel at image processing while recurrent neural networks (RNNs) Rumelhart et al. (1986); Schuster et al. (1997) handle data sequences (such as text or speech).

Previous works applied RNNs to execution traces to predict the next event or the final execution state. Evermann et al. (2017); Tax et al. (2017). When data sequences are too long, vanilla RNNs may face the so-called vanishing or exploding gradient problem Hochreiter (1998). Indeed, weights from the first layers may rarely be adjusted since, during training, the back-propagation mechanism re-injects prediction errors backwardly in the network starting from its output layer so that it can ultimately provide the right outcome. Because one injects errors from the output, they tend to vanish and never reach the first layers leaving them unchanged. Conversely, the gradient can grow exponentially, yielding intractable computations. Two RNN architectures deal with longer sequences and long-term dependencies: Long-Short Term Memory (LSTM) Hochreiter and Schmidhuber (1997) and Gated Recurrent Unit (GRU) Cho et al. (2014). These architectures alleviate gradient issues Hochreiter (1998); Chung et al. (2014) by using gates to regulate the data flow and keep specific long-term data in memory. RNNs are composed of multiple units (sometimes referred to as cells), which can convey data from one to another. Typically, RNNs start with an embedding layer that transforms input data into multi-dimensional vectors.

Fig. 3
figure 3

A unit of LSTM versus a unit of GRU

Figure 3 depicts an example of an LSTM unit (left) and a GRU unit (right). Inside one unit, gates regulate the data flow, deciding what data to keep and what to forget. Mathematically, gates are functions (e.g., sigmoid) expressing the amount of data to keep. We can define several types of internal gates for different purposes. An LSTM unit (Fig. 3a) is composed of three different state variables and three different gates. The variables represent respectively the input of the unit (i.e., the matrix computed by the embedding layer, called \(x_t\) in the figure), the output (called \(h_t\)), and the unit state (called \(c_t\)). The latter acts as the long-term memory of the network, registering data from previous units to pass through the next ones. Forget gates (on the left of the Figure) are used to convey data from the previous unit directly to the next one. In particular, it may set some values to 0, making the network forget this data. The input gate (in the middle) defines how much data should be treated in the current unit. The final output of a unit travels through the output gate (on the right of the Figure). To avoid gradient explosion, LSTM units use a tanh function (above the output gate) to keep data in a small range of value (i.e., between -1 and 1). In GRU (Fig. 3b), input and forget gates are merged (on the right of the Figure) and there is no output gate. Consequently, the output and unit state variables are also combined into a unique variable (named \(h_t\) in the Figure). GRU also offers a new type of gate (in the middle of the figure) expressing how relevant data from the previous unit is for the current unit.

LSTMs and GRUs are efficient text classifiers, e.g.,  Liu et al. (2016); Kowsari et al. (2017). In this work, we want to create a mapping between execution traces that are a succession of events occurring in a specific order and configurations of a system that, supposedly, can produce them. In this context, an event does not appear randomly but depends on the previous succession of events. Sometimes directly from the few previous ones, sometimes because of an event that occurs way earlier in the trace. Thus, using LSTM and GRU architectures seems appropriate. Furthermore, because of this dependency in the sequence of events, we consider traces as text, i.e., an ordered sequence of symbols that follows a given grammar.

While RNNs are usually good fits to work on Natural Language Processing (NLP) tasks, there is little work trying to use RNNs in the context of technical documents or software specifications. Li et al. conducted a systematic literature review on extracting variants from text specifications Li et al. (2017).

None of the reviewed works relied on RNNs but used other classification models (decision trees, association rules, etc. ). Recently, Arganese et al. investigated ambiguity in natural requirements as variability points Arganese et al. (2020), but the mapping concerns words rather than complete sequences.

3 Motivation: Behaviour-driven VIS Reverse-engineering via Black-box Learning

Over the past two decades, researchers have been focused on modelling the behaviour of SPLs for design and analysis purposes. Various paradigms for modelling SPL behaviour, such as Featured Transition Systems Classen et al. (2013) and Featured Finite State Machines Hafemann Fragal et al. (2017), have been defined. However, engineers typically manually create these models, which is time-consuming, error-prone, and not suitable for complex VISs. Recent efforts Damasceno et al. (2021, 2019) have attempted to automate this model creation process, but it is still in its early stages.

Most approaches to learning VIS behaviour rely on and extend Dana Angluin’s seminal \(L^*\) algorithm from 1987 Angluin (1987). This algorithm aims to infer a single system’s behaviour in a black-box and active manner, relying solely on execution traces obtained on the fly. In this case, access to the source code is unnecessary, but interaction with the System Under Learning (SUL) is essential. \(L^*\) follows a simple metaphor. The Learner constructs hypothesis models of the system by posing queries to the Teacher, who serves as a middleware between the Learner and the SUL. The Teacher can either validate the hypothesis or provide a counterexample if it is invalid, helping the Learner to update its hypothesis.

To adapt Angluin’s algorithm and accommodate variability, existing approaches Damasceno et al. (2021, 2019) introduce post-processing steps. For instance, learning each product variant and progressively merging them is one approach. However, this approach becomes impractical when dealing with a large number of variants. Another approach, instead of considering variants individually, would be to consider learning the VIS in a family-based fashion Fortz (2021, 2023). In both cases, it is essential to relate Angluin’s queries and counterexamples to configurations. Existing mappings are incomplete, as they rely on partial observations of the system. We assume that the Teacher only possesses knowledge of previously observed SULs, with all new configurations being unknown. Hence, in this scenario, a configuration prediction technique is required.

Existing SPL reverse engineering techniques usually assume the presence of an accurate FM. This is a strong hypothesis. Usually, FMs are built from requirements which are known to be ambiguous and partly implicit. FM reverse engineering approaches also have limitations in terms of completeness and soundness. It is much easier to assume a set of configurations especially in reverse engineering scenarios. Therefore, we aim for a solution that does not require an FM. In our context, we can only rely on the list of features (but without explicitly stating the constraints between them) and a list of the configurations used for classification.

To map configurations to variant products, white-box approaches rely on the source code or use a combination of software implementation artifacts and logs Michelon et al. (2021, 2023). We place ourselves in a strict black-box context in which the source code is not available. This is the case for the business processes we analysed. Therefore, we focus on execution logs only. These logs do not directly relate event traces with variants. Indeed, Cândido et al. Cândido et al. (2021) have pointed out that preemptively logging detailed information would result in enormous log files, reaching several terabytes, which would impede effective analysis.

Since extensive mapping information is not available, we propose to employ supervised machine learning to tackle the following challenge: how can we classify new incoming traces (previously unseen) to multiple variants?

Fig. 4
figure 4

Description of the VaryMinions architecture

4 VaryMinions Overview

Figure 4 provides an overview of VaryMinions’ architecture. The input data (a) are a set of available execution traces. For training, traces are associated to the set of system variants that can produce them. The inputs first pass through an Embedding layer (b) that transforms the sequences of events into a vector of indexes (to make the representation more compact and to ease their processing). The embedding layer creates a structured space in which indexes that occur in a similar context are close. In this new representation space, indexes become vectors, and initial traces become tensors composed of numerical weights. This homogeneous representation allows performing mathematical operations on those weights through the rest of the network.

The embedding layer is also configurable, e.g., we need to specify the number of dimensions of the representation space and the number of dimensions in the output tensors. We keep the number of dimensions the same in input and output to avoid combining different input dimensions into one output dimension. We then link this layer to the RNN layer (c), which is instantiated with either LSTM or GRU units to learn the relationships between elements of the tensors. Again, this layer is configurable, in particular with the number and usable kinds of units (detailed in Section 5.3).

There exist unidirectional and bidirectional Schuster et al. (1997) units. Unidirectional layers only consider the processing of the sequence in one direction (from start to end). In contrast, bidirectional layers also handle the other direction (from end to start), which can be helpful in language processing. In our case, traces are fully available at training time. Reading them forward and backward can help grasp long-term relations between events. Because of our analogy with text, we use bidirectional units Schuster et al. (1997) only.

Then the network continues with one Dense layer (d) preparing for classification. We made the number of units in this layer the same as the number of classes (i.e., configurations). The output of the network (e) is a vector of 1s and 0s whose number of elements is equal to the number of configurations of the system. This vector classify the trace into one or more configuration(s). In this vector, 1s state that associated configurations can generate the input trace and 0s that they cannot.

For instance, let us take a simple system with three configurations. The output vector is thus of size three. If our prediction model outputs the vector \(\left[ 1, 1, 1\right] \), it predicts that all the configurations can execute the input trace. In another case, the output vector is \(\left[ 0, 1, 0\right] \). Then, only the second configuration is able to produce the input trace, etc. . One should note that our models cannot provide the output vector \(\left[ 0, 0, 0\right] \) since the RNN selects at least the configuration with the highest score.

5 Evaluation Protocol

In the following, we describe our evaluation protocol to validate that we can learn which variants may have produced an execution trace. First, we state the research questions that drive this experimentation before describing the creation and annotation of our datasets. Then, we explain how we instantiated VaryMinions regarding our specific context. Finally, we present the running setup and the evaluation metrics.

5.1 Research Questions

We state the following research questions concerning the multi-classification of execution traces among the different VIS variants:

RQ1:

How accurately can we identify process variants based on their traces? This question addresses the efficiency of our approach. To the best of our knowledge, this is the first attempt to use RNNs to learn such a mapping. Thus, we cannot compare it with the state of the art. Instead, we expect the RNNs to be at least better than random classifiers (accuracy higher than \(>50\%\)).

RQ2:

What is the performance of LSTMs versus that of GRUs for process traces classification? We would like to know which model architecture is the most appropriate for this task, if any.

5.2 Datasets Selection and Preprocessing

We use six different datasets that we divide into two groups. The first group contains the 2015 and 2020 editions of the Business Process Intelligence Challenge (BPIC). Each dataset contains event logs, describing different executions of configurable processes:

  • BPIC15 (DS1) represents building permit applications in five municipalities, each one corresponding to a process variant van Dongen (2015); and

  • BPIC20 (DS2) gathers data from the travel reimbursement process at the Eindhoven University of Technology (TU/e), where variants correspond to different kinds of documents to be managed van Dongen (2020).

The second group consists of four datasets containing event logs describing executions of different variants of Claroline Devroey (2020); Devroey et al. (2017), an online course management system used at the University of Namur until 2018. Claroline was the main communication channel between students and lecturers, with approximately 7, 000 users. Its architecture is plugin-based. Depending on needs, one can deploy new variants at runtime.

5.2.1 Business Process Intelligence Challenge (BPIC)

The original BPIC datasets (from van Dongen (2015, 2020)) contain only valid and complete traces and other information. We prune the logs to keep only the process variant ID, the trace ID and the sequence of events. To cope with different trace lengths, we apply padding (i.e., filling traces with other meaningless events and using a mask to know where the processing should stop). Trace duplicates are removed, and since multiple variants can produce the same trace, we encode the variants into a binary vector (where the size matches the number of variants) that serves as a label. A value of one at the i-th index of the vector denotes that we observed at least once the trace associated with variant i. Traces associated with all variants have thus a vector full of ones. In the end, each trace is associated with one or more variants (i.e., classes). We expect the RNN models to learn these associations to predict the variant(s) for an unlabelled trace. We wrote this preprocessing procedure in Python as part of VaryMinions’ implementation Fortz et al. (2022).

As described in Table 1, DS1 contains 5, 542 traces after preprocessing, with a maximum of 154 events per trace. The five process variants are fairly equally represented since they contain 1, 108 traces on average, with a minimum of 828 and a maximum of 1, 350. Therefore, DS1 is well-balanced. DS2 contains 2, 074 traces after preprocessing, with 5 process variants and a maximum of 90 events per trace. The least and most represented process variants contain 89 and 1, 478 traces respectively, with an average of 415 traces per variant. Therefore, the dataset is imbalanced, suggesting it is harder to learn from accurately.

Table 1 Overview of the preprocessed datasets used in our experiments

To better characterise the learning complexity, Table 1 shows the number of traces per class (i.e., variant) and the overlap (i.e., percentage of variant-specific and shared behaviour) between classes. The number of traces provides a first indication of the learning difficulty: more traces generally yield a more accurate network once trained. DS1 contains equally represented classes with limited overlap (\(<0.5\%\) in the last column), while DS2 is less balanced in how classes are represented and how they are interleaved, denoting a shared behaviour between multiple variants. In particular, for DS2, there is a big overlap between the International Declaration and the Permit Request variants, and between the Prepaid Travel Cost and the Request For Payment variants, while the Domestic Declaration variant is completely separated.

5.2.2 Claroline

Claroline is a highly configurable web-based system whose behaviour depends on a set of activated options. In total, Claroline contains 44 options leading to more than 5, 406, 700 unique variants. Handling such a large configurable system is not trivial as it requires deriving different variants and executing them in various ways to trigger different behaviours and collect, format, and process the corresponding event logs. Setting up such pipelines is hard and outside the scope of this paper. For those reasons, we decided, instead of executing the actual system, to simulate executions of different variants using a Featured Transition System (FTS) capturing the behaviours of different configurations of Claroline. The FTS was reverse-engineered by Devroey (2020); Devroey et al. (2017) from a 5.26 Go Apache webserver log containing 45, 210, 987 entries collected from January 2013 to September 2013 using a bigram inference method. The final FTS consists of 106 states and 2, 053 transitions.

Simulations

The simulation of a given Claroline configuration works as follows. First, the FTS is projected on the configuration (i.e., pruned) to keep only the subset of behaviours that can effectively be executed by the configuration. The result of that process is a classical transition system, describing a subset of the behaviours of Claroline. Second, the traces associated with the configuration are produced using random walks in the transition system. We generated 5, 000 traces per configuration. To avoid infinite traces (e.g., in case of a loop in the transition system), we also limited the size of a trace to 300 events. We relied on VIBeS Devroey et al. (2015); Devroey (2022), a model-based testing tool for highly-configurable systems, to project the FTS and generate the traces.

We relied on two different strategies to select the different simulated Claroline configurations: random selection and dissimilarity-based selection. The random selection consists in selecting a set of (valid) configurations using a dedicated generator ensuring a random distribution of the selection. In our case, we used CMSGen Golia et al. (2021), a fast uniform-like sampler. CMSGen comes with a default parameterisation, which we reused as is.Footnote 3 Unlike random, dissimilarity-based selection Henard et al. (2014) picks configurations in such a way that they are as dissimilar as possible when considering their selected options. For our evaluation, we used PLEDGE Henard et al. (2013), a search-based dissimilarity-driven configuration selection tool. We selected the default parameterisation of PLEDGE, with one minute per generation. We have set the number of configurations to simulate to 10 and 50. This way, we can go beyond the difficulty provided by the BPIC datasets and check that our method can run when the number of configurations is higher. While 50 is still small compared with the number of possible unique variants of Claroline (i.e., \(> 5,000,000\)), it is closer to a realistic setting.

Event Logs Datasets

We have derived the four different event logs datasets based on the following sets of configurations of Claroline:

Claroline Dissimilar 10 (DS3) regroups execution traces of 10 different configurations of Claroline, selecting the most dissimilar sets of options. This dataset should lead to more discriminated traces and better classifications.

Claroline Random 10 (DS4) gathers traces from 10 different instances of Claroline, randomly chosen to have a more realistic dataset.

Claroline Dissimilar 50 (DS5) is similar to DS3, but with 50 configurations to allow more diversity.

Claroline Random 50 (DS6) is similar to DS4, but with 50 configurations.

For each of these datasets (DS3 to DS6), the output of this generation process is a file containing 5, 000 traces per configuration that we can use as an input for VaryMinions. In our case, we thus have either 50, 000 traces per file (for 10 configurations) or 250, 000 traces per file (for 50 configurations), as shown in Table 1. The last two columns of this table show systematically \(100\%\) of variant-specific behaviour and \(0\%\) of shared behaviour for Claroline datasets, meaning that for each trace, at least one action is specific to one variant of Claroline. This is due to the use of a sampler for selecting the configurations, giving very little control over the traces overlap. Due to the huge amount of possible variants (i.e., \(> 5,000,000\)), the chance to find any shared behaviour between multiple variants is almost zero.

Table 2 Hyper-parameters settings

5.3 RNN Parameterisations

As we said before, because we use sequences of events, we investigate the use of RNNs to learn to which configuration(s) we can associate a trace. More specifically, we focus on LSTMs and GRUs. As for many DL models, hyperparameters must be defined. Because there are so many, we decided to vary only a few of them to try to understand how much impact they may have on learning. We focused on the functions that are used inside the networks and that may impact the quality of the predictions. We also manually selected a subset of hyperparameters that we fixed to a specific standard value. Hyper-parameters and their values are described in detail hereafter and summed up in Table 2.

Number of Hidden Layers

One specific aspect that impacts the learning capabilities of neural networks is their topology. Since the traces are short compared to text documents, we decided to use networks with only one hidden layer. It may avoid potential overfitting, that can emerge from more complex structures (e.g., auto-encoder) while offering satisfactory prediction performance.

Units

In our previous work Fortz et al. (2021), our experiments used different numbers of units regarding the RNN layer (c). This number affects the topology of the network and may help to grasp more complex concepts if this number increases. Yet, having too many units on a layer may lead to dealing with redundant information that will deteriorate the final prediction performances of the network Geman et al. (1992). On the contrary, a layer with a smaller number of units may not have the capability to grasp interesting information which may also harm the prediction performances Geman et al. (1992). Based on our previous experiences, we decided to set the number of units to 30 which has shown relatively good performances while limiting the training time.

Fig. 5
figure 5

Sigmoid (blue) and tanh (orange) function responses represented by the Y-axis depending on the input signal (X-axis)

Training Set, Batch Size and Epochs

Other hyperparameters can be set affecting the training time and the optimisation of the many different parameters (e.g., weights between layers and units) of the networks. Common hyperparameters to set are the ratio of data used to train the model and those used to evaluate the performance of the model; the size of the batch of data that the model will have to deal with during training, which may mitigate overfitting; and the number of time the model will optimise parameters over the whole training set (i.e., the number of epochs). Each of these hyperparameters was set as follows:

  1. (i)

    the percentage of the data used for training is set to \(66\%\) of the whole dataset which is a common value in the ML community, the remaining traces are used in the test set to assess the generalisation performances of the trained models;

  2. (ii)

    we set the batch size to 128, which is adapted to the dataset size;

  3. (iii)

    we set the number of epochs to 20 to avoid overfitting. In our preliminary evaluations (evaluated between 10 and 50 epochs), a plateau was reached after approximately 15 epochs. We finally set the number of epochs to 20, to allow for small increases in accuracy.

Activation Functions

Activation functions are defined at the level of units (i.e., neurons) and respond to an input signal. If the signal is strong enough, the neuron is activated and the output is also high. Though different activation functions can be used for each neuron, it is usual to define an activation function for an entire layer. We have used a Rectified Linear Unit (ReLU) function on the hidden layer RNN layer (i.e., (c) in Fig. 4) to alleviate the vanishing gradient problem. Regarding the Dense layer (d), we experimented with two common activation functions that are sigmoid and hyperbolic tangent (tanh). Both are shown in Fig. 5. The main difference between both is their definition domain which affects how they handle negative input values. The sigmoid function is defined over \(\left[ 0;1\right] \) meaning that as the values get closer to \(-\infty \) the neuron is closer to being non-activated at all (i.e., the output signal is 0) while as the input values are getting larger the response is also getting larger. When the input value is 0, the response is 0.5. On the other hand, tanh is defined over \(\left[ -1;1\right] \). It may be useful to take into account negative correlations and when the input value is 0, the response is also 0. Using one or the other may affect the “strength” of the signal that will reach the last layer for classification in turn affecting which class (i.e., configuration) will be recognised.

Loss Functions

Loss functions are used during training to optimise the weights of the networks by back-propagating errors. We have used three loss functions already implemented in tensorflow,Footnote 4 namely Binary Cross-Entropy (with and without logits, respectively named hereafter Bin-CE and Bin-CE logits) and the Mean Squared Error (MSE). Logit is defined as the inverse function of the sigmoid. We also implemented two custom loss functions: a variant of the Jaccard distance Jaccard (1901) (named Weight_Jaccard hereafter), and the Manhattan distance between two vectors. The motivation for these two last functions is that because a single trace might be assigned to different process variants, the error should be defined considering a comparison of elements of vectors but not from a single value. This difference between two vectors should define a distance score. The Manhattan distance (sometimes called L1 norm) computes the sum of absolute differences between each element of the two vectors (i.e., in this case, the process variants). The Jaccard distance assesses how many equal elements of two vectors are over their size. We have implemented a variant of the Jaccard distance to cope with floating-point values generated by the networks. The Jaccard distance was employed to evaluate trace dissimilarity in variability-intensive systems (e.g.,  Devroey et al. (2016)). Further discussions about the use and characteristics of these loss functions are provided in Section 7.2.

5.4 Model Training

We decided to use only a training set and a test set in our evaluation due to the number of available execution traces. The training and performance evaluation process is done as follows: i) the entire dataset is randomly split into training and test sets. We have used the Keras function train_test_split Footnote 5 that ensures the data distribution of classes among the two sets are similar. ii) A model is trained using the training set. iii) Its prediction performances are evaluated on the test set. To mitigate biases in our analyses we decided to train and evaluate the performances of each parameterization ten times on each dataset. For each run, the whole training and performance evaluation process is started again (i.e., splitting into training and test sets, training the model, and evaluating its performances). The fact that the splits are done each time mitigates the chances to train and evaluate a model on the best sets solely. Not only that it may change the data used for training the model but it may change the order of appearance too, which may have an impact on the trained model.

5.5 Evaluation Metrics

This work is the first attempt to use RNNs to classify execution traces among variants of a system. One of our goals is to evaluate if such a DL technique is appropriate for this task. We thus computed four different standard metrics that are Accuracy, Precision, Recall, and F1-score.

Accuracy

To evaluate the quality of the models that have been learnt, the usual metric is the Accuracy measure. Accuracy is defined as

$$\begin{aligned} Acc = \frac{Number~of~correct~predictions}{Total~number~of~predictions} \end{aligned}$$
(1)

It is a standard measure in the ML community to assess how well a model performs from a high-level point of view. It has the advantage to be easily computable and it can also be used to refer to the number of wrong predictions (i.e., \(1-Accuracy\)).

However, when classes are not well balanced (i.e., the number of traces is way more important for at least one class than for others), Accuracy may hide some important information as the number of correct predictions for the classes with more data may take the lead on the number of wrong predictions of the others resulting in a high ratio. To mitigate this aspect from our analysis, we only consider other measures.

Precision

One usual metric to account for the performances of a prediction model is its precision. It can be calculated for each class as follows:

$$\begin{aligned} Precision = \frac{Number~of~correct~predictions}{Number~of~predictions~for~the~class} \end{aligned}$$
(2)

where Number of predictions for the class is the number of correct predictions and the number of additional data that are wrongly predicted to belong to the class (i.e., false positives).

We gathered all these individual precision measures into a global one using a weighted average:

$$\begin{aligned} Prec = \frac{\sum _{i=1}^c{Precision_i * supp_i}}{Number~of~data} \end{aligned}$$
(3)

where c is the number of classes, \(Precision_i\) the precision measure for class i, and \(supp_i\) the number of data with label i.

Recall

Similarly to the precision, the recall is also standard to report on the predictions of a model. It can also be calculated for each class and is defined as follows:

$$\begin{aligned} Recall = \frac{Number~of~correct~predictions}{Number~of~labeled~data~for~the~class} \end{aligned}$$
(4)

where Number of labeled data for the class is the number of data labelled with the class under consideration.

Similarly to the precision, we computed a weighted average to get an overall recall measure for the model:

$$\begin{aligned} Rec = \frac{\sum _{i=1}^c{Recall_i * supp_i}}{Number~of~data} \end{aligned}$$
(5)

where c is the number of classes, \(Recall_i\) the recall measure for class i, and \(supp_i\) the number of data with label i.

F1-score

The F1-score is obtained through the harmonic mean of precision and recall to get an overview of the global performances of the model in one single measure. The F1-score in the case of two classes is defined as:

$$\begin{aligned} F1-score = 2\frac{precision*recall}{precision + recall} \end{aligned}$$
(6)

Again, we can apply this calculation on each class and average with a weight equal to the proportion of data of each class in the (test) set to get an overall value for the model. The three last metrics were computed by the precision_recall_fscore_support Footnote 6 function in Scikit Learn before being averaged. Also, we did compute confusion matrices Footnote 7 for each class. They are available in our replication package.

5.6 Running Infrastructure

Finally, Table 2 shows: \(2\ models \times 1\ \#units \times 1\ \%training\ set \times 1\ batch\ size \times 1\ \#\ epochs \times 2\ activation\ funtions \times 5\ loss\ functions = 20\) different parameterisations of RNNs. We conducted these experiments on three different HPC facilities hosted by the CÉCI.Footnote 8 On the first cluster, called Dragon1, we used 1 CPU with 8 cores per task (Intel Sandy Bridge, E5-2650 processors at 2.00GHz) with a Tesla Kepler accelerator (K20m, 1.1 Tflops, float64). For runs on Dragon2, we used 1 CPU with 12 cores (Intel SkyLake, Xeon 6126 processors at 2.60 GHz ) associated with an NVidia Tesla Volta V100 accelerator (5120 CUDA Cores, 16GB HBM2, 7.5 TFlops, double precision). On the third cluster, Hercules, we had access to 1 CPU with 8 cores per task (Intel Sandy Bridge, Xeon E5-2660 processors at 2.20 GHz) with an NVidia GeForce accelerator (RTX 2080 Ti, 7.5 TFlops, double precision). Each CPU has been allocated 3 GB of RAM. All our scripts are written in Python 3, with the Keras and Tensorflow frameworks for deep learning. Our replication package is openly available Fortz et al. (2022).

We conducted a 10-fold validation, where for each fold we randomly defined different train and test sets. For each fold, we evaluate the model by computing all the metrics described in Section 5.5. In total, running our 20 different network parameterisations with 10 repetitions on the six different datasets, resulted in \(20 \times 10 \times 6 = 1,200\) runs and more than 151 days of execution. The time needed for a single execution varies between 44 seconds and 13 hours depending on the dataset and the GPU type.

Fig. 6
figure 6

Boxplots showing the Accuracy over 10 runs for each parametrisation of each dataset

Fig. 7
figure 7

Boxplots showing the Precision over 10 runs for each parametrisation of each dataset

Fig. 8
figure 8

Boxplots showing the Recall over 10 runs for each parametrisation of each dataset

Fig. 9
figure 9

Boxplots showing the F1-Score over 10 runs for each parametrisation of each dataset

6 Evaluation Results

In this section, we answer our two research questions separately based on:

  • box-plots presented in Figs. 6, 7, 8 and 9, showing accuracy, precision, recall and F1-score for each parameterisation of each dataset;

  • a multi-comparison statistical analysis (see Fig. 10), using Friedman’s test with Nemenyi’s post-hoc analysis;

  • Tables presented in Appendix A, with average and standard deviation for the four computed metrics.

All the results (i.e., for each execution of each parameterisation) are also available in our replication package Fortz et al. (2022), including the code to compute the metrics, box-plots and statistical tests.

6.1 Performance (RQ1)

Table 3 reports the averaged accuracy (over 10 runs) of the 20 considered parameterisations of RNNs, over the 6 datasets (i.e., 120 models). We group into LSTM and GRU (columns) and the average accuracies into three categories according to a predefined threshold: i)below \(50\%\) where we consider models as performing worse than a random assignment to system variants and thus useless; ii) between \(50\%\) and \(70\%\) where we consider models as being slightly better than random assignments; iii) over \(70\%\) where we consider the models as performing well. Out of the 120 models, 44 RNNs parameterisations (first row) yield an accuracy higher than \(70\%\), 15 are between \(50\%\) and \(70\%\), and the remaining 61 have an accuracy below \(50\%\). It means that nearly half of the considered models perform better than a random guess, a majority of which (i.e., 44 parameterisations out of 59) performs well in our context.

The highest averaged accuracy for datasets BPIC15 and BPIC20 (top of Fig. 6, or Tables 4 and 5 in Appendix) is \(88\%\) and \(87\%\) respectively with high stability (i.e., low standard deviation). On BPIC20, only five parameterisations out of twenty do not reach \(50\%\). Even better, for BPIC15 only five parameterisations are lower than \(70\%\) of accuracy. Top of Figs. 7, 8 and 9 confirm these results by giving similar values for precision, recall and F1-score respectively.

Despite the complexity of Claroline datasets, at least one parameterisation obtains an averaged accuracy of \(80\%\) for each dataset. For Claroline Dissimilar 10 (middle left of Fig. 6 and Table 6 in Appendix), the top parameterisation reaches \(99.6\%\) and 4 different parameterisations are above \(85\%\). Claroline Random 10 and Random 50 (middle and bottom right of Fig. 6, or Tables 7 and 9 in Appendix) also have several parameterisations above \(80\%\), and their top one gets over \(95\%\) of accuracy. Claroline Dissimilar 50 (bottom left of Fig. 6, or Table 8 in Appendix) has only one row with an averaged accuracy of \(80\%\) and only two other rows above \(70\%\). Among the remaining, 15 rows are below \(30\%\).

Note that, for Claroline Dissimilar 50, boxplots are either spread out or centred on low values (Bottom left of Fig. 6). Moreover, the top three rows also report a high standard deviation for the accuracy (i.e., higher than 0.13 and up to 0.38, in Table 8 in Appendix). It highlights that the results lack stability: at least one execution out of ten does not belong to the same value range. Regarding Claroline Random 10 and Claroline Random 50 (middle and bottom right of Fig. 6), the top three parameterisations show very compact boxplots with few outliers. This suggests a more stable accuracy, as confirmed by a standard deviation between 0.03 and 0.11 for the accuracy (Tables 7 and 9 in Appendix). The top two parameterisations of Claroline Dissimilar 10 (Table 6) both show an accuracy higher than 0.99 and a standard deviation lower than 0.001, demonstrating very stable results.

Overall, the number of configurations of the Claroline system (10 or 50) neither influences averaged accuracy nor the standard deviation. Similarly, how we sample (random-based or dissimilarity-based) configurations does not impact accuracy. As for BPIC15 and BPIC20, the other metrics (precision, recall and F1-score presented respectively in Figs. 78 and 9) only confirm this analysis as they follow the same tendencies.

figure a

6.2 LSTM vs. GRU (RQ2)

Our second RQ is about the prevalence of each type of RNN. Can LSTM or GRU be considered better and should be preferred in this context? To answer this question, we hypothesize that one kind of RNN prevails over the other one and performs a multi-comparison statistical analysis of each 20 RNN parameterisations on all 6 datasets. We used a Friedman’s non-parametric test García et al. (2009) with a significance level \(\alpha = 0.05\). This test ranks parameterisations over accuracy and then determines if the differences between parameterisations are significant. We further complete this result with Nemenyi’s post-hoc procedure Japkowicz and Shah (2011); Nemenyi (1963) indicating the statistical differences between parameterisations. This procedure can determine equivalence classes, regrouping parameterisations that are statistically similar regarding accuracy.

Fig. 10
figure 10

Result of Friedman’s statistical test along with Nemenyi’s post-hoc analysis over all datasets and parameterisations

Figure 10 shows the results of Nemenyi’s test. After executing Friedman’s test, we obtain a p-value under 0.001, meaning that there is a statistical difference between the accuracy of some of the parameterisations. Nemenyi’s post-hoc procedure shows that the minimum distance between two statistically different groups of parameterisations (i.e., the critical distance) is 3.828. The bottom of Fig. 10 shows the seven best parameterisations over all the datasets. Statistically, they are equivalent and perform better than the remaining others (belonging to a different group).

Four pairs of loss and activation functions out of ten seem to stand out from the test. They are:

  • MSE and sigmoid

  • binary cross-entropy and sigmoid

  • binary cross-entropy with logits values and sigmoid

  • MSE and tanh (with LSTM only)

For most datasets, these parameterisations can predict the right set of variants with an accuracy greater than (or very close to) \(70\%\) (confirmed by Fig. 6 and Appendix A). However, sometimes a combination also gives bad results. It is the case with MSE and sigmoid, both with LSTM and GRU, where accuracy does not exceed 0.25 for Claroline Dissimilar 50 (Table 8).

We can observe that the dedicated loss functions (Manhattan and Jaccard distance) give terrible results compared to the other “classical” loss functions. Nemenyi’s procedure (Fig. 10) assigns them the highest mean ranks. For all Claroline datasets, the accuracy is always under \(30\%\). On BPIC15 and BPIC20, they give better results (respectively up to \(82\%\) for BPIC15 and up to \(58\%\) for BPIC20) but still lower than the other loss functions.

Regarding the activation functions, our statistical analysis shows that 6 out of 7 best parameterisations use sigmoid instead of tanh.

Nemenyi’s procedure shows that LSTMs are present in 4 of the top parameterisations and GRUs in 3 of them. However, these parameterisations are indistinguishable regarding accuracy (i.e., critical distance \(<3.823\)). Appendix A shows that the best parameterisation is an LSTM for BPIC15, BPIC20 and Claroline Dissimilar 50, but it is a GRU for the three other datasets. GRU is also the model giving the best accuracy amongst all datasets with up to \(99,6\%\) for Claroline Dissimilar 10 (Table 6). Moreover, the count of LSTMs and GRUs in each category of Table 3 shows similar numbers and indicates that using GRU or LSTM does not influence the results.

Table 3 Number of RNN parameterisations reaching predefined accuracy thresholds
figure b

7 Discussion and Future Work

This section discusses threats to validity that we identified and other aspects driving our future works.

7.1 Threats to Validity

Internal Validity

The datasets we used contain clean and consistent traces (i.e., they omit inconsistent traces when the system crashes or an unexpected event occurs). The BPIC community ensure this property van Dongen (2015, 2020) or by the use of an FTS model and the VIBeS framework Devroey et al. (2015); Devroey (2022) as a trace generator (for Claroline). For a new VIS, a preprocessing step should take care of trace consistency (i.e., a trace should capture a complete user session). It does not entail that the dataset captures the whole system’s behaviour. Indeed logs and models inferred from them represent a partial view of it.

To assess the difficulty of the learning process (i.e., being able to map logs to variants while sharing parts of the traces), we defined our own metrics (see last two columns of Table 1). This definition is inspired by our experience in analysing VISs where commonalities and variabilities between behaviours are key to the analysis. These metrics come from the analysis of the dataset only and give a better understanding of the intrinsic complexity of the learning problem. While they are fairly simple and high-level, they can be computed quickly but do not provide fine-grained differences (as the Levenshtein distance Levenshtein et al. (1966) would do but at the cost of longer computations). Finding the right trade-off between simplicity to compute and precision is left to future work.

The deep learning community is very active, leading to new types (or combinations of types) of models appearing every few months, especially for image processing tasks, where competition is fierce. It is less so regarding models dedicated to time sequences. We selected LTSMs and GRUs for their ability to deal with temporal sequences and to evade the vanishing or exploding gradient issue.

We evaluated 20 distinct parameterisations of RNNs over six datasets. We designed them regarding our goal, based on our previous work Fortz et al. (2021). However, since exhaustive coverage of the hyperparameter space is impossible, we may have missed some relevant parameterisations. Dealing with the inherent variability of hyperparameters is a research challenge per se.

A way to optimise the parameterisations is to use hyperparameter tuning techniques such as random search or auto-ML Nagarajah and Poravi (2019). We did not use any in this work but tried to scope the parameterization space with a manual approach similar to a grid search approach Fortz et al. (2021). One motivation for this choice is that VaryMinions is the first effort to use RNNs to classify execution traces for variants of systems. Thus, we were not interested in finding the best-performing model (aka the goal of hyperparameter tuning). Rather, we show that, within a reasonable effort, finding a suitable RNN model parameterisation performing well is possible.

External Validity

Compared to our initial results Fortz et al. (2021), we augmented our experimental setup using Claroline, a VIS. Though our method applies to two different application domains, we cannot ensure that it generalises to all configurable systems. We used six different datasets having different characteristics that mitigate the fact that our method may work only on simple datasets. Among the ones we have used, some were taken from existing competitions (BPIC), and some were generated from scratch (Claroline) allowing us to vary and control the complexity of the learning by modifying the amount of traces available and/or the number of configurations to deal with. Let us note that reverse-engineered models from logs necessarily form an incomplete representation of the behaviour of the system. Indeed, logs cannot capture all execution traces that are often infinite for any real-world system. Besides, we do not guarantee that our cases cover the whole spectrum of VIS, given their diversity and widespread.

A problem when using DL techniques in such a context is imbalanced representations in the training set. The training set may contain fewer occurrences of a configuration of a system or a process (e.g., because of lower popularity or fewer actions need to be performed) with the risk that the trained model may neglect classification errors involving these configurations since they can be considered as rare events. While the Claroline datasets were generated in such a way that imbalance representations were limited, we had no control over the BPIC datasets. They exhibit configuration imbalance but our RNN models coped with it (i.e., successfully classifying traces belonging to these configurations). Thus, we took no further actions to mitigate this aspect. Of course, class imbalance impact is case-specific.

Replicability

To prevent potential replicability issues, our implementation of VaryMinions and all the results presented in this paper are publicly available on Zenodo for long-term storage Fortz et al. (2022).

7.2 Hyperparameter Variability

The use of RNNs in this context requires carefully dimensioning the network and considering many parameterisations that can influence classification performances. In what follows, we discuss two elements that may influence them.

Loss Functions

We use the mean squared error (MSE) to evaluate prediction errors while training a network, which is traditionally preferred when tackling a regression problem. However, Hui and Belkin Hui and Belkin (2021) showed that this assumption lacks solid theoretical foundations and that MSE is suitable for classification. In particular for NLP applications, where MSE usually outperforms cross-entropy.

The choice of the loss function is tricky since we need to take care of multiple aspects: the formalisation of the problem (e.g., single or multi-label, regression or classification) or the way to compute errors. Even when trying to choose the loss function according to these points (e.g., Jaccard distances have been used to solve SPL problems, as in Devroey et al. (2016)), our results indicate that the MSE works surprisingly well. Given the importance of a loss function on the observed performance, experimenting with additional loss functions appears promising. For example, the focal loss Lin et al. (2020), which penalises more misclassified instances than well-classified ones, is a perspective that we aim to follow.

The interplay of Losses and Activations

We deliberately chose to explore custom loss rather than activation functions. Loss functions are easier to adapt to the problem at hand (by quantifying how far we are from the true label) acting on the network output. Yet, activation functions and loss functions have distinct roles in the network, and they should be considered complementary and not independent. Both are important in the learning process. Activation functions come after every layer inside the network and, together with the weights, set the importance of a specific neuron through the propagation of the network. Loss functions are defined at the end of the network and are used to provide the final class(es). Loss functions are also used to back-propagate the classification errors through the network to optimize the weights in the training phase. From this short description, it is clear that activation and loss functions’ interactions affect the model performance. The former may block or lower the importance of discriminative information if incorrectly set while the latter defines the distance from the labels, from which the network optimises itself. Hence, assessing the impact of one type of function alone is not possible. Further investigations on which combinations would be best suited are needed. Defining new custom activation functions for this specific context is a possible option.

Complexity of the Neural Networks

We argued that learning a trace-to-variant mapping was feasible due to the number of traces w.r.t.the limited number of process variants. Generally, the challenge lies in the fact that having temporal sequences forces dependencies between elements that are usually learned separately. We suppose that deeper RNNs (i.e., increasing the number of hidden layers) may have a positive impact. Adding more layers increases the complexity of the model (as well as requires more resources for training), but allows for a more accurate mapping between traces and variants. Yet, the risk of overfitting must not be neglected. In the future, we will also consider architectures such as auto-encoders to produce a compact intern representation of traces, that could be more efficient in discriminating them according to the process variants. Similarly to other application domains (e.g., image or speech processing), learning more compact representations could rely on new feature descriptors instead of only considering events of a trace.

7.3 Variant-based vs. Option-based Labelling

Our results indicate that applying classification techniques on a variant-based approach (i.e., identify the variants producing a specific trace) using RNNs is promising. However, it has a major drawback: being able to predict that a trace is generated by a variant requires seeing at least one (usually much more) trace(s) generated from this variant. Said differently, enumerating all the variants and executing them all at least once is required for further predictions. If in our evaluation the number of variants was limited, the combinatorial explosion problem inherent to VISs may prevent us to apply these techniques to larger configurable processes like, for instance, continuous integration workflows with hundreds of options, leading to an intractable number of possible variants.

One future possibility to address this limitation is to work on data representation. Indeed, a variant is formed by a combination of (Boolean) options, corresponding to a configuration of the system. If we cannot enumerate variants, enumerating options is possible. In this case, we need a new representation which can depict the three states of each option: activated, deactivated or undetermined (i.e., the presence of the option is not relevant for the current context). The neural network will learn a partial configuration allowing for a more fine-grained mapping. This would be useful to locate precisely a combination of options yielding a given anomalous event trace. One can use such learned models in fault localisation and repair techniques Fahland and van der Aalst (2015). As all labelling approaches, this new option-based approach is a costly task, but unlike a variant-based approach, it is feasible. For example, in Claroline Devroey (2020); Devroey et al. (2017) we have more than \(5\) million variants but only \(44\) different features. However, this new approach comes with its own challenges. Predicting the wrong features can potentially lead to a violation of the FM’s constraints, creating an invalid configuration.

7.4 Data Availability

As for any DL technique, the issue of data availability is also present in this work. We managed to train our models with few execution traces (i.e., thousands) compared to the potentially infinite number of traces that the considered systems can produce. However, VaryMinions remains a supervised machine-learning technique and requires a set of execution logs, labelled with the variants of the system that have produced them.

To reduce the labelling effort, the recent field of semi-supervised learning Chapelle et al. (2006) techniques seems interesting. Semi-supervised learning takes place when, in the training set, some data have labels but a majority of them are unlabelled (e.g., due to prohibitive cost in labelling that cannot allow labelling more than a few tens). The goal is thus to learn a model while being able to label automatically the unlabelled data. In this area, label propagation Lee et al. (2013); Iscen et al. (2019) automatically assigns a new label via propagating the label of already known similar data. We envision using the same technique (or an adapted version) to reduce the labelling effort while being able to take into account more and more execution logs which may improve the prediction performances of VaryMinions models.

8 Related Work

This paper focuses on using DL techniques to reverse-engineer configurations. However, it is not the only context where DL has been used in conjunction with business processes or SPLs. This section gives an overview of existing approaches where both DL and variable systems meet.

8.1 Machine Learning for Process Monitoring and Mining

Machine learning, in particular deep learning, has been notably used in business process monitoring. For instance, ML models can use past observations to predict the next event in a process Tax et al. (2017); Di Mauro et al. (2019); Matzner and Eskofier (2021); Tello-Leal et al. (2018); Venugopal et al. (2021), the outcome of a process Kratsch et al. (2020); Wang et al. (2019); Bozorgi et al. (2020), the remaining time Sun et al. (2020); Welsing et al. (2021), vulnerabilities and anomalies Borkowski et al. (2019); Hariyanti et al. (2021); Nguyen et al. (2019); Nolle et al. (2018, 2020) or even performance Park and Song (2020). This vast research area called predictive business process monitoring, attracted several literature reviews (e.g.,  Neu et al. (2021); Harane and Rathi (2020)). ML can also be used to optimise existing processes Fernandes et al. (2019) or to get a compact representation of traces Bui et al. (2019, 2020). Recently, there has been interest in the interpretability of RNNs models, specifically in a process mining context Hanga et al. (2020).

Han et al. (2020) use LSTM to discover automatically business processes from textual documentation. However, their work is focused on single processes and does not highlight variability.

8.2 Engineering Configurable Processes

When trying to (reverse-)engineer configurable processes or even perform maintenance and/or evolution, some of the reported techniques rely on grammar-based or evolutionary algorithms, while others are machine learning (ML) oriented. The latter mostly consider tasks like clustering traces (e.g., Song et al. (2013)). However, few techniques allow to retrieving a complete configurable process from event logs. Some approaches use genetic algorithms Buijs et al. (2013); La Rosa and Dumas (2008), but they are limited to a small number of variants. Another option is to use (configurable) process fragments to rebuild the configurable model Assy et al. (2015). Sikal et al. propose a pattern for variability discovery during process mining, but this approach is only methodological at this stage Sikal et al. (2018).

In our case, we focus on the classification task. Bobek et al. (2013) offer recommendations to configure variability-aware business processes at design time with Bayesian Networks. Clustering techniques have also been used Mans et al. (2008); De Weerdt et al. (2012); Varela-Vaca et al. (2019) to perform classification tasks in an unsupervised way, i.e., without knowing the classes to learn. Song et al. use dimensionality reduction techniques to improve trace clustering Song et al. (2013). In our context, we want to specify the variants (i.e., the classes) to learn. Finally, Hinkka et al. (2018) aim at categorising traces into classes, thanks to LSTMs and GRUs. However, their approach differs on several points: (i) they define artificial classes, and (ii) they focus on binary classification.

8.3 Machine Learning for Variability-Intensive Systems

While there is a growing interest to employ ML techniques for VIS engineering Pereira et al. (2020); Ferreira et al. (2021), to the best of our knowledge, classification of variants from behavioural traces using ML techniques has not been studied yet. ML approaches have been used to support performance prediction (e.g.,  Shu et al. (2020); Ha and Zhang (2019); Valov et al. (2015); Zhang et al. (2015); Kaltenecker et al. (2020); Alves Pereira et al. (2020); Bacciu et al. (2015)), performance optimisation (e.g.,  Martin et al. (2021); Dorn and Apel (2020); Weckesser et al. (2018); Weber et al. (2021); Velez et al. (2021)), to improve the search for good and acceptable configurations (e.g.,  Temple et al. (2021); Nair et al. (2017); Temple et al. (2016)) or to predict unwanted feature interactions Khoshmanesh and Lutz (2020); Li et al. (2020). If some of these works also target classification tasks, they consider configurations as the main entry point of their approaches and do not take into account the behaviour of the studied systems. ML also supports usability prediction Vyas et al. (2019), attacks and vulnerabilities detection Abdelrazek et al. (2019) and defect prediction Strüder et al. (2020); Amand et al. (2019). In particular, Strüder et al. demonstrated that artificial neural networks were suitable for this last task Strüder et al. (2020).

While ML can support VIS engineering, the converse, i.e., applying variability-aware techniques to neural networks is also possible. For example, Ghofrani et al. (2019a, 2019b) proposed a new approach to reuse modules of deep neural networks without additional training. On their side, Ghamizi et al. developed a framework to explore variability amongst different neural network architectures and automated search-based techniques to find the optimal one for a given task Ghamizi et al. (2019, 2020).

8.4 Variability-Intensive Systems Reverse Engineering

Over the years, several approaches were proposed to reverse engineer VISs, and SPLs in particular. These techniques operate at different levels: variability model, mapping between options and VIS artefacts, and learning VIS design models.

8.4.1 Learning Variability models

More than a decade of effort has been devoted to extracting options from VIS artefacts. Due to their popularity, most approaches target feature models Acher et al. (203); Lopez-Herrejon et al. (2015); She et al. (2011); Li et al. (2017); Martinez and Parsai (2018). Besides, Ramos-Gutiérrez et al. (2021) use process mining to retrieve the process of configuring an SPL. VaryMinions does not necessarily need a complete feature model but rather a set of variants. They can be sampled from a feature model as we did for the Claroline system or simply known via product descriptions.

8.4.2 Learning VIS Design Models

There also exist model-based approaches to recover an architectural model of a VIS Kerdoudi et al. (2019); Lima et al. (2019); Assunção et al. (2020). This can be useful when the system was not designed with the SPL paradigm in mind (but e.g., by using a clone-and-own approach) and when we want to perform complex maintenance or evolution tasks. Devroey et al. (2014) designed a technique to retrieve a behavioural model of an SPL. This technique, based on usage models inferred from logs, learns a candidate FTS which should be completed manually with annotations (feature expressions). This technique yielded the FTS model we used to generate Claroline datasets. On the other hand, Damasceno et al. (2019, 2021) idea is fully automated, but limited to a few variants. Their proposal consists of an adaptation of a classical learning algorithm (\(L*\), by Angluin (1987)) which is instrumented to merge individual models of each variant into a model of the complete SPL. Note that merging has a high complexity (i.e., exponential) with respect to the number of variants to merge.

In contrast, VaryMinions does not aim to learn a behavioural model but to build a mapping between behaviour and variants. It may prove useful to automatically annotate SPL models.

8.4.3 Learning VIS Mappings

Feature location is another task in VIS reverse engineering that Cruz et al. (2019) divided into three categories of techniques: static (based on source code), dynamic (based on execution traces), and textual (based on NLP). Some techniques mix several approaches; for instance, Michelon et al. (2021) use a hybrid approach based both on static analysis of the source code and dynamic analysis of execution traces. However, the general idea of feature location is slightly different since these are white-box approaches whose purpose is to map features with source code (e.g., by source code annotations). Their goal is usually to help with maintenance and evolution. Moreover, classical feature location techniques (e.g., Michelon et al. (2021); Cruz et al. (2019)) do not use RNNs. In our case, we are more focused on associating behaviours with variants or directly with features (see future work in Section 7.3). VaryMinions is thus a black-box and dynamic approach that could be used to make a first classification of variants of interest before delving into the source code or other white-box artifacts.

9 Conclusion

In this work, we evaluated the relevance of using Recurrent Neural Networks (RNNs) to address the problem of how to multi-classify behavioural traces found in logs to the variant(s) they belong. This mapping is highly relevant when debugging variability-intensive systems (VIS) as anomalous behaviour may result from the interaction of a few specific options belonging to some variants amongst a myriad. Based on the promising results we obtained for configurable business processes Fortz et al. (2021), we extended our experiments to Claroline, a configurable course management system previously re-engineered at the university of Namur. We assessed two popular RNN types – Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU) – under 20 distinct parameterisations on 6 datasets (2 from configurable processes and 4 generated from Claroline models). Our results show that it is always possible to learn a mapping with an accuracy of at least \(80\%\) Fortz et al. (2022). There is no prevalence of one particular model type (GRU or LSTM) among the best-performing models.

While we demonstrated that VaryMinions easily scales up to at least 50 variants and \(5,000+\) traces per variant, covering huge configuration spaces, e.g., learning mapping for hundreds or thousands of configurations, may be problematic. It suggests the first item for our future work: offer an option-based encoding for the mapping problem, which would be less prone to variant explosion. We also intend to experiment with other loss functions and design new dedicated ones. Finally, new neural architectures may be considered, such as attention-based ones Vaswani et al. (2017).