VaryMinions: Leveraging RNNs to Identify Variants in Variability-intensive Systems’ Logs

From business processes to course management, variability-intensive software systems (VIS) are now ubiquitous. One can configure these systems’ behaviour by activating options, e.g. , to derive variants handling building permits across municipalities or implementing different functionalities (quizzes, forums) for a given course. These customisation facilities allow VIS to support distinct relevant customer requirements while taking advantage of reuse for common parts. Customisation thus allows realising both scope and scale economies. Behavioural differences amongst variants manifest themselves in event logs. To re-engineer this kind of system, one must know which variant(s) have produced which behaviour. Since variant information is barely present in logs, this paper supports this task by employing machine learning techniques to classify behaviours (event sequences) among variants. Specifically, we train Long Short Term Memory (LSTMs) and Gated Recurrent Units (GRUs) recur-Sophie


Introduction
Business processes capture the activities of every profit or non-profit, public or private organisation, coordinating humans and software to collectively deliver value.As organisations evolve, new needs appear, e.g., covering electric scooters for an insurance company or handling a change in the law about reimbursing travel expenses at the university.These needs lead to the emergence of process variants, differing in their control flow or performance while having commonalities with the original processes.Process variants or configurations are specific combinations of the system's options.We consider process executions stored in event logs, where an event trace (or trace) is an ordered sequence of events.To explore process reengineering opportunities, it is necessary to identify which variant(s) may have produced a given trace.Existing variant analysis [119] techniques do not answer this question but cover the inverse operation, i.e., focusing on the differences between identified variants.This problem is not restricted to business processes and naturally extends to variability-intensive systems, which change their behaviour in response to the (de)activation of some options.Examples of variability-intensive systems include Software Product Lines (SPLs) [6,103], operating systems kernels [93,111], code generators [14,121], or web-based frameworks [57,108].Validating these systems is difficult because enumerating all variants, whose number can grow exponentially with the number of options, is generally infeasible [57].In this context, locating variations is essential for any reengineering endeavour [9].Black-box testing techniques can also benefit from this information to e.g., sample which variants should be tested first [57].
To support these activities, in this article, we train Recurrent Neural Networks (RNNs) [107,110] architectures with different hyperparameters (loss and activation functions among others) to predict the candidate variant(s) that could have produced a given event trace.We make the following contributions: (i) the first variability-aware approach, which we called VaryMinions, to map execution traces to variants of a system; (ii) a detailed account on the usage of Long Short Term Memory (LSTMs) [66], and Gated Recurrent Units (GRUs) [21], two RNN architectures, on six different datasets, describing business processes and course management system variants; (iii) four datasets openly available and based on Claroline [30,32,33] and containing 2 * 10 and 2 * 50 configurations with 5, 000 traces per configurations; (iv) a characterisation of the intrinsic learning difficulty for variability-intensive systems.
Methodology.For the first contribution, we showed empirically that VaryMinions can distinguish 50 variants from 5, 000+ event traces per variant.In our second contribution, we successfully determine the variant(s) responsible for generating an event trace with high accuracy (> 80%), regardless of whether the GRU or LSTM model is employed.To measure the learning difficulty, we defined and computed a metric based on the amount of behaviour shared amongst event traces.
Open Science Policy.We also provide a replication package [47] with an implementation of our approach using two common Python frameworks reusing RNNs implementations(namely Tensorflow [29] and Keras [22]) as well as presenting all the results of our experiments.These contributions extend our preliminary research published at the MaL-TeSQuE 2021 workshop [46].While our previous paper focused solely on business processes, this article adds a new source of datasets issued from the VIS domain: Claroline [30,32,33], a course management system that was reverse-engineered from an instance in-use at the University of Namur.We derive four new datasets from this newly added system, forming a much more challenging learning problem (up to 50 variants instead of 5), and we assess the effect of sampling (random uniform vs dissimilarity-based) on the outcome.In addition, we reran all our previous experiments and the new ones at one of the Belgian universities' HPC facilities.We also refactored the VaryMinions source code to ease its reuse and make it more configurable.To summarise, the added value of this extension comes from: (i) four new and more complex datasets from the VIS domain; (ii) a discussion about the effect of sampling on this classification task; (iii) a refactored implementation of VaryMinions.
Section 2 introduces process mining, VIS and RNNs.Section 3 motivates the use of VaryMinions.Section 4 gives an overview of the proposed solution, while Section 5 presents the datasets and the experimental setup with more details.Section 6 gives the results of our evaluation.Section 7 discusses certain factors influencing our experiments, such as hyper-parameter variability and alternate labelling of variants.Section 8 presents related work, and finally, Section 9 wraps up the paper.

Background
Our work tackles the problem of tracing back the system variant that produced some event logs.This is an issue common to process variants and variability-intensive systems.We address it by relying on techniques coming from the Deep Learning community.In the following, we introduce these different concepts.

Process Variants
Nowadays, many organisations work with multiple (business) processes in parallel that can highly depend on environmental and human factors.For instance, a business process can be influenced by regional laws, available resources, the size of the organisation, etc.Most of them share common behaviours meaning that for one general business process, one can define several process variants, each one behaving (slightly) differently from the other variants.Similar process variants gather in process lines or process families and can be modelled using different formalisms [106].
Analysing the specificities and commonalities of process variants allows scale economies and helps practitioners to improve the general business process, define new variants or maintain existing ones [119].
For process understanding and reverse engineering purposes, one commonly inspects execution logs.Indeed, they contain valuable information on the process behaviour in production.If the process owns several variants, one must know which variant(s) are involved in a behaviour of interest.Unfortunately, event logs do not usually contain information about a specific variant (or set of variants) which (could have) produced the sequence of events (i.e., the event trace).This can prevent practitioners from understanding why this behaviour occurs for one variant and not another.In this paper, we address the problem of identifying process variants that have (potentially) shown a specific behaviour, based on a given event trace.This mapping information is key in various re-engineering activities such as variant process mining [119].However, these activities are beyond the scope of this paper.
To demonstrate the feasibility of our variant process identification learning approach, we use two datasets that gather execution traces of business processes.They both come from the Business Process Intelligence Challenge, a yearly challenge organised since 2011 to stimulate process mining research on real-life datasets. 1We selected two editions, modelling process variants: the first one from the year 2015 and the second from the year 2020.

Variability-Intensive Systems
Process families belong to the vast and heterogeneous category of Variability-Intensive Systems (VISs).These are software-based systems that exist in many variants to address the diversity of customer needs and usage contexts.Structured approaches, like Software Product Lines (SPLs) [104], facilitate the design, development and quality assurance of such systems.They consider a global base of software artefacts for a family of software systems, and allow to Fig. 1: VIS Feature Model of a beverage vending machine [24,30].
produce variants through the (de)activation of options (also called features in the VIS world). 2 Reasoning at the family level rather than at the single system level yields significant economies of scale and quality improvements.

Variability Modelling
The variability of a VIS is usually decomposed using a tree-like structure called a (VIS) Feature Model (FM) [72,109].An FM represents the common and variable aspects of the system.For instance, Figure 1 presents the FM of a simple configurable beverage vending machine.The machine sells either soda or tea (or both) in euros or dollars, and may optionally support cancelling purchases and providing free drinks.As the number of possible variants increases exponentially with the number of options available, such compact representation of the variability of a VIS enables various kinds of analysis, including counting the number of possible variants, detecting dead options that can never be selected, etc.For instance, the vending machine of Figure 1 counts already 24 possible (distinct) configurations. 3This number is very small compared to real-world VISs.For example, the Claroline case, which we will introduce in Section 5.2.2, has more than 5 million possible configurations for 44 options.
Behavioural Modelling Complementary to FMs, Featured Transition Systems (FTSs) [24] are designed to represent compactly the behaviour of a VIS.An FTS is a transition system where each transition is labelled using (VIS) feature expressions (i.e., a Boolean formula referring to its options) to indicate which valid configurations of the VIS can execute the transition.For instance, Figure 2 presents the FTS of the beverage vending machine of Figure 1.As can be seen from the feature expressions, only specific configurations can execute some transitions: e.g., only vending machines with the free (f) option enabled can 2 To avoid any confusion between VIS feature, i.e., a functionality of a software system, and machine learning feature, i.e., a property characterising an entity, we will refer to the former as option (or sometimes VIS feature) and to the latter as feature. 3To avoid any confusion between VIS configuration, i.e., a combination of software options, and neural network configuration, i.e., a selection of hyperparameters characterising a network, we will refer to the former as configuration or variant and to the latter as parameterisation.Fig. 2: VIS Feature Transition System of a beverage vending machine [24,30] execute the free transition from state 1 to state 3.As for FMs, FTSs provide a way to represent compactly the behaviour of all the different configurations of a VIS.
In this paper, we rely on the FM and FTS of Claroline, a highly configurable course management platform previously used at the University of Namur, to simulate executions of different configurations of a real system.This simulation offers a way to generate event logs in a controllable way without requiring running a large number of variants of the system.The Claroline FM and FTS were defined by Devroey et al. [30,32,33], based on the logs of the implementation used at the University of Namur collected over 9 months.

Deep Learning and Recurrent Neural Nets
As explained previously, the number of possible variants grows exponentially with the number of VIS options.Similarly, the number of traces a system can generate is supposed to be infinite.These observations command the use of automatic reasoning instead of manual inspections.In particular, we rely on machine learning and deep learning techniques.
Deep Learning (DL) is a subset of machine learning techniques.They remain statistical techniques, but the main difference is that machine learning techniques rely on predefined features (or characteristics) compactly representing data.Historically, domain experts defined the relevant features and the procedures to extract them from raw data.In contrast, DL techniques can infer such features automatically while training but at the cost of more computational resources and time.In the last decade, DL techniques efficiently performed different tasks and new applications such as image processing, assistance in driving for autonomous vehicles, board gaming such as playing Go, sound processing, text processing, automatic translation, etc.
Different families of machine learning algorithms exist: decision trees or random forests, support vector machines, linear regressors, neural networks, etc. Thanks to their capability to model and handle complex relations, neural networks are at the centre of attention of DL techniques.There are various neural network architectures, each adapted to a specific task.For instance, convolutional neural networks excel at image processing while recurrent neural networks (RNNs) [107,110] handle data sequences (such as text or speech).
Previous works applied RNNs to execution traces to predict the next event or the final execution state.[40,118].When data sequences are too long, vanilla RNNs may face the so-called vanishing or exploding gradient problem [65].Indeed, weights from the first layers may rarely be adjusted since, during training, the back-propagation mechanism re-injects prediction errors backwardly in the network starting from its output layer so that it can ultimately provide the right outcome.Because one injects errors from the output, they tend to vanish and never reach the first layers leaving them unchanged.Conversely, the gradient can grow exponentially, yielding intractable computations.Two RNN architectures deal with longer sequences and long-term dependencies: Long-Short Term Memory (LSTM) [66] and Gated Recurrent Unit (GRU) [21].These architectures alleviate gradient issues [23,65] by using gates to regulate the data flow and keep specific long-term data in memory.RNNs are composed of multiple units (sometimes referred to as cells), which can convey data from one to another.Typically, RNNs start with an embedding layer that transforms input data into multi-dimensional vectors.Figure 3 depicts an example of an LSTM unit (left) and a GRU unit (right).Inside one unit, gates regulate the data flow, deciding what data to keep and what to forget.Mathematically, gates are functions (e.g., sigmoid) expressing the amount of data to keep.We can define several types of internal gates for different purposes.An LSTM unit (Figure 3a) is composed of three different state variables and three different gates.The variables represent respectively the input of the unit (i.e., the matrix computed by the embedding layer, called x t in the figure), the output (called h t ), and the unit state (called c t ).The latter acts as the long-term memory of the network, registering data from previous units to pass through the next ones.Forget gates (on the left of the Figure ) are used to convey data from the previous unit directly to the next one.In particular, it may set some values to 0, making the network forget this data.The input gate (in the middle) defines how much data should be treated in the current unit.The final output of a unit travels through the output gate (on the right of the Figure).To avoid gradient explosion, LSTM units use a tanh function (above the output gate) to keep data in a small range of value (i.e., between -1 and 1).In GRU (Figure 3b), input and forget gates are merged (on the right of the Figure ) and there is no output gate.Consequently, the output and unit state variables are also combined into a unique variable (named h t in the Figure ).GRU also offers a new type of gate (in the middle of the figure) expressing how relevant data from the previous unit is for the current unit.
LSTMs and GRUs are efficient text classifiers, e.g., [75,85].In this work, we want to create a mapping between execution traces that are a succession of events occurring in a specific order and configurations of a system that, supposedly, can produce them.In this context, an event does not appear randomly but depends on the previous succession of events.Sometimes directly from the few previous ones, sometimes because of an event that occurs way earlier in the trace.Thus, using LSTM and GRU architectures seems appropriate.Further-  None of the reviewed works relied on RNNs but used other classification models (decision trees, association rules, etc.).Recently, Arganese et al. investigated ambiguity in natural requirements as variability points [7], but the mapping concerns words rather than complete sequences.
3 Motivation: Behaviour-driven VIS Reverse-engineering via Black-box Learning Over the past two decades, researchers have been focused on modelling the behaviour of SPLs for design and analysis purposes.Various paradigms for modelling SPL behaviour, such as Featured Transition Systems [24] and Featured Finite State Machines [56], have been defined.However, engineers typically manually create these models, which is time-consuming, error-prone, and not suitable for complex VISs.Recent efforts [26,27] have attempted to automate this model creation process, but it is still in its early stages.
Most approaches to learning VIS behaviour rely on and extend Dana Angluin's seminal L * algorithm from 1987 [5].This algorithm aims to infer a single system's behaviour in a black-box and active manner, relying solely on execution traces obtained on the fly.In this case, access to the source code is unnecessary, but interaction with the System Under Learning (SUL) is essential.L * follows a simple metaphor.The Learner constructs hypothesis models of the system by posing queries to the Teacher, who serves as a middleware between the Learner and the SUL.The Teacher can either validate the hypothesis or provide a counterexample if it is invalid, helping the Learner to update its hypothesis.
To adapt Angluin's algorithm and accommodate variability, existing approaches [26,27] introduce post-processing steps.For instance, learning each product variant and progressively merging them is one approach.However, this approach becomes impractical when dealing with a large number of variants.Another approach, instead of considering variants individually, would be to consider learning the VIS in a family-based fashion [44,45].In both cases, it is essential to relate Angluin's queries and counterexamples to configurations.Existing mappings are incomplete, as they rely on partial observations of the system.We assume that the Teacher only possesses knowledge of previously observed SULs, with all new configurations being unknown.Hence, in this scenario, a configuration prediction technique is required.
Existing SPL reverse engineering techniques usually assume the presence of an accurate FM.This is a strong hypothesis.Usually, FMs are built from requirements which are known to be ambiguous and partly implicit.FM reverse engineering approaches also have limitations in terms of completeness and soundness.It is much easier to assume a set of configurations especially in reverse engineering scenarios.Therefore, we aim for a solution that does not require an FM.In our context, we can only rely on the list of features (but without explicitly stating the constraints between them) and a list of the configurations used for classification.
To map configurations to variant products, white-box approaches rely on the source code or use a combination of software implementation artifacts and logs [91,92].We place ourselves in a strict black-box context in which the source code is not available.This is the case for the business processes we analysed.Therefore, we focus on execution logs only.These logs do not directly relate event traces with variants.Indeed, Cândido et al. [19] have pointed out that preemptively logging detailed information would result in enormous log files, reaching several terabytes, which would impede effective analysis.
Since extensive mapping information is not available, we propose to employ supervised machine learning to tackle the following challenge: how can we classify new incoming traces (previously unseen) to multiple variants?

VaryMinions Overview
Figure 4 provides an overview of VaryMinions' architecture.The input data (a) are a set of available execution traces.For training, traces are associated to the set of system variants that can produce them.The inputs first pass through an Embedding layer (b) that transforms the sequences of events into a vector of indexes (to make the representation more compact and to ease their processing).The embedding layer creates a structured space in which indexes that occur in a similar context are close.In this new representation space, indexes become vectors, and initial traces become tensors composed of numerical weights.This homogeneous representation allows performing mathematical operations on those weights through the rest of the network.
The embedding layer is also configurable, e.g., we need to specify the number of dimensions of the representation space and the number of dimensions in the output tensors.We keep the number of dimensions the same in input and output to avoid combining different input dimensions into one output dimension.We then link this layer to the RNN layer (c), which is instantiated with either LSTM or GRU units to learn the relationships between elements of the tensors.Again, this layer is configurable, in particular with the number and usable kinds of units (detailed in Section 5.3).
There exist unidirectional and bidirectional [110] units.Unidirectional layers only consider the processing of the sequence in one direction (from start to end).In contrast, bidirectional layers also handle the other direction (from end to start), which can be helpful in language processing.In our case, traces are fully available at training time.Reading them forward and backward can help grasp long-term relations between events.Because of our analogy with text, we use bidirectional units [110] only.
Then the network continues with one Dense layer (d) preparing for classification.We made the number of units in this layer the same as the number of classes (i.e., configurations).The output of the network (e) is a vector of 1s and 0s whose number of elements is equal to the number of configurations of the system.This vector classify the trace into one or more configuration(s).
In this vector, 1s state that associated configurations can generate the input trace and 0s that they cannot.
For instance, let us take a simple system with three configurations.The output vector is thus of size three.If our prediction model outputs the vector [1, 1, 1], it predicts that all the configurations can execute the input trace.In another case, the output vector is [0, 1, 0].Then, only the second configuration is able to produce the input trace, etc..One should note that our models cannot provide the output vector [0, 0, 0] since the RNN selects at least the configuration with the highest score.In the following, we describe our evaluation protocol to validate that we can learn which variants may have produced an execution trace.First, we state the research questions that drive this experimentation before describing the creation and annotation of our datasets.Then, we explain how we instantiated VaryMinions regarding our specific context.Finally, we present the running setup and the evaluation metrics.

Research Questions
We state the following research questions concerning the multi-classification of execution traces among the different VIS variants:

RQ1 How accurately can we identify process variants based on their traces?
This question addresses the efficiency of our approach.To the best of our knowledge, this is the first attempt to use RNNs to learn such a mapping.Thus, we cannot compare it with the state of the art.Instead, we expect the RNNs to be at least better than random classifiers (accuracy higher than > 50%).RQ2 What is the performance of LSTMs versus that of GRUs for process traces classification?We would like to know which model architecture is the most appropriate for this task, if any.

Datasets selection and preprocessing
We use six different datasets that we divide into two groups.The first group contains the 2015 and 2020 editions of the Business Process Intelligence Challenge (BPIC).Each dataset contains event logs, describing different executions of configurable processes: -BPIC15 (DS1) represents building permit applications in five municipalities, each one corresponding to a process variant [38]; and -BPIC20 (DS2) gathers data from the travel reimbursement process at the Eindhoven University of Technology (TU/e), where variants correspond to different kinds of documents to be managed [37].
The second group consists of four datasets containing event logs describing executions of different variants of Claroline [30,32], an online course management system used at the University of Namur until 2018.Claroline was the main communication channel between students and lecturers, with approximately 7, 000 users.Its architecture is plugin-based.Depending on needs, one can deploy new variants at runtime.

Business Process Intelligence Challenge (BPIC)
The original BPIC datasets (from [37,38]) contain only valid and complete traces and other information.We prune the logs to keep only the process variant ID, the trace ID and the sequence of events.To cope with different trace lengths, we apply padding (i.e., filling traces with other meaningless events and using a mask to know where the processing should stop).Trace duplicates are removed, and since multiple variants can produce the same trace, we encode the variants into a binary vector (where the size matches the number of variants) that serves as a label.A value of one at the i -th index of the vector denotes that we observed at least once the trace associated with variant i. Traces associated with all variants have thus a vector full of ones.In the end, each trace is associated with one or more variants (i.e., classes).We expect the RNN models to learn these associations to predict the variant(s) for an unlabelled trace.We wrote this preprocessing procedure in Python as part of VaryMinions' implementation [47].
As described in Table 1, DS1 contains 5, 542 traces after preprocessing, with a maximum of 154 events per trace.The five process variants are fairly equally represented since they contain 1, 108 traces on average, with a minimum of 828 and a maximum of 1, 350.Therefore, DS1 is well-balanced.DS2 contains 2, 074 traces after preprocessing, with 5 process variants and a maximum of 90 events per trace.The least and most represented process variants contain 89 and 1, 478 traces respectively, with an average of 415 traces per variant.Therefore, the dataset is imbalanced, suggesting it is harder to learn from accurately.
Table 1: Overview of the preprocessed datasets used in our experiments.Classspecific metrics (cols 3-5) represent (i) the number of traces per class, (ii) the percentage of traces assigned specifically to this variant in the dataset, and (iii) the percentage of traces shared by this variant and at least another one.To better characterise the learning complexity, Table 1 shows the number of traces per class (i.e., variant) and the overlap (i.e., percentage of variantspecific and shared behaviour) between classes.The number of traces provides a first indication of the learning difficulty: more traces generally yield a more accurate network once trained.DS1 contains equally represented classes with limited overlap (< 0.5% in the last column), while DS2 is less balanced in how classes are represented and how they are interleaved, denoting a shared behaviour between multiple variants.In particular, for DS2, there is a big overlap between the International Declaration and the Permit Request variants, and between the Prepaid Travel Cost and the Request For Payment variants, while the Domestic Declaration variant is completely separated.

Claroline
Claroline is a highly configurable web-based system whose behaviour depends on a set of activated options.In total, Claroline contains 44 options leading to more than 5, 406, 700 unique variants.Handling such a large configurable system is not trivial as it requires deriving different variants and executing them in various ways to trigger different behaviours and collect, format, and process the corresponding event logs.Setting up such pipelines is hard and outside the scope of this paper.For those reasons, we decided, instead of executing the actual system, to simulate executions of different variants using a Featured Transition System (FTS) capturing the behaviours of different configurations of Claroline.The FTS was reverse-engineered by Devroey et al. [30,32] from a 5.26 Go Apache webserver log containing 45, 210, 987 entries collected from January 2013 to September 2013 using a bigram inference method.The final FTS consists of 106 states and 2, 053 transitions.
Simulations.The simulation of a given Claroline configuration works as follows.First, the FTS is projected on the configuration (i.e., pruned) to keep only the subset of behaviours that can effectively be executed by the configuration.The result of that process is a classical transition system, describing a subset of the behaviours of Claroline.Second, the traces associated with the configuration are produced using random walks in the transition system.We generated 5, 000 traces per configuration.To avoid infinite traces (e.g., in case of a loop in the transition system), we also limited the size of a trace to 300 events.We relied on VIBeS [31,34], a model-based testing tool for highly-configurable systems, to project the FTS and generate the traces.
We relied on two different strategies to select the different simulated Claroline configurations: random selection and dissimilarity-based selection.The random selection consists in selecting a set of (valid) configurations using a dedicated generator ensuring a random distribution of the selection.In our case, we used CMSGen [54], a fast uniform-like sampler.CMSGen comes with a default parameterisation, which we reused as is. 4 Unlike random, dissimilarity-based selection [62] picks configurations in such a way that they are as dissimilar as possible when considering their selected options.For our evaluation, we used PLEDGE [63], a search-based dissimilarity-driven configuration selection tool.We selected the default parameterisation of PLEDGE, with one minute per generation.We have set the number of configurations to simulate to 10 and 50.This way, we can go beyond the difficulty provided by the BPIC datasets and check that our method can run when the number of configurations is higher.While 50 is still small compared with the number of possible unique variants of Claroline (i.e., > 5, 000, 000), it is closer to a realistic setting.For each of these datasets (DS3 to DS6), the output of this generation process is a file containing 5, 000 traces per configuration that we can use as an input for VaryMinions.In our case, we thus have either 50, 000 traces per file (for 10 configurations) or 250, 000 traces per file (for 50 configurations), as shown in Table 1.The last two columns of this table show systematically 100% of variant-specific behaviour and 0% of shared behaviour for Claroline datasets, meaning that for each trace, at least one action is specific to one variant of Claroline.This is due to the use of a sampler for selecting the configurations, giving very little control over the traces overlap.Due to the huge amount of possible variants (i.e., > 5, 000, 000), the chance to find any shared behaviour between multiple variants is almost zero.

RNN Parameterisations
As we said before, because we use sequences of events, we investigate the use of RNNs to learn to which configuration(s) we can associate a trace.More specifically, we focus on LSTMs and GRUs.As for many DL models, hyperparameters must be defined.Because there are so many, we decided to vary only a few of them to try to understand how much impact they may have on learning.We focused on the functions that are used inside the networks and that may impact the quality of the predictions.We also manually selected a subset of hyperparameters that we fixed to a specific standard value.Hyperparameters and their values are described in detail hereafter and summed up in Table 2.
Number of hidden layers.One specific aspect that impacts the learning capabilities of neural networks is their topology.Since the traces are short compared to text documents, we decided to use networks with only one hidden layer.It may avoid potential overfitting, that can emerge from more complex structures (e.g., auto-encoder) while offering satisfactory prediction performance.
Units.In our previous work [46], our experiments used different numbers of units regarding the RNN layer (c).This number affects the topology of the network and may help to grasp more complex concepts if this number increases.Yet, having too many units on a layer may lead to dealing with redundant information that will deteriorate the final prediction performances of the network [49].On the contrary, a layer with a smaller number of units may not have the capability to grasp interesting information which may also harm the prediction performances [49].Based on our previous experiences, we decided to set the number of units to 30 which has shown relatively good performances while limiting the training time.Training set, batch size and epochs.Other hyperparameters can be set affecting the training time and the optimisation of the many different parameters (e.g., weights between layers and units) of the networks.Common hyperparameters to set are the ratio of data used to train the model and those used to evaluate the performance of the model; the size of the batch of data that the model will have to deal with during training, which may mitigate overfitting; and the number of time the model will optimise parameters over the whole training set (i.e., the number of epochs).Each of these hyperparameters was set as follows: (i) the percentage of the data used for training is set to 66% of the whole dataset which is a common value in the ML community, the remaining traces are used in the test set to assess the generalisation performances of the trained models; (ii) we set the batch size to 128, which is adapted to the dataset size; (iii) we set the number of epochs to 20 to avoid overfitting.In our preliminary evaluations (evaluated between 10 and 50 epochs), a plateau was reached after approximately 15 epochs.We finally set the number of epochs to 20, to allow for small increases in accuracy.
Activation functions.Activation functions are defined at the level of units (i.e., neurons) and respond to an input signal.If the signal is strong enough, the neuron is activated and the output is also high.Though different activation functions can be used for each neuron, it is usual to define an activation function for an entire layer.We have used a Rectified Linear Unit (ReLU) function on the hidden layer RNN layer (i.e., (c) in Figure 4) to alleviate the vanishing gradient problem.Regarding the Dense layer (d), we experimented with two common activation functions that are sigmoid and hyperbolic tangent (tanh).Both are shown in Figure 5.The main difference between both is their definition domain which affects how they handle negative input values.The sigmoid function is defined over [0; 1] meaning that as the values get closer to −∞ the neuron is closer to being non-activated at all (i.e., the output signal is 0) while as the input values are getting larger the response is also getting larger.When the input value is 0, the response is 0.5.On the other hand, tanh is defined over [−1; 1].It may be useful to take into account negative correlations and when the input value is 0, the response is also 0. Using one or the other may affect the "strength" of the signal that will reach the last layer for classification in turn affecting which class (i.e., configuration) will be recognised.
Loss functions.Loss functions are used during training to optimise the weights of the networks by back-propagating errors.We have used three loss functions already implemented in tensorflow 5 , namely Binary Cross-Entropy (with and  without logits, respectively named hereafter Bin-CE and Bin-CE logits) and the Mean Squared Error (MSE).Logit is defined as the inverse function of the sigmoid.We also implemented two custom loss functions: a variant of the Jaccard distance [69] (named Weight Jaccard hereafter), and the Manhattan distance between two vectors.The motivation for these two last functions is that because a single trace might be assigned to different process variants, the error should be defined considering a comparison of elements of vectors but not from a single value.This difference between two vectors should define a distance score.The Manhattan distance (sometimes called L1 norm) computes the sum of absolute differences between each element of the two vectors (i.e., in this case, the process variants).The Jaccard distance assesses how many equal elements of two vectors are over their size.We have implemented a variant of the Jaccard distance to cope with floating-point values generated by the networks.The Jaccard distance was employed to evaluate trace dissimilarity in variability-intensive systems (e.g., [35]).Further discussions about the use and characteristics of these loss functions are provided in Section 7.2.

Model Training
We decided to use only a training set and a test set in our evaluation due to the number of available execution traces.The training and performance evaluation process is done as follows: i) the entire dataset is randomly split into training and test sets.We have used the Keras function train test split6 that ensures the data distribution of classes among the two sets are similar.ii) A model is trained using the training set.iii) Its prediction performances are evaluated on the test set.To mitigate biases in our analyses we decided to train and evaluate the performances of each parameterization ten times on each dataset.For each run, the whole training and performance evaluation process is started again (i.e., splitting into training and test sets, training the model, and evaluating its performances).The fact that the splits are done each time mitigates the chances to train and evaluate a model on the best sets solely.Not only that it may change the data used for training the model but it may change the order of appearance too, which may have an impact on the trained model.

Evaluation Metrics
This work is the first attempt to use RNNs to classify execution traces among variants of a system.One of our goals is to evaluate if such a DL technique is appropriate for this task.We thus computed four different standard metrics that are Accuracy, Precision, Recall, and F1-score.Accuracy.To evaluate the quality of the models that have been learnt, the usual metric is the Accuracy measure.Accuracy is defined as It is a standard measure in the ML community to assess how well a model performs from a high-level point of view.It has the advantage to be easily computable and it can also be used to refer to the number of wrong predictions (i.e., 1 − Accuracy).However, when classes are not well balanced (i.e., the number of traces is way more important for at least one class than for others), Accuracy may hide some important information as the number of correct predictions for the classes with more data may take the lead on the number of wrong predictions of the others resulting in a high ratio.To mitigate this aspect from our analysis, we only consider other measures.Precision.One usual metric to account for the performances of a prediction model is its precision.It can be calculated for each class as follows:

P recision =
N umber of correct predictions N umber of predictions f or the class where N umber of predictions f or the class is the number of correct predictions and the number of additional data that are wrongly predicted to belong to the class (i.e., false positives).We gathered all these individual precision measures into a global one using a weighted average: where c is the number of classes, P recision i the precision measure for class i, and supp i the number of data with label i.
Recall.Similarly to the precision, the recall is also standard to report on the predictions of a model.It can also be calculated for each class and is defined as follows: Recall = N umber of correct predictions N umber of labeled data f or the class where N umber of labeled data f or the class is the number of data labelled with the class under consideration.
Similarly to the precision, we computed a weighted average to get an overall recall measure for the model: where c is the number of classes, Recall i the recall measure for class i, and supp i the number of data with label i.
F1-score.The F1-score is obtained through the harmonic mean of precision and recall to get an overview of the global performances of the model in one single measure.The F1-score in the case of two classes is defined as: Again, we can apply this calculation on each class and average with a weight equal to the proportion of data of each class in the (test) set to get an overall value for the model.The three last metrics were computed by the precision recall fscore support7 function in Scikit Learn before being averaged.Also, we did compute confusion matrices8 for each class.They are available in our replication package.
We conducted a 10-fold validation, where for each fold we randomly defined different train and test sets.For each fold, we evaluate the model by computing all the metrics described in Section 5.5.In total, running our 20 different network parameterisations with 10 repetitions on the six different datasets, resulted in 20 × 10 × 6 = 1, 200 runs and more than 151 days of execution.The time needed for a single execution varies between 44 seconds and 13 hours depending on the dataset and the GPU type.

Evaluation Results
In this section, we answer our two research questions separately based on: box-plots presented in Figures 6 to 9, showing accuracy, precision, recall and F1-score for each parameterisation of each dataset; a multi-comparison statistical analysis (see Figure 10), using Friedman's test with Nemenyi's post-hoc analysis; -Tables presented in Appendix A, with average and standard deviation for the four computed metrics.
All the results (i.e., for each execution of each parameterisation) are also available in our replication package [47], including the code to compute the metrics, box-plots and statistical tests.

Performance (RQ1)
Table 3 reports the averaged accuracy (over 10 runs) of the 20 considered parameterisations of RNNs, over the 6 datasets (i.e., 120 models).We group into LSTM and GRU (columns) and the average accuracies into three categories according to a predefined threshold: i)below 50% where we consider models as performing worse than a random assignment to system variants and thus useless; ii) between 50% and 70% where we consider models as being slightly better than random assignments; iii) over 70% where we consider the models as performing well.Out of the 120 models, 44 RNNs parameterisations (first row) yield an accuracy higher than 70%, 15 are between 50% and 70%, and the remaining 61 have an accuracy below 50%.It means that nearly half of the considered models perform better than a random guess, a majority of which (i.e., 44 parameterisations out of 59) performs well in our context.The highest averaged accuracy for datasets BPIC15 and BPIC20 (top of Figure 6, or Tables 4 and 5 in Appendix) is 88% and 87% respectively with high stability (i.e., low standard deviation).On BPIC20, only five parameterisations out of twenty do not reach 50%.Even better, for BPIC15 only five parameterisations are lower than 70% of accuracy.Top of Figures 7, 8 and 9 confirm these results by giving similar values for precision, recall and F1-score respectively.
Despite the complexity of Claroline datasets, at least one parameterisation obtains an averaged accuracy of 80% for each dataset.For Claroline Dissimilar 10 (middle left of Figure 6 and Table 6 in Appendix), the top parameterisation reaches 99.6% and 4 different parameterisations are above 85%.Claroline Random 10 and Random 50 (middle and bottom right of Figure 6, or Tables 7  and 9 in Appendix) also have several parameterisations above 80%, and their top one gets over 95% of accuracy.Claroline Dissimilar 50 (bottom left of Figure 6, or Table 8 in Appendix) has only one row with an averaged accuracy of 80% and only two other rows above 70%.Among the remaining, 15 rows are below 30%.
Note that, for Claroline Dissimilar 50, boxplots are either spread out or centred on low values (Bottom left of Figure 6).Moreover, the top three rows also report a high standard deviation for the accuracy (i.e., higher than 0.13 and up to 0.38, in Table 8 in Appendix).It highlights that the results lack stability: at least one execution out of ten does not belong to the same value range.Regarding Claroline Random 10 and Claroline Random 50 (middle and bottom right of Figure 6), the top three parameterisations show very compact boxplots with few outliers.This suggests a more stable accuracy, as confirmed by a standard deviation between 0.03 and 0.11 for the accuracy (Table 7 and Table 9 in Appendix).The top two parameterisations of Claroline Dissimilar 10 (Table 6) both show an accuracy higher than 0.99 and a standard deviation lower than 0.001, demonstrating very stable results.
Overall, the number of configurations of the Claroline system (10 or 50) neither influences averaged accuracy nor the standard deviation.Similarly, how we sample (random-based or dissimilarity-based) configurations does not impact accuracy.As for BPIC15 and BPIC20, the other metrics (precision, recall and F1-score presented respectively in Figure 7, 8 and 9) only confirm this analysis as they follow the same tendencies.Answer to RQ1 (performance): we were able to train RNNs providing an accuracy above 70% (and even above 80%) for each dataset.On Claroline Dissimilar-10 the accuracy can reach 99.6%.The associated standard deviations can be small (i.e., < 0.01) but they are usually higher with the Claroline datasets, regardless of the number of configurations used or the way we select them.Yet, these results suggest there is potential to use RNNs to automatically classify newly generated execution traces among the variants of a system rather than trying to do it manually.

LSTM vs. GRU (RQ2)
Our second RQ is about the prevalence of each type of RNN.Can LSTM or GRU be considered better and should be preferred in this context?To answer this question, we hypothesize that one kind of RNN prevails over the other one and performs a multi-comparison statistical analysis of each 20 RNN parameterisations on all 6 datasets.We used a Friedman's non-parametric test [48] with a significance level α = 0.05.This test ranks parameterisations over accuracy and then determines if the differences between parameterisations are significant.We further complete this result with Nemenyi's post-hoc procedure [70,96] indicating the statistical differences between parameterisations.This procedure can determine equivalence classes, regrouping parameterisations that are statistically similar regarding accuracy.Figure 10 shows the results of Nemenyi's test.After executing Friedman's test, we obtain a p-value 0.001, meaning that there is a statistical difference between the accuracy of some of the parameterisations.Nemenyi's post-hoc procedure shows that the minimum distance between two statistically different groups of parameterisations (i.e., the critical distance) is 3.828.The bottom of Figure 10 shows the seven best parameterisations over all the datasets.Statistically, they are equivalent and perform better than the remaining others (belonging to a different group).
Four pairs of loss and activation functions out of ten seem to stand out from the test.They are: -MSE and sigmoid binary cross-entropy and sigmoid binary cross-entropy with logits values and sigmoid For most datasets, these parameterisations can predict the right set of variants with an accuracy greater than (or very close to) 70% (confirmed by Figure 6 and Appendix A).However, sometimes a combination also gives bad results.It is the case with MSE and sigmoid, both with LSTM and GRU, where accuracy does not exceed 0.25 for Claroline Dissimilar 50 (Table 8).
We can observe that the dedicated loss functions (Manhattan and Jaccard distance) give terrible results compared to the other "classical" loss functions.Nemenyi's procedure (Figure 10) assigns them the highest mean ranks.For all Claroline datasets, the accuracy is always under 30%.On BPIC15 and BPIC20, they give better results (respectively up to 82% for BPIC15 and up to 58% for BPIC20) but still lower than the other loss functions.
Regarding the activation functions, our statistical analysis shows that 6 out of 7 best parameterisations use sigmoid instead of tanh.
Nemenyi's procedure shows that LSTMs are present in 4 of the top parameterisations and GRUs in 3 of them.However, these parameterisations are indistinguishable regarding accuracy (i.e., critical distance < 3.823).Appendix A shows that the best parameterisation is an LSTM for BPIC15, BPIC20 and Claroline Dissimilar 50, but it is a GRU for the three other datasets.GRU is also the model giving the best accuracy amongst all datasets with up to 99, 6% for Claroline Dissimilar 10 (Table 6).Moreover, the count of LSTMs and GRUs in each category of Table 3 shows similar numbers and indicates that using GRU or LSTM does not influence the results.

Answer to RQ2 (classifiers):
In the top combinations of all six datasets, we observed mixed performance of LSTMs and GRUs, with no absolute winner.A statistical comparison showed that 4 out of 7 parameterisations use LSTMs, without any significant difference between the 3 parameterisations using GRUs.Moreover, GRU gives better results on 3 of the datasets (Claroline Dissimilar 10, Claroline Random 10 and Claroline Random 50).Hence, we cannot conclude the prevalence of one over the other for these six datasets.Moreover, our results suggest using the sigmoid activation functions rather than tanh.

Discussion and Future Work
This section discusses threats to validity that we identified and other aspects driving our future works.

Threats to Validity
Internal validity.The datasets we used contain clean and consistent traces (i.e., they omit inconsistent traces when the system crashes or an unexpected event occurs).The BPIC community ensure this property [37,38] or by the use of an FTS model and the VIBeS framework [31,34] as a trace generator (for Claroline).For a new VIS, a preprocessing step should take care of trace consistency (i.e., a trace should capture a complete user session).It does not entail that the dataset captures the whole system's behaviour.Indeed logs and models inferred from them represent a partial view of it.
To assess the difficulty of the learning process (i.e., being able to map logs to variants while sharing parts of the traces), we defined our own metrics (see last two columns of Table 1).This definition is inspired by our experience in analysing VISs where commonalities and variabilities between behaviours are key to the analysis.These metrics come from the analysis of the dataset only and give a better understanding of the intrinsic complexity of the learning problem.While they are fairly simple and high-level, they can be computed quickly but do not provide fine-grained differences (as the Levenshtein distance [79] would do but at the cost of longer computations).Finding the right trade-off between simplicity to compute and precision is left to future work.
The deep learning community is very active, leading to new types (or combinations of types) of models appearing every few months, especially for image processing tasks, where competition is fierce.It is less so regarding models dedicated to time sequences.We selected LTSMs and GRUs for their ability to deal with temporal sequences and to evade the vanishing or exploding gradient issue.
We evaluated 20 distinct parameterisations of RNNs over six datasets.We designed them regarding our goal, based on our previous work [46].However, since exhaustive coverage of the hyperparameter space is impossible, we may have missed some relevant parameterisations.Dealing with the inherent variability of hyperparameters is a research challenge per se.
A way to optimise the parameterisations is to use hyperparameter tuning techniques such as random search or auto-ML [94].We did not use any in this work but tried to scope the parameterization space with a manual approach similar to a grid search approach [46].One motivation for this choice is that VaryMinions is the first effort to use RNNs to classify execution traces for variants of systems.Thus, we were not interested in finding the best-performing model (aka the goal of hyperparameter tuning).Rather, we show that, within a reasonable effort, finding a suitable RNN model parameterisation performing well is possible.
External Validity.Compared to our initial results [46], we augmented our experimental setup using Claroline, a VIS.Though our method applies to two different application domains, we cannot ensure that it generalises to all configurable systems.We used six different datasets having different characteristics that mitigate the fact that our method may work only on simple datasets.Among the ones we have used, some were taken from existing competitions (BPIC), and some were generated from scratch (Claroline) allowing us to vary and control the complexity of the learning by modifying the amount of traces available and/or the number of configurations to deal with.Let us note that reverse-engineered models from logs necessarily form an incomplete representation of the behaviour of the system.Indeed, logs cannot capture all execution traces that are often infinite for any real-world system.Besides, we do not guarantee that our cases cover the whole spectrum of VIS, given their diversity and widespread.
A problem when using DL techniques in such a context is imbalanced representations in the training set.The training set may contain fewer occurrences of a configuration of a system or a process (e.g., because of lower popularity or fewer actions need to be performed) with the risk that the trained model may neglect classification errors involving these configurations since they can be considered as rare events.While the Claroline datasets were generated in such a way that imbalance representations were limited, we had no control over the BPIC datasets.They exhibit configuration imbalance but our RNN models coped with it (i.e., successfully classifying traces belonging to these configurations).Thus, we took no further actions to mitigate this aspect.Of course, class imbalance impact is case-specific.
Replicability.To prevent potential replicability issues, our implementation of VaryMinions and all the results presented in this paper are publicly available on Zenodo for long-term storage [47].

Hyperparameter Variability
The use of RNNs in this context requires carefully dimensioning the network and considering many parameterisations that can influence classification performances.In what follows, we discuss two elements that may influence them.
Loss functions.We use the mean squared error (MSE) to evaluate prediction errors while training a network, which is traditionally preferred when tackling a regression problem.However, Hui and Belkin [67] showed that this assumption lacks solid theoretical foundations and that MSE is suitable for classification.In particular for NLP applications, where MSE usually outperforms cross-entropy.
The choice of the loss function is tricky since we need to take care of multiple aspects: the formalisation of the problem (e.g., single or multi-label, regression or classification) or the way to compute errors.Even when trying to choose the loss function according to these points (e.g., Jaccard distances have been used to solve SPL problems, as in [35]), our results indicate that the MSE works surprisingly well.Given the importance of a loss function on the observed performance, experimenting with additional loss functions appears promising.For example, the focal loss [84], which penalises more misclassified instances than well-classified ones, is a perspective that we aim to follow.
The interplay of Losses and Activations.We deliberately chose to explore custom loss rather than activation functions.Loss functions are easier to adapt to the problem at hand (by quantifying how far we are from the true label) acting on the network output.Yet, activation functions and loss functions have distinct roles in the network, and they should be considered complementary and not independent.Both are important in the learning process.Activation functions come after every layer inside the network and, together with the weights, set the importance of a specific neuron through the propagation of the network.Loss functions are defined at the end of the network and are used to provide the final class(es).Loss functions are also used to back-propagate the classification errors through the network to optimize the weights in the training phase.From this short description, it is clear that activation and loss functions' interactions affect the model performance.The former may block or lower the importance of discriminative information if incorrectly set while the latter defines the distance from the labels, from which the network optimises itself.Hence, assessing the impact of one type of function alone is not possible.Further investigations on which combinations would be best suited are needed.
Defining new custom activation functions for this specific context is a possible option.

Complexity of the neural networks
We argued that learning a trace-tovariant mapping was feasible due to the number of traces w.r.t. the limited number of process variants.Generally, the challenge lies in the fact that having temporal sequences forces dependencies between elements that are usually learned separately.We suppose that deeper RNNs (i.e., increasing the number of hidden layers) may have a positive impact.Adding more layers increases the complexity of the model (as well as requires more resources for training), but allows for a more accurate mapping between traces and variants.Yet, the risk of overfitting must not be neglected.In the future, we will also consider architectures such as auto-encoders to produce a compact intern representation of traces, that could be more efficient in discriminating them according to the process variants.Similarly to other application domains (e.g., image or speech processing), learning more compact representations could rely on new feature descriptors instead of only considering events of a trace.

Variant-based vs. Option-based Labelling
Our results indicate that applying classification techniques on a variant-based approach (i.e., identify the variants producing a specific trace) using RNNs is promising.However, it has a major drawback: being able to predict that a trace is generated by a variant requires seeing at least one (usually much more) trace(s) generated from this variant.Said differently, enumerating all the variants and executing them all at least once is required for further predictions.If in our evaluation the number of variants was limited, the combinatorial explosion problem inherent to VISs may prevent us to apply these techniques to larger configurable processes like, for instance, continuous integration workflows with hundreds of options, leading to an intractable number of possible variants.
One future possibility to address this limitation is to work on data representation.Indeed, a variant is formed by a combination of (Boolean) options, corresponding to a configuration of the system.If we cannot enumerate variants, enumerating options is possible.In this case, we need a new representation which can depict the three states of each option: activated, deactivated or undetermined (i.e., the presence of the option is not relevant for the current context).The neural network will learn a partial configuration allowing for a more fine-grained mapping.This would be useful to locate precisely a combination of options yielding a given anomalous event trace.One can use such learned models in fault localisation and repair techniques [41].As all labelling approaches, this new option-based approach is a costly task, but unlike a variant-based approach, it is feasible.For example, in Claroline [30,32] we have more than 5 million variants but only 44 different features.However, this new approach comes with its own challenges.Predicting the wrong features can potentially lead to a violation of the FM's constraints, creating an invalid configuration.

Data availability
As for any DL technique, the issue of data availability is also present in this work.We managed to train our models with few execution traces (i.e., thousands) compared to the potentially infinite number of traces that the considered systems can produce.However, VaryMinions remains a supervised machine-learning technique and requires a set of execution logs, labelled with the variants of the system that have produced them.
To reduce the labelling effort, the recent field of semi-supervised learning [20] techniques seems interesting.Semi-supervised learning takes place when, in the training set, some data have labels but a majority of them are unlabelled (e.g., due to prohibitive cost in labelling that cannot allow labelling more than a few tens).The goal is thus to learn a model while being able to label automatically the unlabelled data.In this area, label propagation [68,78] automatically assigns a new label via propagating the label of already known similar data.We envision using the same technique (or an adapted version) to reduce the labelling effort while being able to take into account more and more execution logs which may improve the prediction performances of VaryMinions models.

Related Work
This paper focuses on using DL techniques to reverse-engineer configurations.However, it is not the only context where DL has been used in conjunction with business processes or SPLs.This section gives an overview of existing approaches where both DL and variable systems meet.

Machine Learning for Process Monitoring and Mining
Machine learning, in particular deep learning, has been notably used in business process monitoring.For instance, ML models can use past observations to predict the next event in a process [36,90,118,120,128], the outcome of a process [15,76,130], the remaining time [117,133], vulnerabilities and anomalies [13,61,[98][99][100] or even performance [101].This vast research area called predictive business process monitoring, attracted several literature reviews (e.g., [60,97]).ML can also be used to optimise existing processes [42] or to get a compact representation of traces [16,17].Recently, there has been interest in the interpretability of RNNs models, specifically in a process mining context [59].
Han et al. [58] use LSTM to discover automatically business processes from textual documentation.However, their work is focused on single processes and does not highlight variability.

Engineering Configurable Processes
When trying to (reverse-)engineer configurable processes or even perform maintenance and/or evolution, some of the reported techniques rely on grammarbased or evolutionary algorithms, while others are machine learning (ML) oriented.The latter mostly consider tasks like clustering traces (e.g., [115]).However, few techniques allow to retrieving a complete configurable process from event logs.Some approaches use genetic algorithms [18,77], but they are limited to a small number of variants.Another option is to use (configurable) process fragments to rebuild the configurable model [10].Sikal et al. propose a pattern for variability discovery during process mining, but this approach is only methodological at this stage [114].
In our case, we focus on the classification task.Bobek et al.
[12] offer recommendations to configure variability-aware business processes at design time with Bayesian Networks.Clustering techniques have also been used [28,87,125] to perform classification tasks in an unsupervised way, i.e., without knowing the classes to learn.Song et al. use dimensionality reduction techniques to improve trace clustering [115].In our context, we want to specify the variants (i.e., the classes) to learn.Finally, Hinkka et al. [64] aim at categorising traces into classes, thanks to LSTMs and GRUs.However, their approach differs on several points: (i) they define artificial classes, and (ii) they focus on binary classification.

Machine Learning for Variability-Intensive Systems
While there is a growing interest to employ ML techniques for VIS engineering [43,102], to the best of our knowledge, classification of variants from behavioural traces using ML techniques has not been studied yet.ML approaches have been used to support performance prediction (e.g., [3,11,55,71,113,124,134]), performance optimisation (e.g., [39,88,127,131,132]), to improve the search for good and acceptable configurations (e.g., [95,122,123]) or to predict unwanted feature interactions [74,82].If some of these works also target classification tasks, they consider configurations as the main entry point of their approaches and do not take into account the behaviour of the studied systems.ML also supports usability prediction [129], attacks and vulnerabilities detection [1] and defect prediction [4,116].In particular, Strüder et al. demonstrated that artificial neural networks were suitable for this last task [116].
While ML can support VIS engineering, the converse, i.e., applying variabilityaware techniques to neural networks is also possible.For example, Ghofrani et al. [52,53] proposed a new approach to reuse modules of deep neural networks without additional training.On their side, Ghamizi et al. developed a framework to explore variability amongst different neural network architectures and automated search-based techniques to find the optimal one for a given task [50,51].

Variability-Intensive Systems Reverse Engineering
Over the years, several approaches were proposed to reverse engineer VISs, and SPLs in particular.These techniques operate at different levels: variability model, mapping between options and VIS artefacts, and learning VIS design models.

Learning Variability models
More than a decade of effort has been devoted to extracting options from VIS artefacts.Due to their popularity, most approaches target feature models [2,81,86,89,112].Besides, Ramos-Gutiérrez et al. [105] use process mining to retrieve the process of configuring an SPL.VaryMinions does not necessarily need a complete feature model but rather a set of variants.They can be sampled from a feature model as we did for the Claroline system or simply known via product descriptions.

Learning VIS Design Models
There also exist model-based approaches to recover an architectural model of a VIS [8,73,83].This can be useful when the system was not designed with the SPL paradigm in mind (but e.g., by using a clone-and-own approach) and when we want to perform complex maintenance or evolution tasks.Devroey et al. [33] designed a technique to retrieve a behavioural model of an SPL.This technique, based on usage models inferred from logs, learns a candidate FTS which should be completed manually with annotations (feature expressions).This technique yielded the FTS model we used to generate Claroline datasets.On the other hand, Damasceno et al. [26,27] idea is fully automated, but limited to a few variants.Their proposal consists of an adaptation of a classical learning algorithm (L * , by Angluin [5]) which is instrumented to merge individual models of each variant into a model of the complete SPL.Note that merging has a high complexity (i.e., exponential) with respect to the number of variants to merge.
In contrast, VaryMinions does not aim to learn a behavioural model but to build a mapping between behaviour and variants.It may prove useful to automatically annotate SPL models.

Learning VIS Mappings
Feature location is another task in VIS reverse engineering that Cruz et al. [25] divided into three categories of techniques: static (based on source code), dynamic (based on execution traces), and textual (based on NLP).Some techniques mix several approaches; for instance, Michelon et al. [91] use a hybrid approach based both on static analysis of the source code and dynamic analysis of execution traces.However, the general idea of feature location is slightly different since these are white-box approaches whose purpose is to map features with source code (e.g., by source code annotations).Their goal is usually to help with maintenance and evolution.Moreover, classical feature location techniques (e.g., [25,91]) do not use RNNs.In our case, we are more focused on associating behaviours with variants or directly with features (see future work in Section 7.3).VaryMinions is thus a black-box and dynamic approach that could be used to make a first classification of variants of interest before delving into the source code or other white-box artifacts.

Conclusion
In this work, we evaluated the relevance of using Recurrent Neural Networks (RNNs) to address the problem of how to multi-classify behavioural traces found in logs to the variant(s) they belong.This mapping is highly relevant when debugging variability-intensive systems (VIS) as anomalous behaviour may result from the interaction of a few specific options belonging to some variants amongst a myriad.Based on the promising results we obtained for configurable business processes [46], we extended our experiments to Claroline, a configurable course management system previously re-engineered at the university of Namur.We assessed two popular RNN types -Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU) -under 20 distinct parameterisations on 6 datasets (2 from configurable processes and 4 generated from Claroline models).Our results show that it is always possible to learn a mapping with an accuracy of at least 80% [47].There is no prevalence of one particular model type (GRU or LSTM) among the best-performing models.
While we demonstrated that VaryMinions easily scales up to at least 50 variants and 5, 000+ traces per variant, covering huge configuration spaces, e.g., learning mapping for hundreds or thousands of configurations, may be problematic.It suggests the first item for our future work: offer an optionbased encoding for the mapping problem, which would be less prone to variant explosion.We also intend to experiment with other loss functions and design new dedicated ones.Finally, new neural architectures may be considered, such as attention-based ones [126].

A Appendix 1
This appendix contains 6 tables (one per datasets) representing the average and standard deviation for four metrics computed on 10 iterations.Accuracy, precision, recall and F1-score were computed based on definitions provided in Section 5.5.
The first three columns of the tables show the hyperparameters values for each of the RNNs' parameterisations.For conciseness, we do not report in these tables hyperparameters that were fixed to a single value, such as the batch size or the number of epochs.Indeed, we discussed them in Section 5.The other columns reports the average and standard deviation of accuracy, precision, recall and F1-score.

Fig. 3 :
Fig. 3: A unit of LSTM versus a unit of GRU

Fig. 4 :
Fig. 4: Description of the VaryMinions architecture Event logs datasets.We have derived the four different event logs datasets based on the following sets of configurations of Claroline: Claroline Dissimilar 10 (DS3) regroups execution traces of 10 different configurations of Claroline, selecting the most dissimilar sets of options.This dataset should lead to more discriminated traces and better classifications.Claroline Random 10 (DS4) gathers traces from 10 different instances of Claroline, randomly chosen to have a more realistic dataset.Claroline Dissimilar 50 (DS5) is similar to DS3, but with 50 configurations to allow more diversity.Claroline Random 50 (DS6) is similar to DS4, but with 50 configurations.

Fig. 5 :
Fig. 5: Sigmoid (blue) and tanh (orange) function responses represented by the Y-axis depending on the input signal (X-axis).

Fig. 6 :
Fig. 6: Boxplots showing the Accuracy over 10 runs for each parametrisation of each dataset.

Fig. 7 :
Fig. 7: Boxplots showing the Precision over 10 runs for each parametrisation of each dataset.

Fig. 8 :
Fig. 8: Boxplots showing the Recall over 10 runs for each parametrisation of each dataset.

Fig. 10 :
Fig. 10: Result of Friedman's statistical test along with Nemenyi's post-hoc analysis over all datasets and parameterisations

Table 8 :Table 9 :
Results for dataset Claroline Dissimilar 50: Averaged and standard deviations of different metrics over 10 runs.Each line corresponds to a parameterisation of a RNN.Results for dataset Claroline Random 50: Averaged and standard deviations of different metrics over 10 runs.Each line corresponds to a parameterisation of a RNN.

Table 3 :
Number of RNN parameterisations reaching predefined accuracy thresholds.We take into account 120 parameterisations.Accuracies are averaged over 10 runs on each dataset.Each cell indicates the number of times a given RNN model type (column) reaches the threshold (row).The last column gives the total (LSTM+GRU) per accuracy range.

Table 4 :
Results for dataset BPIC15: Averaged and standard deviations of different metrics over 10 runs.Each line corresponds to a parameterisation of a RNN.

Table 5 :
Results for dataset BPIC20: Averaged and standard deviations of different metrics over 10 runs.Each line corresponds to a parameterisation of a RNN.

Table 6 :
Results for dataset Claroline Dissimilar 10: Averaged and standard deviations of different metrics over 10 runs.Each line corresponds to a parameterisation of a RNN.

Table 7 :
Results for dataset Claroline Random 10: Averaged and standard deviations of different metrics over 10 runs.Each line corresponds to a parameterisation of a RNN.