Keywords

1 Introduction

Modern Cyber-Physical Systems (CPS) often have a high complexity, because they consist of multiple subcomponents including, e.g., digital embedded systems, sensors, and software systems. Additionally, CPS contain interfaces to their physical environment making their functionality dependent on complex physical behavior. These characteristics make modeling of CPS very difficult. Still, models of CPS are needed, e.g., for test, verification, or anomaly detection. Machine learning helps in modeling complex CPS by automatically learning a model of a system from data. This does not only allow to generate knowledge of the system, but also speeds up analysis and design of systems where models are often derived in a time-consuming, manual process. One approach for model learning is automata learning which derives an automaton model from observations of a system. Automata learning has a strong theoretical base coming from the field of language inference, discussed, e.g., in the seminal work of [6]. Techniques, implementing automata learning differentiate between active [3, 12, 18] and passive [11, 14] automata learning. Active learning adaptively updates a model hypothesis by asking the system for new information. This requires to directly perform actions on the original system. Passive automata learning is mostly used if a direct interaction with the system is not possible. Instead, an automaton is learned on a pre-collected learning set. The quality of learned models in this scenario is limited by the amount of data provided for learning. There are several approaches in improving passive automata learning, e.g., to speed-up execution [1] or to handle specific learning data [7, 10, 13].

In contrast to these approaches, our strategy aims at improving the accuracy of learned automaton models and—as will be discussed within the paper—especially support model accuracy in case of limited learning data sets. The work of Gold [6] proves that the established algorithms for passive automata learning yield exact, deterministic automaton models, on a characteristic learning set. This means that the training data represents the whole target automaton. For a learning data set collected during operation of the system, getting a characteristic learning set cannot be guaranteed in general.

Our approach combines multiple learned models to improve the overall accuracy. This approach is known in machine learning, e.g., in random forests, combining multiple learned decision trees, first discussed in [4]. Since then, random forests have already been used in multiple applications, e.g., in image processing [5], navigation [2], or energy systems [17]. The performance of random forests illustrates that machine learning methods benefit from randomized learning algorithms when using multiple learned models. An improved accuracy of learned models is important, e.g., if the model is used for monitoring where false alarms or undetected errors should be avoided.

We propose automata forests applying this idea to passive automata learning. The main contribution of our paper is a new concept of passive automata learning. This concept combines the internal randomization of the automata learning algorithms and the idea of random forests to consider multiple learned automata for modeling. We build automata forests on top of the Regular Positive and Negative Inference (RPNI) algorithm which is one of the most commonly used passive learning algorithm. We present two different strategies to create automata forests. Furthermore, both of these approaches are applied to learn Deterministic Finite Automata (DFA) and Mealy machines, respectively. Our results imply that this concept is especially helpful when a characteristic learning set is not available. In an experimental evaluation, we observe up to \(43\%\) better accuracy of the forest models compared to automata learned with the ordinary RPNI algorithm when applied for prediction on an evaluation set. Section 2 provides some notation and basic information on automata and automata learning. In Sect. 3 the two forest algorithms and their implementation for DFAs and Mealy machines are described. Section 4 gives insight into the performance of the forest algorithms in an experimental evaluation. Finally, Sect. 5 summarizes and discusses the results identifying a superior performance of automata forests compared to the common RPNI algorithm.

2 Preliminaries

Our approach learns DFAs and Mealy machines. A DFA M represents a language \(\mathcal {L}\) in the set of regular languages. The automaton decides if a given input word belongs to the language of the automaton M. A Mealy machine \(\mathcal {A}\) translates an input word to an output word.

There exist many approaches for learning automata. For this paper, we use the RPNI algorithm [8]. The algorithm uses positive and negative samples, where positive words are element of \(\mathcal {L}\) from DFA M and negative words do not belong to \(\mathcal {L}\). For DFAs, the learner uses positive and negative sampled words as inputs to construct a tree spanning all given positive samples. Afterwards, the algorithm tries to merge states in the tree, while continuously checking if the merging operation is valid using the negative samples. The validation check is done by looking up if the tree would still reject all given negative input samples after a merge.

For Mealy machines, the learner gets tuples of input and output words sampled from the system to be learned and builds a tree again. Merging operations are now valid if, after a merging step, the output words of the Mealy machine \(\mathcal {A}\) given the sampled input words are still equivalent to all corresponding sampled output words.

3 Algorithms for Learning of Automata Forests

3.1 Forest Structure

The general idea behind the algorithms is that multiple automata are built forming the forest structure. Thus, a forest structure consists of several instances of automaton learners. Here, each learner implements the RPNI algorithm. Each of the instances inside the forest gets assigned a random fraction of the learning dataset. Afterwards, every instance of the forest builds an automaton. The automata corresponding to each instance are evaluated to form an output. The parameters to design the automata forests in this paper are:

  • \(m \in \mathbb {N}\) is the number of instances inside the forest,

  • \(\alpha \in (0,1)\) is the percentage of the original dataset given to each instance.

In the following, we show two different approaches when making a decision on how to evaluate these forest structures to form an output.

3.2 Forest with Cross Validation (ForestCV)

ForestCV is inspired by the cross-validation method [16]. Every instance of the forest stores the corresponding set from the original learning set, which was not used to train the automaton, as its test set (with the information whether a word is accepted by the target automaton or not). Afterwards, each DFA inside the forest gets scored using the corresponding test set. The scoring gives an automaton a point for every correctly validated word (accept words from the positive sample and decline negative words). ForestCV outputs the DFA, which achieved the highest score using this cross-validation method.

When dealing with Mealy machines, the output of an automaton changes to translating an input to an output word. The basic principle to form a ForestCV for Mealy machines is the same as for DFAs, where the evaluation is covered by the cross-validation and the output Mealy machine is the automaton inside the forest, which performed best given a chosen metric. We analyze the following metrics when translating words: (1) outputting the Mealy machine, which most often correctly translated output words (ForestCV\(_{W}\)); (2) outputting the Mealy machine, which had on average the lowest Hamming distance (counting character substitutions when comparing words) to the sampled output words (ForestCV\(_{ED}\)). The decision which metric to take when constructing ForestCV depends on what the user wants to achieve, i.e., which metric is to be minimized or maximized.

3.3 Forest with Majority Voting (ForestMV)

ForestMV does not output a DFA, but input and output of the forest are equal to a DFA, i.e., the forest decides whether a word is element of \(\mathcal {L}\). For an input word, all DFAs decide if they accept or decline the word. The answer which the majority of DFAs take, is the output of the forest. For Mealy machines, given an input word and the output alphabet of the target automaton, ForestMV triggers all Mealy machines inside and forms an output word. This is done by iterating characterwise over the input word. On every character a majority vote picks the output character most often choosen by the Mealy machines, as output character of the forest at the given input character position.

4 Experimental Evaluation

Our empirical evaluation includes two scopes: (1) learning regular expressions. Since regular expressions are equivalent to DFAs, we sample positive words from a regular expression instead of a real target automaton. Negative samples are randomly generated words that do not belong to the considered language \(\mathcal {L}\). The 2 regular languages examined in our paper are: \(\mathcal {L}_{1}\) with an alphabet size of 36 and 12074 words, \(\mathcal {L}_{2}\) with an alphabet size of 26 and 7938 words; (2) learning Mealy automata given random target Mealy machines. Many CPS can be abstracted to Mealy machines, identifying discrete in- and outputs. In the experiment, we randomly generate input words on the Mealy machine’s alphabet and feed the input words to the target Mealy machine, resulting in sampled output words. The 3 Mealy machines in our paper are: \(\mathcal {A}_{1}\) with an input alphabet of size 10, an output alphabet of size 10, and 20 states. \(\mathcal {A}_{2}\) with an input alphabet of size 10, an output alphabet of size 10, and 5 states. \(\mathcal {A}_{3}\) with an input alphabet of size 15, an output alphabet of size 8, and 15 states.

The words obtained from the target language (1) and Mealy machine (2) form training and test set, respectively to evaluate the different forests. The nominal RPNI algorithm is used as the reference in the evaluation. When building the forests, each instance gets the same amount of words for training. ForestCV and ForestMV use the fraction \(\alpha \in (0,1)\) of the input data to learn their inner m automata. RPNI uses \(100\%\) of the input data. Training data used in Sect. 4 are 1000 positive words for \(\mathcal {L}_{1}\), 80 for \(\mathcal {L}_{2}\) with each 500 negative words and 500 pairs of input and output words for \(\mathcal {A}_{1,2,3}\). The experiments as well as the implementation of the algorithms are based on the LearnLib framework [9].

4.1 Hyperparameter Tuning

Each test set contains 1000 words and gets continuously tested on the algorithms, while increasing \(\alpha \). The y-axis shows the average number of successes of the algorithms (classifying words in \(\mathcal {L}_{1}\), translating words for \(\mathcal {A}_{1},\mathcal {A}_{2}\)) and the x-axis depicts \(\alpha \). The plots in Fig. 1 show different results on how to choose \(\alpha \). Figure 1a and 1c show that, when steadily increasing \(\alpha \), the curve has a global maximum at roughly \(\alpha =60\%\) for DFAs and \(\alpha =~80\%\) for Mealy machines. However, in the case of \(\mathcal {A}_{1}\), the curve increases until \(\alpha =100\%\) (Fig. 1b). For the tests in our paper, we chose \(\alpha =60\%\) for DFAs and \(\alpha =80\%\) for Mealy machines.

Fig. 1
3 multiline graphs. a. Average classified words versus alpha for L 1. R P N I is constant. Forest M V and C V increase logarithmically. b and c. Average translated words versus alpha for A 1 and 2. R P N I decreases slightly in b, and fluctuates in c. Forest C V W, C V E D, and M V increase.

Performance analysis with respect to \(\alpha \)

Fig. 2
2 multiline graphs. a. Average classified words versus m for L 1. R P N I fluctuates around 960. Forest M V and C V increase logarithmically. b. Average translated words versus m for A 1. R P N I is constant at around 79 with forest C V W and C V E D below. Forest M V increases to 90 on the y-axis.

Performance analysis with respect to m

Figure 2 shows an analysis of the parameter m. Each test set contains 1000 words and gets continuously tested on the algorithms while increasing m. The y-axis shows the average success rate of the algorithms (classifying words in \(\mathcal {L}_{1}\), translating words for \(\mathcal {A}_{1}\)) and the x-axis depicts m. In our experiments the number of successes converges with increasing m. Figure 2a and 2b show, that after increasing m above 100, the result is not changing much. Because of this, \(m=100\) is chosen in our experiments.

4.2 Analyzing DFAs

In the following, the algorithms are executed with \(\mathcal {L}_{1}\): 5 different training sets which each get 5 different test sets on 100 iterations resulting in 2500 observations and \(\mathcal {L}_{2}\): 10 different training sets on each 10 different test sets with 100 iterations resulting in 10000 observations. Figure 3 shows the results of the DFA examples. The x-axis depicts the number of correctly classified words with a maximum of 1000 and the y-axis depicts the density distribution of the data using the histogram function from R [15], i.e., depicting the probability that a data point achieves a certain number of correctly classified words. In our experiments, the plots show that both forest algorithms outperform the RPNI.

Fig. 3
2 histograms, a and b, for L 1 and L 2 plot density versus the number of correctly classified words. The bars have increasing trends. The bars for R P N I are the highest followed by forest C V and forest M V. R P N I increases to peak at around (995, 0.17) in graph a, and (995, 0.009) in b.

Number of correctly associated words for DFA’s on 1000 tries

Table 1 Metrics for DFAs

Table 1 analyzes the metrics of the plots from Fig. 3: average (avg), variance (var), false positives (false pos.), false negatives (false neg.) and the execution time in seconds (execTime) of each algorithm and comparing them to RPNI. The percentage in parantheses shows the approximate increase or decrease of the forests in the metrics (green cell markings are the highest improvement; red cell markings depict the slowest execution time). In the presented results, forest structures show better results with respect to all presented metrics in Table 1, while their execution time is increasing by a factor of roughly \(\approx 72.5\) compared to the basic RPNI. Experiments in Sect. 4.2 are executed on an Intel©  Core™i5-7300HQ CPU @ 2.50GHz processor with 8GB RAM.

4.3 Analyzing Mealy Machines

For learning Mealy machines the test cases are produced by: building \(\mathcal {A}_{1}\) at random 5 different times with each 3 different training sets on each 3 different test sets executing them 100 times resulting in 4500 observations; building \(\mathcal {A}_{2}\) at random 10 times with each 3 different training sets on each 3 different test sets and is executed 100 times at each step which results in 9000 observations; building \(\mathcal {A}_{3}\) at random 25 times and sampled together with each 2 different training sets and each 2 different test sets with 100 iterations for a total of 10000 observations. The x-axis depicts the number of correctly translated words with a maximum of 1000 and the y-axis depicts the density distribution of the data plotted in R using the histogram function. Figure 4 shows that although both ForestCV algorithms can at times perform worse than RPNI, ForestMV is able to consistently outperform the RPNI. Furthermore, Fig. 4a and 4c both follow the curve for \(\alpha \) from Fig. 1b, i.e., \(\alpha \) is not chosen optimal. Thus, it can be assumed, that when \(\alpha \) would be further increased, the results of the forests in Fig. 4a and 4c may be better. Table 2 analyzes the plots for Mealy machines in a similar way as Table 1 for DFAs. The observed metrics are: average (avg), variance (var) and the execution time in seconds (execTime) of each algorithm and comparing them to the RPNI. ForestMV has in Fig. 4c an average increase in correctly translated words of \(43.6\%\), which is the highest increase of our algorithms when only considering the correct classification or translation of words. Contrarily to the results of the DFAs, the variance for the algorithms is not always lower than for RPNI. The row execTime shows, that the forest algorithms have an average increase in execution time of the factor \(\approx 78.3\). For both analysis, DFAs and Mealy machines, our experiments show that the execution time of ForestCV and ForestMV correlates to \(\alpha \cdot m \cdot N\), where N is the execution time of the RPNI algorithm.

Fig. 4
3 histograms, a to c, for A 1, 2 and 3 plot density versus correctly translated words. a. Bars for R P N I, forest C V W, and C V E D have a fluctuating bell trend up to 250 words, followed by forest M V. b. The bars increase up to around (1000, 0.03). c. Forest M V has a bell trend after 200 words.

Number of correctly translated words for Mealy machines on 1000 tries

Table 2 Metrics for Mealy machines

The experiments are executed on:

  • \(\mathcal {A}_{1}\): Intel®Core™i7-9700K CPU @ 3.60GHz processor with 32GB RAM;

  • \(\mathcal {A}_{2}\): Intel®Core™i5-7300HQ CPU @ 2.50GHz processor with 8GB RAM;

  • \(\mathcal {A}_{3}\): AMD Ryzen 7 3800X 8-Core CPU @ 3.90 GHz processor with 32GB RAM.

5 Conclusion

Classic automata learning aims at exact results which is guaranteed when a characteristic set of observations is available. However, this is often not the case, especially for practical and complex CPS. For this reason, the concepts of ForestCV and ForestMV are introduced. Both approaches show that passive learning of automata can be improved using forest structures when the characteristic set is not sampled. ForestCV does this by picking an automaton which performs best given a chosen metric. The output automaton is therefore dependent on a single automaton in the forest. By optimizing the chosen metric, ForestCV performs comparable to RPNI in our experiments, but with a lower variance. With ForestMV’s approach the output is not dependent on a single Mealy machine inside the forest and hence exploits its forest structure more than ForestCV. ForestMV shows improvements for passive learning in our experiments, because it outperforms the RPNI continuous. Using ForestMV yields up to \(43\%\) more accurate results, when compared to the RPNI algorithm. This is a success in the field of passive learning, if a more accurate result is of higher significance, than the resulting longer computation time to build the forests. This might be the case for the implementation of a monitoring tool for CPS where a pre-learned model with high accuracy is required. Future directions in research for automata forests include an analysis of the influence of noise on the training set.