1 Introduction

This paper describes the PAutomaC probabilistic automaton learning competition and provides an overview of the relevant literature on this topic. PAutomaC was an on-line challenge that took place in 2012 at http://ai.cs.umbc.edu/icgi2012/challenge/Pautomac/. The goal of PAutomaC was to provide an overview of which probabilistic automaton learning techniques work best in which setting and to stimulate the development of new techniques for learning distributions over strings. Many probabilistic automata learning methods have been produced in the past (see Sect. 2 for an overview). Most of these focus on deterministic probabilistic automata (Dpfa), where only the symbols are drawn from probability distributions but the transitions are uniquely determined given the generated symbol. There exist some exceptions, however, which aim to learn hidden Markov models (Baum 1972), probabilistic residual automata (Esposito et al. 2002), and multiplicity automata (Denis et al. 2006). Another important approach is to learn Markov chains or n-grams by simply counting the occurrences of sub-strings (Saul and Pereira 1997; Ney et al. 1997; Jelinek 1998). These simple counting methods have been very successful in practice (Brill et al. 1998).

Although many methods have been proposed, there has been so far no thorough investigation of which model/algorithm is likely to perform best, why and when. Knowledge about this would be very helpful to scientists/practitioners faced with a data set made of strings and the problem of finding a likely distribution over these strings. PAutomaC aimed to fill this knowledge gap by providing the first elaborate test-suite for learning string distributions.

In addition to being very helpful for applications of automata learning methods, PAutomaC was designed in such a way that it provided directions to future theoretical work and algorithm development. For instance, unlike previous automata learning competitions (see Sect. 2.4 for details), in PAutomaC, the type of automaton device was not fixed: learning problems were generated using automaton models of increasing complexity. This is not only very useful for practical applications (where many different types of distributions can be encountered), but also aims to answer to the interesting question whether it is best to learn a non-deterministic model (e.g. Hmm) or a deterministic model (e.g. Dpfa) when the data is drawn from a (non-)deterministic distribution,as described for instance in the work of Gavaldà et al. (2006). PAutomaC also encouraged the development and use of new techniques from machine learning that do not build an automaton structure, but do result in a string distribution. Therefore, the actual structures of the learned automata were not evaluated in PAutomaC. Instead, the performance of the different algorithms were tested only on the quality of the resulting string distribution. Like previous automaton learning competitions, this evaluation was performed on-line using a test set and an evaluation oracle running on the competition server. Consequently, the participants could use the observed performance (and that of the competition) to update their algorithms.

The competition setup in PAutomaC contained some novel elements that may also be of interest for competitions of other (string) distribution learning algorithms. Above all, in PAutomaC the performance was evaluated using the actual probabilities assigned by a learned distribution, instead of the more traditional method of evaluating its predictive performance. This has the advantage of not only testing whether the high probability events are assigned the largest probabilities, but also whether the low probability events are assigned the correct low probabilities. Furthermore, the actual strings that were being used for this evaluation were given to the participants beforehand.

The traditional approach to compare language models, which had also been considered for PAutomaC, is to test the learned model over some unseen data. Perplexity (Cover and Thomas 1991) is the usual measure and, in order to perform well on such a metric, it is necessary to learn a smoothed model, in which a non-null probability is assigned to all strings (the penalty is infinite otherwise). Experience shows that in that case, the smoothing method may become preponderant: the quality of the model can rely mainly on the smoothing. Another issue with such an evaluation task is that the model has to be checked somehow for consistency, since the probabilities of all possible strings must sum up to one.

The goal of PAutomaC being to compare learning algorithms (and not smoothing algorithms), a different protocol was chosen: the teams knew the test set in advance, and part of the problem for them consisted in reassigning the mass of probabilities the learned model used for the strings absent from the test set to those strings inside this set. In this way, a perplexity-like evaluation measure could be used to evaluate the differences in the probabilities assigned to different strings from the test-set. A couple of possible dangers of this protocol were identified by the PAutomaC Scientific CommitteeFootnote 1 and, later, by the participants. A first one was that the extra information in the test set (which was also randomly drawn from the unknown target distribution) could be used to learn. A second danger came from the fact that the teams could submit various solutions to the same problem (with no feedback about their score, but knowing their overall standing): this could have allowed some hill-climbing strategy. Both the Scientific Committee’s analysis and the attempts by some participants showed that the PAutomaC evaluation process was resistant: the winning team is actually the one who submitted the least times. We detail in this paper the choices that were made to handle these dangers.

As main contributions of this paper we provide an overview of the literature on probabilistic automaton learning, and describe PAutomaC including its design issues and solutions. The results of the competition and the approach followed by the main participants are also provided. There is a clear winner to PAutomaC: a novel collapsed Gibbs sampling method for Pfa developed by team Shibata-Yoshinaka. As it is not common to use such a method when learning distributions over strings, we hope and expect this result will influence what people will use in practice. In addition to having an appealing winner, we can draw several interesting conclusions by analyzing the results. In particular, it can be observed that the Alergia-based method developed by team Llorens outperforms the winning team on the deterministic instances. This provides some additional insights into the important question whether it is better to learn deterministic or non-deterministic models and can serve as an important starting point for further researches on this topic. Furthermore, we analyze the PAutomaC results with the goal of determining when which method works best and why. Our analysis indicates the problem areas for each of the used methods, which forms a basis for future studies and hopefully further improvements of the used methods. Last but not least, all methods developed by the participating teams significantly outperform the provided baseline algorithms, clearly demonstrating the need for developing and evaluating (new) methods for learning string distributions.

This paper is organized in six sections: introduction (Sect. 1), motivations and history (Sect. 2), an overview of PAutomaC (Sect. 3), final results (Sect. 4), a brief description and analysis of the approaches used by main participants (Sect. 5), and a conclusion (Sect. 6).

2 Motivations and history

We assume the reader to be familiar with the theory of languages and automata (Sudkamp 2006), their probabilistic counterparts such as hidden Markov models (Rabiner 1989), and basic concepts from computational complexity (Sanjeev and Boaz 2009), computational learning theory (Kearns and Vazirani 1994), and information theory (Cover and Thomas 1991). For more information on these topics the reader is referred to the corresponding references.

2.1 Why learn a probabilistic automaton?

Finite state automata (or machines) are well-known models for characterizing the behavior of systems or processes. They have been used for several decades in computer and software engineering to model the complex behaviors of electronic circuits and software such as communication protocols (Lee and Yannakakis 1996). A nice feature of an automaton model is that it is easy to interpret, allowing one to gain insight into the inner workings of a system. In many applications, unfortunately, the original design of a system is unknown. This is the case for instance when one wants to:

  • model Dna or protein sequences in bioinformatics (Sakakibara 2005),

  • find patterns underlying different sounds for speech processing (Tzay 1994),

  • infer morphological or phonological rules for natural language processing (Gildea and Jurafsky 1996),

  • model unknown mechanical processes in physics (Shalizi and Crutchfield 2001),

  • discover the exact environment of robots (Rivest and Schapire 1993),

  • detect anomaly for intrusion detections in computer security (Milani Comparetti et al. 2009),

  • do behavioral modeling of users in applications ranging from web systems (Borges and Levene 2000) to the automotive sector (Verwer et al. 2011),

  • discover the structure of music styles for music classification/generation (Cruz-Alcázar and Vidal 2008).

In all such cases, an automaton model is learned from observations of the system, i.e., a finite set of strings. Usually, the data gathered from observations is unlabeled, that is to say that it is often possible to observe only strings that can be generated by the system, and strings that cannot be generated are thus unavailable. The standard method of dealing with this situation is to assume a probabilistic automaton model, i.e., a distribution over strings. In such a model, different states can generate different symbols with different probabilities. The goal of automata learning is then one of model selection (Grünwald 2007): find the probabilistic automaton model that gives the best fit to the observed strings, i.e., that is most likely to have generated the data. In addition to the data probability, this implies that the model size has to be taken into account in order to avoid over-fitting. Otherwise, the model that generates only the seen strings and whose probabilities correspond to the observed frequency perfectly achieves the goal. But this naive model is of little use: it assigns null probability to all unseen strings and therefore makes no generalization.

2.2 Which probabilistic automata to learn?

Several variants of probabilistic automata have been proposed in the past. An important and obvious recurring rule with respect to these variants is the fact that the better the machine is at modeling string distributions, the harder it is going to be to learn it. The best known variants are probabilistic finite state automata (Pfa) and hidden Markov models (Hmm) (see Fig. 1):

  • Pfa (Paz 1971) are non-deterministic automata in which every state is assigned an initial and a halting probability, and every transition is assigned a transition probability (weight). The sum of all initial probabilities equals 1, and for each state, the sum of the halting and all outgoing transition probabilities equals 1. A Pfa generates strings probabilistically by starting in a state determined at random using the initial state distribution, either halting or executing a transition randomly determined using their probabilities, and iterating and generating the transition symbol in case it has not halted. A study of these automata can be found in Vidal et al. (2005, 2005b).

  • Hidden Markov models (Hmms)Footnote 2 (Rabiner 1989; Jelinek 1998) are Pfa (as described in the previous paragraph) where the symbols are emitted at the states instead of at the transitions which are only used to move. Initial probabilities are assigned to each state but there are no final probabilities, defining therefore a distribution over Σ n for each value of n. In order to obtain a distribution over Σ a special halting symbol or state can be introduced. With such an addition an Hmm generates strings like a Pfa.

Fig. 1
figure 1

An Hmm (a) and a Pfa (b) that are equivalent: they correspond to the same probability distribution—this example is taken from Dupont et al. (2005)

Interestingly, although Hmms and Pfa are commonly used in distinct areas of research, they are equivalent with respect to the distributions that can be modeled: an Hmm can be converted into a Pfa and vice-versa (Vidal et al. 2005; Dupont et al. 2005). Though it is easy to randomly generate strings from these models, determining the probability of a given string is more complicated because different executions can result in the same string. For both models, computing this probability can be solved optimally by dynamic programming using variations of the Forward (or Backward) algorithm (Baum et al. 1970). However, estimating the most likely parameter values (probabilities) for a given set of strings and a given model (maximizing the likelihood of model given the data) cannot be solved optimally unless RP equals NP (Abe and Warmuth 1992). The traditional method of dealing with this problem is the Baum-Welch (Baum et al. 1970) greedy algorithm.

The deterministic counterpart of a Pfa is a deterministic probabilistic finite automaton (Dpfa) (Carrasco and Oncina 1994). These have been introduced for efficiency reasons essentially: in the non-probabilistic case, learning a Dfa is provably easier than learning a Nfa (de la Higuera 2010). However, although non-probabilistic deterministic automata are equivalent to non-probabilistic non-deterministic automata in terms of the languages they can generate, it is shown in Vidal et al. (2005, 2005b), Dupont et al. (2005) that Dpfa are strictly less powerful than Pfa. Furthermore, distributions generated by Pfa cannot be approximated by Dpfa unless the size of the Dpfa is allowed to be exponentially larger than the one of the corresponding Pfa (Guttman et al. 2005, 2006). There is a positive side to this loss in power: estimating the parameter values of a Dpfa is easy, and there exist algorithms that learn a Dpfa structure in a probably approximately correct (Pac) like setting (Clark and Thollard 2004).Footnote 3 This is not known to be possible for Pfa or Hmms. For Pfa it has only been shown that they are strongly learnable in the limit (Denis and Esposito 2004), or Pac-learnable (under some restrictions) using a (possibly exponentially larger) Dpfa structure (Gavaldà et al. 2006).

In addition to Pfa, Hmms, and Dpfa, other probabilistic finite state automata have been proposed such as: Markov chains (Saul and Pereira 1997), n-grams (Ney et al. 1997; Jelinek 1998), probabilistic suffix trees (Pst) (Ron et al. 1994), probabilistic residual finite state automata (Prfa) (Esposito et al. 2002), and multiplicity automata (Ma) (Bergadano and Varricchio 1996; Beimel et al. 2000) (or weighted automata—Mohri 1997). Probabilistic accepting automata will use weights to assign probabilities to individual strings, whose meaning is the probability of accepting (Vs not accepting) the string. Such automata are also sometimes called fuzzy automata. Although Markov chains and n-grams are a lot less powerful than Dpfa (both the structure and parameters are easy to compute given the data), they are very popular and often effective in practice. In fact, to the best of our knowledge, it is an open problem whether Pfa, Hmm, or Dpfa learners are able to consistently outperform n-gram models on prediction tasks. Probabilistic suffix trees are acyclic Dpfa that have a Pac-like learning algorithm (Ron et al. 1994). Probabilistic residual finite state automata are more powerful than Dpfa, but less powerful than Pfa. Though multiplicity automata are more powerful than Pfa, they are also shown to be strongly learnable in the limit (Denis et al. 2006). The expressiveness power of the different types of probabilistic automata is summarized in Fig. 2.

Fig. 2
figure 2

The hierarchy of the different finite states machines. Multiplicity automata (MA) can model the most distributions (but can also model other functions), n-grams are the least expressive

2.3 How to learn a probabilistic automaton?

Early work concerning the learning of distributions over strings can be found in Horning (1969) and Angluin (1988). In the first case, the goal was to learn probabilistic context-free grammars; in the second, convergence issues concerning identification in the limit with probability 1 are studied. Although these initial studies were done decades ago, only three techniques have become mainstream for learning Pfa, Hmms, and Dpfa.

Parameter estimation

The first family of techniques takes a standard structure or architecture for the machine, typically a complete graph, and then tries to find parameter settings that maximize the likelihood of model given the data. If the structure is deterministic, the optimization problem is quite simple: transition probabilities can be estimated using the maximum likelihood (Wetherell 1980). If not, the standard method is the Baum-Welch algorithm (Baum et al. 1970; Baum 1972) which iteratively computes a new estimate for the transition probabilities using the probabilities assigned to the input data. Although this technique is known to be sensitive to initial probabilities and may get stuck in a local optimum, it has frequently been applied successfully in practice.

Bayesian inference

The second family of techniques correspond to Bayesian methods such as Gibbs sampling (Gelfand and Smith 1990), see, e.g., Neal (2000), Gao and Johnson (2008). Instead of learning a single model (a pointFootnote 4 estimate), these methods aim to make predictions using the joint distribution formed by all possible models. This joint distribution is hard to compute and an Hmm Gibbs sampler estimates it by iteratively sampling the visited hidden states conditioned on earlier samples of all other state visits. The stationary distribution of the thus formed Markov chain is exactly this joint distribution. Although these methods are not yet commonplace for Pfa, we believe this is likely to change after this competition.

State-merging

Learning Dpfa typically relies on the technique of state-merging (see, e.g., de la Higuera 2010): the idea is to start with a very large automaton with enough states to describe the learning sample exactly, and then iteratively combining the states of this automaton in order to refine this model into a more compact one. The three main state-merging algorithms for probabilistic automata that have had the largest impact were proposed in the mid-nineties:

  • Alergia by Carrasco and Oncina (1994),

  • Bayesian model merging by Stolcke (1994), and

  • Learn-PSA by Ron et al. (1995).

The first deals with learning a Dpfa while the second tries to learn both the parameters and the structure of an Hmm. The third learns probabilistic suffix trees. Like the first technique, these are greedy algorithms that can get stuck in local optima. However, they do come with theoretical guarantees: probabilistic suffix trees can be Pac-learned (Ron et al. 1994), Dpfa have been proved to be learnable in the limit with probability 1 (Carrasco and Oncina 1994), and more recently it has been shown that they can also be learned in a Pac-like setting (Clark and Thollard 2004). Based on these three basic algorithms a number of refinements for state merging learning algorithms have been proposed:

  • There have been several extensions of Alergia (de la Higuera and Thollard 2000; Carrasco et al. 2001; de la Higuera and Oncina 2003, 2004; Young-Lai and Tompa 2000; Goan et al. 1996).

  • Improvements of Ron et al. (1995) based on the concept of distinguishable states have been developed (Thollard and Clark 2004; Palmer and Goldberg 2005; Guttman 2006; Castro and Gavaldá 2008). An incremental version also exists (Gavaldà et al. 2006).

  • Algorithm Mdi was introduced by Thollard and Dupont (1999), Thollard et al. (2000), Thollard (2001). This algorithm also uses state merging.

  • Recently, they have been extended to learn not only the distribution over strings of events/symbols but also over their timing behaviors (Verwer et al. 2010) and from a continuous stream of data instead of a data set (Balle et al. 2012).

Other methods

Several other methods have been proposed that have not yet become mainstream, most notably:

  • Esposito et al.’s (2002) approach has consisted in learning probabilistic residual finite state automata based on the identification of the residuals of a rational language. These are the probabilistic counterparts to the residual finite state automata introduced by Denis et al. (2000, 2001).

  • Denis et al. (2006) and Habrard et al. (2006) introduced the innovative algorithm Dees that learns a multiplicity automaton (the weights can be negative but in each state the weights sum to one) by iteratively solving equations on the residuals.

  • Other algorithms learning multiplicity automata have been developed, using common approaches in machine learning such as recurrent neural networks (Carrasco et al. 1996), Principal Component Analysis (Bailly et al. 2009) or a spectral approach (Bailly 2011).

Most of these methods estimate the model parameters based on maximum likelihood. This can cause problems when computing probabilities, especially for strings of low frequency. For some of these methods, therefore, smoothing methods have been developed that adjust the maximum likelihood estimate in order to hopefully overcome these difficulties (Chen and Goodman 1996). Typically, these smoothing methods assign larger probabilities to infrequent strings, and consequently, less to more frequent ones. For n-gram learning, smoothing is very often used and sophisticated methods such as back-off smoothing exist (Zhai and Lafferty 2004). For Dpfa learning, smoothing techniques can be found in Dupont and Amengual (2000), Thollard (2001), Habrard et al. (2003). Smoothing Pfa and Hmms is still a question requiring further research.

In conclusion, many algorithms for learning probabilistic automata have been produced. Due to the difficulty of the learning problem, most of them focus on some form of Dpfa. Another important approach is to learn Markov chains or n-grams by simply counting the occurrences of sub-strings. As already stated, these simple methods have been very successful in practice (Brill et al. 1998). When one is faced with a data set made of strings and one needs to find a likely distribution over these strings for tasks such as prediction, anomaly detection, or modeling, it would be very helpful to know which model is likely to perform best and why. Due to the lack of a thorough test of all of these techniques, this is currently an open question. Furthermore, the facts that all known algorithms are of the greedy type and the recent successes of search-based approaches for non-probabilistic automaton learning (Heule and Verwer 2010; Hasan Ibne et al. 2010) makes one wonder whether search-based strategies are also beneficial for probabilistic automaton learning. The Probabilistic Automaton learning Competition (PAutomaC) aims to answers these questions by providing an elaborate test-suite for learning string distributions.

2.4 About previous competitions

There have been in the past competitions related with learning finite state machines or grammars.

  • The first grammatical inference competition was organized in 1999. The participants of Abbadingo (http://abbadingo.cs.nuim.ie) had to learn Dfa of sizes ranging from 64 to 512 states from positive and negative data, strings over a two letter alphabet.

  • A follow-up was system Gowachin (http://www.irisa.fr/Gowachin/), developed to generate new automata for classification tasks: the possibility of having a certain level of noise was introduced.

  • The Omphalos competition (http://www.irisa.fr/Omphalos/) involved learning context-free grammars, given samples which in certain cases contained both positive and negative strings, and in others, just text.

  • In the Tenjinno competition, the contestants had to learn finite state transducers (http://web.science.mq.edu.au/tenjinno/).

  • The Gecco conference organized a competition involving learning Dfa from noisy samples (http://cswww.essex.ac.uk/staff/sml/gecco/NoisyDFA.html).

  • The Stamina competition (http://stamina.chefbe.net/), organized in 2010, also involved learning Dfa but new methods were used and permitted to solve even harder problems.

  • The Zulu competition (http://labh-curien.univ-st-etienne.fr/zulu/) concerned the task of actively learning Dfa through requests to an oracle.

  • The Rers Grey Box Challenge (http://leo.cs.tu-dortmund.de:8100/isola2012) aimed to discover the complementary values of white-box and black-box software system analysis techniques, including tools for learning finite state machines.

More generally, a number of other machine learning competitions have been organized during the past years. A specific effort has been made by the Pascal network (http://pascallin2.ecs.soton.ac.uk/Challenges/).

3 An overview of PAutomaC

The goal of PAutomaC was to provide an overview of which probabilistic automaton learning techniques work best in which setting and to stimulate the development of new techniques for learning distributions over strings. In order to stimulate this development, PAutomaC was set up using an oracle server that was able to evaluate the submissions by participants on-line. Furthermore, in contrast to the traditional methods used to evaluate predictive machine learning algorithms, the performance in PAutomaC was evaluated using the actual probabilities assigned by a learned distribution.

Two types of data were available: artificial and real-world data donated by researchers and industries. But we have to admit that the latter were after all of little interest in the context of the competition. The problem came from the fact that not knowing the targeted probabilities implies a biased way to evaluate them. We chose to use 3-grams trained on the complete data sets to fix these probabilities, hoping that the induced bias would be drastically reduced since the competition sets consisted of less than 10 % of these data. Unfortunately, this goal was not achieved since the participants who scored the best on these data sets used n-grams (even when they were using more complex approaches on the artificial data sets). We will thus not discuss the real-world data sets in the rest of this paper (detailed information is available on the website).

In this section, we first describe the way the targets automata were generated. We then turn our attention on how the submissions of the participants were evaluated. Finally we discuss the choices made all along this process.

3.1 Generating artificial data

Artificial data was generated by building random probabilistic machines with 5 to 75 states and with an alphabet consisting of 4 to 24 symbols (both inclusive, and decided uniformly at random). These machines were subsequently used to generate data sets. Of all possible state-symbol pairs that could occur in transitions, between 20 and 80 percent (the symbol sparsity) of them were generated. These pairs were selected by first choosing a state at random, and subsequently choosing a symbol from the set of symbols that had not yet been selected for that state. This created a selection without replacement from the set of all possible state-symbol pairs that was modified to remain uniform over the states. This modification made it less likely that the resulting symbols were evenly distributed over the states. For every generated state-symbol pair, one transition was generated to a randomly chosen target state. Between 0 and 20 percent (the transition sparsity) transitions were generated in addition to these, selected without replacement from the set of possible transitions, modified to remain uniform over the source states and transition labels.

Initial and final states were selected without replacement until the percentages of selected states exceeded the transition and symbol sparsities, respectively. All initial, symbol, and transition probabilities were drawn from a Dirichlet distribution with concentration parameters set to 1 (making every probability distribution equally likely). The final probabilities were drawn together with the symbol probabilities.

From such a structure, one training set (with repetitions) and one test set (without repetitions) were generated from every target. With probability one out of four, the generated train set was of size 100 000, it was of size 20 000 otherwise. New test strings were generated using the target machine until 1 000 unique strings had been generated. The test strings were allowed to overlap with the strings used for training. If the average length of the generated strings was less than 5 or greater than 50, a new automaton and new data sets were generated using the same construction parameters. In total, 150 models and corresponding train and test sets were generated using this way. We evaluated the difficulty of the generated sets using a 3-gram baseline algorithm: the problem was considered easy if the baseline output was close to the target (a perplexity difference of less than 1.0), and difficult otherwise. We then selected 16 of them, aiming to obtain ranging values for the number of states, the size of the alphabet, sparsity values, and difficulty. We applied the same procedure for Dpfa but without generating additional transitions; and for Hmms, we generated state-state pairs instead of state-symbol-state triples.

In total, this results in 48 (16 for every type) artificially generated problems for use in the competition. The participants were given no other information about the target than the two files of strings (one for the training set and one for the test set). The format of these files is given in Fig. 3.

Fig. 3
figure 3

Format of the files made available to the participants of the PAutomaC challenge

3.2 Evaluation

The evaluation measure was based on perplexity. Given a test set S, it was defined by the formula:

$$S\mathrm{core}(C, S)=2^{- \sum_{x \in S} P_{\mathrm{T}}(x) * \log_2 (P_{\mathrm {C}}(x)) } $$

where P T(x) is the normalized probability of x in the target and P C(x) is the normalized candidate probability for x submitted by the participant. The normalization process is the usual one when perplexity is considered: it consists in modifying the probabilities so that they sum to 1 on the set S. A consequence of this normalization was that adding probability to one of the test strings removed probability from the others. Therefore, this perplexity score measured how well the differences in the assigned probabilities matched with the target probabilities.

Notice that this measure is equivalent to the well-known Kullback-Leibler (KL) divergence (Kullback and Leibler 1951). Indeed, given two distributions P and Q, the KL divergence is defined as KL(P,Q)=∑ x P(x)log2(P(x)/Q(x)) which can be rewritten into KL(P,Q)=(−∑ x P(x)log2 Q(x))−H(P) where H(P) is the entropy of the target distribution. H(P) is constant in our case since the aim is to compare various candidate distributions Q. As we were only interested in the divergence on a given test set S, the only varying element of the KL divergence is −∑ xS P(x)log2 Q(x) which is equivalent to our measure, up to a monotonous transformation.

To decide the final overall rank of each participant, points were attributed for each data set: the leader of a problem at the end of the competition scored 5 points, the second 3, the third 2 and the fourth 1. In case of equality on a problem (based on the first 10 digits of the perplexity score), the earliest submission won. The winner is the participant whose overall score was the highest. There was no restriction on the number of submissions a given participant could provide, but no feedback was given about the resulting perplexity. To compute the final score of a participant, only the best submission to each problem was considered.

3.3 Discussion on the design of the competition

When organizing an on-line competition, one has to make various choices about the generation of data and the evaluation of the participant submissions. We described above what was done for PAutomaC but we feel that the choices that were made have to be discussed. What follows thus contains arguments about the validity of our approach and therefore of the results of the competition.

Target generation

As already stated, we used a Dirichlet distribution for sampling the output probabilities. The main advantage of this method is that every possible distribution is equally likely when sampled using a Dirichlet distribution (with concentration parameters set to 1). Notice that this does not happen when every output probability is iteratively sampled uniformly at random. Since we did not intend to bias the distribution in PAutomaC towards certain types of distributions, using the Dirichlet distribution seemed the logical choice.

If we were to sample all output probabilities from a Dirichlet distribution unconstrained, however, we would obtain a very densely connected Pfa with high probability. Such densely connected automata are uninteresting from a learning perspective: a simple one-gram will already reach a close to optimal perplexity. We therefore constrained this sampling using symbol and transition sparsity values. These two values were preselected and the generated Pfa was then forced to match these sparsity values. Afterwards, we sampled the transition probabilities for every state using a Dirichlet distribution.

The Pfa structure generator worked by iteratively adding new transitions until the preselected sparsity values were reached. This selection remained uniform over all states, lowering the probability that every state gets assigned the same number of symbols and transitions. The generator initialized by adding to every state one random symbol and one random transition for that symbol. This avoided the generation of states with a final probability of 1.0, i.e., sink states. This was done because we aimed for the final probability generation to be independent of the structure generation.

The final probability of each state was handled as the emission of a special symbol: this allowed a simple normalization process and did not influence the bias over distributions since their values were sampled together with the output probabilities using a Dirichlet distribution. Together with the consistency test (see below), this ensured that the generated machines corresponded to a proper distribution (probabilities over all possible strings summed to 1). The selection of which states had final probabilities, however, was performed independently of the process used to select output transitions. This ensured that having more output symbols does not lead to lower final probabilities.

An important step took place directly after the generation of a target. It consisted in checking that all states were reachable from an initial state and that they were all co-accessible. Indeed, verifying the consistency of the machine ensured that we did not have a path (and thus a probability mass) that reached a part of the machine that never led to an accepting state. In addition, we tested whether the generated probability distribution did not result in giving too much weight to long or short strings. Although this created some bias in the generation procedure, it was unavoidable because testing the different methods on instances that are too difficult or uninteresting makes no sense.

Evaluation

As already stated, the choice of an evaluation function that does not rely on a particular type of machine was a fundamental requirement of PAutomaC. Using a perplexity measure had the interest to be a widely accepted way to compare distributions and its link with the KL divergence was clearly a plus. Though we did not inform the participants about it, we also computed two other evaluation functions for each submission: the max-norm (maximal difference between the submitted probabilities and the target ones) and the sum-norm (the sum of the differences between the submitted and the target probabilities). While on a few problems the ranking of the participants was a bit different than the one obtained with the official perplexity measure, the overall ranking of the teams was the same.

A common issue when dealing with string distributions is smoothing. When using perplexity as a measure, smoothing becomes necessary because strings with zero probability obtain an infinite KL divergence when compared to the target (or any other non-zero assigning distribution) and thus an infinite perplexity. Although smoothing can be very beneficial in practice, we feel that the standard perplexity measure is too dependent on smoothing (compared to the max-norm, for instance) and therefore that a perplexity evaluation based on an unseen test set does not properly measure the quality of the string distribution. In PAutomaC we therefore decided to provide the participating teams with knowledge of the actual strings used to compute the perplexity measure. This removed the need for specialized smoothing methods since the participants could simply use a minimum value for the probability assigned to any string.

Collusion

A usual problem with on-line competitions is the one of the possibility of collusion. Indeed, a set of test data has to be made available to participants in order to evaluate the performance of their algorithm with respect to a given target. But if this set contains information about the target, then it can be used during the learning phase and may bias the results. In a competition where the targets are not stochastic devices, this problem is usually tackled by the requirement that elements of the test set do not occur in the train set (though they are generated by the same process). But this cannot be ensured when the aim is to learn a distribution as both sets have to be generated using the target: erasing elements of the test set that occur in the train set generates an important bias in the distribution of the test set. We therefore chose to keep these elements, expecting that the difference in size between the train and test sets sufficed to make the information contained in the test set useless.

But collusion can also result from the fact that the test set by itself contains information about the target distribution: duplicate strings are likely to be frequent in the target distribution. This is why we decided to remove redundant elements of the test sets, creating a small bias in the distribution of these sets. However, since the actual target distribution was used during the evaluation, and thanks to the choices made for this phase, this did not result in a bias or other problems during evaluation.

4 Results

4.1 Competition activity

38 participants registered to have access to the problem sets and 16 of them submitted at least one of their solutions to a problem. There were a total number of 2 787 submissions during the competition. 5 participants managed to score some points, 4 of them were ranked first at least once (see Fig. 4).

Fig. 4
figure 4

Overall evolution of the score of the 5 leading teams (artificial data sets). For each problem, 5 points was given to the team whose best submission had the smallest perplexity, 3 points to the second best team, 2 for the third and 1 for the fourth

During the competition phase, the website received 724 visits (with a maximum of 54 the last day of the competition) from 196 unique visitors with an average visit duration of a bit more than 5 minutes. IPs from 37 countries have been detected, between which 14 countries corresponded to 5 or more visitors.

4.2 Overall results

The final scores can be seen in Fig. 4 and detailed results are presented in Table 1 (available in the Appendix). There is a clear winner of PAutomaC: team Shibata-Yoshinaka. Of all participants, they obtained the best perplexity values on most instances and performed well on all others. This result is validated by the computation of other performance indicators (the max-norm and sum-norm). From Table 1 it can be observed that the method implemented by team Shibata-Yoshinaka really works well for all of the competition problems: the difference between the perplexity values of the solutions and their submissions was never greater than 0.1. Furthermore, this difference was even smaller on the instances with 100 000 strings, indicating that they made good use of additional data.

4.3 Analysis of the results

In PAutomaC, the different approaches were tested on problem instances with a broad range of parameter values and coming from different probabilistic automaton models (see Table 2 in the Appendix). This makes it possible to perform some additional analysis of the results with the goal of discovering when each method works best and trying to understand why. Tables 1 and 2 (both in the Appendix) clearly show that team Shibata-Yoshinaka is only outperformed on the (nearly) deterministic ones (Dpfa, or Pfa/Hmm with a small transition sparsity). On these instances team Llorens performs slightly better. Team Hulden’s method also manages to obtain the best perplexity values on two instances, and actually beats team Llorens overall performance by just 2 points (rightmost points in Fig. 4). Their method seems to perform best on dense instances with few states. The methods used by team Bailly and team Kepler have some difficulties with very sparse instances (and thus also with Dpfa), and perform well but not best on the other instances.

We further analyzed the results using a standard decision tree learningFootnote 5 for two prediction tasks:

  1. (1)

    Predicting the winner given the problem instance parameter values.

  2. (2)

    Predicting whether a deterministic distribution was used to generate the problem instance given the winner.

The resulting decision trees are depicted in Fig. 5. Interestingly, although team Shibata-Yoshinaka performs well on all problem instances, they are outperformed by team Llorens on sparse problem instances. Sparse instances are generated using an automaton that contains only a tiny fraction of all possible transitions given the number of states and the alphabet size (see Sect. 3.1). Since deterministic automata are fixed to use such a fraction, most of these automata are deterministic. This is confirmed by the second prediction task, which indicates that when team Llorens performs best there is an 80 % chance that the generator is deterministic. This result is very interesting since team Shibata-Yoshinaka and team Hulden used methods based on non-deterministic automaton models, while team Llorens used deterministic models (see Sect. 5). Of course, we cannot be sure that the used model or the used method is important when predicting the type of generator, but it seems to indicate that it is best to learn a non-deterministic model when the data is drawn from a non-deterministic distribution, and that it is best to learn a deterministic model when the data is drawn from a deterministic distribution. In fact, this result also shows that it is possible to detect whether a given set of strings is drawn from a deterministic or non-deterministic generator (the second tree in Fig. 5): use team Shibata-Yoshinaka’s, team Huldens’s, and team Llorens’s methods to learn a predictor, test their performance on a validation set, return the type of model used by the best performing method. Such a method has several interesting applications like evaluating possible discretization of values coming from an abstract deterministic generator. In the next section, we provide some detailed descriptions and individual analyses for each of the methods.

Fig. 5
figure 5

Decision trees predicting the winner given the parameters of a problem instance (left), and whether a deterministic or non-deterministic generator was used given who won (right)

5 The different approaches and individual results

A wide spectrum of learning approaches has been used during the competition. We describe in this section the ones of the main participants—those who scored at least a point—and provide a small detailed analysis of their performance in PAutomaC. This section is the result of deep discussions and electronic exchanges the authors had with the different teams. However, the overview presented here is superficial and the reader is therefore referred to the original paper describing the team’s work.

5.1 Team Shibata-Yoshinaka

Shibata and Yoshinaka (2012) used a Gibbs sampling method to estimate the probability \(\mathrm{Pr}(\mathbb{b}|\mathbb{a})\) of a future sentence \(\mathbb{b}\) given training data \(\mathbb{a}\) generated by an unknown Pfa. The probability that a Pfa generates a sentence \(\mathbb{a} = a_{1} \cdots a_{T}\) by passing states \(\mathbb{z} = z_{0} \cdots z_{T}\) in this order is given as

$$ \mathrm{Pr}(\mathbb{a},\mathbb{z} \mid\xi) = \prod _{1 \le t \le T} \xi _{z_{t-1} a_t z_{t}} = \prod_{i,a,j} \xi_{i a j}^{C_{iaj}}, $$
(1)

where ξ iaj is the probability of the state change from i to j with a letter a and C iaj counts the times when that transition occurs. Applying Gibbs sampling directly to ξ is somewhat tricky. For instance, it requires one to continuously compute new state sequences, see, e.g., Gao and Johnson (2008). Therefore, they first marginalize ξ out from Eq. (1) under the assumption that the prior of ξ is a Dirichlet distribution. Intuitively, this computes the sum of all possible values of Eq. (1) for every ξ multiplied by the probability of that ξ. Although this is a very large sum to compute, under the assumption the ξ is Dirichlet distributed, many terms cancel out making the resulting computation easy. This technique is called Collapsed Gibbs Sampling, see, e.g., Blei and Jordan (2006).

Shibata and Yoshinaka sample different values \(\mathbb{z}^{(1)},\ldots,\mathbb{z}^{(S)}\) for \(\mathbb{z}\) independently from the resulting distribution by Gibbs sampling, i.e., by iteratively sampling from \(\mathrm{Pr}(z_{t} \mid\mathbb{a}, z_{0} \cdots z_{t-1} z_{t+1} \cdots z_{T})\). The exact values of \(\widetilde{\xi}^{(1)},\ldots,\widetilde{\xi}^{(S)}\) are then simply the expectation based on the state transition history:

$$\widetilde{\xi}^{(s)}_{iaj} = \mathrm{E} \bigl[\xi_{iaj} \, |\, \mathbb{a},\mathbb{z}^{(s)} \bigr] = \frac{C_{iaj}+\beta }{C_{i}+AN\beta}, $$

where N is the (maximum) number of states of the target Pfa, A is the size of the letter alphabet and β is the smoothing parameter (the prior).

In the actual implementation,Footnote 6 they have fixed the number of iterations of CGS and sampling points a priori. The values of N and β were determined by 10-fold cross validation amongst a finite number of candidates. Finding good settings for these values required quite some computational resources.

Analysis

The result of learning a decision tree that aims to predict the performance of team Shibata-Yoshinaka given the problem parameters (unknown to the participants during the competition) is shown in Fig. 6. In the learned tree, we can clearly observe that the collapsed Gibbs sampling approach of team Shibata-Yoshinaka performs best when there are many (100k) strings available for training, or when the target contains few (<21) states. Moreover, in the other cases, it still finds distributions close to the optimal one (with an average perplexity difference of 0.0467).

Fig. 6
figure 6

Decision tree predicting the performance (perplexity difference with the solution) of team Shibata-Yoshinaka given the parameters of a problem instance

5.2 Team Hulden

The inference approach of Hulden (2012) used three strategies:

  1. (1)

    A basic “baseline” n-gram strategy with smoothing.

  2. (2)

    Another “baseline” n-gram strategy without smoothing, but using interpolated test data.

  3. (3)

    The construction of a fully connected Pfa inferred with Baum-Welch (EM), each between 5 and 40 states in size. Training was done using only the original training data, and separately also using reconstructed training data, as in (2).

In the first strategy, the n-gram counts were extracted from the training data for various values of n (between 2 and 9). Then, the log likelihood of the training data was calculated and the n yielding the highest log likelihood was used to issue the probabilities to the test strings for submission. Witten-Bell smoothing (see, e.g., Chen and Goodman 1996) was used in all cases.

For the second approach, the n chosen in the first one was used to decide the optimal window size to use for n-grams. In this strategy, the test data was used for training as well, and was augmented in an iterative fashion. This because the original test data represented a skewed distribution as duplicates had been removed. First, the expected number of occurrences of each string in the test set was calculated based on the total number of occurrences of that string in the training and test sets. Based on this expected number, a fractional count of strings was “added” to the test data, reflecting a guess that the original test data had contained these duplicates. This process was repeated until convergence (when the expected string count in the test data no longer changed). These counts were then used for calculating the probabilities of each string in the test data.

For the third strategy, three randomly initialized Pfa of 5, 10, 20, and 40 states were trained with Baum-Welch for each problem, after which the one with the highest log likelihood was submitted (several results in case of approximate ties). Similarly to the n-gram case, another three runs for each state size were made using both training and reconstructed test data. However, contrary to the n-gram strategy, using reconstructed test data for training failed to ever improve on the basic Baum-Welch that used only the PAutomaC training data for training.

The n-gram solutions were submitted early and the EM solutions later. This allowed the observation, based on the server feedback, that EM outperformed the n-grams in most cases (roughly 85 % of problems). A notable exception is the two real data problems where the interpolated n-grams performed best in each case. As mentioned, using reconstructed test data for training helped in the n-gram strategy, but not with Baum-Welch, probably because of severe over-fitting.

Analysis

The tree predicting the performance of team Hulden’s Baum-Welch/EM approach is depicted in Fig. 7. Their method performs best on dense problems (transition sparsity >0.0215), and excels when the target contains not too many states (<35). Overall, the performance is close to the one of the winning team. From personal communication, we discovered that the amount of computing power used by team Shibata-Yoshinaka’s method is much more than that of team Hulden’s. Unfortunately, the influence on the results of the computational resources could not be measured, nor was it a criterion for the competition itself.

Fig. 7
figure 7

Decision tree predicting the performance (perplexity difference with the solution) of team Hulden given the parameters of a problem instance

5.3 Team Llorens

The approach followed by the Llorens team was two-fold: on one hand, they upgraded the Alergia algorithm (Carrasco and Oncina 1994) by using ideas from evidence-driven approaches to state merging. Specifically, they computed all possible merges in a red-blue framework (see Lang et al. 1998), and performed the one that passed the most statistical tests, which are computed using Hoeffding’s bound as in Alergia. The second line they followed was to work on the fact that the test data was known and that there could be a better strategy than the simple normalization to make probabilities sum to 1 on the test set.

Analysis

The tree predicting the performance of team Llorens’s Alergia-based approach is depicted in Fig. 8. This tree is quite interesting because the state merging approach adopted by team Llorens is very different from the first two approaches. First of all, the root decision shows that their method performs best on target distributions with a small alphabet (<18). An interesting question is whether this can be linked to the known problems of state merging methods for non-probabilistic automata on large alphabets (Walkinshaw et al. 2012). Secondly, from this tree it is very clear that the type of generating distribution has a large effect on the performance. In particular, it confirms that learning a Dpfa works best when the generating distribution is a Dpfa. Interestingly, in the non-Dpfa case, it performs better on dense problems (transition sparsity >0.0795). This seems to indicate that learning dense non-deterministic distributions is easier (in terms of perplexity) than learning sparse ones, even when a deterministic model is learned.

Fig. 8
figure 8

Decision tree predicting the performance (perplexity difference with the solution) of team Llorens given the parameters of a problem instance

5.4 Team Bailly

Team Bailly tackled the competition by using a spectral approach (see Bailly 2011). The main component which is manipulated is the Hankel matrix (Partington 1988), representing the counts for every possible prefix-suffix pair. The core of the spectral technique is the Hankel matrix factorization, from which the parameters of a probabilistic model can be directly deduced.

Analysis

Team Bailly approached the competition using a new and promising method for learning probability distributions over strings. This emphasizes the interest in determining when their algorithm performs well. The PAutomaC data clearly shows when this is the case (see Fig. 9). Although their method performed well on many instances (32), and was leading the competition for a long time (see Sect. 4.2), their performance shows large drops on sparse problem instances (transition sparsity <0.0428). All methods have some trouble with sparse problems, but significantly less than team Bailly’s spectral approach. Future research is needed to try to determine exactly why the spectral approach has so much trouble with these instances, the PAutomaC data and generator remain available for this purpose.

Fig. 9
figure 9

Decision tree predicting the performance (perplexity difference with the solution) of team Bailly given the parameters of a problem instance

5.5 Team Kepler

The approach applied by Kepler et al. (2012) uses n-gram models with variable length. n-grams are represented as a context tree that maps the probabilities of sequences of symbols. To shrink the state space while working with large n-grams, the context tree is pruned based on the Kullback-Leibler divergence. Experiments showed that this approach almost always achieves lower perplexity than the fixed 3-gram model on the PAutomaC training data. However, it is not clear how to define the maximum size of the n-gram or the pruning threshold value.

Analysis

Team Kepler’s method shows the same behavior with sparse instances as team Bailly, see Fig. 10. However, it has in addition trouble learning distributions coming from generators with a small alphabet. Again, future research will need to point out why this happens. It is surprising to see that although the relatively simple n-gram based approach adopted by team Kepler are very popular in practice, it did not perform as well as the other more complex approaches to learning string distributions on the PAutomaC data.

Fig. 10
figure 10

Decision tree predicting the performance (perplexity difference with the solution) of team Kepler given the parameters of a problem instance

6 Conclusion

We presented an overview of PAutomaC, the relevant literature on learning probabilistic automata, a brief explanation of the methods used during the competition, and an analysis of their results. The results of PAutomaC presented in this paper indicate that the competition was fruitful:

  • There were 5 active participating teams from around the world.

  • All participants used different (both old and new) methods and were stimulated to improve these. All methods performed much better than the provided baseline algorithms.

  • The PAutomaC data set provides a detailed comparison of the performance of each of these methods.

  • There is a clear winner, and interestingly, they used a method that is in practice not (yet) commonly applied when learning Pfa.

  • The results remain valid using different evaluation criteria.

  • Interesting conclusions can be drawn by analyzing the results.

In particular, the observation that team Llorens outperforms the winning team on the deterministic instances is very interesting for future research as it could provide a method for deciding whether a given data sample is drawn from a deterministic distribution or from a non-deterministic one. This could be very useful during the discretization of data, for instance. Moreover, it would be very interesting to further investigate and hopefully improve the performance of the spectral and n-gram based approaches developed by team Bailly and team Kepler on sparse problem instances. Last but not least, new Gibbs sampling and EM/Baum-Welch methods have been developed for Pfa by team Shibata-Yoshinaka and Team Hulden. Based on their excellent performance in PAutomaC, we can encourage anyone interested in learning probability distributions over strings to use one of these methods. The developed Gibbs sampler performed consistently better in PAutomaC, but required much more computational resources. When the generating distribution is known to be deterministic, we advise a state merging approach such as the one developed by team Llorens.