PAutomaC: a probabilistic automata and hidden Markov models learning competition
- 495 Downloads
Approximating distributions over strings is a hard learning problem. Typical techniques involve using finite state machines as models and attempting to learn these; these machines can either be hand built and then have their weights estimated, or built by grammatical inference techniques: the structure and the weights are then learned simultaneously. The Probabilistic Automata learning Competition (PAutomaC), run in 2012, was the first grammatical inference challenge that allowed the comparison between these methods and algorithms. Its main goal was to provide an overview of the state-of-the-art techniques for this hard learning problem. Both artificial data and real data were presented and contestants were to try to estimate the probabilities of strings. The purpose of this paper is to describe some of the technical and intrinsic challenges such a competition has to face, to give a broad state of the art concerning both the problems dealing with learning grammars and finite state machines and the relevant literature. This paper also provides the results of the competition and a brief description and analysis of the different approaches the main participants used.
KeywordsGrammatical inference Probabilistic automata Hidden Markov models Programming competition
This paper describes the PAutomaC probabilistic automaton learning competition and provides an overview of the relevant literature on this topic. PAutomaC was an on-line challenge that took place in 2012 at http://ai.cs.umbc.edu/icgi2012/challenge/Pautomac/. The goal of PAutomaC was to provide an overview of which probabilistic automaton learning techniques work best in which setting and to stimulate the development of new techniques for learning distributions over strings. Many probabilistic automata learning methods have been produced in the past (see Sect. 2 for an overview). Most of these focus on deterministic probabilistic automata (Dpfa), where only the symbols are drawn from probability distributions but the transitions are uniquely determined given the generated symbol. There exist some exceptions, however, which aim to learn hidden Markov models (Baum 1972), probabilistic residual automata (Esposito et al. 2002), and multiplicity automata (Denis et al. 2006). Another important approach is to learn Markov chains or n-grams by simply counting the occurrences of sub-strings (Saul and Pereira 1997; Ney et al. 1997; Jelinek 1998). These simple counting methods have been very successful in practice (Brill et al. 1998).
Although many methods have been proposed, there has been so far no thorough investigation of which model/algorithm is likely to perform best, why and when. Knowledge about this would be very helpful to scientists/practitioners faced with a data set made of strings and the problem of finding a likely distribution over these strings. PAutomaC aimed to fill this knowledge gap by providing the first elaborate test-suite for learning string distributions.
In addition to being very helpful for applications of automata learning methods, PAutomaC was designed in such a way that it provided directions to future theoretical work and algorithm development. For instance, unlike previous automata learning competitions (see Sect. 2.4 for details), in PAutomaC, the type of automaton device was not fixed: learning problems were generated using automaton models of increasing complexity. This is not only very useful for practical applications (where many different types of distributions can be encountered), but also aims to answer to the interesting question whether it is best to learn a non-deterministic model (e.g. Hmm) or a deterministic model (e.g. Dpfa) when the data is drawn from a (non-)deterministic distribution,as described for instance in the work of Gavaldà et al. (2006). PAutomaC also encouraged the development and use of new techniques from machine learning that do not build an automaton structure, but do result in a string distribution. Therefore, the actual structures of the learned automata were not evaluated in PAutomaC. Instead, the performance of the different algorithms were tested only on the quality of the resulting string distribution. Like previous automaton learning competitions, this evaluation was performed on-line using a test set and an evaluation oracle running on the competition server. Consequently, the participants could use the observed performance (and that of the competition) to update their algorithms.
The competition setup in PAutomaC contained some novel elements that may also be of interest for competitions of other (string) distribution learning algorithms. Above all, in PAutomaC the performance was evaluated using the actual probabilities assigned by a learned distribution, instead of the more traditional method of evaluating its predictive performance. This has the advantage of not only testing whether the high probability events are assigned the largest probabilities, but also whether the low probability events are assigned the correct low probabilities. Furthermore, the actual strings that were being used for this evaluation were given to the participants beforehand.
The traditional approach to compare language models, which had also been considered for PAutomaC, is to test the learned model over some unseen data. Perplexity (Cover and Thomas 1991) is the usual measure and, in order to perform well on such a metric, it is necessary to learn a smoothed model, in which a non-null probability is assigned to all strings (the penalty is infinite otherwise). Experience shows that in that case, the smoothing method may become preponderant: the quality of the model can rely mainly on the smoothing. Another issue with such an evaluation task is that the model has to be checked somehow for consistency, since the probabilities of all possible strings must sum up to one.
The goal of PAutomaC being to compare learning algorithms (and not smoothing algorithms), a different protocol was chosen: the teams knew the test set in advance, and part of the problem for them consisted in reassigning the mass of probabilities the learned model used for the strings absent from the test set to those strings inside this set. In this way, a perplexity-like evaluation measure could be used to evaluate the differences in the probabilities assigned to different strings from the test-set. A couple of possible dangers of this protocol were identified by the PAutomaC Scientific Committee1 and, later, by the participants. A first one was that the extra information in the test set (which was also randomly drawn from the unknown target distribution) could be used to learn. A second danger came from the fact that the teams could submit various solutions to the same problem (with no feedback about their score, but knowing their overall standing): this could have allowed some hill-climbing strategy. Both the Scientific Committee’s analysis and the attempts by some participants showed that the PAutomaC evaluation process was resistant: the winning team is actually the one who submitted the least times. We detail in this paper the choices that were made to handle these dangers.
As main contributions of this paper we provide an overview of the literature on probabilistic automaton learning, and describe PAutomaC including its design issues and solutions. The results of the competition and the approach followed by the main participants are also provided. There is a clear winner to PAutomaC: a novel collapsed Gibbs sampling method for Pfa developed by team Shibata-Yoshinaka. As it is not common to use such a method when learning distributions over strings, we hope and expect this result will influence what people will use in practice. In addition to having an appealing winner, we can draw several interesting conclusions by analyzing the results. In particular, it can be observed that the Alergia-based method developed by team Llorens outperforms the winning team on the deterministic instances. This provides some additional insights into the important question whether it is better to learn deterministic or non-deterministic models and can serve as an important starting point for further researches on this topic. Furthermore, we analyze the PAutomaC results with the goal of determining when which method works best and why. Our analysis indicates the problem areas for each of the used methods, which forms a basis for future studies and hopefully further improvements of the used methods. Last but not least, all methods developed by the participating teams significantly outperform the provided baseline algorithms, clearly demonstrating the need for developing and evaluating (new) methods for learning string distributions.
This paper is organized in six sections: introduction (Sect. 1), motivations and history (Sect. 2), an overview of PAutomaC (Sect. 3), final results (Sect. 4), a brief description and analysis of the approaches used by main participants (Sect. 5), and a conclusion (Sect. 6).
2 Motivations and history
We assume the reader to be familiar with the theory of languages and automata (Sudkamp 2006), their probabilistic counterparts such as hidden Markov models (Rabiner 1989), and basic concepts from computational complexity (Sanjeev and Boaz 2009), computational learning theory (Kearns and Vazirani 1994), and information theory (Cover and Thomas 1991). For more information on these topics the reader is referred to the corresponding references.
2.1 Why learn a probabilistic automaton?
model Dna or protein sequences in bioinformatics (Sakakibara 2005),
find patterns underlying different sounds for speech processing (Tzay 1994),
infer morphological or phonological rules for natural language processing (Gildea and Jurafsky 1996),
model unknown mechanical processes in physics (Shalizi and Crutchfield 2001),
discover the exact environment of robots (Rivest and Schapire 1993),
detect anomaly for intrusion detections in computer security (Milani Comparetti et al. 2009),
discover the structure of music styles for music classification/generation (Cruz-Alcázar and Vidal 2008).
In all such cases, an automaton model is learned from observations of the system, i.e., a finite set of strings. Usually, the data gathered from observations is unlabeled, that is to say that it is often possible to observe only strings that can be generated by the system, and strings that cannot be generated are thus unavailable. The standard method of dealing with this situation is to assume a probabilistic automaton model, i.e., a distribution over strings. In such a model, different states can generate different symbols with different probabilities. The goal of automata learning is then one of model selection (Grünwald 2007): find the probabilistic automaton model that gives the best fit to the observed strings, i.e., that is most likely to have generated the data. In addition to the data probability, this implies that the model size has to be taken into account in order to avoid over-fitting. Otherwise, the model that generates only the seen strings and whose probabilities correspond to the observed frequency perfectly achieves the goal. But this naive model is of little use: it assigns null probability to all unseen strings and therefore makes no generalization.
2.2 Which probabilistic automata to learn?
Pfa (Paz 1971) are non-deterministic automata in which every state is assigned an initial and a halting probability, and every transition is assigned a transition probability (weight). The sum of all initial probabilities equals 1, and for each state, the sum of the halting and all outgoing transition probabilities equals 1. A Pfa generates strings probabilistically by starting in a state determined at random using the initial state distribution, either halting or executing a transition randomly determined using their probabilities, and iterating and generating the transition symbol in case it has not halted. A study of these automata can be found in Vidal et al. (2005, 2005b).
Hidden Markov models (Hmms)2 (Rabiner 1989; Jelinek 1998) are Pfa (as described in the previous paragraph) where the symbols are emitted at the states instead of at the transitions which are only used to move. Initial probabilities are assigned to each state but there are no final probabilities, defining therefore a distribution over Σ n for each value of n. In order to obtain a distribution over Σ ∗ a special halting symbol or state can be introduced. With such an addition an Hmm generates strings like a Pfa.
Interestingly, although Hmms and Pfa are commonly used in distinct areas of research, they are equivalent with respect to the distributions that can be modeled: an Hmm can be converted into a Pfa and vice-versa (Vidal et al. 2005; Dupont et al. 2005). Though it is easy to randomly generate strings from these models, determining the probability of a given string is more complicated because different executions can result in the same string. For both models, computing this probability can be solved optimally by dynamic programming using variations of the Forward (or Backward) algorithm (Baum et al. 1970). However, estimating the most likely parameter values (probabilities) for a given set of strings and a given model (maximizing the likelihood of model given the data) cannot be solved optimally unless RP equals NP (Abe and Warmuth 1992). The traditional method of dealing with this problem is the Baum-Welch (Baum et al. 1970) greedy algorithm.
The deterministic counterpart of a Pfa is a deterministic probabilistic finite automaton (Dpfa) (Carrasco and Oncina 1994). These have been introduced for efficiency reasons essentially: in the non-probabilistic case, learning a Dfa is provably easier than learning a Nfa (de la Higuera 2010). However, although non-probabilistic deterministic automata are equivalent to non-probabilistic non-deterministic automata in terms of the languages they can generate, it is shown in Vidal et al. (2005, 2005b), Dupont et al. (2005) that Dpfa are strictly less powerful than Pfa. Furthermore, distributions generated by Pfa cannot be approximated by Dpfa unless the size of the Dpfa is allowed to be exponentially larger than the one of the corresponding Pfa (Guttman et al. 2005, 2006). There is a positive side to this loss in power: estimating the parameter values of a Dpfa is easy, and there exist algorithms that learn a Dpfa structure in a probably approximately correct (Pac) like setting (Clark and Thollard 2004).3 This is not known to be possible for Pfa or Hmms. For Pfa it has only been shown that they are strongly learnable in the limit (Denis and Esposito 2004), or Pac-learnable (under some restrictions) using a (possibly exponentially larger) Dpfa structure (Gavaldà et al. 2006).
2.3 How to learn a probabilistic automaton?
Early work concerning the learning of distributions over strings can be found in Horning (1969) and Angluin (1988). In the first case, the goal was to learn probabilistic context-free grammars; in the second, convergence issues concerning identification in the limit with probability 1 are studied. Although these initial studies were done decades ago, only three techniques have become mainstream for learning Pfa, Hmms, and Dpfa.
The first family of techniques takes a standard structure or architecture for the machine, typically a complete graph, and then tries to find parameter settings that maximize the likelihood of model given the data. If the structure is deterministic, the optimization problem is quite simple: transition probabilities can be estimated using the maximum likelihood (Wetherell 1980). If not, the standard method is the Baum-Welch algorithm (Baum et al. 1970; Baum 1972) which iteratively computes a new estimate for the transition probabilities using the probabilities assigned to the input data. Although this technique is known to be sensitive to initial probabilities and may get stuck in a local optimum, it has frequently been applied successfully in practice.
The second family of techniques correspond to Bayesian methods such as Gibbs sampling (Gelfand and Smith 1990), see, e.g., Neal (2000), Gao and Johnson (2008). Instead of learning a single model (a point4 estimate), these methods aim to make predictions using the joint distribution formed by all possible models. This joint distribution is hard to compute and an Hmm Gibbs sampler estimates it by iteratively sampling the visited hidden states conditioned on earlier samples of all other state visits. The stationary distribution of the thus formed Markov chain is exactly this joint distribution. Although these methods are not yet commonplace for Pfa, we believe this is likely to change after this competition.
Improvements of Ron et al. (1995) based on the concept of distinguishable states have been developed (Thollard and Clark 2004; Palmer and Goldberg 2005; Guttman 2006; Castro and Gavaldá 2008). An incremental version also exists (Gavaldà et al. 2006).
Recently, they have been extended to learn not only the distribution over strings of events/symbols but also over their timing behaviors (Verwer et al. 2010) and from a continuous stream of data instead of a data set (Balle et al. 2012).
Esposito et al.’s (2002) approach has consisted in learning probabilistic residual finite state automata based on the identification of the residuals of a rational language. These are the probabilistic counterparts to the residual finite state automata introduced by Denis et al. (2000, 2001).
Denis et al. (2006) and Habrard et al. (2006) introduced the innovative algorithm Dees that learns a multiplicity automaton (the weights can be negative but in each state the weights sum to one) by iteratively solving equations on the residuals.
Other algorithms learning multiplicity automata have been developed, using common approaches in machine learning such as recurrent neural networks (Carrasco et al. 1996), Principal Component Analysis (Bailly et al. 2009) or a spectral approach (Bailly 2011).
Most of these methods estimate the model parameters based on maximum likelihood. This can cause problems when computing probabilities, especially for strings of low frequency. For some of these methods, therefore, smoothing methods have been developed that adjust the maximum likelihood estimate in order to hopefully overcome these difficulties (Chen and Goodman 1996). Typically, these smoothing methods assign larger probabilities to infrequent strings, and consequently, less to more frequent ones. For n-gram learning, smoothing is very often used and sophisticated methods such as back-off smoothing exist (Zhai and Lafferty 2004). For Dpfa learning, smoothing techniques can be found in Dupont and Amengual (2000), Thollard (2001), Habrard et al. (2003). Smoothing Pfa and Hmms is still a question requiring further research.
In conclusion, many algorithms for learning probabilistic automata have been produced. Due to the difficulty of the learning problem, most of them focus on some form of Dpfa. Another important approach is to learn Markov chains or n-grams by simply counting the occurrences of sub-strings. As already stated, these simple methods have been very successful in practice (Brill et al. 1998). When one is faced with a data set made of strings and one needs to find a likely distribution over these strings for tasks such as prediction, anomaly detection, or modeling, it would be very helpful to know which model is likely to perform best and why. Due to the lack of a thorough test of all of these techniques, this is currently an open question. Furthermore, the facts that all known algorithms are of the greedy type and the recent successes of search-based approaches for non-probabilistic automaton learning (Heule and Verwer 2010; Hasan Ibne et al. 2010) makes one wonder whether search-based strategies are also beneficial for probabilistic automaton learning. The Probabilistic Automaton learning Competition (PAutomaC) aims to answers these questions by providing an elaborate test-suite for learning string distributions.
2.4 About previous competitions
The first grammatical inference competition was organized in 1999. The participants of Abbadingo (http://abbadingo.cs.nuim.ie) had to learn Dfa of sizes ranging from 64 to 512 states from positive and negative data, strings over a two letter alphabet.
A follow-up was system Gowachin (http://www.irisa.fr/Gowachin/), developed to generate new automata for classification tasks: the possibility of having a certain level of noise was introduced.
The Omphalos competition (http://www.irisa.fr/Omphalos/) involved learning context-free grammars, given samples which in certain cases contained both positive and negative strings, and in others, just text.
In the Tenjinno competition, the contestants had to learn finite state transducers (http://web.science.mq.edu.au/tenjinno/).
The Gecco conference organized a competition involving learning Dfa from noisy samples (http://cswww.essex.ac.uk/staff/sml/gecco/NoisyDFA.html).
The Stamina competition (http://stamina.chefbe.net/), organized in 2010, also involved learning Dfa but new methods were used and permitted to solve even harder problems.
The Zulu competition (http://labh-curien.univ-st-etienne.fr/zulu/) concerned the task of actively learning Dfa through requests to an oracle.
The Rers Grey Box Challenge (http://leo.cs.tu-dortmund.de:8100/isola2012) aimed to discover the complementary values of white-box and black-box software system analysis techniques, including tools for learning finite state machines.
More generally, a number of other machine learning competitions have been organized during the past years. A specific effort has been made by the Pascal network (http://pascallin2.ecs.soton.ac.uk/Challenges/).
3 An overview of PAutomaC
The goal of PAutomaC was to provide an overview of which probabilistic automaton learning techniques work best in which setting and to stimulate the development of new techniques for learning distributions over strings. In order to stimulate this development, PAutomaC was set up using an oracle server that was able to evaluate the submissions by participants on-line. Furthermore, in contrast to the traditional methods used to evaluate predictive machine learning algorithms, the performance in PAutomaC was evaluated using the actual probabilities assigned by a learned distribution.
Two types of data were available: artificial and real-world data donated by researchers and industries. But we have to admit that the latter were after all of little interest in the context of the competition. The problem came from the fact that not knowing the targeted probabilities implies a biased way to evaluate them. We chose to use 3-grams trained on the complete data sets to fix these probabilities, hoping that the induced bias would be drastically reduced since the competition sets consisted of less than 10 % of these data. Unfortunately, this goal was not achieved since the participants who scored the best on these data sets used n-grams (even when they were using more complex approaches on the artificial data sets). We will thus not discuss the real-world data sets in the rest of this paper (detailed information is available on the website).
In this section, we first describe the way the targets automata were generated. We then turn our attention on how the submissions of the participants were evaluated. Finally we discuss the choices made all along this process.
3.1 Generating artificial data
Artificial data was generated by building random probabilistic machines with 5 to 75 states and with an alphabet consisting of 4 to 24 symbols (both inclusive, and decided uniformly at random). These machines were subsequently used to generate data sets. Of all possible state-symbol pairs that could occur in transitions, between 20 and 80 percent (the symbol sparsity) of them were generated. These pairs were selected by first choosing a state at random, and subsequently choosing a symbol from the set of symbols that had not yet been selected for that state. This created a selection without replacement from the set of all possible state-symbol pairs that was modified to remain uniform over the states. This modification made it less likely that the resulting symbols were evenly distributed over the states. For every generated state-symbol pair, one transition was generated to a randomly chosen target state. Between 0 and 20 percent (the transition sparsity) transitions were generated in addition to these, selected without replacement from the set of possible transitions, modified to remain uniform over the source states and transition labels.
Initial and final states were selected without replacement until the percentages of selected states exceeded the transition and symbol sparsities, respectively. All initial, symbol, and transition probabilities were drawn from a Dirichlet distribution with concentration parameters set to 1 (making every probability distribution equally likely). The final probabilities were drawn together with the symbol probabilities.
From such a structure, one training set (with repetitions) and one test set (without repetitions) were generated from every target. With probability one out of four, the generated train set was of size 100 000, it was of size 20 000 otherwise. New test strings were generated using the target machine until 1 000 unique strings had been generated. The test strings were allowed to overlap with the strings used for training. If the average length of the generated strings was less than 5 or greater than 50, a new automaton and new data sets were generated using the same construction parameters. In total, 150 models and corresponding train and test sets were generated using this way. We evaluated the difficulty of the generated sets using a 3-gram baseline algorithm: the problem was considered easy if the baseline output was close to the target (a perplexity difference of less than 1.0), and difficult otherwise. We then selected 16 of them, aiming to obtain ranging values for the number of states, the size of the alphabet, sparsity values, and difficulty. We applied the same procedure for Dpfa but without generating additional transitions; and for Hmms, we generated state-state pairs instead of state-symbol-state triples.
Notice that this measure is equivalent to the well-known Kullback-Leibler (KL) divergence (Kullback and Leibler 1951). Indeed, given two distributions P and Q, the KL divergence is defined as KL(P,Q)=∑ x P(x)log2(P(x)/Q(x)) which can be rewritten into KL(P,Q)=(−∑ x P(x)log2 Q(x))−H(P) where H(P) is the entropy of the target distribution. H(P) is constant in our case since the aim is to compare various candidate distributions Q. As we were only interested in the divergence on a given test set S, the only varying element of the KL divergence is −∑ x∈S P(x)log2 Q(x) which is equivalent to our measure, up to a monotonous transformation.
To decide the final overall rank of each participant, points were attributed for each data set: the leader of a problem at the end of the competition scored 5 points, the second 3, the third 2 and the fourth 1. In case of equality on a problem (based on the first 10 digits of the perplexity score), the earliest submission won. The winner is the participant whose overall score was the highest. There was no restriction on the number of submissions a given participant could provide, but no feedback was given about the resulting perplexity. To compute the final score of a participant, only the best submission to each problem was considered.
3.3 Discussion on the design of the competition
When organizing an on-line competition, one has to make various choices about the generation of data and the evaluation of the participant submissions. We described above what was done for PAutomaC but we feel that the choices that were made have to be discussed. What follows thus contains arguments about the validity of our approach and therefore of the results of the competition.
As already stated, we used a Dirichlet distribution for sampling the output probabilities. The main advantage of this method is that every possible distribution is equally likely when sampled using a Dirichlet distribution (with concentration parameters set to 1). Notice that this does not happen when every output probability is iteratively sampled uniformly at random. Since we did not intend to bias the distribution in PAutomaC towards certain types of distributions, using the Dirichlet distribution seemed the logical choice.
If we were to sample all output probabilities from a Dirichlet distribution unconstrained, however, we would obtain a very densely connected Pfa with high probability. Such densely connected automata are uninteresting from a learning perspective: a simple one-gram will already reach a close to optimal perplexity. We therefore constrained this sampling using symbol and transition sparsity values. These two values were preselected and the generated Pfa was then forced to match these sparsity values. Afterwards, we sampled the transition probabilities for every state using a Dirichlet distribution.
The Pfa structure generator worked by iteratively adding new transitions until the preselected sparsity values were reached. This selection remained uniform over all states, lowering the probability that every state gets assigned the same number of symbols and transitions. The generator initialized by adding to every state one random symbol and one random transition for that symbol. This avoided the generation of states with a final probability of 1.0, i.e., sink states. This was done because we aimed for the final probability generation to be independent of the structure generation.
The final probability of each state was handled as the emission of a special symbol: this allowed a simple normalization process and did not influence the bias over distributions since their values were sampled together with the output probabilities using a Dirichlet distribution. Together with the consistency test (see below), this ensured that the generated machines corresponded to a proper distribution (probabilities over all possible strings summed to 1). The selection of which states had final probabilities, however, was performed independently of the process used to select output transitions. This ensured that having more output symbols does not lead to lower final probabilities.
An important step took place directly after the generation of a target. It consisted in checking that all states were reachable from an initial state and that they were all co-accessible. Indeed, verifying the consistency of the machine ensured that we did not have a path (and thus a probability mass) that reached a part of the machine that never led to an accepting state. In addition, we tested whether the generated probability distribution did not result in giving too much weight to long or short strings. Although this created some bias in the generation procedure, it was unavoidable because testing the different methods on instances that are too difficult or uninteresting makes no sense.
As already stated, the choice of an evaluation function that does not rely on a particular type of machine was a fundamental requirement of PAutomaC. Using a perplexity measure had the interest to be a widely accepted way to compare distributions and its link with the KL divergence was clearly a plus. Though we did not inform the participants about it, we also computed two other evaluation functions for each submission: the max-norm (maximal difference between the submitted probabilities and the target ones) and the sum-norm (the sum of the differences between the submitted and the target probabilities). While on a few problems the ranking of the participants was a bit different than the one obtained with the official perplexity measure, the overall ranking of the teams was the same.
A common issue when dealing with string distributions is smoothing. When using perplexity as a measure, smoothing becomes necessary because strings with zero probability obtain an infinite KL divergence when compared to the target (or any other non-zero assigning distribution) and thus an infinite perplexity. Although smoothing can be very beneficial in practice, we feel that the standard perplexity measure is too dependent on smoothing (compared to the max-norm, for instance) and therefore that a perplexity evaluation based on an unseen test set does not properly measure the quality of the string distribution. In PAutomaC we therefore decided to provide the participating teams with knowledge of the actual strings used to compute the perplexity measure. This removed the need for specialized smoothing methods since the participants could simply use a minimum value for the probability assigned to any string.
A usual problem with on-line competitions is the one of the possibility of collusion. Indeed, a set of test data has to be made available to participants in order to evaluate the performance of their algorithm with respect to a given target. But if this set contains information about the target, then it can be used during the learning phase and may bias the results. In a competition where the targets are not stochastic devices, this problem is usually tackled by the requirement that elements of the test set do not occur in the train set (though they are generated by the same process). But this cannot be ensured when the aim is to learn a distribution as both sets have to be generated using the target: erasing elements of the test set that occur in the train set generates an important bias in the distribution of the test set. We therefore chose to keep these elements, expecting that the difference in size between the train and test sets sufficed to make the information contained in the test set useless.
But collusion can also result from the fact that the test set by itself contains information about the target distribution: duplicate strings are likely to be frequent in the target distribution. This is why we decided to remove redundant elements of the test sets, creating a small bias in the distribution of these sets. However, since the actual target distribution was used during the evaluation, and thanks to the choices made for this phase, this did not result in a bias or other problems during evaluation.
4.1 Competition activity
During the competition phase, the website received 724 visits (with a maximum of 54 the last day of the competition) from 196 unique visitors with an average visit duration of a bit more than 5 minutes. IPs from 37 countries have been detected, between which 14 countries corresponded to 5 or more visitors.
4.2 Overall results
The final scores can be seen in Fig. 4 and detailed results are presented in Table 1 (available in the Appendix). There is a clear winner of PAutomaC: team Shibata-Yoshinaka. Of all participants, they obtained the best perplexity values on most instances and performed well on all others. This result is validated by the computation of other performance indicators (the max-norm and sum-norm). From Table 1 it can be observed that the method implemented by team Shibata-Yoshinaka really works well for all of the competition problems: the difference between the perplexity values of the solutions and their submissions was never greater than 0.1. Furthermore, this difference was even smaller on the instances with 100 000 strings, indicating that they made good use of additional data.
4.3 Analysis of the results
In PAutomaC, the different approaches were tested on problem instances with a broad range of parameter values and coming from different probabilistic automaton models (see Table 2 in the Appendix). This makes it possible to perform some additional analysis of the results with the goal of discovering when each method works best and trying to understand why. Tables 1 and 2 (both in the Appendix) clearly show that team Shibata-Yoshinaka is only outperformed on the (nearly) deterministic ones (Dpfa, or Pfa/Hmm with a small transition sparsity). On these instances team Llorens performs slightly better. Team Hulden’s method also manages to obtain the best perplexity values on two instances, and actually beats team Llorens overall performance by just 2 points (rightmost points in Fig. 4). Their method seems to perform best on dense instances with few states. The methods used by team Bailly and team Kepler have some difficulties with very sparse instances (and thus also with Dpfa), and perform well but not best on the other instances.
Predicting the winner given the problem instance parameter values.
Predicting whether a deterministic distribution was used to generate the problem instance given the winner.
5 The different approaches and individual results
A wide spectrum of learning approaches has been used during the competition. We describe in this section the ones of the main participants—those who scored at least a point—and provide a small detailed analysis of their performance in PAutomaC. This section is the result of deep discussions and electronic exchanges the authors had with the different teams. However, the overview presented here is superficial and the reader is therefore referred to the original paper describing the team’s work.
5.1 Team Shibata-Yoshinaka
In the actual implementation,6 they have fixed the number of iterations of CGS and sampling points a priori. The values of N and β were determined by 10-fold cross validation amongst a finite number of candidates. Finding good settings for these values required quite some computational resources.
5.2 Team Hulden
A basic “baseline” n-gram strategy with smoothing.
Another “baseline” n-gram strategy without smoothing, but using interpolated test data.
The construction of a fully connected Pfa inferred with Baum-Welch (EM), each between 5 and 40 states in size. Training was done using only the original training data, and separately also using reconstructed training data, as in (2).
In the first strategy, the n-gram counts were extracted from the training data for various values of n (between 2 and 9). Then, the log likelihood of the training data was calculated and the n yielding the highest log likelihood was used to issue the probabilities to the test strings for submission. Witten-Bell smoothing (see, e.g., Chen and Goodman 1996) was used in all cases.
For the second approach, the n chosen in the first one was used to decide the optimal window size to use for n-grams. In this strategy, the test data was used for training as well, and was augmented in an iterative fashion. This because the original test data represented a skewed distribution as duplicates had been removed. First, the expected number of occurrences of each string in the test set was calculated based on the total number of occurrences of that string in the training and test sets. Based on this expected number, a fractional count of strings was “added” to the test data, reflecting a guess that the original test data had contained these duplicates. This process was repeated until convergence (when the expected string count in the test data no longer changed). These counts were then used for calculating the probabilities of each string in the test data.
For the third strategy, three randomly initialized Pfa of 5, 10, 20, and 40 states were trained with Baum-Welch for each problem, after which the one with the highest log likelihood was submitted (several results in case of approximate ties). Similarly to the n-gram case, another three runs for each state size were made using both training and reconstructed test data. However, contrary to the n-gram strategy, using reconstructed test data for training failed to ever improve on the basic Baum-Welch that used only the PAutomaC training data for training.
The n-gram solutions were submitted early and the EM solutions later. This allowed the observation, based on the server feedback, that EM outperformed the n-grams in most cases (roughly 85 % of problems). A notable exception is the two real data problems where the interpolated n-grams performed best in each case. As mentioned, using reconstructed test data for training helped in the n-gram strategy, but not with Baum-Welch, probably because of severe over-fitting.
5.3 Team Llorens
The approach followed by the Llorens team was two-fold: on one hand, they upgraded the Alergia algorithm (Carrasco and Oncina 1994) by using ideas from evidence-driven approaches to state merging. Specifically, they computed all possible merges in a red-blue framework (see Lang et al. 1998), and performed the one that passed the most statistical tests, which are computed using Hoeffding’s bound as in Alergia. The second line they followed was to work on the fact that the test data was known and that there could be a better strategy than the simple normalization to make probabilities sum to 1 on the test set.
5.4 Team Bailly
Team Bailly tackled the competition by using a spectral approach (see Bailly 2011). The main component which is manipulated is the Hankel matrix (Partington 1988), representing the counts for every possible prefix-suffix pair. The core of the spectral technique is the Hankel matrix factorization, from which the parameters of a probabilistic model can be directly deduced.
5.5 Team Kepler
The approach applied by Kepler et al. (2012) uses n-gram models with variable length. n-grams are represented as a context tree that maps the probabilities of sequences of symbols. To shrink the state space while working with large n-grams, the context tree is pruned based on the Kullback-Leibler divergence. Experiments showed that this approach almost always achieves lower perplexity than the fixed 3-gram model on the PAutomaC training data. However, it is not clear how to define the maximum size of the n-gram or the pruning threshold value.
There were 5 active participating teams from around the world.
All participants used different (both old and new) methods and were stimulated to improve these. All methods performed much better than the provided baseline algorithms.
The PAutomaC data set provides a detailed comparison of the performance of each of these methods.
There is a clear winner, and interestingly, they used a method that is in practice not (yet) commonly applied when learning Pfa.
The results remain valid using different evaluation criteria.
Interesting conclusions can be drawn by analyzing the results.
In particular, the observation that team Llorens outperforms the winning team on the deterministic instances is very interesting for future research as it could provide a method for deciding whether a given data sample is drawn from a deterministic distribution or from a non-deterministic one. This could be very useful during the discretization of data, for instance. Moreover, it would be very interesting to further investigate and hopefully improve the performance of the spectral and n-gram based approaches developed by team Bailly and team Kepler on sparse problem instances. Last but not least, new Gibbs sampling and EM/Baum-Welch methods have been developed for Pfa by team Shibata-Yoshinaka and Team Hulden. Based on their excellent performance in PAutomaC, we can encourage anyone interested in learning probability distributions over strings to use one of these methods. The developed Gibbs sampler performed consistently better in PAutomaC, but required much more computational resources. When the generating distribution is known to be deterministic, we advise a state merging approach such as the one developed by team Llorens.
We only consider discrete Hmms.
The hardness of Pac-learning the structure of a Dpfa is shown in Kearns et al. (1994).
One instance of a model is a point in the space of all possible models.
rpart, implemented in R.
A version of their algorithm is available at http://www.iip.ist.i.kyoto-u.ac.jp/member/ry/pfai/.
We are very thankful to the members of the scientific committee for their help in designing this competition. We want to thank all participants and in particular Raphael Bailly, Cleo Billa, Mans Hulden, Fabio Kepler, David Llorens, Sergio Mergen, Shihiro Shibata, and Ryo Yoshinaka for their help during the writing of this paper.
- Angluin, D. (1988). Identifying languages from stochastic examples (Technical Report Yaleu/Dcs/RR-614). Yale University. Google Scholar
- Bailly, R. (2011). QWA: spectral algorithm. In JMLR—workshop and conference proceedings: Vol. 20. Proceedings of the Asian conference on machine learning, ACML’11 (pp. 147–163). Cambridge: JMLR. Google Scholar
- Bailly, R., Denis, F., & Ralaivola, L. (2009). Grammatical inference as a principal component analysis problem. In Proceedings of the international conference on machine learning ICML’09 (pp. 33–40). Omnipress. Google Scholar
- Balle, B., Castro, J., & Gavaldà, R. (2012). Bootstrapping and learning PDFA in data streams. In JMLR—workshop and conference proceedings: Vol. 21. Proceedings of the international conference on grammatical inference ICGI’12 (pp. 34–48). Cambridge: JMLR. Google Scholar
- Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3, 1–8. Google Scholar
- Brill, E., Florian, R., Henderson, J. C., & Mangu, L. (1998). Beyond n-grams: can linguistic sophistication improve language modeling. In Proceedings of the joint conference of the international committee on computational linguistics and the association for computational linguistics COLING-ACL’98 (pp. 186–190). Los Altos: Kaufmann/ACL. Google Scholar
- Carrasco, R. C., & Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method. In LNAI: Vol. 862. Proceedings of the international colloquium on grammatical inference ICGI’94 (pp. 139–150). Berlin: Springer. Google Scholar
- Carrasco, R. C., Forcada, M., & Santamaria, L. (1996). Inferring stochastic regular grammars with recurrent neural networks. In LNAI: Vol. 1147. Proceedings of the international colloquium on grammatical inference ICGI’96 (pp. 274–281). Berlin: Springer. Google Scholar
- Castro, J., & Gavaldá, R. (2008). Towards feasible PAC-learning of probabilistic deterministic finite automata. In LNCS: Vol. 5278. Proceedings of the international colloquium on grammatical inference ICGI’08 (pp. 163–174). Berlin: Springer. Google Scholar
- Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the meeting of the association for computational linguistics ACL’96 (pp. 310–318). Stroudsburg: Association for Computational Linguistics. Google Scholar
- de la Higuera, C. (2010). Grammatical inference: learning automata and grammars. Cambridge: Cambridge University Press. Google Scholar
- de la Higuera, C., & Oncina, J. (2003). Identification with probability one of stochastic deterministic linear languages. In LNCS: Vol. 2842. Proceedings of the international conference on algorithmic learning theory ALT’03 (pp. 134–148). Berlin: Springer. Google Scholar
- de la Higuera, C., & Oncina, J. (2004). Learning probabilistic finite automata. In LNAI: Vol. 3264. Proceedings of the international colloquium on grammatical inference ICGI’04 (pp. 175–186). Berlin: Springer. Google Scholar
- de la Higuera, C., & Thollard, F. (2000). Identification in the limit with probability one of stochastic deterministic finite automata. In LNAI: Vol. 1891. Proceedings of the international colloquium on grammatical inference ICGI’00 (pp. 15–24). Berlin: Springer. Google Scholar
- Denis, F., & Esposito, Y. (2004). Learning classes of probabilistic automata. In LNCS: Vol. 3120. Proceedings of the conference on learning theory COLT’04 (pp. 124–139). Berlin: Springer. Google Scholar
- Denis, F., Lemay, A., & Terlutte, A. (2000). Learning regular languages using non-deterministic finite automata. In LNAI: Vol. 1891. Proceedings of the international colloquium on grammatical inference ICGI’00 (pp. 39–50). Berlin: Springer. Google Scholar
- Denis, F., Esposito, Y., & Habrard, A. (2006). Learning rational stochastic languages. In LNCS: Vol. 4005. Proceedings of the conference on learning theory COLT’06 (pp. 274–288). Berlin: Springer. Google Scholar
- Dupont, P., & Amengual, J.-C. (2000). Smoothing probabilistic automata: an error-correcting approach. In LNAI: Vol. 1891. Proceedings of the international colloquium on grammatical inference ICGI’00 (pp. 51–62). Berlin: Springer. Google Scholar
- Esposito, Y., Lemay, A., Denis, F., & Dupont, P. (2002). Learning probabilistic residual finite state automata. In LNAI: Vol. 2484. Proceedings of the international colloquium on grammatical inference ICGI’02 (pp. 77–91). Berlin: Springer. Google Scholar
- Gao, J., & Johnson, M. (2008). A comparison of Bayesian estimators for unsupervised hidden Markov model POS taggers. In Proceedings of the conference on empirical methods in natural language processing EMNLP’08 (pp. 344–352). Stroudsburg: Association for Computational Linguistics. CrossRefGoogle Scholar
- Gavaldà, R., Keller, P. W., Pineau, J., & Precup, D. (2006). PAC-learning of Markov models with hidden state. In LNCS: Vol. 4212. Proceedings of the European conference on machine learning ECML’06 (pp. 150–161). Berlin: Springer. Google Scholar
- Gildea, D., & Jurafsky, D. (1996). Learning bias and phonological-rule induction. Computational Linguistics, 22, 497–530. Google Scholar
- Goan, T., Benson, N., & Etzioni, O. (1996). A grammar inference algorithm for the world wide web. In Proceedings of AAAI spring symposium on machine learning in information access, Stanford, CA. Menlo Park: AAAI Press. Google Scholar
- Grünwald, P. (2007). The minimum description length principle. Cambridge: MIT Press. Google Scholar
- Guttman, O. (2006). Probabilistic automata and distributions over sequences. PhD thesis, The Australian National University. Google Scholar
- Habrard, A., Bernard, M., & Sebban, M. (2003). Improvement of the state merging rule on noisy data in probabilistic grammatical inference. In LNAI: Vol. 2837. Proceedings of the European conference on machine learning ECML’03 (pp. 169–180). Berlin: Springer. Google Scholar
- Habrard, A., Denis, F., & Esposito, Y. (2006). Using pseudo-stochastic rational languages in probabilistic grammatical inference. In LNAI: Vol. 4201. Proceedings of the international colloquium on grammatical inference ICGI’06 (pp. 112–124). Berlin: Springer. Google Scholar
- Hasan Ibne, A., Batard, A., de la Higuera, C., & Eckert, C. (2010). PMSA: a parallel algorithm for learning regular languages. In NIPS workshop on learning on cores, clusters and clouds. Google Scholar
- Heule, M., & Verwer, S. (2010). Exact DFA identification using SAT solvers. In LNCS: Vol. 6339. Proceedings of international colloquium on grammatical inference ICGI’10 (pp. 66–79). Google Scholar
- Horning, J. J. (1969). A study of grammatical inference. PhD thesis, Stanford University. Google Scholar
- Hulden, M. (2012). Treba: efficient numerically stable EM for PFA. In JMLR—workshop and conference proceedings: Vol. 21. Proceedings of the international conference on grammatical inference ICGI’12 (pp. 249–253). Cambridge: JMLR. Google Scholar
- Jelinek, F. (1998). Statistical methods for speech recognition. Cambridge: MIT Press. Google Scholar
- Kearns, M. J., & Vazirani, U. (1994). An introduction to computational learning theory. Cambridge: MIT Press. Google Scholar
- Kepler, F., Mergen, S., & Billa, C. (2012). Simple variable length n-grams for probabilistic automata learning. In JMLR—workshop and conference proceedings: Vol. 21. Proceedings of the international conference on grammatical inference ICGI’12 (pp. 254–258). Cambridge: JMLR. Google Scholar
- Lang, K. J., Pearlmutter, B. A., & Price, R. A. (1998). Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In LNAI: Vol. 1433. Proceedings of the international colloquium on grammatical inference ICGI’98 (pp. 1–12). Berlin: Springer. CrossRefGoogle Scholar
- Milani Comparetti, P., Wondracek, G., Kruegel, C., & Kirda, E. (2009). Prospex: protocol specification extraction. In Proceedings of the IEEE symposium on security and privacy (pp. 110–125). Los Alamitos: IEEE Computer Society. Google Scholar
- Ron, D., Singer, Y., & Tishby, N. (1994). Learning probabilistic automata with variable memory length. In Proceedings of the conference on learning theory COLT’94 (pp. 35–46). New York: ACM. Google Scholar
- Ron, D., Singer, Y., & Tishby, N. (1995). On the learnability and usage of acyclic probabilistic finite automata. In Proceedings of the conference on learning theory COLT’95 (pp. 31–40). New York: ACM. Google Scholar
- Sanjeev, A., & Boaz, B. (2009). Computational complexity: a modern approach (1st edn.). Cambridge: Cambridge University Press. Google Scholar
- Saul, L., & Pereira, F. (1997). Aggregate and mixed-order Markov models for statistical language processing. In Proceedings of the second conference on empirical methods in natural language processing EMNLP’97 (pp. 81–89). Stroudsburg: Association for Computational Linguistics. Google Scholar
- Shibata, C., & Yoshinaka, R. (2012). Marginalizing out transition probabilities for several subclasses of PFAs. In JMLR—workshop and conference proceedings: Vol. 21. Proceedings of the international conference on grammatical inference ICGI’12 (pp. 259–263). Google Scholar
- Stolcke, A. (1994). Bayesian learning of probabilistic language models. Ph.D. dissertation, University of California. Google Scholar
- Sudkamp, A. (2006). Languages and machines: an introduction to the theory of computer science (third edn.). Reading: Addison-Wesley. Google Scholar
- Thollard, F. (2001). Improving probabilistic grammatical inference core algorithms with post-processing techniques. In Proceedings of the international conference on machine learning ICML’01 (pp. 561–568). Los Altos: Kauffman. Google Scholar
- Thollard, F., & Dupont, P. (1999). Entropie relative et algorithmes d’inférence grammaticale probabiliste. In Actes de la conférence CAP’99 (pp. 115–122). Google Scholar
- Thollard, F., Dupont, P., & de la Higuera, C. (2000). Probabilistic DFA inference using Kullback-Leibler divergence and minimality. In Proceedings of the international conference on machine learning ICML’00 (pp. 975–982). Los Altos: Kaufmann. Google Scholar
- Verwer, S., Weerdt, M., & Witteveen, C. (2010). A likelihood-ratio test for identifying probabilistic deterministic real-time automata from positive data. In LNCS: Vol. 6339. Proceedings of the international colloquium on grammatical inference ICGI’10 (pp. 203–216). Berlin: Springer. Google Scholar
- Verwer, S., de Weerdt, M., & Witteveen, C. (2011). Learning driving behavior by timed syntactic pattern recognition. In Proceedings of the international joint conference on artificial intelligence IJCAI’11 (pp. 1529–1534). Google Scholar
- Walkinshaw, N., Lambeau, B., Damas, C., Bogdanov, K., & Dupont, P. (2012). Stamina: a competition to encourage the development and assessment of software model inference techniques. In Empirical software engineering (pp. 1–34). Google Scholar