PAutomaC: a probabilistic automata and hidden Markov models learning competition
 614 Downloads
 8 Citations
Abstract
Approximating distributions over strings is a hard learning problem. Typical techniques involve using finite state machines as models and attempting to learn these; these machines can either be hand built and then have their weights estimated, or built by grammatical inference techniques: the structure and the weights are then learned simultaneously. The Probabilistic Automata learning Competition (PAutomaC), run in 2012, was the first grammatical inference challenge that allowed the comparison between these methods and algorithms. Its main goal was to provide an overview of the stateoftheart techniques for this hard learning problem. Both artificial data and real data were presented and contestants were to try to estimate the probabilities of strings. The purpose of this paper is to describe some of the technical and intrinsic challenges such a competition has to face, to give a broad state of the art concerning both the problems dealing with learning grammars and finite state machines and the relevant literature. This paper also provides the results of the competition and a brief description and analysis of the different approaches the main participants used.
Keywords
Grammatical inference Probabilistic automata Hidden Markov models Programming competition1 Introduction
This paper describes the PAutomaC probabilistic automaton learning competition and provides an overview of the relevant literature on this topic. PAutomaC was an online challenge that took place in 2012 at http://ai.cs.umbc.edu/icgi2012/challenge/Pautomac/. The goal of PAutomaC was to provide an overview of which probabilistic automaton learning techniques work best in which setting and to stimulate the development of new techniques for learning distributions over strings. Many probabilistic automata learning methods have been produced in the past (see Sect. 2 for an overview). Most of these focus on deterministic probabilistic automata (Dpfa), where only the symbols are drawn from probability distributions but the transitions are uniquely determined given the generated symbol. There exist some exceptions, however, which aim to learn hidden Markov models (Baum 1972), probabilistic residual automata (Esposito et al. 2002), and multiplicity automata (Denis et al. 2006). Another important approach is to learn Markov chains or ngrams by simply counting the occurrences of substrings (Saul and Pereira 1997; Ney et al. 1997; Jelinek 1998). These simple counting methods have been very successful in practice (Brill et al. 1998).
Although many methods have been proposed, there has been so far no thorough investigation of which model/algorithm is likely to perform best, why and when. Knowledge about this would be very helpful to scientists/practitioners faced with a data set made of strings and the problem of finding a likely distribution over these strings. PAutomaC aimed to fill this knowledge gap by providing the first elaborate testsuite for learning string distributions.
In addition to being very helpful for applications of automata learning methods, PAutomaC was designed in such a way that it provided directions to future theoretical work and algorithm development. For instance, unlike previous automata learning competitions (see Sect. 2.4 for details), in PAutomaC, the type of automaton device was not fixed: learning problems were generated using automaton models of increasing complexity. This is not only very useful for practical applications (where many different types of distributions can be encountered), but also aims to answer to the interesting question whether it is best to learn a nondeterministic model (e.g. Hmm) or a deterministic model (e.g. Dpfa) when the data is drawn from a (non)deterministic distribution,as described for instance in the work of Gavaldà et al. (2006). PAutomaC also encouraged the development and use of new techniques from machine learning that do not build an automaton structure, but do result in a string distribution. Therefore, the actual structures of the learned automata were not evaluated in PAutomaC. Instead, the performance of the different algorithms were tested only on the quality of the resulting string distribution. Like previous automaton learning competitions, this evaluation was performed online using a test set and an evaluation oracle running on the competition server. Consequently, the participants could use the observed performance (and that of the competition) to update their algorithms.
The competition setup in PAutomaC contained some novel elements that may also be of interest for competitions of other (string) distribution learning algorithms. Above all, in PAutomaC the performance was evaluated using the actual probabilities assigned by a learned distribution, instead of the more traditional method of evaluating its predictive performance. This has the advantage of not only testing whether the high probability events are assigned the largest probabilities, but also whether the low probability events are assigned the correct low probabilities. Furthermore, the actual strings that were being used for this evaluation were given to the participants beforehand.
The traditional approach to compare language models, which had also been considered for PAutomaC, is to test the learned model over some unseen data. Perplexity (Cover and Thomas 1991) is the usual measure and, in order to perform well on such a metric, it is necessary to learn a smoothed model, in which a nonnull probability is assigned to all strings (the penalty is infinite otherwise). Experience shows that in that case, the smoothing method may become preponderant: the quality of the model can rely mainly on the smoothing. Another issue with such an evaluation task is that the model has to be checked somehow for consistency, since the probabilities of all possible strings must sum up to one.
The goal of PAutomaC being to compare learning algorithms (and not smoothing algorithms), a different protocol was chosen: the teams knew the test set in advance, and part of the problem for them consisted in reassigning the mass of probabilities the learned model used for the strings absent from the test set to those strings inside this set. In this way, a perplexitylike evaluation measure could be used to evaluate the differences in the probabilities assigned to different strings from the testset. A couple of possible dangers of this protocol were identified by the PAutomaC Scientific Committee^{1} and, later, by the participants. A first one was that the extra information in the test set (which was also randomly drawn from the unknown target distribution) could be used to learn. A second danger came from the fact that the teams could submit various solutions to the same problem (with no feedback about their score, but knowing their overall standing): this could have allowed some hillclimbing strategy. Both the Scientific Committee’s analysis and the attempts by some participants showed that the PAutomaC evaluation process was resistant: the winning team is actually the one who submitted the least times. We detail in this paper the choices that were made to handle these dangers.
As main contributions of this paper we provide an overview of the literature on probabilistic automaton learning, and describe PAutomaC including its design issues and solutions. The results of the competition and the approach followed by the main participants are also provided. There is a clear winner to PAutomaC: a novel collapsed Gibbs sampling method for Pfa developed by team ShibataYoshinaka. As it is not common to use such a method when learning distributions over strings, we hope and expect this result will influence what people will use in practice. In addition to having an appealing winner, we can draw several interesting conclusions by analyzing the results. In particular, it can be observed that the Alergiabased method developed by team Llorens outperforms the winning team on the deterministic instances. This provides some additional insights into the important question whether it is better to learn deterministic or nondeterministic models and can serve as an important starting point for further researches on this topic. Furthermore, we analyze the PAutomaC results with the goal of determining when which method works best and why. Our analysis indicates the problem areas for each of the used methods, which forms a basis for future studies and hopefully further improvements of the used methods. Last but not least, all methods developed by the participating teams significantly outperform the provided baseline algorithms, clearly demonstrating the need for developing and evaluating (new) methods for learning string distributions.
This paper is organized in six sections: introduction (Sect. 1), motivations and history (Sect. 2), an overview of PAutomaC (Sect. 3), final results (Sect. 4), a brief description and analysis of the approaches used by main participants (Sect. 5), and a conclusion (Sect. 6).
2 Motivations and history
We assume the reader to be familiar with the theory of languages and automata (Sudkamp 2006), their probabilistic counterparts such as hidden Markov models (Rabiner 1989), and basic concepts from computational complexity (Sanjeev and Boaz 2009), computational learning theory (Kearns and Vazirani 1994), and information theory (Cover and Thomas 1991). For more information on these topics the reader is referred to the corresponding references.
2.1 Why learn a probabilistic automaton?

model Dna or protein sequences in bioinformatics (Sakakibara 2005),

find patterns underlying different sounds for speech processing (Tzay 1994),

infer morphological or phonological rules for natural language processing (Gildea and Jurafsky 1996),

model unknown mechanical processes in physics (Shalizi and Crutchfield 2001),

discover the exact environment of robots (Rivest and Schapire 1993),

detect anomaly for intrusion detections in computer security (Milani Comparetti et al. 2009),

do behavioral modeling of users in applications ranging from web systems (Borges and Levene 2000) to the automotive sector (Verwer et al. 2011),

discover the structure of music styles for music classification/generation (CruzAlcázar and Vidal 2008).
In all such cases, an automaton model is learned from observations of the system, i.e., a finite set of strings. Usually, the data gathered from observations is unlabeled, that is to say that it is often possible to observe only strings that can be generated by the system, and strings that cannot be generated are thus unavailable. The standard method of dealing with this situation is to assume a probabilistic automaton model, i.e., a distribution over strings. In such a model, different states can generate different symbols with different probabilities. The goal of automata learning is then one of model selection (Grünwald 2007): find the probabilistic automaton model that gives the best fit to the observed strings, i.e., that is most likely to have generated the data. In addition to the data probability, this implies that the model size has to be taken into account in order to avoid overfitting. Otherwise, the model that generates only the seen strings and whose probabilities correspond to the observed frequency perfectly achieves the goal. But this naive model is of little use: it assigns null probability to all unseen strings and therefore makes no generalization.
2.2 Which probabilistic automata to learn?

Pfa (Paz 1971) are nondeterministic automata in which every state is assigned an initial and a halting probability, and every transition is assigned a transition probability (weight). The sum of all initial probabilities equals 1, and for each state, the sum of the halting and all outgoing transition probabilities equals 1. A Pfa generates strings probabilistically by starting in a state determined at random using the initial state distribution, either halting or executing a transition randomly determined using their probabilities, and iterating and generating the transition symbol in case it has not halted. A study of these automata can be found in Vidal et al. (2005, 2005b).

Hidden Markov models (Hmms)^{2} (Rabiner 1989; Jelinek 1998) are Pfa (as described in the previous paragraph) where the symbols are emitted at the states instead of at the transitions which are only used to move. Initial probabilities are assigned to each state but there are no final probabilities, defining therefore a distribution over Σ ^{ n } for each value of n. In order to obtain a distribution over Σ ^{∗} a special halting symbol or state can be introduced. With such an addition an Hmm generates strings like a Pfa.
Interestingly, although Hmms and Pfa are commonly used in distinct areas of research, they are equivalent with respect to the distributions that can be modeled: an Hmm can be converted into a Pfa and viceversa (Vidal et al. 2005; Dupont et al. 2005). Though it is easy to randomly generate strings from these models, determining the probability of a given string is more complicated because different executions can result in the same string. For both models, computing this probability can be solved optimally by dynamic programming using variations of the Forward (or Backward) algorithm (Baum et al. 1970). However, estimating the most likely parameter values (probabilities) for a given set of strings and a given model (maximizing the likelihood of model given the data) cannot be solved optimally unless RP equals NP (Abe and Warmuth 1992). The traditional method of dealing with this problem is the BaumWelch (Baum et al. 1970) greedy algorithm.
The deterministic counterpart of a Pfa is a deterministic probabilistic finite automaton (Dpfa) (Carrasco and Oncina 1994). These have been introduced for efficiency reasons essentially: in the nonprobabilistic case, learning a Dfa is provably easier than learning a Nfa (de la Higuera 2010). However, although nonprobabilistic deterministic automata are equivalent to nonprobabilistic nondeterministic automata in terms of the languages they can generate, it is shown in Vidal et al. (2005, 2005b), Dupont et al. (2005) that Dpfa are strictly less powerful than Pfa. Furthermore, distributions generated by Pfa cannot be approximated by Dpfa unless the size of the Dpfa is allowed to be exponentially larger than the one of the corresponding Pfa (Guttman et al. 2005, 2006). There is a positive side to this loss in power: estimating the parameter values of a Dpfa is easy, and there exist algorithms that learn a Dpfa structure in a probably approximately correct (Pac) like setting (Clark and Thollard 2004).^{3} This is not known to be possible for Pfa or Hmms. For Pfa it has only been shown that they are strongly learnable in the limit (Denis and Esposito 2004), or Paclearnable (under some restrictions) using a (possibly exponentially larger) Dpfa structure (Gavaldà et al. 2006).
2.3 How to learn a probabilistic automaton?
Early work concerning the learning of distributions over strings can be found in Horning (1969) and Angluin (1988). In the first case, the goal was to learn probabilistic contextfree grammars; in the second, convergence issues concerning identification in the limit with probability 1 are studied. Although these initial studies were done decades ago, only three techniques have become mainstream for learning Pfa, Hmms, and Dpfa.
Parameter estimation
The first family of techniques takes a standard structure or architecture for the machine, typically a complete graph, and then tries to find parameter settings that maximize the likelihood of model given the data. If the structure is deterministic, the optimization problem is quite simple: transition probabilities can be estimated using the maximum likelihood (Wetherell 1980). If not, the standard method is the BaumWelch algorithm (Baum et al. 1970; Baum 1972) which iteratively computes a new estimate for the transition probabilities using the probabilities assigned to the input data. Although this technique is known to be sensitive to initial probabilities and may get stuck in a local optimum, it has frequently been applied successfully in practice.
Bayesian inference
The second family of techniques correspond to Bayesian methods such as Gibbs sampling (Gelfand and Smith 1990), see, e.g., Neal (2000), Gao and Johnson (2008). Instead of learning a single model (a point^{4} estimate), these methods aim to make predictions using the joint distribution formed by all possible models. This joint distribution is hard to compute and an Hmm Gibbs sampler estimates it by iteratively sampling the visited hidden states conditioned on earlier samples of all other state visits. The stationary distribution of the thus formed Markov chain is exactly this joint distribution. Although these methods are not yet commonplace for Pfa, we believe this is likely to change after this competition.
Statemerging

There have been several extensions of Alergia (de la Higuera and Thollard 2000; Carrasco et al. 2001; de la Higuera and Oncina 2003, 2004; YoungLai and Tompa 2000; Goan et al. 1996).

Improvements of Ron et al. (1995) based on the concept of distinguishable states have been developed (Thollard and Clark 2004; Palmer and Goldberg 2005; Guttman 2006; Castro and Gavaldá 2008). An incremental version also exists (Gavaldà et al. 2006).

Algorithm Mdi was introduced by Thollard and Dupont (1999), Thollard et al. (2000), Thollard (2001). This algorithm also uses state merging.

Recently, they have been extended to learn not only the distribution over strings of events/symbols but also over their timing behaviors (Verwer et al. 2010) and from a continuous stream of data instead of a data set (Balle et al. 2012).
Other methods

Esposito et al.’s (2002) approach has consisted in learning probabilistic residual finite state automata based on the identification of the residuals of a rational language. These are the probabilistic counterparts to the residual finite state automata introduced by Denis et al. (2000, 2001).

Denis et al. (2006) and Habrard et al. (2006) introduced the innovative algorithm Dees that learns a multiplicity automaton (the weights can be negative but in each state the weights sum to one) by iteratively solving equations on the residuals.

Other algorithms learning multiplicity automata have been developed, using common approaches in machine learning such as recurrent neural networks (Carrasco et al. 1996), Principal Component Analysis (Bailly et al. 2009) or a spectral approach (Bailly 2011).
Most of these methods estimate the model parameters based on maximum likelihood. This can cause problems when computing probabilities, especially for strings of low frequency. For some of these methods, therefore, smoothing methods have been developed that adjust the maximum likelihood estimate in order to hopefully overcome these difficulties (Chen and Goodman 1996). Typically, these smoothing methods assign larger probabilities to infrequent strings, and consequently, less to more frequent ones. For ngram learning, smoothing is very often used and sophisticated methods such as backoff smoothing exist (Zhai and Lafferty 2004). For Dpfa learning, smoothing techniques can be found in Dupont and Amengual (2000), Thollard (2001), Habrard et al. (2003). Smoothing Pfa and Hmms is still a question requiring further research.
In conclusion, many algorithms for learning probabilistic automata have been produced. Due to the difficulty of the learning problem, most of them focus on some form of Dpfa. Another important approach is to learn Markov chains or ngrams by simply counting the occurrences of substrings. As already stated, these simple methods have been very successful in practice (Brill et al. 1998). When one is faced with a data set made of strings and one needs to find a likely distribution over these strings for tasks such as prediction, anomaly detection, or modeling, it would be very helpful to know which model is likely to perform best and why. Due to the lack of a thorough test of all of these techniques, this is currently an open question. Furthermore, the facts that all known algorithms are of the greedy type and the recent successes of searchbased approaches for nonprobabilistic automaton learning (Heule and Verwer 2010; Hasan Ibne et al. 2010) makes one wonder whether searchbased strategies are also beneficial for probabilistic automaton learning. The Probabilistic Automaton learning Competition (PAutomaC) aims to answers these questions by providing an elaborate testsuite for learning string distributions.
2.4 About previous competitions

The first grammatical inference competition was organized in 1999. The participants of Abbadingo (http://abbadingo.cs.nuim.ie) had to learn Dfa of sizes ranging from 64 to 512 states from positive and negative data, strings over a two letter alphabet.

A followup was system Gowachin (http://www.irisa.fr/Gowachin/), developed to generate new automata for classification tasks: the possibility of having a certain level of noise was introduced.

The Omphalos competition (http://www.irisa.fr/Omphalos/) involved learning contextfree grammars, given samples which in certain cases contained both positive and negative strings, and in others, just text.

In the Tenjinno competition, the contestants had to learn finite state transducers (http://web.science.mq.edu.au/tenjinno/).

The Gecco conference organized a competition involving learning Dfa from noisy samples (http://cswww.essex.ac.uk/staff/sml/gecco/NoisyDFA.html).

The Stamina competition (http://stamina.chefbe.net/), organized in 2010, also involved learning Dfa but new methods were used and permitted to solve even harder problems.

The Zulu competition (http://labhcurien.univstetienne.fr/zulu/) concerned the task of actively learning Dfa through requests to an oracle.

The Rers Grey Box Challenge (http://leo.cs.tudortmund.de:8100/isola2012) aimed to discover the complementary values of whitebox and blackbox software system analysis techniques, including tools for learning finite state machines.
More generally, a number of other machine learning competitions have been organized during the past years. A specific effort has been made by the Pascal network (http://pascallin2.ecs.soton.ac.uk/Challenges/).
3 An overview of PAutomaC
The goal of PAutomaC was to provide an overview of which probabilistic automaton learning techniques work best in which setting and to stimulate the development of new techniques for learning distributions over strings. In order to stimulate this development, PAutomaC was set up using an oracle server that was able to evaluate the submissions by participants online. Furthermore, in contrast to the traditional methods used to evaluate predictive machine learning algorithms, the performance in PAutomaC was evaluated using the actual probabilities assigned by a learned distribution.
Two types of data were available: artificial and realworld data donated by researchers and industries. But we have to admit that the latter were after all of little interest in the context of the competition. The problem came from the fact that not knowing the targeted probabilities implies a biased way to evaluate them. We chose to use 3grams trained on the complete data sets to fix these probabilities, hoping that the induced bias would be drastically reduced since the competition sets consisted of less than 10 % of these data. Unfortunately, this goal was not achieved since the participants who scored the best on these data sets used ngrams (even when they were using more complex approaches on the artificial data sets). We will thus not discuss the realworld data sets in the rest of this paper (detailed information is available on the website).
In this section, we first describe the way the targets automata were generated. We then turn our attention on how the submissions of the participants were evaluated. Finally we discuss the choices made all along this process.
3.1 Generating artificial data
Artificial data was generated by building random probabilistic machines with 5 to 75 states and with an alphabet consisting of 4 to 24 symbols (both inclusive, and decided uniformly at random). These machines were subsequently used to generate data sets. Of all possible statesymbol pairs that could occur in transitions, between 20 and 80 percent (the symbol sparsity) of them were generated. These pairs were selected by first choosing a state at random, and subsequently choosing a symbol from the set of symbols that had not yet been selected for that state. This created a selection without replacement from the set of all possible statesymbol pairs that was modified to remain uniform over the states. This modification made it less likely that the resulting symbols were evenly distributed over the states. For every generated statesymbol pair, one transition was generated to a randomly chosen target state. Between 0 and 20 percent (the transition sparsity) transitions were generated in addition to these, selected without replacement from the set of possible transitions, modified to remain uniform over the source states and transition labels.
Initial and final states were selected without replacement until the percentages of selected states exceeded the transition and symbol sparsities, respectively. All initial, symbol, and transition probabilities were drawn from a Dirichlet distribution with concentration parameters set to 1 (making every probability distribution equally likely). The final probabilities were drawn together with the symbol probabilities.
From such a structure, one training set (with repetitions) and one test set (without repetitions) were generated from every target. With probability one out of four, the generated train set was of size 100 000, it was of size 20 000 otherwise. New test strings were generated using the target machine until 1 000 unique strings had been generated. The test strings were allowed to overlap with the strings used for training. If the average length of the generated strings was less than 5 or greater than 50, a new automaton and new data sets were generated using the same construction parameters. In total, 150 models and corresponding train and test sets were generated using this way. We evaluated the difficulty of the generated sets using a 3gram baseline algorithm: the problem was considered easy if the baseline output was close to the target (a perplexity difference of less than 1.0), and difficult otherwise. We then selected 16 of them, aiming to obtain ranging values for the number of states, the size of the alphabet, sparsity values, and difficulty. We applied the same procedure for Dpfa but without generating additional transitions; and for Hmms, we generated statestate pairs instead of statesymbolstate triples.
3.2 Evaluation
Notice that this measure is equivalent to the wellknown KullbackLeibler (KL) divergence (Kullback and Leibler 1951). Indeed, given two distributions P and Q, the KL divergence is defined as KL(P,Q)=∑_{ x } P(x)log_{2}(P(x)/Q(x)) which can be rewritten into KL(P,Q)=(−∑_{ x } P(x)log_{2} Q(x))−H(P) where H(P) is the entropy of the target distribution. H(P) is constant in our case since the aim is to compare various candidate distributions Q. As we were only interested in the divergence on a given test set S, the only varying element of the KL divergence is −∑_{ x∈S } P(x)log_{2} Q(x) which is equivalent to our measure, up to a monotonous transformation.
To decide the final overall rank of each participant, points were attributed for each data set: the leader of a problem at the end of the competition scored 5 points, the second 3, the third 2 and the fourth 1. In case of equality on a problem (based on the first 10 digits of the perplexity score), the earliest submission won. The winner is the participant whose overall score was the highest. There was no restriction on the number of submissions a given participant could provide, but no feedback was given about the resulting perplexity. To compute the final score of a participant, only the best submission to each problem was considered.
3.3 Discussion on the design of the competition
When organizing an online competition, one has to make various choices about the generation of data and the evaluation of the participant submissions. We described above what was done for PAutomaC but we feel that the choices that were made have to be discussed. What follows thus contains arguments about the validity of our approach and therefore of the results of the competition.
Target generation
As already stated, we used a Dirichlet distribution for sampling the output probabilities. The main advantage of this method is that every possible distribution is equally likely when sampled using a Dirichlet distribution (with concentration parameters set to 1). Notice that this does not happen when every output probability is iteratively sampled uniformly at random. Since we did not intend to bias the distribution in PAutomaC towards certain types of distributions, using the Dirichlet distribution seemed the logical choice.
If we were to sample all output probabilities from a Dirichlet distribution unconstrained, however, we would obtain a very densely connected Pfa with high probability. Such densely connected automata are uninteresting from a learning perspective: a simple onegram will already reach a close to optimal perplexity. We therefore constrained this sampling using symbol and transition sparsity values. These two values were preselected and the generated Pfa was then forced to match these sparsity values. Afterwards, we sampled the transition probabilities for every state using a Dirichlet distribution.
The Pfa structure generator worked by iteratively adding new transitions until the preselected sparsity values were reached. This selection remained uniform over all states, lowering the probability that every state gets assigned the same number of symbols and transitions. The generator initialized by adding to every state one random symbol and one random transition for that symbol. This avoided the generation of states with a final probability of 1.0, i.e., sink states. This was done because we aimed for the final probability generation to be independent of the structure generation.
The final probability of each state was handled as the emission of a special symbol: this allowed a simple normalization process and did not influence the bias over distributions since their values were sampled together with the output probabilities using a Dirichlet distribution. Together with the consistency test (see below), this ensured that the generated machines corresponded to a proper distribution (probabilities over all possible strings summed to 1). The selection of which states had final probabilities, however, was performed independently of the process used to select output transitions. This ensured that having more output symbols does not lead to lower final probabilities.
An important step took place directly after the generation of a target. It consisted in checking that all states were reachable from an initial state and that they were all coaccessible. Indeed, verifying the consistency of the machine ensured that we did not have a path (and thus a probability mass) that reached a part of the machine that never led to an accepting state. In addition, we tested whether the generated probability distribution did not result in giving too much weight to long or short strings. Although this created some bias in the generation procedure, it was unavoidable because testing the different methods on instances that are too difficult or uninteresting makes no sense.
Evaluation
As already stated, the choice of an evaluation function that does not rely on a particular type of machine was a fundamental requirement of PAutomaC. Using a perplexity measure had the interest to be a widely accepted way to compare distributions and its link with the KL divergence was clearly a plus. Though we did not inform the participants about it, we also computed two other evaluation functions for each submission: the maxnorm (maximal difference between the submitted probabilities and the target ones) and the sumnorm (the sum of the differences between the submitted and the target probabilities). While on a few problems the ranking of the participants was a bit different than the one obtained with the official perplexity measure, the overall ranking of the teams was the same.
A common issue when dealing with string distributions is smoothing. When using perplexity as a measure, smoothing becomes necessary because strings with zero probability obtain an infinite KL divergence when compared to the target (or any other nonzero assigning distribution) and thus an infinite perplexity. Although smoothing can be very beneficial in practice, we feel that the standard perplexity measure is too dependent on smoothing (compared to the maxnorm, for instance) and therefore that a perplexity evaluation based on an unseen test set does not properly measure the quality of the string distribution. In PAutomaC we therefore decided to provide the participating teams with knowledge of the actual strings used to compute the perplexity measure. This removed the need for specialized smoothing methods since the participants could simply use a minimum value for the probability assigned to any string.
Collusion
A usual problem with online competitions is the one of the possibility of collusion. Indeed, a set of test data has to be made available to participants in order to evaluate the performance of their algorithm with respect to a given target. But if this set contains information about the target, then it can be used during the learning phase and may bias the results. In a competition where the targets are not stochastic devices, this problem is usually tackled by the requirement that elements of the test set do not occur in the train set (though they are generated by the same process). But this cannot be ensured when the aim is to learn a distribution as both sets have to be generated using the target: erasing elements of the test set that occur in the train set generates an important bias in the distribution of the test set. We therefore chose to keep these elements, expecting that the difference in size between the train and test sets sufficed to make the information contained in the test set useless.
But collusion can also result from the fact that the test set by itself contains information about the target distribution: duplicate strings are likely to be frequent in the target distribution. This is why we decided to remove redundant elements of the test sets, creating a small bias in the distribution of these sets. However, since the actual target distribution was used during the evaluation, and thanks to the choices made for this phase, this did not result in a bias or other problems during evaluation.
4 Results
4.1 Competition activity
During the competition phase, the website received 724 visits (with a maximum of 54 the last day of the competition) from 196 unique visitors with an average visit duration of a bit more than 5 minutes. IPs from 37 countries have been detected, between which 14 countries corresponded to 5 or more visitors.
4.2 Overall results
The final scores can be seen in Fig. 4 and detailed results are presented in Table 1 (available in the Appendix). There is a clear winner of PAutomaC: team ShibataYoshinaka. Of all participants, they obtained the best perplexity values on most instances and performed well on all others. This result is validated by the computation of other performance indicators (the maxnorm and sumnorm). From Table 1 it can be observed that the method implemented by team ShibataYoshinaka really works well for all of the competition problems: the difference between the perplexity values of the solutions and their submissions was never greater than 0.1. Furthermore, this difference was even smaller on the instances with 100 000 strings, indicating that they made good use of additional data.
4.3 Analysis of the results
In PAutomaC, the different approaches were tested on problem instances with a broad range of parameter values and coming from different probabilistic automaton models (see Table 2 in the Appendix). This makes it possible to perform some additional analysis of the results with the goal of discovering when each method works best and trying to understand why. Tables 1 and 2 (both in the Appendix) clearly show that team ShibataYoshinaka is only outperformed on the (nearly) deterministic ones (Dpfa, or Pfa/Hmm with a small transition sparsity). On these instances team Llorens performs slightly better. Team Hulden’s method also manages to obtain the best perplexity values on two instances, and actually beats team Llorens overall performance by just 2 points (rightmost points in Fig. 4). Their method seems to perform best on dense instances with few states. The methods used by team Bailly and team Kepler have some difficulties with very sparse instances (and thus also with Dpfa), and perform well but not best on the other instances.
 (1)
Predicting the winner given the problem instance parameter values.
 (2)
Predicting whether a deterministic distribution was used to generate the problem instance given the winner.
5 The different approaches and individual results
A wide spectrum of learning approaches has been used during the competition. We describe in this section the ones of the main participants—those who scored at least a point—and provide a small detailed analysis of their performance in PAutomaC. This section is the result of deep discussions and electronic exchanges the authors had with the different teams. However, the overview presented here is superficial and the reader is therefore referred to the original paper describing the team’s work.
5.1 Team ShibataYoshinaka
In the actual implementation,^{6} they have fixed the number of iterations of CGS and sampling points a priori. The values of N and β were determined by 10fold cross validation amongst a finite number of candidates. Finding good settings for these values required quite some computational resources.
Analysis
5.2 Team Hulden
 (1)
A basic “baseline” ngram strategy with smoothing.
 (2)
Another “baseline” ngram strategy without smoothing, but using interpolated test data.
 (3)
The construction of a fully connected Pfa inferred with BaumWelch (EM), each between 5 and 40 states in size. Training was done using only the original training data, and separately also using reconstructed training data, as in (2).
In the first strategy, the ngram counts were extracted from the training data for various values of n (between 2 and 9). Then, the log likelihood of the training data was calculated and the n yielding the highest log likelihood was used to issue the probabilities to the test strings for submission. WittenBell smoothing (see, e.g., Chen and Goodman 1996) was used in all cases.
For the second approach, the n chosen in the first one was used to decide the optimal window size to use for ngrams. In this strategy, the test data was used for training as well, and was augmented in an iterative fashion. This because the original test data represented a skewed distribution as duplicates had been removed. First, the expected number of occurrences of each string in the test set was calculated based on the total number of occurrences of that string in the training and test sets. Based on this expected number, a fractional count of strings was “added” to the test data, reflecting a guess that the original test data had contained these duplicates. This process was repeated until convergence (when the expected string count in the test data no longer changed). These counts were then used for calculating the probabilities of each string in the test data.
For the third strategy, three randomly initialized Pfa of 5, 10, 20, and 40 states were trained with BaumWelch for each problem, after which the one with the highest log likelihood was submitted (several results in case of approximate ties). Similarly to the ngram case, another three runs for each state size were made using both training and reconstructed test data. However, contrary to the ngram strategy, using reconstructed test data for training failed to ever improve on the basic BaumWelch that used only the PAutomaC training data for training.
The ngram solutions were submitted early and the EM solutions later. This allowed the observation, based on the server feedback, that EM outperformed the ngrams in most cases (roughly 85 % of problems). A notable exception is the two real data problems where the interpolated ngrams performed best in each case. As mentioned, using reconstructed test data for training helped in the ngram strategy, but not with BaumWelch, probably because of severe overfitting.
Analysis
5.3 Team Llorens
The approach followed by the Llorens team was twofold: on one hand, they upgraded the Alergia algorithm (Carrasco and Oncina 1994) by using ideas from evidencedriven approaches to state merging. Specifically, they computed all possible merges in a redblue framework (see Lang et al. 1998), and performed the one that passed the most statistical tests, which are computed using Hoeffding’s bound as in Alergia. The second line they followed was to work on the fact that the test data was known and that there could be a better strategy than the simple normalization to make probabilities sum to 1 on the test set.
Analysis
5.4 Team Bailly
Team Bailly tackled the competition by using a spectral approach (see Bailly 2011). The main component which is manipulated is the Hankel matrix (Partington 1988), representing the counts for every possible prefixsuffix pair. The core of the spectral technique is the Hankel matrix factorization, from which the parameters of a probabilistic model can be directly deduced.
Analysis
5.5 Team Kepler
The approach applied by Kepler et al. (2012) uses ngram models with variable length. ngrams are represented as a context tree that maps the probabilities of sequences of symbols. To shrink the state space while working with large ngrams, the context tree is pruned based on the KullbackLeibler divergence. Experiments showed that this approach almost always achieves lower perplexity than the fixed 3gram model on the PAutomaC training data. However, it is not clear how to define the maximum size of the ngram or the pruning threshold value.
Analysis
6 Conclusion

There were 5 active participating teams from around the world.

All participants used different (both old and new) methods and were stimulated to improve these. All methods performed much better than the provided baseline algorithms.

The PAutomaC data set provides a detailed comparison of the performance of each of these methods.

There is a clear winner, and interestingly, they used a method that is in practice not (yet) commonly applied when learning Pfa.

The results remain valid using different evaluation criteria.

Interesting conclusions can be drawn by analyzing the results.
In particular, the observation that team Llorens outperforms the winning team on the deterministic instances is very interesting for future research as it could provide a method for deciding whether a given data sample is drawn from a deterministic distribution or from a nondeterministic one. This could be very useful during the discretization of data, for instance. Moreover, it would be very interesting to further investigate and hopefully improve the performance of the spectral and ngram based approaches developed by team Bailly and team Kepler on sparse problem instances. Last but not least, new Gibbs sampling and EM/BaumWelch methods have been developed for Pfa by team ShibataYoshinaka and Team Hulden. Based on their excellent performance in PAutomaC, we can encourage anyone interested in learning probability distributions over strings to use one of these methods. The developed Gibbs sampler performed consistently better in PAutomaC, but required much more computational resources. When the generating distribution is known to be deterministic, we advise a state merging approach such as the one developed by team Llorens.
Footnotes
 1.
 2.
We only consider discrete Hmms.
 3.
The hardness of Paclearning the structure of a Dpfa is shown in Kearns et al. (1994).
 4.
One instance of a model is a point in the space of all possible models.
 5.
rpart, implemented in R.
 6.
A version of their algorithm is available at http://www.iip.ist.i.kyotou.ac.jp/member/ry/pfai/.
Notes
Acknowledgements
We are very thankful to the members of the scientific committee for their help in designing this competition. We want to thank all participants and in particular Raphael Bailly, Cleo Billa, Mans Hulden, Fabio Kepler, David Llorens, Sergio Mergen, Shihiro Shibata, and Ryo Yoshinaka for their help during the writing of this paper.
References
 Abe, N., & Warmuth, M. (1992). On the computational complexity of approximating distributions by probabilistic automata. Machine Learning Journal, 9, 205–260. zbMATHGoogle Scholar
 Angluin, D. (1988). Identifying languages from stochastic examples (Technical Report Yaleu/Dcs/RR614). Yale University. Google Scholar
 Bailly, R. (2011). QWA: spectral algorithm. In JMLR—workshop and conference proceedings: Vol. 20. Proceedings of the Asian conference on machine learning, ACML’11 (pp. 147–163). Cambridge: JMLR. Google Scholar
 Bailly, R., Denis, F., & Ralaivola, L. (2009). Grammatical inference as a principal component analysis problem. In Proceedings of the international conference on machine learning ICML’09 (pp. 33–40). Omnipress. Google Scholar
 Balle, B., Castro, J., & Gavaldà, R. (2012). Bootstrapping and learning PDFA in data streams. In JMLR—workshop and conference proceedings: Vol. 21. Proceedings of the international conference on grammatical inference ICGI’12 (pp. 34–48). Cambridge: JMLR. Google Scholar
 Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3, 1–8. Google Scholar
 Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41, 164–171. zbMATHMathSciNetCrossRefGoogle Scholar
 Beimel, A., Bergadano, F., Bshouty, N. H., Kushilevitz, E., & Varricchio, S. (2000). Learning functions represented as multiplicity automata. Journal of the ACM, 47(3), 506–530. zbMATHMathSciNetCrossRefGoogle Scholar
 Bergadano, F., & Varricchio, S. (1996). Learning behaviors of automata from multiplicity and equivalence queries. SIAM Journal on Computing, 25(6), 1268–1280. zbMATHMathSciNetCrossRefGoogle Scholar
 Blei, D. M., & Jordan, M. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1), 121–143. MathSciNetCrossRefGoogle Scholar
 Borges, J., & Levene, M. (2000). Data mining of user navigation patterns. In LNCS: Vol. 1836. Web usage mining and user profiling—WEBKDD’99 workshop (pp. 92–111). Berlin: Springer. CrossRefGoogle Scholar
 Brill, E., Florian, R., Henderson, J. C., & Mangu, L. (1998). Beyond ngrams: can linguistic sophistication improve language modeling. In Proceedings of the joint conference of the international committee on computational linguistics and the association for computational linguistics COLINGACL’98 (pp. 186–190). Los Altos: Kaufmann/ACL. Google Scholar
 Carrasco, R. C., & Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method. In LNAI: Vol. 862. Proceedings of the international colloquium on grammatical inference ICGI’94 (pp. 139–150). Berlin: Springer. Google Scholar
 Carrasco, R. C., Forcada, M., & Santamaria, L. (1996). Inferring stochastic regular grammars with recurrent neural networks. In LNAI: Vol. 1147. Proceedings of the international colloquium on grammatical inference ICGI’96 (pp. 274–281). Berlin: Springer. Google Scholar
 Carrasco, R. C., Oncina, J., & CaleraRubio, J. (2001). Stochastic inference of regular tree languages. Machine Learning Journal, 44(1), 185–197. zbMATHCrossRefGoogle Scholar
 Castro, J., & Gavaldá, R. (2008). Towards feasible PAClearning of probabilistic deterministic finite automata. In LNCS: Vol. 5278. Proceedings of the international colloquium on grammatical inference ICGI’08 (pp. 163–174). Berlin: Springer. Google Scholar
 Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the meeting of the association for computational linguistics ACL’96 (pp. 310–318). Stroudsburg: Association for Computational Linguistics. Google Scholar
 Clark, A., & Thollard, F. (2004). PAClearnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research, 5, 473–497. zbMATHMathSciNetGoogle Scholar
 Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. zbMATHCrossRefGoogle Scholar
 CruzAlcázar, P., & Vidal, E. (2008). Two grammatical inference applications in music processing. Applied Artificial Intelligence, 22(1–2), 53–76. CrossRefGoogle Scholar
 de la Higuera, C. (2010). Grammatical inference: learning automata and grammars. Cambridge: Cambridge University Press. Google Scholar
 de la Higuera, C., & Oncina, J. (2003). Identification with probability one of stochastic deterministic linear languages. In LNCS: Vol. 2842. Proceedings of the international conference on algorithmic learning theory ALT’03 (pp. 134–148). Berlin: Springer. Google Scholar
 de la Higuera, C., & Oncina, J. (2004). Learning probabilistic finite automata. In LNAI: Vol. 3264. Proceedings of the international colloquium on grammatical inference ICGI’04 (pp. 175–186). Berlin: Springer. Google Scholar
 de la Higuera, C., & Thollard, F. (2000). Identification in the limit with probability one of stochastic deterministic finite automata. In LNAI: Vol. 1891. Proceedings of the international colloquium on grammatical inference ICGI’00 (pp. 15–24). Berlin: Springer. Google Scholar
 Denis, F., & Esposito, Y. (2004). Learning classes of probabilistic automata. In LNCS: Vol. 3120. Proceedings of the conference on learning theory COLT’04 (pp. 124–139). Berlin: Springer. Google Scholar
 Denis, F., Lemay, A., & Terlutte, A. (2000). Learning regular languages using nondeterministic finite automata. In LNAI: Vol. 1891. Proceedings of the international colloquium on grammatical inference ICGI’00 (pp. 39–50). Berlin: Springer. Google Scholar
 Denis, F., Lemay, A., & Terlutte, A. (2001). Learning regular languages using RFSA. In LNCS: Vol. 2225. Proceedings of the international conference on algorithmic learning theory ALT’01 (pp. 348–363). Berlin: Springer. CrossRefGoogle Scholar
 Denis, F., Esposito, Y., & Habrard, A. (2006). Learning rational stochastic languages. In LNCS: Vol. 4005. Proceedings of the conference on learning theory COLT’06 (pp. 274–288). Berlin: Springer. Google Scholar
 Dupont, P., & Amengual, J.C. (2000). Smoothing probabilistic automata: an errorcorrecting approach. In LNAI: Vol. 1891. Proceedings of the international colloquium on grammatical inference ICGI’00 (pp. 51–62). Berlin: Springer. Google Scholar
 Dupont, P., Denis, F., & Esposito, Y. (2005). Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. Pattern Recognition, 38(9), 1349–1371. zbMATHCrossRefGoogle Scholar
 Esposito, Y., Lemay, A., Denis, F., & Dupont, P. (2002). Learning probabilistic residual finite state automata. In LNAI: Vol. 2484. Proceedings of the international colloquium on grammatical inference ICGI’02 (pp. 77–91). Berlin: Springer. Google Scholar
 Gao, J., & Johnson, M. (2008). A comparison of Bayesian estimators for unsupervised hidden Markov model POS taggers. In Proceedings of the conference on empirical methods in natural language processing EMNLP’08 (pp. 344–352). Stroudsburg: Association for Computational Linguistics. CrossRefGoogle Scholar
 Gavaldà, R., Keller, P. W., Pineau, J., & Precup, D. (2006). PAClearning of Markov models with hidden state. In LNCS: Vol. 4212. Proceedings of the European conference on machine learning ECML’06 (pp. 150–161). Berlin: Springer. Google Scholar
 Gelfand, A., & Smith, A. (1990). Samplingbased approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 398–409. zbMATHMathSciNetCrossRefGoogle Scholar
 Gildea, D., & Jurafsky, D. (1996). Learning bias and phonologicalrule induction. Computational Linguistics, 22, 497–530. Google Scholar
 Goan, T., Benson, N., & Etzioni, O. (1996). A grammar inference algorithm for the world wide web. In Proceedings of AAAI spring symposium on machine learning in information access, Stanford, CA. Menlo Park: AAAI Press. Google Scholar
 Grünwald, P. (2007). The minimum description length principle. Cambridge: MIT Press. Google Scholar
 Guttman, O. (2006). Probabilistic automata and distributions over sequences. PhD thesis, The Australian National University. Google Scholar
 Guttman, O., Vishwanathan, S. V. N., & Williamson, R. C. (2005). Learnability of probabilistic automata via oracles. In LNCS: Vol. 3734. Proceedings of the international conference on algorithmic learning theory ALT’05 (pp. 171–182). Berlin: Springer. CrossRefGoogle Scholar
 Habrard, A., Bernard, M., & Sebban, M. (2003). Improvement of the state merging rule on noisy data in probabilistic grammatical inference. In LNAI: Vol. 2837. Proceedings of the European conference on machine learning ECML’03 (pp. 169–180). Berlin: Springer. Google Scholar
 Habrard, A., Denis, F., & Esposito, Y. (2006). Using pseudostochastic rational languages in probabilistic grammatical inference. In LNAI: Vol. 4201. Proceedings of the international colloquium on grammatical inference ICGI’06 (pp. 112–124). Berlin: Springer. Google Scholar
 Hasan Ibne, A., Batard, A., de la Higuera, C., & Eckert, C. (2010). PMSA: a parallel algorithm for learning regular languages. In NIPS workshop on learning on cores, clusters and clouds. Google Scholar
 Heule, M., & Verwer, S. (2010). Exact DFA identification using SAT solvers. In LNCS: Vol. 6339. Proceedings of international colloquium on grammatical inference ICGI’10 (pp. 66–79). Google Scholar
 Horning, J. J. (1969). A study of grammatical inference. PhD thesis, Stanford University. Google Scholar
 Hulden, M. (2012). Treba: efficient numerically stable EM for PFA. In JMLR—workshop and conference proceedings: Vol. 21. Proceedings of the international conference on grammatical inference ICGI’12 (pp. 249–253). Cambridge: JMLR. Google Scholar
 Jelinek, F. (1998). Statistical methods for speech recognition. Cambridge: MIT Press. Google Scholar
 Kearns, M. J., & Vazirani, U. (1994). An introduction to computational learning theory. Cambridge: MIT Press. Google Scholar
 Kearns, M. J., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R. E., & Sellie, L. (1994). On the learnability of discrete distributions. In Proceedings of the twentysixth annual ACM symposium on theory of computing STOC’94 (pp. 273–282). New York: ACM. CrossRefGoogle Scholar
 Kepler, F., Mergen, S., & Billa, C. (2012). Simple variable length ngrams for probabilistic automata learning. In JMLR—workshop and conference proceedings: Vol. 21. Proceedings of the international conference on grammatical inference ICGI’12 (pp. 254–258). Cambridge: JMLR. Google Scholar
 Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. zbMATHMathSciNetCrossRefGoogle Scholar
 Lang, K. J., Pearlmutter, B. A., & Price, R. A. (1998). Results of the abbadingo one DFA learning competition and a new evidencedriven state merging algorithm. In LNAI: Vol. 1433. Proceedings of the international colloquium on grammatical inference ICGI’98 (pp. 1–12). Berlin: Springer. CrossRefGoogle Scholar
 Lee, D., & Yannakakis, M. (1996). Principles and methods of testing finite state machines—a survey. Proceedings of the IEEE, 84(8), 1090–1123. CrossRefGoogle Scholar
 Milani Comparetti, P., Wondracek, G., Kruegel, C., & Kirda, E. (2009). Prospex: protocol specification extraction. In Proceedings of the IEEE symposium on security and privacy (pp. 110–125). Los Alamitos: IEEE Computer Society. Google Scholar
 Mohri, M. (1997). Finitestate transducers in language and speech processing. Computational Linguistics, 23(3), 269–311. MathSciNetGoogle Scholar
 Neal, R. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265. MathSciNetGoogle Scholar
 Ney, H., Martin, S., & Wessel, F. (1997). Statistical language modeling using leavingoneout. In Corpusbased statiscal methods in speech and language processing (pp. 174–207). Norwell: Kluwer Academic. CrossRefGoogle Scholar
 Palmer, N., & Goldberg, P. W. (2005). PAClearnability of probabilistic deterministic finite state automata in terms of variation distance. In LNCS: Vol. 3734. Proceedings of the international conference on algorithmic learning theory ALT’05 (pp. 157–170). Berlin: Springer. CrossRefGoogle Scholar
 Partington, J. R. (1988). An introduction to Hankel operators. London mathematical society student texts. Cambridge: Cambridge University Press. zbMATHGoogle Scholar
 Paz, A. (1971). Introduction to probabilistic automata. San Diego: Academic Press. zbMATHGoogle Scholar
 Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286. CrossRefGoogle Scholar
 Rivest, R. L., & Schapire, R. E. (1993). Inference of finite automata using homing sequences. Information and Computation, 103, 299–347. zbMATHMathSciNetCrossRefGoogle Scholar
 Ron, D., Singer, Y., & Tishby, N. (1994). Learning probabilistic automata with variable memory length. In Proceedings of the conference on learning theory COLT’94 (pp. 35–46). New York: ACM. Google Scholar
 Ron, D., Singer, Y., & Tishby, N. (1995). On the learnability and usage of acyclic probabilistic finite automata. In Proceedings of the conference on learning theory COLT’95 (pp. 31–40). New York: ACM. Google Scholar
 Sakakibara, Y. (2005). Grammatical inference in bioinformatics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7), 1051–1062. CrossRefGoogle Scholar
 Sanjeev, A., & Boaz, B. (2009). Computational complexity: a modern approach (1st edn.). Cambridge: Cambridge University Press. Google Scholar
 Saul, L., & Pereira, F. (1997). Aggregate and mixedorder Markov models for statistical language processing. In Proceedings of the second conference on empirical methods in natural language processing EMNLP’97 (pp. 81–89). Stroudsburg: Association for Computational Linguistics. Google Scholar
 Shalizi, C. R., & Crutchfield, J. P. (2001). Computational mechanics: pattern and prediction, structure and simplicity. Journal of Statistical Physics, 104, 817–879. zbMATHMathSciNetCrossRefGoogle Scholar
 Shibata, C., & Yoshinaka, R. (2012). Marginalizing out transition probabilities for several subclasses of PFAs. In JMLR—workshop and conference proceedings: Vol. 21. Proceedings of the international conference on grammatical inference ICGI’12 (pp. 259–263). Google Scholar
 Stolcke, A. (1994). Bayesian learning of probabilistic language models. Ph.D. dissertation, University of California. Google Scholar
 Sudkamp, A. (2006). Languages and machines: an introduction to the theory of computer science (third edn.). Reading: AddisonWesley. Google Scholar
 Thollard, F. (2001). Improving probabilistic grammatical inference core algorithms with postprocessing techniques. In Proceedings of the international conference on machine learning ICML’01 (pp. 561–568). Los Altos: Kauffman. Google Scholar
 Thollard, F., & Clark, A. (2004). PAClearnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research, 5, 473–497. zbMATHMathSciNetGoogle Scholar
 Thollard, F., & Dupont, P. (1999). Entropie relative et algorithmes d’inférence grammaticale probabiliste. In Actes de la conférence CAP’99 (pp. 115–122). Google Scholar
 Thollard, F., Dupont, P., & de la Higuera, C. (2000). Probabilistic DFA inference using KullbackLeibler divergence and minimality. In Proceedings of the international conference on machine learning ICML’00 (pp. 975–982). Los Altos: Kaufmann. Google Scholar
 Verwer, S., Weerdt, M., & Witteveen, C. (2010). A likelihoodratio test for identifying probabilistic deterministic realtime automata from positive data. In LNCS: Vol. 6339. Proceedings of the international colloquium on grammatical inference ICGI’10 (pp. 203–216). Berlin: Springer. Google Scholar
 Verwer, S., de Weerdt, M., & Witteveen, C. (2011). Learning driving behavior by timed syntactic pattern recognition. In Proceedings of the international joint conference on artificial intelligence IJCAI’11 (pp. 1529–1534). Google Scholar
 Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., & Carrasco, R. C. (2005a). Probabilistic finite state automata—part I. Pattern Analysis and Machine Intelligence, 27(7), 1013–1025. CrossRefGoogle Scholar
 Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., & Carrasco, R. C. (2005b). Probabilistic finite state automata—part II. Pattern Analysis and Machine Intelligence, 27(7), 1026–1039. CrossRefGoogle Scholar
 Walkinshaw, N., Lambeau, B., Damas, C., Bogdanov, K., & Dupont, P. (2012). Stamina: a competition to encourage the development and assessment of software model inference techniques. In Empirical software engineering (pp. 1–34). Google Scholar
 Wetherell, C. S. (1980). Probabilistic languages: a review and some open questions. Computing Surveys, 12(4), 361–379. zbMATHMathSciNetCrossRefGoogle Scholar
 Young, T. Y. (1994). Handbook of pattern recognition and image processing: computer vision (Vol. 2). San Diego: Academic Press. zbMATHGoogle Scholar
 YoungLai, M., & Tompa, F. W. (2000). Stochastic grammatical inference of text database structure. Machine Learning Journal, 40(2), 111–137. CrossRefGoogle Scholar
 Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22, 179–214. CrossRefGoogle Scholar