A comparison of collapsed Bayesian methods for probabilistic finite automata
 455 Downloads
 1 Citations
Abstract
This paper describes several collapsed Bayesian methods, which work by first marginalizing out transition probabilities, for inferring several kinds of probabilistic finite automata. The methods include collapsed Gibbs sampling (CGS) and collapsed variational Bayes, as well as two new methods. Their targets range over general probabilistic finite automata, hidden Markov models, probabilistic deterministic finite automata, and variablelength grams. We implement and compare these algorithms over the data sets from the Probabilistic Automata Learning Competition (PAutomaC), which are generated by various types of automata. We report that the CGSbased algorithm designed to target general probabilistic finite automata performed the best for any types of data.
Keywords
Collapsed Gibbs sampling Variational Bayesian methods Statemerging algorithms1 Introduction
Since Hidden Markov Models (HMMs) are implemented in many applications, many inference methods for them have thus far been proposed and refined. It is difficult to find transition probabilities that maximize the generation probability of training samples. It is also intractable to marginalize out state transition probabilities and simultaneously sum them with respect to hidden variables. Therefore, some approximation and/or searchinglocaloptima technique is required. The Expectation Maximization (EM) algorithm, called BaumWelch, is the most wellknown classic method that is used as a statistical method for the HMM learning. Recently, beyond EM, many statistical approaches have been developed and applied for inferring HMMs, such as Collapsed Gibbs Sampling (CGS) (Goldwater and Griffiths 2007), Variational Bayes (VB) (Beal 2003), and spectral methods (Hsu et al. 2009). CGS is a special form of Gibbs sampling (Bishop 2006), where only hidden variables are sampled after transition probabilities are marginalized. VB approximates all parameters, namely, transition probabilities and probabilities of hidden states, to be independent. CGS is considered to be one of the best choices for the HMM inference as compared empirically to EM and VB (Gao and Johnson 2008).
The recent Bayesian methods have also been applied to various probabilistic models. For instance, Johnson et al. (2007) applied CGS to PCFGs in the Chomsky Normal Form. Liang et al. (2007) applied VB inference for infinite PCFGs, where an infinite number of nonterminal symbols and rules can be modeled by assuming that their priors are represented by a Hierarchical Dirichlet Process (HDP) (Teh et al. 2006a). Pfau et al. (2010) applied the MetropolisHastings algorithm (Bishop 2006) for Probabilistic Deterministic Infinite Automata (PDIAs), where graph structures of PDFAs are generated from a variant of HDP (Teh 2006). Their algorithm can be thought of as a method that samples PDFAs by iterating merging and splitting states randomly in the Bayesian manner.
For a probabilistic topic model called Latent Dirichlet Allocation (LDA), which is used in natural language processing, Teh et al. (2006b) proposed a method called Collapsed Variational Bayes (CVB). They used a VB approximation after integrating out transition probabilities, and showed that their method yielded a more accurate result than the standard VB method. CVB has variables, each of which represents a probability that the automaton is in a particular state at a certain time. These variables are assumed to be independent of each other. We update these variables so as to minimize the KLdivergence between the approximated and the true marginal probability. However, since it is still difficult to update variables so as to minimize the KLdivergence exactly, a further approximation is applied to each update. While Teh et al. (2006b) used the second order Taylor approximation when updating the independent approximation of the probability of each hidden variable, Asuncion et al. (2009) found that the zeroth order Taylor approximation (called CVB0) is empirically sufficient to achieve good accuracy for LDAs.
In this paper, our targets for learning are the class of probabilistic finite automata (PFAs) and their special cases. We call an inference method collapsed if the transition probabilities of the PFAs are integrated out before some approximations or sampling methods are applied. This paper introduces and describes several collapsed inference methods for PFAs and evaluates them. Moreover, we compare collapsed methods that target other subclasses of PFAs, such as HMMs, PDFAs, and variablelength grams (VGrams). We say that a PFA is fully connected if one can move from every state to every state using any symbol. In Sects. 2.1–2.4, we discuss how existing techniques of CGS and CVB0 can be applied to the inference of PFAs. We describe and compare the computational cost for CGS, CVB0, and CVB2.
In Sect. 2.5 and Sect. 3, we propose two different approaches, which are modifications of CVB0 and CGS. In Sect. 2.5, we propose a variant of CVB0, which we call GCVB0, for which a convergence property is guaranteed. The standard CVB0 does not have this nice property, since it uses Taylor approximations when updating variables. We modify CVB0 to have the convergence property by defining a global function that approximates the KLdivergence. Variables are updated using existing techniques, such as quasiNewton methods. In Sect. 3, we introduce a simple generative model for PFAs that are not fully connected, for which a CGS algorithm is presented. In addition to the sequence of hidden states, graph structures of PFAs are also sampled.
Abe and Warmuth (1992) showed that PFAs are KLPAC learnable from samples of size polynomial in the number of the states and letters and the sample size using maximumlikelihood estimation. They also showed that the actual computational cost must be prohibitively expensive unless RP=NP. Kearns et al. (1994) showed that learning PDFAs even over 2letter alphabets is as hard as a problem for which no polynomial algorithm is known. On the other hand, Clark and Thollard (2004) proposed an algorithm that PAC learns PDFAs that satisfy μdistinguishability in polynomial time from a polynomial amount of data. Some elaborations of his algorithm have also been proposed (Castro and Gavaldà 2008; Balle et al. 2013). On the other hand, no solid work has been done on the computational cost of techniques generically called MCMC, including Gibbs sampling, which infer the correct posterior distribution in the limit.
Our experimental results are presented in Sect. 4. We compare the inference methods described in the preceding sections as well as other collapsed Bayesian methods for special kinds of PDFAs, including HMMs, PDFAs, and VGrams. Experimental results for PAutomaC data sets^{1} showed that CGS performed better than other methods in terms of accuracy. Although GCVB0 is guaranteed to converge to some local optimal point, and thus it is clear at which point its iterations should be stopped, GCVB0 yielded results worse than those of CVB0 and CGS.
PAutomaC data sets were generated by different types of PFAs, including HMMs. CGSHMM is a modification of our CGS algorithm for PFAs such that it targets HMMs. From the comparison of CGSPFA and CGSHMM, it appears that CGSPFA yields better scores than does CGSHMM, since CGSHMM often fails to find appropriate emission probabilities η and state transition probabilities θ that can factorize the transition probability ξ. We also compared CGSPFA with other collapsed methods for other models, which are actually special cases of PFAs. However, CGSPFA yields better scores than any of these methods. Therefore, we conclude that CGSPFA is empirically the best choice among the collapsed methods described in this paper.
A drawback of CGS is its rather high computational cost. In Sect. 4.6, we measure empirically the relation between the computational cost and accuracy of CGS and other classic methods, including a statemerging method based on marginal probability. The computational costs of CGS and the statemerging method have a gap of one to three orders of magnitude. The actual computational cost of CGS depends on the number of iterations where variables are resampled. The sampling process should be repeated until the sampled distribution converges. Our implementation set the iteration number to 20,000 for every problem, which seems unnecessarily large for many problems. However, this number is in fact not too large; we observed that 200 iterations, for example, are too few to make the empirical distribution converge.
2 Collapsed Bayesian approaches for fully connected PFA
2.1 Probabilistic model for fully connected NFAs
A probabilistic finite automaton (PFA) is a nondeterministic finite automaton G in which transition probabilities ξ are assigned. We call G the underlying automaton and the strings accepted by G sentences. A PFA assigns a probability to each sentence according to ξ. A PFA is seen as a machine that generates strings according to these transition probabilities.
Without loss of generality, in this section, we assume the underlying automaton G to be a fully connected NFA. In a fully connected NFA, every letter a may induce transition from every state i to every state j, except when a=0 or j=0. That is, ξ _{ iaj } may have nonzero value for every i, a(≠0) and j(≠0). Any PFAs whose underlying automata have sparse edges, including DFAs as a special case, can be represented or at least well approximated with the fully connected NFA by assigning 0 or a very small probability to some edges. We will discuss another approach, which infers sparse PFAs, in Sect. 3. We fix the number of states of G to be N+1 and denote the states by natural numbers 0,1,…,N. Let A be the size of the letter alphabet including 0. In the sequel, we suppress G.
2.2 Feasible and unfeasible marginalization
It is known to be unfeasible to calculate Open image in new window based on the definition of Open image in new window and Pr(ξ) given as Eqs. (1) and (2). The following table shows which combination of parameters Open image in new window , Open image in new window , and ξ makes it feasible or unfeasible to compute the joint and/or conditional probability.
Open image in new window and Open image in new window can be calculated in a dynamic programming manner since Eq. (1) has the form ∏_{ t } f _{ t }(z _{ t },a _{ t },z _{ t+1}). Open image in new window is calculated as
Definition 1
(delta function)
Definition 2
(counting functions)
It is often the case that inferring a PFA from Open image in new window is a means of obtaining a probability prediction of future sentences, although inferring a specific PFA is not the only way to fulfill the latter purpose. Since we have fixed the underlying machine to be a fully connected NFA with N states, the inference of a PFA is reduced to that of ξ. According to a Bayesian approach, which this study uses, this amounts to estimating Open image in new window . The probability of a future sentence Open image in new window is represented by Open image in new window , on the other hand. Thus, the computation of Open image in new window and Open image in new window is our central concern, which is, however, unfeasible. The difficulty of computing Open image in new window and Open image in new window can be reduced to the infeasibility of the calculation of Open image in new window . Computing Open image in new window in general is obviously harder than computing Open image in new window , where ϵ denotes the empty sequence. One can compute Open image in new window using Open image in new window by Open image in new window , where Open image in new window can easily be obtained by dynamic programming.
Therefore, we necessarily have to use some approximation to achieve the above two purposes. In the following sections, we use two approximations obtained by random algorithms, which are known as collapsed Gibbs sampling (CGS) and collapsed variational Bayesian method (CVB). We also give a simple variant method of CVB that converges to a local optimal point, whereas, in general, CVB has no guarantee of convergence to a local optimal point.
2.3 Collapsed Gibbs sampling for the inference of PFAs—CGSPFA

For t=1,…,T in this order, we sequentially sample \(x_{t}^{(k)}\) according to the probability \(\Pr(x_{t} \mid x_{1}^{(k)},\dots,x_{t1}^{(k)},x_{t+1}^{(k1)},\dots,x_{T}^{(k1)})\).

We have obtained a sample Open image in new window .
 If (i,a,j)=(k,a _{ t },z _{ t+1})≠(z _{ t−1},a _{ t−1},k),$$\mathit {C}^{z_t=k}_{iaj} = \mathit {C}^{\neg t}_{iaj} + \delta\begin{pmatrix} k, a_t, z_{t+1}\\ i, a, j \end{pmatrix} + \delta\begin{pmatrix} z_{t1}, a_{t1}, k\\ i, a, j \end{pmatrix} = \mathit {C}^{\neg t}_{iaj} + 1 = \mathit {C}^{\neg t}_{k a_t z_{t+1}} +1. $$
 If (i,a,j)=(z _{ t−1},a _{ t−1},k)≠(k,a _{ t },z _{ t+1}),$$\mathit {C}^{z_t=k}_{iaj} = \mathit {C}^{\neg t}_{iaj} + \delta\begin{pmatrix} k, a_t, z_{t+1}\\ i, a, j \end{pmatrix} + \delta\begin{pmatrix} z_{t1}, a_{t1}, k\\ i, a, j \end{pmatrix} = \mathit {C}^{\neg t}_{iaj} + 1 = \mathit {C}^{\neg t}_{z_{t1} a_{t1} k} +1. $$
 If (i,a,j)=(z _{ t−1},a _{ t−1},k)=(k,a _{ t },z _{ t+1}),$$\mathit {C}^{z_t=k}_{iaj} = \mathit {C}^{\neg t}_{iaj} + \delta\begin{pmatrix} k, a_t, z_{t+1}\\ i, a, j \end{pmatrix} + \delta\begin{pmatrix} z_{t1}, a_{t1}, k\\ i, a, j \end{pmatrix} = \mathit {C}^{\neg t}_{iaj} + 2 = \mathit {C}^{\neg t}_{z_{t1} a_{t1} k} + 2. $$
2.4 Collapsed variational Bayes approximations
Teh et al. (2006b) proposed an approximation method called Collapsed Variational Bayes (CVB) for inferring a probabilistic topic model called Latent Dirichlet Allocation (LDA). In this subsection, we explain how their approach can be applied to the inference of PFAs. In a similar way to that described in Sect. 2.3, we first marginalize ξ out, so that the standard technique of Variational Bayes is applicable to computing Open image in new window . Thus, this approach is called collapsed.
The remaining task is to calculate the three terms on the right hand side of Eq. (14) for PFAs. The existing methods cannot be applied to this task, and therefore, we introduce our own approach.
2.4.1 Treatments of expectations over q
Definition 3
That is, if (x _{1},y _{1}) and (x _{2},y _{2}) have a common element, which is z _{ k } for some k, we have (x _{1},y _{1})∼(x _{2},y _{2}). The relation ∼ is obviously symmetric.

Z(R)={x _{1},…,x _{ n },y _{1},…,y _{ n }}∩Z,

H(R)={x _{1},…,x _{ n },y _{1},…,y _{ n }}∩H.
Lemma 1
Lemma 1 can easily be generalized to the case where R contains pairs of letters. In that case, respective pairs of letters (a,b) form a singleton equivalence class R _{ a,b }, where Open image in new window .
2.5 An approximation for the objective function D(q)
In this section, a new approximation approach for minimizing D(q) is discussed. The approach presented in the previous section is based on the updating formula (Eq. (12)), which should lead q to a local minimum convergence point, provided that it is calculated precisely. However, instead, we have used an approximation formula where we have no guarantee of monotonicity or convergence. Moreover, the intractability of the calculation of D(q), even from the approximated q, prevents us from determining a point where we should stop iterating.
Instead, this subsection proposes an approximation D _{0}(q) of D(q) as an objective function, to which we apply the CVB0 technique. Unlike for the approximation presented in the previous section, it is ensured that the values of q will converge to a local optimal point, and thus one can easily decide when the updating of the values of q should be terminated.
3 Using graph structures for inferring sparse PFAs
This type of distribution of G is different from distributions where each edge is independently added. Whereas, in the latter case, ν has a binomial distribution and is centralized at some point, in the former case, the distribution of ν is free to be given as a prior Pr(ν).
We can locate the update of G (Line 1–7) inside the loop of updating z _{ t } for each t (after Line 12) in order to sample G more frequently. Since the computational cost of updating G is O(N ^{2} A), while that of updating z _{ t } for each t is O(N), the interval of updating G should be more than O(NA).
We finally predict the generating probability of Open image in new window by Eqs. (5) and (6), similarly to CGSPFA of Sect. 2.3.
4 Experiments
Each experiment in this section was run using computation nodes in a grid environment called InTrigger,^{5} each node contained 2 CPUs (Xeon E5410/E5330, 2.33/2.3 GHz and 4 cores+HT) with a memory of 32/24 GB. Each execution was done in a single thread, and therefore, essentially, we did not use parallel computation.
4.1 Experimental details of CGS
For the respective problems in PAutomaC I, one iteration took approximately 0.2 to 2.0 s, and thus, 400 to 40,000 s for 20,000 iterations. To determine the values of N and β among nine and six candidates by 10fold crossvalidation, respectively, one must run CGS 150 times in total for every problem. Using these determined values, we ran CGS 10 times to obtain the final answer.
4.2 Comparison of CVB0, GCVB0, and CGS for PFAs
In this section, we compare CVB0 and GCVB0 for the PFAs described in Sects. 2.4 and 2.5. The numerical optimizer that we used for GCVBO was a limitedmemory quasiNewton method called LBFGSB (Zhu et al. 1997).^{7} LBFGSB is able to take lower and upper bounds for each variable and minimizes an objective function within the given bound.
In order to handle the constraints that ∑_{ k } q _{ t }(k)=1 for all t, we use variables x _{ t }(k) such that q _{ t }(k)=x _{ t }(k)^{2}/∑_{ i } x _{ t }(i)^{2}, instead of q _{ t }(k) themselves. The objective function of GCVB0 is also modified as D _{0}(q)+∑_{ t }(1−∑_{ k } q _{ t }(k))^{2}. By this transformation, each x _{ t }(k) has only one constraint, x _{ t }(k)≥0.^{8} The computational cost for calculating D _{0}(q) and its derivations is O(TN ^{2}), which is the same as for CVB0. The factor of LBFGSB for GCVB0 for convergence^{9} is set to 10^{7}.
Figure 8(b) shows the scores obtained by estimated transition probabilities ξ according to Eq. (10). The final score of CVB0 (32.609 after 2,000 iterations) was better than that of GCVB0 (32.651). Moreover, the score of CVB0 improved more quickly than that of GCVB0. Both CVB0 and GCVB0 yielded worse scores than CGS (32.569) for the above data and settings of N and β.
Figure 8(c) shows the number of edges of the obtained PFA. That is, the number of triples (i,a,j) such that E_{ q }[C _{ i,a,j }]>1, which indicates the density of the network. As the value of GCVB0 (117) is much smaller than that of CVB0 (1356), GCVB0 tends to give a more compact PFA than does CVB0. We represent the value of Open image in new window by thinner lines in Fig. 8(a), where Open image in new window shows the entropy. The thick and thin red lines (GCVB0) come quite close to each other, which means that the approximation of D(q) with D _{0}(q) tends to bias q _{ t }(k) toward 0 or 1 as compared to the approximation technique of CVB0.
4.3 Effects of sampling underlying NFAs for sparse PFAs
An experimental result for CGSSG, which was introduced in Sect. 3, is shown in this section. In the experiment, we assumed that Pr(ν) was uniform, and G was periodically resampled in the loop of updating z _{ t } in Algorithm 2. We implemented and executed CGSPFA and CGSSG for Prob. II26, which was generated by a sparse PDFA. The number of states was set to 90 and the number of iterations to 40,000, where 201 points were taken as samples in the latter half of all iterations.
Comparison of CGS for fullyconnected PFAs and Algorithm 2 for a sparse PDFA (PAutomaC II26)
β  CGSPFA  CGSSG  

0.5  0.1  0.05  0.01  0.5  0.1  0.05  0.01  
score − min. score  14.442  0.942  0.108  0.092  1.800  0.251  0.258  0.391 
# of valid states i  91  91  88  82  91  91  91  91 
# of valid pairs (i,a)  628  355  334  302  440  392  370  322 
# of valid edges (i,a,j)  4775  728  511  346  1087  766  708  525 
# of possible edges in G  (49231)  (49231)  (49231)  (49231)  1242  1535  2229  5733 
inferred determinacy  7.60  2.05  1.53  1.15  2.47  1.95  1.91  1.63 
We call a state i valid if C _{ i }>0, and similarly a pair (i,a) and an edge (i,a,j) are said to be valid if C _{ ia }>0 and C _{ iaj }>0, respectively. It should be noted that the minimal NFA defined in Sect. 3 has all and only valid edges. CGSPFA with β=0.01 had the least number, 346, of valid edges after the last iteration. The number 346 is close enough to the true number, 299, of the edges of the generating PDFA^{11} of Prob. II26. We define the inferred determinacy to be the ratio of valid pairs to valid edges. The value should be 1 if the generating PDFA is correctly inferred. The inferred determinacy of CGSPFA with β=0.01 is 1.15, which suggests that CGSPFA with sufficiently small β can be effective for inferring sparse PDFAs.
CGSPFA outperformed the other methods when the hyperparameter β was set correctly; nevertheless, with larger values for β, the performance of CGSPFA decreased quite sharply in terms of both the score and determinacy. On the other hand, CGSSG worked stably with different values of β.
4.4 Comparison of CGS for PFAs and CGS for HMMs
HMMs are a special type of PFAs. In HMMs, ξ is factorized as ξ _{ iaj }=η _{ ia } θ _{ ij }, where η _{ ia } is the emission probability that letter a is emitted from state i, and θ _{ ij } represents the state transition probability from state i to state j. It is known that every PFA has an equivalent HMM, but in general the transformation from a PFA to an equivalent HMM squares the number of states.
We ran CGSPFA and CGSHMM together with CGSHMM(*) for PAutomaC I data sets, which are classified into three types of data according to the generative model, namely, PFAs, PDFAs, and HMMs. Both hyperparameters α and β in CGSPFA and CGSHMM were always set to 0.1. The number of states were searched among {10,15,20,30,40,50,70,90}. For each problem, both CGSPFA and CGSHMM were run 10 times, where each execution consisted of 10,000 iterations.
Differences of scores among CGSHMM, CGSHMM(*) and CGSPFA for the subset of PAutomaC I data sets generated by HMMs
No.  HMM−HMM(∗)  min{HMM,HMM(∗)}−PFA  N (HMM)  N (PFA) 

2  0.6062386  0.0484486  90  40 
3  10.6874226  0.2430831  70  15 
4  3.6180065  0.3853828  90  15 
5  2.0407411  0.1371957  90  50 
22  0.2159087  0.0474664  90  30 
23  4.0745021  1.0769991  90  15 
24  0.0216522  0.0243302  90  50 
25  0.4575560  0.3295595  90  70 
26  0.1223562  0.1214686  90  50 
37  −0.0039283  −0.0006234  50  50 
38  −0.0031304  0.0040898  30  40 
39  −0.0013565  −0.0080017  50  50 
40  −0.0000129  −0.0002220  30  70 
41  0.0221770  0.0005653  70  90 
The numbers of states that give the best results for CGSHMM are shown in the fourth column of Table 2. As the figure shows, CGSHMM tends to have a small number of states when it gives a good result. Although the number of states may not be sufficient for CGSHMM, since HMMs with N ^{2} states can represent any PFAs with Nstates, it is preferable to choose CGSPFA for the following reasons. The computational cost of CGSPFA is lower than that of CGSHMM with a larger number of states; and CGSHMM has the difficulty discussed in the previous paragraph.
4.5 Other collapsed methods for subclasses of PFAs
This section compares CGSPFA with two other methods: (1) A statemerging algorithms for PDFAs; and (2) an algorithm based on variablelength gram (VGram). As described below, they maximize greedily a probability that is obtained by collapsing transition probabilities.
4.5.1 Evidencebased statemerging algorithm for PDFAs (EStateMerge)
From the viewpoint of computational cost, one of the advantages of EStateMerge, as well as of other statemerging algorithms, is that both C _{ i } and C _{ i,a } are changed only when the state i is merged with other states. Thus, it is enough to recalculate these local parts to update Open image in new window at each merging step. Since states are aggressively merged whenever Open image in new window increases, EStateMerge does not have the PAC learnability property for PDFAs that is shown (Clark and Thollard 2004) for statemerging frequencybased algorithms, such as ALERGIA.
4.5.2 Evidencebased VGram (EVGram)
4.5.3 Comparison of evidencebased VGram and crossvalidationbased VGram
We also implemented a variant of Niesler and Woodland’s (1999) VGram algorithm, the criterion of which is based on crossvalidation. We call it CVGram and compared it with EVGram.
We used 10fold CVGram, since it empirically gives a better result than the leaveoneout CVGram, which Niesler and Woodland (1999) used. Generally, the leaveoneout CV decreases the bias of estimated predictive probabilities, although the 10fold CV is considered sufficient in many cases, as discussed in Kohavi (1995). If the number of blocks of CV is too small, it underestimates the accuracy of a learning method. On the other hand, CV is known to cause overfitting for model selection due to variance in the estimated predictive probabilities (Cawley and Talbot 2010; Bengio and Grandvalet 2004). Thus, there is a tradeoff between underestimating bias and overfitting due to variance, and such a tradeoff is one of the reasons why the results of the 10fold CVGram are better than those of the leaveoneout CVGram.
4.5.4 Comparison of CGSPFA, EStateMerge, and EVGram
Figure 11(a) shows the scores achieved by the above methods for all the problems of PAutomaC I. For normalizing, they are divided by the minimum scores, which are given by substituting the true probabilities of the test set in Eq.(19). As a result, CGSPFA performed the best. The average ratio of its scores to the theoretical minimum scores exceeded 1 by 0.00129. CVGram, EVGram, and EStateMerge achieved 0.00642, 0.00992, and 0.0185, respectively. Figure 11(b) summarizes the scores of these methods for three types of generating models, PDFAs, PFAs, and HMMs. For any type of generating model, CGSPFA yields a higherlevel accuracy than the other methods. While the scores of EVGram and CVGram do not show significant differences among different target generating models, CGSPFA and EStateMerge obtained significantly better scores on the problems generated by PDFAs than did other models.
4.6 Relation between computational costs and scores for different methods
Comparison of computational time of different methods
4.6.1 How many iterations should suffice in CGS?
We further investigated the relation between the number of iterations and the score on Prob. II5 and 43 by running CGS 10 times with different initial values. Figure 12(b) and (c) shows the results for Prob. II5 and 43, respectively. The shapes of the score curves largely depend on the initial values, particularly for Prob. II5, but, unlike in Fig. 12(a), not many curves seem to be converged before the 10,000th for this problem. On the other hand, in Fig. 12(c), all the curves gather and are tangled. For Prob. II43, the choice of the initial value does not seem to be very important.
We conclude that the number of iterations that suffices for convergence depends on the initial value and the problem. At least for the problems of PAutomaC, the number L=20,000 seems sufficient, in general, although more iterations might improve the scores for some limited number of problems.
4.6.2 Comparison of computational cost of different methods
The scores of CGS in Table 3 for Prob.II5 are of the trial corresponding to the green line in Fig. 12(b).
Among these methods, StateMerge often achieves scores significantly worse than do the other methods, while it often succeeded in finding a concise automaton as compared to C/EVGram. The number of the states of a VGram tree constructed by EVGram sometimes becomes even bigger than the sample size.
5 Conclusions and future work
In this paper, we compared various collapsed Bayesian methods for PFAs and their variants, including HMMs, PDFAs, and VGrams. For fully connected PFAs, we discussed how existing techniques of Collapsed Gibbs Sampling (CGS) and Collapsed Variable Bayes with the 0thorder Taylor approximation (CVB0) can be applied, and in addition, we proposed a new method called GCVB0 for which the convergence is ensured. While CVB2, the CVB with the secondorder approximation, may appear to yield a more accurate probability prediction, its computational cost per iteration is evaluated as O(N ^{3} T), which is not sufficiently efficient unless the size of target automata is restricted to be very small. In contrast, the costs for CGS, CVB0, and GCVB0 are only O(NT), O(N ^{2} T), and O(N ^{2} T), respectively. Hence, these methods can be applied to bigger PFAs. According to the experimental results for PAutomaC data sets, CGS performed better than CVB0 and GCVB0. Although GCVB0 is guaranteed to converge to some local optimal point and thus it is clear when its iterations should be stopped, the results of GCVB0 were worse than those of CVB0 and CGS. For sparse PFAs, by using a simple generative model, an algorithm for sampling graph structures of PFAs is proposed.
We also empirically compared algorithms that targeted different types of PFAs using PAutomaC data sets generated by different types of PFAs. In the comparison of CGSPFA and CGSHMM, it appeared that CGSPFA achieves better scores than CGSHMM, since CGSHMM often failed to find appropriate emission probabilities η and state transition probabilities θ that can factorize the transition probability ξ. CGSPFA gives better scores than EStateMerge, CVGram, and EVGram for every generating model, whereas EStateMerge and EVGram run much faster than CGSPFA, since they change the graph structures in order to maximize marginal probabilities greedily. Graph structures for PDFAs and VGram should have been sampled in the Bayesian manner. Our conclusion is that, empirically, CGSPFA is the best choice among the collapsed methods described in this paper.
Many methods for inferring PFAs still remain to be compared with the methods we described in this paper. For instance, although we fixed the numbers of states based on cross validation in this study, the numbers can be sampled simultaneously with values of Open image in new window in nonparametric methods. The comparison of our methods with nonparametric methods, such as HDP (Teh et al. 2006a), on PAutomaC data constitutes future work. In our experiments, EStateMerge did not perform better than CGS in terms of accuracy, even on sample sets generated by PDFAs. There is no guarantee that EStateMerge will PAClearn PDFAs, since it merges states greedily according to marginal probabilities. Since other statemerging techniques for which PAC learnability is proven might yield more accurate results, we should compare them with the methods examined in this paper using data sets generated from PDFAs.
Footnotes
 1.
 2.
It would be possible to collect values \(\widetilde {\xi }^{(s,1)},\dots,\widetilde {\xi }^{(s,S_{s})}\) again by Gibbs sampling for each Open image in new window instead of taking the expectation \(\widetilde {\xi }^{(s)}\); however, this alternative is too computationally expensive.
 3.
Whereas Teh et al.’s original work uses the secondorder approximation (let us call it CVB2), we use CVB0, since the computational cost of CVB2 for PFA is N times more than that of CVB0 (see Appendix), and it has been reported that CVB0 often outperforms CVB2 with respect to accuracy for Latent Dirichlet Allocation (LDA) (Asuncion et al. 2009).
 4.
 5.
 6.
These independent trials were executed in parallel using a parallel computing processing system, GXP Make (http://www.logos.ic.i.utokyo.ac.jp/gxp/).
 7.
 8.
We used x _{ t }(k)>10^{−7} as the lower bound for LBFGSB in practice.
 9.
The objective function f is considered converged when Open image in new window , where EPSMCH=2.220⋅10^{−16} in our environment.
 10.
We define a single iteration for GCVB0 as a computation of D _{0}(q) and its derivation for a single point.
 11.
The number is obtained after transforming the PDFA to satisfy our postulation on the forms of PFAs described in Sect. 2.1.
Notes
Acknowledgements
We are grateful to the committee of PAutomaC for offering various useful data sets and detailed information on them. We deeply appreciate the valuable comments and suggestions of the anonymous reviewers, which improved the quality of this paper.
References
 Abe, N., & Warmuth, M. K. (1992). On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9, 205–260. zbMATHGoogle Scholar
 Asuncion, A. U., Welling, M., Smyth, P., & Teh, Y. W. (2009). On smoothing and inference for topic models. In Proceedings of the twentyfifth conference on uncertainty in artificial intelligence (pp. 27–34). Google Scholar
 Balle, B., Castro, J., & Gavaldà, R. (2013). Learning probabilistic automata: a study in state distinguishability. Theoretical Computer Science, 473, 46–60. zbMATHMathSciNetCrossRefGoogle Scholar
 Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London. Google Scholar
 Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of Kfold crossvalidation. Journal of Machine Learning Research, 5, 1089–1105. zbMATHMathSciNetGoogle Scholar
 Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer. Chap. 11. zbMATHGoogle Scholar
 Carrasco, R. C., & Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method. In Proceedings of the second international colloquium of grammatical inference (pp. 139–152). Google Scholar
 Castro, J., & Gavaldà, R. (2008). Towards feasible paclearning of probabilistic deterministic finite automata. In Proceedings of the 9th international colloquium on grammatical inference (pp. 163–174). Google Scholar
 Cawley, G. C., & Talbot, N. L. C. (2010). On overfitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079–2107. zbMATHMathSciNetGoogle Scholar
 Clark, A., & Thollard, F. (2004). PAClearnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research, 5, 473–497. zbMATHMathSciNetGoogle Scholar
 Gao, J., & Johnson, M. (2008). A comparison of Bayesian estimators for unsupervised hidden Markov model POS taggers. In Proceedings of the 2008 conference on empirical methods in natural language (pp. 344–352). Google Scholar
 Goldwater, S., & Griffiths, T. (2007). A fully Bayesian approach to unsupervised partofspeech tagging. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 744–751). Google Scholar
 Hsu, D., Kakade, S. M., & Zhang, T. (2009). A spectral algorithm for learning hidden Markov models. In Proceedings of the 22nd conference on learning theory. Google Scholar
 Johnson, M., Griffiths, T. L., & Goldwater, S. (2007). Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of human language technology conference of the North American chapter of the association of computational linguistics (pp. 139–146). Google Scholar
 Kearns, M. J., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R. E., & Sellie, L. (1994). On the learnability of discrete distributions. In Proceedings of the 26th annual ACM symposium on theory of computing (pp. 273–282). Google Scholar
 Kohavi, R. (1995). A study of crossvalidation and bootstrap for accuracy estimation and model selection. In Proceedings of the fourteenth international joint conference on artificial intelligence (pp. 1137–1145). Google Scholar
 Liang, P., Petrov, S., Jordan, M. I., & Klein, D. (2007). The infinite PCFG using hierarchical Dirichlet processes. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 688–697). Google Scholar
 Niesler, T., & Woodland, P. C. (1999). Variablelength category ngram language models. Computer Speech & Language, 13(1), 99–124. CrossRefGoogle Scholar
 Pfau, D., Bartlett, N., & Wood, F. (2010). Probabilistic deterministic infinite automata. In Advances in neural information processing systems 23 (NIPS) (pp. 1930–1938). Google Scholar
 Teh, Y. W. (2006). A hierarchical Bayesian language model based on PitmanYor processes. In Proceedings of the 44th annual meeting of the association of computational linguistics (pp. 985–992). Google Scholar
 Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006a). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581. zbMATHMathSciNetCrossRefGoogle Scholar
 Teh, Y. W., Newman, D., & Welling, M. (2006b). A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in neural information processing systems 19 (NIPS) (pp. 1353–1360). Google Scholar
 Thollard, F. (2001). Improving probabilistic grammatical inference core algorithms with postprocessing techniques. In Proceedings of the eighteenth international conference on machine learning (pp. 561–568). Google Scholar
 Thollard, F., Dupont, P., & de la Higuera, C. (2000). Probabilistic DFA inference using KullbackLeibler divergence and minimality. In Proceedings of the seventeenth international conference on machine learning (pp. 975–982). Google Scholar
 Verwer, S., Eyraud, R., & de la Higuera, C. (2012). Results of the PAutomaC probabilistic automaton learning competition. In Proceedings of the 11th international conference on grammatical inference (Vol. 21, pp. 243–248). Google Scholar
 Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization. ACM Transactions on Mathematical Software, 23(4), 550–560. zbMATHMathSciNetCrossRefGoogle Scholar