Online Bayesian inference for the parameters of PRISM programs
Authors
- First Online:
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s10994-012-5305-8
- Cite this article as:
- Cussens, J. Mach Learn (2012) 89: 279. doi:10.1007/s10994-012-5305-8
- 1.7k Views
Abstract
This paper presents a method for approximating posterior distributions over the parameters of a given PRISM program. A sequential approach is taken where the distribution is updated one datapoint at a time. This makes it applicable to online learning situations where data arrives over time. The method is applicable whenever the prior is a mixture of products of Dirichlet distributions. In this case the true posterior will be a mixture of very many such products. An approximation is effected by merging products of Dirichlet distributions. An analysis of the quality of the approximation is presented. Due to the heavy computational burden of this approach, the method has been implemented in the Mercury logic programming language. Initial results using a hidden Markov model and a probabilistic graph are presented.
Keywords
Inductive logic programmingBayesian statisticsStatistical relational learningPRISMMixture modelsMissing data1 Introduction
In the Bayesian approach to ‘parameter estimation’ the goal is to return the joint posterior distribution over all parameters, rather than return the single ‘best estimate’ of the parameters. The motivation for attempting this complex task is that the posterior captures the combined information given by observed data and prior knowledge, and so provides a much fuller picture of the state of our knowledge about the parameters than can a point estimate.
Unfortunately, many posterior distributions are hard even to represent let alone compute efficiently. This is certainly generally the case for posterior distributions over the parameters of PRISM programs. PRISM programs define distributions over finite or countably infinite sample spaces using potentially complex generative processes. Generally the steps taken in the generative process are not discernible from any observed output—a hidden data situation—which leads to a posterior distribution with many local modes.
Fortunately, if the prior over PRISM parameters is a mixture of products of Dirichlet distributions, then at least the form of the posterior will be known: it will also be a mixture of products of Dirichlet distributions. However, the number of mixture components will usually be large. This paper presents an exact technique for finding all these mixture components for small scale problems and considers an approximate method for cases where the exact approach is infeasible.
The paper is organised as follows. The statistical aspects of the PRISM formalism are explained in Sect. 2. Section 3 derives an expression for the posterior distribution when the prior is a mixture of products of Dirichlet distributions. Section 4 describes the approximation method and presents results on the quality of the approximation. Initial results are presented in Sect. 5 followed by conclusions and pointers to further work in Sect. 6.
2 PRISM
In this section a description of PRISM will be given which includes an explanation of the basics of the Bayesian approach to parameter estimation in PRISM. All of this material has been previously presented elsewhere and is included here merely as a convenience for the reader. In particular, the following is closely based on work by Cussens (2005, 2007) and Sato et al. (2008).
2.1 Defining distributions in PRISM
Considering now the general case, a ground fact such as msw(’X1’,x) is actually an abbreviation for a fact msw(’X1’,j,x) which is a statement that it is true that the random variable X_{1,j} is instantiated to have a value x, where j∈ℕ. For any j,j′∈ℕ, where j≠j′, X_{1,j} and X_{1,j′} must be independent and identically distributed (which motivates the abbreviation just mentioned and explains why the index j is not represented in actual PRISM programs). The common distribution of the X_{1,j} is an arbitrary discrete distribution defined by the parameter vector θ_{1}=(θ_{1,1},…,θ_{1,v},…θ_{1,n(1)}) where θ_{1,v} is the probability that X_{1,j} takes value v. A family of iid random variables such as {X_{1,j}}_{j∈ℕ} is known as a switch.
Typically a PRISM program has more than one switch, each switch defining a different family of iid variables. hmm.psm has 5 switches. These are init, tr(s0), out(s0), tr(s1) and out(s1) defining: the initial state distribution, state transitions from state s0, emissions from state s0, state transitions from state s1 and emissions from state s1, respectively. This collection of discrete probability distributions θ_{i}, one for each switch i, makes up the parameter set θ=(θ_{1},…,θ_{i},…,θ_{n}) for a given PRISM program. Crucially, any two distinct switches are mutually independent so that ∀i,i′,j,j′ X_{i,j} is independent of X_{i′,j′} whenever i≠i′. Given any finite subset of the {X_{i,j}}_{i,j} a product distribution can be defined on its joint instantiations in the obvious way. As noted by Sato and Kameya (2001) it then follows that there is a probability distribution which assigns a probability to any (measurable) set of infinite joint instantiations of the {X_{i,j}}_{i,j}. This is known as the basic distribution and is consistent with all the finite product distributions.
Any instantiation of all the (infinitely many) X_{i,j} together with the rest of the clauses of the PRISM program, defines a least Herbrand model: a possible world. The probability of any possible world is the probability of the set of infinite instantiations of the X_{i,j} which entail it. Thus the parameters of the PRISM program define a distribution over a set of possible worlds. This distribution determines a (marginal) probability for each ground atomic formulae in the first-order language defined by the PRISM program; it is the sum of the probabilities of the worlds which satisfy the formula.
Usually a particular target predicate is distinguished to represent an ‘output’ distribution defined by a PRISM program. The target predicate for hmm.psm is hmm/1. Typically a target predicate is defined so that exactly one ground atomic formula with it as predicate symbol is true in each possible world, thereby defining a distribution over ground instances of the target predicate. (In Sect. 5.3 we will see that is possible to relax this condition and still do Bayesian parameter estimation.) This is certainly the case for hmm.psm. It is not too difficult to see that however the switches are instantiated (i.e. which possible world is chosen) exactly one ground instance of hmm(L) is logically entailed by the program. Such ground instances are viewed as outputs of the PRISM program and will be generically denoted by y.
In PRISM programs ‘with failure’ the condition on the target predicate is weakened so that at most one output is true in each possible world; sometimes the program may ‘fail’ to generate an output. Adding, for example, the literal L = [A,B,A,C,D] to the clause defining hmm/1, which effects the constraint that the 1st and 3rd symbols of the HMM output are the same, would be enough to make hmm.psm a program ‘with failure’, since then some instantiations of the switches would not logically entail any ground instance of hmm(L). Unless the probability of failure is zero the probabilities of ground instances of the target predicate will sum to less than one, so that to define a distribution over them it is necessary to normalise. All PRISM programs discussed in this paper will be failure-free, i.e. not ‘with failure’.
In the normal non-failure case any instantiation of the infinitely many X_{i,j} determines exactly one output. However, it is a requirement of PRISM programs that a finite subset of any such infinite instantiation is enough to determine which output this is. A finite instantiation which is minimal with respect to determining output is called an explanation. Explanations will be generically denoted by x. An explanation x must entail an output y=f(x) and be such that any of its proper subsets do not. f is the function mapping explanations to outputs which is encoded by the structure of the PRISM program. A further restriction on PRISM programs is that any output has only a finite number of associated explanations: ∀y:|f^{−1}(y)|<∞. This is known as the finite support condition and f^{−1}(y) is known as the support set for y.
2.2 Existing approaches to parameter estimation in PRISM
The simplest way to fit the parameters of a PRISM program is to do maximum likelihood estimation (MLE) using the EM algorithm. The key to doing this efficiently is to find the explanations f^{−1}(y) of an observed data point y just once, rather than on each iteration of the EM algorithm. In addition, it is highly advantageous to use a compact representation of explanations called explanation graphs. All of this is done in the graphical EM algorithm which is explained by Sato and Kameya (2001). This algorithm is provided via builtin predicates in the current PRISM distribution. The EM approach to parameter estimation has been extended to PRISM programs ‘with failure’ (Sato et al. 2005) by incorporating the failure-adjusted maximisation approach, originally devised by the current author Cussens (2001) for stochastic logic programs (Muggleton 1996).
More recently Sato and colleagues have looked at estimating PRISM parameters using a Bayesian approach, arguing (as here) that MLE can result in overfitting. In work (Sato et al. 2008) closely related to the current paper they present a variational approach to Bayesian parameter estimation (and approximation of the marginal likelihood). A Dirichlet prior distribution is defined for the probabilities of each switch in the PRISM program. A variational Bayes approach (the VB-EM algorithm) is then used which produces an approximate posterior distribution which is also a Dirichlet distribution for each switch. Rather than perform the necessary calculations naïvely (as is done in this paper), Sato et al. use the graphical VB-EM algorithm which exploits explanation graphs similarly to the graphical EM algorithm.
Approximate Bayesian inference is often done via sampling, and PRISM is no exception. Sato (2011) uses MCMC to sample explanations of observed data (rather than find all such explanations as is done in this paper). These samples can then be used to estimate the marginal likelihood and also to estimate the most likely explanation of some new datapoint. As an alternative to MCMC the current author has used Approximate Bayesian Computation (ABC) to estimate PRISM parameters (Cussens 2011). ABC methods are often called ‘likelihood-free’ since they avoid explicitly considering the likelihood function: instead data is generated from candidate parameter values and this synthetic data is compared to the actually observed data. Cussens (2011) the current author applies the ABC sequential Monte Carlo (ABC-SMC) algorithm of Toni et al. (2009) to PRISM, implemented using the current PRISM distribution. (In later unpublished work a Mercury implementation was used.)
Evidently the current paper is most closely related to previous work which took a Bayesian approach. There is an obvious difference with the sample-based approaches: here we update a closed form approximation to the posterior as each datapoint is considered. This sort of online updating is more difficult with sampling approaches.
A key difference to all the existing work in this area is that the current approach is informed by the knowledge that the true posterior distribution for PRISM parameters is guaranteed to be a mixture of products of Dirichlet distributions whenever the prior is also of this form. A special case of such a prior is a single product of Dirichlet distributions which is the prior used in all previous work on Bayesian inference for PRISM parameters. The hypothesis here is that the true mixture distribution is best approximated by a distribution which is also a mixture (albeit with fewer components). This will at least be a distribution with many local modes like the true one. Approximating with a single product of Dirichlet distributions (as in Sato et al. 2008) cannot do this. On the negative side, aspects of the algorithm presented here are naïve when compared to, say, Sato et al. (2008), since explanations are collected and stored without exploiting explanation graphs. This is an obvious direction for future improvement.
2.3 Bayesian inference for PRISM parameters
When using PRISM programs for statistical inference, observed data will be a sequence of outputs y=y_{1},…,y_{T}. For example, when considering hmm.psm the data might be the 3 datapoints hmm([a,b,a,a,a]), hmm([b,b,a,a,b]) and hmm([b,a,a,b,b]). Each such datapoint is viewed as having been independently sampled from an unknown PRISM program. This paper concerns only parameter estimation and so the structure of the program will be known, only the parameters defining switch distributions will be unknown.
The key point about statistical inference for PRISM programs is that, barring exceptional cases, the values of the switches—the explanation—associated with any observed datapoint will be unobserved. For example, there are 2^{5} distinct explanations for the observed datapoint hmm([a,b,a,a,a]) corresponding to the 2^{5} different state trajectories associated with this sequence of symbols. We are thus faced with latent variables which can be considered as ‘missing data’.
As indicated by (2), P(y_{t}|θ) is generally a complex polynomial function of θ which means that the likelihood function \(\prod_{t=1}^{T} P({y}_{t}|{\boldsymbol {\theta}})\) is an even more complex polynomial. This rules out computing an exact closed form representation for P(θ|y_{1},…,y_{T}) irrespective of the chosen prior P(θ). However, as mentioned in the introduction, P(θ|y_{1},…,y_{T}) will be a mixture of products of Dirichlet distributions whenever the prior P(θ) is a product of Dirichlet distributions. So mixtures of products of Dirichlet distributions are ‘closed’ under updating. This is an example of the well-known phenomenon of the ‘weak’ conjugacy of mixtures (Bernardo and Girón 1988). (Note also that using a mixture of distributions from a particular class as a prior is always more flexible than using a single distribution from that class.) The next section has the necessary derivation.
3 The posterior distribution when the prior is a mixture of products of Dirichlet distributions
4 Sequential approximate computation of posterior distributions for PRISM parameters
Using (12) the posterior P(θ|y)=P(θ|y_{1},…,y_{T}) could be sequentially computed by conditioning on each of the y_{t} in turn. However, since the number of mixture components increases by a factor of |{x:f(x)=y_{t}}| for each y_{t} this is clearly impractical for all but the smallest values of T. It is true that if ∏_{i}Dir(θ_{i}|α_{ℓ,i}+C_{i}(x))=∏_{i}Dir(θ_{i}|α_{ℓ′,i}+C_{i}(x′)) for two pairs (ℓ,x), and (ℓ′,x′) then the components are identical and can be merged, but it is unrealistic to depend upon such coincidences to keep the number of mixture components manageable.
This approach to mixture reduction was used by Cowell et al. (1995) and is also discussed by Cowell et al. (1999). It is an instance of assumed density filtering where a complex posterior is approximated with something more convenient at each observation. The method presented here is also related to the clustering method of West (1992). Note that for each y_{t}, all explanations x are searched for, although only the associated count vectors C_{i}(x) are recorded. It follows that for this method to be practical the number of explanations for any single datapoint cannot be too great.
4.1 Merging components
As previously mentioned, the number of components is reduced by successively merging the lowest-weighted component with the nearest other component. Following Cowell et al. (1995), distance between components is simply the Euclidean distance between mean vectors. Other, perhaps better justified, metrics are possible: Euclidean distance was chosen since it is computationally cheap and has been used previously by Cowell et al. (1995) with some success. Once a nearest neighbour has been found according to this metric, merging is done using (essentially) the moment-matching method of Cowell et al. (1995).
4.2 Accuracy of the sequential approximation
Putting together inequality (21) with (22) and (23) provides an upper bound on the KL-divergence between any given mixture of products of Dirichlet distributions and any given approximation to it produced by merging components. However it is difficult to see how to use this result to formulate a good merging strategy.
Given λ and Dirichlet distributions p_{1} and p_{2}, there seems no simple way of finding a distribution q which satisfies (29). However, it is possible to minimise (26) numerically. Some experiments have been conducted in this direction using simple Beta distributions (Dirichlet distributions where there are only two (switch) values.) For example, setting p_{1}=Dir(θ_{1},θ_{2}|1,4), p_{2}=Dir(θ_{1},θ_{2}|3,5) and λ=0.5, the R (R Development Core Team 2011) optim function was used to find that α_{q}=(1.278,3.242) is the minimising choice, giving a KL-divergence of −H(r)−0.3651454. Moment-matching, in contrast produces a value of α_{q}=(1.444,3.579) which gives a KL-divergence of −H(r)−0.3604162. Changing λ to 0.1 leads to a minimising choice of α_{q}=(2.279,4.146) with a KL-divergence of −H(r)−0.3629558, whereas moment matching here produces α_{q}=(2.488,4.471) with a KL-divergence of −H(r)−0.3607265. In both cases the minimising solution did indeed satisfy (29) as expected.
These two comparisons are typical of results that have been obtained by preliminary numerical experimentation: moment-matching does not minimise KL-divergence, but it approximates the minimising choice quite well. Given the computational savings of the moment-matching approach it has been decided to use this as the merging method in this paper. Another less pragmatic argument in favour of moment-matching is given by Cowell (1998) who argues that the KL-divergence between distributions is the wrong score to minimise. Instead the KL-divergence between the predictive performance of the distributions is what should be minimised. PRISM programs can be used to make predictions about likely explanations x or likely observations y. The predictive performance of approximating q when r is the true distribution is \(\sum_{x} r(x) \log\frac{r(x)}{q(x)}\) for explanations and \(\sum_{y} r(y) \log\frac{r(y)}{q(y)}\) for observations.
4.3 Approximating the marginal likelihood
Given its importance for model choice it is not surprising that there is existing work on computing marginal likelihood for PRISM programs. Sato et al. (2008) have used a variational approach to approximating P(y), which is implemented in the current version of PRISM. In this approach the joint distribution P(θ,x|y) over parameters and hidden data is approximated by a product q_{θ}(θ)q_{x}(x) and an EM-like algorithm is used to find choices for q_{θ}(θ) and q_{x}(x) which optimise this approximation. More recently, Sato (2011) has used MCMC to approximate P(y).
4.4 Implementation of the sequential algorithm
Due to the computationally demanding nature of the task, the fastest logic programming language available was used, namely Mercury (Somogyi et al. 1996). In all cases Mercury was used in a standard fashion: the Mercury compiler generated low-level C which was compiled to native code by gcc. The GNU scientific library was linked in to provide log beta functions. A collection of 6 Mercury modules were developed: model.m, prior.m, data.m, sequential.m, params.m and vectors.m. The first three of these are problem-specific and define, respectively, the PRISM model, prior distribution and observed data for a particular Bayesian inference problem. The other modules implement the sequential approximation algorithm. These modules ‘know’ the bare minimum about the PRISM model: just the types of observed datapoints, the number and type of switches and (via a predicate model/2 exported from model.m) which C_{i}(x) are associated with a particular datapoint. This modularity ensures the algorithm can be applied to any (failure-free) PRISM program. An additional advantage of using Mercury is that it is possible to use the Mercury compiler to check that a PRISM program is indeed failure-free. To do this one writes the model/2 predicate to be a relation between data and explanations and declares that if an explanation is given as input, exactly one datum can result as output (a det mode declaration). Note that in this paper, model.m has been written as a relation between data and explanation counts since this is slightly more efficient if there is no need to check that there is no failure.
5 Results for the approximate sequential approach
This section reports on some initial empirical results using the approximate sequential approach. Most results were produced using Mercury version rotd-2010-04-17, configured for i686-pc-linux-gnu using a dual-core 3 GHz machine running Linux. In all cases only a single core was used. For some of the bigger experiments it was necessary to increase the default size of the Mercury det stack using the runtime –detstack-size-kwords option.
5.1 An initial experiment
5.2 Comparing approximate and exact solutions
Mean values of HMM probabilities according to the posterior distribution conditional on data [a,b,a,b,b],[a,b,a,a,b],[a,b,a,a,a],[a,a,a,a,a] and three approximations to it. Table headings have been abbreviated, so that, for example, init=s0 is short for E[P(init=s0)]
init=s0 | tr(s0)=s0 | tr(s1)=s0 | out(s0)=a | out(s1)=a | |
---|---|---|---|---|---|
Exact | 0.5000 | 0.4660 | 0.5340 | 0.6487 | 0.6487 |
K=100 | 0.5001 | 0.4662 | 0.5334 | 0.6410 | 0.6402 |
K=100 | 0.5003 | 0.4676 | 0.5327 | 0.6593 | 0.6589 |
K=10 | 0.4924 | 0.4643 | 0.5342 | 0.6857 | 0.6885 |
Mean values of HMM probabilities according to various distributions. Table heading have been abbreviated, so that, for example, init=s0 is short for E[P(init=s0)]
K | w | init=s0 | tr(s0)=s0 | tr(s1)=s0 | out(s0)=a | out(s1)=a | |
---|---|---|---|---|---|---|---|
100 | < | 0.528 | 0.074 | 0.391 | 0.812 | 0.618 | 0.152 |
100 | > | 0.471 | 0.921 | 0.196 | 0.610 | 0.146 | 0.627 |
200 | < | 0.490 | 0.081 | 0.410 | 0.821 | 0.611 | 0.147 |
200 | > | 0.510 | 0.923 | 0.240 | 0.620 | 0.144 | 0.653 |
It is clear that in both the K=100 and K=200 case two (near) equally weighted (near) ‘symmetrical’ distributions have been returned; with the hidden states being swapped between them. Note, for example that the mean value for P(out(s0)=a) in the 1st row is close to that of P(out(s1)=a) in the second. The < components are the ‘correct’ ones since the true probability of P(init=s0) is indeed above 0.5. Note that in these components the mean value of P(init=s0) is very close to the true value of 0.9 and the means for the emission probabilities are also good estimates. The transition probabilities are less well estimated by these mean values though.
5.3 Probabilistic graph model
The sequential approximation algorithm was run twice, with the order of the datapoints being reversed the second time. A single product of Dirichlet distributions was used as the prior, with each Dirichlet having both parameters set to 1. The component limit was 200. In contrast to earlier experiments Mercury 11.07 was used on 2.5 GHz laptop. The two runs took 370 s and 342 s.
Mean values for edge probabilities as estimated by two runs of the sequential algorithm with the order of the data reversed between runs
a→c | b→c | a→b | c→d | c→e | e→d |
---|---|---|---|---|---|
0.872 | 0.512 | 0.536 | 0.628 | 0.830 | 0.799 |
0.835 | 0.469 | 0.515 | 0.611 | 0.782 | 0.768 |
6 Conclusions and future work
The principal contributions of this paper are theoretical. Starting from the easy result that priors which are mixture of Dirichlet products lead to posteriors of the same form, we move on to an analysis of the quality of the approximation effected by component merging. An upper bound on the KL-divergence for any such approximation is found although unfortunately this result has not been exploited to optimise the method. However, in the case of optimising the merger of any two given Dirichlet distributions an expression has been derived which can be solved numerically. This has allowed a comparison with the computationally convenient moment-matching method. In addition, it has been shown that marginal likelihood can be computed almost as a by-product of the presented sequential approximation approach. Some initial results have been produced using a Mercury implementation. Unfortunately space constraints prevent an account of the many interesting issues that arise when implementing a PRISM program as just a particular sort of Mercury program.
However, it is clear that much remains to be done. Most obviously extensive empirical evaluation would be useful, not least to compare marginal likelihood estimates with those produced using variational (Sato et al. 2008) and MCMC (Sato 2011) methods. It would also be useful to remove the restriction that each single datapoint does not produce very many explanations. A possible solution to this would be to search for only high weight explanations rather than just find all of them. Finally, although it generalises PRISM in an important respect, the ProbLog system (Gutmann et al. 2010) still satisfies (2) and so it would be interesting to do further investigation into using the method presented here for parameter estimation of ProbLog programs.
Acknowledgements
Thanks to the anonymous reviewers for their comments and suggestions.