Goal and Plan Recognition via Parse Trees Using Prefix and Infix Probability Computation

Conference paper

DOI: 10.1007/978-3-319-23708-4_6

Part of the Lecture Notes in Computer Science book series (LNCS, volume 9046)
Cite this paper as:
Kojima R., Sato T. (2015) Goal and Plan Recognition via Parse Trees Using Prefix and Infix Probability Computation. In: Davis J., Ramon J. (eds) Inductive Logic Programming. Lecture Notes in Computer Science, vol 9046. Springer, Cham


We propose new methods for goal and plan recognition based on prefix and infix probability computation in a probabilistic context-free grammar (PCFG) which are both applicable to incomplete data. We define goal recognition as a task of identifying a goal from an action sequence and plan recognition as that of discovering a plan for the goal consisting of goal-subgoal structure respectively. To achieve these tasks, in particular from incomplete data such as sentences in a PCFG that often occurs in applications, we introduce prefix and infix probability computation via parse trees in PCFGs and compute the most likely goal and plan from incomplete data by considering them as prefixes and infixes.

We applied our approach to web session logs taken from the Internet Traffic Archive whose goal and plan recognition is important to improve websites. We tackled the problem of goal recognition from incomplete logs and empirically demonstrated the superiority of our approach compared to other approaches which do not use parsing. We also showed that it is possible to estimate the most likely plans from incomplete logs. All prefix and infix probability computation together with the computation of the most likely goal and plan in this paper is carried out using logic-based modeling language PRISM.


Prefix probability PCFG Plan recognition Session log 

1 Introduction

Goal and plan recognition have been studied in artificial intelligence as an inverse problem of planning and applied to services that require the identification of users’ intentions and plans from their actions such as human-computer collaboration [9], intelligent interfaces [4] and intelligent help systems [5]1.

This paper addresses goal and plan recognition using probabilistic context-Free grammars (PCFGs) which are widely used as models not only in natural language processing but also for analyzing symbolic sequences in general. In a simple PCFG model for goal and plan recognition, a symbolized action sequence made by a user is regarded as a sentence generated by a mixture of PCFGs where each component PCFG describes a plan for a specific goal. We call this model a mixture of PCFGs. In this setting, the task of goal recognition is to infer the goal from a sentence as the most likely start symbol of some component PCFG in the mixture whereas plan recognition computes the most likely parse tree representing a plan for the goal.

The problem with this simple model is that sentences generated by a grammar alone receive non-zero probability. So the probability of non-sentences is always zero and therefore it is impossible to extract information contained in incomplete sentences by considering their probabilities and (incomplete) parse trees, though they often occur in real data. To overcome this problem, we propose to generalize the probability and parse trees of sentences to those of incomplete sentences. Consider for example a prefix of a sentence, i.e., an initial substring of the sentence. It is one type of incomplete sentence and often observed as data when the observation started but is not yet completed such as a medical record of a patient who is receiving treatment2.

In plan recognition, our proposal enables us to extract the most likely plan from incomplete data bringing us important information about the user’s intention. When applied to a website, for example, where we identify the visitor’s plan from his/her actions such as clicking links, the discovered plan would reveal the website structure matches the visitor’s intention. However, to our knowledge, no grammatical approach so far has extracted a plan from incomplete sentences, much less the most likely plan. In this paper, we show that it is possible to extract the most likely plan from incomplete sentences.

Turning to goal recognition, it is surely possible to directly estimate the goal from actions by feature-based methods such as logistic regression, support vector machines and so on but they are unable to make use of structural information represented by a plan behind the action sequence. We experimentally demonstrate the importance of structural information contained in a plan and compare our method to these feature-based methods in web session log analysis we describe next.

In our web session log analysis, action sequences recorded in session logs are basically regarded as complete sentences in a mixture of PCFGs. We however consider three types of incomplete data situation. The first one is the online situation where what is desired is navigating visitors to a target web page during their web surfing, say, by displaying affiliate links appropriate to their goals. The second one is unachieved visitors who quit the website for some reason before their purpose is achieved. In these situations, their goal is not achieved, and hence their action sequences should be treated as incomplete sentences. The last type is the cross-site situation where a user visits several websites, which is a very likely situation in real web surfing. In this situation an action sequence observed at one website is only part of the whole action sequence. Consequently an action sequence recorded at one website should be considered as an incomplete sentence.

Notice that the first and second situations yield prefixes whereas the last one causes the lack of the beginning part in addition to the ending part (of sentences). This type of incomplete sentence is called infix. In this paper, we primarily deal with the first and second situations and touch the last situation. Our analysis is uniformly applicable to all situations made by visitors and can let appropriate advertisements pop up timely on a display during web surfing by detecting visitors’ purposes and plans from action sequences recorded in web session logs.

To implement our approach, we use the logic-based modeling language PRISM [12, 13] which is a probabilistic extension of Prolog for probabilistic modeling. It supports general and highly efficient probability computation and parameter learning for a wide variety of probabilistic models including PCFGs. For example a PRISM program for PCFGs can perform probability computation in the same time complexity as the Inside-Outside algorithm, a specialized probability computation algorithm for PCFGs. In PRISM, probability is computed by dynamic programming applied to acyclic explanation graphs which are an internal data structure encoding all (but finitely many) proof trees. In our previous work [14], we extended them and introduced an efficient mechanism of prefix probability computation based on cyclic explanation graphs encoding infinitely many parse trees. In this paper, we further add to PRISM infix probability computation and the associated Viterbi inference which computes the most likely partial parse tree using cyclic explanation graphs obtained from parsing.

In the following, we first review PRISM focusing on prefix probability computation. We next apply prefix probability computation via parse trees to web session log analysis and conduct an experiment on visitors’ goals and plans using real data sets. Then infix probability computation via parse trees is briefly described and followed by conclusion.

2 Reviewing PRISM

The prefix and infix probability computation via parse trees proposed in this paper is carried out in PRISM using explanation graphs. So in this section, we quickly review probability computation in PRISM while focusing on prefix probability computation by cyclic explanation graphs.

PRISM is a probabilistic extension of Prolog and provides a basic built-in predicate of the form \(\mathtt{msw}(i,v)\) for representing a probabilistic choice used in probabilistic modeling. It is called “multi-valued random switch” (\(\mathtt{msw}\) for short) and used to denote a simple probabilistic event \(X_i=v\) where \(X_i\) is a discrete random variable and v is its realized value. Let \(V_i = \{ v_1, \cdots , v_{|V_i|} \}\) be the set of possible outcomes of \(X_i\). i is called the switch name of \(\mathtt{msw}(i,v)\).

To represent the distribution \(P(X_i=v)\) (\(v \in V_i\)), we introduce a set \(\{ \mathtt{msw}(i,v) \mid v \in V_i \}\) of mutually exclusive \(\mathtt{msw}\) atoms and give them a joint distribution such that \(P(\mathtt{msw}(i,v)) = \theta _{i,v} = P(X_i=v)\) (\( v \in V_i\)) where \(\sum _{v \in V_i}\theta _{i,v}=1\). \(\{ \theta _{i,v} \}\) are called parameters and the set of all parameters \(\mathbf {\Theta }\) appearing in a program is manually specified by the user or automatically learned from data.

Suppose a positive program \( DB \) is given which is a Prolog program containing \(\mathtt{msw}\) atoms. We define a basic distribution (probability measure) \(P_{\mathtt{msw}}(\cdot \mid \mathbf {\Theta })\) as the product of distributions for the \(\mathtt{msw}\)s appearing in \( DB \). Then it is proved that the basic distribution can uniquely be extended by way of the least model semantics in logic programming to a \(\sigma \)-additive probability measure \(P_{ DB }(\cdot \mid \mathbf {\Theta })\) over possible Herbrand interpretations of \( DB \). It is the denotation of \( DB \) in the distribution semantics [11, 12] that is a standard semantics for probabilistic logic programming. In the following, we omit \(\mathbf {\Theta }\) when the context is clear.

Semantically PRISM is just one of many possible implementations of the distribution semantics that realizes efficient probability computation by adding two assumptions: independence and exclusiveness . Let G be a non-msw atom which is ground. \(P_ DB (G)\), the probability of G defined by the program \( DB \), can be naively computed as follows. First reduce the top-goal G using Prolog’s exhaustive top-down proof search to a propositional DNF(disjunctive normal form) formula \(\mathrm{expl}_0(G)=e_1 \vee e_2 \vee \cdots \vee e_k\) where \(e_i (1 \le i \le k)\) is a conjunction of atoms \(\mathtt{msw}_1 \wedge \cdots \wedge \mathtt{msw}_n\) such that \(e_i, DB \vdash G\)3. Each \(e_i\) is called an explanation for G. Then assuming the
  • Independence condition (\(\mathtt{msw}\) atoms in an explanation are independent):
    $$\begin{aligned} P_ DB (\mathtt{msw}\wedge \mathtt{msw}')=P_ DB (\mathtt{msw}) P_ DB (\mathtt{msw}') \end{aligned}$$
  • Exclusive condition (explanations are exclusive):
    $$\begin{aligned} P_ DB (e_i \wedge e_j)=0 \mathrm{if } i\ne j \end{aligned}$$
we compute \(P_ DB (G)\) as
$$\begin{aligned} P_ DB (G)= & {} P_ DB (e_1) + \cdots + P_ DB (e_k) \\ P_ DB (e_i)= & {} P_ DB (\mathtt{msw}_1) \cdots P_ DB (\mathtt{msw}_n) \mathrm{for } e_i = \mathtt{msw}_1 \wedge \cdots \wedge \mathtt{msw}_n \end{aligned}$$
Recall that \(\mathtt{msw}\)s with different switch names are independent by construction of \(P_{\mathtt{msw}}(\cdot \mid \mathbf {\Theta })\). We further assume that \(\mathtt{msw}\) atoms with the same switch name are iid (independent and identically distributed). Fortunately this assumption can be automatically satisfied.

Contrastingly the exclusiveness condition cannot be automatically satisfied. It needs to be satisfied by the user, for example, by writing a program so that it generates an output solely as a sequence of probabilistic choices made by \(\mathtt{msw}\) atoms (modulo auxiliary non-probabilistic computation). Although most generative models including BNs, HMMs and PCFGs are naturally written in this style, there are models which are not [3]. Relating to this, observe that Viterbi explanation, i.e., the most likely explanation \(e^*\) for G, is computed similarly to \(P_ DB (G)\) just by replacing sum with argmax: \(e^* \mathop {=}\limits ^{\mathrm {def}}\mathop {\mathrm{argmax}}\limits _{e \in \mathrm{expl}_0(G)} P_ DB (e).\)

So far our computation is naive. Since there can be exponentially many explanations, naive computation would lead to exponential time computation. PRISM avoids this by adopting tabled search in the exhaustive search for all explanations for the top-goal G and applying dynamic programming to probability computation. By tabling, a goal which is once called and proved is stored (tabled) in memory with its answer substitutions and later calls to the same goal return with a stored answer substitution without processing further. Tabling is important to probability computation because tabled goals factor out common sub-conjunctions in \(\mathrm{expl}_0(G)\), which results in sharing probability computation for common sub-conjunctions, thereby realizing dynamic programming which gives exponentially faster probability computation compared to naive computation.

As a result of exhaustive tabled search for all explanations for G, PRISM yields a set of propositional formulas called defining formulas of the form \(H \Leftrightarrow B_1 \vee \cdots \vee B_h\) for every tabled goal H that directly or indirectly calls \(\mathtt{msw}\)s. We call the heads of defining formulas defined goals. Each \(B_i (1 \le i \le h) \) is recursively composed of a conjunction \(C_1 \wedge \cdots \wedge C_m \wedge \mathtt{msw}_1\wedge \cdots \wedge \mathtt{msw}_n (0 \le m,n)\) of defined goals \(\{C_1,\cdots , C_m\}\) and \(\mathtt{msw}\) atoms \(\{\mathtt{msw}_1, \cdots , \mathtt{msw}_n\}\). We introduce a binary relation \(H \succ C\) over defined goals such that \(H \succ C\) holds if H is the head of some defining formula and C occurs in the body. We denote by \(\mathrm{expl}(G)\) the whole set of defining formulas and call \(\mathrm{expl}(G)\) the explanation graph for G as in the non-tabled case. When “\(\succ \)” is acyclic, we call an explanation graph acyclic and extend “\(\succ \)” to a partial ordering over the defined goals.

Once \(\mathrm{expl}(G)\) is obtained as an acyclic explanation graph, since defined goals are layered by the “\(\succ \)" relation, defining formulas in the bottom layer (minimal elements) have only \(\mathtt{msw}\)s in their bodies whose probabilities are known (declared in the program), so we can compute probabilities by a sum-product operation for all defined goals from the bottom layer upward in a dynamic programming manner in time linear in the number of atoms appearing in \(\mathrm{expl}(G)\).

Compared to naive computation, the use of dynamic programming on \(\mathrm{expl}(G)\) can reduce time complexity for probability computation from exponential time to polynomial time. For example PRISM’s probability computation for HMMs takes O(L) time for a given sequence with length L and coincides with the standard forward-backward algorithm for HMMs. Likewise PRISM’s sentence probability computation for PCFGs takes \(O(L^3)\) time for a given sentence with length L and coincides with inside probability computation for PCFGs.

Viterbi inference that computes the Viterbi explanation and its probability is similarly performed on \(\mathrm{expl}(G)\) in a bottom-up manner like probability computation stated above. The only difference is that we use argmax instead of sum. In what follows, we look into the detail of how the Viterbi explanation is computed.

Let H be a defined goal and \(H \Leftrightarrow B_1 \vee \cdots \vee B_h\) the defining formula for H in \(\mathrm{expl}(G)\). Write \(B_i=C_1 \wedge \cdots \wedge C_m \wedge \mathtt{msw}_1\wedge \cdots \wedge \mathtt{msw}_n (0 \le m,n)\)\((1 \le i \le h)\) and suppose recursively that the Viterbi explanation \(e_{C_j}^{*} (1 \le j \le m)\) has already been calculated for each defined goal in \(C_j\) in \(B_i\). Then the Viterbi explanation \(e_{B_i}^{*} \) for \(B_i\) and Viterbi explanation \(e_{H}^{*} \) are respectively computed by
$$\begin{aligned} e_{B_i}^{*}= & {} e_{C_1}^{*} \wedge \cdots \wedge e_{C_m}^{*} \wedge \mathtt{msw}_1\wedge \cdots \wedge \mathtt{msw}_n \nonumber \\ e_{H}^{*}= & {} \mathop {\mathrm{argmax}}\limits _{B_i} P_ DB (e_{B_i}^{*} \mid \mathbf {\Theta }) \\&\quad \mathrm{where}\; P_ DB (e_{B_i}^{*} \mid \mathbf {\Theta }) = P_ DB (e_{C_1}^{*}) \cdots P_ DB (e_{C_m}^{*}) \theta _{i_1,v_1} \cdots \theta _{i_n,v_n} \nonumber \end{aligned}$$
Here \(\theta _{i_1,v_1}\) is a parameter associated with \(\mathtt{msw}_1\) and so on. In this way, the Viterbi explanation for the top-goal G is computed in a bottom-up manner by scanning \(\mathrm{expl}(G)\) once in time linear in the size of \(\mathrm{expl}(G)\) in an acyclic explanation graph.

3 Prefix Probability Computation

In this section, we examine prefix computation for PCFGs in PRISM. A PCFG \(\mathsf G_{\mathbf {\Phi }}\) is a CFG \(\mathsf G\) augmented with a parameter set \(\mathbf {\Phi }= \bigcup _{N^i \in \mathbf {N}} \{ \phi _r \}_{N^i}\) where \(\mathbf {N}\) is a set of nonterminals, \(N^1\) a start symbol and \(\{\phi _r \}_{N^i}\) the set of parameters associated with rules \(\{ r \mid r = N^i \rightarrow \zeta \}\) for a nonterminal \(N^i\) where \(\zeta \) is a sequence of nonterminal and terminal symbols. We assume that the \(\phi _r\)’s satisfy \(0<\phi _r<1\) and \(\sum _{\zeta :N^i \rightarrow \zeta } \phi _{N^i\rightarrow \zeta } = 1\).

There are already algorithms to compute prefix probabilities in PCFGs [8, 15]. We here briefly describe prefix probability computation based on explanation graphs in PRISM [14]. As previously stated, a prefix\(\mathbf {v}\) is an initial substring of a sentence and the prefix probability \(P_\mathrm{pre}^{N^1}(\mathbf {v})\) of \(\mathbf {v}\) is an infinite sum of probabilities of sentences extending \(\mathbf {v}\):
$$\begin{aligned} P_\mathrm{pre}^{N^1}(\mathbf {v}) = \sum _{\mathbf {w}} P_\mathsf{G}(\mathbf {vw}) \end{aligned}$$
where \(\mathbf {w}\) ranges over strings such that \(\mathbf {vw}\) is a sentence in \(\mathsf G\). Prefix probabilities are computed in PRISM by way of cyclic explanation graphs. We sketch our prefix probability computation following [14]. We use a PCFG \(\mathsf G_0\) = { s\(\rightarrow \)s s : 0.4, s\(\rightarrow \)a : 0.3, s\(\rightarrow \)b : 0.3 } where “s” is a start symbol and “a” and “b” are terminals and consider the computation of the prefix probability \(P_\mathrm{pre}^{\mathtt{s}}(\mathtt{a})\) of prefix a. To compute \(P_\mathrm{pre}^\mathtt{s}(\mathtt{a})\), we first parse “a” as a prefix by the PRISM program \( DB _0\) in Fig. 1. As can be seen from the comments, it runs exactly like a standard top-down CFG parser except pseudo success at line (6). pseudo success means an immediate return with success on the consumption of the input prefix L1 ignoring the remaining nonterminals in R at line (2)4.
Fig. 1.

Prefix parser \( DB _0\)

Fig. 2.

Explanation graph for prefix “a” (left) and associated probability equations (right)

By running a command ?-probf(pre_pcfg([a])) in PRISM, we obtain an explanation graph in Fig. 2 (left) for pre_pcfg([a])5. Note a cycle exists in the explanation graph; pre_pcfg([s,s],[a],[]) calls itself in the third defining formula. Since this is a small example, its explanation graph has only self-loops. In general however, an explanation graph for prefix parsing has larger cycles as well as self-loops, and we call this type of explanation graphs cyclic explanation graphs [14].

Then we convert the defining formulas to a set of probability equations about X, Y, Z and W as shown in Fig. 2 (right). We use the assumptions in PRISM that goals are independent (\(P(A\wedge B) = P(A) P(B)\)) and disjunctions are exclusive (\(P(A\vee B) =P(A)+P(B)\)). By solving them using parameter values \(\theta _{\mathtt{s}\rightarrow \mathtt{s s}}\) = 0.4 and \(\theta _{\mathtt{s}\rightarrow \mathtt{a}}\) = 0.3 set by :-set_sw(s,[0.4,0.3,0.3]) in the program \( DB _0\), we finally obtain \(\mathtt{X} = \mathtt{Y} = \mathtt{Z} = 0.5\)6. So we have \(P_\mathrm{pre}^\mathtt{s}(\mathtt{a}) = \mathtt{X} = 0.5\). In general, a set of probability equations generated from a prefix in a PCFG using \( DB _0\) is always linear and solvable by matrix operation [14].

We next describe an extension of the Viterbi inference of PRISM to cyclic explanation graphs. The most likely explanation and its probability for cyclic explanation graphs is defined as usual as \(e^* \mathop {=}\limits ^{\mathrm {def}}\mathop {\mathrm{argmax}}\limits _{e \in \mathrm{expl}_0(G)} P_ DB (e)\) where \(\mathrm{expl}_0(G)\) is possibly an infinite set of explanations represented by a cyclic explanation graph. For example, the set of explanations represented by Fig. 2 (left) is \(\mathrm{expl}_0(\)pre_pcfg([s,s],[a],[])\()=\)\(\{\)msw(s,[a]), msw(s,[s,s])\(\wedge \)msw(s,[a]), msw(s,[s,s])\(\wedge \)msw(s,[s,s])\(\wedge \)msw(s,[a]), \(\cdots \}\) where the repetition of msw(s,[s,s]) is produced by the cycle. Note that although there are infinitely many explanations, the most likely explanation is msw(s,[a]) since the product of probabilities is monotonically decreasing w.r.t. the number of occurrences of msw(s,[s,s]) (\(0 < P_ DB (\mathtt{msw(s,[s,s])}) < 1\)) in an explanation.

The Viterbi algorithm in Eq. (1) for acyclic explanation graphs is no longer applicable to cyclic graphs as it wouldn’t stop if applied to them. So we generalize it for cyclic explanation graphs using a shortest path algorithm such as Dijkstra’s algorithm and the Bellman-Ford algorithm [6]. In our implementation, we adopted the Bellman-Ford algorithm since it neither requires additional data structure nor memory by reusing the space for the Viterbi algorithm.

4 Action Sequences as Incomplete Sentences in a PCFG

From here on, we tackle the problem of identifying the purposes or goals of visitors who visit a website from their session logs. We first abstract a visitor’s session log into a sequence of five basic actions: up, down, sibling, reload and move. The first two, up and down, state that the visitor moves respectively to a page in the parent directory or a subdirectory in the site’s directory structure. An action sibling says that the visitor moves to a page in a subdirectory of the parent directory. An action reload means that the visitor requests the same page. An action move categorizes remaining miscellaneous actions. Moving between web pages is expressed by a sequence of basic actions. For example moving from /top/index.html to /top/child/a.html is a down action.
Fig. 3.

Example of CFG rules (left) and a parse tree using them (right)

We consider an action sequence generated by a visitor who has achieved the intended goal as a complete sentence in a PCFG. We parse it using rules as in Fig. 3 (left) and obtain a parse tree as illustrated there (right). The CFG rules (left) describe possible goal-subgoal structures behind visitors’ action sequences.

Since diverse visitors visit a website with diverse goals in mind, we capture their action sequences \(\mathbf {w}\) in terms of a mixture of PCFGs \(P(\mathbf {w} \mid N^1) = \sum _A P^A(\mathbf {w} \mid A)P(A \mid N^1)\) where \(P^A(\mathbf {w} \mid A)\) is the probability of \(\mathbf {w}\) being generated by a visitor whose goal is represented by a nonterminal A and \(P(A \mid N^1)\) is the probability of A being derived from the start symbol \(N^1\) respectively. We call such A a goal-nonterminal and assume that there is a unique rule \(N^1\rightarrow A\) for each goal-nonterminal A with a parameter \(\theta _{N^1\rightarrow A}= P(A \mid N^1)\).

Finally to make it possible to estimate visitor goals from incomplete sequences, we replace a sentence probability \(P^A(\mathbf {w} \mid A)\) in a mixture of PCFGs \(P(\mathbf {w} \mid N^1) = \sum _A P^A(\mathbf {w} \mid A)P(A \mid N^1)\) by a prefix probability \(P_\mathrm{pre}^A(\mathbf {w} \mid A)\). We call this method the prefix method.

Suppose a prefix \(\mathbf {w}_k\) with length k is given as an action sequence. We estimate the most likely goal-nonterminal \(A^*\) for \(\mathbf {w}_k\) by
$$\begin{aligned} A^* = \mathop {\mathrm{argmax}}\limits _{A} P_\mathrm{pre}^{A}(\mathbf {w}_k)P(A \mid N^1) \end{aligned}$$
where A ranges over possible goal-nonterminals. \(P_\mathrm{pre}^{A}(\mathbf {w}_k)\) is computed just like \(P_\mathrm{pre}^{N^1}(\mathbf {w})\) in the previous section.

5 Comparative Experiment

In this section, we empirically evaluate our prefix method and compare it to existing methods: the PCFG method and logistic regression. The PCFG method naively uses a PCFG. It applies a mixture of PCFGs to action sequences \(\mathbf {w}_k\) by assuming that every sequence is a sentence. The most likely goal-nonterminal is estimated by substituting the \(P_\mathrm{pre}^{A}(\mathbf {w}_k)\) in the Eq. (2) by \(P^{A}(\mathbf {w}_k)\).

We also compare the prefix method with logistic regression which is a popular discriminative model that does not assume any structure behind data unlike the prefix and PCFG methods. For a fixed length k, the most likely visitor goal is estimated from \(\mathbf {w}_k\) considered as a feature vector where features are the five basic visitor actions introduced in Sect. 4.

5.1 Data Sets and the Universal Session Grammar

We first prepare three data sets of action sequences by preprocessing the web server logs of U of S (University of Saskatchewan), ClarkNet and NASA [1] in the Internet Traffic Archive [7]. We consider, solely for convenience, action sequences with length greater than 20 as sentences and exclude those with length greater than 30 as the computation of the latter is too costly. In this way we prepared three data sets of action sequences referred to here as U of S, ClarkNet and NASA, each containing 652, 4523 and 2014 action sequences respectively.

We next specify a CFG to build a mixture of PCFGs which is applied to the data sets. To do so in turn requires to determine the number of goal-nonterminals. In other words, we have to decide how many goals or intentions visitors have when visiting a website. So we performed clustering on action sequences assuming that one cluster corresponds to one goal, i.e., the number of clusters gives the number of goal-nonterminals. We used a mixture of PCFGs again for clustering7. As a result of clustering, we got five clusters which are listed in Table 1.
Table 1.

Result of clustering

Cluster (goal-nonterminal)

Features and major action


up/down moves in the hierarchy of a website


up/down moves in the hierarchy of a website + reload the same page


access to the same layer


access to the same layer + reload the same page



Finally we manually expand the small CFG used for clustering into a large CFG called the universal session grammar that has five goal-nonterminals corresponding to five visitor clusters in Table 1. Some of the rules concerning Survey are listed in Table 2. The universal session grammar contains 102 rules and 32 nonterminals and reflects our observation that visitors have different action patterns between initial, middle and final parts of a session.
Table 2.

Part of the universal session grammar

5.2 Evaluation of the Prefix Method

We apply the prefix method to the task of estimating the visitors’ goals from prefixes of action sequences and record the estimation accuracy while varying prefix length. We also apply a mixture of hidden Markov models (HMMs) as a reference method.

To prepare a teacher data set to measure accuracy, we need to label each action sequence by the visitor’s true intention or goal, which is however practically impossible. As a substitution, we define a correct top-goal for an action sequence in a data set to be the most likely goal-nonterminal for the sequence estimated by a mixture of PCFGs with the universal session grammar whose parameters are learned by the EM algorithm from the data set. This strategy seems to work as long as the universal session grammar is reasonably constructed.

In the experiment8, accuracy is measured by five-fold cross-validation for each prefix length k\((2 \le k \le 20)\). After parameter learning by a training data set, prefixes with length k are cut out from action sequences in the test set and their most likely goal-nonterminals are estimated and compared against correct top-goals labeling them. Figure 4 shows accuracy for each k with standard deviation.
Fig. 4.

Accuracy for U of S, ClarkNet and NASA

Here Prefix denotes the prefix method, PCFG the PCFG method9 and Log-Reg logistic regression respectively. We also add HMM for comparison which uses a mixture of HMMs instead of a mixture of PCFGs10\(^{,}\)11.

Figure 4 clearly demonstrates that the prefix and PCFG methods outperform logistic regression and HMM when prefix is long. Actually all differences at prefix length \(k = 20\) in the graph are statistically significant and confirmed by t-test at 0.05 significance level. Also we can observe, as prefix gets shorter, that the PCFG method rapidly deteriorates though the prefix method keeps fairly good performance comparable to logistic regression and HMM.
Fig. 5.

a prefix parse tree for a action sequence in NASA data

At this point we would like to emphasize that our approach can produce a most-likely plan for the estimated goal by the Viterbi algorithm which runs on cyclic explanation graphs. We show an example of estimated plan in Fig. 5. In this figure, the purple node is the estimated goal, the green internal nodes are subgoals and the red leaf nodes stand for actions and web pages accessed by the visitor. Parse trees like this visualize the visitor’s plan and help a web manager improve the website. For example in this case, the actions taken from No. 4 to No. 6 are recognized as “Search” and hence we might be able to help the visitor’s search by adding a link from the page of No. 3 to that of No. 7.

5.3 Discussion

In the previous section, we experimentally compared the prefix method to existing methods using three probabilistic models: PCFG, logistic regression and HMM. Here we have a closer look at the results of the experiment.

The first observation is that the PCFG method shows poor accuracy when the prefix length is less than 10. This is thought to be caused by a mismatch between the model that assumes the observed data is complete and the incomplete data given as input.

The second one is that the accuracy of the prefix method is higher than or equal to that of logistic regression, a standard discriminative model, when the prefix length is long. Nonetheless, when the length is short, logistic regression outperforms the grammatical methods. This is interpreted as follows. First our grammatical methods require to correctly identify the most-likely parse tree for the top-goal but the identification becomes quite difficult for short sequences as they give little information on correct parse trees. So mis-identification easily occurs which leads to poor performance.

The third one is a notable difference in accuracy between our method and the HMM method. It might be ascribed to the fact that we used the universal session grammar to decide correct answers in the experiment. The use of it as a criterion for accuracy causes a substantial disadvantage to HMM which is a special case of PCFG and not as expressive as the universal session grammar.

The last observation is that the degree of difference in accuracy depends on a data set. For example, the difference between accuracy of the prefix and PCFG methods and that of other methods is small in the U of S data set whereas it becomes larger in the NASA data set, especially when the prefix length is long as seen in Fig. 4. To understand this phenomenon, we computed the entropy of each PCFG model. The entropy of U of S, ClarkNet and NASA are \(5.14 \times 10^4\), \(2.77 \times 10^5\) and \(3.14 \times 10^6\) respectively12. What is revealed by this entropy computation is that higher accuracy by the prefix and PCFG methods in comparison to other methods and higher entropy of a data model in the experiment co-occur. We do not think this co-occurrence is accidental. First of all, the entropy is an indicator of uncertainty of a probability distribution and in PCFGs, it represents uncertainty of parse trees. Hence when the data model is simple and the entropy is low, the identification of a correct goal is easy and simple methods such as HMM and logistic regression are comparable to the prefix and PCFG methods. However when the entropy is high as in the case of the NASA data set, these simple approaches fail to disambiguate the correct parse tree and the prefix and PCFG methods that exploit structural information in the input data outperform them, particularly when the data is long.

6 Infix Probability Computation: Beyond Prefix Probability Computation

Up until now we have only considered prefixes that describe session logs under online situation or unachieved visitors. When we consider the cross-site situation however, infix needs to be introduced. Compared to prefix probability computation, infix probability computation is much harder and early attempts put some restrictions on it. However Nederhof and Satta recently proposed a completely general method to compute infix probability that solves a set of non-linear equations [10]. One thing to recall is that their method is purely numerical and yields no parse trees for prefixes or infixes, and hence cannot be used for Viterbi inference to infer the most likely parse tree for a given infix. Contrastingly our approach can yield parse trees for infixes as well as for prefixes.

6.1 Nederhof and Satta’s Algorithm

An infix\(\mathbf {v}\) in a PCFG \(\mathsf G\) is a substring of a sentence written as \(\mathbf {uvw}\) for some terminal sequences \(\mathbf {u}\) and \(\mathbf {w}\). The infix probability \(P_\mathrm{in}^{N^1}(\mathbf {v})\) is defined as
$$\begin{aligned} P_\mathrm{in}^{N^1}(\mathbf {v}) = \sum _{\mathbf {u},\mathbf {w}} P_\mathsf{G}(\mathbf {uvw}) \end{aligned}$$
where \(\mathbf {u}\) and \(\mathbf {w}\) range over strings such that \(\mathbf {uvw}\) is a sentence. According to Nederhof and Satta [10], \(P_\mathrm{in}^{N^1}(\mathbf {v})\) is computed by first constructing an intersection PCFG \(\mathsf G' = \mathsf G\cap \mathsf{FA}\) of \(\mathsf G\) and a finite automaton \(\mathsf{FA}\) which accepts every string containing \(\mathbf {v}\), and second by computing the sum of probabilities of all sentences derived from \(\mathsf G'\). The second computation is reduced to solving a set of multi-variate polynomial equations (details omitted).
The problem here is that while their algorithm is completely general, building the intersection PCFG \(\mathsf G'\) contains redundancy. Let \(A \rightarrow B C\) be a CFG rule in \(\mathsf G\) and \(\{s_0,\ldots ,s_n \}\) a set of states in \(\mathsf{FA}\). To create \(\mathsf G'\), rules of the form \(\langle s_iAs_k\rangle \rightarrow \langle s_iBs_j\rangle \langle s_jCs_k\rangle \) are constructed for every possible combination of states \(s_i,s_j,s_k\)\((0 \le i, j, k \le n)\)13 but many of these rules are not used to derive a sentence and need to be removed as useless rules.
Fig. 6.

Infix parser \( DB _2\)

6.2 Infix Parsing and Cyclic Explanation Graphs

To avoid building redundant rules by blindly combining states and removing them later, we here propose to introduce parsing to the Nederhof and Satta’s algorithm. More concretely, we parse an infix L by the PRISM program in Fig. 6. It is a modification of the prefix parser in Fig. 1 that faithfully simulates the parsing action of the intersection PCFG \(\mathsf G'\).

This program differs from the prefix parser in that an input infix \(\mathbf {w} = w_1\cdots w_n\) is asserted in the memory as a sequence of state transitions: \(\mathtt{tr(}0,w_1,\mathtt{1)},\)\(\ldots ,\)\(\mathtt{tr(}n-1,w_n,n\mathtt{)},\) together with other transitions constituting the finite automaton \(\mathsf{FA}\). In the program, tr(S0,A,S1) represents a state transition from S0 to S1 by a word A in the infix. infix_pcfg(S0,S2,\(\alpha \)) reads that \(\alpha \), a sequence of terminals and nonterminals, spans a terminal sequence which causes a state transition of \(\mathsf{FA}\) from S0 to S2. Parsing an infix by the infix parser in Fig. 6 yields an explanation graph which is (mostly) cyclic and converted to a set of probability equations just like the case of prefix probability computation. Unlike prefix probability, though, probability equations for an infix are (usually) non-linear and we solve them by Broyden’s method, a quasi-Newton method. In this way, we can compute infix probability by way of cyclic explanation graphs. In addition, this program produces the most likely infix parse trees by the Viterbi algorithm on cyclic explanation graphs as explained in Sect. 3.

We experimentally applied the infix method to web session log data and obtained similar results to Fig. 4 (details omitted). However a set of non-linear equations for infix probability computation has multiple roots and a solution given by Broyden’s method depends critically on the initial value. Moreover, since Broyden’s method is a general solver for non-linear equations, its solution is not necessarily constrained to be between 0 and 1 and actually is often invalid. How to obtain a valid and stable solution of non-linear equations for infix probability computation remains as an open problem.

7 Conclusion

We have proposed new goal and plan recognition methods which are based on prefix and infix probability computation via parse trees in PCFGs. They can identify users’ goals and plans behind their incomplete action sequences. A comparative experiment of identifying visitors’ goals at a website using three real data sets is conducted using the prefix and infix methods introduced in this paper, the PCFG method that always treats action sequences as complete sentences, the HMM method that uses a mixture of HMMs instead of a mixture of PCFGs, and logistic regression. The result empirically demonstrates the superiority of our approach for long (but incomplete) action sequences.

Another contribution is that our approach removes computational redundancy in Nederhof and Satta’s method [10] and also gives infix and prefix parse trees as a side effect. We implemented our approach on top of PRISM2.214 which supports prefix and infix parsing and the subsequent probability computation, in particular by automatically solving a set of linear and non-linear probability equations for prefix and infix probability computation respectively.


In this paper, we distinguish goal recognition and plan recognition; the former is a task of identifying a goal from actions but the latter means to discover a plan consisting of goal-subgoal structure to achieve the goal.


The probability of a prefix in a PCFG is defined to be the sum of probabilities of infinitely many sentences extending it and computed by solving a set of linear equations derived from the CFG [8]. Also there is prefix probability computation based on probabilistic Earley parsing [15].


\(\mathrm{expl}_0(G)\) is equivalent to G in view of the distribution semantics. When convenient, we treat \(\mathrm{expl}_0(G)\) as a bag \(\{e_1, e_2, \cdots , e_k\}\) of explanations.


This is justified because we assume the consistency of PCFGs [16] that implies the probability of remaining nonterminals in R yielding some terminal sequences is 1.


probf/1 is a PRISM’s built-in predicate and displays an explanation graph.


\(\mathtt{W} = 1\) because pre_pcfg([a],[a],[]) is logically proved without involving msws.


Clustering was done by PRISM. We used a small CFG for clustering, containing 30 rules and 12 nonterminals, because clustering by a mixture of large PCFGs tends to suffer from very high memory usage. To build this grammar, we merged similar symbols such as InternalSearch and Search in the universal session grammar shown in Table 2.


It is conducted on a PC with Core i7 Quad 2.67 GHz, OpenSUSE 11.4 and 72 GB main memory.


We applied a PCFG to prefixes by pretending them to be sentences. In this experiment, we found that the universal session grammar fails to parse at most two sequences for each data set, so we can ignore these sequences.


We used a left-to-right HMM where the number of states is varied from 2 to 8. In Fig. 4, only the highest accuracy is plotted for each k. Since logistic regression only accepts fixed length data, we prepare 19 logistic regression models, one for each length k\((2 \le k \le 20)\).


We used PRISM to implement a mixture of HMMs and that of PCFG and also to compute prefix probability. For the implementation of logistic regression we used the ‘nnet’ package of R.


The entropy is defined as \(- \sum _{\tau } P(\tau )\log P(\tau )\) where \(\tau \) is a possible parse tree [2]. In our setting, a common grammar, the universal session grammar, is used for all data sets. So the entropy only depends on the parameters of a PCFG learned from the data set.


This is to simulates a state transition of \(\mathsf{FA}\) made by a string derived from the nonterminal A using \(A \rightarrow B C\).


Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Graduate School of Information Science and EngineeringTokyo Institute of TechnologyMeguro-ku, TokyoJapan
  2. 2.AI research centerAISTKoto-ku, TokyoJapan

Personalised recommendations