Algorithms for learning parsimonious context trees
 3k Downloads
Abstract
Parsimonious context trees, PCTs, provide a sparse parameterization of conditional probability distributions. They are particularly powerful for modeling contextspecific independencies in sequential discrete data. Learning PCTs from data is computationally hard due to the combinatorial explosion of the space of model structures as the number of predictor variables grows. Under the scoreandsearch paradigm, the fastest algorithm for finding an optimal PCT, prior to the present work, is based on dynamic programming. While the algorithm can handle small instances fast, it becomes infeasible already when there are half a dozen fourstate predictor variables. Here, we show that common scoring functions enable the use of new algorithmic ideas, which can significantly expedite the dynamic programming algorithm on typical data. Specifically, we introduce a memoization technique, which exploits regularities within the predictor variables by equating different contexts associated with the same data subset, and a boundandprune technique, which exploits regularities within the response variable by pruning parts of the search space based on score upper bounds. On realworld data from recent applications of PCTs within computational biology the ideas are shown to reduce the traversed search space and the computation time by several orders of magnitude in typical cases.
Keywords
Exact algorithms Structure learning Contextspecific independence Branch and bound Sequential data1 Introduction
Univariate conditional distributions play a central role in various multivariate probabilistic models, such as Markov models, hidden Markov models, Bayesian networks, and general hierarchical graphical models. Ideally, each conditional distribution either involves only a few conditioning variables or one can assume the conditional distribution to take some simple form, for example, a linear model. In practice, neither case may apply, and we encounter the curse of dimensionality: the representation size of the conditional distribution, which usually is proportional to the number of free parameters, grows exponentially in the number of variables.
The concept of contextspecific independence (Boutilier et al. 1996) provides an appealing approach to deal with the curse of dimensionality. Contextspecific independence takes place when fixing some of the conditioning variables to certain states, called a context, the remaining variables provide no additional information about the response variable, that is, the response variable is independent of the rest given the context. Examples of generalpurpose model classes that are based on the notion of contextspecific independence include decision trees (Breiman et al. 1984; Quinlan 1986; Buntine 1992; Chipman et al. 1998), decision graphs (Oliver 1993; Chickering et al. 1997; Jaeger et al. 2006), chain event graphs (Smith and Anderson 2008), multilinear functions (Chavira and Darwiche 2005), conditional independence trees (Su and Zhang 2005), and conditional probabilistic sentential decision diagrams (Shen et al. 2018).
To address the shortcoming of CTs, Bourguignon and Robelin (2004) proposed parsimonious context trees (PCTs). PCTs generalize CTs by identifying a context with a selection of state subsets for the explanatory variables, which yields a much wider range of admissible tree structures in relation to CTs (Fig. 1). This includes the important capability of (contextspecifically) “skipping” a position partially or entirely, allowing in effect for a compact representation and statistically efficient learning even in the presence of longrange dependencies.
PCTs have found recent applications particularly within computational biology, where modeling sequential data over discrete alphabets constitutes a recurring challenge. Seifert et al. (2012) used PCTs for augmenting higherorder Hidden Markov models to improve ArrayCGH analysis. Another wellstudied application models DNA sequence patterns that are of importance for gene regulation (Eggeling et al. 2014a, 2015b). Here, PCTs augments an inhomogeneous Markov model that can be viewed a Bayesian network of fixed structure where the parents of each variable are the direct predecessors in the sequence.
Such an inhomogeneous parsimonious Markov model has several advantages for the given application domain. First, it yields favorable predictive performance in relation to alternative models such as Bayesian networks (Barash et al. 2003); see the study in Eggeling et al. (2014a) and Sect. 7.8 in this article. Second, it can be used for unsupervised learning tasks, such as de novo motif discovery (Eggeling et al. 2014a, 2015b, 2017) or as component of a mixture model (Eggeling et al. 2017; Eggeling 2018), where learning is possible only through an iterative approach such as the EM algorithm (Dempster et al. 1977) or variants thereof (Nielsen 2000; Fujimaki and Morinaga 2012). Third, it allows for an intuitive model visualization through a conditional sequence logo (Eggeling et al. 2017) that is a direct generalization of the popular sequence logo (Schneider and Stephens 1990). Finally, it can be easily generalized to capture distal dependencies by relaxing the assumption of a fixed Bayesian network backbone structure (Eggeling 2018).
Irrespective of the concrete application, however, structure learning of PCTs is very challenging from a computational point of view, which has been considered a drawback of the model (Leonardi 2006). The reason for that are the relaxed structural constraints: even if a node is labeled by the full alphabet, which stands for contextspecific independence, the node can be succeeded by a nontrivial subtree; hence the structure search cannot be stopped once some contextspecific independencies have been found, but the whole space of possible structures has to be considered. Bourguignon and Robelin (2004) proposed a dynamic programming (DP) algorithm that is capable of finding a PCT of a given maximum depth \(d\) so as to maximize a given decomposable scoring function without explicitly enumerating all PCTs; a score is decomposable if it is the sum of socalled leaf scores. However, this algorithm still has to consider each potential leaf node that could occur in a valid PCT, the number of which grows exponentially in \(d\).
In this article, we present techniques for enhancing the basic DP algorithm of Bourguignon and Robelin (2004), with the aim of significantly speeding up the structure learning of PCTs. Our central observation is that the basic DP algorithm makes essentially no assumptions about the structure of the scoring function. Put otherwise, for the common scoring functions used in practice, we should be able to enhance the algorithm by exploiting the particular form of the scoring function. Indeed, we will show that we can exploit regularities in the data to reduce the computational burden of finding an optimal PCT. There are two types of regularities, which can be capitalized upon by two different ideas respectively.
On the one hand, there are regularities among the realizations of the conditioning variables, which we can utilize: we store entire optimized subtrees—actually only their scores—in memory for possible later reuse. This idea, we call memoization, has the drawback of being memory intensive. For that reason, we also investigate a parameterized extension of the idea that allows us to trade time for space.
On the other hand, there are regularities within the response variable. We exploit the regularities by devising two pruning rules, a stopping rule and a deletion rule, which allow us to ignore subproblems that are guaranteed to not contribute to an optimal PCT. The deletion rule resembles a simple pruning rule (Teyssier and Koller 2005) that is nowadays standard in structure learning in Bayesian networks: while that rule concerns the “is subset of” relation on candidate parent sets, our deletion rule concerns the “refines” relation on set partitions of the alphabet. To effectively apply the pruning rules in practice, we derive score upper bounds based on the properties of the considered concrete scoring functions, similar in spirit to the bounds of Tian (2000) and de Campos and Ji (2011) for structure learning in Bayesian networks.
We evaluate the performance of the individual techniques alone and in concert on real world data sets from the domain of computational biology. We use two data sets as running examples for demonstrating the detailed effects of parameter settings that control the algorithmic complexity. We further present an exhaustive study on a large variety of data sets that show a different degree of regularity among the input variables. These studies show that the proposed ideas can be highly effective in many cases, yielding speedup of up to two orders of magnitude for typical data sets.
This article is based on and considerably extends our preliminary work published in two conference papers (Eggeling et al. 2015a; Eggeling and Koivisto 2016). The first paper introduced the memoization idea, but it did not consider the parameterized extension that allows for trading time for space. The second paper introduced pruning techniques; however, the study was restricted to the BIC score (Schwarz 1978) and only derived a relatively simple bound that we will refer to as the coarse bound. The present work extends this path of research by deriving a substantially tighter bound, we will call the fine bound, and by making the bounds applicable also for other related scoring functions such as the AIC score (Akaike 1974). Due to these major methodological developments, the experimental studies are completely new, covering a larger number of data sets and instantiations of the proposed algorithms.
The remainder of this article is organized as follows. Section 2 contains a technical recap of PCTs, including a formal definition of the model and the structure learning problem, a description of the basic DP algorithm of Bourguignon and Robelin (2004), and a visual interpretation. In Sect. 3, we present the memoization technique in its plain variant as well as a parameterized version for limitedmemory usage. We then describe the pruning ideas: Sect. 4 gives the upper bounds on the scoring function; the pruning rules that rely on these bounds are given in Sect. 5. Next, we describe the interplay of all different algorithmic ingredients in a final algorithm in Sect. 6. We report on the case studies in Sect. 7 and conclude the article with some discussions and final remarks in Sect. 8.
2 Parsimonious context trees
In this section, we revisit the definition of a parsimonious context tree (PCT) and a scoreandsearch approach to structure learning of PCTs. We also describe the basic dynamic programming algorithm of Bourguignon and Robelin (2004) and its interpretation as subtree selection in a socalled extended PCT.
2.1 Basic definitions
Let \(\varOmega \) be a finite set. A rooted, balanced, nodelabeled tree of depth d is called a parsimonious context tree (PCT) over \(\varOmega \) if the node labels satisfy the following property: for each node at depth \(\ell < d\) the labels of the node’s children form a set partition of \(\varOmega \), that is, the labels of the children are pairwise disjoint nonempty subsets of \(\varOmega \) whose union is \(\varOmega \). We call the set \(\varOmega \) the alphabet and its members \(symbols \).
We identify each node of a PCT with the sequence of labels \({\mathbf {V}}= V_\ell \cdots V_1\) of the nodes on the unique path from the node up to, but excluding, the root; here and henceforth we write the labels in the reversed order. We may interpret the node \({\mathbf {V}}\) as the set \(\bigcup _{j \ge 0}\big (\varOmega ^j \times V_\ell \times \cdots \times V_1\big )\), which consists of the sequences over \(\varOmega \) whose length is at least \(\ell \) and whose ith symbol belongs to \(V_i\) for \(i = \ell ,\ldots ,1\). Following this interpretation, we say that a sequence \({\mathbf {x}}\)matches\({\mathbf {V}}\) if \({\mathbf {x}}\in {\mathbf {V}}\). It follows that the leaves of a PCT of depth d correspond to a set partition of the set of all sequences over \(\varOmega \) whose length is at least d. Furthermore, each PCT corresponds to a distinct partition.
Given a PCT \({\mathcal {T}}\) and its node \({\mathbf {V}}\), we denote by \({\mathcal {T}}({\mathbf {V}})\) the subtree of \({\mathcal {T}}\) rooted at \({\mathbf {V}}\). We say that the subtree is minimal if it consists of a single chain of nodes down to a single leaf, thus all nodes labeled by \(\varOmega \); we say that the subtree is maximal if it consists only of nodes labeled by singletons \(\{a\}\subseteq \varOmega \), thus having \(\varOmega ^{d\ell ({\mathbf {V}})}\) leaves; here and henceforth \(\ell ({\mathbf {V}})\) denotes the depth of node \({\mathbf {V}}\).
2.2 The structure learning problem
Complexity of PCT learning
(a) Number of valid PCTs  (b) Size of extended PCT  

\(d\)  \(\varOmega =3\)  \(\varOmega =4\)  \(d\)  \(\varOmega =3\)  \(\varOmega =4\) 
1  5  15  1  7  16 
2  205  72, 465  2  57  241 
3  \(8.74 \times 10^{6}\)  \(2.75 \times 10^{19}\)  3  400  3616 
4  \(6.68 \times 10^{20}\)  \(5.78 \times 10^{77}\)  4  2801  54,241 
5  \(2.98 \times 10^{62}\)  \(1.12 \times 10^{311}\)  5  19,608  813,616 
2.3 Basic dynamic programming
Bourguignon and Robelin (2004) presented a dynamic programming (DP) algorithm that finds the maximum score \(S_{{\mathcal {T}}_*}\) without enumerating all distinct PCTs. The algorithm, we shall call basic DP, relies on the decomposability of the scoring function:
Assumption 1
Proposition 1
2.4 Dynamic programming as search on extended PCT
The inner workings of the algorithm of Bourguignon and Robelin (2004) and the construction of the optimal PCT itself can be viewed as bottomup reduction of a data structure called extended PCT, as illustrated in Fig. 2. In contrast to a PCT, the sibling nodes in an extended PCT do not partition their parent node, but are labeled by all nonempty subsets of \(\varOmega \). An extended PCT thus contains all possible PCTs as subtrees.
The size of the extended PCT, which essentially determines the complexity of the DP algorithm, grows substantially slower than the number of possible PCTs (Table 1). Yet, the complexity of the algorithm is exponential in the maximum depth of the PCT, and overexponential in the alphabet size: The algorithm computes the leaf scores of \((2^{\varOmega }1)^d\) leaves of the extended PCT; in addition, it computes an optimal selection of children for \(\sum _{\ell =0}^{d1}(2^{\varOmega }1)^{\ell }\) inner nodes, each of which takes \(O(3^{\varOmega })\) time using a routine we describe in Sect. 5.2.
2.5 Learning with weighted data
Suppose each data point \(({\mathbf {x}}^{t},y^{t})\), with \(t = 1, \ldots , N\), is associated with a realvalued weight \(w^{t}\ge 0\). The weights may arise from different origins: First, scientific experiments, such as modern highthroughput technologies in DNA sequence analysis (Orenstein and Shamir 2014), may directly produce weighted data. Second, it can be more efficient to store an original data set with many duplicates as weighted data consisting of unique data points where the weight equals the number of occurrences in the original data set. Third, learning with weighted data is needed when the model is a component of a mixture model that is learned using the EM algorithm or variants thereof (Fujimaki and Morinaga 2012); see Eggeling (2018) for a recent application of PCTs in such a scenario.
3 Memoization
The basic dynamic programming algorithm of Bourguignon and Robelin (2004), described in the previous section, only exploits the decomposability of the scoring function (Assumption 1). Fortunately, the commonly used scoring functions also share other properties that enable further computational savings. In this section, we formalize a sufficient condition under which two different subproblems (i.e. nodes of the extended PCT) must have equal solutions and thus need to be solved only once; we call the resulting technique memoization.
3.1 Storing and reusing solved subproblems
For a node \({\mathbf {V}}\) of an extended PCT, write \(I({\mathbf {V}}):=\{ t : {\mathbf {x}}^{t} \in {\mathbf {V}}\}\) for the set of indices of data points that match the node. We will make use of the following data locality property, which we, again, formalize as an assumption.
Assumption 2
(Data locality) If two leaves \({\mathbf {V}}\) and \({\mathbf {V}}'\) of an extended PCT are matched by the same data points, \(I({\mathbf {V}})=I({\mathbf {V}}')\), then their scores are equal, \(S({{\mathbf {V}}}) = S({\mathbf {V}}')\).
In essence, data locality means that the leaf score \(S({\mathbf {V}})\) depends on the leaf \({\mathbf {V}}\) only through the data subset indicated by \(I({\mathbf {V}})\). If the property holds for the leaf scores, then, due to Eq. 7, it also holds for the optimal scores of the inner nodes of any fixed depth:
Proposition 2
(Memoization) Let \({\mathbf {V}}\) and \({\mathbf {V}}'\) be two nodes such that \(I({\mathbf {V}})=I({\mathbf {V}}')\) and \(\ell ({\mathbf {V}})=\ell ({\mathbf {V}}')\). Then \(S_*({\mathbf {V}}) = S_*({\mathbf {V}}')\).
Assumption 2 is fulfilled, for instance, by the BIC score, and more generally by most practically relevant scoring functions that can be written in terms of a penalized maximumlikelihood score. A notable exception, however, is the Bayes score with contextdependent hyperparameters (Eggeling et al. 2014b), comparable to the family of BDeu scores for Bayesian networks (Heckerman et al. 1995).
The effectiveness of the memoization rule is data dependent. For small data sets but deep trees, the rule is expected to apply often, for then the number of nodes gets large while the number of distinct data subsets that match highdepth nodes gets small. Likewise, the relative gain is expected be the higher, the larger the alphabet is. Also, the memoization rule is likely to apply more frequently on highly structured data than on random data.
3.2 Trading time for space
A downside of the memoization rule is an increased memory consumption due to the necessity to store the computed optimal scores of the visited nodes in the extended PCT in memory for potential future reuse. In order to find an appropriate tradeoff between memory and time consumption, it is reasonable to control the degree of memoization employed by the algorithm.
The key idea is to store the optimal score of node \({\mathbf {V}}\) only if it is likely to be reused and holds a promise for significant savings in running time. While there are several possibilities to make such a decision, for instance, based on the number of (distinct) data points matching \({\mathbf {V}}\), we have found that the depth of \({\mathbf {V}}\) is the decisive factor: there are only a few shallow nodes in the extended PCT and the potential savings in running time are immense when reusing applies, since a shallow node is root of a large subtree that needs to be traversed. On the other hand, there is a vast number of deep nodes, for which the potential savings are comparatively small as they are parents of only very small subtrees. Hence, it is reasonable to limit the maximum depth at which scores are stored in memory by an external memoization depth parameter, denoted by \(m\), which we can use to trade time against space. We empirically investigate the effect of varying \(m\) in Sect. 7.5.
4 Score upper bounds
In this section, we present techniques to prune parts of the search space based on fasttocompute upper bounds on the optimal scores of subproblems (i.e. nodes of an extended PCT). To this end, we will make yet another assumption regarding the scoring function, in addition to Assumptions 1 and 2.
Assumption 3
Note that the constant \(K\) is allowed to depend on the data size and the size of the alphabet—we only require that \(K\) is the same number for different choices of \({\mathbf {V}}\). The assumption of constant leaf penalty is fulfilled, for instance, by the BIC score, with \(K=\frac{1}{2}\big (\varOmega 1\big )\ln N\), and by the AIC score, with \(K=\big (\varOmega 1\big )\). An example of a scoring function that fulfills Assumptions 1 and 2 but not Assumption 3 is the fNML score (Silander et al. 2010): while the fNML score takes the form of a penalized maximum loglikelihood, the penalty term is the multinomial regret function and thus depends on the count \(N_{\mathbf {V}}\).
For the remainder of the paper, we assume Assumption 3 to be fulfilled implicitly for every score S mentioned, which can thus be BIC, AIC, or any other scoring criteria arising from different choices for the constant leaf penalty K.
4.1 A simple bound
First, we introduce a simple, coarse upper bound that requires no substantial precomputations and can thus always be used at no additional costs. Consider an inner node \({\mathbf {V}}\). To upper bound the maximum score over all possible subtrees rooted at \({\mathbf {V}}\), we upper bound the largest possible gain in the maximumlikelihood term on one hand, and lower bound the inevitable penalty due to increased model complexity on the other hand.
 Case 1

The minimal subtree is optimal. In this case, we can compute the exact score directly: \(S_{*}({\mathbf {V}}) = L({\mathbf {V}})K\).
 Case 2

The minimal subtree is not optimal. Thus an optimal subtree makes at least one split, and therefore has at least two leaves. In this case, the likelihood term is upper bounded by \(L^{ \text{ UB }}({\mathbf {V}})\), while the penalty term is at least \(2K\).
Proposition 3
While this upper bound is coarse and sometimes tight, we next consider two possibilities to further tighten it significantly by investing more effort in precomputations.
4.2 Refining the bound
For example, if a PCT has 2 leaves (implying a penalty term \(2 K\)), then the likelihood upper bound of Eq. 9 allowing splits according to all explanatory variables can be overly loose, for we actually can split according to at most one variable. To formalize this idea, for every \(\ell = 1, 2, \ldots , d\) and index subset \(J \subseteq \{1, 2, \ldots , d \ell \}\) let \({\mathcal {A}}(\ell , J)\) denote the family of all subsets of sequences \({\mathbf {A}} \subseteq \varOmega ^{d\ell }\) such that \(A_i\) is a singleton if \(i \in J\) and \(A_i = \varOmega \) otherwise. In other words, the family forms a set partition of \(\varOmega ^{d\ell }\) into \(\varOmega ^{J}\) disjoint sets according to the explanatory variables indexed in J (shifted by \(\ell \)). For any inner node \({\mathbf {V}}\) we obtain the upper boundsA PCT with n leaves can split according to at most \(n1\) explanatory variables.
By maximizing the bound over the sets J we obtain the following bound for the score:
Proposition 4
The number of likelihoodterms that have to be computed is thus \(2^{d\ell ({\mathbf {V}})}\) instead of two for the coarse upper bound, which may appear as a substantial additional investment. However, the greatest additional computational effort, \(2^d\) likelihood computations, occurs solely at the root of the extended PCT, whereas closer to the leaves the computational effort decreases very rapidly as \(\ell ({\mathbf {V}})\) increases. Since the number of nodes in the extended PCT grows faster than \(2^d\), we may expect the amortized additional effort due to the fine upper bound to amount only to a small overhead—a low price for potentially much tighter bounds. We empirically study the practical effect of the fine bound on running times in Sect. 7.7.
We show an example that compares the fine and coarse upper bound for a small data set of \(N=10\) over the alphabet \(\varOmega =\{\textsf {\small A},\textsf {\small B},\textsf {\small C},\textsf {\small D}\}\) in Fig. 4. Using the BIC score, the value of the constant leaf penalty is \(K= \frac{3}{2}\ln 10 \simeq 3.45\). We consider bounding the score of the root node and thus omit explicit reference to the argument \({\mathbf {V}}\) of the upper bound. Since there are two explanatory variables \(x_{1}\) and \(x_{2}\), we distinguish four cases, namely (1) full independence of y from the explanatory variables, (2) independence from \(x_{2}\) (but not \(x_{1}\)), (3) independence from \(x_{1}\) (but not \(x_{2}\)), (4) dependence on both variables. For each of these four cases, we show the smallest possible PCT topology and the maximal possible likelihood for the data set. Note that each \(L^{ \text{ UB }}\) stems from splitting the data into a larger number of groups than the number of leaves in the smallest possible PCT of the same case; however, for both the splits concern exactly the same explanatory variables. We find that in this example the coarse upper bound is too optimistic: while there is a certain dependence in the data, it requires both explanatory variables to fully utilize it. The fine bound shows that doing so does not give a better score than the independence model, knowledge that can be utilized for pruning the search space (Sect. 5).
4.3 Lookahead
Proposition 5
Using the lookahead bound with a large \(q\) constitutes a substantial computational effort. Specifically, if \(q=d\ell ({\mathbf {V}})\), the bound equals the optimal score and is, in essence, obtained by traversing through all possible PCT subtrees. Hence, the choice of \(q\) could be critical for obtaining a good tradeoff between gained savings and additional invested effort in relation to the flat bound. We will investigate this issue empirically in Sect. 7.3.
5 Pruning rules
Armed with the score upper bounds derived in the previous section, we next present pruning rules that aim at deciding at each visited node in the extended PCT, whether the upper bounds allow us to avoid exact solving of the corresponding subproblem.
5.1 Stopping rule
From this basic idea it follows that using a lookahead bound within the stopping rule is pointless. While it may yield a tighter bound, it has to consider subtrees explicitly and solve the alphabet partition problem at least once, which are the tasks that are to be avoided. Formally, we have the following.Stop the search at a node when contextspecific independence can be declared already without explicitly considering the possible subtrees.
Proposition 6
The correctness can be shown by contradiction. Assume \({\mathcal {T}}'\) does not yield the optimal score. Then \( L({\mathbf {V}})K<S_{*}({\mathbf {V}})\). But since \( S_{*}({\mathbf {V}})\le S_{0}({\mathbf {V}})\), this violates the premise.
5.2 Deletion rule
As we wish to delete as many potential child nodes as possible and not compute their optimal scores, we cannot assume the optimal scores of the sibling nodes are available. Thus, to make the rule concrete, we resort to upper bounds on the scores. Likewise, we need to lower bound the optimal score among the valid sets of children. While, in principle, various lower bounding schemes would be possible, we have chosen to use a particularly simple lower bound: the optimal score of the \(\varOmega \)labeled child. We next describe the bounds and the rule more formally.Delete a child node if the best set of children it belongs to is worse than some other set of children (to which the node does not belong).
Proposition 7
The correctness can be shown by contradiction. Suppose \(C{\mathbf {V}}\) did belong to an optimal PCT even though the premise was fulfilled. Then \(S_*(C{\mathbf {V}}) + f^*(\varOmega {\setminus } C) \,\ge \, S_{*}(\varOmega {\mathbf {V}}) ,\) leading to \(S_q(C{\mathbf {V}})<S_*(C{\mathbf {V}})\), which violates the property of \(S_q\) being an upper bound of \(S_*\).
The deletion rule invests a certain amount of effort for which the obtained savings need to compensate before the rule becomes effective: In the worst case, we need to compute the optimal partition of children twice for each inner node, once with the upper bounded scores for excluding subtrees from further optimization, and once with the exact scores. As a positive side note, we observe that, while we focus on upper bounds based on Assumption 3 in this work, the deletion rule is in principle independent of the used scoring function, as long as an effective upper bound \(S_{q}\le S_{*}\) can be specified.
6 The final algorithm
While also omitted in the pseudocode for brevity, incorporating the memoization into the proposed algorithm is straightforward. We can add a test that checks whether the index set \(I({\mathbf {V}})\) has already occurred with some other node at the same depth directly when entering the function \(\textsc {MaxPCT}({\mathbf {V}})\). If the test is positive, the score is reused and the rest of the function is skipped. If the test is negative and the depth of \({\mathbf {V}}\) is not larger than \(m\), the score is stored in a hash data structure at the end of the function with the current data subset (index set) and the depth of \({\mathbf {V}}\) as the key.
7 Case studies
In the empirical part of this work, we evaluate the effects of the proposed techniques for expediting PCT learning using a Javaimplementation based on the Jstacs library (Grau et al. 2012). The software is available at http://www.jstacs.de/index.php/PCTLearn.
7.1 Data
We consider the problem of modeling DNA binding sites of regulatory proteins such as transcription factors, which constitutes one established application of PCTs. A data set of DNA binding sites consists of short sequences of same length over the alphabet \(\varOmega =\{\textsf {\small A},\textsf {\small C},\textsf {\small G},\textsf {\small T}\}\) that are considered to be recognized by the same DNAbinding protein. In this application, the task is to model the conditional probability of observing a particular symbol at a certain position in the sequence given its direct predecessors—a task that directly fits to the setting outlined in Sect. 1. The probability of the full sequence is, by the chain rule, simply the product over all conditionals. Due to the nature of protein–DNA interaction, the conditional distribution at a particular position is strictly positionspecific, so we need to learn a separate PCT for every sequence position in a data set.
We use data from the publicly available data base JASPAR (Sandelin et al. 2004), which contains a large number of DNA binding site data sets for various organisms. For the majority of this section, we focus on two exemplary data sets, which contain binding sites of human DNAbinding proteins called CTCF and REST. The sequence in both data sets are rather long (19 and 21 nucleotides), so there are quite a few PCTs of large depth to be learned. For conveniently referring to a particular learned PCT, we introduce the abbreviations CTCFj for the PCT learned at the jth position of the CTCF data set, and RESTj likewise. Both proteins are known to recognize a rather complex sequence pattern (Eggeling et al. 2015b), which makes the structure learning problem challenging.
We observe that the complexities of the optimal PCTs differ. In both data sets, there are sequence positions where a PCT that represents full statistical independence of the variable giving its predecessors is optimal according to the BIC score, which typically, though not always, occurs at highly informative positions. For CTCF all optimal PCTs have splits until at most depth three, whereas in the case of REST the allowed maximum depth of 6 is actually used to full capacity in case of REST11 and REST20, one final split occurs at depth 5, and three final splits at depth 4. The preference of REST for deeper trees, in comparison to CTCF, may be caused by a combination a larger sample size, which allows a bit higher model complexity, and the location of the highly informative positions in clusters, which spatially separates lowinformative positions among whose dependencies are likely to occur.
The height and shape of the optimal PCT structures suggest that the PCT optimization for REST is generally computationally harder than for CTCF. In the following sections, we utilize both data sets for evaluating the effectiveness of the proposed memoization and pruning techniques.
7.2 Pruning versus memoization
Memoization reduces the search space by approximately one order of magnitude on average, and the savings vary only little from position to position. This can be explained by the structure of the data sets, where most positions have both high and lowinformative positions as predecessors, so the potential for exploiting regularities in the explanatory variables is in a similar range.
The effect of pruning, however, varies to a large degree. As a ruleofthumb, at highinformation positions pruning yields a tremendous reduction of the search space. In one exceptional case, CTCF13, it is possible to prune already at the root, which we cannot always expect to happen: other positions with minimal optimal tree displayed in Fig. 7(top) require more effort to declare statistical independence. The savings at lowinformation positions are not as pronounced, but for all 28 cases under consideration, pruning yields higher savings than memoization.
It is thus no surprise that the combination of both is dominated by the effect of pruning: Memoization contributes only small additional savings for positions where pruning is not overly effective, such as CTCF8 or REST15.
Comparing the two data sets to each other, we find that the aggregated savings for CTCF are higher than for REST, which confirms the speculation from the previous section. In particular, for REST11 and REST15 finding optimal PCTs is relatively demanding. However, the optimal tree structure only implies a tendency, the correlation is not perfect: REST7 and REST20 seem equally challenging instances, yet the former yields a minimal optimal tree, whereas the latter yields an optimal tree with five leaves that reaches up to depth 6.
7.3 Pruning variants in detail
We observe that the biggest difference among methods is achieved at seemingly “easy” positions: the most striking example is again CTCF13, where the difference among the best and the worse pruning technique amounts to four orders of magnitude. Moreover, the switch between the coarse and fine upper bound has a higher impact than changing the number of lookahead steps. Except for a few difficult cases (CTCF18, REST11, REST15) using the fine bound has always a clearly positive effect on the reduction of the search space, and it never increases the work load in terms of the number of visited nodes.
Lookahead, however, can have a negative effect, as it potentially increases the search space in cases where it has little benefit on tightening the bounds. With the coarse upper bound, lookahead clearly pays off, \(q={1}\) and \(q={2}\) are both almost equally good and in some cases (CTCF13, REST17) substantially better than \(q=0\). With the fine upper bound, \(q={1}\) performs best. For a few positions (CTCF14, REST9, REST16), the onestep lookahead substantially improves the shallow fine upper bound by more than one order of magnitude. Furthermore, for the majority of positions \(q={1}\) is slightly superior to \(q={2}\), but there are a few instances where further lookahead pays off, such as REST9 or REST10. The cases \(q>2\), we omit from the plots for clarity, follow the trend from \(q={1}\) to \(q={2}\) and yield inferior performance.
We conclude that the fine upper bound in combination with onestep lookahead is a competitive choice. Twostep lookahead is for these data sets not substantially worse, as the additional number of visited lookahead nodes is compensated by the tighter bound so the parameter is robust.
7.4 The AIC score
We again compute optimal PCTs of depth \(d=6\) for all algorithm variants. The results are shown in Fig. 10. The savings for memoization are exactly the same as in the case of the BIC score, which serves as a sanity check: the memoization technique does not distinguish between BIC and AIC, and so the results must be identical.
The results for pruning, however, dramatically change. Due to the less harsh penalty term, total statistical independence never occurs, that is, the minimal tree is never optimal. Moreover, contextspecific independence can be declared in much fewer cases than for BIC, and so the pruning rules are less effective. The largest saving occurs for CTCF10, where the AICoptimal PCT has only four leaves, the saving being a little more than three orders of magnitude, which is comparable to the worst cases for BIC on the same data set. There are even instances, where the reduction of the search space is smaller than one order of magnitude.
The comparatively poor effect of the pruning rules, however, changes the game when pruning is combined with memoization. While is some cases like CTCF10, Rest8, or REST8 pruning alone could still suffice, and in a few other cases like CTCF14, CTCF18, or REST10 memoization alone yields already the best possible result, combining the two ideas clearly pays off for the majority of positions. It demonstrates that the memoization idea can in principle be as valuable as pruning or be even more effective, depending heavily on the scoring function and the complexity of the optimal model structures.
7.5 Memoization revisited
We thus investigate the impact of the memoization depth \(m\), which indicates the deepest layer of the extended PCT for which subproblems are stored for potential reuse later on. For measuring time consumption, we count the number of visited nodes in the extended PCT. Since the total running time for a data set is the main factor of interest, we here take the mean value over all positions. For measuring space consumption, we count the number of stored nodes. Here, however, we take the maximum over all positions, since it typically is the quantity of interest to decide whether a problem can be solved on a given machine or not. Figure 11 displays the results.
We observe that the pattern is similar for all six cases, and \(m=4\) gives the overall best tradeoff between time and space complexity. For cases where pruning is rather effective, such as BIC, space complexity may not become a critical bottleneck, so even \(m=5\) could be justified. In the other cases, it might be a good idea to stop storing subproblems one layer earlier by setting \(m=d2\) and to compute, if needed, the optimal partition of the leaf nodes of the extended PCT explicitly.
7.6 Broad study
In the previous sections, we investigated two data sets in detail and used the number of visited nodes in the extended PCT as an evaluation metric. Two open questions remain: How do the numbers of visited nodes translate to running times? How does the algorithmic variants perform on a larger variety of data sets, in particular with respect to the sample size?
The results generally confirm the observation from the previous sections: pruning gives larger savings than memoization, even though the difference in running times is not as large as the difference in the number of visited nodes (Sect. 7.2). One explanation is that the computation of the fine upper bound does have a certain computational cost, whereas memoization has a memory rather than a computationoverhead. In addition, memoization can also give improvements in cases where pruning itself is ineffective. As a consequence, the combination of pruning and memoization is the significantly best choice for speeding up PCT optimization and reducing the median running time by almost two orders of magnitude.
In Fig. 12(right), we plot for this best variant the running time against the number of visited nodes in the extended PCT, for each of the 767 problem instances. We color each point in the scatter plot by the size of the data set, distinguishing three size groups, roughly on a log scale: small with \(N<500\), typical with \(500<N<3000\), and large with \(N>3000\), consisting of 23, 52, and 20 data sets respectively (each amounting to several instances). We observe that the running time correlates well with the number of visited nodes (Pearson correlation coefficient \(\rho =0.90\)).
7.7 Running times for different parameter values
The previous section discussed the running times for concrete selections the algorithms’ parameters. We now set these parameters, one at a time, to possible alternative values and study the effects on the running time (Fig. 13). We observe that for every parameter there exist some problem instances that benefit from a change of the parameter value, but nevertheless we do observe a general trend.
When using the coarse bound instead of the fine bound (top, right), we find that for the majority of problem instances the running time increases, and in many cases by more than one order of magnitude. Keeping the fine bound, but disabling the lookahead instead (top, center) also leads to an increased running time for the majority of instances. These are often cases the minimal PCT is optimal (red), and whereas the fine bound enables pruning directly very early in the optimization, the coarse bound does not. Increasing the lookahead from \(q=1\) to \(q=2\) has relatively little effect, and thus confirms the expectations gained from analyzing the number of visited nodes (cf. Fig. 9).
When varying the memoization parameter m, we observe that for the majority of problem instances the running times remain widely identical, especially these where the optimal PCT has only one leaf. However, for many instances where the optimal PCT has more than one leaf, gradually disabling memoization by reducing m increases running time. These results also confirm the expectations from the earlier analysis: pruning and memoization complement each other: whereas the former technique attempts to quickly identify contextspecific independencies (including complete independence), the latter allows savings also in cases where the optimal PCT is relatively complex.
7.8 Predictive performance
Armed with the algorithmic tricks described in this paper, we are now able to study the predictive performance of a PCTbased model, dubbed iPMM (inhomogeneous parsimonious Markov model), on a large scale. We also investigate the performance of Bayesian networks (BNs), which have been previously proposed for the modeling complexity in transcription factor binding sites (Barash et al. 2003). This comparison is particularly relevant as the two model classes take into account different features in the data: iPMMs allow dependencies only among nucleotides in close proximity, but they model such dependencies in a very sparse and efficient way. BNs also allow longrange dependencies among distant positions in the sequence, but they are potentially less effective for shortrange dependencies due to their use of conditional probability tables.
To allow for a fair comparison among the structural features of both model classes, we learn globally optimal iPMMs and BNs with the same structure score (BIC) and the same parameter estimator given the structure (posterior mean with pseudo count 1 / 2). For BN structure learning, we use an implementation of the dynamic programming algorithm of Silander and Myllymäki (2006), which is sufficient for finding a globally optimal DAG for the problem sizes within this application domain. For evaluating the predictive performance for both models we employ a repeated holdout approach with 90% training data and 100 repetitions. For each data set, we compute the mean log predictive probabilities and test whether the difference among both models is significant using the signedrank test of Wilcoxon (1945). The individual results for all 95 data sets under consideration are shown in the Appendix.
Number of instances for which an iPMM predicts better/worse than a BN
\(\alpha =0.05\)  \(\alpha =0.005\)  \(\alpha =0.0005\)  

Better  75  70  66 
Tie  11  16  22 
Worse  9  9  7 
While the absolute difference among the predictive probabilities may seem small, the practical relevance depends on the concrete application. For scanning an entire genome with a thresholdbased approach, for instance, even small differences in the predictive probability may have a substantial impact on the number of false positives. In addition to the general advantages such as easy visualization as discussed in Sect. 1, iPMMs have also the conceptual advantage over BNs that the running time grows only linear with the sequence length. Hence, they could be used to model longer sequence patterns, while still retaining optimality with respect to the chosen objective function.
7.9 Other types of data
Since DNA binding site data (1) concerns only \(\varOmega =4\) and (2) entails some highlyinformative response variables due to the inhomogeneity of the used iPMM, they may be not fully representative for other types of data. We thus additionally investigate our algorithmic techniques on learning PCTs from protein sequences, which are typically described using the 20letter amino acid alphabet. However, for many applications it is common to reduce this alphabet to smaller sizes based on, e.g., similar biochemical properties of certain amino acids (Li et al. 2003; Peterson et al. 2009; Bacardit et al. 2009). In this study we use the alphabet reduction method of Li et al. (2003), since it offers for each possible reduced alphabet size an optimal clustering of amino acids into groups.
For each of these sequences, we learn a PCT (thus implicitly assuming a homogeneous model) with the basic DP algorithm and with our full algorithm with improvements enabled and plot the required running time for three combinations of alphabet size and maximal PCT depth in Fig. 14. We find that our algorithm speeds up structure learning also for this type of data and model. Compared to the results on DNA binding sites, the savings rates are on smaller on average, but also the variance in savings is decreased. This can be explained by the observation that in homogeneous sequences the response variables rarely have an extreme marginal distribution, so pruning at or close to the root almost never occurs, even if the independence models was optimal.
Algorithm comparison for learning PCTs of depth d \(=\) 4 on ADL data
Data set  N  Metric  Basic  Full  Saving factor 

A  248  VN  262,209,281  15,007,453  17.47 
A  248  RT  6451  598  10.78 
B  493  VN  262,209,281  24,514,291  10.69 
B  493  RT  6955  1129  6.16 
8 Discussion
We have investigated the problem of learning parsimonious context trees (Bourguignon and Robelin 2004), which are a powerful model class for sequential discrete data, but entail the challenge of requiring high computational effort for exact structure learning (Leonardi 2006). Specifically, we proposed two orthogonal ideas to expedite the basic dynamic programming algorithm of Bourguignon and Robelin (2004) for finding a highestscoring parsimonious context tree.
The first idea, memoization, exploits regularities among the explanatory variables by storing and reusing previously optimized subtrees. Empirical results on real world DNA binding site data suggest that memoization reduces the search space by about one order of magnitude in typical cases. The variance in the savings factor is generally moderate since regularities among multiple explanatory variables need to coincide for the memoization rule to apply; extreme cases with extraordinary high regularity are rare unless the dependence among variables is actually deterministic. While memoization is rather memoryintensive in its maximal instantiation, we have seen that the memory burden can be significantly reduced by putting a limit on the number of stored subproblem solutions, losing only little in search space reduction. We observed that a simple implementation of this idea works: storing solutions only up to a certain userspecified depth provides an interpolation between the minimal and maximal memory requirements. Let us note that we also investigated several alternative criteria for deciding whether the solution to a particular subproblem should be stored for later reuse or not, such as the number of (distinct) data points matching the corresponding node. However, no other criterion could compete with the simple depthbased criterion.
The second idea, pruning, exploits regularities within the response variable through upper bounds on the scoring function. Specifically, we derived local score upper bounds for scores with a constant leaf penalty such as the BIC score or the AIC score, with an option for a fewstep lookahead. We presented two pruning rules that utilize these upper bounds: a stopping rule and a deletion rule. Empirical results showed that pruning can be extremely effective when the entropy in the response variable is low and the scoring function favors sparse trees, which is typically the case for BIC. Here, pruning substantially outperforms memoization and when employing both, the latter appears to offer only a negligible additional contribution. However, the reduction of the traversed search space is less pronounced when the distribution of the response variable has a high entropy and when the found optimal tree is large. In this situation, the combination of pruning and memoization pays off.
The effectiveness of pruning for a given scoring function depends partially on the quality of the score upper bounds. We may control and influence this aspect by algorithmic decisions concerning the amount of effort we are willing to invest for getting tighter bounds. However, our case studies demonstrate that it additionally depends on the size of the tree structures favored by the scoring function: if the optimal tree is sparse, then there is more potential, albeit no guarantee, for pruning large parts of the search space. This second aspect is beyond our direct control in algorithm design once data set and scoring function are fixed. As a consequence, the choice of the scoring function, a pure modeling decision, has a direct and in fact rather predictable impact on the speed of the algorithm.
It might be noteworthy that using the BIC score for learning parsimonious context trees was previously motivated from the perspective of predictive performance under limited data (Eggeling et al. 2014b). The empirical results from this work now also suggest an algorithmic justification for this scoring function choice, and it can be considered as a fortunate coincidence that these two different objectives lead to the same conclusion.
While our case studies involved only two concrete scoring functions and one type of benchmark data, we believe that the lessons learned can be somewhat generalized. Since our score upper bounds for BIC and AIC share the same functional form, we may assume that the upper bounds alone are roughly equally effective in both cases. Hence, a larger leaf penalty of a scoring function implies a larger pruning potential on average. This conclusion should generalize also to other scoring functions, such as Bayes scores arising from different prior choices, albeit deriving effective upper bounds could be technically more challenging in these cases. In contrast, memoization makes only weak assumptions about the scoring function and is thus always equally effective no matter whether model complexity is penalized heavily or not. This interplay of pruning and memoization techniques is likely to generalize also to other classes treestructured probabilistic models.
One obvious limitation of the method is that the effectiveness of the proposed methods wane with growing sample size as memoization becomes less likely to apply and also the optimal PCTs become larger. However, we find this limitation not very significant, as the very purpose of PCTs is to provide a sparse representation of a conditional distribution in small data scenarios where a full Markov model would have excessively many parameters. Thus, it may be not critical if learning the model becomes computationally infeasible in situations where the model does not have clear advantage over computationally simpler models in the first place.
Another limitation is that the presented methods, as such, are insufficient for handling large alphabets. The reason is that increasing the alphabet size does not only increase the size of the extended PCT (which could be dealt with by pruning and memoization), but also the time needed for computing a single optimal partition for each of its inner nodes. Since the time complexity for the latter task is already \({\mathcal {O}}(3^{\varOmega })\), it does not seem likely that exact learning of PCTs on more than a dozen of symbols becomes tractable for practically relevant instances. So if handling a large alphabet is critical for the specific application and cannot be circumvented by some alphabet reduction method, one should resort either to heuristic algorithms or to simpler and potentially less powerful models.
The present work also opens avenues for future research. For instance, it might be worthwhile to apply the pruning ideas for finding optimal classification and regression trees (Breiman et al. 1984; Buntine 1992; Chipman et al. 1998) with many categorical explanatory variables. The published exact algorithms for learning decision trees (Blanchard et al. 2007; Hush and Porter 2010; Bertsimas and Dunn 2017) do not employ pruning based on score upper bounds; the pruning strategies explored in the literature—see, e.g., Frank (2000), Lomax and Vadera (2013), and references therein—are limited to postprocessing of decision trees found by greedy, inexact algorithms, and are thus not comparable to the methods presented in the present paper. It also remains to be investigated whether the boundandprune approach could succeed in expediting other algorithms that are based on recursive set partitioning. An example is the DP algorithm by Kangas et al. (2014) for learning chordal Markov networks, for which Rantanen et al. (2017) recently presented a related boundandprune variant; however, the variant appears to not take full advantage of the underlying DP algorithm and yields speedups only occasionally. A different line of research is to design heuristic, approximate algorithms for learning parsimonious context trees that scale to large alphabets and thereby significantly widen the applicability of the model class. We believe the techniques presented and the insight obtained in this work constitute a fruitful starting point for devising effective greedy and local search algorithms.
Notes
Acknowledgements
Open access funding provided by University of Helsinki including Helsinki University Central Hospital.
Funding
Funding was provided by Academy of Finland (Grant No. 276864).
References
 Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.MathSciNetCrossRefzbMATHGoogle Scholar
 Bacardit, J., Stout, M., Hirst, J., Valencia, A., Smith, R., & Krasnogor, N. (2009). Automated alphabet reduction for protein datasets. BMC Bioinformatics, 10, 6.CrossRefGoogle Scholar
 Barash, Y., Elidan, G., Friedman, N., & Kaplan, T. (2003). Modeling dependencies in proteinDNA binding sites. In Proceedings of the seventh annual international conference on research in computational molecular biology (RECOMB) (pp 28–37).Google Scholar
 Begleiter, R., ElYaniv, R., & Yona, G. (2004). On prediction using variable order Markov models. Journal of Artificial Intelligence Research, 22, 385–421.MathSciNetCrossRefzbMATHGoogle Scholar
 BenGal, I., Shani, A., Gohr, A., Grau, J., Arviv, S., Shmilovici, A., et al. (2005). Identification of transcription factor binding sites with variableorder Bayesian networks. Bioinformatics, 21, 2657–2666.CrossRefGoogle Scholar
 Bertsimas, D., & Dunn, J. (2017). Optimal classification trees. Machine Learning, 106(7), 1039–1082.MathSciNetCrossRefzbMATHGoogle Scholar
 Blanchard, G., Schäfer, C., Rozenholc, Y., & Müller, K. (2007). Optimal dyadic decision trees. Machine Learning, 66(2–3), 209–241.CrossRefGoogle Scholar
 Bourguignon, P. Y., & Robelin, D. (2004). Modèles de Markov parcimonieux: sélection de modele et estimation. In Proceedings of the 5e édition des Journées Ouvertes en Biologie, Informatique et Mathématiques (JOBIM).Google Scholar
 Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D. (1996). Contextspecific independence in Bayesian networks. In Proceedings of the 12th conference on uncertainty in artificial intelligence (UAI) (pp. 115–123).Google Scholar
 Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Belmont: Wadsworth.zbMATHGoogle Scholar
 Brocchieri, L., & Karlin, S. (2005). Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Research, 33(10), 3390–3400.CrossRefGoogle Scholar
 Bühlmann, P., & Wyner, A. (1999). Variable length Markov chains. Annals of Statistics, 27, 480–513.MathSciNetCrossRefzbMATHGoogle Scholar
 Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2(2), 63–73.CrossRefGoogle Scholar
 Chavira, M., & Darwiche, A. (2005). Compiling Bayesian networks with local structure. In Proceedings of the 19th international joint conference on artificial intelligence (IJCAI) (pp. 1306–1312).Google Scholar
 Chickering, D., Heckerman, D., & Meek, C. (1997). A Bayesian approach to learning Bayesian networks with local structure. In Proceedings of the 13th conference on uncertainty in artificial intelligence (UAI) (pp. 80–89).Google Scholar
 Chipman, H., George, E., & McCulloch, R. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443), 935–948.CrossRefGoogle Scholar
 de Campos, C., & Ji, Q. (2011). Efficient structure learning of Bayesian networks using constraints. Journal of Machine Learning Research, 12, 663–689.MathSciNetzbMATHGoogle Scholar
 Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–38.MathSciNetzbMATHGoogle Scholar
 Eggeling, R. (2018). Disentangling transcription factor binding site complexity. Nucleic Acids Research. https://doi.org/10.1093/nar/gky683. (epub ahead of print).
 Eggeling, R., Gohr, A., Keilwagen, J., Mohr, M., Posch, S., Smith, A., et al. (2014a). On the value of intramotif dependencies of human insulator protein CTCF. PLoS ONE, 9(1), e85–629.CrossRefGoogle Scholar
 Eggeling, R., Grosse, I., & Grau, J. (2017). InMoDe: Tools for learning and visualizing intramotif dependencies of DNA binding sites. Bioinformatics, 33(4), 580–582.Google Scholar
 Eggeling, R., & Koivisto, M. (2016). Pruning rules for learning parsimonious context trees. In Proceedings of the 32nd conference on uncertainty in artificial intelligence (UAI) (pp. 152–161).Google Scholar
 Eggeling, R., Koivisto, M., & Grosse, I. (2015a). Dealing with small data: On the generalization of context trees. In Proceedings of the 32nd international conference on machine learning (ICML) (pp. 1245–1253).Google Scholar
 Eggeling, R., Roos, T., Myllymäki, P., & Grosse, I. (2014b). Robust learning of inhomogeneous PMMs. In Proceedings of the 17th international conference on artificial intelligence and statistics (AISTATS) (pp. 229–237).Google Scholar
 Eggeling, R., Roos, T., Myllymäki, P., & Grosse, I. (2015b). Inferring intramotif dependencies of DNA binding sites from ChIPseq data. BMC Bioinformatics, 16, 375.CrossRefGoogle Scholar
 Frank, E. (2000). Pruning decision trees and lists. Ph.D. Thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand.Google Scholar
 Fujimaki, R., & Morinaga, S. (2012). Factorized asymptotic Bayesian inference for mixture modeling. In Proceedings of the 15th international conference on artificial intelligence and statistics (AISTATS).Google Scholar
 Grau, J., Keilwagen, J., Gohr, A., Haldemann, B., Posch, S., & Grosse, I. (2012). Jstacs: A Java framework for statistical analysis and classification of biological sequences. Journal of Machine Learning Research, 13, 1967–1971.zbMATHGoogle Scholar
 Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243.zbMATHGoogle Scholar
 Hush, D., & Porter, R. (2010). Algorithms for optimal dyadic decision trees. Machine Learning, 80(1), 85–107.MathSciNetCrossRefGoogle Scholar
 Jaeger, M., Nielsen, J., & Silander, T. (2006). Learning probabilistic decision graphs. International Journal of Approximate Reasoning, 42(1–2), 84–100.MathSciNetCrossRefzbMATHGoogle Scholar
 Kangas, K., Koivisto, M., & Niinimäki, T. (2014). Learning chordal Markov networks by dynamic programming. In Advances in neural information processing systems (NIPS) (Vol. 27, pp. 2357–2365).Google Scholar
 Leonardi, F. (2006). A generalization of the PST algorithm: Modeling the sparse nature of protein sequences. Bioinformatics, 22(11), 1302–1307.CrossRefGoogle Scholar
 Li, T., Fan, K., Wang, J., & Wang, W. (2003). Reduction of protein sequence complexity by residue grouping. Protein Engineering, 16, 323–330.CrossRefGoogle Scholar
 Lichman, M. (2013). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml. Accessed 8 Oct 2018.
 Lomax, S., & Vadera, S. (2013). A survey of costsensitive decision tree induction algorithms. ACM Computing Surveys, 45(2), 16:1–16:35.CrossRefzbMATHGoogle Scholar
 Nielsen, S. (2000). The stochastic EM algorithm: Estimation and asymptotic results. Bernoulli, 6(3), 457–489.MathSciNetCrossRefzbMATHGoogle Scholar
 Oliver, J. (1993). Decision graphs—an extension of decision trees. In Proceedings of the 4th international workshop on artificial intelligence and statistics (AISTATS) (pp. 343–350).Google Scholar
 Ordonéz, F., de Toledo, P., & Sanchis, A. (2013). Activity recognition using hybrid generative/discriminative models on home environments using binary sensors. Sensors, 13(5), 5460–5477.CrossRefGoogle Scholar
 Orenstein, Y., & Shamir, R. (2014). A comparative analysis of transcription factor binding models learned from PBM, HTSELEX and ChIP data. Nucleic Acids Research, 42(8), e63.CrossRefGoogle Scholar
 Peterson, E., Kondev, J., Theriot, J., & Phillips, R. (2009). Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics, 25, 1356–1362.CrossRefGoogle Scholar
 Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.Google Scholar
 Rantanen, K., Hyttinen, A., & Järvisalo, M. (2017). Learning chordal Markov networks via branch and bound. In Advances in neural information processing systems (NIPS), (Vol. 30, pp. 1845–1855).Google Scholar
 Rissanen, J. (1983). A universal data compression system. IEEE Transactions on Information Theory, 29(5), 656–664.MathSciNetCrossRefzbMATHGoogle Scholar
 Sandelin, A., Alkema, W., Engström, P., Wasserman, W., & Lenhard, B. (2004). JASPAR: An openaccess database for eukaryotic transcription factor binding profiles. Nucleic Acids Research, 32, D91–D94.CrossRefGoogle Scholar
 Schneider, T., & Stephens, R. (1990). Sequence logos: A new way to display consensus sequences. Nucleic Acids Research, 18(20), 6097–6100.CrossRefGoogle Scholar
 Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 2, 461–464.MathSciNetCrossRefzbMATHGoogle Scholar
 Seifert, M., Gohr, A., Strickert, M., & Grosse, I. (2012). Parsimonious higherorder hidden Markov models for improved arrayCGH analysis with applications to Arabidopsis thaliana. PLOS Computational Biology, 8(1), e1002–286.CrossRefGoogle Scholar
 Shen, Y., Choi, A., & Darwiche, A. (2018). Conditional PSDDs: Modeling and learning with modular knowledge. In Proceedings of the 33rd national conference on artificial intelligence (AAAI) (pp. 6433–6442).Google Scholar
 Silander, T., & Myllymäki, P. (2006). A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the 22nd annual conference on uncertainty in artificial intelligence (UAI).Google Scholar
 Silander, T., Roos, T., & Myllymäki, P. (2010). Learning locally minimax optimal Bayesian networks. International Journal of Approximate Reasoning, 51, 544–557.MathSciNetCrossRefGoogle Scholar
 Smith, J., & Anderson, P. (2008). Conditional independence and chain event graphs. Artificial Intelligence, 172(1), 42–68.MathSciNetCrossRefzbMATHGoogle Scholar
 Su, J., & Zhang, H. (2005). Representing conditional independence using decision trees. In Proceedings of the 20th national conference on artificial intelligence (AAAI) (pp. 874–879).Google Scholar
 Teyssier, M., & Koller, D. (2005). Orderingbased search: A simple and effective algorithm for learning Bayesian networks. In Proceedings of the 21st conference on uncertainty in artificial intelligence (UAI) (pp. 584–590).Google Scholar
 The UniProt Consortium. (2017). UniProt: The universal protein knowledgebase. Nucleic Acids Research, 45(D1), D158–D169.CrossRefGoogle Scholar
 Tian, J. (2000). A branchandbound algorithm for MDL learning in Bayesian networks. In Proceedings of the 16th conference on uncertainty in artificial intelligence (UAI) (pp. 580–588).Google Scholar
 Volf, P., & Willems, F. (1994). Context maximizing: Finding MDL decision trees. In Proceedings of 15th symposium on information theory, Benelux (pp. 192–200).Google Scholar
 Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.CrossRefGoogle Scholar
 Zhao, X., Huang, H., & Speed, T. (2005). Finding short DNA motifs using permuted Markov models. Journal of Computational Biology, 12, 894–906.CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.