In the empirical part of this work, we evaluate the effects of the proposed techniques for expediting PCT learning using a Java-implementation based on the Jstacs library (Grau et al. 2012). The software is available at http://www.jstacs.de/index.php/PCTLearn.
Data
We consider the problem of modeling DNA binding sites of regulatory proteins such as transcription factors, which constitutes one established application of PCTs. A data set of DNA binding sites consists of short sequences of same length over the alphabet \(\varOmega =\{\textsf {\small A},\textsf {\small C},\textsf {\small G},\textsf {\small T}\}\) that are considered to be recognized by the same DNA-binding protein. In this application, the task is to model the conditional probability of observing a particular symbol at a certain position in the sequence given its direct predecessors—a task that directly fits to the setting outlined in Sect. 1. The probability of the full sequence is, by the chain rule, simply the product over all conditionals. Due to the nature of protein–DNA interaction, the conditional distribution at a particular position is strictly position-specific, so we need to learn a separate PCT for every sequence position in a data set.
We use data from the publicly available data base JASPAR (Sandelin et al. 2004), which contains a large number of DNA binding site data sets for various organisms. For the majority of this section, we focus on two exemplary data sets, which contain binding sites of human DNA-binding proteins called CTCF and REST. The sequence in both data sets are rather long (19 and 21 nucleotides), so there are quite a few PCTs of large depth to be learned. For conveniently referring to a particular learned PCT, we introduce the abbreviations CTCF-j for the PCT learned at the jth position of the CTCF data set, and REST-j likewise. Both proteins are known to recognize a rather complex sequence pattern (Eggeling et al. 2015b), which makes the structure learning problem challenging.
Figure 6 displays the position-specific marginal frequencies of both exemplary data sets in sequence logo representation of Schneider and Stephens (1990). They slightly differ in the length of the sequence, otherwise the properties are rather similar: both contain several highly informative positions, where the marginal distribution clearly favors a single symbol. But there is also an at least equally large number of positions where the marginal distribution contains only little information. The biggest difference among both data sets is the sample size \(N\), that is, the number of sequences available to estimate the distributions from: for CTCF we have \(N=908\), for REST we have \(N=1575\).
For both data sets, we now learn optimal PCTs according to the BIC score setting the maximum depth to \(d=6\), except for the first six sequence positions, where the maximum depth is limited by the number of available explanatory variables. We show the resulting PCT structures in Fig. 7, hereby omitting the node labels in order to obtain a compact representation. Each node is still colored according to the size of its label, that is, whether it represents a singleton, the full alphabet, or a case in between.
We observe that the complexities of the optimal PCTs differ. In both data sets, there are sequence positions where a PCT that represents full statistical independence of the variable giving its predecessors is optimal according to the BIC score, which typically, though not always, occurs at highly informative positions. For CTCF all optimal PCTs have splits until at most depth three, whereas in the case of REST the allowed maximum depth of 6 is actually used to full capacity in case of REST-11 and REST-20, one final split occurs at depth 5, and three final splits at depth 4. The preference of REST for deeper trees, in comparison to CTCF, may be caused by a combination a larger sample size, which allows a bit higher model complexity, and the location of the highly informative positions in clusters, which spatially separates low-informative positions among whose dependencies are likely to occur.
The height and shape of the optimal PCT structures suggest that the PCT optimization for REST is generally computationally harder than for CTCF. In the following sections, we utilize both data sets for evaluating the effectiveness of the proposed memoization and pruning techniques.
Pruning versus memoization
In a first study, we compare the effect of memoization in its maximal variant, pruning with the fine upper bound and one-step lookahead (\(q=1\)), and the combination of both techniques for finding optimal PCTs of maximum depth \(d=6\). For each position \(j>6\) in both data sets, we count the number of visited nodes, which are nodes in the extended PCT that are explicitly created (including lookahead nodes), and plot the savings achieved by each algorithmic variant in relation to the basic DP algorithm of Sect. 2.3 in Fig. 8. We observe that the general pattern is similar for both data sets.
Memoization reduces the search space by approximately one order of magnitude on average, and the savings vary only little from position to position. This can be explained by the structure of the data sets, where most positions have both high- and low-informative positions as predecessors, so the potential for exploiting regularities in the explanatory variables is in a similar range.
The effect of pruning, however, varies to a large degree. As a rule-of-thumb, at high-information positions pruning yields a tremendous reduction of the search space. In one exceptional case, CTCF-13, it is possible to prune already at the root, which we cannot always expect to happen: other positions with minimal optimal tree displayed in Fig. 7(top) require more effort to declare statistical independence. The savings at low-information positions are not as pronounced, but for all 28 cases under consideration, pruning yields higher savings than memoization.
It is thus no surprise that the combination of both is dominated by the effect of pruning: Memoization contributes only small additional savings for positions where pruning is not overly effective, such as CTCF-8 or REST-15.
Comparing the two data sets to each other, we find that the aggregated savings for CTCF are higher than for REST, which confirms the speculation from the previous section. In particular, for REST-11 and REST-15 finding optimal PCTs is relatively demanding. However, the optimal tree structure only implies a tendency, the correlation is not perfect: REST-7 and REST-20 seem equally challenging instances, yet the former yields a minimal optimal tree, whereas the latter yields an optimal tree with five leaves that reaches up to depth 6.
Pruning variants in detail
In the last section, we have seen that pruning with the fine upper bound and one-step lookahead is very competitive and that adding memoization on top of that yields only marginal additional savings. Now, we take a closer look at pruning itself in order to evaluate how large the impact of the different variants is. We compare the cross-combinations of (1) the coarse and fine upper bound and (2) q-step lookahead with \(q\in \{0,1,2\}\). The results are shown in Fig. 9.
We observe that the biggest difference among methods is achieved at seemingly “easy” positions: the most striking example is again CTCF-13, where the difference among the best and the worse pruning technique amounts to four orders of magnitude. Moreover, the switch between the coarse and fine upper bound has a higher impact than changing the number of lookahead steps. Except for a few difficult cases (CTCF-18, REST-11, REST-15) using the fine bound has always a clearly positive effect on the reduction of the search space, and it never increases the work load in terms of the number of visited nodes.
Lookahead, however, can have a negative effect, as it potentially increases the search space in cases where it has little benefit on tightening the bounds. With the coarse upper bound, lookahead clearly pays off, \(q={1}\) and \(q={2}\) are both almost equally good and in some cases (CTCF-13, REST-17) substantially better than \(q=0\). With the fine upper bound, \(q={1}\) performs best. For a few positions (CTCF-14, REST-9, REST-16), the one-step lookahead substantially improves the shallow fine upper bound by more than one order of magnitude. Furthermore, for the majority of positions \(q={1}\) is slightly superior to \(q={2}\), but there are a few instances where further lookahead pays off, such as REST-9 or REST-10. The cases \(q>2\), we omit from the plots for clarity, follow the trend from \(q={1}\) to \(q={2}\) and yield inferior performance.
We conclude that the fine upper bound in combination with one-step lookahead is a competitive choice. Two-step lookahead is for these data sets not substantially worse, as the additional number of visited lookahead nodes is compensated by the tighter bound so the parameter is robust.
The AIC score
In the previous two sections, we used the BIC score as the objective function to be optimized. It is a reasonable choice in the domain of DNA binding site modeling due to its harsh penalty term (Eggeling et al. 2014b), which yields sparse trees as shown in Fig. 7. Now, we repeat the study from Sect. 7.2, but replace BIC by AIC, which is known penalize complex models less heavily. While we refrain from showing the optimal PCTs for brevity, they are indeed substantially more complex in terms of the number of leaves: for CTCF the mean over all sequence positions is 11.4 and the median is 8, for REST the mean is 12.1 and the median is 13.
We again compute optimal PCTs of depth \(d=6\) for all algorithm variants. The results are shown in Fig. 10. The savings for memoization are exactly the same as in the case of the BIC score, which serves as a sanity check: the memoization technique does not distinguish between BIC and AIC, and so the results must be identical.
The results for pruning, however, dramatically change. Due to the less harsh penalty term, total statistical independence never occurs, that is, the minimal tree is never optimal. Moreover, context-specific independence can be declared in much fewer cases than for BIC, and so the pruning rules are less effective. The largest saving occurs for CTCF-10, where the AIC-optimal PCT has only four leaves, the saving being a little more than three orders of magnitude, which is comparable to the worst cases for BIC on the same data set. There are even instances, where the reduction of the search space is smaller than one order of magnitude.
The comparatively poor effect of the pruning rules, however, changes the game when pruning is combined with memoization. While is some cases like CTCF-10, Rest-8, or REST-8 pruning alone could still suffice, and in a few other cases like CTCF-14, CTCF-18, or REST-10 memoization alone yields already the best possible result, combining the two ideas clearly pays off for the majority of positions. It demonstrates that the memoization idea can in principle be as valuable as pruning or be even more effective, depending heavily on the scoring function and the complexity of the optimal model structures.
Memoization revisited
As demonstrated in the last section, the memoization technique has the merit of yielding a certain reduction of the search space, no matter whether the scoring function favors for sparse or complex models. However, memoization has the downside that storing solutions to previously computed subproblems—either scores associates with data subsets or even entire subtrees—can substantially increase the memory consumption.
We thus investigate the impact of the memoization depth \(m\), which indicates the deepest layer of the extended PCT for which subproblems are stored for potential re-use later on. For measuring time consumption, we count the number of visited nodes in the extended PCT. Since the total running time for a data set is the main factor of interest, we here take the mean value over all positions. For measuring space consumption, we count the number of stored nodes. Here, however, we take the maximum over all positions, since it typically is the quantity of interest to decide whether a problem can be solved on a given machine or not. Figure 11 displays the results.
We observe that the pattern is similar for all six cases, and \(m=4\) gives the overall best tradeoff between time and space complexity. For cases where pruning is rather effective, such as BIC, space complexity may not become a critical bottleneck, so even \(m=5\) could be justified. In the other cases, it might be a good idea to stop storing subproblems one layer earlier by setting \(m=d-2\) and to compute, if needed, the optimal partition of the leaf nodes of the extended PCT explicitly.
Broad study
In the previous sections, we investigated two data sets in detail and used the number of visited nodes in the extended PCT as an evaluation metric. Two open questions remain: How do the numbers of visited nodes translate to running times? How does the algorithmic variants perform on a larger variety of data sets, in particular with respect to the sample size?
In order to shed light on these issues, we now investigate 95 data sets with varying sample size, from \(N=102\) to \(N=8,\!734\) (see Appendix for the full list). We use the BIC score as objective function, the fine upper bound with the lookahead parameter \(q=1\) as pruning method, and full memoization. The sequence length, which determines the number of PCTs to be optimized, differs among data sets. Using \(d=6\), we learn all 767 PCTs and plot the running times required for each of the four algorithmic variants in Fig. 12(left). Performing a signed-rank test of Wilcoxon (1945) among these variants, we find that all pairwise differences are highly significant with p values below \(10^{-10}\).
The results generally confirm the observation from the previous sections: pruning gives larger savings than memoization, even though the difference in running times is not as large as the difference in the number of visited nodes (Sect. 7.2). One explanation is that the computation of the fine upper bound does have a certain computational cost, whereas memoization has a memory- rather than a computation-overhead. In addition, memoization can also give improvements in cases where pruning itself is ineffective. As a consequence, the combination of pruning and memoization is the significantly best choice for speeding up PCT optimization and reducing the median running time by almost two orders of magnitude.
In Fig. 12(right), we plot for this best variant the running time against the number of visited nodes in the extended PCT, for each of the 767 problem instances. We color each point in the scatter plot by the size of the data set, distinguishing three size groups, roughly on a log scale: small with \(N<500\), typical with \(500<N<3000\), and large with \(N>3000\), consisting of 23, 52, and 20 data sets respectively (each amounting to several instances). We observe that the running time correlates well with the number of visited nodes (Pearson correlation coefficient \(\rho =0.90\)).
One factor that prohibits a perfect linear correlation between running times and visited nodes is the sample size \(N\), which itself has a roughly linear effect on the running time. This is because all data points need to be read and distributed among the nodes in the extended PCT, which becomes most evident for all cases where pruning applies directly at the root (1 visited extended PCT node) where the correlation between sample size and running time is almost perfect (\(\rho =0.99\)). For the remaining cases, the relationship is less perfect but the general trend remains the same. For the four-symbol alphabet the data management, as opposed to the alphabet partitioning, dominates the workload in each node of the extended PCT.
Running times for different parameter values
The previous section discussed the running times for concrete selections the algorithms’ parameters. We now set these parameters, one at a time, to possible alternative values and study the effects on the running time (Fig. 13). We observe that for every parameter there exist some problem instances that benefit from a change of the parameter value, but nevertheless we do observe a general trend.
When using the coarse bound instead of the fine bound (top, right), we find that for the majority of problem instances the running time increases, and in many cases by more than one order of magnitude. Keeping the fine bound, but disabling the lookahead instead (top, center) also leads to an increased running time for the majority of instances. These are often cases the minimal PCT is optimal (red), and whereas the fine bound enables pruning directly very early in the optimization, the coarse bound does not. Increasing the lookahead from \(q=1\) to \(q=2\) has relatively little effect, and thus confirms the expectations gained from analyzing the number of visited nodes (cf. Fig. 9).
When varying the memoization parameter m, we observe that for the majority of problem instances the running times remain widely identical, especially these where the optimal PCT has only one leaf. However, for many instances where the optimal PCT has more than one leaf, gradually disabling memoization by reducing m increases running time. These results also confirm the expectations from the earlier analysis: pruning and memoization complement each other: whereas the former technique attempts to quickly identify context-specific independencies (including complete independence), the latter allows savings also in cases where the optimal PCT is relatively complex.
Predictive performance
Armed with the algorithmic tricks described in this paper, we are now able to study the predictive performance of a PCT-based model, dubbed iPMM (inhomogeneous parsimonious Markov model), on a large scale. We also investigate the performance of Bayesian networks (BNs), which have been previously proposed for the modeling complexity in transcription factor binding sites (Barash et al. 2003). This comparison is particularly relevant as the two model classes take into account different features in the data: iPMMs allow dependencies only among nucleotides in close proximity, but they model such dependencies in a very sparse and efficient way. BNs also allow long-range dependencies among distant positions in the sequence, but they are potentially less effective for short-range dependencies due to their use of conditional probability tables.
To allow for a fair comparison among the structural features of both model classes, we learn globally optimal iPMMs and BNs with the same structure score (BIC) and the same parameter estimator given the structure (posterior mean with pseudo count 1 / 2). For BN structure learning, we use an implementation of the dynamic programming algorithm of Silander and Myllymäki (2006), which is sufficient for finding a globally optimal DAG for the problem sizes within this application domain. For evaluating the predictive performance for both models we employ a repeated holdout approach with 90% training data and 100 repetitions. For each data set, we compute the mean log predictive probabilities and test whether the difference among both models is significant using the signed-rank test of Wilcoxon (1945). The individual results for all 95 data sets under consideration are shown in the Appendix.
Table 2 summarizes these results for a few different significance levels. We find that iPMMs describe the majority of data sets significantly better than BNs, which justifies the use of a PCT-based model. Data sets with long-range dependencies among nucleotides, which cannot be taken into account by iPMMs, exist, but they are the exception rather than the rule.
Table 2 Number of instances for which an iPMM predicts better/worse than a BN While the absolute difference among the predictive probabilities may seem small, the practical relevance depends on the concrete application. For scanning an entire genome with a threshold-based approach, for instance, even small differences in the predictive probability may have a substantial impact on the number of false positives. In addition to the general advantages such as easy visualization as discussed in Sect. 1, iPMMs have also the conceptual advantage over BNs that the running time grows only linear with the sequence length. Hence, they could be used to model longer sequence patterns, while still retaining optimality with respect to the chosen objective function.
Other types of data
Since DNA binding site data (1) concerns only \(|\varOmega |=4\) and (2) entails some highly-informative response variables due to the inhomogeneity of the used iPMM, they may be not fully representative for other types of data. We thus additionally investigate our algorithmic techniques on learning PCTs from protein sequences, which are typically described using the 20-letter amino acid alphabet. However, for many applications it is common to reduce this alphabet to smaller sizes based on, e.g., similar biochemical properties of certain amino acids (Li et al. 2003; Peterson et al. 2009; Bacardit et al. 2009). In this study we use the alphabet reduction method of Li et al. (2003), since it offers for each possible reduced alphabet size an optimal clustering of amino acids into groups.
We study protein sequences from the UniProt database (The UniProt Consortium 2017). In order to somewhat limit the number of data sets, we consider only human proteins with catalytic activity. In addition, we data sets to these with a protein length between 250 and 500 residues, which is motivated by the median human protein length of 375 (Brocchieri and Karlin 2005). We further exclude three selenoproteins and finally retain 1191 sequences.
For each of these sequences, we learn a PCT (thus implicitly assuming a homogeneous model) with the basic DP algorithm and with our full algorithm with improvements enabled and plot the required running time for three combinations of alphabet size and maximal PCT depth in Fig. 14. We find that our algorithm speeds up structure learning also for this type of data and model. Compared to the results on DNA binding sites, the savings rates are on smaller on average, but also the variance in savings is decreased. This can be explained by the observation that in homogeneous sequences the response variables rarely have an extreme marginal distribution, so pruning at or close to the root almost never occurs, even if the independence models was optimal.
In order to also consider an example from a non-biological domain, we evaluate the PCT learning algorithms on the activity of daily living (ADL) data of Ordonéz et al. (2013), obtained from the UCI machine learning repository (Lichman 2013). We extract the sequence of daily activities of both users, which contain nine and ten possible states, respectively. We further combine the states “Breakfast”, “Lunch”, “Snack”, and “Dinner” (the latter occurs only for user B) into a single state “Meal”, thus obtaining two sequences with alphabet size seven. We use these data for learning PCTs of depth four for both ADL data sets with our algorithm and with the basic variant and display the results in Table 3. We again obtain a substantial reduction of search space and running time that is comparable to the previous results on protein sequences.
Table 3 Algorithm comparison for learning PCTs of depth d \(=\) 4 on ADL data