Background

Introduction

RNA secondary structure prediction (i.e., RNA folding) [1] “is the process by which a linear ribonucleic acid (RNA) molecule acquires secondary structure through intra-molecular interactions. The folded domains of RNA molecules are often the sites of specific interactions with proteins in forming RNA–protein (ribonucleoprotein) complexes.” Unlike a paired double strand DNA sequence, RNA primary structure is single strand which could be considered as a chain (sequence format) of nucleotides, where the alphabet is {A (adenine), U(uracil), G(guanine), C(cytosine)}. This single strand could fold onto itself such that (A, U), (C, G) and (G, U) are complementary base pairs. The secondary structure of RNA is such two-dimensional structure composed by list of complementary base pairs which are close together with the minimum energy. RNA folding algorithm is the approach to predict this secondary structure of RNA. In other words, we are given a primary structure of RNA, which is a list of sequence characters A[ 1:n]=a 1 a 2a n where a i A,U,G,C. We are required to determine this non-nested/pseudoknot-free structure P with minimum energy, such that the number of complementary base pairs in P is maximum. (A pseudoknot [2] “is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem.”)

Smith and Waterman (SW) [3] and Nussinov et al. [4] proposed a dynamic programming algorithm for RNA folding in 1978. Zuker et al. [5] modified Nussinov’s algorithm using thermodynamic and auxiliary information. The asymptotic complexity of the SW’s, Nussinov’s, and Zuker’s algorithms are O(n 3) time and O(n 2) space, where n is the length of the RNA sequence. Li et al. [6] proposed a cache-aware version of Nussinov’s algorithm, called Transpose, that takes twice the memory but reduces run time significantly. Many parallel algorithms for RNA folding have also been proposed (see, for e.g., [615]).

In this paper, we focus on reducing the number of cache misses that occur in the computation of Nussinov’s method without increasing the memory requirement. Our interest in cache misses stems from two observations–(1) the time required to service a lowest-level-cache (LLC) miss is typically 2 to 3 orders of magnitude more than the time for an arithmetic operation and (2) the energy required to fetch data from main memory is typical between 60 to 600 times that needed when the data is on the chip. As a result of observation (1), cache misses dominate the overall run time of applications for which the hardware/software cache prefetch modules on the target computer are ineffective in predicting future cache misses. The effectiveness of hardware/software cache prefetch mechanisms varies with application, computer architecture, compiler, and compiler options used. So, if we are writing code that is to be used on a variety of computer platforms, it is desirable to write cache-efficient code rather than to rely exclusively on the cache prefetching of the target platform. Even when the hardware/software prefetch mechanism of the target platform is very effective in hiding memory latency, observation (2) implies excessive energy use when there are many cache misses.

We develop three algorithms that meet our objective of cache efficiency without memory increase–ByRow, ByRowSegment, and ByBox. Since these take the same amount of memory as Classical and Transpose takes twice as much, the maximum problem size (n) that can be solved in any fixed amount of memory by algorithms Classical, ByRow, ByRowSegment, and ByBox is 40% more than what can be done by Transpose. On practical but large instances, ByRow and ByRowSegment have the same run time performance. Our experiments indicate that, depending on the computational platform and programming language, either ByRow or ByBox give best run time and energy performance. In fact, the C version of our proposed algorithms reduce run time by as much as 97.2% and energy consumption by as much as 88.8% relative to Classical and by as much as 56.3% and 57.8% relative to Transpose. The Java versions reduce run time by as much as 98.3% relative to Classical and by as much as 75.2% relative to Transpose.

The rest of the paper is organized in the following way. We first introduce our simple cache model that we use in our cache-efficiency analysis. Then we propose three cache- and memory-efficient RNA folding algorithms. These algorithms are being theoretically analyzed using our cache model. Finally, we present our experimental and benchmark results.

Cache model

We use a simple cache model so that the cache miss analysis is of manageable complexity. In this model, there is a single cache whose capacity is sw words, where s is the number of cache lines and w is the number of words in a cache line. Each data item is assumed to have the same size as a word. The main memory is assumed to be partitioned into blocks of size w words each. Data transfer between the cache and memory takes place in units of a block (equivalently, a cache line). A read miss occurs whenever the program attempts to read a word that is not in cache. To service this cache miss, the block of main memory that includes the needed word is fetched and copied into a cache line, which is selected using the LRU (least recently used) rule. Until this block of main memory is evicted from this cache line, its words may be read without additional cache misses. We assume the cache is written back with write allocate. That is, when the program needs to write a word of data, a write miss occurs if the block corresponding to the main memory is not currently in cache. To service the write miss, the corresponding block of main memory is fetched and copied in a cache line. Write back means that the word is written to the appropriate cache line only. A cache line with changed content is written back to the main memory when it is about to be overwritten by a new block from main memory.

In practice, modern computers commonly have two or three levels of cache and employ sophisticated adaptive cache replacement strategies rather than the LRU strategy described above. Further, hardware and software cache prefetch mechanisms, out of order executions are often deployed to hide the latency involved in servicing a cache miss. These mechanisms may, for example, attempt to learn the memory access pattern of the current application and then predict the future need for blocks of main memory. The predicted blocks are brought into cache before the program actually tries to read/write from/into those blocks thereby avoiding (or reducing) the delay involved in servicing a cache miss. Actual performance is also influenced by the compiler used and the compiler options in effect at the time of compilation.

As a result, actual performance may bear little relationship to the analytical results obtained for our simple cache model. Despite this, we believe the simple cache model serves a useful purpose in directing the quest for cache-efficient algorithms that eventually need to be validated experimentally. We believe this because our simple model favors algorithms that exhibit good spatial locality in their data access pattern over those that do not and all cache architectures favor algorithms with good spatial locality. The experimental results reported in this paper strengthen our belief in the usefulness of our simple model. These results indicate that algorithms with a smaller number of cache misses on our simple model actually have a smaller number of (lowest level) cache misses on a variety of modern computers that employ potentially different cache replacement strategies (vendors often use proprietary cache replacement strategies). Further, a reduction in cache misses on our simple model often translates into a reduction in run time.

Methods

Classical RNA folding algorithm (Nussinov’s algorithm)

Let A[ 1:n]=a 1 a 2a n be an RNA sequence and let H ij be the maximum number of the complimentary pairs in a folding of the sub-sequence A[ i:j], 1≤ijn. So, H 1n is the score of the best folding for the entire sequence A[ 1:n]. The following dynamic programming equations to compute H 1n are due to Nussinov [4].

$$ H_{i,i-1} = 0, \ 2 \le i \le n $$
(1)
$$ H_{i,i} = 0, \ 1 \le i \le n $$
(2)
$$ H_{i,j} = \text{max}\left\{ \begin{array} {lcr} H_{i+1,j} \\ H_{i,j-1} \\ H_{i+1,j-1} + c(a_{i}, a_{j}) \\ {max}_{i<k<j}\{H_{i,k}+ H_{k+1,j}\} \end{array} \right. $$
(3)

where c(a i ,a j ) is the match score between characters a i and a j . If a i and a j are complimentary pairs such as AU, GC or GU, c(a i ,a j ) is 1, otherwise it is 0. The different cases of the recurrence in Nussinov’s algorithm are illustrated in Fig. 1, where Fig. 1 a shows the case when a i is added to the best RNA folding of the subsequence A[ i+1:j]. Figure 1 b shows the case when a j is added to the best RNA folding of A[ i:j−1], Fig. 1 c shows the case when (a i ,a j ) is added to the best RNA folding of A[ i+1:j−1] and Fig. 1 d shows the combining of two subsequences A[ i:k] and A[ k+1:j] into one.

Fig. 1
figure 1

Four cases for Nussinov’s equations [21]

Due to the fact that Fig. 1 a and b can be considered as a special case of combining two subsequences where one of them is a single node subsequence. Several authors ([15], for example) have observed that Nussinov’s equations may be simplified to

$$ H_{i,i} = 0, \ 1 \le i \le n $$
(4)
$$ H_{i,i+1} = 0, \ 1 \le i \le n-1 $$
(5)
$$ H_{i,j} = \text{max}\left\{ \begin{array} {lcr} H_{i+1,j-1} + c(a_{i}, a_{j}) \\ {max}_{i\le{k}<j}\{H_{i,k}+ H_{k+1,j}\} \end{array} \right. $$
(6)

Once the best RNA folding score, H 1n , has been computed, a standard dynamic programming traceback procedure, which takes O(n) time, may be performed to find the path leading to the maximum score. This path defines the actual RNA secondary structure.

Algorithm 1 gives the Classical algorithm to compute H 1n using the simplified Nussinov’s equations. This algorithm computes H by diagonals and within a diagonal from top to bottom. It’s run time is O(n 3). Although the algorithm is written using two-dimensional array notation for H, we need only the upper triangle of H. Hence, a memory efficient implementation would either map the upper triangle into a 1D array or employ a dynamically allocated 2D array with variable size rows. In either case, we would need memory for n(n+1)/2 elements of H rather than for n 2 elements.

For the (data) cache miss analysis, we focus on read and write misses of the array H and ignore misses due to the reads of the sequence A as well as of the scoring matrix c (notice that there are no write misses for A and c). Figure 2 shows the memory access pattern for H. Figure 2 a left shows the order (by diagonals and within a diagonal from top to bottom) in which the elements of H are computed. In this figure, three diagonals have been computed as have 2 elements of the fourth; we are presently computing the third element (H ij ) of the fourth diagonal. Figure 2 b shows the elements of H in row i and column j that are needed for the computation of H ij (i.e., in the computation of max{H i,k +H k+1,j }). The elements in row i are accessed from left to right while those in column j are accessed from top to bottom. So, w row elements are brought into cache with a single miss and a miss takes place for each element of column j that is accessed. Note that the cache lines for column j also contain the column j+1 data needed in the computation of H i+1,j+1. However, when n is sufficiently large, this data is overwritten by new data under the LRU policy before it can be used in the computation of H i+1,j+1. So, for each of the ji sums of max{H i,k +H k+1,j } we incur 1/w read misses on average for H i,k and 1 read miss for H k+1,j . Over the entire computation we compute n 3/6 (plus low order terms) of these sums incurring a total of (n 3/6)(1+1/w) read misses. Although to complete the computation of H i,j we also need H i+1,j−1, accessing these values of H incurs only O(n 2) read misses. The number of write misses for H is also O(n 2). So, for our simplified cache model, the number of cache misses incurred when computing H using algorithm Classical is (n 3/6)(1+1/w) (plus low order terms).

Fig. 2
figure 2

Memory access pattern for algorithm Classical (Algorithm 1)

Transpose RNA folding algorithm

Li et al. [6] have proposed a cache-efficient computation of Nussinov’s simplified equations. Their algorithm, which we refer to as Transpose, uses an n×n array H in which the upper triangle is used to store the H i,j , ji, values defined by Nussinov’s equations and the lower triangle is used to store the transpose of the upper triangle. That is, H i,j =H j,i for all i and j. As new H ij s are computed, they are stored in both H i,j and H j,i . The sum H i,k +H k+1,j is computed as H i,k +H j,k+1, with the result that a sum now requires only 2/w cache misses on average. So, the total number of read misses is (n 3/6)(2/w) plus low order terms. The number of write misses is O(n 2). The ratio of cache misses of Classical to Transpose is approximately (1+1/w)/(2/w)=(w+1)/2. The run time remains O(n 3).

ByRow RNA folding algorithm

Although Transpose reduces the number of cache misses (in our model) by an impressive factor of (w+1)/2 relative to Classical, it does so at the cost of doubling the memory requirement. The increased memory requirement means that Classical can be used to solve problems up to 40% bigger than can be solved by Transpose on any computer with a fixed memory size. For smaller instances that can be solved by both algorithms, we expect Transpose to take less time. In this section, we propose an alternative cache-efficient algorithm ByRow that does not have a memory penalty associated with it. In our cache model, ByRow incurs the same number of cache misses as incurred by Transpose.

The algorithm ByRow computes the H i,j s by row bottom-to-top and within a row left-to-right. This is illustrated in Fig. 3. Figure 3 a shows the situation after the 4 bottommost rows of H have been computed. The computation of the next row (i.e, row 5 from the bottom in our example) is done in two stages. Note that the first two elements on each row are 0 by definition. So, only elements 3 onward are to be computed. In the first stage, every H i,j , j>i+1 on the row being computed is initialized to H i+1,j−1. The memory access pattern for this is shown in Fig. 3 b. The second stage comprises many sub-stages. In a sub-stage, all H i,j s in row i are updated using the sums H i,k +H k+1,j for a single k. In the first sub-stage, we use H i,i and H i+1,j to update H i,j , j>i+1 (see Fig. 3 c). In the next sub-stage, we use H i,i+1 and H i+1,j to update H i,j , j>i+1 and so on. Algorithm 2 gives the details.

Fig. 3
figure 3

Memory access pattern for ByRow algorithm (Algorithm 2)

It is easy to see that ByRow takes O(n 3) time and that its memory requirement is the same as that of Classical and about half that of Transpose. For the cache miss analysis, we see that for each element initialized in stage 1, an average of 1/w read misses and 1/w write misses occur. So, this stage contributes O(n 2) to the overall cache miss count. For the second stage, we see that the total number of read misses for the first term in an H i,k +H k+1,j over all sub-stages is O(n 2/w) and that for the second term is (n 3/6)(1/w) (plus low order terms). Additionally, there are (n 3/6)(1/w) (plus low order terms) read misses for H i,j . So, the total number of misses is (n 3/6)(2/w) (plus low order terms).

The algorithm ByRowSegment reduces this count by computing the elements in each row of H in segments of size no larger than the capacity of our cache. The segments in a row are computed from left to right. When the segment size is s, the number of read misses for H ik becomes (n 3/6)(1/s). The misses for H k+1,j remains (n 3/6)(1/w). So, the total number of misses is further reduced to (n 3/6)(1/s+1/w).

ByBox RNA folding algorithm

In the ByBox algorithm, we partition H into boxes and compute these boxes in an appropriate order. For the partitioning, we first divide the rows of H into strips of p rows each from bottom-to-top (Fig. 4 a). Note that the top most strip may have fewer than p rows. Next each strip is partitioned into a triangle box and multiple rectangle boxes (Fig. 4 b). The width of the first box is p, that of all but the last of the remaining boxes is q, and that of the last is ≤q. Observe that the first box in a strip is a p×p triangle (the height of the triangle in the topmost strip may be less than p), the last box in a strip is a p×q rectangle (again the height in the top strip may be less than p), and the remaining boxes are p×q boxes (again, the height may be less in the top strip).

Fig. 4
figure 4

Partitioning H into boxes

The elements in triangular boxes are computed using ByRow. These triangular boxes may be computed in any order. The rectangular boxes are computed by strips bottom-to-top and within a strip from left-to-right. Let T denote the rectangular box to be computed next (Fig. 5 a). Because of the order in which rectangular boxes are computed, all H values to its left and below it have already been computed. Let L 0, L 1, ⋯, L k−1 be the boxes to the left of T. Note that L 0 is a triangular box. Partition the Hs below T into q×q boxes B 1, B 2, ⋯, B k−1 plus a last triangular box B k whose width is w (Fig. 5 b).

Fig. 5
figure 5

Boxes in the computation of the rectangular box T

To compute T, we first consider the pairs of rectangular boxes (L i ,B i ), 1≤i<k. When a pair (L i ,B i ) is considered, we update all Hs in the box T that depend on values in this pair of boxes. To complete the computation of the Hs in box T, we read in the triangular boxes L 0 and B k and update all Hs in T by moving up the rows of T and within a row of T from left-to-right (Algorithm 3).

The time and memory required by algorithm ByBox are the same as for Classical and ByRow. For the cache miss analysis, assume that we have enough cache to hold one pair (L i ,B i ) as well as the box T. Loading L i and B i into cache incurs pq/w misses for L i and q 2/w for B i . The number of H i,k +H k+1,j computations we can do for each H in T without additional misses is q. So, with (p+q)q/w cache misses we can do pq 2 sum computations. Or, an average of (p+q)q/(wpq 2)=(p+q)/(wpq) misses per computation. Therefore, to do all n 3/6 required computations we incur (n 3/6)(p+q)/(wpq) cache misses. The misses attributable to the remaining terms in Nussinov’s equations as well as to writes of H are O(n 2) and may be ignored.

When q=w, the cache miss count for ByBox becomes (n 3/6)(1/w 2+1/(wp)), which is quite a bit less than that for our other algorithms.

When p=1, ByBox has much similarity with ByRowSegment. However, since ByBox needs sufficient cache for a q×qB i , \(q \le \sqrt {s}\), where s is the largest segment size that can be accomodated in cache. So, the miss count for ByBox is \((n^{3}/6)(p+q)/(wpq) = (n^{3}/6)(1+1/\sqrt {s})(1/w)\), which is more than that for ByRowSegment when \(w < \sqrt {s}\).

Practical considerations

We make the following observations regarding our expectations for the performance of the various Nussinov’s algorithms described in this section:

  1. 1.

    We have used a very simple 1-level cache model for our analyses and also assumed an LRU replacement strategy. Modern computers have two or three levels of cache and employ more sophisticated cache replacement strategies. So, our analyses, are at best a crude approximation of actual cache misses.

  2. 2.

    Modern computers employ sophisticated hardware and software methods for cache miss prediction and prefetch data based on this prediction. To the extent these methods are successful in accurately predicting the need for data sufficiently in advance, the latency due to cache misses can be masked. As a result, observed run times may not be indicative of cache misses.

  3. 3.

    In practice, the maximum n will be small enough that many of the cache misses counted in our analyses will actually not occur. For example, in the ByRow algorithm, the lowest level cache will usually be large enough to hold a row of H. This expectation comes from the observation that when n=100,000 (say), we will need more than 2×1010 bytes of main memory to hold the upper triangle of H (assuming 4 bytes per element) and only 400,000 bytes of cache to hold a row of H. As a result, the cache misses for H i,j will be O(n 2) rather than O(n 3). Similarly, for ByRowSegment, s=n. So, in practice, we expect ByRow and ByRowSegment to have the same performance.

  4. 4.

    In ByBox, using a q as small as w is not expected to result in speedup because of the overheads involved in this algorithm. In practice, we wish to use large nearly square boxes such that L i , B i , and T fit in cache. When the size of the lowest level cache is sufficient for 3∗220 elements (say), we could set p=q=1024.

Results

Experimental platform and test data

We implemented the Classical, Transpose, ByRow, and ByBox RNA folding algorithms in two programming languages – C and Java. For the data set sizes used by us, ByRow and ByRowSegment are identical as a row fits into cache and the segment size equals the row size. Consequently, we did not experiment with ByRowSegment. For all but Transpose, we conducted preliminary tests benchmarking 3 different implementations as below:

  1. 1.

    H is a classical n×n array.

  2. 2.

    The upper triangle of H is mapped into a 1D array of size n(n+1)/2 in row-major order [16].

  3. 3.

    H is a 2D array with variable size rows. The first row has n entries, the next has n−1, the next has n−2, ⋯ and the last has 1 entry. Such an array may be dynamically allocated as in [16]

The last two of these implementations take about half the memory as taken by Transpose and the first implementation. Our preliminary benchmarking showed that, in C, the last implementation is faster than the other two while in Java the first implementation is the fastest and the third next fastest. More specifically, the third implementation takes between 1% and 4% less time than the first in C and approximately 10% more time than the first in Java. The performance results reported in this section are for the third implementation except in the case of the smaller Java tests for which we had sufficient memory to use implementation 1. In other words, the reported performance results are for the fastest of the three possible implementations for Classical, ByRow, and ByBox. For Transpose, the standard 2D array implementation is used as this algorithm uses the entire n×n array.

The following platforms were used to compile and execute the codes.

  1. 1.

    Xeon E5-2603 v2 Quad Core processor 1.8 GHz with 10 MB cache On this platform, the C codes were compiled using gcc version 5.2.1 with the O2 option and the Java codes were compiled using javac version 1.8.0_72.

  2. 2.

    AMD Athlon 64 X2 5600+ 2.9 GHz with 512 KB LLC cache. The C codes were compiled using gcc version 4.9.2 with the O2 option and the Java codes were compiled using javac version 1.8.0_73.

  3. 3.

    Intel I7-x980 3.33 GHz CPU with 12 MB LLC cache. The C codes were compiled using gcc 4.8.4 with the O2 option and the Java codes were compiled using javac 1.8.0_77.

  4. 4.

    PowerPC A2 processor(IBM Blue Gene Q) 1.33 GHz 64-bit with 32 MB LLC cache. On this platform, the C codes were compiled using Mpixlc: IBM XL C/C++ for Blue Gene Version 12.01. The Java codes were not run on this platform.

Our Xeon platform had tools to measure cache misses and energy consumption. So, for this platform we report cache misses and energy consumption as well as run time. On this platform, we used the “perf” [17] software to measure energy usage through the RAPL interface. For the PowerPC A2 (Blue Gene Q) platform, the MonEQ software [18, 19] was used to measure the power usage every half second and calculate the actual energy consumption. For the remaining 2 platforms (Xeon and AMD), we were able to determine only the run time as we did not have the tools available to measure cache misses and energy.

For test data, we used randomly generated RNA sequences as well as real RNA sequences obtained from the National Center for Biotechnology Information (NCBI) database [20].

C Implementations

Xeon E5-2603

Figure 6 and Table 1 give the run times of our various algorithms for our random data sets on our Xeon platform for sequence sizes between 4000 and 40000. Figure 7 and Table 2 do this for sample real RNA sequences from [20]. In both figures, the time is in seconds while in both tables, the time is given using the format hh:mm:ss. We did not measure the time required by Classical for n>28,000 as this algorithm took almost 6 hours for n=28,000. The column labeled RvsC (BvsC) in Tables 1 and 2 gives the run time reduction achieved by ByRow (ByBox) relative to Classical. Similarly, RvsT and BvsT give the reductions relative to Transpose. As can be seen, on our Xeon platform, ByRow performs better than Classical and Transpose algorithms, ByBox outperforms all other three algorithms. On the randomly generated data set, the ByRow algorithm reduces run time by up to 89.13% compared to the original Nussinov’s Classical algorithm and by up to 35.18% compared to the cache-efficient Transpose algorithm of Li et al. [6]. The corresponding reductions for ByBox are up to 91.26% and 56.31%. On the real RNA sequences, ByRow algorithm reduces run time by up to 90.38% and 35.19% compared to Classical and Transpose algorithm. The corresponding reductions for ByBox are up to 91.93% and 56.58%.

Fig. 6
figure 6

Run time, in seconds, for random sequences on Xeon E5 platform

Fig. 7
figure 7

Run time, in seconds, for RNA sequences from [20] on Xeon E5 platform

Table 1 Run time (HH:mm:ss) for random sequences on Xeon E5 platform
Table 2 Run time (HH:mm:ss) for real RNA sequences of [20] on Xeon E5 platform

Since the results for randomly generated RNA sequences are comparable to those for similarly sized sequences from the NCBI database [20], in the rest of paper, we present results only for randomly generated sequences.

Figure 8 and Table 3 gives the number of cache misses on our Xeon platform. ByBox reduces cache misses by up to 99.8% relative to Classical and by up to 99.3% relative to Transpose. The corresponding reductions for ByRow are 96.6% and 85.9%. The very significant reduction in cache misses is expected given the cache miss analysis was done using our simple cache model. The reduction in run time, while significant, isn’t as much as the reduction in cache misses possibly due to the effect of cache prefetching, which reduces cache induced computational delays.

Fig. 8
figure 8

Cache Misses, in billions, for random sequences on Xeon E5 platform

Table 3 Cache misses, in millions, for random sequences on Xeon E5 server

Figure 9 and Tables 4 give the CPU and Cache energy consumption, in joules, by our Xeon platform. On our datasets, ByBox required up to 88.77% less CPU and Cache energy than Classical and up to 57.76% less than Transpose. It is interesting to note that the energy reduction is comparable to the reduction in run time suggesting a close relationship between run time and energy consumption for this application.

Fig. 9
figure 9

CPU and cache energy consumption, in thousands joules, for random sequences on Xeon E5 platform

Table 4 CPU and cache energy consumption, in joules, for random sequences on Xeon E5 server

AMD Athlon 64

Figure 10 and Table 5 give the run times on our AMD platform. The Classical algorithm took over 9 hours for n=16,000. As a result, we did not measure the run time of this algorithm for larger values of n. ByBox is faster than ByRow and both are substantially faster than Classical and Transpose. ByBox reduced run time by up to 97.16% compared to Classical and by up to 39.55% compared to Transpose. The reductions achieved by ByRow relative to Classical and Transpose were up to 96.08% and up to 18.33%, respectively.

Fig. 10
figure 10

Run time, in seconds, for random sequences on AMD Athlon 64 server

Table 5 Run time, in HH:mm:ss, for random sequences on AMD Athlon 64 platform

Intel I7

Figure 11 and Table 6 give the run times on our Intel I7 platform. Once again, we were unable to run Classical on our larger data sets (this time, n>28,000) because of the excessive time required by this algorithm on these larger data sets. As was the case for our Xeon and AMD platforms, the algorithms are ranked ByBox, ByRow, Transpose, Classical, fastest to slowest. The run time reduction achieved by ByBox is up to 93.70% relative to Classical and up to 51.92% relative to Transpose. ByRow is up to 89.19% faster than Classical and up to 15.62% faster than Transpose.

Fig. 11
figure 11

Run time, in seconds, for random sequences on Intel I7 platform

Table 6 Run time, in HH:mm:ss, for random sequences on Intel I7 platform

Power PC A2

Figure 12 and Table 7 give the run times on our Power PC A2 platform. On this platform, we were able to run Classical only for n≤8000 and the remaining algorithms only for n≤15,000, because of the excessive time required by our algorithms on larger instances. On this platform, the speed ranking of our algorithms is consistent with our other 3 platforms. The ranking, fastest to slowest, is now ByBox, ByRow, Transpose, Classical. ByBox is up to 87.74% faster than Classical and up to 33.43% faster than Transpose, where ByRow is up to 84.18% faster than Classical and up to 14.68% faster than Transpose.

Fig. 12
figure 12

Run time, in seconds, for random sequences on the Power PC A2 platform

Table 7 Run time, in HH:mm:ss, for random sequences on the Power PC A2 platform

Table 8 gives the energy consumption in joules on our Power PC platform. As other platforms, the energy reduction by our cache efficient algorithms tracked run time quite closely. For example, while ByBox was almost always slower than Transpose, it almost always used less energy. ByBox reduced energy consumption by up to 87.59% relative to Classical and by up to 40.31% relative to Transpose. And ByRow is up to 82.6% and 16.7% relative to Classical and Transpose, respectively.

Table 8 Energy consumption, in joules, for random sequences on the Power PC A2

Java implementations

Figures 13, 14 and 15 and Tables 9, 10 and 11 give the run time for our Java implementations on our Xeon, AMD, and Intel platforms.

Fig. 13
figure 13

Run time, in seconds, for random sequences using our Java implementations on our Xeon E5 platform

Fig. 14
figure 14

Run time, in seconds, for random sequences using our Java implementations on our AMD platform

Fig. 15
figure 15

Run time, in seconds, for random sequences using our Java implementations on our Intel I7 platform

Table 9 Run time, in HH:mm:ss, for random sequences using our Java implementations on our Xeon E5 platform
Table 10 Run time, in HH:mm:ss, for random sequences using our Java implementations on our AMD platform
Table 11 Run time, in HH:mm:ss, for random sequences using our Java implementations on our Intel I7 platform

The Java implementations take much substantially time and memory than do the C implementations. Because of memory limitations, Transpose could not be run on our AMD and Intel platforms for n≥16,000 and n≥24,000, respectively. Because of time requirements, we did not experiment with n>28,000 for any algorithm on any platform. The speed ranking, fastest to slowest, for the Java implementations is ByBox, Transpose, ByRow, Classical. The Java implementation of ByBox was up to 88.9% faster than the Java implementation of Classical on our Xeon platform, up to 98.3% faster on the AMD, and up to 88.5% faster on the Intel I7. The corresponding speedups relative to the Java implementation of Transpose were 75.2%, 64.6%, and 69.7%.

We observe that the run time of ByRow was generally more than that of Transpose on all of our platforms. We suspect this is because our Java code for ByRow makes more accesses to array elements than made by our Java code for Transpose. Array accesses are expensive in Java as the array indexes are checked for validity whenever an attempt is made to access an array element (we note that C does not perform such a check). Although some Java compilers eliminate this check when they can assert there will be no violation of array bounds, their ability to make this assertion is both variable and limited. In the case of Transpose, our code reduces the number of array accesses significantly by copying an array element that is to be used many times into a simple variable and then referring to this simple variable in reuses of the element. This reduction strategy could not be employed in the code for ByRow. As a result of the increased array bounds checking done in our Java code for ByRow relative to that done in our Java code for Transpose, the former is often slower.

Discussion and conclusions

We have proposed three cache-efficient algorithms–ByRow, ByRowSegment, and ByBox–for RNA folding using Nussinov’s dynamic programming equations. Their cache miss efficiency was analyzed using a simple cache model. Although the simple cache model does not accurately reflect the cache architecture of modern computers, it is useful for an initial assessment of cache performance as the model encourages the design of algorithms with good spatial locality and good spatial locality results in better cache performance on virtually all cache architectures.

Our algorithms were benchmarked against the classical implementation, Classical, of Nussinov’s equations as well as the cache efficient implementation Transpose proposed by Li et al. [6]. The benchmarking was done using four different computational platforms (Xeon E5, AMD Athalon 64, Intel I7, Power PC A2) and two programming languages (C and Java). For the benchmarking, we excluded ByRowSegement, as, for the dataset sizes we could handle on our test platforms, ByRow and ByRowSegment are identical. Our benchmarking shows that, depending on the computational platform and programming language, either ByRow or ByBox give best run time and energy performance. In fact, the C version of these algorithms reduce run time by as much as 97.2% and energy consumption by as much as 88.8% relative to Classical and by as much as 56.3% and 57.8% relative to Transpose. The Java versions reduce run time by as much as 98.3% relative to Classical and by as much as 75.2% relative to Transpose.

The algorithms ByRow, ByRowSegment, ByBox, and Classical require about half as much memory as does Transpose. While run time becomes a limiting factor more often than memory, in our Java experiments, we were unable to run Transpose on our larger data sets on our AMD and Intel I7 platforms because of insufficient memory.