BEAGLE  is a high-performance likelihood-calculation platform for phylogenetic applications. BEAGLE defines a uniform application programming interface (API) and includes a collection of efficient implementations for evaluating likelihoods under a wide range of evolutionary models, on graphics processing units (GPUs) as well as on multicore central processing units (CPUs). The BEAGLE library can be installed as a shared resource, to be used by any software aimed at phylogenetic reconstruction that supports the library. This approach allows developers of phylogenetic software to share any optimizations of the core calculations, and any program that uses BEAGLE will automatically benefit from the improvements to the library. For researchers, this centralization provides a single installation to take advantage of new hardware and parallelization techniques.
The BEAGLE project has been very successful in bringing hardware acceleration to phylogenetics. The library has been integrated into popular phylogenetics software including BEAST , MrBayes , PhyML , and GARLI  and has been widely used across a diverse range of evolutionary studies. The BEAGLE library is free, open-source software licensed under the Lesser GPL and available at https://beagle-dev.github.io.
2.1.1 Computing Observed Data Likelihoods
The most effective methods for phylogenetic inference involve computing the probability of observed character data for a set of taxa given an evolutionary model and phylogenetic tree, which is often referred to as the (observed data) likelihood of that tree. Felsenstein demonstrated an algorithm to calculate this probability , and his algorithm recursively computes partial likelihoods via simple sums and products. These partial likelihoods track the probability of the observed data descended from an internal node conditional on a particular state at that internal node.
The partial likelihood calculations apply to a subtree comprising a parent node, two child nodes, and connecting branches. It is repeated for each unique site pattern in the data (in the form of a multiple sequence alignment), for each possible character of the state space (e.g., nucleotide, amino acid, or codon), and for each internal node in the proposed tree. The computational complexity of the likelihood calculation for a given tree is O(p × s
2 × n), where p is the number of unique site patterns in the sequence (typically on the order of 102–106), s is the number of states each character in the sequence can assume (typically 4 for a nucleotide model, 20 for an amino-acid model, or 61 for a codon model), and n is the number of operational taxonomic units (e.g., species and alleles).
Additionally, the tree space is very large; the number of unrooted topologies possible for n operational taxonomic units is given by the double factorial function (2n − 5)!! . Thus, to explore even a fraction of the tree space, a very large number of topologies need to be evaluated, and hence a very great number of likelihood calculations have to be performed. This leads to analyses that can take days, weeks, or even months to run. Further compounding the issue, rapid advances in the collection of DNA sequence data have made the limitation for biological understanding of these data an increasingly computational problem. For phylogenetic inferences, the computation bottleneck is most often the calculation of the likelihoods on a tree. Hence, speeding up the calculation of the likelihood function is key to increasing the performance of these analyses.
2.1.2 Parallel Computation
Advances in computer hardware, specifically in parallel architectures, such as many-core GPUs, multicore CPUs, and CPU intrinsics (e.g., SSE and AVX), have created opportunities for new approaches to computationally intensive methods. The structure of the likelihood calculation, involving large numbers of positions and multiple states, as well as other characteristics, makes it a very appealing computational fit to these modern parallel processors, especially to GPUs.
BEAGLE exploits GPUs via fine-grained parallelization of functions necessary for computing the likelihood on a (phylogenetic) tree. Phylogenetic inference programs typically explore tree space in a sequential manner (Fig. 1, tree space) or with only a small number of sampling chains, offering limited opportunity for task-level parallelization. In contrast, the crucial computation of partial likelihood arrays at each node of a proposed tree presents an excellent opportunity for fine-grained data parallelism, which GPUs are especially suited for. The use of many lightweight execution threads incurs very low overhead on GPUs, enabling efficient parallelism at this level.
In order to calculate the overall likelihood of a proposed tree, phylogenetic inference programs perform a post-order traversal, evaluating a partial likelihood array at each node. When using BEAGLE, the evaluation of these multidimensional arrays is offloaded to the library. While each partial likelihood array is still evaluated in sequence, BEAGLE assigns the calculation of the array entries to separate GPU threads, for computation in parallel (Fig. 1, partial likelihood). Further, BEAGLE uses GPUs to parallelize other functions necessary for computing the overall tree likelihood, thus minimizing data transfers between the CPU and GPU. These additional functions include those necessary for computing branch transition probabilities, for integrating root and edge likelihoods, and for summing site likelihoods.
Multicore CPU parallelization through BEAGLE can only be done via multiple instances of the library, such that each instance computes a different data partition. Multiple CPU threads can be used (e.g., one for each partition) if the application program (BEAST, for the remainder of this chapter) creates the BEAGLE instances in separate computation threads, which will be the case when using BEAST. This approach suits the trend of increasingly large molecular sequence data sets, which are often heavily partitioned in order to better model the underlying evolutionary processes. BEAGLE itself does not employ any kind of load balancing nor are the site columns computed in individual threads. Each BEAGLE instance only parallelizes computation on CPUs via SSE vectorization.
BEAGLE can also use GPUs to perform partitioned analyses, however for problem sizes that are insufficiently large to saturate the capacity of one device, efficient computation requires multiple GPUs. Recent progress has been made in parallelizing the computation of multiple data subsets on one GPU , and future releases of BEAGLE will include this capability.
The general structure of the BEAGLE library can be conceptualized as layers (Fig. 2, library), the upper most of which is the application programming interface. Underlying this API is an implementation management layer, which loads the available implementations, makes them available to the client program, and passes API commands to the selected implementation.
The design of BEAGLE allows for new implementations to be developed without the need to alter the core library code or how client programs interface with the library. This architecture also includes a plugin system, which allows implementation-specific code (via shared libraries) to be loaded at runtime when the required dependencies are present. Consequently, new frameworks and hardware platforms can more easily be made available to programs that use the library, and ultimately to users performing phylogenetic analyses.
Currently, the implementations in BEAGLE derive from two general models. One is a serial CPU implementation model, which does not directly use external frameworks. Under this model, there is a standard CPU implementation, and one with added SSE intrinsics, which uses vector processing extensions present in many CPUs to parallelize computation across character state values. The other implementation model involves an explicit parallel accelerator programming model, which uses the CUDA external computing framework to exploit NVIDIA GPUs. It implements fine-grained parallelism for evaluating likelihoods under arbitrary molecular evolutionary models, and thus harnessing the large number of processing cores to efficiently perform calculations [3, 32].
Recent progress has been made in developing new implementations for BEAGLE, beyond those described here, thus expanding the range of hardware that can be used. Upcoming releases of the library will include additional support for CPU parallelism via a multi-threaded implementation and will support the OpenCL standard, enabling the use of AMD GPUs .
2.2.2 Application Programming Interface
The BEAGLE API was designed to increase performance via fine-scale parallelization while reducing data transfer and memory copy overhead to an external hardware accelerator device (e.g., GPU). Client programs, such as BEAST , use the API to offload the evaluation of tree likelihoods to the BEAGLE library (Fig. 2, API). API functions can be subdivided into two categories: those which are only executed once per inference run and those which are repeatedly called as part of the iterative sampling process. As part of the one-time initialization process, client programs use the API to indicate analysis parameters such as tree size and sequence length, as well as specifying the type of evolutionary model and hardware resource(s) to be used. This allows BEAGLE to allocate the appropriate number and size of data buffers on device memory. Additionally at this initialization stage, the sequence data is specified and transferred to device memory. This costly memory operation is only performed once, thus minimizing its impact.
During the iterative tree sampling procedure, client programs use the API to specify changes to the evolutionary model and instruct a series of partial likelihood operations that traverse the proposed tree in order to find its overall likelihood. BEAGLE efficiently computes these operations and makes the overall tree likelihood as well as per-site likelihoods available via another API call.
Peak performance with BEAGLE is achieved when using a high-end GPU; however, the relative gain over using a CPU depends on model type and problem size as more demanding analyses allow for better utilization of GPU cores. Figure 3 shows speedups relative to serial CPU code when using BEAGLE with an NVIDIA P100 GPU for the critical partial likelihood function, with increasing unique site pattern counts and for two model types. Computing these likelihoods typically accounts for over 90% of the total execution time for phylogenetic inference programs and the relationship between speedups and problem size observed here primarily matches what would be observed for a full analysis.
Figure 3 includes performance results for computing partial likelihoods under both nucleotide and codon models. The vertical axis labels show the speedup relative to the average performance of a baseline serial, single threaded and non-vectorized, CPU implementation. This nonparallel CPU implementation provides a consistent performance level across different problem sizes and provides a relevant point of comparison as most phylogenetic inference software packages use serial code as their standard.
Using a nucleotide model, relative GPU performance over the CPU strongly scales with the number of site patterns. For very small numbers of patterns, the GPU exhibits poor performance due to greater execution overhead relative to overall problem size. GPU performance improves quickly as the number of unique site patterns is increased and by 10,000 patterns it is closer to a saturation point, continuing to increase but more slowly. At 100,000 nucleotide patterns, the GPU is approximately 64 times faster than the serial CPU implementation.
For codon-based models, GPU performance is less sensitive to the number of unique site patterns. This is due to the better parallelization opportunity afforded by the 61 biologically meaningful states that can be encoded by a codon. The higher state count of codon data compared to nucleotide data increases the ratio of computation to data transfer, resulting in increased GPU performance for codon-based analyses. For a problem size with 10,000 codon patterns, the GPU is over 256 times faster than the serial CPU implementation.
2.4 Memory Usage
When assessing the suitability of a phylogenetic analysis for GPU acceleration via BEAGLE, it is also important to consider if the GPU has sufficient on-board memory for the analysis to be performed. GPUs typically have less memory than what is available to CPUs, and the high transfer cost of moving data from CPU to GPU memory prevents direct use of CPU memory for GPU acceleration.
Figure 4 shows how much memory is required for problems of different sizes when running nucleotide and codon-model analyses in BEAST with BEAGLE GPU acceleration. Note that when multiple GPUs are available, BEAST can partition a data set into separate BEAGLE instances, one for each GPU. Thus, each GPU will only require as much memory as necessary for the data subset assigned to it. Typical PC-gaming GPUs have 8 GB of memory or less, while GPUs dedicated to high-performance computing, such as the NVIDIA Tesla series, may have as much as 24 GB of memory.
Highly parallel computing technologies such as GPUs have overtaken traditional CPUs in peak performance potential and continue to advance at a faster pace. Additionally, the memory bandwidth available to the processor is especially relevant to data-intensive computations, such as the evaluation of nucleotide model likelihoods. In this measure as well, high-end GPUs significantly outperform equivalently positioned CPUs.
BEAGLE was designed to take advantage of this trend of increasingly advanced GPUs and uses runtime compilation methods to optimize code for whichever generation of hardware is being used. Table 1 lists hardware specifications for the processors used in this chapter. We note that further advancements in the GPU market for scientific computing are on its way, with NVIDIA preparing the launch (at the time of writing) of the Tesla V100 in Q3 of 2017. The new NVIDIA Tesla V100 features a total of 5120 CUDA cores and comes equipped with 32 GB of on-board memory with 900 GB/s of bandwidth. As such, it seems to have the potential to reach 7.5 TFLOPs in double-precision peak performance (DP PP), a roughly 50% increase over their current flagship, the Tesla P100.