Large-scale protein structure comparison is becoming a more and more important approach to providing a better picture for understanding biological systems [1, 2]. Given a database of protein structures, the main goal is either to find proteins that are structurally similar to a given protein (i.e., one-against-all comparison) or to build various connectivity among proteins by performing exhaustive comparisons on the whole database (i.e., all-against-all comparison). The results of structural comparison are useful in discovering potential structural, evolutionary, and functional relationships among these proteins and have significant impact on structure-based drug design [3], protein-protein docking [4], and other biological findings [5]. Recently, the dramatic increase in protein structural data [6] has led to an ever increasing demand for structure alignment tools that can not only find accurate alignments at residue level but also complete large-scale structure comparisons in a reasonable time.

Several approaches have been developed to address the limitations of traditional alignment methods and tackle the computational issues. The traditional alignment methods [1, 2, 7], such as DALI [8], CE [9], TM-align [10], Fr-TM-align [11], and MAMMOTH [12], are based on the comparison of residues or fragments to build initial alignments which are optimized by various procedures, such as Monte-Carlo, combinational search, and dynamic programming. These methods can provide accurate alignments at the residue level but are usually computationally expensive, which makes them infeasible in coping with very large datasets. To accelerate this process, one approach is to map the protein structures into 1D sequences and then use various sequence alignment methods to align two structures [13, 14]. Another approach [15] utilizes a "bag of words" method, which depends on frequency of specific structural patterns, to provide speedy structure match and filtering. These approaches significantly improve efficiency for large datasets; however, this is often achieved at the cost of loss of topological details, which could lead to lower accuracy than the traditional structural comparison methods or could be unsuitable to perform residue-level alignment. Another approach is to parallelize traditional algorithms using a cluster or grid environment consisting of thousands of computing nodes [16, 17]. These approaches can fulfill the desires of efficiency and accuracy but require high-performance computing environments which are energy-consuming and may not be accessible to the biologists.

With the increase in performance and programmability of many-core Graphic Processing Units (GPUs), more and more bioinformatics applications have been deployed on GPUs and have shown promising results in terms of speedup over their conventional CPU implementations. Liu et al. [18] implemented a GPU-based Smith-Waterman algorithm [19] for pair-wise DNA sequence alignment. Later, the efficiency of sequence alignments has been continuously improved in [2023]. Vouzis and Sahinidis developed GPU-BLAST (Basic Local Alignment Search Tool) [24] to accelerate NCBI-BLAST [25]. Hung et al. developed a method for calculating RMSD (Root Mean Square Deviation) after superposition for ATI GPU card [26]. Stivala et al. utilized simulated annealing (SA) to develop a protein substructure searching algorithm, SA Tableau Search, to find structural motif at level of secondary structure element (SSE) [27]. It is worth mentioning that from the literature the SA Tableau Search is the first attempt to apply GPU in protein structure comparison at the SSE level. Other applications include protein-protein docking [28] and statistical phylogenetics [29].

In this paper, we present ppsAlign, a parallel protein structure Alignment framework which is designed and optimized to exhaustively exploit the parallelism of the GPU architecture for residue-level structure comparisons. Our experimental results (reported on a NVIDIA Tesla C2050 GPU card) show that ppsAlign significantly outperforms existing structural alignment tools in computational efficiency.

We believe that GPU's massive parallel computing power can unlock the door to a cost-effective and high-performance computing environment that can be beneficial to the structural biology community.



The framework of ppsAlign is shown in Figure 1. The inputs include a target protein and a protein database Λ = {P1, P2, ...., P n }. The outputs are structure alignments between the target protein and each database protein. The online alignment starts with a generation of some initial sets of matched fragments and corresponding alignments. Then, the initial alignments are extended and refined using Dynamic Programming to obtain the final results. Specifically, the ppsAlign algorithm consists of 5 steps: 1) Index-based matched fragment set (MFS) search is utilized to find the maximal N seed seed MFS' between the target protein and each database protein; 2) Fragment-level alignment is used to assemble the MFS' and generate initial alignments; 3) Residue-level alignment is used to refine the initial alignments to residue alignments; 4) Maximal alignment search is used to find a transformation that can best superimpose the entire target protein over each database protein based on the obtained residue alignments; 5) Final assessment is performed to calculate z-Score and evaluate statistical significance of alignments. Steps 1) and 5) are executed on the CPU core, while steps 2) ~ 4), the most time-consuming parts of ppsAlign, are implemented as GPU kernels and iteratively executed on GPU for N iter times. The GPU kernels are developed using CUDA (Compute Unified Device Architecture) programming model [30]. During the alignment, the protein structures and intermediate results from each GPU kernel are stored in GPU's on-board memory, such as read-only constant memory, read-only texture memory, and read-write global memory. Generally, the constant and texture memory have limited capacity but high access rate compared to the global memory. For an overview of GPU architecture and CUDA model, readers are referred to [30, 31]. To facilitate the search of structurally similar fragments from the protein database, ppsAlign has an off-line component that pre-processes substructures from the entire protein database and builds an indexing tree to allow fast retrievals.

Figure 1
figure 1

Framework of ppsAlign. The framework consists of both GPU- and CPU- based processes. The input includes a target protein and database proteins. The output contains all the structural alignment results between the target protein and each database protein.

Index-based matched fragment set search

The purpose of this CPU-based step is to quickly find all possible matched fragment sets (MFS') between the target protein and each database protein for further refinement based on an information retrieval (IR) approach which goes beyond the capability of the traditional "bag of words" concept by introducing spatial relationships among these fragments. Let Q = q 1 , q 2 , , q L Q and P = p 1 , p 2 , . . . , p L p be a target protein with L Q residues and a database protein with L P residues, respectively. Here, q and p represent 3D coordinates of the C α atoms. A fragment f is a set of L f ( = 8) continuous residues with the direction from N terminal to C terminal along the protein backbone. A MFS includes two non-empty subsets, F Q and F P , which contain an order of fragments that conforms to some criteria of structural similarity between Q and P, respectively. The fragments in a MFS will then be used to generate a rough alignment between Q and P in the fragment-level alignment.

The MFS search utilizes the substructure mapping method of the Index-based Substructure Alignment algorithm [32], developed by the authors, to retrieve similar fragments from the database proteins. In this method, substructures of the database proteins, extracted by a large set of pairs of windows along the backbones, are indexed off-line by an indexing tree in which similar substructures are clustered into same leaf node, denoted by t i Λ , and one substructure is selected as representative for each leaf node. Such representative structures preserve certain topological information, both locally and globally, from two disjoint substructures with various ranges of distances. Similarly, substructures in the target protein Q are indexed by an indexing tree in which each leaf node is denoted by t i Q . The representative substructure of each t i Q is used to search the indexing tree of database and a list of best matched tΛ is returned. For simplicity, we use t to denote t i Q and tΛ. The database proteins that have substructures in tΛ can be found by an inverted index. Such a database protein, P, can be represented by an order of substructures, denoted by Ω t , occurring in t. Likewise, the protein Q can be represented by an order of substructures, denoted by Ω t Q , occurring in t. As substructures identified by the same t are similar, they can be used as "anchors" for rough alignments. For detailed explanation of the substructure mapping method, readers are referred to [32].

In ppsAlign, substructures are further projected into fragments as follows: if any residue of a substructure from Ω t P (or Ω t Q ) is located in a fragment, the fragment is selected and added to F P (or F Q ). The fragment subsets F P and F Q are used to construct a MFS between the protein Q and P. After searching all tQ, we can obtain all possible MFS' between Q and database proteins, if any. In this step, if the algorithm cannot find any MFS for a database protein, all the fragments from Q and the database protein are selected to form a MFS. An example of MFS searching and construction is illustrated in Additional file 1: Figure S1.

After searching MFS, a filtering process is called to remove redundant MFS'. Then, the non-redundant MFS' between Q and each database protein are ranked according to scoring function S MFS and the top N seed sets are selected. The scoring function is defined as follows:

S MFS = w 1 N Q N f Q + w 2 N P N f P + w 3 min N Q, N P max N Q, N P

where N Q and N P denote the cardinality of F Q and F P in a MFS, respectively.

N f Q = L Q / L f a n d N f P = L P / L f are the numbers of fragments in the target protein and a database protein, respectively. The third term of the above scoring function is used to favor MFS' which have comparable N Q and N P . The values w 1 , w 2 , and w 3 are used to weight the contributions from the three terms.

The data needed by ppsAlign in order to compute the alignments on GPU are: structures of the protein Q and of the database proteins, and MFS'. To allow efficient processing, those data must be judiciously laid out on the GPU memories. Specifically, the database structures are transferred to the texture memory before execution. The MFS' are transferred from CPU memory to GPU global memory as inputs to the fragment-level alignment (see Figure 1). Finally, the structure of protein Q is stored in the constant memory, which has smaller capacity but lower access latency compared to the texture memory.

Fragment-level alignment

In this step, the fragments in each MFS are assembled to obtain initial alignments using Dynamic Programming (DP). For a given MFS, the DP algorithm first sorts the fragments from F Q and F P according to their locations in Q and P. Then, it computes the similarity score S f (i, j) of each fragment pair for 1 ≤ iN Q and 1 ≤ jN P using the following recurrence:

s f i , j = max S f i - 1 , j - 1 + S f i , j S f i , j - 1 + G f S f i - 1 , j + G f ,

where G f is gap penalty and S f is based on the inverse cosine distance of fragment's feature vector. Given a fragment pair, A and B, and their corresponding feature vectors D A and D B , S f is calculated as follows:

s f = 1 - cos - 1 D A , D B D A D B

where < D A , D B > is the inner product of D A and D B , ||D A || and ||D B || are the norm of D A and D B , respectively. In the current implementation, features only use Euclidean distance of each residue pair for fast calculation. The main reason for using feature distance as an approximate measure of fragment similarity is the need for simple control paths due to the SIMT (Single Instruction, Multiple Thread) computing mode of the GPU [30]. Traditional methods usually calculate RMSD and find an optimal transformation using the Kabsh algorithm [33], which contains complex control flows and is therefore not suitable for the SIMT mode. This step provides a rough alignment result which will be refined by the residue-level alignment.

GPU computation for fragment-level alignment

The pseudo-code in Figure 2 describes the fragment-level alignment. The algorithm splits the computation into three GPU kernels. The first kernel performs the computation of the fragment scores S f by assigning a database protein to each thread. This kernel performs all-against-all fragment comparisons and writes similarity scores into the GPU global memory.

Figure 2
figure 2

Algorithm of fragment-level alignment. Fragment-level alignment consists of three GPU kernels. The first kernel performs the computation of the fragment scores. The second kernel implements the Dynamic Programming algorithm, and the third one performs the back tracing.

The second GPU kernel implements the DP algorithm, whereas the third one performs back tracing. The total number of threads N T that can run concurrently on the GPU is mainly limited by the global memory capacity of the GPU (in this phase each thread requires approximately 10 kB of memory). Suppose that the total number of MFS between Q and all database proteins is N F . If N F > N T , the overall MFS' will be divided into N batch = ⌈N F /N T ⌉ batches. ppsAlign sequentially schedules each batch to run on GPU. In each batch, the DP is first executed as a GPU kernel and each thread corresponds to a MFS. Then, the GPU kernel for the back tracing is called to obtain alignment paths for each MFS. When a batch terminates, ppsAlign transfers the output (i.e., alignment path for each MFS) from the GPU memory to CPU memory. After aggregating the outputs from all batches, ppsAlign first performs filtering to remove redundant alignments, and then assembles all the fragments along the alignment paths to form residue alignments which will be further refined by the residue-level alignment.

It is critically important to effectively utilize the limited memory resources of the GPU. Our GPU memory allocation scheme is exemplified in Figure 3. The MFS' are stored in a 2D block of size (N T × N S ) where N S is the maximal size of all MFS'. Each thread of the DP kernel fetches a MFS to initialize its setting. The score and direction matrices are stored in a separate 3D memory block of size (N Q × N P × N T ), where N Q and N P represent the maximal number of fragments from the target protein and all the database proteins, respectively. The alignment paths are then stored in a 2D block of size (N P × N T ). In ppsAlign, multiple GPU memory accesses are coalesced into a single transaction whenever possible. This fragment-level alignment process provides a selection of seed fragments which are likely to be successful in accurate alignment. Only approximately 1.6% of the total execution time is spent in this phase.

Figure 3
figure 3

GPU global memory layout of fragment-level alignment. The Matched Fragment Sets (MFS') are stored in a 2D block. The score and direction matrices are stored in a separate 3D memory block. The alignment paths are stored in a 2D block.

Residue-level alignment

The results of fragment-level alignment are then refined by a residue-level alignment process. Such a refined alignment result is an ordered set R = {(q i , p i ) | q i Q', p i P'}, where Q'Q (target protein) and P'P (database protein).

In this step, a rigid-body transformation (rotation and translation) T that minimizes the RMSD of R is first calculated. Then, the transformation T is used to superimpose all the residues from Q over P. Finally, the DP algorithm is used to find an alignment path between Q and P similar to the fragment-level alignment. In the DP, the gap penalty G r is set to 0 and the residue similarity score S r uses the scoring function from TM-align [10]. However, our framework can be configured to use any suitable residue-level scoring function [1].

As we mentioned previously, the complex control flows present in the traditional method for computing T (e.g., Kabsch algorithm [33]) make it unsuitable for the SIMT computing model of GPU. To address this issue, we implement and optimize a fast algorithm using quaternion-based characteristic polynomial (QCP) [34], gRMSD-QCP, to determine the transformation T on GPU. In the gRMSD-QCP kernel, coordinates of residues from two protein structures are first written into the GPU global memory and origin of coordinate is moved to the center of coordinates for each protein. Then, the inner-product of two coordinate matrices is calculated, which is used by QCP for RMSD calculation. The work flow of gRMSD-QCP is relatively simple, and therefore amenable of efficient GPU implementation.

GPU computation for residue-level alignment

The GPU implementation of residue-level alignment starts with loading coordinates of residues from R to the GPU global memory. Next, the gRMSD-QCP kernel is invoked to calculate the transformation T which is also written into the GPU global memory. Finally, a DP kernel is called to find residue alignments which are transferred into the CPU memory after the kernel terminates.

As in the fragment-level alignment phase, the residue-level alignments are divided into batches according to the memory requirement of the threads. After all the batches are executed, ppsAlign aggregates the outputs of residue alignment R, which are used in the next step for searching the maximal alignment.

Maximal alignment search

The maximal alignment search is used to find the largest subset MR such that the score of the residue alignment R, denoted by S a , is maximized. Because finding the largest subset M is extremely time-consuming, a heuristic and approximate algorithm, MaxSub [35], has been developed to solve this problem. In ppsAlign, a variant of MaxSub, gMaxSub, is designed to parallelize the search process on the GPU. In the current implementation of ppsAlign, S a is defined using the TM-score [10].

GPU computation for maximal alignment search

The input of this step is the alignment R from the residue-level alignment which has L R aligned residue pairs. The original MaxSub algorithm on CPU searches the largest subset M by shifting a window W of size L W along R (see Figure 4a). This results into (L R -L W + 1) shift operations which are candidates for parallelization. Then, gMaxSub searches the maximal alignment by concurrently dispatching each calculation of W to different GPU threads (see Figure 4b).

Figure 4
figure 4

Comparison of MaxSub and gMaxSub. The original MaxSub algorithm on CPU searches the largest subset by shifting a window W along the residue alignment R. The gMaxSub searches the maximal alignment by concurrently dispatching each calculation of W to different GPU threads.

Figure 5 describes a pseudo-code of gMaxSub. First, for each residue alignment R between Q and P, (L R -L W + 1) windows are generated. Second, the gRMSD-QCP kernel is invoked to calculate the transformation T for the residue pairs within each W and then T is used to superimpose residues from Q over P in R. Third, residue pair (q i , p i )∊R is added into W if its distance is below a cutoff (4.0 Å) after the superimposition. The above two steps (i.e., gRMSD-QCP and window extension) are iteratively executed for N MS times. Forth, the last W is assigned to M and S a is calculated.

Figure 5
figure 5

Algorithm of maximal alignment search. For each search, the gRMSD-QCP kernel is invoked to calculate the transformation superimpose residues with each window. The residue pair is added into the window if its distance is below a cutoff.

As in previous phases, the maximal alignment searches are divided into batches. After all the batches are executed, ppsAlign aggregates the outputs of subset and selects the one with the largest S a as the largest subset M. The transformation T associated with the largest subset M is used to superimpose all the residues from Q over P and the residue pair whose distance is below a cutoff (4.0Ǻ) is selected to form a new residue alignment R.

After gMaxSub terminates, if the current iteration number < N iter , the residue alignment R will be first filtered to remove redundant alignments from the same database protein and then sent to the residue-level alignment for further refinement; otherwise, R will be used as input for the next step of final assessment.

Final assessment of alignment quality

After structure alignments are computed on GPU, the residue alignments R are transferred from the GPU memory to CPU memory. We use PSI (percentage of structural similarity), defined as the percentage of residue pairs from R with distance below 4.0 Å, to score the alignment quality. We also assess the statistical significance of the alignments through z-Score of the PSI, which is given as follows:

z - Score =  PSI -  μ PSI σ PSI

where μ PSI and σ PSI denote mean and standard deviation of PSI for a given protein chain length, respectively. The parameters μ PSI and σ PSI are obtained using a method similar to [12], leading to the following settings: μ PSI = 375.64·k-0.5295 and σ PSI = 99.67·k-0.5885. Here, k is the minimum chain length between target and database proteins.


In this section, we compare ppsAlign's performance to concurrent methods in terms of alignment quality and computational efficiency. We evaluate ppsAlign using an NVIDIA Tesla C2050 GPU card equipped with 448 cores at 1.15 GHz and 3 GB global memory. The concurrent methods include TM-align [10], Fr-TM-align [11], and MAMMOTH [12], which share similar computational framework as ppsAlign. As DALI [8] and CE [9] have been exhaustively evaluated elsewhere [10], we do not include these approaches in our experiments. We download software packages of these methods from their official websites and evaluate the performance on a Linux personal computer with AMD Opetron dual-core 1000 series processor at 1.8 GHz and 8 GB RAM.

The main purpose of structure alignment is to maximize the number of aligned residues (N e ) while minimizing the RMSD of the aligned residues, denoted by cRMSD. To eliminate the size dependence of cRMSD on N e , in this paper we use a normalized measure of cRMSD, RMSD100, to evaluate the alignment quality. RMSD100 is calculated as follows [36]:

R S M D 100 = c R M S D 1 + ln N e 100 ,

which corresponds to the cRMSD value expected if the two protein structures were 100 residues long.

To evaluate efficiency, we measure the execution time on a dataset in which the protein's chain length is in a range from 80 to 500 residues extracted from ASTRAL 1.75 database [37] with sequence identity < 40% (ASREAL40). The database protein chain length is determined by the global memory capacity on the GPU card. However, this limitation is not severe as 98.5% ASTRAL40 protein chains have less than 500 residues. We expect that the advancement of GPU technology will solve this memory limitation issue in the near future so that the ppsAlign algorithm can handle protein chains longer than 500 residues. Currently we can handle structures larger than 500 residues in one of the following two ways: 1) by sending the alignment tasks to our CPU-based algorithm and 2) if resource allows, by using another GPU card to align the remaining 1.5% of large structures. Although the algorithm can also handle small protein chains below 80 residues (~16% of ASTRAL40), we do not use them for our testing because they have relatively simple topologies [38].

To efficiently utilize global memory of GPU card, the entire database proteins are sorted according to the chain length and then divided into two small datasets: 1) D 1 , which includes 6, 569 proteins in the range [80, 250) residues selected from ASTRAL40 according to the length distribution of proteins, and 2) D 2 , which includes 1, 912 proteins in the range [251, 500) residues. The target dataset includes 100 proteins which are randomly selected in the range [80, 250) from ASTRAL40. For each target protein, a one-against-all alignment is performed with all database proteins and totally 100 × (6, 569 + 1, 912) = 848, 100 non-homologous protein pairs are compared during the experiment.

Scalability of ppsAlign

There are two critical parameters for ppsAlign, namely the maximal number of iteration (N iter ) and the maximal number of MFS (N seed ). Intuitively, when increasing N iter or N seed , ppsAlign will often obtain better alignment quality but the execution time will be significantly lengthened. To verify this, we preliminarily investigate the performance of different settings using a small target dataset of 17 proteins and the dataset D 1 in terms of RMSD100. The experimental results of RMSD100 with N iter = {3, 5, 7} and N seed = {10, 30, 50, 70} are shown in Figure 6, which illustrates that ppsAlign has decreased RMSD100 when N iter and/or N seed is increasing. This figure can be used as a guideline for parameter selection of ppsAlign. For a fair comparison of efficiency improvement from ppsAlign to a concurrent method, we select a combination of N iter and N seed that achieves comparable alignment quality.

Figure 6
figure 6

Performance comparison of ppsAlign with different settings of N seed and N iter . ppsAlign is running on NVIDIA Tesla C2050 GPU card with a small target dataset of 17 proteins. The parameter settings of ppsAlign are N iter = {3, 5, 7} and N seed = {10, 30, 50, 70}. (A) N iter = 3. (B) N iter = 5. (C) N iter = 7.

Speedup over TM-align and CPU-based ppsAlign

In this experiment, ppsAlign is executed with a parameter setting of N iter = 3 and N seed = 20 which results in a comparable RMSD100 to TM-align and the CPU version of ppsAlign. Table 1 summarizes the alignment quality, average execution time, and corresponding speedup. ppsAlign achieves speedups of 23.8 and 35.9 compared to CPU-based ppsAlign and TM-align, respectively. The detailed comparison of alignment quality of ppsAlign and TM-align can be found in Additional file 1: Table S1.

Table 1 Average execution time of TM-align, CPU-based ppsAlign, and ppsAlign with parameter settings (N iter = 3 and N seed = 20).

Speedup over Fr-TM-align

Since Fr-TM-align performs more iterations to improve its alignment quality over TM-align, we increase both iteration and seed numbers of ppsAlign algorithm to achieve a comparable alignment quality with Fr-TM-align. The experimental results of RMSD100, average execution time, and corresponding speedup with N iter = 6 and N seed = 30 are shown in Table 2. ppsAlign achieves speedup 64.7 compared to Fr-TM-align with the same alignment quality. The detailed comparison of alignment quality of ppsAlign and Fr-TM-align can be found in Additional file 1: Table S2.

Table 2 Average execution time of Fr-TM-align and ppsAlign with parameter settings (N iter = 6 and N seed = 30).

Speedup over MAMMOTH

In the last experiment, we use the same dataset to compare the performance of ppsAlign and MAMMOTH. Different from TM-align and Fr-TM-align, MAMMOTH is originally developed for the purpose of large-scale comparisons with high efficiency at the cost of the reduction of alignment quality. Because of its high speed, MAMMOTH is used as a benchmark for maximal speed on the CPU platform in [39]. The experimental results of RMSD100, average execution time, and corresponding speedup with N iter = 1 and N seed = 8 are shown in Table 3. ppsAlign achieves speedup 40.3 compared to MAMMOTH and higher alignment quality. The detailed comparison of alignment quality of ppsAlign and MAMMOTH can be found in Additional file 1: Table S3.

Table 3 Average execution time of MAMMOTH and ppsAlign with parameter settings (N iter = 1 and N seed = 8).


The framework of ppsAlign is a general-purpose GPU platform for protein structure alignment which could take many concurrent methods, such as TM-align [10] and Fr-TM-align [11], into the parallelized algorithm design. An important novelty in our approach is to create a unique design to manage resources of the GPU architecture. First, an intelligent decomposition of the application in kernels characterized by different parallelization strategies is provided. In the existing methods for GPU-based sequence alignment mentioned previously, a pair-wise comparison is either assigned to a thread (i.e., inter-task parallelization) or corporately performed by a block of threads (i.e., intra-task parallelization) [18, 20]. However, as the workflow of structure alignment is more complicated than that of sequence alignment, neither the inter- nor the intra- task parallelization can efficiently exploit the GPU computing power. Therefore, ppsAlign utilizes a hybrid inter- and intra- task parallel model. In particular, each task (i.e., pair-wise structural comparison) is divided into several independent seed alignments. Each seed alignment is assigned to a different thread (inter-task parallelization), whereas each block executes one or more pair-wise comparisons (intra-task parallelization). Second, a smart design of memory layout and memory access patterns are developed, the former allowing an effective use of the memory capacity at the different levels of the GPU memory hierarchy, and the latter minimizing the memory bandwidth requirement of the application. Third, several efficient algorithms for avoiding complex control flow on GPU are proposed to take advantage of the SIMT nature of the GPU. For instance, a feature-based measure is used to compute similarity of fragment at the fragment-level alignment which can avoid time-consuming RMSD calculation at the initial stage of structure alignment.

One of the major ways in which ppsAlign differs to other methods is implementing protein structure alignment at the residue level on GPU. Recently, the GPU-enhanced algorithms are gaining an increasing attention in bioinformatics. One of the major steps was a GPU implementation of a one-against-all sequence comparison using Smith-Waterman algorithm [20, 21]. With these methods, a sequence database search can be performed resulting in a list of similarity scores, while these methods do not provide the detailed alignment information of the best hits [23]. To provide detailed residue-residue correspondence, GPU-BLAST [24] was developed, that allowed to accelerate the NCBI-BLAST search, achieving the speedup between 3 and 4 on an NVIDIA Tesla C2050 GPU card. In addition, another approach to protein sequence that uses backtracking on GPU to construct alignment of residues has been proposed [23]. Compared to the sequence alignments, the implementation of structure alignment on GPU is a more challenging task, because some routines (e.g., RMSD calculation) can cause severe divergence among GPU threads and decrease performance of GPU. One of the first structure comparison methods implemented on GPU, SA Tableau Search [27], aligns protein substructure at the secondary structure level, that is by aligning secondary structure elements, while not aligning structures at the residue level. To the best of our knowledge, ppsAlign is the first protein structure comparison platform for GPU that provides the residue level structural alignment.

The substantial contribution of ppsAlign is to provide a high-performance computing platform for the research community. An alternative solution to accelerate the protein structure alignment is to install more CPU computing cores in a single machine. However, using more CPU cores in a single machine need to upgrade main board and memory accordingly, which could decrease price/performance ratio. In contrast, installing a GPU card into a PCIe (Peripheral Component Interconnect Express) slot does not require extra cost and more GPU cards can be installed into one PCIe slot by a switch. In this paper, an NVIDIA Tesla C050 GPU card is utilized to evaluate performance, which has also been used in GPU-BLAST [24]. Though it is a high end product of NVIDIA, we expect its price will drop in the near future due to market demand in gaming industry.


This paper presents ppsAlign for large-scale protein structure alignment using GPUs. ppsAlign employs an index-based search procedure to find seeds of matched fragment sets, and then iteratively refines the seeds with fragment- and residue- level alignments. We provide an in-depth comparison of ppsAlign against several concurrent CPU-based methods. Our experimental results show that ppsAlign can achieve significant speedup over its CPU implementation, TM-align, Fr-TM-align, and MAMMOTH on a single NVIDIA Tesla C2050 GPU.

We emphasize that the framework of ppsAlign is not designed as a replacement for the existing structural alignment tools, but as a general-purpose platform for protein structure alignments on GPU. With this platform, we can parallelize the existing algorithms (e.g., TM-align and Fr-TM-align) on GPU and utilize the massive parallel computing power of GPU to achieve high-throughput structural comparisons without sacrificing alignment quality.

Availability and requirements

• Project name: ppsAlign

• Project home page:

• Operating system(s): Linux

• Programming language: CUDA, JAVA, and PHP

• License: none

Availability of supporting data

The data sets supporting the results of this article are available in the Worldwide Protein Data Bank repository,