Background

Next generation sequencing technologies hold great promise for biology in general [1]. They may be used to identify SNPs, pursue metagenomics, analyse DNA-protein interactions, and to discover non-coding RNA [2]. Furthermore, they may also be used for the analysis of the transcriptome [3, 4] supplementing the microarray technology. Compared to microarrays, sequencing based analysis of the transcriptome allows to tackle new biological problems such as the identification of allele specific gene expression, absolute measurement of gene expression, identification of structural variation, identification of alternative splicing sites and cross species comparison of gene expression.

We developed PanGEA – The Comprehensive (ancient greek: pan) G ene E xpression A nalyzer – to enable a fast and user-friendly analysis of allele specific gene expression using the 454 technology. PanGEA can be used for quantification of gene expression, the identification of SNPs and the quantification of allele specific gene expression. Additionally, PanGEA implements a modification of the Smith-Waterman algorithm which deals with incorrect estimates of homopolymer length as occuring in the 454 technology.

PanGEA and the accompanying console applications have been mainly developed for Windows but also work in Linux and Mac OsX. PanGEA is distributed under the Mozilla Public Licence and can be obtained from http://www.kofler.or.at/bioinformatics/PanGEA [see Additional file 1 for the executable and Additional file 2 for the source code of PanGEA].

Implementation

PanGEA-BlastN

To map ESTs to genes or whole genomes we developed PanGEA-BlastN. Similarly to Blast [5], PanGEA-BlastN uses an heuristic algorithm to find approximate hits between the database and the query sequence and then extends these hits with dynamic programming. PanGEA-BlastN is well-suited for mapping of EST reads obtained from next-generation sequencing technologies for the following reasons:

  • the seeding (heuristic search for approximate hits) has been optimized. Pairwise alignments will only be created for the best seeds, which reduces the number of dynamic programming steps and thus computation time

  • the necessity to map ESTs unambiguously is explicitly addressed

  • the dynamic programming algorithm has been modified to deal with uncertainty of homopolymer length as occurring in the 454-technology or in the Helicos system [6, 7]

  • several modifications have been implemented which allow for introns in the EST sequences

The mapping algorithm of PanGEA-BlastN, initially builds a hash-table of the database sequence and subsequently scans for approximate hits between the query and the database sequence (seeds). Computation time is reduced by the identification of the best candidates for the highest scoring hit from the longest diagonals, i.e. longest succession of shared words between the query and the database sequence. Only the longest diagonals will be subjected to dynamic programming. In addition to the classic Smith-Waterman algorithm PanGEA-BlastN provides a modified Smith-Waterman algorithm which is especially adapted to uncertainty of homopolymer length estimates occurring in several next-generation sequencing technologies [6, 7]. We also implemented improvements in the dynamic programming algorithm to increase computation efficiency Gotoh [8]. Unambiguously mapped ESTs are identified by comparing the scores of pairwise alignments. If the score difference between the best and the second best hit exceeds a minimum threshold, a mapping result is considered unambiguous. Ambiguous results are reported into a separate output file. PanGEA-BlastN also offers an intron-mode in which introns are already considered during seeding. Putative exons, separated by an intron, are individually aligned by dynamic programming (partial alignments) and subsequently aggregated into a composite alignment. Partial alignments, representing putative exons, are frequently overlapping with respect to the query sequence. For example, 'exon a' covering the bases 5 – 125 of a query sequence overlaps with 'exon b' which covers the bases 115 – 220. These overlaps are biologically not meaningful and have to be resolved. Therefore, PanGEA-BlastN calculates the alignment scores for each overlap individually and removes the overlap with the lowest score.

In contrast to other Blast-like approaches, insignificant hits cannot be filtered by specification of a minimum alignment score. Rather, spurious hits can be filtered after a PanGEA-BlastN search with the option 'Manage Pairwise alignments', by specifying a minimum similarity, alignment length and read coverage (see below). This has the advantage that performing a separate PanGEA-BlastN search for each different setting is not necessary. Instead, a PanGEA-BlastN search is conducted only once and the optimal parameters can subsequently be quickly estimated. The total length of the database sequences is only limited by the amount of available RAM, an analysis using the D. melanogaster genome as database sequence (120 Mbp) typically requires about 700 MB of RAM. No upper limit exists for the number of query sequences as PanGEA-BlastN operates in batch mode. PanGEA-BlastN is available as a stand-alone console application and embedded into a user-friendly GUI in the software PanGEA.

Seeding

Identification of approximate hits between the database and the query sequence, i.e seeding, provides the starting point for subsequent dynamic programming steps. Since the most time consuming processes during mapping of ESTs is dynamic programming, minimizing the number of dynamic programming steps could considerably improve computational efficiency. For EST mapping to genes or genomes the primary interest is the identification of the corresponding genes, thus only a single best hit is expected for each EST. This particular requirement can be used to design an efficient EST-mapping-algorithm by searching, already during seeding, for best-hit-candidates and subsequently aligning only those with dynamic programming. In contrast to Blast which aligns each approximate hit between a database and a query sequence [5], PanGEA-BlastN only aligns the best-hit-candidates. Best-hit-candidates are identified by searching for the longest diagonals between a database and a query sequence [9, 10]. Briefly, a hash table is built, containing each non-overlapping word of length k in the database sequences. Each word holds information about the index of the database sequence (i) and the position within the database sequence (j). Words having a low information content, i.e. occur several-fold more often than expected by chance (n > n max ), are removed from the hash table. The maximum number of occurrences n max for words of length k in database sequences having the total length l d can be calculated as

Where c denotes the low complexity cutoff specified by the user. After building a hash table, the query sequences are scanned. For each overlapping word of length k in the query sequence the corresponding matches in the hash table are identified. For these words, a shift (s) is calculated s = j - t where j is the position of the word in the database sequence and t the position in the query sequence. Subsequently, these words are sorted and parsed by searching for consecutive words with identical index (i) and identical (or similar) shift (s) [10]. A consecutive series of n identical indexes and shifts form a diagonal with length n. The algorithm searches for the longest diagonal, having the length n longest , and passes all diagonals with a length nn longest - 1 as seeds to the dynamic programming algorithm (Fig. 1). The main difference to the algorithm of Ning et al. [10] is that PanGEA-BlastN uses the diagonals merely as seeds for dynamic programming. In addition to this, PanGEA-BlastN provides an optional modification to account for the presence of introns in the reads being mapped against genomic sequences. Consecutive diagonals of length n ≥ 2 may be concatenated thus forming cumulative diagonals (Fig. 1). These cumulative diagonals allow for introns in the ESTs already during seeding. A maximum distance between the individual diagonals may be specified by the user.

Figure 1
figure 1

Seeding during the two PanGEA-BlastN search modes. Individual word positions are marked with an x. Length of each diagonal (n) is shown above whereas the longest diagonal is indicated by a star. Diagonals being passed as seeds to the dynamic programming algorithm are shown shaded (nn longest - 1).

Homopolymer adapted dynamic programming

Several next-generation sequencing technologies, for example the 454-platform or the Helicos system introduce new types of sequencing errors [1113]. Most notably, the length of homopolymers is often estimated incorrectly [1113, 7]. These sequencing errors frequently cause the alignments of mismatching bases (Fig. 2), which can lead to wrong estimates of the evolutionary distance between two sequences or may complicate the identification of SNPs in downstream applications. We developed a novel Smith-Waterman algorithm, which accounts for this uncertainty of homopolymer length by allowing for gaps preferentially in homopolymers.

Figure 2
figure 2

Pairwise alignments created with the homopolymer Smith-Waterman algorithm compared to the classic Smith-Waterman algorithm [for the whole alignments see Additional file 3].

The basic idea of the algorithm is to adjust the gap-introduction penalty (gap-opening penalty) dynamically to the "homopolymer-terrain" of a nucleotide sequence, i.e to use a position specific gap-introduction penalty, which decreases linearly within hompolymers. Additionally, a reduced gap-introduction penalty should only be valid within the tract of a homopolymer, if a gap is to be extended beyond, the default gap-introduction penalty should be used.

Let the two DNA sequences be D = d1d2...d n (database) and Q = q1q2...q m (query). Let I max further be the default (maximum) gap-introduction penalty, E the gap-extension penalty, S the hit score and P mm the mismatch penalty, then the minimum gap-introduction penalty I min can be calculated.

Gap introduction penalties I <I min cause inconsistent alignments. Now two position specific gap introduction matrices can be constructed I D = Id 1Id 2...I dn and I Q = Iq 1Iq 2...I qm where each entry I di , I qk relates to an corresponding entry d i , q k in D and Q respectively, where 1 ≤ in and 1 ≤ km. The two matrices I D and I Q are instantiated with values for I di , I qk where I min = I di , I qk = I max . In the absence of homopolymers in sequences D and Q, the corresponding values I di and I qk respectively, are set to I max whereas these values decrease linearly to I min within homopolymers.

For gaps of length t the affine gap penalty P gt can be calculated [14]:

The homopolymer Smith-Waterman algorithm described here, additionally uses the homopolymer gap penalty P ht for gaps of length t.

To restrict the introduced low-penalty-gaps to homopolymers, we introduced the homopolymer-transgression-penalty T, where x denotes the number of homopolymer transgressions. A homopolymer is transgressed each time q i qi+1for insertions and d i di+1for deletions. A high value of T restricts low-penalty-gaps exclusively to homopolymer tracts, whereas T = 0 allows an extension of these gaps without imposing any restrictions. Introduction of the homopolymer transgression penalty additionally has the advantage that this facilitates the implementation of the homopolymer Smith-Waterman algorithm in the important modification described by Gotoh [8].

Let s(d i , q k ) be the similarity between the two bases d i and q k then a two dimensional matrix H can be constructed, similar as described by Smith and Waterman [14].

Fig 2 shows some pairwise alignments generated with the homopolymer Smith-Waterman algorithm compared to alignments generated by the classical Smith-Waterman implementation [for the whole alignments see Additional file 3].

We implemented this homopolymer Smith-Waterman algorithm together with the modification described by Gotoh [8], which reduces the required computation time from O(m2n) to O(mn) where m and n is the length of the database and the query sequence respectively [8]. We simply used four one-dimensional arrays which keep track of the highest possible gap score (normal gaps and homopolymer gaps, each in the database and the query sequence) instead of the two originally described. An implementation of this homopolymer Smith-Waterman algorithm is available as the stand-alone application 'PanGEA-SW'.

Mapping statistics and management of pairwise alignments

The mapped cDNA sequence reads can be managed using the user-friendly GUI of PanGEA. Summary statistics for all ESTs mapping to the same gene are provided, such as the number of sense-ESTs mapping to the gene or the number of ESTs containing large gaps (putative introns). Subsets of the mapped reads can be displayed and exported by providing several quality criteria, such as ambiguity, minimum length of the alignment, minimum similarity, minimum coverage of the EST, presence or absence of large gaps (putative introns) or transcript orientation (sense, anti-sense). The subsets may be exported and used for a subsequent analysis, for example SNP identification.

SNP identification

SNPs are identified from the pairwise alignments. If a list of validated SNPs is available, PanGEA provides the option to use only these SNPs for frequency estimates from the sequence reads. If no validated SNPs are available, PanGEA identifies SNPs from the sequence reads and provides several options to minimize the number of miscalled SNPs. PanGEA can account for the quality scores of the sequences, determining the sequence quality at the SNP-site and its neighborhood.

The strategy for SNP-identification in PanGEA is to first identify SNPs using not-stringent parameters and to subsequently select a subset of these SNPs with the option 'Manage SNPs' using stringent parameters. This has the main advantage that a separate SNP-identification for each different parameters is not necessary, rather SNPs are identified only once and subsets can be flexible selected. This approach allows for an interactive fine-tuning of the selected SNPs and SNP-alleles. To test the SNP identification module we created extensive unit tests using NUnit [see Additional file 4].

The SNP identification module is available as stand-alone console application 'PanGEA-SNP' and has been integrated into the software PanGEA.

Identification of allele specific gene expression and visualisation of SNPs

PanGEA provides two options to display the identified SNPs. Either summary results are displayed for each SNP-site (Fig. 3) or for each database sequence (typically corresponding to a gene or transcript). The summary statistics for each SNP-site furthermore provide a convenient way to quantify allele specific gene expression by displaying the SNP-allele frequencies at each SNP-site (Fig. 3). Optionally, subsets of the SNP-alleles can be displayed according to quality, direction of transcription (sense and anti-sense) and minimum frequency. The quality of SNP-alleles can be assessed by several criteria such as the minimum sequence quality of the SNP, the minimum sequence quality in the neighborhood of the SNP and the minimum distance from the alignment ends.

Figure 3
figure 3

Summary statistics for each SNP-site and quantification of allele specific gene expression by the frequencies of individual SNP-alleles.

Methods for benchmarking

We obtained the Drosophila melanogaster genome (release 5.5), gene sequences (release 5.5) and the transcripts (release 5.5) from Flybase http://www.flybase.org/. All benchmarks were carried out on a standard desktop computer with 2 GB of RAM and an Intel™Core Duo®2 × 2.4 GHz processor. For benchmarking, a set of 26 040 454-ESTs, with an average length of 106 bp, derived from the 3'-end of D. melanogaster transcripts, were downloaded from GenBank [accession numbers: EV574767 – EV600806; [4]]. These 454-ESTs were mapped to the genes of D. melanogaster using stand-alone BLAST 2.2.13 and PanGEA-BlastN. Both programs used only one of the two available processors. The following PanGEA-BlastN settings were used: word length 11; minimum diagonal 2; low complexity threshold 10; hit score 3; mismatch penalty 5; gap introduction penalty 11; gap extension penalty 2; homopolymer transgression penalty 3; ambiguity threshold 10; homopolymer Smith-Waterman; intron mode was off; The defaults settings were used for NCBI-BlastN, except the e-value was set to 10-10 and the tabular output format (-m 8) was used. The pairwise alignments, resulting from the mapping of these 26 000 454-ESTs to the genes of D. melanogaster, were used for the subsequent identification of SNPs.

To test the performance of PanGEA-BlastN with the 454-platform in detail, we developed a console application which randomly excises 1000 ESTs from the transcripts of D. melanogaster, randomly introduces pseudo-sequencing-errors (0%, 5% and 10%) into these ESTs and maps them either to the genes or the whole genome of D. melanogaster using PanGEA-BlastN. An EST was considered correctly mapped to the genes, if the gene-ID (specified in header of transcript) matched the mapping result, whereas an EST was considered correctly mapped to the whole genome, if the chromosome-ID as well as the position within the chromosome (specified in header of transcript) matched the mapping result.

Results

Influence of the mapping parameters used by PanGEA-BlastN

We evaluated the influence of the PanGEA-BlastN parameters on the mapping accuracy and computation time by mapping 1000 randomly generated 250 bp fragments from D. melanogaster transcripts (release 5.5) to the corresponding genes.

First, we determined the influence of the low complexity cutoff (c), which reflects the maximum frequency of a word in a hash-table. Words occurring c times more frequent than expected by chance were not considered. As expected the mapping accuracy increased with 'c' on the expense of computation time (Fig. 4a). Nevertheless, the number of inaccurately mapped 454-ESTs was low (< 1.20%) irrespective of the low complexity cutoff.

Figure 4
figure 4

Effect of the most important parameters on the performance of PanGEA-BlastN. Values were calculated for mapping of 1000 randomly generated 250 bp fragments from D. melanogaster transcripts to the corresponding genes. Benchmarks were calculated in triplicate for five (or more) datapoints [see Additional file 5]. If not stated otherwise the following parameters were used: word length 11; minimum diagonal 3; low complexity threshold 20; ambiguity 12.

Next we calculated the influence of the ambiguity threshold, which measures the difference between the best and the second best hit. Increasing the ambiguity threshold resulted in a moderate reduction for incorrectly mapped 454-ESTs. While < 1.0% were mapped incorrectly when only the best hit (ambiguity threshold = 0) was considered, an ambiguity threshold of 100 had < 0.5% incorrectly mapped 454-ESTs. The trade off of this increase in mapping accuracy was an increase of ambiguously mapped 454-ESTs. Rather than 9% for the best hit, an ambiguity threshold of 100 resulted in 13% ambiguous hits (Fig. 4b). The ambiguity threshold only has a minor influence on computation time [see Additional file 5]. On the other hand, increasing the word size dramatically reduces the computation time on the expense of the mapping accuracy (Fig. 4c). The last parameter evaluated was the 'minimum diagonal length'. Similar to word size an increase in minimum diagonal length reduced the computational time on the expense of mapping accuracy (Fig. 4c).

These results illustrate that optimal parameters represent a compromise between computation time, specificity and sensitivity.

Mapping performance of PanGEA-BlastN

To assess the performance PanGEA-BlastN we compared PanGEA-BlastN with NCBI-BlastN. A set of more than 25,000 454 ESTs [4], with an average length of 106 bp were mapped to their gene sequences using PanGEA-BlastN and NCBI-BlastN. Despite a considerable reduced computation time, PanGEA-BlastN generated very similar results as NCBI-BlastN (Table 1), suggesting that the simplified search did not compromise the mapping efficiency.

Table 1 Comparision of the performance of PanGEA-BlastN with NCBI-BlastN [5].

Nevertheless, we noted some differences between PanGEA-BlastN and NCBI-BlastN. To evaluate the mapping efficiency of PanGEA-BlastN, we computationally generated 1000 454-EST-like sequences from D. melanogaster transcripts and mapped them either to gene sequences (including intronic sequences) or to the entire genome. To account for sequencing errors, we also introduced 5% and 10% mutations prior to mapping.

The performance of PanGEA-BlastN was assessed using the following criteria: (i) the number of ambiguous hits (ii) the number of correct hits, including ambiguous hits containing the correct hit, (iii) the number of wrong hits, including ambiguous hits not containing the correct hit (iv) the number of not-mapped ESTs, (v) the number of identified large gaps (> 50 bp; putative introns) and finally (vi) the required computation time.

A very high proportion (> 99.5%) of the ESTs was correctly mapped irrespectively of the sequence divergence (Table 2). This mapping accuracy could be further improved by changing some of the parameters, such as word size (see previous section). We noted a substantial discrepancy of unambiguously mapped reads for the gene sequences and genomic sequences. Despite a higher complexity, fewer reads (2.5%) were ambiguously mapped to the genome than to the gene sequences (10%). The reason for this discrepancy are ovelapping/nested genes (data not shown). Most importantly, the mapping accuracy was not effected when the intron discovery mode was switched on. However, several large gaps (i.e.: introns) were discovered, emphasizing the need for the intron discovery mode. Increasing the length of the 454-ESTs beyond 100 bp did not result in an increased mapping efficiency, suggesting that this length is sufficient for reliable mapping.

Table 2 Performance of PanGEA-BlastN with the 454-platform using the recommended settings.

However, considering the benchmarks of Table 2 we recommend the following settings for mapping of 454-ESTs, which are an attempt to optimize the antagonistic demands for efficiency, sensitivity and specificity: word length 11 (10–12), minimum diagonal length 3 (2–3), low complexity cutoff 10 (10–50); intron mode on. These settings are used as defaults by PanGEA-BlastN.

Discussion and conclusion

PanGEA provides an important step towards the use of massively parallel sequencing for gene expression analysis. While it is currently not apparent which of the new sequencing technologies will provide the most appropriate tool for gene expression analysis, the software tool PanGEA allows an accurate quantification of allele specific gene expression.

Availability and requirements

Project name: PanGEA

Project home page: http://www.kofler.or.at/bioinformatics/PanGEA

Operating system(s): Windows, Linux and Mac Os X

Programming language: C#

Other requirements: .Net Framework 2.0 for Windows; Mono 2.0 for Mac Os X and Linux

License: Mozilla Public License

Any restrictions to use by non-academics: none