Background

Next-generation sequencing technologies, based on sequencing by synthesis (SBS), are starting to deliver a large number of DNA sequences at a relatively low cost, thus opening new areas of genomic research. One of the ambitious goals for these technologies is to sequence a complete human genome in several days at a price of $100,000, and eventually $1,000. To this end, sequencing throughput must be increased dramatically. This may be achieved by carrying out many parallel reactions. Despite the fact that the read-length is short (down to 25-50 bp), the overall throughput is enormous, each run producing up to several million reads and billions of base-pairs of sequence data.

While the promise of next-generation sequencing technologies has become a reality, they also present substantial challenges such as in the mapping of short sequence reads to the genome, polymorphism detection, characterization of allele frequencies from population samples and data management. So far, most efforts have been spent in developing methods for unspliced mapping of short sequence reads, and several software tools have been developed. An alignment integrated tool in the Illumina-Solexa data processing package, ELAND, optimizes mapping of very short reads and allows at most only two mismatches between the read and the genomic sequence. MAQ is another program for ungapped alignment with probability models to measure alignment quality [1]. SOAP and SeqMap can conduct both ungapped and gapped alignment for Solexa reads [2, 3]. SHRiMP is a package for mapping short reads to highly polymorphic genomes with a statistical method for scoring the alignment [4]. It is required to quickly and accurately align the sequence reads over exon/intron boundaries in transcriptome sequencing, but these programs are not designed for spliced alignments. Most existing programs for spliced alignment such as BLAT and GMAP were designed for mapping Sanger capillary mRNA or EST sequence to genome [5, 6] and do not yet have efficient support for short sequence reads. QPalma is specifically designed to align the short sequence reads over intron boundaries [7], but it requires training sets of spliced reads with sequencing quality and known alignments information. Moreover, the existing programs usually detect SNPs in a single individual, and cannot be applied to population samples where high-quality SNPs and allele frequencies require characterization.

Here we provide a freely available software tool, MapNext, for both spliced and unspliced alignments of the short reads and automated SNP detection from population sequences. For spliced alignments, a training process is not needed. Additionly, MapNext is capable of facilitating the conversion of text based sequences and quality output, mapping results and SNP results into a more flexible SQL database.

Implementation

The main program of MapNext was developed in C++. Some Perl scripts were written to format data, and an SQL script was used to create tables and load the results into the tables in the database. FASTA format for the reference sequences, and either FASTA or FASTQ format for query reads are allowed in the input of MapNext. The current version is designed for Linux/Unix systems. Compiling the program requires a recent version of g++. The implementation of MapNext1.0, full documentation and a downloadable test dataset are freely available [8].

Results and discussion

Unspliced alignment

Many of the short sequence mapping programs use k-mer and hash index table algorithm to accelerate alignment. SeqMap and MAQ build k-mer index table for reads, and SOAP builds index table for reference genome. To admit two mismatches, these programs split every read into four fragments and use all six combinations of the two fragments as seed. SHRiMP starts with a rapid k-mer hashing step and executes a vectorized Smith-Waterman step to score and validate the alignment. MapNext also uses a fast k-mer scan to locate the regions of potential homology. But it uses the filtration method described by [9]. For reads of length m, and mapping with up to n mismatches, each read is partitioned n+1 contiguous seeds. So there is at least one seed with no mismatches. MapNext processes the set of query reads to build a hash table indexed by seed. For every seed, its location in the query reads is listed where it occurs.

MapNext scans the reference sequence using a sliding window of size equals to k-mer size at steps of size 1. If the k-mer is a key in the query hash table, the corresponding reads with full length must be compared with adjacent regions of the reference surrounding the k-mer. MapNext counts mismatches between the reads and the reference sequence and stores the corresponding read name, the strand of the read, the number of mismatches and the position of the reference sequence if the number of mismatches does not exceed the maximum number of mismatches assigned by the user. Once the reads have been mapped onto the reference sequences, their locations can be used to precisely map reads to clusters based on the overlaps in the reference. For multiple hits of one read, the program randomly reports one of the hits. The output is given in a tab-separated format containing the reference and query sequence names, the start and end position of the alignment in the reference sequences, as well as the strand and the number of mismatches per read.

Spliced alignment

Most existing programs were designed for mapping mRNA or EST sequence to genome. QPalma is specifically designed to align the short sequence reads over intron boundaries. Qpalma first needs a dataset of known splice site and a splice site prediction model. It extends Smith-Waterman alignment to take qulity score, splice site and intron length into account. The prediction and training process is time-consuming, and the running steps are complicated.

MapNext first conducts the unspliced alignment of all query reads, then conducts the spliced alignment of reads that cannot be aligned at the first stage. It needs no prediction and traning process. For spliced alignment, MapNext uses two different strategies. One strategy is to generate all potential genomic sequences overlapping with two consecutive exon boundaries (according to annotation of the exons). Since all the introns have been spliced out, the reads left in the first stage can be aligned against these sequences to find the alignment over intron/exon boundaries. It is very efficient to map in this way. The perl script was written to build a file that contains intron-spanning sequences and their position relative to the genome sequences.

Unfortunately, this strategy cannot find un-annotated and novel transcripts such as those from alternative splicing. The other strategy (de novo), which is designed to find potential spliced alignments without annotation information, proceeds through the following steps: (1) building a hash table of k-mer on the head and tail of the query reads; (2) fast k-mer scanning to find all genomic locations from which the reads may derive; (3) searching the paired k-mer locations within 10 kbp and with the same strand; (4) extending the alignments in both 5' and 3' directions; and (5) verifying the location of the four boundaries (start and end of two consecutive exons) by the presence of the consensus (forward:GT/AG reverse: CT/AC) splice sites on the genomic locations.

SNP detection and minor allele frequency estimation

Most existing programs usually detect SNPs in a single individual, and cannot be applied to population samples where high-quality SNPs and allele frequencies require characterization. MapNext provides useful information about the SNP detection from population sequences. Methods of SNP detection can be categorized as two kinds. One is to develop an explicit statistical framework to model variants. The other is based on some heuristic rules. Quality control and significance threshold are the most adopted rules. MapNext uses these rules to detect SNPs from population sequences.

MapNext looks at each position in the cluster to determine if there is an SNP at that position. To make a qualified assessment, it also considers the general quality of the bases around this position [10, 11]. Potential nucleotide variants should be greater than the sequence quality value (QV), neighbour quality value (NQV), minimum coverage (MC), and minimum minor allele frequency (MMAF). The default QV value is >25 for the polymorphic base and >20 for the minimal NQV (4 bp). If SNPs were called in areas of low coverage, a higher amount of false positives would occur. Therefore, MC for the SNP to be called should be set. The coverage is counted as the number of valid reads at the current position. For population sequences, the call of minor frequency alleles is very likely affected by sequencing errors. MMAF can be used for determining the minimum frequency of the minor allele of the site to be called an SNP. MapNext calculates MAF from the major and minor allele counts at the SNP site.

To summarize, based on the user's specifications on what will be considered a valid SNP, SNP detection scans through the entire clustering results and report all SNPs that meet the requirements. The output contains the nucleotide bases of major and minor alleles, minor allele frequencies, coverage at the SNP site and the SNP position on the reference sequence.

Storage of the results

The last role of MapNext is to facilitate the conversion of text-based sequences, quality scores, alignment, and SNP results into a SQL database. The database created with MySQL records the sequences and results in relational tables. An SQL script is written to help users create the database and load data. Users can access the data through a command line client within the Linux server. A web interface is also being implemented to query and visualize the results.

Accuracy measure

To measureme the accuracy of the alignment algorithm, 1893118 reads(35 bp length), with 0.5% sequencing error rate, from 5796 coding DNA sequences of the chromosome I of Arabidopsis thaliana were simulated, in which about 7% of the total reads were spliced (134274 reads). Table 1 shows a comparison of accuracy and running time among the short reads alignment programs, i.e., SHRiMP, SeqMap, SOAP and MAQ for the alignment of the unspliced reads only, and QPalma for the alignment of the spliced reads. MapNext which incorporates both unspliced and spliced alignment of short transcript sequences can map more reads than other programs. Moreover, MapNext can achieve nearly the same sensitivity and accuracy for spliced alignment without a training process. For unspliced alignment, MapNext is faster than SHRiMP and SeqMap, but slower than SOAP and MAQ. For spliced alignment, MapNext is faster than Qpalma.

Table 1 Benchmark results of short reads alignment programs.

To measureme the accuracy of SNP detection in population sequences, a set of short reads data from population sequences known as true SNPs and their frequencies was simulated. Population sequences (sequence length = 100000, sample size = 50 and Theta = 0.005) were generated under a neutral evolution model using ms [12] and Seq-Gen [13] software. The population sequences were then randomly split into different numbers of 35 bp short reads and the empirical quality scores were added for each site. The empirical quality scores were obtained from sequence reads generated by Illumina-Solexa Genome Analyzer. Sequencing errors were added to these sites according to the probability of sequencing errors (Q = -10 * log(e/(1 - e)), where Q is the quality score, and e is the probability of sequencing error). One sequence was randomly chosen from the population sequences as the reference sequence for alignment. Mapping and SNP detection from those short reads was performed with MapNext. The candidate SNPs were compared with real SNPs to evaluate the accuracy of SNP calling. Table 2 shows true positive and false positive rates at various levels of read coverage for SNP detection. At a coverage of 8X, for example, the SNP true positive rate can be as high as 93.20%, while the false positive rate is less than 0.5%. The accuracy of estimation of allele frequencies was also evaluated from the estimated data by comparison to real frequencies data (Figure 1). The SD of the frequency esetimation is smaller at high coverage.

Figure 1
figure 1

The accuracy of minor allele frequency estimation produced by MapNext. The mean (SD) of minor allele frequencies are given. QV, NQV, MMAF and MC are set 25, 20, 0.01 and 50 respectively. (a) The sequencing coverage is 2× per individual. (b) The sequencing coverage is 12× per individual.

Table 2 Accuracy of SNP detection produced by MapNext.

A real dataset from Sonneratia alba, a mangrove species widely distributed in the western Indo-Pacific region was used to test our program. Seventy-five nuclear genes were sequenced from a sample that was a mixture of 100 individuals of S. alba. The query dataset contained 8161413 single-end reads (length 35 bp) generated by Illumina-Solexa Genome Analyzer. Out of these reads, 5695624 were mapped onto the reference sequences. Finally, a total of 30 putative SNPs were found and their allele frequencies were predicted.

Conclusion

MapNext is a comprehensive and powerful tool for both spliced and unspliced alignment of the short reads onto reference sequences and SNP detection from population sequences. It can facilitate the conversion of text based output into a SQL database, allowing users to query for results easily. MapNext enables researchers to efficiently analyze data in the study of transcriptomics and population genomics.

Availability and requirements

Project name: MapNext 1.0

Project home page: http://evolution.sysu.edu.cn/english/software/mapnext.htm

Operating system(s): Linux

Programming language: C++, Perl and MySQL

Other requirements: MySQL 5.0

License: Academic software license must be agreed upon during installation

Any restrictions to use by non-academics: Yes

Note

Other papers from the meeting have been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics, available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.