Background

Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation within species. As a result, SNPs are now becoming the most popular type of marker in genetic association and mapping studies. SNPs are also most likely to be the molecular basis for the majority of phenotypic variation in (outbred) populations. In particular, SNPs in regulatory and protein-coding regions can have an effect on gene expression levels and protein activity, respectively. The phenotypic differences observed between selected (sub) strains in model organisms may be the result of specific (combinations of) natural occurring polymorphisms. Hence, a comprehensive inventory of SNPs, including extensive annotation will be extremely valuable in the search for functional polymorphisms.

There is often a vast unexplored potential in large sequence datasets that have been collected for other purposes, for example, EST and whole genome sequencing (WGS) projects. In an effort to address these two issues, we have developed an in silico candidate SNP mining pipeline that uses all publicly available sequence data for a specific organism, and designed a database, CASCAD (CAscad SNP CAndidates Database), that allows storage of a wide variety of primary source data, cross-annotation to other databases, and analysis parameters for SNPs associated with expressed sequences.

Construction and content

We applied the SNP discovery pipeline to both rat [1] and zebrafish (unpublished results) and identified about 33,000 and 52,000 high-quality candidates, respectively, that were extensively annotated and stored in the CASCAD database (Table 1). The database includes detailed primary information on which the discovery of the candidate SNPs was based. This information, including sequence quality information (Phred score), the number of supporting reads for every nucleotide observed at the SNP position, and expected alignment lengths, was found to be very valuable for filtering for predicted variants that have the highest likelihood to be experimentally confirmed. Two verification experiments for the rat resulted in confirmation rate estimates of 59% (68 candidate SNPs in 10 different laboratory rat strains) and 50.3 % (340 candidates in 5 laboratory and 2 wild rat isolates) [1]. A set of 139 CASCAD entries tested in 7 zebrafish isolates confirmed 67.6% of them as true polymorphisms (unpublished data). The success rate values obtained are likely to be underestimates since only limited number of isolates/samples were used and we were unable to include exactly the same isolates that were used for generating the primary data (e.g. EST sequencing).

Table 1 Input data (number of sequence reads) for the CASCAD pipeline and number of predicted candidate SNPs.

We designed a web-based interface with Perl scripts communicating to a MySQL database, and displaying HTML pages through Apache server running on SuSE Linux. The interface provides simple, advanced (Figure 1), and sequence-based search forms. Parameters that can be used in a search include strain information, different formats of sequence identifiers (e.g. GenBank, UniGene, LocusLink accessions or gene symbol), map positions (genetic and physical), and a wide variety of SNP characteristics, such as experimental evidence, a likelihood score for verification as deduced from extensive verification experiments [1], and information regarding restriction sites that have been affected, facilitating the design of RFLP based assays. To this end CASCAD is tightly linked to primer design [2] and local genome assembly [3] interfaces, enabling a fast, reliable, and universal primer design for a chosen SNP candidate even when there is no assembled genome sequence available. To facilitate the retrieval of candidate SNPs with higher verification rates, represented by more reliable and common SNP candidates, we implemented the possibility to restrict searches to entries characterized by location at hypervariable CpG site, by minimal basecalling quality (Phred score), sequence match size, and/or minimal number of supporting reads for either allele.

Figure 1
figure 1

CASCAD advanced search form.

In addition to primary sequence data analysis, the effect of all SNPs on protein coding capacity was evaluated and non-synonymous SNPs were categorized in classes reflecting the severity of the polymorphism using a BLOSUM-based score. The predicted missense SNPs were analyzed by SIFT [4] and Polyphen [5] programs that utilize not only substitution information but also phylogenetic conservation and structural protein information to predict a potential effect of the polymorphism on protein function.

Query results are summarized on the SNP details page (Figure 2), listing the SNP characteristics and including active links to other databases and resources, such as dbSNP, Ensembl, UniGene, and LocusLink. More detailed information regarding raw data underlying the candidate SNP (links to the original sequence files and a full nucleotide and protein alignment) can be obtained by clicking on the observed nearly exact hits between nucleotide sequences. Moreover, statistics on the data that support the SNP (number of occurrences for every nucleotide at SNP position, range of Phred basecalling quality scores) are provided.

Figure 2
figure 2

SNP details page

Utility and discussion

For many applications, it is important to be able to distinguish between SNP candidates by their characteristics, as they may be predictive for verification success rate or carry biologically relevant information. Non-confirmed candidate polymorphisms may represent variants uncommon for a given population, but also sequencing errors (all types of sequences), RNA editing events and reverse transcriptase errors (EST reads). In order to minimize the contribution of false positives, one can exclude polymorphisms based on a single read for either allele, as is common for many in silico discovery pipelines [6]. Although this is a valid approach when selecting SNPs for population or association genetics, one could inadvertently discard many rare variants that may be associated with phenotypic variation, for example by affecting protein structure or function. Information on such polymorphisms can be very useful when mapping disease or QTL alleles.

We have developed our database to fulfill the needs of any particular SNP application by providing control over every parameter we used in the polymorphism discovery step.

Applications of the CASCAD database include queries for potentially deleterious SNPs in a specific genomic region of interest, for example a QTL interval, design of SNP-based mapping panels using either RFLP or any other technology, and identification of informative SNPs for fine-mapping. Custom sequences can be provided to search for known SNPs in any sequence of interest. In addition, the CASCAD pipeline [1] can be used to build a candidate SNP database for any model organism of interest for which sufficient sequencing data is available.

Conclusions

The main purpose of CASCAD database is to provide flexible access to candidate single nucleotide polymorphisms, which were predicted using a computational approach from publicly available sequence data of the rat and zebrafish. The resulting database is crosslinked to most common public databases and can be queried for SNPs using accession numbers, sequence context, SNP characteristics, but also using parameters specific to the SNP discovery process, allowing stringent or relaxed conditions suitable for different types of applications.

Availability and requirements

The database is freely accessible through the website http://cascad.niob.knaw.nl. Programs, scripts, MySQL database dumps, and instructions for setting up a species-specific SNP database can be obtained from the authors upon request.