Background

Single nucleotide polymorphisms (SNPs) are the most abundant variants in many genomes, and are very important in many fields of genomics. Many high-throughput SNP genotyping methods have been developed including SNP microarrays [1], MALDI-TOF (Matrix Assisted Laser Desorption/Ionization-Time of Flight) [2], TaqMan probes [3], PCR resequencing [4], and others [57]; however, most of them come at a price. This is that they require expensive machinery and are unsuitable for the kind of small-scale genotyping that is routinely undertaken in most regular laboratories. TaqMan real-time PCR, for instance, has been applied to SNP genotyping in many association studies [3, 8, 9] but some SNPs cannot be genotyped using TaqMan probes [1012]. As an alternative, PCR-restriction fragment length polymorphism (RFLP) analysis is able to provide cost-effective SNP genotyping and mutation detection [1214]. However, RFLP mining for SNPs in a genome is a complex task when manually performed.

Many SNP- or restriction enzyme-related software tools, such as SNPicker [15], NEBcutter [16], SRP Opt [17], PIRA-PCR Designer [18], SNP cutter [19], SNPselector [20], SNP2CAPS [21], and software for "Restriction of DNA sequences" [22] have been reviewed and compared to SNP-RFLPing (ver. 1) [23]. Other more recent SNP-related software tools have also been developed, such as SNP500Cancer [24], which provides gene-centric SNP retrieval with TaqMan probe information but only for the SNPs associated with human cancer-related genes. Another is Seq4SNPs [25], which provides retrieval of multiple accurately annotated DNA sequences that have already been formatted for SNP assay design. SNP-Flankplus [26] provides SNP ID-centric retrieval functions for SNP flanking sequences. SNPLogic [27] provides comprehensive SNP selection, annotation and a prioritization system for genotyping. SNP ID-info [28] as well as a visual SNP ID platform for multiple inputs can be used to improve systematic SNP association studies. However, none of these software platforms provide RFLP information that can be used as part of SNP genotyping.

In 2006, we developed SNP-RFLPing (ver. 1) [23], an efficient and informative PCR-RFLP mining tool for SNPs, which provided user-friendly multiple sequence input formats, gene-centric RFLP assays for SNPs, and detailed output of the SNP information. To the best of our knowledge, SNP-RFLPing (ver. 1) was the first software to link the gene name to its SNP-RFLP restriction enzyme information. In the past three years, the numbers of known SNPs [29] and their RFLP restriction enzymes [30] have accumulated rapidly for the genomes of many species. Many new resources, such as tagSNP [31], miRNA [32], and gene ontology (GO) [33] are related to the study of SNP but display a relatively small amount of the relevant information needed for SNP genotyping. This paper presents an update, SNP-RFLPing 2, of SNP-RFLPing (ver. 1), which acts as a software tool that introduces and integrates a PCR-RFLP database for SNP genotyping.

Implementation

Web interface

The flow chart for the eight input functions, namely: (1) SNP IDs, (2) SNP fasta sequences, (3) multiple SNPs, (4) accession key, (5) tag SNP, (6) transcript ID/miRNA, (7) CGAP GO, and (8) file upload, is shown in Figure 1. Their corresponding processes are performed as indicated by arrow lines. For inputs of SNP IDs, accession key and tag SNP, the "Data Retrieve Module" is designed to retrieve data from remote databases, such as NCBI nucleotide, NCBI dbSNP, and HapMap, into the "Remote Database Module". For inputs of SNP fasta sequences, multiple SNPs, and file upload, the "Sequence Process Module" is designed to filter non-nucleotide symbols, including blanks, commas, and brackets. For inputs of transcript ID/miRNA and CGAP GO (gene ontology in the Cancer Genome Anatomy Project), the "Data Query Module" is designed to query available information from the "SNP-RFLP database Module". After the "Remote Database Module", "Sequence Process Module", or "Data Query Module" processes are performed, the remote data (NCBI nucleotide, NCBI dbSNP, and HapMap), the filtered sequence data (without non-nucleotide symbols), and query data from SNP-RFLP database (REBASE, CGAP GO, miRSNP, SNP-RFLP, and SNP500Cancer Assays_SNPs) are fed into the "SNP-RFLP Module", which then mines the data for the available restriction enzyme sites as appropriate. Subsequently, PCR-RFLP primers can be designed using the "Primer Design Module". Finally, the results for the available restriction enzymes and PCR-RFLP primers are output by the "Output Module". Detailed information on the databases used by SNP-RFLPing is outlined below.

Figure 1
figure 1

System structure and flowchart for SNP-RFLPing 2.

Database description

All the PCR-RFLP functions in SNP-RFLPing 2 provide both the RFLP restriction enzymes and primer set for PCR-RFLP. The primer design function is partly based on the Prim-SNPing [34], that is the regular and mutagenic (degenerate) primer designs for PCR-RFLP are provided. Moreover, the price of the possible RFLP enzymes is a newly added feature and RFLP enzymes are now classified into IUPAC and non-IUPAC (recognition sequences containing only the combination of nucleotides A, T, C, and G) types. Some SNP IDs that have pre-designed TaqMan probes from SNP500Cancer [24] and from dbSNP in NCBI (ABI) are integrated into SNP-RFLPing 2. SNP-RFLPing (ver. 1) only provides RFLP restriction enzyme information.

The flanking sequences for SNP ID rs# and ss# input for all available species are designed to be extendable. They are retrieved on-line from SNP-Flankplus [26] with modifications of the updated retrieval source codes to adapt the current format to dbSNP [29] build 129 from NCBI. The International HapMap Project provides the tagSNPs in the human genome for several HapMap populations such as YRI (Yoruba in Ibadan, Nigeria), JPT (Japanese in Tokyo, Japan), CHB (Han Chinese in Beijing, China), and CEU (CEPH; Utah residents with ancestry from northern and western Europe) [31]. The current on-line linked tagSNP database is HapMap Data Rel 23a/phaseII Mar08, on the NCBI B36 assembly, dbSNP b126 [31]. The miRNA SNPs are downloaded from the Polymorphism in the microRNA Target Site (PolymiRTS) database [32], which includes naturally occurring DNA variations in putative microRNA target sites.

Availability and update frequency

The SNP-RFLPing 2 web site and its user manual are freely accessible at http://bio.kuas.edu.tw/snp-rflping2 (or Additional file 1) and http://bio.kuas.edu.tw/snp-rflping2/userManual.jsp, respectively. The RFLP restriction enzymes and the corresponding primers for all SNPs are simultaneously analyzed on-line. For REBASE version 906 [30], SNP for miRNA (downloaded from PolymiRTS Database) [32], and GO Browser (CGAP GO), they are updated annually and are built into a local database (SNP-RFLP database module). All other databases, such as dbSNP, HapMap, and GenBank, are constantly updated and retrieved on-line by an automated procedure. To improve the speed of access, our application encodes the user request information and sends them to the appropriate remote application server. The process is almost the same as when these remote applications are performed on their web sites. During data processing, the system retrieves the query or analysis result page from the remote application servers by http transmission. This does slightly increase time lag, which is usually less than one second, because the application needs to parse the result page and obtain the useful information before transforming it into the appropriate format for further processing. All the aforementioned steps are cached and performed on the server side. The retrieval formats for these on-line databases will be checked monthly in order to maintain the correct on-line extraction formats.

Results and discussion

Summary of the database updates

In parallel to the rapid expansion of the use of SNPs in the last few years, SNP-RFLPing 2 demonstrates significant advances over SNP-RFLPing (ver. 1). These include: 1) rewritten source codes to improve the functionality, efficiency and stability of SNP-RFLP analysis; 2) SNPs for sixteen different species can be retrieved on-line; 3) all kinds of SNPs are acceptable including di-, tri-, tetra-allelic and indel formats; 4) the functional class (function codes for reference SNP clusters or refSNPs in gene features used for options to limit retrieval) for dbSNP in NCBI has been added (details on functional class are discussed later); 5) a multiple SNPs-containing sequence for PCR-RFLP mining is acceptable; 6) HapMap tagSNPs for multiple inputs can be retrieved on-line; 7) gene ontology (GO)-based RFLP enzyme mining is provided; 8) miRNA SNPs for the human and mouse genomes are included; 9) regular and degenerate primer designs for PCR-RFLP are provided; 10) REBASE databases are updated and the prices of RFLP restriction enzymes are available; 11) RFLP enzymes are classified into IUPAC and non-IUPAC types; and 12) TaqMan probes for SNP genotyping from SNP500Cancer in CGAP and from dbSNP in NCBI (ABI) are supplied if available.

I. Original functions in SNP-RFLPing (ver. 1) and their improvements

SNP ID input, SNP in fasta sequence input, and file upload

SNP ID (rs# and ss#) and SNP in fasta sequence formats are acceptable to query the SNP-RFLP information (Figure 2A and Figure 2B, respectively). Two types of files are acceptable for file upload, namely SNP IDs (rs#, ss#, or mixture of rs#/ss#) and SNP fasta sequences (multiple sequences with SNPs in [dNTP1/dNTP2] or IUPAC formats) (Figure 2H). These functions were also originally included in SNP-RFLPing (ver. 1).

Figure 2
figure 2

Overview of the SNP-RFLPing 2 web interface. (A) SNP ID input. (B) SNP in fasta sequence input. (C) Multiple SNPs within one sequence input. (D) GenBank accession input. (E) TagSNP from HapMap input. (F) Transcript ID/miRNA input. (G) Gene Ontology-based annotation for SNPs input. (H) File upload input.

In SNP-RFLPing 2, we have newly added a function for RFLP analysis and primer design involving tri-allelic, tetra-allelic, and indel (insertion and deletion) SNPs, such as, for example, rs2243244, rs13631133, and rs68134313 respectively. Furthermore, SNP ID information for all available species can be retrieved on-line using SNP-RFLPing 2 rather than from the local database that was built into SNP-RFLPing (ver. 1), that is sixteen genomes (current data)vs. three genomes (human, mouse, and rat) respectively.

GenBank Accession

The inputs for the GenBank accession no. [35], such as reference SNP ID (rs#), submitter SNP ID (ss#), HUGO gene name, and local link ID (gene ID), which can be used to retrieve the SNP sequence information for RFLP analysis, are the same as the original functions in SNP-RFLPing (ver. 1).

In SNP-RFLPing 2 (Figure 2D) additional input formats for accession version and local SNP ID have been added. The classification of dbSNP in NCBI for functional class (coding nonsynonymous, reference, intron, coding synonymous, locus region, mRNA UTR, and splice site), SNP class (heterozygous, indel, mixed, multinucleotide polymorphism, named locus, no variation, and snp), and heterozygosity are selectable in GenBank accession input. Furthermore, the entire information contained in GenBank can be retrieved on-line for all available species and this is integrated in SNP-RFLPing 2.

II. Added improvements to SNP-RFLPing 2

Multiple SNPs within one sequence

Up to 50 SNPs represented in the [dNTP1/dNTP2] or IUPAC formats within an input sequence are acceptable for analysis in SNP-RFLPing 2 (Figure 2C). The flanking sequences for two nearby SNPs should not overlap within a range of 6 nucleotides. The pre-aligned reference sequence, such as that generated by the multiple sequence alignment function in Seq-SNPing [36], may be input into the SNP-RFLPing 2 for RFLP mining of multiple SNPs as well.

TagSNP from HapMap

To reduce the necessary amount of SNPs for genotyping, it was believed that a subset of the SNPs in a region (tagSNPs) ought be chosen to represent most of the remaining SNP variants [37]. As shown in Figure 2E, the HapMap database versions, population, pairwise methods (tagger pairwise or tagger multimarkers), R square cutoff, and MAF (minor allele frequency) cutoff can all be user-adjusted in SNP-RFLPing 2. The position within the chromosome, accession number, gene name, cytoband position, and ENCODE (ENCyclopedia Of DNA Elements) [38] region can also be queried. The tagSNPs information from HapMap is retrieved online and the mining function of RFLP restriction enzymes for tagSNPs is implemented in SNP-RFLPing 2.

Transcript ID/miRNA

MicroRNAs (miRNAs) are a group of small RNAs that are able to bind to the RNA transcripts of protein-coding genes and this allows them to repress translation or decrease mRNA stability [39, 40]. Dysfunction of miRNAs influences cell biology and cancer progression [41]. Polymorphisms in miRNA pathways may affect gene expression, which may lead to a change in complex phenotypes, and such polymorphisms have the potential to be disease markers for personalized medicine [42]. Transcript IDs and miRNA numbers from the human and mouse datasets are acceptable for PCR-RFLP analysis (Figure 2F). The RFLP enzyme mining and primer design for transcript IDs and miRNA are newly developed and have been integrated in the SNP-RFLPing 2.

Gene Ontology-based annotation for SNPs

The GeneOntology Browser (GO Browser; http://cgap.nci.nih.gov/Genes/GOBrowser), which provides annotations for human and mouse genes based on molecular function, biological process, and cellular component, has been integrated into SNP-RFLPing 2. GO IDs and vocabulary terms may be input to find specific genes with an interesting function as well as their corresponding SNPs (Figure 2G).

III. Common output examples for SNP-RFLPing 2

Two types of SNP genotyping information, such as primers (natural and mutagenic)/restriction enzymes using PCR-RFLP and TaqMan probes using real-time PCR (Figure 3), are provided for all the inputs in SNP-RFLPing 2 (Figure 2). Except for the restriction enzymes, the other functions are novel improvements found only in SNP-RFLPing 2. For TaqMan probes, these are shown as available and unavailable in the examples given for rs12947788 and rs650304, which are shown in Figures 3A1 and 3B1 respectively. The TaqMan probes are on-line retrieved from SNP500Cancer [24] and dbSNP in NCBI. For PCR-RFLP primers, the natural and mutagenic primers that are available for rs12947788 and rs6503048 are also shown (Figures 3A2 and 3B2 respectively). For the mutagenic primer, the mutagenic (artificial) nucleotide is marked with red color in the forward (F) primer sequence. Currently, TaqMan probes may not be always available as public resources and sometimes there are no suitable TaqMan probes for some SNPs. In this case, PCR-RFLP using natural and mutagenic primers coupled with their corresponding restriction enzymes can solve this problem. Moreover, the PCR-RFLP method is more cost-effective than TaqMan probes using real-time PCR.

Figure 3
figure 3

Representative outputs of SNP-RFLPing 2. (A) Natural primers for rs12947788. (B) Mutagenic primers for rs6503048. SNP genotyping information for (A1/B1) TaqMan probe, (A2/B2) natural primers, (A3/B3) restriction enzymes.

The primer sequences, position, length, GC no., GC%, Tm value, and Tm-difference, and product length after PCR-RFLP genotyping are provided. The restriction enzyme list for both the sense and antisense strands are shown in Figure 3B2 but omitted in Figure 3A2. The flanking sequences, suppliers, and NEB price for restriction enzymes for the target SNPs are provided as well (shown in Figures 3A2 and 3A3 but omitted in Figures 3B2 and 3B3).

Conclusions

SNP-RFLPing 2 has significant advantages over SNP-RFLPing 1, because the new features complement many of the new comprehensive fields that are associated with modern SNP-related research and because the on-line retrieval systems avoid the need for updates from most databases. In this paper, we describe an updated web-based interface and a java-based program, SNP-RFLPing 2, which is able to provide comprehensive PCR-RFLP information, including RFLP enzymes and their appropriate primer set so that SNP genotyping can be carried out. SNP-RFLPing 2 can also be applied to many PCR-RFLP-based fields, such as the characterization of microorganisms [43, 44], food authentication [45, 46], and avian gender determination [47]. On-line example inputs have been used to demonstrate each of the main functions of SNP-RFLPing 2 that are described in the user manual, which is downloadable from the homepage URL.

Availability and requirements

Project home page: http://bio.kuas.edu.tw/snp-rflping2/

Operating system(s): Operating systems with web browser.

Programming language: Java

Other requirements: Java 1.5.0 (or later)

License: none for academic users. For any restrictions regarding the use by non-academics please contact the corresponding author.