Background

With the ever-increasing volume of single nucleotide polymorphisms (SNPs) deposited in publicly available databases such as National Center for Biotechnology Information (NCBI) dbSNP, laboratory geneticists are faced with the routine need of selecting an appropriate set of SNPs in both gene mapping and molecular evolution studies. The major bottleneck in the workflow for SNP-based studies has shifted away from SNP discovery toward SNP selection. Although it is beyond dispute that several web-based applications and stand-alone software packages are available for handling SNP data, including viewGene [1], Genotools SNP manager [2], SNPbox [3], SNPicker [4], and SNPper [5], these applications go off on a tangent when it comes to selecting the best SNP set because their applications focus on primer design (e.g. SNPbox and SNPicker), SNP visualization (e.g. viewGene), specific platform applications such as MassARRAY technology (e.g. Genotools SNP manager), and SNP search (e.g. SNPper). In light of the surging interest in haplotype inference [6, 7] and haplotype-based association studies, the power of a linkage disequlibrium (LD) study is determined not only by the number of SNPs used, but also by the quality. Contemporary geneticists aim to maximize the statistical power in detecting a disease-susceptible locus by selecting a "best set" of closely linked SNPs given a limited (and often fixed) genotyping budget [8].

In the case when a large number of SNPs are available for a susceptibility gene of interest, genotyping all SNPs on all samples is an inefficient utilization of resources. Recently, a cost-effective two-stage method has been proposed to identify disease-susceptibility markers [9]. In stage I, a set of SNPs (S1), spaced in a predefined interval, is selected (e.g., evenly spaced every 3 to 5 Kb in and surrounding a candidate gene [10]). The genotypes of markers in S1 are then used to define LD blocks and to reconstruct haplotypes within blocks across the candidate gene locus in a representative random sample, C1, of the original source population (e.g., a multiethnic cohort of men and women [11]). In stage II, a representative set (S2; S2S1) of htSNPs is selected on the basis of the LD characterization in the random sample of C1, and S2 is then genotyped in a much larger case-control set C2 (C1C2), and haplotype-based association tests are performed in C2, nested in the original source population.

Both stages are critical to the success of an association study. However, it is not a trivial task to select S1 (i.e. a set of evenly spaced SNPs in and surrounding a candidate gene) because the number of available SNPs for each human gene varies dramatically (from <10 to >200) because of varying gene sizes and SNP densities. Furthermore, we certainly do not simply keep common SNPs in S1 [i.e., minor allele frequency (MAF) ≥ 5%]; missense and regulatory SNPs should still be considered to be included in S1 even if their MAFs fall below 5% [11]. Hand-picking S1 by "eyeballing" is extremely labor-intensive, time-consuming, and error-prone for candidate genomic regions with hundreds and thousands of SNPs. Furthermore, obtaining a SNP flanking sequence long enough (~200 bp), and with annotation of nearby potential SNPs, is essential for the successful design of a SNP genotyping assay. Unfortunately, the flanking sequences of many SNPs recorded in dbSNP are short (<100 bp) and without any annotation of nearby SNPs.

NCBI's dbSNP offers a comprehensive SNP searching tool [12]. However, tools are still needed to easily and efficiently locate the desired SNPs, to evaluate their annotations, and to export them in suitable formats for downstream analyses. To meet such needs, we have developed the program SNPHunter, a tool with a friendly graphical user interface (GUI) that works as a portal between the user and NCBI dbSNP [13]. The program can extract and export SNP data retrieved from dbSNP, import saved SNP data, and offers a very flexible SNP selection function with graphic illustration of SNP position, function and heterozygosity. Furthermore, it retrieves any arbitrarily-defined, user-specified length of SNP flanking sequence with annotation of all nearby SNPs.

Architectural structure

SNPHunter was written using Microsoft Visual Basic .NET. A schematic diagram of the architectural framework for SNPHunter is shown in Figure 1. This tool relies on an HTTP parser that delegates the user's query to databases including dbSNP [13], MapViewer, LocusLink [14], and AceView at NCBI (Figure 1, left), and parses the retrieved data. It consists of three modular components, SNP Search, SNP Management, and LocusLink SNP. In the SNP Search module, the user inputs the gene symbol of interest and chooses SNPs based on heterozygosity (HET), chromosomal position, and functional class (Figure 1, upper right). The user can also specify whether upstream/downstream sequences of the gene should also be included for search. In the SNP Management module, the user fetches and manages detailed information for SNPs retrieved in the SNP Search module or the user's own SNP list. In the LocusLink module, SNPHunter reads in a list of LocusLink gene IDs (i.e. Entrez Gene IDs) and performs a batch-mode SNP search via LocusLink (Although NCBI LocusLink was superseded by the NCBI Entrez Gene, this SNP search mode is still fully functional).This module is very useful for obtaining SNP data with a large number of genes. SNPHunter creates a SNP summary and pops up a new "Filter SNP" panel (Figure 1, lower right). SNP filtering can be performed on all or selected genes according to user-specified filtering criteria. One advantage of the SNPHunter's local filter is that it does not rely on any Web server to perform the filtering. Once the user has downloaded all the SNP information and exported it to a local file, SNPHunter will perform filtering either automatically or manually, which means that the user can further modify the selection after automatic filtering. The selected SNP list can be exported to local directories for storage or further analyses. This batch-mode search operation is impressively fast. In the example shown in Figure 1, all the SNPs on the six genes were retrieved and downloaded in 10 sec, and automatic filtering on a regular personal computer with one Intel Pentium 4 2.8 GHz processor took another 7 sec.

Figure 1
figure 1

A depiction of the architecture structure of SNPHunter. SNPHunter allows the user to perform (1) an ad hoc search by gene symbol [an example of "Tumor Necrosis Factor" (Gene Symbol, TNF) is shown]; and (2) batch-mode search. It can be seen that a total of 722 SNPs were found on a total of 6 user-specified genes (with LocusLink ID listed on the right). With automatic filtering, 405 SNPs were picked and the gaps between them were also calculated.

Implementation

A detailed description of the implementation of the three modules has been presented in the User's Manual [15]. In brief, since retrieval of the flanking sequences of a desired SNP relies on knowledge of its genomic coordinate, in an ad hoc mode, SNPHunter first pinpoints the SNP's genomic coordinate from dbSNP's reference SNP (refSNP) record, strand orientation, and the SNP's corresponding contig number. Moreover, SNPHunter communicates with the NCBI MapViewer database and retrieves the corresponding sequence centering at the desired SNP, with the sequence lengths specified by users. SNPHunter will detect all neighboring SNPs located within a user-defined radius around the SNP of interest. Once the SNP's genomic coordinate and contig data are retreived, SNPHunter also obtains nearby SNP data on all neighboring SNPs by querying dbSNP for all available SNPs that lie within the user-defined radius. Once the starting and ending coordinates of a particular gene are determined by SNPHunter through NCBI's AceView, the 5' upstream and 3' downstream regions of the gene can be retrieved according to user-defined lengths. In a batch-mode, SNPHunter communicates with NCBI's LocusLink to fetch the SNPs that reside within each LocusLink gene. Since LocusLink has a curated SNP list for each gene included in the LocusLink database, this batch-mode search offers a reliable, efficient way to conduct a systematic SNP search for a large set of candidate genes (e.g. belonging to the same biological pathway/network). Furthermore, SNP data can be stored in the user's local directories, and SNP filtering can be performed automatically according to user-defined criteria.

Application example

To demonstrate the SNP selection process from dbSNP using SNPHunter, we applied SNPHunter for S1 selection for 10 biological candidate genes (Table 1) for a type 2 diabetes mellitus (DM) case-control study. These 10 candidate genes were chosen on the basis of their biochemical and physiological functions.

Table 1 Size, location, and the estimated number of SNPs for each of the 10 candidate genes for type 2 diabetes mellitus.

We used the following four SNP selection criteria:

  1. (1)

    Genome coverage: SNPs should cover the gene region as well as its 30 Kb 5' upstream and 30 Kb 3' downstream regions (the gene sizes are shown in Table 1).

  2. (2)

    Functionality priority: coding SNPs (cSNPs; including both synonymous and nonsynonymous SNPs) and splice site SNPs (ssSNPs) must be kept; for SNPs located in the 5' upstream region and 3' downstream regions, the function is defined according to existing in vivo/in vitro experimental data. The priority of SNP selection is nonsynonymous SNPs > synonymous SNPs > ssSNPs > 5' upstream SNPs > 3' downstream SNPs > intronic SNPs.

  3. (3)

    Priority based on HET: For cSNPs and ssSNPs, no HET threshold is set (HET can be calculated using the POLYMORPHISM software [16]); for intronic and 5' upstream or 3' downstream region SNPs, those SNPs with HET values going above the threshold of 0.095 (which correspond to MAF ≥ 5%) have higher priorities.

  4. (4)

    SNP density: The SNPs should be relatively evenly distributed across the gene region (as well as the 30 Kb 5' upstream and 30 Kb 3' downstream regions) with a density of 5–50 SNPs/Kb depending on the gene sizes (see Table 1). The goal is that for gene sizes < 10 Kb, we use a density of 50 SNPs/Kb; for gene sizes 10–100 Kb, we use a density of 10 SNPs/Kb; for gene sizes > 100 Kb, we use a density of 5 SNPs/Kb.

To date, there are no turn-key solutions that can select the best SNP set automatically. Our SNP selection procedure is an iterative process consisting of the following four major steps:

  1. (a)

    Retrieve all SNPs regardless of HET values according to SNP selection criterion (1).

  2. (b)

    Select all cSNPs and ssSNPs; in addition, 5' upstream, 3' downstream and intronic SNPs with HET ≥ 0.095 will also be selected according to SNP selection criterion (2).

  3. (c)

    Enforce a relatively even SNP density according to SNP selection criterion (4). We implement this by setting the maximum inter-marker distance d (i.e., for a given set of selected SNPs S, if there exists a pair of neighboring SNPs (SNP i , SNP j ), where the physical distance between SNP i and SNP j is <d, the program recursively picks a random SNP, say SNP k , between SNP i and SNP j and inserts SNP k in the middle of SNP i and SNP j ; by mathematical induction, this process will guarantee that S will eventually be a saturated set, S', at a resolution level of d). Re-adjust the marker density by iteratively adding available SNPs in the priority order set by SNP selection criterion (2) and (3) until we come to a target number of SNPs with desired density, according to SNP selection criterion (4).

  4. (d)

    Include any non-redundant SNPs from sources other than dbSNP, such as from literature review.

Using these criteria and selection procedures, we selected a total of 670 SNPs for the 10 genes listed in Table 1.

Besides SNP selection, SNPHunter allows the retrieval of genomic coordinates and flanking sequences for specific SNPs and gives graphic illustration of all the SNPs within the gene of interest as well. Figure 2 gives an illustration of the 28 SNPs found in a 2.7 Kb region spanning the tumor necrosis factor (TNF) gene from NCBI dbSNP.

Figure 2
figure 2

The Gene View panel which gives a graphic illustration of all the SNPs within the gene of TNF (LocusLink ID: 7124). The height of a SNP bar indicates the heterozygosity of that SNP, and the four dotted horizontal lines means heterozygosity of 0, 0.25, 0.5, and 0.75 respectively. Out of 28 available SNPs, two have been dropped and plotted as "gray". For those 26 selected SNPs, "green" means "coding: synonymy unknown" or "synonymous"; "red" means "non synonymous"; "orange" means all the others. And for the convenience of SNP selection, there is a red triangle indicating the current focused SNP, which is SNP rs3093663.

Discussion

The motivation for developing SNPHunter is to allow the efficient and accurate selection of S1 (see Background) because of its intrinsic value in LD studies, particularly in a case-control setting. A few Web resources, such as NCBI's Entrez, Ensembl's EnsMart [17] and SNPper [5] provide SNP database searching and SNP information downloading according to user-specified criteria. These tools, each with its own unique capabilities and focuses, have benefited the work of geneticists. However, few of them are dedicated solely for SNP search purposes and for the management of SNP data. Although SNPper [5] offers a very helpful function of filtering SNP sets, it is a locally stored SNP-centric database resource maintained by the Children's Hospital Informatics Program, Harvard Medical School, and requires regular data downloads from NCBI dbSNP. By contrast, SNPHunter is designed to work as a stand-alone application that retrieves the most-updated SNP and sequence data without the need for complicated local database support. Thus, the user is relieved from maintaining a local database and updating the data frequently. The ability to export and to save every dataset locally in plain text format provides the user with the freedom for later reuse or any other customized analysis without any website support. In addition, SNPHunter offers a very friendly GUI, allowing researchers without much computer background to perform SNP searches easily and efficiently. Moreover, its batch search and automatic SNP selection proved very efficient in large-scale candidate genes study. Table 2 lists features comparisons between SNPHunter and other major SNP related software/web tools.

Table 2 Comparisons between SNPHunter and other publicly available software/tool.

It is worth noting that SNPHunter relies on dbSNP for data retrieval, and thus is deprived of the independence whereas other application with local database support usually has. What's more, SNP selection should not be limited to NCBI dbSNP, although dbSNP represents the largest publicly available SNP database that can be accessed via the Internet worldwide. Some SNPs reported in the earlier literature have not yet been incorporated into dbSNP. Furthermore, there are several on-going gene re-sequencing projects for selected human genes, such as SeattleSNPs or SNP500Cancer [18]. Therefore, SNPs from these other sources, if not yet included in dbSNP, should also be considered in SNP selection. Nevertheless, NCBI dbSNP has been steadily updated and has gradually emerged as one of the most comprehensive SNP depositories.

Conclusion

In summary, SNPHunter allows for customized SNP searches (both ad hoc-mode and batch-mode) by directly retrieving and managing SNP information from the NCBI dbSNP database, eliminating tedious and costly local database maintenance on the user's side. To date, SNPHunter has received more than 1000 downloads worldwide. We hope this simple program can serve as an efficient and reliable tool for researchers everywhere to facilitate their genetic studies.

Availability and requirements

Project name: SNPHunter

Project home page: http://www.hsph.harvard.edu/ppg/software.htm

Operating system(s): Microsoft Windows

Programming language: Visual Basic .NET

Other requirements: Microsoft .NET Framework 1.0 or above.

License: None

Any restrictions to use by non-academics: Contact authors