Background

Many biologists focus their research on one or a few specific protein families. While comprehensive sequence databases, such as GENBANK [1], and generally annotated databases, such as SWISSPROT [2], are freely available, intensive studies on protein families are better supported by relatively small, focused databases that are developed for specific research needs. Our experience with voltage-gated potassium channel database (VKCDB) provides an example of such a small, targeted protein family database. The approaches that we used to create this database are generally applicable to building databases for functional studies of other protein families.

Voltage-gated potassium channels (VKCs) are intrinsic membrane proteins that respond to changes in the transmembrane electrical field by altering conformation and selectively allowing potassium ions to pass through the membrane [3]. This property is the basis for VKCs' roles in shaping action potentials in neurons and modulating the electrical activity of excitable membranes. Mutations in VKC genes can lead to severe diseases, such as long QT syndrome and epilepsy [4, 5]. Thus, VKCs have been considered as possible targets for drug design [6].

VKCs constitute a structurally and functionally diverse protein family. At this writing, there are over two hundred described members of this family from more than 35 organisms. VKC-related structural and functional data, particularly electrophysiological and pharmacological parameters, are distributed in dozens of databases and hundreds of journal articles. No single database contains structural and functional data for the various members of this large protein family. The application of comparative methods to the study of VKCs and other protein families depends on ready availability of both structural and functional data in an easily accessed database. Here we report a customized database of VKC-related data that was created using semi-automated collection and management. This relational database currently holds 346 VKC entries. Each entry contains sequences, motifs, references, hyperlinks to other databases, and other available structural information. We have also collected available electrophysiological and pharmacological parameters for VKCs from several hundred published articles. These types of data are not properties of most proteins, and are not contained in general protein databases.

Construction and Content

VKCDB was initially populated by performing a redundant set of searches of GENBANK for family members (Figure 1). GENBANK was first searched for protein sequences similar to the human Kv1.2 protein sequence [7] using BLASTP [8]. The top 200 hits were used to perform BLASTP searches against GENBANK and SWISSPROT, yielding a comprehensive collection of VKCs. After collapsing all redundant BLASTP results, the top 319 non-redundant hits were collected; sequences with lower scores were not VKCs.

Figure 1
figure 1

Populating VKCDB. VKCDB was populated by searching against GENBANK and SWISSPROT databases with 200 seeding VKC protein sequences. The redundancies in these results were collapsed, then structural and functional information was extracted from different databases and published articles using a combination of automated scripts and manual selection.

A Perl script was used to retrieve information on the 319 VKCs from GENBANK and SWISSPROT and store it in a MySQL relational database. A schematic diagram of the ER model of VKCDB can be found at VKCDB website http://vkcdb.biology.ualberta.ca/images/ermodel.gif. Data from redundant records in GENBANK and SWISSPROT were combined into a single entry. VKCDB entries with very similar sequences were manually checked, and their sequences were compared and annotated as "possible isoforms" or "sequence conflicts". Records for splicing variants were cross-referenced. Conflicting sequences that were submitted by different authors were cross-indexed as sequence conflicts, unless sequence errors were indicated in the literature, in which case the most recently updated sequence was kept. Entries labelled as "unknown products" from large sequencing projects that had the characteristic sequence pattern of the voltage sensor (a lysine or arginine residue at every third position of the fourth transmembrane domain) were used as BLASTP queries and annotated as members of a specific family of VKC based on the annotation of most similar BLAST results.

Using literature citations from GENBANK and SWISSPROT, we manually collected available electrophysiological and pharmacological data from published articles for each VKC entry. Conflicting data were all kept and hyperlinked to the references in PUBMED [9].

All sequences were submitted to the TMHMM [10] and PHD [11] servers for secondary structure prediction. Results from both analyses were parsed and combined into a single annotated sequence figure. This information is currently stored as a graphic because both of the prediction programs are not sufficiently accurate to be taken as a definitive result.

Sequences belonging to each of the four Kv families and the KCNQ family [12] were extracted and a multiple alignment for each family was generated with ClustalW [13]. Alignments of the highly conserved regions (the T1 domain and the six transmembrane domains) were manually adjusted and included in VKCDB. Subsets of the aligned sequences can be selected and exported in FASTA format.

VKCDB is updated monthly. The "last modified" date of each VKCDB entry is compared to the corresponding field in the cognate GENBANK and SWISSPROT entries. Information from any of the archival entries that has been changed since the last VKCDB update is then parsed and used to update VKCDB. New VKC entries are collected by performing a BLAST search of GENBANK and SWISSPROT with all entries in VKCDB. The hits are combined into a non-redundant list for each subfamily and the top twenty scores on that non-redundant list that are not already entered in VKCDB are manually checked to confirm that they are indeed VKCs before adding them to VKCDB. If all twenty hits are VKCs, the next twenty hits are also manually evaluated; this process is repeated until non-VKC entries are found. Current entries in VKCDB were updated with SWISSPROT Release 42.1 (October 2003) and GENBANK entries as of October 2003.

Utility and Discussion

At present VKCDB contains 346 VKC entries from 35 organisms. 39 VKCs are annotated as having between two and nine different isoforms, although some of these might be due to cloning or annotation artefacts. VKCDB can be browsed and searched through a web interface by several criteria, including VKC Kv subfamilies [12], organism names, GENBANK Protein ID, protein description, reference information, and electrophysiological parameters. The VKC entries in all search and browse results can be individually selected on the web page to produce a batch sequence file in FASTA format for use in other applications. This site also implements a local BLAST server for similarity searches against VKCDB entries.

Each VKC entry page contains information such as protein accession number, protein name, protein sequence, coding gene name and accession number, SWISSPROT function description, references and hyperlinks to other biological databases. On each entry page, a button labelled "Electrophysiology" opens a pop-up window containing electrophysiological parameters, pharmacological data, and related references hyperlinked to PUBMED (Figure 2). We will be adding data on synthetic VKC mutants to VKCDB in the near future, including links to the cognate wild type protein and the electrophysiological and pharmacological data. There is also a link to transmembrane helix predictions by TMHMM [10] and PHD [11] on each entry page.

Figure 2
figure 2

Screen dump of the entry page of a VKCDB entry. The popup window contains the electrophysiological parameters of this entry. The transmembrane helix prediction by TMHMM and PHD can also be displayed for each VKCDB entry. Content of the entry page is extensively hyperlinked to various databases.

Multiple alignments of conserved regions of the four Kv families and of the KCNQ family are available on the VKCDB web site [12], on the tools page http://vkcdb.biology.ualberta.ca/alignment.html. The individual sequences can be selected and downloaded, with gaps in place, in FASTA format for use in other applications.

The VKCDB web site includes a "submit" page that allows us to communicate with users on annotation errors, missing entries, and other information so that we can maintain accurate and updated VKC information in our database.

Conclusions

VKCDB contains structural and functional data and related multiple alignments for voltage-gated potassium channels in a single database. The VKCDB web page is designed to provide easy access and searching through a user-friendly interface. It is also designed to interact easily with tools that we are developing to study the structure-functional relationship in VKCs using machine-learning approaches. The database information is also available as an XML file for users who wish to implement customized configurations.

Similar approaches can be taken to construct specific, small-to-medium-sized protein family databases, with minimum knowledge of Perl and MySQL database management. As a small, customized protein family database, VKCDB is a useful and convenient resource for research on VKCs. As our understanding of VKCs increases, more annotations and applications will be added to enrich VKCDB so that it can continue to serve as a main resource for structural and functional studies of VKCs.

Availability and Requirements

VKCDB is freely accessible at http://vkcdb.biology.ualberta.ca. A snapshot of VKCDB (as of November 2, 2003 at this writing) in XML format can be freely downloaded from the website of VKCDB.