Background

Protein kinases represent one of the largest protein domain families in most of the higher eukaryotes. Members of protein kinase family are involved in innumerable signal transduction pathways contributing in the decision making on variety of cellular processes such as cell growth, development, differentiation, metabolism, cell communication and apoptosis. Availability of the first draft version of the human genomic data in the year 2001 [1] provided opportunity to recognize the repertoire of protein kinases in the human [24]. These studies employed bioinformatics approaches to recognize kinases starting from the genomic data and provided a classification of kinases that give hint to the signal transduction pathways in which the kinases participate. Due to differences in the human genome data set employed and also due to different bioinformatics approaches employed by different groups the number of protein kinases reported in these early publications differed slightly though a vast majority of kinases forming a common set [5].

Quality of the human genomic data is being continuously improved since its first release in 2001 [1]. In the early draft dataset the order of the base pairs in each chromosomal area has been determined 4 or 5 times (4× or 5×) or even more [1]. However in April 2003 human genome data of better quality was compiled in which additional sequencing improved the accuracy significantly http://www.ornl.gov/TechResources/Human_Genome/project/50yr.html. With the improvements in genome sequencing projects, order of the base pairs in each chromosomal area has been determined 8 or 9 times (8× or 9×) and the quality of the human genomic data improved significantly over the time. The present version of the human genomic data and the annotation are of superior quality compared to the first release about 8 years ago. This accurately annotated genomic sequence data set has impact in the quality of other kinds of genome-wide data sets such as, transcript sequences, expression data, disease influenced gene regulation and functional information.

Alternative splicing is an important mechanism to increase proteomic diversity [6] which produces diverse transcripts and hence potentially diverse protein products from the same gene locus which eventually add significant complexity to the genome. In typical alternative splicing process different combinations of exons within a gene are spliced from RNA precursor and reassembled in the mature mRNA. The nature of splice variant could be further related to the tissue type, developmental stage and disease versus normal conditions of the cell. Thus splicing could result in multitude of proteins originating from single gene and these protein products often have interesting diversity in functional properties. For example Drosophila Dscam gene, an axon guidance receptor gene, is a striking example which produces more than a thousand gene products [7]. Alternative splicing mechanism has been studied for various mammalian systems as well as for plants [[811]; http://www.itb.cnr.it/kinweb]. Analysis of some of the human kinase splice variants has been performed in the past [[11]; http://www.itb.cnr.it/kinweb] which also gives information about their domain combination. Though human and mouse are very close mammalian species, it has been observed that the conservation of alternative splicing pattern is low [1214] which might be the reason for the generation of species-specific gene products. Though the human kinome analysis performed in 2001 [24] identified over 500 kinases, it was earlier anticipated that the number of human protein kinases might be around 1000 [15]. We suggest that higher estimate still holds good as the splicing mechanism in human is very common. Bioinformatics analysis indicates that approximately 40-60% of the human genes are known to alternatively spliced [1619] with majority of splicing events occurring in 5' untranslated regions. Splicing event allows production of more than one protein isoform from a single gene which may have altered substrate affinities, sub-cellular localizations, and may exhibit different and sometimes antagonistic functional and structural properties [2022]. In fact, various protein kinase splice variants have been experimentally characterized and they are shown to elicit different expression patterns, differential localization and hence different functional properties [2326]. It is known from a previous study that alternative transcripts of protein kinases encode different domain structures and these variants are likely to play important roles in phosphorylation-dependent signaling pathways [8]. Splice variants of human centrosome kinase Nek2 are known to exhibit different pattern of expression in mitosis [24]. Two splice forms of human protein kinase B are shown to have distinct regulatory capacity depending upon presence or absence of phosphorylation site in the carboxy terminal hydrophobic domain [27]. Two splice forms of Mitogen-activated protein (MAP) kinase-interacting kinase (Mnk) derived from same gene differ only at their C termini. It has been seen that while Mnk2a contains a MAP kinase-binding site in this region, Mnk2b lacks such a sequence and is much less readily activated by MAP kinases in vitro. Mnk2a is cytoplasmic whereas substantial amount of Mnk2b is found in the nucleus [25]. Splice variants of Serum and Glucocorticoid-Inducible Kinase 1 (SGK1) which belongs to AGC group of protein kinase known to be regulated differentially following differentiation and shown to be markedly upregulated in tumor tissues [28]. TrkC, a receptor tyrosine kinase has a number of naturally occurring splice variants including one with a 14 residues insertion between subdomain VII and VIII in the C-terminal lobe of the kinase domain. This insert in the catalytic domain of TrkC results in defective MAPK activation which may result predominantly from an inhibition of high-affinity Shc binding [29]. Splicing mechanism is very carefully regulated as aberrant regulation of alternative splicing in human has been implicated in various diseases [30, 31].

Most of the analyses on identification and analysis of comprehensive collection of human gene splice products have been performed before or around the time when the first draft of the human genome sequence data was released. Thus, these analyses were performed using incomplete set of human gene products. Moreover, despite availability of complete human genomic data in the subsequent years, the improvement in the quality of the genomic data has been gradual which has prompted us to identify and analyse repertoire of protein kinase gene products with the present information on splice variants of human kinases. Our objective in this analysis is to extend our understanding of repertoire of functions of human protein kinases and extent to which the functional diversity could occur among splice variants of human kinases.

Results and discussion

Identification and analysis of a comprehensive collection of human protein kinases

Currently, there are two online sites that constantly update the information on repertoire of human protein kinases gleaned periodically from the releases of human genome data. These data sets KinG [Kinases in Genome http://hodgkin.mbu.iisc.ernet.in/~king] [32] and kinase.com http://kinase.com/kinbase/FastaFiles/ are quite similar in terms of list of human kinases and their classification into subfamilies. Both these datasets of human kinome are obtained as a result of careful use of highly sophisticated sequence analysis tools. For example, in the recent releases of KinG database http://hodgkin.mbu.iisc.ernet.in/~king, we have employed highly sensitive multiple profile search approach [33, 34]. During this update of the database and analysis many protein kinases have come into light which are splice variants of previously identified protein kinases listed in various kinome datasets such as the one in http://www.itb.cnr.it/kinweb[11]. This prompted us to extend and analyse the repertoire of protein kinase gene products by including protein kinase splice variants in the dataset. The improved human genome data enriched with information on splice variants available in ENSEMBL database http://www.ensembl.org/Homo_sapiens/index.html has helped us to analyze protein kinase splice variants which were not studied in the previous genome-wide studies on kinases using the draft version of human genomic data [24].

In the current analysis, using various sensitive profile search methods (discussed in Materials and Methods section), we report 918 putative human protein kinase gene products (Additional file 1). In addition this list integrated with user-friendly search options is also available as a part of KinG database at http://hodgkin.mbu.iisc.ernet.in/king/cgi/search. All these 918 kinase gene products have catalytic aspartate, which acts as catalytic base, suggesting that these gene products are likely to possess phosphotransferase activity. Sequences lacking the catalytic aspartate (protein kinase-like non-kinases) have not been included in the current list. There are 445 genes which are encoding 918 protein kinase products. There are 209 genes which are encoding only one protein kinase. Remaining 236 kinase genes give rise to more than one gene product which are splice variants. Maximum number of splice variants generated from a protein kinase gene is 12.

We checked for the expression of the human protein kinase splice variants using the publicly available cDNA data ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/cdna/ and found that 894 out of 918 have known cDNA data. Hence overwhelming majority of the recognized protein kinase genes is probably expressed and represents functional proteins. We have compared our kinome data set to kinase.com http://www.kinase.com in which so far 516 protein kinase gene products have been listed of which 25 are atypical protein kinases which are different from typical eukaryotic protein kinase superfamily discussed in the current analysis.

Classification of protein kinase gene products

These putative protein kinases have been further classified into various subfamilies as proposed by Hanks and Hunter [35]. The most represented protein subfamily in human is CAMK (Ca2+/Calmodulin dependent Protein Kinase) (Additional file 1 and http://hodgkin.mbu.iisc.ernet.in/king/cgi/search) group of protein kinases regulated by Calcium/Calmodulin having 161 members which is followed by CMGC (CDK, MAPK, GSK3, CK2) group of protein kinases which have 149 members. There are 130 protein products belonging to AGC group of kinases which are mainly second messenger regulated kinases. Several members which are closely related to protein tyrosine kinase family and Tyrosine kinase-like (TKL) subfamily have also been identified. TKL family is quite diverse which have members (e.g. Raf, Mixed-lineage kinase, Activin and beta-receptor kinase, Interleukin 1 receptor associated kinase etc.) resembling both tyrosine and Serine/Threonine kinase families.

Human kinases splice variants with potentially diverse functions: revealed from sequence analysis

Interestingly there are examples of newly recognized splice variants encoded by a single gene which differ markedly in length and have different domain organization. Such prominent differences indicate that these splice variants might have different mode of regulation and different functional properties. Additional File 2 represents set of such protein kinase splice variants which have different domain organisation. For example while two protein kinases (ENSP00000300843 and ENSP00000262893) with their catalytic domains closely resembling MARK subfamily has UBA (Ubiquitin-associated domain) domain following the kinase domain; however a splice variant (ENSP00000262891) has an extra domain, KA1 (Kinase-associated domain), apart from kinase and UBA domains (Figure 1A). KA1 domain tethered to protein kinase catalytic domain indicates a role of this kinase in ubiquitin pathway. Similarly, regulatory domain CAMII_AD domain which has role in multimerization is usually found tethered to the Calcium/Calmodulin dependent protein kinase II (CAMKII). CAMII_AD domain is absent in one of the splice variants which has a kinase domain closely related to CAMKII subfamily (ENSP00000369096) while this regulatory domain is present in three splice variants (ENSP00000339740, ENSP00000378032, ENSP00000378034) (Figure 1B). In the same way, one of the four splice variants (ENSP00000346671) encoded by the gene ENSG00000010810 lacks SH3 domain which is found tethered N-terminal to the Fyn tyrosine kinase (Figure 1C) which has role in increasing local concentration of proteins, altering their subcellular location and mediating the assembly of large multiprotein complexes. Interestingly, two splice variants (ENSP00000339291, ENSP00000339299) encoded by the gene ENSG00000038382 which belong to Trio subfamily of CAMK group regulated by calmodulin, may mediate apoptosis induced by interferon-gamma [36], have also different domain organizations. In this example one of the members (ENSP00000339299) has multiple domains like spectrin repeats, two copies of Rho GEF and PH domain, SH3 and I-set (Immunoglobulin) domains whereas another variant (ENSP00000339291) is lacking all these domains except I-set domain which is in the N-terminus of the kinase domain (Figure 1D). There are many more splice variants which are encoded by the same gene but lack one or more regulatory accessory domains and these are listed in Additional file 2.

Figure 1
figure 1

Cartoon representation of domain organization of newly recognized splice variants encoded by a single gene which have different domain organization. Splice variants shown in the figure are closely related to a) MARK kinase; b) CAMK II subfamily; c) Fyn tyrosine kinase d) Trio kinase. Abbreviations followed in the figure: UBA, Ubiquitin Associated Domain; KA1, Kinase Associated Domain 1; CAMKII_AD, Calcium/calmodulin dependent protein kinase II Association; SH3, Src homology 3; SH2, Src homology 2; I-set, Immunoglobulin domain; RhoGEF, Guanine nucleotide exchange factor for Rho/Rac/Cdc42-like GTPases; PH, Pleckstrin Homology domain.

Additional file 2 also lists few examples of splice variants which are encoded by same gene but their catalytic kinase domain belong to different protein kinase subfamilies. For example from a set of six splice variants (ENSP00000320622, ENSP00000346846, ENSP00000352088, ENSP00000353452, ENSP00000353530, ENSP00000355024) encoded by the gene ENSG00000065534 five splice variants belong to MLCK (Mixed Lineage Chain Kinase) subfamily whereas one of the splice variants belong to DRAK (death-associated protein kinase [DAP-kinase]-related apoptosis-inducing kinase) subfamily which plays important roles in apoptotic signal transduction [37]. There is another example in which one of the four splice variants (ENSP00000354170) encoded by ENSG00000068078 gene belong to Vascular Endothelial Growth Factor protein tyrosine kinase subfamily whereas other three (ENSP00000231803, ENSP00000260795, ENSP00000339824) belong to Fibroblast growth factor receptor subfamily (Additional file 2).

Expression of human kinases splice variants in different cell types

Additional file 3 provides information on expression of many human kinase splice variants in different tissue or cell types. This information was obtained from ENSEMBL database. From this Table it can be observed that at least one splice variant product from each of 44 genes is expressed in two or more different tissues compared to other splice variants corresponding to the same gene. This number represents a proportion of ~49% of the genes implicated in variant splicing. In all such cases splice variants corresponding to a gene belong to the same kinase subfamily. For example, the gene ENSG00000141068 produces two kinases belonging to the same sub-family, one expressing in 4 tissues (ENSP00000323178) and the other expressing in 17 tissues (ENSP00000268763).

Conclusions

Remarkable improvement in the quality of human genomic data enabled us to generate a more comprehensive repertoire of human kinases with identification of 918 protein kinase gene products with many of these representing splice variants with altered properties. Previous analyses on human kinome [13] presented much smaller set of kinases as splice variants were not analyzed in these early papers. Indeed Milensi et al [11] extended the early analysis of human kinome considering splice variation and, in particular, reported 5 further kinase genes and a pseudogene. However, present study indicates that the alternative splice forms of protein kinase genes are far more abundant than previously thought and seems to provide interesting variety with implications in biological pathways and processes. Present analysis has unearthed newly recognized protein kinase splice variants with almost all of them having known cDNA data available and many of them known to get expressed in various tissue or cell types. Our current survey enabled us to recognize many important kinase splice variants which might be involved in diverse signal transduction pathways and have different modes of regulation. List of newly recognized human kinase splice variants includes several examples which are not yet studied experimentally. Future direction could aim at studying sub-cellular localization of the kinase splice variants in the types of cells, their role in biological processes, modes of regulation and substrate specificity which will give insight into the human signal transduction pathway.

Methods

The complete set of predicted protein sequences from the ORFs of the human genome has been obtained from ENSEMBL [release 52] http://www.ensembl.org. We have surveyed the genome, for Ser/Thr and Tyr protein kinases using sensitive sequence profile matching algorithms. The sequence search tools and other strategies for domain identification and sub-family classification used in this study by employing PSI-BLAST [38], Reverse PSI - BLAST (RPS-BLAST) [39], HMMer [40] which match Hidden Markov Models (HMMs) to identify protein kinase catalytic domain and their co-occurring domains. Search procedures such as PSI-BLAST and RPS-BLAST have been used at stringent E-value cut off of 0.0001. Searches performed using HMMer with an E-value cut-off of 0.01 or below have been shown to nearly eliminate any spurious hits. Hits lacking significant sequence similarity with the query have been further examined, manually. The final data set of predicted putative human protein kinases has been obtained from the compilation of hits obtained using various procedures. The program CD-HIT [41] was used in order to eliminate redundant sequences. The sequence identity shared between any two putative protein kinase sequences in the data set is less than 100%, suggesting that the data set is devoid of redundant sequences. Multiple sequence alignment of the kinase catalytic domains of the putative human protein kinases has been carried out using ClustalW [42]. Information on cell/tissue types where various kinases are expressed has been retrieved from ENSEMBL http://www.ensembl.org and it has been observed that 498 protein kinases are getting expressed in various tissue types (Additional file 3).

Classification of human protein kinases into subfamilies

Protein kinases discussed in the current analysis have been classified into various protein kinase subfamilies proposed by Hanks and Hunter classification scheme. RPS-BLAST has been used to search each of the putative protein kinases as a query against the database containing 55 Multiple Position Specific Scoring Matrices (MulPSSMs) created for the various subgroups of protein kinases in each of the subfamilies. Criteria used in identifying kinase subfamilies involved use of stringent conditions such as alignment coverage of the sequence greater than 70% over the kinase domain region and sequence identity cut-off of 30% aside from satisfying the stringent e-value cutoff criteria.

Domain assignment to the non-catalytic regions of the kinase containing genes has been carried out using the HMM search methods by querying each of the kinase domain containing sequences against the 9318 protein family HMMs available in the Pfam database release 23 [43].

Information on splice variants for these putative protein kinases was taken from BioMart data mining tool available in ENSEMBL database which has entry for genes, their transcripts and protein products. In ENSEMBL, an initial alignment of mRNA and protein is performed against the genome. All overlapping transcripts are clustered under one gene name and deemed 'splice variants'. So the gene products pointing to a single gene are defined as splice variants by ENSEMBL http://www.ensembl.org.