Background

Since the completion of the human genome project, thousands of species have been fully sequenced [1], providing a broader coverage of species diversity across the tree of life. “Moonshot” approaches, such as the Earth BioGenome Project (EBP), aim to catalog, and then characterize, genomes across eukaryotic biodiversity during the next decade [2]. The acquisition of genomic (and their associated proteomic) datasets enables the accurate prediction of protein functions through ever-deeper evolutionary analysis of related sequences [3]. Protein kinases transfer the gamma phosphate group from ATP to an expanding subset of amino acids in their regulatory targets [4]. They can be distinguished from other mechanistically related enzymes, such as metabolic and glycan-modifying kinases [5]. Protein kinases represent one of the largest druggable families of signaling proteins abnormally regulated in various human diseases, including most human cancers [6,7,8]. The human genome encodes nearly 550 protein kinase-related genes (collectively referred to as the human kinome) that have been broadly classified into major groups and families [9, 10]. A majority of the human kinome members have been functionally characterized in multiple model organisms; however, nearly 30% of human kinases remain understudied, despite multi-organism knowledge of their primary sequence [11,12,13]. These are collectively referred to as “dark” kinases based on a subset of metrics such as the number of published papers (Jensen PubMed score [14] < 50 and PubTator score [15] < 150) and grant funding (no R01). Many of the understudied kinases, such as RIO and NEK families, contain clear orthologs in a majority of eukaryotic genomes, suggesting essential (rate-limiting) biological functions across life [16,17,18]. A major focus of the Illuminating the Druggable Genome (IDG; https://commonfund.nih.gov/idg/index) consortium is to characterize the functions of these understudied proteins as a conceptual starting point for developing new drugs for a wide range of diseases such as cancer, neurodegenerative and autoimmune disorders that are associated with abnormal kinome signaling [19].

Comparative genomics is a powerful approach for inferring gene functions and is based on the assumption that genes descended from the same ancestor are likely to retain commonly shared functions [20, 21]. These gene descendants are called orthologs and paralogs, two major types of homologs related to speciation and duplication events, respectively [22]. Paralogs can be further categorized as in-paralogs and out-paralogs: the former arises from duplication after speciation, while the latter arises from duplication before speciation [23]. The concept of “one-to-one” orthologous relationships (one protein in one species versus one protein in the other species) has been extended to “one-to-many” or “many-to-many” relationships and are collectively termed orthologous groups [24]. Co-orthologs are defined as a pair of genes from the same orthologous group but different species [25]. Given the importance of these relationships for functional analysis, several orthology inference methods have been developed. We have previously used these approaches to analyze canonical protein kinases and pseudokinases, including a broad survey of pseudoenzymes [26], pseudokinases [27], and a variety of understudied kinases whose biological function remains unknown, despite conservation in various eukaryotic lineages [28,29,30].

Current orthology inference methods can be broadly classified into two major categories: tree-based methods [31,32,33,34] and graph-based methods [35,36,37,38,39,40,41,42,43]. Tree-based methods, such as EnsemblCompara [31], construct reconciled trees based on gene trees and corresponding species trees and then infer the type of evolutionary event (speciation/duplication) that represents each internal node of the tree. By contrast, graph-based methods, such as OrthoMCL [35], avoid building trees and identify hypothetical orthologous relationships (orthologs/paralogs) typically by two main steps: graph construction and clustering. Graph-based methods represent proteins as nodes and the relationships connecting the nodes (sequence similarity, for example) as edges. The nodes in the graph are then clustered into orthologous groups by different strategies [44,45,46]. Tree-based methods are generally more accurate than graph-based methods, depending on the accuracy of species trees [46, 47]. However, tree-based methods are computationally expensive, limiting the exploration of thousands of species across the tree of life [47]. In contrast, graph-based methods are faster, but the increased speed is generally achieved at the cost of reduced sensitivity. When applied to large datasets, the performance of graph-based methods is comparable to tree-based methods and, in some cases, even better than tree-based methods [48].

Most orthology inference methods rely on time-consuming all-vs-all sequence similarity searches across full-length gene or protein sequences across entire genomes. As such, these methods are not designed for focused analysis on individual gene families. Within large protein families, such as the protein kinase superfamily, traditional orthology inference methods display high false-positive rates since they do not consider the conservation of known functional domains. Some existing methods identify sequences as putative orthologs that almost certainly lack the classical bilobal kinase domain. In contrast, domain-based methods, such as Hierarchical grouping of Orthologous and Paralogous Sequences (HOPS) [49], FlowerPower [50], Domain based Detection of Orthologs (DODO) [51], Microbial Genome Database (MBGD) [52], and Domainoid [53], are tailored to identify evolutionary relationships based on functionally relevant regions, notably domains, of a protein. However, the performances of these methods are reliant on the annotation of domains based on prior knowledge, thereby making it challenging to identify novel domains and relationships. A good example of this is the discovery of atypical kinases with very low sequence identity compared to search sequences, such as the atypical SelO kinase [54].

To address the above challenges in orthology prediction, we developed KinOrtho as a complementary approach for efficient and accurate identification of human kinase orthologs across ~ 17,000 species, extending well beyond the 15 model organisms defined in the seminal study of Manning and colleagues [9] and a recently updated kinase-centric database with kinases from 2000 species [55]. KinOrtho is query-based and achieves increased sensitivity by combining similarities in the commonly conserved protein kinase domain and flanking regulatory domains. This enables us to develop one-to-one orthology relationships that provide a finer resolution of orthologs across species than previous efforts. We apply KinOrtho towards identifying putative proteins involving fusion or fission events, or so-called “Rosetta Stone protein” [56]. By integrating evolutionary information with gene expression patterns, we identify a potential functional association between an understudied kinase “Serine/threonine-protein kinase TAO1” (TAOK1) and a metabolic kinase “Phosphatidylinositol-4-phosphate 3-kinase” (PI3KC2A) in autophagy. Using KinOrtho-defined orthologs, Gene Ontology (GO) annotations, and machine learning models, we prioritize understudied kinases for functional studies by developing a Novel Inferred Annotation Score (NIAS). The KinOrtho pipeline and ortholog datasets are available at the GitHub repository (https://github.com/esbgkannan/KinOrtho), and the patterns of conservation in aligned orthologs sequences are visualized in both KinView [57] (https://prokino.uga.edu/kinview/) and the IDG resource Pharos [58] (https://pharos.nih.gov/).

Results

Overview of KinOrtho algorithm

KinOrtho is a query- and graph-based orthology inference method that combines full-length and domain-based orthology inference approaches. It consists of two pipelines (full-length and domain-based) and six main steps: (i) homology search, (ii) building kinome databases, (iii) all-vs-all homology search, (iv) orthology inference, (v) cluster analysis, and (vi) combining the results from two pipelines (Fig. 1).

Fig. 1
figure 1

Overview of the KinOrtho algorithm. KinOrtho is an orthology inference method combining full-length and domain-based approaches. It consists of six main steps: (i) homology search against reference proteomes, (ii) building BLAST databases, (iii) all-vs-all homology search, (iv) orthology inference, (v) cluster analysis, and (vi) combining the results of full-length and domain-based approaches

Because KinOrtho is query-based, it omits a large portion of unnecessary sequence comparisons unrelated to the query sequence(s). This characteristic makes KinOrtho a more efficient tool to identify orthologs of interest across the tree of life. We applied KinOrtho to identify the orthologs of 545 human kinases across some 17,000 species found in UniProt reference proteomes [59]. Without target genes, traditional orthology inference methods start from an all-vs-all homology search, which would require orders of magnitude (more than two quadrillion) pairwise sequence comparisons for this sample of reference proteomes. Because our query sequences were human kinases, only about eight thousand species were found to have human kinase homologs, which resulted in a nearly 2000-fold reduction in the number of comparisons to be made (Additional file 1: Table S1). This makes KinOrtho one of the most computationally efficient orthology inference methods currently available for the identification of kinases.

When performing orthology inference, we adopted the definition of orthologous relationship used by OrthoMCL [35] (see Methods), which resulted in twice as many orthologous relationships using similarity in full-length sequences compared to similarities within the kinase domain alone (302 million for full-length vs. 148 million for kinase domain; Additional file 1: Table S1). However, the application of graph-based clustering and further refinement of the clusters resulted in a comparable number of orthologous relationships in the full-length (97 million) and domain-based (100 million) pipelines (referred to as full-length set and domain-based set, respectively). Finally, the combination of both pipelines resulted in 75 million overlapping orthologous relationships (termed “overlapping set” from here on), including \(\sim\)133,000 relationships between human kinases and kinases from other species (Additional file 1: Table S1). Since this is the most refined set of relationships obtained from KinOrtho, this overlapping set will be referred to as KinOrtho throughout this manuscript. In contrast, the full-length and domain-based results will be explicitly stated when mentioned.

Benchmarking and comparison of KinOrtho with other orthology inference methods

To evaluate and compare the performance of KinOrtho with other orthology inference methods in identifying kinase orthologs, we applied KinOrtho to the well-curated Quest for Orthologs (QfO) reference proteomes 2018 [60]. As shown in Fig. 2a, the overall comparison metrics for KinOrtho are better (in terms of the overall precision and recall) than existing methods in the benchmarking datasets based on the enzyme classification conservation test, agreement with reference gene phylogenies, and species tree discordance benchmarks. The remaining metrics are shown in Additional file 1: Figure S1. It is also important to note that the selection of orthologs from KinOrtho’s full-length pipeline, domain-based pipeline, and the overlapping results all yielded similar performance (compared to other methods), suggesting robustness and agreement across these methods.

Fig. 2
figure 2

Benchmarking and comparison of KinOrtho with other methods. a Evaluations of the kinase orthologs identified by KinOrtho (marked in red) and 21 other methods. The title of each plot represents the evaluation metric. The dotted line represents the Pareto frontier, which runs over the participants with the best efficiency (except KinOrtho). The arrow in the plot shows the optimal corner. Red square: KinOrtho full-length set; red diamond: KinOrtho domain-based set; red triangle: KinOrtho overlapping set. b The 100% stacked bar chart shows the overlap in kinase orthologs identified by KinOrtho versus other orthology inference methods (blue region); a dashed line indicates the average percentage of the overlaps (KinOrtho Full-length and KinOrtho Domain-based do not count). The orange region represents the percentage of orthologs only identified by KinOrtho; a dotted line indicates the average percentage of the blue and orange regions. The gray region shows the percentage of orthologs unique to the compared method. c The heat map represents the Jaccard similarity matrix among orthology inference methods. Method indices are shown on the left and top of the matrix

Next, a direct comparison between the pairs of orthologs identified by KinOrtho was performed alongside other methods to gain insights into overlapping predictions. In general, 35–60% of the orthologs identified by KinOrtho were also identified by other methods (blue bars in Fig. 2b). Besides, KinOrtho (overlapping set) consistently found orthologs not identified by other methods (orange bars in Fig. 2b). Several unique KinOrtho-defined human kinase orthologs, such as cyclin-dependent protein kinase (CDK) orthologs, are described in Supplementary Results and shown in Additional file 1: Figure S2–S4. On the other hand, KinOrtho consistently omitted at least 10% of the orthologs (average: 23.2%; gray bars in Fig. 2b) identified by other methods. This number is significantly reduced when considering KinOrtho full-length or domain-based sets alone (average: 14.3% and 17.2%, respectively; gray bars in Additional file 1: Figure S5), suggesting that KinOrtho eliminates putative ortholog sequences that lack the well-defined bilobal kinase domain associated with protein kinases. The ability of KinOrtho to delineate the orthologs based on the protein kinase domain against the orthologs based on other conserved domains is described in Supplementary Results and shown in Additional file 1: Figure S6. Additional details about the utility and benchmarking of the domain-based approach are discussed below. Finally, we generated a similarity heat map to quantify orthology predictions by KinOrtho and other methods (Fig. 2c). Similarities measured by the Jaccard similarity coefficient between two ortholog sets ranged from 25.3% (Reciprocal Smallest Distance (RSD) [61] vs. Orthologous Matrix (OMA) [39]) to 81.5% (SonicParanoid [40] vs. OrthoInspector [41]. The average similarity among all methods was 50.4%. Orthology results from Bidirectional Best Hits (BBH) [62] and two BBH- and graph-based methods, OrthoInspector and SonicParanoid, were found to have the most agreement with KinOrtho results (average similarities: 59.9%, 59.8%, and 58.7%, respectively).

Inferring functional associations using KinOrtho-based identification of kinase domain fusion and fission events

KinOrtho’s ability to find orthologs for individual domains allows identifying domain fusion and fission events for kinases with multiple kinase domains. In the human kinome, there are 13 kinases with two tandem kinase domains within the same polypeptide, many of which are functionally annotated phosphorylation targets of Mitogen Activated Protein Kinase (MAPK) signaling pathways. Figure 3a illustrates the four scenarios of finding domain-based orthologs for these 13 kinases: (1) tandem domains in one kinase match tandem domains in another kinase, (2) tandem domains in one kinase match tandem domains in another kinase in reverse order, (3) two domains from two human kinases match tandem domains in one kinase from another species, and (4) tandem domains in one human kinase match two domains from two kinases in different species. Traditional full-length BBH-based orthology inference methods do not have the resolution to distinguish between these scenarios. However, KinOrtho’s domain-based approach allows the definition of orthologs from all scenarios, thus identifying fusion and fission events in orthologous sequences.

Fig. 3
figure 3

Scenarios of a single protein with tandem kinase domains and examples of potential gene fusion and fission events. a Four scenarios of a single protein with tandem kinase domains and their domain-based orthologs. Arrows represent orthologous pairs. The number of cases for each scenario is shown in parentheses. b Example of Scenario 3 and phylogenetic tree analysis on the fusion event of TAOK1 orthologs and PIK3C2A orthologs. Human TAOK1’s and PIK3C2A’s domain-based orthologs in 245 species were aligned, concatenated (represented by a dotted line if these two domains are from different species), and used to build a phylogenetic tree. Species names are labeled at the leaves of the circular-mode phylogenetic tree. The leaves are colored according to the clade of the species (refer to legend). Black stars mark the species with a TAOK1-PIK3C2A fused gene. The time when potential fusion events occurred is indicated by red stars on the tree. c Example of Scenario 4 and analysis of the fission event of OBSCN orthologs. The domain-based orthologs of human OBSCN’s two kinase domains in 80 species were used to build a phylogenetic tree. Black triangles mark the species with fission events. Red triangles indicate the time when potential fission events occurred

Scenario 3 in Fig. 3a reflects potential domain fusion events. Although we identified 113 potential fusion events, we use TAOK1 and PIK3C2A as an example to illustrate how integrating evolutionary data with other contextual data (protein-protein interaction, co-expressions, and co-occurrence) can reveal potential functions for understudied kinases (Additional file 2). TAOK1, an understudied kinase, belongs to the STE20 family, while PIK3C2A belongs to the PI3K family. We found nine instances of potential domain fusion events between these two kinase domains (Fig. 3b). In addition, we identified 236 species with both TAOK1 and PIK3C2A domain-based orthologs in different kinases. We concatenated the sequences of TAOK1 orthologs and PIK3C2A orthologs for each species and then built a phylogenetic tree. We found that eight kinases with TAOK1 and PIK3C2A domains reside in the same clade of Nematodes. Based on this, we postulate two potential fusion events (indicated by red stars on the phylogenetic tree, Fig. 3b). Proteins involved in a fusion event usually belong to the same functional category [63]. As an example, TAOK1, an understudied kinase, shows high co-expression with PIK3C2A in 17,382 normal samples and 1376 cancer samples in the Genotype-Tissue Expression project [64] (GTEx, version 8) and the Cancer Dependency Portal [65] (DepMap, 20Q4), respectively. The correlations (Pearson correlation coefficient = 0.856 in normal samples and 0.612 in cancer samples) are among the top 0.15% of all kinase pairs (Additional file 1: Figure S7). The co-expressed patterns are conserved in A.aegypti, B.taurus, D.melanogaster, and S.mansoni (STRING [66], version 11.0), suggesting a possible physical interaction. Moreover, TAOK1 and PIK3C2A have been reported to be involved in the autophagy response [67]. Based on these observations, we predict a functional association between the understudied kinase TAOK1 and PIK3C2A in human cellular biology, perhaps involving communication between the membrane, where phospholipids are sensed, and the cytosol, where TAOK1 has known functions in relaying information to MAPK pathways.

We also analyzed cases in Scenario 4 for potential domain fission events. We found ten kinases in five species (ferrets, turkeys, Atlantic salmon, rainbow trout, and huchen) matching the tandem kinase domains in human Obscurin (OBSCN) kinase (Fig. 3c). The tandem kinase domain arrangement in OBSCN is conserved in 75 species (Scenario 1). In species where the tandem domains are encoded in two different proteins, we concatenated the domains and performed phylogenetic comparisons with species where the two domains are naturally fused. The concatenated sequences of Atlantic salmon, rainbow trout, and huchen occur in the same clade. Based on the phylogenetic tree, we estimate three kinase domain fission events (marked by red triangles in Fig. 3c). Although the functional significance of these fission events is unclear, the established role of OBSCN in eye development [68, 69] suggests a role for these events in the evolution of vision in these species [70,71,72,73].

Phylogenetic profile analysis reveals the evolutionary depth of human protein kinase conservation and enriched molecular functions across species

We next sought to classify human kinases based on conservation depth across species by building a phylogenetic profile of KinOrtho-defined orthologs. Figure 4a highlights a human kinase phylogenetic profile consisting of 558 human kinase domains and their orthologs across 561 clades. As expected, human kinase orthologs are barely present in bacteria, archaea, and viruses, except for the orthologs of eukaryotic-like protein kinases. Consistent with previous findings, four eukaryotic-like kinases, Protein adenylyltransferase SelO, mitochondrial (SELENOO; 3936 orthologs), AarF domain-containing protein kinase 1 (ADCK1; 3258 orthologs), Ketosamine-3-kinase (FN3KRP; 2234 orthologs), and Serine/threonine-protein kinase RIO1 (RIOK1; 1849 orthologs) have the most orthologs. In contrast, Casein kinase II subunit alpha 3 (CSNK2A3), Rhodopsin kinase GRK1 (GRK1), Putative serine/threonine-protein kinase PR-KY (PRKY), and Probable serine/threonine-protein kinase SIK1B (SIK1B) are “orphan” kinases with no orthologs based on KinOrtho’s stringent criteria. The phylogenetic profile also shows that the kinases in tyrosine kinase (TK), tyrosine kinase-like (TKL), and receptor guanylate cyclase (RGC) groups are mainly conserved in Metazoa (including mammals, reptiles, birds, fish, and protostomes), which is consistent with the findings of a previous study [74].

Fig. 4
figure 4

Cluster analysis of the phylogenetic profile of human kinases. a Phylogenetic profile of human kinases. X-axis: 558 human kinase domains, ordered by the kinase group and the number of orthologs identified by KinOrtho. Y-axis: 561 clades, ordered by classification. Each dot’s color represents the coverage of the human kinase ortholog in each clade. b Phylogenetic clusters of human kinases. The human kinase domains in the x-axis are in the order of clusters and the number of orthologs. The boundaries of each cluster and classification are highlighted in red. c Top 5 understudied kinases with the most orthologs in each cluster. The number of species in each classification is shown in parentheses

Many understudied kinases have escaped analysis due to weak conservation in model organisms. Based on the distribution of orthologs across different species, we organized human kinases into 10 clusters (Fig. 4b). The top 5 understudied kinases with the most orthologs in each cluster are highlighted in Fig. 4c. Kinases within each cluster are closely related (small Euclidean distance) with high co-occurrence with each other. Using this clustered phylogenetic profile, we sought to identify potentially conserved kinase-regulated biological functions across species. For example, because the kinases in Cluster 5 are highly conserved in Metazoa, we can hypothesize a role for these kinases in metazoan-specific biological functions. To this end, we performed Gene Ontology (GO) enrichment analyses using the GO annotations of all kinases as background. We identified 802 significantly enriched GO terms (false discovery rate (FDR) < 0.05), and the GO term is annotated for at least five human kinases in the cluster. The top three enriched GO terms for each cluster are shown in Table 1, and the entire list is shown in Additional file 3.

Table 1 GO term enrichment analysis on the phylogenetic clusters of human kinases

The orthologs of human kinases in Cluster 2 are present in most eukaryotic species. The most enriched GO term in Cluster 2 is a biological process term “cell cycle checkpoint” (GO:0000075), which encompasses a variety of DNA and spindle-assembly checkpoints, well-established control mechanisms that control progression through the eukaryotic cell cycle [75]. EKC/KEOPS complex subunit TP53RK (TP53RK), an understudied kinase with 1367 orthologs, plays a vital role in the cell cycle and G1 checkpoint control [76, 77]. However, this GO term is currently absent in both human TP53RK annotation and TP53RK ortholog annotation. Kinases in Cluster 9 are mostly present in mammals. Consistently, the cellular component term “acrosomal vesicle” (GO:0001669) is the most enriched GO term in Cluster 9. Acrosomal vesicles, components in the sperm’s head, contain enzymes essential for fertilization [78]. All members of testis-specific serine/threonine-kinases (TSSK) belong to Cluster 9, and they are all understudied kinases: TSSK1B, TSSK2, TSSK3, TSSK4, and TSSK6. Currently, TSSK1B, TSSK2, and TSSK4 are annotated with this GO term. Although TSSK3, TSSK6, and their orthologs lack this annotation, both TSSK3 and TSSK6 are highly expressed in testis (median Transcripts Per Million = 60.86 and 424.6, respectively) [64]. TSSK6 is also reported to be involved in the acrosome reaction and egg fertilization [79]. Therefore, based on the cluster analysis of the phylogenetic profile, we predict TSSK3 and TSSK6 function in acrosomal biology and vesicle localization.

Machine learning model to prioritize understudied kinases using KinOrtho and GO annotations

The human kinome contains several understudied kinases of unknown function. We next wanted to investigate if KinOrtho-defined orthologs, along with sequence similarities and GO annotations from different species, can be used to infer the functions of understudied kinases using “guilt-by-association” [80] and machine learning methods. To this end, we trained machine learning classifiers using orthology and functional annotations of well-studied human kinases to predict whether functional annotation could be transferred from orthologs in other species to human kinases. In brief, we built training sets using the manually curated GO annotations of well-studied kinases (Fig. 5). The input features of each training instance represent the GO annotation status of human kinase orthologs; the output shows whether the human kinase has this GO term annotation or not. Input features were weighted based on the sequence similarities between the human kinase and orthologs. The training sets contained 0.3 million instances with 730 GO terms for 393 well-studied kinases and their orthologs across 176 species. After class balancing and 10-fold cross-validation, random forest displayed high prediction accuracies among various machine learning methods attempted (90.9%, 92.1%, and 95.5% for the training sets of biological process, cellular component, and molecular function, respectively; Table 2).

Fig. 5
figure 5

Calculating the Novel Inferred Annotation Score (NIAS) for understudied kinases using the phylogenetic profile, GO annotations, and machine learning methods. The weight matrices represent the normalized sequence similarities between the human kinase and orthologs. The training sets and test sets show an example of a GO term (GO:0050896, response to stimulus) annotation status across well-studied human kinases, understudied human kinases, and orthologs. CV: cross-validation; IAS: Inferred Annotation Score; UAS: Unreviewed Annotation Score; NIAS: Novel Inferred Annotation Score

Next, we used the trained models (random forests) to predict the functions of understudied kinases. To this end, we constructed test sets in which understudied human kinases could be annotated based on the GO terms available for one or more of their orthologs. This resulted in 16 thousand instances with 2642 GO terms for 144 understudied kinases. Application of the pre-trained random forest models on the test sets resulted in 11,573 predictions of kinase-GO term pairs as present (Additional file 1: Table S2). Among these predicted annotations, 8933 predictions (77.2%) already existed in the UniProt as manually reviewed annotations, while 2640 predictions (22.8%) did not have UniProt annotations. Instead of referring to these 2640 predictions as false positives, we considered them as missing annotations. In fact, 1452 of them (55%) were found to be unreviewed electronic annotations from Ensembl [81], InterPro [82], the UniProt Consortium, or the GO Consortium. The remaining 1188 annotations, including 236 lowest-level GO term annotations, were novel inferred annotations (available in Additional file 4). By aggregating the prediction score of each novel inferred annotation, we calculated a Novel Inferred Annotation Score (NIAS) for each understudied human kinase (the last step in Fig. 5).

Our analysis reveals that Serine/threonine-protein kinase ULK4 (ULK4) has the highest NIAS among all understudied human kinases. It has 22 novel inferred annotations. Fifteen of them with prediction scores higher than 0.9 are inferred from mouse/rat Ulk4. Twelve of these inferred annotations with a high score are associated with neuronal function and brain development, such as “ventricular system development” (GO:0021591), “corpus callosum development” (GO:0022038), “neuronal stem cell division” (GO:0036445), and “GABAergic neuron differentiation” (GO:0097154). A role for human ULK4 in neuronal function and brain development has been suggested [83,84,85], and it is an unusual pseudokinase that binds to nucleotides in the absence of cations [86, 87]. Serine/threonine-protein kinase PAK 5 (PAK5), an understudied kinase with the second highest NIAS, has 12 novel inferred annotations. The association between PAK5 and “activation of MAPK activity” (GO:0000187), the GO term with the highest prediction score for PAK5, is also known from the literature, where these kinases act as upstream regulators of MAPK modules [88]. We also identified 80 understudied kinases with a NIAS of 0. We can still prioritize these proteins by Inferred Annotation Score (IAS, which aggregates all prediction scores regardless of existing unreviewed annotations; see Methods) for further manual curation or experimental validation. For example, the NIAS of Eukaryotic elongation factor 2 kinase (EEF2K) is 0. Kinases in this list include pseudokinases such as Serine/threonine-protein kinase H2 (PSKH2), which represent the “darkest” of kinases with little or no information across species and no functional biology currently reported [28].

Table 2 Performance of GO annotation prediction in each training set

Discussion

Here we map human kinase orthologs across diverse species by developing a kinase orthology inference method called KinOrtho. We demonstrate that KinOrtho performs better than existing orthology inference methods based on comparisons across standard benchmarking datasets and metrics. KinOrtho utilizes domain-based orthology inference to eliminate orthologs with no kinase domains, allowing researchers to focus on the functional domains of interest. KinOrtho’s query-based characteristic enables users to identify orthologs of specific kinases across thousands of species within a reasonable time. In contrast to orthologous groups provided by other methods, this approach provides one-to-one ortholog, in-paralog, and co-ortholog relationships, thereby revealing functional relationships and separating even the most closely related paralogous sequences.

While KinOrtho’s performance is better than existing methods based on metrics in the benchmarking dataset (Fig. 2a and Additional file 1: Figure S1), overlap in orthologous relationships defined by various methods in the benchmarking dataset is significantly low (only 29.9% similarity; Additional file 1: Figure S8), presumably because of the variability in orthology definition, methods used, or even potential genome assembly errors in the UniProt reference database. Thus, the interpretation of ortholog sets should be made with some caution. The Alliance of Genome Resources (AGR) has recently established orthologous relationships among humans and six model organisms: Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus, Rattus norvegicus, and Saccharomyces cerevisiae [89]. The orthologous relationships in AGR are based on the consensus of seven orthology inference methods [38, 39, 41, 42, 81, 90, 91] and five databases [92,93,94,95,96]. Comparison of KinOrtho-defined human kinase orthologs with AGR-defined orthologs reveals nearly 70.5% similarity. The greatest difference in kinase orthology sets occurs in the CMGC and CAMK groups, presumably because of the deeper conservation of these kinases across taxa (Additional file 1: Figure S2).

In our previous study [97], we developed an annotation score (AS) for prioritizing understudied kinases based on existing knowledge stored in curated databases, such as mutations, pathways, expressions, and post-translational modifications (updated AS are available in Additional file 4). In this study, we propose a complementary NIAS for prioritizing understudied kinases based on missing knowledge (mainly their biological functions) inferred by machine learning methods. Because these two scores reflect different aspects of kinase annotation status, they should be used in conjunction when prioritizing understudied kinases. For example, the NIAS score can be informative when prioritizing understudied kinases based on information available from other organisms, while the AS can be helpful when prioritizing kinases based on curation status. Although we attempted to generate an aggregate score (AS’) by introducing NIAS into the original AS calculation (Additional file 4), the difference between AS and AS’ was not significant. Therefore, we recommend using AS and NIAS independently when prioritizing understudied kinases for experimental studies.

For illuminating understudied kinases, ion channels, G-protein-coupled receptors, or other protein families, a broader collection of manually curated biological functions from various species would be immensely helpful. Although we propose KinOrtho as a tool that can be generalized to a broad range of protein families, its query- and domain-based characteristics may result in lower sensitivity of small protein families with fewer orthologs. Moreover, a fair comparison between KinOrtho and other orthology inference methods cannot be made if the results are evaluated based on a tiny subset of benchmarking datasets. Therefore, to identify the orthologs of those protein families with few members, users are recommended to use KinOrtho in conjunction with other orthology inference methods.

Conclusions

In this study, we have developed an efficient query-based orthology inference method that combines full-length and domain-based orthology inference methods to comprehensively map human kinase orthologs across the tree of life. KinOrtho performed better than existing methods in a benchmarking dataset and identified putative domain fusion and fission events. We confirmed kinase-associated molecular functions enriched across species using phylogenetic profiles after identifying overlapping orthologous relationships from full-length and domain-based pipelines. Finally, we prioritized and inferred functions of understudied human kinases using KinOrtho-defined orthology and GO annotations as features in machine learning. Our studies serve as a conceptual starting point for investigating understudied human kinase biology by leveraging evolutionary information. This is exemplified, but by no means limited to, pharmacologically tractable protein families such as the protein kinases.

Methods

KinOrtho workflow

KinOrtho is a query-based, graph-based, and combinatorial orthology inference method. It consists of six main steps (Fig. 1):

  1. 1

    Homology search for the query sequences of interest against reference proteomes

  2. 2

    Building Basic Local Alignment Search Tool [98] (BLAST) databases, containing full-length and domain-based databases

  3. 3

    All-vs-all homology search for the rebuilt databases

  4. 4

    Orthology inference and determining orthologs, paralogs, and co-orthologs

  5. 5

    Cluster analysis and filtering out the orthologous relationships between two proteins in different clusters or the clusters without query sequences

  6. 6

    Combining the results of full-length and domain-based methods

Query sequences

The query sequences used in this study were based on a broader mapping of human kinome composition performed recently [19]. We collected 545 human kinases, containing 483 eukaryotic protein kinases (ePKs), 19 eukaryotic-like protein kinases (PKLs), and 43 atypical protein kinases (aPKs). Based on a manually curated eukaryotic protein kinase sequence profile [27], Pfam [99], and Conserved Domain Database [100], we manually annotated and collected 558 kinase domain sequences from the 545 human kinases. More information about the domain name, domain boundary, and kinase group are available in Additional file 5.

Reference proteomes

We applied KinOrtho to the UniProt reference proteomes (release 2019_11), which are chosen to broadly represent the taxonomic diversity [59]. It is also the most well-curated and extensive collection of entire proteomes across the tree of life. The reference proteomes contain 18,870,318 protein sequences spanning the tree of life (Additional file 1: Table S1). To benchmark the performance of KinOrtho, we applied KinOrtho to the Quest for Orthologs (QfO) reference proteomes 2018 [60], which contains 885,338 protein sequences from 48 eukaryotic species, 82,507 sequences from 23 bacterial species, and 17,317 sequences from 7 archaea species (Additional file 1: Table S1). To benchmark KinOrtho based on a domain-based kinase classification, we also applied KinOrtho to the model organisms in KinBase [101], which includes 15 species and 7597 kinase sequences (Additional file 1: Table S3).

Homology search and building BLAST databases

Before performing a time-consuming all-vs-all homology search for all reference proteomes, KinOrtho looks for potential homologs of query sequences by screening the reference proteomes using NCBI BLAST+ [102] (version 2.7.1) with default settings, except for the E-value threshold. Referring to other orthology inference methods, such as OrthoMCL-DB [103] and PANTHER [90], KinOrtho uses 10\(^{-5}\) as a default E-value threshold for BLAST search. This threshold has been demonstrated to balance between false-positive and false-negative rates [104]. An additional experiment showed that choosing the default E-value threshold of BLAST+ (10\(^{1}\)) yielded similar sequence comparisons with choosing 10\(^{-5}\) in the benchmarking dataset (Additional file 1: Table S4) but not significantly increased the performance based on the six benchmarking metrics shown in Additional file 1: Figure S9. Then, KinOrtho builds two sets of BLAST databases as new reference proteomes (“kinomes” hereafter) based on full-length and domain-based query sequences. To build a full-length kinome for each proteome, KinOrtho keeps the sequences in the BLAST result, generates a new sequence file, and then applies the “makeblastdb” function provided by NCBI BLAST+. To build a domain-based kinome for each proteome, KinOrtho generates a new sequence file and builds a BLAST database based on the BLAST hit region (between “sstart” and “send”) of the sequences in the BLAST result. After building a set of full-length kinomes and a set of domain-based kinomes, KinOrtho performs an all-vs-all homology search for each set using the E-value threshold (10\(^{-5}\)) mentioned above.

Orthology inference

The orthologous relationships identified by KinOrtho include orthologs, in-paralogs, and co-orthologs. KinOrtho defines a pair of one-to-one orthologs using the Bidirectional Best Hits (BBH) method [62]. A pair of in-paralogs is defined as two protein sequences with a higher similarity score (BLAST bit score) in the same species than the homologous sequences in other species. A pair of co-orthologs is defined based on the following two criteria: (i) ortholog of one sequence is the in-paralog of the other, or (ii) in-paralog of each sequence are a pair of orthologs. Using all orthologous relationships as edges, KinOrtho builds two graphs by connecting the kinases in the full-length and domain-based kinomes, respectively.

Cluster analysis

To identify orthologous groups, KinOrtho performs the Markov Cluster (MCL) Algorithm [44] (version 14.137) for the two graphs. MCL is a fast, unsupervised clustering method using a simulation of flow in graphs. It has been utilized in other graph-based orthology inference methods [35, 105] and detecting protein families [106]. In the orthologous relationship graphs, KinOrtho assigns the negative logarithm of the E-value as a weight for each edge. If an E-value is reported 0 by the BLAST program, KinOrtho assigns an arbitrary E-value of 10-200. Considering the systematic differences among species, such as nucleotide composition bias, KinOrtho normalizes the weights based on the method used by OrthoMCL [35]. For the orthologs or co-orthologs between any two species, KinOrtho normalizes the weights by dividing them by the average weight of all the orthologs or co-orthologs between the two species. For in-paralogs, the weights are divided by the average weight of all in-paralogs in each kinome. When performing MCL after setting a normalized weight for each edge, KinOrtho chooses 1.5 as a default inflation value to control the cluster tightness. This value is the best inflation value to balance the sensitivity and selectivity for functional classification [35]. Each protein is assigned to a cluster, after which KinOrtho refines orthologous relationships by filtering out the relationships between two proteins in different clusters or the clusters without query sequences.

Combining results

In the last step, KinOrtho combines the orthologous relationships from full-length and domain-based results. We define an “overlapping orthologous relationship” as a relationship present in both full-length and domain-based results. For example, in Scenario 1 of Fig. 3a, if A1B1 (meaning the pair of A’s 1st kinase domain and B’s 1st kinase domain) and A2B2 are domain-based orthologs and A–B is a full-length ortholog pair, then both A1B1 and A2B2 are defined as overlapping orthologous relationships. However, in Scenario 4 of Fig. 3a, if A1B1 and A2C1 are domain-based orthologs and A-B is a full-length ortholog pair, only A1B1 is an overlapping orthologous relationship, but A2C1 is not. Because non-overlapping relationships are also informative in domain-based orthology analyses, KinOrtho keeps all the results from full-length and domain-based methods.

Comparison of orthology inference methods

There are 21 public orthology inference results available at Ortholog Benchmarking Webservice [60] (Additional file 1: Table S5; similarity matrices are shown in Additional file 1: Figure S8). These datasets generated by full-length orthology inference methods contain the kinase relationships and all other proteins’ relationships in the QfO reference proteomes 2018. To make the orthologs identified by KinOrtho and those identified by the 21 methods comparable, we performed the following preprocessing for the compared datasets. First, because KinOrtho defines orthologs based on the BBH method, we only kept one-to-one relationships in the compared datasets. Second, to identify the kinase orthologs in the compared datasets, we only kept the relationships with at least one protein found in the ortholog relationships identified by KinOrtho (either full-length or domain-based approach). Finally, to identify human kinase orthologs in the compared datasets, we only kept the relationships involving human kinases. The numbers of remaining proteins and ortholog relationships are shown in Additional file 1: Table S5. We submitted these 21 preprocessed one-to-one kinase ortholog datasets to Ortholog Benchmarking Webservice for performance evaluation.

Protein domain annotation

This study employed the annotations in Pfam [99] (version 32.0) as known protein domain annotations. There are 305,472 proteins with at least one orthologous relationship identified by KinOrtho from the UniProt reference proteomes. In these proteins, 197,327 of them have 398,313 domain annotations, and 149,080 have at least one of the two major protein kinase domains: “Pkinase” and “Pkinase_Tyr”.

Phylogenetic analysis

The phylogenetic tree analyses in this study were utilized to investigate the gene fusion and fission events of proteins with tandem kinase domains (Fig. 3). First, we obtained the domain-based orthologs of the kinase domains of interest. To identify gene fusion events, we used TAOK1 and PIK3C2A as an example. There are 245 species (including humans) having both TAOK1 and PIK3C2A orthologs. To identify gene fission events, we used OBSCN as an example. There are 80 species with orthologs for each of the (tandem) kinase domains in OBSCN. Second, we aligned those two sets of kinase domain orthologs separately by Multiple Alignment using Fast Fourier Transform [107] (MAFFT, version 7.407). We used options “L-INS-i”, “–localpair”, and “–maxiterate 10000” to generate more accurate alignments. Third, the two kinase domains’ alignments were concatenated as a single alignment file: the first kinase domain’s orthologs followed by the second kinase domain’s orthologs for each species. Fourth, we used IQ-TREE [108] with options “-m TEST” (standard model), “-bb 1000” (bootstrap replicates), and “-alrt 1000” (approximate likelihood ratio test) to build consensus trees. Finally, phylogenetic trees were visualized using Interactive Tree Of Life [109] (iTOL, version 4).

Cluster and enrichment analyses on phylogenetic profile

The phylogenetic profile of human kinases in this study was built upon 558 human kinase domains and their orthologs identified by KinOrtho’s both full-length and domain-based approaches across the 17,134 species in the UniProt reference proteomes (Fig. 4). We manually grouped these species into 561 clades based on the NCBI Taxonomy database [110]. Each clade contains at least five species; each clade in eukaryotes, bacteria, archaea, or viruses contains at most 41, 240, 35, or 2287 species, respectively. Then we calculated an ortholog coverage for each kinase-clade pair by dividing the number of orthologs by the total number of species in each clade. Based on this phylogenetic profile (a kinase-clade matrix), in addition to ordering the human kinase domains by their groups defined by KinBase (Fig. 4a), we clustered them using k-means clustering [111] (Fig. 4b). We used an R package “factoextra” [112] (version 1.0.7) with options “kmeans” (clustering function), “nstart = 50” (initial random centroids), “nboot = 500” (number of bootstrap samples), and “gap_stat” (compute gap statistic [113]) to determine the optimal number of clusters. We found that the optimal number of clusters was 10 (Additional file 1: Figure S10).

We performed Gene Ontology (GO) enrichment analyses using the GO annotations of all human kinases and their orthologs in each cluster. We extracted all three GO domains (biological process, cellular component, and molecular function) annotations from UniProt [114] (release 2019_11). We then expanded the lowest-level GO terms to all-levels GO terms for every kinase based on the hierarchical controlled vocabulary defined by the GO Consortium [115, 116]. Because the GO terms are annotated at the protein level instead of the domain level, we removed duplicate annotations if a protein’s tandem kinase domains are in the same cluster. When performing enrichment analyses, we chose all kinase orthologs’ annotations as background, used Fisher’s exact test, and then controlled the FDR by the Benjamini-Hochberg procedure [117]. A significantly enriched GO term is defined based on whether the FDR < 0.05, and at least five human kinases in the cluster have an annotation. If multiple GO terms in the same lineage are enriched in a cluster, we only keep the lowest-level term.

Novel Inferred Annotation Score

The Novel Inferred Annotation Score (NIAS) proposed in this study is used to estimate the number of potential annotations we can infer from orthologous relationships to annotate understudied human kinases, which are defined by the NIH Illuminating the Druggable Genome program (IDG) [13] (Additional file 5, last updated on June 11, 2019). The scoring system was built upon machine learning-based annotation inference models using overlapping orthologous relationships and GO annotations (Fig. 5).

First, to prevent prediction models from being biased by unreviewed data, we only used the manually reviewed (non-electronic) annotations of all well-studied kinases and their orthologs to build training sets for the three GO domains. The GO terms annotated for less than 100 kinases were excluded from the training sets. Each instance of a kinase-GO term pair in the training sets showed the values of output and input features based on the GO annotation status (1 for present and 0 for absent) of a well-studied human kinase and its orthologs, respectively. For example, in Fig. 5, an instance shows that mouse’s and rat’s Egfr genes have a GO term “response to stimulus” (GO:0050896) annotation, but human EGFR has not. The training sets consisted of 0.3 million instances with 730 GO terms for 393 well-studied kinases and their orthologs across 176 species. To prioritize the annotations from different species, we further introduced a sequence similarity for each ortholog. Sequence similarities are defined by the average of the normalized weights generated when we built orthologous relationship graphs.

After building the three training sets, we built annotation inference models for each training set using logistic regression, support vector machine (SVM), and random forest implemented by WEKA [118]. All models were trained with class balancing (using instance reweighting) and 10-fold cross-validation to prevent overfitting. The three machine learning methods’ prediction performances for each training set are shown in Table 2. Because random forest showed the best performance among the three training sets, we used the annotation inference models built by random forest to predict missing GO annotations for understudied human kinases. When building test sets, we used the GO terms annotated for at least one understudied kinase ortholog. There were 11,503 instances in the biological process test set, 1810 instances in the cellular component test set, and 2507 instances in the molecular function test set. After applying the random forest models, the confusion matrices built upon the annotation inference result and existing annotations are shown in Additional file 1: Table S2. We collected those GO annotations currently absent but predicted as present for each understudied kinase and then only kept the lowest-level term in each GO term lineage. The summation of each prediction score calculated by random forest is defined as an Inferred Annotation Score (IAS):

$$\begin{aligned} S(k_g)& = {} {\left\{ \begin{array}{ll} P(k_g) &{} \text {if }P(k_g) > 0.5\text { and} \\ &{} \hbox { for all}\ P(k_{g'}) <= 0.5 \\ 0 &{} \text {otherwise} \end{array}\right. } \\ IAS(k)& = {} \sum _{g=1}^{G}S(k_g), \end{aligned}$$

where \(P(k_g)\) is the prediction score of gth GO term annotation for understudied kinase k, \(g=\{1,2,\ldots ,G\}\), and \(g'\) represents any descendant of gth GO term. Because the predicted annotations may include existing unreviewed electronic annotations, we defined the NIAS of an understudied kinase by subtracting the unreviewed annotation score (UAS) from IAS:

$$\begin{aligned} U(k_g)& = {} {\left\{ \begin{array}{ll} 1 &{} \text {if }k_g\text { is an unreviewed annotation} \\ 0 &{} \text {otherwise} \end{array}\right. }\\ UAS(k)& = {} \sum _{g=1}^{G}S(k_g)U(k_g)\\ NIAS(k)& = {} IAS(k)-UAS(k) \end{aligned}$$