Background

Plants have evolved a surveillance system that is continuously monitoring a broad range of stimuli, including tissue damage or altered developmental processes, or establishing a symbiotic interaction. They commonly use pattern recognition receptors (PRR) to perceive 1) microbe-, pathogen-, or damage-associated molecular patterns (MAMP/PAMP/DAMP); 2) virulence factors; 3) secreted proteins; and 4) processed peptides directly or indirectly with specific molecular signatures [1]. These membrane-bound PRR are receptor-like kinases (RLK) or receptor-like proteins (RLP). The two receptor classes are located on the plant plasma membrane and are known as modular transmembrane proteins [2]. In contrast, the intracellular resistance proteins such as the nucleotide binding site-leucine-rich repeat proteins (NB-LRR or NBS-LRR) are encoded by the so-called resistance genes (R genes) and have been targeted to elicit a resistance response to pathogens [3]. These intracellular resistance genes are out of the scope of this study.

R genes are broadly categorized into eight classes based on their motif organization and membrane domains [4]. Following this classification system and depending on their protein structure, three belong to the RLK and RLP categories, such as the gene resistance to Cladosporum fulvum: Cf-9, Cf-4, and Cf-2 (class III); the gene resistance to Xanthomonas oryzae – race 6: Xa21 (resistance to) (class IV); and Verticillium wilt resistance genes: Ve1 and Ve2 (class V) [4]. Proteins such as the polygalacturonase-inhibiting protein (PGIP) also play an important role for certain defense proteins even though they are not directly involved in pathogen recognition or activation of any defense genes [4]. In contrast, the PRRs confer a broad-spectrum resistance and are modular transmembrane RLK or RLP proteins, and their recognition is based on a set of conserved molecules [5]. Most characterized RLK/RLP are involved in defense/resistance processes in plants (Additional file 1: Table S1) or are actively involved in cell growth and development, such as floral organ abscission (A. thaliana – HAESA) [6], meristem development (A. thaliana – CLAVATA) [7], self-incompatibility (MPLK) [8], abscission (CST) [9], stomatal patterning (TMM) [10], and embryonic patterning (SSP) [11].

RLK and RLP are structurally identified by the presence of motifs involved in the protein transport system, such as signal peptide. The transmembrane helices anchors the RLK/RLP to plasma membrane [12]. The extracellular domains, or ectodomains, are functional regions located outside of the cell and initiate contact with other molecules or surfaces and lead to signal transduction [2, 3, 5, 13,14,15,16,17]. Among the ectodomains, the LRR are a component of N-glycosylated plant proteins, and many N-glycosylation acceptor sequences are present in all ectodomains [18]. The C (Carbohydrate-binding protein domain)/G (S-receptor-like or S-locus)/L (L-like lectin domain), LysM (Lysin Motif), and malectin classes of lectins are key players in plant immunity [19]. The C/G/L lectins are omnipresent in plants [20]. LysM receptors are the most studied lectins, and 15 RLK-LysM and five RLP-LysM have been functionally characterized [21]. These proteins are known to play an essential role in plant defense signaling and inducing symbiosis. Among these proteins are NFR1 (Nod factor receptor 1) [22], NFR5 (Nod factor receptor 5) [22], LYK3 (putative Medicago ortholog of NFR1) [23], and NFP (LysM protein controlling Nod factor perception) [24], that recognize lipochitooligosaccharide nod factors [25]. Malectin-like domain-containing and FERONIA protein (FER or protein Sirene) receptors are recognized as critical regulators of cell growth and appear to function as surveyors of cell-wall status [26].

Other ectodomain families include the PR-5 family (Pathogenesis-related protein 5), composed of thaumatin-like proteins (TLPs) are responsive to biotic and abiotic stress and are widely studied in plants [27]. Cell-wall-associated kinases (the “WAK” family) and their roles in signal transduction and pathogen stress responses arose from studies of the model plant species A. thaliana [28, 29]. The hallmark of a WAK is the presence of epidermal growth factor-like repeats (“EGF”) in the extracellular domain [2, 3]. In contrast to the WAK, the evolution of the tumor necrosis factor/tumor necrosis factor receptor superfamily (“TNF/TNFR”) is complicated and not well understood [30], and even though the TNFR domain is conserved in dicots and monocots, this domain family has distinctive characteristics among taxonomic families [31]. The stress-antifung domain family (known as DUF26 – Domain of Unknown Function) belongs to the cysteine-rich receptor-like protein kinases that form one of the largest groups of RLK in plants [32]. The structural details of RLK and RLP are reviewed by different authors [3, 13, 14, 33, 34].

RLKs and RLPs typically display high target specificity and selectivity [3, 35]. This provided an opportunity to understand how plants differentiate and distinguish favorable and harmful stimuli, as well as how various receptors coordinate their roles under variable environmental conditions [3]. The RLK family belongs to the protein kinase superfamily that has expanded in the flowering plant lineage, in part through recent duplications. Particularly, the flowering plant protein kinase repertoire known as “kinome,” (a term coined by Manning et al., 2002 [36]), describes the catalog of protein kinases in a genome and is significantly larger (600 to 2500 members) than the kinome in other eukaryotes. This large variation among organisms is principally due to the expansion and contraction of a few families; more than 60% of the kinome belongs to the receptor-like kinase/Pelle flowering plants family [37, 38]. The kinase domains can be divided into RD and non-RD families based on the presence or absence of an arginine (R) located before a catalytic aspartate (D) residue [39]. Non-RD kinases lack the strong autophosphorylation activities of RD kinases and display lower enzymatic activities [40]. Non-RD kinases are associated with innate immune receptors that recognize conserved microbial signatures [39]. Computational and comprehensive tools related to the prediction and analysis of resistance genes, such as RLKs or RLPs, could potentially support plant breeders/geneticists to identify candidate resistance genes to facilitate the understanding of new resistance sources and mechanisms, which may be useful for crop improvement [41].

The RLPs function with RLKs to regulate development and defense responses. The similarities between the structure of RLPs and RLKs and their functional relationships suggest that RLKs with novel domain configurations may have evolved through fusions of an RLP and RLK [35, 42]. Since most RLP are membrane-spanning proteins, they most likely are integral components of extracellular signaling networks. Fusions between ancestral RLP and RLK/Pelle kinases could, therefore, have led to novel signal transduction pathways by linking ligand perception to different downstream kinase mediated signaling pathways. Alternatively, fusions may simply have occurred between RLP and RLK/Pelle that were already components of the same signaling networks [35].

In recent years, more than 20 studies to computationally identify cytoplasmic resistant proteins (mostly NBS-LRR) from different plant species have been published [43, 44]. Due to the diversity of extracellular receptor domains, which makes them harder to characterize compared to cytoplasmic resistant proteins, efforts to identify and characterize RLKs/RLPs computationally have been limited (see review by Sekhwal and colleagues [43]). These genomic studies targeted many plant species [45], including Arabidopsis [46], Arabidopsis and rice (Oryza sativa L.) [47], grape (Vitis vinifera L.) [48], and tomato (Solanum lycopersicum (L.) H. Karst) [49], among others. To date, the strategies used similar computational approaches, but no standardized computational tools or annotation criteria were followed. Thus, the results from different studies are not necessarily comparable [43]. Furthermore, the establishment of robust, independent, and highly diverse data with multiple examples is required to evaluate the performance of the strategies and tools published [50, 51].

Recently, legume genomics tools have expanded because of advancements in high-throughput sequencing and genotyping technologies resulting in reference genome sequences for many legume crops. This allowed the identification of structural variations and enhanced the efficiency and resolution of large-scale genetic mapping and marker-trait association studies for legumes [52, 53]. Legumes are considered the second most important family of crop plants after the grass family based on their economic relevance. Approximately 27% of world crop production is composed of grain legumes, providing 33% of human dietary protein, while pasture and forage legumes are fundamental for animal feed [54]. To date, no RLK and RLP comparative genomic analyses have been published that explores the genomes of soybean (Glycine max (L.) Merrill; GM [55], common bean (Phaseolus vulgaris L.; PV) [56], barrel medic (Medicago truncatula L.; MT) [57], mungbean (Vigna radiata (L.) R. Wilczek; VR) [58], cowpea (Vigna unguiculata L. Walp; VU) [59], Adzuki bean (Vigna angularis var. Angularis; VA) [60], and pigeonpea (Cajanus cajan L.; CC) [61].

This study describes the computational identification of receptor-like proteins and receptor-like kinase proteins and probable resistance RLK-nonRD proteins in legumes using probabilistic methods [62,63,64]. The computational identification of these plasma membrane receptors is based on the prediction of presence/absence of a signal peptide, transmembrane helix motif/s, and extracellular and intracellular domains. The domain combination was considered as the presence of two or more domains that may occur in a protein and were evaluated to illustrate the domain mixture. The performance of the proposed strategy was evaluated with experimentally-validated RLK (n = 63) and RLP (n = 27) proteins (Additional file 1: Table S1), and the RLK/RLP identification was applied on protein datasets that belong to the seven legume genomes mentioned above. Also, three non-legume model plant species were included to enrich the analysis due to the high quality of its genomic annotation. These species are Arabidopsis thaliana (L. Heynh; AT) [65]; tomato (S. lycopersicum; SL) [49]; and common grape (V. vinifera; VV) [66], which represents the basal rosid lineage and has ancestral karyotypes that facilitate comparisons across major eurosids [66, 67].

Results

Performance prediction of RLK and RLP

The independent performance evaluation of the computational strategy identified 56 out of a total 63 RLK proteins as true RLK, and the remaining proteins were not detected and considered as false negatives. In contrast, 23 out of the total 27 RLP proteins were classified as true RLP, and the remaining proteins were not detected and classified as false negatives. Lastly, none of the 96 proteins belonging to the cytoplasmic R gene classes were classified as RLKs or RLPs (Additional file 2: Table S2). Based on these results, the performance predictive measures were calculated (Table 1).

Table 1 Performance evaluation

This evaluation established a minimum set of conditions to classify the RLK or RLP protein classes. RLK- and RLP-predicted proteins must have at least one transmembrane helix with the presence of at least one extracellular domain (LRR, L/C/G-Lectin, LysM, PR5K, thaumatin, WAK, malectin, EGF, or stress-Antifung). Additionally, for RLK, the presence of an intracellular Pkinase domains is also required, and for RLP, the absence of Pkinase and NB-ARC domains is required; these logic conditions are stated in Fig. 1.

Fig. 1
figure 1

Computational strategy followed to identify RLK and RLP

Summary of predicted RLK and RLP

Based on the number of RLKs and RLPs identified among all species, about 3% or less of the total proteins per species belong to these classes of membrane bound receptor-like proteins. Specifically, for legumes, the percentage ranged from 0.9 to 2.3% for RLKs and 1.4 to 1.7% for non-legumes. The RLP percentage ranges from 0.3 to 0.7% for legumes, and 0.5 to 0.6% for non-legumes species. The species analysis evaluated 447,948 proteins, with 351,491 from legumes, and 96,457 from non-legumes. Almost 9.4% of the legume and 9.7% of the non-legume predicted proteins had a predicted signal peptide, and 4.3% of legumes and 4.4% of non-legumes had at least one transmembrane helix above the threshold. For the subset of proteins without a predicted signal peptide, 16.6% of legumes and 17.9% of non-legumes reached the TMHMM cut-off. Among the total number of proteins evaluated, 1.9% of legumes and 1.5 of non-legumes belong to the RLK class of proteins, and 0.5% of legumes and 0.5% of non-legumes belong to the RLP class (Table 2). Also, the number of RLK proteins identified as non-RD, which are potentially kinases associated with innate immune receptors, are reported in Table 2 footnote (Additional file 3: Table S3), and the differentiated proteins identified by species for RLK are in the Additional file 4: Table S4 and for RLP are in the Additional file 5: Table S5.

Table 2 Summary of total number of RLK and RLP identified across legumes/non-legumes

Based on the Pfam clans and families of domains of known function used to filter the identified RLKs and RLPs, the computational strategy allowed for the identification of extra domains present in the predicted proteins (Additional file 6: Table S6). For the RLK proteins reported in Table 3, the approach identified, besides a Pkinase domain, up to four combinations of functional domains (located extra or intracellularly). Almost all the classical domains reported by different authors [3, 13, 14, 33, 34] for RLKs and RLPs were identified, the exception was the TNFR domain in which the in-house scripts (https://github.com/drestmont/plant_rlk_rlp/) did not identify its present in any of the datasets; however, when reviewing the approach, it was found that the TNFR domains predicted by Pfam 31, HMMER, and PfamScan did not reach the minimum cut-offs in the prediction process followed. All species evaluated had proteins with at least one extra domain (Additional file 7: Table S7).

Table 3 Receptor-like kinases identified by extracellular domains across the species

The G-lectin class of proteins reported in Table 3 is typically composed of three domains (B-lectin/S-locus/PAN); however, different combinations of these three domains were identified. C-lectin is a rare domain, and only soybean species showed more than one C-lectin protein. The WAK is typically composed of two domain classes (WAK/EFG), and such proteins possessed one or the other domain. The dual domain combination LRR/Malectin is the most frequent among the atypical dual combinations. Also, atypical domain combinations with a low frequency among the species were identified. Among the legumes, these were the B-Lectin/PR5K combination in GM, MT, VA, and VU and a three-domain combination of B-lectin/S-locus/WAK only in CC, MT, PV, VA, and VU. Among non-legumes, the uncommon dual combinations PAN/WAK and PAN/S-locus/WAK were only found in VV. The only uncommon domain combination found in both legumes/non-legumes was S-locus/WAK in VV and VR.

A four-domain combination, consisting of B-lectin/S-locus/PAN/WAK domains, was present GM, MT, PV, SL, VA, VR, VU, and VV species. Across all legume/non-legume species, the LRR ectodomain class was the most frequent domain per species. The computational classification strategy also discovered RLK proteins with no other domains and some proteins with the additional domains beyond the signal peptide, transmembrane helix, and Pkinase domains. In the case of the RLCK, the proteins that belong to this class are the kinases without signal peptide, but with a transmembrane helix. The RLCKs without another plasma membrane attachment domain were not predicted (Table 3).

For the RLP extracellular domain identification and domain combinations reported in Table 4, the computational approach allowed the identification of up to three possible combinations of additional functional domains (which could be located extra or intracellularly) in the proteins evaluated; however, all combinations correspond to the typical combinations reported in Additional file 7: Table S7, such as the G-lectin (B-lectin/S-lectin/PAN) present in legumes/non-legumes, the classic WAK/EGF only present in CC and VV (legume/non-legume), and the LRR/Malectin present in all species evaluated. However, the three cases mentioned were of a low frequency compared with other domains, such as LRR or Stress-antifung. As in RLK, for RLP, the most abundant ectodomain for all species was the LRR, and no RLP proteins were contained a C-lectin or TNFR domain.

Table 4 Receptor-like proteins identified by extracellular domains across the species

Summary of the presence and prevalence of functional domains

As a result of the identification process for RLK and RLP are summarized in Fig. 2, the specific domains that belong to the clans and families (Additional file 6: Table S6, Additional file 7: Table S7, and Additional file 8: Table S8) are reported in Tables 5, and 6. Table 7 shows the domains identified in the RLK and RLP proteins (Additional file 1: Table S1) used to evaluate the performance of the plasma membrane identification process.

Fig. 2
figure 2

Summary of the extracellular domains identified in RLK/RLP. The domains in this figure resume the domains and the combinations identified. A. Classical RLK/RLP protein structure. B. Ectodomains identified that are also reported by the scientific community (Tables 1 and 2). C. Ectodomain combinations identified in RLK/RLP. In B and C, the ectodomains are only represented, in the RLK cases all proteins must have an intracellular Pkinase

Table 5 Summary of domains present on the RLK proteins predicted
Table 6 Summary of domains present on the RLP proteins predicted
Table 7 Summary of domains identified in the validation dataset

The domains in this figure resume the domains and the combinations identified. A. Classical RLK/RLP protein structure. B. Ectodomains identified that are also reported by the scientific community (Additional file 7: Table S7 and Additional file 8: Table S8). C. Ectodomain combinations identified in RLK/RLP. In B and C, the ectodomains are only represented, in the RLK cases all proteins must have an intracellular Pkinase.

Table 5 shows the domains identified in the predicted RLK, and Table 6 shows the domains identified in the predicted RLP. In the target domains (domains classically reported as present in RLK and RLP proteins) identified on the experimentally-validated RLK and RLP proteins (Additional file 1: Table S1), almost all of the domains were identified for the RLKs with the exception of the C-Lectin and TNFR domains. Also, two additional domains (DUF3403 and CL0384) were found in the sequences of the proteins evaluated. For the evaluated RLPs, only domains belonging to LRR and LysM were identified. Regarding the ectodomain classes reported for RLKs and RLPs (Table A1), the expected domains were identified using the strategy implemented in this study (Table 7).

Among the predicted RLKs, 125 Pfam domains (Table 5 and Additional file 9: Table S9) were classified, with 35 domains (Table 5) belonging to the “target domains” (Additional file 6: Table S6 and Additional file 7: Table S7). The remaining domains are included in Additional file 9: Table S9. Independent of the Pkinase domains, which are cytoplasmically located, the other domains could be present either extra- or intracellularly. Comparing the domains identified in the predicted RLKs and RLPs against the target Pfam domains (Additional file 6: Table S6) for the identification of extra/intracellular domains, 10 out of 35 Pkinase domains, 7 out of 12 LRR domains, 1 out of 43 L-Lectin domains, 1 out of 1 C-Lectin domains, 5 out of 8 G-Lectin domains, 1 out of 3 LysM domains, 1 out of 1 PR5K domain, 3 out of 3 WAK domains, 2 out of 2 Malectin domains, 3 out of 18 EGF domains, and 1 out of 1 Stress-antifung domain were identified. Also, with the exception of the TNFR, all families and domains reported in Table 1 were identified in all 10 species. Of the non-target domains, which are considered additional domains that are different to the classically reported in RLK and RLP proteins, a total of 90 were identified (Additional file 9: Table S9), the most prevalent were RCC1_2, DUF3403, Ribonuc_2-5A, NAF, DUF3660, and Glyco_hydro_18, all of which were present in at least eight species (legumes/non-legumes); the remaining domains (84 in total) were present in two or fewer species.

For the entire set of domains identified in the RLPs, 71 domains (Table 6 and Additional file 10: Table S10) were identified, 33 (Table 6) belong to the “target domains” (Additional file 6: Table S6 and Additional file 7: Table S7), and the remaining domains are reported in Additional file 10: Table S10. All domains present in this dataset are extracellularly located. Comparing the domains identified with the total of Pfam (31 version) clans and families evaluated (Additional file 6: Table S6) used to identify extra/intracellular domains (Fig. 1), the RLK and RLP predicted for the 10-species evaluated allowed to identified 8 out of 12 LRR domains, 8 out of 43 L-Lectin domains, 5 out of 8 G-Lectin domains, 1 out of 3 LysM domains, 1 out of 1 PR5K domain, 3 out of 3 WAK domains, 2 out of 2 Malectin domains, 4 out of 18 EGF domains, and 1 out of 1 Stress-antifung domain were identified. Also, with the exception of C-Lectin and the TNFR family, all families and domains are reported in Additional file 7: Table S7. Of the non-target domains (38 in total Additional file 10: Table S10), the most prevalent were DUF2854, Glyco_hydro_32N, DUF3357, Alliinase_C, Galactosyl_T, zf-RING_2, PA, Peptidase_M8, and Exostosin, all of which were present in at least six species; the remaining domains (29 in total) were present in three or fewer legumes/non-legumes species.

Discussion

The performance evaluation of the computational approach to predict RLK and RLP proteins were previously shown to be associated with biotic resistance. The quality of the validation dataset (Additional file 1: Table S1) is ideal because the data come from diverse species and are independent, experimentally-validated, and non-redundant. Based on the legume/non-legume results, the RLK proteins are more diverse in terms of domains compared to RLP proteins (Table 7). With respect to sensitivity and specificity, the sensitivity measure of the process suggests it was able to classify a protein as RLK/RLP with only a few false negatives. The specificity measure evaluated the ability of the approach to correctly classify a protein as non-RLK/RLP. The combined results indicate a greater ability to identify few false positive proteins. Based on the Matthews correlation coefficient, the performance evaluation reports a very strong positive value (0.91), which suggests the approach is ideal for RLK/RLP identification [50].

As for the RLK/RLP prediction requirements described in Fig. 1, the prediction and identification of RLK using the logic sum of conditions was a restively simple work flow. The Pkinase domain is required RLK proteins, in contrast with the logic sum of conditions that a protein needs to be classified as an RLP. Interestingly, for the last plasma membrane class mentioned, apart from the conditions that proteins must meet to belong to the RLP class, one factor that improves the confidence of the prediction and reduces false positive protein is the exclusion of cytoplasmic resistance genes which could be confounded with RLP. This is accomplished by excluding proteins with a NB-ARC domain.

Of the total plasma membrane proteins reported in Table 2, the results for G. max had the largest set of RLKs and RLPs compared with all other species, a result most probably due to its recent whole genome duplication about 13 MYA [68, 69]. Such duplications are the main mechanism for the expansion of the protein kinase superfamily in plants [37]. Regarding the RLK-nonRD class, with the exception of the non-legume AT (8.6%), the other legume/non-legume species (CC (13.6%), GM (12.0%), MT (18.3%), PV (14.7%), SL (17.4%), VA (14.6%), VR (14.7%), VU (15.9%), and VV (13.3%)), have more than 12% RLKs with this kinase domain modification. This RLK subset is interesting because it has been previously found that most PRR kinases or PRR-associated kinases have a change in a conserved arginine (R) located adjacent to the key catalytic aspartate (D) (the so-called RD motif) that facilitates phosphotransfer [39, 70].

Compared with RLKs, the majority of RLCKs reported in Table 3 only contain a Ser/Thr-specific cytoplasmic kinase domain, corresponding to previously reported results [71]. However, non-target domains were identified, contrary to the additional domains previously reported, which suggests that apart from the Pkinase, the RLCK could have similar intracellular domains as the ectodomains present in the RLKs, such as leucine rich repeat (LRR), lectin, epidermal growth factor (EGF), a domain of unknown function (DUF), U-BOX, and WD40 [71]. With the exception of the non/legume VV (4.7%), all other species [(AT (16.4%), CC (22%), GM (18.9%), MT (15.9%), PV (15.9%), SL (16.6%), VA (18.4%), VR (17.7%), and VU (14.82%)] had more than 15% of the RLKs classified as RLCKs. This is important because a number of RLCKs have emerged as central components linking PRR to downstream defenses. These PRRs are involved in transducing signals from extracellular ligands by phospho-relay [72]; several Arabidopsis RLCKs are associated with PRRs and play important roles in PTI [73].

The number of RLKs per species reported is proportionally similar to the 1 to 2% of total gene models per species reported in previous studies, where RLKs normally represented about 60% or more of protein kinases [37, 38]. The range of RLK proteins identified in this study was 450–1867 for legume proteins and 444–556 for non-legume proteins. The legumes GM (1867 proteins) and MT (1062 proteins) showed the highest number of RLKs. In contrast, the range for legume RLP proteins was 141–466 proteins and 160–170 for non-legume proteins. As with RLKs, the legumes GM (466 proteins) and MT (363 proteins) showed the highest number of RLPs.

Given that the RLK receptor configuration arises from a fusion between an RLP and an RLCK [74], it could be expected that RLPs have similar ectodomains, excluding the LRR and LysM domains that are experimentally reported for RLPs. The presence of other extracellular domains, which are mainly associated with RLKs, was explored to identify probable RLPs with the presence of L/C/G-lectin, TNFR, thaumatin, WAK, malectin, EGF, or stress-antifung domain. This approach was based on the similarities reported among two-plasma membrane receptors and suggests a consistent functional relationship and the possibility of novel domain configurations created by their fusion [35]. This approach discovered that for legumes (0.29 to 0.69%)/non-legumes (0.46 to 0.64%), less than 1% of the proteins present in the genomes belong to the RLP class.

Even though the TNFR domains belonging to both plasma membrane classes were not identified, a detailed evaluation showed that in the prediction process step (Pfam31, HMMER3.1, and PfamScan.pl), the domain match was considered insignificant because the bit score fell below the software threshold. However, RLK proteins have been predicted as RLKs with a TNFR extracellular domain and reported in the SMART database in an earlier study [75] for AT (2 proteins), GM (4 proteins), SL (2 proteins), and VV (3 proteins). Interestingly, with the exception of the VV proteins, the eight other proteins were identified as RLKs either with non-target domains or only the Pkinase domain. Other missed domains could include L-Lectin and TNFR for RLPs. This exploration of missing domains suggests that including tools such as SMART could add precision to the predictions in some instances.

Regarding the diverse domain combinations identified for RLK and RLP, RLK, in particular, vary greatly in their extracellular domain organization. A variety of extracellular domains are present in RLKs [16] such as LRR/Malectin; the S-locus/WAK present only in the legume VA and the non-legume VV; the B-lectin/PR5K present only in the legumes GM, MT, VA, and VU; the B-lectin/S-locus/WAK present only in the legumes CC, MT, PV, VA, and VU; and the B-Lectin/S-locus/Pan/WAK shared among the legumes GM, MT, PV, VA, and VU, and non-legumes VV and SL. The unique non-common ectodomain combination identified in RLP was LRR/Malectin, which was present in all species evaluated. This suggests the RLK domain combinations are more diverse compared with RLP combinations. Some RLK domain combinations were only reported for legumes, while RLP combinations were present among legumes and non-legumes.

The diversity of the Pfam domains to characterize various RLK and RLP as input criteria for classification is an advantage over using only target specific motifs [76]. Diversity of the Pfam domains was most evident in the RLK class for the Pkinases which possessed 10 domains/families. Among the 10 Pkinase domains/families, WaaY in MT; APH in PV, VA, VR and VU; and Pkinase_C in CC and GM were exclusively present in the legumes. For the 7 RLK-LRR, the LRR_2 was exclusively present in the legumes GM, VA, and VU. For other family domains, the EGF domain was only present in the legumes GM and MT. In contrast, for the ectodomains present in RLPs, the LRR_9 from the LRR clan was only present in CC; the L-lectin clan with the LPRY domain and the PAN clan with the PAN_4 domain were exclusive to all the legumes. Interestingly, those clans are collectively judged likely to be homologous and are valuable because they are built manually and integrate a diverse variety of information sources that allow the transfer of structural and functional information between families and improving the prediction of structure and function of unknown families [77]. The classification of non-target domains present for RLK and RLP among the species demonstrated that none of the most prevalent domains identified (present in 10-species) in both plasma membrane classes was common, suggesting a bias related to the kind of plasma membrane relation. This suggests that further analysis could be done to explore probable correlations among the domains evaluated.

Conclusions

The identification of RLK and RLP based on the use of different machine-learning tools publicly available for the prediction of different biological features, allowed this study to propose a simple, logical, and effective set of conditions. The validation demonstrated that the approach is highly effective in identifying RLK/RLP proteins. The domains organization of RLK was more diverse compared with the domain organization of RLP domains. More L-lectin domain diversity exists in RLP (8 domains) compared with RLK (1 domain). Specifically, for the RLK, the non-RD represented 8 to 18%, and the RLCK represented about 15% of this class of plasma membrane proteins per species evaluated. Regarding the legume/non-legume comparison, G. max contains a larger set of RLK (1867 proteins) and RLP (466 proteins) compared with the legume/non-legume species. Across all species, the LRR ectodomain class was the most frequent domain per species. C-lectin is a rare domain commonly reported only once per genome, and only the GM species showed more than one such protein, which could be related to the recent whole genome duplication. For RLKs/RLPs among legumes/non-legumes, the LRR/Malectin domain combination is the most frequent among the dual combinations.

Methods

Independent evaluation of predictive performance

To evaluate the RLK and RLP prediction strategy, we test the ability to correctly classified or reject RLK, RLP, and non RLK/RLP proteins. The prediction performance used three evaluation sets with known outcomes supported by experimental evidence (Additional file 1: Table S1). For the performance evaluation measurement, sensitivity (range: 0 to 1), specificity (range: 0 to 1), and Matthews correlation coefficient “MCC” (range: − 1 to 1) were selected [50]. In the evaluation datasets, the identification of experimentally-validated proteins for each class became the true positive (RLP and RLK) and true negative data (cytoplasmic resistance genes) [50]. The cytoplasmic resistance genes could have similar ectodomains to RLK/RLP but have an exclusively NB-ARC domain [78]. The datasets obtained were independently processed using CD-HIT [79] to obtain a non-redundant version using a 90% identity to avoid similar or highly similar overlapping entries [50]. The predictive analysis of RLK/RLP was applied to the non-redundant sets; for the RLK evaluation, the RLP and “cytoplasmic resistance genes” sets were used as true negative proteins; for the RLP evaluation, the RLK and “cytoplasmic resistance genes” sets were used as true negative proteins.

Genome dataset

To evaluate the proposed RLK/RLP identification strategy, three datasets were used (RLK, RLP and cytoplasmic resistance genes). All the datasets contain experimentally-validated proteins from 34 plant species (Additional file 1: Table S1) and were extracted from the UniProt Consortium [80]. The RLK set contained 66 proteins, the RLP set contained 28 proteins, and the set of cytoplasmic resistance genes (non-RLK/RLP), contained 96 proteins (Additional file 1: Table S1) [3, 43, 72, 73, 81]. To identify probable RLK and RLP, the analysis focused on seven legumes and three non-legumes (outgroup set), including V. vinifera because it represents the basal rosid lineage and has a close-to-ancestral karyotypes that facilitate comparisons across major eurosids [66, 67]. Also, non-legumes Arabidopsis and S. lycopersicum were included because they are model plants that could allow us to evaluate conservation and divergence. The protein information of the legumes/non-legumes is reported in Table 8.

Table 8 Summary of genomes

Computational identification of RLK and RLP

The computational strategy for RLK and RLP discovery is described in Fig. 1. The identification of the presence/absence of signal peptide and transmembrane helices was predicted with SignalP 4.0 [62] and TMHMM 2 [63], respectively. The cut-offs used were Eukaryotes (euk): euk SignalP-noTM networks: 0.45 and euk SignalP-TM networks: 0.50 [62]. The selection criteria for TMHMM2 were based on the identification of one or more transmembrane helices, which must exceed the expected number of amino acids (ExpAA) threshold; if this value is larger than 18, it is very likely to be a transmembrane protein or have a signal peptide [63]. In both prediction processes, cut-off values are reported by default.

The PfamScan (pfamscan.pl) script [82] was used to annotate the protein sequences against the Pfam 31.0 library using HMMER 3.1b1 [64]. The selection criteria to assign a protein to each modular organization classes were defined by PfamScan, which states if overlapping matches within a clan are detected, it will then only report the most significant, which will be the lowest E-value match within the clan [83]. In some cases, proteins belonged to two domain classes, but the redundant information was extracted in the counting process. To establish a domain cutoff for Pfam-A searches, the parameter used by default was based on the diverse set of domains to reach these trusted cut-offs, which were defined by Pfam curators and their variable for each domain or family [64].

The PfamScan output was filtered using in-house scripts (https://github.com/drestmont/plant_rlk_rlp/) for the identification of RLK/RLP and their structural domains. The identification of the modular organization domains (Additional file 7: Table S7) is defined in the Pfam database [84] as profiles and clans (labelled: CL); the clans are profiles grouped together with a common evolutionary ancestor [82]. The in-house script includes 134 Pfam domains representing the extra domains and the Pkinase reported in Additional file 7: Table S7. They are considered “target domains” for this research and are reported in Additional file 6: Table S6 and Additional file 8: Table S8. The target clan or domain Pfam ID are reported in Table 9.

Table 9 Target domains for the classification of RLK/RLP

The identification approach follows this logic (logical operators: and, or, and not) (Fig. 1) for RLK: “presence/absence Signal peptide” and “transmembrane helix (at least one)” and “Pkinase domain/s” and “Extracellular domain/s: LRR or L-Lectin or C-Lectin or G- Lectin or LysM or PR5K or TNFR or WAK or Malectin or EGF or Stress-Antifung” not “NB-ARC” domains and, for RLP: “presence/absence Signal peptide” and “transmembrane helix (at least one)” and “Extracellular domain/s: LRR or L-Lectin or C-Lectin or G-Lectin or LysM or PR5K or TNFR or WAK or Malectin or EGF or Stress-Antifung” not “Pkinase domain/s” and not “NB-ARC domains”. Finally, a summary of the domain and family prevalence among species was obtained based on the RLK/RLP identified in the evaluation set and the species explored. The frequency analysis was based on the evaluation of “experimentally-validated protein datasets” (Additional file 1: Table S1), and also for the identified proteins, which belong to the species evaluated. After the RLK proteins per species were classified to identify potential non-RD proteins, the entire set of Pkinase sequence domains was broken into subsets using the start and end domain coordinates reported by PfamScan. The MEME command line tool version [86] was used to identify the RD and non-RD motif sites, and the MEME parameters used were as follows: -mod oops -maxw 10 -nmotifs 4 -maxsize 6,000,000. After the motif sites were reported, they were classified as RD ([H][R][D]) and non-RD ([H][^R][D]) motif (regex notation). The kinome was identified by annotating the whole set of proteins per species using pfamscan.pl. The proteins with the presence of Pkinase domains were filtered (Table 2 – footnote and Additional file 3: Table S3).