Background

Glycoside hydrolase 9 (GH9) endoglucanases utilize water (EC3.x.y.z) to cleave the glycoside (1 → 4) or (1 → 3) bonds between repeated monomeric β(D)-glucopyranose units of cellulose and comprise sequences from all major kingdoms of life [1, 2]. GH9 endoglucanases in land plants were previously clustered into classes A and B on the basis of the presence/ absence of transmembrane (TM) and/ or signal peptide (SP) sub regions [1, 2]. The abundantly present amorphous cellulose is enzymatically amenable to digestion, and is the de facto substrate for these enzymes. However, an editing/ modifying function for crystalline cellulose has been ascribed to class A endoglucanases, either exclusively or in association with the cellulosome [3,4,5]. The discovery and further characterization of a carbohydrate binding module (CBM49) at the C-termini of previously annotated GH9 endoglucanases (classes A and B) in Solanum lycopersicum, Oryza sativa, Arabidopsis thaliana, and Nicotiana tabacum conferred, on this family, catalytic competency for crystalline cellulose [6,7,8]. The hydrogen-bond stabilized crystalline cellulose, is the preferred substrate for bacteria, fungi, archaea, and protists, organisms which predate the emergence of green land plants by several millions of years [9,10,11,12,13,14]. The discovery, therefore, that a subset of plant GH9 endoglucanases could utilize crystalline cellulose as its cognate substrate raises fundamental questions not only on the evolution and ancestry of plant GH9 endoglucanases, but also the functional relevance of an additional hydrolase with a hitherto novel spectrum of catalytic activity.

Cellulose, is a straight chain polymer of repeating units of β(1 → 4) linked D-glucopyranose residues and consists of microcrystalline (I α , I β ) and amorphous (I α am, I β am) regions (Fig. 1a and b). This heterogeneous distribution is dictated by the presence of a rich inter-and intra-fibrillar hydrogen bond network. Whilst, the paucity of hydrogen bonds in the former facilitates enzymatic cleavage, the ordered structure of the latter, imposes constraints on the activity profile of plant GH9 endoglucanases. Natural cellulose is rarely pure (Gossypium spp., 90%), and is frequently found in association with other carbohydrates (hemicellulose) and/ or other macromolecules (lipids, proteins). The presence of these complexes would also imply, reciprocally, the existence of mixed function endo- and exo-glucanases acting in tandem with biosynthetic catalysts to modulate the composition of the encompassing cell wall matrix/ capsule/ coat [15,16,17]. Observations by several investigators suggest a correlation between exhibited function with the occurrence of sequence homology or manifested enzymatic activity. Thus, despite the proximity of divergence between multicellular green algae and primitive land plants 470 − 480 Million years ago (Mya), homologous GH9 endoglucanase sequences are either completely absent or at best partial and fragmented in unicellular members (Chlamydomonas reinhardtii, Volvox carteri) [16, 17]. In contrast, bacteria (≅3200 − 3950 Mya), archaea (≅390 − 1350 Mya), protists (≅2000 − 3000 Mya), fungi (≅1000 − 1500 Mya), and some animals (180 − 670 Mya) not just possess sequences with ascribable GH9 endoglucanase activity of crystalline cellulose, but also a demonstrable and relevant function (Table 1) [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40]. These include modulation of sporulation (Dictyostelium spp., clostridiales, bacillales), host-pathogen interactions (fungi, nematodes, protists, plants), repair and survival (Euryarchaea), and preventive desiccation (bacteria, Dictyostelium spp.) [15, 41,42,43,44,45,46,47,48,49]. Genomic evidence of GH9 endoglucanases in some animals (marine invertebrates, termites, arthropods, parasitic and saprophytic nematodes), in the absence of demonstrable function, was postulated to have occurred during phases of co-infection with gastrointestinal and oral microbiota [15, 42, 44, 45, 50,51,52,53,54]. However, the confirmed presence in numerous other animals, similarity in substrate and reaction chemistry, and sequence conservation, along with supporting laboratory data has refuted much of this horizontal transfer mode of gene transfer [15, 41, 42, 44, 45, 55,56,57]. Davison and Blaxter suggested a single origin of GH9 genes based on monophyly in the phylogenetic tree and conserved intron positions [55].

Fig. 1
figure 1

Taxonomic distribution and analysis of the GH9 domain in putative endoglucanse sequences. a Molecular structure of cellulose with repeating units of D-glucopyranose linked by a β(1 → 4) glycosidic bond. The liberated mono- or oligosaccharides either retain the β-hydroxyl group (retaining), or are inverted (α-hydroxyl) after transformation, b Generic reaction mechanism of hydrolytic GH9 endoglucanase (EC 3.2.1.4) mediated transformation of cellulose into simpler oligo- and/or mono-saccharides, c Alignment compatible sequences of GH9 domains from putative GH9 endoglucanases across all taxa (n1 = 607). Abbreviations: GH9, glycoside hydrolase; EC, enzyme commission

Table 1 Literature based divergence rates of taxa utilized for calibrating the time trees

In land plants (Viridiplantae), the activity profile of GH9 endoglucanases on cellulose, correlates, in part, with their distribution, as well as the purported roles in growth, development, flowering, and seed germination [16]. The carbohydrate binding modules/ domains (n = 64), are sequences 40 − 200 aa in length, and despite being intrinsically non catalytic can facilitate the hydrolytic cleavage of the glycosidic linkage [47]. Unlike the C-terminally localized CBM49 of plant GH9 endoglucanases, different CBMs favouring the activity on crystalline cellulose in bacteria, fungi, protists, animals, and possibly archaea and green algae are distributed throughout the length of the sequence [16]. The presence of one or more TM regions also suggests that at least in plants cellulose metabolism may occur in clusters of (biosynthetic, degrading enzymes) and be localized at the membrane itself [4, 5]. The presence of signal peptide regions, in contrast, posits that these enzymes may be secreted and digest cellulose extracellularly. Such a mechanism might benefit fungal pathogens of plants, may be deployed by termites, and participate in glucose extraction in ruminants as well [15, 42, 44, 48]. The proportion of sequences that exhibit class B and C activity is subject to much debate. Whilst, a simple sequence similarity suggests a preponderance of class B members, complex classification schema using hidden markov models (HMM) and artificial neural networks (ANN) indicates a marginally greater number of putative class C GH9 endoglucanases in primary transcript data from sequenced land plants [16, 58,59,60].

The potential importance of class C enzymes in biomass conversion notwithstanding, a paradigm shift in the chemical nature of cellulose, the inconsistencies in the numbers observed between predicted and observed members, and a conserved reaction chemistry in extant non plant taxa, suggest that plant class C GH9 endoglucanases may predate classes A and B enzymes [16, 58,59,60,61]. Here, we attempt to resolve some of these queries by investigating the origins, evolution, and subsequent divergence of the GH9 domain in putative plant endoglucanase sequences, with particular emphasis on the contribution of class C members. The role of the aromatic (W/ Y / F) and polar uncharged (S/T/N/Q) is critical to the functioning of endoglucanases in the presence and absence of well-defined CBMs, and, in the presence of low complexity regions their incorporation into the GH9 domain might constitute the only measure of approximating the CBM49 [62,63,64]. These residues despite being non-catalytic themselves have been shown to confer the capacity on the encompassing enzymes to discriminate between related ligands (cellulose/ X, X = {xylose, lignin, chitin; β-1,3/β-1,4), effect and in some cases even the binding affinity for a cognate substrate, contribute to processivity and thermal stability, and interestingly introduce catalytic competency [62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78]. We utilize a combination of phylogenetic analysis, pattern approximation, identification, distribution analysis, and residue mapping of the CBM49 to investigate the emergence of crystalline cellulose digesting activity in land plants. Finally, we complement these analyses by examining the presence and distribution of transmembrane and signal peptide regions in vascular land plants, and the possible routes by which endoglucanase sequences with putative class C activity could contribute to the emergence of sequences with novel functionality.

Methods

Collation, annotation, and domain extraction of GH9 endoglucanases

Sequences of putative GH9 endoglucanases were downloaded from the publically available databases National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov) and Carbohydrate-active enzymes (CAZy; http://www.cazy.org/) [16, 79, 80]. Sequences of green land plants (Viridiplantae) utilized for this analysis were downloaded from Phytozome (https://phytozome.jgi.doe.gov/pz/portal.html), extensively curated, and classified into classes A, B, and C as described previously [16]. Annotation for non-plant GH9 endoglucanases was in accordance with the schema adopted by dbCAN (Carbohydrate enzyme annotation; http://csbl.bmb.uga.edu/dbCAN) [81]. The pooled sequences were filtered on the basis of their contribution to a compatible multiple sequence alignment (MSA) and the presence of a single GH9 domain as determined by MEGA7.0 (Molecular evolutionary genetic analysis, local installation) and the SMART (Simple modular architecture research tool) server [82,83,84]. Exclusion criteria for this preliminary data were: a) an indeterminable MSA, b) the complete absence of a demonstrable GH9 domain, c) more than one GH9 domain ((GH9) x  : x > 1) in the same sequence, and d) presence of a concomitant GH domain other than GH9 ((GH9 ∧ GHx) : x∈[1, 8] ∧ [10 − 130]). Amino acids at the start and end positions of the GH9 domains were noted and extracted (n1) using in-house developed PERL scripts (Additional file 1: Text S1, Additional file 2: Text S3, and Additional file 3: Text S4). Here, the final set of compatible sequences of the GH9 domains (n1A), pattern selected GH9 domains (n1B, n1C), pattern selected and GH9 encompassing full length sequences (n2), CBM49/ CBM49-like sequences of land plants (n3X = n LPSX ; X = {A, B, C}) comprised the datasets utilized in this study. The distinct and delineable CBM49 from putative class C GH9 endoglucanases was similarly isolated and comprised (n3C = n LPSC ) (Additional file 4: Text S2). The amino acid content of the extricated GH9 and CBM49 domains were assessed using PIR (Protein information server, http://pir.georgetown.edu) and categorized on the basis of side chain content into those with hydrophobic side chains (HSC), aromatic amino acids (AAA), polar uncharged (PUC), polar charged acidic (PCA), and polar charged basic (PCB). The GH9 domains were used for phylogenetic analysis and time tree estimation (n1A), CBM49 was utilized for pattern analysis and motif approximation (n3C), and CBM49-like full length sequences from plant and non-plant taxa were utilized for assessing relevant bioinformatics indices (n1B, n1C) (Additional file 1: Text S1, Additional file 4: Text S2, Additional file 2: Text S3 and Additional file 3: Text S4).

Model selection, phylogenetic analysis, and time tree estimation

Multiple sequence alignments (MSA) of the extracted GH9 domains and the CBM49/ CBM49-like in land plants were generated using the default parameters (gap opening = gap extension = 10), with gap opening penalties of 0.1 (pairwise alignment) and 0.2 (MSA), a divergence cut off of 20%, and the BLOSUM62 set of matrices (Additional file 5: Table S1, Additional file 1: Text S1 and Additional file 3: Text S4) [85, 86]. This was chosen to account for the purported domain distribution of classes A, B, and C among the various taxa. Sequences were deemed compatible if and only if their pairwise alignments were free from errors as determined by the distance matrix computed by MEGA7.0. The top scoring amino acid substitution models for the aforementioned MSAs was selected amongst all (n = 56) using the Akaike information criteria corrected (min(AICc)) and the Bayesian information criteria (min(BIC)) as indices (Additional file 6: Table S3). BEAST v2.4.7 (Bayesian evolutionary analysis by sampling trees) and the accompanying software suite (FigTree v1.4.3, DensiTree, Tracer v1.6, TreeAnnotator) was utilized to infer the date and visualize a maximum clade credibility tree with median heights, and tabulate descriptive statistics after the posterior probabilities converged (Tables 2 and 3; Additional file 7: Table S4) [87,88,89]. Whilst, the age of the node and the branch times of the clades were inferred directly (Mya), support was denoted as the posterior probabilities (PP%) and bootstrap values (n = 1000) by maximum likelihood (ML%), i.e., support = PP % , ML%, (FigTree v1.4.3). Whilst, the selection of the root for evaluating the evolution of the GH9 domain (parent of the bacterial clade), was based on fossil records that suggested that bacteria were amongst the earliest forms of life (≈3170 − 4180 Mya), the same for the CBM49/ CBM49-like land plants was the presence of a distinct and delineable CBM49 in the ancestral bryophytes and tracheophytes coupled with the assumption that the parent of class C vascular land plants (≈201 − 241 Mya) were likely to possess the same architecture (Table 2; Additional file 5: Table S1 and Additional file 8: Table S2) [18, 19].

Table 2 Parameters utilized for Bayesian inference of evolution of the GH9 and CBM49 domains
Table 3 Taxonomic distribution of bacteria in datasets

Pattern analysis and motif approximation of CBM49 in putative GH9 endoglucanases

The boundaries of CBM49 were defined in characterized and putative class C GH9 endoglucanase sequences with single- and multiple-copies of the GH9 domain (n = 116) (Additional file 5: Table S1C and Additional file 8: Table S2B) [6,7,8, 83, 84]. These were then clustered, realigned, and represented using the Clustal Omega and WebLogo servers (https://www.ebi.ac.uk/Tools/msa/clustalo; http://weblogo.berkeley.edu/logo.cgi) with default parameters [90,91,92]. The refined list of CBM49 sequences in (n = 100) class C GH9 endoglucanases were then submitted to the PRATT v 2.1 server (http://web.expasy.org/pratt), and utilized to identify and score suitable domain spanning patterns [93]. A profile of these patterns (n = 20) was generated based on the numbers of putative class C enzymes that they were found in, i.e., 5 → 100 (Table 4). This was used to search for sequences with CBM49-like motifs amongst full length GH9 endoglucanase sequences without a delineable CBM49 region, and on the GH9 domain itself and was accomplished using the server ScanProsite (http://prosite.expasy.org/scanprosite) (Additional file 9: Table S5). These datasets (n1B, n1C, n2, n3) along with the subset of was used for all further analyses (Tables 4, 5 and 6; Additional file 9: Table S5, Additional file 10: Table S6, Additional file 11: Table S7 and Additional file 12: Table S8, Additional file 13 Text S9, Additional file 16: Text S10, Additional file 14: Text S11 and Additional file 15: Text S12). Alternatively, a Hidden Markov Model or support vector machine (SVM) may have been utilized for this part of the analysis. SVMs, are binary classifiers and incorporate several features of the training sequences to determine presence/ absence in an unknown sequence of interest. Whilst the SVM for the CBM49 could have been easily constructed, its utility in identifying the same in a distantly related sequence is likely to be limited. The HMM, however, for this specific module hand would simply indicate the existence of a similar region above a certain threshold. Since, our requirement mandated features of both these, i.e., presence/ absence of CBM49-like regions in GH9 domain containing endoglucanases across taxa, these predictors of the extrema would not have sufficed.

Table 4 Alignment based pattern analysis of CBM49 in putative and characterized class C GH9 endoglucanases
Table 5 Distribution of sequence segments in classes A, B, and C plant GH9 endoglucanases
Table 6 Salient features of putative GH9 endoglucanase sequences with multiple delineable domains

Domain analysis of plant GH9 endoglucanases

The above compiled datasets (n1 − n3) were meant to offer an insight into the origin and evolution of the GH9-CBM49-like domain across all taxa, the end point being the emergence of plant GH9 endoglucanases (classes A, B, and C) (Additional file 6 Table S3, Additional file 7: Table S4, Additional file 9: Table S5, Additional file 10: Table S6 and Additional file 11: Table S7, Additional file 1: Texts S1, Additional file 4: Texts S2, Additional file 2: Texts S3 and Additional file 3: Texts S4. Since the methods discussed afford compelling evidence of the ancestral nature of class C GH9 endoglucanase sequences, our subsequent analyses (domain frequency) was focussed on establishing potential divergence of class C members and/ or the emergence of classes A and B. Plant GH9 endoglucanase sequences possess a differential distribution of TM, SP, and CBM49 regions. and the frequency of occurrence of these was analysed by directly comparing CBM49 positive class C members (n3C = n LPSC  = 97) with pattern 20 selected sequences of putative classes A (n3A = n LPSA  = 22) and B (n3B = n LPSB  = 75) (Additional file 10: Table S6, Additional file 3: Text S4). Since, the hydrophobic profile of these regions overlap, we utilized data from three algorithms that predict both TM and SP regions to arrive at a consensus. The servers consulted were: MEMSAT-SVM, DAS-TMfilter, and PHOBIUS [94,95,96,97,98,99,100,101] (Additional file 11: Table S7, Additional file 13: Text S9, Additional file 16: Text S10 Additional file 14: Text S11 and Additional file 15: Text S12). The MEMSAT-SVM classifies membrane spanning helical regions in a sequence as strong (TM), weak pore-lining (PH), or re-entrant (RH), i.e., (TM ∨ PH ∨ RH). [94, 100]. The dense alignment surface (DAS-TMfilter) differs from other predictors of transmembrane regions in considering hydrophobic region(s) of a query protein, and mapping the results to known transmembrane regions [95, 96]. PHOBIUS, is a hidden Markov model based delineator of signal peptide regions and uses sub models of the sequences that comprise these regions along with topology information to make predictions [101].

Algorithm to assess contribution of prediction method to each sub segment

Full length sequences of land plants encompassing the CBM49-pattern 20, i.e., classes A, B, and C (n3 = (n3LPSA = n3A) + (n3LPSB = n3B) + (n3LPSC = n3C) = 187) were searched for well defined amino acid segments using the aforementioned servers (MEMSAT-SVM, DAS, PHOBIUS). The subset (NN) was used to define sequences without delineable TM and SP regions (NN = {C0, B0, A0}). The method of choice was determined by rendering the resultant data equivalent and therefore, comparable. The definitions utilized are as under:

$$ {\displaystyle \begin{array}{ccc} TM& :=& Sequences\kern0.5em with\kern0.5em one\ or\ more\kern0.5em predicted\kern0.5em transmembrane\kern0.5em domains\\ {} SP& :=& Sequences\kern0.5em with\kern0.5em one\ or\kern0.5em more\ predicted\kern0.5em signal\kern0.5em peptide\ regions\\ {} PH& :=& Sequences\kern0.5em with\kern0.5em one\ or\ more\kern0.5em predicted\kern0.5em pore\ lining\ helices\\ {} RH& :=& Sequences\kern0.5em with\kern0.5em one\ or\ more\kern0.5em predicted\kern0.5em pore\ lining\ helices\\ {}\boldsymbol{NN}& :=& \left({SP}^{-}\right)\wedge {\left( TM\vee PH\vee RH\right)}^{-}\\ {}\boldsymbol{NY}& :=& \left.\Big({SP}^{-}\right)\wedge {\left( TM\vee PH\vee RH\right)}^{+}\\ {}\boldsymbol{Y}\boldsymbol{Y}& :=& \left({SP}^{+}\right)\wedge {\left( TM\vee PH\vee RH\right)}^{+}\\ {}\boldsymbol{Y}& :=& {\left( TM\vee PH\vee RH\right)}^{+}\end{array}} $$
  • Step 1:Sequences with negative predictions for both SP and TM regions (f(NN) ↔ ) and {x i  ∈ NN ⊂n3 ∣ (SP) ∧ (TM ∨ PH ∨ RH), i ∈ ), were removed from the computations.

  • Step 2:The remaining sequences were assessed for the presence of the transmembrane subregions (f(Y) ↔ ) and {x i  ∈ Y ⊂n3 ∣ (TM ∨ PH ∨ RH)+, i ∈ ).

  • Step 3:The data computed in Step 2 was then used to calculate the number of sequences with or without the presence of an associated signal peptide regions (f(NY) ↔ ) and (f(YY) ↔ ). {x i  ∈ NY ⊂n3| (SP) ∧ (TM ∨ PH ∨ RH)+, i ∈ } and {x i  ∈ YY ⊂n3| (SP+) ∧ (TM ∨ PH ∨ RH)+, i ∈ }.

  • Step 4:Utilize the data from the above to compute a ratio was used to establish equivalence between the predictions, and thereby, a rationale for its subsequent inclusion/ exclusion \( \left(\raisebox{1ex}{$\mid \boldsymbol{NY}\mid $}\!\left/ \!\raisebox{-1ex}{$\mid \boldsymbol{Y}\mid $}\right.,\raisebox{1ex}{$\mid \boldsymbol{Y}\boldsymbol{Y}\mid $}\!\left/ \!\raisebox{-1ex}{$\mid \boldsymbol{Y}\mid $}\right.\right) \).

Results

Taxonomic distribution of the GH9 domain

The GH9 domain averages ≈448 aa, and is present as a single copy in the sequences investigated (n1 = 607), i.e., bacteria (BAC), land plants (LPS), animals (ALS), fungi (FGI), green algae (GAL), protists (PRS), and archaea (ARC) (Fig. 1c; Additional file 5: Table S1A). Although, the vast majority of sequences selected for this study were putative GH9 endoglucanases, available empirical data (kinetic, transcript data, 3D structure) for many of these taxa were available and included (n LPS  = 26; n ALS  = 1; n BAC  = 11). Whilst, most sequences possessed alignment compatible GH9 domains (n1A = 601), there were few sequences (n = 6) which could not be aligned and were not utilized in the estimation of divergence of GH9 domains across taxa (Additional file 5: Table S1A, Additional file 1: Text S1). The source of error was most likely the archaeal sequence (Methanohalobium evestigatum; tr|D7E938). This sequence has a predicted GH9 domain length of 222 aa (Eval = 1.2E − 08), and sub optimally aligned sequences are likely to have inflated scores in excess of the threshold for inclusion. On the other hand, despite possessing GH9 domains of suitable length, the lower confidence levels of the HMM predictor for α-proteobacteria (Asticcacaulis biprosthecum; gi|328841530, gi|328840708; Evals = 2.20E − 17, 8.40E − 15), and a member each of the Chlorobi-Fibrobacter-Bacillales (CFB) ancestral phylum (Bacterioides fluxus YIT 12057; gi|328530713, gi|328531610; Evals = 2.80E − 25, 2.40E − 23)) and subgroup Bacillales of the Firmicutes (Listeria innocus; gi|313621564; Eval = 1.50E − 20) were probable confounders for the alignment mismatch (Additional file 5: Table S1A). The bacterial subgroup comprised Gram negative (proteobacteria) and Gram positive organisms (members of CFB phylum, cyanobacteria, firmicutes, and bacillales) (Table 3, Fig. 1c). However, multiple distinct representations of the GH9 domain in one protein are not uncommon, and are present as two or four (Saccoglossus kowalevskii; gi|291236258) copies (n = 16; n ALS  = 7, n BAC  = 2, n LPS  = 7) (Additional file 5: Table S1B). Additionally, we observed the concomitant presence of heterogenous Glycoside hydrolase domains in some bacterial species (n BAC  = 4), which included Caldocellum saccharolyticum (gi| 1708078; GH9, GH48), Ruminococcus champanellensis (gi| 291543673; GH9, GH16), Ruminioclostridium thermocellum (gi| 1663519; GH9, GH44), and Caldicellulosiruptor spp. (gi| 12743885; GH9, GH44) (Additional file 5: Table S1C). Interestingly, despite being classified as GH9 members, only the anaerobic methanogen (Methanohalobium evestigatum; tr|D7E938) of the archaea subgroup Euryarchaeota possessed the requisite GH9 domain (Additional file 5: Table S1D).

Evolution and emergence of the GH9 and CBMs in plant and non plant taxa

The data suggests that the GH9 domain is conserved across all taxa and a catalytically functional copy may have been present in bacteria (≈3000 Mya; support = 100%, 96%) (Fig. 2; Additional file 7: Table S4A, Additional file 17: Text S5 and Additional file 18: Text S6). Interestingly, the clades of the land plants and green algae appears to have diverged relatively early and independently of the animals, fungi, and the protists (≈1961 Mya; support = 100%). Whilst, the GH9 domains of the land plants and green algae continued to evolve for another ≈1750 Mya finally diverging from each other relatively recently (≈211 Mya; support = 97%). In contrast, the protists diverged from animals and fungi (≈817 Mya; support = 97%), whilst GH9 domains of animals and fungi diverged from each other (≈11 Mya; support = 97%). A generic timeline for the evolution of the GH9 domain, i.e., BAC > PRS > {FGI, GAL, ALS, LPS}, is perfectly plausible (Fig. 2). We also posited, and thence investigated the contribution of non-GH9 regions (CBM49, linker(s)) to substrate dichotomy (crystalline, amorphous) in plant GH9 endoglucanases. We observed distinct and delineable CBM49s (79 − 84 aa; median = 81 aa) in putative class C GH9 endoglucanase sequences of flowering land plants (n = 102) after outlier exclusion (n = 2; Zea mays, GRMZM2G143747_P01; Selaginella. moellendorffii, 109529) (Additional file 8: Table S2A and B). The only exceptions were the presence of a single CBM49 (82 aa) in the protist, Polysphondylium pallidumPN500 (gi|281207043, gi|281207029) (Additional file 5: Table S1A). Remarkably, our results indicate a unique copy of CBM49 in bryophytes (n = 4; Physcomitrella patens) and tracheophytes (n = 3; S. moellendorffii) (Additional file 8: Table S2). Analysis of the primary sequences also indicates the presence of one or more linker sequences connecting the GH9 to the CBMs. In CBM49 class C sequences this constitutes a 7–77 AA (Prunus persica, ppa022524m; Phaseolus vulgaris, Phvul.011G030300.1) (Additional file 5: Table S1 and Additional file 8: Table S2).

Fig. 2
figure 2

Evolution of GH9 domain. A Bayesian inference (BI) dated tree was estimated (maximum clade credibility) from the computed tree population (n = 4476; burn − in = 70%) using the WAG amino acid substitution model and parent of the clade of bacteria as the root. Whilst, node ages (= node height = branch time of the longest diverging taxa) and branch times are in Mya, support for branch points are indicated by the posterior probabilities (PP%) and bootstrap values (n = 1000; ML%), i.e., support = PP % , ML%. The root for this tree was the parent of bacteria (3170 − 4180 Mya).The log likelihood for this tree was (≈ − 0.0838233). Abbreviations: BI, Bayesian inference; GH9, glycoside hydrolase; Mya, millions of years; WAG, Whelan and Goldman

Characterization, analysis, and assessment of relevance of CBM49-spanning patterns in non-plant taxa

The amino acid profile (HSC ≅ 46.2%; AAA ≅ 11%; PUC ≅ 36%; PCA ≅ 4.7%; PCB ≅ 13%) of the truncated CBM49 sequences (n = 102) suggests a high percentage of amino acids whose side chain functional groups (PUC = {−OH, −SH, −NH2}), i.e., Serine (S), Threonine (T), Cysteine (C), Tyrosine (Y), Asparagine (N), and Glutamine (Q), could potentially contribute to the catalytic machinery of these putative enzymes (Additional file 8: Table S2C). Interestingly, there was a paucity of the catalytic permissive (PCA = {−COO}) amino acids (D/E) in the sequences analysed (Fig. 3a and b; Additional file 8: Table S2, Additional file 4: Text S2). Clearly, the restricted taxonomic distribution of CBM49 precludes a direct comparison, thereby justifying our search for patterns that could approximate CBM49 (Fig. 3; Additional file 9: Table S5). These patterns were partitioned into those with low/ high fitness strengths, which was correlated to its compositional complexity (Table 4, Fig. 3c). Since, patterns of reduced complexity are likely to be present in a greater number of sequences, and also possess low fitness (Fs) scores (Table 4, Fig. 4c). The Rm-value is the expected number of random matches in 100,000 unrelated sequences [102]. For instance, the pattern with the lowest fitness score (p20), i.e., Gx(3)G[LV], has the value Rm = 33184 (n = 100), whilst the same for the high scoring pattern 1 (p1) was Rm = 2.47E − 35 (n = 5) (Table 5, Fig. 3c). The presence of these patterns in CBM49-containing characterized class C sequences was confirmed initially, following which, their occurrence in non-class C members was evaluated (Fig. 4a and b).

Fig. 3
figure 3

Characterizing the carbohydrate binding module (CBM49). a Multiple sequence alignment of the CBM49 in class C GH9 endoglucanases. This region has been highlighted in the presented alignment, and suggests a conservation, in not just the overall structure, but also several key residues (W| F| Y; K| R| N| H| Q). Additionally, the highest (p1) and the lowest (p20) scoring patterns that approximate CBM49 have been illustrated. The rudimentary p20 derived from class C sequences was found in several organisms (n = 194), including classes A and B of plant GH9 members, b WebLogo of the carbohydrate binding domain 49 of putative class C plant GH9 endoglucanases. Truncated sequences with a well defined 81 AA region corresponding to the CBM49 were utilized to construct this, and c Analysis of 20 patterns spanning the CBM49 with number of matched sequences (Sm), fitness (Fs), and randomly matched sequences (log(Rm=R)) as indices. Abbreviations: AA, amino acids; CBM49, carbohydrate binding module; Fs, fitness score; GH9, glycoside hydrolase; Sm, number of sequences with matches; Rm, number of randomly matched sequences

Fig. 4
figure 4

Pattern analysis and major findings in selected plant GH9 endoglucanases. a Distribution and presence of high- and low-fitness strength CBM49-spanning patterns (53 'hits' on 4 sequences) in characterized class C enzymes, b Taxonomic distribution of the low strength p20 (n2 = 291), and c Analysis of the presence of all 20 CBM49-spanning patterns in selected sequences of classes A, B, and C (n = 187). Clearly, the ubiquitous presence of p20 favours its use as an index of the presence of CBM49 in non class C taxa. The higher strength patterns (p1-p17) are limited to putative class C GH9 endoglucanases. Abbreviations: CBM49, carbohydrate binding module; GH9, glycoside hydrolase; p20, pattern 20

These data, for full length sequences of putative GH9 endoglucanases without a delineable CBM49 in terms of number of hits and sequences corresponds to: p1 − p17 (hits = 0), p18 (hits = 93; sequences = 81), p19 (hits = 2; sequences = 2), and p20 (hits = 233; sequences = 194) (Additional file 9: Table S5A). The results for all taxa with the GH9 domain: 18 (hits = 98; sequences = 89), p19 (hits = 1; sequences = 1), and p20 (hits = 315; sequences = 265) (Additional file 9: Table S5B). The low scoring p18 (Gx[DENQPST]x(2)G[LV]) and p20 (Gx(3)G[LV]) are the only patterns equivalent to the CBM49 which are found in classes A and B along with other taxa, in both full length and GH9 domain sequences (Table 4). The maximal population (≈44 − 48%) and taxa-specific coverage (ALS, BAC, FGI, PRS, LPS), then justifies the utilization of p20 in defining a dataset that could be used to develop an evolutionary trace of putative class C specific endoglucanase activity (Additional file 7: Table S4 and Additional file 9: Table S5, Additional file 13: Text S9, Additional file 16: Text S10, Additional file 14: Text S11 and Additional file 15: Text S12). This combined, i.e., inclusive of class C sequences, dataset (n2 = 291) of full length putative GH9 endoglucanase sequences then possessed GH9 (n = 1) and CBM49-p20 (n ≥ 1) occurrences, and includes bacteria (n = 64), animals (n = 18), fungi (n = 5), protists (n = 8), and green algae (n = 2) (Fig. 4b; Additional file 2: Text S3). The distribution of bacteria between the datasets (n1, n2) was similar firmicutes (≈56%, ≈69%), actinobacteria (≈17.2%, ≈15.6%), and proteobacteria (≈17.2%, ≈8%) (Table 3). However, the sole archaeal sequence (tr|D7E938) was conspicuous in the absence of the same (Figs. 1c and 4b; Additional file 5: Table S1A). We also observed that while several sequences of land plants, bacteria, and fungi included more than one occurrence of this pattern, green algae, protists, and animals only contained one occurrence of Gx(3)G[LV] (Additional file 9: Table S5A). A search for sequences with pattern 18 (G[DENPQST]x(2)G[LV]), with a marginal increase in fitness strength (| δp20, p18| ≅1.4) eliminated green algae altogether (Table 4; Additional file 9: Table S5). The taxonomic spread for matched occurrences on the GH9 domain (n1 = 607) with p18 (n1B; n ALS  = 3, n BAC  = 34, n FGI  = 3, n LPSA  = 7, n LPSB  = 28, n LPSC  = 14) and p20 (n1C; n ALS  = 14, n BAC  = 53, n FGI  = 4, n PRS  = 6, n LPSA  = 14, n LPSB  = 70, n LPSC  = 108), reiterates the generic nature of these patterns (Additional file 9: Table S5B). Interestingly, and in complete contrast is the profile of occurrences of p19, which despite its low fitness registers a single hit (class C, S. moellendorffii, 109529).

Analysis of CBM49 and CBM49-like GH9 endoglucanases of vascular land plants

In addition to establishing the origins of CBM49, we examined the divergence of putative class C GH9 endoglucanase sequences and the emergence of classes A and B in vascular land plants. To accomplish this a subset of pattern 20 selected GH9 endoglucanase sequences in land plants (n = 186; n LPSA  = 22, n LPSB  = 75, n LPSC  = 89) was collated and compared. The node ages and branch times suggest that vascular class C (≈222 Mya; support = 100%, 99%) GH9 endoglucanases predate members of classes A and B (≈114 Mya; support = 87%, 99%) (Fig. 5; Additional file 19: Text S7 and Additional file 20: Text S8). The molecular basis of these findings were ascertained by examining CBM49 (class C) and CBM49-like (classes -A and -B) sequences of vascular land plants for the presence of concomitant transmembrane and signal peptide regions (Table 5, Fig. 4a; Additional file 11: Table S7, Additional file 16: Text S10, Additional file 14: Text S11 and Additional file 15: Text S12). The MEMSAT-SVM data clearly suggest that all classes of GH9 endoglucanase sequences possess distinct high- (transmembrane; n LPSA  ≈ 96 % , n LPSB  ≈ 83 % , n LPSC  ≈ 80%) or low- scoring (pore-lining; n LPSA  ≈ 4 % , n LPSB  ≈ 19 % , n LPSC  ≈ 20%) helical regions, with the exception of the class B sequence (MDP0000199273), which possessed both classes of helices. Interestingly, a third class (re-entrant helical) was computed in class A members (n LPSA  = 3). When these data were combined, i.e., TM ∨ PH ∨ RH, all classes A, B, and C were shown to possess one or more TM subregions (n LPSA  = n LPSB  = n LPSC  = 100 % ) (Table 5). The same for the DAS-TMfilter (n LPSA  = 95%, n LPSB  = 98%, n LPSC  = 90%), and PHOBIUS (n LPSA  = 91%, n LPSB  = 4%, n LPSC  = 2%) (Table 5). The computations also suggest a bimodal distribution of signal peptide regions ((SP) ∧ (TM ∨ PH ∨ RH)+ ∶  = NY, (SP+) ∧ (TM ∨ PH ∨ RH)+ ∶  = YY). While, the data for MEMSAT-SVM was (n LPSB  ≅ 80%, n LPSC  = 75%; YY), the same for the DAS-TMfilter was (n LPSB  ≅ 56%, n LPSC  = 86%; YY). In contrast, the data from PHOBIUS differed considerably (n LPSB  ≅ 33.3%, n LPSC  = 0; YY), and was applicable to only 3 sequences. The was primarily due the almost complete absence of TM (n LPSB  = 4%, n LPSC  = 2%), or conversely the overwhelming presence of signal peptide regions in classes B and C (n LPSB  ≅ 96%, n LPSC  ≅ 98%) enzymes (Table 5; Additional file 11: Table S7, Additional file 16: Text S10 and Additional file 15: Text S12). However, as discussed vide supra, the corresponding results for the presence of the TM ∨ PH ∨ RH regions in class A GH9 endoglucanases predicted by MEMSAT-SVM (n LPSA  = 100%), DAS-TMfilter (n LPSA  = 95%), and PHOBIUS (n LPSA  = 91%) was almost identical (Table 5). Additionally, whilst, the results from DAS-TMfilter were similar to MEMSAT-SVM, its coverage of classes B (n LPSB  = 67%) and C (n LPSC  = 51%) was suboptimal. The MEMSAT-SVM data, therefore was deemed most appropriate for predicting the molecular events that may have occurred during the evolution of plant GH9 endoglucanases (Table 5; Additional file 11: Table S7, Additional file 15: Text S12).

Fig. 5
figure 5

Insights into divergence of plant class C GH9 endoglucanases. A Bayesian inference (BI) dated tree was estimated (maximum clade credibility) from the computed tree population (n = 4837; burn − in = 70%) using the JTT + I + G amino acid substitution model. Whilst, node ages (= node height = branch time of the longest diverging taxa) and branch times are in Mya, support for branch points are indicated by the posterior probabilities (PP%) and bootstrap values (n = 1000; ML%), i.e., support = PP%, ML%. The root for this tree was the parent of vascular class C land plants (201 − 249 Mya). The log likelihood for this tree was (≈ − 0.1350387). Abbreviations: BI, Bayesian inference; GH9, glycoside hydrolase; I, proportion of invariant sites; G, gamma parameter; Mya, millions of years; JTT, Jones, Taylor, and Thornton

Discussion

Evolutionary significance of crystalline cellulose digesting non plant GH9 endoglucanases

Our results, on the evolution of the GH9 and CBM49 regions suggest a pyramidal model with vertical gene transfer and progressive evolution (loss or modification of function) as a plausible explanation for the emergence, occurrence, and divergence of GH9 endoglucanase activity (≈3000 Mya) (Figs. 2 and 5) [15,16,17,18,19,20,21,22,23,24,25,26,27,28, 32,33,34]. Conversely, since crystalline cellulose is the preferred substrate, this also implies a conserved active site architecture of the encoded protein and a correspondingly similar reaction chemistry in non-plant taxa and land plants with putative class C GH9 endoglucanase activity (Tables 1 and 4, Fig. 4a; Additional file 5: Tables S1 and Additional file 8: Table S2) [7, 46, 47].

The structure of crystalline cellulose renders it resistant to alterations in temperature, salt, pH of the surrounding environment, clearly a desirable trait in archaea (methanogens) and bacteria (halophiles, thermophiles) which inhabit extreme environments such as hot springs and the oral and gastrointestinal microbiomes of several animals. Here, perhaps, the role of GH9 endoglucanases could be critical in remodelling the cell membranes, thereby maintaining intracellular homeostasis [47, 49]. Additionally, crystalline cellulose is inert, compact, and insoluble in aqueous and several organic solvents. These physicochemical properties would imply that spores and seeds made predominantly of this polymer would be resistant to dessication and stressors such as weather fluctuations [14, 41, 43]. Clearly, protists (Dictyostelium- and Polysphondylium-spp.) and gram positive bacteria may have utilized GH9 endoglucanses to regulate the processes of sporulation, dissemination, and effective germination [14, 41, 43]. The lipopolysaccharides (complexes of crystalline cellulose with lipids) synthesized by gram negative bacteria (proteobacteria, actinobacteria) and fungi, too, could aid protection of the organism from host immune systems (phagocytosis) while concomitantly establishing an infection (Cryptococcus neoformans, Pseudomonas spp., Vibrio spp.) or infestation in developing protists and marine invertebrates [9,10,11,12,13,14, 41, 43, 48, 102,103,104]. Reciprocally, an interesting utility of GH9 endoglucanases is to facilitate the symbiotic/ parasitic association between some fungi and bacteria of animal and plants hosts (macrophages, leguminous nodules of the rhizomes) by digesting the crystalline cellulose of the host. Thus, bacteria/ fungi could secrete these enzymes and/ or in association with the cellulosome could digest the cellulose and hemicellulose in root hairs and wood to extract/ exchange nutrients (Laccaria bicolor, Sporisorium reilianum, Phanerochaete chrysosporium) [42, 44, 53, 54, 105,107,108,109]. Although cellulose is unequivocally inert, reports of its potential to stimulate an immune response in the host are not unknown. In fact, specialized cells in the tunics of marine vertebrates (O. dioica, S. kowalevskii, and C. intestinalis) might function as primitive phagocytes that could detect the presence of crystalline cellulose (potential pathogen, index of nutritional status) and could moderate a suitable response (adhesion to the substratum, infection by marine microbes). The ability to utilize the nutritionally superior crystalline cellulose may be an important consideration, albeit, indirect for the dominant global presence of arthropods including insects (Apis mellifera, Camponotus floridanus, Nasonia vitripennis, Nasutitermes Takasagoensis), crustaceans (Daphnia pulex), and segmented worms (Additional file 5: Table S1A, Additional file 1: Text S1) [15, 50, 51, 108,109,110,111,112,113,114]. Since, GH9 endoglucanase producing bacteria populate the microbiomes of these animals, they are able to extract glucose from diverse substrates (wood, chitoligosaccharides) and can subsist in several seemingly inhospitable environments. Additionally, and in comparison to the kingdom specific analysis (bacteria, fungi, land plants, animals) with corresponding multiple trees by previous investigators, we were able to generate a unified time tree of over 600 GH9 domain sequences spread over every major taxa (n BAC  ≈ 6.5X, n ALS  ≈ 3.4X, n FGI  ≈ 1.6X, n LPS  ≈ 4.8X), and include green algae and protists [55].

Rationale and relevance of a multimodal approach to approximating the CBM49

As discussed vide supra, the carbohydrate binding module CBM49 is unique to class C members of land plants (Fig. 3; Additional file 8: Table S2 and Additional file 4: Text S2). Our data suggests that homologous CBMs (GH9 ∧ (CBMx) y | x ∈ {2, 3, 4,10,49, X}, y = {1, 2}) distributed across the length of the protein might contribute to catalysis of crystalline cellulose in bacteria (n = 37), animals (n = 18), and protists (n = 2) (Table 6; Additional file 12: Table S8). The data from the SMART server also indicated the presence of several low complexity regions both, in full length and truncated (GH9 domains) sequences. This coupled with the sparse CBM data (<10%), prompted us to search for CBM49 spanning patterns amongst putative non class C GH9 endoglucanase sequences, reasoning that patterns with low fitness scores might constitute a superior index of approximating the CBM49. In our analysis the CBM49-approximating and low scoring p18 (Gx[DENQPST]x(2)G[LV]), p19 (Gx[ILV][WY]G[LV]), and p20 (Gx(3)G[LV]), possessed amino acids that may be both potentially catalytic and/ or facilitatory. Whilst, the bulky side chains of the aromatic amino acids can physically stretch the glycosidic linkage between adjacent β(D)-glucopyranose residues and weaken it several fold, amino acids with side chain functional groups (−OH, −NH2, −SH), can effect electron-proton transfers and are critical components of the catalytic machinery of any enzyme [62,63,64, 115]. The concomitant occurrence of these residues with the GH9, i.e., (GH9 ∧ p18) ∨ (GH9 ∧ p20), could function as an index of CBM49-presence on the GH9 domain in sequences of non class C taxa and can then be utilized to trace the origins of CBM49. The biological relevance of this approach may be gleaned by examining the correlation between the presence of aromatic amino acids which are known to influence catalysis of crystalline cellulose and the 'hits' or 'occurrences' of low strength patterns in non class C enzymes (Table 4; Additional file 9: Table S5) [62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78]. Whilst, the complete absence of aromatic acids could be responsible for the generic distribution of p18 and p20 (93 ≤ n Hits  ≤ 230, full length; 98 ≤ n Hits ≤ 315; GHdomain), the incorporation of a single residue W/Y into p19 results in a significant reduction in its occurrence in non class C members (n Hits  = 2, full length; n Hits  = 1; GHdomain) (Table 4, Fig. 4b; Additional file 9: Table S5).

Evolution of the CBM49 encompassing class C GH9 endoglucanases

The identification of the CBM49 as the facilitator of crystalline cellulose digestion (class C activity) in a select population of previously annotated GH9 endoglucanases in land plants raises intriguing queries with regards to the origin, subsequent divergence, and physiological relevance of substrate shuffling (amorphous, crystalline) in plant GH9 endoglucanases [6,7,8, 33, 34]. In the absence of an identifiable CBM49, the analysis of full length putative GH9 endoglucanase sequences with occurrences of p20 (low strength generic approximator of CBM49) might constitute a viable approach, and provide insights into the origins and subsequent divergence of CBM49 containing enzymes.

Emergence and origin of the CBM49

The influence of non-GH9 regions of the primary sequence on the catalytic spectrum of plant GH9 endoglucanases, suggest that these, like the GH9 may have originated in non-plant taxa. These could include the presence of: a) homologous CBMs throughout the length of the protein sequence, and b) delocalized residue- specific activity of the GH9 domain itself. Extensive sequence analysis of full length and GH9 domain sequences of non-plant taxa reveals the presence of several regions of low complexity, along with sparsely present pre-defined CBMs (n = 57; ≅9.3%) (Table 6; Additional file 12: Table S8). The numbers notwithstanding, distinct copies of CBM2 (animals, bacteria), CBM3 (bacteria), CBM4 (animals, bacteria), CBM10 (bacteria), CBMX (bacteria), and the CBM49 (protists) itself (GH9 ∧ (CBMx) y ), have been characterized in literature with the encompassing GH9 endoglucanases exhibiting a clear preference for crystalline cellulose [39,40,41,42,43,44,45,46,47,48,49,50,51,52,53] (Table 6; Additional file 12: Table S8). Interestingly, the CBMs 2 and 4 of animals and bacteria were present at opposite termini of the GH9 domain. Thus, while CBM4_9 is C-terminal in animals, its position in bacteria is distinctly N-terminal, with the reverse being true for CBM2 (Additional file 12: Table S8). This mobility of CBMs across taxa suggests that either N- or C-terminal positioned CBMs could have functioned as precursors of CBM49. The length of the linker sequences exhibited considerably greater variation in non-plant taxa (27 − 230 aa) as compared to land plants (7 − 77 aa) (Additional file 5: Table S1A,  Additional file 8: Table S2A and Additional file 12: Table S8). In contrast, the low strength CBM49-approximator, i.e., pattern 20, could be mapped directly onto the full length and GH9 domains ( ≅ 50%). In the presence of key aromatic and/ or polar uncharged amino acids this mapping could also confer competency to digest crystalline cellulose. Whilst, the exact origin of the CBM49 remains speculative, our results when combined indicate a distinct probability (>0.00) that a double ((GH9 ∧ (CBMx) y ) = {0.093} ∨ (GH9 ∧ p20) = {0.44,0.48}) or triple event ((GH9 ∧ (CBMx) y  ∧ p20) = {0.041,0.046}) may have resulted in the emergence of CBM49 in early land plants (Table 6; Additional file 12: Table S8).

Divergence of class C GH9 endoglucanases

The interdomain linker, a common feature between the GH9 and CBMs is, surprisingly stable and seems to have remained as such for ≅450 − 480 Mya. Whilst, the evidence for the ancestral role of class C members of vascular land plant GH9 endoglucanases is fairly unequivocal, a clear insight into the downstream molecular events that may have occurred in their transformation to classes A and B is debatable (Figs. 5 and 6). Here too, we posited that vertical gene loss of class C GH9 endoglucanase sequences was operative and could result in the emergence of classes A (A1) and B (B1, B2) (Table 5, Fig. 6; Additional file 11: Table S7, Additional file 13: Texts S9, Additional file 16: Text S10, Additional file 14: Text S11 and Additional file 15: Text S12). The extensive computational analysis conducted in this work suggests that classes B (B1, B2) and C (C1, C2) could be considered a union of two distinct groups each, a partitioning that is based on the presence or absence of a signal peptide region (Table 5, Fig. 6; Additional file 11: Table S7, Additional file 13: Texts S9, Additional file 16: Text S10, Additional file 14: Text S11 and Additional file 15: Text S12). The first model purports that the last common ancestor (LCA) of vascular plant GH9 endoglucanases were class C-like enzymes in bryophytes and early tracheophytes. Subsequent losses, in parallel of the CBM49 could have resulted in the appearance of modern vascular equivalents (Figs. 5 and 6). This model also offers an explanation to the fewer numbers of class C members frequently observed by investigators, despite contrasting bioinformatics evidence [14, 58,59,60]. Indeed, this may be the route of choice for the emergence of class C (≈222 Mya; support = 100%, 99%) and classes A and B (≈114 Mya; support = 87%, 99%) (Figs. 5 and 6). Clearly, this model would mandate the presence of distinct subpopulations of the LCA, i.e., CBM49 with either TM or SP regions. Alternatively, class C GH9 endoglucanases of land plants may have been the first to emerge after the tracheophytes, whilst classes A and B evolved from them by the progressive loss of the signal peptide. This route, too, seems perfectly plausible given the presence of two distinct sub populations of class C GH9 endoglucanases (C1, C2), with each diverging secondary to the loss of the CBM49 subregion (class C2 → class A1 ≈ class B2; n1A) and the considerable earlier divergence of class C vascular plants (Table 5, Figs. 5 and 6). Since, classes A and B, in vascular land plants could be originate in parallel and directly from their class C counterparts, the fewer numbers observed could simply mean fewer original class C members left as compared to class B GH9 endoglucanases. A third scenario, could be the origin of later members sequentially, i.e., class C → class A → class B or class C → class B → class A (Fig. 5). Phylogenetic and sequence analysis of this dataset (n3) suggests that the most probable routes was class C1 → class B1 → class A1 and/ or (class C2 → class A1 ≈ class B2; n1A) (Fig. 6).

Fig. 6
figure 6

Evolution, divergence, and emergence of plant class C GH9 endoglucanases. a Evolutionary theories for the emergence and divergence of classes A, B, and C plant GH9 endoglucanases. The major considerations in proposing these were data gleaned from the time trees, and analysis of the sequences for the presence and/ or absence the transmembrane, signal peptides, and the CBM49 itself. b Phylogenetic and bioinformatics analysis of full length sequences of classes A, B, and C plants endoglucanases with one or more occurrences of p20. CBM49 attributable class C activity along with a GH9 domain was present in early land plants bryophytes (avascular) and tracheophytes (vascular), and suggests the presence of two class C populations which may have diverged so as to result in the newer classes A and B. Abbreviations: CBM49, carbohydrate binding module; GH9, glycoside hydrolase; p20, pattern 20

Class C GH9 enzymes, last common ancestor of plant GH9 endoglucanases

Physiologically, the development of an intact vascular system could have brought about a paradigm shift in not just the utilization of extant endoglucanase activity, but also in the nature of cellulose itself. The introduction and persistence of water molecules between the microfibrils of cellulose could have resulted in competition for hydrogen bonds with water rather than other fibrils of cellulose. These events could have been complemented by the late emergence of the crystalline cellulose (I α , I β ) editing subclass A GH9 endoglucanases, and could have shifted the reaction equilibria towards the right, i.e., synthesis of amorphous cellulose (I α am, I β am) [10]. These reactions can be depicted as:

The proliferation of amorphous regions would have rendered cellulose accessible and amenable to enzymatic conversion with lesser stringency. Evolutionarily, this means that the CBM49 in land plants (avascular and early vascular) despite its ancestral origins may no longer be necessary for cellulose metabolism. This in turn may have initiated a series of molecular events in extant class C endoglucanase sequences of late tracheophytes such as S. moellendorffii, and may have culminated in the divergence and subsequent appearance of late vascular GH9 endoglucanases of class C (Table 5, Fig. 6) [62,63,64]. The presence of the linker region too, may have facilitated the progressive loss of CBM49 and its progressive transformation into classes A and B over ≅114 Mya (Fig. 5). Since, the modified chemistry and quantity of cellulose made it amenable to rapid digestion, enzymes of classes A and B were more suited to digesting the now abundant amorphous regions of cellulose, and could utilize it as a source of carbon, as well as remodel it to effect growth, development, flowering, and germination [16, 58]. Whilst the presence of crystalline cellulose in the stems of cereal crops (Hordeum vulgare, Brachypodium distachyon, O. sativa) facilitates growth and cultivation, its secretion in the mucilage from the epidermal cells of differentiating eudicot seeds is a critical event in germination [58, 60, 116,117,118]. The recent divergence of land plant GH9 endoglucanases into monocots such as the cereals (O. sativa, B. distachyon, Panicum virgatum) and the asterid subdivision of the eudicots (S. tuberosum, S. lycopersicum and N. tabacum) is consistent in all classes and in both datasets (n1A, n3) (Table 1). These could reflect a modification of the culinary habits of a developing civilization with a desire for bulk and storage foods (Table 4). Here, too the in situ digestion of crystalline cellulose by class C enzymes or its conversion to amorphous forms thereof, could proceed unhindered. The continuing molecular evolution of classes A and B enzymes also suggests a versatile and adaptive mechanism of action perhaps in tandem with the emergence of novel pathophysiological stimuli. The existence of high levels of mRNA of putative class C members observed from the internode regions (high cellulose content) of the developing stems of O. sativa and A. thaliana, suggest that these enzymes could still be of benefit to modern land plants, as they could direct the higher affinity classes A and B enzymes to regions of growth and development, where the concentrations of cellulose would be much lower [16, 58, 60, 116,117,118]. The CBM49 of class C plant GH9 endoglucanases could also function as a gene/ protein repository for newly emerging functions, thus justifying their title as living fossils of the plant world.

Conclusions

Our work when coupled with extant data on class C plant GH9 endoglucanases suggests that these enzymes are ancestral to classes A and B of this family. Plant GH9 endoglucanases are able to digest crystalline cellulose (class C activity) in a manner reminiscent of catalysis by bacteria, animals, protists, fungi, and archaea. Our work here suggests that the GH9 domain is relatively well conserved across taxa. We also present plausible phylogenetic time lines coupled with bioinformatics evidence that favour a vertical mode of gene evolution that may have contributed to the origin and emergence of the CBM49 between the GH9 endoglucanases of plants and non plant taxa, as well as its subsequent divergence (tracheophytes and the vascular land plants of classes A, B, and C). Finally, we review the computational evidence in context of likely physiological events that may have occurred during their divergence and evolution.