Insights into the mechanism(s) of digestion of crystalline cellulose by plant class C GH9 endoglucanases

Biofuels such as γ-valerolactone, bioethanol, and biodiesel are derived from potentially fermentable cellulose and vegetable oils. Plant class C GH9 endoglucanases are CBM49-encompassing hydrolases that cleave the β (1 → 4) glycosidic linkage of contiguous D-glucopyranose residues of crystalline cellulose. Here, I analyse 3D-homology models of characterised and putative class C enzymes to glean insights into the contribution of the GH9, linker, and CBM49 to the mechanism(s) of crystalline cellulose digestion. Crystalline cellulose may be accommodated in a surface groove which is imperfectly bounded by the GH9_CBM49, GH9_linker, and linker_CBM49 surfaces and thence digested in a solvent accessible subsurface cavity. The physical dimensions and distortions thereof, of the groove, are mediated in part by the bulky side chains of aromatic amino acids that comprise it and may also result in a strained geometry of the bound cellulose polymer. These data along with an almost complete absence of measurable cavities, along with poorly conserved, hydrophobic, and heterogeneous amino acid composition, increased atomic motion of the CBM49_linker junction, and docking experiements with ligands of lower degrees of polymerization suggests a modulatory rather than direct role for CBM49 in catalysis. Crystalline cellulose is the de facto substrate for CBM-containing plant and non-plant GH9 enzymes, a finding supported by exceptional sequence- and structural-homology. However, despite the implied similarity in general acid-base catalysis of crystalline cellulose, this study also highlights qualitative differences in substrate binding and glycosidic bond cleavage amongst class C members. Results presented may aid the development of novel plant-based GH9 endoglucanases that could extract and utilise potential fermentable carbohydrates from biomass. Graphical Abstract Crystalline cellulose digestion by plant class C GH9 endoglucanases - an in silico assessment of function. Electronic supplementary material The online version of this article (10.1007/s00894-019-4133-1) contains supplementary material, which is available to authorized users.

Extant structures of non-plant GH9 enzymes suggest that crystalline cellulose may be digested in subtle fully enclosed tunnels (processive), or in larger, open solvent accessible grooves/clefts (non-processive), although a mixed mode is likely to prevail in most enzymes [51][52][53][54][55][56][57][58][59][60]. The binding site(s) are labelled as plus (substrate, entrance) and minus (product, exit) sites with hydrolytic cleavage occurring between the +1 and −1 sites [51][52][53][54][55][56][57]. The length of the tunnel itself (≈50 Ang) is consistent amongst other GH9 enzymes and consists of about ten subsites (−7 to + 2), where amino acids make contact with the glucan chain [51][52][53][54][55][56][57]. Further insights into the mechanistic contributions of GH9, linker, and/or the CBMs may be gleaned from the X-ray structures of enzymes in complex with simple (DP < 9; DP = {2, 3, 5}) or complex (DP = 10; −SH) oligosaccharides [58][59][60]. For example, GH9 and CBM3 are distinct spatial entities (Cel9G, Clostridium cellulolyticum; CelE4, Thermomonospora fusca) with an interaction surface that comprises a network of hydrogenbonded residues [59,60]. However, in the absence of an active enzyme substrate (ES) complex (DP ≥ 6), the manner in which polymeric crystalline cellulose is processed by GH9 enzymes is not known [59]. Interestingly, the authors also report an inter-dependence or quasi-allostericity of the GH9 and CMBs in binding crystalline cellulose, a substrate-binding groove that is lined with polar and aromatic acid residues, and the possibility of a polyfunctional CelE4 with exo-and endoglucanase activities [59,60]. Crystalline cellulose is the cognate substrate for GH9 endoglucanases in non-plant taxa such as bacteria, archaea, fungi, protists, and arthropods, and may predate plant GH9 enzymes by several millions of years [8]. This, when combined with the similarity between the GH9 domains, suggests that the active site architecture of plant class C enzymes and subsequent reaction chemistry may be similar [8,51,52]. Whilst, the data generated vide supra is able to offer insights into the origin and evolution of plant class C enzymes, mechanistic details of the same are fundamental to comprehending the precise manner in which catalysis of crystalline cellulose may proceed. Here, I analyse homology models of putative and characterised plant class C sequences, i.e. with a single wel-defined CBM49 subsequence, to classify and infer the contribution(s) of the GH9, CBM49, and linker to the catalysis of crystalline cellulose.
The LeAP module of AMBERTOOLS v17.0 was used to explicitly add water molecules (TIP3P) to the 3D models of characterised class C enzymes (n = 4; x FL , x T ) and render the modelled structures electrically neutral ({Na + , Cl − } ≥ 1) (Fig. 1) [62]. The models were optimised by minimizing their computed energies in a bi-phasic (n min1 = n min2 = 5000) implementation of the steepest descent algorithm with (100 Kcal mol − Ang 2 ) Fig. 1 Schema for biophysical characterization of class C GH9 enzymes. Generic protocol to assess contribution of GH9, CBM49, and the linker to catalysis of crystalline cellulose by plant class C enzymes. These steps consisted of fold identification, 3D protein and ligand geometry optimization, invariant core determination and normal mode analysis, surface analysis, cavity and groove delineation, and docking. Folds of characterised (full length, truncated) class C enzymes and putative class C sequences were initially identified. 3D models of class C enzymes with the top scoring templates (non-plant) were used for all further analysis; energy minimization (E min ) of the 3D models was used to compare the effects of truncation on the structural integrity of the protein. Equilibrium structures (40.1 ns) were used subsequently to delineate the active site architecture of plant class C GH9 endoglucanases as well as conduct detailed docking studies with cellulose based ligands. Abbreviations-GH, glycoside hydrolase; CBM, carbohydrate binding module; Phyre2, protein homology/analogy recognition engine and without positional restraints for the amino acids ( Fig.  1, Table 1). The minimised models x FL min ; x T min ð Þwere utilised for comparative analyses to ascertain the significance and relevance of CBM49 to the structural integrity of the protein. Full length minimised structures were perturbed (Temp : 0.0K → 300.0K; constant volume; 20 ps) with low energy (10.0 Kcal mol −1 Ang 2 ) positional restraints for the amino acids, which was followed by an unrestrained (Temp = 300.0K; constant pressure; 100 ps) and a production grade run (40.1 ns) MD run with NAMD v2.13 (nanoscale molecular dynamics) and VMD v1.9.3 (visual molecular dynamics; configuration files) ( Fig. 1, Table 1) [63,64]. These models, i.e. x FL 40:1ns ; x∈ Q5NAT 0; Q8LJP6; f Q93WY 9; Q9ZSP9g, were used to infer active site architecture, perform docking experiments, and identifying structural homologues of selected characterised class C enzymes (Fig. 1, Table 1).

Invariant core analysis of characterised and putative class C enzymes
The invariant core is a measure of inferring structural variation from the xyz coordinates of aligned atoms of amino acids at specific site(s) and was utilised to assess the conservation of GH9, linker, and CBM49. This was accomplished by generating multiple sequence alignments (MSA) with a standalone version of multiple sequence alignment by computing logexpectation (MUSCLE; http://drive5.com/muscle) in association with the R-package Bio3D (http://thegrantlab. org/bio3d) and with scripts developed in house ( Fig. 1) [65][66][67]. The volume of the invariant core was then iteratively computed and is defined as the least volume (V < 1.0 Ang 3 ) from all volumes of arbitrary ellipsoids (V ≥ 1.00 Ang 3 ). Here, an ellipsoid comprises the variance of eigenvalues along its three principle axes of the atomic xyz coordinates of amino acid(s) at every aligned position of the combined and ungapped MSA, whilst its volume represents the structural variation at the given position(s) [67][68][69][70]. Although Alanine is not the most hydrophobic amino acid (kdH Ala < kdH Met < kdH Cys < kdH Phe < kdH Leu < kdH Val < kdH Ile ; kdH ≔ Kyte Doolittle Hydrophobicity index), its non-bulky and unbranched side chain renders it an excellent index of invariance of a given structure. Since truncating the proteins might be expected to dramatically alter the behaviour of the GH9 of the 3D models, a corrected subset (O. sativa, #AA = 456; N. tabacum, #AA = 466; G. hirsutum, #AA = 464; S. lycopersicum, #AA = 476) that comprised matched residues of full length proteins was used x cFL min ð Þ¼x cFL min ð Þ , i.e.
for comparative analyses x cFL min vs x T min ð Þwhere x ∈ {Q5NAT0, Q8LJP6, Q93WY9, Q9ZSP9}. Since, the number of characterised class C enzymes was small (n = 4), a larger MSA, which included 3D models of putative class C enzymes (n = 92) was generated. The eigenvalues of the lowest invariant core (0 < V(Ang) ≤ 1.0) were then investigated with principal component analysis (PCA), which in turn was used to cluster and identify structural homologues of characterised class C enzymes. The aligned models were thence utilised to infer plausible activesite architecture(s) of plant class C enzymes.
Structural analysis of 3D models of plant class C GH9 enzymes Low frequency (ω) and non-trivial normal modes (NM) (ω(NM) > 0, NM > 6; ω ∈ ℝ, NM ∈ ℕ) of the superposed 3D models as well as individual protein sequences of t h e m i n i m i s e d NM These, in tandem with the standard deviation σ rmsf x cFL min ; x T min ð Þ ð Þ , were used to assess and compare the influence of atomic motion on the structural organization of characterised class C proteins. The presence of correlated displacements of residues for each full length protein after the MD run x FL 40:1ns ð Þwas also examined by the dynamic cross correlation map (DCCM), i.e. the covariance matrix of the root mean square fluctuations rmsf x FL40:1ns ð Þ¼rmsf xFL 40:1ns of every Cα atom of each class C protein cov x FL 40:1ns ; x FL 40:1ns ð Þ ð Þ ∀ x ∈ {Q5NAT0, Q8LJP6, Q93WY9, Q9ZSP9} (Fig. 1). These investigations were complemented by computing the surfaces, cavities, and, grooves present in the GH9, linker, and CBM49 regions or at their interfaces using the SPDBV (Swiss protein data bank viewer) suite of programs (https://spdbv.vital-it.ch) ( Fig. 1) [73]. A cylinder of minimum area and volume was used to model and thence approximate the dimensions (radius ≔ r, height ≔ h, length ≔ l; r, h, l ∈ ℝ + ) of the predicted substrate binding and cleaving groove(s) necessary to accommodate and digest crystalline. These formulas were derived and are as under: Differentiating w, r, t, h and solving for r and h results in the formulae

Ligand preparation and utilization
The degree of polymerization (DP) was utilised to shortlist potential candidates of cellulose oligomers (2 ≤ DP ≤ 8) and their stereoisomers, from the ZINC12 and PubChem databases (http://www.ncbi.nlm.nih.gov/pubchem;http://zinc. docking.org) [74,75]. Briefly, for 2 ≤ DP ≤ 4 (n = 3) and for 5 ≤ DP ≤ 8 (n = 1) were utilised (n = 13 = 3 * (3) + 4) for this analysis (Fig. 1, Table 2). The ligands were downloaded in the isomeric SMILES format and built with ChemSketch installed locally. Geometry isomerization was initially performed with Chemsketch itself, followed by a further 500 − 2000 cycles of optimization with the steepest descent and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithms [76]. These were implemented with a local installation of Arguslab using the universal force field (UFF) parameter of the molecular mechanics component (http://www.arguslab.com/arguslab. com) [77]. Additional relevant parameters for this step were the cutoff for non-bonded interactions (8.0 Ang) and data updates after every 20 steps. The optimization converged for all the ligands tested with a net energy of < − 8 Kcal mol − Ang 2 . The xyz coordinates along with other relevant information was encoded as a pdb file and uploaded to the DockingServer (https://www.dockingserver.com/web) [78]. The geometry of all the ligands (n = 13) uploaded were finally optimised using the semi-empirical (PM6) method of partial charge addition, the Merck molecular force field (MMFF94), with all rotatable bonds delineated and non-polar hydrogen atoms merged [79,80].
Docking experiments of characterised plant class C GH9 endoglucanases 3D models of characterised plant class C GH9 endoglucanases x FL 40:1ns ; x∈ Q5NAT0; Q8LJP6; Q93WY 9; Q9ZSP9 f g ð Þ were uploaded to the DockingServer (https://www.dockingserver. com/web) [78]. The server with the aid of AutoDock, added the necessary hydrogens, atomic charges, and utilised a grid of 100 × 100 × 100 points with a spacing of 0.375 Ang [81]. The final positions of the coordinates on this grid were modified to include the previously delineated interaction surfaces of GH9, linker, and CBM49, for all the proteins. Computation of the non-covalent bonds (van der Waals, electrostatics) was accomplished using the parameter set from AutoDock. Docking was performed using the Lamarckian genetic algorithm and a local search method after the initial position, orientation, and torsion angles of the ligand molecules were set randomly [81,82]. Data for a single experiment was derived from 100 different runs (Δtranslation = 0.2 Ang; Δtorsion = Δquaternion = 5). These were set to terminate after a previously set limit of energy evaluations (E_evals =2500000, population =150). The contribution of these residues to the catalysis of crystalline cellulose was inferred from the free energy C22, C23, C31, C32, C33, C41, C5, C6, C7, C8}.

Data organization and arrangement
A pipeline comprising each step and the relevant data generated are presented as under the following steps: Step 0: Parameters were defined for protocols to minimise, equilibriate, and preliminarily characterise 3D models of plant class C GH9 endoglucanases and ligands of cellulose ( Fig. 1, Tables 1 and 2) Step 1: The 3D fold of sequences of characterised (full length, truncated) and putative plant class C GH9 endoglucanases was determined (Figs. 1 and 2, Table 3; Supplementary Text 1).
Step 2: The 3D models of characterised class C enzymes were minimised and used to assess contributions of the linker and CBM49 to the structural integrity of protein (potential energy calculations, rms deviation, normal mode analysis, root mean square fluctuations) (Fig. 3, Table 4; Supplementary Texts 2-5).
Step 5: Structural homologues of selected characterised and putative class C enzymes were identified with a PCA-based clustering schema and analysed to derive insignts into the mechanism(s) of digesting crystalline cellulose by plant class C GH9 endoglucanases (Figs. 8 and 9, Table 9; Supplementary Text 10).    Homology modelling and assessment of characterised class C GH9 endoglucanases An intersequence pairwise alignment suggests that despite a high degree of identity (≈75 − 83%) between the class C enzymes of S. lycopersicum, G. hirsutum, and N. tabacum, the preferred template for G. hirsutum was from T. fusca (PDBID : 1JS4). Conversely, the sequence identity for O. sativa was marginally lower (≈62 % identity), yet shared the same top ranked template, i.e. C. cellulolyticum (PDBID : 1GA2), with S. lycopersicum and N. tabacum (Table 3; Supplementary Text 1). However, the average sequence identity with the templates (≈32 − 40%) was similar for all class C enzymes investigated ( Table 3). The superposed ungapped MSA of the truncated (x T ) class C proteins additionally resulted in the exclusion of the linker, i.e. CBM49 ≡ CBM49 ∪ L, from the MSA, i.e. (Fig. 2).
The results (rmsd (template, x) < 2 Ang) suggest that the catalytic machinery for digesting crystalline may be conserved in plants and other non-plant taxa most notably bacteria (Table 3; Supplementary Text 1) [8, 17, 20-34, 59, 60]. The models The 3D models that represented the best approximation to the template X-ray structures Thermomonospora fusca (PDB: 1JS4; UID : Q8LJP6) and Clostridium cellulolyticum (PDB: 1GA2; UIDs : Q5NAT0, Q9ZSP9, Q93WY9) were used for all further investigations. The parameters used to evaluate these were sequence identity, presence of an homologous structure (confidence), and the percentage of the protein that could be modelled (coverage  Fig. 3 Comparative analyses of full length and truncated 3D models of class C GH9 endoglucanases. a, b Energy minimization (E min < 0.0) of 3D models of full length and truncated x FLmin ; x T min ; x∈ ð Q5NAT 0; Q8LJP6; Q93WY 9; Q9ZSP9 f g Þ characterised class C GH9 endoglucanases was carried out and monitored by the root mean squared deviation of the intermediate structures. The absence of significant variation of the total (ETOT) energy for the models studied suggested that these were stable and could be examined further. c Normal mode analysis of these minimised models suggested that the carboxy-terminal end of the linker region and the CBM49 regions experienced increased oscillatory motion, an observation which is mitigated when these were truncated. The frequencies for O. sativa ðQ5NAT 0 T min ≫Q5NAT 0 FLmin ) and was more pronounced for the lower non-trivial modes as opposed to the proteins from S. lycopersicum and N. tabacum x T min < x FLmin ð ; x∈ Q8LJP6; Q93WY 9;  Þ models (Fig. 3, Tables 1 and 3; Supplementary Text 3).
f g Þ were consistent for N. tabacum and G. hirsutum, there was a complete reversal of the same for O. sativa and S. lycopersicum Q9ZSP9 FL min ÞÞÞ ( Fig. 3; Supplementary Text 3). These data suggest that full length class C enzymes may adopt a stable conformation earlier than their truncated counterparts. Interestingly, the rms deviations of the minimised full length

Assessing the contribution of CBM49 to the structural integrity of class C enzymes
The core data for the 3D models of all full length characterised plant class C enzymes suggests that while GH9 is well conserved (#Cα 0.0 < V ≤ 100.0 (GH9) > 0), CBM49 is not (#Cα 0.0 < V ≤ 100.0 (CBM49) = 0). The N-and C-terminal regions of the linker does, however, exhibit partial conservation (#Cα 8.0 < V ≤ 100.0 (Linker) = {1, 3}), a trend which is unlikely to be sustained for larger datasets (Supplementary Text 2). Low frequency non-trivial modes, i.e. NM xFL min ¼ NM xT min ¼ 7−18 , were also assessed to garner additional information about the possible role(s) of CBM49 and the linker in influencing structure of the GH9 (  (Table 4; Supplementary Texts 4 and 5). A position-specific analysis of this data clearly demonstrates that this heightened oscillatory motion involves the residues of the linker and CBM49 (Fig. 3, Table 4; Supplementary Texts 4 and 5). These data when combined suggests that CBM49 and the linker, despite being poorly conserved even amongst class C members, may deploy corrective hypermobility to rapidly restore equilibrial status secondary to perturbation events such Delineating the active site architecture of characterised plant class C enzymes An multi-modal approach (surface contact analysis, docking, cavity and groove delineation) was adopted to ascertain the residues and their relevance to crystalline cellulose digestion by plant class C enzymes.
Analysing the DCCM to assess and characterise intra-protein residue interactions The NMA and DCCM data of maturefolded (40.1 ns) class C enzymes suggest that several residues that comprise the non-contiguous segments between the GH9, linker, and CBM49 exhibit positively correlated atomic displacements (r ≅ 1.00) (Fig. 4, Supplementary Texts 6-9). These data imply that plant class C enzymes, like their bacterial counterparts may also possess well-defined interaction surface(s) IS ¼ IS GC x ; IS CL x ; IS GL x È É À Á between GH9, linker, and CBM49 (Fig. 5) [59,60]. The surface area of interacting residues was variable and ranged from 375 − 517 Ang 2 Fig. 4 Structural analyses of 3D models of class C GH9 endoglucanases. a, b 3D models of full length characterised class C GH9 endoglucanases, at equilibrium, were monitored by the root mean squared deviation of the intermediate structures, and the absence of significant variation of the kinetic (EKTOT), potential (EPTOT), and thence the total (ETOT) energies. c Normal mode analysis and root means square fluctuations (rmsf) suggested that the carboxy-terminal end of the linker region and/or CBM49 regions are flexible and may contribute to an adaptable active site geometry and d dynamic cross-correlation map of residues of characterised class C GH9 enzymes. The coavariance matrix of the rmsf values of each residue per protein was computed. The dynamic cross correlation map (cov rmsf xFL 40:1ns ; rmsf xFL 40:1ns ∀x ∈ {Q5NAT0, Q8LJP6, Q93WY9, Q9ZSP9}) was examined for areas of positive correlation (red, r → 1.00) across all modes of vibrational motion. The offdiagonal data suggests that like the bacterial templates, class C enzymes may also possess different non-contiguous segments (G, L, C) whose atomic displacements might be correlated. Evaluating the positive r−coefficients suggests the existence of multiple interaction surfaces between these. Abbreviations-C, carbohydrate binding module 49; cov, covariance matrix; G, glycoside hydrolase 9; L, linker; r, correlation coefficient; GH, glycoside hydrolase; CBM, carbohydrate binding module; FL, full length; UID : Q5NAT0, O. sativa; UID : Q8LJP6, G. hirsutum; UID : Q93WY9, N. tabacum; UID : Q9ZSP9, S. lycopersicum (CBM49_linker≡IS CL x ), 283 − 481 Ang 2 (GH9_linker≡IS GL x ), a n d 9 6 − 2 0 8 A n g 2 ( G H 9 _ C B M 4 9 ≡IS GC x ) where x ∈ {Q5NAT0, Q8LJP6, Q93WY9, Q9ZSP9}. The surfaces themselves may be further decomposed into non-contiguous subsegments, i.e. G = G1 ∪ G2 ∪ G3 and C = C1 ∪ C2 ∪ C3. Thus, IS GC x ¼ G2∪G3∪C2∪C3, IS GL x ¼ G1∪G2∪G3∪L, and IS CL x ¼ C1∪C2∪C3∪L (Fig. 5). In general, while the contact surface formed between GH9 and CBM49 was the least, the same for CBM49 and the linker was maximal The only exception was for the class C enzyme from O. sativa IS CL Q5NAT0 > IS GL Q5NAT0 which can be explained by a large interaction surface spanning GH9, CBM49, and the linker The bonds between the residues that comprised these protein-protein interaction surfaces AA IS x À ; x∈ Q5NAT0; Q8LJP6; f Q93WY 9; Q9ZSP9gÞ were non-covalent (hydrophobic, hydrogen, van der Waals) for N. tabacum, G. hirsutum, and S. lycopersicum. Here, too, the contact surface for the class C enzyme from O. sativa was exceptional and included the possibility of a covalent and oxygen-sensitive (−SS−) linkage between C124 and M5/M132 (Fig. 5).
Docking data suggests qualitative differences between individual class C enzymes The binding energy of the ligands was lower for the higher molecular weight ligands x ΔGC8 , while interestingly, the free energy of binding for C7 x ΔG C7 ≅ ð −2:67 kcal ; x∈ Q5NAT0; Q8LJP6; Q93WY 9; Q9ZSP9 f g Þ suggests a preponderance of residues with small hydrophobic, aromatic, and basic side chains along with serine and threonine. Exceptionally, the catalytic amino acids aspartic (D) and glutamic (E) acids were almost (D465, O. sativa; D139, D451, S. lycopersicum) completely excluded from these calculations as were other amino acids with known proclivity to partake in catalysis, i.e. cysteine (C) and histidine (H) ( Table 7).
Delineating the cavities and grooves for crystalline cellulose catalysis and modification by plant class C GH9 endoglucanases Since, solvent accessibility is a pre-requisite for hydrolytic catalysis of the glycosidic linkage by GH9 endoglucanases, the presence of amino acids identified previously by docking was examined in cavities and grooves of the ; x∈ Q5NAT0; Q8LJP6; Q93WY 9; Q9ZSP9 f g Þ a n d a n a l y s e d  , was completely devoid of the aspartic (D), glutamic (E) acids, cysteine (C), and histidine (H), amino acids with known propensity for catalysis. The absence of a single continuous groove/cavity and the distribution of amino acids suggests a dual/discontinuous mode, wherein the +1 and −1 sites, are present in a subsurface cavity, while crystalline cellulose itself may interact and be modified by residues at the surface before entering the catalytic site. Despite these variations the probable length (l ≥ 100 − 200 Ang) of the relevant cavities and grooves suggest a welladapted mechanism for the intact cellulose polymer. Colour codes for GH9 (blue), linker (green), and CBM49 (red), and relevant cavity and grooves (black). Abbreviations-AA, amino acids; CvG, cavities and grooves; Dock; docking experiment; r, h, l, radius, height, and height of groove-approximating cylinder; GH9, glycoside hydrolase 9; IS, interaction surface   Table 1 and Supplementary Text 10). The presence of these ancestral class C members, i.e. tracheophytes, further strengthened the rationale of selecting this group since it represents organisms that may have evolved over 400 million years ago and therefore any mechanism postulated to digest crystalline cellulose would also likely have remained unchanged for that duration [8]. The quadrants (−, +) whose members (n = 19) included the characterised class C enzyme from N. tabacum, and (+, +) with n = 13 members possessed a similar distribution of plant members as with group 1 (−, −) ( Fig. 8; Supplementary Table 1 and Supplementary Text 10).

Discussion
Contribution of the GH9, linker, and CBM49 to the architecture of the active site plant class C enzymes Plant class C enzymes share considerable structural homology with gram-positive and -negative bacterial GH9 members (Tables 1, 2, and 3; Supplementary Texts 1 and 2). Although these results for GH9 are not entirely unexpected, data from this study also supports the involvement of the linker and CBM49 in the catalysis of crystalline cellulose by plant class C enzymes (Table 3; Supplementary Texts 1 and 2) [8, 17, 20-34, 59, 60]. The inclusion of the N-and C-terminal linker, albeit at higher volumes (V ∈ (8.0,100.0]) and the complete exclusion of CBM49 even amongst this small subset of class C enzymes suggest poor conservation of these segments (Figs. 5 and 8a; Supplementary Text 2) [8,81]. These data raise the possibility that the linker and CBM49 may have an indirect or modulatory role in catalysing glycosidic cleavage and may  The digestion of crystalline cellulose, in non-plant taxa may occur in a continuous groove that spans the GH9, linker, and the associated CBMs [51][52][53][54][55][56][57][58][59][60]. Plant class C enzymes may also do so in a surface groove that is initially bounded by the GH9_linker IS GL x À Á at the posterior basolateral surface and continues laterally being bounded in turn by the GH9_linker surfaces where x ∈ {Q5NAT0, Q8LJP6, Q93WY9, Q9ZSP9}, finally terminating anteriorly in a solvent accessible cavity that might constitute the principal active site (Fig. 5).
Physically, although the IS-bounded grooves appear discontinuous at the surface, a thorough analysis suggests the presence of several subsurface cavities that could maintain contiuity (Figs. 5, 6, and 7). Further, an almost complete absence of measurable cavities in CBM49/linker could also ensure that the substrate-facing surface through which crystalline cellulose traverses was chemically inert. The model precludes the existence of disparate active sites whilst, concomitantly asserts a preparatory/modulatory effect by CBM49/ linker which may then be followed by the hydrolytic cleavage of the glycosidic bond at the active site (Fig. 7). The rmsf and DCCM data in concert with the invariant core volumes further suggests that the IS that bounds the linker and CBM49 IS CL x ; x∈ Q5NAT 0; Q8LJP6; Q93WY 9; Q9ZSP9 f g À Á surface may exhibit heightened low frequency motion, a factor that could confer upon class C enzymes the propensity to accommodate varying lengths of crystalline cellulose (47, Tables 6, 7, and 8; Supplementary Text 2, 6-9).  (Table 7).
Despite, the higher contribution of PC2 to the variance (≅6.42%) as compared with PC3 (≅5%), there was greater resolution of the sequences and a greater number of sequences (n = 39 vs n = 22) with the latter. Further, three characterised members (O. sativa, G. hirsutum, S. lycopersicum) also clustered into this quadrant (−, −) as opposed to only N. tabacum (−, +). These data suggest that the x − & z − axes (PC1, PC3) might represent the principal axes of class C enzymes. Abbreviations-PC1-6, principal components 1-6; GH9, glycoside hydroxyl 9; CBM49, carbohydrate binding module; V, invariant core volume (processive, non-processive) of action wherein crystalline cellulose is initially acted upon and thereby modified by the indenting side chains of aromatic amino in a quasicontinuous surface groove at the interface(s) of GH9, linker, and CBM49, which is inert and stable. Once modified (induced strain on the glycosidic linkage), crystalline cellulose is driven towards a solvent accessible subsurface cavity. Here, the GH9 conserved catalytic residues of aspartic (D) and/or glutamic (E) acids utilise an acid-base catalytic mechanism to cleave the β (1 → 4) linkage between glucopyranose units. These may then be acted upon by exoglucanases to release oligosaccharides (C2 − C4). This mechanism not only corroborates extant kinetic data such as CBM-mediated modulatory catalysis, but also offers a molecular explanation for substrate promiscuity observed for this group of enzymes, whilst conforming to available structural data from non-plant taxa (Figs. 4, 5, 6, 7, 8, and 9, Tables 4, 5  Mechanistic insights into digestion of crystalline cellulose by plant class C endoglucanases. The data presented suggest that CBM49 along with the linker is poorly conserved and exhibits considerable heterogeneity, even amongst plant class C enzymes. Since, the effects of similar CBMs on catalysis are well characterised at least in non-plant taxa, any model would have to consider modulation by CBM49 of the catalytic residues which are present on GH9. This would imply that while catalysis may occur in a solvent accessible subsurface cavity, the surface groove(s) leading to it must involve CBM49 and the linker. The multimodal approach adopted here (interactions surface definition and amino acid enumeration, docking, cavity and surface analysis) suggests that the extended side chains of aromatic amino acid effect could interact and thereby render crystalline cellulose amenable to subsequent cleavage. Residues such as proline and stabilizing electrostatic interations involving arginine, lysine, asparagine, glutamine, serine, and threonine, along with several smaller hydrophobic residues along the interaction surfaces of the linker and CBM49, heightened oscillatory motion, could result in physical alteration of the groove itself, whilst concomitantly influencing the reactions that cleave crystalline cellulose. Additionally, the selection of substrates/polymer may also be determined by these residues. In support of these analyses 3D models of several homologues were analysed. Clearly, a large and extended groove formed by GH9, linker, and CBM49 and could lead to the catalytic site is observed with C. sinensis, C. rubella, A. coerulea, P. persica, and M. domestica. Although, segment spanning and overlapping grooves (B. distachyon, M. truncatula) are also present, it is unlikely that these may contribute to catalysis. However, the clear presence of large disjoint grooves along the interaction surfaces, along with the complete absence of catalytically competent residues corroborates a dual mode of interaction/modification and catalysis by plant class C enzymes. Abbreviations-CBM49, carbohydrate binding module; GH9, glycoside hydrolase Evolutionary significance for CBM49-mediated digestion of crystalline cellulose The ability to cleave crystalline cellulose by plant class C members is dependent on the presence of CBM49 and may have evolved directly from non-plant taxa (≈500 Mya) [8, 17, 20-34, 59, 60]. An additional premise explored previously was that plant class C enzymes may not just predate but, could potentially diverge into classes A and B after CBM49 was excised during processing of the mature mRNA transcript [8,18,46,[83][84][85]. A mechanistic understanding of these processes is clearly desirable with much of the aforementioned Zea mays generated data involving kinetic parameters, mRNA expression levels, and sequence information. The present study highlights variations in the CBM49/linker even amongst class C enzymes, provides insights into the architecture, position, plasticity, and composition of the IS-enclosed surface grooves, delineates the position and composition of a contiguous subsurface cavity for catalytic cleavage of the glycosidic linkage, enumerates functionally relevant amino acids that participate in substrate selection/modification, and offers a mechanistic explanation of CBM49-mediated reaction chemistry (Figs. 1, 2, 3 [63,86,87]. This would imply that proteins with the CBM49_linker may be evolutionarily at a disadvantage than those without. Alternatively, these might be encoded by nucleotides with a tendency to form higher order substructures in mRNA such as stem loops, bulges, and bends. These in turn could delay or irreversibly interrupt the ribosomal apparatus and prevent effective translation of the mRNA, and thereby contribute to decreased expression of class C enzymes. Since CBM49 is central to the ability of plant class C enzymes to digest crystalline cellulose, it would follow this loss could lead to a decrease in class C enzymes or conversely an increase in classes A and B [8].

Conclusions
A detailed biophysical analysis of homology models of characterised and putative class C endoglucanases was carried out to assess the contribution(s) of the GH9, linker, and CBM49 to catalysis/modification of crystalline cellulose. The work presented in this manuscript corroborates the notion that the linker and CBM49 may complement generic acidbase catalysis by aspartic/glutamic residues of GH9, and may do so in a multitude of ways. These include an influence on the structural organization of the protein, participation in critical intra-protein interactions, facilitate formation of inert and structurally plastic surface grooves, and render crystalline cellulose amenable to hydrolytic cleavage. Despite being entirely computational, the findings presented here offer profound insights into not just the active site geometry of plant class C GH9 endoglucanases, but also offer valuable clues into their evolutionary divergence. Whilst, most these findings await experimental valiation the analyses conducted suggests that plant-based conversion of biomass is feasible and may constitute a viable alternative to bacterial-, fungal-, and algal-based protocols.