Background

The family of TMTC (transmembrane (TM) and tetratricopeptide (TPR) repeat-containing) proteins in human is represented by four paralogues: TMTC1 (isoform X3 with accession XP_016875493, 875 residues (AA); see comment below why sequence Q8IUR5 (882 AA) appears doubtful), TMTC2 (Q8N394, 830 AA), TMTC3 (Q6ZXV5, 915 AA) and TMTC4 (Q5T4D3, 741 AA). Their common sequence architecture consists of an N-terminal segment with transmembrane regions and intermittent loops and a C-terminal stretch of multiple, in the order of 10 TPR repeats.

After having long been genes with unknown function, first functional information trickled in from genome-wide association (GWAS) and family (FS) studies that linked TMTCs with neurological/psychiatric diseases, sensory organ disorders but also with other conditions. Although an original, GWAS-based claim for TMTC2 in primary open-angle glaucoma in a Japanese cohort [1] could not be confirmed in several follow-up studies (for Afro-Caribbean [2], Chinese [3], Japanese [4], Korean [5], Saudi Arabian [6] and South Indian [7] cohorts), new GWAS evidence for another set of single nucleotide polymorphisms in cohorts of mixed ethnic origin reemphasizes the link [8]. Optic disc area is influenced by TMTC2 in cohorts of European and Asian ancestry [9].

TMTC1 has been related to schizophrenia (via GWAS [10]) and it is differentially expressed in inflammatory bowel disease linked arthritis [11]. The circular RNA circTMTC1 inhibits skeletal muscle satellite cell differentiation in chicken [12]. TMTC2 is associated with non-syndromic sensorineural hearing loss (SNHL; via both GWAS and FS [13, 14]). TMTC2 interactions with certain miRNAs hint towards a role in Parkinson’s disease [15]. GWAS associates TMTC2 with obesity in Caribean Hispanics [16] and Han Chinese [17], left ventricular mass increase [16] as well as with immune conditions such as eczema, asthma and ‘atopic march’ [18]. Family studies show TMTC3 mutations being causative for cobblestone lissencephaly [19] and periventricular nodular heterotopia with intellectual disability and epilepsy [20]. Genetic inactivation of TMTC4 in mice causes rapid, early postnatal cochlear hair cell death, leading to hearing loss [21]. TMTC4’s role in influencing bone mineral density is known from a transcriptome-wide association study [22].

Hence, the diversity of clinical effects hints towards human TMTCs having, most likely, very basic molecular and cellular functions with pleiotropic, context-specific effects. TMTC1 [23], TMTC2 [23], TMTC3 [24, 25] and TMTC4 [21, 25] were found to be located in the endoplasmic reticulum (ER). For all TMTCs, the TPR-containing C-terminal segment was shown to be located in the ER lumen (TMTC1/2 [23], TMTC3/4 [25]). TMTC1/2 were associated with intracellular calcium homeostasis [21, 23]. TMTC3 was reported to have a potential role in ER stress response [24], TMTC4 was linked with unfolded protein response [21].

Dramatic progress in understanding TMTC function was recently achieved by Danish researchers collaborating with several American groups [25, 26]. Knockout of all four TMTCs in HEK293 cells abolished O-mannosylation of a variety of cadherin and proto-cadherin proteins; thus, the TMTCs are members of a new O-mannosylation pathway that selectively processes cadherin-like targets [26]. Apparently, the presence of various TMTCs affects the spectrum of modified cadherins since the selective TMTC1/3 knockout (with TMTC2/4 remaining functional) produces a larger set of O-mannosyl glycopeptides in the mass-spectrometric analysis [26]. Further, TMTC3 complementation at the background of a combined four TMTC knockout in HEK293 cells rescues the O-mannosylation of E-cadherin and enhances cellular adherence [25]. TMTC3/4 knockdowns were demonstrated to delay gastrulation in frog [25]. Three known TMTC3 disease mutations in the N-terminal protein half (H67D, R71H, G384E) were shown to exhibit reduced protein half-life despite native ER localization.

Having followed the TMTC story since 2012, we were puzzled by the difficulties to consistently interpret the sequence-analytic findings in terms of biological function, a problem so nicely summarized by Larsen, Graham et al. [25,26,27,28]. It starts with something apparently simple such as the largely varying predicted transmembrane region (TM) numbers for various TMTCs due to evolutionary sequence divergence within their membrane-embedded N-terminal region and it does not end with the diversity of enzymatic activities and substrates of homologous proteins (largely sugar transferases), sometimes even with known 3D structure. In this work, we explore:

  1. (i)

    To which extent can the sequence architecture of TMTCs be unified, especially with regard to their number of TMs?

  2. (ii)

    What is the nature of the sequence segment homologous to Pfam model DUF1736?

  3. (iii)

    Can the conservation of sequence motifs among TMTCs and known homologous sugar transferases (including those with known 3D structure) be rationalized in terms of catalysis and ligand/substrate binding?

Methods

If not otherwise mentioned, all sequence-analytic operations were carried out with the ANNOTATOR software suite [29, 30], an in-house tool developed over ca. 20 years that integrates more than 40 academic tools (either self-programmed or used with permission of the original authors) for the prediction of protein structural and functional features. In the context of this work, the battery of programs for prediction of transmembrane regions, cellular export signals and for sequence similarity searches were especially important. In cases where completeness and recent updates of sequence and domain databases were critical, selected locally executed similarity searches were repeated on the respective websites supported by the original authors (BLAST [31, 32], HHpred [33, 34]) to make sure that no important hit from recent database additions was omitted.

Structural modelling of TMTCs by homology was carried out with Modeller (version 9.4) [35]. As it became clear during the subsequent analyses that the TMTCs harbor a binding site for a lipid-linked sugar, we used the Schrodinger suite [36] for the placement of this ligand. Subsequent induced fit relaxation and energy optimization of the complex followed published procedures [36,37,38,39,40,41,42].

Results

Collection and sequence architecture of the TMTC1/2/3/4 superfamily

Pairwise similarity searches using the BLAST tool [31, 32] and starting with any of the full-length human TMTC1, TMTC2, TMTC3 and TMTC4 sequences conveniently gather the superfamily of true TMTC orthologues in upper Eukarya and of TMTC-like proteins in other organisms including many hypothetical proteins, if at all, automatically annotated by sequence similarity.

The sequence architecture of human TMTCs is two-partite with an N-terminal segment consisting of transmembrane regions and intermitting loops (456 AA for N-TMTC1, 475 AA for N-TMTC2, 426 AA for N-TMCT3 and 462 AA for N-TMTC4) and a remaining C-terminal part comprising TPR repeats. This result was obtained by analysing human TMTC1/2/3/4 within the ANNOTATOR environment [29, 30]. We applied the suite of transmembrane prediction tools (DAS-tmfilter [43, 44], HMMTOP [45, 46], PHOBIUS [47, 48], TMHMM [49, 50] and TOPPRED2 [51, 52]) as well as comparisons with protein domain and protein repeat databases (PFAM [53], SMART [54], Miguel Andrade’s repeats [55]) via HMM searches [56, 57].

When we repeat the simple BLAST searches with just these N-terminal segments of TMTC1/2/3/4, apparently the same superfamily of TMTCs is collected (in the order of ~ 10,000 hits with E-value < 3.e-4 and above 60% query sequence coverage; details not shown). Phylogenetically, true TMTC orthologues and TMTC-like proteins are found throughout the eukaryote kingdom with homologues even among prokaryotes but the set of four paralogues per organism with full coverage of the N-terminal domain can be systematically detected only from vertebrates down to the insect level. Already in the complete genome of the worm Caenorhabditis elegans, just two TMTCs are known (TMTC1: Q20144/NP_509123, TMTC2: NP_504200).

We created a grand alignment of the full set of the N-terminal segments of TMTCs from six animal organisms (Homo sapiens, Bos taurus, Gallus gallus, Xenopus laevis, Danio rerio, Drosophila melanogaster; see Fig. 1 and Additional file 1) to study family-specific and superfamily-wide sequence conservation patterns.

Fig. 1
figure 1

Grand alignment of N-terminal segments of TMTCs together with sequences of selected sugar transferases with known 3D structure. We show the grand alignment of the full set of the N-terminal segments of TMTCs from six organisms (Homo sapiens (Hs), Bos taurus (Bt), Gallus gallus (Gg), Xenopus laevis (Xl), Danio rerio (Dr), Drosophila melanogaster (Dm)) together with the sequences taken from 5ezm chain A [58], 6s7t chain A [59], 5ogl chain A [60] and 6p25 chain A [59]. For supporting navigation in the alignment, the location of the TMs in human TMTC1 and in 5ezm are shown. The figure was generated with Jalview [61] using an externally created and manually edited multiple alignment (in the SEAVIEW environment [62, 63]). The location of TMs in TMTC1 follows the observations from the 3D structural model created in the course of this work and, at some places, does differ slightly from the sequence-analytic predictions provided in Additional File 2. The following sequence segments have been excluded from the alignment and replaced by “XX”: in TMTC1_B, 244–304 after TM6; in TMTC1_Gg, 251–310 after TM6; in TMTC1_Dm, 358-417after TM8; in TMTC2_Hs, 337–393 after TM8; in TMTC2_Bt (G3MY32_BOVIN), 334–393 after TM8; in TMTC2_Gg (F1NPM4_CHICK), 324–380 after TM8; in TMTC2_Xl, 337–393 after TM8; in TMTC2_Dr (F1R0Y9_DANRE), 346–401 after TM8, in TMTC2_Dm, 360–504 after TM8; in 6S7T, 288–348 after TM6 and 486–535 after TM10; in 6P25, 219–261 after TM6, 312–531 after TM7 and 560–585. Please note that, as result of the excluded sequence stretches in some sequences, the residue numbering in the figure might deviate from the residue numbering in the respective entry of the sequence database. Additional information for this figure is provided in Additional Files 1 and 2 available with this article. For locating specific residues in the alignment, we recommend first finding the nearby TMs and then looking for conserved motifs next to them

As a first goal during the alignment creation, we wanted to understand the number and sequence localization of TM regions in the human TMTCs. In the literature, the number of TM regions in the N-terminal segment of various human TMTCs is reported to be different for various TMTCs and between 8 and 12 [25,26,27,28]. The confusion is not surprising as TM region predictors behave erratically in the twilight range of their scoring function [43]. Just one additional polar residue can bring the hydrophobicity of the candidate sequence segment below the threshold. And the boundaries of TM regions are typically heuristically determined bringing the length near 20 residues.

This variation of TM region number among TMTCs is potentially conflicting with evolutionarily conserved function as the latter requires homologous loop segments being located in the same subcellular space (in the ER or in the cytoplasm). Thus, membrane topology needs to be conserved among species within a given TMTC family and, to a large extent, also among various TMTC paralogues. As a further constraint, the C-terminal, TPR-comprising region is shown to be located in the ER for all TMTCs [23, 25].

For all 24 sequences in Fig. 1, locations of potential TM regions were identified with the full suite of the five TM predictors in the ANNOTATOR [29, 30]. In total, we find 12 regions with hydrophobic motifs that are predicted as TM regions in at least some sequences for three out of four families TMTC1, TMTC2, TMTC3 and TMTC4 (see Additional File 2). Four major discrepancies and issues are observed:

  1. 1)

    The most N-terminal TM region might actually be a signal peptide.

  2. 2)

    In the human TMTC1 sequence as in Q8IUR5, there is no hit for TM7. But it does exist in the sequence version of TMTC1 with accession XP_016875493 (isoform X3).

  3. 3)

    In human TMTC3, TM3 is only weakly recognized.

  4. 4)

    All TMTC sequences have a segment with significant sequence similarity to the Pfam domain DUF1736 (E-value < 1.e-30 for any of the human TMTCs in an HMMER search against Pfam-A [53]). The TM segment predictors suggest a TM region inside this segment for all human TMTCs except for TMTC2.

First, the most N-terminal hydrophobic region in all human TMTCs seems to be a true TM segment, maybe, a signal anchor but not a signal peptide as the sequence assessments with SIGNALP version 5 [64] show. The following loop contains the strongly conserved DD motif that, if having an enzymatic function, needs to be localized in the ER. Consequently, the N-terminus of TMTCs appears cytoplasmic. With the C-terminus in the ER, TMTCs need to have an uneven number of TM regions so that the TPR segment can reside inside the ER lumen [23, 25].

Second, we encountered serious difficulties when attempting to include the canonical TMTC1 sequence Q8IUR5 into the grand alignment, especially in the region that includes TM7 and the DUF1736 hit (which is much worse in Q8IUR5 with E-value=3.e-19 compared with other TMTCs). This would not have surprised anyone if the sequence were from a more obscure insect or fish genome but Q8IUR5 is a human protein. Searching human sequences with TMTC1 from Bos taurus or Gallus gallus delivers XP_016875493 (TMTC1 isoform X3) as the sequence that can be much easier aligned with TMTC1s from other species as well as with other TMTCs. At the same time, searching the Bos taurus or Gallus gallus proteomes with human Q8IUR5 does not deliver a better, more similar isoform than the best homologue found with XP_016875493. Thus, it cannot be excluded that Q8IUR5 has sequence errors in the region 245–312 (with the corresponding region 245–305 in XP_016875493 being the correct version). While none of the five TM region predictors finds a trace of a hit for TM7 in Q8IUR5, it is confidently predicted by the majority of them in XP_016875493.

Third, the evolutionary argument (see Fig. 1) strongly suggests that the respective regions for TM3 in human TMTC3 are just subthreshold for the TM predictors (compared with other human TMTCs, there are additional polar residues (Ser119, Ser120 and Ser124) in the respective sequence KSSVIASLLFAVHPIHT (residues 118–134) of human TMTC3).

Fourth, the sequence segment predicted to be a TM region as part of the DUF1736 hit is actually not membrane-embedded. When checking the TMTCs against sequences with known 3D structures via HHpred [33, 34] as implemented in the ANNOTATOR environment [29, 30], we find convincing statistically significant similarity of the N-terminal portions of TMTCs to structures such as 5ezm [58]. For example in the case of N-TMTC1, the E-value is 1.9e-22. Comparison with the alignment delivered by HHpred reveals that the segment FPNFFFI (261–267 in 5ezm), a small, quite hydrophobic helix at the ER side and with its axis parallel to the membrane, aligns with the segment 318–324 in human TMTC1. Notably, the segment 311–324 is the common core from TM predictions by four different TM predictors (TMHMM, PHOBIUS, DAS-tmfilter, and HMMTOP). Similar observations are available in other homologous structures. TMTC1’s segment 311–324 hits the same type of small, hydrophobic helix in the ER lumen parallel to the membrane in 5ogl (found with E-value 2.7e-15 by HHpred; segment 325–333 with sequence PEVFMQRIS [60]) or in 6s7t (found with E-value 2.4e-17 by HHpred; segment 382–389 with sequence GRFYSLWD [65]).

Thus, we can convincingly conclude (i) that the DUF1736-similar region in TMTCs, actually just a loop between TM7 and TM8 located in the endoplasmic reticulum lumen, does not contain a TM region, (ii) that all human TMTCs comprise 11 TM regions in their N-terminal sequence portion and (iii) that the N-terminus is located in the cytoplasm and the C-terminal TPR domain is in the ER lumen (see also Fig. 2).

Fig. 2
figure 2

Cartoon of the membrane topology of the N-terminal domain of TMTCs and localization of important substructures and residues. The figure shows schematic representation of the overall structural elements and the connectivity of TMTCs. The TM helices are shown in yellow cylinders and marked as I to XI while the helical regions in the lumen are shown in green cylinders and are marked as JM1, JM2 and JM3. The lumenal loops are numbered from EL1 to EL5. The whole TPR region is shown as a single block colored in cyan. The figure also highlights important residues which are (i) the strictly conserved DD motif (M1, Table 4) in EL1 (loop between TM1 and TM2), (ii) conserved SHKSYRP motif (M2, Table 4) also present in EL1, (iii) conserved lysine residue of KET(Q) xxT motif (M4, Table 4) that forms a salt bridge with the phosphate group of DPM, (iv) glutamate residue from conserved KET(Q) xxT motif (M4, Table 4) in EL3 and aspartate residue of the conserved DW motif (M4, Table 4) in EL4, (v) strictly conserved arginine residue from conserved ERxxY motif (M7, Table 4) in loop EL5 between TM9 and T10. All the important residues are colored in yellow except the metal binding residues which are highlighted in pink. The sequence position numbering corresponds to TMTC1. The location of TMs in TMTC1 follows the observations from the 3D structural model created in the course of this work and, at some places, does differ slightly from the sequence-analytic predictions provided in Additional File 2

Further, we wish to emphasize that the TM regions in TMTCs are largely of the complex type (the only consistently simple TMs are TM7 in TMTC3 from various species (data not shown)) [66, 67]; thus, their sequences contain evolutionary information beyond the generally not informative hydrophobic background (sprinkled-in polar residues, glycine and proline are typically rare in TMs [68, 69]) useful for sequence comparison in homology searches [70,71,72].

As mentioned by a reviewer, membrane topology prediction for proteins with TM regions has been attempted directly from sequence, typically following the TM segment prediction part [45, 46, 73]. As a trend, these prediction tools support the topology conclusions for the TMTCs but not always. For example, the probability for the N-terminus to be cytoplasmic was predicted by TMHMM [49, 50] as follows: TMTC1 0.61, TMTC2 0.64, TMTC3 0.89, TMTC4 0.30. We think that the predicted number of TM regions (especially their even/uneven number) critically influences the correctness of the topology prediction. For TMTC1/2/3, nine TM regions were found by TMHMM (uneven as in the case of the actual 11 TM regions) but this number was predicted ten for TMTC4.

TMTCs are homologous to membrane-bound sugar transferases with known 3D structures

We summarized the findings related to the top hits of the HHpred searches with the N-TMTC1, N-TMTC2, N-TMTC3 and N-TMTC4 sequence segments in Table 1. The original HHpred outputs are available as supplementary material (Additional File 3). All the hits have excellent E-values (<< 1.e-10) despite low sequence identities of the respective sequence alignments (all values between 8 and 13%; e.g., TMTC1/2/3/4 align with 5ezm with sequence identities 8, 13, 10 and 12% in the HHpred-generated alignments respectively); thus, the match of the physico-chemical property pattern between the respective sequences is excellent, especially for the TM segments and some loop regions next to them.

Table 1 HHPred search with the N-terminal part of the four human TMTCs against PDB (PDB_mmCIF70_29_May, version 29/05/2020)

Proteins with known structure discovered in these searches belong to the group of well-studied membrane-standing arabinosyl-, oligosaccharyl- or mannosyltransferases. Their annotated enzymatic domain is fully part of the alignment. Given the full-length coverage of the N-TMTCs’ sequences queried against the PDB, there is no doubt that N-TMTCs and the annotated enzymatic domains of sugar transferases detected share a common fold and have a similar 3D structure.

For all N-TMTCs, the sequence of the bacterial aminoarabinose transferase ArnT corresponding to structures 5ezm/5f15 [58] is the most similar homologue with an almost gapless alignment (with some exception for the N-terminal region of the loop between TM7 and TM8). The alignments of N-TMTCs generated by HHpred cover the first 11 of the 13 N-terminal TMs in 5ezm/5f15, nicely supporting the membrane topology consideration in the previous section (to note, TM region TM4 is missing and TM5/6 are annotated as a single large TM both in the PDB entry 5ezm and in the Uniprot entry Q1LDT6). As a result of the structural similarity, we can conclude that there are five loops between TM regions that form the structure in the ER lumen (see Fig. 2): (i) two long loops EL1 (between TM1/TM2) and EL4 (between TM7/TM8; both loops contain helical segments) as well as (ii) three short loops EL2 (between TM3/TM4), EL3 (between TM5/TM6) and EL5 (between TM9/TM10). In 5ezm/5f15 (as in other sugar transferases of this type), there are two substrate binding cavities that communicate via a channel limited, on one side, by the TMs in the membrane and, at the other side, by the long loop connecting TM7 and TM8 (i.e., EL4 in the case of TMTCs). One binding region is formed by the segments homologous to EL1, EL2 and EL4 and accommodates the sugar acceptor substrate. The other site (built by EL1 and mainly by EL4) provides for interaction with a lipid-linked carbohydrate (LLC; the sugar donor, e.g., a dolichyl phosphate or pyrophosphate with attached sugar/oligosaccharide moiety). In the zone of contact of the two substrates, a divalent metal ion important for catalysis is coordinated by amino acid residues of the transferase. Despite the vast differences in sequences and possible ligands, homology considerations suggest that the TMTCs are constructed following the same general architecture.

Most importantly, we see at the level of sequence comparison (even without any structural modelling) that some critical motifs strongly conserved among the TMTCs have a structural and/or functional equivalent (e.g., in ligand binding) in the 3D structures of enzymes found. The strictly conserved DD motif in the loop between TM1 and TM2 (e.g., D52/D53 in N-TMTC1) aligns with the known active site in several sugar transferases (e.g., D55/E56 in 5ezm_A, D77/E78 in 6p25_A or D281/D282 in 7bvf_A). All the sugar transferases found in our HHpred homology search have at least an aspartate that coincides with the first aspartate in this motif. This residue is described as binding to the polar group of the sugar acceptor and/or a divalent metal ion (e.g., for 5ezm/5f15 [58], 5ogl [60], 6s7t/6s7o [65] or 6sni/6snh [77]). Thus, these positions are absolutely critical for enzymatic catalysis since any residue substitution leads to loss of function. For example in 6p25/6p2r [59], E78 forms a salt bridge with R138 making D77 sticking out towards the cavity where it binds to the sugar acceptor substrate. Any replacement of D77/E78 abolishes enzyme function [59, 78].

In 5ezm/5f15, D158 (in EL2, N-terminal to TM4) interacts with the acceptor substrate and also forms a salt bridge with K203 (in EL3, C-terminal to TM5). The homologous residues are conserved in TMTCs (e.g., D169 and K219 in N-TMTC1) and, thus, are predicted to also play a role in ligand binding.

An arginine in the loop EL5 between TM9 and T10 close to the N-terminus of TM10 and strictly conserved among TMTCs (e.g., R404 in TMTC1 as part of the conserved sequence AERV) followed by a hydrophobic stretch of residues (from TM10) is also seen in sugar transferase structures (R459 in 6s7t [65], R405 in 6s7o [65], R404 in 6ezn [74], R426 in 3waj [75, 79], and R375 in 5ogl [60]). In all these known structures, this arginine is described as an interaction partner of the LLC’s phosphate group whereas the lipid part of the LLC is accommodated within a hydrophobic groove formed mainly by TM6 and TM7.

The sequence SHKSYRP (with H89/K90 in TMTC1) in EL1 is well conserved among TMTCs (close to the N-terminal end of second helix in EL1). At the same time, K85 in the 5ezm/5f15 sequence at a homologous position is known to interact with the LLC’s phosphate. Thus, it is reasonable to assume that one of the positively charged residues in TMTCs (e.g., H89 or K90 in TMTC1) has a similar role. This suggestions is supported by the known mutant phenotype in human TMTC3 (the mutation His67Asp introduces a charge swap and leads to cobblestone lissencephaly [19]; H67 is the position in TMTC3 homologous to H89 in TMTC1).

The limits of a purely sequence-analytic approach can be illustrated with the case of the DW motif conserved among all TMTCs in EL4 (e.g., D330/W331 in N-TMTC1) at the C-terminal end of the helix parallel to the ER membrane. It is problematic to identify the function of an equivalent motif in homologous 3D structures, even in those with a hit to DUF1736. For example, the apparently homologous sequence position R270/Y271 in 5ezm/5f15 are at the edge of a structurally unresolved loop region. In 6s7t, residues E405/H406 seem the closest to positions homologous to the TMTCs’ DW motif. E405 is directed towards R214 (a residue in the loop homologous to EL2) [65]. Thus, the function of the conserved DW motif in TMTCs (as well as of several others) cannot be unambiguously understood due to such comparisons. Interestingly, a DW motif has been described as critical for subunit interaction in pyruvate dehydrogenase kinase 2 [80].

Thus, this sequence-analytic comparison of TMTCs with known homologous 3D structures shows that a number of conserved sequence motifs can be understood in the context of ligand binding. TMTCs appear to incorporate divalent metal ions for catalysis and LLCs as donors for a sugar moiety. Given the experimental finding of TMTCs being part of a new O-mannosylation pathway [26], the LLC applicable here is dolichyl-phospho-mannose (DPM), the universal donor of mannosyl-residues in higher eukaryotes.

TMTCs are homologous to a variety of sequence families of membrane-bound sugar transferases

When applying HHPred with N-TMTCs as input against the Pfam library of sequence domain family models, a large variety of annotated entries besides many domains of unknown function are hit with, beyond doubt, statistically significant E-values (E-value< 1.e-5, see Table 2 and Additional file 3).

Table 2 HHPred search with N-terminal part of four human TMTCs against Pfam-A_v33.1

Most of the domains found belong to the GT-C clan (CL0111) of glycosyltransferases (out of 19 known GT-C members, nine were detected: Glyco_transf_22, STT3, PTPS_related, PMT, Mannosyl_trans2, PMT_2, Arabinose_trans, PIG-U, GT87). Most informative are the sequence homologies with Glyco_transf_22 (PF03901) and STT3 (PF02516) because the E-value is < 1.e-18 and alignment of the Pfam domains and the N-TMTCs cover both query and template almost completely (coverage > 95%). Certain super-conserved residues in the sequence family alignments of both Pfam families are also conserved among the TMTCs. This includes the active site DD motif in EL1 (e.g., D52/D53 in N-TMTC1) and the arginine in front of TM10 (e.g., R404 in TMTC1) that are characteristic for both Pfam domains.

The homology with other groups of dolichyl-phosphate-mannose-dependent mannosyltransferases (Mannosyl_trans4, PF15971), glucosyl transferases GtrII (Glucos_trans_II, PF14264) and arabinofuranosyltransferase N-terminal domain (AftA_N, PF12250) not directly linked to the GT-C clan fits into the same general functional prediction for TMTCs as sugar transferases and having a similar 3D structure.

The HHPRED search results are confirmed by iterative PSI-BLAST [32] runs with standard parametrization and human TMTC sequences as input. They deliver plentiful hits within the GT-C clan and beyond (results not shown). The diversity of significant homology hits constitutes a problem for function assignment of TMTCs beyond the general prediction as GT-C/PMT-like sugar transferases. It needs to be emphasized that the GT-C clan is a very diverse sequence superfamily comprising membrane-bound sugar transferases with a large variety of different specific activities and substrate types (including the transfer of arabinose, mannose, glucose or oligosaccharides among others).

We find also other proteins including even enzymatically completely inactive ones such as PIG-U (see reference [81] for discussion of PIG-U’s function). Interestingly, the profile build on the basis of our grand alignment of TMTCs is linked by HHPred to the domain BindGPILA [81] with E-value ~ 0.03 (calculated at the background of all Pfam models). To note, this domain model is derived from homologous sequence segments with 10 TMs and intermittent loops extracted from proteins in the glycosylphosphatidylinositol (GPI) lipid anchor pathway PIG-B, PIG-M, PIG-U, PIG-V, PIG-W and PIG-Z [81]. PIG-W is an acetyltransferase for the GPI lipid anchor, PIG-U is not an enzyme at all but the remaining four (PIG-B, PIG-M, PIG-V and PIG-Z) are mannosyltransferases. All of them are united by the ability to bind phospho-lipid linked sugar/carbohydrate moieties.

Thus, the mere homology of TMTCs to the GT-C group of sequences by itself is only informative with regard to fold coincidence, to structural similarity and to a general level of functional classification. Yet, the conservation of residues known to be important for catalysis and substrate binding as detailed in the sequence analysis above indicates that TMTCs are actually enzymatically active. As we see in the 3D structure modelling exercise below, many additional conserved sequence motifs can be rationalized due to interactions with ligands and substrate molecules.

Insights from the structural modelling of human TMTCs by homology to membrane-bound sugar transferases with known 3D structural arrangements

We attempted to create 3D structural models of all four TMTCs together with a divalent metal ion and DPM with the goal to explore whether observed sequence motifs that are conserved between TMTCs and sugar-transferases of known 3D structure come spatially together for interaction with the ligands.

HHpred scored the aminoarabinose transferase structures ArnTCm (PDB IDs: 5ezm and 5f15, chain A [58]) as by far the best hit for all human TMTCs (see Table 1) and also for five other organisms including Bos taurus, Gallus gallus, Danio rerio, Xenopus laevis and Drosophila melanogaster (results not shown). Therefore, this X-ray crystal structure was used as a template to build 3D models of TMTC1 (XP_016875493.1), TMTC2 (Q8N394), TMTC3 (Q6ZXV5) and TMTC4 (Q5T4D3) using the functions automodel and loop refine in Modeller (version 9.4) [35]. The overall structure of 5ezm (apo ArnTCm, resolution 2.70 Å) / 5f15 (UndP-bound ArnTCm, resolution 3.20 Å) [58] consists of (i) an N-terminal membrane-embedded region and (ii) a periplasmic domain (PD). For this work, only the first segment is of interest. It involves 13 TM helices and interconnecting loops including three juxtamembrane helices (JM1, JM2 and JM3). JM1 and JM2 form the first periplasmic loop between TM1 and TM2 while JM3 leads into a partially disordered flexible periplasmic loop (PL4 being homologous to EL4 in TMTCs) between TM7 and TM8.

In this study, only the membrane-embedded domain of TMTCs including the juxtamembrane helices were modelled using the most N-terminal regions of the templates 5ezm and 5f15 (the 11 TM segments together with JM1 and JM2 following 5ezm while JM3 was molded after 5f15). The major hurdles to generate the 3D structure of TMTCs by homology modelling are (i) the low percent identity (< 15%) with sequences of the template crystal structures (Table 3) and (ii) several overly long loops between TM regions without equivalent in the structure templates. As we want to understand structural detail at the lumenal side, cytoplasmic loops are not that critical but the lumenal ones are. The loop sequence segments include (i) the cytoplasmic loop between TM2-TM3 (residues 136–146) in TMTC4, (ii) the cytoplasmic loop between TM6-TM7 in all TMTCs and (iii) the lumenal loop TM9-TM10 in all TMTCs. Furthermore, the template 5ezm/5f15 does not account for a loop extension at the N-terminal side of the domain of unknown function, DUF1736 (PF08409), between TM7-TM8 for all TMTCs. Moreover, we note that TMTC2 has another unusually longer cytoplasmic loop between TM8-TM9 (residues 337–392) and, therefore, in the absence of any template, residues 337–392 were not modelled. We describe the alignment with the 5ezm/5f15 template, the regions modelled for each TMTC proteins and issues with the overly long loops in Table 3 and in the annotated alignment in Additional File 4 – Supplementary Figure 1.

Table 3 Modelling the 3D structures of TMTCs

As we expect that certain long loops, especially those that have no equivalent in the 5ezm/5f15 structure, will not get reconstructed well, the DOPE model scoring system provided by Modeller might not be such a good choice for selecting among various model instances. We have validated our model instances based on the TM-align scores [82]. A TM-score between 0 and 0.3 suggests random structural similarity while a TM-score greater than 0.5 and less than 1.0 suggests two structures having the same fold. The TM-align scores for TMTC1, TMTC2, TMTC3 and TMTC4 (when compared with 5ezm) are 0.93441, 0.72261, 0.91499, and 0.92104 respectively.

The resulting 3D structure models (see Fig. 3) were used to place a divalent metal ion (following 5ezm for initial positioning) and a DPM moiety (using crystal-bound ligand UndP in 5f15 for initial posing as reference position). We applied Zn+ 2 parametrization for the ion in this study although there is no clarity about the exact nature of the divalent metal ion from experiment. The crystallographic evidence speaks for zinc in 5ezm [58]; yet, Mn2+ is the likely ion in the case of 5ogl [60], several other reports such as the one for 6s7t [65] remain silent about the nature of the ion other than emphasizing an electronic density consistent with a divalent metal ion. To emphasize, we do not think that the exact parametrization of the ion (beyond carrying two positive charges) is critical for the outcome of this modelling study.

Fig. 3
figure 3

Structure models of TMTC1/2/3/4 with ligands. The cartoon representation of model TMTC1/2/3/4 (from top to bottom) with docked DPM is shown in side- (left column) and top-view (middle column). Close-up (right column) of the binding pocket of TMTCs with docked DPM (cyan color sticks) and with important residues (HKSY residues of the conserved SHKSYRP motif M2 in EL1; K and E from motif M4 in EL3) presented in yellow color sticks; the divalent metal ion (modelled as zinc) is shown in gray color

3D structure modelling operations including ligands were implemented with Schrodinger suite [36]. An induced fit procedure following established protocols [36,37,38,39,40,41,42] was applied. In brief, the Schrodinger programs “Protein Preparation Wizard” and “LigPrep” were utilized for preparing the TMTC models and the DPM. With “Glide-SP” and “Prime”, multiple poses of DPM were generated and optimized in multi-step energy minimizations (with the OPLS parameter set and a surface Generalized Born implicit solvent model) that included some stages with softened potentials and side chains mutated to alanine. The procedure was completed with a minimization that allowed all residues within 5 Å of DPM (including their backbone and side-chain) and ligand DPM itself to be relaxed. The complexes were ranked by Prime energy (molecular mechanics energy plus solvation) and those within 30 kcal/mol of the minimum energy structure were passed through for a final round of Glide docking and scoring with GlideScore. The final structures for each of the TMTCs together with the ligands are provided with their atomic coordinates (Additional File 5).

As the most important outcome of the modelling effort, visual inspection of the four model structures show that, for all TMTCs, the resulting structures show consistently that seven conserved sequence motifs M1-M7 as listed in Table 4 come spatially together at the lumenal side of the TMTCs, form part of the surface of the protein structure that is homologous to the two substrate/ligand binding sites in 5ezm/5f15. They group closely around the DPM moiety and the divalent ion creating a dome region (see Fig. 4 for the case of TMTC1). We find that residues in motifs M4 and M5 are observed for coordinating the divalent metal ions. M2 and M3 are largely engaged in mannose interactions, M6 tends to contact with the dolichyl tail. Motifs M4, M5 and M7 are important for interaction with the phosphate in DPM. Thus, the observed sequence conservation can be rationalized in terms of evolutionary conserved function.

Table 4 Several conserved sequence motifs in TMTCs are related to DPM binding and divalent metal ion coordination
Fig. 4
figure 4

Sequence motifs M1-M7 come spatially together in model structures of TMTCs. We illustrate the spatial localization of sequence motifs M1 (red), M2 (orange), M3 (yellow), M4 (green), M5 (blue), M6 (violet) and M7 (pink, all shown in ball mode) at the background of the structural cartoon of the whole protein. DPM is presented as blackish sticks, the divalent metal ion is represented as reddish sphere. We show the case of TMTC1; the figures for the other TMTCs look very similar. To note, motif M2 in this figure is extended to the conserved region represented by SHKSYRPLCVTLTSFRLN in TMTC1 (88–103 in EL1)

Further, several close contacts between the DPM ligand, the metal ion and TMTC residues were observed (to note, we did not enforce any specific residue contacts during the induced fit docking procedure). Given some sequence diversity among TMTCs and also the large number of degrees of freedom in the modelling process, it is not surprising that not all contacts are found in all models. Yet, a common subset of those was detected in each of the TMTC1, TMTC2, TMTC3, and TMTC4 model structures (see Table 4) and some contacts repeat patterns seen in homologous crystal structures:

  1. (i)

    The phosphate functional group of DPM interacts with the divalent metal ion. In addition, the metal binds to the glutamate residue in the conserved KET(Q) xxT motif in EL3 (e.g., E220 of TMTC1) and to aspartate residue of the conserved DW motif (e.g., D330 in TMTC1) in EL4. To note, H267 (in the motif H265-E266-H267 where the glutamate is homologous to D330 in TMTC1) interacts with the divalent metal ion held between JM1 and EL4 in 5ezm [58].

  2. (ii)

    The phosphate group of DPM also forms a salt bridge with the lysine residue of the conserved KET(Q) xxT motif in EL3 (e.g., K219 in TMTC1).

  3. (iii)

    The mannose moiety interacts with residues H-K-S-Y within the conserved SHKSYRP motif M2 in EL1 (e.g., S80, H89, K90 and S91 residues in TMTC1, Fig. 3).

  4. (iv)

    The conserved stretch in EL1 represented by SHKSYRPLCVLTSFRLN in TMTC1 (it includes motif M2) forms the dome region of the DPM binding pocket in all 4 TMTCs. The dolichyl lipid chain of DPM occupies the cavity that is provided by hydrophobic residues of TM6, TM7 and TM9.

The structural models of human TMTCs can only be considered preliminary in many details at this stage since

  • important ingredients such as the protein substrate and possibly important interacting partners are missing,

  • sequence identity with the target structure is low (~ 10% in the manually edited alignments used for modelling, Table 3),

  • there are loop extensions not found in the structural template, and

  • the TMTCs are modelled without the C-terminal TPR domain.

The average accuracy of C-alpha atom positioning in homology modelling above 30% sequence identity is estimated 2 Å [83, 84]; hence, the error is expected to be higher for certain regions in our model structures, especially in loop regions without equivalent in the template. On the other hand, the known crystal structures (having very moderate crystallographic resolutions around 3 Å) do not resemble the complete protein complex including the correctness of certain groups of amino acid chains, some inter-TM loops, substrates and ligands needed for catalysis either.

Despite these restrictions, we see consistent features emerging from the modelling of various TMTCs, namely the arrangement of TM regions in the membrane as well as of the loops and segments that form the binding site for the lipid-linked sugar and the divalent metal ion; essentially, the major part of the structure located in the ER lumen appears functionally plausible after the conserved sequence segments got spatially united as a result of the 3D reconstruction.

Thus, it makes sense to analyze also contacts between the DPM moiety, the metal ion and TMTC residues seen only in a few of the TMTC models. In this way, we will get a more complete picture of the binding cavity and can enlarge the list of potentially relevant residues for interaction with the ligands:

  1. (i)

    We found the aspartate from motif M3 in the vicinity of the mannose in TMTC2 (D141) and TMTC3 (E145). The homologous residue D158 in 5f15 [58] is also seen to interact with the arabinose moiety.

  2. (ii)

    K203 in 5f15 [58] forms a salt bridge to the arabinose moiety. A similar close contact to the sugar is seen by homologous lysine residues in motif M4 for TMTC2 (K186), TMTC3 (K188) and TMTC4 (K221).

  3. (iii)

    The motif M7 arginine in TMTC2 (R422) forms a hydrogen bond with the phosphate. This interaction resembles the contact between several homologous arginine residues (R459 in 6s7t [65], R405 in 6s7o [65], R404 in 6ezn [74], R426 in 3waj [75, 79], and R375 in 5ogl [60]) and the phosphates from the respective LLCs in those X-ray 3D structures. Similarly, the M7 tyrosine is observed close to the phosphate in TMTC2 (Y425) and TMTC4 (Y415) as Y345 in 5f15 [58].

  4. (iv)

    Residues E84/K85 in 5ezm [58] do interact with the metal ion in the absence of a LLC molecule. We see the homologous residues HK in motif M2 also interacting with a ligand (but with the sugar moiety) in our TMTC models.

Discussion

Despite the wealth of sequence-analytic findings available for TMTCs, the systematic analysis of their sequences and of related biomolecular data for the purpose of assigning the biological function of TMTCs has never been performed before. Several roadblocks had to be overcome. First, there are issues with sequence accuracy as, for some TMTCs, several versions of protein sequences are available in databases, some of which lack sequence pieces essential for TMTC function as this study has revealed. Second, the complex nature [66] of the TM regions sprinkled with polar residues/prolines/glycines makes their accurate prediction in the TMTC sequences difficult. This seriously hampers function discovery since localizing certain loops at the correct side of the membrane might be impossible with errors in membrane topology. Third, just the fact of finding sequence similarity with a large number of sugar transferases is helpful to establish the homology relationship but provides little guidance for biological follow-up work aimed at zooming into the exact molecular and cellular functions of TMTCs, for example with regard to actual catalytic capacity, substrate specificity and ligands bound.

This work has made significant steps forward in understanding 3D structure and biological function of the membrane-embedded domains covering the N-terminal halves of TMTC1, TMTC2, TMTC3 and TMTC4 sequences. First, we determined the exact membrane topology using sequence-analytic, phylogenetic and available experimental data. The assumption of conserved membrane topology for evolutionarily conserved molecular function was key to interpret TM prediction results for N-TMTCs in a unified manner. The finally determined membrane topology including 11 TMs nicely complies with all known constraints. The C-terminal globular TPR domain is located in the ER lumen together with the critical for function conserved sequence motifs in the loops between TM regions. The homologous sequence segments in the known 3D structures 5ezm/5f15 corresponding with the luminal loops in TMTCs have the same membrane topology. We can further conclude that TMTC sequences in the database that cannot fit to this topology are most likely erroneous.

Whereas the complex nature of TM regions in TMTCs makes TM prediction difficult, it supports establishing gene homology via searches for significant sequence similarity [66, 70]. The evidence certifying the homology of N-TMTCs with GT-C/PMT-class and other related sugar transferases is overwhelming; thus, TMTCs must have the same overall fold and resemble similar tertiary structure. Despite the huge evolutionary distance from bacteria to human representatives in this homology group, higher eukaryote TMTCs share strongly conserved sequence motifs with GT-C/PMT-class enzyme sequences. Even at the pure sequence-analytic level, we can explain a few of these conserved sites as required for catalysis or for ligand binding. Given the close relationship with ArnT from Cupriavidus metallidurans (the structure of which is known: 5ezm/5f15), we suggest that these ligands include a divalent metal ion and a LLC molecule. Since TMTCs are part of an O-mannosylation pathway, we conclude that this LLC is DPM.

3D-structural modelling of N-TMTCs further enhances the association of conserved sequence motifs with ligand binding. Seven conserved sequence motifs from various parts of the protein sequence (including those seen already at the level of just sequence comparison) come spatially together to form the surface of binding sites for the mannosyl residue, the phosphate group and the dolichyl tail of DPM as well as the divalent metal ion; thus, their evolutionary conservation can be rationalized as maintaining the ability to position these two ligands for catalysis. Notably, this spatial co-localization of peptide stretches corresponding to the conserved motifs is sufficiently macroscopic to be a reliable result not affected by the accuracy of the homology procedure applied here.

In addition, we derive, as a result of this homology-supported structural modelling, a further expanded list of residues taken from the set of conserved motifs that are potentially interacting with the divalent metal ion and the DPM ligand. This list comprises those critical residues previously found with combined phylogenetic arguments (sequence conservation among TMTCs and similarity with sequences of structurally and functionally characterized sugar transferases) as a subset. Thus, we can relate certain residues strictly conserved among the TMTC sequences with functions in catalysis and ligand binding. This work also clarified the nature of the DUF1736 sequence segment in TMTCs, actually a loop between TM7 and TM8 the accurate positioning of several of its functional residues is critical for catalysis and binding of ligands, especially the lipid-linked sugar moiety.

Notably, we have already established the homology of TMTCs with GT-C/PMT-class sugar transferases when we first analysed their sequences for the first time in 2012; yet, a substrate and biological context assignment as well as 3D structural modelling were not possible. With HHpred [33], significant sequence similarity with DPM-dependent mannosyltransferases (PMTs, PF02366) was detected. With RPS-BLAST [85, 86], we found the link to ArnT-like arabinose transferases (COG1807). Their respective 3D structures were not known during that time [58].

The density of hints derived from sequence analysis, phylogenetic comparisons, homology studies and structural modelling leaves no doubt that the TMTCs have enzymatic activity and perform sugar moiety transferase functions in their biological context. Thus, the O-mannosyl-transferase sought in the recently discovered new O-mannosylation pathway (via combinations of TMTC knock-outs) that selectively processes cadherin-like targets and that the TMTCs are members of [26], are actually the TMTCs.

Finding the real substrates of the various human TMTCs and rationalising the function of their glycosylation are important questions from the view-point of biological science. Additionally, this topic has a critical medical dimension as several mutations of TMTCs are compatible with survival but severely disable the affected patients in various ways due to the pleiotropic nature of their molecular and cellular functions. Laudably, first steps in this direction have been done. It can be concluded that various cadherins/proto-cadherins found as substrates for the new O-mannosylation pathway are protein substrates for O-mannosylation by TMTCs [25, 26].

BLAST/PSIBLAST [32] searches reveal TMTC proteins are present in a wide range of animals but apparently not in fungi and plants (details not shown). Interestingly, essentially full-length homologous sequences (including the sugar transferase followed by TPR segments) are also found in many, typically not yet well characterized prokaryotes besides hits in lower eukaryotes such as oomycetes and choanoflagellates. One example is protein AMJ42_05695 (from Deltaproteobacteria bacterium DG_8) that is found by a BLAST search with human TMTC3 (24% sequence identity, E-value=3.e-47, alignment of query positions 12–698 against positions 46–774 from target). Human curiosity will not be satisfied until the diversity of their organic chemistry, the related biomolecular mechanisms and the cellular phenotypes will be understood.