Metaproteomic analysis was carried out, by MQ software, investigating three different databases: Proboscidea, Viridiplantae, and Bacteria/Nematoda. The results were analyzed at both protein and peptide levels (i.e., not considering also the proteins from which these peptides come). In particular, all the “original peptides” identified by Max Quant were used to perform an additional analysis by Unipept search engine, which uses the UniProt database, a version of the NCBI taxonomy, and an LCA algorithm (see Material and Methods section), to achieve a global vision about the taxonomic distribution of all peptides.
Proteins related to Mammoth
By searching the Proboscidea database, 35 and 14 proteins with at least 2 peptides were identified in “trunk” and “trunk tip” samples, respectively. Four proteins were identified in both the samples investigated. To validate species identifications, all the peptides which allowed the designation of these proteins were subjected to a sequence search by BLASTp. By this search, the identified proteins were classified into three groups: (i) 27 proteins that may be specifically related to mammoth; these proteins were authenticated by at least a peptide (peptide marker) related only with the Elephantidae species (Loxodonta africana, Mammut americanum or more in general Elephantidae); (ii) 14 proteins presenting peptides related to more species (Mammalia, Eutheria, Afrotheria, and not specific), but not coming from Homo sapiens; and finally (iii) 8 proteins related to Mammalia or not specific, but also including human. The list of proteins is reported in Table 1 (the complete list of proteins and peptides is reported in Supplementary Tables S1 and S2). As expected, most of these proteins, such as collagens, plakophilin, desmoglein, junction plakoglobin, are skin-related proteins. We also identified three keratins (i.e., keratin 32, keratin 35, and the IF rod domain-containing protein, see Supplementary Table S7) that were characterized by peptide trait sequences shared between the Loxodonta africana and human keratins. Therefore, also these proteins might be related to the mammoth, and considered as potentially endogenous components of the sample. Nevertheless, no L. africana specific peptide belonging to these keratins was recognized. Moreover, the deamidation level of the keratin-related peptides showed a similar profile to the other peptides coming from protein contaminants. Therefore, since it was not possible to establish the exact origin of these peptides, as a precaution, the above-mentioned keratins were considered contaminants.
Proteins related to Viridiplantae, Bacteria, and Nematoda
MS data were also used to investigate separately the Viridiplantae, and the Bacteria/Nematoda databases. By searching the Viridiplantae database a total of eight proteins with at least two peptides (Table 2; the complete lists of proteins and peptides are reported in the Supplementary Tables S3 and S4) were identified. Four proteins were identified in the trunk, whereas the others were in the trunk tip sample. Species validation, carried out as above reported, evidenced that five proteins presented diagnostic peptides of Mesangiospermae, the other three showed marker peptides of Brassicaceae, Arabidopsis thaliana, and Amborella trichopoda, respectively.
By the same approach, investigation of the database including only Bacteria and Nematode, allowed the identification of only one protein (i.e., the cytidylate kinase) with at least two peptides, which was specific of the Mesoplasma florum, a bacterium isolated from plants and insects (Table 2; the complete lists of proteins and peptides are reported in the Supplementary Tables S5 and S6).
Peptides related to mammoth
To achieve a comprehensive vision about the taxonomic distribution of all peptides identified by Max Quant, the Unipept search analysis was performed. In the “trunk” sample, unipept analysis of the 774 peptides characterized by searching the Proboscidea database allowed the classification of 651 sequences that were specific for the domain of Eukaryota. Among these sequences, 190 were specific for the family of Elephantidae, in particular specific of Loxodonta africana, and only one sequence specific of Mammut americanum (Fig. 3a).
In the “trunk tip” sample, unipept analysis of the 479 peptides allowed the classification of 409 sequences that were specific for the domain of Eukaryota. 158 sequences were specific for the superorder of Afrotheria, and included 135 sequences which instead were specific for the family of Elephantidae (Fig. 3b), and in particular Loxodonta africana.
Peptides related to Viridiplantae
In the “trunk” sample, among the 496 peptides identified by searching the Viridiplantae database, 396 sequences were classified, by unipept analysis, specific for the clade Viridiplantae. Figure 4a shows the tree-graph results of Unipept investigation. Most of the peptides (386 sequences, 97%) were related to the clade of Streptophyta, whereas about 2% (9 peptides) to the clade of Chlorophyta. Moreover, this approach revealed that among the Streptophyta-related sequences, 93% (360 sequences) belong to the class of Magnoliopsida, and were mainly specific (185 sequences, 51%) of the Brassicaceae family. Among Magnoliopsida, the clade of Petrosavidae was well represented (64 sequences), but also the class of Fabales and Solanales (11 sequences) were identified.
In the “trunk tip” sample, among the 361 peptides identified by searching the Viridiplantae database, 295 sequences were classified, by unipept analysis, specific for the clade Viridiplantae. The tree-graph results of Unipept investigation, reported in Fig. 4b, show the same distribution of the peptides classified in “trunk” sample.
Peptides related to Bacteria and Nematoda
In the “trunk” sample, among the peptides identified by searching the Bacteria/Nematoda database, 359 were specific to Bacteria, whereas 26 sequences were specific to Nematoda phylum. All the peptides related to Nematoda were specific to Rhabditida (Fig. 5a), an order of phytoparasitic and zooparasitic microbivorous nematodes living in soil, and most of them were specific to the Caenorhabditis genus. Only one sequence was referred to Spirurina infraorder. In Bacteria (Fig. 6a) two main phyla were identified: Proteobacteria (42%; corresponding to 151 peptides), and Firmicutes (23%; corresponding to 84 peptides). The remaining peptides were related to other phyla including Actinobacteria (5%; 17 peptides), Bacteroidetes (4%; 13 peptides), Tenericutes (3%; 12 peptides), Cyanobacteria (3%; 11 peptides), and other bacteria phyla represented by a lower number of peptides.
In the “trunk tip” sample, 319 peptides were specific to Bacteria and 25 ones were specific to Nematoda phylum. The tree-graph results of Unipept investigation, reported in Fig. 5b (for Nematoda) and 6b (for Bacteria), show a similar distribution of the peptides classified in “trunk” sample.
Level of deamidation and other chemical modifications
To discriminate the original endogenous components, present in the investigated samples, from components that are instead probably contaminants related to the post-excavation history of the sample, we calculated the deamidation level. As known, asparagine and glutamine residues naturally deamidate over time (Robinson et al. 2001, 2002; Schroeter et al. 2016). Even if different environmental factors, such as temperature, pH, and the inherent properties of proteins, may affect the level of deamidation process, it has been observed that the level of this modification is generally higher in ancient molecules than in modern ones. Therefore, the deamidation levels of asparagine and, mainly, of glutamine residues may be used as biomolecular indicators of deterioration and the natural aging of proteins in archeology and paleo-materials.
The deamidation levels of asparagine and glutamine residues were calculated for all the three types of peptides classified as “original”: i.e., peptides related to Mammoth, Viridiplantae, and Bacteria/Nematoda. These results were compared with the deamidation level of those peptides classified as “contaminants” (Fig. 7), because belonging to the proteins of the c-RAP database. Figure 7 shows that the deamidation level of the “original peptides” ranges from 35 to 63% in both samples investigated. On the contrary, “contaminant peptides” present a deamidation level that is always below 10%. Moreover, taking into account that other forms of spontaneous, and non-enzymatic, modifications could also be a sign of proteins damage due to the exposure to light or oxidative environmental factors (Pattinson 2012; Davies 2016), the level of the oxidation products at tryptophan, tyrosine, cysteine, and methionine residues was calculated for both the “original” and “contaminant” peptides (Stadtman et al. 2005; Pattison et al. 2012; Mikšík et al. 2016; Cannizzo et al. 2012). Interestingly, the comparison of the level of the oxidation products in original and in contaminant peptides confirms the trend already observed for the deamidation (see Supplementary Figure S1).
Characterization of the alpha-1 type I collagen sequence
The dedicated database search of MS data against all the UniProt publicly available entries of col1a1 by the PEAKS X and MaxQuant software allowed the identification of a total of 40 tryptic fragments (see Supplementary Table S8). These tryptic sequences are mainly related to the col1a1 unreviewed entry (UniProt Accession No. G3SSE0; length: 1058 amino acid residues in the mature protein (Exposito et al. 2002; Buckley et al. 2011)) of the modern African elephant (Loxodonta africana), that, therefore, was chosen as the reference sequence. Overall, MS data allowed to confidently characterize about 65% of the Mammuth primigenus col1a1 primary structure (Fig. 8). Most identified sequences were identical to the reference from L. africana, although some differences were also observed. In particular, de novo interpretation of the MS/MS spectrum of the triply-charged ion at m/z 850.4013 (Fig. 9) allowed to deduce the sequence GNDGATGAAGPPGPTGPAGPPGFPGAVGAK which corresponds to the tryptic fragment G162NDGATGAAGPPV174SPTGPAGPPGFPGAVGAK192 of the col1a1 unreviewed entry (UniProt Accession No. G3SSE0) of L. africana, but carrying the deletion of the valine residue at position n. 174 and the substitution of the serine at position n. 175 with a glycine residue.
This result was confirmed by the identification of many MS/MS scans (see Supplementary Table S8). It is important to highlight that the differences observed at positions 174 and 175 are most likely due to an error in the unreviewed entry G3SSE0_LOXAF. Indeed, it is well known that collagen molecules are made up of three alpha chains involved in the formation of a triple-helical structure (Exposito et al. 2010). At the primary structure level, the sequence of alpha chains consists of repeating Gly-Xaa-Yaa triplets, called the collagenous domain or triple helix motif. The triple helix is stabilized by the presence of glycine as every third residue, a high content of proline and hydroxyproline, interchain hydrogen bonds, and electrostatic interactions (Persikov et al. 2005), involving lysine and aspartate (Fallas et al. 2009). The presence of the VS trait in the G3SSE0_LOXAF entry would interrupt the repetition of Gly-Xaa-Yaa triplets, impairing the collagen triple helix; thus it represents a very unlikely mutation in collagen. On the other hand, a glycine residue is reported in the same position in another annotated version of the sequence of collagen alpha-1(I) chain of Loxodonta africana (UPI0005406912 entry; NCBI Reference Sequence: XP_010592644.1).
Moreover, a glycine residue at position 175 was previously detected by Buckley (Buckley et al. 2011; see Fig. 8), that has characterized the col1a1 sequence extracted from the bone powder of a woolly mammoth dredged from the North Sea, a modern African elephant (Loxodonta africana), and an Asian elephant (Elephas maximus) specimens, and is also typical of the col1a1 sequence of M. americanum (SwissProt Accession No. P0C2W8). An additional error in the unreviewed entry G3SSE0_LOXAF was detected. The tryptic fragment T393GPPGPAG400QDGRPGPPGPPGAR414 (Table S8, Fig. 8) shows as expected, at position 400, the presence of a glycine residue, which instead is lacking in the G3SSE0_LOXAF entry. Again, the absence of the glycine residue at this position would interrupt the repetition of Gly-Xaa-Yaa triplets, impairing the collagen triple helix; thus it represents a mistake in the entry G3SSE0_LOXAF. De novo interpretation of the MS/MS spectra of the triply charged ions at m/z 738.6978 and 828.8729, allowed the identification of two amino acid substitutions in the M. primigenius col1a1 here investigated with respect to the L. africana counterpart. In detail, our experimental MS/MS data allowed to deduce the sequences VGPPGPSGNAGPPGPPGPAGKEGAK, and GPPGSAGTPGKDGLNGLPGPIGPPGPR, respectively (Figs. 10 and 11).
The first one corresponds to the sequence V723GPPGPSGNAGPPGPPGPAGKEGGK747, but shows the substitution of the glycine at position 746 with an alanine residue. The second one instead corresponds to the tryptic fragment G982PPGSAGAPGKDGLNGLPGPIGPPGPR1008, carrying the amino acid substitution Ala989 → Thr. It is important to highlight that comparison of the col1a1 sequences coming from different Proboscidea species (Fig. 8) reveals that the presence of an alanine residue at position 746 and a threonine at position 989 appear unique of the col1a1 sequence of the M. primigenus here investigated, and, therefore, could be used to reliably distinguish the Mammuthus primigenius with respect to the other two genera of elephantids (i.e., Elephas and Loxodonta), and the extinct American mastodon (i.e., Mammut americanum).
Furthermore, MS data here obtained allowed to detect the presence of a proline residue at position 701, which appears characteristic of Elephantidae, and could be used as marker to distinguish them from the M. americanum, which instead has an alanine (Fig. 8). Finally, our MS data show that the primary structure of the M. primigenius col1a1 sequence here investigated has a glycine and an alanine at positions 795 and 857, respectively (Fig. 8). These two positions are shared with the unreviewed col1a1 entry of L. africana, and the reviewed entry counterpart of M. americanum. On the contrary, the Elephantidae (i.e., M. primigenius, L. africana and E. maximus) col1a1 sequences reported by Buckley (Buckley et al., 2011) show the presence of an alanine (at position 795) and a serine (at position 857).
Charaterization of the alpha-2 type I collagen sequence
The dedicated database search of MS data against all the UniProt publicly available entries of alpha-2 type I collagen (col1a2) by the PEAKS X and MaxQuant softwares allowed the identification of a total of 31 tryptic fragments (see Supplementary Table S9). These tryptic sequences are mainly related to the col1a2 unreviewed entry (UniProt Accession No. G3TIC0; length: 1040 amino acid residues in the mature protein) of the modern African elephant (Loxodonta africana), that thus was selected as reference sequence. Overall, MS data allowed to confidently characterize about 50% of the Mammuth primigenus col1a2 primary structure (Fig. 12). The characterized sequence was identical to the reference from L. africana, except for an amino acid point difference. In particular, de novo interpretation of the MS/MS spectrum of the double-charged ion at m/z 784.8686 (Fig. 13) allowed to deduce the sequence GSDGEAGSAGPAGPPGLR which corresponds to the tryptic fragment G303SS305GEAGSAGPAGPPGLR320 of L. africana, but carrying a substitution of the serine residue at position n. 305 with an aspartic acid. This substitution is supported by the presence of the MS/MS spectrum of the triple-charged ion at m/z 575.6152, corresponding to the sequence RGSDGEAGSAGPAGPPGLR (see Supplementary Table S9). However, although the corresponding amino acid trait with the asparagine at position 305 wasn’t detected, the presence of a deamidated asparagine instead of the aspartic acid, at this position, cannot be excluded. It is interesting to note that the col1a2 sequences of Elephantidae (i.e., M. primigenius, L. africana, and E. maximus) and of M. americanum, as reported by Buckley (Buckley et al. 2011), show a serine residue at position n. 305. Thus the presence of a different amino acid residue at this position could be used as marker to distinguish the Siberian Mammuthus primigenius with respect to the other Proboscidea species.
Furthermore, as previously suggested by Buckley (Buckley et al. 2011), MS data here obtained allowed to detect the presence of a threonine residue at position 781, which seems peculiar of Elephantidae, and could be used as a marker to distinguish them from the M. americanum, which instead has a serine (Fig. 12).