Background

The Sterile Alpha Motif Domain-containing 9 (SAMD9) gene is located in chromosome 7q21.2 of the human genome, and is adjacent to its close paralogue, SAMD9-like (SAMD9L), in a head-to-tail position [1, 2] and separated by approximately 12 kb. The physiological functions of both SAMD9 and SAMD9L currently remain poorly understood, but the importance of human SAMD9 has been recently emphasized during the discovery of the genetic cause of a rare life-threatening human disease, normophosphatemic familiar tumoral calcinosis (NFTC) [3, 4]. Patients with NFTC exhibited normal calcium and phosphate metabolism while developing calcified tumorous nodules at their extremities, accompanied by severe gingivitis. Two independent founder genetic events leading to the deleterious mutations in SAMD9 are responsible for the autosomal recessive disease of NFTC [3, 4]. Interestingly, these patients and their kindred are from a culturally isolated ethical group, namely Jewish-Yemenite, suggesting a potential selection pressure associated with this population [3, 4]. In addition to NFTC, misregulated human SAMD9 expression was also shown to be associated with aggressive fibromatosis, breast, and colon cancers [1].

Human SAMD9 expression can be upregulated by tumor necrosis factor (TNF) [4] or by type I [5] and type II interferons (IFNs) [6], and it is classified as an interferon-stimulated gene (ISG). Recently, an interferon regulatory factor (IRF-1) binding element was identified in the promoter region of the SAMD9 gene in humans [6], and overexpression of IRF-1 can lead to elevated SAMD9 gene expression [7]. All these observations suggest a key role of SAMD9 as a signalling hub in response to innate immune stimulations. Most importantly, human SAMD9 also has very recently been shown to possess anti-viral properties in cultured cells [8, 9] emphasizing its crucial role in host defence against viral pathogens.

On the other hand, the human SAMD9L gene was shown to exhibit lower expression levels in breast cancer tissue than in normal breast tissue from the same patient [1]. It was also identified to be an inducible gene for type I IFNs (IFNα and β), and in activated human T cells the function of SAMD9L is correlated with its IFN-induced inhibitory effects on cell migration [10]. The murine SAMD9L gene expression was also found to be upregulated by calcitonin [11], suggesting a potential involvement in calcium homeostasis as well.

Lastly, the human SAMD9 and SAMD9L genes were both classified as myeloid tumor suppressors, as they are localized within a microdeletion cluster associated with myeloid disorders, such as juvenile myelomonocytic leukemia (JMML), acute myeloid leukemia (AML), and myelodysplastic syndrome (MDS) [2]. In another study investigating altered immune responses in patients with metastatic melanoma, both SAMD9 and SAMD9L expression were shown to be significantly reduced in T and B cell populations when compared with those from healthy control individuals [12]. It has been suggested that since these two proteins exhibit considerable sequence similarity, they may function redundantly or in related pathways, but it should be noted that patients with NFTC possess mutations only in SAMD9 and thus it is likely that the two proteins perform non-identical tasks in humans.

Evolutionarily, the orthologous genes for both SAMD9 and SAMD9L are highly conserved in many mammalian genomes, such as rat, primates and rabbit, but not in chicken, frog and fish species, or insects [1]. This suggests that the origin of these two related genes, possibly from an ancestral duplication event, occurred at some point after branching of the mammalian species. In addition, one intriguing fact is that the house mouse genome (Mus musculus, Mumu) has lost the SAMD9 gene while maintaining SAMD9L, after an evolutionary chromosome breakage event [1].

The absence of SAMD9 from the house mouse (Mumu) genome led us to question if it was a unique event restricted to this taxon and stimulated the study of SAMD9 and SAMD9L evolution and divergence in different mammalian genomes. We have examined the evolutionary history and phylogeny of SAMD9 and SAMD9L, using all the available and complete mammalian genomic sequences of both genes in NCBI and Ensembl databases, in order to obtain a broader understanding of the origin of these two genes. Our deduced phylogenetic tree suggests that SAMD9 and SAMD9L indeed resulted from an ancestral gene duplication event that occurred after the divergence of Marsupialia from Placentalia. At the same time, we applied six different Maximum Likelihood (ML) methods to test for potential positive selective pressures exerted at the gene level, and we also looked for evidence of positive selection at the deduced protein level. The analyses revealed that SAMD9 and SAMD9L, at both the genome and deduced protein sequence levels, were under the effects of what appears to be sustained positive selective pressures. Our results suggest that these two proteins have been selected by long term environmental pressures, such as those exerted by pathogen responses that are under the control of innate immune regulators like the type I interferons.

Results

SAMD9 and SAMD9Lgenes prevalence in mammals

All the available and complete mammalian SAMD9 and SAMD9L genes coding sequences in the NCBI and Ensembl databases were collected, resulting in a total of fifteen SAMD9 and nineteen SAMD9L genomic sequences of different species indicated in Table 1. The species collected for SAMD9 genes fit into seven Eutheria orders, commonly designated as placental mammals, while the taxa collected for SAMD9L genes fit into eight placental orders. The grey short-tailed opossum, a representative of the order Didelphimorphia traditionally included in Marsupialia (pouch mammals), was the only marsupial genome to possess a complete SAMD9L sequence.

Table 1 Mammalian SAMD9 and SAMD9L genes accession numbers from species used in phylogenetic and selection analyses

Besides the complete SAMD9 and SAMD9L coding sequences, several other non-complete SAMD9 and SAMD9L mammalian genes, including full length mRNA-derived transcripts with many still-undetermined nucleotides (for example, the large flying fox or the west European hedgehog SAMD9 coding sequences, or the American pika SAMD9L sequence) or partial gene sequences (for example, the Ord’s kangaroo rat SAMD9L or the Hoffmann’s two-toed sloth SAMD9 genes), have been already identified and annotated in Ensembl database. However, these incomplete sequences were not used in the phylogenetic and selection analyses performed in this study. Both the complete and the non-complete SAMD9 and SAMD9L genes annotated in Ensembl are represented in Figure 1, allowing a broader view into this gene family distribution within the mammalian context.

Figure 1
figure 1

Ensembl annotation of mammalian SAMD9 and SAMD9L genes, and neighboring genes. Ensembl annotation of the available mammalian SAMD9 and SAMD9L genes, both complete and incomplete coding region sequences, are represented. Complete sequences for both genes are highlighted (dark yellow for SAMD9 and light yellow for SAMD9L). Other SAMD9 and SAMD9L genes are already annotated in Ensembl, but at the time of these analyses were still incomplete, corresponding to the non-highlighted locations in the figure and were excluded from posterior analyses. Ensembl species by order of appearance: Modo - Opossum; Ecte - Lesser hedgehog tenrec; Bota - Cow; Susc - Pig; Vipa - Alpaca; Aime - Panda; Calu - Dog; Feca - Cat; Dano - Armadillo; Mylu - Microbat; Ptva - Megabat; Ereu - Hedgehog; Prca - Hyrax; Ocpr - Pika; Orcu - Rabbit; Eqca - Horse; Chho - Sloth; Caja - Marmoset; Gogo - Gorilla; Hosa - Human; Mamu - Macaque; Nole - Gibbon; Otga - Bushbaby; Patr - Chimpanzee; Poab - Orangutan; Tasy - Tarsier; Loaf - Elephant; Capo - Guinea pig; Dior - Kangaroo rat; Ictr - Squirrel; Mumu - Mouse; Rano - Rat; Soar - Shrew. To access the complete species name, the list of abbreviations should be consulted. Based on human chromosome 7 mapping, SAMD9 and SAMD9L neighboring genes were identified and represented (CDK6, HEPACAM2, CCDC132 and CALCR). Using Ensembl database, the same search was performed for the remaining species and the identified genes are represented under the forward strain arrow. The represented genes are, in most cases, the immediate neighboring genes, while for a reduced number of species some other genes are located in the same region, but were excluded for this purpose. “<” symbol: gene located on the reverse strand; “>” symbol: gene located on the forward strand.

Special reference has to be made to two particular complete sequences that were included in our evolutionary analyses: the northern white-cheeked gibbon (Nole) SAMD9 and the domestic dog (Calu) SAMD9L. The northern white-cheeked gibbon has no SAMD9 gene currently annotated in Ensembl. However, by comparing SAMD9 sequences of other primates to the gibbon genome in Ensembl using BLAST analysis, we obtained a perfect match with a neighboring designated pseudogene of SAMD9L. Despite this biotype classification, we could not exclude this SAMD9 sequence from being considered as a bona fide gibbon SAMD9 gene. Regarding the domestic dog SAMD9L, this gene is present in NCBI and is annotated in Ensembl, but in this latter database the sequence was missing seventy-four nucleotides when compared to the sequence in NCBI. Thus, for the subsequent analyses we used only the sequence from NCBI. It should also be noted that, despite not being annotated in Ensembl, an incomplete SAMD9 sequence for the domestic dog is available in NCBI. However, when the NCBI sequence (XM_003639470.1) was analyzed by BLAST, it possessed 99 to 100% identity with a non-annotated region of chromosome 14. Since it is a non-complete nucleotide sequence, it was not used further for the study reported here.

When SAMD9 and SAMD9L were mapped in human chromosome 7, orthologous counterparts of both genes were identified in the chimpanzee (Patr), dog (Calu) and rat (Rano), but in the house mouse (Mumu) genome there was only a single genetic correspondence to the SAMD9L open reading frame in chromosome 6 [1]. From what is currently available in Ensembl database, the absence of SAMD9 for the house mouse (Mumu) is confirmed. We checked the other available rodents to confirm the presence or absence of SAMD9 in this specific lineage. In Ensembl there is a single SAMD9 annotation for the thirteen-lined ground squirrel (Ictr). In addition, what appear to be intact SAMD9 genes have been deposited in NCBI database for the brown rat (Rano), the Chinese hamster (Crgr) and the domestic Guinea pig (Capo). On the other hand, like the house mouse (Mumu), the Ord’s kangaroo rat (Dior) does not have SAMD9 gene annotated in Ensembl database.

Complete mammalian SAMD9 and SAMD9Lgene sequences: recombination and phylogenetic analyses

The complete nucleotide coding sequences from SAMD9 and SAMD9L were aligned together (SAMD9 + SAMD9L) and translated into deduced protein sequences (Additional file 1: Figure S1). Before further phylogenetic analyses, we used the software GARD [13, 14] to look for any evidence of recombination in the alignment. Three breakpoints were identified, but only one was strongly supported by the Kishino-Hasegawa (KH) test (Additional file 2: Table S1), which should result in the estimation of a phylogenetic tree for each segment. However, since the breakpoint was located on nucleotide 4755, the genomic segment to the right of the breakpoint was only composed of 150 nucleotides.

A Maximum Likelihood (ML) tree was estimated for the smallest genetic segment (not shown), but the nodes were weakly supported by low bootstrap values. Therefore, only the large segment with 4755 nucleotides was used to reconstruct a ML phylogenetic tree under the GTR+I+G nucleotide substitution model. The resulting tree is represented in Figure 2. Another ML phylogenetic tree was estimated, but without testing recombination, to compare differences in the tree topologies. The model used was again the GTR+I+G and resulted in a tree (Additional file 3: Figure S2) with a similar overall topology to the gene segment containing 4755 nucleotides.

Figure 2
figure 2

Mammalian SAMD9 and SAMD9L genes estimated Maximum Likelihood tree. For the mammalian SAMD9 and SAMD9L genes alignment and after GARD analysis [13, 14], a significant recombination breakpoint was detected (nucleotide position 4755) defining a left and a right segment. A phylogenetic tree was estimated for each segment using the Maximum Likelihood (ML) method. However, the resulting tree from the right segment presented weakly supported nodes and was discarded. On the other hand, the left segment with 4755 nucleotides was used to reconstruct a ML phylogenetic tree under the GTR+I+G nucleotide substitution model. The analyses were performed with 1,000,000 generations and 1,000 bootstrap searches. The bootstrap values are indicated on the branches. The abbreviations correspond to the following species common names: Aime - Giant panda; Bota - Cow; Caja - Common marmoset; Calu - Domestic dog; Capo - Domestic Guinea pig; Crgr - Chinese hamster; Eqca - Horse; Ereu - West European hedgehog; Gogo - Western gorilla; Hosa - Human; Loaf - African bush elephant; Mamu - Rhesus monkey; Modo - Grey short-tailed opossum; Mumu - House mouse; Mylu - Little brown myotis; Nole - Northern white-cheeked gibbon; Orcu - European rabbit; Patr - Common chimpanzee; Poab - Sumatran orangutan; Rano - Brown rat; Soar - Common shrew ; Susc - Pig. To access the species scientific names, the list of abbreviations should be consulted.

In the estimated ML phylogenetic tree (Figure 2), SAMD9 and SAMD9L formed two well defined monophyletic groups, and within each clade we observed a concordant topology with the accepted evolutionary relationships of eutherian mammals [15] (Additional file 4: Figure S3). Interestingly, the marsupial grey short-tailed opossum (Modo) SAMD9L represented a highly divergent outgroup, even from the remaining SAMD9L species.

A gene duplication event after the split of marsupial and placental mammals originated SAMD9/SAMD9Lgene family

It has been previously suggested that SAMD9 and its paralogous SAMD9L may have originated from a common ancestor by a gene duplication event [1]. In our study, the ML tree (Figure 2) topology supports this view. However, the opossum (Modo) gene annotated as SAMD9L in NCBI database (XM_001378475.1) does not cluster in the placental mammal SAMD9L group. In fact, the opossum sequence can be recognized as being in a basal position. Two highly supported eutherian monophyletic clades in the ML tree, one corresponding to all SAMD9 genes and the other one to all SAMD9L genes, were observed. The most likely evolutionary scenario can be described as following: an ancestral gene is present before the separation of marsupial from placental mammals in the common ancestor that originated the extant SAMD9L gene in the marsupial opossum (Modo) and the ancestral gene of placental SAMD9/SAMD9L gene family. Later, in placental mammals, this ancestral gene suffered an event of gene duplication resulting in the contemporary SAMD9 and SAMD9L genes.

The conservation of similar arrangement of genes in the same relative locations on the chromosomes of different species, denominated as shared synteny, can indicate the existence of a common ancestor. In Ensembl, among the mammalian species where the presence of SAMD9 and/or SAMD9L has been annotated, shared synteny can be readily observed in chromosomes and ‘gene-scaffolds’. The consistent presence of the same common flanking genes (CALCR, CCDC132, CDK6 and HEPACAM2) in different species supports the idea that SAMD9 and SAMD9L are located in highly conserved regions throughout placental mammals’ divergence and diversification (Figure 1).

Inference of positive selection at SAMD9 and SAMD9Lgenes level

Placental SAMD9 and SAMD9L deduced protein sequences were aligned independently (Additional file 5: Figure S4; Additional file 6: Figure S5) and ML trees were estimated for each gene (Additional file 7: Figure S6; Additional file 8: Figure S7). Afterwards, we determined whether the SAMD9 and SAMD9L genes might have been subject to positive selection pressures by comparing PAML codon-based nested models with and without positive selection using likelihood ratio tests (LRTs) [16, 17]. Both comparisons of M1 (nearly neutral) versus M2 (positive selection) and M7 (beta) versus M8 (beta and ω > 1) resulted in the rejection of the null hypothesis, strongly supporting the finding of positive selection for both SAMD9 and SAMD9L (<0.001; Table 2). We also used the PARRIS [18] method to detect if a proportion of sites in each gene alignment evolved under positive selection after accounting for the potentially confounding effects of recombination and synonymous site variation. Interestingly, only SAMD9L was found to be under selection when using this method (<0.05; Additional file 9: Table S2).

Table 2 SAMD9 and SAMD9L likelihood ratio test (LRT) for four site models from PAML software

Six different methods were used to detect sites under selection for SAMD9 and SAMD9L (Additional file 10: Table S3). For PAML software, we used M8 model to detect sites under selection for SAMD9 and SAMD9L phylogenetic trees, and the BEB approach was used to identify codons with a posterior probability >90%. The other five applied methods to detect sites under positive selection are available in the Datamonkey web server. In this study, we only considered a codon with evidence of selection when it was identified by at least three of the six used methods [19, 20] (Additional file 10: Table S3). Seventeen sites for SAMD9 and nineteen sites for SAMD9L were identified as candidates for sites under positive selection (Figure 3 and 4; Additional file 10: Table S3).

Figure 3
figure 3

Positively-selected SAMD9 codons and respective physicochemical properties for each mammalian species. SAMD9 sites under positive selection identified by at least three of the six used Maximum Likelihood methods. Codons are numbered according to the SAMD9 deduced proteins alignment (Additional file 5: Figure S4). The abbreviations correspond to the following species common names: Hosa - Human; Patr - Common chimpanzee; Gogo - Western gorilla; Poab - Sumatran orangutan; Nole - Northern white-cheeked gibbon; Mamu - Rhesus monkey; Bota - Cow; Susc - Pig; Eqca - Horse; Mylu - Little brown myotis; Orcu - European rabbit; Rano - Brown rat; Crgr - Chinese hamster; Capo - Domestic Guinea pig; Soar - Common shrew. To access the species scientific names, the list of abbreviations should be consulted. The background colors represent amino acid properties: polar positive (yellow), polar negative (orange), polar neutral (green), non-polar neutral (purple), non-polar aliphatic (blue) and non-polar aromatic (pink).

Figure 4
figure 4

Positively-selected SAMD9L codons and respective physicochemical properties for each mammalian species. SAMD9L sites under positive selection identified by at least three of the six used Maximum Likelihood methods. Codons are numbered according to the SAMD9L deduced proteins alignment (Additional file 6: Figure S5). The abbreviations correspond to the following species common names: Hosa - Human; Patr - Common chimpanzee; Gogo - Western gorilla; Poab - Sumatran orangutan; Nole - Northern white-cheeked gibbon; Caja - Common marmoset; Mamu - Rhesus monkey; Loaf - African bush elephant; Eqca - Horse; Calu - Domestic dog; Aime - Giant panda; Ereu - West European hedgehog; Orcu - European rabbit; Mumu - House mouse; Crgr - Chinese hamster; Rano - Brown rat; Capo - Domestic Guinea pig; Soar - Common shrew. To access the species scientific names, the list of abbreviations should be consulted. The background colors represent amino acid properties: polar positive (yellow), polar negative (orange), polar neutral (green), non-polar neutral (purple), non-polar aliphatic (blue) and non-polar aromatic (pink).

Amino acid substitutions can be either conservative or radical, depending on whether they lead to a change in a certain physicochemical property [21]. For the codons identified as being under selection, we investigated the alterations of charge and polarity between mammalian taxa. For SAMD9 all the detected codons (Figure 3) exhibited at least one physicochemical alteration across species and a maximum of five different combinations of properties were identified for codon 331. Primate species SAMD9 amino acid changes were quite conservative, since eleven codons exhibited the same amino acid. Despite the low number of species available for Artiodactyla and Rodentia, we verified in each order a great number of amino acid physicochemical alterations per codon in the SAMD9 genes. In addition, all SAMD9L codons under presumptive selection (Figure 4) exhibited physicochemical alterations across taxa and at least three properties were represented in each codon. A maximum of five different physicochemical properties were identified for codon position 452. In Primates, amino acid substitutions in SAMD9L were once again quite conservative, given that thirteen positions kept the same physicochemical properties even when amino acid substitutions happened. On the contrary, among the four Rodentia species, only three positions in SAMD9L presented the same physicochemical properties, but just one was in fact the same amino acid.

To detect whether some sites along particular SAMD9 and SAMD9L lineages were under positive selection, we employed branch-site Model A (Table 3). On the SAMD9 phylogenetic tree we identified six branches (foreground branches) with ω ratio greater than 1, but only the common shrew (Soar) branch had a statistical significant LRT (<0.01). SAMD9L branch-site analysis revealed a total of twelve branches with ω ratio greater than 1, yet only four of those branches presented a statistical significant LRT. Both the Sumatran orangutan (Poab) and the domestic Guinea pig (Capo) branches had statistical significance <0.05, while the west European hedgehog (Ereu) and the common shrew (Soar) branches had statistical significance <0.01.

Table 3 SAMD9 and SAMD9L parameter estimates and likelihood ratio test (LRT) for branch-site model A (PAML)

Inference of positive selection at SAMD9 and SAMD9L deduced proteins level

The evaluation of destabilizing radical changes that may occur in specific regions of proteins should complement the information obtained from positive selection analyses at the gene level. Using TreeSAAP software, it is possible to estimate, from a phylogenetic tree, the amino acid properties under selection from the thirty-one available in the software [22] (see Methods section for full list of the thirty-one properties).

For both SAMD9 and SAMD9L phylogenetic trees, the two amino acid properties with the most radical value (category 8) denoting positive destabilizing selection were the isoelectric point (pI) and the equilibrium constant (ionization of COOH) (Additional file 11: Table S4). When comparing the pI values among species for each protein, we observed a high variability across them, especially for SAMD9L taxa (Figure 5). For SAMD9 proteins, both the cow (Bota) and the domestic Guinea pig (Capo) exhibited the lowest pI (7.60), while a pI of 8.11 for the northern white-cheeked gibbon was the highest observed in SAMD9 proteins. SAMD9L proteins from placental mammals exhibited a larger range for the pI values with the giant panda (Aime) presenting the lowest pI (6.85) and the horse (Eqca) exhibiting the highest pI (8.22). Interestingly, the marsupial grey short-tailed opossum SAMD9L deduced protein presented the lowest pI (6.74) of all. The differences in the pI, and especially in SAMD9L proteins, may cause dramatic effects on proteins folding, since those changes are caused by significant differences in the polarity of the amino acids that compose the proteins. Besides the pI and equilibrium constant, SAMD9 presented two other properties under strong positive destabilizing selection, while five more properties were identified as being under positive destabilizing selection for the SAMD9L alignment (Additional file 11: Table S4).

Figure 5
figure 5

Mammalian SAMD9 and SAMD9L deduced proteins isoelectric points (pI). The grey bars correspond to the SAMD9 deduced proteins pI, the blue bars to the SAMD9L deduced proteins pI and the yellow bar to the opossum (Modo) SAMD9L deduced protein pI. The abbreviations correspond to the following species common names: Hosa - Human; Patr - Common chimpanzee; Gogo - Western gorilla; Poab - Sumatran orangutan; Nole - Northern white-cheeked gibbon; Mamu - Rhesus monkey; Bota - Cow; Susc - Pig; Eqca - Horse; Mylu - Little brown myotis; Orcu - European rabbit; Rano - Brown rat; Crgr - Chinese hamster; Capo - Domestic Guinea pig; Soar - Common shrew; Caja - Common marmoset; Loaf - African bush elephant; Calu - Domestic dog; Aime - Giant panda; Ereu - West European hedgehog; Mumu - House mouse; Modo - Grey short-tailed opossum. To access the species scientific names, the list of abbreviations should be consulted.

Regarding the SAMD9 sliding window, the four amino acid properties with significant z-Score values (>3.09) were evenly distributed along the SAMD9 proteins alignment (Figure 6). However, a superior concentration of higher z-Score values was observed in the region between amino acid 660 and 910, specifically for the pI. The SAMD9L sliding window showed a dense pattern for the seven amino acid properties under destabilizing selection (Figure 7). Yet, two regions of SAMD9L proteins alignment presented an even larger density of properties and the highest z-Score values for some of those properties: amino acid range of 208–431 and the range of 863–1430.

Figure 6
figure 6

Sliding window for SAMD9 amino acid properties under positive selection. SAMD9 amino acid properties under destabilizing selection with significant z-Score values (>3.09).

Figure 7
figure 7

Sliding window for SAMD9L amino acid properties under positive selection. SAMD9L amino acid properties under destabilizing selection with significant z-Score values (>3.09).

Discussion

From a previous study, SAMD9 and its paralogue SAMD9L have been identified in a variety of species, namely in human, chimpanzee, dog and rat. However, in the house mouse (Mus musculus, Mumu) genome, SAMD9 was uniquely lost [1]. The same study indicated the absence of both genes in chicken, frog and all currently sequenced fish species, suggesting that the SAMD9/SAMD9L genes originating event had occurred after the mammalian radiation. One of our goals was to intensify the identification of SAMD9 and SAMD9L within different mammalian genomes and also verify whether the loss of mouse SAMD9 was a unique event restricted to this taxon.

Despite the great number of morphological, molecular and phylogenetic studies for the order Rodentia, controversies relating to the divergence times between its major suborders still persist [23]. In a recent study on rodent evolution [24] some internal rodent branches have been resolved, where three main groups in the phylogenetic tree were supported: the Mouse-related clade, Ctenohystrica clade and the Squirrel-related clade. A scenario has been proposed where the pre-Squirrel-related clade diverged early from the common ancestor followed by a later separation of the pre-Mouse-related and pre-Ctenohystrica clade [24]. We gathered sequences for one or both SAMD9 and SAMD9L genes for species representative of the three clades. The two genes were present in the thirteen-lined ground squirrel (Squirrel-related clade), the domestic Guinea pig (Ctenohystrica clade), the Chinese hamster and the brown rat (Mouse-related clade). Together with the absence of SAMD9 in the house mouse genome, the Ord’s kangaroo rat (Mouse-related clade) also did not have this gene annotated in Ensembl. With the apparent region synteny for the Ord’s kangaroo rat when compared to the other mammals, this absence might just be the case of a genome still to be completely annotated, leaving the house mouse as the only rodent taxon that has lost SAMD9, at least from the currently available genomic sequence database.

A great number of the available mammalian genomes are still not completely annotated. Therefore, we made no assumptions regarding SAMD9 and SAMD9L for those species. Nevertheless, we observed that the fairly well annotated cow and pig genomes (Order Artiodactyla) had no matches or annotations for SAMD9L. This information together with the absence of SAMD9 in the house mouse and the already suggested origin of both genes from a common ancestor by ancient gene duplication [1] led us to the following hypothesis: in some lineages the presence of both genes might be costly for the genome, resulting in the loss of one of the genes that functionally would be overcome by the remaining paralogue. Although these observations support the potential existence of certain gene redundancy between SAMD9 and SAMD9L, we also note the almost nonexistent recombination between them, despite the proximity in the location of these two genes in the genomes of all the annotated mammalian species. This genetic isolation of the two paralogues does not support the existence of functional redundancy between SAMD9 and SAMD9L. These apparent contradictory hypotheses have to be confirmed with the conduction of functional studies in different species.

With all the available mammalian sequences collected for both SAMD9 and SAMD9L genes, the performed phylogenetic study resulted in a tree with a well-defined monophyletic group per gene gathering solely placental mammals and a single outgroup, the marsupial grey short-tailed opossum. This supported the speculative hypothesis of SAMD9 and SAMD9L resulting from a gene duplication event, more precisely, after the divergence of Marsupialia from Placentalia 147.7 Mya [25]. Despite the common ancestor, when testing for the occurrence of potential positive selection acting at the gene and protein levels, we concluded that SAMD9L is under stronger selection than SAMD9. This is supported by the fact that a higher number of sites at the gene level and of specific lineages were positively selected in SAMD9L than SAMD9. Besides, a greater number of amino acid properties were under selection at the deduced protein level of SAMD9L than SAMD9.

When we examined the amino acid substitutions and changes on physicochemical properties for sites under selection, it was clear, for both proteins, that members of the Rodentia order presented the highest number of divergent alterations for the same codons compared to other mammalian orders. Since it is known that in many proteins the amino acid substitutions caused by positive selection are not random [21, 26], for instances the Primate APOBEC3G residues involved in HIV-1 Vif interaction [27], we hypothesize that any occurring alteration in rodents or even in other lineages may be the result of consistent arms race between the host and a pathogen stressor. This could be a significant observation, given that anti-viral properties have been already assigned to human SAMD9 in cultured human cells. Specifically, a unique viral gene product, M062 of myxoma virus, was found to antagonize the anti-viral properties of SAMD9 protein in order to permit the replication of this virus in cultured human cells [8].

Considering the mammalian species included in this study, selection analyses performed on SAMD9 and/or SAMD9L genes for each species individually one may have different results from the obtained in our work, since recombination rates and effective population sizes are expected to differ among species. These species and population specific selection analyses should result in the identification of sites under selection in SAMD9 and/or SAMD9L genes that can be used in genetic population studies by determining parameters like allele and genotype frequencies, and FST and nucleotide diversity values. This contributes to the definition of genotypes that might be favorable or not, for example, to the defence against certain pathogens.

Human SAMD9 and SAMD9L have solely one defined domain, the Sterile Alpha Motif (SAM), a module of about 70 amino acid residues long [28], specifically 65 amino acids and 66 in SAMD9 and SAMD9L, respectively. SAM domains, one of the most common protein domains found in eukaryotic cells, are protein-protein interaction modules that perform a large number of different functions [29, 30] and are not easily categorized. Indeed, different SAM domains can self-associate, bind to other SAM domains and/or to non-SAM proteins, and even interact with RNA, DNA or lipids [30]. Because of the great variety of known functions, the presence of a SAM domain does not necessarily involve a specific function or pathway, but an array of possible functions. For both human SAMD9 and SAMD9L, no function has yet been assigned to their SAM domains, but for SAMD9 the ability to form SAM polymers has been suggested [31]. From our evolutionary study on both proteins, none of the identified sites or amino acid properties under positive selection overlapped with the deduced SAM domains, demonstrating a high level of conservation among the mammalian species.

Conclusions

Since the origin and evolution of the SAMD9 and SAMD9L genes were first reported, a great number of mammalian genomes have been sequenced, allowing now a more detailed view into the evolutionary history of both genes. Our study supports the previously suggested origin of SAMD9 and SAMD9L from a mammalian ancestral duplication event. Specifically, according to the results from our study, this event occurred after the divergence of Marsupialia from Placentalia. When considering the mostly complete mammalian genomes collected for this study, the apparent loss of SAMD9 or SAMD9L in some species led us to propose that some overlapping functional redundancy exists between the two proteins, despite the almost nonexistent recombination between the two closely located genes from other species. From the positive selection analyses performed, both at gene and protein levels, we demonstrate that SAMD9 and SAMD9L continue to be under long term selective pressure, with even stronger evidence for positive selection in SAMD9L.

Both SAMD9 and SAMD9L genes are upregulated by type I interferon, a classic feature associated with many innate pathogen-response genes called interferon-stimulated genes (ISGs). Indeed, human SAMD9 has already been shown to be a functional inhibitor for at least one viral pathogen, a poxvirus called myxoma virus, that expresses a specific viral inhibitor (M062) that counteracts the anti-viral properties of SAMD9 [8]. Our results suggest that at least the SAMD9 genes may have been under sustained selection pressure exerted by viral pathogens.

Our work is the first complete study to investigate the evolutionary history of mammalian SAMD9 and SAMD9L.

Methods

SAMD9 and SAMD9L nucleotide and protein sequences

All the available mammalian SAMD9 and SAMD9L genes coding sequences used in the phylogenetic and positive selection analyses were retrieved from NCBI (http://www.ncbi.nlm.nih.gov) and Ensembl (http://www.ensembl.org/index.html) databases. Next, sequences were aligned with ClustalW [32] implemented in BioEdit v7.0.9 [33], followed by visual inspection. Nucleotide sequences translation into protein sequences was performed using also BioEdit.

SAMD9 and SAMD9L genes coding sequences were collected for fifteen and nineteen species, respectively. Based on the Mammal Species of the World database classification (http://www.bucknell.edu/msw3/), representative species of mammalian infraclasses Metatheria (Order Didelphimorphia) and Eutheria (Order Artiodactyla, Carnivora, Chiroptera, Erinaceomorpha, Lagomorpha, Perissodactyla, Primates, Proboscidea, Rodentia and Soricomorpha) were included in this study. Table 1 summarizes the species collected for each gene and their respective accession numbers.

The isoeletric point (pI) of SAMD9 and SAMD9L deduced proteins for different species was estimated using DAMBE (Data Analysis and Molecular Biology and Evolution) [34].

Recombination and phylogenetic analyses

Recombination can mislead phylogenetic and positive selection analyses [35], and particularly for SAMD9 and SAMD9L, the genes close location (~12 kb in human genome, for example) might increase the probability of recombination to occur. Therefore, we first performed recombination testing on placental SAMD9 and SAMD9L nucleotide sequences alignments, and also on the alignment of both genes together (SAMD9 + SAMD9L). The software GARD (Genetic Algorithm for Recombination Detection) [13, 14], implemented in the Datamonkey web server [36], was used to detect possible recombination breakpoints.

For SAMD9 and SAMD9L genes alignments no significant breakpoints were detected while using GARD, thus the complete alignments were used to establish each gene phylogeny. As indicated by the Akaike Information Criterion (AIC) implemented in jModelTest v0.1.1 [37], the nucleotide substitution model TVM+G was used for SAMD9 tree estimation, while the GTR+G model was the consensus model selected for SAMD9L phylogenetic tree construction. On the other hand, a significant breakpoint was detected when running GARD for the SAMD9+SAMD9L alignment and a phylogenetic tree was estimated for each segment. For the left segment, the AIC in jModelTest indicated GTR+I+G as the best-fit nucleotide substitution model, whereas for the right segment the TPM2uf+G model was indicated as the best for the tree estimation. Also, for the SAMD9+SAMD9L alignment, a phylogenetic tree was estimated without testing recombination. In this case, the jModelTest AIC estimated GTR+I+G model as the best-fit nucleotide substitution model.

To establish mammalian phylogeny for SAMD9, SAMD9L and SAMD9+SAMD9L, based on nucleotide sequences, the Maximum Likelihood (ML) method implemented on GARLI v2.0 (Genetic Algorithm for Rapid Likelihood Inference) was used [38]. The analyses were performed with 1,000,000 generations and 1,000 bootstrap searches. ML trees were displayed using FigTree v1.3.1 (http://tree.bio.ed.ac.uk/).

Codon-based analyses of positive selection

A useful measurement for identifying adaptive protein evolution is the nonsynonymous (d N )/synonymous substitution (d S ) rate (ω = d N /d S ), where values of ω = 1, < 1, and > 1 indicate neutral selection, negative selection, and positive selection, respectively [39, 40]. Naturally, and due to protein structural and functional constraints, ω is expected to be close to 0 and full protein analysis rarely detects positive selection [41]. As a result, several methods, based on models of codon substitution, have been developed to detect adaptive evolution (positive selection) at individual sites in a background of negative selection [42, 43]. We employed six different methods to detect sites under selection, and based on the methodology adopted by several authors [19, 20] only codons identified by at least three of the six used methods were considered to be under positive selection.

To detect selection based on the ratio ω and at the gene-level, for both SAMD9 and SAMD9L, PAML v4.4 (Phylogenetic Analysis by Maximum Likelihood) [16, 17] was used and the codon frequency model F3x4 was fitted to both alignments. In the site-specific models that allow the ratio ω to vary among codons, we performed Likelihood Ratio Tests (LRTs) with 2 degrees of freedom to compare the following models (NS sites): M1 (nearly neutral) with M2 (selection) and M7 (neutral, β distribution of ω < 1) with M8 (selection, β distribution of ω > 1). A significant LRT demonstrates that the selection model fits better than the neutral model [42, 43]. For model M8, a Bayes empirical Bayes (BEB) approach was employed to detect codons with a posterior probability >90% of being under selection [44]. Also the branch-site model A was performed for testing positive selection on individual sites along a specific lineage, called foreground branch, where the other lineages are background branches. In branch-site model A, three ω ratios are assumed for foreground (0 < ω 0 < 1, ω 1  = 1, ω 2  > 1) and two ω ratios for background (0 < ω 0 < 1, ω 1  = 1). The null model is the same as model A, but ω 2  = 1 is fixed. We also used BEB approach to calculate the posterior probability of a specific codon site and to identify those most likely to be under positive selection (posterior probability >90%) [44].

Both SAMD9 and SAMD9L genes were also analyzed using HyPhy software implemented in the Datamonkey web server [36]. Datamonkey includes three classic ML methods to detect sites under selection: the Single Likelihood Ancestor Counting (SLAC) model, the Fixed Effect Likelihood (FEL) model and the Random Effect Likelihood (REL) model [45]. Besides these three methods, two other recently developed and implemented in the Datamonkey web server were applied to our dataset: the Mixed Effects Model of Evolution (MEME) that allows the distribution of ω to vary from site to site and also from branch to branch at a site, being capable of identifying both episodic and pervasive positive selection [46], and the Fast Unbiased Bayesian AppRoximation (FUBAR) method that can detect positive selection under a model faster than the existing fixed effects likelihood models through the introduction of an ultra-fast Markov chain Monte Carlo (MCMC) routine and that allows to visualize Bayesian inference for each site [47]. All these methods were run using the best model chosen by AIC on a defined Neighbor-Joining (NJ) phylogenetic tree after running GARD to detect recombination. To avoid a high false-positive rate, due to the reduced number of sequences [45], sites with p-values <0.1 for SLAC, FEL and MEME models, Bayes Factor >50 for REL model and a posterior probability >0.90 for FUBAR were accepted as candidates for selection.

From the HyPhy software available on the Datamonkey web server, we also run the PARRIS method used to detect if a proportion of sites in the alignment evolve with d N /d S  > 1 and that accounts for synonymous rate variation and recombination [18].

Amino acid-based analyses of positive selection

By using TreeSAAP v3.2 (Selection of Amino Acid Properties based on Phylogenetic Trees) [22] it was possible to detect selection signatures at the amino acid level, more specifically, positively selected amino acid properties that result in radical structural and functional changes in local regions of the protein (destabilization). Properties that fell into categories 6 through 8 (the most radical values denoting positive destabilizing selection), presented z-score values of 3.09 and higher, and with a probability value of 0.001 were plotted in a sliding window (length = 20).

Thirty-one amino acid properties were evaluated across SAMD9 and SAMD9L phylogenetic trees to identify protein regions that presented evidence of positive destabilization for each property. The thirty-one amino acid properties are the following: alpha-helical tendencies, average number of surrounding residues, beta-structure tendencies, bulkiness, buriedness, chromatographic index, coil tendencies, composition, compressibility, equilibrium constant (ionization of COOH), helical contact area, hydropathy, isoelectric point, long-range non-bonded energy, mean r.m.s. fluctuation displacement, molecular volume, molecular weight, normalized consensus hydrophobicity, partial specific volume, polar requirement, polarity, power to be at the C-terminal, power to be at the middle of alpha-helix, power to be at the N-terminal, refractive index, short and medium range non-bonded energy, solvent accessible reduction ratio, surrounding hydrophobicity, thermodynamic transfer hydrophobicity, total non-bonded energy and turn tendencies.