Screening for mouse genes lost in mammals with long lifespans
- 35 Downloads
Gerontogenes include those that modulate life expectancy in various species and may be the actual longevity genes. We believe that a long (relative to body weight) lifespan in individual rodent and primate species can be due, among other things, to the loss of particular genes that are present in short-lived species of the same orders. These genes can also explain the widely different rates of aging among diverse species as well as why similarly sized rodents or primates sometimes have anomalous life expectancies (e.g., naked mole-rats and humans). Here, we consider the gene loss in the context of the prediction of Williams’ theory that concerns the reallocation of physiological resources of an organism between active reproduction (r-strategy) and self-maintenance (K-strategy). We have identified such lost genes using an original computer-aided approach; the software considers the loss of a gene as disruptions in gene orthology, local gene synteny or both.
A method and software identifying the genes that are absent from a predefined set of species but present in another predefined set of species are suggested. Examples of such pairs of sets include long-lived vs short-lived, homeothermic vs poikilothermic, amniotic vs anamniotic, aquatic vs terrestrial, and neotenic vs nonneotenic species, among others. Species are included in one of two sets according to the property of interest, such as longevity or homeothermy. The program is universal towards these pairs, i.e., towards the underlying property, although the sets should include species with quality genome assemblies. Here, the proposed method was applied to study the longevity of Euarchontoglires species. It largely predicted genes that are highly expressed in the testis, epididymis, uterus, mammary glands, and the vomeronasal and other reproduction-related organs. This agrees with Williams’ theory that hypothesizes a species transition from r-strategy to K-strategy. For instance, the method predicts the mouse gene Smpd5, which has an expression level 20 times greater in the testis than in organs unrelated to reproduction as experimentally demonstrated elsewhere. At the same time, its paralog Smpd3 is not predicted by the program and is widely expressed in many organs not specifically related to reproduction.
The method and program, which were applied here to screen for gene losses that can accompany increased lifespan, were also applied to study reduced regenerative capacity and development of the telencephalon, neoteny, etc. Some of these results have been carefully tested experimentally. Therefore, we assume that the method is widely applicable.
KeywordsLifespan Longevity Aging Gerontology Gene loss Synteny In silico analysis
Upper Galilee mountains blind mole rat
maximum reported lifespan
reads per kilobase per million mapped reads
topologically associated domain
transcripts per million
The genetic propensity of certain species for longevity and anti-aging is a challenging problem in vertebrate biology. Of particular interest are the genes that influence life expectancy differences among species. These genes are expected to be the real longevity genes of interest and should explain the wide differences in the rate of aging among diverse species and why similarly sized rodents or primates such as naked mole-rats (NMR) or humans sometimes have anomalous life expectancies (see, e.g., ). According to , no such genes have been unequivocally identified. We performed a computer-aided analysis of data relevant to lifespan and made a bioinformatic search for the genes, the loss of which might modulate lifespan. This search is based on the general idea that such genes are lost in a predefined set of species but are present in another predefined set of species. Examples of such pairs of sets include long-lived vs short-lived, homeothermic vs poikilothermic, amniotic vs anamniotic, aquatic vs terrestrial, and neotenic vs nonneotenic species, among others. Species are included in one of two sets depending on the property of interest, such as longevity or homeothermy. A bioinformatics method and software relevant to the idea are universal towards these sets and the property that defines them. Earlier, we used the method to predict that the loss of certain genes in the evolution of vertebrates could increase the size of the telencephalon in hot-blooded vertebrates relative to cold-blooded vertebrates at the expense of a considerably decreased regenerative capacity, and later it was experimentally demonstrated . Accordingly, we considered homeotherms vs poikilotherms among vertebrates there. Here, we studied Euarchontoglires in regard to their lifespan with due account to the body weight [4, 5]. The main concepts of the approach have been implemented in the lossgainRSL software . The results of this work were reported at the conference .
Thus, the goal of this paper was to describe a method and software for lost gene identification and to illustrate their application for studying the longevity of Euarchontoglires species. The program identifies the genes that are present in a predefined set of species but absent from another predefined set of species. Here, these sets are defined by the longevity property of Euarchontoglires. The program is universal in regard to its applicability to any property, although the corresponding sets should include species with quality genome assemblies, which do not always match the best biological choice. The program has not been described previously and implements the very part of our method that is required for the current property of longevity. Thus, we have obtained a list of mouse genes lost by long-lived Euarchontoglires, which can be linked to their longevity. Supraprimates were chosen since this is the natural clade combining primates and rodents with a wide variance in lifespans. In addition, this taxon is well represented in databases. Further data on the considered species and individual examples of lost genes are given in Additional file 1.
If the complex set of conditions is tested against a multitude of species, data mining reveals patterns that are hardly discernible when few genomes are analyzed. Data mining raises the requirements for computation performance and software flexibility. Here, this is achieved by supercomputer-based parallel processing and the separation of the system of conditions from the programming code, which is realized by describing the gene selection procedure in the specifically designed language. This language allows the easy variation of the conditions. The prediction accuracy is improved by the combined application of several orthology inference methods and variation of all significant parameters, which also increases the computational load.
The method of in silico search for lost genes
This section presents an original approach to the identification of lost (or, similarly, acquired) genes, which is the major achievement of this study; its computer implementation (with technical details relevant to the corresponding software) can be found in the Methods section. We are given d groups of species (each group is a taxon or a union of taxa) and a species-specific property T. Two fractions of each group are distinguished by the property, the upper fraction where the property is present and the lower one where it is missing. The species that belong to the upper and lower fractions will be referred to as upper and lower, respectively. Let us specify one of the lower species as the reference species, assuming that its genome is well assembled and convenient for further wet experiments. The genes of the reference species present in the genomes of the lower species and missing (lost) in the genomes of the upper species are identified in silico. For example, T can indicate longevity or homeothermy, etc. It is unlikely that each of the identified genes is directly linked to a corresponding property that is inherent in the respective upper species, due to incomplete genomic assemblies as well as to the complexity of the evolution process. However, we believe that the property T that defines the set of upper species was formed in evolution, in particular, due to the loss of some genes from those predicted. In other words, such a link with the property T can be observed for some of the identified genes by the proposed approach. Our approach and software can be of general interest.
Several orthology inference methods are independently used, and the gene “presence” or “absence” is confirmed by the presence or absence, respectively, of local synteny (see the Methods section for details).
The decision of gene presence or loss is made by consensus or voting for the results of (А).
If the absence of an ortholog favors the loss of gene X in the reference species, but the genomic alignment contains a gene X’ in the tested species such that the exon/intron structures of X and X’ have significant similarity, including a satisfactory local alignment of certain exons, X’ is considered as substantially varied version of gene X.
In the above case, similar tissue expression patterns of genes X and X’ or their similar biochemical roles serve as an indication of the presence of gene X in the tested species. For instance, the predominant expression of both genes in the same organs favors the presence of the gene in the tested species.
Naturally, the result depends on the parameters. For each pair of species, the choice that should be made is the number of orthologous gene pairs in the vicinity that is sufficient to acknowledge the presence of the gene (in Fig. 1, this number equals 2). Each vicinity is described by a size r, which depends on the upper or lower fraction as well as on the group of species. Here, the gene is finally considered lost if it was lost within r’ running from some lower boundary to the upper boundary r0. The reason for choosing r0 is based on the following considerations. The 3D structure of DNA includes topologically associated domains (TADs); the impact of enhancers is usually limited to the TAD size so that transcription control largely occurs within the TAD. Local synteny conservation is usually considered as its conservation within the same TAD. In mammals, the TAD size ranges from one to several Mbp [14, 15, 16, 17]. While choosing r0 can be problematic in general, it can be justified here.
Our fast and flexible program lossgainRSL  identifies lost and acquired genes between several groups of species. Currently, it implements step (A) separately for each orthology inference method, while steps (В), (C), and (D) are performed during computer-aided postprocessing. Their validation can be automated as well, but this was not implemented in the current program since the automatic validation is less accurate and requires its own thresholds that are hard to clearly define. The details of this implementation can be found in the Methods section.
Mouse genes lost in long-lived Euarchontoglires
Species of rodents, lagomorphs, and primates. Maximum reported lifespan (MRLS, years) and longevity quotient (LQ,%) were obtained from the AnAge database 
Naked mole-rat (NMR)
Nannospalax galili (Spalax galili)
Upper Galilee mountains blind mole rat (BMR)
Fukomys damarensis (Cryptomys damarensis)
Damaraland mole-rat (DMR)
Peromyscus maniculatus bairdii
Northern American deer mouse
Lesser Egyptian jerboa
Thirteen-lined ground squirrel
Oryctolagus cuniculus (Lepus cuniculus)
Brazilian guinea pig
Cebus capucinus imitator
Northern white-cheeked gibbon
Bolivian squirrel monkey
Nancy Ma’s night monkey
Gorilla gorilla gorilla
Western lowland gorilla
Colobus angolensis palliatus
Golden snub-nosed monkey
Black snub-nosed monkey
Meaningful mouse genes lost in long-lived mammals
RIKEN cDNA 4931406B18 gene
RIKEN cDNA 1810065E05 gene
transmembrane BAX inhibitor motif containing 7
RIKEN cDNA 4,921,536 K21 gene
cytochrome P450, family 2, subfamily ab, polypeptide 1
heart, spleen, placenta
selection and upkeep of intraepithelial T cells 2
RIKEN cDNA 2210407C18 gene
diazepam binding inhibitor-like 5
histone H1-like protein in spermatids 1
RIKEN cDNA F830045P16 gene
tetratricopeptide repeat domain 41
tetratricopeptide repeat domain 39D
MAS-related GPR, member B2
skin, ovary, forelimb
MAS-related GPR, member B8
RIKEN cDNA 4930504O13 gene
selection and upkeep of intraepithelial T cells 4
skin, testis, thymus, brain
cytochrome c, testis
predicted gene 12,253
selection and upkeep of intraepithelial T cells 3
skin, spleen, lung
sphingomyelin phosphodiesterase 5
immunoglobulin lambda variable 1
spleen, colon, thymus
immunoglobulin lambda variable 2
spleen, thymus, colon
predicted gene 12,394
RIKEN cDNA 9230113P08 gene
predicted gene 595
selection and upkeep of intraepithelial T cells 1
skin, whole organism, thymus
predicted gene 2163
A considerable part of the predicted genes are still detected as lost when the algorithm parameters are varied. For instance, only 3 out of 27 genes in Table 2 are missing at m = 2: Tmbim7, 9230113P08Rik, and Ttc39d; similarly, three genes are missing at p = 3: Iglv1, Iglv2, and Mrgprb8; and a single gene, Hils1, is missing at n = 3. Many of the genes in Table 2 were predicted irrespective of the orthology inference method. Complete data related to this special problem are not presented here. However, we compared the screening results for the abovementioned 40 Euarchontoglires species produced by three alternative orthology inference methods: Ensembl v92 employing a distinct approach, our original method of protein clustering , and blastp identification of reciprocal best hits with 0.001 cutoff including different voting schemes. These alternative methods predict at least genes most reasonably linked to longevity. Specifically, using the “2 of 3” voting scheme identified, apart from the olfactory receptors, 13 genes: Tmbim7, 4921536K21Rik, Cyp2ab1, Skint2, Dbil5, Hils1, Ttc41, Ttc39d, 4930504O13Rik, Cyct, Gm12253, Smpd5, and 9230113P08Rik. The consensus of the three orthology inference methods excludes the three underlined genes. A comparison of numerous thusly generated lists of lost genes is hardly practicable since it is hard to interpret genes that are present in one list and absent from another. The elucidation of the role of a lost gene depends on a component of longevity under consideration and requires analysis of individual gene properties.
There are numerous theories of aging based on diverse concepts of this extremely complex process. These include Williams’ antagonistic pleiotropy theory  relating the active reproduction vs longevity opposition, thus opposing the r- and K-strategies for the maintenance of the physiological functions of the body. Indeed, the predicted lost genes largely demonstrate a specific relationship to reproductive-related organs, which agrees with Williams’ hypothesis  concerning the reallocation of the physiological resources of the body between self-maintenance and reproduction (transition to К-strategy in the species evolution). We are unaware of software for large-scale screening for the genes, the loss of which correlates with species-specific lifespans in the context of Williams’ theory . Published data indicate that many genes are actively expressed in mouse testes and correspond to human pseudogenes, although such genes were identified with no account for synteny . We searched mouse genes whose orthology and local genomic synteny were present in short-lived species, but either of these relations was disrupted in long-lived species. Our program can simultaneously process a vast set of available species and all their known genes. Many of the genes identified by the program are essential for reproduction. Accordingly, their loss can indicate the transition to Williams’ K-strategy in the species evolution .
The global problem here is whether gene loss can cause a significant evolutionary change. We also asked this question until our extensive three-year experiments demonstrated that the tropical clawed frog’s gene, ENSXETG00000033176, identified by the same program was, to say the least, associated with a sharp drop in regenerative potential as well as with forebrain development .
The list of 134 lost genes in Additional file 2 was obtained for the following long-lived species: 3 rodents with LQ > 130% and 7 primates with LQ > 220%, assuming that the genes should necessarily be lost in the NMR, human and white-headed capuchin. The lower set included 11 rodents and lagomorphs with LQ < 110% and 10 primates with LQ < 190%. The mouse was the reference species. We validated these genes with an account of the genomic and nucleotide alignments, exon-intron structure, the similarity of organ-specific expression, and gene function using well-known bioinformatics tools such as Ensembl Genome Browser  and Expression Atlas . This was done for pairs of genes suspected as orthologs in mouse and a long-lived species that were not specified as orthologs or paralogs in Ensembl. For instance, synteny conservation can be evident from the genomic alignments and thus favor the orthology of the gene pair, although the genes are not considered orthologous. Approximately half of these 134 genes lacked such signs of orthology and synteny, which we considered validating the prediction quality. These genes include 43 olfactory genes and 27 other genes that were provisionally designated as the meaningful. More than one-third of olfactory genes are significantly (data not shown) expressed in the testis and vomeronasal organ, both of which are associated with reproduction. Similarly, the bulk of the meaningful genes demonstrate significant expression in organs associated with reproduction. These genes encode regulatory factors or enzymes linked to biochemical pathways critical for reproduction; this particular statement is not discussed further here. The loss of some predicted vomeronasal and olfactory receptor genes in human and naked mole-rat conforms to their specific anatomical features. We consider these arguments to be relevant to the testing of our program on biological data.
Let us consider some of the 27 predicted meaningful genes. Certain mouse genes were lost completely (i.e., have no orthologs) or became pseudogenes in all long-lived species; however, their close paralogs were preserved in these species. For example, the Smpd5 gene (ENSMUSG00000071724) encodes neutral sphingomyelinase 5; it localizes to both mitochondria and the endoplasmic reticulum. Its expression in the adult testis is approximately 20 times higher than in any other organ. According to Mouse ENCODE , its expression in adult testis is 60 RPKM, while the expression in other organs does not exceed 3 RPKM. According to the Expression Atlas , the Smpd5 expression level in the testis varied from 110 to 185 TPM in six experiments but never exceeded 8 TPM in any other tissue. Its paralog Smpd3 (ENSMUSG00000031906), which encodes another neutral sphingomyelinase, is actively expressed in many mouse organs, including the brain, intestine, and testis. It does not stand out in terms of expression in reproduction-related organs and is not identified by our program. Sphingomyelinases hydrolyze the membrane phospholipid sphingomyelin into phosphocholine and ceramide, a component of transmembrane pores [28, 29]; Smpd3 is likely involved in the release of neuromediators in the synaptic cleft . In addition, sphingomyelinases can trigger apoptosis through the formation of ceramide pores in the mitochondrial membrane [31, 32, 33] and phospholipid transformation . Sphingomyelin participates in the formation of the intracellular mitochondrial network in skeletal muscle, which is significantly decelerated in NMR relative to the mouse. Hence, Smpd5 loss can serve as a prerequisite for neoteny. This and other human and NMR neotenic properties have been summarized and discussed in detail by Skulachev et al. [35, 36]. The relationship between the lifespan and certain sphingomyelinases in different animal species has been considered in [37, 38, 39].
The mouse gene Dbil5 (ENSMUSG00000038057), which encodes diazepam binding inhibitor-like 5, is massively expressed in the testis (a 100-fold expression difference is shown in six experiments in  compared to other organs). This protein, located in the cytoplasm, binds fatty-acyl-CoA and mediates spermatogenesis . In humans, it corresponds to the pseudogene DBIL5P; the neighboring genes were also preserved. In the Damaraland mole-rat (DMR, Fukomys damarensis), the neighboring genes were preserved, but there is no ortholog of mouse Dbil5. Genes orthologous to Dbil5 are found in the African bush elephant, most Laurasiatheria species including the little brown bat, the rabbit, most rodents including the BMR, and many primates including the bushbaby, mouse lemur, Coquerel’s sifaka, macaque, and gibbon. The alignment of amino acid sequences suggests that the proteins in the degu, chinchilla, macaque, and gibbon substantially differ from those in the bushbaby, mouse lemur, Coquerel’s sifaka, rabbit, and some rodents. In other words, the proteins in the lemurs (bushbaby and mouse lemur) and some rodents form a cluster of similar proteins, while these proteins in the degu, chinchilla, macaque, and gibbon substantially differ from those in the cluster as well as from each other. Thus, a second protein cluster is not formed.
The mouse gene Tmbim7 (ENSMUSG00000014529) encodes a transmembrane protein, a putative apoptosis inhibitor; it is actively expressed in the mouse testis (a 100-fold difference in eight experiments in  compared to other organs) and corresponds to the human pseudogene TMBIM7P. The gene corresponds to the pseudogene ENSGL00000028944 in the female NMR; however, the corresponding gene was preserved in the DMR. Without regard to synteny, the gene aligns with the gene TMBIM6, which is overexpressed in certain cancer types and suppresses Bax-induced apoptosis.
The loss of the flehmen response and olfactory receptors (abundantly represented in Additional file 2) in primates relative to other mammals is thought to be associated with reproduction but also with the acquisition of trichromatic vision .
The following two genes are identified by the program using different parameter q values than those considered above, q = 1 and q = 0, respectively.
The mouse gene Wap (ENSMUSG00000000381) has been lost completely or has become a pseudogene in the NMR, DMR, human, bonobo, chimpanzee, western gorilla, white-cheeked gibbon, and white-headed capuchin. It has an ortholog in a single short-lived primate (vervet monkey) and exists in the Bolivian squirrel monkey. The Wap gene encodes the whey acidic protein (WAP). WAP was described in bovine colostrum , its minor quantities are detectable in mouse milk , and it constitutes a considerable fraction of the total protein in rabbit milk [44, 45]. This protein inhibits Staphylococcus aureus growth and regulates the proliferation of mammary epithelial cells by preventing serine protease from degrading laminin, thus suppressing the MAP kinase pathway [46, 47]. WAP expression is upregulated during lactation . The absence of WAP in human and the NMR agrees with the low levels of total protein in their milk relative to mice, rats, and rabbits [44, 49]. Its expression levels of 18 and 53 TPM were demonstrated in the mammary gland in experiments, while it never exceeded 3 TPM in other tissues .
The mouse gene Wfdc16 (ENSMUSG00000070530) has orthologs only in murine rodents and some Laurasiatheria species. In particular, it was lost in the NMR, BMR, DMR, chinchilla, and all primates. Wfdc16 is most actively expressed in the epididymis and pregnant uterus. The protein product is thought to be an inhibitor of peptidases. The rat has at least four WFDC-containing proteins; three of them, Wfdc8, Wfdc11, and Wfdc16, are expressed only in the epididymis . Their expression depends on the levels of sex hormones . The action of at least one of these proteins, Wfdc8, pertains to antimicrobial defense . According to , its expression in the genital fat pad (120 RPKM) is 30 times higher than in organs not directly related to reproduction (4 RPKM in kidneys), and there is also significant expression (7 RPKM) in the testis. According to , one experiment revealed that its expression in the epididymis was 330 TPM, which is 20 times that in the kidney and other tissues not directly related to reproduction (but 21 TPM in the testis and 20 TPM in the uterus). The biological properties, primarily the tissue-specific expression of the identified genes, are detailed in Additional file 3. These results indicate that the genes identified by our program directly relate to reproduction and, according to Williams’ theory , can influence the lifespan.
Let us return to the discussion of predicted genes in general. Low levels of their expression are commonly observed in many organs, which can be potentially harmful. Certain age-related (e.g., oncological and neurodegenerative) diseases can progress slowly so as not to affect young organisms . Accordingly, individuals of short-lived species largely do not survive to the age when the disease becomes a problem. This exemplifies the antagonistic pleiotropy, i.e., the opposite effects of the same gene during ontogeny .
On the other hand, the loss of certain genes might not be a factor of increased lifespan but rather may result from an increased lactation period and reproductive age. This indicates that a gene loss can accompany longevity without contributing to its cause.
According to the common point of view, rearrangements in the network of gene cis-regulatory elements are considered to be the main route for the evolutionary changes that lead to the phenotypic and physiological differences between species, particularly in vertebrates . Unfortunately, this kind of genomic rearrangement is difficult to reveal by systematic screening using bioinformatic methods. The identification of cis-regulatory elements is a complex task, and mutations are not necessarily linked to the functioning of the corresponding gene. Clearly, gene losses do not exclude such evolutionary events as well.
The limitations of the method include a certain dependence on the results of the orthology inference method and some sensitive parameters, such as the vicinity size or the number of witnesses in the upper and lower species. Accordingly, we selected only genes of the reference species that were lost for all or most of these parameters.
The developed method and its software allowed us to identify a short list of presumably lost genes (i.e., genes with disrupted orthology or local synteny) associated with a long lifespan in Euarchontoglires. The predicted lost genes largely demonstrate specific expressions in reproductive organs, which agrees with Williams’ hypothesis  concerning the reallocation of the physiological resources of the body between self-maintenance and reproduction (transition to К-strategy in the species evolution). The loss of some predicted vomeronasal and olfactory receptor genes in human and naked mole-rat conforms to their specific anatomical features. We suggest that the loss of certain genes in evolution is one of the essential determinants of lifespan. Overall, it is a likely driving force for many aspects of species evolution in vertebrates.
Thus, gene X is present in the set S if it is present in at least k species from S (k is a set-specific parameter). The presence of gene X from the reference species in another species means that (i) the latter species includes the gene X’ which is an ortholog of X and (ii) there are several other genes (“witnesses”) near gene X that have respective orthologs located near X’ in that species (see Fig. 1). If either condition is not satisfied, the gene is considered lost. Here, the orthologs may be of any type (1-to-1, 1-to-many, many-to-many), and the program can optionally accept a gene X’ that is not orthologous to Х but is a paralog of an ortholog of gene Х. Thus, synteny is taken into account in addition to orthology and paralogy; the consideration of synteny significantly advances the instrumentation for the identification of genes lost (or gained) in evolution. The program lossgainRSL does not rely on a certain orthology inference method and uses orthology data at the input.
The abovementioned predicate is a Boolean function composed of such conditions using connectives AND, OR, and NOT; it is described below.
The sets of species, predicate, number of witnesses, other synteny details such as the values of r and r’ for each elementary condition, and orthology type are the program parameters. In particular, the gene order and direction in a synteny block can be the same (as shown in Fig. 1) or differ within the specified conditions. Here, we varied the r and r’ values from 1 to 5 Mbp.
All protein-coding genes from the reference species are tried one by one. For each gene X, the algorithm verifies the predicate until the first match; then, the gene is added to the general list of identified genes; otherwise, it is skipped. The resulting list includes genes from the reference species that are present in some species and are missing in others following the predicate. Proper selection of species groups and elementary conditions included in the predicate allows the program to identify both lost and acquired genes. The program is somewhat sensitive to the quality of the genome assembly: short contigs containing one or a few genes as well as genes located at the contig boundary can be difficult to process. The method does not require chromosome-level assembly; it can process contigs of arbitrary length.
The current version 6.20 of lossgainRSL is primarily targeted at species and genomic data included in the Ensembl database , but it also provides for other orthology data as described in the program manual. The program uses three types of input data, that is, three tables, one of genes, one of orthologs, and one of paralogs for each relevant species. The table of paralogs is necessary if the selection of orthologous genes should be performed with the paralogs considered. A species from GenBank can also be added using the utility ‘addspecies’, which is also available at the program site. The lossgainRSL output is tab delimited and can be imported into Excel. This file contains all identified genes of the reference species that satisfy the predicate with the given parameters. The output can optionally contain the entire synteny block (i.e., including the witnesses) for each gene as well as their orthologs in other species.
Thus, this algorithm takes into account the two groups of Euarchontoglires species mentioned in the Results section. Each group includes the upper and lower fractions, which are tentative long- and short-lived species, respectively. These groups and fractions are detailed below. In addition, the NMR, human, and capuchin are considered compulsory species where gene X must be absent.
The corresponding configuration file is given in Additional file 4. It illustrates the potential of the language describing the predicates and was designed to reflect how the predicate is satisfied in detail. Specifically, all but the compulsory species will be shown in the output file. Only the Ensembl ID is given for orthologous genes in those species, but the configuration file can be easily changed to return more information, such as the names, coordinates, and annotations of these genes (as was done for the mouse genes). In addition, the identified gene and its orthologs can be supplemented by the whole syntenic block containing neighboring witnesses in each species. All 134 genes identified using this configuration file are listed in Additional file 2.
The program has been extensively parallelized and can operate on a supercomputer, which is essential if many complete genomes are considered or if synteny blocks consist of many neighboring genes. However, in the case of Ensembl v95 data and the program parameters listed in the Results section, the execution time did not exceed 10 min on a PC with a 6-core Xeon processor.
Here, the method for the identification of lost genes described in the first section of the Results and in this section was applied in the context of the species lifespan (longevity) as follows. Forty Euarchontoglires species falling into two groups were considered: the first included 16 rodents and lagomorphs, and the second, 24 primates (see Table 1). The upper and lower fractions of species were defined by the quantitative longevity index. Several indices are known: the longevity quotient (LQ), the maximum longevity residual (MLR), the maximum reported lifespan (MRLS), etc. Here, we used the AnAge build 14 database  and its index, referred to as LQ below. It is calculated using the formula MRLS/(4.88∙m0.153), where m is the adult body weight in grams as specified in the database. We have selected the species listed in Table 1 considering their representation in Ensembl and the quality of the genomes addressed in Additional file 5: Table S2, which does not always match a biologically justified selection. Among numerous publications on the relationship between MRLS and body weight and related problems, one should mention [21, 55, 56].
While compiling the two sets, that is, long- and short-lived species, we arranged species in Table 1 as follows: species were sorted by descending LQ separately for glires and primates, after which their upper fractions were joined and opposed to the joined lower fractions. Specifically, we included in the upper set three rodents with LQ > 130% and seven primates with LQ > 220% and accepted that the NMR, human and white-headed capuchin are compulsory species where the gene must be absent. The lower set consisted of eleven rodents and lagomorphs with LQ < 110% and ten primates with LQ < 190%. The mouse was a reference species.
We do not discuss the global biological problem of longevity because it combines many complex phenomena including resistance to cancer. The following opinion  illustrates this problem well: “Aging is a complex process affecting different species and individuals in different ways. Comparing genetic variation across species with their aging phenotypes will help to understand the molecular basis of aging and longevity. Although most studies on aging have so far focused on short-lived model organisms, recent comparisons of genomic, transcriptomic, and metabolomic data across lineages with different lifespans are unveiling molecular signatures associated with longevity.” We believe that the quote applies well to the proposed program.
The research was carried out using supercomputers at the Joint Supercomputer Center of the Russian Academy of Sciences (JSCC RAS).
LIR, GAS, and VAL conceived and designed the work. AGZ participated in the elaboration of the biological concept concerning the importance of local synteny. GAS and OAZ gathered and analyzed the input data. LIR implemented the programs and carried out the computations. AVS and GAS analyzed the results and wrote the first draft of the manuscript. VAL and LIR wrote the final text. All authors reviewed and approved the final manuscript.
The reported study was funded by the RFBR according to the research project No. 18–29-13037. The idea of taking the coherent account of synteny and orthology of genes was partially supported by the RFBR project No. 18-29-07014 MK (AGZ).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 3.Korotkova DD, Lyubetsky VA, Ivanova AS, Rubanov LI, Seliverstov AV, Zverkov OA, Martynova NYu, Nesterenko AM, Tereshina MB, Peshkin L, Zaraisky AG. Bioinformatics screening of genes specific for well-regenerating vertebrates reveals c-answer, a regulator of brain development and regeneration. Cell Reports. 2019;29(4):1027–40.CrossRefGoogle Scholar
- 6.Rubanov L, Seliverstov A, Lyubetsky V. lossgainRSL: a program for prediction of gene losses and gains between several groups of species [Internet]. figshare; 2019 [Cited 2019Aug2]. Available from: https://doi.org/10.6084/m9.figshare.9173243.v2.
- 14.Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–93. https://doi.org/10.1126/science.1181369.CrossRefPubMedPubMedCentralGoogle Scholar
- 19.Weigl R. Longevity of Mammals in Captivity; from the Living Collections of the World. Kleine Senckenberg-Reihe vol. 48. Stuttgart: 2005.Google Scholar
- 21.Muntané G, Farré X, Rodríguez JA, Pegueroles C, Hughes DA, de Magalhães JP, Gabaldón T, Navarro A. Biological processes modulating longevity across primates: a phylogenetic genome-phenome analysis. Mol Biol Evol. 2018;35(8):1990–2004. https://doi.org/10.1093/molbev/msy105.CrossRefPubMedPubMedCentralGoogle Scholar
- 25.Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Girón CG, Gil L, Gordon L, Haggerty L, Haskell E, Hourlier T, Izuogu OG, Janacek SH, Juettemann T, To JK, Laird MR, Lavidas I, Liu Z, Loveland JE, Maurel T, McLaren W, Moore B, Mudge J, Murphy DN, Newman V, Nuhn M, Ogeh D, Ong CK, Parker A, Patricio M, Riat HS, Schuilenburg H, Sheppard D, Sparrow H, Taylor K, Thormann A, Vullo A, Walts B, Zadissa A, Frankish A, Hunt SE, Kostadima M, Langridge N, Martin FJ, Muffato M, Perry E, Ruffier M, Staines DM, Trevanion SJ, Aken BL, Cunningham F, Yates A, Flicek P. Ensembl 2018. Nucleic Acids Res 2018;46:D754–D761. doi: https://doi.org/10.1093/nar/gkx1098.CrossRefGoogle Scholar
- 26.Papatheodorou I, Fonseca NA, Keays M, Tang YA, Barrera E, Bazant W, Burke M, Füllgrabe A, Fuentes AM, George N, Huerta L, Koskinen S, Mohammed S, Geniza M, Preece J, Jaiswal P, Jarnuczak AF, Huber W, Stegle O, Vizcaino JA, Brazma A, Petryszak R. Expression atlas: gene and protein expression across multiple studies and organisms. Nucleic Acids Res. 2018;46:D246–51. https://doi.org/10.1093/nar/gkx1158.CrossRefPubMedGoogle Scholar
- 27.Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, Sandstrom R, Ma Z, Davis C, Pope BD, Shen Y, Pervouchine DD, Djebali S, Thurman B, Kaul R, Rynes E, Kirilusha A, Marinov GK, Williams BA, Trout D, Amrhein H, Fisher-Aylor K, Antoshechkin I, DeSalvo G, See LH, Fastuca M, Drenkow J, Zaleski C, Dobin A, Prieto P, Lagarde J, Bussotti G, Tanzer A, Denas O, Li K, Bender MA, Zhang M, Byron R, Groudine MT, McCleary D, Pham L, Ye Z, Kuan S, Edsall L, Wu YC, Rasmussen MD, Bansal MS, Keller CA, Morrissey CS, Mishra T, Jain D, Dogan N, Harris RS, Cayting P, Kawli T, Boyle AP, Euskirchen G, Kundaje A, Lin S, Lin Y, Jansen C, Malladi VS, Cline MS, Erickson DT, Kirkup VM, Learned K, Sloan CA, Rosenbloom KR, de Sousa BL, Beal K, Pignatelli M, Flicek P, Lian J, Kahveci T, Lee D, Kent WJ, Santos MR, Herrero J, Notredame C, Johnson A, Vong S, Lee K, Bates D, Neri F, Diege M, Canfield T, Sabo PJ, Wilken MS, Reh TA, Giste E, Shafer A, Kutyavin T, Haugen E, Dunn D, Reynolds AP, Neph S, Humbert R, Hansen RS, De Bruijn M, Selleri L, Rudensky A, Josefowicz S, Samstein R, Eichler EE, Orkin SH, Levasseur D, Papayannopoulou T, Chang KH, Skoultchi A, Gosh S, Disteche C, Treuting P, Wang Y, Weiss MJ, Blobel GA, Good PJ, Lowdon RF, Adams LB, Zhou XQ, Pazin MJ, Feingold EA, Wold B, Taylor J, Kellis M, Mortazavi A, Weissman SM, Stamatoyannopoulos J, Snyder MP, Guigo R, Gingeras TR, Gilbert DM, Hardison RC, Beer MA, Ren B, The mouse ENCODE Consortium. A comparative encyclopedia of DNA elements in the mouse genome. Nature. 2014. 515(7527):355–64. doi: https://doi.org/10.1038/nature13992.CrossRefGoogle Scholar
- 34.Wu BX, Rajagopalan V, Roddy PL, Clarke CJ, Hannun YA. Identification and characterization of murine mitochondria-associated neutral sphingomyelinase (MA-nSMase), the mammalian sphingomyelin phosphodiesterase 5. J Biol Chem. 2010;285(23):17993–8002. https://doi.org/10.1074/jbc.M110.102988.CrossRefPubMedPubMedCentralGoogle Scholar
- 51.Yamazaki K, Adachi T, Sato K, Yanagisawa Y, Fukata H, Seki N, Mori C, Komiyama M. Identification and characterization of novel and unknown mouse epididymis-specific genes by complementary DNA microarray technology. Biol Reprod. 2006;75(3):462–8. https://doi.org/10.1095/biolreprod.105.048058.CrossRefPubMedGoogle Scholar
- 55.Harper JM, Salmon AB, Leiser SF, Galecki AT, Miller RA. Skin-derived fibroblasts from long-lived species are resistant to some, but not all, lethal stresses and to the mitochondrial inhibitor rotenone. Aging Cell. 2007;6(1):1–13. https://doi.org/10.1111/j.1474-9726.2006.00255.x.CrossRefPubMedGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.