Genome-wide prediction and prioritization of human aging genes by data fusion: a machine learning approach
Machine learning can effectively nominate novel genes for various research purposes in the laboratory. On a genome-wide scale, we implemented multiple databases and algorithms to predict and prioritize the human aging genes (PPHAGE).
We fused data from 11 databases, and used Naïve Bayes classifier and positive unlabeled learning (PUL) methods, NB, Spy, and Rocchio-SVM, to rank human genes in respect with their implication in aging. The PUL methods enabled us to identify a list of negative (non-aging) genes to use alongside the seed (known age-related) genes in the ranking process. Comparison of the PUL algorithms revealed that none of the methods for identifying a negative sample were advantageous over other methods, and their simultaneous use in a form of fusion was critical for obtaining optimal results (PPHAGE is publicly available at https://cbb.ut.ac.ir/pphage).
We predict and prioritize over 3,000 candidate age-related genes in human, based on significant ranking scores. The identified candidate genes are associated with pathways, ontologies, and diseases that are linked to aging, such as cancer and diabetes. Our data offer a platform for future experimental research on the genetic and biological aspects of aging. Additionally, we demonstrate that fusion of PUL methods and data sources can be successfully used for aging and disease candidate gene prioritization.
KeywordsGenome-wide Prioritization Human aging genes Positive unlabeled learning Machine learning
Genome-Wide Association Study
Percentage of Variance
Prediction and Prioritization of Human Aging Genes
Receiver Operating Characteristic
Prior understanding of the genetic basis of a disease is a crucial step for the better diagnosis and treatment of the disease . Machine learning methods help specialists and biologists the use of functional or inherent properties of genes in the selection of candidate genes . Perhaps the question that is posed to researchers is why all research is aimed at identifying pathogenic rather than non-pathogenic genes. The answer may lie in the fact that genes introduced as non-pathogens may be documented as disease genes later on.
Biologists apply computation, mathematics methods, and algorithms to develop machine learning methods of identifying novel candidate disease genes . Based on the principle of “guilt by association”, similar or identical diseases share genes that are very similar in function or intrinsic properties, or have direct physical protein-protein interactions . Most methods of predicting candidate genes employ various biological data, such as protein sequence, functional annotation, gene expression, protein-protein interaction networks, regulatory data and even orthogonal and conservation data, to identify similarities with respect to the principle of association based on similarity . These methods are categorized as unsupervised, supervised, and semi-supervised . Unsupervised methods cluster the genes based on their proximity and similarity to the known disease genes, and rank them by various methods. Supervised methods create a boundary between disease genes and non-disease genes, and utilize this boundary to select candidate genes. Several studies have been performed to address different aspects of the methodology and have expanded the use of various methods and tools [3, 7, 8, 9, 10, 11, 12].
The tools that are available for candidate gene prioritization can be classified with respect to efficiency, computational algorithms, data sources, and availability [13, 14, 15]. Available prioritization tools can be categorized into specific and general tools . Specific tools are used to prioritize candidate genes associated with a specific disease. In these methods, information related to a specific tissue involved in the disease or other information related to the disease is employed. General tools can be applied for most diseases, and various data sources are often used in these tools. Gene prioritization tools can be divided into two types of single-species and multi-species. Single-species tools are only usable for a specific species, such as human or mouse. Multi-species tools have the ability to prioritize candidate genes in several different species. For example, the ENDEAVOR software can prioritize the candidate genes in six different species . With respect to computational algorithms, candidate prioritization tools are primarily divided into two groups of complex network-based methods and similarity-based methods . The inevitable completeness and existence of errors in biological data sources necessitate fusion of multiple data sources . Most gene targeting methods, therefore, use multiple data sources to improve performance.
The purpose of this study was to design a machine to identify and prioritize novel candidate aging genes in human. We examined the existing methods of identifying human non-aging (negative) genes in the machine learning techniques, and then made a binary classifier for predicting novel candidate genes, based on the positively and negatively learned genes. Gene ranking was based on the principle of the similarity among positive genes through “guilt by association”. Thus, across the unlabeled genes, genes that were less similar in respect with the known genes were employed as negative sample.
Datasets used to evaluate reliable negative sample extraction algorithms
Number of instances
Number of attributes
Data set names
Parkinson’s Disease Classification Data Set 
Liver Disorders Data Set 
Cloud Data Set 
Ionosphere Data Set 
MAGIC Gamma Telescope Data Set 
Mammographic Mass Data Set 
Breast Cancer Wisconsin (Diagnostic) Data Set 
Connectionist Bench (Sonar, Mines vs. Rocks) Data Set 
We also randomly selected 70% of the positive samples as the training set, and the remainder as the test set. To determine the classifier, positive and negative samples were equally selected to ensure that the classifier did not have any bias at the training step. Therefore, we compared the three algorithms with eight data sources extracted from the UCI database (Additional file 1).
Performance evaluation of the reliable negative sample extraction algorithms
MAGIC Gamma Telescope
Breast Cancer Wisconsin
Connectionist Bench (Sonar, Mines vs. Rocks)
Model performance evaluation by Naïve Bayes on the aging data
F measure %
Performance evaluation comparison by multiple binary classifier in the aging data
TP rate %
F measure %
Performance evaluation comparison by multiple binary classifier in the aging data after feature selection
TP rate %
F measure %
Model accuracy assurance is very difficult when the model applied to a separate test suite includes positive and unlabeled samples. This challenge is critical in instances which lack negative sample. Thus, we compared the evaluation metric with the data. We generated data for all 10 models in the training section to predict the residual genes, and extracted the genes that were identified by the 10 models as positive genes, yielding a total of 3531 final candidate genes.
(the list of seed genes in the form of K-Fold with K = 3 was utilized for the mentioned tools). Two metrics for comparing the tools with the proposed model were considered. The first metric calculated the average ranking for the seed genes, and the second metric determined the number of seed genes on the lists as 10, 50, 100, 500, and 1000.
Number of detected seed genes in comparison to the output of tools
Average rank of the seed genes in comparison to the output of tools
The top 25 human candidate aging genes
Diabetes Mellitus, Non-Insulin-Dependent
ATPase Phospholipid Transporting
Serine And Arginine Rich Splicing Factor
Diabetes Mellitus, Non-Insulin-Dependent
Cataract, autosomal recessive congenital 2
Diabetes Mellitus, Non-Insulin-Dependent
Coronary heart disease
Hereditary Diffuse Gastric Cancer
Coronary heart disease
Increased gastric cancer
On a genome-wide scale, we used three PUL methods to create a method for the isolation of human aging genes from other genes. The combined use of several methods as a fusion of their output was advantageous over using one single method.
Following are examples of the identified genes and experimental or GWAS link between these genes and aging. On the list of the 25 top genes, NAP1L4 encodes a member of the nucleosome assembly protein (NAP) family, which interacts with both core and linker histones, and shuttles between the cytoplasm and nucleus, suggesting a role as histone chaperone. Histone protein levels decline during aging, and dramatically affect chromatin structure. Remarkably, the lifespan can be extended by manipulations that reverse the age-dependent changes to chromatin structure, indicating the pivotal role of chromatin structure in aging . In another example, gene expression of NAP1L4 increases with age in the skin tissue . Findings of GWAS link a number of the identified genes to age-related disorders, such as GAB2 and late onset Alzheimer’s disease , and QKI and coronary heart disease/myocardial infarction . Interestingly, GWAS reports also link QKI to successful aging .
RPL3 encodes a ribosomal protein that is a component of the 60S subunit. The encoded protein belongs to the L3P family of ribosomal proteins, and is increased in gene expression during aging of skeletal muscle . In another example, FZD5 is involved in prostate cancer, which is the most common malignancy in older men. ATP8A2 is another gene subject to deterioration and loss of function over time. RYR2 (Additional file 3) encodes a ryanodine receptor found in cardiac muscle sarcoplasmic reticulum. Mutations in this gene are associated with stress-induced polymorphic ventricular tachycardia and arrhythmogenic right ventricular dysplasia and methylation analysis of CpG sites in DNA from blood cells showed a positive correlation between RYR2 and age . In additional examples, differential expression with age was identified in BCAS3, TUFM and DST in the skin . Gene expression revealed a significant increase in the expression of hippocampal TLR3 from elderly (aged 69–99 years old) compared to cells from younger individuals (aged 20–52 years old) . Similarly, differential expression with age was identified in RORA in the adipose tissue .
In order to investigate the implication of the identified candidate genes in aging, we conducted a comprehensive analysis of 330 human pathways in the KEGG. Each of the pathways was examined in the seed and candidate genes, and direct association was detected in a number of instances. For example IL10 activates STAT3 in the FOXO signaling pathway. In another example, GAB2 has a regulatory role for PLCG2 in the osteoclast differentiation pathway, as well as an activating role in the chronic myeloid leukemia pathway. Likewise, FOS is an expression target for IL10 in the T cell receptor signaling pathway.
Indicative diseases associated with the candidate aging genes
Indicative biological pathways associated with the candidate aging genes
Pathways in cancer_Homo sapiens_hsa05200
Proteoglycans in cancer_Homo sapiens_hsa05205
Epstein-Barr virus infection_Homo sapiens_hsa05169
Regulation of actin cytoskeleton_Homo sapiens_hsa04810
HTLV-I infection_Homo sapiens_hsa05166
Protein processing in endoplasmic reticulum_Homo sapiens_hsa04141
Herpes simplex infection_Homo sapiens_hsa05168
PI3K-Akt signaling pathway_Homo sapiens_hsa04151
Focal adhesion_Homo sapiens_hsa04510
Indicative diseases associated with the reliable negative genes
Based on the principle that similar disease genes are likely to have similar characteristics, some machine learning methods have been employed to predict new disease genes from known disease genes. Previous approaches developed a binary classification model that used known disease genes as a positive training set and unknown genes as a negative training set. However, the negative sets were often noisy because unknown genes could include healthy genes and positive collections. Therefore, the results presented by these methods may not be reliable. Using computational machine learning methods and similarity metrics, we identified reliable negative samples, and then tested the samples using a two-class classifier to identify novel positive aging genes in human.
We implemented 11 databases and several machine learning methods to rank the entire human genes, and predicted and prioritized over 3,000 novel candidate age-related genes based on significant ranking scores. These genes were supported by biological, ontology, and disease enrichment analyses. Future experimental research is warranted to verify the significance of the identified genes in human aging.
Bayesian classifiers that work explicitly on the possibilities of different assumptions, such as the NB classifier, which is one of the most efficient and most effective algorithms available for certain learning problems, have provided useful practical solutions .
The NB classifier can compete with other algorithms and in some cases, it works better than other algorithms . A NB classifier can be considered as a simple Bayesian network, which is used for independence assumptions between features and classes. We chose NB based on the structure and nature of the data, the independent nature of each data source, and the high volume of the data and binary features.
Comparison of the evaluation metric across data sources
Since our main data did not contain any negative samples, training a model to identify and prioritize new positive genes was based on the three PUL algorithms. An NB classifier was designed following the extraction of a reliable negative sample and positive genes. Genes were assigned positive labels for the final ranking, using the weighting method according to the available data  .
Data sources used in Naïve Bayes classifier for candidate aging genes
Data source name
The ageing-related information included both by manual and automatic information extraction from the scientific literature.
The list of all functional annotation.
The list of biological pathway.
The Biological Process, Molecular Function, and Cellular Component vocabularies.
The list of all ageing-related phenotype and associated gene.
The chromosome number, location, gene segment, gene type, etc.
The list of all known active site, binding site, chain, etc.
The list of each gene had a physical interaction with each of the positive genes.
The ageing-related expression included tissue type, overexpressed and under expressed, etc.
The list of all regulatory relationship, such as miRNA, Transcription factor, etc.
The catalog of orthologous protein-coding genes across vertebrates and known conserved domain.
The vector of binary features consisted of 11 main parts, each part of which was equivalent to one of the data sources. The information for each data source was a boolean value, and if any gene contained this value, it scored 1, and otherwise, it scored 0 (Table 2). For example, a part of the biological pathway data contained 330 attributes, which were equivalent to a human pathway in KEGG. If the intended gene was located in this pathway, it scored 1, and otherwise, it scored 0. Also for interaction network data, if each gene had a physical interaction with each of the positive genes, it scored 1, and otherwise, 0. These data were extracted from the String and HPRD databases.
In addition, eight valid data sources from the UCI database (https://archive.ics.uci.edu/ml/index.php) were used to evaluate the efficiency of the algorithms. In each data set, one of the data classes with great sample frequency were unlabeled data. Using algorithms, we identified negative samples and compared them to the original data (Table 3).
This research was jointly performed by the University of Social Welfare and Rehabilitation Sciences, Tehran, Iran, and University of Tehran, Iran.
MA designed the project, carried out the bioinformatics studies, and performed data analysis. MO contributed to supervision and data analysis, and wrote the manuscript. VRT participated in the statistical studies. AD helped in coordination. KK participated in designing the project and methodology, supervision and co-ordination. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 1.Korf B, Rimoin D, O’Connor J, Pyeritz R. Nature and frequency of genetic disease. In: Rimoin D, O’Connor J, Pyeritz R, Korf B, editors. Principles and Practice of Medical Genetics. Amsterdam: Elsevier; 2008. pp. 49–51.Google Scholar
- 3.Al-Turaiki IM, et al. Computational approaches for gene prediction: a comparative survey. Berlin: Springer; 2011.Google Scholar
- 8.Lachmann R, Schulze S, Nieke M, Seidl C, & Schaefer I. System-level test case prioritization using machine learning. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE; 2016. pp. 361-368.Google Scholar
- 12.Oneto L, Bunte K, Schleif FM. Advances in artificial neural networks, machine learning and computational intelligence: Selected papers from the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2018). Neurocomputing. 2019;342:1-5.CrossRefGoogle Scholar
- 22.Sigillito VG, et al. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech Dig. 1989;10(3):262–6.Google Scholar
- 25.Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for breast tumor diagnosis. In Biomedical image processing and biomedical visualization (Vol. 1905, pp. 861-870). Int Soc Optics and Photonics. 1993;1905:861-870.Google Scholar
- 58.Lorenzo N, et al. APL-2, an altered peptide ligand derived from heat-shock protein 60, induces interleukin-10 in peripheral blood mononuclear cell derived from juvenile idiopathic arthritis patients and downregulates the inflammatory response in collagen-induced arthritis model. Clin Exp Med. 2015;15(1):31–9.PubMedCrossRefGoogle Scholar
- 94.Zhang B, Zuo W. Learning from positive and unlabeled examples: a survey in 2008 International Symposiums on Information Processing; 2008.Google Scholar
- 95.Li X, Liu B. Learning to classify texts using positive and unlabeled data. In: Proceedings of the 18th international joint conference on Artificial intelligence. Acapulco: Morgan Kaufmann Publishers Inc.; 2003. p. 587–92.Google Scholar
- 96.Liu B, et al. Partially supervised classification of text documents. In: Proceedings of the Nineteenth International Conference on Machine Learning. Morgan: Kaufmann Publishers Inc.; 2002. p. 387–94.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.