Abstract
Background
The inherent correlations among gene expressions have received attention. Recently, it was reported that a set of approximately 1000 landmark genes can be utilized for prediction of expression of other genes (target genes).
Objective
The objective of this study is to predict expression values of target genes based on expression values of landmark genes.
Methods
A cluster-based regression method is proposed. In the proposed method, clusters are obtained from a set of training instances of a gene and an estimator is obtained per cluster. A test instance is assigned to one of clusters then a regression model corresponding to the cluster predicts expression value.
Results
Performance of the proposed method is measured on the GEO (Gene Expression Omnibus) expression data and the GTEx (Genotype-Tissue Expression) expression data. In terms of mean absolute error averaged across target genes, the proposed method significantly outperforms previous approaches in the case of the GEO expression data.
Conclusions
The experimental results report that the combination of clustering and regression can outperform the state-of-the art methods such as generative adversarial networks and a gradient boosting based method.

Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Bageritz J, Willnow P, Valentini E, Leible S, Boutros M, Teleman A (2019) Gene expression atlas of a developing tissue by single cell expression correlation analysis. Nat Methods 16:750–756
Bishop CM (2006) Linear basis function models. Pattern Recognition and Machine Learning. Springer, New York, pp 138–147
Chen Y (2014) Machine learning for large-scale genomics: algorithms, models and applications. Ph.D. dissertation, Dept. Comp. Sci., UC Irvine
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. Proc. the 22nd ACM SIGKDD, pp. 785—794
Chen Y, Li Y, Narayan R, Subramanian A, Xie X (2016) Gene expression inference with deep learning. Bioinformatics 32:1832–1839
Dizaji KG, Wang X, Huang H (2018) Semi-supervised generative adversarial network for gene expression inference. Proc. the 24th ACM SIGKDD, pp. 1435–1444
Edgar R, Domrachev M, Lash AE (2008) Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30:207–210
Goodfellow I, Bengio Y, Courville A (2016) Generative adversarial networks. Deep Learning. The MIT Press, Cambridge, pp 690–693
Greene C et al (2015) Understanding multicellular function and disease with human tissue-specific networks. Nat Genet 47:569–576
Kouw WM, Loog M (2019) An introduction to domain adaptation and transfer learning. https://arxiv.org/abs/1812.11806
Lamb J et al (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313:1929–1935
Lappalainen T, Sammeth M, Dermitzakis ET (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501:506–511
Li W, Yin Y, Quan X, Zhang H (2019) Gene expression value prediction based on XGBoost algorithm. Front Genet 10:1077
Lonsdale J et al (2013) The genotype-tissue expression (GTEx) project. Nat Genet 45:580–585
Murphy KP (2012a) Kernel ridge regression. Machine learning: a probabilistic perspective. The MIT Press, Cambridge, pp 492–493
Murphy KP (2012b) Boosting as functional gradient descent. Machine learning: a probabilistic perspective. The MIT Press, Cambridge, pp 560–561
Pedregosa F et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Subramanian A et al (2017) A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171:1437–1452
van Dam S, Võsa U, van der Graaf A, Franke L, de Magalhães JP (2018) Gene co-expression analysis for functional classification and genedisease predictions. Brief Bioinformatics 19:575–592
Wang X, Dizaji KG, Huang H (2018) Conditional generative adversarial network for gene expression inference. Bioinformatics 34:i603–i611
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2018R1D1A1B07047156).
Author information
Authors and Affiliations
Contributions
H.S.S conceived the study, implemented the algorithm, analyzed the experimental result, and drafted the manuscript.
Corresponding author
Ethics declarations
Competing interests
The author declares that he has no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Seok, HS. Enhancing performance of gene expression value prediction with cluster-based regression. Genes Genom 43, 1059–1064 (2021). https://doi.org/10.1007/s13258-021-01128-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13258-021-01128-6

