Abstract
Statistical packages such as edgeR and DESeq are intended to detect genes that are relevant to phenotypic traits and diseases. A few studies have also modeled the relationships between gene expressions and traits. In the presence of multicollinearity and outliers, which are unavoidable in genetic data, the robust ridge regression estimator can be applied with the trait value as the response variable and the gene expressions as explanatory variables. In some simulation scenarios, the robust ridge estimator is resistant to outliers and less susceptible to multicollinearity than the ordinary least-squares (OLS) estimator. This study investigated the reliability of the robust ridge estimator, in a scenario where the explanatory variables have tail-dependence and negative binomial distributions, by comparing its performance to that of OLS using vine copula to model the tail-dependence among gene expressions. The robust ridge estimator and OLS were both applied to an ecological dataset. First, statistical analysis was used to compare RNA sequencing data between two treatments; then, 15 differentially expressed genes were selected. Next, the regression parameter estimates of robust ridge and OLS for the effects of the 15 contigs (explanatory variables) on trait values (response variables) were compared. Robust ridge regression was found to detect fewer positive and negative slopes than OLS regression. These results indicate that robust ridge regression can be successfully applied for RNA sequencing analysis to estimate the effect of trait-associated genes using real data, and holds great promise as a tool for modeling the association between RNA expression and phenotypic traits.
Similar content being viewed by others
References
Aas K, Czado C, Frigessi A, Bakken H (2009) Pair-copula constructions of multiple dependence. Insurance Math Econ 44:182–198
Ali S, Khan H, Shah L, Butt MM, Suhail M (2019) A comparison of some new and old robust ridge regression estimators. Comm Stat Simul Comput. https://doi.org/10.1080/03610918.2019.1597119
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106
Bedford T, Cooke RM (2002) Vines—a new graphical model for dependent random variables. Ann Stat 30:1031–1068
Chang B, Joe H (2019) Prediction based on conditional distributions of vine copulas. Comput Stat Data Anal 139:45–63
Chou JW, Zhou T, Kaufmann WK, Paules RS, Bushel PR (2007) Extracting gene expression patterns and identifying co-expressed genes from microarray data reveals biologically responsive processes. BMC Bioinformatics 8:427
Farcomeni A (2008) A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion. Stat Methods Med Res 17(4):347–388
Forsberg LA, Absher D, Dumanski JP (2013) Non-heritable genetics of human disease: spotlight on post-zygotic genetic variation acquired during lifetime. J Med Genet 50:1–10
Grogan LF, Cashins SD, Skerratt LF, Berger L, McFadden MS, Harlow P, Hunter DA, Scheele BC, Mulvenna J (2018) Evolution of resistance to chytridiomycosis is associated with a robust early immune response. Mol Ecol 27:919–934
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67
Hoerl AE, Kannard RW, Baldwin KF (1975) Ridge regression: some simulations. Commun. Stat 4:105–123
Huber PJ (1981) Robust statistics. Wiley, Hoboken
Ishwaran H, Rao JS (2014) Geometry and properties of generalized ridge regression in high dimensions. Contemp Math. 622:81–93
Joe H (1997) Multivariate models and dependence concepts. Chapman & Hall, London
Joehanes R, Zhang X, Huan T, Yao C, Ying SX, Nguyen QT, Demirkale CY, Feolo ML, Sharopova NR, Sturcke A, Schäffer AA, Heard-Costa N, Chen H, Liu P, Wang R, Woodhouse KA, Tanriverdi K, Freedman JE, Raghavachari N, Dupuis J, Johnson AD, O’Donnell CJ, Levy D, Munson PJ (2017) Integrated genome-wide analysis of expression quantitative trait loci aids interpretation of genomic association studies. Genome Biol 18:16
Maronna RA (2011) Robust ridge regression for high-dimensional data. Technometrics 53(1):44–53
Michimae H, Yoshida A, Emura T, Matsunami M, Nishimura K (2018) Reconsidering the estimation of costs of phenotypic plasticity using the robust ridge estimator. Ecol Inform 44:7–20
Montgomery DC, Peck EA, Vining GG (2012) Introduction to linear regression analysis, 5th edn. Wiley, Hoboken
Nagler T, Bumann C, Czado C (2019) Model selection in sparse high-dimensional vine copula models with application to portfolio risk. J Multivar Anal 172:180–192
Nelsen R (2006) An introduction to copulas. Springer, Berlin
Norouzirad M, Arashi M (2017) Preliminary test and Stein-type shrinkage ridge estimators in robust regression. Stat Pap. https://doi.org/10.1007/s00362-017-0899-3
Robinson MD, Smyth GK (2008) Small sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9:321–332
Schafer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol. https://doi.org/10.2202/1544-6115.1175
Schepsmeier U, Stoeber J (2014) Derivatives and Fisher information of bivariate copulas. Stat Papers 55:525–542
Seo M, Kim K, Yoon J, Jeong JY, Lee HJ, Cho S, Kim H (2016) RNA-seq analysis for detecting quantitative trait-associated genes. Sci Rep 6:24375
Silvapulle MJ (1991) Robust ridge regression based on an M-estimator. Aust N Z J Stat 33:319–333
Sklar A (1959) Fonctions de R´epartion `a n Dimensions et Leur Marges. Publications de l’Institut de Statistique de l’Universit´e de Paris 8:229–231
Storey JD (2002) A direct approach to false discovery rates. J R Stat Soc Series B 64:479–498
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc B (Stat Methodol) 58:267–288
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31:46–53
Wong KY, Chiu SN (2015) An iterative approach to minimize the mean squared error in ridge regression. Comput Statistics 30(2):625–639
Yang SP, Emura T (2017) A Bayesian approach with generalized ridge estimation for high-dimensional regression and testing. Commun Stat-Simul 46(8):6083–6105
Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, Robinson GJ, Lundberg AE, Bartlett PF, Wray NR, Zhao QY (2014) A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS ONE 9:e103207
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B Stat Methodol 67:301–320
Acknowledgements
The authors sincerely thank the two anonymous referees for their invaluable suggestions that helped to improve this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Handling Editor: Bryan F. J. Manly.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Michimae, H., Matsunami, M. & Emura, T. Robust ridge regression for estimating the effects of correlated gene expressions on phenotypic traits. Environ Ecol Stat 27, 41–72 (2020). https://doi.org/10.1007/s10651-019-00434-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10651-019-00434-3