Abstract
Factor analysis is a powerful tool for dimensionality reduction in multivariate studies. This study extends the factor model with non-linear interactions. The main contribution of our work is to present two approaches to cluster the non-linear interactions and thus develop new models that are not restricted to the extreme scenarios where all non-null interactions are different or all are the same. The first strategy to handle the clusters involves a finite mixture of degenerate components. The second option is specified via the Dirichlet process. A comprehensive simulation study is developed to explore the performance of the proposals. A sensitivity analysis is carried out to evaluate advantages of estimating a smoothness parameter defined in a covariance function of the Gaussian process establishing the non-linearity of the interactions. In terms of application, the methodology is illustrated with the analysis of gene expression levels related to four breast cancer data sets. The genes belonging to disjoint genome regions, with copy number alteration, are connected to the main factors and their non-linear interactions are estimated and clustered. The mutual investigation and comparison of these four breast cancer data sets is rarely found in the literature.
Similar content being viewed by others
References
Affymetrix: Statistical algorithms reference guide. Affymetrix Technical Report (2001). http://tools.thermofisher.com/content/sfs/brochures/statistical_reference_guide.pdf. Accessed 3 July 2020
Carvalho, M.C., Chang, J., Lucas, J.E., Nevins, J.R., Wang, Q., West, M.: High-dimensional sparse factor modelling: applications in gene expression genomics. J. Am. Stat. Assoc. 103, 1438–1456 (2008)
Chin, K., De Vriers, S., Fridlyand, J., Spellman, P.T., Roydasgupta, R., Kuo, W.L., Lapuk, A., Neve, R.M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Esserman, L., Albertson, D.G., Waldman, F.M., Gray, J.W.: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 10, 529–541 (2006)
Eddelbuettel, D.: Seamless R and C++ integration with Rcpp, vol. 64. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-6868-4
Eddelbuettel, D., Francois, R.: Rcpp: seamless R and C++ integration. J. Stat. Softw. 40(8), 1–18 (2011). http://www.jstatsoft.org/v40/i08/. Acccessed 3 July 2020
Eddelbuettel, D., Sanderson, C.: RcppArmadillo: accelerating R with high-performance C++ linear algebra. Comput. Stat. Data Anal. 71, 1054–1063 (2014)
Gamerman, D., Lopes, H.F.: Markov chain Monte Carlo: stochastic simulation for Bayesian inference, vol. 68, 2nd edn. Chapman and Hall/CRC, Boca Raton (2006)
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis. Texts in Statistical Science, 3rd edn. Chapman and Hall/CRC, Boca Raton (2013)
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970)
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P.: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003)
Ishwaran, H., James, L.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96, 161–173 (2001)
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 6th edn. Pearson/Prenticel Hall, Upper Saddle River (2007)
Lucas, J.E., Carvalho, C., Wang, Q., Bild, A., Nevins, J.R., West, M.: Sparse statistical modelling in gene expression genomics. In: Muller, K.D.P., Vannucci, M. (eds.) Bayesian Inference for Gene Expression and Proteomics, pp. 155–176. Cambridge University Press, Cambridge (2006)
Lucas, J.E., Kung, H.N., Chin, J.T.: Cross-study projections of genomics biomarkers: an evaluation in cancer genomics. PLoS Comput. Biol. 6, e1000920 (2010). https://doi.org/10.1371/journal.pcbi.1000920
Mayrink, V.D., Lucas, J.E.: Sparse latent factor model with interactions: analysis of gene expression. Ann. Appl. Stat. 7(2), 799–822 (2013)
Mayrink, V.D., Lucas, J.E.: Supplement to sparse latent factor model with interations: analysis of gene expression. Ann. Appl. Stat. (2013). https://doi.org/10.1214/12-AOAS607SUPP
Mayrink, V.D., Lucas, J.E.: Bayesian factor model for the detection of coherent patterns in gene expression data. Braz. J. Probab. Stat. 29(1), 1–33 (2015)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)
Miller, D.L., Smeds, J., George, J., Vega, V.B., Vergara, L., Ploner, A., Pawitan, Y., Hall, P., Klaar, S., Liu, E.T., Bergh, J.: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl. Acad. Sci. USA 112, 13550–13555 (2005)
Pollack, J.R., Sorlie, T., Perou, C.M., Rees, C.A., Jeffrey, S.S., Lonning, P.E., Tibshirani, R., Botstein, D., Dale, A.L.B., Brown, P.O.: Microarrays analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci. USA 99(20), 12963–12968 (2002)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2020). http://www.r-project.org/. Accessed 3 July 2020
Roberts, G.O., Gelman, A., Gilks, W.R.: Weak convergence and optimal scaling of random walk Metropolis algorithm. Ann. Appl. Probab. 7(1), 110–120 (1997)
Rueda, O.M., Uriarte, R.D.: Flexible and accurate detection of genomic copy number changes from aCGH. PLoS Comput. Biol. 3(6), e122 (2007)
Sethuraman, J.: A constructive definition of the Dirichlet process prior. Stat. Sin. 2, 639–650 (1994)
Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Kains, B.H., Desmedt, C., Larsimont, D., Cardoso, F., Peterse, H., Nuyten, D., Buyse, M., Vijver, M.J.V.D., Bergh, J., Piccart, M., Delorenzi, M.: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J. Natl. Cancer Inst. 98, 262–272 (2006)
Spiegelhalter, D.J., Best, N.G., van der Linde, B.P.C.A.: Bayesian measures of model complexity and fit. J. R. Stat. Soc. Ser. B 64, 583–639 (2002)
Wang, Y., Klijn, J.G.M., Zhang, Y., Sieuwert, A.M., Look, M.P., Yang, F., Talantov, D., Timmermans, M., Gelder, M.E.M.V., Jatkoe, T., Berns, E.M.J.J., Atkins, D., Foekens, J.A.: Gene expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365, 671–679 (2005)
Watanabe, S.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571–3594 (2010)
West, M.: Bayesian factor regression models in the large p, small n paradigm. In: Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A., West, M. (eds.) Bayesian Statistics, vol. 7, pp. 723–732. Oxford University Press, Oxford (2003)
Wu, Z., Irizarry, R.A., Gentleman, R., Murillo, F.M., Spencer, F.: A model based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc. 99, 909–917 (2004)
Acknowledgements
The authors would like to thank two anonymous referees for their constructive comments leading to an improved version of this paper. The first author is also grateful to Fundação de Amparo à Pesquisa de Minas Gerais (FAPEMIG) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for supporting this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Full posterior conditional distributions
\((F^{*}_{r}\mid \alpha , \lambda , \sigma ^{2}, z, X) \sim N_{n}(M_{F^{*}_{r}}, V_{F^{*}_{r}})\) where
For the probability \(\rho ^{*}_{ir}\) in (10), consider \(Q_{0}=\mathbf {0}\) and \(Q_{r} = \displaystyle {-\frac{1}{2\sigma ^{2}_{i}} \left[ F^{*}_{r} F^{*\top }_{r} -2F^{*}_{r} \left( X^{\top }_{i\bullet } -\lambda ^{\top } \alpha ^{\top }_{i\bullet }\right) \right] }\), with \(r = 0, 1, \ldots , R\).
\((\sigma ^{2}_{i} \mid \alpha _{i \bullet }, \lambda , F_{i \bullet }, X_{i \bullet }) \sim \text{ IG }(A,B)\) where \(A = a + n/2\) and
In order to update \(q_{il}^{*}\) and \(\alpha _{il}\), consider the \(N(M_{\alpha _{il}},V_{\alpha _{il}})\) such that \(V_{\alpha _{il}} = \left[ \displaystyle {\frac{1}{w} } +\displaystyle { \frac{1}{\sigma ^{2}_{i}} }\sum _{j=1}^{n}\lambda ^{2}_{lj}\right] ^{-1}\) and \(M_{\alpha _{il}} = V_{\alpha _{il}}\left[ \displaystyle {\frac{1}{\sigma ^{2}_{i}} }\sum _{j=1}^{n}\lambda _{lj}\left( X_{ij}-F_{ij}-\sum _{l^{*}\ne l}\alpha _{il^{*}}\lambda _{l^{*}j} \right) \right] \).
The construction of posterior weights via stick-breaking process takes into account: \((\mathcal {V}_{ir} \mid F, z, \rho , \lambda , \phi ) \sim \text{ Beta }(z_{ir} + 1, \sum _{s=r+1}^{R}z_{is} + \tau )\).
The full conditional distribution of \(\phi \) is:
with \(p(\phi )\) being the density of the U(0.1, 0.5). We have \(M_{F_{i\bullet }^\top } = M_{F_r^*}\) and \(V_{F^{\top }_{i\bullet }} = V_{F^*_r}\) when \(F_{i\bullet }^\top = F_{r}^*\).
The full conditional distribution of \(\lambda _{\bullet j}\) is given by:
where \(V_{\lambda _j}=\left[ \alpha ^{\top }D^{-1}\alpha + I_{L}\right] ^{-1}\) and \(M_{\lambda _j}=V_{\lambda _j}\left[ \alpha ^{\top }D^{-1}(X_{\bullet j}-F_{\bullet j})\right] \). The term \(\lambda _{-\left\{ \bullet j\right\} }\) indicates the matrix \(\lambda \) without the j-th column.
Appendix B: Short description of some goodness-of-fit measurements
Let \(\theta \) be a generic vector of unknown parameters associated to the model with likelihood \(p(Y|\theta )\). In this case, \(Y = \{Y_1, Y_2, \cdots , Y_n\}\) represents the set of observed data and n is the sample size. Supposed that an MCMC algorithm was applied to sample from the target posterior distribution \(p(\theta |Y)\). As a result, \(\theta ^{(s)}\) is the value generated in the s-th MCMC iteration after the burn-in period, for \(s = 1, \ldots , S\). Assume that \(\bar{\theta }\) is the posterior mean of \(\theta \). Three measurements, considered in this paper to compare models in terms of goodness-of-fit, are summarized as follows:
-
The DIC is a widely used criterion for model selection in the Bayesian context. According to [27] this quantity is calculated by \(2\bar{D}(\theta )-D(\bar{\theta })\), where \(\bar{D}(\theta ) = -2 \sum _{s=1}^{S} \ln [p(Y|\theta ^{(s)})]/S\) and \(D(\bar{\theta }) = -2\ln [p(Y|\bar{\theta })]\).
-
The WAIC criterion is obtained through the following difference \(\hat{\text{ lppd }} - \hat{p}_{\tiny \text{ WAIC }}\). The first term is the estimated log pointwise predictive density given by \(\hat{\text{ lppd }} = \sum _{i=1}^{n} \ln [\sum _{s=1}^{S} p(Y_i|\theta ^{(s)})/S]\). The second term is the estimated effective number of parameters obtained through the formulation \(\hat{p}_{\tiny \text{ WAIC }} = \sum _{i=1}^{n} V_{s=1}^{S}[\ln p(Y_i|\theta ^{(s)})]\), where \(V_{s=1}^{S}[a^{(s)}] = \sum _{s=1}^{S} (a^{(s)}-\bar{a})^2/(S-1)\) and \(\bar{a} = \sum _{s=1}^{S} a^{(s)}/S\). Consider [29] for more details.
-
The LPML is a model selection criterion based on the so called conditional predictive ordinate (CPO). For the i-th observation, we calculate \(\hat{\text{ CPO }}_i = S [\sum _{s=1}^{S} 1/ p(Y_i|\theta ^{(s)})]^{-1}\). The target result is given by \(\sum _{i=1}^{n} \ln \hat{\text{ CPO }}_i\). See [8] for additional details.
Rights and permissions
About this article
Cite this article
Amorim, E.d.C., Mayrink, V.D. Clustering non-linear interactions in factor analysis. METRON 78, 329–352 (2020). https://doi.org/10.1007/s40300-020-00186-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40300-020-00186-2