Clustering non-linear interactions in factor analysis

Amorim, Erick da Conceição; Mayrink, Vinícius Diniz

doi:10.1007/s40300-020-00186-2

Clustering non-linear interactions in factor analysis

Published: 17 September 2020

Volume 78, pages 329–352, (2020)
Cite this article

METRON Aims and scope Submit manuscript

130 Accesses
2 Citations
Explore all metrics

Abstract

Factor analysis is a powerful tool for dimensionality reduction in multivariate studies. This study extends the factor model with non-linear interactions. The main contribution of our work is to present two approaches to cluster the non-linear interactions and thus develop new models that are not restricted to the extreme scenarios where all non-null interactions are different or all are the same. The first strategy to handle the clusters involves a finite mixture of degenerate components. The second option is specified via the Dirichlet process. A comprehensive simulation study is developed to explore the performance of the proposals. A sensitivity analysis is carried out to evaluate advantages of estimating a smoothness parameter defined in a covariance function of the Gaussian process establishing the non-linearity of the interactions. In terms of application, the methodology is illustrated with the analysis of gene expression levels related to four breast cancer data sets. The genes belonging to disjoint genome regions, with copy number alteration, are connected to the main factors and their non-linear interactions are estimated and clustered. The mutual investigation and comparison of these four breast cancer data sets is rarely found in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Factor Analysis with Mixture Modeling to Evaluate Coherent Patterns in Microarray Data

Additive Conditional Independence for Large and Complex Biological Structures

A mixture factor model with applications to microarray data

Article 11 May 2018

References

Affymetrix: Statistical algorithms reference guide. Affymetrix Technical Report (2001). http://tools.thermofisher.com/content/sfs/brochures/statistical_reference_guide.pdf. Accessed 3 July 2020
Carvalho, M.C., Chang, J., Lucas, J.E., Nevins, J.R., Wang, Q., West, M.: High-dimensional sparse factor modelling: applications in gene expression genomics. J. Am. Stat. Assoc. 103, 1438–1456 (2008)
Article Google Scholar
Chin, K., De Vriers, S., Fridlyand, J., Spellman, P.T., Roydasgupta, R., Kuo, W.L., Lapuk, A., Neve, R.M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Esserman, L., Albertson, D.G., Waldman, F.M., Gray, J.W.: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 10, 529–541 (2006)
Article Google Scholar
Eddelbuettel, D.: Seamless R and C++ integration with Rcpp, vol. 64. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-6868-4
Book MATH Google Scholar
Eddelbuettel, D., Francois, R.: Rcpp: seamless R and C++ integration. J. Stat. Softw. 40(8), 1–18 (2011). http://www.jstatsoft.org/v40/i08/. Acccessed 3 July 2020
Eddelbuettel, D., Sanderson, C.: RcppArmadillo: accelerating R with high-performance C++ linear algebra. Comput. Stat. Data Anal. 71, 1054–1063 (2014)
Article MathSciNet Google Scholar
Gamerman, D., Lopes, H.F.: Markov chain Monte Carlo: stochastic simulation for Bayesian inference, vol. 68, 2nd edn. Chapman and Hall/CRC, Boca Raton (2006)
Book Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis. Texts in Statistical Science, 3rd edn. Chapman and Hall/CRC, Boca Raton (2013)
Book Google Scholar
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Article Google Scholar
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970)
Article MathSciNet Google Scholar
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P.: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003)
Article Google Scholar
Ishwaran, H., James, L.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96, 161–173 (2001)
Article MathSciNet Google Scholar
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 6th edn. Pearson/Prenticel Hall, Upper Saddle River (2007)
MATH Google Scholar
Lucas, J.E., Carvalho, C., Wang, Q., Bild, A., Nevins, J.R., West, M.: Sparse statistical modelling in gene expression genomics. In: Muller, K.D.P., Vannucci, M. (eds.) Bayesian Inference for Gene Expression and Proteomics, pp. 155–176. Cambridge University Press, Cambridge (2006)
Chapter Google Scholar
Lucas, J.E., Kung, H.N., Chin, J.T.: Cross-study projections of genomics biomarkers: an evaluation in cancer genomics. PLoS Comput. Biol. 6, e1000920 (2010). https://doi.org/10.1371/journal.pcbi.1000920
Article Google Scholar
Mayrink, V.D., Lucas, J.E.: Sparse latent factor model with interactions: analysis of gene expression. Ann. Appl. Stat. 7(2), 799–822 (2013)
Article MathSciNet Google Scholar
Mayrink, V.D., Lucas, J.E.: Supplement to sparse latent factor model with interations: analysis of gene expression. Ann. Appl. Stat. (2013). https://doi.org/10.1214/12-AOAS607SUPP
Article MATH Google Scholar
Mayrink, V.D., Lucas, J.E.: Bayesian factor model for the detection of coherent patterns in gene expression data. Braz. J. Probab. Stat. 29(1), 1–33 (2015)
Article MathSciNet Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)
Article Google Scholar
Miller, D.L., Smeds, J., George, J., Vega, V.B., Vergara, L., Ploner, A., Pawitan, Y., Hall, P., Klaar, S., Liu, E.T., Bergh, J.: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl. Acad. Sci. USA 112, 13550–13555 (2005)
Article Google Scholar
Pollack, J.R., Sorlie, T., Perou, C.M., Rees, C.A., Jeffrey, S.S., Lonning, P.E., Tibshirani, R., Botstein, D., Dale, A.L.B., Brown, P.O.: Microarrays analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci. USA 99(20), 12963–12968 (2002)
Article Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2020). http://www.r-project.org/. Accessed 3 July 2020
Roberts, G.O., Gelman, A., Gilks, W.R.: Weak convergence and optimal scaling of random walk Metropolis algorithm. Ann. Appl. Probab. 7(1), 110–120 (1997)
Article MathSciNet Google Scholar
Rueda, O.M., Uriarte, R.D.: Flexible and accurate detection of genomic copy number changes from aCGH. PLoS Comput. Biol. 3(6), e122 (2007)
Article MathSciNet Google Scholar
Sethuraman, J.: A constructive definition of the Dirichlet process prior. Stat. Sin. 2, 639–650 (1994)
MATH Google Scholar
Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Kains, B.H., Desmedt, C., Larsimont, D., Cardoso, F., Peterse, H., Nuyten, D., Buyse, M., Vijver, M.J.V.D., Bergh, J., Piccart, M., Delorenzi, M.: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J. Natl. Cancer Inst. 98, 262–272 (2006)
Article Google Scholar
Spiegelhalter, D.J., Best, N.G., van der Linde, B.P.C.A.: Bayesian measures of model complexity and fit. J. R. Stat. Soc. Ser. B 64, 583–639 (2002)
Article MathSciNet Google Scholar
Wang, Y., Klijn, J.G.M., Zhang, Y., Sieuwert, A.M., Look, M.P., Yang, F., Talantov, D., Timmermans, M., Gelder, M.E.M.V., Jatkoe, T., Berns, E.M.J.J., Atkins, D., Foekens, J.A.: Gene expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365, 671–679 (2005)
Article Google Scholar
Watanabe, S.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571–3594 (2010)
MathSciNet MATH Google Scholar
West, M.: Bayesian factor regression models in the large p, small n paradigm. In: Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A., West, M. (eds.) Bayesian Statistics, vol. 7, pp. 723–732. Oxford University Press, Oxford (2003)
Google Scholar
Wu, Z., Irizarry, R.A., Gentleman, R., Murillo, F.M., Spencer, F.: A model based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc. 99, 909–917 (2004)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to thank two anonymous referees for their constructive comments leading to an improved version of this paper. The first author is also grateful to Fundação de Amparo à Pesquisa de Minas Gerais (FAPEMIG) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for supporting this research.

Author information

Authors and Affiliations

Departamento de Estatistica, ICEx, Universidade Federal de Minas Gerais, Av. Antonio Carlos, 6627, Belo Horizonte, MG, 31270-901, Brazil
Erick da Conceição Amorim & Vinícius Diniz Mayrink

Authors

Erick da Conceição Amorim
View author publications
You can also search for this author in PubMed Google Scholar
Vinícius Diniz Mayrink
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vinícius Diniz Mayrink.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Full posterior conditional distributions

$(F^{*}_{r}\mid \alpha , \lambda , \sigma ^{2}, z, X) \sim N_{n}(M_{F^{*}_{r}}, V_{F^{*}_{r}})$ where

$$\begin{aligned} V_{F^{*}_{r}}= & {} \left[ \displaystyle {\left( \sum _{i=1}^{m} \frac{z_{ir}}{\sigma ^{2}_{i}} \right) } I_{n} + K^{-1}(\lambda ,\phi ) \right] ^{-1} \; \hbox {and}\\ M_{F^{*}_{r}}= & {} V_{F^{*}_{r}} \left[ \displaystyle { \sum _{i=1}^{m} \frac{z_{ir}}{\sigma ^{2}_{i}}} \left( X^{\top }_{i\bullet } -\lambda ^{\top } \alpha ^{\top }_{i\bullet } \right) \right] . \end{aligned}$$

For the probability $\rho ^{*}_{ir}$ in (10), consider $Q_{0}=\mathbf {0}$ and $Q_{r} = \displaystyle {-\frac{1}{2\sigma ^{2}_{i}} \left[ F^{*}_{r} F^{*\top }_{r} -2F^{*}_{r} \left( X^{\top }_{i\bullet } -\lambda ^{\top } \alpha ^{\top }_{i\bullet }\right) \right] }$, with $r = 0, 1, \ldots , R$.

$(\sigma ^{2}_{i} \mid \alpha _{i \bullet }, \lambda , F_{i \bullet }, X_{i \bullet }) \sim \text{ IG }(A,B)$ where $A = a + n/2$ and

$$\begin{aligned} B=\displaystyle {\frac{1}{2}\left[ X_{i\bullet }X^{\top }_{i\bullet }-2\alpha _{i\bullet }\lambda (X^{\top }_{i\bullet }-F^{\top }_{i\bullet }) - 2F_{i\bullet }X^{\top }_{i\bullet } + F_{i\bullet }F^{\top }_{i\bullet } + \alpha _{i\bullet }\lambda \lambda ^{\top }\alpha ^{\top }_{i\bullet }\right] + b}. \end{aligned}$$

In order to update $q_{il}^{*}$ and $\alpha _{il}$, consider the $N(M_{\alpha _{il}},V_{\alpha _{il}})$ such that $V_{\alpha _{il}} = \left[ \displaystyle {\frac{1}{w} } +\displaystyle { \frac{1}{\sigma ^{2}_{i}} }\sum _{j=1}^{n}\lambda ^{2}_{lj}\right] ^{-1}$ and $M_{\alpha _{il}} = V_{\alpha _{il}}\left[ \displaystyle {\frac{1}{\sigma ^{2}_{i}} }\sum _{j=1}^{n}\lambda _{lj}\left( X_{ij}-F_{ij}-\sum _{l^{*}\ne l}\alpha _{il^{*}}\lambda _{l^{*}j} \right) \right] $.

The construction of posterior weights via stick-breaking process takes into account: $(\mathcal {V}_{ir} \mid F, z, \rho , \lambda , \phi ) \sim \text{ Beta }(z_{ir} + 1, \sum _{s=r+1}^{R}z_{is} + \tau )$.

The full conditional distribution of $\phi $ is:

$$\begin{aligned} p(\phi \mid \alpha , \lambda , F, \sigma ^2, z, X)&\propto p(X \mid \alpha ,\lambda , F, \sigma ^{2})~ p(F \mid \lambda , z)~ p(\phi ) \\&\propto \left\{ \prod _{r=0}^{R}\prod _{i=1}^{m}\left[ N_{n}(F^{\top }_{i\bullet }\mid M_{F^{\top }_{i\bullet }}, V_{F^{\top }_{i\bullet }}) \right] ^{z_{ir}}\right\} p(\phi ), \end{aligned}$$

with $p(\phi )$ being the density of the U(0.1, 0.5). We have $M_{F_{i\bullet }^\top } = M_{F_r^*}$ and $V_{F^{\top }_{i\bullet }} = V_{F^*_r}$ when $F_{i\bullet }^\top = F_{r}^*$.

The full conditional distribution of $\lambda _{\bullet j}$ is given by:

$$\begin{aligned} p(\lambda _{\bullet j} \mid \alpha , \lambda _{-\left\{ \bullet j\right\} }, F, \sigma ^2_{i}, X)&\propto p(X \mid \alpha ,\lambda , F, \sigma ^{2})~ p(F^{*}_{1},F^{*}_{2}, \ldots , F^{*}_{R} \mid \lambda , z_{i})~ p(\lambda _{\bullet j}) \\&\propto N_{L}(\lambda _{\bullet j} \mid M_{\lambda _j},V_{\lambda _j})\left| K(\lambda , \phi )\right| ^{-\sum ^{R}_{r=1} z_{ir}/2} \\&\quad \times \exp \left\{ -\frac{1}{2}\sum ^{R}_{r=1}z_{ir}F^{*}_{r}K(\lambda ,\phi )^{-1}F^{*\top }_{r}\right\} , \end{aligned}$$

where $V_{\lambda _j}=\left[ \alpha ^{\top }D^{-1}\alpha + I_{L}\right] ^{-1}$ and $M_{\lambda _j}=V_{\lambda _j}\left[ \alpha ^{\top }D^{-1}(X_{\bullet j}-F_{\bullet j})\right] $. The term $\lambda _{-\left\{ \bullet j\right\} }$ indicates the matrix $\lambda $ without the j-th column.

Appendix B: Short description of some goodness-of-fit measurements

Let $\theta $ be a generic vector of unknown parameters associated to the model with likelihood $p(Y|\theta )$. In this case, $Y = \{Y_1, Y_2, \cdots , Y_n\}$ represents the set of observed data and n is the sample size. Supposed that an MCMC algorithm was applied to sample from the target posterior distribution $p(\theta |Y)$. As a result, $\theta ^{(s)}$ is the value generated in the s-th MCMC iteration after the burn-in period, for $s = 1, \ldots , S$. Assume that $\bar{\theta }$ is the posterior mean of $\theta $. Three measurements, considered in this paper to compare models in terms of goodness-of-fit, are summarized as follows:

The DIC is a widely used criterion for model selection in the Bayesian context. According to [27] this quantity is calculated by $2\bar{D}(\theta )-D(\bar{\theta })$, where $\bar{D}(\theta ) = -2 \sum _{s=1}^{S} \ln [p(Y|\theta ^{(s)})]/S$ and $D(\bar{\theta }) = -2\ln [p(Y|\bar{\theta })]$.
The WAIC criterion is obtained through the following difference $\hat{\text{ lppd }} - \hat{p}_{\tiny \text{ WAIC }}$. The first term is the estimated log pointwise predictive density given by $\hat{\text{ lppd }} = \sum _{i=1}^{n} \ln [\sum _{s=1}^{S} p(Y_i|\theta ^{(s)})/S]$. The second term is the estimated effective number of parameters obtained through the formulation $\hat{p}_{\tiny \text{ WAIC }} = \sum _{i=1}^{n} V_{s=1}^{S}[\ln p(Y_i|\theta ^{(s)})]$, where $V_{s=1}^{S}[a^{(s)}] = \sum _{s=1}^{S} (a^{(s)}-\bar{a})^2/(S-1)$ and $\bar{a} = \sum _{s=1}^{S} a^{(s)}/S$. Consider [29] for more details.
The LPML is a model selection criterion based on the so called conditional predictive ordinate (CPO). For the i-th observation, we calculate $\hat{\text{ CPO }}_i = S [\sum _{s=1}^{S} 1/ p(Y_i|\theta ^{(s)})]^{-1}$. The target result is given by $\sum _{i=1}^{n} \ln \hat{\text{ CPO }}_i$. See [8] for additional details.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amorim, E.d.C., Mayrink, V.D. Clustering non-linear interactions in factor analysis. METRON 78, 329–352 (2020). https://doi.org/10.1007/s40300-020-00186-2

Download citation

Received: 28 March 2020
Accepted: 29 August 2020
Published: 17 September 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s40300-020-00186-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering non-linear interactions in factor analysis

Abstract

Access this article

Similar content being viewed by others

Factor Analysis with Mixture Modeling to Evaluate Coherent Patterns in Microarray Data

Additive Conditional Independence for Large and Complex Biological Structures

A mixture factor model with applications to microarray data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Full posterior conditional distributions

Appendix B: Short description of some goodness-of-fit measurements

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering non-linear interactions in factor analysis

Abstract

Access this article

Similar content being viewed by others

Factor Analysis with Mixture Modeling to Evaluate Coherent Patterns in Microarray Data

Additive Conditional Independence for Large and Complex Biological Structures

A mixture factor model with applications to microarray data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Full posterior conditional distributions

Appendix B: Short description of some goodness-of-fit measurements

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation