Abstract
Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing.
Similar content being viewed by others
References
Abdi H, Williams L (2012) Partial least squares methods: partial least squares correlation and partial least square regression. In: Reisfeld B, Mayeno A (eds) Methods in molecular biology: computational toxicology. Springer, New York, pp 549–579
Bock H (1969) The equivalence of two extremal problems and its application to the iterative classification of multivariate data. In: Vortragsausarbeitung, Tagung. Mathematisches Forschungsinstitut Oberwolfach
Bougeard S, Cardinal M (2014) Multiblock modeling for complex preference study. Application to European preferences for smoked salmon. Food Qual Prefer 32:56–64
Bougeard S, Hanafi M, Qannari E (2007) ACPVI multibloc. Application à des données d’épidémiologie animale. Journal de la Société Française de Statistique 148:77–94
Bougeard S, Qannari E, Lupo C, Hanafi M (2011a) From multiblock partial least squares to multiblock redundancy analysis. A continuum approach. Informatica 22:11–26
Bougeard S, Qannari E, Rose N (2011b) Multiblock redundancy analysis: interpretation tools and application in epidemiology. J Chemom 25:467–475
Bry X, Verron T, Redont P, Cazes P (2012) THEME-SEER: a multidimensional exploratory technique to analyze a structural model using an extended covariance criterion. J Chemom 26:158–169
Charles C (1977) Régression typologique et reconnaissance des formes. PhD thesis, University of Paris IX, France
De Roover K, Ceulemans C, Timmerman M (2012) Clusterwise simultaneous component analysis for analyzing structural differences in multivariate multiblock data. Psychol Methods 17:100–119
DeSarbo W, Cron W (1988) A maximum likelihood methodology for clusterwise linear regression. J Classif 5:249–282
Diday E (1976) Classification et sélection de paramètres sous contraintes. Technical report, IRIA-LABORIA
Dolce P, Esposito Vinzi V, Lauro C (2016) Path directions incoherence in PLS path modeling: a prediction-oriented solution. In: Abdi H, Esposito Vinzi V, Russolillo G, Saporta G, Trinchera L (eds) The multiple facets of partial least squares and related methods. Springer proceedings in mathematics & statistics. Springer, Berlin, pp 59–59
Hahn C, Johnson M, Hermann AFA (2002) Capturing customer heterogeneity using finite mixture PLS approach. Schmalenbach Bus Rev 54:243–269
Hubert H, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Hwang H, Takane Y (2004) Generalized structured component analysis. Psychometrika 69:81–99
Hwang H, DeSarbo S, Takane Y (2007) Fuzzy clusterwise generalized structured component analysis. Psychometrika 72:181–198
Kissita G (2003) Les analyses canoniques généralisées avec tableau de référence généralisé : éléments théoriques et appliqués. PhD thesis, University of Paris Dauphine, France
Lohmoller J (1989) Latent variables path modeling with partial least squares. Physica-Verlag, Heidelberg
Martella F, Vicari D, Vichi M (2015) Partitioning predictors in multivariate regression models. Stat Comput 25:261–272
Preda C, Saporta G (2005) Clusterwise PLS regression on a stochastic process. Comput Stat Data Anal 49:99–108
Qin S, Valle S, Piovoso M (2001) On unifying multiblock analysis with application to decentralized process monitoring. J Chemom 15:715–742
Sarstedt M (2008) A review of recent approaches for capturing heterogeneity in partial least squares path modelling. J Model Manage 3:140–161
Schlittgen R, Ringle C, Sarstedt M, Becker JM (2016) Segmentation of PLS path models by iterative reweighted regressions. J Bus Res 69:4583–4592
Shao Q, Wu Y (2005) Consistent procedure for determining the number of clusters in regression clustering. J Stat Plan Inference 135:461–476
Spath H (1979) Clusterwise linear regression. Computing 22:367–373
Team R (2015) R: a language and environment of statistical computing. http://cran.r-project.org/
Tenenhaus A, Tenenhaus M (2011) Regularized generalized canonical correlation analysis. Psychometrika 76:257–284
Tenenhaus M (1998) La régression PLS. Technip, Paris
Trinchera L (2007) Unobserved heterogeneity in structural equation models: a new approach to latent class detection in PLS path modeling. PhD thesis, University of Naples Federico II
Vicari D, Vichi M (2013) Multivariate linear regression for heterogeneous data. J Appl Stat 40:1209–1230
Vinzi V, Lauro C, Amato S (2005) PLS typological regression. In: Vichi M, Monari P, Mignani S, Montanari A (eds) New developments in classification and data analysis. Springer, Berlin, pp 133–140
Vinzi V, Ringle C, Squillacciotti S, Trinchera L (2007) Capturing and treating unobserved heterogeneity by response based segmentation in PLS path modeling. a comparison of alternative methods by computational experiments. Technical reports, ESSEC Business School, https://www.academia.edu/168969/Capturing_and_Treating_Unobserved_Heterogeneity_by_Response_Based_Segmentation_in_PLS_Path_Modeling._A_Comparison_of_Alternative_Methods_by_Computational_Experiments
Vinzi V, Trinchera L, Squillacciotti S, Tenenhaus M (2009) REBUS-PLS: a response-based procedure for detecting unit segments in pls path modeling. Appl Stochastic Models Bus Ind 24:439–458
Vivien M (2002) Approches PLS linéaires et non-linéaires pour la modélisation de multi-tableaux : théorie et applications. PhD thesis, University of Montpellier 1, France
Westerhuis J, Coenegracht P (1997) Multivariate modelling of the pharmaceutical two-step process of wet granulation and tableting with multiblock partial least squares. J Chemom 11:379–392
Westerhuis J, Smilde A (2001) Deflation in multiblock PLS. J Chemom 15:485–493
Westerhuis J, Kourti T, MacGregor J (1998) Analysis of multiblock and hierarchical PCA and PLS model. J Chemom 12:301–321
Wold H (1985) Encyclopedia of statistical sciences. In: Kotz S, Johnson N (eds) Partial least squares. Wiley, New York, pp 581–591
Wold S (1984) Three PLS algorithms according to SW. Technical reports, Umea University, Sweden
Wold S, Martens H, Wold H (1983) The multivariate calibration problem in chemistry solved by the PLS method. Matrix Pencils pp 286–293
Acknowledgements
The authors are grateful to two anonymous reviewers for their valuable suggestions that greatly improved the clarity and the relevance of this article.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: multiblock PLS
In standard multiblock PLS for the case of a single dataset \(\mathbf {Y}\) to explain, the relationship between \(\mathbf {Y}\) and the K matrices \(\mathbf {X}^{k}\) (stored in \(\mathbf {X}\)) is first modeled by computing a pair of linear combinations \(\mathbf {u}\) and \(\mathbf {t}\)—called components—of the columns of, respectively \(\mathbf {Y}\) and \(\mathbf {X}\) such that these components have maximal covariance (see, e.g., Qin et al. 2001; Abdi and Williams 2012). After this first step—equivalent to a standard PLS model (Qin et al. 2001)—specific components are computed to relate each \(\mathbf {X}^{k}\) to \(\mathbf {Y}\). Formally, MBPLS first implements the following optimization problem
The solution of this problem is obtained by taking \(\mathbf {v}\) and \(\mathbf {w}\) (called, respectively, the \(\mathbf {Y}\)- and \(\mathbf {X}\)-loadings) as (respectively) the first left and right singular vectors of matrix \(\mathbf {Y}^\textsf {T} \mathbf {X}\) (and the first singular value \(\delta \) gives the thought after maximum of Expression (10)). In a second step, the dependent dataset \(\mathbf {Y}\) is predicted (with a standard linear regression) from the component \(\mathbf {t}\) as
The matrix \(\widehat{\mathbf {Y}}\) therefore corresponds to the orthogonal projection of \(\mathbf {Y}\) onto the component \(\mathbf {t}\). In MBPLS, after the components and loadings have been found, block loadings (also called partial loadings) are computed (see, e.g., Qin et al. 2001, and Equations (3) and (4) for details) as
This way of computing the block loading vectors ensures that the global component vector can be obtained as a weighted average of the block vectors, namely
The partial loadings \(\mathbf {w}^{k}\) can be seen as normalized sub-vectors of \(\mathbf {w}\), and this implies that MBPLS can naturally cope with multicollinearity in \(\mathbf {X}_k\) or \(\mathbf {Y}\) and will, therefore, provide stable solutions.
As our regression problem is to get a good prediction of \(\mathbf {Y}\), this dataset is explained with all the variables in \((\mathbf {X}^{1}, \dots , \mathbf {X}^{K})\) (Westerhuis and Smilde 2001). As a consequence, the component-based regression is derived from the global component \(\mathbf {t}\) rather than on block components \(\mathbf {t}^{k}\). Thereafter, plugging Eq. (13) into Eq. (11) shows that matrix \(\mathbf {Y}\) can also be predicted from the partial components as
Note that when—as it is the case here—the matrix \(\mathbf {Y}\) does not include blocks, the \(\mathbf {X}^{k}\) block loadings are computed after the global loadings have been estimated, and so the block loadings do not depend upon the partition of the explanatory variables in blocks; therefore MBPLS, for the case of a single dependent block \(\mathbf {Y}\), is not a true multiblock method (Westerhuis et al. 1998; Qin et al. 2001; Vivien 2002).
Because one component rarely completely explain the dependent variables, higher order components are often needed. These higher order components are obtained by first removing from the raw data the previous-order solution (a procedure called “deflation”) and then re-iterating the optimization procedure on the deflated data. Because this procedure ensures orthogonality of the components further used in the component-based regression, we choose to deflate the raw data from the global component \(\mathbf {t}\) rather than from the block components \(\mathbf {t}^{k}\). Also, as deflating \(\mathbf {X}\) or \(\mathbf {Y}\) leads to the same prediction (Westerhuis and Smilde 2001), we choose to regress out the effect of the first-order global component from \(\mathbf {X}\). Formally, in our deflation step, \(\mathbf {X}\) is replaced by \(\mathbf {X}^{(2)}\) computed as
To improve the prediction, \(\mathbf {X}\) is replaced in Eq. (10) by its residual defined in Eq. (15). The process can then be re-iterated to obtain subsequent components. We denote by O the optimal number of components to keep in the model (with \(O \le J\))—O is in general estimated by a cross-validation approach. This deflation step ensures that components (i.e., the vectors \(\mathbf {t}\)) obtained at different steps are orthogonal to each other. Therefore, the predicted dependent dataset can be written according to the global components or according to the block ones
with
being the vector of the regression coefficients of \(\mathbf {Y}\) on \(\mathbf {t}^{(h)}\). This last regression step corresponds to the following optimization problem
where \(\mathbf {X}^{(h-1)}\) is the residual of the prediction of \(\mathbf {X}\) from the \(h-1\) previous components \((\mathbf {t}^{(1)}, \dots , \mathbf {t}^{(h-1)})\). Because these components are orthogonal, Expression
is equivalent to
with \({\mathbf {w}^{(h)}}^*\) defined as
(for proofs see, e.g., Tenenhaus 1998; Wold et al. 1983).
If we define
the optimal prediction of \(\mathbf {Y}\), denoted \(\widehat{\mathbf {Y}}^{(O)}\), can be obtained, in a way analogous to standard multiple linear regression, as
Interestingly, rewriting Eq. (21) shows that it can also be obtained as the solution of the following minimization problem
This expression corresponds to a standard least square estimation problem and this indicates, therefore, that the quality of the PLS model can be evaluated like a standard linear regression model using the well-known Root Mean Square Error
Appendix 2: multiblock redundancy analysis
MBRA can be expressed as the solution of the following optimization problem (24)
under the constraints that
It can be shown that the solution of this problem is obtained by taking \(\mathbf {v}\) as the first eigenvector of the matrix
(see, e.g., Bougeard et al. 2007, 2011a for proofs and details).
In MBRA, block components come from the normalized projections of \(\mathbf {u}\) onto each subspace spanned by the variables of \(\mathbf {X}^{k}\) and are computed as
In MBRA, the global component is obtained as the weighted sum of the block components, namely
It can be noticed that global as well as block components of MBRA take into account the partition of the explanatory variables in blocks. Furthermore—compared to MBPLS—MBRA is more oriented towards the explanation of \(\mathbf {Y}\) but will be less stable in case of multicollinearity within explanatory blocks because it requires matrix inversions \(\left( \text {i.e., } \left( (\mathbf {X}^{k})^\textsf {T} \mathbf {X}^{k}\right) ^{-1} \right) \) as indicated in Eqs. (25) and (26) see, for details, Bougeard et al. (2011a).
As for MBPLS, the effect of the component \(\mathbf {t}\) is regressed out of \(\mathbf {X}\) through the deflation of \(\mathbf {X}\) upon this global component following Eq. (15). Subsequent components are then obtained by replacing matrix \(\mathbf {X}\) in Eq. (24) by its successive residual matrices.
In a second step, the dependent dataset \(\mathbf {Y}\) is predicted using the successive components \((\mathbf {t}^{(1)}, \dots , \mathbf {t}^{(H)})\) and Eqs. (16) and (17) for O—the optimal number of components in the model (in general obtained through a cross-validation procedure).
As for MBPLS, the regression step of MBRA can be interpreted as the solution to the following optimization problem
where \(\mathbf {X}^{k(h-1)}\) is the residual of the prediction of \(\mathbf {X}^{k}\) from the \(h-1\) previous components \((\mathbf {t}^{(1)}, \dots , \mathbf {t}^{(h-1)})\).
Appendix 3: computation times for some representative case studies
See Table 8.
Rights and permissions
About this article
Cite this article
Bougeard, S., Abdi, H., Saporta, G. et al. Clusterwise analysis for multiblock component methods. Adv Data Anal Classif 12, 285–313 (2018). https://doi.org/10.1007/s11634-017-0296-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-017-0296-8
Keywords
- Multiblock component method
- Clusterwise regression
- Typological regression
- Cluster analysis
- Dimension reduction