Introduction

Nowadays, more and more studies result in multivariate two-level data. In sensometric research, for instance, such data are obtained when multiple panelists each rate different sets of food samples on a number of descriptor variables (e.g., Bro, Qannari, Kiers, Naes, & Frøst, 2008). Take as a second example an emotion psychologist who studies how the experience of multiple emotions evolves across time through experience sampling, and does this for a number of subjects (e.g., Erbas, Ceulemans, Pe, Koval, & Kuppens, 2014). Such data have a two-level structure, in that the observation units (i.e., food samples, measurement occasions) can be considered level one units that are nested within level two units (i.e., panelists, subjects). Moreover, the data are multivariate in that all level one units are measured on multiple variables. In the remainder, following De Roover, Ceulemans and Timmerman (2012), we will denote the level two units by “data blocks” and the level one units by “observations.”

Many research questions regarding two-level multivariate data pertain to the associations between the variables, which need not be the same across all data levels. In case the variables fall apart in predictor (i.e., independent) and criterion (i.e., dependent) variables, multilevel regression models (Snijders & Bosker, 1999) may be adopted to properly model these different associations. In case no such distinction is made and the number of variables is somewhat larger (say, larger than five), researchers often turn to multilevel factor analysis. The key principle behind this method is that the observed variables are considered to be indicators of latent variables, which account for the associations between the variables. Two types of multilevel factor analysis techniques exist. The first type is multilevel confirmatory factor analysis (multilevel CFA), which aims at testing whether specific hypotheses regarding the factor structure are consistent with the data (e.g., Mehta & Neale, 2005). The second type is exploratory in nature (multilevel EFA) and is used to find the latent factors that best fit the data at hand (e.g., Goldstein & Browne, 2005).

In multilevel EFA, the variance in the data is split in a between part, containing the means of each data block, and a within part, consisting of the deviations from the block means, and separate exploratory factor models are fitted to each part. Hence, between-block differences in variable means and covariance structure can be studied separately. Multilevel EFA, however, has the disadvantage that it requires a large number of data blocks.

An interesting alternative to multilevel EFA was provided by Timmerman (2006), who proposed a class of multilevel simultaneous component models (MLSCA). Like a multilevel EFA model, a MLSCA model sheds light on the associations between the variables at the different levels, by specifying separate submodels for the level two units and the level one units. Each submodel consists of a component model. MLSCA has already been successfully applied to study individual differences in development (Timmerman, Ceulemans, Lichtwarck-Aschoff, & Vansteelandt, 2009) and attachment (Bosmans, Van de Walle, Goossens, & Ceulemans, 2014), cross-cultural differences (Kuppens, Ceulemans, Timmerman, Diener, & Kim-Prieto, 2006), and differences between experimental conditions in a social dilemma game (Stouten, Ceulemans, Timmerman, & Van Hiel, 2011) and genomics (Jansen, Hoefsloot, van der Greef, Timmerman, & Smilde, 2005).

However, the use of MLSCA is hampered by two issues. First, as MLSCA solutions are obtained through iterative algorithms, analyzing large data sets (i.e., data sets with many observations) may take a lot of computation time. Second, whereas for multilevel regression, multilevel CFA, and multilevel EFA easily accessible software exists (e.g., MPLUS, SAS, and R-packages), up to now, such software for estimating MLSCA models has been lacking.

In this paper, we address both issues. Specifically, we discuss a computational shortcut for MLSCA fitting that considerably decreases computation time in cases where the number of level one units is much larger than the number of variables. Moreover, we present the MLSCA package, which was built in MATLAB, but is also available in a version that can be used on any Windows computer, without having MATLAB installed. This package may be freely downloaded from http://ppw.kuleuven.be/okp/software/MLSCA. The MLSCA package allows the user to interact with the program through a graphic user interface, in which different analysis options can be specified (e.g., different number of between- and within-components and different MLSCA variants). Through built-in model selection heuristics, the package assists the user in selecting a model that fits the data well and is parsimonious at the same time. The analysis results are automatically saved in formats that can be easily loaded in popular software packages (e.g., SAS and SPSS) to further process them (e.g., plotting the component scores/loadings, correlating the obtained estimates with other variables).

The remainder of this paper is organized into four main sections. In the second section, MLSCA theory is recapitulated. In the third section, the different steps of an MLSCA analysis are discussed and demonstrated making use of sensory profiling data. In this section, we also present the computational shortcut for large data sets. The fourth section introduces the MLSCA software package. Finally, the fifth section contains some concluding remarks.

MultiLevel simultaneous component analysis (MLSCA)

Data structure and preprocessing

MLSCA models a data matrix X, that consists of I columnwise concatenated data blocks X i (i = 1…I), consisting of K i observations by J variables. X can be split uniquely into three parts – an offset term, a between-part, and a within-part – as:

$$ \mathbf{X}={\mathbf{X}}^{offset}+{\mathbf{X}}^{between}+{\mathbf{X}}^{within}. $$
(1)

The entries of these three matrices are obtained by decomposing the score \( {x}_{ij{k}_i} \) of observation k i within data block i on variable j:

$$ \begin{array}{rcl}{x}_{ij{k}_i}& =& {x}_{ij{k}_i}^{offset}+{x}_{ij{k}_i}^{between}+{x}_{ij{k}_i}^{within}\\ {}& =& {x}_{.j.}+\left({x}_{ij.}-{x}_{.j.}\right)+\left({x}_{ij{k}_i}-{x}_{ij.}\right),\end{array} $$
(2)

with x . j. denoting the mean score on variable j computed across all data blocks, and x ij. indicating the mean score on variable j computed within data block i.

Since component analysis identifies components that capture as much of the variance in the data as possible, variance differences among the variables may hugely affect the obtained results. Therefore, researchers should decide whether or not such variance differences are meaningful. For instance, when analyzing physiological measures such as heart rate, respiratory volume, and blood pressure, variance differences are at least partly arbitrary because the variables are measured on a different scale (see e.g., De Roover, Timmerman, Van Diest, Onghena, & Ceulemans, 2014). Thus, it makes sense to give each variable the same weight in the analysis by scaling each variable to a variance of one across all data blocks. However, as a counterexample, when studying emotions in daily life, the variances of negative emotions are often much smaller than those of positive emotions, even though they are rated using the same scale. Such differences are meaningful because negative emotions are experienced less often than positive ones. Discarding these variance differences may yield misleading results (see e.g., Brose, De Roover, Ceulemans, & Kuppens, 2015); therefore, one should not scale the variables to equal variance in this case.

Model

The full MLSCA model for the observed data is obtained by summing the offset term and the between- and within-submodels (see Timmerman, 2006), which we discuss below.

Between-submodel

The between-submodel captures the differences between data blocks in variable means. It boils down to an ordinary principal component analysis (PCA) of X between. Formally, each of the I between-parts X between i is approximated by \( {\widehat{\mathbf{X}}}_i^{between} \) which is decomposed as follows:

$$ {\widehat{\mathbf{X}}}_i^{between}={\mathbf{1}}_{K_i}{\mathbf{f}}_i^b\;{\mathbf{B}}^b\acute{\mkern6mu}, $$
(3)

with \( {\mathbf{1}}_{K_i} \) being a K i  × 1 vector of ones, f b i (1 × Q b) containing the between-component scores of data block i on the Q b between-components, and B b (J × Q b) being the between-component loading matrix. Per component, the overall mean (i.e., across all blocks) of the between-component scores equals zero, and its overall variance amounts to one.

Two remarks regarding the between-loadings are in order. First, these loadings may be orthogonally or obliquely rotated, provided that the between-component scores are counterrotated. This rotational freedom is often exploited to facilitate the interpretation of the obtained components, by rotating the loadings towards simple structure. Second, a raw between-loading amounts to the covariance among the between-component and the between-part of the observed variable. To ease interpretation, one can compute normalized loadings, which can be read as correlations rather than covariances, by dividing each raw loading by the standard deviation of the corresponding column of X between.

Within-submodel

The within-submodel accounts for the covariance structure of the variables within the data blocks. More specifically, each X within i is approximated by \( {\widehat{\mathbf{X}}}_i^{within} \) which is decomposed as:

$$ {\widehat{\mathbf{X}}}_i^{within}={\mathbf{F}}_i^w{\mathbf{B}}_i^{w\prime }, $$
(4)

with Q w denoting the number of within-components for data block i, F w i (K i  × Q w) being its within-component score matrix, and B w i (J × Q w) its within-component loading matrix. Per block and per component, the mean of the within-component scores equals zero.

The most general variant, called MLCA, boils down to a separate PCA per X within i , with the variance of the within-component scores being restricted to one for each of the block-specific components. Thus, each data block has its own component loading matrix B w i , implying that the within-component scores of the different data blocks cannot be compared. Each loading matrix B w i can be separately rotated. MLCA should be used when equal within-loadings for the different data blocks are implausible. For instance, when assessing the structure of values in different cultures, it is reasonable that this structure differs depending on, amongst others, the industrialization rate of the cultures (see e.g., De Roover, Timmerman, Mesquita, & Ceulemans, 2013).

However, for many data sets it makes sense to expect that the within-loadings of the data blocks are equal or at least very similar. Then, the structure is (about) the same, but the variances and/or correlations of the component scores may still differ across data blocks. As an example, Erbas et al. (2013) studied the structure of 20 emotions in typically developing adolescents and in adolescents with autism spectrum disorder (ASD). They found that the emotion loadings are very similar in both groups, but that the emotion components are more strongly correlated in the autism group. As another example, Brose et al. (2015) summarizes multiple studies on daily affect in different age groups by stating that the level of negative affect varies less across days in older adults. To model such data, based on the simultaneous component analysis (SCA) framework of Timmerman and Kiers (2003), Timmerman (2006) distinguished four different model variants (see Table 1), which imply different sets of restrictions on the F w i matrices. Note that all variants constrain B w i to be the same for all data blocks.

Table 1 Overview of the restrictions imposed on the variances (var) and correlations (corr) of the within-component scores of the separate data blocks by the different MultiLevel Simultaneous Component Analysis (MLSCA) variants, and of the associated complexity values, with \( K={\displaystyle \sum_{i=1}^I \min \left({K}_i, \ln \right(}{K}_i\left)J\right) \)

The four variants can be ordered in terms of restrictiveness. MLSCA-P is the least restrictive variant in that no further constraints are imposed on the variances and correlations of the within-component scores. When these correlations are constrained to have the same absolute values in all data blocks, but the variances are left untouched, MLSCA-PF2 is obtained. Importantly, this constraint on the correlations implies that the signs of these correlations may differ across the blocks (see Helwig, 2013), which has not always been acknowledged in previous papers on MLSCA. If such sign differences are unreasonable and occur, a constrained MLSCA-PF2 version can be considered, which restricts the correlations to be exactly the same. This model can be called MLSCA-PF2-NN, as it imposes non-negativity constraints on a further decomposition of the F w i matrices (see Harshman, 1972; Kiers, ten Berge & Bro, 1999). MLSCA-IND goes one step further in this direction in that the correlations of the within-component scores are set to zero for each data block. Finally, in MLSCA-ECP, the most restrictive variant, both the variances and the correlations of the within-component scores are restricted to be the same for each data block. Note that in all four variants, the overall variances of the within-component scores equal one per component.

Two of these four variants, MLSCA-P and MLSCA-ECP, have rotational freedom. Specifically, the within-loadings can be rotated across data blocks, as long as these rotations are compensated for in the within-component scores. Note that a raw within-loading equals the covariance between the corresponding within-component and the within-part of the associated observed variable. Again, when the correlation, rather than the covariance, is of interest, the within-loadings can be normalized into correlations by dividing them by the standard deviation of the associated column of X within.

Steps in an MLSCA analysis

In this section, we discuss the three main steps of an MLSCA analysis: (1) fitting the different MLSCA variants, (2) model selection, that is, determining the optimal number of between- and within-components and the most adequate model variant for the within-part, and (3) interpreting the component matrices of the retained solution. These steps will be illustrated by analyzing a sensory profiling data set.

An important goal in sensory profiling research is to reveal whether panelists have systematically higher or lower ratings than other panelists; whether panelists, implicitly or explicitly, take different product features into account when judging food samples; whether panelists attach different weights to these product features; and whether these product features are differently associated across panelists. These four questions pertain to between-panelist differences in mean levels, within-component loadings, and the variances and correlations of the within-component scores, and can thus be answered using MLSCA.

To demonstrate this, we will analyze sensory profiling data concerning cream cheeses. Specifically, eight panelists were asked to rate three samples of ten different cream cheeses (i.e., 30 samples) with respect to 23 attributes, such as sweet, grainy, and chalky (for a detailed description of the data set, see Bro et al., 2008). We consider the samples to be the level one units, which are nested within the panelists. In other words, the data blocks pertain to the sample by attribute data matrices of the different panelists. In accordance with previous analyses of these data (Bro et al., 2008; Ceulemans, Timmerman, & Kiers, 2011; De Roover, Timmerman, Van Mechelen, & Ceulemans, 2013; Wilderjans & Ceulemans, 2013), we opted to discard all variance differences between the attributes by scaling each of them to a variance of one across all data blocks (preprocessing option of the software). This implies that possible differences in variances between panelists are retained. The software package includes an excel-file that shows the raw data, the scaled version, and the splitting into offset term, between-part, and within-part.

Fitting the different MLSCA variants

Loss function

When fitting the different MLSCA variants for specific numbers of between-components and within-components, we look for estimates of f b i , B b, F w i , and B w i that satisfy the imposed constraints and minimize the following loss function:

$$ f={\displaystyle \sum_{i=1}^I{\left\Vert \left({\mathbf{X}}_i-{\mathbf{X}}_i^{offset}\right)-\left({1}_{K_i}{\mathbf{f}}_i^b{\mathbf{B}}^b\prime +{\mathbf{F}}_i^w{\mathbf{B}}_i^w\prime \right)\right\Vert}^2,} $$
(5)

where X i  − X offset i implies that we analyze the grand-mean centered data. Timmerman (2006) demonstrated that minimizing this loss function (Eq. 5) is equivalent to maximizing the percentage of variance in the data that the between- and within-components account for:

$$ VAF\%=1-\frac{f}{{\displaystyle {\sum}_{i=1}^I\left\Vert {\mathbf{X}}_i-{\mathbf{X}}_i^{offset}\right\Vert {}^2}} $$
(6)

Furthermore, the VAF% may be computed for each submodel separately, because the offset, between-submodel, and within-submodel are mutually orthogonal. Specifically, the percentages of between- and within-variance accounted for amount to:

$$ \begin{array}{l}VAF{\%}^{between}=\frac{{\displaystyle {\sum}_{i=1}^I\left\Vert {1}_{K_i}{\mathbf{f}}_i^b{\mathbf{B}}^b\acute{\mkern6mu}\right\Vert {}^2}}{{\displaystyle {\sum}_{i=1}^I\left\Vert {\mathbf{X}}_i^{between}\right\Vert {}^2}}\\ {}VAF{\%}^{within}=\frac{{\displaystyle {\sum}_{i=1}^I\left\Vert {\mathbf{F}}_i^w{\mathbf{B}}_i^w\acute{\mkern6mu}\right\Vert {}^2}}{{\displaystyle {\sum}_{i=1}^I\left\Vert {\mathbf{X}}_i^{within}\right\Vert {}^2}}.\end{array} $$
(7)

Note that it is instructive to examine and compare the values of ∑ I i = 1 X between i ² and ∑ I i = 1 X within i ², as they indicate how much of the variance is situated at the between-level and how much at the within-level. For instance, these values equal 1,902.05 and 3,617.95 for the cheese data, implying that 34.46 % of the total variance is between-variance. Moreover, it is also worthwhile to inspect the amount of between- and within-variance for each variable separately.

Algorithm

Because the X between i and X within i matrices are mutually orthogonal, the parameters of the between- and within-submodels can be estimated separately. In particular, f b i and B b are obtained by conducting a singular value decomposition (SVD) on the vertically concatenated X between i matrices (for details, see Timmerman, 2006). The within-component scores and loadings of the MLCA and MLSCA-P models are estimated by an SVD of each separate X within i matrix and of their vertical concatenation, respectively (Kiers & ten Berge, 1994a; Timmerman & Kiers, 2003). Because of the additional restrictions in the MLSCA-PF2, MLSCA-IND, and MLSCA-ECP models, an alternating least squares algorithm is required to estimate their associated within-component scores and loadings. These algorithms alternate between estimating B w and the separate F w i matrices, starting from an initial configuration for B w (see Timmerman & Kiers, 2003). In case of MLSCA-PF2 analyses, it may happen that the obtained solution is degenerate in that the within-component scores are extremely strongly correlated in each data block. This degeneracy problem is well known in PARAFAC analysis where one possible solution is to impose orthogonality constraints (Harshman & De Sarbo, 1984). Following this line of reasoning, in case of a MLSCA-PF2 degeneracy, we recommend considering orthogonality restrictions and thus using MLSCA-IND. Moreover, to obtain MLSCA-PF2-NN solutions, non-negativity constraints can be imposed in a particular step of the alternating procedure (see, Kiers, ten Berge & Bro, 1999).

Given that the results of an ALS algorithm may depend on the starting configuration and thus may pertain to a local optimum only, we recommend using a multi-start procedure. Specifically, we advise running the algorithm using a more or less rational start, consisting of the B w that is obtained through an MLSCA-P analysis, as well as using a number of randomly generated B w matrices, and retaining the best solution encountered across the different runs. When analyzing the cheese data, we used 100 random starts.

Computational shortcut

For large data sets, computation time can be considerable due to, amongst others, the SVDs that are performed when estimating the separate F w i matrices. If one or more of the data blocks X within i contain more observations than variables – as is typically the case – computational costs can be decreased by using a speedup that was proposed by Kiers & Harshman (1997) in the context of three-way component analysis. More specifically, one computes the QR decomposition of the X within i matrices for which J < K i : X within i  = Q i R i , with Q i a K i  × J columnwise orthonormal basis matrix and R i (J × J) an upper triangular matrix, and replaces X within i by R i . Analogous to what is explained by Kiers and Harshman (1997), the loss function does not change when replacing X within i by R i , so minimizing the loss function involving R i solves the same problem, but on (much) smaller matrices. Conducting the MLSCA analysis on these smaller matrices yields within-component score matrices F w,red i that are reduced in size as well. Afterwards, the within-component score matrices F w i for the full data set are computed based on the F w,red i and Q i matrices: F w i  = Q i F w,red i .

How much computation time can be gained depends of course on the difference between the number of observations per block and the number of variables. To illustrate this, Table 2 first shows the computation time of the within-part of the MLSCA analyses of the cheese data, run on a the same laptop using five within-components, one rational start, and 100 random starts. As the number of observations per block is only slightly larger than the number of variables, there is hardly any gain in computation time. For comparison, Table 2 also displays the computation times for the MLSCA analyses of the cross-cultural data reported by Kuppens et al. (2006), using five within-components and a rational start only. This data set consists of ratings of 14 emotions by inhabitants (observations) of 48 different countries (data blocks). Since the number of inhabitants per block varies between 27 and 549 (M = 193.75; SD = 117.63), it is no surprise that the analyses are executed much faster when using the computational shortcut.

Table 2 Computation times (in seconds) of the within-part of MultiLevel Simultaneous Component Analysis (MLSCA) analyses of the cheese data and a cross-cultural data set

Model selection

The number of between- (i.e., Q b) and within- (i.e., Q w) components and the model variant that is needed to adequately and parsimoniously summarize the information in a two-level multivariate data set are usually unknown, although prior studies or theory can sometimes yield useful clues. To resolve this model selection problem, one may fit the different MLSCA variants with increasing Q b and Q w and use a formal model selection heuristic to assessing the optimal complexity. As the between- and within-part of the data are analyzed independently (see Model subsection above), the appropriate between- and within-submodels can be determined separately.

Between-part

To select the optimal number of between-components, we recommend using the CHull test (Ceulemans & Kiers, 2006; Wilderjans, Ceulemans, & Meers, 2013), which is an extension of the well known scree-test (Cattell, 1966) and has shown good behavior for MLSCA (Ceulemans, Timmerman & Kiers, 2011) as well as for a variety of other model selection problems (Bulteel, Wilderjans, Tuerlinckx, & Ceulemans, 2013; Ceulemans & Kiers, 2009; Ceulemans & Van Mechelen, 2005; Schepers, Ceulemans, & Van Mechelen, 2008). To conduct this test, VAF % between is first plotted against a complexity measure \( {c}_{Q^b} \) (i.e., the number of free parameters corrected for the number of observations), which can be computed as [min(I, ln(I)J)]Q b + JQ b − (Q b)2 − Q b (for a detailed explanation, see Ceulemans, Timmerman, & Kiers, 2011). Next, the convex hull of this plot is obtained and the solutions that are located on the higher boundary of this convex hull – denoted as the hull solutions – are retained, as they have the best fit versus complexity balance. Finally, the resulting H hull solutions are scanned to detect the model complexity at which the increase in fit maximally levels off. To determine this point in a more objective way, scree ratios

$$ s{r}_h=\frac{\frac{VAF{\%}_h^{between}-VAF{\%}_{h-1}^{between}}{c_h-{c}_{h-1}}}{\frac{VAF{\%}_{h+1}^{between}-VAF{\%}_h^{between}}{c_{h+1}-{c}_h}} $$
(8)

are computed, with h indicating the h-th hull solution (h=1..H), and the solution for which the resulting ratio is highest is retained (see Ceulemans & Kiers, 2006; Wilderjans, Ceulemans, & Meers, 2013).

When analyzing the cheese data, we fitted between-models with one to six between-components. Applying the CHull procedure reveals that all between-models except for the one with five between-components lie on the higher boundary of the convex hull (see Fig. 1). Based on the scree ratios (see Table 3), a between-model with four between-components is selected.

Fig. 1
figure 1

Hull plot for the between-solutions for the cheese data. The numbers in the plot indicate the complexity of the solutions (i.e., the number of between-components)

Table 3 Fit and complexity values and srh-ratio for all between-solutions for the cheese data that are located on the upper boundary of the convex hull

Within-part

To select the optimal model variant and number of within-components Q w, one can also apply a CHull test. Specifically, one can compute scree ratios as in equation (8), but using VAF % within and \( {c}_{Q^w} \), with the latter value expressing the number of free parameters in the within-submodel, corrected for the number of observations. Table 1 shows the formulas to determine \( {c}_{Q^w} \) (for details, see Ceulemans, Timmerman, & Kiers, 2011). Note, however, that it is not readily clear how the \( {c}_{Q^w} \) values should be adapted to distinguish among the MLSCA-PF2 and MLSCA-PF2-NN variants.

To determine the optimal within-model for the cheese data, analyses with all five model variants (see Model subsection above) were performed with the number of within-components varying from one to six, resulting in 30 within-solutions. Figure 2 shows the resulting hull plot. Applying the numerical convex hull procedure to this plot yielded seven hull solutions. Based on the associated sr h -values (see Table 4), a MLSCA-PF2 model with two within-components seems indicated.

Fig. 2
figure 2

Hull plot for the within-solutions for the cheese data (left panel); the right panel zooms in on the part of the plot that contains the selected solution. The solutions are ordered from least to most complex

Table 4 Fit and complexity values and srh-ratio for all within-solutions for the cheese data that are located on the upper boundary of the convex hull

Interpreting the parameter estimates of the retained model

Interpretation of the between-submodel

Table 5 displays the raw (i.e., the covariances between the between-components and the between-parts of the observed variables, see above) VARIMAX rotated between-component loadings. Because the between-variance is considerably smaller than the within-variance, the normalized loadings (i.e., correlations rather than covariances; not shown) are much larger, but they are proportionally quite similar to their raw counterparts. From Table 5 it can be concluded that the differences in the mean descriptor profiles of the eight panelists can be summarized by means of the following four dimensions: The first characteristic pertains to chalky (i.e., astringency) and grainy taste of the cheese samples, whereas the second characteristic groups fat-related flavor properties (i.e., butter, fat, and creaminess). The other two characteristics pertain to the visual appearance and the sweetness of the cheese samples, respectively. The between-component scores, which reveal the positions that the eight panelists take on these four dimensions, can be read from Table 6. It can, for instance, be derived that the sixth panelist has higher sweetness ratings than the other panelists have.

Table 5 Raw between- (VARIMAX rotated) and within-loadings of the MultiLevel Simultaneous Component Analysis (MLSCA)-PF2 model with four between-components and two within-components for the cheese data. Loadings for which the normalized value is larger than or equal to .40 are underlined
Table 6 Between-component scores and standard deviations of the within-component scores of the MultiLevel Simultaneous Component Analysis (MLSCA)-PF2 model with four between-components and two within-components for the cheese data

Interpretation of the within-submodel

Given that the number of observations, and, thus, the number of within-component scores, will often be quite large, the interpretation of the within-model focuses mostly on the within-loadings, which determine the labeling of the within-components. Moreover, depending on the variant, inspecting the block-specific variances and correlations of the within-component scores can also be very informative.

For the cheese example, the within-components capture the main features that each panelist (implicitly or explicitly) uses when judging the separate cheese samples. These components can be labeled by inspecting the raw loadings in Table 5. The MLSCA-PF2 model has no rotational freedom, which implies that the loadings should be interpreted as given in Table 5. The first within-component pertains to texture properties of the cheese (i.e., going from cheese with a firm texture that does not easily melt down in the mouth to soft cheese that breaks down faster), and the second to fat- and color-related properties (i.e., going from fat and creamy yellow cheese to chalky white cheese).

The panelist-specific variances of the within-components, which are shown in Table 6, reveal how salient these components are for each panelist. For instance, it can be concluded that panelists 4, 6, and 7 are relatively variable in their judgments of the texture of the cheeses. The correlation between the within-components for each single panelist amounts to .11 for three panelists (1, 4, 7) and −.11 for the other five, which implies that the two product features that the judges use are only slightly dependent, with “firm texture” being related to a “yellow color” for panelists 1, 4 and 7 and to “a white color” for the others. The fact that this correlation is weak, implies that the MLSCA-IND model fits the data almost equally well and yields very similar parameter values.

Note that in this example the between-components and within-components pertain to different features. This suggests that features relevant for describing structural differences between raters (as captured in the between-loadings) differ from those relevant for describing structural differences between the cheeses (as captured in the within-loadings). In other applications, the between- and within-features may or may not differ substantially.

The MLSCA software package

The MLSCA software can be downloaded from http://ppw.kuleuven.be/okp/software/ MLSCA. The package includes all the files that are necessary to replicate the analysis of the cheese data, so that users can get a sense of how the program works, and of what the output looks like and can be interpreted.

Users interact with the program through a graphic user interface (GUI), which is built around a set of MATLAB m-files. Two versions of the software are available: a standalone application, which runs on any Windows computer and does not require MATLAB to be installed, and a MATLAB application. Whereas the standalone version can be launched from the Windows start menu (icon MLSCA_gui.exe; see also installation manual at the website mentioned above), the MATLAB version can be started by setting the current MATLAB directory to the folder in which all the software files are stored and typing MLSCA_gui at the command prompt. The GUI that appears (see Fig. 3) consists of three compartments: Data description and data files, Analysis options, and Output files and options. To perform a ML(S)CA analysis, the user specifies the necessary information in the different compartments and clicks the Run analysis-button. In the following paragraphs, the three compartments will be outlined, closing off with error handling.

Fig. 3
figure 3

Screenshot of the MultiLevel Simultaneous Component Analysis (MLSCA) graphic user interface

Data description and data files

Data file

First, the user specifies the data file by means of the “Data file” Browse-button. For instance, from Fig. 3, it can be read that the cheese data are stored in Data.txt; an excerpt of this file is displayed in the left-hand panel of Fig. 4. The data file should be an ASCII file (i.e., .txt file) that contains a row for each observation and is sorted according to the blocks. No empty lines are allowed between the rows of a single data block. However, rows belonging to different data blocks may be separated by one or more empty lines. Per row, the scores on the variables are separated by one or more spaces, commas, semicolons, or tabs, or any combination of these. Missing data are not allowed. Each observed score should be an integer or a real number, with decimal separators being denoted by a period and not by a comma!

Fig. 4
figure 4

Input files for the MultiLevel Simultaneous Component Analysis (MLSCA) analysis of the cheese data: (Left) Excerpt of the data file Data.txt (i.e., first two panelists, only first five attributes), (Middle) the label file Labels.txt, and (Right) the file that indicates the number of observations within each block Rows.txt

Data description

To provide information regarding the number of observations in each data block, the user clicks the File that contains the number of rows per data block Browse-button and selects the appropriate file (for our example: Rows.txt, see right-hand panel of Fig. 4). This file should again be of the ASCII format, containing as many rows as there are data blocks; each row consists of a single number indicating the number of observations within the corresponding data block.

Label file

Optionally, the user may provide labels for the observations, variables, and data blocks. To this end, the user checks the option yes (specify the labels) and, subsequently, browses for the ASCII file containing the labels (for our example: Labels.txt, see Fig. 3; the content of this file is shown in the middle panel of Fig. 4). The label file contains three sets of labels, for the data blocks, block-observation combinations, and variables, respectively. Each label should be placed on a separate line. The different sets may be separated by one or more empty lines; however, within each set, only line breaks are allowed. Obviously, the number and ordering of the labels within each set has to correspond to the number and ordering of the entities in the data file. The labels should be character strings that may contain any kind of symbol. If the user does not want to provide labels, no (no labels) should be checked. As a consequence, the program will generate default labels. Note that when the user wants to include labels, (s)he should provide labels for each of the three sets.

Analysis options

Complexity of the between/within analysis

In the Complexity of the between-analysis and Complexity of the within-analysis panel, the user specifies how many between- and within-components, respectively, should be extracted. For the between-components, this number should be an integer between one and min(I,J), whereas the number of within-components should be a number between 1 and min(K 1,K 2,…,K I ,J). Next, the user indicates, for the between- and within-submodel separately, whether only an analysis with the specified complexity is needed, or whether different analyses with the number of components going from one to the specified number of components should be run. The former can be achieved by checking the Analysis with the specified number of components only option, while the latter is obtained by selecting the Analysis with 1 up to the specified number of components option.

Type of within analysis

The user may select one or more within-submodel variants (see Model subsection above): (1) MLCA, (2) MLSCA-P, (3) MLSCA-IND, (4) MLSCA-PF2, and (5) MLSCA-ECP.

Analysis settings

When fitting MLSCA-IND, MLSCA-PF2, and/or MLSCA-ECP, an ALS algorithm is adopted. In order to minimize the risk of obtaining a local minimum only, a multi-start procedure is used consisting of one rational and a number of random starts (see Algorithm subsection above). The user may alter the number of random starts (five by default) that needs to be used by entering a number (integer) in the box next to the Number of random starts (1 rational start has been provided) field. Further, the user may indicate the maximal number of ALS iterations (1,000 by default) that will be performed by means of the Maximal number of iterations box. Note, however, that if this maximum number of iterations is reached, this indicates that the obtained solution should be interpreted with much caution as the algorithm did not converge properly. Finally, the user has to decide about the preprocessing strategy by selecting whether (center and scale data overall option) or not (center data overall option) the variables should be scaled to a variance of one across blocks. Hence, the software always grand-mean centers the data.

Output files and options

In the Output files and options compartment (see Fig. 3), the user selects the directory (for the cheese example “C:\MLSCA\example\output”) where all the output files have to be stored, by means of the corresponding Browse button. Furthermore, the user specifies a string (Name to label the output files) which will constitute the first part of the name of all output files (i.e., for the cheese example “Results”). Finally, the user checks the appropriate boxes in the Required parameters in output panel, indicating which parameter estimates need to be presented in the output: (a) unrotated loadings, (b) orthogonally rotated loadings, using the VARIMAX criterion, (c) obliquely rotated loadings, on the basis of the HKIC criterion (Kiers & ten Berge, 1994b), and/or (d) the associated component scores.

When the analysis is finished, multiple output files with different extensions (i.e., .mht, .txt, and .mat) are stored in the selected output folder. The information in the .txt files enables the user to load the results easily into any popular software package, like SAS and SPSS, in order to further process the results. In the remainder, the content of the different output files will be described.

Output file with .mht extension

The .mht file (in our example “Results_overview.mht”) contains a summary of the analysis. Specifically, first, information regarding the amount of between-block and within-block variation in the data is displayed. Second, the percentage of between- and within-variation that is explained by the fitted between- and within-submodels is shown, along with a between- and within-Variance Accounted For (VAF) plot and model selection advise. Finally, for each considered within-variant, the total percentage of within-variance explained by the different solutions is displayed for each data block separately.

Output files with .txt extension

When the user checked loadings and/or component scores in the Required parameters in output panel, the between- and within-solutions are stored in separate files, of which the name indicates the model variant (i.e., between, PCA, SCAP, SCAIND, SCAPF2, or SCAECP) and the rotation used (i.e., _unrotated, _rotated, and _oblique). If unrestricted, the component variances and correlations per data block are also displayed.

Output files with .mat extension (for MATLAB version only)

These files contain an object containing the results of the different analyses.

Status of the analysis and error handling

Once the user has specified the necessary input and output files and analysis options, the analysis can be started by clicking the Run analysis-button. During the analysis, the MLSCA program displays information concerning the status of the analysis in the box at the bottom of the GUI screen (see Fig. 3). When the analysis has finished, a screen will appear notifying the user that The analysis has been finished!. When the input files do not comply with the requirements mentioned above, the analysis will be deferred and the MLSCA program will produce one or more error screens, which will contain information regarding the encountered problem(s). To aid the user in dealing with the encountered problem(s), the content of the error message(s) will also be displayed in the box at the bottom of the GUI screen. Once all specifications are corrected, the user may click the Run analysis-button to restart the analysis.

Concluding remarks

In this paper, we discussed a computational shortcut for MLSCA. Further, we presented a software package, MLSCA, to perform MLSCA analyses. Three comments need to be made.

First, as MLSCA is a deterministic approach, no statistical tests or estimates of the standard errors of the parameters are provided. Yet, to get insight into the reliability of the estimates, one may adopt a bootstrap procedure to obtain confidence intervals for the ML(S)CA parameters (see Timmerman, Kiers, Smilde, Ceulemans, & Stouten, 2009). As many options have to be specified (e.g., which data levels are considered random), leading to rather different bootstrap procedures and analysis output, we opted to not include it in the MLSCA program. However, the MATLAB code for performing such a bootstrap analysis is available upon request. Moreover, we will explore how this option can be incorporated efficiently and in a user-friendly manner in a future version of the software.

Second, as for regular component analysis (Hubert, Rousseeuw, & Vanden Branden, 2005), MLSCA results can be affected by bad leverage outliers. To trace such outliers and obtain robust estimates, Ceulemans, Hubert and Rousseeuw (2013) presented a robust version of MLSCA-P. As the associated estimation procedure relies heavily on the Libra toolbox and is only available for MLSCA-P, the robust approach cannot be incorporated in the software package, but m-files can be provided upon request.

Finally, two-level multivariate data often contain missing data. Up to now, this problem is in most cases dealt with by list-wise deletion of the corresponding observations (see e.g., Kuppens et al., 2006), which may yield biased results. Recently, Josse, Timmerman and Kiers (2013) proposed an imputation approach for the least restrictive MLSCA variant. However, for the moment it is unclear how missing data have to be treated when using the other MLSCA variants. Therefore, we did not include this imputation approach in the current version of the package.