Skip to main content
Log in

The SIMCLAS Model: Simultaneous Analysis of Coupled Binary Data Matrices with Noise Heterogeneity Between and Within Data Blocks

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

In many research domains different pieces of information are collected regarding the same set of objects. Each piece of information constitutes a data block, and all these (coupled) blocks have the object mode in common. When analyzing such data, an important aim is to obtain an overall picture of the structure underlying the whole set of coupled data blocks. A further challenge consists of accounting for the differences in information value that exist between and within (i.e., between the objects of a single block) data blocks. To tackle these issues, analysis techniques may be useful in which all available pieces of information are integrated and in which at the same time noise heterogeneity is taken into account. For the case of binary coupled data, however, only methods exist that go for a simultaneous analysis of all data blocks but that do not account for noise heterogeneity. Therefore, in this paper, the SIMCLAS model, being a Hierarchical Classes model for the simultaneous analysis of coupled binary two-way matrices, is presented. In this model, noise heterogeneity between and within the data blocks is accounted for by downweighting entries from noisy blocks/objects within a block. In a simulation study it is shown that (1) the SIMCLAS technique recovers the underlying structure of coupled data to a very large extent, and (2) the SIMCLAS technique outperforms a Hierarchical Classes technique in which all entries contribute equally to the analysis (i.e., noise homogeneity within and between blocks). The latter is also demonstrated in an application of both techniques to empirical data on categorization of semantic concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.

Similar content being viewed by others

Notes

  1. In the remainder of this paper, the terms ‘matrix’ and ‘block’ are used interchangeably.

  2. A value of π n larger than 0.50 is not realistic, because this implies that the model entries may differ from their corresponding data entries with a probability that is above chance.

  3. Note that allowing π to differ among the columns of each data block would yield a special case of block-homogeneous SIMCLAS (i.e., the case in which each data block consists of one variable only). Note further that it may not be a good idea to allow π to be different for each data element, because it would be impossible to estimate such a model (i.e., there are more parameters to fit than there are data points).

  4. When π 1>π 2, then \(c_{1} = \log (\frac{\pi_{1}}{1-\pi_{1}}) > c_{2} = \log (\frac{\pi_{2}}{1-\pi_{2}})\), with c 1 and c 2 representing, respectively, the contribution of entries from D 1 and D 2 to the likelihood, which has to be maximized. Note that c 1 and c 2 are negative when π n ≤0.50, which implies that |c 1|<|c 2|. As such, entries from more noisy blocks (i.e., larger π n and smaller |c n |) imply a smaller decrease in the likelihood than entries from less noisy blocks (i.e., smaller π n and larger |c n |), resulting in entries from more noisy blocks being downweighted.

  5. The kappa coefficient κ between two dichotomous variables can be computed as follows:

    $$ \kappa= \frac{(p_{00} + p_{11}) - (p_{0.}p_{.0} + p_{1.}p_{.1}) }{1 - (p_{0.}p_{.0} + p_{1.}p_{.1})}, $$
    (7)

    with p 00 (p 11) the proportion of zero-agreements (one-agreements) and p 0. and p 1. (p .0 and p .1) the marginal proportion of zeros and ones for the first (second) variable.

References

  • Aarts, E.H.L., Korst, J.H.M., & van Laarhoven, P.J.M. (1997). Simulated annealing. In E.H.L. Aarts & J.K. Lenstra (Eds.), Local search in combinatorial optimization (pp. 91–120). Chichester: Wiley.

    Google Scholar 

  • Barbut, M., & Monjardet, B. (1970). Ordre et classification : Algèbre et combinatoire. Paris: Hachette.

    Google Scholar 

  • Birkhoff, G. (1940). Lattice theory. Providence: Am. Math. Soc.

    Google Scholar 

  • Ceulemans, E., & Storms, G. (2010). Detecting intra- and inter-categorical structure in semantic concepts using hiclas. Acta Psychologica, 133, 296–304.

    Article  PubMed  Google Scholar 

  • Ceulemans, E., & Van Mechelen, I. (2003). Uniqueness of n-way n-mode hierarchical classes models. Journal of Mathematical Psychology, 47, 259–264.

    Article  Google Scholar 

  • Ceulemans, E., & Van Mechelen, I. (2004). Tucker2 hierarchical classes analysis. Psychometrika, 69, 375–399.

    Article  Google Scholar 

  • Ceulemans, E., & Van Mechelen, I. (2005). Hierarchical classes models for three-way three-mode binary data: Interrelations and model selection. Psychometrika, 70, 461–480.

    Article  Google Scholar 

  • Ceulemans, E., Van Mechelen, I., & Leenen, I. (2003). Tucker3 hierarchical classes analysis. Psychometrika, 68, 413–433.

    Article  Google Scholar 

  • Ceulemans, E., Van Mechelen, I., & Leenen, I. (2007). The local minima problem in hierarchical classes analysis: An evaluation of a simulated annealing algorithm and various multistart procedures. Psychometrika, 72, 377–391.

    Article  Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • De Boeck, P., & Rosenberg, S. (1988). Hierarchical classes: Model and data analysis. Psychometrika, 53, 361–381.

    Article  Google Scholar 

  • De Deyne, S., Verheyen, S., Ameel, E., Vanpaemel, W., Dry, M., Voorspoels, W., & Storms, G. (2008). Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts. Behavioral Research Methods, 40, 1030–1048.

    Article  Google Scholar 

  • Haggard, E.A. (1958). Intraclass correlation and the analysis of variance. New York: Dryden.

    Google Scholar 

  • Kiers, H.A.L. (2000). Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics, 14, 105–122.

    Article  Google Scholar 

  • Kiers, H.A.L., & ten Berge, J.M.F. (1989). Alternating least squares algorithms for simultaneous components analysis with equal component weight matrices for all populations. Psychometrika, 54, 467–473.

    Article  Google Scholar 

  • Kiers, H.A.L., & ten Berge, J.M.F. (1994). Hierarchical relations between methods for simultaneous component analysis and a technique for rotation to a simple simultaneous structure. British Journal of Mathematical & Statistical Psychology, 47, 109–126.

    Article  Google Scholar 

  • Kirk, R.E. (1982). Experimental design: Procedures for the behavioral sciences (2nd ed.). Belmont: Brooks/Cole.

    Google Scholar 

  • Kirkpatrick, S., Gelatt, C.D.J., & Vecchi, M.P. (1983). Optimization by simulated annealing. Science, 220, 671–680.

    Article  PubMed  Google Scholar 

  • Leenen, I., & Van Mechelen, I. (2001). An evaluation of two algorithms for hierarchical classes analysis. Journal of Classification, 18, 57–80.

    Article  Google Scholar 

  • Leenen, I., Van Mechelen, I., De Boeck, P., & Rosenberg, S. (1999). indclas: A three-way hierarchical classes model. Psychometrika, 64, 9–24.

    Article  Google Scholar 

  • Leenen, I., Van Mechelen, I., Gelman, A., & De Knop, S. (2008). Bayesian hierarchical classes analysis. Psychometrika, 73, 39–64.

    Article  Google Scholar 

  • Millsap, R.E., & Meredith, W. (1988). Component analysis in cross-sectional and longitudinal data. Psychometrika, 53, 123–134.

    Article  Google Scholar 

  • ten Berge, J.M.F., Kiers, H.A.L., & van der Stel, V. (1992). Simultaneous components analysis. Statistica Applicata, 4, 377–392.

    Google Scholar 

  • Timmerman, M.E., & Kiers, H.A.L. (2003). Four simultaneous component models for the analysis of multivariate time series from more than one subject to model intraindividual and interindividual differences. Psychometrika, 68, 105–121.

    Article  Google Scholar 

  • Van Deun, K., Smilde, A.K., van der Werf, M.J., Kiers, H.A.L., & Van Mechelen, I. (2009). A structured overview of simultaneous component based data integration. BMC Bioinformatics, 10, 246.

    Article  PubMed  Google Scholar 

  • Van Mechelen, I., De Boeck, P., & Rosenberg, S. (1995). The conjunctive model of hierarchical classes. Psychometrika, 60, 505–521.

    Article  Google Scholar 

  • Van Mechelen, I., & Smilde, A.K. (2009). A generic model for data fusion. Paper presented at the 6th meeting of TRICAP (Three-way methods in chemistry and psychology), June 14–19, Vall de Núria, Spain.

  • Van Mechelen, I., & Smilde, A.K. (2010). A generic linked-mode decomposition model for data fusion. Chemometrics and Intelligent Laboratory Systems, 104, 83–94.

    Article  Google Scholar 

  • Wilderjans, T.F., Ceulemans, E., & Van Mechelen, I. (2008). The chic model: global model for coupled binary data. Psychometrika, 73, 729–751.

    Article  Google Scholar 

  • Wilderjans, T.F., Ceulemans, E., Van Mechelen, I., & van den Berg, R.A. (2011). Simultaneous analysis of coupled data matrices subject to different amounts of noise. British Journal of Mathematical & Statistical Psychology, 64, 277–290.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tom F. Wilderjans.

Additional information

The first author is a Research Assistant of the Fund for Scientific Research (FWO)—Flanders (Belgium). The research reported in this paper was partially supported by the Research Council of K.U. Leuven (GOA/2005/04 and EF/2005/07, ‘SymBioSys’) and by IWT-Flanders (SBO 60045, ‘Bioframe’). We would like to thank Gert Storms and his collaborators for providing us with an interesting data set.

Appendix: Simulated Annealing to Estimate the Bundle Matrices, Conditional on the Noise Parameters

Appendix: Simulated Annealing to Estimate the Bundle Matrices, Conditional on the Noise Parameters

To estimate, in Step 2 of the SIMCLAS algorithm (see section The SIMCLAS algorithm), the binary bundle matrices A and B n that maximize the loss function, conditionally upon the noise parameters, a simulated annealing procedure is adopted. Simulated annealing is a local search technique that implies a walk through the solution space. In particular, a chain of solutions, consisting of several subchains, is generated by each time creating a candidate solution based on the current solution. Next, the loss function values of the current and the candidate solution are compared. If the candidate solution has a better loss function value f, it is accepted, which implies that the current solution is replaced by the candidate solution. If the candidate solution, however, has a worse loss function value, it is accepted with a probability that depends on its relative quality (i.e., the difference in loss function value f between the current solution and the candidate one) and the current temperature, a quantity that controls the acceptance probability. At the end of each subchain the temperature is decreased. Subchains are generated until a prespecified subchain stop criterion is met. Finally, the best encountered solution in the chain is retained.

Based on the results of a pilot study and on the SA implementations that have been used for other Hierarchical Classes models (see, Ceulemans et al. 2007), we implemented the procedure for generating a single SA chain (see Algorithm A1 for pseudo-code) in the SIMCLAS algorithm as follows:

  1. 1.

    An initial solution S current and associated initial loss value L current is obtained by replacing the P columns of each bundle matrix by P data vectors sampled at random (i.e., for A, column vectors are drawn from the different D n, whereas for each B n, row vectors are chosen from the corresponding D n).

  2. 2.

    The initial temperature T initial is obtained by running a subchain of solutions and accepting all solutions; subsequently, the average increase in the likelihood function across those links in which worse solutions are accepted, is divided by ln(0.8); as such, during the first subchains in which the algorithm is still far from the optimal solution, worse solutions are accepted with a high probability (see Kirkpatrick et al. 1983; Aarts et al. 1997; Ceulemans et al. 2007).

  3. 3.

    A candidate solution S trial, and associated loss value L trial, is obtained from the current solution S current by altering the value of a randomly chosen cell of a randomly chosen bundle matrix, with each cell of each bundle matrix having the same probability of being changed.

  4. 4.

    A worse candidate solution is accepted if: p<exp((L trialL current)/T current), with p being a number generated from a uniform (0,1) distribution.

  5. 5.

    A subchain stops (1) if the number of generated solutions i gen equals the maximum number of solutions \(\mathit{CL} = ((I+\sum_{n=1}^{N} J_{n})\times2^{P})\times5\), or (2) if the number of accepted solutions i acc equals CL×0.10.

  6. 6.

    At the end of each subchain, the temperature is decreased by a factor α=0.90, implying a smaller acceptance probability for worse solutions: T current=0.9×T current.

  7. 7.

    An SA chain stops when (1) the current temperature becomes smaller than T stop=0.000001, or (2) the number of subsequent subchains i id with an identical loss value L current for the last accepted solution in each subchain (i.e., L current=L previous) equals \(\mathit {max}_{i_{\mathrm{id}}}\), which is set to five.

  8. 8.

    The retained solution is the best encountered solution S best across all subchains.

Algorithm A1
figure 2

Pseudo-code for generating a single SA chain in Step 2 of the SIMCLAS algorithm.

To lower the risk of ending in a suboptimal solution (i.e., local optimum), a multi-start procedure may be advised, which consists of running 100 SA chains, each time with a different initial solution and initial temperature (see Steps 1 and 2), and retaining the best encountered solution across all chains (see Ceulemans et al. 2007).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wilderjans, T.F., Ceulemans, E. & Van Mechelen, I. The SIMCLAS Model: Simultaneous Analysis of Coupled Binary Data Matrices with Noise Heterogeneity Between and Within Data Blocks. Psychometrika 77, 724–740 (2012). https://doi.org/10.1007/s11336-012-9275-3

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-012-9275-3

Key words

Navigation