The SIMCLAS Model: Simultaneous Analysis of Coupled Binary Data Matrices with Noise Heterogeneity Between and Within Data Blocks

Wilderjans, Tom F.; Ceulemans, E.; Van Mechelen, I.

doi:10.1007/s11336-012-9275-3

The SIMCLAS Model: Simultaneous Analysis of Coupled Binary Data Matrices with Noise Heterogeneity Between and Within Data Blocks

Published: 30 May 2012

Volume 77, pages 724–740, (2012)
Cite this article

Psychometrika Aims and scope Submit manuscript

Tom F. Wilderjans¹,
E. Ceulemans² &
I. Van Mechelen¹

287 Accesses
12 Citations
Explore all metrics

Abstract

In many research domains different pieces of information are collected regarding the same set of objects. Each piece of information constitutes a data block, and all these (coupled) blocks have the object mode in common. When analyzing such data, an important aim is to obtain an overall picture of the structure underlying the whole set of coupled data blocks. A further challenge consists of accounting for the differences in information value that exist between and within (i.e., between the objects of a single block) data blocks. To tackle these issues, analysis techniques may be useful in which all available pieces of information are integrated and in which at the same time noise heterogeneity is taken into account. For the case of binary coupled data, however, only methods exist that go for a simultaneous analysis of all data blocks but that do not account for noise heterogeneity. Therefore, in this paper, the SIMCLAS model, being a Hierarchical Classes model for the simultaneous analysis of coupled binary two-way matrices, is presented. In this model, noise heterogeneity between and within the data blocks is accounted for by downweighting entries from noisy blocks/objects within a block. In a simulation study it is shown that (1) the SIMCLAS technique recovers the underlying structure of coupled data to a very large extent, and (2) the SIMCLAS technique outperforms a Hierarchical Classes technique in which all entries contribute equally to the analysis (i.e., noise homogeneity within and between blocks). The latter is also demonstrated in an application of both techniques to empirical data on categorization of semantic concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Notes

In the remainder of this paper, the terms ‘matrix’ and ‘block’ are used interchangeably.
A value of π _n larger than 0.50 is not realistic, because this implies that the model entries may differ from their corresponding data entries with a probability that is above chance.
Note that allowing π to differ among the columns of each data block would yield a special case of block-homogeneous SIMCLAS (i.e., the case in which each data block consists of one variable only). Note further that it may not be a good idea to allow π to be different for each data element, because it would be impossible to estimate such a model (i.e., there are more parameters to fit than there are data points).
When π ₁>π ₂, then $c_{1} = \log (\frac{\pi_{1}}{1-\pi_{1}}) > c_{2} = \log (\frac{\pi_{2}}{1-\pi_{2}})$, with c ₁ and c ₂ representing, respectively, the contribution of entries from D ¹ and D ² to the likelihood, which has to be maximized. Note that c ₁ and c ₂ are negative when π _n≤0.50, which implies that |c ₁|<|c ₂|. As such, entries from more noisy blocks (i.e., larger π _n and smaller |c _n|) imply a smaller decrease in the likelihood than entries from less noisy blocks (i.e., smaller π _n and larger |c _n|), resulting in entries from more noisy blocks being downweighted.
The kappa coefficient κ between two dichotomous variables can be computed as follows:
$$ \kappa= \frac{(p_{00} + p_{11}) - (p_{0.}p_{.0} + p_{1.}p_{.1}) }{1 - (p_{0.}p_{.0} + p_{1.}p_{.1})}, $$
(7)
with p ₀₀ (p ₁₁) the proportion of zero-agreements (one-agreements) and p _0. and p _1. (p _.0 and p _.1) the marginal proportion of zeros and ones for the first (second) variable.

References

Aarts, E.H.L., Korst, J.H.M., & van Laarhoven, P.J.M. (1997). Simulated annealing. In E.H.L. Aarts & J.K. Lenstra (Eds.), Local search in combinatorial optimization (pp. 91–120). Chichester: Wiley.
Google Scholar
Barbut, M., & Monjardet, B. (1970). Ordre et classification : Algèbre et combinatoire. Paris: Hachette.
Google Scholar
Birkhoff, G. (1940). Lattice theory. Providence: Am. Math. Soc.
Google Scholar
Ceulemans, E., & Storms, G. (2010). Detecting intra- and inter-categorical structure in semantic concepts using hiclas. Acta Psychologica, 133, 296–304.
Article PubMed Google Scholar
Ceulemans, E., & Van Mechelen, I. (2003). Uniqueness of n-way n-mode hierarchical classes models. Journal of Mathematical Psychology, 47, 259–264.
Article Google Scholar
Ceulemans, E., & Van Mechelen, I. (2004). Tucker2 hierarchical classes analysis. Psychometrika, 69, 375–399.
Article Google Scholar
Ceulemans, E., & Van Mechelen, I. (2005). Hierarchical classes models for three-way three-mode binary data: Interrelations and model selection. Psychometrika, 70, 461–480.
Article Google Scholar
Ceulemans, E., Van Mechelen, I., & Leenen, I. (2003). Tucker3 hierarchical classes analysis. Psychometrika, 68, 413–433.
Article Google Scholar
Ceulemans, E., Van Mechelen, I., & Leenen, I. (2007). The local minima problem in hierarchical classes analysis: An evaluation of a simulated annealing algorithm and various multistart procedures. Psychometrika, 72, 377–391.
Article Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Article Google Scholar
De Boeck, P., & Rosenberg, S. (1988). Hierarchical classes: Model and data analysis. Psychometrika, 53, 361–381.
Article Google Scholar
De Deyne, S., Verheyen, S., Ameel, E., Vanpaemel, W., Dry, M., Voorspoels, W., & Storms, G. (2008). Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts. Behavioral Research Methods, 40, 1030–1048.
Article Google Scholar
Haggard, E.A. (1958). Intraclass correlation and the analysis of variance. New York: Dryden.
Google Scholar
Kiers, H.A.L. (2000). Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics, 14, 105–122.
Article Google Scholar
Kiers, H.A.L., & ten Berge, J.M.F. (1989). Alternating least squares algorithms for simultaneous components analysis with equal component weight matrices for all populations. Psychometrika, 54, 467–473.
Article Google Scholar
Kiers, H.A.L., & ten Berge, J.M.F. (1994). Hierarchical relations between methods for simultaneous component analysis and a technique for rotation to a simple simultaneous structure. British Journal of Mathematical & Statistical Psychology, 47, 109–126.
Article Google Scholar
Kirk, R.E. (1982). Experimental design: Procedures for the behavioral sciences (2nd ed.). Belmont: Brooks/Cole.
Google Scholar
Kirkpatrick, S., Gelatt, C.D.J., & Vecchi, M.P. (1983). Optimization by simulated annealing. Science, 220, 671–680.
Article PubMed Google Scholar
Leenen, I., & Van Mechelen, I. (2001). An evaluation of two algorithms for hierarchical classes analysis. Journal of Classification, 18, 57–80.
Article Google Scholar
Leenen, I., Van Mechelen, I., De Boeck, P., & Rosenberg, S. (1999). indclas: A three-way hierarchical classes model. Psychometrika, 64, 9–24.
Article Google Scholar
Leenen, I., Van Mechelen, I., Gelman, A., & De Knop, S. (2008). Bayesian hierarchical classes analysis. Psychometrika, 73, 39–64.
Article Google Scholar
Millsap, R.E., & Meredith, W. (1988). Component analysis in cross-sectional and longitudinal data. Psychometrika, 53, 123–134.
Article Google Scholar
ten Berge, J.M.F., Kiers, H.A.L., & van der Stel, V. (1992). Simultaneous components analysis. Statistica Applicata, 4, 377–392.
Google Scholar
Timmerman, M.E., & Kiers, H.A.L. (2003). Four simultaneous component models for the analysis of multivariate time series from more than one subject to model intraindividual and interindividual differences. Psychometrika, 68, 105–121.
Article Google Scholar
Van Deun, K., Smilde, A.K., van der Werf, M.J., Kiers, H.A.L., & Van Mechelen, I. (2009). A structured overview of simultaneous component based data integration. BMC Bioinformatics, 10, 246.
Article PubMed Google Scholar
Van Mechelen, I., De Boeck, P., & Rosenberg, S. (1995). The conjunctive model of hierarchical classes. Psychometrika, 60, 505–521.
Article Google Scholar
Van Mechelen, I., & Smilde, A.K. (2009). A generic model for data fusion. Paper presented at the 6th meeting of TRICAP (Three-way methods in chemistry and psychology), June 14–19, Vall de Núria, Spain.
Van Mechelen, I., & Smilde, A.K. (2010). A generic linked-mode decomposition model for data fusion. Chemometrics and Intelligent Laboratory Systems, 104, 83–94.
Article Google Scholar
Wilderjans, T.F., Ceulemans, E., & Van Mechelen, I. (2008). The chic model: global model for coupled binary data. Psychometrika, 73, 729–751.
Article Google Scholar
Wilderjans, T.F., Ceulemans, E., Van Mechelen, I., & van den Berg, R.A. (2011). Simultaneous analysis of coupled data matrices subject to different amounts of noise. British Journal of Mathematical & Statistical Psychology, 64, 277–290.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Research Group of Quantitative Psychology and Individual Differences, Department of Psychology, KU Leuven, Tiensestraat 102, Box 3713, 3000, Leuven, Belgium
Tom F. Wilderjans & I. Van Mechelen
Department of Educational Sciences, KU Leuven, Leuven, Belgium
E. Ceulemans

Authors

Tom F. Wilderjans
View author publications
You can also search for this author in PubMed Google Scholar
E. Ceulemans
View author publications
You can also search for this author in PubMed Google Scholar
I. Van Mechelen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tom F. Wilderjans.

Additional information

The first author is a Research Assistant of the Fund for Scientific Research (FWO)—Flanders (Belgium). The research reported in this paper was partially supported by the Research Council of K.U. Leuven (GOA/2005/04 and EF/2005/07, ‘SymBioSys’) and by IWT-Flanders (SBO 60045, ‘Bioframe’). We would like to thank Gert Storms and his collaborators for providing us with an interesting data set.

Appendix: Simulated Annealing to Estimate the Bundle Matrices, Conditional on the Noise Parameters

To estimate, in Step 2 of the SIMCLAS algorithm (see section The SIMCLAS algorithm), the binary bundle matrices A and B ⁿ that maximize the loss function, conditionally upon the noise parameters, a simulated annealing procedure is adopted. Simulated annealing is a local search technique that implies a walk through the solution space. In particular, a chain of solutions, consisting of several subchains, is generated by each time creating a candidate solution based on the current solution. Next, the loss function values of the current and the candidate solution are compared. If the candidate solution has a better loss function value f, it is accepted, which implies that the current solution is replaced by the candidate solution. If the candidate solution, however, has a worse loss function value, it is accepted with a probability that depends on its relative quality (i.e., the difference in loss function value f between the current solution and the candidate one) and the current temperature, a quantity that controls the acceptance probability. At the end of each subchain the temperature is decreased. Subchains are generated until a prespecified subchain stop criterion is met. Finally, the best encountered solution in the chain is retained.

Based on the results of a pilot study and on the SA implementations that have been used for other Hierarchical Classes models (see, Ceulemans et al. 2007), we implemented the procedure for generating a single SA chain (see Algorithm A1 for pseudo-code) in the SIMCLAS algorithm as follows:

1.
An initial solution S _current and associated initial loss value L _current is obtained by replacing the P columns of each bundle matrix by P data vectors sampled at random (i.e., for A, column vectors are drawn from the different D ⁿ, whereas for each B ⁿ, row vectors are chosen from the corresponding D ⁿ).
2.
The initial temperature T _initial is obtained by running a subchain of solutions and accepting all solutions; subsequently, the average increase in the likelihood function across those links in which worse solutions are accepted, is divided by ln(0.8); as such, during the first subchains in which the algorithm is still far from the optimal solution, worse solutions are accepted with a high probability (see Kirkpatrick et al. 1983; Aarts et al. 1997; Ceulemans et al. 2007).
3.
A candidate solution S _trial, and associated loss value L _trial, is obtained from the current solution S _current by altering the value of a randomly chosen cell of a randomly chosen bundle matrix, with each cell of each bundle matrix having the same probability of being changed.
4.
A worse candidate solution is accepted if: p<exp((L _trial−L _current)/T _current), with p being a number generated from a uniform (0,1) distribution.
5.
A subchain stops (1) if the number of generated solutions i _gen equals the maximum number of solutions $\mathit{CL} = ((I+\sum_{n=1}^{N} J_{n})\times2^{P})\times5$, or (2) if the number of accepted solutions i _acc equals CL×0.10.
6.
At the end of each subchain, the temperature is decreased by a factor α=0.90, implying a smaller acceptance probability for worse solutions: T _current=0.9×T _current.
7.
An SA chain stops when (1) the current temperature becomes smaller than T _stop=0.000001, or (2) the number of subsequent subchains i _id with an identical loss value L _current for the last accepted solution in each subchain (i.e., L _current=L _previous) equals $\mathit {max}_{i_{\mathrm{id}}}$, which is set to five.
8.
The retained solution is the best encountered solution S _best across all subchains.

To lower the risk of ending in a suboptimal solution (i.e., local optimum), a multi-start procedure may be advised, which consists of running 100 SA chains, each time with a different initial solution and initial temperature (see Steps 1 and 2), and retaining the best encountered solution across all chains (see Ceulemans et al. 2007).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wilderjans, T.F., Ceulemans, E. & Van Mechelen, I. The SIMCLAS Model: Simultaneous Analysis of Coupled Binary Data Matrices with Noise Heterogeneity Between and Within Data Blocks. Psychometrika 77, 724–740 (2012). https://doi.org/10.1007/s11336-012-9275-3

Download citation

Received: 13 April 2010
Revised: 13 July 2011
Published: 30 May 2012
Issue Date: October 2012
DOI: https://doi.org/10.1007/s11336-012-9275-3

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The SIMCLAS Model: Simultaneous Analysis of Coupled Binary Data Matrices with Noise Heterogeneity Between and Within Data Blocks

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Simulated Annealing to Estimate the Bundle Matrices, Conditional on the Noise Parameters

Rights and permissions

About this article

Cite this article

Key words

Navigation

The SIMCLAS Model: Simultaneous Analysis of Coupled Binary Data Matrices with Noise Heterogeneity Between and Within Data Blocks

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Simulated Annealing to Estimate the Bundle Matrices, Conditional on the Noise Parameters

Appendix: Simulated Annealing to Estimate the Bundle Matrices, Conditional on the Noise Parameters

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation