Clustering Boolean tensors

Metzler, Saskia; Miettinen, Pauli

doi:10.1007/s10618-015-0420-3

Clustering Boolean tensors

Published: 16 June 2015

Volume 29, pages 1343–1373, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Saskia Metzler¹ &
Pauli Miettinen¹

671 Accesses
9 Citations
Explore all metrics

Abstract

Graphs—such as friendship networks—that evolve over time are an example of data that are naturally represented as binary tensors. Similarly to analysing the adjacency matrix of a graph using a matrix factorization, we can analyse the tensor by factorizing it. Unfortunately, tensor factorizations are computationally hard problems, and in particular, are often significantly harder than their matrix counterparts. In case of Boolean tensor factorizations—where the input tensor and all the factors are required to be binary and we use Boolean algebra—much of that hardness comes from the possibility of overlapping components. Yet, in many applications we are perfectly happy to partition at least one of the modes. For instance, in the aforementioned time-evolving friendship networks, groups of friends might be overlapping, but the time points at which the network was captured are always distinct. In this paper we investigate what consequences this partitioning has on the computational complexity of the Boolean tensor factorizations and present a new algorithm for the resulting clustering problem. This algorithm can alternatively be seen as a particularly regularized clustering algorithm that can handle extremely high-dimensional observations. We analyse our algorithm with the goal of maximizing the similarity and argue that this is more meaningful than minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient 0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm for Boolean tensor clustering achieves high scalability, high similarity, and good generalization to unseen data with both synthetic and real-world data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The code is available from http://www.mpi-inf.mpg.de/~pmiettin/btc/.
http://www.cs.cmu.edu/~epapalex/.
http://www.sandia.gov/~tgkolda/TensorToolbox/.
The noise levels are reported w.r.t. number of non-zeros.
http://grouplens.org/datasets/hetrec-2011.
http://www.delicious.com.
http://www.cs.cmu.edu/~enron/.
http://socialnetworks.mpi-sws.org/datasets.html.
http://www.last.fm.
http://www.cis.temple.edu/~yates/papers/jair-resolver.html.
http://www.caida.org/data/passive/passive_2009_dataset.xml.
http://www.mpi-inf.mpg.de/yago-naga/yago.

References

Alon N, Sudakov B (1999) On two segmentation problems. J Algorithm 33:173–184
Article MathSciNet MATH Google Scholar
Bělohlávek R, Glodeanu C, Vychodil V (2012) Optimal factorization of three-way binary data using triadic concepts. Order 30(2):437–454
Article Google Scholar
Cantador I, Brusilovsky P, Kuflik T (2011) 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec ’11). In: 5th ACM Conference on Recommender Systems (RecSys’11)
Carroll JD, Chang JJ (1970) Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika 35(3):283–319
Article MATH Google Scholar
Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):1
Article Google Scholar
Cerf L, Besson J, Nguyen KNT, Boulicaut JF (2013) Closed and noise-tolerant patterns in n-ary relations. Data Min Knowl Discov 26(3):574–619
Article MathSciNet MATH Google Scholar
Chi EC, Kolda TG (2012) On tensors, sparsity, and nonnegative factorizations. SIAM J Matrix Anal Appl 33(4):1272–1299
Article MathSciNet MATH Google Scholar
Dagum L, Menon R (1998) OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng Mag 5(1):46–55
Article Google Scholar
Erdős D, Miettinen P (2013a) Discovering facts with boolean tensor tucker decomposition. In: 22nd ACM International Conference on Information & Knowledge Management (CIKM ’13), pp 1569–1572
Erdős D, Miettinen P (2013b) Walk’n’Merge: a scalable algorithm for Boolean tensor factorization. In: 13th IEEE International Conference on Data Mining (ICDM ’13), pp 1037–1042
Harshman RA (1970) Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multimodal factor analysis. Tech. Rep. 16, UCLA Working Papers in Phonetics
Huang H, Ding C, Luo D, Li T (2008) Simultaneous tensor subspace selection and clustering: the equivalence of high order SVD and k-means clustering. In: 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08), pp 327–335
Ignatov DI, Kuznetsov SO, Magizov RA, Zhukov LE (2011) From triconcepts to triclusters. In: 13th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC ’11), pp 257–264
Jegelka S, Sra S, Banerjee A (2009) Approximation algorithms for tensor clustering. In: International Conference on Algorithmic Learning Theory (ALT ’09), pp 368–383
Jiang P (2014) Pattern extraction and clustering for high-dimensional discrete data. PhD thesis, University of Illinois at Urbana-Champaign
Kim M, Candan KS (2011) Approximate tensor decomposition within a tensor-relational algebraic framework. In: 20th ACM International Conference on Information & Knowledge Management (CIKM ’11), pp 1737–1742
Kim M, Candan KS (2012) Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient tensor decomposition. In: 21st ACM International Conference on Information & Knowledge Management (CIKM ’12), pp 355–364
Kim M, Candan KS (2014) Pushing-down tensor decompositions over unions to promote reuse of materialized decompositions. In: European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD ’14), pp 688–704
Kleinberg J, Papadimitriou C, Raghavan P (1998) A microeconomic view of data mining. Data Min Knowl Discov 2(4):311–324
Article Google Scholar
Kleinberg JM, Papadimitriou CH, Raghavan P (2004) Segmentation problems. J ACM 51(2):263–280
Article MathSciNet Google Scholar
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Article MathSciNet MATH Google Scholar
Leenen I, Van Mechelen I, De Boeck P, Rosenberg S (1999) INDCLAS: a three-way hierarchical classes model. Psychometrika 64(1):9–24
Article MATH Google Scholar
Liu X, De Lathauwer L, Janssens F, De Moor B (2010) Hybrid clustering of multiple information sources via HOSVD. In: 7th International Conference on Advances in Neural Networks—Part II (ISNN ’10), pp 337–345
Miettinen P (2009) Matrix Decomposition methods for data mining: computational complexity and algorithms. PhD thesis, Department of Computer Science, University of Helsinki
Miettinen P (2010) Sparse Boolean matrix factorizations. In: 10th IEEE International Conference on Data Mining (ICDM ’10), pp 935–940
Miettinen P (2011) Boolean tensor factorizations. In: 11th IEEE International Conference on Data Mining (ICDM ’11), pp 447–456
Miettinen P, Vreeken J (2014) MDL4BMF: minimum description length for Boolean matrix factorization. ACM Trans Knowl Discov Data 8(4):18
Article Google Scholar
Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10):1348–1362
Article Google Scholar
Papadimitriou CH, Steiglitz K (1998) Combinatorial optimization: algorithms and complexity. Dover Publications, Mineola
MATH Google Scholar
Papalexakis EE, Faloutsos C, Sidiropoulos ND (2012) ParCube: sparse parallelizable tensor decompositions. In: European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD ’12), pp 521–536
Papalexakis EE, Sidiropoulos N, Bro R (2013) From K-means to higher-way co-clustering: multilinear decomposition with sparse latent factors. IEEE Trans Signal Process 61(2):493–506
Article Google Scholar
Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471
Article MATH Google Scholar
Seppänen JK (2005) Upper bound for the approximation ratio of a class of hypercube segmentation algorithms. Inform Process Lett 93(3):139–141
Article MathSciNet MATH Google Scholar
Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: 16th International Conference on World Wide Web (WWW ’07), pp 697–706
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31(3):279–311
Article MathSciNet Google Scholar
Viswanath B, Mislove A, Cha M, Gummadi KP (2009) On the evolution of user interaction in Facebook. In: 2nd ACM Workshop on Online Social Networks (WOSN ’09), pp 37–42
Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the web. J Artif Intell Res 34:255–296
MATH Google Scholar
Zhao L, Zaki MJ (2005) TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. In: ACM SIGMOD International Conference on Management of Data (SIGMOD ’05), pp 694–705

Download references

Author information

Authors and Affiliations

Max-Planck-Institut für Informatik, Saarbrücken, Germany
Saskia Metzler & Pauli Miettinen

Authors

Saskia Metzler
View author publications
You can also search for this author in PubMed Google Scholar
Pauli Miettinen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saskia Metzler.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, Concha Bielza.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Metzler, S., Miettinen, P. Clustering Boolean tensors. Data Min Knowl Disc 29, 1343–1373 (2015). https://doi.org/10.1007/s10618-015-0420-3

Download citation

Received: 04 January 2015
Accepted: 25 May 2015
Published: 16 June 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10618-015-0420-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Boolean tensors

Abstract

Access this article

Similar content being viewed by others

Complex Networks: a Mini-review

Graph based anomaly detection and description: a survey

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering Boolean tensors

Abstract

Access this article

Similar content being viewed by others

Complex Networks: a Mini-review

Graph based anomaly detection and description: a survey

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation