Abstract
The analysis of n-ary relations receives attention in many different fields, for instance biology, web mining, and social studies. In the basic setting, there are n sets of instances, and each observation associates n instances, one from each set. A common approach to explore these n-way data is the search for n-set patterns, the n-way equivalent of itemsets. More precisely, an n-set pattern consists of specific subsets of the n instance sets such that all possible associations between the corresponding instances are observed in the data. In contrast, traditional itemset mining approaches consider only two-way data, namely items versus transactions. The n-set patterns provide a higher-level view of the data, revealing associative relationships between groups of instances. Here, we generalize this approach in two respects. First, we tolerate missing observations to a certain degree, that means we are also interested in n-sets where most (although not all) of the possible associations have been recorded in the data. Second, we take association weights into account. In fact, we propose a method to enumerate all n-sets that satisfy a minimum threshold with respect to the average association weight. Technically, we solve the enumeration task using a reverse search strategy, which allows for effective pruning of the search space. In addition, our algorithm provides a ranking of the solutions and can consider further constraints. We show experimental results on artificial and real-world datasets from different domains.
References
Acar, E., Aykut-Bingol, C., Bingol, H., Bro, R., & Yener, B. (2007). Multiway analysis of epilepsy tensors. Bioinformatics, 23(13), i10–i18.
Acar, E., Çamtepe, S., & Yener, B. (2006). Collective sampling and analysis of high order tensors for chatroom communications. In Intelligence and security informatics (pp. 213–224). Berlin: Springer.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In VLDB ’94: Proceedings of the 20th international conference on very large data bases (pp. 487–499). San Mateo: Morgan Kaufmann.
Asahiro, Y., Iwama, K., Tamaki, H., & Tokuyama, T. (2000). Greedily finding a dense subgraph. Journal of Algorithms, 34(2), 203–221.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., & Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Nature Genetics, 25(1), 25–29.
Avis, D., & Fukuda, K. (1996). Reverse search for enumeration. Discrete Applied Mathematics, 65, 21–46.
Bader, G. D., & Hogue, C. W. (2003). An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2.
Baranzini, S. E., Mousavi, P., Rio, J., Caillier, S. J., Stillman, A., Villoslada, P., Wyatt, M. M., Comabella, M., Greller, L. D., Somogyi, R., Montalban, X., & Oksenberg, J. R. (2004). Transcription-based prediction of response to IFNβ using supervised computational methods. PLoS Biology, 3(1), e2.
Beckmann, C. F., & Smith, S. M. (2005). Tensorial extensions of independent component analysis for multisubject FMRI analysis. Neuroimage, 25(1), 294–311.
Bejerano, G., Friedman, N., & Tishby, N. (2004). Efficient exact p-value computation for small sample, sparse, and surprising categorical data. Journal of Computational Biology, 11(5), 867–886.
Besson, J., Robardet, C., De Raedt, L., & Boulicaut, J. F. (2006). Mining bi-sets in numerical data. In Lecture notes in computer science : Vol. 4747. KDID ’06: Knowledge discovery in inductive databases, fifth international workshop (pp. 11–23). Berlin: Springer.
Borgwardt, K. M., Kriegel, H. P., & Wackersreuther, P. (2006). Pattern mining in frequent dynamic subgraphs. In ICDM ’06: Proceedings of the sixth international conference on data mining (pp. 818–822). Los Alamitos: IEEE Comput. Soc.
Cerf, L., Besson, J., Robardet, C., & Boulicaut, J. F. (2008). Data peeler: contraint-based closed pattern mining in n-ary relations. In SDM ’08: Proceedings of the SIAM international conference on data mining (pp. 37–48).
Culhane, A. C., Schwarzl, T., Sultana, R., Picard, K. C., Picard, S. C., Lu, T. H., Franklin, K. R., French, S. J., Papenhausen, G., Correll, M., & Quackenbush, J. (2010). GeneSigDB—a curated database of gene expression signatures. Nucleic Acids Research 38(suppl_1), D716–D725.
Everett, L., Wang, L. S., & Hannenhalli, S. (2006). Dense subgraph computation via stochastic search: application to detect transcriptional modules. Bioinformatics, 22(14), e117–e123.
Farkas, I. J., Abel, D., Palla, G., & Vicsek, T. (2007). Weighted network modules. New Journal of Physics, 9, 180.
Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D., & Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell, 11(12), 4241–4257.
Georgii, E., Dietmann, S., Uno, T., Pagel, P., & Tsuda, K. (2009a). Enumeration of condition-dependent dense modules in protein interaction networks. Bioinformatics, 25(7), 933–940.
Georgii, E., Tsuda, K., & Schölkopf, B. (2009b). Multi-way set enumeration in real-valued tensors. In DMMT ’09: Proceedings of the second workshop on data mining using matrices and tensors (pp. 32–41). New York: ACM.
Goldberg, L. A. (1992). Efficient algorithms for listing unlabeled graphs. Journal of Algorithms, 13(1), 128–143.
Han, J., & Kamber, M. (2006). The Morgan Kaufmann series data management systems. Data mining: concepts and techniques. San Mateo: Morgan Kaufmann.
Haraguchi, M., & Okubo, Y. (2006). A method for pinpoint clustering of web pages with pseudo-clique search. In Lecture notes in computer science : Vol. 3847. Federation over the Web (pp. 59–78). Berlin: Springer.
Höppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis: methods for classification, data analysis and image recognition. New York: Wiley.
Hu, H., Yan, X., Huang, Y., Han, J., & Zhou, X. J. (2005). Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics, 21(suppl_1), i213–i221.
Jaschke, R., Hotho, A., Schmitz, C., Ganter, B., & Stumme, G. (2006). TRIAS—an algorithm for mining iceberg tri-lattices. In ICDM ’06: Proceedings of the sixth international conference on data mining (pp. 907–911). Los Alamitos: IEEE Comput. Soc.
Jegelka, S., Sra, S., & Banerjee, A. (2009). Approximation algorithms for tensor clustering. In Algorithmic learning theory (pp. 368–383).
Ji, L., Tan, K. L., & Tung, A. K. H. (2006). Mining frequent closed cubes in 3D datasets. In VLDB ’06: Proceedings of the thirty-second international conference on very large data bases (pp. 811–822). VLDB Endowment/ACM, New York. http://portal.acm.org/citation.cfm?id=1164197, http://dblp.uni-trier.de/rec/bibtex/conf/vldb/JiTT06.
Jiang, D., & Pei, J. (2009). Mining frequent cross-graph quasi-cliques. ACM Transactions on Knowledge Discovery Data, 2(4), 1–42.
Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. In AAAI ’06: Proceedings of the twenty-first national conference on artificial intelligence (pp. 381–388). Menlo Park: AAAI Press.
Klimt, B., & Yang, Y. (2004). The Enron corpus: a new dataset for email classification research. In ECML ’04: Proceedings of the 15th european conference on machine learning (pp. 217–226). Berlin: Springer.
Kolda, T. G., & Bader, B. W. (2007). Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories.
Kolda, T. G., Bader, B. W., & Kenny, J. P. (2005). Higher-order web link analysis using multilinear algebra. In ICDM ’05: Proceedings of the fifth IEEE international conference on data mining (pp. 242–249). Los Alamitos: IEEE Comput. Soc.
Kolda, T. G., & Sun, J. (2008). Scalable tensor decompositions for multi-aspect data mining. In ICDM ’08: Proceedings of the eighth IEEE international conference on data mining (pp. 363–372).
Koyutürk, M., Szpankowski, W., & Grama, A. (2007). Assessing significance of connectivity and conservation in protein interaction networks. Journal of Computational Biology, 14(6), 747–764.
Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology of Bioinformatics, 1(1), 24–45.
Mishra, N., Ron, D., & Swaminathan, R. (2004). A new conceptual clustering framework. Machine Learning, 56(1–3), 115–151.
Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences of United States of America, 103(23), 8577–8582.
Palla, G., Derenyi, I., Farkas, I., & Vicsek, T. (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043), 814–818.
Robardet, C. (2009). Constraint-based pattern mining in dynamic graphs. In ICDM ’09: Proceedings of the ninth IEEE international conference on data mining (pp. 950–955). Los Alamitos: IEEE Comput. Soc.
Rymon, R. (1992). Search through systematic set enumeration. In Proceedings of the third international conference on principles of knowledge representation and reasoning (pp. 539–550).
Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27–64.
Shamir, R., Maron-Katz, A., Tanay, A., Linhart, C., Steinfeld, I., Sharan, R., Shiloh, Y., & Elkon, R. (2005). EXPANDER—an integrative program suite for microarray data analysis. BMC Bioinformatics, 6(1), 232.
Spirin, V., & Mirny, L. A. (2003). Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences of United States of America, 100(21), 12123–12128.
Sun, J., Tao, D., & Faloutsos, C. (2006). Beyond streams and graphs: dynamic tensor analysis. In KDD ’06: Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 374–383). New York: ACM.
Tanay, A., Sharan, R., & Shamir, R. (2002). Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18(Suppl 1), S136–S144.
Ulitsky, I., & Shamir, R. (2009). Identifying functional modules using expression profiles and confidence-scored protein interactions. Bioinformatics, 25(9), 1158–1164.
Uno, T. (2007). An efficient algorithm for enumerating pseudo cliques. In ISAAC ’07: Algorithms and computation, eighteenth international symposium (pp. 402–414).
Yan, C., Burleigh, J. G., & Eulenstein, O. (2005). Identifying optimal incomplete phylogenetic data sets from sequence databases. Molecular Phylogenetics and Evolution, 35(3), 528–535.
Yan, X., & Han, J. (2002). gSpan: graph-based substructure pattern mining. In ICDM ’02: Proceedings of the second IEEE international conference on data mining (pp. 721–724). Los Alamitos: IEEE Comput. Soc.
Yan, X., Zhou, X. J., & Han, J. (2005). Mining closed relational graphs with connectivity constraints. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 324–333). New York: ACM.
Zeng, Z., Wang, J., Zhou, L., & Karypis, G. (2006). Coherent closed quasi-clique discovery from large dense graph databases. In KDD ’06: Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 797–802). New York: ACM.
Zhao, L., & Zaki, M. J. (2005). TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on management of data (pp. 694–705). New York: ACM.
Zhu, F., Yan, X., Han, J., & Yu, P. S. (2007). gPrune: a constraint pushing framework for graph pattern mining. In PAKDD ’07: Proceedings of the eleventh Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 388–400). Berlin: Springer.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: S.V.N. Vishwanathan, Samuel Kaski, Jennifer Neville, and Stefan Wrobel.
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Georgii, E., Tsuda, K. & Schölkopf, B. Multi-way set enumeration in weight tensors. Mach Learn 82, 123–155 (2011). https://doi.org/10.1007/s10994-010-5210-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5210-y
Keywords
- Tensor
- Multi-way set
- Dense pattern enumeration
- Quasi-hyper-clique
- N-ary relation
- Graph mining