Machine Learning

, Volume 82, Issue 2, pp 123–155 | Cite as

Multi-way set enumeration in weight tensors

  • Elisabeth Georgii
  • Koji Tsuda
  • Bernhard Schölkopf
Open Access
Article

Abstract

The analysis of n-ary relations receives attention in many different fields, for instance biology, web mining, and social studies. In the basic setting, there are n sets of instances, and each observation associates n instances, one from each set. A common approach to explore these n-way data is the search for n-set patterns, the n-way equivalent of itemsets. More precisely, an n-set pattern consists of specific subsets of the n instance sets such that all possible associations between the corresponding instances are observed in the data. In contrast, traditional itemset mining approaches consider only two-way data, namely items versus transactions. The n-set patterns provide a higher-level view of the data, revealing associative relationships between groups of instances. Here, we generalize this approach in two respects. First, we tolerate missing observations to a certain degree, that means we are also interested in n-sets where most (although not all) of the possible associations have been recorded in the data. Second, we take association weights into account. In fact, we propose a method to enumerate all n-sets that satisfy a minimum threshold with respect to the average association weight. Technically, we solve the enumeration task using a reverse search strategy, which allows for effective pruning of the search space. In addition, our algorithm provides a ranking of the solutions and can consider further constraints. We show experimental results on artificial and real-world datasets from different domains.

Keywords

Tensor Multi-way set Dense pattern enumeration Quasi-hyper-clique N-ary relation Graph mining 

References

  1. Acar, E., Aykut-Bingol, C., Bingol, H., Bro, R., & Yener, B. (2007). Multiway analysis of epilepsy tensors. Bioinformatics, 23(13), i10–i18. CrossRefGoogle Scholar
  2. Acar, E., Çamtepe, S., & Yener, B. (2006). Collective sampling and analysis of high order tensors for chatroom communications. In Intelligence and security informatics (pp. 213–224). Berlin: Springer. CrossRefGoogle Scholar
  3. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In VLDB ’94: Proceedings of the 20th international conference on very large data bases (pp. 487–499). San Mateo: Morgan Kaufmann. Google Scholar
  4. Asahiro, Y., Iwama, K., Tamaki, H., & Tokuyama, T. (2000). Greedily finding a dense subgraph. Journal of Algorithms, 34(2), 203–221. MATHCrossRefMathSciNetGoogle Scholar
  5. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., & Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Nature Genetics, 25(1), 25–29. CrossRefGoogle Scholar
  6. Avis, D., & Fukuda, K. (1996). Reverse search for enumeration. Discrete Applied Mathematics, 65, 21–46. MATHCrossRefMathSciNetGoogle Scholar
  7. Bader, G. D., & Hogue, C. W. (2003). An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2. CrossRefGoogle Scholar
  8. Baranzini, S. E., Mousavi, P., Rio, J., Caillier, S. J., Stillman, A., Villoslada, P., Wyatt, M. M., Comabella, M., Greller, L. D., Somogyi, R., Montalban, X., & Oksenberg, J. R. (2004). Transcription-based prediction of response to IFNβ using supervised computational methods. PLoS Biology, 3(1), e2. CrossRefGoogle Scholar
  9. Beckmann, C. F., & Smith, S. M. (2005). Tensorial extensions of independent component analysis for multisubject FMRI analysis. Neuroimage, 25(1), 294–311. CrossRefGoogle Scholar
  10. Bejerano, G., Friedman, N., & Tishby, N. (2004). Efficient exact p-value computation for small sample, sparse, and surprising categorical data. Journal of Computational Biology, 11(5), 867–886. Google Scholar
  11. Besson, J., Robardet, C., De Raedt, L., & Boulicaut, J. F. (2006). Mining bi-sets in numerical data. In Lecture notes in computer science : Vol. 4747. KDID ’06: Knowledge discovery in inductive databases, fifth international workshop (pp. 11–23). Berlin: Springer. CrossRefGoogle Scholar
  12. Borgwardt, K. M., Kriegel, H. P., & Wackersreuther, P. (2006). Pattern mining in frequent dynamic subgraphs. In ICDM ’06: Proceedings of the sixth international conference on data mining (pp. 818–822). Los Alamitos: IEEE Comput. Soc. CrossRefGoogle Scholar
  13. Cerf, L., Besson, J., Robardet, C., & Boulicaut, J. F. (2008). Data peeler: contraint-based closed pattern mining in n-ary relations. In SDM ’08: Proceedings of the SIAM international conference on data mining (pp. 37–48). Google Scholar
  14. Culhane, A. C., Schwarzl, T., Sultana, R., Picard, K. C., Picard, S. C., Lu, T. H., Franklin, K. R., French, S. J., Papenhausen, G., Correll, M., & Quackenbush, J. (2010). GeneSigDB—a curated database of gene expression signatures. Nucleic Acids Research 38(suppl_1), D716–D725. CrossRefGoogle Scholar
  15. Everett, L., Wang, L. S., & Hannenhalli, S. (2006). Dense subgraph computation via stochastic search: application to detect transcriptional modules. Bioinformatics, 22(14), e117–e123. CrossRefGoogle Scholar
  16. Farkas, I. J., Abel, D., Palla, G., & Vicsek, T. (2007). Weighted network modules. New Journal of Physics, 9, 180. CrossRefGoogle Scholar
  17. Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D., & Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell, 11(12), 4241–4257. Google Scholar
  18. Georgii, E., Dietmann, S., Uno, T., Pagel, P., & Tsuda, K. (2009a). Enumeration of condition-dependent dense modules in protein interaction networks. Bioinformatics, 25(7), 933–940. CrossRefGoogle Scholar
  19. Georgii, E., Tsuda, K., & Schölkopf, B. (2009b). Multi-way set enumeration in real-valued tensors. In DMMT ’09: Proceedings of the second workshop on data mining using matrices and tensors (pp. 32–41). New York: ACM. Google Scholar
  20. Goldberg, L. A. (1992). Efficient algorithms for listing unlabeled graphs. Journal of Algorithms, 13(1), 128–143. MATHCrossRefMathSciNetGoogle Scholar
  21. Han, J., & Kamber, M. (2006). The Morgan Kaufmann series data management systems. Data mining: concepts and techniques. San Mateo: Morgan Kaufmann. Google Scholar
  22. Haraguchi, M., & Okubo, Y. (2006). A method for pinpoint clustering of web pages with pseudo-clique search. In Lecture notes in computer science : Vol. 3847. Federation over the Web (pp. 59–78). Berlin: Springer. CrossRefGoogle Scholar
  23. Höppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis: methods for classification, data analysis and image recognition. New York: Wiley. MATHGoogle Scholar
  24. Hu, H., Yan, X., Huang, Y., Han, J., & Zhou, X. J. (2005). Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics, 21(suppl_1), i213–i221. CrossRefGoogle Scholar
  25. Jaschke, R., Hotho, A., Schmitz, C., Ganter, B., & Stumme, G. (2006). TRIAS—an algorithm for mining iceberg tri-lattices. In ICDM ’06: Proceedings of the sixth international conference on data mining (pp. 907–911). Los Alamitos: IEEE Comput. Soc. CrossRefGoogle Scholar
  26. Jegelka, S., Sra, S., & Banerjee, A. (2009). Approximation algorithms for tensor clustering. In Algorithmic learning theory (pp. 368–383). Google Scholar
  27. Ji, L., Tan, K. L., & Tung, A. K. H. (2006). Mining frequent closed cubes in 3D datasets. In VLDB ’06: Proceedings of the thirty-second international conference on very large data bases (pp. 811–822). VLDB Endowment/ACM, New York. http://portal.acm.org/citation.cfm?id=1164197, http://dblp.uni-trier.de/rec/bibtex/conf/vldb/JiTT06. Google Scholar
  28. Jiang, D., & Pei, J. (2009). Mining frequent cross-graph quasi-cliques. ACM Transactions on Knowledge Discovery Data, 2(4), 1–42. CrossRefMathSciNetGoogle Scholar
  29. Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. In AAAI ’06: Proceedings of the twenty-first national conference on artificial intelligence (pp. 381–388). Menlo Park: AAAI Press. Google Scholar
  30. Klimt, B., & Yang, Y. (2004). The Enron corpus: a new dataset for email classification research. In ECML ’04: Proceedings of the 15th european conference on machine learning (pp. 217–226). Berlin: Springer. CrossRefGoogle Scholar
  31. Kolda, T. G., & Bader, B. W. (2007). Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories. Google Scholar
  32. Kolda, T. G., Bader, B. W., & Kenny, J. P. (2005). Higher-order web link analysis using multilinear algebra. In ICDM ’05: Proceedings of the fifth IEEE international conference on data mining (pp. 242–249). Los Alamitos: IEEE Comput. Soc. CrossRefGoogle Scholar
  33. Kolda, T. G., & Sun, J. (2008). Scalable tensor decompositions for multi-aspect data mining. In ICDM ’08: Proceedings of the eighth IEEE international conference on data mining (pp. 363–372). Google Scholar
  34. Koyutürk, M., Szpankowski, W., & Grama, A. (2007). Assessing significance of connectivity and conservation in protein interaction networks. Journal of Computational Biology, 14(6), 747–764. CrossRefMathSciNetGoogle Scholar
  35. Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology of Bioinformatics, 1(1), 24–45. CrossRefGoogle Scholar
  36. Mishra, N., Ron, D., & Swaminathan, R. (2004). A new conceptual clustering framework. Machine Learning, 56(1–3), 115–151. MATHCrossRefGoogle Scholar
  37. Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences of United States of America, 103(23), 8577–8582. CrossRefGoogle Scholar
  38. Palla, G., Derenyi, I., Farkas, I., & Vicsek, T. (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043), 814–818. CrossRefGoogle Scholar
  39. Robardet, C. (2009). Constraint-based pattern mining in dynamic graphs. In ICDM ’09: Proceedings of the ninth IEEE international conference on data mining (pp. 950–955). Los Alamitos: IEEE Comput. Soc. CrossRefGoogle Scholar
  40. Rymon, R. (1992). Search through systematic set enumeration. In Proceedings of the third international conference on principles of knowledge representation and reasoning (pp. 539–550). Google Scholar
  41. Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27–64. CrossRefMathSciNetGoogle Scholar
  42. Shamir, R., Maron-Katz, A., Tanay, A., Linhart, C., Steinfeld, I., Sharan, R., Shiloh, Y., & Elkon, R. (2005). EXPANDER—an integrative program suite for microarray data analysis. BMC Bioinformatics, 6(1), 232. CrossRefGoogle Scholar
  43. Spirin, V., & Mirny, L. A. (2003). Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences of United States of America, 100(21), 12123–12128. CrossRefGoogle Scholar
  44. Sun, J., Tao, D., & Faloutsos, C. (2006). Beyond streams and graphs: dynamic tensor analysis. In KDD ’06: Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 374–383). New York: ACM. Google Scholar
  45. Tanay, A., Sharan, R., & Shamir, R. (2002). Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18(Suppl 1), S136–S144. Google Scholar
  46. Ulitsky, I., & Shamir, R. (2009). Identifying functional modules using expression profiles and confidence-scored protein interactions. Bioinformatics, 25(9), 1158–1164. CrossRefGoogle Scholar
  47. Uno, T. (2007). An efficient algorithm for enumerating pseudo cliques. In ISAAC ’07: Algorithms and computation, eighteenth international symposium (pp. 402–414). Google Scholar
  48. Yan, C., Burleigh, J. G., & Eulenstein, O. (2005). Identifying optimal incomplete phylogenetic data sets from sequence databases. Molecular Phylogenetics and Evolution, 35(3), 528–535. CrossRefGoogle Scholar
  49. Yan, X., & Han, J. (2002). gSpan: graph-based substructure pattern mining. In ICDM ’02: Proceedings of the second IEEE international conference on data mining (pp. 721–724). Los Alamitos: IEEE Comput. Soc. Google Scholar
  50. Yan, X., Zhou, X. J., & Han, J. (2005). Mining closed relational graphs with connectivity constraints. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 324–333). New York: ACM. CrossRefGoogle Scholar
  51. Zeng, Z., Wang, J., Zhou, L., & Karypis, G. (2006). Coherent closed quasi-clique discovery from large dense graph databases. In KDD ’06: Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 797–802). New York: ACM. Google Scholar
  52. Zhao, L., & Zaki, M. J. (2005). TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on management of data (pp. 694–705). New York: ACM. CrossRefGoogle Scholar
  53. Zhu, F., Yan, X., Han, J., & Yu, P. S. (2007). gPrune: a constraint pushing framework for graph pattern mining. In PAKDD ’07: Proceedings of the eleventh Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 388–400). Berlin: Springer. Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Elisabeth Georgii
    • 1
    • 2
    • 3
  • Koji Tsuda
    • 4
    • 5
  • Bernhard Schölkopf
    • 6
  1. 1.Department of Empirical InferenceMax Planck Institute for Biological CyberneticsTübingenGermany
  2. 2.Friedrich Miescher Laboratory of the Max Planck SocietyTübingenGermany
  3. 3.Department of Information and Computer Science, Helsinki Institute for Information Technology, HIITAalto University School of Science and TechnologyAaltoFinland
  4. 4.Computational Biology Research CenterNational Institute of Advanced Industrial Science and Technology, AISTTokyoJapan
  5. 5.ERATO Minato ProjectJapan Science and Technology AgencyTokyoJapan
  6. 6.Department of Empirical InferenceMax Planck Institute for Biological CyberneticsTübingenGermany

Personalised recommendations