Data Mining and Knowledge Discovery

, Volume 28, Issue 5–6, pp 1398–1428 | Cite as

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

  • Hao Wu
  • Jilles Vreeken
  • Nikolaj Tatti
  • Naren Ramakrishnan
Article

Abstract

Many application domains such as intelligence analysis and cybersecurity require tools for the unsupervised identification of suspicious entities in multi-relational/network data. In particular, there is a need for automated semi-automated approaches to ‘uncover the plot’, i.e., to detect non-obvious coalitions of entities bridging many types of relations. We cast the problem of detecting such suspicious coalitions and their connections as one of mining surprisingly dense and well-connected chains of biclusters over multi-relational data. With this as our goal, we model data by the Maximum Entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery.

Keywords

Multi-relational data Maximum entropy modeling Subjective interestingness Pattern mining Biclusters 

References

  1. Califano A, Stolovitzky G, Tu Y (2000) Analysis of gene expression microarrays for phenotype classification. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 75–85Google Scholar
  2. Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):3:1–3:36CrossRefGoogle Scholar
  3. Cerf L, Besson J, Nguyen KNT, Boulicaut JF (2013) Closed and noise-tolerant patterns in n-ary relations. Data Min Knowl Discov 26(3):574–619CrossRefMATHMathSciNetGoogle Scholar
  4. Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, AAAI Press, pp 93–103Google Scholar
  5. Cover T, Thomas J (2006) Elements of information theory. Wiley, New YorkMATHGoogle Scholar
  6. Csiszar I (1975) \(I\)-Divergence geometry of probability distributions and minimization problems. Ann Probab 3(1):146–158CrossRefMATHMathSciNetGoogle Scholar
  7. Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5):1470–1480CrossRefMATHMathSciNetGoogle Scholar
  8. Davis WLI, Schwarz P, Terzi E (2009) Finding representative association rules from large rule collections. In: Proceedings of the 9th SIAM international conference on data mining (SDM). Sparks, NV, SIAM, pp 521–532Google Scholar
  9. De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446CrossRefMATHMathSciNetGoogle Scholar
  10. Dehaspe L, Toironen H (2000) Discovery of relational association rules. In: Dĕzeroski S (ed) Relational data mining. Springer, New York Inc, pp 189–208Google Scholar
  11. Dzeroski S, Lavrac N (eds) (2001) Relational data mining. Springer, BerlinMATHGoogle Scholar
  12. Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of discovery science. Springer, Berlin, pp 278–289Google Scholar
  13. Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):167–176CrossRefGoogle Scholar
  14. Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Paris, France, pp 379–388Google Scholar
  15. Hossain M, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N (2012a) Connecting the dots between PubMed abstracts. PLoS ONE 7(1)Google Scholar
  16. Hossain MS, Butler P, Boedihardjo AP, Ramakrishnan N (2012b) Storytelling in entity networks to support intelligence analysts. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Beijing, China, pp 1375–1383Google Scholar
  17. Hughes FJ (2005) Discovery, proof, choice: the art and science of the process of intelligence analysis, case study 6, “All Fall Down”, unpublished reportGoogle Scholar
  18. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev Ser II 106(4):620–630MATHMathSciNetGoogle Scholar
  19. Jin Y, Murali TM, Ramakrishnan N (2008) Compositional mining of multirelational biological datasets. ACM Trans Knowl Discov Data 2(1):2:1–2:35CrossRefGoogle Scholar
  20. Kiernan J, Terzi E (2008) Constructing comprehensive summaries of large event sequences. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD). Las Vegas, NV, pp 417–425Google Scholar
  21. Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, IEEE, pp 350–359Google Scholar
  22. Kontonasios KN, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD). Springer, Prague, Czech Republic, pp 256–271Google Scholar
  23. Kumar D, Ramakrishnan N, Helm RF, Potts M (2006) Algorithms for storytelling. In: Proceedings of the 12th ACM international conference on knowledge discovery and data Mining (SIGKDD), Philadelphia, PA, pp 604–610Google Scholar
  24. Lavrac N, Flach P (2001) An extended transformation approach to inductive logic programming. ACM Trans Comput Logic 2(4):458–494CrossRefGoogle Scholar
  25. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45CrossRefGoogle Scholar
  26. Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6:1–44CrossRefGoogle Scholar
  27. Ojala M, Garriga GC, Gionis A, Mannila H (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the 10th SIAM international conference on data mining (SDM). Columbus, OH, pp 906–917Google Scholar
  28. Rasch G (1960) Probabilistic models for some intelligence and attainnment tests. Danmarks paedagogiske InstitutGoogle Scholar
  29. Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471CrossRefMATHGoogle Scholar
  30. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464CrossRefMATHGoogle Scholar
  31. Segal E, Taskar B, Gasch A, Friedman N, Koller D (2001) Rich probabilistic models for gene expression. Bioinformatics 17(suppl 1):S243–S252CrossRefGoogle Scholar
  32. Shahaf D, Guestrin C (2010) Connecting the dots between news articles. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Washington, DC, pp 623–632Google Scholar
  33. Shahaf D, Guestrin C (2012) Connecting two (or less) dots: discovering structure in news articles. ACM Trans Knowl Discov Data 5(4):24:1–24:31CrossRefGoogle Scholar
  34. Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by gibbs sampling. Bioinformatics 19(suppl 2):196–205CrossRefGoogle Scholar
  35. Spyropoulou E, De Bie T (2011) Interesting multi-relational patterns. Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, pp 675–684Google Scholar
  36. Spyropoulou E, De Bie T, Boley M (2013) Mining interesting patterns in multi-relational data with n-ary relationships. Discovery Science, vol 8140, Lecture Notes in Computer Science. Springer, Berlin, pp 217–232Google Scholar
  37. Spyropoulou E, De Bie T, Boley M (2014) Interesting pattern mining in multi-relational data. Data Min Knowl Discov 28(3):808–849CrossRefMathSciNetGoogle Scholar
  38. Tatti N (2006) Computational complexity of queries based on itemsets. Inf Process Lett 98(5):183–187. doi:10.1016/j.ipl.2006.02.003 CrossRefMATHMathSciNetGoogle Scholar
  39. Tatti N, Vreeken J (2012) Comparing apples and oranges - measuring differences between exploratory data mining results. Data Min Knowl Disc 25(2):173–207CrossRefMATHMathSciNetGoogle Scholar
  40. Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P (1999) Clustering methods for the analysis of dna microarray data. Stanford University, Tech. repGoogle Scholar
  41. Uno T, Kiyomi M, Arimura H (2005) Lcm ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, New York, NY, USA, OSDM ’05, pp 77–86Google Scholar
  42. Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735Google Scholar
  43. Zaki M, Hsiao CJ (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478CrossRefGoogle Scholar
  44. Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Chicago, IL, pp 364–373Google Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  • Hao Wu
    • 1
    • 2
  • Jilles Vreeken
    • 3
    • 4
  • Nikolaj Tatti
    • 5
    • 6
  • Naren Ramakrishnan
    • 2
    • 7
  1. 1.Department of Electrical and Computer EngineeringVirginia TechArlingtonUSA
  2. 2.Discovery Analytics CenterVirginia TechArlingtonUSA
  3. 3.Max Planck Institute for InformaticsSaarbrückenGermany
  4. 4.Cluster of Excellence MMCISaarland UniversitySaarbrückenGermany
  5. 5.HIIT, Department of Information and Computer ScienceAalto UniversityAaltoFinland
  6. 6.Department of Computer ScienceKU LeuvenLeuvenBelgium
  7. 7.Department of Computer ScienceVirginia TechArlingtonUSA

Personalised recommendations