Contrast Mining from Interesting Subgroups

  • Laura Langohr
  • Vid Podpečan
  • Marko Petek
  • Igor Mozetič
  • Kristina Gruden
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7250)

Abstract

Subgroup discovery methods find interesting subsets of objects of a given class. We propose to extend subgroup discovery by a second subgroup discovery step to find interesting subgroups of objects specific for a class in one or more contrast classes. First, a subgroup discovery method is applied. Then, contrast classes of objects are defined by using set theoretic functions on the discovered subgroups of objects. Finally, subgroup discovery is performed to find interesting subgroups within the two contrast classes, pointing out differences between the characteristics of the two. This has various application areas, one being biology, where finding interesting subgroups has been addressed widely for gene-expression data. There, our method finds enriched gene sets which are common to samples in a class (e.g., differential expression in virus infected versus non-infected) and at the same time specific for one or more class attributes (e.g., time points or genotypes). We report on experimental results on a time-series data set for virus infected potato plants. The results present a comprehensive overview of potato’s response to virus infection and reveal new research hypotheses for plant biologists.

References

  1. 1.
    Berthold, M.R. (ed.): Bisociative Knowledge Discovery. LNCS (LNAI), vol. 7250. Springer, Heidelberg (2012)Google Scholar
  2. 2.
    Kralj Novak, P., Vavpetič, A., Trajkovski, I., Lavrač, N.: Towards Semantic Data Mining with g-SEGS. In: SiKDD 2010 (2010)Google Scholar
  3. 3.
    Bruner, J., Goodnow, J., Austin, G.: A Study of Thinking. Wiley (1956)Google Scholar
  4. 4.
    Michalski, R.: A Theory and Methodology of Inductive Learning. Artificial Intelligence 20(2), 111–161 (1983)MathSciNetCrossRefGoogle Scholar
  5. 5.
    van Belle, G., Fisher, L., Heagerty, P., Lumley, T.: Biostatistics: A Methodology for the Health Sciences, 2nd edn. Wiley series in probability and statistics. Wiley-Interscience (1993)Google Scholar
  6. 6.
    Klösgen, W.: Explora: a Multipattern and Multistrategy Discovery Assistant. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 249–271. AAAI (1996)Google Scholar
  7. 7.
    Wrobel, S.: An Algorithm for Multi-Relational Discovery of Subgroups. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 78–87. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  8. 8.
    del Jesus, M., Gonzalez, P., Herrera, F., Mesonero, M.: Evolutionary Fuzzy Rule Induction Process for Subgroup Discovery: A Case Study in Marketing. Transactions on Fuzzy Systems 15, 578–592 (2007)CrossRefGoogle Scholar
  9. 9.
    May, M., Ragia, L.: Spatial Subgroup Discovery Applied to the Analysis of Vegetation Data. In: Karagiannis, D., Reimer, U. (eds.) PAKM 2002. LNCS (LNAI), vol. 2569, pp. 49–61. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  10. 10.
    Allison, D., Cui, X., Page, G., Sabripour, M.: Microarray Data Analysis: from Disarray to Consolidation and Consensus. Nature Reviews, Genetics 5, 55–65 (2006)CrossRefGoogle Scholar
  11. 11.
    Mootha, V., Lindgren, C., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M., Patterson, N., Mesirov, J., Golub, T., Tamayo, P., Spiegelman, B., Lander, E., Hirschhorn, J., Altshuler, D., Groop, L.: PGC-1α-responsive Genes Involved in Oxidative Phosphorylation are Coordinately Downregulated in Human Diabetes. Nature Genetics 34(3), 267–273 (2003)CrossRefGoogle Scholar
  12. 12.
    Kim, S.Y., Volsky, D.: PAGE: Parametric Analysis of Gene Set Enrichment. BMC Bioinformatics 6(1), 144 (2005)CrossRefGoogle Scholar
  13. 13.
    Antoniotti, M., Ramakrishnan, N., Mishra, B.: GOALIE, A Common Lisp Application to Discover Kripke Models: Redescribing Biological Processes from Time-Course Data. In: ILC 2005 (2005)Google Scholar
  14. 14.
    Antoniotti, M., Carreras, M., Farinaccio, A., Mauri, G., Merico, D., Zoppis, I.: An Application of Kernel Methods to Gene Cluster Temporal Meta-Analysis. Computers & Operations Research 37(8), 1361–1368 (2010)CrossRefMATHGoogle Scholar
  15. 15.
    Zoppis, I., Merico, D., Antoniotti, M., Mishra, B., Mauri, G.: Discovering Relations Among GO-Annotated Clusters by Graph Kernel Methods. In: Măndoiu, I.I., Zelikovsky, A. (eds.) ISBRA 2007. LNCS (LNBI), vol. 4463, pp. 158–169. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  16. 16.
    Bay, S., Pazzani, M.: Detecting Group Differences: Mining Contrast Sets. Data Mining and Knowledge Discovery 5, 213–246 (2001)CrossRefMATHGoogle Scholar
  17. 17.
    Webb, G., Butler, S., Newlands, D.: On Detecting Differences between Groups. In: KDD 2003, pp. 256–265. ACM (2003)Google Scholar
  18. 18.
    Kralj Novak, P., Lavrač, N., Gamberger, D., Krstacic, A.: CSM-SD: Methodology for Contrast Set Mining through Subgroup Discovery. Journal of Biomedical Informatics 42(1), 113–122 (2009)CrossRefGoogle Scholar
  19. 19.
    Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast Discovery of Association Rules. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI (1996)Google Scholar
  20. 20.
    Suzuki, E.: Autonomous Discovery of Reliable Exception Rules. In: KDD 1997 (1997)Google Scholar
  21. 21.
    Agrawal, R., Imieliński, T., Swami, A.: Mining Association Rules Between Sets of Items in Large Databases. In: SIGMOD 1993, pp. 207–216. ACM (1993)Google Scholar
  22. 22.
    Mielikäinen, T.: Intersecting Data to Closed Sets with Constraints. In: FIMI 2003 (2003)Google Scholar
  23. 23.
    Pan, F., Cong, G., Tung, A., Yang, J., Zaki, M.: Carpenter: Finding Closed Patterns in Long Biological Datasets. In: KDD 2003, pp. 637–642. ACM (2003)Google Scholar
  24. 24.
    Borgelt, C., Yang, X., Nogales-Cadenas, R., Carmona-Saez, P., Pascual-Montano, A.: Finding Closed Frequent Item Sets by Intersecting Transactions. In: EDBT/ICDT 2011, pp. 367–376. ACM (2011)Google Scholar
  25. 25.
    De Raedt, L., Dehaspe, L.: Clausal Discovery. Machine Learning 26, 99–146 (1997)CrossRefMATHGoogle Scholar
  26. 26.
    Gruber, T.: Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human-Computer Studies 43, 907–928 (1995)CrossRefGoogle Scholar
  27. 27.
    Srikant, R., Agrawal, R.: Mining Generalized Association Rules. In: VLDB 1995, pp. 407–419 (1995)Google Scholar
  28. 28.
    Khatri, P., Drǎghici, S.: Ontological Analysis of Gene Expression Data: Current Tools, Limitations, and Open Problems. Bioinformatics 21(18), 3587–3595 (2005)CrossRefGoogle Scholar
  29. 29.
    Aoki-Kinoshita, K., Kanehisa, M.: Gene Annotation and Pathway Mapping in KEGG. In: Walker, J.M., Bergman, N.H. (eds.) Comparative Genomics, vol. 396, pp. 71–91. Humana Press (2007)Google Scholar
  30. 30.
    Thimm, O., Bläsing, O., Gibon, Y., Nagel, A., Meyer, S., Krüger, P., Selbig, J., Müller, L., Rhee, S., Stitt, M.: MapMan: a User-driven Tool to Display Genomics Data Sets Onto Diagrams of Metabolic Pathways and Other Biological Processes. The Plant Journal 37(6), 914–939 (2004)CrossRefGoogle Scholar
  31. 31.
    Han, J., Fu, Y.: Discovery of Multiple-Level Association Rules from Large Databases. In: VLDB 1995, pp. 420–431. Morgan Kaufmann Publishers Inc. (1995)Google Scholar
  32. 32.
    Trajkovski, I., Lavrač, N., Tolar, J.: SEGS: Search for enriched gene sets in microarray data. Journal of Biomedical Informatics 41(4), 588–601 (2008)CrossRefGoogle Scholar
  33. 33.
    Kralj Novak, P., Lavrač, N., Webb, G.: Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining. Journal of Machine Learning Research 10, 377–403 (2009)MATHGoogle Scholar
  34. 34.
    Cui, X., Churchill, G.: Statistical Tests for Differential Expression in cDNA Microarray Experiments. Genome Biology 4(4), 210.1–210.10 (2003)Google Scholar
  35. 35.
    Baldi, P., Long, A.: A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-test and Statistical Inferences of Gene Changes. Bioinformatics 17(6), 509–519 (2001)CrossRefGoogle Scholar
  36. 36.
    Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., Mesirov, J.: Gene Set Enrichment Analysis: A Knowledge-based Approach for Interpreting Genome-wide Expression Profiles. PNAS 102(43), 15545–15550 (2005)CrossRefGoogle Scholar
  37. 37.
    The Potato Genome Sequencing Consortium: Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195 (2011)Google Scholar
  38. 38.
    Bioinformatics @ IPK Gatersleben: BLASTX against Arabidopsis, http://pgrc-35.ipk-gatersleben.de/pls/htmldb_pgrc/f?p=194:5:941167238168085::NO (visited on March 2011)
  39. 39.
    Podpečan, V., Lavrač, N., Mozetič, I., Kralj Novak, P., Trajkovski, I., Langohr, L., Kulovesi, K., Toivonen, H., Petek, M., Motaln, H., Gruden, K.: SegMine Workflows for Semantic Microarray Data Analysis in Orange4WS. BMC Bioinformatics 12, 416 (2011)CrossRefGoogle Scholar

Copyright information

© The Author(s) 2012 2012

Authors and Affiliations

  • Laura Langohr
    • 1
  • Vid Podpečan
    • 2
  • Marko Petek
    • 3
  • Igor Mozetič
    • 2
  • Kristina Gruden
    • 3
  1. 1.Department of Computer Science and, Helsinki Institute for Information Technology (HIIT)University of HelsinkiFinland
  2. 2.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia
  3. 3.Department of Biotechnology and Systems BiologyNational Institute of BiologyLjubljanaSlovenia

Personalised recommendations