Journal of Intelligent Information Systems

, Volume 42, Issue 2, pp 233–254 | Cite as

Semantic subgroup explanations

Article

Abstract

Subgroup discovery (SD) methods can be used to find interesting subsets of objects of a given class. While subgroup describing rules are themselves good explanations of the subgroups, domain ontologies can provide additional descriptions to data and alternative explanations of the constructed rules. Such explanations in terms of higher level ontology concepts have the potential of providing new insights into the domain of investigation. We show that this additional explanatory power can be ensured by using recently developed semantic SD methods. We present a new approach to explaining subgroups through ontologies and demonstrate its utility on a motivational use case and on a gene expression profiling use case where groups of patients, identified through SD in terms of gene expression, are further explained through concepts from the Gene Ontology and KEGG orthology. We qualitatively compare the methodology with the supporting factors technique for characterizing subgroups. The developed tools are implemented within a new browser-based data mining platform ClowdFlows.

Keywords

Data mining Semantic data mining Subgroup discovery Ontologies Microarray data 

References

  1. Angiulli, F., Fassetti, F., Palopoli, L. (2013). Discovering characterizations of the behavior of anomalous subpopulations. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1280–1292. doi:10.1109/TKDE.2012.58.CrossRefGoogle Scholar
  2. Atzmüller, M., & Puppe, F. (2006). SD-Map—a fast algorithm for exhaustive subgroup discovery. In Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD ’06) (pp. 6–17). Springer.Google Scholar
  3. Bay, S.D., & Pazzani, M.J. (2001). Detecting group differences: mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246.CrossRefMATHGoogle Scholar
  4. Demšar, J., Zupan, B., Leban, G. (2004). Orange: from experimental machine learning to interactive data mining, white paper. Faculty of Computer and Information Science, University of Ljubljana. www.ailab.si/orange.
  5. Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: discovering trends and differences. In Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 43–52).Google Scholar
  6. Elston, C.W., & Ellis, I.O. (1991). Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology, 19(5), 403–410.CrossRefGoogle Scholar
  7. Eronen, L., & Toivonen, H. (2012). Biomine: predicting links between biological entities using network models of heterogeneous databases. BMC Bioinformatics, 13, 119.CrossRefGoogle Scholar
  8. Galea, M., Blamey, R., Elston, C., Ellis, I. (1992). The Nottingham prognostic index in primary breast cancer. Breast Cancer Research and Treatment, 22, 207–219.CrossRefGoogle Scholar
  9. Gamberger, D., & Lavrač, N. (2002). Expert-guided subgroup discovery: methodology and application. Journal of Artificial Intelligence Research (JAIR), 17, 501–527.MATHGoogle Scholar
  10. Gamberger, D., & Lavrač, N. (2003). Active subgroup mining: a case study in coronary heart disease risk group detection. Artificial Intelligence in Medicine, 28(1), 27–57.CrossRefGoogle Scholar
  11. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H. (2009). The WEKA data mining software: an update. SIGKDD Explor Newsl, 11, 10–18.CrossRefGoogle Scholar
  12. Hilario, M., Nguyen, P., Do, H., Woznica, A., Kalousis, A. (2011). Ontology-based meta-mining of knowledge discovery workflows. In N. Jankowski, W. Duch, K. Grabczewski (Eds.), Meta-learning in computational intelligence, studies in computational intelligence (Vol. 358, pp. 273–315). Berlin Heidelberg: Springer.CrossRefGoogle Scholar
  13. Jovanoski, V., & Lavrač, N. (2001). Classification rule learning with APRIORI-C. In P. Brazdil, & A. Jorge (Eds.), EPIA, lecture notes in computer science (Vol. 2258, pp. 44–51). Berlin Heidelberg: Springer.Google Scholar
  14. Kavšek, B., & Lavrač, N. (2006). APRIORI-SD: adapting association rule learning to subgroup discovery. Applied Artificial Intelligence, 20(7), 543–583.CrossRefGoogle Scholar
  15. Klösgen, W. (1996). Explora: a multipattern and multistrategy discovery assistant. In Advances in knowledge discovery and data mining, (pp. 249–271). Menlo Park: American Association for Artificial Intelligence.Google Scholar
  16. Kralj Novak, P., Lavrač, N., Webb, G.I. (2009). Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research, 10, 377–403.MATHGoogle Scholar
  17. Kranjc, J., Podpečan, V., Lavrač, N. (2012). Clowdflows: a cloud based scientific workflow platform. In P.A. Flach, T.D. Bie, N. Cristianini (Eds.), ECML/PKDD (2), lecture notes in computer science (Vol. 7524, pp. 816–819). Berlin Heidelberg: Springer.Google Scholar
  18. Langohr, L., Podpečan, V., Petek, M., Mozetič, I., Gruden, K., Lavrač, N., Toivonen, H. (2013). Contrasting subgroup discovery. Computer Journal, 56(3), 289–303.CrossRefGoogle Scholar
  19. Lavrač, N., Kavšek, B., Flach, P.A., Todorovski, L. (2004). Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5, 153–188.Google Scholar
  20. Lavrač, N., Vavpetič, A., Soldatova, L., Trajkovski, I., Kralj Novak, P. (2011). Using ontologies in semantic data mining with SEGS and g-SEGS. In Proceedings of the international conference on discovery science (DS ’11) (pp. 165–178). Springer.Google Scholar
  21. Lawrynowicz, A., & Potoniec, J. (2011). Fr-ont: an algorithm for frequent concept mining with formal ontologies. In M. Kryszkiewicz, H. Rybinski, A. Skowron, Z.W. Ras (Eds.), ISMIS, lecture notes in computer science (Vol. 6804, pp. 428–437). Berlin Heidelberg: Springer.Google Scholar
  22. Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T. (2005). Entrez gene: gene-centered information at NCBI. Nucleic Acids Research, 33(Database issue).Google Scholar
  23. McCall, M.N., Bolstad, B.M., Irizarry, R.A. (2010). Frozen robust multiarray analysis (fRMA). Biostatistics, 11(2), 242–253.CrossRefGoogle Scholar
  24. Podpečan, V., Juršič, M., žakova, M., Lavrač, N. (2009). Towards a service-oriented knowledge discovery platform. In V. Podpečan & N. Lavrač (Eds.), Third-generation data mining: towards service-oriented knowledge discovery (pp. 25–36).Google Scholar
  25. Podpečan, V., Lavrač, N., Mozetič, I., Kralj Novak, P., Trajkovski, I., Langohr, L., Kulovesi, K., Toivonen, H., Petek, M., Motaln, H., Gruden, K. (2011a). SegMine workflows for semantic microarray data analysis in Orange4WS. BMC Bioinformatics, 12, 416.CrossRefGoogle Scholar
  26. Podpečan, V., Zemenova, M., Lavrač, N. (2011b). Orange4WS environment for service-oriented data mining. The Computer Journal. doi:10.1093/comjnl/bxr077. Accessed 7 Aug 2011.
  27. Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning, 53, 23–69.CrossRefMATHGoogle Scholar
  28. Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Haibe-Kains, B., Desmedt, C., Larsimont, D., Cardoso, F., Peterse, H., Nuyten, D., Buyse, M., Van de Vijver, M.J., Bergh, J., Piccart, M., Delorenzi, M. (2006). Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. Journal of the National Cancer Institute, 98(4), 262–272.Google Scholar
  29. Srinivasan, A. (2007). Aleph manual. http://www.cs.ox.ac.uk/activities/machinelearning/Aleph/.
  30. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15,545–15,550.CrossRefGoogle Scholar
  31. Suzuki, E. (1997). Autonomous discovery of reliable exception rules. In Proceedings of the third international conference on knowledge discovery and data mining (pp. 259–262).Google Scholar
  32. Suzuki, E. (2006). Data mining methods for discovering interesting exceptions from an unsupervised table. Journal of Universal Computer Science, 12(6), 627–653.Google Scholar
  33. Taminau, J., Steenhoff, D., Coletta, A., Meganck, S., Lazar, C., de Schaetzen, V., Duque, R., Molter, C., Bersini, H., Nowé, A., Weiss Solís, D.Y. (2011). InSilicoDB: an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO. Bioinformatics. doi:10.1093/bioinformatics/btr529.
  34. Trajkovski, I., Lavrač, N., Tolar, J. (2008). SEGS: search for enriched gene sets in microarray data. Journal of Biomedical Informatics, 41(4), 588–601.CrossRefGoogle Scholar
  35. Vavpetič, A., & Lavrač, N. (2013). Semantic subgroup discovery systems and workflows in the SDM-Toolkit. Computer Journal, 56(3), 304–320.CrossRefGoogle Scholar
  36. Vavpetič, A., Podpečan, V., Meganck, S., Lavrač, N. (2012). Explaining subgroups through ontologies. In P. Anthony, M. Ishizuka, D. Lukose (Eds.), Proceedings of PRICAI, lecture notes in computer science (Vol. 7458, pp. 625–636). Berlin Heidelberg: Springer.Google Scholar
  37. Vavpetič, A., Novak, P.K., Grčar, M., Mozetič, I., Lavrač, N. (2013). Semantic data mining of financial news articles. In Proceedings of the international conference on discovery science (DS ’13). Springer.Google Scholar
  38. Webb, G.I., Butler, S.M., Newlands, D. (2003). On detecting differences between groups. In Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-03) (pp. 256–265).Google Scholar
  39. Wrobel, S. (1997). An algorithm for multi-relational discovery of subgroups. In Proceedings of the first European conference on principles of data mining and knowledge discovery (PKDD ’97) (pp. 78–87). Springer.Google Scholar
  40. Žáková, M., Železný, F., García-Sedano, J.A., Tissot, C.M., Lavrač, N., Kremen, P., Molina, J. (2006). Relational data mining applied to virtual engineering of product designs. In Proceedings of the 16th international conference on inductive logic programming (ILP’06) (pp. 439–453). Berlin/Heidelberg, Germany, Santiago de Compostela, Spain: Springer-Verlag.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Anže Vavpetič
    • 1
    • 2
  • Vid Podpečan
    • 1
  • Nada Lavrač
    • 1
    • 2
    • 3
  1. 1.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia
  2. 2.Jožef Stefan International Postgraduate SchoolLjubljanaSlovenia
  3. 3.University of Nova GoricaNova GoricaSlovenia

Personalised recommendations