Advertisement

Journal of Intelligent Information Systems

, Volume 42, Issue 1, pp 47–73 | Cite as

Bayesian analysis of GUHA hypotheses

  • Robert Piché
  • Marko Järvenpää
  • Esko Turunen
  • Milan Šimůnek
Article

Abstract

The LISp-Miner system for data mining and knowledge discovery uses the GUHA method to comb through a large data base and finds 2 × 2 contingency tables that satisfy a certain condition given by generalised quantifiers and thereby suggest the existence of possible relations between attributes. In this paper, we show how a more detailed interpretation of the data in the tables that were found by GUHA can be obtained using Bayesian statistical methods. Using a multinomial sampling model and Dirichlet prior, we derive posterior distributions for parameters that correspond to GUHA generalised quantifiers. Examples are presented illustrating the new Bayesian post-processing tools implemented in LISp-Miner. A statistical model for the analysis of contingency tables for data from two subpopulations is also presented.

Keywords

Data mining GUHA Contingency table Bayesian statistics 

Mathematics Subject Classifications (2010)

62F15 62H17 62-07 

References

  1. Balakrishnan, N., & Nevzorov, V.B. (2003). A primer on statistical distributions. New York: Wiley.CrossRefMATHGoogle Scholar
  2. Berry, D.A. (1996). Statistics: A Bayesian perspective. Duxberry Press.Google Scholar
  3. Bolstad, W. (2007). Introduction to Bayesian statistics (2nd ed.). New York: Wiley.CrossRefMATHGoogle Scholar
  4. Cook, J.D. (2009). Exact calculation of beta inequalities. Technical Report 54, University of Texax M. D. Anderson Cancer Center Department of Biostatistics. http://biostats.bepress.com/mdandersonbiostat/paper54. Accessed 19 June 2013
  5. Cools, R. (2003). An encyclopaedia of cubature formulas. Journal of Complexity, 19, 445–453.CrossRefMATHMathSciNetGoogle Scholar
  6. Dardzinska, A. (2013). Action rules mining. Studies in Computational Intelligence (Vol. 468). Springer.Google Scholar
  7. Devroye, L. (1986). Non-uniform random variate generation. New York: Springer. Web Edition http://www.nrbook.com/devroye/. Accessed 19 June 2013CrossRefMATHGoogle Scholar
  8. Eerola, H. (2009). Lääketieteellisen datan analysointia GUHA-tiedonlouhintamenetelmällä (in Finnish). Master’s thesis, Tampere University of Technology.Google Scholar
  9. Frigyik, B., Kapila, A., Gupta, M. (2010). Introduction to the Dirichlet distribution and related processes. Technical Report UWEETR-2010-0006, University of Washington Information Design Lab. http://ee.washington.edu/research/guptalab/publications/UWEETR-2010-0006.pdf.
  10. Hájek, P., & Havránek, T. (1978). Mechanizing hypothesis formation: Mathematical foundations for a general theory. Springer. http://www.cs.cas.cz/hajek/guhabook/. Accessed 19 June 2013
  11. Hájek, P., Havel, I., Chytil, M. (1966). The GUHA method of automatic hypotheses determination. Computing, 1, 293–308. ISSN 0010-485X. doi: 10.1007/BF02345483.CrossRefMATHGoogle Scholar
  12. Hájek, P., Holeňa, M., Rauch, J. (2010). The GUHA method and its meaning for data mining. Journal of Computer and System Sciences, 76(1), 34–48. ISSN 0022-0000. doi: 10.1016/j.jcss.2009.05.004.CrossRefMATHMathSciNetGoogle Scholar
  13. Hubbard, R. (2011). The widespread misinterpretation of p-values as error probabilities. Journal of Applied Statistics, 38(11), 2617–2626. ISSN 0266-4763 (print), 1360-0532 (electronic). doi: 10.1080/02664763.2011.567245.CrossRefMathSciNetGoogle Scholar
  14. Kotz, S., Balakrishnan, N., Johnson, N.L. (2000). Continuous multivariate distributions, volume 1: Models and applications (2nd ed.). New York: Wiley.CrossRefGoogle Scholar
  15. Lee, P.M. (2012). Bayesian statistics: An introduction. New York: Wiley.Google Scholar
  16. Myllymäki, P., Silander, T., Tirri, H., Uronen, P. (2002). B-course contraceptive method choice dataset. http://b-course.cs.helsinki.fi/obc/cmcexpl.html. Accessed 19 June 2013
  17. Ng, K.W., Tian, G., Tang, M. (2011). Dirichlet and related distributions. New York: Wiley.CrossRefMATHGoogle Scholar
  18. Pham-Gia, T., Turkkan, N., Eng, P. (1993). Bayesian analysis of the difference of two proportions. Communications in Statistics Theory and Methods, 22(6), 1755–1771.CrossRefMATHMathSciNetGoogle Scholar
  19. Piché, R., & Turunen, E. (2010). Bayesian assaying of GUHA nuggets. In E. Hüllermeier, R. Kruse, F. Hoffmann (Eds.), Information processing and management of uncertainty in knowledge-based systems. Theory and Methods, Communications in computer and information science (Vol. 80, pp. 348–355). doi: 10.1007/978-3-642-14055-6.
  20. Ras, Z., & Wieczorkowska, A. (2000). Action-rules: How to increase profit of a company. In D. Zighed, J. Komorowski, J. Zytkow (Eds.), Principles of data mining and knowledge discovery. Lecture notes in computer science (Vol. 1910, pp. 75–116). Springer. ISBN 978-3-540-41066-9. doi: 10.1007/3-540-45372-5_70.
  21. Rauch, J. (2005). Logic of association rules. Applied Intelligence, 22, 9–28.CrossRefMATHGoogle Scholar
  22. Rauch, J. (2009). Considerations on logical calculi for dealing with knowledge in data mining online. Applied Intelligence, 22, 177–201.Google Scholar
  23. Rauch, J. (2013). Observational calculi and association rules. Studies in computational intelligence. Springer.Google Scholar
  24. Rauch, J., & Šimůnek, M. (2005). An alternative approach to mining association rules. In T.Y. Lin, S. Ohsuga, C.-J. Liau, X. Hu, S. Tsumoto (Eds.), Foundations of data mining and knowledge discovery. Studies in computational intelligence (Vol. 6, pp. 211–231). Springer. ISBN 978-3-540-26257-2. doi: 10.1007/11498186_13.
  25. Rauch, J., & Šimůnek, M. (2009). Dealing with background knowledge in the sewebar project. In B. Berendt, D. Mladenic, M. de Gemmis, G. Semeraro, M. Spiliopoulou, G. Stumme, V. Svatek, F. Železnỳ (Eds.), Knowledge discovery enhanced with semantic and social information (pp. 89–106). Springer.Google Scholar
  26. Rauch, J., & Šimůnek, M. (2012). LISp-Miner project homepage. http://lispminer.vse.cz/ (online). Accessed 21 Sep 2012.
  27. Roussas, G. (1997). A course in mathematical statistics (2nd ed.). New York: Academic.MATHGoogle Scholar
  28. Šimůnek, M. (2003). Academic KDD project LISp-Miner. In A. Abraham, K. Franke, K. Koppen (Eds.), Intelligent systems design and applications, advances in soft computing (pp. 263–272). Springer.Google Scholar
  29. Šimundić, A.-M. & Nikolac, N. (2009). Statistical errors in manuscripts submitted to biochemia medica journal. Biechemia Medica, 19(3), 294–300.Google Scholar
  30. Turunen, E. (2012). The GUHA method in data mining. Lecture notes. Tampere University of Technology. http://URN.fi/URN:NBN:fi:tty-201209261292. Accessed 19 June 2013

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Robert Piché
    • 1
  • Marko Järvenpää
    • 1
  • Esko Turunen
    • 2
  • Milan Šimůnek
    • 3
  1. 1.Tampere University of TechnologyTampereFinland
  2. 2.Center for Machine Perception, Department of Cybernetics, Faculty of Electrical EngineeringCzech Technical UniversityPragueCzech Republic
  3. 3.University of Economics PraguePragueCzech Republic

Personalised recommendations