Semi-supervised Learning for Mixed-Type Data via Formal Concept Analysis

  • Mahito Sugiyama
  • Akihiro Yamamoto
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6828)

Abstract

Only few machine learning methods; e.g., the decision tree-based classification method, can handle mixed-type data sets containing both of discrete (binary and nominal) and continuous (real-valued) variables and, moreover, no semi-supervised learning method can treat such data sets directly. Here we propose a novel semi-supervised learning method, called SELF (SEmi-supervised Learning via FCA), for mixed-type data sets using Formal Concept Analysis (FCA). SELF extracts a lattice structure via FCA together with discretizing continuous variables and learns classification rules using the structure effectively. Incomplete data sets including missing values can be handled directly in our method. We experimentally demonstrate competitive performance of SELF compared to other supervised and semi-supervised learning methods. Our contribution is not only giving a novel semi-supervised learning method, but also bridging two fields of conceptual analysis and knowledge discovery.

Keywords

Semi-supervised learning Classification Mixed-type data Formal Concept Analysis Discretization Concept lattice 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abe, S.: Analysis of multiclass support vector machines. In: Proceedings of International Conference on Computational Intelligence for Modeling Control and Automation, pp. 385–396 (2003)Google Scholar
  2. 2.
    Blinova, V.G., Dobrynin, D.A., Finn, V.K., Kuznetsov, S.O., Pankratova, E.S.: Toxicology analysis by means of the JSM-method. Bioinformatics 19(10), 1201–1207 (2003)CrossRefGoogle Scholar
  3. 3.
    de Brecht, M., Yamamoto, A.: Topological properties of concept spaces (full version). Information and Computation 208, 327–340 (2010)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cambridge (2006), http://www.kyb.tuebingen.mpg.de/ssl-book Google Scholar
  5. 5.
    Davey, B.A., Priestley, H.A.: Introduction to lattices and order, 2nd edn. Cambridge University Press, Cambridge (2002)CrossRefMATHGoogle Scholar
  6. 6.
    Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1029 (1993)Google Scholar
  7. 7.
    Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
  8. 8.
    Ganter, B., Kuznetsov, S.: Formalizing hypotheses with concepts. In: Ganter, B., Mineau, G.W. (eds.) ICCS 2000. LNCS, vol. 1867, pp. 342–356. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  9. 9.
    Ganter, B., Kuznetsov, S.: Hypotheses and version spaces. In: de Moor, A., Lex, W., Ganter, B. (eds.) ICCS 2003. LNCS, vol. 2746, pp. 83–95. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    Ganter, B., Stumme, G., Wille, R. (eds.): Formal Concept Analysis. LNCS (LNAI), vol. 3626. Springer, Heidelberg (2005)MATHGoogle Scholar
  11. 11.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database systems: The complete book. Prentice Hall Press, Englewood Cliffs (2008)Google Scholar
  12. 12.
    Han, J., Kamber, M.: Data Mining, 2nd edn. Morgan Kaufmann, San Francisco (2006)MATHGoogle Scholar
  13. 13.
    Jaschke, R., Hotho, A., Schmitz, C., Ganter, B., Stumme, G.: TRIAS–An algorithm for mining iceberg tri-lattices. In: Proceedings of the 6th International Conference on Data Mining, pp. 907–911. IEEE, Los Alamitos (2006)Google Scholar
  14. 14.
    Kaytoue, M., Kuznetsov, S.O., Napoli, A., Duplessis, S.: Mining gene expression data with pattern structures in formal concept analysis. Information Sciences (2010)Google Scholar
  15. 15.
    Kok, S., Domingos, P.: Learning Markov logic network structure via hypergraph lifting. In: Proceedings of the 26th International Conference on Machine Learning. pp. 505–512 (2009)Google Scholar
  16. 16.
    Kuznetsov, S.O.: Machine learning and formal concept analysis. In: Eklund, P. (ed.) ICFCA 2004. LNCS (LNAI), vol. 2961, pp. 287–312. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  17. 17.
    Kuznetsov, S.O., Samokhin, M.V.: Learning closed sets of labeled graphs for chemical applications. In: Kramer, S., Pfahringer, B. (eds.) ILP 2005. LNCS (LNAI), vol. 3625, pp. 190–208. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  18. 18.
    Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: An enabling technique. Data Mining and Knowledge Discovery 6(4), 393–423 (2002)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Makino, K., Uno, T.: New algorithms for enumerating all maximal cliques. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 260–272. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  20. 20.
    Murthy, S.K.: Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery 2(4), 345–389 (1998)CrossRefGoogle Scholar
  21. 21.
    Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient mining of association rules using closed itemset lattices. Information Systems 24(1), 25–46 (1999)CrossRefMATHGoogle Scholar
  22. 22.
    Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  23. 23.
    Quinlan, J.R.: Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research 4, 77–90 (1996)MATHGoogle Scholar
  24. 24.
    R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing (2011), http://www.R-project.org
  25. 25.
    Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)CrossRefMATHGoogle Scholar
  26. 26.
    Skubacz, M., Hollmén, J.: Quantization of continuous input variables for binary classiffication. In: Leung, K.-S., Chan, L., Meng, H. (eds.) IDEAL 2000. LNCS, vol. 1983, pp. 42–47. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  27. 27.
    Sugiyama, M., Yamamoto, A.: The coding divergence for measuring the complexity of separating two sets. In: Proceedings of 2nd Asian Conference on Machine Learning. JMLR Workshop and Conference Proceedings, vol. 13, pp. 127–143 (2010)Google Scholar
  28. 28.
    Uno, T., Kiyomi, M., Arimura, H.: LCM ver. 3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, pp. 77–86. ACM, New York (2005)CrossRefGoogle Scholar
  29. 29.
    Valtchev, P., Missaoui, R., Godin, R.: Formal concept analysis for knowledge discovery and data mining: The new challenges. Concept Lattices, 3901–3901 (2004)Google Scholar
  30. 30.
    Vapnik, V., Sterin, A.: On structural risk minimization or overall risk in a problem of pattern recognition. Automation and Remote Control 10(3), 1495–1503 (1977)MATHGoogle Scholar
  31. 31.
    Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (2000)CrossRefMATHGoogle Scholar
  32. 32.
    Zhang, Y., Feng, B., Xue, Y.: A new search results clustering algorithm based on formal concept analysis. In: Proceedings of 5th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 356–360. IEEE, Los Alamitos (2008)Google Scholar
  33. 33.
    Zhu, X., Goldberg, A.B.: Introduction to semi-supervised learning. Morgan and Claypool Publishers, San Francisco (2009)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mahito Sugiyama
    • 1
    • 2
  • Akihiro Yamamoto
    • 1
  1. 1.Graduate School of InformaticsKyoto UniversityKyotoJapan
  2. 2.Research Fellow of the Japan Society for the Promotion of ScienceJapan

Personalised recommendations