Discretization Numbers for Multiple-Instances Problem in Relational Database

  • Rayner Alfred
  • Dimitar Kazakov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4690)


Handling numerical data stored in a relational database is different from handling those numerical data stored in a single table due to the multiple occurrences of an individual record in the non-target table and non-determinate relations between tables. Most traditional data mining methods only deal with a single table and discretize columns that contain continuous numbers into nominal values. In a relational database, multiple records with numerical attributes are stored separately from the target table, and these records are usually associated with a single structured individual stored in the target table. Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database, in order to reduce the continuous domains to more manageable symbolic domains of low cardinality, and the loss of precision is assumed to be acceptable. In this paper, we consider different alternatives for dealing with continuous attributes in MRDM. The discretization procedures considered in this paper include algorithms that do not depend on the multi-relational structure of the data and also that are sensitive to this structure. In this experiment, we study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. We implement a new method of discretization, called the entropyinstance-based discretization method, and we evaluate this discretization method with respect to C4.5 on three varieties of a well-known multi-relational database (Mutagenesis), where numeric attributes play an important role. We demonstrate on the empirical results obtained that entropy-based discretization can be improved by taking into consideration the multiple-instance problem.


Discretization Entropy-based Semi-supervised clustering Genetic Algorithm Multiple Instance 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alfred, R., Kazakov, D.: Weighted Pattern-Based Transformation Approach to Relational Data Mining. In: Proc of ICAIET 2006, Kota Kinabalu, Sabah, Malaysia (November 2006)Google Scholar
  2. 2.
    Alfred, R., Kazakov, D.: Data Summarization Approach to Relational Domain Learning Based on Frequent Pattern to Support the Development of Decision Making. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 889–898. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Alfred, R., Kazakov, D.: Pattern-Based Transformation Approach to Relational Domain Learning Using DARA. In: the Proc DMIN 2006, USA, pp. 296–302 (2006)Google Scholar
  4. 4.
    Srinivasan, A., Muggleton, S.H., Sternberg, M.J.E., King, R.D.: Theories for mutagenicity: A study in first-order and feature-based induction. Artificial Intelligence 85 (1996)Google Scholar
  5. 5.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Alamitos, CaliforniaGoogle Scholar
  6. 6.
    Kramer, S., Lavrač, N., Flach, P.: Propositionalization approaches to relational data mining. In: Dzeroski, S., Lavrač, N. (eds.) Relational Data mining, Springer, Heidelberg (2001)Google Scholar
  7. 7.
    Salton, G., Michael, J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York (1986)Google Scholar
  8. 8.
    Bezdek, J.C.: Some new indexes of cluster validiy. IEEE Transaction System, Man, Cybern. B 28, 301–315 (1998)CrossRefGoogle Scholar
  9. 9.
    Boley, D.: Principal direction divisive partitioning. Data Mining and Knowledge Discovery 2(4), 325–344 (1998)CrossRefGoogle Scholar
  10. 10.
    Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufman, San Francisco (1999)Google Scholar
  11. 11.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 94–105. ACM Press, New York (1998)CrossRefGoogle Scholar
  12. 12.
    Hofmann, T., Buhnmann, J.M.: Active data clustering. In: Advance in Neural Information Processing System (1998)Google Scholar
  13. 13.
    Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)zbMATHGoogle Scholar
  14. 14.
    Van Laer, W., De Raedt, L., Deroski, S.: On multi-class problems and discretization in inductive logic programming. In: Raś, Z.W., Skowron, A. (eds.) ISMIS 1997. LNCS, vol. 1325, Springer, Heidelberg (1997)Google Scholar
  15. 15.
    Kohavi, R., Sahami, M.: Error-based and entropy-based discretisation of continuous features. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press (1996)Google Scholar
  16. 16.
    Perner, P., Trautzsch, S.: Multi-interval discretization methods for decision tree learning. In: Advances in Pattern Recognition, Joint IAPR International Workshops SSPR ’98 and SPR 1998, pp. 475–482 (1998)Google Scholar
  17. 17.
    Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)Google Scholar
  18. 18.
    Srinivasan, A., Muggleton, S., King, R.: Comparing the use of background knowledge by inductive logic programming systems. In: Proceedings of the 5th International Workshop on Inductive Logic Programming (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Rayner Alfred
    • 1
    • 2
  • Dimitar Kazakov
    • 1
  1. 1.University of York, Computer Science Department, Heslington, YO105DD YorkUnited Kingdom
  2. 2.On Study Leave from Universiti Malaysia Sabah, School of Engineering and Information Technology, 88999, Kota Kinabalu, SabahMalaysia

Personalised recommendations