CrossMine: Efficient Classification Across Multiple Database Relations

Yin, Xiaoxin; Han, Jiawei; Yang, Jiong; Yu, Philip S.

doi:10.1007/11615576_9

CrossMine: Efficient Classification Across Multiple Database Relations

Xiaoxin Yin²¹,
Jiawei Han²¹,
Jiong Yang²¹ &
…
Philip S. Yu²²

Conference paper

365 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3848))

Abstract

Most of today’s structured data is stored in relational data- bases. Such a database consists of multiple relations that are linked together conceptually via entity-relationship links in the design of relational database schemas. Multi-relational classification can be widely used in many disciplines including financial decision making and medical research. However, most classification approaches only work on single “flat” data relations. It is usually difficult to convert multiple relations into a single flat relation without either introducing huge “universal relation” or losing essential information. Previous works using Inductive Logic Programming approaches (recently also known as Relational Mining) have proven effective with high accuracy in multi-relational classification. Unfortunately, they fail to achieve high scalability w.r.t. the number of relations in databases because they repeatedly join different relations to search for good literals.

In this paper we propose CrossMine, an efficient and scalable approach for multi-relational classification. CrossMine employs tuple ID propagation, a novel method for virtually joining relations, which enables flexible and efficient search among multiple relations. CrossMine also uses aggregated information to provide essential statistics for classification. A selective sampling method is used to achieve high scalability w.r.t. the number of tuples in the databases. Our comprehensive experiments on both real and synthetic databases demonstrate the high scalability and accuracy of CrossMine.

The work was supported in part by National Science Foundation under Grants IIS-02-09199/IIS-03-08215, and an IBM Faculty Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Appice, A., Ceci, M., Malerba, D.: Mining model trees: a multi-relational approach. In: Horváth, T., Yamamoto, A. (eds.) ILP 2003. LNCS (LNAI), vol. 2835, pp. 4–21. Springer, Heidelberg (2003)
Chapter Google Scholar
Aronis, J.M., Provost, F.J.: Increasing the Efficiency of Data Mining Algorithms with Breadth-First Marker Propagation. In: Proc. 2003 Int. Conf. Knowledge Discovery and Data Mining, Newport Beach, CA (1997)
Google Scholar
Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of logical decision trees. In: Proc. 1998 Int. Conf. Machine Learning, Madison, WI (August 1998)
Google Scholar
Blockeel, H., De Raedt, L., Jacobs, N., Demoen, B.: Scaling up inductive logic programming by learning from interpretations. Data Mining and Knowledge Discovery 3(1), 59–93 (1999)
Article Google Scholar
Blockeel, H., Dehaspe, L., Demoen, B., Janssens, G., Ramon, J., Vandecasteele, H.: Improving the efficiency of inductive logic programming through the use of query packs. Journal of Artificial Intelligence Research 16, 135–166 (2002)
MATH Google Scholar
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121–168 (1998)
Article Google Scholar
Clark, P., Boswell, R.: Rule induction with CN2: Some recent improvements. In: Proc. 1991 European Working Session on Learning, pp. 151–163. Porto, Portugal (March 1991)
Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book. Prentice-Hall, Englewood Cliffs (2002)
Google Scholar
Gehrke, J., Ramakrishnan, R., Ganti, V.: Rainforest: A framework for fast decision tree construction of large datasets. In: Proc. 1998 Int. Conf. Very Large Data Bases, New York (August 1998)
Google Scholar
Lavrac, N., Dzeroski, S.: Inductive Logic Programming: Techniques and Applications. Ellis Horwood (1994)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Muggleton, S.: Inductive Logic Programming. Academic Press, New York (1992)
MATH Google Scholar
Muggleton, S.: Inverse entailment and progol. In: New Generation Computing. Special issue on Inductive Logic Programming (1995)
Google Scholar
Muggleton, S., Feng, C.: Efficient induction of logic programs. In: Proc. 1990 Conf. Algorithmic Learning Theory, Tokyo, Japan (1990)
Google Scholar
Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning Relational Probability Trees. In: Proc. 2003 Int. Conf. Knowledge Discovery and Data Mining, Washtington, DC (2003)
Google Scholar
Popescul, A., Ungar, L., Lawrence, S., Pennock, M.: Towards structural logistic regression: Combining relational and statistical learning. In: Proc. Multi-Relational Data Mining Workshop, Alberta, Canada (2002)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Quinlan, J.R., Cameron-Jones, R.M.: FOIL: A midterm report. In: Proc. 1993 European Conf. Machine Learning, Vienna, Austria (1993)
Google Scholar
Taskar, B., Segal, E., Koller, D.: Probabilistic classification and clustering in relational data. In: Proc. 2001 Int. Joint Conf. Artificial Intelligence, Seattle, WA (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Xiaoxin Yin, Jiawei Han & Jiong Yang
IBM T.J. Watson Research Center, Yorktown Heights, N.Y., 10598, USA
Philip S. Yu

Authors

Xiaoxin Yin
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Han
View author publications
You can also search for this author in PubMed Google Scholar
Jiong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSA-Lyon, LIRIS CNRS UMR5205, F-69621, Villeurbanne, France
Jean-François Boulicaut
Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001, Heverlee, Belgium
Luc De Raedt
HIIT, Helsinki University of Technology and, University of Helsinki, Finland
Heikki Mannila

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yin, X., Han, J., Yang, J., Yu, P.S. (2006). CrossMine: Efficient Classification Across Multiple Database Relations. In: Boulicaut, JF., De Raedt, L., Mannila, H. (eds) Constraint-Based Mining and Inductive Databases. Lecture Notes in Computer Science(), vol 3848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11615576_9

Download citation

DOI: https://doi.org/10.1007/11615576_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31331-1
Online ISBN: 978-3-540-31351-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics