Skip to main content
Log in

Reducing the size of databases for multirelational classification: a subgraph-based approach

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Multirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering classification, different phases of the knowledge discovery process are affected by economic utility. For instance, in the data preprocessing process, one must consider the cost associated with acquiring, cleaning, and transforming large volumes of data. When training and testing the data mining models, one has to consider the impact of the data size on the running time of the learning algorithm. In order to address these utility-based issues, the paper presents an approach to create a pruned database for multirelational classification, while minimizing predictive performance loss on the final model. Our method identifies a set of strongly uncorrelated subgraphs from the original database schema, to use for training, and discards all others. The experiments performed show that our strategy is able to, without sacrificing predictive accuracy, significantly reduce the size of the databases, in terms of the number of relations, tuples, and attributes.The approach prunes the sizes of databases by as much as 94 %. Such reduction also results in decreasing computational cost of the learning process. The method improves the multirelational learning algorithms’ execution time by as much as 80 %. In particular, our results demonstrate that one may build an accurate model with only a small subset of the provided database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Further discussions regarding all the resulted join paths for this database will be presented in Example 1 in this section

References

  • Almuallim, H., & Dietterich, T.G. (1991). Learning with many irrelevant features. In AAAI ’91 (Vol. 2, pp. 547–552). Anaheim, California: AAAI Press.

    Google Scholar 

  • Almuallim, H., & Dietterich, T.G. (1992). Efficient algorithms for identifying relevant features. Tech. Rep., Corvallis, OR, USA.

  • Alphonse, E., & Matwin. S. (2004.) Filtering multi-instance problems to reduce dimensionality in relational learning. Journal of Intelligent Information Systems, 22(1), 23–40.

    Article  Google Scholar 

  • Berka, P. (2000). Guide to the financial data set. In A. Siebes & P. Berka (Eds.), PKDD2000 discovery challenge.

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge and Discovery Data, 1(1), 5.

    Article  Google Scholar 

  • Blockeel, H., & Raedt, L.D. (1998). Top-down induction of first-order logical decision trees. Artificial Intelligence 101(1–2), 285–297.

    Article  MathSciNet  MATH  Google Scholar 

  • Bringmann, B., & Zimmermann, A. (2009). One in a million: picking the right patterns. Knowledge and Information Systems, 18, 61–81.

    Article  Google Scholar 

  • Burges, C.J.C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167.

    Article  Google Scholar 

  • Burnside, J.D.E., Ramakrishnan, R., Costa, V.S., Shavlik, J. (2005). View learning for statistical relational learning: With an application to mammography. In Proceeding of the 19th IJCAI (pp. 677–683).

  • Ceci, M., & Appice, A. (2006). Spatial associative classification: propositional vs structural approach. Journal of Intelligent Information Systems, 27, 191–213.

    Article  Google Scholar 

  • Chan, P.K., & Stolfo, S.J. (1993). Experiments on multistrategy learning by meta-learning. In CIKM ’93 (pp. 314–323). New York: ACM Press.

    Chapter  Google Scholar 

  • Chen, B.C., Ramakrishnan, R., Shavlik, J.W., Tamma, P. (2009). Bellwether analysis: searching for cost-effective query-defined predictors in large databases. ACM Transaction on Knowledge and Discovery Data, 3(1), 1–49.

    Article  Google Scholar 

  • Cohen, W. (1995). Learning to classify English text with ILP methods. In L. De Raedt (Ed.), ILP ’95 (pp. 3–24). DEPTCW.

  • De Marchi, F., & Petit, J.M. (2007). Semantic sampling of existing databases through informative armstrong databases. Information Systems, 32(3), 446–457.

    Article  Google Scholar 

  • De Raedt, L. (2008). Logical and relational learning. Cognitive Technologies. New York: Springer.

    Book  MATH  Google Scholar 

  • Dehaspe, L., Toivonen, H., King, R.D. (1998). Finding frequent substructures in chemical compounds. In AAAI Press (pp. 30–36).

  • Dzeroski, S., & Lavrac, N. (2001). Relational data mining. In S. Dzeroski & N. Lavrac (Eds.). Berlin: Springer.

  • Frank, R., Moser, F., Ester, M. (2007). A method for multi-relational classification using single and multi-feature aggregation functions. In PKDD 2007 (pp. 430–437).

  • Getoor, L., & Taskar, B. (2007). Statistical relational learning. MIT Press: Cambridge.

    MATH  Google Scholar 

  • Ghiselli, E.E. (1964). Theory of psychological measurement. New York: McGrawHill Book Company.

    Google Scholar 

  • Giraud-Carrier, C.G., Vilalta, R., Brazdil, P. (2004). Introduction to the special issue on meta-learning. Machine Learning, 54(3), 187–193.

    Article  Google Scholar 

  • Guo, H., & Viktor, H.L. (2006). Mining relational data through correlation-based multiple view validation. In KDD ’06 (pp. 567–573). New York, NY, USA.

  • Guo, H., & Viktor, H.L. (2008). Multirelational classification: a multiple view approach. Knowledge and Information Systems, 17(3), 287–312.

    Article  Google Scholar 

  • Guo, H., Viktor, H.L., Paquet, E. (2007). Pruning relations for substructure discovery of multi-relational databases. In PKDD (pp. 462–470).

  • Guo, H., Viktor, H.L., Paquet, E. (2011). Privacy disclosure and preservation in learning with multi-relational databases. JCSE, 5(3), 183–196.

    Google Scholar 

  • Habrard, A., Bernard, M., Sebban, M. (2005). Detecting irrelevant subtrees to improve probabilistic learning from tree-structured data. Fundamenta Informaticae, 66(1–2), 103–130.

    MathSciNet  MATH  Google Scholar 

  • Hall, M. (1998). Correlation-based feature selection for machine learning. Ph.D thesis, Department of Computer Science, University of Waikato, New Zealand.

  • Hamill, R., & Martin, N. (2004). Database support for path query functions. In Proc. of 21st British national conference on databases (BNCOD 21) (pp. 84–99).

  • Han, J., & Kamber, M. (2005). Data mining: Concepts and techniques (2nd Edition). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..

    Google Scholar 

  • Heckerman, D. (1998). A tutorial on learning with bayesian networks. In Proceedings of the NATO advanced study institute on learning in graphical models (pp. 301–354). Norwell, MA, USA: Kluwer Academic Publishers.

    Chapter  Google Scholar 

  • Heckerman, D., Geiger, D., Chickering, D.M. (1995). Learning bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20(3), 197–243.

    MATH  Google Scholar 

  • Hogarth, R. (1977). Methods for aggregating opinions. In H. Jungermann & G. de Zeeuw (Eds.), Decision making and change in human affairs. Dordrecht-Holland.

  • Jamil, H.M. (2002). Bottom-up association rule mining in relational databases. Journal of Intelligent Information Systems, 19(2), 191–206.

    Article  Google Scholar 

  • Jensen, D., Jensen, D., Neville, J. (2002). Schemas and models. In Proceedings of the SIGKDD-2002 workshop on multi-relational learning (pp. 56–70).

  • Kietz, J.U., Zücker, R., Vaduva, A. (2000). Mining mart: Combining case-based-reasoning and multistrategy learning into a framework for reusing kdd-applications. In 5th international workshop on multistrategy learning (MSL 2000). Guimaraes, Portugal.

  • Kira, K., & Rendell, L.A. (1992). A practical approach to feature selection. In ML92 proceedings of the 9th international workshop on machine learning (pp. 249–256). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..

    Google Scholar 

  • Knobbe, A.J. (2004). Multi-relational data mining. PhD thesis, University Utrecht.

  • Kohavi, R., & John, G.H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324.

    Article  MATH  Google Scholar 

  • Kohavi, R., Langley, P., Yun, Y. (1997). The utility of feature weighting in nearest-neighbor algorithms. In ECML ’97. Prague, Czech Republic: Springer.

    Google Scholar 

  • Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In ICML ’96 (pp. 284–292).

  • Krogel, M.A. (2005). On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto-von-Guericke-Universität Magdeburg.

  • Krogel, M.A., & Wrobel, S. (2003). Facets of aggregation approaches to propositionalization. In ILP’03.

  • Landwehr, N., Kersting, K., Raedt, L.D. (2007). Integrating naive bayes and foil. Journal of Machine Learning Research, 8, 481–507.

    MATH  Google Scholar 

  • Landwehr, N., Passerini, A., Raedt, L.D., Frasconi, P. (2010). Fast learning of relational kernels. Machine Learning 78(3), 305–342.

    Article  Google Scholar 

  • Lipton, R.J., Naughton, J.F., Schneider, D.A., Seshadri, S. (1993). Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116(1–2), 195–226.

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, H., & Setiono, R. (1996). A probabilistic approach to feature selection - a filter solution. In ICML ’96 (pp. 319–327).

  • Margaritis, D. (2009). Toward provably correct feature selection in arbitrary domains. In NIPS (pp. 1240–1248).

  • Merz, C.J. (1999). Using correspondence analysis to combine classifiers. Machine Learning, 36(1–2), 33–58.

    Article  Google Scholar 

  • Neville, J., Jensen, D., Friedland, L., Hay, M. (2003). Learning relational probability trees. In Proceedings of the ninth ACM SIGKDD (pp 625–630). New York, NY, USA: ACM Press.

    Google Scholar 

  • Olken, F., & Rotem, D. (1986). Simple random sampling from relational databases. In VLDB (pp. 160–169).

  • Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..

    Google Scholar 

  • Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1–2), 65–105.

    Article  Google Scholar 

  • Perlich, C., & Provost, F.J. (2003). Aggregation-based feature invention and relational concept classes. In KDD’03 (pp. 167–176).

  • Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T. (1988). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..

    Google Scholar 

  • Quinlan, J.R., & Cameron-Jones, R.M. (1993). Foil: A midterm report. In ECML ’93 (pp. 3–20).

  • Reutemann, P., Pfahringer, B., Frank, E. (2004). A toolbox for learning from relational data with propositional and multi-instance learners. In Australian conference on artificial intelligence (pp. 1017–1023).

  • Rückert, U., & Kramer, S. (2008). Margin-based first-order rule learning. Machine Learning, 70, 189–206.

    Article  Google Scholar 

  • Singh, L., Getoor, L., Licamele, L. (2005). Pruning social networks using structural properties and descriptive attributes. In ICDM ’05 (pp. 773–776).

  • Ting, K.M., & Witten, I.H. (1999). Issues in stacked generalization. Journal of Artificial Intelligence Research (JAIR), 10, 271–289.

    MATH  Google Scholar 

  • Witten, I.H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..

    Google Scholar 

  • Wolpert, D.H. (1990). Stacked generalization. Tech. Rep. LA-UR-90-3460, Los Alamos, NM.

  • Yin, X., Han, J., Yang, J., Yu, P.S. (2006). Efficient classification across multiple database relations: A crossmine approach. IEEE Transactions on Knowledge and Data Engineering, 18(6), 770–783.

    Article  Google Scholar 

  • Zajonic, R. (1962). A note on group judgements and group size. Human Relations, 15, 177–180.

    Article  Google Scholar 

  • Zhong, N., & Ohsuga, S. (1995). KOSI - an integrated system for discovering functional relations from databases. Journal of Intelligent Information Systems, 5(1), 25–50.

    Article  Google Scholar 

  • Zucker, J.D., & Ganascia, J.G. (1996). Representation changes for efficient learning in structural domains. In ICML ’96( pp. 543–551).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongyu Guo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, H., Viktor, H.L. & Paquet, E. Reducing the size of databases for multirelational classification: a subgraph-based approach. J Intell Inf Syst 40, 349–374 (2013). https://doi.org/10.1007/s10844-012-0229-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-012-0229-0

Keywords

Navigation