Speeding up knowledge discovery in large relational databases by means of a new discretization algorithm

  • Alex Alves Freitas
  • Simon H. Lavington
Technical Papers Optimisation/Performance Issues
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1094)

Abstract

Most of the KDD (Knowledge Discovery in Databases) algorithms proposed in the literature have been applied to relatively small datasets and do not permit any integration with a DBMS. Hence, the application of these algorithms to the huge amounts of data found in current databases and data warehouses faces serious scalability problems, particularly the problem of excessive learning time. This paper investigates a way of improving the scalability of KDD algorithms, via discretization of ordinal or continuous attributes. This work has two novel aspects. First, we map a generic discretization primitive into an SQL query. Second, we propose a new discretization algorithm for classification tasks. We show how the new discretization algorithm can be implemented with good effect via the SQL primitive.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Catlett 91]
    J. Catlett. Megainduction: a test flight Proc. 8th Int. Workshop on Machine Learning, 596–599. 1991.Google Scholar
  2. [Catlett 91a]
    J. Catlett. On changing continuous attributes into ordered discrete attributes. Proc. European Working Session on Learning (EWSL-91). LNAI-482, 164–178.Google Scholar
  3. [Fayyad et al. 96]
    U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth R. Uthurusamy. (Eds.) Advances in Knowledge Discovery and Data Mining. AAAI/MIT, 1996.Google Scholar
  4. [Freitas & Lavington 96]
    A.A. Freitas and S.H. Lavington. Using SQL primitives and parallel DB servers to speed up knowledge discovery in large relational databases. Accepted for Symposium on Knowledge Discovery in Databases, 13th European Meeting on Cybernetics and Systems Research (EMCSR'96). Vienna, Apr./96.Google Scholar
  5. [Kerber 92]
    R. Kerber. ChiMerge: Discretization of numeric attributes. Proc. 1992 Conf. American Assoc. for AI (AAAI-92), 123–128.Google Scholar
  6. [Mannino et al. 88]
    M.V. Mannino, P. Chu and T. Sager. Statistical profile estimation in database systems. ACM Computing Surveys, 20(3), Sep./88, 191–221.Google Scholar
  7. [Michie et al. 94]
    D. Michie, D.J. Spiegelhalter and C.C. Taylor. Machine Learning, Neural and Statistical Classification. New York: Ellis Horwood, 1994.Google Scholar
  8. [Paliouras & Bree 95]
    G. Paliouras and D.S. Bree. The effect of numeric features on the scalability of inductive learning programs. Proc. 8th European Conf. Machine Learning (ECML-95). LNAI-912, 218–231. 1995.Google Scholar
  9. [Provost & Aronis 96]
    F.J. Provost and J.M. Aronis. Scaling up inductive learning with massive parallelism. To appear in Machine Learning, 1996.Google Scholar
  10. [Piatetsky-Shapiro & Frawley 91]
    G. Piatetsky-Shapiro and W.J. Frawley. (Eds.) Knowledge Discovery in Databases. Menlo Park, CA: AAAI, 91.Google Scholar
  11. [Quinlan 93]
    J.R. Quinlan. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993.Google Scholar
  12. [Richeldi & Rossotto 95]
    M. Richeldi and M. Rossotto. Class-driven statistical discretization of continuous attributes. (Extended Abstract) Proc. 8th European Conf. Machine Learning (ECML-95). LNAI-912, 335–338.Google Scholar
  13. [Schaffer 94]
    C. Schaffer. A conservation law for generalization performance. Proc. 11th Int. Conf. Machine Learning, 259–265. 1994Google Scholar
  14. [Smyth & Goodman 91]
    P. Smyth and R.M. Goodman. Rule induction using information theory. In G. Piatetsky-Shapiro and W.J. Frawley. (Eds.) Knowledge Discovery in Databases. Menlo Park, CA: AAAI Press, 1991.Google Scholar
  15. [Stolfo et al. 95]
    S.J. Stolfo, H.M. Dewan, D.Ohsie and M.Hernandez. A parallel and distributed environment for database rule processing: open problems and future directions. In: M. Abdelguerfi & S. Lavington. (Ed.) Emerging Trends in Database and Knowledge-Based Machine, 225–253. IEEE, 1995.Google Scholar

Copyright information

© Springer-Verlag 1996

Authors and Affiliations

  • Alex Alves Freitas
    • 1
  • Simon H. Lavington
    • 1
  1. 1.Dept. of Computer ScienceUniversity of EssexColchesterUK

Personalised recommendations