Abstract
Industrial databases often contain millions of tuples but most data mining algorithms suffer from limited applicability to only small sets of examples. In this paper, we propose to utilize data reduction before data mining to overcome this deficit. We specifically present a novel similarity-driven sampling approach which applies two preparation steps, sorting and stratification, and reuses an improved variant of leader clustering. We experimentally evaluate similarity-driven sampling in comparison to statistical sampling techniques in different classification domains using C4.5 and instance-based learning as data mining algorithms. Experimental results show that similarity-driven sampling often outperforms statistical sampling techniques in terms of error rates using smaller samples.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Aha, D.W., Kibler, D., & Albert, M.K. (1991). Instance-Based Learning Algorithms. Machine Learning, 6, p. 37–66.
Catlett, J. (1991). Megainduction: Machine Learning on Very Large Data-bases. Ph.D. Thesis, University of Sydney, Australia.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and Unsupervised Discretization of Continuous Features. in: Prieditis, A., & Russell, S. (eds.). Proceedings of the 12th International Conference on Machine Learning. July, 9–12, Tahoe City, CA. Menlo Park, CA: Morgan Kaufmann, pp. 194–202.
Ester, M., Kriegel, H.-P., & Xu, X. (1995). Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. in: Egenhofer, M.J., & Herring, J.R. (eds.). Proceedings of the 4th International Symposium on Spatial Databases. August, 6–9, Portland, Maine. New York, NY: Springer, pp. 67–82.
Hartigan, J.A. (1975). Clustering Algorithms. New York, NY: John Wiley & Sons, Inc.
John, G.H., & Langley, P. (1996). Static Versus Dynamic Sampling for Data Mining. in: Simoudis, E., Han, J., & Fayyad, U. (eds.) Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. August, 2–4, Portland, Oregon. Menlo Park, CA: AAAI Press, pp. 367–370.
Kohavi, R., Sommerfield, D., & Dougherty, J. (1996). Data Mining Using MLC++: A Machine Learning Library in C++. http:// robotics.stanford.edu/~ronnyk.
Murphy, P.M., & Aha, D. (1994). UCI Repository of Machine Learning Databases. ftp://ics.uci.edu/pub/machine-learning-databases.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Toivonen, H. (1996). Sampling Large Databases for Finding Association Rules. in: Vijayaraman, T.M., Buchmann, A.P., Mohan, C., & Sarda, N.L. (eds.) Proceedings of the 22nd International Conference on Very Large Databases. September, 3–6, Mumbai, India. San Mateo, CA: Morgan Kaufmann, pp. 134–145.
Wirth, J., & Catlett, J. (1988). Experiments on the Costs and Benefits of Windowing in ID3. in: Laird, J. (ed.) Proceedings of the 5th International Conference on Machine Learning. June, 12–14, University of Michigan, Ann Arbor. San Mateo, CA: Morgan Kaufmann, pp. 87–99.
Zaki, M.J., Parthasarathy, S., Li, W., & Ogihara, M. (1997). Evaluation of Sampling for Data Mining of Association Rules. in: Scheuermann, P. (ed.) Proceedings of the 7th Workshop on Research Issues in Data Engineering. April, 7–8, Birmingham, England. Los Alamitos, CA: IEEE Computer Society Press.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Reinartz, T. (1998). Similarity-driven sampling for data mining. In: Żytkow, J.M., Quafafou, M. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1998. Lecture Notes in Computer Science, vol 1510. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0094846
Download citation
DOI: https://doi.org/10.1007/BFb0094846
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65068-3
Online ISBN: 978-3-540-49687-8
eBook Packages: Springer Book Archive