Skip to main content

Combining the Strength of Pattern Frequency and Distance for Classification

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2035))

Included in the following conference series:

Abstract

Supervised classification involves many heuristics, including the ideas of decision tree, k-nearest neighbour (k-NN), pattern frequency, neural network, and Bayesian rule, to base induction algorithms. In this paper, we propose a new instance-based induction algorithm which combines the strength of pattern frequency and distance. We define a neighbourhood of a test instance. If the neighbourhood contains training data, we use k-NN to make decisions. Otherwise, we examine the support (frequency) of certain types of subsets of the test instance, and calculate support summations for prediction. This scheme is intended to deal with outliers: when no training data is near to a test instance, then the distance measure is not a proper predictor for classification. We present an effective method to choose an “optimal” neighbourhood factor for a given data set by using a guidance from a partial training data. In this work, we find that our algorithm maintains (sometimes exceeds) the outstanding accuracy of k-NN on data sets containing pure continuous attributes, and that our algorithm greatly improves the accuracy of k-NN on data sets containing a mixture of continuous and categorical attributes. In general, our method is much superior to C5.0.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., May 1993. ACM Press.

    Google Scholar 

  2. D.W. Aha, D. Kibler, and M.K. Albert. Instance-based learning algorithms. Machine Learning, 6:37–66, 1991.

    Google Scholar 

  3. C.L. Blake and P.M. Murphy. The UCI machine learning repository. [http://www.cs.uci.edu/~mlearn/MLRepository.html]. In Irvine, CA: University of California, Department of Information and Computer Science, 1998.

  4. T.M. Cover and P.E. Hart. Nearest neighbour pattern classification. IEEE Transactions on Information Theory, 13:21–27, 1967.

    Article  MATH  Google Scholar 

  5. Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 43–52, San Diego, CA, 1999. ACM Press.

    Google Scholar 

  6. Guozhu Dong, Xiuzhen Zhang, Limsoon Wong, and Jinyan Li. CAEP: Classification by aggregating emerging patterns. In Proceedings of the Second International Conference on Discovery Science, Tokyo, Japan, pages 30–42. Springer-Verlag, December 1999.

    Google Scholar 

  7. James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings of the Twelfth International Conference on Machine Learning, pages 94–202. Morgan Kaufmann, 1995.

    Google Scholar 

  8. U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery: An overview. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 1–34. AAAI/MIT Press, 1996.

    Google Scholar 

  9. U.M. Fayyad and K.B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022–1029. Morgan Kaufmann, 1993.

    Google Scholar 

  10. E. Fix and J. Hodges. Discriminatory analysis, non-parametric discrimination, consistency properties. Technical Report Technical Report 4, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, TX, 1951.

    Google Scholar 

  11. R. Kohavi, G. John, R. Long, D. Manley, and K. Pfleger. MLC++: A machine learning library in C++. In Tools with artificial intelligence, pages 740–743, 1994.

    Google Scholar 

  12. Pat Langley and Wayne Iba. Average-case analysis of a nearest neighbour algorithm. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 889–894, Chambery, France, 1993.

    Google Scholar 

  13. Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Instance-based classification by emerging patterns. In Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 191–200, Lyon, France, September 2000. Springer-Verlag.

    Google Scholar 

  14. Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Making use of the most expressive jumping emerging patterns for classification. In Knowledge and Information Systems: An International Journal, to appear.

    Google Scholar 

  15. Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association rule mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 80–86, New York, USA, August 1998. AAAI Press.

    Google Scholar 

  16. Dimitris Meretakis and Beat Wuthrich. Extending naive bayes classifiers using long itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 165–174, San Diego, CA, 1999. ACM Press.

    Google Scholar 

  17. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

    Google Scholar 

  18. J.R. Quinlan. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4:77–90, 1996.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, J., Ramamohanarao, K., Dong, G. (2001). Combining the Strength of Pattern Frequency and Distance for Classification. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_48

Download citation

  • DOI: https://doi.org/10.1007/3-540-45357-1_48

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41910-5

  • Online ISBN: 978-3-540-45357-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics