Combining the Strength of Pattern Frequency and Distance for Classification

Li, Jinyan; Ramamohanarao, Kotagiri; Dong, Guozhu

doi:10.1007/3-540-45357-1_48

Jinyan Li⁴,
Kotagiri Ramamohanarao⁴ &
Guozhu Dong⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2035))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1326 Accesses
13 Citations
3 Altmetric

Abstract

Supervised classification involves many heuristics, including the ideas of decision tree, k-nearest neighbour (k-NN), pattern frequency, neural network, and Bayesian rule, to base induction algorithms. In this paper, we propose a new instance-based induction algorithm which combines the strength of pattern frequency and distance. We define a neighbourhood of a test instance. If the neighbourhood contains training data, we use k-NN to make decisions. Otherwise, we examine the support (frequency) of certain types of subsets of the test instance, and calculate support summations for prediction. This scheme is intended to deal with outliers: when no training data is near to a test instance, then the distance measure is not a proper predictor for classification. We present an effective method to choose an “optimal” neighbourhood factor for a given data set by using a guidance from a partial training data. In this work, we find that our algorithm maintains (sometimes exceeds) the outstanding accuracy of k-NN on data sets containing pure continuous attributes, and that our algorithm greatly improves the accuracy of k-NN on data sets containing a mixture of continuous and categorical attributes. In general, our method is much superior to C5.0.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., May 1993. ACM Press.
Google Scholar
D.W. Aha, D. Kibler, and M.K. Albert. Instance-based learning algorithms. Machine Learning, 6:37–66, 1991.
Google Scholar
C.L. Blake and P.M. Murphy. The UCI machine learning repository. [http://www.cs.uci.edu/~mlearn/MLRepository.html]. In Irvine, CA: University of California, Department of Information and Computer Science, 1998.
T.M. Cover and P.E. Hart. Nearest neighbour pattern classification. IEEE Transactions on Information Theory, 13:21–27, 1967.
Article MATH Google Scholar
Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 43–52, San Diego, CA, 1999. ACM Press.
Google Scholar
Guozhu Dong, Xiuzhen Zhang, Limsoon Wong, and Jinyan Li. CAEP: Classification by aggregating emerging patterns. In Proceedings of the Second International Conference on Discovery Science, Tokyo, Japan, pages 30–42. Springer-Verlag, December 1999.
Google Scholar
James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings of the Twelfth International Conference on Machine Learning, pages 94–202. Morgan Kaufmann, 1995.
Google Scholar
U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery: An overview. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 1–34. AAAI/MIT Press, 1996.
Google Scholar
U.M. Fayyad and K.B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022–1029. Morgan Kaufmann, 1993.
Google Scholar
E. Fix and J. Hodges. Discriminatory analysis, non-parametric discrimination, consistency properties. Technical Report Technical Report 4, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, TX, 1951.
Google Scholar
R. Kohavi, G. John, R. Long, D. Manley, and K. Pfleger. MLC++: A machine learning library in C++. In Tools with artificial intelligence, pages 740–743, 1994.
Google Scholar
Pat Langley and Wayne Iba. Average-case analysis of a nearest neighbour algorithm. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 889–894, Chambery, France, 1993.
Google Scholar
Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Instance-based classification by emerging patterns. In Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 191–200, Lyon, France, September 2000. Springer-Verlag.
Google Scholar
Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Making use of the most expressive jumping emerging patterns for classification. In Knowledge and Information Systems: An International Journal, to appear.
Google Scholar
Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association rule mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 80–86, New York, USA, August 1998. AAAI Press.
Google Scholar
Dimitris Meretakis and Beat Wuthrich. Extending naive bayes classifiers using long itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 165–174, San Diego, CA, 1999. ACM Press.
Google Scholar
J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
Google Scholar
J.R. Quinlan. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4:77–90, 1996.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dept of CSSE, The University of Melbourne, Vic, 3010, Australia
Jinyan Li & Kotagiri Ramamohanarao
Dept of CSE, Wright State University, USA
Guozhu Dong

Authors

Jinyan Li
View author publications
You can also search for this author in PubMed Google Scholar
Kotagiri Ramamohanarao
View author publications
You can also search for this author in PubMed Google Scholar
Guozhu Dong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science and Information Systems, The University of Hong Kong, Pokfulam, Hong Kong China
David Cheung
CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia
Graham J. Williams
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong China
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, J., Ramamohanarao, K., Dong, G. (2001). Combining the Strength of Pattern Frequency and Distance for Classification. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_48

Download citation

DOI: https://doi.org/10.1007/3-540-45357-1_48
Published: 11 April 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics