Scalable Hierarchical Clustering Method for Sequences of Categorical Values
Abstract
Data clustering methods have many applications in the area of data mining. Traditional clustering algorithms deal with quantitative or categorical data points. However, there exist many important databases that store categorical data sequences, where significant knowledge is hidden behind sequential dependencies between the data. In this paper we introduce a problem of clustering categorical data sequences and present an efficient scalable algorithm to solve the problem. Our algorithm implements the general idea of agglomerative hierarchical clustering and uses frequently occurring subsequences as features describing data sequences. The algorithm not only discovers a set of high quality clusters containing similar data sequences but also provides descriptions of the discovered clusters.
Preview
Unable to display preview. Download preview PDF.
References
- 1.Agrawal R., Gehrke J., Gunopulos D., Raghavan P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (1998)Google Scholar
- 2.Agrawal R., Srikant R.: Mining Sequential Patterns. Proceedings of the 11th International Conference on Data Engineering (1995)Google Scholar
- 3.Agrawal, R.; Mehta, M.; Shafer, J.; Srikant, R.; Arning, A.; Bollinger, T.: The Quest Data Mining System. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (1996)Google Scholar
- 4.Bradley P.S., Fayyad U.M., Reina C.: Scaling Clustering Algorithms to Large Databases. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (1998)Google Scholar
- 5.Broder A., Glassman S., Manasse M., Zweig G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29, Proceedings of the 6th International WWW Conference (1997)Google Scholar
- 6.Ester M., Kriegel H-P., Sander J., Xu X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (1996)Google Scholar
- 7.Fisher D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2 (1987)Google Scholar
- 8.Ganti V., Gehrke J., Ramakrishnan R.: CACTUS-Clustering Categorical Data Using Summaries. Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)Google Scholar
- 9.Gibson D., Kleinberg J.M., Raghavan P.: Clustering Categorical Data: An Approach Based on Dynamical Systems. Proceedings of the 24th International Conference on Very Large Data Bases (1998)Google Scholar
- 10.Guha S., Rastogi R., Shim K.: CURE: An Efficient Clustering Algorithm for Large Databases. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (1998)Google Scholar
- 11.Guha S., Rastogi R., Shim K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. Proceedings of the 15th International Conference on Data Engineering (1999)Google Scholar
- 12.Hartigan J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)zbMATHGoogle Scholar
- 13.Han E., Karypis G., Kumar V., Mobasher B.: Clustering based on association rules hypergraphs. Proceedings of the Workshop on Research Issues on Data Mining and Knowledge Discovery (1997)Google Scholar
- 14.Jain A.K., Dubes R.C.: Algorithms for Clustering Data. Prentice Hall (1988)Google Scholar
- 15.Kaufman L., Rousseeuw P.: Finding Groups in Data. John Wiley & Sons, New York (1989)Google Scholar
- 16.Ketterlin A.: Clustering Sequences of Complex Objects. Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (1997)Google Scholar
- 17.Lesh N., Zaki M.J., Ogihara M.: Mining Features for Sequence Classification. Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)Google Scholar
- 18.Perkowitz M., Etzioni O.: Towards Adaptive Web Sites: Conceptual Framework and Case Study. Computer Networks 31, Proceedings of the 8th International WWW Conference (1999)Google Scholar
- 19.Ramkumar G.D., Swami A.: Clustering Data Without Distance Functions. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol.21No. 1 (1998)Google Scholar
- 20.Srikant R., Agrawal R.: Mining Sequential Patterns: Generalizations and Performance Improvements. Proceedings of the 5th International Conference on Extending Database Technology (1996)Google Scholar
- 21.Wang K., Xu C., Liu B.: Clustering Transactions Using Large Items. Proceedings of the 1999ACM CIKM International Conference on Information and Knowledge Management (1999)Google Scholar
- 22.Zhang T., Ramakrishnan R., Livny M.: Birch: An efficient data clustering method for very large databases. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (1996)Google Scholar