Abstract
We provide sub-quadratic clustering algorithms for generic dissimilarity. Our algorithms are robust because they use medians rather than means as estimators of location, and the resulting representative of a cluster is actually a data item. We demonstrate mathematically that our algorithms converge. The methods proposed generalize approaches that allow a data item to have a degree of membership in a cluster. Because our algorithm is generic to both, fuzzy membership approaches and probabilistic approaches for partial membership, we simply name it non-crisp clustering. We illustrate our algorithms with categorizing WEB visitation paths. We outperform previous clustering methods since they are all of quadratic time complexity (they essentially require computing the dissimilarity between all pairs of paths).
Chapter PDF
References
K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. IPPS 11th Int. Parallel Processing Symp., 1998.
C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. J. of Symbolic Computation, 2:99–102, 1986.
M.J.A. Berry and G. Linoff. Data Mining Techniques — for Marketing, Sales and Customer Support. Wiley, NY, 1997.
J.P. Bigus. Data Mining with Neural Networks: Solving Business Problems from Applciation Development to Decision Support. McGraw-Hill, NY, 1996.
J. Borges and M. Levene. Mining assocaition rules in hypertext databases. 4th KDD, 149–153, NY, 1998.
V. Cherkassky and F. Muller. Learning from Data. Wiley, NY, 1998.
C. R. Cunha & C. F. B. Jaccound. Determining www user’s next access and its application to prefetching. Int. Symp. Computers & Communication’97, 1997.
P. Densham and G. Rushton. A more efficient heuristic for solving large p-median problems. Papers in Regional Science, 71:307–329, 1992.
V. Estivill-Castro & M.E. Houle. Roboust clustering of large geo-referenced data sets. 3rd PAKDD-99, 327–337. Springer-Verlag LNAI 1574, 1999.
V. Estivill-Castro and M.E. Houle. Fast randomized algorithms for robust estimation of location. Int. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining-TSDM2000, witht 4th PKDD, 74–85, Lyon, 2000. LNAI 2007.
V. Estivill-Castro and A.T. Murray. Discovering associations in spatial data-an efficient medoid based approach. 2nd PAKDD-98, 110–121, Melbourne, 1998. Springer-Verlag LNAI 1394.
V. Estivill-Castro and J. Yang. A fast and robust generl purpose clustering algorithm. 6th PRICAI-2000, 208–218, Melbourne, 2000. Springer-Verlag LNAI 1886.
V. Estivill-Castro and J. Yang. Categorizing Visitors Dynamically by Fast and Robust Clustering of Access Logs Asia-Pacific Conference on Web Intelligence WI-2001. Maebashi City, Japan. 2001. N. Zhong and Y. Yao (eds) LNAI In press.
V. Estivill-Castro and J. Yang. Non-crisp clustering Web visitors by vast, convergent and robust algorithms on access logs. Tech. R. 2001-07, Dep. CS & SE, U. of Newcastle, http://www.cs.newcastle.edu.au/Dept/techrep.html/.
M. Horn. Analysis and computation schemes for p-median heuristics. Environment and Planning A, 28:1699–1708, 1996.
T. Kato, H. Nakyama and Y. Yamane. Navigation analysis tool based on the correlation between contents and access patterns. Manuscript. http://citeseer.nj.nec.com/354234.html.
M Lorr. Cluster Analysis for Social Scientists. Jossey-Bass, San Francisco, 1983.
T. Morzy, M. Wojciechowski, and Zakrzewicz. Scalabale hierarchical clustering methods for sequences of categorical values. D. Cheung, et al eds., 5th PAKDD, 282–293, Hong Kong, 2001. LNAI 2035.
A.T. Murray. Spatial characteristics and comparisons of interaction and median clustering models. Geographical Analysis, 32:1-, 2000.
A.T. Murray and R.L. Church. Applying simulated annealing to location-planning models. J. of Heuristics, 2:31–53, 1996.
R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. 20th VLDB, 144–155, 1994. Morgan Kaufmann.
G. Paliouras, C. Papatheodorou, V. Karkaletsis, and C. Spyropoulos. Clustering the users of large web sites into communities. P. Langley, ed., 17th Int. Conf. on Machine Learning, 719–726, 2000. Morgan Kaufmann.
J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logas. 4th PAKDD, 396–407, 2000. Springer-Verlag LNCS 1805.
D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In P. Langley, ed., 17th Int. Conf. Machine Learning, 727–734, CA, 2000. Morgan Kaufmann.
D.M. Pennock, E. Horvitz, S. Lawrence, and C.L. Giles. Collaborative filtering by personality diagnosis: A hybrid memory-and model-based approach. 16th Conf. on Uncertanity in Artificial Intelligence, 473–840, 2000. Morgan Kaufmann.
M Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. IJCAI, 16–23, Nagoya, 1998.
M Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. 15th National C. on AI, 727–732, Madison, July 1998. AAAI Press.
C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah. Knowledge discovery from users web page navigation. IEEE RIDE’97, 20–31, 1997.
M. Spiliopoulou. Web usage mining for web site evaluation. C. of the ACM, 43:127–134, 2000.
J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Esplorations, 1(2):12–23, January 2000.
M.B. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955–961, 1968.
J. Xiao, Y. Zhang, X. Jia, and T. Li. Measuring similarity of interests for clustering web-users. 12th ADC 2001, 107–114, Gold Coast, IEEE Computer Society.
A.M. Zarkesh, J. Adibi, C. Shahabi, R. Sadri, and V. Shah. Analysis and design of server informative WWW-sites. 6th CIKM, 254–261, Las Vegas, 1997. ACM Press.
B. Zhang, M. Hsu, and U. Dayal. K-harmonic means — a spatial clustering algorithm with boosting. Int. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining-TSDM2000, with 4th PKDD, 31–42, Lyon, 2000. LNAI 2007.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Estivill-Castro, V., Yang, J. (2001). Non-crisp Clustering by Fast, Convergent, and Robust Algorithms. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_9
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive