Non-crisp Clustering by Fast, Convergent, and Robust Algorithms

Estivill-Castro, Vladimir; Yang, Jianhua

doi:10.1007/3-540-44794-6_9

Non-crisp Clustering by Fast, Convergent, and Robust Algorithms

Vladimir Estivill-Castro³ &
Jianhua Yang³

Conference paper
First Online: 01 January 2001

2490 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2168))

Abstract

We provide sub-quadratic clustering algorithms for generic dissimilarity. Our algorithms are robust because they use medians rather than means as estimators of location, and the resulting representative of a cluster is actually a data item. We demonstrate mathematically that our algorithms converge. The methods proposed generalize approaches that allow a data item to have a degree of membership in a cluster. Because our algorithm is generic to both, fuzzy membership approaches and probabilistic approaches for partial membership, we simply name it non-crisp clustering. We illustrate our algorithms with categorizing WEB visitation paths. We outperform previous clustering methods since they are all of quadratic time complexity (they essentially require computing the dissimilarity between all pairs of paths).

Download to read the full chapter text

Chapter PDF

References

K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. IPPS 11th Int. Parallel Processing Symp., 1998.
Google Scholar
C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. J. of Symbolic Computation, 2:99–102, 1986.
Article MathSciNet Google Scholar
M.J.A. Berry and G. Linoff. Data Mining Techniques — for Marketing, Sales and Customer Support. Wiley, NY, 1997.
Google Scholar
J.P. Bigus. Data Mining with Neural Networks: Solving Business Problems from Applciation Development to Decision Support. McGraw-Hill, NY, 1996.
Google Scholar
J. Borges and M. Levene. Mining assocaition rules in hypertext databases. 4th KDD, 149–153, NY, 1998.
Google Scholar
V. Cherkassky and F. Muller. Learning from Data. Wiley, NY, 1998.
MATH Google Scholar
C. R. Cunha & C. F. B. Jaccound. Determining www user’s next access and its application to prefetching. Int. Symp. Computers & Communication’97, 1997.
Google Scholar
P. Densham and G. Rushton. A more efficient heuristic for solving large p-median problems. Papers in Regional Science, 71:307–329, 1992.
Article Google Scholar
V. Estivill-Castro & M.E. Houle. Roboust clustering of large geo-referenced data sets. 3rd PAKDD-99, 327–337. Springer-Verlag LNAI 1574, 1999.
Google Scholar
V. Estivill-Castro and M.E. Houle. Fast randomized algorithms for robust estimation of location. Int. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining-TSDM2000, witht 4th PKDD, 74–85, Lyon, 2000. LNAI 2007.
Google Scholar
V. Estivill-Castro and A.T. Murray. Discovering associations in spatial data-an efficient medoid based approach. 2nd PAKDD-98, 110–121, Melbourne, 1998. Springer-Verlag LNAI 1394.
Google Scholar
V. Estivill-Castro and J. Yang. A fast and robust generl purpose clustering algorithm. 6th PRICAI-2000, 208–218, Melbourne, 2000. Springer-Verlag LNAI 1886.
Google Scholar
V. Estivill-Castro and J. Yang. Categorizing Visitors Dynamically by Fast and Robust Clustering of Access Logs Asia-Pacific Conference on Web Intelligence WI-2001. Maebashi City, Japan. 2001. N. Zhong and Y. Yao (eds) LNAI In press.
Google Scholar
V. Estivill-Castro and J. Yang. Non-crisp clustering Web visitors by vast, convergent and robust algorithms on access logs. Tech. R. 2001-07, Dep. CS & SE, U. of Newcastle, http://www.cs.newcastle.edu.au/Dept/techrep.html/.
M. Horn. Analysis and computation schemes for p-median heuristics. Environment and Planning A, 28:1699–1708, 1996.
Article Google Scholar
T. Kato, H. Nakyama and Y. Yamane. Navigation analysis tool based on the correlation between contents and access patterns. Manuscript. http://citeseer.nj.nec.com/354234.html.
M Lorr. Cluster Analysis for Social Scientists. Jossey-Bass, San Francisco, 1983.
Google Scholar
T. Morzy, M. Wojciechowski, and Zakrzewicz. Scalabale hierarchical clustering methods for sequences of categorical values. D. Cheung, et al eds., 5th PAKDD, 282–293, Hong Kong, 2001. LNAI 2035.
Google Scholar
A.T. Murray. Spatial characteristics and comparisons of interaction and median clustering models. Geographical Analysis, 32:1-, 2000.
Article MATH Google Scholar
A.T. Murray and R.L. Church. Applying simulated annealing to location-planning models. J. of Heuristics, 2:31–53, 1996.
Article Google Scholar
R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. 20th VLDB, 144–155, 1994. Morgan Kaufmann.
Google Scholar
G. Paliouras, C. Papatheodorou, V. Karkaletsis, and C. Spyropoulos. Clustering the users of large web sites into communities. P. Langley, ed., 17th Int. Conf. on Machine Learning, 719–726, 2000. Morgan Kaufmann.
Google Scholar
J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logas. 4th PAKDD, 396–407, 2000. Springer-Verlag LNCS 1805.
Google Scholar
D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In P. Langley, ed., 17th Int. Conf. Machine Learning, 727–734, CA, 2000. Morgan Kaufmann.
Google Scholar
D.M. Pennock, E. Horvitz, S. Lawrence, and C.L. Giles. Collaborative filtering by personality diagnosis: A hybrid memory-and model-based approach. 16th Conf. on Uncertanity in Artificial Intelligence, 473–840, 2000. Morgan Kaufmann.
Google Scholar
M Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. IJCAI, 16–23, Nagoya, 1998.
Google Scholar
M Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. 15th National C. on AI, 727–732, Madison, July 1998. AAAI Press.
Google Scholar
C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah. Knowledge discovery from users web page navigation. IEEE RIDE’97, 20–31, 1997.
Google Scholar
M. Spiliopoulou. Web usage mining for web site evaluation. C. of the ACM, 43:127–134, 2000.
Article Google Scholar
J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Esplorations, 1(2):12–23, January 2000.
Article Google Scholar
M.B. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955–961, 1968.
Article MATH Google Scholar
J. Xiao, Y. Zhang, X. Jia, and T. Li. Measuring similarity of interests for clustering web-users. 12th ADC 2001, 107–114, Gold Coast, IEEE Computer Society.
Google Scholar
A.M. Zarkesh, J. Adibi, C. Shahabi, R. Sadri, and V. Shah. Analysis and design of server informative WWW-sites. 6th CIKM, 254–261, Las Vegas, 1997. ACM Press.
Google Scholar
B. Zhang, M. Hsu, and U. Dayal. K-harmonic means — a spatial clustering algorithm with boosting. Int. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining-TSDM2000, with 4th PKDD, 31–42, Lyon, 2000. LNAI 2007.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Software Engineering, The University of Newcastle, Callaghan, NSW, 2308, Australia
Vladimir Estivill-Castro & Jianhua Yang

Authors

Vladimir Estivill-Castro
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Albert-Ludwigs University Freiburg, Georges Köhler-Allee, Geb. 079, 79110, Freiburg, Germany
Luc De Raedt
Inst.of Information and Computing Sciences Dept. of Mathematics and Computer Science, University of Utrecht, Padualaan 14, de Uithof, 3508, TB Utrecht, The Netherlands
Arno Siebes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Estivill-Castro, V., Yang, J. (2001). Non-crisp Clustering by Fast, Convergent, and Robust Algorithms. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_9

Download citation

DOI: https://doi.org/10.1007/3-540-44794-6_9
Published: 28 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics