Advertisement

Non-crisp Clustering by Fast, Convergent, and Robust Algorithms

  • Vladimir Estivill-Castro
  • Jianhua Yang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2168)

Abstract

We provide sub-quadratic clustering algorithms for generic dissimilarity. Our algorithms are robust because they use medians rather than means as estimators of location, and the resulting representative of a cluster is actually a data item. We demonstrate mathematically that our algorithms converge. The methods proposed generalize approaches that allow a data item to have a degree of membership in a cluster. Because our algorithm is generic to both, fuzzy membership approaches and probabilistic approaches for partial membership, we simply name it non-crisp clustering. We illustrate our algorithms with categorizing WEB visitation paths. We outperform previous clustering methods since they are all of quadratic time complexity (they essentially require computing the dissimilarity between all pairs of paths).

Keywords

Data Item Dissimilarity Measure Robust Algorithm Dissimilarity Function Reconstruction Step 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. IPPS 11th Int. Parallel Processing Symp., 1998.Google Scholar
  2. 2.
    C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. J. of Symbolic Computation, 2:99–102, 1986.CrossRefMathSciNetGoogle Scholar
  3. 3.
    M.J.A. Berry and G. Linoff. Data Mining Techniques — for Marketing, Sales and Customer Support. Wiley, NY, 1997.Google Scholar
  4. 4.
    J.P. Bigus. Data Mining with Neural Networks: Solving Business Problems from Applciation Development to Decision Support. McGraw-Hill, NY, 1996.Google Scholar
  5. 5.
    J. Borges and M. Levene. Mining assocaition rules in hypertext databases. 4th KDD, 149–153, NY, 1998.Google Scholar
  6. 6.
    V. Cherkassky and F. Muller. Learning from Data. Wiley, NY, 1998.zbMATHGoogle Scholar
  7. 7.
    C. R. Cunha & C. F. B. Jaccound. Determining www user’s next access and its application to prefetching. Int. Symp. Computers & Communication’97, 1997.Google Scholar
  8. 8.
    P. Densham and G. Rushton. A more efficient heuristic for solving large p-median problems. Papers in Regional Science, 71:307–329, 1992.CrossRefGoogle Scholar
  9. 9.
    V. Estivill-Castro & M.E. Houle. Roboust clustering of large geo-referenced data sets. 3rd PAKDD-99, 327–337. Springer-Verlag LNAI 1574, 1999.Google Scholar
  10. 10.
    V. Estivill-Castro and M.E. Houle. Fast randomized algorithms for robust estimation of location. Int. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining-TSDM2000, witht 4th PKDD, 74–85, Lyon, 2000. LNAI 2007.Google Scholar
  11. 11.
    V. Estivill-Castro and A.T. Murray. Discovering associations in spatial data-an efficient medoid based approach. 2nd PAKDD-98, 110–121, Melbourne, 1998. Springer-Verlag LNAI 1394.Google Scholar
  12. 12.
    V. Estivill-Castro and J. Yang. A fast and robust generl purpose clustering algorithm. 6th PRICAI-2000, 208–218, Melbourne, 2000. Springer-Verlag LNAI 1886.Google Scholar
  13. 13.
    V. Estivill-Castro and J. Yang. Categorizing Visitors Dynamically by Fast and Robust Clustering of Access Logs Asia-Pacific Conference on Web Intelligence WI-2001. Maebashi City, Japan. 2001. N. Zhong and Y. Yao (eds) LNAI In press.Google Scholar
  14. 14.
    V. Estivill-Castro and J. Yang. Non-crisp clustering Web visitors by vast, convergent and robust algorithms on access logs. Tech. R. 2001-07, Dep. CS & SE, U. of Newcastle, http://www.cs.newcastle.edu.au/Dept/techrep.html/.
  15. 15.
    M. Horn. Analysis and computation schemes for p-median heuristics. Environment and Planning A, 28:1699–1708, 1996.CrossRefGoogle Scholar
  16. 16.
    T. Kato, H. Nakyama and Y. Yamane. Navigation analysis tool based on the correlation between contents and access patterns. Manuscript. http://citeseer.nj.nec.com/354234.html.
  17. 17.
    M Lorr. Cluster Analysis for Social Scientists. Jossey-Bass, San Francisco, 1983.Google Scholar
  18. 18.
    T. Morzy, M. Wojciechowski, and Zakrzewicz. Scalabale hierarchical clustering methods for sequences of categorical values. D. Cheung, et al eds., 5th PAKDD, 282–293, Hong Kong, 2001. LNAI 2035.Google Scholar
  19. 19.
    A.T. Murray. Spatial characteristics and comparisons of interaction and median clustering models. Geographical Analysis, 32:1-, 2000.zbMATHCrossRefGoogle Scholar
  20. 20.
    A.T. Murray and R.L. Church. Applying simulated annealing to location-planning models. J. of Heuristics, 2:31–53, 1996.CrossRefGoogle Scholar
  21. 21.
    R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. 20th VLDB, 144–155, 1994. Morgan Kaufmann.Google Scholar
  22. 22.
    G. Paliouras, C. Papatheodorou, V. Karkaletsis, and C. Spyropoulos. Clustering the users of large web sites into communities. P. Langley, ed., 17th Int. Conf. on Machine Learning, 719–726, 2000. Morgan Kaufmann.Google Scholar
  23. 23.
    J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logas. 4th PAKDD, 396–407, 2000. Springer-Verlag LNCS 1805.Google Scholar
  24. 24.
    D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In P. Langley, ed., 17th Int. Conf. Machine Learning, 727–734, CA, 2000. Morgan Kaufmann.Google Scholar
  25. 25.
    D.M. Pennock, E. Horvitz, S. Lawrence, and C.L. Giles. Collaborative filtering by personality diagnosis: A hybrid memory-and model-based approach. 16th Conf. on Uncertanity in Artificial Intelligence, 473–840, 2000. Morgan Kaufmann.Google Scholar
  26. 26.
    M Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. IJCAI, 16–23, Nagoya, 1998.Google Scholar
  27. 27.
    M Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. 15th National C. on AI, 727–732, Madison, July 1998. AAAI Press.Google Scholar
  28. 28.
    C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah. Knowledge discovery from users web page navigation. IEEE RIDE’97, 20–31, 1997.Google Scholar
  29. 29.
    M. Spiliopoulou. Web usage mining for web site evaluation. C. of the ACM, 43:127–134, 2000.CrossRefGoogle Scholar
  30. 30.
    J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Esplorations, 1(2):12–23, January 2000.CrossRefGoogle Scholar
  31. 31.
    M.B. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955–961, 1968.zbMATHCrossRefGoogle Scholar
  32. 32.
    J. Xiao, Y. Zhang, X. Jia, and T. Li. Measuring similarity of interests for clustering web-users. 12th ADC 2001, 107–114, Gold Coast, IEEE Computer Society.Google Scholar
  33. 33.
    A.M. Zarkesh, J. Adibi, C. Shahabi, R. Sadri, and V. Shah. Analysis and design of server informative WWW-sites. 6th CIKM, 254–261, Las Vegas, 1997. ACM Press.Google Scholar
  34. 34.
    B. Zhang, M. Hsu, and U. Dayal. K-harmonic means — a spatial clustering algorithm with boosting. Int. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining-TSDM2000, with 4th PKDD, 31–42, Lyon, 2000. LNAI 2007.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Vladimir Estivill-Castro
    • 1
  • Jianhua Yang
    • 1
  1. 1.Department of Computer Science & Software EngineeringThe University of NewcastleCallaghanAustralia

Personalised recommendations