Advertisement

Abstract

Spectral clustering algorithms recently gained much interest in research community. This surge in interest is mainly due to their ease of use, their applicability to a variety of data types and domains as well as the fact that they very often outperform traditional clustering algorithms. These algorithms consider the pair-wise similarity between data objects and construct a similarity matrix to group data into natural subsets, so that the objects located in the same cluster share many common characteristics. Objects are then allocated into clusters by employing a proximity measure, which is used to compute the similarity or distance between the data objects in the matrix. As such, an early and fundamental step in spectral cluster analysis is the selection of a proximity measure. This choice also has the highest impact on the quality and usability of the end result. However, this crucial aspect is frequently overlooked. For instance, most prior studies use the Euclidean distance measure without explicitly stating the consequences of selecting such measure. To address this issue, we perform a comparative and explorative study on the performance of various existing proximity measures when applied to spectral clustering algorithm. Our results indicate that the commonly used Euclidean distance measure is not always suitable, specifically in domains where the data is highly imbalanced and the correct clustering of boundary objects are critical. Moreover, we also noticed that for numeric data type, the relative distance measures outperformed the absolute distance measures and therefore, may boost the performance of a clustering algorithm if used. As for the datasets with mixed variables, the selection of distance measure for numeric variable again has the highest impact on the end result.

Keywords

Spectral clustering Proximity measures Similarity measures Boundary detection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Luxburg, U.: A Tutorial on Spectral Clustering. Statistics and Computing 17(4), 395–416 (2007)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)CrossRefGoogle Scholar
  3. 3.
    Bach, F.R., Jordan, M.I.: Learning Spectral Clustering, with Application to Speech Separation. J. Mach. Learn. Res. 7, 1963–2001 (2006)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Paccanaro, A., Casbon, J.A., Saqi, M.A.: Spectral Clustering of Protein Sequences. Nucleic Acids Res. 34(5), 1571–1580 (2006)CrossRefGoogle Scholar
  5. 5.
    Ng, A.Y., Jordan, M.I., Weiss, Y.: On Spectral Clustering: Analysis and an Algorithm. In: Dietterich, T.G., Ghahramani, S.B. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 849–856 (2001)Google Scholar
  6. 6.
    Verma, D., Meila, M.: A Comparison of Spectral Clustering Algorithms (2001)Google Scholar
  7. 7.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys 31(3), 264–323 (1999)CrossRefGoogle Scholar
  8. 8.
    Bach, F.R., Jordan, M.I.: Learning Spectral Clustering. In: Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference, pp. 305–312 (2003)Google Scholar
  9. 9.
    Everitt, B.S.: Cluster Analysis, 2nd edn. Edward Arnold and Halsted Press (1980)Google Scholar
  10. 10.
    Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience (2005)Google Scholar
  11. 11.
    Meila, M., Shi, J.: A Random Walks View of Spectral Segmentation. In: International Conference on Artificial Intelligence and Statistics (AISTAT), pp. 8–11 (2001)Google Scholar
  12. 12.
    Webb, A.R.: Statistical Pattern Recognition, 2nd edn. John Wiley & Sons (2002)Google Scholar
  13. 13.
    Larose, D.T.: Discovering Knowledge in Data: An Introduction to Data Mining. Wiley-Interscience (2004)Google Scholar
  14. 14.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (2006)zbMATHGoogle Scholar
  15. 15.
    Costa, I.G., de Carvalho, F.A.T., de Souto, M.C.P.: Comparative Study on Proximity Indices for Cluster Analysis of Gene Expression Time Series. Journal of Intelligent and Fuzzy Systems: Applications in Engineering and Technology 13(2-4), 133–142 (2002)Google Scholar
  16. 16.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  17. 17.
    Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30(2-3), 195–215 (1998)CrossRefGoogle Scholar
  18. 18.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)Google Scholar
  19. 19.
    Japkowicz, N., Shah, M.: Performance Evaluation for Classification A Machine Learning and Data Mining Perspective (in progress): Chapter 6: Statistical Significance Testing (2011)Google Scholar
  20. 20.
    Heinz, G., Peterson, L.J., Johnson, R.W., Kerk, C.J.: Exploring Relationships in Body Dimensions. Journal of Statistics Education 11(2) (2003)Google Scholar
  21. 21.
    Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007)Google Scholar
  22. 22.
    Lee, S.-W., Verri, A. (eds.): SVM 2002. LNCS, vol. 2388. Springer, Heidelberg (2002)zbMATHGoogle Scholar
  23. 23.
    Abou-Moustafa, K.T., Ferrie, F.P.: The Minimum Volume Ellipsoid Metric. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 335–344. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  24. 24.
    Filzmoser, P., Garrett, R., Reimann, C.: Multivariate Outlier Detection in Exploration Geochemistry. Computers and Geosciences 31(5), 579–587 (2005)CrossRefGoogle Scholar
  25. 25.
    Aiello, M., Andreozzi, F., Catanzariti, E., Isgro, F., Santoro, M.: Fast Convergence for Spectral Clustering. In: ICIAP 2007: Proceedings of the 14th International Conference on Image Analysis and Processing, pp. 641–646. IEEE Computer Society, Washington, DC (2007)Google Scholar
  26. 26.
    Fischer, I., Poland, J.: New Methods for Spectral Clustering. Technical Report IDSIA-12-04, IDSIA (2004)Google Scholar
  27. 27.
    Teknomo, K.: Similarity Measurement, http://people.revoledu.com/kardi/tutorial/Similarity/
  28. 28.
    Boslaugh, S., Watters, P.A.: Statistics in a Nutshell. O.Reilly & Associates, Inc., Sebastopol (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Nadia Farhanaz Azam
    • 1
  • Herna L. Viktor
    • 1
  1. 1.School of Electrical Engineering and Computer ScienceUniversity of OttawaOttawaCanada

Personalised recommendations