Improving Classifications Through Graph Embeddings

  • Anirban Chatterjee
  • Sanjukta Bhowmick
  • Padma Raghavan
Chapter

Abstract

Unsupervised classification is used to identify similar entities in a dataset and is extensively used in many application domains such as spam filtering [5], medical diagnosis [15], demographic research [13], etc. Unsupervised classification using K-Means generally clusters data based on (1) distance-based attributes of the dataset [4, 16, 17, 23] or (2) combinatorial properties of a weighted graph representation of the dataset [8].

Notes

Acknowledgments

This research was supported in part by the National Science Foundation through grants CNS 0720749, OCI 0821527, and CCF 0830679. Additionally, Dr. Sanjukta Bhowmick would like to acknowledge the support of the College of Information Science and Technology at the University of Nebraska at Omaha.

References

  1. 1.
    Arthur D, Vassilvitskii S (2006) How slow is the k-means method? In: SCG ’06: Proceedings of the twenty-second annual symposium on Computational geometry. ACM, New York, pp 144–153, http://doi.acm.org/10.1145/1137856.1137880
  2. 2.
    Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
  3. 3.
    Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 96–103, http://doi.acm.org/10.1145/290941.290970
  4. 4.
    Ball G, Hall D (1965) Isodata, a novel method of data analysis and pattern classification. Tech report NTIS AD 699616, Stanford Research Institute, Stanford, CAGoogle Scholar
  5. 5.
    Bíró I, Szabó J, Benczúr A (2008) Latent dirichlet allocation in web spam filtering. In: AIRWeb ’08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web. ACM, New York, pp 29–32, http://doi.acm.org/10.1145/1451983.1451991
  6. 6.
    Chatterjee A, Bhowmick S, Raghavan R (2010) Feature subspace transformations for enhancing k-means clustering. In: Proceedings of the 19th ACM international conference on information and knowledge management, ACMGoogle Scholar
  7. 7.
    Chatterjee A, Raghavan P, Bhowmick S (2012) Similarity graph neighborhoods for enhanced supervised classification. In: Procedia computer science, Elsiever, pp 577–586,  10.1016/j.procs.2012.04.062
  8. 8.
    Dhillon I, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957,  10.1109/TPAMI.2007.1115 CrossRefGoogle Scholar
  9. 9.
    Ding C, He X (2004) K-means clustering via principal component analysis. ACM, New York, pp 225–232Google Scholar
  10. 10.
    Fruchterman TMJ, Reingold EM (1991) Graph drawing by force-directed placement. Software Pract Exp 21(11):1129–1164CrossRefGoogle Scholar
  11. 11.
    Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28(1)Google Scholar
  12. 12.
    Inc TM (2007) Matlab and simulink for technical computing. http://www.mathworks.com
  13. 13.
    Kim K, Ahn H (2008) A recommender system using ga k-means clustering in an online shopping market. Expert Syst Appl 34(2):1200-1209,  10.1016/j.eswa.2006.12.025 CrossRefGoogle Scholar
  14. 14.
    Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the twelfth international conference on machine learning, pp 331–339Google Scholar
  15. 15.
    Li X (2008) A volume segmentation algorithm for medical image based on k-means clustering. In: IIH-MSP ’08: Proceedings of the 2008 international conference on intelligent information hiding and multimedia signal processing. IEEE Computer Society, Washington, pp 881–884, http://dx.doi.org/10.1109/IIH-MSP.2008.161
  16. 16.
    Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inform Theory 28:129–137MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematics, statistics and probability. University of California Press, CaliforniaGoogle Scholar
  18. 18.
    Michie D, Spiegelhalter DJ, Taylor C (1994) Machine learning, neural and statistical classification, Prentice hall. Edn. 1.Google Scholar
  19. 19.
    Neal R (1998) Assessing relevance determination methods using delve. In: Neural networks and machine learning. Springer, Berlin, pp 97–129, http://www.cs.toronto.edu/~delve/
  20. 20.
    Ng A, Jordan M, Weiss Y (2001) On spectral clustering: Analysis and an algorithm In: T. Dietterich, S. Becker, Z. Ghahramani (eds). In Advances in Neural Information Processing Systems, pp. 849Google Scholar
  21. 21.
    Salton G (1971) Smart data set. ftp://ftp.cs.cornell.edu/pub/smart
  22. 22.
    Savaresi SM, Boley DL, Bittanti S, Gazzaniga G (2002) Cluster selection in divisive clustering algorithms. In: SIAM datamining conference, Arlington, VAGoogle Scholar
  23. 23.
    Steinhaus H (1956) Sur la division des corp materials en parties. Bull Acad Polon Sci IV (C1. III):801–804Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Anirban Chatterjee
    • 1
  • Sanjukta Bhowmick
    • 2
  • Padma Raghavan
    • 3
  1. 1.MathworksNatickUSA
  2. 2.University of Nebraska at OmahaOmahaUSA
  3. 3.Department of Computer Science and EngineeringThe Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations