Advertisement

Clustering Short-Text Using Non-negative Matrix Factorization of Hadamard Product of Similarities

  • Krutika Verma
  • Mukesh K. Jadon
  • Arun K. Pujari
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8281)

Abstract

Short-texts mining has become an important area of research in IR and data mining. Ncut-term weighting is recently proposed for clustering of short-texts using non-negative matrix factorization. Non-negative factorization can be employed for such term weighting when the similarity measure is the inner product of term-document matrix. We propose a new weighting scheme and devise a new clustering algorithm using Hadamard product of similarity matrices. We demonstrate that our technique yields much better clustering in comparison to ncut weighting scheme. We use three measures for evaluating clustering qualities, namely purity, normalized mutual information and adjusted Rand index. We use standard benchmark datasets and also compare the performance of our algorithm with well-known document clustering technique of Ng-Jordan-Weiss. Experimental results suggest that the weighting process by Hadamard product gives better clustering of document of short-texts.

Keywords

Short-text clustering ncut-weighting non-negative matrix factorization Hadamard product kernel distance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adamic, L., Glance, N.: The political blogosphere and the 2004 u.s. election: Divided they blog. In: LinkKDD 2005: Proceedings of the 3rd International Workshop on Link Discovery, pp. 36–43 (2005)Google Scholar
  2. 2.
    Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788. ACM, New York (2007)Google Scholar
  3. 3.
    Buckley, C., Singhal, A., Mitra, M.: New retrieval approaches using SMART. In: Proc. of the 4th Text Retrieval conference (TREC-4), Gaithersburg (1996)Google Scholar
  4. 4.
    Jin, R., Falusos, C., Hauptmann, A.G.: Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall. In: Proc. of the 24th ACM International Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 83–89 (2001)Google Scholar
  5. 5.
    Kim, Y.-D., Choi, S.: Weighted non negative matrix factorization. In: ICASSP (2009)Google Scholar
  6. 6.
    Kim, H., Park, H.: Sparse non-negative matrix factorization via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 1495–1502 (2007)CrossRefGoogle Scholar
  7. 7.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative factorization. Nature 401, 788–791 (1999)CrossRefGoogle Scholar
  8. 8.
    Lee, D.D., Seung, H.S.: Algorithms for non-negative factorization. In: Advances in Neural Information Processing Systems, vol. 13, pp. 556–562 (2001)Google Scholar
  9. 9.
    Lin, F., Cohen, W.: Power iteration clustering. In: 27th International Conference on Machine Learning (ICML), Haifa, Israel (2010)Google Scholar
  10. 10.
    Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering abstracts instead of full texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. 11.
    Manning, C., Raghavan, P., Schutze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  12. 12.
    Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances of Neural Information Processing Systems, vol. 14 (2001)Google Scholar
  13. 13.
    Pantel, P., Lin, D.: Document clustering with committees. In: Proc. of the 25th ACM International Conference on Research and Development in Information Retrieval (SIGIR 2002), pp. 199–206 (2002)Google Scholar
  14. 14.
    Pinto, A.: On Clustering and Evaluation of Narrow Domain Short-Text Corpora. PhD thesis, Universidad Politécnica de Valencia, Spain (2008)Google Scholar
  15. 15.
    Rawat, S., Gulati, V.P., Pujari, A.K.: Frequency and ordering based similarity measure for host-based intrusion detection. Info. Mngt. Computer Security 12(5), 411–421 (2004)CrossRefGoogle Scholar
  16. 16.
    Sharma, A., Pujari, A.K., Paliwal, K.K.: Intrusion detection using text processing techniques with a kernel based similarity measure. Computer & Security 26(7-8), 488–495 (2007)CrossRefGoogle Scholar
  17. 17.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. PAMI 22(8), 888–905 (2000)CrossRefGoogle Scholar
  18. 18.
    Yan, X., Guo, J.: Clustering Short Text Using Ncut-weighted Non-negative Matrix Factorization. In: CIKM 2012, Mami, HI, USA, pp. 2259–2262 (2012)Google Scholar
  19. 19.
    Yan, X., Guo, J.: Learning Topics in short text Using Ncut-weighted non-negative matrix Factorization on term correlation matrix, http://xiaohuiyan.com/papers/TNMF-SDM-13.pdf
  20. 20.
    Yu, S., Shi, J.: Multiclass spectral clustering. In: Proceedings of Ninth IEEE International Conference on Computer Vision, pp. 313–319. IEEE (2003)Google Scholar
  21. 21.
  22. 22.
    http://archive.ics.uci.edu/ml/datasets/IrisGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Krutika Verma
    • 1
  • Mukesh K. Jadon
    • 2
  • Arun K. Pujari
    • 3
  1. 1.Institute of Information TechnologySambalpur UniversitySambalpurIndia
  2. 2.Department of CSEThe LNMIITJaipurIndia
  3. 3.School of Computer and Information SciencesUniversity of HyderabadHyderabadIndia

Personalised recommendations