Evolving Systems

, Volume 8, Issue 4, pp 261–269 | Cite as

Characterizing evolving behavior of context vectors for context based clustering

  • Anagha R Kulkarni
  • Vrinda Tokekar
  • Parag Kulkarni
Original Paper
  • 93 Downloads

Abstract

Characterizing evolving behavior of document vectors helps in identifying similarity between text documents. As document vectors contain terms and their importances in documents, discovering association and disassociation between terms is very important. This paper introduces characterization of evolving behavior of document vectors to identify similar and dissimilar segments in document vectors. This approach is particularly suitable where document vectors contain similar patterns of term occurrences but the patterns could be away from each other with regard to distance. The main objective of this paper is to capture evolving structure of context vector, document vector of contextually related terms, for discovering similarity between them. Context vector reduces the size of document vector from 6 to 12.57%. Evaluation is done by clustering the documents using Unweighted Pair Group Method with Arithmetic Mean with standard datasets. This results in formation of clusters with better entropy and purity. Mann–Whitney–Wilcoxon U test demonstrates statistically significant quality enhancement.

Keywords

Context based clustering Text mining UPGMA Behavioral patterns 

References

  1. Agarwal, R, Srikant R (1994) Fast algorithms for mining association rules. In: VLDB, vol 1215Google Scholar
  2. Aghabozorgi S, Shirkhorshidi AS, Wah TY (2015) Time-series clustering—a decade review. Inf Syst 53:16–38CrossRefGoogle Scholar
  3. Antonie ML, Zaiane OR (2002) Text document categorization by term association. In: 2002 IEEE international conference on data mining, 2002. ICDM 2003. Proceedings. IEEE, pp 19–26Google Scholar
  4. Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst (TOIS) 12(3):233CrossRefGoogle Scholar
  5. Bekkerman R, Allan J (2004) Using bigrams in text categorization. In: ICML, vol 1003. University of Massachusetts, AmherstGoogle Scholar
  6. Cheng H, Yan X, Han J, Hsu CW (2007) Discriminative frequent pattern analysis for effective classification. In: IEEE 23rd international conference on in data engineering, 2007. ICDE 2007. IEEE, pp 716–725Google Scholar
  7. Han, EH, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) Webace: a web agent for document categorization and exploration. In: Proceedings of the second international conference on Autonomous agents. ACM, pp 408–415Google Scholar
  8. Hassan MT, Karim A, Kim JB, Jeon M (2015) Cdim: document clustering by discrimination information maximization. Inf Sci 316:87CrossRefGoogle Scholar
  9. Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZC-SRSC2008), Christchurch, New Zealand, pp 49–56Google Scholar
  10. Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting k-means algorithm. In: Advances in knowledge discovery and data mining. Springer, Berlin, pp 802–812Google Scholar
  11. Jing L, Zhou L, Ng MK, Huang JZ (2006) Ontology-based distance measure for text clustering. In: Proc. of SIAM SDM workshop on text mining, Bethesda, Maryland, USAGoogle Scholar
  12. Kulkarni A, Tokekar V, Kulkarni P (2015a) Discovering context using contextual positional regions based on chains of frequent terms in text documents. In: Intelligent systems technologies and applications. Springer, Berlin. pp 321–332Google Scholar
  13. Kulkarni A, Tokekar V, Kulkarni P (2015b) Discovering context of labelled text documents using context similarity coefficient. In: Procedia computer science, vol 49. Elsevier, Amsterdam, pp 118–127Google Scholar
  14. Lang K (1995) Newsweeder: learning to filter netnews. In: Proceedings of 12th international conference on machine learning, pp 331–339Google Scholar
  15. Lewis DD (1992) Representation and learning in information retrieval, Representation and learning in information retrieval. Ph.D. thesis, University of MassachusettsGoogle Scholar
  16. Lewis DD (1992) Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on speech and natural language. Association for Computational Linguistics, pp 212–217Google Scholar
  17. Lewis DD (1997) Reuters-21578 text categorization test collection, distribution 1.0. http://www.research.att.com/~lewis/reuters21578.html
  18. Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5(Apr):361Google Scholar
  19. Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings of the 4thGoogle Scholar
  20. Mahgoub H, Rösner D, Ismail N, Torkey F (2008) A text mining technique using association rules extraction. Int J Comput Intell 4(1):21Google Scholar
  21. Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39CrossRefGoogle Scholar
  22. Murugesan K, Zhang J (2011) Hybrid hierarchical clustering: an experimental analysis. University of Kentucky, Lexington, Technical Report: CMIDA-HiPSCCS, pp 001–11Google Scholar
  23. Nachar N (2008) The Mann–Whitney U: a test for assessing whether two independent samples come from the same distribution. Tutor Quant Methods Psychol 4(1):13CrossRefGoogle Scholar
  24. Scott S, Matwin S (1999) Feature engineering for text classification. In: ICML, vol 99. Citeseer, pp 379–388Google Scholar
  25. Steinbach M, Karypis G, Kumar V et al (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, vol 400. Boston, MA, pp 525–526Google Scholar
  26. Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using WordNet and lexical chains. Expert Syst Appl 42(4):2264CrossRefGoogle Scholar
  27. Weichselbraun A, Gindl S, Scharl A (2013) Extracting and grounding context-aware sentiment lexicons. IEEE Intell Syst 28(2):39CrossRefGoogle Scholar
  28. Xiong H, Steinbach M, Ruslim A, Kumar V (2009) Characterizing pattern preserving clustering. Knowledge and information systems 19(3):311CrossRefGoogle Scholar
  29. Yang B, Cardie C (2014) Context-aware learning for sentence-level sentiment analysis with posterior regularization. In: ACL (1), pp 325–335Google Scholar
  30. Yuan M, Ouyang YX, Xiong Z (2013) A text categorization method using extended vector space model by frequent term sets. J Inf Sci Eng 29(1):99Google Scholar
  31. Zhao, Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the eleventh international conference on Information and knowledge management. ACM, pp 515–524Google Scholar
  32. Zhu H, Chen E, Xiong H, Yu K, Cao H, Tian J (2015) Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST) 5(4):58Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Anagha R Kulkarni
    • 1
  • Vrinda Tokekar
    • 2
  • Parag Kulkarni
    • 3
  1. 1.Cummins College of Engineering for WomenPuneIndia
  2. 2.IET, DAVVIndoreIndia
  3. 3.iKnowlation Research Labs Pvt LtdPuneIndia

Personalised recommendations