Evaluating Hard and Soft Flat-Clustering Algorithms for Text Documents

Singh, Vivek Kumar; Siddiqui, Tanveer Jahan; Singh, Manoj Kumar

doi:10.1007/978-3-642-31603-6_6

Vivek Kumar Singh⁵,
Tanveer Jahan Siddiqui⁶ &
Manoj Kumar Singh⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 179))

929 Accesses

Abstract

Document clustering refers to unsupervised classification (categorization) of documents into groups (clusters) in such a way that the documents in a cluster are similar, whereas dissimilar documents are assigned in different clusters. The documents may be web pages, blog posts, news articles, or other text files. A popular and computationally efficient clustering technique is flat clustering. Unlike hierarchical techniques, flat clustering algorithms aim to partition the document space into groups of similar documents. The cluster assignments however may be hard or soft. This paper presents our experimental work on evaluating some hard and soft flat-clustering algorithms, namely K-means, heuristic k-means and fuzzy C-means, for categorizing text documents. We experimented with different representations (tf, tf.idf, Boolean) and feature selection schemes (with or without stop word removal and with or without stemming) on some standard datasets. The results indicate that tf.idf representation and the use of stemming obtains better clustering. Moreover, fuzzy clustering obtains better results than K-means on almost all datasets, and is also a more stable method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Classification of Text Documents Using Adaptive Fuzzy C-Means Clustering

Text Clustering Using Novel Hybrid Algorithm

On Fuzzy Cluster Validity Indexes for High Dimensional Feature Space

References

Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(31) 477–488 (1995)
Google Scholar
Jain, A.K.: 50 years beyond K-means. In: 19th International Conference on Pattern Recognition, Tampa, FL (2008)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series, Prentice Hall, NJ (1988)
Google Scholar
Manning, C.D., Raghvan, P., Schutze, H.: Introduction to Information Retrieval, Cambridge University Press, Cambridge (2008)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2th edn. Wiley-Interscience, New York (2000)
Google Scholar
Valente De Oliviera, J., Pedrycz, W.: Advances in Fuzzy Clustering and its Applications, pp. 3–30. Wiley, Hoboken (2007)
Google Scholar
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981)
Google Scholar
Alag, S.: Collective Intelligence in Action. Manning, New York (2009)
Google Scholar
Rand, W.M.: Objective criteria for evaluation of clustering methods. J. Am. Stat. Assoc. 36(31), pp. 846–850 (1971)
Google Scholar
Bjorner, L., Aone, C,: Fast and Effective Text Mining Using Liner time Document Clustering. In: Knowledge and Data Discovery’ 1999, California. (1999)
Google Scholar
Van Rijsbergen, C.J.: Information Retrieval. 2nd edn, Butterworths, London (1979)
Google Scholar
Lewis, D., Reuters-21578. http://www.research.att.com/lewis/reuters21578.html. Accessed Jan 2011
Classic 4 (7095). ftp://ftp.cs.cornell.edu/pub/smart. Accessed Jan 2011
Newsgroups. http://www.ai.mit.edu/prople/jrennie/20-newsgroups. Accessed Jan 2011
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2002)
Google Scholar
Xiong, H., Wu, J., Chen, J.: K-means clustering versus validation measures—a data distribution perspective. In: KDD’06, ACM Press (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Banaras Hindu University (BHU), Varanasi, India
Vivek Kumar Singh
Institute of Applied Physics and Technology, University of Allahabad, Allahabad, India
Tanveer Jahan Siddiqui
DST-Centre for Interdisciplinary Mathematical Sciences(DST-CIMS), Banaras Hindu University (BHU), Varanasi, India
Manoj Kumar Singh

Authors

Vivek Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
Tanveer Jahan Siddiqui
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manoj Kumar Singh .

Editor information

Editors and Affiliations

, Faculty of Electrical Engineering and, Technical University of Ostrava, 17. listopadu 15, Ostrava-Poruba, 708 33, Czech Republic
Miloš Kudělka
, Faculty of Mathematics and Physics, Charles University, Malostranske nam. 25, Praha, 118 00, Czech Republic
Jaroslav Pokorný
, Faculty of Electrical Engineering and, Technical University of Ostrava, 17. listopadu 15, Ostrava-Poruba, 708 33, Czech Republic
Václav Snášel
Auburn, 98071, USA
Ajith Abraham

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, V.K., Siddiqui, T.J., Singh, M.K. (2013). Evaluating Hard and Soft Flat-Clustering Algorithms for Text Documents. In: Kudělka, M., Pokorný, J., Snášel, V., Abraham, A. (eds) Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011), Prague, Czech Republic, August, 2011. Advances in Intelligent Systems and Computing, vol 179. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31603-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-31603-6_6
Published: 17 July 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31602-9
Online ISBN: 978-3-642-31603-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Evaluating Hard and Soft Flat-Clustering Algorithms for Text Documents

Abstract

Access this chapter

Similar content being viewed by others

Classification of Text Documents Using Adaptive Fuzzy C-Means Clustering

Text Clustering Using Novel Hybrid Algorithm

On Fuzzy Cluster Validity Indexes for High Dimensional Feature Space

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Evaluating Hard and Soft Flat-Clustering Algorithms for Text Documents

Abstract

Access this chapter

Similar content being viewed by others

Classification of Text Documents Using Adaptive Fuzzy C-Means Clustering

Text Clustering Using Novel Hybrid Algorithm

On Fuzzy Cluster Validity Indexes for High Dimensional Feature Space

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation