Skip to main content

A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 4027)

Abstract

The steady increase of information on WWW, digital library, portal, database and local intranet, gave rise to the development of several methods to help user in Information Retrieval, information organization and browsing. Clustering algorithms are of crucial importance when there are no labels associated to textual information or documents. The aim of clustering algorithms, in the text mining domain, is to group documents concerning with the same topic into the same cluster, producing a flat or hierarchical structure of clusters. In this paper we present a Knowledge Discovery System for document processing and clustering. The clustering algorithm implemented in this system, called Induced Bisecting k-Means, outperforms the Standard Bisecting k-Means and is particularly suitable for on line applications when computational efficiency is a crucial aspect.

Keywords

  • Document Cluster
  • Feature Selection Technique
  • Freshness Manager
  • Initial Centroid
  • Open Directory Project

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Boley, D.: Principal Direction Divisive Partitioning, Technical Report TR-97-056, Department of Computer Science and Engineering, University of Minnesota, Minneapolis

    Google Scholar 

  2. Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proc. of ACM International Conference on Management of Data, pp. 117–128 (2000)

    Google Scholar 

  3. Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proc. of 15th Annual ACM-SIGIR, pp. 318–329 (1992)

    Google Scholar 

  4. Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. In: Text Data Mining and Applications (2002)

    Google Scholar 

  5. Ferragina, P., Gulli, A.: A personalized search engine based on web-snippet hierarchical clustering. In: Special interest tracks and posters of the 14th International Conference on WWW, pp. 801–810 (2005)

    Google Scholar 

  6. Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: WebACE: A web agent for document categorization and exploration. In: Proc. of the 2nd International Conference on Autonomous Agents, pp. 408–415 (1998)

    Google Scholar 

  7. Kashyap, V., Ramakrishnan, C., Thomas, C., Bassu, D., Rindflesch, T.C., Sheth, A.: TaxaMiner: An experiment framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services 1(2), 240–266 (2005)

    CrossRef  Google Scholar 

  8. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proc. of the 14th International Conference on Machine Learning, pp. 170–178 (1997)

    Google Scholar 

  9. Pirolli, P., Schank, P., Hearst, M., Diehl, C.: Scatter/Gather Browsing Communicates the Topic Structure of a Very Large Text Collection. In: Proc. of CHI, pp. 213–220 (1996)

    Google Scholar 

  10. Reuters-21578, http://www.daviddlewis.com/resources/testcollections/reuters21578/

  11. Salton, G., McGill, M.J.: Introduction to Modern Retrieval. McGraw-Hill Company, New York (1983)

    MATH  Google Scholar 

  12. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of ACM 18(11), 613–620 (1975)

    CrossRef  MATH  Google Scholar 

  13. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)

    CrossRef  Google Scholar 

  14. Savaresi, M., Boley, D.L.: On the performance of bisecting k-Means and PDDP. In: First SIAM International Conference on Data Mining, pp. 1–14 (2001)

    Google Scholar 

  15. Steinbach, M., Karypis, G., Kumar, V.: A comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)

    Google Scholar 

  16. Toda, H., Kataoka, R.: A search Result clustering Method using Informatively Named Entities. In: Proc. of the 7th annual ACM International Workshop on Web information and data management, pp. 81–86 (2005)

    Google Scholar 

  17. TREC: Text Retrieval Conference, http://trec.nist.gov

  18. Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive Clustering of Web document. In: Proc. of KDD, pp. 287–290 (1997)

    Google Scholar 

  19. Zhang, D., Dong, Y.: Semantic, Hierarchical, Online Clustering of Web Search Results. In: Proc. of the 6th Asia Pacific Web Conference (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Archetti, F., Campanelli, P., Fersini, E., Messina, E. (2006). A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means. In: Larsen, H.L., Pasi, G., Ortiz-Arroyo, D., Andreasen, T., Christiansen, H. (eds) Flexible Query Answering Systems. FQAS 2006. Lecture Notes in Computer Science(), vol 4027. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11766254_22

Download citation

  • DOI: https://doi.org/10.1007/11766254_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-34638-8

  • Online ISBN: 978-3-540-34639-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics