Improving the Efficiency of Document Clustering and Labeling Using Modified FPF Algorithm

Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 131)

Abstract

Document clustering is an effective tool to manage information overload. By grouping similar documents together, we enable a human observer to quickly browse large document collections, make it possible to easily grasp the distinct topics and subtopics. In this Paper we survey the most important problems and techniques related to text information retrieval: document pre-processing and filtering, word sense disambiguation, Further we present text clustering using Modified FPF algorithm and comparison of our clustering algorithms against FPF, which is the most used algorithm in the text clustering context. Further we introduce the problem of cluster labeling: Cluster labeling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure.

Keywords

Clustering document clustering Cluster Labeling Information retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Geraci, F., Pellegrini, M., Sebastiani, F., Maggini, M.: Cluster generation and cluster labeling for web snippets: A ast and accurate hierarchical solution. Technical Report IIT TR-1/2006, Institute for Informatics and Telematics of CNR (2006)Google Scholar
  2. 2.
    Nearest-neighbor searching and metric space dimensions. In: Shakhnarovich, G., Darrell, T., Indyk, P. (eds.) Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, pp. 15–59. MIT Press (2006)Google Scholar
  3. 3.
    Chierichetti, F., Panconesi, A., Raghavan, P., Sozio, M., Tiberi, A., Upfal, E.: Finding near neighbors through cluster pruning. In: Proceedings of ACM PODS (2008)Google Scholar
  4. 4.
    Ferragina, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Special Interest Tracks and Poster Proceedings of WWW 2005, 14th International Conference on the World Wide Web, Chiba, JP, pp. 801–810 (2006)Google Scholar
  5. 5.
    Figueroa, K., Chávez, E., Navarro, G., Paredes, R.: On the Least Cost for Proximity Searching in Metric Spaces. In: Àlvarez, C., Serna, M. (eds.) WEA 2006. LNCS, vol. 4007, pp. 279–290. Springer, Heidelberg (2006)Google Scholar
  6. 6.
    Furini, M., Geraci, F., Montangero, M., Pellegrini, M.: VISTO: VIsual Storyboard forWeb Video Browsing. In: CIVR 2007: Proceedings of the ACM International Conference on Image and Video Retrieval (July 2007)Google Scholar
  7. 7.
    Geraci, F., Pellegrini, M., Maggini, M., Sebastiani, F.: Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 25–36. Springer, Heidelberg (2006)Google Scholar
  8. 8.
    Geraci, F., Pellegrini, M., Pisati, P., Sebastiani, F.: A scalable algorithm for high-quality clustering of Web snippets. In: Proceedings of SAC 2006, 21st ACM Symposium on Applied Computing, Dijon, FR, pp. 1058–1062 (2007)Google Scholar
  9. 9.
    Geraci, F., Leoncini, M., Montangero, M., Pellegrini, M., Elena Renda, M.: Fpf-sb: a scalable algorithm for microarray gene expression data clustering. In: Proceedings of 1st International Conference on Digital Human Modeling (2008)Google Scholar
  10. 10.
    Osinski, S., Weiss, D.: Conceptual clustering using Lingo algorithm: Evaluation on Open Directory Project data. In: Proceedings of IIPWM 2004, 5th Conference on Intelligent Information Processing and Web Mining, Zakopane, PL, pp. 369–377 (2004)Google Scholar

Copyright information

© Springer India Pvt. Ltd. 2012

Authors and Affiliations

  1. 1.Department of Computer Science & ApplicationsBangalore UniversityBangaloreIndia
  2. 2.Department of Computer ScienceSiddaganga College for WomenTumkurIndia

Personalised recommendations