Phrase Based Web Document Clustering: An Indexing Approach

  • Amit Prakash Singh
  • Shalini Srivastava
  • Sanjib Kumar Sahu
Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 5)


Clustering documents within a cluster are mostly based on single term analysis over present document set and naming one of them can vector space model. In view to get improved results, more explanatory features are to be included. These informative features could be like phrases and the weights they hold for a specific document. Importance of document clustering can be explained with the need of categorization of documents, making sets of search engines results, building taxonomy of documents, and many more. This paper is going to present few such important pats of document clustering. Two algorithms famous in this field will be discussed and an analytical study of the results presented by famous search engine and the tool based on this algorithm will be talked about. The model will show efficient and less relevant results of a search engine and an improved document similarity.


Web mining Document similarity Phrase-based indexing Document clustering Document structure Document Index Graph Phrase matching 


  1. 1.
    K.M. Hammouda and M.S. Kamel, “Efficient Phrase-Based Document Indexing for Web Document Clustering,” IEEE Trans.Knowledge and Data Eng., vol. 16, no. 10, pp. 1279–1296, Oct 2004.Google Scholar
  2. 2.
    A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, N.J., 1988.Google Scholar
  3. 3.
    Hung Chim and Xiaotie Deng, “Efficient Phrase-Based Document Similarity for Clustering” IEEE Trans. Knowledge and Data Eng., Vol. 20, No. 9, pp. 1217–1228, Sep 2008.Google Scholar
  4. 4.
    O. Zamir, O. Etzioni, O. Madanim, and R. M. Karp. Fast and intuitive clustering of web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages 287290, Newport Beach, CA, August 1997. AAAI.Google Scholar
  5. 5.
    O. Zamir, O. Etzioni, O. Madanim, and R.M. Karp, “Fast and Intuitive Clustering of Web Documents,” Proc. Third Int’l Conf. Knowledge Discovery and Data Mining, pp. 287–290, Aug. 1997.Google Scholar
  6. 6.
    R. Kosala and H. Blockeel, “Web Mining Research: A Survey,” ACM SIGKDD Explorations Newsletter, vol. 2, no. 1, pp. 1–15, 2000.Google Scholar
  7. 7.
    M. Yamamoto and K.W. Church, “Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus,” Computational Linguistics, vol. 27, no. 1, pp. 1–30, 2001.Google Scholar
  8. 8.
    J. D. Isaacs and J. A. Aslam. Investigating measures for pairwise document similarity. Technical Report PCS-TR99-357, Dartmouth College, Computer Science, Hanover, NH, June 1999.Google Scholar
  9. 9.
    M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. KDD-2000 Workshop on Text Mining, August 2000.Google Scholar
  10. 10.
    K. Cios, W. Pedrycs, and R. Swiniarski. Data Mining Methods for Knowledge Discovery. Kluwer Academic Publishers, Boston, 1998.Google Scholar
  11. 11.
    U. Y. Nahm and R. J. Mooney. A mutually beneficial integration of data mining and information extraction. In 17th National Conference on Artificial Intelligence (AAAI-00), pp. 627632, 2000.Google Scholar
  12. 12.
    O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of the 21st Annual International ACM SIGIR Conference, pp. 4654, Melbourne, Australia, 1998.Google Scholar
  13. 13.
    U.Y. Nahm and R.J. Mooney, “A Mutually Beneficial Integration of Data Mining and Information Extraction,” Proc. 17th Nat’l Conf. Artificial Intelligence (AAAI-00), pp. 627–632, 2000.Google Scholar
  14. 14.
    U. Manber and G. Myers, “Suffix Arrays: A New Method for OnLine String Searches,” SIAM J. Computing, vol. 22, no. 5, pp. 935–948, 1993.Google Scholar
  15. 15.
    R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999.Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2017

Authors and Affiliations

  • Amit Prakash Singh
    • 1
  • Shalini Srivastava
    • 2
  • Sanjib Kumar Sahu
    • 3
  1. 1.University School of Information and Communication TechnologyGuru Gobind Singh Indraprastha UniversityDwarkaIndia
  2. 2.Integrated Institute of TechnologyDwarkaIndia
  3. 3.Department of Computer Science and ApplicationUtkal UniversityBhubaneshwarIndia

Personalised recommendations