Skip to main content
Log in

Document Similarity Using a Phrase Indexing Graph Model

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aas K, Eikvil L (1999) Text categorisation: a survey. Technical Report 941. Norwegian Computing Center, Oslo, Norway

    Google Scholar 

  2. Boley D, Gini M, Gross R, Han S, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999a) Partitioning-based clustering for web document categorization. Decis Supp Sys 27:329–341

    Article  Google Scholar 

  3. Boley D, Gini M, Gross R, Han S, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999b) Document categorization and query generation on the World Wide Web using WebACE. AI Rev 13(5–6):365–391

    Google Scholar 

  4. Boley D (1998) Principal direction divisive partitioning. Data Min Knowl Discov 2(4):325–344

    Article  Google Scholar 

  5. Cios K, Pedrycs W, Swiniarski R (1998) Data mining methods for knowledge discovery. Kluwer, Boston

  6. Frakes WB, Baeza-Yates R (1992) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ

    Google Scholar 

  7. Hammouda K, Kamel M (2002) Phrase-based document similarity based on an index graph model. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM02), Maebashi City, 7–12 October 2002. IEEE Computer Society, CA, pp 203–210

  8. Hofmann T (1999) The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Proceedings of the 16th international joint conference on artificial intelligence (IJCAI-99), Stockholm, Sweden, 31 July–6 August 1999, pp 682–687

  9. Honkela T, Kaski S, Lagus K, Kohonen T (1997) WEBSOM – self-organizing maps of document collections. In: Proceedings of the workshop on self-organizing maps WSOM’97, Espoo, Finland, 4–6 June 1997, pp 310–315

  10. Isaacs JD, Aslam JA (1999) Investigating measures for pairwise document similarity. Technical Report PCS-TR99-357, Department of Computer Science, Dartmouth College, Hanover, NH

  11. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ

  12. Junker M, Sintek M, Rinck M (1999) Learning for text categorization and information extraction with ILP. In: Cussens J (ed) Proceedings of the 1st workshop on learning language in logic. Bled, Slovenia, 30 June 1999, pp 84–93

  13. Kosala R, Blockeel H (2000) Web mining research: a survey. ACM SIGKDD Explor Newslett 2(1):1–15

    Google Scholar 

  14. Lu SY, Fu KS (1978) A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans Sys Man Cybern 8:381–389

    MATH  Google Scholar 

  15. Nahm UY, Mooney RJ (2000) A mutually beneficial integration of data mining and information extraction. In: Proceedings of the 17th national conference on artificial intelligence (AAAI-00), Austin, TX, 30 July–3 August 2000, pp 627–632

  16. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Google Scholar 

  17. Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  18. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill Computer Science Series. McGraw-Hill, New York

  19. Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272

    Google Scholar 

  20. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of the KDD-2000 workshop on text mining, Boston, 20–23 August 2000

  21. Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the 17th national conference on artificial intelligence: workshop of artificial intelligence for Web search (AAAI 2000), Austin, TX, 30 July–3 August 2000. AAAI, Menlo Park, CA, pp 58–64

  22. Yang Y, Pedersen JP (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning (ICML’97), Nashville, TN, 8–12 July 1997, pp 412–420

  23. Zamir O, Etzioni O, Madanim O, Karp RM (1997) Fast and intuitive clustering of web documents. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, Newport Beach, CA, 14–17 August 1997. AAAI, Menlo Park, CA, pp 287–290

  24. Zamir O, Etzioni O, Madanim O (1999) Grouper: a dynamic clustering interface to web search results. In: Proceedings of the 8th international World Wide Web conference (WWW8), Toronto, 11–14 May 1999, p 8

  25. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference, Melbourne, Australia, 24–28 August 1998, pp 46–54

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khaled M. Hammouda.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hammouda, K., Kamel, M. Document Similarity Using a Phrase Indexing Graph Model. Know. Inf. Sys. 6, 710–727 (2004). https://doi.org/10.1007/s10115-003-0118-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-003-0118-5

Keywords

Navigation