Document Similarity Using a Phrase Indexing Graph Model

Hammouda, Khaled M.; Kamel, Mohamed S.

doi:10.1007/s10115-003-0118-5

Document Similarity Using a Phrase Indexing Graph Model

Published: 15 January 2004

Volume 6, pages 710–727, (2004)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Khaled M. Hammouda¹ &
Mohamed S. Kamel¹

194 Accesses
24 Citations
Explore all metrics

Abstract

Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Aas K, Eikvil L (1999) Text categorisation: a survey. Technical Report 941. Norwegian Computing Center, Oslo, Norway
Google Scholar
Boley D, Gini M, Gross R, Han S, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999a) Partitioning-based clustering for web document categorization. Decis Supp Sys 27:329–341
Article Google Scholar
Boley D, Gini M, Gross R, Han S, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999b) Document categorization and query generation on the World Wide Web using WebACE. AI Rev 13(5–6):365–391
Google Scholar
Boley D (1998) Principal direction divisive partitioning. Data Min Knowl Discov 2(4):325–344
Article Google Scholar
Cios K, Pedrycs W, Swiniarski R (1998) Data mining methods for knowledge discovery. Kluwer, Boston
Frakes WB, Baeza-Yates R (1992) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ
Google Scholar
Hammouda K, Kamel M (2002) Phrase-based document similarity based on an index graph model. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM02), Maebashi City, 7–12 October 2002. IEEE Computer Society, CA, pp 203–210
Hofmann T (1999) The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Proceedings of the 16th international joint conference on artificial intelligence (IJCAI-99), Stockholm, Sweden, 31 July–6 August 1999, pp 682–687
Honkela T, Kaski S, Lagus K, Kohonen T (1997) WEBSOM – self-organizing maps of document collections. In: Proceedings of the workshop on self-organizing maps WSOM’97, Espoo, Finland, 4–6 June 1997, pp 310–315
Isaacs JD, Aslam JA (1999) Investigating measures for pairwise document similarity. Technical Report PCS-TR99-357, Department of Computer Science, Dartmouth College, Hanover, NH
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ
Junker M, Sintek M, Rinck M (1999) Learning for text categorization and information extraction with ILP. In: Cussens J (ed) Proceedings of the 1st workshop on learning language in logic. Bled, Slovenia, 30 June 1999, pp 84–93
Kosala R, Blockeel H (2000) Web mining research: a survey. ACM SIGKDD Explor Newslett 2(1):1–15
Google Scholar
Lu SY, Fu KS (1978) A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans Sys Man Cybern 8:381–389
MATH Google Scholar
Nahm UY, Mooney RJ (2000) A mutually beneficial integration of data mining and information extraction. In: Proceedings of the 17th national conference on artificial intelligence (AAAI-00), Austin, TX, 30 July–3 August 2000, pp 627–632
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Google Scholar
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill Computer Science Series. McGraw-Hill, New York
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272
Google Scholar
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of the KDD-2000 workshop on text mining, Boston, 20–23 August 2000
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the 17th national conference on artificial intelligence: workshop of artificial intelligence for Web search (AAAI 2000), Austin, TX, 30 July–3 August 2000. AAAI, Menlo Park, CA, pp 58–64
Yang Y, Pedersen JP (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning (ICML’97), Nashville, TN, 8–12 July 1997, pp 412–420
Zamir O, Etzioni O, Madanim O, Karp RM (1997) Fast and intuitive clustering of web documents. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, Newport Beach, CA, 14–17 August 1997. AAAI, Menlo Park, CA, pp 287–290
Zamir O, Etzioni O, Madanim O (1999) Grouper: a dynamic clustering interface to web search results. In: Proceedings of the 8th international World Wide Web conference (WWW8), Toronto, 11–14 May 1999, p 8
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference, Melbourne, Australia, 24–28 August 1998, pp 46–54

Download references

Author information

Authors and Affiliations

Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Khaled M. Hammouda & Mohamed S. Kamel

Authors

Khaled M. Hammouda
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed S. Kamel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khaled M. Hammouda.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hammouda, K., Kamel, M. Document Similarity Using a Phrase Indexing Graph Model. Know. Inf. Sys. 6, 710–727 (2004). https://doi.org/10.1007/s10115-003-0118-5

Download citation

Received: 09 December 2002
Revised: 13 February 2003
Accepted: 16 May 2003
Published: 15 January 2004
Issue Date: November 2004
DOI: https://doi.org/10.1007/s10115-003-0118-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Document Similarity Using a Phrase Indexing Graph Model

Abstract

Access this article

Similar content being viewed by others

Siamese Neural Networks: An Overview

A comprehensive and analytical review of text clustering techniques

Clustering graph data: the roadmap to spectral techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Document Similarity Using a Phrase Indexing Graph Model

Abstract

Access this article

Similar content being viewed by others

Siamese Neural Networks: An Overview

A comprehensive and analytical review of text clustering techniques

Clustering graph data: the roadmap to spectral techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation