Abstract
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.
Similar content being viewed by others
References
Aas K, Eikvil L (1999) Text categorisation: a survey. Technical Report 941. Norwegian Computing Center, Oslo, Norway
Boley D, Gini M, Gross R, Han S, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999a) Partitioning-based clustering for web document categorization. Decis Supp Sys 27:329–341
Boley D, Gini M, Gross R, Han S, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999b) Document categorization and query generation on the World Wide Web using WebACE. AI Rev 13(5–6):365–391
Boley D (1998) Principal direction divisive partitioning. Data Min Knowl Discov 2(4):325–344
Cios K, Pedrycs W, Swiniarski R (1998) Data mining methods for knowledge discovery. Kluwer, Boston
Frakes WB, Baeza-Yates R (1992) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ
Hammouda K, Kamel M (2002) Phrase-based document similarity based on an index graph model. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM02), Maebashi City, 7–12 October 2002. IEEE Computer Society, CA, pp 203–210
Hofmann T (1999) The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Proceedings of the 16th international joint conference on artificial intelligence (IJCAI-99), Stockholm, Sweden, 31 July–6 August 1999, pp 682–687
Honkela T, Kaski S, Lagus K, Kohonen T (1997) WEBSOM – self-organizing maps of document collections. In: Proceedings of the workshop on self-organizing maps WSOM’97, Espoo, Finland, 4–6 June 1997, pp 310–315
Isaacs JD, Aslam JA (1999) Investigating measures for pairwise document similarity. Technical Report PCS-TR99-357, Department of Computer Science, Dartmouth College, Hanover, NH
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ
Junker M, Sintek M, Rinck M (1999) Learning for text categorization and information extraction with ILP. In: Cussens J (ed) Proceedings of the 1st workshop on learning language in logic. Bled, Slovenia, 30 June 1999, pp 84–93
Kosala R, Blockeel H (2000) Web mining research: a survey. ACM SIGKDD Explor Newslett 2(1):1–15
Lu SY, Fu KS (1978) A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans Sys Man Cybern 8:381–389
Nahm UY, Mooney RJ (2000) A mutually beneficial integration of data mining and information extraction. In: Proceedings of the 17th national conference on artificial intelligence (AAAI-00), Austin, TX, 30 July–3 August 2000, pp 627–632
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill Computer Science Series. McGraw-Hill, New York
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of the KDD-2000 workshop on text mining, Boston, 20–23 August 2000
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the 17th national conference on artificial intelligence: workshop of artificial intelligence for Web search (AAAI 2000), Austin, TX, 30 July–3 August 2000. AAAI, Menlo Park, CA, pp 58–64
Yang Y, Pedersen JP (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning (ICML’97), Nashville, TN, 8–12 July 1997, pp 412–420
Zamir O, Etzioni O, Madanim O, Karp RM (1997) Fast and intuitive clustering of web documents. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, Newport Beach, CA, 14–17 August 1997. AAAI, Menlo Park, CA, pp 287–290
Zamir O, Etzioni O, Madanim O (1999) Grouper: a dynamic clustering interface to web search results. In: Proceedings of the 8th international World Wide Web conference (WWW8), Toronto, 11–14 May 1999, p 8
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference, Melbourne, Australia, 24–28 August 1998, pp 46–54
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hammouda, K., Kamel, M. Document Similarity Using a Phrase Indexing Graph Model. Know. Inf. Sys. 6, 710–727 (2004). https://doi.org/10.1007/s10115-003-0118-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-003-0118-5