Document Mining Based on Semantic Understanding of Text

  • Khaled Shaban
  • Otman Basir
  • Mohamed Kamel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4225)


This paper presents a new paradigm for mining documents by exploiting the semantic information of their texts. A formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation for documents. The representation is constructed through accumulation of syntactic and semantic analysis outputs. A new distance measure is developed to determine the similarities between contents of documents. The measure is based on inexact matching of attributed trees. It involves the computation of all distinct similarity common sub-trees, and can be computed efficiently. It is believed that the proposed representation along with the proposed similarity measure will enable more effective document mining processes. The proposed techniques to mine documents were implemented as components in a mining system. A case study of semantic document clustering is presented to demonstrate the working and the efficacy of the framework. Experimental work is reported, and its results are presented and analyzed.


Document mining semantic understanding text representation similarity measure document clustering 


  1. 1.
    Aas, K., Eikvil, L.: Text categorisation: A survey. Tech. Report 941, Norwegian Computing Center (1999)Google Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM, New York (1999)Google Scholar
  3. 3.
    Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: Two sides of the same coin? Comm. of the ACM 35(12), 29–38 (1992)CrossRefGoogle Scholar
  4. 4.
    Berkhin, P.: Survey of Clustering Data Mining Techniques. Tech, Report, Accrue Software (2002)Google Scholar
  5. 5.
    Berry, M.W., Dunais, S.T., O’Brien, G.W.: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37(4), 573–595 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Boley, D.: Principal direction divisive partitioning. Data Mining and Knowledge Discovery 2(4), 325–344 (1998)CrossRefGoogle Scholar
  7. 7.
    Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems 27, 329–341 (1999)CrossRefGoogle Scholar
  8. 8.
    Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Document categorization and query generation on the World Wide Web using WebACE. AI Review 13(5-6), 365–391 (1999)Google Scholar
  9. 9.
    Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)Google Scholar
  10. 10.
    Cios, K., Pedrycs, W., Swiniarski, R.: Data Mining Methods for Knowledge Discovery. Kluwer Academic Publishers, Dordrecht (1998)zbMATHGoogle Scholar
  11. 11.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proce. of 40th Anniversary Meeting of the ACL (ACL 2002), Philadelphia (July 2002)Google Scholar
  12. 12.
    Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/gather: A cluster-based approach to browsing large document collections. In: 16th International ACM SIGIR Conference on Research and Development in IR, pp. 126–135 (1993)Google Scholar
  13. 13.
    Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. McGraw-Hill Comp. Scie. Series. IEEE Comp. Soci. Press, Los Alamitos (1991)Google Scholar
  14. 14.
    Eikvil, L.: Information Extraction from World Wide Web – A Survey, Technical Report 945, Norwegian Computing Center (July 1999)Google Scholar
  15. 15.
    Hill, D.R.: A vector clustering technique. In: Samuelson (ed.) Mechanized Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam (1968)Google Scholar
  16. 16.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  17. 17.
    Lee, D.L., Chuang, H., Seamons, K.: Document Ranking and the Vector-Space Model. IEEE Comp., Issues on Assessing Measurement 14(2), 67–75 (1997)Google Scholar
  18. 18.
    Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database Cognitive Science Lab. Princeton University (1993)Google Scholar
  19. 19.
    Mostafa, J., Mukhopadhyay, S., Lam, W., Palakal, M.: A Multi-level Approach to Intelligent Information Filtering: Model, System & Evaluation. ACM Transactions on Information Systems 15(4), 368–399 (1997)CrossRefGoogle Scholar
  20. 20.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  21. 21.
    Salton, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1984)Google Scholar
  22. 22.
    Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)zbMATHCrossRefGoogle Scholar
  23. 23.
    Shannon, C.E.: A Mathematical Theory of Communication. Bell Syst. Tech. J. 27, 379–423, 623-656, (1948)zbMATHMathSciNetGoogle Scholar
  24. 24.
    Soderland, S.: Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning (1999)Google Scholar
  25. 25.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD 2000 Workshop on TextMining (August 2000)Google Scholar
  26. 26.
    Text Analysis International, Inc. Integrated Development Environments for Natural Language Processing, White Paper (October 2001),
  27. 27.
    Text Analysis International, Inc. Multi-Pass Multi-Strategy NLP, White Paper (October 2003),
  28. 28.
    van Rijsbergen, C.J.: Information Retrieval, Woburn Massachusetts. Butterworths (1979)Google Scholar
  29. 29.
    Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceeding of the 14th International Conference on Machine Learning, ICML, Nashville, TN, pp. 412–420 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Khaled Shaban
    • 1
  • Otman Basir
    • 1
  • Mohamed Kamel
    • 1
  1. 1.Electrical and Computer EngineeringUniversity of WaterlooWaterlooCanada

Personalised recommendations