Document Mining Based on Semantic Understanding of Text
Abstract
This paper presents a new paradigm for mining documents by exploiting the semantic information of their texts. A formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation for documents. The representation is constructed through accumulation of syntactic and semantic analysis outputs. A new distance measure is developed to determine the similarities between contents of documents. The measure is based on inexact matching of attributed trees. It involves the computation of all distinct similarity common sub-trees, and can be computed efficiently. It is believed that the proposed representation along with the proposed similarity measure will enable more effective document mining processes. The proposed techniques to mine documents were implemented as components in a mining system. A case study of semantic document clustering is presented to demonstrate the working and the efficacy of the framework. Experimental work is reported, and its results are presented and analyzed.
Keywords
Document mining semantic understanding text representation similarity measure document clusteringReferences
- 1.Aas, K., Eikvil, L.: Text categorisation: A survey. Tech. Report 941, Norwegian Computing Center (1999)Google Scholar
- 2.Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM, New York (1999)Google Scholar
- 3.Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: Two sides of the same coin? Comm. of the ACM 35(12), 29–38 (1992)CrossRefGoogle Scholar
- 4.Berkhin, P.: Survey of Clustering Data Mining Techniques. Tech, Report, Accrue Software (2002)Google Scholar
- 5.Berry, M.W., Dunais, S.T., O’Brien, G.W.: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37(4), 573–595 (1995)MATHCrossRefMathSciNetGoogle Scholar
- 6.Boley, D.: Principal direction divisive partitioning. Data Mining and Knowledge Discovery 2(4), 325–344 (1998)CrossRefGoogle Scholar
- 7.Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems 27, 329–341 (1999)CrossRefGoogle Scholar
- 8.Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Document categorization and query generation on the World Wide Web using WebACE. AI Review 13(5-6), 365–391 (1999)Google Scholar
- 9.Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)Google Scholar
- 10.Cios, K., Pedrycs, W., Swiniarski, R.: Data Mining Methods for Knowledge Discovery. Kluwer Academic Publishers, Dordrecht (1998)MATHGoogle Scholar
- 11.Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proce. of 40th Anniversary Meeting of the ACL (ACL 2002), Philadelphia (July 2002)Google Scholar
- 12.Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/gather: A cluster-based approach to browsing large document collections. In: 16th International ACM SIGIR Conference on Research and Development in IR, pp. 126–135 (1993)Google Scholar
- 13.Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. McGraw-Hill Comp. Scie. Series. IEEE Comp. Soci. Press, Los Alamitos (1991)Google Scholar
- 14.Eikvil, L.: Information Extraction from World Wide Web – A Survey, Technical Report 945, Norwegian Computing Center (July 1999)Google Scholar
- 15.Hill, D.R.: A vector clustering technique. In: Samuelson (ed.) Mechanized Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam (1968)Google Scholar
- 16.Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)MATHGoogle Scholar
- 17.Lee, D.L., Chuang, H., Seamons, K.: Document Ranking and the Vector-Space Model. IEEE Comp., Issues on Assessing Measurement 14(2), 67–75 (1997)Google Scholar
- 18.Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database Cognitive Science Lab. Princeton University (1993)Google Scholar
- 19.Mostafa, J., Mukhopadhyay, S., Lam, W., Palakal, M.: A Multi-level Approach to Intelligent Information Filtering: Model, System & Evaluation. ACM Transactions on Information Systems 15(4), 368–399 (1997)CrossRefGoogle Scholar
- 20.Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
- 21.Salton, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1984)Google Scholar
- 22.Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)MATHCrossRefGoogle Scholar
- 23.Shannon, C.E.: A Mathematical Theory of Communication. Bell Syst. Tech. J. 27, 379–423, 623-656, (1948)MATHMathSciNetGoogle Scholar
- 24.Soderland, S.: Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning (1999)Google Scholar
- 25.Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD 2000 Workshop on TextMining (August 2000)Google Scholar
- 26.Text Analysis International, Inc. Integrated Development Environments for Natural Language Processing, White Paper (October 2001), http://www.textanalysis.com
- 27.Text Analysis International, Inc. Multi-Pass Multi-Strategy NLP, White Paper (October 2003), http://www.textanalysis.com/
- 28.van Rijsbergen, C.J.: Information Retrieval, Woburn Massachusetts. Butterworths (1979)Google Scholar
- 29.Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceeding of the 14th International Conference on Machine Learning, ICML, Nashville, TN, pp. 412–420 (1997)Google Scholar