Abstract
Text classification has been widely used to assist users with the discovery of useful information from the Internet. However, Current text classification systems are based on the “Bag ofWords” (BOW) representation, which only accounts for term frequency in the documents, and ignores important semantic relationships between key terms. To overcome this problem, previous work attempted to enrich text representation by means of manual intervention or automatic document expansion. The achieved improvement is unfortunately very limited, due to the poor coverage capability of the dictionary, and to the ineffectiveness of term expansion. Fortunately, DBpedia appeared recently which contains rich semantic information. In this paper, we proposed a method compiling DBpedia knowledge into document representation to improve text classification. It facilitates the integration of the rich knowledge of DBpedia into text documents, by resolving synonyms and introducing more general and associative concepts. To evaluate the performance of the proposed method, we have performed an empirical evaluation using SVM calssifier on several real data sets. The experimental results show that our proposed framework, which integrates hierarchical relations, synonym and associative relations with traditional text similarity measures based on the BOW model, does improve text classification performance significantly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
de Buenaga Rodriguez, M., Gomez Hidalgo, J.M., Agudo, B.D.: Using WordNet to complement training information in text categorization. In: The 2nd international conference on recent advances in natural language processing, RANLP 1997 (1999)
Urena-Lopez, L.A., Buenaga, M., Gomez, J.M.: Integrating linguistic resources in TC through WSD. Comput. Hum. 35, 215–230 (2001)
Reuters-21578 text categorization test collection, Distribution 1.0. Reuters (1997), http://www.daviddlewis.com/resources/testcollections/reuters21578/
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international World Wide Web conference WWW 2003 (2003)
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the semantic web workshop at SIGIR 2003 (2003)
Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Proceedings of the 19th international joint conference on artificial intelligence, IJCAI 2005 (2005)
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using DBpedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21nd AAAI conference on artificial intelligence, AAAI 2006 (2006)
Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
Salton, G.: Automatic Text Processing. Addison-Wesley Publishing Inc., Boston (1989)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Hotho, A., Staab, S., Stumme, G.: Wordnet improves Text Document Clustering. In: Proc. of the Semantic Web Workshop of the 26th Annual International ACM SIGIR Conference, Toronto, Canada (2003)
Moldovan, D.I., Mihalcea, R.: Improving the Search on the Internet by using WordNet and lexical operators. IEEE Internet Computing 4(1), 34–43 (2000)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)
Paice, C.D.: Another stemmer. SIGIR Forum 24(3), 56–61 (1990)
Reuters-21578 text categorization test collection, Distribution 1.0. Reuters (1997), http://www.daviddlewis.com/resources/testcollections/reuters21578/
Hersh, W., Buckley, C., Leone, T., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual internationalACM-SIGIR conference on research and development in information retrieval (SIGIR 1994), pp. 192–201 (1994)
Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning (ICML 1995), pp. 331–339 (1995)
Joachims, T.: Text categorizationwith support vectormachines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Stumme, G., Maedche, A.: FCA-Merge: A Bottom Up Approach for Merging Ontologies. In: Proceedings of the International Joint Conference on Artificial Intelligence, Seattle, Washington, USA, pp. 225–234 (2001)
Noy, N.F., Musen, M.A.: SMART: Automated Support for Ontology Merging and Alignment. In: Proceedings of the KAW 1999, Banff, Alberta, Canada, Saturday 16 to Thursday 21 October (1999)
Noy, N.F., Musen, M.A.: Algorithm and Tool for Automated Ontology Merging and Alignment. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, USA (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liao, J., Bai, R. (2009). Building “Bag of Conception” Model Based on DBpedia. In: Kim, Th., Fang, WC., Lee, C., Arnett, K.P. (eds) Advances in Software Engineering. ASEA 2008. Communications in Computer and Information Science, vol 30. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10242-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-10242-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10241-7
Online ISBN: 978-3-642-10242-4
eBook Packages: Computer ScienceComputer Science (R0)