A Framework for Titled Document Categorization with Modified Multinomial Naivebayes Classifier

Guo, Hang; Zhou, Lizhu

doi:10.1007/978-3-540-73871-8_31

Hang Guo²⁴ &
Lizhu Zhou²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4632))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2187 Accesses
1 Citations

Abstract

Titled Documents (TD) are short text documents that are segmented into two parts: Heading Part and Excerpt Part. With the development of the Internet, TDs are widely used as papers, news, messages, etc. In this paper we discuss the problem of automatic TDs categorization. Unlike traditional text documents, TDs have short headings which have less useless words comparing to their excerpts. Though headings are usually short, their words are more important than other words. Based on this observation we propose a titled document classification framework using the widely used MNB classifier. This framework puts higher weight on the heading words at the cost of some excerpt words. By this means heading words play more important roles in classification than the traditional method. According to our experiments on four datasets that cover three types of documents, the performance of the classifier is improved by our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hull, D.P., Schutze, J., Method, H.: Combination for document filtering. In: Proc. the 19th ACM SIGIR Conference on Research and Development in Information Retrieval, Switzerland, pp. 279–287 (1996)
Google Scholar
Tumer, K., Ghosh, J.: Linear and order statistics combination for pattern classification. In: Sharkey, A. (ed.) Combining Artificial Neural Networks, pp. 127–162. Springer, Sharkey (1999)
Google Scholar
Merz, C.J., Pazzani, M.J.: Combining neural network regression estimates with regularized linear weights. In: Advances in Neural Information Processing Systems, vol. 9, pp. 564–570. MIT Press, Cambridge (1997)
Google Scholar
Mccallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proc. AAAI workshop on Learning for Text Categorization, Wisconsin, pp. 41–48 (1998)
Google Scholar
Fabrizio, S.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article Google Scholar
Rennie, J.D.M., et al.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proc. International Conference on Machine Learning, Washington, DC (2003)
Google Scholar
Domingos, P., Pazzani, M.: Beyond independence: conditions for the optimality of the simple Bayesian classifier. In: Proc. International Conference on Machine Learning, Italy (1996)
Google Scholar
Sun, A., Lim, E., Ng, W.: Web Classification Using Support Vector Machine. In: Proc. Workshop on Web Information and Knowledge Management, Virginia (2002)
Google Scholar
Joachims, T., Sebastiani, F.: Guest editors’s categorization. J. Intell. Inform. Syst. 18(2/3), 103–105 (2002)
Article Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Google Scholar
Guo, H., Zhou, L.: Segmented Document Classification: Problem and Solution. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080, Springer, Heidelberg (2006)
Chapter Google Scholar
Hamill, K., Zamora, A.: The use of titles for automatic document classification. In J. of the American Society for Information Science (1980)
Google Scholar
Song, D., Bruza, P., Huang, Z., Lau, R.: Classifying Document Titles Based on Information Inference. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, Springer, Heidelberg (2003)
Chapter Google Scholar
Hakenberg, J., Rutsch, J., Leser, U.: Tuning Text Classification for Hereditary Diseases with Section Weighting. In: Proc International Symposium on Semantic Mining in Biomedicine (2005)
Google Scholar
Kaist, I., Kim, G.: Query type classification for web document retrieval. In: Proc. of ACM SIGIR, ACM Press, New York (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Technology Department, 100084, Tsinghua University,Beijing, China
Hang Guo & Lizhu Zhou

Authors

Hang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, University of Calgary , Calgary, AB, Canada
Reda Alhajj
School of Computer Science and Technology , Harbin Institute of Technology, Harbin, China
Hong Gao
School of Computer Science and Technology , Harbin Institute of Technology , Harbin, China
Jianzhong Li
School of Information Technology and Electronic Engineering , The University of Queensland , Queensland, Australia
Xue Li
Department of Computing Science , University of Alberta, Edmonton, AB, Canada
Osmar R. Zaïane

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, H., Zhou, L. (2007). A Framework for Titled Document Categorization with Modified Multinomial Naivebayes Classifier. In: Alhajj, R., Gao, H., Li, J., Li, X., Zaïane, O.R. (eds) Advanced Data Mining and Applications. ADMA 2007. Lecture Notes in Computer Science(), vol 4632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73871-8_31

Download citation

DOI: https://doi.org/10.1007/978-3-540-73871-8_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73870-1
Online ISBN: 978-3-540-73871-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics