Segmented Document Classification: Problem and Solution

Guo, Hang; Zhou, Lizhu

doi:10.1007/11827405_53

Hang Guo¹⁸ &
Lizhu Zhou¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4080))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1401 Accesses
2 Citations

Abstract

In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like “title”,“body”, etc. We call them “segmented documents”. To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaiveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fabrizio, S.: Machine Learing in Automated Text Categorization. ACM Computing Surveys 34 (2002)
Google Scholar
Andrew, M., Kamal, N.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. In: Conference on Data Mining and Knowledge Discovery (2002)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning (1998)
Google Scholar
Qi, G., Zhiqiang, Z., Lizhu, Z., Jianhua, F.: A Highly Adaptable Web Information Extractor Using Graph Data Model. In: The Forth Asia Pacific Web Conference (APWeb 2002). Springer, Heidelberg (2002)
Google Scholar
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13(3), 637–649 (2001)
Article MATH Google Scholar
le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)
Article MATH Google Scholar
Mohammed, J., Charu, C.: XRules: An Effective Structural Classifier for XML Data. In: SIGKDD 2003 (2003)
Google Scholar
Mohammed, J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications, vol. 17 (2005)
Google Scholar
Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
Chapter Google Scholar
Bauner, E., Kohavi, R.: A empirical comparison of voting classificaiton algorithms: Bagging, boosting, and variants. Machine Learning 36, 105–139 (1999)
Article Google Scholar
Dietterich, T.G.: An experimental comparison of three method for constructing ensembles of decision trees: Bagging, Boosting, Radomization. Machine Learing (2000)
Google Scholar
Hansen, L., Salamon, P.: Neural network ensembles. IEEE trans. Pattern Analysis and Machine 12(10), 993–1001 (1990)
Article Google Scholar
Wang, K., Liu, H.: Discover Structural Association of Semistructured Data. Intell. 12, 993–1001 (1999)
Google Scholar
Asai, T., et al.: Efficient subtree discovery from large semi-structured data. In: International Conference on Data Mining (ICDM 2002). Springer, Heidelberg (2002)
Google Scholar
Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: SIGKDD 1998 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Technology Department, Tsinghua University, 100084, Beijing, China
Hang Guo & Lizhu Zhou

Authors

Hang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, National University of Singapore,
Stéphane Bressan
University of Linz, Altenbergerstraße 69, 4040, Linz, Austria
Josef Küng & Roland Wagner &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, H., Zhou, L. (2006). Segmented Document Classification: Problem and Solution. In: Bressan, S., Küng, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2006. Lecture Notes in Computer Science, vol 4080. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11827405_53

Download citation

DOI: https://doi.org/10.1007/11827405_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37871-6
Online ISBN: 978-3-540-37872-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics