Abstract
In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like “title”,“body”, etc. We call them “segmented documents”. To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaiveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fabrizio, S.: Machine Learing in Automated Text Categorization. ACM Computing Surveys 34 (2002)
Andrew, M., Kamal, N.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. In: Conference on Data Mining and Knowledge Discovery (2002)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning (1998)
Qi, G., Zhiqiang, Z., Lizhu, Z., Jianhua, F.: A Highly Adaptable Web Information Extractor Using Graph Data Model. In: The Forth Asia Pacific Web Conference (APWeb 2002). Springer, Heidelberg (2002)
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13(3), 637–649 (2001)
le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)
Mohammed, J., Charu, C.: XRules: An Effective Structural Classifier for XML Data. In: SIGKDD 2003 (2003)
Mohammed, J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications, vol. 17 (2005)
Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
Bauner, E., Kohavi, R.: A empirical comparison of voting classificaiton algorithms: Bagging, boosting, and variants. Machine Learning 36, 105–139 (1999)
Dietterich, T.G.: An experimental comparison of three method for constructing ensembles of decision trees: Bagging, Boosting, Radomization. Machine Learing (2000)
Hansen, L., Salamon, P.: Neural network ensembles. IEEE trans. Pattern Analysis and Machine 12(10), 993–1001 (1990)
Wang, K., Liu, H.: Discover Structural Association of Semistructured Data. Intell. 12, 993–1001 (1999)
Asai, T., et al.: Efficient subtree discovery from large semi-structured data. In: International Conference on Data Mining (ICDM 2002). Springer, Heidelberg (2002)
Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: SIGKDD 1998 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guo, H., Zhou, L. (2006). Segmented Document Classification: Problem and Solution. In: Bressan, S., Küng, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2006. Lecture Notes in Computer Science, vol 4080. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11827405_53
Download citation
DOI: https://doi.org/10.1007/11827405_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37871-6
Online ISBN: 978-3-540-37872-3
eBook Packages: Computer ScienceComputer Science (R0)