Segmented Document Classification: Problem and Solution
In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like “title”,“body”, etc. We call them “segmented documents”. To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaiveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.
KeywordsPlain Text Training Document Semistructured Data False Rate Text Categorization Problem
Unable to display preview. Download preview PDF.
- 1.Fabrizio, S.: Machine Learing in Automated Text Categorization. ACM Computing Surveys 34 (2002)Google Scholar
- 2.Andrew, M., Kamal, N.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)Google Scholar
- 3.Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. In: Conference on Data Mining and Knowledge Discovery (2002)Google Scholar
- 4.Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning (1998)Google Scholar
- 5.Qi, G., Zhiqiang, Z., Lizhu, Z., Jianhua, F.: A Highly Adaptable Web Information Extractor Using Graph Data Model. In: The Forth Asia Pacific Web Conference (APWeb 2002). Springer, Heidelberg (2002)Google Scholar
- 8.Mohammed, J., Charu, C.: XRules: An Effective Structural Classifier for XML Data. In: SIGKDD 2003 (2003)Google Scholar
- 9.Mohammed, J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications, vol. 17 (2005)Google Scholar
- 12.Dietterich, T.G.: An experimental comparison of three method for constructing ensembles of decision trees: Bagging, Boosting, Radomization. Machine Learing (2000)Google Scholar
- 14.Wang, K., Liu, H.: Discover Structural Association of Semistructured Data. Intell. 12, 993–1001 (1999)Google Scholar
- 15.Asai, T., et al.: Efficient subtree discovery from large semi-structured data. In: International Conference on Data Mining (ICDM 2002). Springer, Heidelberg (2002)Google Scholar
- 16.Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: SIGKDD 1998 (1998)Google Scholar