Segmented Document Classification: Problem and Solution

  • Hang Guo
  • Lizhu Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4080)


In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like “title”,“body”, etc. We call them “segmented documents”. To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaiveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.


Plain Text Training Document Semistructured Data False Rate Text Categorization Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fabrizio, S.: Machine Learing in Automated Text Categorization. ACM Computing Surveys 34 (2002)Google Scholar
  2. 2.
    Andrew, M., Kamal, N.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)Google Scholar
  3. 3.
    Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. In: Conference on Data Mining and Knowledge Discovery (2002)Google Scholar
  4. 4.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning (1998)Google Scholar
  5. 5.
    Qi, G., Zhiqiang, Z., Lizhu, Z., Jianhua, F.: A Highly Adaptable Web Information Extractor Using Graph Data Model. In: The Forth Asia Pacific Web Conference (APWeb 2002). Springer, Heidelberg (2002)Google Scholar
  6. 6.
    Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13(3), 637–649 (2001)MATHCrossRefGoogle Scholar
  7. 7.
    le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)MATHCrossRefGoogle Scholar
  8. 8.
    Mohammed, J., Charu, C.: XRules: An Effective Structural Classifier for XML Data. In: SIGKDD 2003 (2003)Google Scholar
  9. 9.
    Mohammed, J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications, vol. 17 (2005)Google Scholar
  10. 10.
    Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  11. 11.
    Bauner, E., Kohavi, R.: A empirical comparison of voting classificaiton algorithms: Bagging, boosting, and variants. Machine Learning 36, 105–139 (1999)CrossRefGoogle Scholar
  12. 12.
    Dietterich, T.G.: An experimental comparison of three method for constructing ensembles of decision trees: Bagging, Boosting, Radomization. Machine Learing (2000)Google Scholar
  13. 13.
    Hansen, L., Salamon, P.: Neural network ensembles. IEEE trans. Pattern Analysis and Machine 12(10), 993–1001 (1990)CrossRefGoogle Scholar
  14. 14.
    Wang, K., Liu, H.: Discover Structural Association of Semistructured Data. Intell. 12, 993–1001 (1999)Google Scholar
  15. 15.
    Asai, T., et al.: Efficient subtree discovery from large semi-structured data. In: International Conference on Data Mining (ICDM 2002). Springer, Heidelberg (2002)Google Scholar
  16. 16.
    Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: SIGKDD 1998 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hang Guo
    • 1
  • Lizhu Zhou
    • 1
  1. 1.Computer Science & Technology DepartmentTsinghua UniversityBeijingChina

Personalised recommendations