Skip to main content

Segmented Document Classification: Problem and Solution

  • Conference paper
Database and Expert Systems Applications (DEXA 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4080))

Included in the following conference series:

Abstract

In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like “title”,“body”, etc. We call them “segmented documents”. To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaiveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fabrizio, S.: Machine Learing in Automated Text Categorization. ACM Computing Surveys 34 (2002)

    Google Scholar 

  2. Andrew, M., Kamal, N.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  3. Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. In: Conference on Data Mining and Knowledge Discovery (2002)

    Google Scholar 

  4. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning (1998)

    Google Scholar 

  5. Qi, G., Zhiqiang, Z., Lizhu, Z., Jianhua, F.: A Highly Adaptable Web Information Extractor Using Graph Data Model. In: The Forth Asia Pacific Web Conference (APWeb 2002). Springer, Heidelberg (2002)

    Google Scholar 

  6. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13(3), 637–649 (2001)

    Article  MATH  Google Scholar 

  7. le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)

    Article  MATH  Google Scholar 

  8. Mohammed, J., Charu, C.: XRules: An Effective Structural Classifier for XML Data. In: SIGKDD 2003 (2003)

    Google Scholar 

  9. Mohammed, J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications, vol. 17 (2005)

    Google Scholar 

  10. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  11. Bauner, E., Kohavi, R.: A empirical comparison of voting classificaiton algorithms: Bagging, boosting, and variants. Machine Learning 36, 105–139 (1999)

    Article  Google Scholar 

  12. Dietterich, T.G.: An experimental comparison of three method for constructing ensembles of decision trees: Bagging, Boosting, Radomization. Machine Learing (2000)

    Google Scholar 

  13. Hansen, L., Salamon, P.: Neural network ensembles. IEEE trans. Pattern Analysis and Machine 12(10), 993–1001 (1990)

    Article  Google Scholar 

  14. Wang, K., Liu, H.: Discover Structural Association of Semistructured Data. Intell. 12, 993–1001 (1999)

    Google Scholar 

  15. Asai, T., et al.: Efficient subtree discovery from large semi-structured data. In: International Conference on Data Mining (ICDM 2002). Springer, Heidelberg (2002)

    Google Scholar 

  16. Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: SIGKDD 1998 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guo, H., Zhou, L. (2006). Segmented Document Classification: Problem and Solution. In: Bressan, S., Küng, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2006. Lecture Notes in Computer Science, vol 4080. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11827405_53

Download citation

  • DOI: https://doi.org/10.1007/11827405_53

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37871-6

  • Online ISBN: 978-3-540-37872-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics