Skip to main content

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 226))

Abstract

This paper deals with one class automatic document classification. Five feature selection methods and three classifiers are evaluated on a Czech corpus in order to build an efficient Czech document classification system. Lemmatization and POS tagging are used for a precise representation of the Czech documents. We demonstrated, that POS tag filtering is very important, while the lemmatization plays a marginal role for classification.We also showed that Maximum Entropy and Support Vector Machines are very robust to the feature vector size and outperform significantly the Naive Bayes classifier from the view point of the classification accuracy. The best classification accuracy is about 90% which is enough for an application for the Czech News Agency, our commercial partner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bratko, A., Bogdan, F.: Exploiting structural information for semi-structured document categorization. Information Processing and Management, 679–694 (2004)

    Google Scholar 

  2. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  3. Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 380–393 (1997)

    Article  Google Scholar 

  4. Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    MATH  Google Scholar 

  5. Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)

    Google Scholar 

  6. Luo, X., Zincir-Heywood, A.N.: Incorporating Temporal Information for Document Classification. In: ICDE Workshops, pp. 780–789 (2007)

    Google Scholar 

  7. Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. In: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pp. 59–68. Springer, London (2000)

    Chapter  Google Scholar 

  8. Cover, T., Thomas, J.: Elements of information theory. Wiley, Chichester (1991)

    Book  MATH  Google Scholar 

  9. Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41, 1263–1276 (2005)

    Article  Google Scholar 

  10. Gomez, J.C., Moens, M.-F.: PCA document reconstruction for email classification. Computer Statistics and Data Analysis 56, 741–751 (2012)

    Article  MathSciNet  Google Scholar 

  11. Yun, J., Jing, L., Yu, J., Huang, H.: A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications 39, 2035–2046 (2012)

    Article  Google Scholar 

  12. Hajič, J., Böhmová, A., Hajičová, E., Vidová-Hladká, B.: The Prague Dependency Treebank: A Three-Level Annotation Scenario. In: Abeillé, A. (ed.) Treebanks: Building and Using Parsed Corpora, pp. 103–127. Kluwer, Amsterdam (2000)

    Google Scholar 

  13. Cohen, W.W.: MinorThird: Methods for Identifying Names and Ontological Relations in Text using Heuristics for Inducing Regularities from Data (2004), http://minorthird.sourceforge.net

  14. Ponmuthuramalingam, P., Devi, T.: Effective Term Based Text Clustering Algorithms. International Journal on Computer Science and Engineering, 1665–1673 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michal Hrala .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer International Publishing Switzerland

About this paper

Cite this paper

Hrala, M., Král, P. (2013). Evaluation of the Document Classification Approaches. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds) Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013. Advances in Intelligent Systems and Computing, vol 226. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00969-8_86

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-00969-8_86

  • Publisher Name: Springer, Heidelberg

  • Print ISBN: 978-3-319-00968-1

  • Online ISBN: 978-3-319-00969-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics