Evaluation of the Document Classification Approaches

Hrala, Michal; Král, Pavel

doi:10.1007/978-3-319-00969-8_86

Michal Hrala⁶ &
Pavel Král⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 226))

2457 Accesses
15 Citations

Abstract

This paper deals with one class automatic document classification. Five feature selection methods and three classifiers are evaluated on a Czech corpus in order to build an efficient Czech document classification system. Lemmatization and POS tagging are used for a precise representation of the Czech documents. We demonstrated, that POS tag filtering is very important, while the lemmatization plays a marginal role for classification.We also showed that Maximum Entropy and Support Vector Machines are very robust to the feature vector size and outperform significantly the Naive Bayes classifier from the view point of the classification accuracy. The best classification accuracy is about 90% which is enough for an application for the Czech News Agency, our commercial partner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bratko, A., Bogdan, F.: Exploiting structural information for semi-structured document categorization. Information Processing and Management, 679–694 (2004)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 380–393 (1997)
Article Google Scholar
Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar
Luo, X., Zincir-Heywood, A.N.: Incorporating Temporal Information for Document Classification. In: ICDE Workshops, pp. 780–789 (2007)
Google Scholar
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. In: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pp. 59–68. Springer, London (2000)
Chapter Google Scholar
Cover, T., Thomas, J.: Elements of information theory. Wiley, Chichester (1991)
Book MATH Google Scholar
Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41, 1263–1276 (2005)
Article Google Scholar
Gomez, J.C., Moens, M.-F.: PCA document reconstruction for email classification. Computer Statistics and Data Analysis 56, 741–751 (2012)
Article MathSciNet Google Scholar
Yun, J., Jing, L., Yu, J., Huang, H.: A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications 39, 2035–2046 (2012)
Article Google Scholar
Hajič, J., Böhmová, A., Hajičová, E., Vidová-Hladká, B.: The Prague Dependency Treebank: A Three-Level Annotation Scenario. In: Abeillé, A. (ed.) Treebanks: Building and Using Parsed Corpora, pp. 103–127. Kluwer, Amsterdam (2000)
Google Scholar
Cohen, W.W.: MinorThird: Methods for Identifying Names and Ontological Relations in Text using Heuristics for Inducing Regularities from Data (2004), http://minorthird.sourceforge.net
Ponmuthuramalingam, P., Devi, T.: Effective Term Based Text Clustering Algorithms. International Journal on Computer Science and Engineering, 1665–1673 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic
Michal Hrala & Pavel Král

Authors

Michal Hrala
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Král
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Hrala .

Editor information

Editors and Affiliations

, Department of Systems and Computer, Wroclaw University of Technology, WybrzezeWyspianskiego St 27, Wroclaw, 50-370, Poland
Robert Burduk
, Department of Systems and Computer, Wroclaw University of Technology, WybrzezeWyspianskiego St 27, Wroclaw, 50-370, Poland
Konrad Jackowski
, Department of Systems and Computer, Wroclaw University of Technology, WybrzezeWyspianskiego St 27, Wroclaw, 50-370, Poland
Marek Kurzynski
, Department of Systems and Computer, Wroclaw University of Technology, WybrzezeWyspianskiego St 27, Wroclaw, 50-370, Poland
Michał Wozniak
, Department of Systems and Computer, Wroclaw University of Technology, WybrzezeWyspianskiego St, Wroclaw, 50-370, Poland
Andrzej Zolnierek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hrala, M., Král, P. (2013). Evaluation of the Document Classification Approaches. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds) Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013. Advances in Intelligent Systems and Computing, vol 226. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00969-8_86

Download citation

DOI: https://doi.org/10.1007/978-3-319-00969-8_86
Publisher Name: Springer, Heidelberg
Print ISBN: 978-3-319-00968-1
Online ISBN: 978-3-319-00969-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics