Confidence Measure for Czech Document Classification

  • Pavel Král
  • Ladislav Lenc
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9042)


This paper deals with automatic document classification in the context of a real application for the Czech News Agency (ČTK). The accuracy our classifier is high, however it is still important to improve the classification results. The main goal of this paper is thus to propose novel confidence measure approaches in order to detect and remove incorrectly classified samples. Two proposed methods are based on the posterior class probability and the third one is a supervised approach which uses another classifier to determine if the result is correct. The methods are evaluated on a Czech newspaper corpus. We experimentally show that it is beneficial to integrate the novel approaches into the document classification task because they significantly improve the classification accuracy.


Latent Dirichlet Allocation Acceptance Threshold Supervise Approach Conformal Predictor Automatic Document 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research 3, 1289–1305 (2003)zbMATHGoogle Scholar
  2. 2.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)Google Scholar
  3. 3.
    Lamirel, J.C., Cuxac, P., Chivukula, A.S., Hajlaoui, K.: Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, 1–18 (2014)Google Scholar
  4. 4.
    Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41, 1263–1276 (2005)CrossRefGoogle Scholar
  5. 5.
    Chandrasekar, R., Srinivas, B.: Using syntactic information in document filtering: A comparative study of part-of-speech tagging and supertagging (1996)Google Scholar
  6. 6.
    Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  7. 7.
    Wong, A.K., Lee, J.W., Yeung, D.S.: Using complex linguistic features in context-sensitive text classification techniques. In: Proceedings of 2005 International Conference on Machine Learning and Cybernetics, vol. 5, pp. 3183–3188. IEEE (2005)Google Scholar
  8. 8.
    Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 1, pp. 248–256. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  9. 9.
    Brychcín, T., Král, P.: Novel unsupervised features for czech multi-label document classification. In: Gelbukh, A., Espinoza, F.C., Galicia-Haro, S.N. (eds.) MICAI 2014, Part I. LNCS, vol. 8856, pp. 70–79. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  10. 10.
    Ramage, D., Manning, C.D., Dumais, S.: Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2011, pp. 457–465. ACM, New York (2011)Google Scholar
  11. 11.
    Gomez, J.C., Moens, M.-F.: Pca document reconstruction for email classification. Computer Statistics and Data Analysis 56(3), 741–751 (2012)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Yun, J., Jing, L., Yu, J., Huang, H.: A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications 39, 2035–2046 (2012)CrossRefGoogle Scholar
  13. 13.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Mach. Learn. 39, 103–134 (2000)CrossRefzbMATHGoogle Scholar
  14. 14.
    Hrala, M., Král, P.: Evaluation of the document classification approaches. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 877–885. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  15. 15.
    Hrala, M., Král, P.: Multi-label document classification in czech. In: Habernal, I., Matousek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 343–351. Springer, Heidelberg (2013)Google Scholar
  16. 16.
    Král, P.: Named entities as new features for czech document classification. In: Gelbukh, A. (ed.) CICLing 2014, Part II. LNCS, vol. 8404, pp. 417–427. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  17. 17.
    Senay, G., Linares, G., Lecouteux, B.: A segment-level confidence measure for spoken document retrieval. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5548–5551. IEEE (2011)Google Scholar
  18. 18.
    Senay, G., Linares, G.: Confidence measure for speech indexing based on latent dirichlet allocation. In: INTERSPEECH (2012)Google Scholar
  19. 19.
    Jiang, H.: Confidence measures for speech recognition: A survey. Speech Communication 45, 455–470 (2005)CrossRefGoogle Scholar
  20. 20.
    Wessel, F., Schluter, R., Macherey, K., Ney, H.: Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing 9, 288–298 (2001)CrossRefGoogle Scholar
  21. 21.
    Servin, B., de Givry, S., Faraut, T.: Statistical confidence measures for genome maps: application to the validation of genome assemblies. Bioinformatics 26, 3035–3042 (2010)CrossRefGoogle Scholar
  22. 22.
    Hu, X., Mordohai, P.: A quantitative evaluation of confidence measures for stereo vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 2121–2133 (2012)CrossRefGoogle Scholar
  23. 23.
    Marukatat, S., Artières, T., Gallinari, P., Dorizzi, B.: Rejection measures for handwriting sentence recognition. In: Proceedings of Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 24–29. IEEE (2002)Google Scholar
  24. 24.
    Li, F., Wechsler, H.: Open world face recognition with credibility and confidence measures. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 462–469. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  25. 25.
    Proedrou, K., Nouretdinov, I., Vovk, V., Gammerman, A.: Transductive confidence machines for pattern recognition. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 381–390. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  26. 26.
    Rodrigues, F.M., de M Santos, A., Canuto, A.M.: Using confidence values in multi-label classification problems with semi-supervised learning. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2013)Google Scholar
  27. 27.
    Nouretdinov, I., Costafreda, S.G., Gammerman, A., Chervonenkis, A., Vovk, V., Vapnik, V., Fu, C.H.: Machine learning classification with confidence: application of transductive conformal predictors to mri-based diagnostic and prognostic markers in depression. Neuroimage 56(2), 809–813 (2011)CrossRefGoogle Scholar
  28. 28.
    Papadopoulos, H.: A cross-conformal predictor for multi-label classification. In: Iliadis, L., Maglogiannis, I., Papadopoulos, H., Sioutas, S., Makris, C. (eds.) Artificial Intelligence Applications and Innovations. IFIP AICT, vol. 437, pp. 241–250. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  29. 29.
    Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM) 3, 1–13 (2007)CrossRefGoogle Scholar
  30. 30.
    Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Computational Linguistics 22, 39–71 (1996)Google Scholar
  31. 31.
    Konkol, M.: Brainy: A machine learning library. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2014, Part II. LNCS, vol. 8468, pp. 490–499. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  32. 32.
    Powers, D.: Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies 2, 37–63 (2011)Google Scholar
  33. 33.
    Brown, C.D., Davis, H.T.: Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Laboratory Systems 80(1), 24–38 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Pavel Král
    • 1
    • 2
  • Ladislav Lenc
    • 1
    • 2
  1. 1.Dept. of Computer Science & Engineering, Faculty of Applied SciencesUniversity of West BohemiaPlzeňCzech Republic
  2. 2.NTIS - New Technologies for the Information Society, Faculty of Applied SciencesUniversity of West BohemiaPlzeňCzech Republic

Personalised recommendations