The Class Imbalance Problem in Construction of Training Datasets for Authorship Attribution

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 391)

Abstract

The paper presents research on class imbalance in the context of construction of training sets for authorship recognition. In experiments the sets are artificially imbalanced, then balanced by under-sampling and over-sampling. The prepared sets are used in learning of two predictors: connectionist and rule-based, and their performance observed. The tests show that for artificial neural networks in several cases the predictive accuracy is not degraded but in fact improved, while one rule classifier is highly sensitive to class balance as it never performs better than for the original balanced set and in many cases worse.

Keywords

Class imbalance Sampling strategy Authorship attribution 

References

  1. 1.
    Alejo, R., Sotoca, J., Valdovinos, R., Casañ, G.: The Multi-Class Imbalance Problem: Cost Functions with Modular and Non-Modular Neural Networks. In: Wang, H., Shen, Y., Huang, T., Zeng, Z. (eds.) The 6th international symposium on neural networks. AISC, vol. 56, pp. 421–431. Springer, Berlin (2009)Google Scholar
  2. 2.
    Baron, G.: Influence of data discretization on efficiency of Bayesian classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014)CrossRefGoogle Scholar
  3. 3.
    Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATHGoogle Scholar
  4. 4.
    Grzymała-Busse, J., Stefanowski, J., Wilk, S.: A Comparison of Two Approaches to Data Mining from Imbalanced Data. In: Negoita, M., Howlett, R., Jain, L. (eds.) Knowledge-based intelligent information and engineering systems. LNCS, vol. 3213, pp. 757–763. Springer, Berlin (2004)CrossRefGoogle Scholar
  5. 5.
    He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  6. 6.
    Holte, R.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 63–91 (1993)MATHCrossRefGoogle Scholar
  7. 7.
    Jockers, M., Witten, D.: A comparative study of machine learning methods for authorship attribution. Literary Linguist. Comput. 25(2), 215–223 (2010)CrossRefGoogle Scholar
  8. 8.
    Stamatatos, E.: Author identification: Using text sampling to handle the class imbalance problem. Inf. Process. Manage. 44, 790–799 (2008)CrossRefGoogle Scholar
  9. 9.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)CrossRefGoogle Scholar
  10. 10.
    Stańczyk, U.: Dominance-Based Rough Set Approach Employed in Search of Authorial Invariants. In: Kurzyński, M., Woźniak, M. (eds.) Computer recognition systems 3. AISC, vol. 57, pp. 315–323. Springer, Berlin (2009)Google Scholar
  11. 11.
    Stańczyk, U.: Application of DRSA-ANN Classifier in Computational Stylistics. In: Kryszkiewicz, M., Rybiński, H., Skowron, A., Raś, Z. (eds.) Foundations of intelligent systems. LNAI, vol. 6804, pp. 695–704. Springer, Berlin (2011)CrossRefGoogle Scholar
  12. 12.
    Stańczyk, U.: Rule-based approach to computational stylistics. In: Bouvry, P., Kłopotek, M., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and intelligent information systems. LNCS, vol. 7053, pp. 168–179. Springer, Berlin (2012)CrossRefGoogle Scholar
  13. 13.
    Stańczyk, U.: Ranking of characteristic features in combined wrapper approaches to selection. Neural Comput. Appl. 26(2), 329–344 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Institute of InformaticsSilesian University of TechnologyGliwicePoland

Personalised recommendations