The Class Imbalance Problem in Construction of Training Datasets for Authorship Attribution
The paper presents research on class imbalance in the context of construction of training sets for authorship recognition. In experiments the sets are artificially imbalanced, then balanced by under-sampling and over-sampling. The prepared sets are used in learning of two predictors: connectionist and rule-based, and their performance observed. The tests show that for artificial neural networks in several cases the predictive accuracy is not degraded but in fact improved, while one rule classifier is highly sensitive to class balance as it never performs better than for the original balanced set and in many cases worse.
KeywordsClass imbalance Sampling strategy Authorship attribution
- 1.Alejo, R., Sotoca, J., Valdovinos, R., Casañ, G.: The Multi-Class Imbalance Problem: Cost Functions with Modular and Non-Modular Neural Networks. In: Wang, H., Shen, Y., Huang, T., Zeng, Z. (eds.) The 6th international symposium on neural networks. AISC, vol. 56, pp. 421–431. Springer, Berlin (2009)Google Scholar
- 4.Grzymała-Busse, J., Stefanowski, J., Wilk, S.: A Comparison of Two Approaches to Data Mining from Imbalanced Data. In: Negoita, M., Howlett, R., Jain, L. (eds.) Knowledge-based intelligent information and engineering systems. LNCS, vol. 3213, pp. 757–763. Springer, Berlin (2004)CrossRefGoogle Scholar
- 10.Stańczyk, U.: Dominance-Based Rough Set Approach Employed in Search of Authorial Invariants. In: Kurzyński, M., Woźniak, M. (eds.) Computer recognition systems 3. AISC, vol. 57, pp. 315–323. Springer, Berlin (2009)Google Scholar