The Class Imbalance Problem in Construction of Training Datasets for Authorship Attribution

Stańczyk, Urszula

doi:10.1007/978-3-319-23437-3_46

Urszula Stańczyk⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 391))

937 Accesses
9 Citations

Abstract

The paper presents research on class imbalance in the context of construction of training sets for authorship recognition. In experiments the sets are artificially imbalanced, then balanced by under-sampling and over-sampling. The prepared sets are used in learning of two predictors: connectionist and rule-based, and their performance observed. The tests show that for artificial neural networks in several cases the predictive accuracy is not degraded but in fact improved, while one rule classifier is highly sensitive to class balance as it never performs better than for the original balanced set and in many cases worse.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alejo, R., Sotoca, J., Valdovinos, R., Casañ, G.: The Multi-Class Imbalance Problem: Cost Functions with Modular and Non-Modular Neural Networks. In: Wang, H., Shen, Y., Huang, T., Zeng, Z. (eds.) The 6th international symposium on neural networks. AISC, vol. 56, pp. 421–431. Springer, Berlin (2009)
Google Scholar
Baron, G.: Influence of data discretization on efficiency of Bayesian classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014)
Article Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Grzymała-Busse, J., Stefanowski, J., Wilk, S.: A Comparison of Two Approaches to Data Mining from Imbalanced Data. In: Negoita, M., Howlett, R., Jain, L. (eds.) Knowledge-based intelligent information and engineering systems. LNCS, vol. 3213, pp. 757–763. Springer, Berlin (2004)
Chapter Google Scholar
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Holte, R.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 63–91 (1993)
Article MATH Google Scholar
Jockers, M., Witten, D.: A comparative study of machine learning methods for authorship attribution. Literary Linguist. Comput. 25(2), 215–223 (2010)
Article Google Scholar
Stamatatos, E.: Author identification: Using text sampling to handle the class imbalance problem. Inf. Process. Manage. 44, 790–799 (2008)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Stańczyk, U.: Dominance-Based Rough Set Approach Employed in Search of Authorial Invariants. In: Kurzyński, M., Woźniak, M. (eds.) Computer recognition systems 3. AISC, vol. 57, pp. 315–323. Springer, Berlin (2009)
Google Scholar
Stańczyk, U.: Application of DRSA-ANN Classifier in Computational Stylistics. In: Kryszkiewicz, M., Rybiński, H., Skowron, A., Raś, Z. (eds.) Foundations of intelligent systems. LNAI, vol. 6804, pp. 695–704. Springer, Berlin (2011)
Chapter Google Scholar
Stańczyk, U.: Rule-based approach to computational stylistics. In: Bouvry, P., Kłopotek, M., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and intelligent information systems. LNCS, vol. 7053, pp. 168–179. Springer, Berlin (2012)
Chapter Google Scholar
Stańczyk, U.: Ranking of characteristic features in combined wrapper approaches to selection. Neural Comput. Appl. 26(2), 329–344 (2015)
Article Google Scholar

Download references

Acknowledgments

The research described was performed within the project BK/RAu2/2015 at the Institute of Informatics, Silesian University of Technology, Gliwice, Poland.

Author information

Authors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Urszula Stańczyk

Authors

Urszula Stańczyk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Urszula Stańczyk .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Aleksandra Gruca
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Agnieszka Brachman
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Tadeusz Czachórski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stańczyk, U. (2016). The Class Imbalance Problem in Construction of Training Datasets for Authorship Attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds) Man–Machine Interactions 4. Advances in Intelligent Systems and Computing, vol 391. Springer, Cham. https://doi.org/10.1007/978-3-319-23437-3_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-23437-3_46
Published: 09 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23436-6
Online ISBN: 978-3-319-23437-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics