Skip to main content

The Class Imbalance Problem in Construction of Training Datasets for Authorship Attribution

  • Conference paper
  • First Online:
Man–Machine Interactions 4

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 391))

Abstract

The paper presents research on class imbalance in the context of construction of training sets for authorship recognition. In experiments the sets are artificially imbalanced, then balanced by under-sampling and over-sampling. The prepared sets are used in learning of two predictors: connectionist and rule-based, and their performance observed. The tests show that for artificial neural networks in several cases the predictive accuracy is not degraded but in fact improved, while one rule classifier is highly sensitive to class balance as it never performs better than for the original balanced set and in many cases worse.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alejo, R., Sotoca, J., Valdovinos, R., Casañ, G.: The Multi-Class Imbalance Problem: Cost Functions with Modular and Non-Modular Neural Networks. In: Wang, H., Shen, Y., Huang, T., Zeng, Z. (eds.) The 6th international symposium on neural networks. AISC, vol. 56, pp. 421–431. Springer, Berlin (2009)

    Google Scholar 

  2. Baron, G.: Influence of data discretization on efficiency of Bayesian classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014)

    Article  Google Scholar 

  3. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  4. Grzymała-Busse, J., Stefanowski, J., Wilk, S.: A Comparison of Two Approaches to Data Mining from Imbalanced Data. In: Negoita, M., Howlett, R., Jain, L. (eds.) Knowledge-based intelligent information and engineering systems. LNCS, vol. 3213, pp. 757–763. Springer, Berlin (2004)

    Chapter  Google Scholar 

  5. He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  6. Holte, R.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 63–91 (1993)

    Article  MATH  Google Scholar 

  7. Jockers, M., Witten, D.: A comparative study of machine learning methods for authorship attribution. Literary Linguist. Comput. 25(2), 215–223 (2010)

    Article  Google Scholar 

  8. Stamatatos, E.: Author identification: Using text sampling to handle the class imbalance problem. Inf. Process. Manage. 44, 790–799 (2008)

    Article  Google Scholar 

  9. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  10. Stańczyk, U.: Dominance-Based Rough Set Approach Employed in Search of Authorial Invariants. In: Kurzyński, M., Woźniak, M. (eds.) Computer recognition systems 3. AISC, vol. 57, pp. 315–323. Springer, Berlin (2009)

    Google Scholar 

  11. Stańczyk, U.: Application of DRSA-ANN Classifier in Computational Stylistics. In: Kryszkiewicz, M., Rybiński, H., Skowron, A., Raś, Z. (eds.) Foundations of intelligent systems. LNAI, vol. 6804, pp. 695–704. Springer, Berlin (2011)

    Chapter  Google Scholar 

  12. Stańczyk, U.: Rule-based approach to computational stylistics. In: Bouvry, P., Kłopotek, M., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and intelligent information systems. LNCS, vol. 7053, pp. 168–179. Springer, Berlin (2012)

    Chapter  Google Scholar 

  13. Stańczyk, U.: Ranking of characteristic features in combined wrapper approaches to selection. Neural Comput. Appl. 26(2), 329–344 (2015)

    Article  Google Scholar 

Download references

Acknowledgments

The research described was performed within the project BK/RAu2/2015 at the Institute of Informatics, Silesian University of Technology, Gliwice, Poland.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Urszula Stańczyk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Stańczyk, U. (2016). The Class Imbalance Problem in Construction of Training Datasets for Authorship Attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds) Man–Machine Interactions 4. Advances in Intelligent Systems and Computing, vol 391. Springer, Cham. https://doi.org/10.1007/978-3-319-23437-3_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23437-3_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23436-6

  • Online ISBN: 978-3-319-23437-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics