Advertisement

Investigation of Text Attribution Methods Based on Frequency Author Profile

  • Polina DiurdevaEmail author
  • Elena MikhailovaEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 838)

Abstract

The task of text analysis with the objective to determine text’s author is a challenge the solutions of which have engaged researchers since the last century. With the development of social networks and platforms for publishing of web-posts or articles on the Internet, the task of identifying authorship becomes even more acute. Specialists in the areas of journalism and law are particularly interested in finding a more accurate approach in order to resolve disputes related to the texts of dubious authorship. In this article authors carry out an applicability comparison of eight modern Machine Learning algorithms like Support Vector Machine, Naive Bayes, Logistic Regression, K-nearest Neighbors, Decision Tree, Random Forest, Multilayer Perceptron, Gradient Boosting Classifier for classification of Russian web-post collection. The best results were achieved with Logistic Regression, Multilayer Perceptron and Support Vector Machine with linear kernel using combination of Part-of-Speech and Word N-grams as features.

Keywords

Author attribution Text classification Frequency author profile 

References

  1. 1.
    Fissette, M.: Author identification in short texts (2010)Google Scholar
  2. 2.
    Ganesh, H.B.B., Reshma, U., Kumar, M.A.: Author identification based on word distribution in word space. In: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1519–1523, August 2015.  https://doi.org/10.1109/ICACCI.2015.7275828
  3. 3.
    Howedi, F., Mohd, M.: Text classification for authorship attribution using Naive Bayes classifier with limited training data. In: Computer Engineering and Intelligent Systems (2014)Google Scholar
  4. 4.
    Jenkins, J., Nick, W., Roy, K., Esterline, A.C., Bloch, J.C.: Author identification using sequential minimal optimization. In: SoutheastCon 2016, pp. 1–2 (2016)Google Scholar
  5. 5.
    Kanhirangat, V., Gupta, D.: Text plagiarism classification using syntax based linguistic features. Expert Syst. Appl. 88, 448–464 (2017).  https://doi.org/10.1016/j.eswa.2017.07.006. http://www.sciencedirect.com/science/article/pii/S095741741730475XCrossRefGoogle Scholar
  6. 6.
    Kapočiūtė-Dzikienė, J., Venčkauskas, A., Damaševičius, R.: A comparison of authorship attribution approaches applied on the Lithuanian language. In: 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 347–351, September 2017.  https://doi.org/10.15439/2017F110
  7. 7.
    Khonji, M., Iraqi, Y., Jones, A.: An evaluation of authorship attribution using random forests. In: 2015 International Conference on Information and Communication Technology Research (ICTRC), pp. 68–71, May 2015.  https://doi.org/10.1109/ICTRC.2015.7156423
  8. 8.
    Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-26123-2_31CrossRefGoogle Scholar
  9. 9.
    Largeron, C., Juganaru-Mathieu, M., Frery, J.: Author identification by automatic learning. In: IEEE International Conference on Document Analysis and Recognition (ICDAR 2015), Nancy, France, August 2015. https://hal.archives-ouvertes.fr/hal-01223252
  10. 10.
    Meina, M., et al.: Ensemble-based classification for author profiling using various features notebook for pan at CLEF 2013. In: CLEF (2013)Google Scholar
  11. 11.
    Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Pokou, Y.J.M., Fournier-Viger, P., Moghrabi, C.: Authorship attribution using variable length part-of-speech patterns. In: ICAART (2016)Google Scholar
  13. 13.
    Pranckevičius, T., Marcinkevičius, V.: Application of logistic regression with part-of-the-speech tagging for multi-class text classification. In: 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), pp. 1–5, November 2016.  https://doi.org/10.1109/AIEEE.2016.7821805
  14. 14.
    Reddy, T.R., Vardhan, B.V., Reddy, P.V.: N-gram approach for gender prediction. In: 2017 IEEE 7th International Advance Computing Conference (IACC), pp. 860–865 (2017)Google Scholar
  15. 15.
    Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, CIKM 2002, pp. 659–661. ACM, New York (2002).  https://doi.org/10.1145/584792.584911
  16. 16.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009).  https://doi.org/10.1002/asi.v60:3CrossRefGoogle Scholar
  17. 17.
    Vorobeva, A.A.: Examining the performance of classification algorithms for imbalanced data sets in web author identification. In: 2016 18th Conference of Open Innovations Association and Seminar on Information Security and Protection of Information Technology (FRUCT-ISPIT), pp. 385–390, April 2016.  https://doi.org/10.1109/FRUCT-ISPIT.2016.7561554

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Saint Petersburg State UniversitySaint PetersburgRussia
  2. 2.ITMO UniversitySaint PetersburgRussia

Personalised recommendations