Advertisement

Authorship Attribution Using Stylometry and Machine Learning Techniques

  • Hoshiladevi Ramnial
  • Shireen Panchoo
  • Sameerchand Pudaruth
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 384)

Abstract

Plagiarism is considered to be a highly unethical activity in the academic world. Text-alignment is currently the preferred technique for estimating the degree of similarity with existing written works. Due to its dependency on other documents it becomes increasingly tedious and time-consuming to scale up to the growing number of online and offline documents. Thus, this paper aims at studying the use of stylometric features present in a document in order to verify its authorship. Two machine learning algorithms, namely k-NN and SMO, were used to predict the authenticity of the writings. A computer program consisting of 446 features was implemented. Ten PhD theses, split into different segments of 1000, 5000 and 10000 words, were used, totaling 520 documents as our corpus. Our results show that authorship attribution using stylometry method has generated an accuracy of above 90 %, except for 7-NN with 1000 words. We also showed how authorship attribution can be used to identify potential cases of plagiarism in formal writings.

Keywords

Plagiarism Authorship verification and attribution Stylometry K-NN SMO Content analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution Using Word Sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Kim, S., Kim, H., Weninger, T., Han, J. and Kim, H. D.: Authorship Classification: A Discriminative Syntactic Tree Mining Approach. In: Proceedings of the ACM SIGIR, July 24–28, Beijing, China (2011)Google Scholar
  3. 3.
    Nirkhi, S.M., Dharaskar, R.V.: Comparative Study of Authorship Identification Techniques for Cyber Forensics Analysis. International Journal of Advanced Computer Science and Applications 4(5), 32–35 (2013)CrossRefGoogle Scholar
  4. 4.
    Khan, S.R., Nirkhi, S.M., Dharaskar, R..V.: E-mail Data Analysis for Application to Cyber Forensic Investigation using Data Mining. In: Proceedings of the 2nd National Conference on Innovative Paradigms in Engineering & Technology (NCIPET 2013), New York, USA (2013)Google Scholar
  5. 5.
    Maurer, H., Zaka, B.: Plagiarism–A Problem and How to Fight It. In: Proceedings of World Conference on Education Multimedia, Hypermedia and Telecommunications, AACE, pp. 4451–4458 (2007)Google Scholar
  6. 6.
    Mozgovoy, M., Kakkonen, T., Cosma, G.: Automatic student plagiarism detection: future perspectives. Journal Educational Computing Research 43(4), 511–531 (2010)CrossRefGoogle Scholar
  7. 7.
    ICAI, Current Cheating Statistics. http://www.academicintegrity.org/icai/integrity-3.php. (accessed April 3, 2015)
  8. 8.
    Mechti, S., Jaoua, M. Belguith, L H.: A framework for Plagiarism Detection based on Author Profiling. In: Notebook for PAN at CLEF 2013 (2013). http://www.clef-initiative.eu/documents/71612/c7a0e432-dd82-46b1-ab9e-5d0dd98c3a8d (accessed March 3, 2015)
  9. 9.
    Smith, I.: The Invisible Web: Where Search Engines Fear to Go (2015). http://www.powerhomebiz.com/vol25/invisible.htm (accessed April 1, 2015)
  10. 10.
    Turnitin, iParadigms (2015). http://turnitin.com/ (accessed March 22, 2015)
  11. 11.
    Viper, Viper the Anti-plagiarism Scanner, Viper’s features (2015). http://www.scanmyessay.com/features.php (accessed April 2, 2015)
  12. 12.
    Plagium, Plagium (2015). http://www.plagium.com/ (accessed April 2, 2015)
  13. 13.
    PlagTracker, PlagTracker (2015). http://www.plagtracker.com/(accessed April 2, 2015)
  14. 14.
    Paper Rater, About Paper Rater (2015). http://www.paperrater.com/about (accessed April 2, 2015)
  15. 15.
    Grammarly, Grammarly (2015). http://www.grammarly.com (accessed April 2, 2015)
  16. 16.
    Horovitz, S.J.: Two Wrong Don’t Negate a Copyright: Don’t Make Students Turnitin if You Won’t Give it Back. Florida Law Review 60(1), 229–268 (2008)Google Scholar
  17. 17.
    TurnitinBot, TurnitinBot General Information Page (2015). https://turnitin.com/robot/crawlerinfo.html (accessed: March 15, 2015)
  18. 18.
    Cheat For Turnitin, Limitations to Turnitin. Tips For How To Cheat Turnitin? (2015). http://cheatturnitin.blogspot.com/ (accessed March 15, 2015)
  19. 19.
    Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the 2005 ACH/ALLC Conference (2005)Google Scholar
  20. 20.
    Hoover, D.L.: Frequent collocations and authorial style. Literary and Linguistic Computing 19(3), 261(28) (2004)Google Scholar
  21. 21.
    Nirkhi, S.M., Dharaskar, R.V., Thakare, V.M.: Authorship Attribution of online messages using Stylometry: An Exploratory Study. In: International Conference on Advances in Engineering and Technology (ICAET’2014) (2014)Google Scholar
  22. 22.
    Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceeding of the 22nd International Conference on Computational Linguistics, Vol. 1, pp. 513–520 (2008)Google Scholar
  23. 23.
    Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorisation Research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar
  24. 24.
    Iqbal, F., Hadjidj, R., Fung, B.C.M., Debbadi, M.: A Novel Approach of Mining Write-Prints for Authorship Attribution in E-mail Forensics. Proceedings of the Digital Forensic Research Workshop, pp. 42–51. Elsevier Ltd., Quebec (2008)Google Scholar
  25. 25.
    Abbasi, A., Chen, H.: Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems 2(2), Article 7 (2008)Google Scholar
  26. 26.
    Abbasi, A., Chen, H.: Visualizing Authorship for Identification. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 60–71. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  27. 27.
    Pavelec, D., Justino, E., Oliveira, L.S.: Author Identification using Stylometric Features. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial 11(36), 59–65 (2007)Google Scholar
  28. 28.
    Stańczyk, U., Cyran, K.A.: Machine learning approach to authorship attribution of literary texts. International Journal of Applied Mathematics & Informatics 1(4), 151–158 (2007)Google Scholar
  29. 29.
    Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation, Science Direct 7(1), 56–64 (2010)CrossRefGoogle Scholar
  30. 30.
    López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A New Document Author Representation for Authorship Attribution. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera López, J.A., Boyer, K.L. (eds.) MCPR 2012. LNCS, vol. 7329, pp. 283–292. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  31. 31.
    Koppel, M., Schler J., Argamon, S., Winter, Y.: The Fundamental Problem of Authorship Attribution. English Studies 93(3), 284–291 (2012). Taylor & FrancisGoogle Scholar
  32. 32.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4), 401–412 (2002)CrossRefGoogle Scholar
  33. 33.
    Halteren, H.V.: Linguistic Profiling for Author Recognition and Verification. In Proceedings: 42nd Annual Meeting on Association for Computational Linguistics (ACL04), Barcelona, Spain, pp. 199–206 (2004)Google Scholar
  34. 34.
    Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the ACM SIGIR, New York, USA, pp. 659–660 (2006)Google Scholar
  35. 35.
    Stamatatos, E.: Author identification: Using text sampling to handle the class imbalance problem. ECAI, IOS Press, Vol. 44, pp. 790–799 (2008)Google Scholar
  36. 36.
    Allison, B., Guthrie, L.: Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation. In: International Conference on Language Resources and Evaluation, Marrakech, Morocco (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Hoshiladevi Ramnial
    • 1
  • Shireen Panchoo
    • 1
  • Sameerchand Pudaruth
    • 2
  1. 1.School of Innovative Technologies and EngineeringUniversity of TechnologyPort LouisMauritius
  2. 2.Department of Ocean Engineering and ICT, Faculty of Ocean StudiesUniversity of MauritiusMokaMauritius

Personalised recommendations