Advertisement

A Scalable Framework for Stylometric Analysis of Multi-author Documents

  • Raheem Sarwar
  • Chenyun Yu
  • Sarana Nutanong
  • Norawit Urailertprasert
  • Nattapol Vannaboot
  • Thanawin Rakthanmanon
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10827)

Abstract

Stylometry is a statistical technique used to analyze the variations in the author’s writing styles and is typically applied to authorship attribution problems. In this investigation, we apply stylometry to authorship identification of multi-author documents (AIMD) task. We propose an AIMD technique called Co-Authorship Graph (CAG) which can be used to collaboratively attribute different portions of documents to different authors belonging to the same community. Based on CAG, we propose a novel AIMD solution which (i) significantly outperforms the existing state-of-the-art solution; (ii) can effectively handle a larger number of co-authors; and (iii) is capable of handling the case when some of the listed co-authors have not contributed to the document as a writer. We conducted an extensive experimental study to compare the proposed solution and the best existing AIMD method using real and synthetic datasets. We show that the proposed solution significantly outperforms existing state-of-the-art method.

Keywords

Stylometry Authorship identification Co-Authorship Graph Multi-author documents 

References

  1. 1.
    Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26(2), 7:1–7:29 (2008)CrossRefGoogle Scholar
  2. 2.
    Akhavan, P., Ebrahim, N.A., Fetrati, M.A., Pezeshkan, A.: Major trends in knowledge management research: a bibliometric study. Scientometrics 107(3), 1249–1264 (2016)CrossRefGoogle Scholar
  3. 3.
    Baron, G.: Influence of data discretization on efficiency of Bayesian classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014)CrossRefGoogle Scholar
  4. 4.
    Bradley, J.K., Kelley, P.G., Roth, A.: Author identification from citations. Technical report, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA (2008)Google Scholar
  5. 5.
    Dauber, E., Overdorf, R., Greenstadt, R.: Stylometric authorship attribution of collaborative documents. In: Dolev, S., Lodha, S. (eds.) CSCML 2017. LNCS, vol. 10332, pp. 115–135. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-60080-2_9CrossRefGoogle Scholar
  6. 6.
    Giannella, C.: An improved algorithm for unsupervised decomposition of a multi-author document. JASIST 67(2), 400–411 (2016)Google Scholar
  7. 7.
    Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. LLC 22(3), 251–270 (2007)Google Scholar
  8. 8.
    Hassan, S.U., Sarwar, R., Muazzam, A.: Tapping into intra- and international collaborations of the organization of Islamic cooperation states across science and technology disciplines. Sci. Public Policy 43(5), 690–701 (2016)CrossRefGoogle Scholar
  9. 9.
    Hill, S., Provost, F.: The myth of the double-blind review? Author identification using only citations. ACM SIGKDD Explor. Newsl. 5(2), 179–184 (2003)CrossRefGoogle Scholar
  10. 10.
    Holmes, C., Adams, N.: A probabilistic nearest neighbour method for statistical pattern recognition. J. R. Stat. Soc. Ser. B Stat. Methodol. 64(2), 295–306 (2002)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Li, J., Zheng, R., Chen, H.: From fingerprint to writeprint. Commun. ACM 49(4), 76–82 (2006)CrossRefGoogle Scholar
  12. 12.
    Lipikorn, R., Shimizu, A., Kobatake, H.: A modified Hausdorff distance for object matching. Pattern Recogn. 1, 566–568 (1994)Google Scholar
  13. 13.
    McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31680-7_16CrossRefGoogle Scholar
  14. 14.
    Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)zbMATHGoogle Scholar
  15. 15.
    Nutanong, S., Yu, C., Sarwar, R., Xu, P., Chow, D.: A scalable framework for stylometric analysis query processing. In: ICDM (2016)Google Scholar
  16. 16.
    Payer, M., Huang, L., Gong, N.Z., Borgolte, K., Frank, M.: What you submit is who you are: a multimodal approach for deanonymizing scientific publications. IEEE Trans. Inf. Forensics Secur. 10(1), 200–212 (2015)CrossRefGoogle Scholar
  17. 17.
    Ramnial, H., Panchoo, S., Pudaruth, S.: Authorship attribution using stylometry and machine learning techniques. In: Berretti, S., Thampi, S.M., Srivastava, P.R. (eds.) Intelligent Systems Technologies and Applications. AISC, vol. 384, pp. 113–125. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-23036-8_10CrossRefGoogle Scholar
  18. 18.
    Rexha, A., Klampfl, S., Kröll, M., Kern, R.: Towards a more fine grained analysis of scientific authorship: predicting the number of authors using stylometric features. In: Proceedings of the Third Workshop on BIR Co-located with the 38th (ECIR 2016), Padova, Italy, 20 March 2016, pp. 26–31 (2016)Google Scholar
  19. 19.
    Sboev, A., Litvinova, T., Gudovskikh, D., Rybka, R., Moloshnikov, I.: Machine learning models of text categorization by author gender using topic-independent features. Procedia Comput. Sci. 101, 135–142 (2016)CrossRefGoogle Scholar
  20. 20.
    Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60(3), 538–556 (2009)CrossRefGoogle Scholar
  21. 21.
    Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. IJDWM 3(3), 1–13 (2007)Google Scholar
  22. 22.
    Zhang, M., Zhou, Z.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Raheem Sarwar
    • 1
  • Chenyun Yu
    • 1
  • Sarana Nutanong
    • 1
  • Norawit Urailertprasert
    • 2
  • Nattapol Vannaboot
    • 2
  • Thanawin Rakthanmanon
    • 2
    • 3
  1. 1.Department of Computer ScienceCity University of Hong KongKowloon TongHong Kong SAR, China
  2. 2.Department of Computer EngineeringKasetsart UniversityBangkokThailand
  3. 3.Vidyasirimedhi Institute of Science and TechnologyRayongThailand

Personalised recommendations