Breaking the Closed-World Assumption in Stylometric Authorship Attribution

  • Ariel Stolerman
  • Rebekah Overdorf
  • Sadia Afroz
  • Rachel Greenstadt
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 433)

Abstract

Stylometry is a form of authorship attribution that relies on the linguistic information found in a document. While there has been significant work in stylometry, most research focuses on the closed-world problem where the author of the document is in a known suspect set. For open-world problems where the author may not be in the suspect set, traditional classification methods are ineffective. This paper proposes the “classify-verify” method that augments classification with a binary verification step evaluated on stylometric datasets. This method, which can be generalized to any domain, significantly outperforms traditional classifiers in open-world settings and yields an F1-score of 0.87, comparable to traditional classifiers in closed-world settings. Moreover, the method successfully detects adversarial documents where authors deliberately change their styles, a problem for which closed-world classifiers fail.

Keywords

Forensic stylometry authorship attribution authorship verification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A. Abbasi and H. Chen, Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace, ACM Transactions on Information Systems, vol. 26(2), pp. 7:1–7:29, 2008.CrossRefGoogle Scholar
  2. 2.
    S. Afroz, M. Brennan and R. Greenstadt, Detecting hoaxes, frauds and deception in writing style online, Proceedings of the IEEE Symposium on Security and Privacy, pp. 461–475, 2012.Google Scholar
  3. 3.
    L. Araujo, L. Sucupira, M. Lizarraga, L. Ling and J. Yabu-Uti, User authentication through typing biometrics features, IEEE Transactions on Signal Processing, vol. 53(2), pp. 851–855, 2005.MathSciNetCrossRefGoogle Scholar
  4. 4.
    M. Brennan, S. Afroz and R. Greenstadt, Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity, ACM Transactions on Information and System Security, vol. 15(3), pp. 12:1–12:22, 2012.CrossRefGoogle Scholar
  5. 5.
    M. Brennan and R. Greenstadt, Practical attacks against authorship recognition techniques, Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence, pp. 60–65, 2009.Google Scholar
  6. 6.
    K. Burton, A. Java and I. Soboroff, The ICWSM 2009 Spinn3r Dataset, Proceedings of the Third Annual Conference on Weblogs and Social Media, 2009.Google Scholar
  7. 7.
    Z. Chair and P. Varshney, Optimal data fusion in multiple sensor detection systems, IEEE Transactions on Aerospace and Electronic Systems, vol. AES-22(1), pp. 98–101, 1986.CrossRefGoogle Scholar
  8. 8.
    C. Chow, On optimum recognition error and reject tradeoff, IEEE Transactions on Information Theory, vol. 16(1), pp. 41–46, 1970.CrossRefMATHGoogle Scholar
  9. 9.
    A. Clark, Forensic Stylometric Authorship Analysis Under the Daubert Standard, University of the District of Comumbia, Washington, DC (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2039824), 2011.Google Scholar
  10. 10.
    P. Clough, Plagiarism in Natural and Programming Languages: An Overview of Current tools and Technologies, Technical Report, Department of Computer Science, University of Sheffield, Sheffield, United Kingdom, 2000.Google Scholar
  11. 11.
    M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. Witten, The Weka Data Mining Software: An update, SIGKDD Explorations Newsletter, vol. 11(1), pp. 10–18, 2009.CrossRefGoogle Scholar
  12. 12.
    R. Herbei and M. Wegkamp, Classification with reject option, Canadian Journal of Statistics, vol. 34(4), pp. 709–721, 2006.MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    P. Juola, Ad hoc Authorship Attribution Competition, Proceedings of the Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, 2004.Google Scholar
  14. 14.
    P. Juola, Authorship attribution, Foundations and Trends in Information Retrieval, vol. 1(3), pp. 233–334, 2008.CrossRefGoogle Scholar
  15. 15.
    P. Juola, Stylometry and immigration: A case study, Journal of Law and Policy, vol. 21(2), pp. 287–298, 2013.Google Scholar
  16. 16.
    P. Juola, J. Noecker, A. Stolerman, M. Ryan, P. Brennan and R. Greenstadt, A dataset for active linguistic authentication, in Advances in Digital Forensics IX, G. Peterson and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 385–398, 2013.CrossRefGoogle Scholar
  17. 17.
    M. Koppel and J. Schler, Authorship verification as a one-class classification problem, Proceedings of the Twenty-First International Conference on Machine Learning, 2004.Google Scholar
  18. 18.
    M. Koppel, J. Schler and S. Argamon, Authorship attribution in the wild, Language Resources and Evaluation, vol. 45(1), pp. 83–94, 2011.CrossRefGoogle Scholar
  19. 19.
    M. Koppel, J. Schler and E. Bonchek-Dokow, Measuring differentiability: Unmasking pseudonymous authors, Journal of Machine Learning Research, vol. 8(2), pp. 1261–1276, 2007.MATHGoogle Scholar
  20. 20.
    L. Manevitz and M. Yousef, One-class document classification via neural networks, Neurocomputing, vol. 70(7-9), pp. 1466–1481, 2007.CrossRefGoogle Scholar
  21. 21.
    A. McDonald, S. Afroz, A. Caliskan, A. Stolerman and R. Greenstadt, Use fewer instances of the letter “i:” Toward writing style anonymization, in Privacy Enhancing Technologies, S. Fischer-Hubner and M. Wright (Eds.), Springer-Verlag, Berlin, Germany, pp. 299–318, 2012.CrossRefGoogle Scholar
  22. 22.
    A. Narayanan, H. Paskov, N. Gong, J. Bethencourt, E. Stefanov, R. Shin and D. Song, On the feasibility of Internet-scale author identification, Proceedings of the IEEE Symposium on Security and Privacy, pp. 300–314, 2012.Google Scholar
  23. 23.
    J. Noecker and P. Juola, Cosine distance nearest-neighbor classification for authorship attribution, presented at the Digital Humanities Conference, 2009.Google Scholar
  24. 24.
    J. Noecker and M. Ryan, Distractorless authorship verification, Proceedings of the Eight International Conference on Language Resources and Evaluation, pp. 785–789, 2012.Google Scholar
  25. 25.
    H. Paskov, A Regularization Framework for Active Learning from Imbalanced Data, M. Engg. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, 2010.Google Scholar
  26. 26.
    E. Sorio, A. Bartoli, G. Davanzo and E. Medvet, Open world classification of printed invoices, Proceedings of the Tenth ACM Symposium on Document Engineering, pp. 187–190, 2010.CrossRefGoogle Scholar
  27. 27.
    B.  Stein, M. Potthast, P. Rosso, A. Barron-Cedeno, E. Stamatatos and M. Koppel, Workshop report: Fourth International Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, ACM SIGIR Forum, vol. 45(1), pp. 45-48, 2011.CrossRefGoogle Scholar
  28. 28.
    D. Tax, One-Class Classification, Ph.D. Dissertation, Faculty of Applied Physics, Delft University of Technology, Delft, The Natherlands, 2001.Google Scholar
  29. 29.
    H. van Halteren, Linguistic profiling for authorship recognition and verification, Proceedings of the Forty-Second Annual Meeting of the Association for Computational Linguistics, art. 199, 2004.Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Ariel Stolerman
    • 1
  • Rebekah Overdorf
    • 1
  • Sadia Afroz
    • 2
  • Rachel Greenstadt
    • 1
  1. 1.Drexel UniversityPhiladelphiaUSA
  2. 2.Computer Science DivisionUniversity of California at BerkeleyBerkeleyUSA

Personalised recommendations