Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval

  • Martin Potthast
  • Sarah Braun
  • Tolga Buz
  • Fabian Duffhauss
  • Florian Friedrich
  • Jörg Marvin Gülzow
  • Jakob Köhler
  • Winfried Lötzsch
  • Fabian Müller
  • Maike Elisa Müller
  • Robert Paßmann
  • Bernhard Reinke
  • Lucas Rettenmeier
  • Thomas Rometsch
  • Timo Sommer
  • Michael Träger
  • Sebastian Wilhelm
  • Benno Stein
  • Efstathios Stamatatos
  • Matthias Hagen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9626)

Abstract

In this paper, we revisit author identification research by conducting a new kind of large-scale reproducibility study: we select 15 of the most influential papers for author identification and recruit a group of students to reimplement them from scratch. Since no open source implementations have been released for the selected papers to date, our public release will have a significant impact on researchers entering the field. This way, we lay the groundwork for integrating author identification with information retrieval to eventually scale the former to the web. Furthermore, we assess the reproducibility of all reimplemented papers in detail, and conduct the first comparative evaluation of all approaches on three well-known corpora.

References

  1. 1.
    Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-. In: CLEF 2011 Notebooks (2011)Google Scholar
  2. 2.
    Arguello, J., Diaz, F., Lin, J., Trotman, A.: RIGOR @ SIGIR (2015)Google Scholar
  3. 3.
    Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: Improvements that don’t add up: ad-hoc retrieval results since. In: CIKM 2009, pp. 601–610 (1998)Google Scholar
  4. 4.
    Arun, R., Suresh, V., Veni Madhavan, C.E.: Stopword graphs and authorship attribution in text corpora. In: ICSC, pp. 192–196 (2009)Google Scholar
  5. 5.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88, 048702 (2002)CrossRefGoogle Scholar
  6. 6.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATHGoogle Scholar
  7. 7.
    Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Lit. Ling. Comp. 17(3), 267–287 (2002)CrossRefGoogle Scholar
  8. 8.
    Chang, C.-C., Chih-Jen Lin, L.: A library for support vector machines. ACM TIST 2, 27:1–27:27 (2011)Google Scholar
  9. 9.
    Collberg, C., Proebstring, T., Warren, A.M.: Repeatability, benefaction in computer systems research: a study and a modest proposal. TR 14–04, University of Arizona (2015)Google Scholar
  10. 10.
    de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)CrossRefGoogle Scholar
  11. 11.
    Di Buccio, E., Di Nunzio, G.M., Ferro, N., Harman, D., Maistro, M., Silvello, G.: Unfolding off-the-shelf IR systems for reproducibility. In: RIGOR @ SIGIR (2015)Google Scholar
  12. 12.
    Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character n-grams for authorship attribution. In: HLT 2011, pp. 288–298 (2011)Google Scholar
  13. 13.
    Ferro, N., Silvello, G.: Rank-biased precision reloaded: reproducibility and generalization. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 768–780. Springer, Heidelberg (2015)Google Scholar
  14. 14.
    Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: COLING (2004)Google Scholar
  15. 15.
    Hagen, M., Potthast, M., Büchner, M., Stein, B.: Twitter sentiment detection via ensemble classification using averaged confidence scores. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 741–754. Springer, Heidelberg (2015)Google Scholar
  16. 16.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  17. 17.
    Hanbury, A., Kazai, G., Rauber, A., Fuhr, N.: Proceedings of ECIR (2015)Google Scholar
  18. 18.
    Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Ling. Comp. 13(3), 111–117 (1998)CrossRefGoogle Scholar
  19. 19.
    Hopfgartner, F., Hanbury, A., Müller, H., Kando, N., Mercer, S., Kalpathy-Cramer, J., Potthast, M., Gollub, T., Krithara, A., Lin, J., Balog, K., Eggel, I.: Report on the Evaluation-as-a-Service (EaaS) expert workshop. SIGIR Forum 49(1), 57–65 (2015)CrossRefGoogle Scholar
  20. 20.
    Juola, P.: Authorship attribution. FnTIR 1, 234–334 (2008)Google Scholar
  21. 21.
    Juola, P.: An overview of the traditional authorship attribution subtask. In: CLEF Notebooks (2012)Google Scholar
  22. 22.
    Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: PACLING 2003, pp. 255–264 (2003)Google Scholar
  23. 23.
    Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: SIGIR 2003, pp. 104–110 (2003)Google Scholar
  24. 24.
    Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)MATHGoogle Scholar
  25. 25.
    Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. LRE 45(1), 83–94 (2011)Google Scholar
  26. 26.
    Lin, J.: The open-source information retrieval reproducibility challenge. In: RIGOR @ SIGIR (2015)Google Scholar
  27. 27.
    Mendenhall, T.C.: The characteristic curves of composition. Science ns–9(214S), 237–246 (1887)CrossRefGoogle Scholar
  28. 28.
    Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: a high performance and scalable information retrieval platform. In: OCIR @ SIGIR (2006)Google Scholar
  29. 29.
    Peng, F., Schuurmans, D., Wang, S.: Augmenting naive Bayes classifiers with statistical language models. Inf. Retr. 7(3–4), 317–345 (2004)CrossRefGoogle Scholar
  30. 30.
    Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN. In: CLEF 2015 Notebooks (2015)Google Scholar
  31. 31.
    Rudman, J.: The state of authorship attribution studies: some problems and solutions. Comput. Humanit. 31(4), 351–365 (1997)CrossRefGoogle Scholar
  32. 32.
    Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware topic models. In: ACL 2012, pp. 264–269 (2012)Google Scholar
  33. 33.
    Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)CrossRefGoogle Scholar
  34. 34.
    Stamatatos, E.: Authorship attribution based on feature set subspacing ensembles. Int. J. Artif. Intell. Tools 15(5), 823–838 (2006)CrossRefGoogle Scholar
  35. 35.
    Stamatatos, E.: Author identification using imbalanced and limited training texts. In: DEXA 2007, pp. 237–241 (2007)Google Scholar
  36. 36.
    Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60, 538–556 (2009)CrossRefGoogle Scholar
  37. 37.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)CrossRefGoogle Scholar
  38. 38.
    Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN. In: CLEF 2014 Notebooks (2014)Google Scholar
  39. 39.
    Stodden, V.: The scientific method in practice: reproducibility in the computational sciences. MIT Sloan Research Paper No. 4773–10 (2010)Google Scholar
  40. 40.
    Tax, N., Bockting, S., Hiemstra, D.: A cross-benchmark comparison of 87 learning to rank methods. IPM 51(6), 757–772 (2015)Google Scholar
  41. 41.
    Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization, pp. 141–165. In: Language Modeling for Information Retrieval (2003)Google Scholar
  42. 42.
    van Halteren, H.: Linguistic profiling for author recognition and verification. In: ACL 2004, pp. 199–206 (2004)Google Scholar
  43. 43.
    Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. JASIST 57(3), 378–393 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Martin Potthast
    • 1
  • Sarah Braun
    • 2
  • Tolga Buz
    • 3
  • Fabian Duffhauss
    • 4
  • Florian Friedrich
    • 5
  • Jörg Marvin Gülzow
    • 6
  • Jakob Köhler
    • 7
  • Winfried Lötzsch
    • 8
  • Fabian Müller
    • 9
  • Maike Elisa Müller
    • 3
  • Robert Paßmann
    • 10
  • Bernhard Reinke
    • 10
  • Lucas Rettenmeier
    • 5
  • Thomas Rometsch
    • 11
  • Timo Sommer
    • 12
  • Michael Träger
    • 13
  • Sebastian Wilhelm
    • 2
  • Benno Stein
    • 1
  • Efstathios Stamatatos
    • 14
  • Matthias Hagen
    • 1
  1. 1.Bauhaus-Universität WeimarWeimarGermany
  2. 2.Technische Universität MünchenMunichGermany
  3. 3.Technical University of BerlinBerlinGermany
  4. 4.RWTH Aachen UniversityAachenGermany
  5. 5.Heidelberg UniversityHeidelbergGermany
  6. 6.University of KonstanzKonstanzGermany
  7. 7.Free University of BerlinBerlinGermany
  8. 8.Chemnitz University of TechnologyChemnitzGermany
  9. 9.Karlsruhe University of Applied SciencesKarlsruheGermany
  10. 10.University of BonnBonnGermany
  11. 11.University of MichiganAnn ArborUSA
  12. 12.Hamburg University of TechnologyHamburgGermany
  13. 13.University of BambergBambergGermany
  14. 14.University of the AegeanMytileneGreece

Personalised recommendations