Skip to main content

Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval

  • Conference paper
Advances in Information Retrieval (ECIR 2016)

Abstract

In this paper, we revisit author identification research by conducting a new kind of large-scale reproducibility study: we select 15 of the most influential papers for author identification and recruit a group of students to reimplement them from scratch. Since no open source implementations have been released for the selected papers to date, our public release will have a significant impact on researchers entering the field. This way, we lay the groundwork for integrating author identification with information retrieval to eventually scale the former to the web. Furthermore, we assess the reproducibility of all reimplemented papers in detail, and conduct the first comparative evaluation of all approaches on three well-known corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Interestingly, Collberg et al.’s study itself has been challenged for lack of rigor and has been reproduced more thoroughly: http://cs.brown.edu/~sk/Memos/Examining-Reproducibility/.

  2. 2.

    Materials and code of this study are available at www.uni-weimar.de/medien/webis/ publications and the latest versions of the code in its GitHub repositories at www.github.com/pan-webis-de (for a convenient overview, see www.github.com/search?q=ECIR+2016+user:pan-webis-de).

  3. 3.

    Confer the repository of the reimplementation of Seroussi et al.’s approach to follow up on this.

References

  1. Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-. In: CLEF 2011 Notebooks (2011)

    Google Scholar 

  2. Arguello, J., Diaz, F., Lin, J., Trotman, A.: RIGOR @ SIGIR (2015)

    Google Scholar 

  3. Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: Improvements that don’t add up: ad-hoc retrieval results since. In: CIKM 2009, pp. 601–610 (1998)

    Google Scholar 

  4. Arun, R., Suresh, V., Veni Madhavan, C.E.: Stopword graphs and authorship attribution in text corpora. In: ICSC, pp. 192–196 (2009)

    Google Scholar 

  5. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88, 048702 (2002)

    Article  Google Scholar 

  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  7. Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Lit. Ling. Comp. 17(3), 267–287 (2002)

    Article  Google Scholar 

  8. Chang, C.-C., Chih-Jen Lin, L.: A library for support vector machines. ACM TIST 2, 27:1–27:27 (2011)

    Google Scholar 

  9. Collberg, C., Proebstring, T., Warren, A.M.: Repeatability, benefaction in computer systems research: a study and a modest proposal. TR 14–04, University of Arizona (2015)

    Google Scholar 

  10. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)

    Article  Google Scholar 

  11. Di Buccio, E., Di Nunzio, G.M., Ferro, N., Harman, D., Maistro, M., Silvello, G.: Unfolding off-the-shelf IR systems for reproducibility. In: RIGOR @ SIGIR (2015)

    Google Scholar 

  12. Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character n-grams for authorship attribution. In: HLT 2011, pp. 288–298 (2011)

    Google Scholar 

  13. Ferro, N., Silvello, G.: Rank-biased precision reloaded: reproducibility and generalization. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 768–780. Springer, Heidelberg (2015)

    Google Scholar 

  14. Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: COLING (2004)

    Google Scholar 

  15. Hagen, M., Potthast, M., Büchner, M., Stein, B.: Twitter sentiment detection via ensemble classification using averaged confidence scores. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 741–754. Springer, Heidelberg (2015)

    Google Scholar 

  16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  17. Hanbury, A., Kazai, G., Rauber, A., Fuhr, N.: Proceedings of ECIR (2015)

    Google Scholar 

  18. Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Ling. Comp. 13(3), 111–117 (1998)

    Article  Google Scholar 

  19. Hopfgartner, F., Hanbury, A., Müller, H., Kando, N., Mercer, S., Kalpathy-Cramer, J., Potthast, M., Gollub, T., Krithara, A., Lin, J., Balog, K., Eggel, I.: Report on the Evaluation-as-a-Service (EaaS) expert workshop. SIGIR Forum 49(1), 57–65 (2015)

    Article  Google Scholar 

  20. Juola, P.: Authorship attribution. FnTIR 1, 234–334 (2008)

    Google Scholar 

  21. Juola, P.: An overview of the traditional authorship attribution subtask. In: CLEF Notebooks (2012)

    Google Scholar 

  22. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: PACLING 2003, pp. 255–264 (2003)

    Google Scholar 

  23. Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: SIGIR 2003, pp. 104–110 (2003)

    Google Scholar 

  24. Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)

    MATH  Google Scholar 

  25. Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. LRE 45(1), 83–94 (2011)

    Google Scholar 

  26. Lin, J.: The open-source information retrieval reproducibility challenge. In: RIGOR @ SIGIR (2015)

    Google Scholar 

  27. Mendenhall, T.C.: The characteristic curves of composition. Science ns–9(214S), 237–246 (1887)

    Article  Google Scholar 

  28. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: a high performance and scalable information retrieval platform. In: OCIR @ SIGIR (2006)

    Google Scholar 

  29. Peng, F., Schuurmans, D., Wang, S.: Augmenting naive Bayes classifiers with statistical language models. Inf. Retr. 7(3–4), 317–345 (2004)

    Article  Google Scholar 

  30. Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN. In: CLEF 2015 Notebooks (2015)

    Google Scholar 

  31. Rudman, J.: The state of authorship attribution studies: some problems and solutions. Comput. Humanit. 31(4), 351–365 (1997)

    Article  Google Scholar 

  32. Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware topic models. In: ACL 2012, pp. 264–269 (2012)

    Google Scholar 

  33. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)

    Article  Google Scholar 

  34. Stamatatos, E.: Authorship attribution based on feature set subspacing ensembles. Int. J. Artif. Intell. Tools 15(5), 823–838 (2006)

    Article  Google Scholar 

  35. Stamatatos, E.: Author identification using imbalanced and limited training texts. In: DEXA 2007, pp. 237–241 (2007)

    Google Scholar 

  36. Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60, 538–556 (2009)

    Article  Google Scholar 

  37. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)

    Article  Google Scholar 

  38. Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN. In: CLEF 2014 Notebooks (2014)

    Google Scholar 

  39. Stodden, V.: The scientific method in practice: reproducibility in the computational sciences. MIT Sloan Research Paper No. 4773–10 (2010)

    Google Scholar 

  40. Tax, N., Bockting, S., Hiemstra, D.: A cross-benchmark comparison of 87 learning to rank methods. IPM 51(6), 757–772 (2015)

    Google Scholar 

  41. Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization, pp. 141–165. In: Language Modeling for Information Retrieval (2003)

    Google Scholar 

  42. van Halteren, H.: Linguistic profiling for author recognition and verification. In: ACL 2004, pp. 199–206 (2004)

    Google Scholar 

  43. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. JASIST 57(3), 378–393 (2006)

    Article  Google Scholar 

Download references

Acknowledgements

This study was supported by the German National Academic Foundation (German: Studienstiftung des deutschen Volkes). The foundation helped to recruit students among its scholars and organized our auditing workshop as part of its 2015 summer academy in La Colle-sur-Loup, France. We thank the foundation for their generous support. Our special thanks go to Dorothea Trebesius, Matthias Frenz, and Martina Rothmann-Stang who provided for our every need at the workshop.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Potthast .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Potthast, M. et al. (2016). Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30671-1_29

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30670-4

  • Online ISBN: 978-3-319-30671-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics