Recent Trends in Digital Text Forensics and Its Evaluation

Plagiarism Detection, Author Identification, and Author Profiling
  • Tim Gollub
  • Martin Potthast
  • Anna Beyer
  • Matthias Busse
  • Francisco Rangel
  • Paolo Rosso
  • Efstathios Stamatatos
  • Benno Stein
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8138)

Abstract

This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PAN 13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58 submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aleman, Y., Loya, N., Vilarino Ayala, D., Pinto, D.: Two Methodologies Applied to the Author Profiling Task—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Google Scholar
  2. 2.
    Argamon, S., Juola, P.: Overview of the International Authorship Identification Competition at PAN-2011. In: Proc. of CLEF 2011 (2011)Google Scholar
  3. 3.
    Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, Genre, and Writing Style in Formal Written Texts. TEXT 23, 321–346 (2003)Google Scholar
  4. 4.
    Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically Profiling the Author of an Anonymous Text. Commun. ACM 52(2), 119–123 (2009)CrossRefGoogle Scholar
  5. 5.
    Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: EvaluatIR: An Online Tool for Evaluating and Comparing IR Systems. In: Proc. of SIGIR 2009 (2009)Google Scholar
  6. 6.
    Blockeel, H., Vanschoren, J.: Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 6–17. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  7. 7.
    Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating Gender on Twitter. In: Proc. EMNLP 2011 (2011)Google Scholar
  8. 8.
    Clough, P., Stevenson, M.: Developing a Corpus of Plagiarised Short Answers. Lang. Resour. Eval. 45, 5–24 (2011)CrossRefGoogle Scholar
  9. 9.
    Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: METER: MEasuring TExt Reuse. In: Proc. ACL 2002 (2002)Google Scholar
  10. 10.
    De Roure, D., Goble, C., Stevens, R.: The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows. Future Gener. Comp. Sy. 25, 561–567 (2009)CrossRefGoogle Scholar
  11. 11.
    Caurcel Diaz, A.A., Gomez Hidalgo, J.M.: Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Google Scholar
  12. 12.
    Downie, J.S.: The Music Information Retrieval Evaluation Exchange (2005–2007): A Window into Music Information Retrieval Research. Acoust. Sc. and Tech. 29(4), 247–255 (2008)CrossRefGoogle Scholar
  13. 13.
    Hernandez Farias, D.I., Guzman-Cabrera, R., Reyes, A., Rocha, M.A.: Semantic-based Features for Author Profiling Identification: First Insights—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  14. 14.
    Flekova, L., Gurevych, I.: Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media–Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Google Scholar
  15. 15.
    Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers (2013)Google Scholar
  16. 16.
    Gillam, L.: Readability for author profiling?—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  17. 17.
    Gollub, T., Burrows, S., Stein, B.: First Experiences with TIRA for Reproducible Evaluation in Information Retrieval. In: Proc. of OSIR at SIGIR 2012 (August 2012)Google Scholar
  18. 18.
    Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower Research: Towards a Web Framework for Providing Experiments as a Service. In: Proc. of SIGIR 2012 (2012)Google Scholar
  19. 19.
    Gollub, T., Stein, B., Burrows, S., Hoppe, D.: TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments. In: Proc. of TIR at DEXA 2012. IEEE (2012)Google Scholar
  20. 20.
    Goswami, S., Sarkar, S., Rustagi, M.: Stylometric Analysis of Bloggers’ Age and Gender. In: Proc. of ICWSM 2009 (2009)Google Scholar
  21. 21.
    Haggag, O., El-Beltagy, S.: Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  22. 22.
    Holmes, J., Meyerhoff, M.: The Handbook of Language and Gender. Blackwell Handbooks in Linguistics. Wiley (2003)Google Scholar
  23. 23.
    Inches, G., Crestani, F.: Overview of the International Sexual Predator Identification Competition at PAN-2012. In: Proc. of CLEF 2012 (2012)Google Scholar
  24. 24.
    Juola, P.: Authorship Attribution. Found. and Trends in IR 1, 234–334 (2008)Google Scholar
  25. 25.
    Juola, P.: Ad-hoc Authorship Attribution Competition. In: Proc. of ALLC 2004 (2004)Google Scholar
  26. 26.
    Juola, P.: An Overview of the Traditional Authorship Attribution Subtask. In: Proc. of CLEF 2012 (2012)Google Scholar
  27. 27.
    Koppel, M., Winter, Y.: Determining if Two Documents are by the Same Author. Journal of the American Society for Information Science and Technology (to appear)Google Scholar
  28. 28.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4), 401–412 (2002)CrossRefGoogle Scholar
  29. 29.
    Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)MATHGoogle Scholar
  30. 30.
    Koppel, M., Schler, J., Argamon, S.: Authorship Attribution in the Wild. Language Resources and Evaluation 45, 83–94 (2011)CrossRefGoogle Scholar
  31. 31.
    Kong, L., Qi, H., Du, C., Wang, M., Han, Z.: Approaches for Source Retrieval and Text Alignment of Plagiarism Detection—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  32. 32.
    Lim, W.Y., Goh, J., Thing, V.L.L.: Content-centric age and gender profiling—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  33. 33.
    Pastor Lopez-Monroy, A., Montes-Y-Gomez, M., Jair Escalante, H., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN’13: Author Profiling task—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Google Scholar
  34. 34.
    Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based Classification for Author Profiling using Various Features—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Google Scholar
  35. 35.
    Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How Old Do You Think I Am?”; A Study of Language and Age in Twitter. In: Proc. of ICWSM 2013 (2013)Google Scholar
  36. 36.
    Nguyen, D., Smith, N.A., Rosé, C.P.: Author Age Prediction from Text Using Linear Regression. In: Proc. of LaTeCH at ACL-HLTGoogle Scholar
  37. 37.
    Gopal Patra, B., Banerjee, S., Das, D., Saikh, T., Bandyopadhyay, S.: Automatic Author Profiling Based on Linguistic and Stylistic Features—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  38. 38.
    Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting Age and Gender in Online Social Networks. In: Proc. of SMUC 2011 (2011)Google Scholar
  39. 39.
    Pennebaker, J.W.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury, USA (2013)Google Scholar
  40. 40.
    Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological Aspects of Natural Language Use: Our Words, Our Selves. Annual Review of Psychology 54(1), 547–577 (2003)CrossRefGoogle Scholar
  41. 41.
    Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Proc. of PAN at SEPLN 2009 (2009)Google Scholar
  42. 42.
    Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Proc. of CLEF 2010 (2010)Google Scholar
  43. 43.
    Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for Plagiarism Detection. In: Proc. of COLING 2010 (2010)Google Scholar
  44. 44.
    Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd International Competition on Plagiarism Detection. In: Proc. of CLEF 2011 (2011)Google Scholar
  45. 45.
    Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th International Competition on Plagiarism Detection. In: Proc. of CLEF 2012 (2012)Google Scholar
  46. 46.
    Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: A Search Engine for the ClueWeb09 Corpus. In: Proc. of SIGIR 2012 (2012)Google Scholar
  47. 47.
    Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th International Competition on Plagiarism Detection. In: Proc. of CLEF 2013 (2013)Google Scholar
  48. 48.
    Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing Interaction Logs to Understand Text Reuse from the Web. In: Proc. of ACL 2013. ACM (to appear, August 2013b)Google Scholar
  49. 49.
    Rodíguez Torrejón, D.A., Martín Ramos, J.M.: Text Alignment Module in CoReMo 2.1 Plagiarism Detector—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  50. 50.
    Santosh, K., Bansal, R., Shekhar, M., Varma, V.: Author Profiling: Predicting Age and Gender from Blogs—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  51. 51.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of Age and Gender on Blogging. In: Proc. of CAAW 2006 (2006)Google Scholar
  52. 52.
    Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology 60, 538–556 (2009)CrossRefGoogle Scholar
  53. 53.
    Stamatatos, E.: Plagiarism Detection Using Stopword N-grams. Journal of the American Society for Information Science and Technology 62(12), 2512–2527 (2011)CrossRefGoogle Scholar
  54. 54.
    Stein, B., Meyer zu Eißen, S., Potthast, M.: Strategies for Retrieving Plagiarized Documents. In: Proc. of SIGIR 2007 (2007)Google Scholar
  55. 55.
    Suchomel, Š., Kasprzak, J., Brandejs, M.: Diverse Queries and Feature Type Selection for Plagiarism Discovery—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  56. 56.
    Williams, K., Chen, H., Chowdhury, S.R., Giles, C.L.: Unsupervised Ranking for Plagiarism Source Retrieval—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15] Google Scholar
  57. 57.
    Wojnarski, M., Stawicki, S., Wojnarowski, P.: TunedIT.org: System for Automated Evaluation of Algorithms in Repeatable Experiments. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 20–29. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  58. 58.
    Zhang, C., Zhang, P.: Predicting Gender from Blog Posts. Technical report, University of Massachusetts Amherst, USA (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Tim Gollub
    • 1
  • Martin Potthast
    • 1
  • Anna Beyer
    • 1
  • Matthias Busse
    • 1
  • Francisco Rangel
    • 2
    • 3
  • Paolo Rosso
    • 3
  • Efstathios Stamatatos
    • 4
  • Benno Stein
    • 1
  1. 1.Web Technology and Information SystemsBauhaus-Universität WeimarGermany
  2. 2.Autoritas Consulting, S.A.Spain
  3. 3.Natural Language Engineering Lab, ELiRFUniversitat Politècnica de ValènciaSpain
  4. 4.Dept. of Information and Communication Systems EngineeringUniversity of the AegeanGreece

Personalised recommendations