Advertisement

Evolution of the PAN Lab on Digital Text Forensics

  • Paolo RossoEmail author
  • Martin Potthast
  • Benno Stein
  • Efstathios Stamatatos
  • Francisco Rangel
  • Walter Daelemans
Chapter
Part of the The Information Retrieval Series book series (INRE, volume 41)

Abstract

PAN is a networking initiative for digital text forensics, where researchers and practitioners study technologies for text analysis with regard to originality, authorship, and trustworthiness. The practical importance of such technologies is obvious for law enforcement, cyber-security, and marketing, yet the general public needs to be aware of their capabilities as well to make informed decisions about them. This is particularly true since almost all of these technologies are still in their infancy, and active research is required to push them forward. Hence PAN focuses on the evaluation of selected tasks from the digital text forensics in order to develop large-scale, standardized benchmarks, and to assess the state of the art. In this chapter we present the evolution of three shared tasks: plagiarism detection, author identification, and author profiling.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

The work of Paolo Rosso was partially funded by the Spanish MICINN under the research project MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31). The work on the author profiling data in Arabic was made possible by NPRP grant #9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.

References

  1. Amigó E, Carrillo-de-Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, de Rijke M, Spina D (2014) Overview of RepLab 2014: author profiling and reputation dimensions for online reputation management. In: Proceedings of the fifth international conference of the CLEF initiativeGoogle Scholar
  2. Argamon S, Juola P (2011) Overview of the international authorship identification competition at PAN-2011. In: CLEF 2011 labs and workshop, notebook papers, 19–22 Sept 2011, Amsterdam, The NetherlandsGoogle Scholar
  3. Argamon S, Koppel M, Fine J, Shimoni AR (2003) Gender, genre, and writing style in formal written texts. TEXT 23:321–346CrossRefGoogle Scholar
  4. Asghari H, Mohtaj S, Fatemi O, Faili H, Rosso P, Potthast M (2016) Algorithms and corpora for persian plagiarism detection: overview of pan at fire 2016. In: Notebook papers of FIRE 2016, FIRE-2016, Kolkata, India, Dec 7–10, CEUR workshop proceedings, vol 1737, pp 135–144. CEUR-WS.orgGoogle Scholar
  5. Bagnall D (2015) Author identification using multi-headed recurrent neural networks. In: Cappellato L, Ferro N, Gareth J, San Juan E (eds) Working notes papers of the CLEF 2015 evaluation labsGoogle Scholar
  6. Bagnall D (2016) Authorship clustering using multi-headed recurrent neural networks. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop – working notes papers. CEUR-WS.orgGoogle Scholar
  7. Barrón-Cedeno A, Rosso P, Devi SL, Clough P, Stevenson M (2013) Pan@fire: overview of the cross-language !ndian text re-use detection competition. In: Notebook papers of FIRE 2011, FIRE-2011, Mumbai, India, Dec 2–4Google Scholar
  8. Bensalem I, Boukhalfa I, Rosso P, Abouenour L, Darwish K, Chikhi S (2015) Overview of the AraPlagDet PAN@ FIRE2015 shared task on arabic plagiarism detection. In: Notebook papers of FIRE 2015, FIRE-2015, Gandhinagar, India, Dec 4–6, CEUR workshop proceedings, vol 1587, pp 111–122. CEUR-WS.orgGoogle Scholar
  9. Burrows S, Potthast M, Stein B (2013) Paraphrase acquisition via crowdsourcing and machine learning. Trans Intell Syst Technol (ACM TIST) 4(3):43:1–43:21. http://dx.doi.org/10.1145/2483669.2483676 CrossRefGoogle Scholar
  10. ClueWeb09 (2009) The ClueWeb09 Dataset, 2009. http://lemurproject.org/clueweb09/
  11. Costa PT, McCrae RR (2008) The revised neo personality inventory (NEO-PI-R). The SAGE handbook of personality theory and assessment, vol 2. SAGE Publications, Los Angeles, pp 179–198Google Scholar
  12. Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) PAN@FIRE: overview of SOCO track on the detection of source code re-use. In: Notebook papers of FIRE 2014, FIRE-2014, Bangalore, India, Dec 5–7Google Scholar
  13. Flores E, Barrón-Cedeño A, Moreno L, Rosso P (2015) PAN@FIRE: overview of CL-SOCO track on the detection of cross-language source code re-use 1587:1–5Google Scholar
  14. Fréry J, Largeron C, Juganaru-Mathieu M (2014) UJM at CLEF in author identification. In: CLEF 2014 labs and workshops, notebook papers, CLEF and CEUR-WS.orgGoogle Scholar
  15. Gollub T, Stein B, Burrows S (2012a) Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 1125–1126Google Scholar
  16. Gollub T, Stein B, Burrows S, Hoppe D (2012b) Tira: Configuring, executing, and disseminating information retrieval experiments. In: Database and expert systems applications (DEXA), 2012 23rd international workshop on, IEEE, pp 151–155Google Scholar
  17. Gollub T, Potthast M, Beyer A, Busse M, Rangel F, Rosso P, Stamatatos E, Stein B (2013) Recent trends in digital text forensics and its evaluation: plagiarism detection, author identification, and author profiling. In: 4th international conference of CLEF on information access evaluation meets multilinguality, multimodality, and visualization, CLEF 2013, LNCS, vol 8138. Springer, New York, pp 53–58Google Scholar
  18. Gupta P, Clough P, Rosso P, Stevenson M (2012) Pan@fire: Overview of the cross-language !ndian news story search (CL!NSS) track. In: Notebook papers of FIRE 2012, FIRE-2012, Kolkata, India, Dec 17–19Google Scholar
  19. Gupta P, Clough P, Rosso P, Stevenson M, Banchs RE (2013) Pan@fire: overview of the cross-language !ndian news story search (CL!NSS) track. In: Notebook papers of FIRE 2013, FIRE-2013, Delhi, India, Dec 4–6Google Scholar
  20. Hagen M, Potthast M, Stein B (2015) Source retrieval for plagiarism detection from large web corpora: recent approaches. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes
  21. Hagen M, Potthast M, Völske M, Gomoll J, Stein B (2016) How writers search: analyzing the search and writing logs of non-fictional essays. In: Kelly D, Capra R, Belkin N, Teevan J, Vakkari P (eds) Proceedings of the 1st ACM SIGIR conference on human information interaction and retrieval (CHIIR 16). ACM, New York, pp 193–202. http://dx.doi.org/10.1145/2854946.2854969 Google Scholar
  22. Hagen M, Potthast M, Adineh P, Fatehifar E, Stein B (2017) Source retrieval for web-scale text reuse detection. In: Proceedings of the 26th ACM international conference on information and knowledge management (CIKM 17), ACM, New YorkGoogle Scholar
  23. Holmes J, Meyerhoff M (2003) The handbook of language and gender. Blackwell Handbooks in Linguistics. Wiley, MaldenCrossRefGoogle Scholar
  24. Inches G, Crestani F (2012) Overview of the international sexual predator identification competition at PAN-2012. In: Forner P, Karlgren J, Womser-Hacker C (eds) CLEF 2012 evaluation labs and workshop – working notes papers, 17–20 Sept, Rome, ItalyGoogle Scholar
  25. Juola P, Stamatatos E (2013) Overview of the author identification task at PAN 2013. In: Working notes for CLEF 2013 conferenceGoogle Scholar
  26. Khonji M, Iraqi Y (2014) A slightly-modified GI-based author-verifier with lots of features (ASGALF). In: CLEF 2014 labs and workshops, notebook papers, CLEF and CEUR-WS.orgGoogle Scholar
  27. Koppel M, Winter Y (2014) Determining if two documents are written by the same author. J Am Soc Inf Sci Technol 65(1):178–187CrossRefGoogle Scholar
  28. Koppel M, Argamon S, Shimoni AR (2003) Automatically categorizing written texts by author gender. Lit Ling Comput 17(4): 401–412CrossRefGoogle Scholar
  29. Koppel M, Schler J, Bonchek-Dokow E (2007) Measuring differentiability: unmasking pseudonymous authors. J Mach Learn Res 8:1261–1276zbMATHGoogle Scholar
  30. López-Monroy AP, Montes-y Gómez M, Escalante HJ, Villaseñor-Pineda L, Stamatatos E (2015) Discriminative subprofile-specific representations for author profiling in social media. Knowl-Based Syst 89:134–147CrossRefGoogle Scholar
  31. Maharjan S, Shrestha P, Solorio T, Hasan R (2014) A straightforward author profiling approach in MapReduce. In: Advances in artificial intelligence. Iberamia, pp 95–107Google Scholar
  32. Moreau E, Jayapal A, Lynch G, Vogel C (2015) Author Verification: Basic Stacked Generalization Applied To Predictions from a Set of Heterogeneous Learners. In: Cappellato L, Ferro N, Gareth J, San Juan E (eds) Working notes papers of the CLEF 2015 evaluation labsGoogle Scholar
  33. Pennebaker JW (2013) The secret life of pronouns: what our words say about us. Bloomsbury, New YorkGoogle Scholar
  34. Potthast M, Stein B, Eiselt A, Barrón-Cedeño A, Rosso P (2009) Overview of the 1st international competition on plagiarism detection. In: Stein B, Rosso P, Stamatatos E, Koppel M, Agirre E (eds) SEPLN 09 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09), CEUR-WS.org, pp 1–9. http://ceur-ws.org/Vol-502
  35. Potthast M, Barrón-Cedeño A, Eiselt A, Stein B, Rosso P (2010a) Overview of the 2nd international competition on plagiarism detection. In: Braschler M, Harman D, Pianta E (eds) Working notes papers of the CLEF 2010 evaluation labs. http://www.clef-initiative.eu/publication/working-notes
  36. Potthast M, Stein B, Barrón-Cedeño A, Rosso P (2010b) An evaluation framework for plagiarism detection. In: Huang CR, Jurafsky D (eds) 23rd international conference on computational linguistics (COLING 10). Association for computational linguistics, Stroudsburg, Pennsylvania, pp 997–1005Google Scholar
  37. Potthast M, Eiselt A, Barrón-Cedeño A, Stein B, Rosso P (2011) Overview of the 3rd international competition on plagiarism detection. In: Petras V, Forner P, Clough P (eds) Working notes papers of the CLEF 2011 evaluation labs. http://www.clef-initiative.eu/publication/working-notes
  38. Potthast M, Gollub T, Hagen M, Graßegger J, Kiesel J, Michel M, Oberländer A, Tippmann M, Barrón-Cedeño A, Gupta P, Rosso P, Stein B (2012a) Overview of the 4th international competition on plagiarism detection. In: Forner P, Karlgren J, Womser-Hacker C (eds) Working notes papers of the CLEF 2012 evaluation labs. http://www.clef-initiative.eu/publication/working-notes
  39. Potthast M, Hagen M, Stein B, Graßegger J, Michel M, Tippmann M, Welsch C (2012b) ChatNoir: a search engine for the ClueWeb09 corpus. In: Hersh B, Callan J, Maarek Y, Sanderson M (eds) 35th international ACM conference on research and development in information retrieval (SIGIR 12), ACM, p 1004. http://dx.doi.org/10.1145/2348283.2348429
  40. Potthast M, Gollub T, Hagen M, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B (2013a) Overview of the 5th international competition on plagiarism detection. In: Forner P, Navigli R, Tufis D (eds) Working notes papers of the CLEF 2013 evaluation labs. http://www.clef-initiative.eu/publication/working-notes
  41. Potthast M, Hagen M, Völske M, Stein B (2013b) Crowdsourcing interaction logs to understand text reuse from the web. In: Fung P, Poesio M (eds) Proceedings of the 51st annual meeting of the association for computational linguistics (ACL 13). Association for computational linguistics, pp 1212–1221. http://www.aclweb.org/anthology/P13-1119
  42. Potthast M, Gollub T, Rangel F, Rosso P, Stamatatos E, Stein B (2014a) Improving the reproducibility of pan’s shared tasks: Plagiarism detection, author identification, and author profiling. In: 5th international conference of CLEF on information access evaluation meets multilinguality, multimodality, and interaction, CLEF 2014. LNCS, vol 8685. Springer, New York, pp 268–299Google Scholar
  43. Potthast M, Hagen M, Beyer A, Busse M, Tippmann M, Rosso P, Stein B (2014b) Overview of the 6th international competition on plagiarism detection. In: Cappellato L, Ferro N, Halvey M, Kraaij W (eds) Working notes papers of the CLEF 2014 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes
  44. Potthast M, Göring S, Rosso P, Stein B (2015) Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes
  45. Potthast M, Rangel F, Tschuggnall M, Stamatatos E, Rosso P, Stein B (2017) Overview of PAN’17: author identification, author profiling, and author obfuscation. In: 8th international conference of CLEF on experimental IR meets multilinguality, multimodality, and visualization, CLEF 2017, LNCS, vol 10456. Springer, New York, pp 275–290Google Scholar
  46. Rammstedt B, John O (2007) Measuring personality in one minute or less: A 10 item short version of the big five inventory in English and German. J Res Pers 203–212CrossRefGoogle Scholar
  47. Rangel F, Rosso P (2015) On the multilingual and genre robustness of emographs for author profiling in social media. In: 6th international conference of CLEF on experimental IR meets multilinguality, multimodality, and interaction, LNCS, vol 9283. Springer, New York, pp 274–280CrossRefGoogle Scholar
  48. Rangel F, Rosso P (2016) On the impact of emotions on author profiling. Inf Process Manage 52(1):73–92CrossRefGoogle Scholar
  49. Rangel F, Rosso P, Moshe Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at pan 2013. In: Forner P, Navigli R, Tufis D (eds) CLEF 2013 labs and workshops, notebook papers, vol 1179. CEUR-WS.orgGoogle Scholar
  50. Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at PAN 2014. In: Cappellato L, Ferro N, Halvey M, Kraaij W (eds) CLEF 2014 labs and workshops, notebook papers, vol 1180. CEUR-WS.orgGoogle Scholar
  51. Rangel F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at pan 2015. In: Cappellato L, Ferro N, Jones G, San Juan E (eds) CLEF 2015 labs and workshops, notebook papers. CEUR workshop proceedings, vol 1391. CEUR-WS.orgGoogle Scholar
  52. Rangel F, González F, Restrepo F, Montes M, Rosso P (2016a) Pan at fire: Overview of the PR-SOCO track on personality recognition in source code. Notebook papers of FIRE 2016, FIRE-2016, Kolkata, India, Dec 7–10, CEUR workshop proceedings, vol 1737, pp 1–5. CEUR-WS.orgGoogle Scholar
  53. Rangel F, Rosso P, Verhoeven B, Daelemans W, Potthast M, Stein B (2016b) Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working notes papers of the CLEF 2016 Evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedingsGoogle Scholar
  54. Rangel F, Rosso P, Potthast M, Stein B (2017) Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in twitter. Working notes papers of the CLEFGoogle Scholar
  55. Rosso P, Rangel F, Potthast M, Stamatatos E, Tschuggnall M, Stein B (2016) Overview of the PAN’2016 - new challenges for authorship analysis: Cross-genre profiling, clustering, diarization, and obfuscation. In: 7th international conference of CLEF on Experimental IR meets multilinguality, multimodality, and interaction, CLEF 2016, LNCS, vol 9822. Springer, New York, pp 332–350Google Scholar
  56. Sadat F, Kazemi F, Farzindar A (2014) Automatic identification of arabic language varieties and dialects in social media. In: Proceedings of SocialNLP, p 22Google Scholar
  57. Schler J, Koppel M, Argamon S, Pennebaker JW (2006) Effects of age and gender on blogging. In: AAAI spring symposium: computational approaches to analyzing weblogs, AAAI, pp 199–205Google Scholar
  58. Seidman S (2013) Authorship verification using the impostors method. In: Forner P, Navigli R, Tufis D (eds) CLEF 2013 Evaluation labs and workshop – Working notes papersGoogle Scholar
  59. Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60:538–556CrossRefGoogle Scholar
  60. Stamatatos E (2011) Plagiarism detection using stopword n-grams. J Am Soc Inf Sci Technol 62(12):2512–2527. http://dx.doi.org/10.1002/asi.21630 CrossRefGoogle Scholar
  61. Stamatatos E, Daelemans W, Verhoeven B, Stein B, Potthast M, Juola P, Sánchez-Pérez MA, Barrón-Cedeño A (2014) Overview of the author identification task at PAN 2014. In: Working notes for CLEF 2014 conference, pp 877–897Google Scholar
  62. Stamatatos E, Daelemans W, Verhoeven B, Juola P, López-López A, Potthast M, Stein B (2015a) Overview of the author identification task at PAN 2015. In: Working notes of CLEF 2015 - conference and labs of the evaluation forumGoogle Scholar
  63. Stamatatos E, Potthast M, Rangel F, Rosso P, Stein B (2015b) Overview of the pan/clef 2015 evaluation lab. In: 6th international conference of CLEF on experimental IR meets multilinguality, multimodality, and interaction, CLEF 2015. LNCS, vol 9283. Springer, New York, pp 518–538Google Scholar
  64. Stamatatos E, Tschuggnall M, Verhoeven B, Daelemans W, Specht G, Stein B, Potthast M (2016) Clustering by authorship within and across documents. In: Working notes papers of the CLEF 2016 Evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings, vol 1609. http://ceur-ws.org/Vol-1609/
  65. Stein B, Meyer zu Eißen S, Potthast M (2007) Strategies for retrieving plagiarized documents. In: Clarke C, Fuhr N, Kando N, Kraaij W, de Vries A (eds) 30th International ACM conference on research and development in information retrieval (SIGIR 07). ACM, New York, pp 825–826. http://dx.doi.org/10.1145/1277741.1277928
  66. Stein B, Lipka N, Prettenhofer P (2011) Intrinsic plagiarism analysis. Lang Resour Eval (LRE) 45(1):63–82. http://dx.doi.org/10.1007/s10579-010-9115-y CrossRefGoogle Scholar
  67. Tschuggnall M, Stamatatos E, Verhoeven B, Daelemans W, Specht G, Stein B, Potthast M (2017) Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working notes papers of the CLEF 2017 evaluation labs, CLEF and CEUR-WS.org. CEUR workshop proceedingsGoogle Scholar
  68. Weren E, Kauer A, Mizusaki L, Moreira V, de Oliveira P, Wives L (2014) Examining multiple features for author profiling. J Inf Data Manage 5:266–279Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Paolo Rosso
    • 1
    Email author
  • Martin Potthast
    • 2
  • Benno Stein
    • 3
  • Efstathios Stamatatos
    • 4
  • Francisco Rangel
    • 5
    • 6
  • Walter Daelemans
    • 7
  1. 1.PRHLT Research CenterUniversitat Politècnica de ValènciaValenciaSpain
  2. 2.Text Mining and RetrievalLeipzig UniversityLeipzigGermany
  3. 3.Web Technology and Information SystemsBauhaus-Universität WeimarWeimarGermany
  4. 4.Dept. of Information and Communication Systems EngineeringUniversity of the AegeanSamosGreece
  5. 5.Autoritas Consulting S.AValenciaSpain
  6. 6.PRHLT Research CenterUniversitat Politècnica de ValènciaValenciaSpain
  7. 7.CLiPS - Computational Linguistics GroupUniversity of AntwerpAntwerpBelgium

Personalised recommendations