Abstract
PAN is a networking initiative for digital text forensics, where researchers and practitioners study technologies for text analysis with regard to originality, authorship, and trustworthiness. The practical importance of such technologies is obvious for law enforcement, cyber-security , and marketing, yet the general public needs to be aware of their capabilities as well to make informed decisions about them. This is particularly true since almost all of these technologies are still in their infancy, and active research is required to push them forward. Hence PAN focuses on the evaluation of selected tasks from the digital text forensics in order to develop large-scale, standardized benchmarks, and to assess the state of the art. In this chapter we present the evolution of three shared tasks: plagiarism detection, author identification, and author profiling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amigó E, Carrillo-de-Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, de Rijke M, Spina D (2014) Overview of RepLab 2014: author profiling and reputation dimensions for online reputation management. In: Proceedings of the fifth international conference of the CLEF initiative
Argamon S, Juola P (2011) Overview of the international authorship identification competition at PAN-2011. In: CLEF 2011 labs and workshop, notebook papers, 19–22 Sept 2011, Amsterdam, The Netherlands
Argamon S, Koppel M, Fine J, Shimoni AR (2003) Gender, genre, and writing style in formal written texts. TEXT 23:321–346
Asghari H, Mohtaj S, Fatemi O, Faili H, Rosso P, Potthast M (2016) Algorithms and corpora for persian plagiarism detection: overview of pan at fire 2016. In: Notebook papers of FIRE 2016, FIRE-2016, Kolkata, India, Dec 7–10, CEUR workshop proceedings, vol 1737, pp 135–144. CEUR-WS.org
Bagnall D (2015) Author identification using multi-headed recurrent neural networks. In: Cappellato L, Ferro N, Gareth J, San Juan E (eds) Working notes papers of the CLEF 2015 evaluation labs
Bagnall D (2016) Authorship clustering using multi-headed recurrent neural networks. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop – working notes papers. CEUR-WS.org
Barrón-Cedeno A, Rosso P, Devi SL, Clough P, Stevenson M (2013) Pan@fire: overview of the cross-language !ndian text re-use detection competition. In: Notebook papers of FIRE 2011, FIRE-2011, Mumbai, India, Dec 2–4
Bensalem I, Boukhalfa I, Rosso P, Abouenour L, Darwish K, Chikhi S (2015) Overview of the AraPlagDet PAN@ FIRE2015 shared task on arabic plagiarism detection. In: Notebook papers of FIRE 2015, FIRE-2015, Gandhinagar, India, Dec 4–6, CEUR workshop proceedings, vol 1587, pp 111–122. CEUR-WS.org
Burrows S, Potthast M, Stein B (2013) Paraphrase acquisition via crowdsourcing and machine learning. Trans Intell Syst Technol (ACM TIST) 4(3):43:1–43:21. http://dx.doi.org/10.1145/2483669.2483676
ClueWeb09 (2009) The ClueWeb09 Dataset, 2009. http://lemurproject.org/clueweb09/
Costa PT, McCrae RR (2008) The revised neo personality inventory (NEO-PI-R). The SAGE handbook of personality theory and assessment, vol 2. SAGE Publications, Los Angeles, pp 179–198
Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) PAN@FIRE: overview of SOCO track on the detection of source code re-use. In: Notebook papers of FIRE 2014, FIRE-2014, Bangalore, India, Dec 5–7
Flores E, Barrón-Cedeño A, Moreno L, Rosso P (2015) PAN@FIRE: overview of CL-SOCO track on the detection of cross-language source code re-use 1587:1–5
Fréry J, Largeron C, Juganaru-Mathieu M (2014) UJM at CLEF in author identification. In: CLEF 2014 labs and workshops, notebook papers, CLEF and CEUR-WS.org
Gollub T, Stein B, Burrows S (2012a) Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 1125–1126
Gollub T, Stein B, Burrows S, Hoppe D (2012b) Tira: Configuring, executing, and disseminating information retrieval experiments. In: Database and expert systems applications (DEXA), 2012 23rd international workshop on, IEEE, pp 151–155
Gollub T, Potthast M, Beyer A, Busse M, Rangel F, Rosso P, Stamatatos E, Stein B (2013) Recent trends in digital text forensics and its evaluation: plagiarism detection, author identification, and author profiling. In: 4th international conference of CLEF on information access evaluation meets multilinguality, multimodality, and visualization, CLEF 2013, LNCS, vol 8138. Springer, New York, pp 53–58
Gupta P, Clough P, Rosso P, Stevenson M (2012) Pan@fire: Overview of the cross-language !ndian news story search (CL!NSS) track. In: Notebook papers of FIRE 2012, FIRE-2012, Kolkata, India, Dec 17–19
Gupta P, Clough P, Rosso P, Stevenson M, Banchs RE (2013) Pan@fire: overview of the cross-language !ndian news story search (CL!NSS) track. In: Notebook papers of FIRE 2013, FIRE-2013, Delhi, India, Dec 4–6
Hagen M, Potthast M, Stein B (2015) Source retrieval for plagiarism detection from large web corpora: recent approaches. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes
Hagen M, Potthast M, Völske M, Gomoll J, Stein B (2016) How writers search: analyzing the search and writing logs of non-fictional essays. In: Kelly D, Capra R, Belkin N, Teevan J, Vakkari P (eds) Proceedings of the 1st ACM SIGIR conference on human information interaction and retrieval (CHIIR 16). ACM, New York, pp 193–202. http://dx.doi.org/10.1145/2854946.2854969
Hagen M, Potthast M, Adineh P, Fatehifar E, Stein B (2017) Source retrieval for web-scale text reuse detection. In: Proceedings of the 26th ACM international conference on information and knowledge management (CIKM 17), ACM, New York
Holmes J, Meyerhoff M (2003) The handbook of language and gender. Blackwell Handbooks in Linguistics. Wiley, Malden
Inches G, Crestani F (2012) Overview of the international sexual predator identification competition at PAN-2012. In: Forner P, Karlgren J, Womser-Hacker C (eds) CLEF 2012 evaluation labs and workshop – working notes papers, 17–20 Sept, Rome, Italy
Juola P, Stamatatos E (2013) Overview of the author identification task at PAN 2013. In: Working notes for CLEF 2013 conference
Khonji M, Iraqi Y (2014) A slightly-modified GI-based author-verifier with lots of features (ASGALF). In: CLEF 2014 labs and workshops, notebook papers, CLEF and CEUR-WS.org
Koppel M, Winter Y (2014) Determining if two documents are written by the same author. J Am Soc Inf Sci Technol 65(1):178–187
Koppel M, Argamon S, Shimoni AR (2003) Automatically categorizing written texts by author gender. Lit Ling Comput 17(4): 401–412
Koppel M, Schler J, Bonchek-Dokow E (2007) Measuring differentiability: unmasking pseudonymous authors. J Mach Learn Res 8:1261–1276
López-Monroy AP, Montes-y Gómez M, Escalante HJ, Villaseñor-Pineda L, Stamatatos E (2015) Discriminative subprofile-specific representations for author profiling in social media. Knowl-Based Syst 89:134–147
Maharjan S, Shrestha P, Solorio T, Hasan R (2014) A straightforward author profiling approach in MapReduce. In: Advances in artificial intelligence. Iberamia, pp 95–107
Moreau E, Jayapal A, Lynch G, Vogel C (2015) Author Verification: Basic Stacked Generalization Applied To Predictions from a Set of Heterogeneous Learners. In: Cappellato L, Ferro N, Gareth J, San Juan E (eds) Working notes papers of the CLEF 2015 evaluation labs
Pennebaker JW (2013) The secret life of pronouns: what our words say about us. Bloomsbury, New York
Potthast M, Stein B, Eiselt A, Barrón-Cedeño A, Rosso P (2009) Overview of the 1st international competition on plagiarism detection. In: Stein B, Rosso P, Stamatatos E, Koppel M, Agirre E (eds) SEPLN 09 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09), CEUR-WS.org, pp 1–9. http://ceur-ws.org/Vol-502
Potthast M, Barrón-Cedeño A, Eiselt A, Stein B, Rosso P (2010a) Overview of the 2nd international competition on plagiarism detection. In: Braschler M, Harman D, Pianta E (eds) Working notes papers of the CLEF 2010 evaluation labs. http://www.clef-initiative.eu/publication/working-notes
Potthast M, Stein B, Barrón-Cedeño A, Rosso P (2010b) An evaluation framework for plagiarism detection. In: Huang CR, Jurafsky D (eds) 23rd international conference on computational linguistics (COLING 10). Association for computational linguistics, Stroudsburg, Pennsylvania, pp 997–1005
Potthast M, Eiselt A, Barrón-Cedeño A, Stein B, Rosso P (2011) Overview of the 3rd international competition on plagiarism detection. In: Petras V, Forner P, Clough P (eds) Working notes papers of the CLEF 2011 evaluation labs. http://www.clef-initiative.eu/publication/working-notes
Potthast M, Gollub T, Hagen M, Graßegger J, Kiesel J, Michel M, Oberländer A, Tippmann M, Barrón-Cedeño A, Gupta P, Rosso P, Stein B (2012a) Overview of the 4th international competition on plagiarism detection. In: Forner P, Karlgren J, Womser-Hacker C (eds) Working notes papers of the CLEF 2012 evaluation labs. http://www.clef-initiative.eu/publication/working-notes
Potthast M, Hagen M, Stein B, Graßegger J, Michel M, Tippmann M, Welsch C (2012b) ChatNoir: a search engine for the ClueWeb09 corpus. In: Hersh B, Callan J, Maarek Y, Sanderson M (eds) 35th international ACM conference on research and development in information retrieval (SIGIR 12), ACM, p 1004. http://dx.doi.org/10.1145/2348283.2348429
Potthast M, Gollub T, Hagen M, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B (2013a) Overview of the 5th international competition on plagiarism detection. In: Forner P, Navigli R, Tufis D (eds) Working notes papers of the CLEF 2013 evaluation labs. http://www.clef-initiative.eu/publication/working-notes
Potthast M, Hagen M, Völske M, Stein B (2013b) Crowdsourcing interaction logs to understand text reuse from the web. In: Fung P, Poesio M (eds) Proceedings of the 51st annual meeting of the association for computational linguistics (ACL 13). Association for computational linguistics, pp 1212–1221. http://www.aclweb.org/anthology/P13-1119
Potthast M, Gollub T, Rangel F, Rosso P, Stamatatos E, Stein B (2014a) Improving the reproducibility of pan’s shared tasks: Plagiarism detection, author identification, and author profiling. In: 5th international conference of CLEF on information access evaluation meets multilinguality, multimodality, and interaction, CLEF 2014. LNCS, vol 8685. Springer, New York, pp 268–299
Potthast M, Hagen M, Beyer A, Busse M, Tippmann M, Rosso P, Stein B (2014b) Overview of the 6th international competition on plagiarism detection. In: Cappellato L, Ferro N, Halvey M, Kraaij W (eds) Working notes papers of the CLEF 2014 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes
Potthast M, Göring S, Rosso P, Stein B (2015) Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes
Potthast M, Rangel F, Tschuggnall M, Stamatatos E, Rosso P, Stein B (2017) Overview of PAN’17: author identification, author profiling, and author obfuscation. In: 8th international conference of CLEF on experimental IR meets multilinguality, multimodality, and visualization, CLEF 2017, LNCS, vol 10456. Springer, New York, pp 275–290
Rammstedt B, John O (2007) Measuring personality in one minute or less: A 10 item short version of the big five inventory in English and German. J Res Pers 203–212
Rangel F, Rosso P (2015) On the multilingual and genre robustness of emographs for author profiling in social media. In: 6th international conference of CLEF on experimental IR meets multilinguality, multimodality, and interaction, LNCS, vol 9283. Springer, New York, pp 274–280
Rangel F, Rosso P (2016) On the impact of emotions on author profiling. Inf Process Manage 52(1):73–92
Rangel F, Rosso P, Moshe Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at pan 2013. In: Forner P, Navigli R, Tufis D (eds) CLEF 2013 labs and workshops, notebook papers, vol 1179. CEUR-WS.org
Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at PAN 2014. In: Cappellato L, Ferro N, Halvey M, Kraaij W (eds) CLEF 2014 labs and workshops, notebook papers, vol 1180. CEUR-WS.org
Rangel F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at pan 2015. In: Cappellato L, Ferro N, Jones G, San Juan E (eds) CLEF 2015 labs and workshops, notebook papers. CEUR workshop proceedings, vol 1391. CEUR-WS.org
Rangel F, González F, Restrepo F, Montes M, Rosso P (2016a) Pan at fire: Overview of the PR-SOCO track on personality recognition in source code. Notebook papers of FIRE 2016, FIRE-2016, Kolkata, India, Dec 7–10, CEUR workshop proceedings, vol 1737, pp 1–5. CEUR-WS.org
Rangel F, Rosso P, Verhoeven B, Daelemans W, Potthast M, Stein B (2016b) Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working notes papers of the CLEF 2016 Evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings
Rangel F, Rosso P, Potthast M, Stein B (2017) Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in twitter. Working notes papers of the CLEF
Rosso P, Rangel F, Potthast M, Stamatatos E, Tschuggnall M, Stein B (2016) Overview of the PAN’2016 - new challenges for authorship analysis: Cross-genre profiling, clustering, diarization, and obfuscation. In: 7th international conference of CLEF on Experimental IR meets multilinguality, multimodality, and interaction, CLEF 2016, LNCS, vol 9822. Springer, New York, pp 332–350
Sadat F, Kazemi F, Farzindar A (2014) Automatic identification of arabic language varieties and dialects in social media. In: Proceedings of SocialNLP, p 22
Schler J, Koppel M, Argamon S, Pennebaker JW (2006) Effects of age and gender on blogging. In: AAAI spring symposium: computational approaches to analyzing weblogs, AAAI, pp 199–205
Seidman S (2013) Authorship verification using the impostors method. In: Forner P, Navigli R, Tufis D (eds) CLEF 2013 Evaluation labs and workshop – Working notes papers
Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60:538–556
Stamatatos E (2011) Plagiarism detection using stopword n-grams. J Am Soc Inf Sci Technol 62(12):2512–2527. http://dx.doi.org/10.1002/asi.21630
Stamatatos E, Daelemans W, Verhoeven B, Stein B, Potthast M, Juola P, Sánchez-Pérez MA, Barrón-Cedeño A (2014) Overview of the author identification task at PAN 2014. In: Working notes for CLEF 2014 conference, pp 877–897
Stamatatos E, Daelemans W, Verhoeven B, Juola P, López-López A, Potthast M, Stein B (2015a) Overview of the author identification task at PAN 2015. In: Working notes of CLEF 2015 - conference and labs of the evaluation forum
Stamatatos E, Potthast M, Rangel F, Rosso P, Stein B (2015b) Overview of the pan/clef 2015 evaluation lab. In: 6th international conference of CLEF on experimental IR meets multilinguality, multimodality, and interaction, CLEF 2015. LNCS, vol 9283. Springer, New York, pp 518–538
Stamatatos E, Tschuggnall M, Verhoeven B, Daelemans W, Specht G, Stein B, Potthast M (2016) Clustering by authorship within and across documents. In: Working notes papers of the CLEF 2016 Evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings, vol 1609. http://ceur-ws.org/Vol-1609/
Stein B, Meyer zu Eißen S, Potthast M (2007) Strategies for retrieving plagiarized documents. In: Clarke C, Fuhr N, Kando N, Kraaij W, de Vries A (eds) 30th International ACM conference on research and development in information retrieval (SIGIR 07). ACM, New York, pp 825–826. http://dx.doi.org/10.1145/1277741.1277928
Stein B, Lipka N, Prettenhofer P (2011) Intrinsic plagiarism analysis. Lang Resour Eval (LRE) 45(1):63–82. http://dx.doi.org/10.1007/s10579-010-9115-y
Tschuggnall M, Stamatatos E, Verhoeven B, Daelemans W, Specht G, Stein B, Potthast M (2017) Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working notes papers of the CLEF 2017 evaluation labs, CLEF and CEUR-WS.org. CEUR workshop proceedings
Weren E, Kauer A, Mizusaki L, Moreira V, de Oliveira P, Wives L (2014) Examining multiple features for author profiling. J Inf Data Manage 5:266–279
Acknowledgements
The work of Paolo Rosso was partially funded by the Spanish MICINN under the research project MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31). The work on the author profiling data in Arabic was made possible by NPRP grant #9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Rosso, P., Potthast, M., Stein, B., Stamatatos, E., Rangel, F., Daelemans, W. (2019). Evolution of the PAN Lab on Digital Text Forensics. In: Ferro, N., Peters, C. (eds) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-22948-1_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-22948-1_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22947-4
Online ISBN: 978-3-030-22948-1
eBook Packages: Computer ScienceComputer Science (R0)