Advertisement

Detecting pages to protect in Wikipedia across multiple languages

  • Francesca SpezzanoEmail author
  • Kelsey Suyehira
  • Laxmi Amulya Gundala
Original Article
  • 4 Downloads

Abstract

Wikipedia is based on the idea that anyone can make edits to the website to create reliable and crowd-sourced content. Yet with the cover of internet anonymity, some users make changes to the website that do not align with Wikipedia’s intended uses. For this reason, Wikipedia allows for some pages of the website to become protected, where only certain users can make revisions to the page. This allows administrators to protect pages from vandalism, libel, and edit wars. However, with over five million pages on English Wikipedia, it is impossible for active editors to monitor all pages to suggest articles in need of protection. In this paper, we consider the problem of deciding whether a page should be protected or not in a collaborative environment such as Wikipedia. We formulate the problem as a binary classification task and propose a novel set of features to decide which pages to protect based on (1) users page revision behavior and (2) page categories. We tested our system, called DePP, on four different Wikipedia language versions: English, German, French, and Italian. Experimental results show that DePP reaches at least 0.93 in both AUROC and average precision across the four languages and significantly outperforms baselines. Moreover, DePP works well in a more realistic, unbalanced setting, that is, when unprotected pages are greatly outnumbered by protected pages, by achieving a good AUROC, a high recall and an average precision significantly higher than the baselines in all the settings and languages considered.

Keywords

Page protection Misinformation Semi-automated detection Wikis and open collaboration 

Notes

References

  1. Adler BT, De Alfaro L, Pye I (2010) Detecting wikipedia vandalism using wikitrust—lab report for PAN at CLEF. In: CLEF 2010 LABs and Workshops, Notebook Papers, 22–23 September, Padua, ItalyGoogle Scholar
  2. Adler BT, De Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Computational linguistics and intelligent text processing—12th International Conference, CICLing 2011, Tokyo, Japan, February 20–26, 2011. Proceedings, Part II, pp 277–288Google Scholar
  3. Das S, Lavoie A, Magdon-Ismail M (2016) Manipulation among the arbiters of collective intelligence: how wikipedia administrators mold public opinion. ACM Trans Web 10(4):24:1–24:25CrossRefGoogle Scholar
  4. Dori-Hacohen S, Allan J (2013) Detecting controversy on the web. In: 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp 1845–1848Google Scholar
  5. Dori-Hacohen S, Allan J (2015) Automated controversy detection on the web. In: Advances in information retrieval—37th European Conference on IR Research, ECIR 2015, Proceedings, Vienna, Austria, March 29–April 2, pp 423–434Google Scholar
  6. Dori-Hacohen S, Jensen DD, Allan J (2016) Controversy detection in wikipedia using collective classification. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17–21, pp 797–800Google Scholar
  7. Fawcett T (2006) An introduction to roc analysis. Pattern Recogn Lett 27(8):861–874MathSciNetCrossRefGoogle Scholar
  8. Green T, Spezzano F (2017) Spam users identification in wikipedia via editing behavior. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 532–535Google Scholar
  9. Hill BM, Shaw AD (2015) Page protection: another missing dimension of wikipedia research. In: Proceedings of the 11th International Symposium on Open Collaboration, San Francisco, CA, USA, August 19–21, 2015, pp 15:1–15:4Google Scholar
  10. Jang M, Foley J, Dori-Hacohen S, Allan J (2016) Probabilistic approaches to controversy detection. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2069–2072Google Scholar
  11. Johannes K, Potthast M, Hagen M, Stein B (2017) Spatio-temporal analysis of reverted wikipedia edits. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 122–131Google Scholar
  12. Kittur A, Suh B, Pendleton BA, Chi EH (2007) He says, she says: conflict and coordination in wikipedia. In: Proceedings of the 2007 conference on human factors in computing systems, CHI 2007, San Jose, California, USA, April 28–May 3, 2007, pp 453–462Google Scholar
  13. Kumar S, Spezzano F, Subrahmanian VS (2015) VEWS: a wikipedia vandal early warning system. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney, NSW, Australia, August 10–13, 2015, pp 607–616Google Scholar
  14. Kumar S, West R, Leskovec J (2016) Disinformation on the web: impact, characteristics, and detection of wikipedia hoaxes. In: Proceedings of the 25th international conference on world wide web, WWW 2016, Montreal, Canada, April 11–15, 2016, pp 591–602Google Scholar
  15. McDonald DW, Javanmardi S, Zachry M (2011) Finding patterns in behavioral observations by automatically labeling forms of wikiwork in barnstars. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 15–24Google Scholar
  16. Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in wikipedia. In: Proceedings advances in information retrieval, 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30–April 3, 2008, pp 663–668Google Scholar
  17. Potthast M, Stein B, Holfeld T (2010) Overview of the 1st international competition on wikipedia vandalism detection. In CLEF (2010) LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, ItalyGoogle Scholar
  18. Rad HS, Barbosa D (2012) Identifying controversial articles in wikipedia: a comparative study. In: Proceedings of the eighth annual international symposium on wikis and open collaboration, WikiSym 2012, Austria, August 27–29, 2012Google Scholar
  19. Roitman H, Hummel S, Rabinovich E, Sznajder B, Slonim N, Aharoni E (2016) On the retrieval of wikipedia articles containing claims on controversial topics. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016, Companion Volume, pp 991–996Google Scholar
  20. Simonite T (2013) The decline of wikipedia. https://www.technologyreview.com/s/520446/the-decline-of-wikipedia/. Accessed 1 Oct 2018
  21. Singer P, Lemmerich F, West R, Zia L, Wulczyn E, Strohmaier M, Leskovec J (2017) Why we read wikipedia. In: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3–7, 2017, pp 1591–1600Google Scholar
  22. Solorio T, Hasan R, Mizan M (2013) A case study of sockpuppet detection in wikipedia. In: Proceedings of the workshop on language analysis in social media. Association for Computational Linguistics, Atlanta, Georgia, pp 59–68. http://aclweb.org/anthology/W13-1107 Google Scholar
  23. Suyehira K, Spezzano F (2016) Depp: a system for detecting pages to protect in wikipedia. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2081–2084Google Scholar
  24. Tran KND (2015) Detecting vandalism on wikipedia across multiple languagesGoogle Scholar
  25. Viégas FB, Wattenberg M, McKeon MM (2007) The hidden order of wikipedia. In: International conference on online communities and social computing, Second international conference, OCSC 2007, held as part of HCI international 2007, 22–27 July 2007. Springer, Beijing, China, pp 445–454Google Scholar
  26. West AG, Kannan S, Lee I (2010) Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata? In: Proceedings of the third European workshop on system security, EUROSEC 2010, Paris, France. ACM, New York, pp 22–28.  https://doi.org/10.1145/1752046.1752050
  27. West AG, Agrawal A, Baker P, Exline B, Lee I (2011a) Autonomous link spam detection in purely collaborative environments. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 91–100Google Scholar
  28. West AG, Chang J, Venkatasubramanian KK, Sokolsky O, Lee I (2011b) Link spamming wikipedia for profit. In: The 8th annual collaboration, electronic messaging, anti-abuse and spam conference, CEAS 2011, Perth, Australia, Proceedings, September 1–2, 2011, pp 152–161Google Scholar
  29. Wulczyn E, Taraborelli D (2015) Wikipedia clickstream dataset. https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream
  30. Yasseri T, Sumi R, Rung A, Kornai A, Kertész J (2012) Dynamics of conflicts in wikipedia. PLOS One 7(6):1–12CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Austria, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Computer Science DepartmentBoise State UniversityBoiseUSA

Personalised recommendations