Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Detecting pages to protect in Wikipedia across multiple languages

Abstract

Wikipedia is based on the idea that anyone can make edits to the website to create reliable and crowd-sourced content. Yet with the cover of internet anonymity, some users make changes to the website that do not align with Wikipedia’s intended uses. For this reason, Wikipedia allows for some pages of the website to become protected, where only certain users can make revisions to the page. This allows administrators to protect pages from vandalism, libel, and edit wars. However, with over five million pages on English Wikipedia, it is impossible for active editors to monitor all pages to suggest articles in need of protection. In this paper, we consider the problem of deciding whether a page should be protected or not in a collaborative environment such as Wikipedia. We formulate the problem as a binary classification task and propose a novel set of features to decide which pages to protect based on (1) users page revision behavior and (2) page categories. We tested our system, called DePP, on four different Wikipedia language versions: English, German, French, and Italian. Experimental results show that DePP reaches at least 0.93 in both AUROC and average precision across the four languages and significantly outperforms baselines. Moreover, DePP works well in a more realistic, unbalanced setting, that is, when unprotected pages are greatly outnumbered by protected pages, by achieving a good AUROC, a high recall and an average precision significantly higher than the baselines in all the settings and languages considered.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    Autoconfirmed registered users can send these requests through the Twinkle gadget (Twinkle. https://en.wikipedia.org/wiki/Wikipedia:Twinkle).

  2. 2.

    Adler et al. also included the features implemented in WikiTrust in their analysis. However, WikiTrust was discontinued as a tool to detect vandalism in 2012 due to unreliability (Wikitrust. https://en.wikipedia.org/wiki/WikiTrust, Computing wikipedia’s authority. https://acrlog.org/2007/08/15/computing-wikipedias-authority/).

  3. 3.

    STiki uses spatio-temporal features such as edit time-of-day, edit day-of-week, time-since article edited, time-since editor registered, time-since last user-offending edit, revision comment length, registered user properties, and reputation features such as article, category, editor, and country reputation (West et al. 2010).

  4. 4.

    More details on the grid search are provided in the Appendix.

  5. 5.

    The data on whether or not an edit has been reverted by these bots/tools is directly available as metadata of the edits we crawled.

  6. 6.

    In English Wikipedia, the average number of edit wars in protected pages is 1.37 while the same number for unprotected pages is 0.06.

  7. 7.

    Because the dataset with page controversy level contains more non-controversial pages than controversial ones, we balanced the number of controversial/non-controversial training pages via majority undersampling to avoid bias towards non-controversial Wikipedia articles. The sampling was conducted 10 times and the results are averaged.

  8. 8.

    Similarly to what done in Sect. 4, the set on unprotected pages are uniformly random sampled from the complete list of unprotected pages.

References

  1. Adler BT, De Alfaro L, Pye I (2010) Detecting wikipedia vandalism using wikitrust—lab report for PAN at CLEF. In: CLEF 2010 LABs and Workshops, Notebook Papers, 22–23 September, Padua, Italy

  2. Adler BT, De Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Computational linguistics and intelligent text processing—12th International Conference, CICLing 2011, Tokyo, Japan, February 20–26, 2011. Proceedings, Part II, pp 277–288

  3. Das S, Lavoie A, Magdon-Ismail M (2016) Manipulation among the arbiters of collective intelligence: how wikipedia administrators mold public opinion. ACM Trans Web 10(4):24:1–24:25

  4. Dori-Hacohen S, Allan J (2013) Detecting controversy on the web. In: 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp 1845–1848

  5. Dori-Hacohen S, Allan J (2015) Automated controversy detection on the web. In: Advances in information retrieval—37th European Conference on IR Research, ECIR 2015, Proceedings, Vienna, Austria, March 29–April 2, pp 423–434

  6. Dori-Hacohen S, Jensen DD, Allan J (2016) Controversy detection in wikipedia using collective classification. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17–21, pp 797–800

  7. Fawcett T (2006) An introduction to roc analysis. Pattern Recogn Lett 27(8):861–874

  8. Green T, Spezzano F (2017) Spam users identification in wikipedia via editing behavior. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 532–535

  9. Hill BM, Shaw AD (2015) Page protection: another missing dimension of wikipedia research. In: Proceedings of the 11th International Symposium on Open Collaboration, San Francisco, CA, USA, August 19–21, 2015, pp 15:1–15:4

  10. Jang M, Foley J, Dori-Hacohen S, Allan J (2016) Probabilistic approaches to controversy detection. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2069–2072

  11. Johannes K, Potthast M, Hagen M, Stein B (2017) Spatio-temporal analysis of reverted wikipedia edits. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 122–131

  12. Kittur A, Suh B, Pendleton BA, Chi EH (2007) He says, she says: conflict and coordination in wikipedia. In: Proceedings of the 2007 conference on human factors in computing systems, CHI 2007, San Jose, California, USA, April 28–May 3, 2007, pp 453–462

  13. Kumar S, Spezzano F, Subrahmanian VS (2015) VEWS: a wikipedia vandal early warning system. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney, NSW, Australia, August 10–13, 2015, pp 607–616

  14. Kumar S, West R, Leskovec J (2016) Disinformation on the web: impact, characteristics, and detection of wikipedia hoaxes. In: Proceedings of the 25th international conference on world wide web, WWW 2016, Montreal, Canada, April 11–15, 2016, pp 591–602

  15. McDonald DW, Javanmardi S, Zachry M (2011) Finding patterns in behavioral observations by automatically labeling forms of wikiwork in barnstars. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 15–24

  16. Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in wikipedia. In: Proceedings advances in information retrieval, 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30–April 3, 2008, pp 663–668

  17. Potthast M, Stein B, Holfeld T (2010) Overview of the 1st international competition on wikipedia vandalism detection. In CLEF (2010) LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy

  18. Rad HS, Barbosa D (2012) Identifying controversial articles in wikipedia: a comparative study. In: Proceedings of the eighth annual international symposium on wikis and open collaboration, WikiSym 2012, Austria, August 27–29, 2012

  19. Roitman H, Hummel S, Rabinovich E, Sznajder B, Slonim N, Aharoni E (2016) On the retrieval of wikipedia articles containing claims on controversial topics. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016, Companion Volume, pp 991–996

  20. Simonite T (2013) The decline of wikipedia. https://www.technologyreview.com/s/520446/the-decline-of-wikipedia/. Accessed 1 Oct 2018

  21. Singer P, Lemmerich F, West R, Zia L, Wulczyn E, Strohmaier M, Leskovec J (2017) Why we read wikipedia. In: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3–7, 2017, pp 1591–1600

  22. Solorio T, Hasan R, Mizan M (2013) A case study of sockpuppet detection in wikipedia. In: Proceedings of the workshop on language analysis in social media. Association for Computational Linguistics, Atlanta, Georgia, pp 59–68. http://aclweb.org/anthology/W13-1107

  23. Suyehira K, Spezzano F (2016) Depp: a system for detecting pages to protect in wikipedia. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2081–2084

  24. Tran KND (2015) Detecting vandalism on wikipedia across multiple languages

  25. Viégas FB, Wattenberg M, McKeon MM (2007) The hidden order of wikipedia. In: International conference on online communities and social computing, Second international conference, OCSC 2007, held as part of HCI international 2007, 22–27 July 2007. Springer, Beijing, China, pp 445–454

  26. West AG, Kannan S, Lee I (2010) Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata? In: Proceedings of the third European workshop on system security, EUROSEC 2010, Paris, France. ACM, New York, pp 22–28. https://doi.org/10.1145/1752046.1752050

  27. West AG, Agrawal A, Baker P, Exline B, Lee I (2011a) Autonomous link spam detection in purely collaborative environments. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 91–100

  28. West AG, Chang J, Venkatasubramanian KK, Sokolsky O, Lee I (2011b) Link spamming wikipedia for profit. In: The 8th annual collaboration, electronic messaging, anti-abuse and spam conference, CEAS 2011, Perth, Australia, Proceedings, September 1–2, 2011, pp 152–161

  29. Wulczyn E, Taraborelli D (2015) Wikipedia clickstream dataset. https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

  30. Yasseri T, Sumi R, Rung A, Kornai A, Kertész J (2012) Dynamics of conflicts in wikipedia. PLOS One 7(6):1–12

Download references

Author information

Correspondence to Francesca Spezzano.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extended version of the conference paper “Kelsey Suyehira and Francesca Spezzano, DePP: A System for Detecting Pages to Protect in Wikipedia. In Proceedings of the 2016 ACM International Conference on Information and Knowledge Management, CIKM 2016” (Suyehira and Spezzano 2016).

Appendix

Appendix

This appendix reports detailed results on DePP’s performances and comparison with the baselines in the balanced setting for the non English version of Wikipedia considered in the paper: German (Table 10), French (Table 11), and Italian (Table 12).

Table 10 DePP accuracy, precision, recall, and AUROC results and comparison with baselines for German Wikipedia
Table 11 DePP accuracy, precision, recall, and AUROC results and comparison with baselines for French Wikipedia
Table 12 DePP accuracy, precision, recall, and AUROC results and comparison with baselines for Italian Wikipedia

Hyperparameter grid search

We conducted a grid search to choose the hyperparameters of the machine learning models used in our experiments. More specifically,

  • for logistic regression, we tried penalties L1 and L2;

  • for SVM, we tried different kernels (’linear’, ’poly’, ’rbf’, ’sigmoid’) and varied the gamma kernel coefficient used by ’poly’, ’rbf’, ’sigmoid’ kernels according to the options ’scale’ and ’auto’ of scikit-learn;

  • for K-nearest neighbor, we varied the number of neighbors in \(\{3,5,7\}\);

  • for random forest, we varied the number of estimators in \(\{100, 200, 250\}\) and the maximum number of levels in tree \(\{10, 20, 30\}\).

Finally, to determine how many most frequent categories to consider in the computation of the category anomaly features we tried \(k in \{5,10,20,30,40\}\).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Spezzano, F., Suyehira, K. & Gundala, L.A. Detecting pages to protect in Wikipedia across multiple languages. Soc. Netw. Anal. Min. 9, 10 (2019). https://doi.org/10.1007/s13278-019-0555-0

Download citation

Keywords

  • Page protection
  • Misinformation
  • Semi-automated detection
  • Wikis and open collaboration