Abstract
Wikipedia is based on the idea that anyone can make edits to the website to create reliable and crowd-sourced content. Yet with the cover of internet anonymity, some users make changes to the website that do not align with Wikipedia’s intended uses. For this reason, Wikipedia allows for some pages of the website to become protected, where only certain users can make revisions to the page. This allows administrators to protect pages from vandalism, libel, and edit wars. However, with over five million pages on English Wikipedia, it is impossible for active editors to monitor all pages to suggest articles in need of protection. In this paper, we consider the problem of deciding whether a page should be protected or not in a collaborative environment such as Wikipedia. We formulate the problem as a binary classification task and propose a novel set of features to decide which pages to protect based on (1) users page revision behavior and (2) page categories. We tested our system, called DePP, on four different Wikipedia language versions: English, German, French, and Italian. Experimental results show that DePP reaches at least 0.93 in both AUROC and average precision across the four languages and significantly outperforms baselines. Moreover, DePP works well in a more realistic, unbalanced setting, that is, when unprotected pages are greatly outnumbered by protected pages, by achieving a good AUROC, a high recall and an average precision significantly higher than the baselines in all the settings and languages considered.
Similar content being viewed by others
Notes
Autoconfirmed registered users can send these requests through the Twinkle gadget (Twinkle. https://en.wikipedia.org/wiki/Wikipedia:Twinkle).
Adler et al. also included the features implemented in WikiTrust in their analysis. However, WikiTrust was discontinued as a tool to detect vandalism in 2012 due to unreliability (Wikitrust. https://en.wikipedia.org/wiki/WikiTrust, Computing wikipedia’s authority. https://acrlog.org/2007/08/15/computing-wikipedias-authority/).
STiki uses spatio-temporal features such as edit time-of-day, edit day-of-week, time-since article edited, time-since editor registered, time-since last user-offending edit, revision comment length, registered user properties, and reputation features such as article, category, editor, and country reputation (West et al. 2010).
More details on the grid search are provided in the Appendix.
The data on whether or not an edit has been reverted by these bots/tools is directly available as metadata of the edits we crawled.
In English Wikipedia, the average number of edit wars in protected pages is 1.37 while the same number for unprotected pages is 0.06.
Because the dataset with page controversy level contains more non-controversial pages than controversial ones, we balanced the number of controversial/non-controversial training pages via majority undersampling to avoid bias towards non-controversial Wikipedia articles. The sampling was conducted 10 times and the results are averaged.
Similarly to what done in Sect. 4, the set on unprotected pages are uniformly random sampled from the complete list of unprotected pages.
References
Adler BT, De Alfaro L, Pye I (2010) Detecting wikipedia vandalism using wikitrust—lab report for PAN at CLEF. In: CLEF 2010 LABs and Workshops, Notebook Papers, 22–23 September, Padua, Italy
Adler BT, De Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Computational linguistics and intelligent text processing—12th International Conference, CICLing 2011, Tokyo, Japan, February 20–26, 2011. Proceedings, Part II, pp 277–288
Das S, Lavoie A, Magdon-Ismail M (2016) Manipulation among the arbiters of collective intelligence: how wikipedia administrators mold public opinion. ACM Trans Web 10(4):24:1–24:25
Dori-Hacohen S, Allan J (2013) Detecting controversy on the web. In: 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp 1845–1848
Dori-Hacohen S, Allan J (2015) Automated controversy detection on the web. In: Advances in information retrieval—37th European Conference on IR Research, ECIR 2015, Proceedings, Vienna, Austria, March 29–April 2, pp 423–434
Dori-Hacohen S, Jensen DD, Allan J (2016) Controversy detection in wikipedia using collective classification. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17–21, pp 797–800
Fawcett T (2006) An introduction to roc analysis. Pattern Recogn Lett 27(8):861–874
Green T, Spezzano F (2017) Spam users identification in wikipedia via editing behavior. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 532–535
Hill BM, Shaw AD (2015) Page protection: another missing dimension of wikipedia research. In: Proceedings of the 11th International Symposium on Open Collaboration, San Francisco, CA, USA, August 19–21, 2015, pp 15:1–15:4
Jang M, Foley J, Dori-Hacohen S, Allan J (2016) Probabilistic approaches to controversy detection. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2069–2072
Johannes K, Potthast M, Hagen M, Stein B (2017) Spatio-temporal analysis of reverted wikipedia edits. In: Proceedings of the eleventh international conference on web and social media, ICWSM 2017, Montréal, Québec, Canada, May 15–18, 2017, pp 122–131
Kittur A, Suh B, Pendleton BA, Chi EH (2007) He says, she says: conflict and coordination in wikipedia. In: Proceedings of the 2007 conference on human factors in computing systems, CHI 2007, San Jose, California, USA, April 28–May 3, 2007, pp 453–462
Kumar S, Spezzano F, Subrahmanian VS (2015) VEWS: a wikipedia vandal early warning system. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney, NSW, Australia, August 10–13, 2015, pp 607–616
Kumar S, West R, Leskovec J (2016) Disinformation on the web: impact, characteristics, and detection of wikipedia hoaxes. In: Proceedings of the 25th international conference on world wide web, WWW 2016, Montreal, Canada, April 11–15, 2016, pp 591–602
McDonald DW, Javanmardi S, Zachry M (2011) Finding patterns in behavioral observations by automatically labeling forms of wikiwork in barnstars. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 15–24
Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in wikipedia. In: Proceedings advances in information retrieval, 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30–April 3, 2008, pp 663–668
Potthast M, Stein B, Holfeld T (2010) Overview of the 1st international competition on wikipedia vandalism detection. In CLEF (2010) LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy
Rad HS, Barbosa D (2012) Identifying controversial articles in wikipedia: a comparative study. In: Proceedings of the eighth annual international symposium on wikis and open collaboration, WikiSym 2012, Austria, August 27–29, 2012
Roitman H, Hummel S, Rabinovich E, Sznajder B, Slonim N, Aharoni E (2016) On the retrieval of wikipedia articles containing claims on controversial topics. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016, Companion Volume, pp 991–996
Simonite T (2013) The decline of wikipedia. https://www.technologyreview.com/s/520446/the-decline-of-wikipedia/. Accessed 1 Oct 2018
Singer P, Lemmerich F, West R, Zia L, Wulczyn E, Strohmaier M, Leskovec J (2017) Why we read wikipedia. In: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3–7, 2017, pp 1591–1600
Solorio T, Hasan R, Mizan M (2013) A case study of sockpuppet detection in wikipedia. In: Proceedings of the workshop on language analysis in social media. Association for Computational Linguistics, Atlanta, Georgia, pp 59–68. http://aclweb.org/anthology/W13-1107
Suyehira K, Spezzano F (2016) Depp: a system for detecting pages to protect in wikipedia. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2081–2084
Tran KND (2015) Detecting vandalism on wikipedia across multiple languages
Viégas FB, Wattenberg M, McKeon MM (2007) The hidden order of wikipedia. In: International conference on online communities and social computing, Second international conference, OCSC 2007, held as part of HCI international 2007, 22–27 July 2007. Springer, Beijing, China, pp 445–454
West AG, Kannan S, Lee I (2010) Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata? In: Proceedings of the third European workshop on system security, EUROSEC 2010, Paris, France. ACM, New York, pp 22–28. https://doi.org/10.1145/1752046.1752050
West AG, Agrawal A, Baker P, Exline B, Lee I (2011a) Autonomous link spam detection in purely collaborative environments. In: Proceedings of the 7th international symposium on wikis and open collaboration, 2011, Mountain View, CA, USA, October 3–5, 2011, pp 91–100
West AG, Chang J, Venkatasubramanian KK, Sokolsky O, Lee I (2011b) Link spamming wikipedia for profit. In: The 8th annual collaboration, electronic messaging, anti-abuse and spam conference, CEAS 2011, Perth, Australia, Proceedings, September 1–2, 2011, pp 152–161
Wulczyn E, Taraborelli D (2015) Wikipedia clickstream dataset. https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream
Yasseri T, Sumi R, Rung A, Kornai A, Kertész J (2012) Dynamics of conflicts in wikipedia. PLOS One 7(6):1–12
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper is an extended version of the conference paper “Kelsey Suyehira and Francesca Spezzano, DePP: A System for Detecting Pages to Protect in Wikipedia. In Proceedings of the 2016 ACM International Conference on Information and Knowledge Management, CIKM 2016” (Suyehira and Spezzano 2016).
Appendix
Appendix
This appendix reports detailed results on DePP’s performances and comparison with the baselines in the balanced setting for the non English version of Wikipedia considered in the paper: German (Table 10), French (Table 11), and Italian (Table 12).
1.1 Hyperparameter grid search
We conducted a grid search to choose the hyperparameters of the machine learning models used in our experiments. More specifically,
for logistic regression, we tried penalties L1 and L2;
for SVM, we tried different kernels (’linear’, ’poly’, ’rbf’, ’sigmoid’) and varied the gamma kernel coefficient used by ’poly’, ’rbf’, ’sigmoid’ kernels according to the options ’scale’ and ’auto’ of scikit-learn;
for K-nearest neighbor, we varied the number of neighbors in \(\{3,5,7\}\);
for random forest, we varied the number of estimators in \(\{100, 200, 250\}\) and the maximum number of levels in tree \(\{10, 20, 30\}\).
Finally, to determine how many most frequent categories to consider in the computation of the category anomaly features we tried \(k in \{5,10,20,30,40\}\).
Rights and permissions
About this article
Cite this article
Spezzano, F., Suyehira, K. & Gundala, L.A. Detecting pages to protect in Wikipedia across multiple languages. Soc. Netw. Anal. Min. 9, 10 (2019). https://doi.org/10.1007/s13278-019-0555-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-019-0555-0