Abstract
The malicious modification of articles, termed vandalism, is a serious problem for open access encyclopedias such as Wikipedia. Wikipedia’s counter-vandalism bots and past vandalism detection research have greatly reduced the exposure and damage of common and obvious types of vandalism. However, there remains increasingly more sneaky types of vandalism that are clearly out of context of the sentence or article. In this paper, we propose a novel context-aware and cross-language vandalism detection technique that scales to the size of the full Wikipedia and extends the types of vandalism detectable beyond past feature-based approaches. Our technique uses word dependencies to identify vandal words in sentences by combining part-of-speech tagging with a conditional random fields classifier. We evaluate our technique on two Wikipedia data sets: the PAN data sets with over 62,000 edits, commonly used by related research; and our own vandalism repairs data sets with over 500 million edits of over 9 million articles from five languages. As a comparison, we implement a feature-based classifier to analyse the quality of each classification technique and the trade-offs of each type of classifier. Our results show how context-aware detection techniques can become a new counter-vandalism tool for Wikipedia that complements current feature-based techniques.
Keywords
- Random Forest
- Conditional Random Field
- User Reputation
- Word Label
- Word Dependency
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adler, B.T., de Alfaro, L.: A content-driven reputation system for the wikipedia. In: WWW, pp. 261–270. Banff, Canada (2007)
Adler, B.T., de Alfaro, L., Mola-Velasco, S.M., Rosso, P., West, A.G.: Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 277–288. Springer, Heidelberg (2011)
Adler, B.T., de Alfaro, L., Pye, I., Raman, V.: Measuring author contributions to the wikipedia. In: WikiSym, pp. 15–24. Porto, Portugal (2008)
Chin, S.C., Street, W.N.: Divide and Transfer: an Exploration of Segmented Transfer to Detect Wikipedia Vandalism. JMLR 27, 133–144 (2012)
Chin, S.C., Street, W.N., Srinivasan, P., Eichmann, D.: Detecting wikipedia vandalism with active learning and statistical language models. In: WICOW, pp. 3–10. Raleigh, NC (2010)
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: ICML, pp. 233–240. Pittsburgh, PA (2006)
Geiger, R.S.: The lives of bots. In: Critical Point of View: A Wikipedia Reader, pp. 78–93. Institute of Network Cultures, Amsterdam (2011)
Halfaker, A., Riedl, J.: Bots and Cyborgs: Wikipedia’s Immune System. Computer 45, 79–82 (2012)
Harpalani, M., Hart, M., Singh, S., Johnson, R., Choi, Y.: Language of vandalism: improving wikipedia vandalism detection via stylometric analysis. In: ACL: Short Papers, pp. 83–88. Portland, Oregon (2011)
Javanmardi, S., McDonald, D.W., Lopes, C.V.: Vandalism detection in wikipedia: a high-performing, feature-rich model and its reduction through lasso. In: WikiSym, pp. 82–90. Mountain View, California (2011)
Kittur, A., Suh, B., Pendleton, B.A., Chi, E.H.: He says, she says: conflict and coordination in wikipedia. In: CHI, Vancouver, BC, Canada, pp. 453–462 (2007)
Kudo, T.: CRF++: Yet Another CRF toolkit (2013)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289. Williams College, MA (2001)
Mola-Velasco, S.M.: Wikipedia vandalism detection through machine learning: feature review and new proposals. In: CLEF. Padua, Italy (2010)
Potthast, M.: Crowdsourcing a wikipedia vandalism corpus. In: SIGIR, Geneva, Switzerland, pp. 789–790 (2010)
Ramaswamy, L., Tummalapenta, R.S., Li, K., Pu, C.: A content-context-centric approach for detecting vandalism in wikipedia. In: Collaboratecom, pp. 115–122. Austin, TX (2013)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: NeMLaP, Manchester, UK, pp. 44–49 (1994)
Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields. Machine Learning 4(4), 267–373 (2011)
Tran, K.-N., Christen, P.: Cross language prediction of vandalism on wikipedia using article views and revisions. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 268–279. Springer, Heidelberg (2013)
Tran, K.N., Christen, P.: Cross-Language Learning from Bots and Users to Detect Vandalism on Wikipedia. IEEE TKDE (2015)
Wang, W.Y., McKeown, K.R.: “Got You!”: automatic vandalism detection in wikipedia with web-based shallow syntactic-semantic modeling. In: Coling, Beijing, China, pp. 1146–1154 (2010)
West, A.G., Kannan, S., Lee, I.: Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata. In: EUROSEC, Paris, France, pp. 22–28 (2010)
West, A.G., Lee, I.: Multilingual vandalism detection using language-independent & ex post facto evidence. In: CLEF, Amsterdam, Netherlands (2011)
Wu, Q., Irani, D., Pu, C., Ramaswamy, L.: Elusive vandalism detection in wikipedia: a text stability-based approach. In: CIKM, Toronto, Canada, pp. 1797–1800 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Tran, KN., Christen, P., Sanner, S., Xie, L. (2015). Context-Aware Detection of Sneaky Vandalism on Wikipedia Across Multiple Languages. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9077. Springer, Cham. https://doi.org/10.1007/978-3-319-18038-0_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-18038-0_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18037-3
Online ISBN: 978-3-319-18038-0
eBook Packages: Computer ScienceComputer Science (R0)