Skip to main content

Context-Aware Detection of Sneaky Vandalism on Wikipedia Across Multiple Languages

Part of the Lecture Notes in Computer Science book series (LNAI,volume 9077)

Abstract

The malicious modification of articles, termed vandalism, is a serious problem for open access encyclopedias such as Wikipedia. Wikipedia’s counter-vandalism bots and past vandalism detection research have greatly reduced the exposure and damage of common and obvious types of vandalism. However, there remains increasingly more sneaky types of vandalism that are clearly out of context of the sentence or article. In this paper, we propose a novel context-aware and cross-language vandalism detection technique that scales to the size of the full Wikipedia and extends the types of vandalism detectable beyond past feature-based approaches. Our technique uses word dependencies to identify vandal words in sentences by combining part-of-speech tagging with a conditional random fields classifier. We evaluate our technique on two Wikipedia data sets: the PAN data sets with over 62,000 edits, commonly used by related research; and our own vandalism repairs data sets with over 500 million edits of over 9 million articles from five languages. As a comparison, we implement a feature-based classifier to analyse the quality of each classification technique and the trade-offs of each type of classifier. Our results show how context-aware detection techniques can become a new counter-vandalism tool for Wikipedia that complements current feature-based techniques.

Keywords

  • Random Forest
  • Conditional Random Field
  • User Reputation
  • Word Label
  • Word Dependency

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adler, B.T., de Alfaro, L.: A content-driven reputation system for the wikipedia. In: WWW, pp. 261–270. Banff, Canada (2007)

    Google Scholar 

  2. Adler, B.T., de Alfaro, L., Mola-Velasco, S.M., Rosso, P., West, A.G.: Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 277–288. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  3. Adler, B.T., de Alfaro, L., Pye, I., Raman, V.: Measuring author contributions to the wikipedia. In: WikiSym, pp. 15–24. Porto, Portugal (2008)

    Google Scholar 

  4. Chin, S.C., Street, W.N.: Divide and Transfer: an Exploration of Segmented Transfer to Detect Wikipedia Vandalism. JMLR 27, 133–144 (2012)

    Google Scholar 

  5. Chin, S.C., Street, W.N., Srinivasan, P., Eichmann, D.: Detecting wikipedia vandalism with active learning and statistical language models. In: WICOW, pp. 3–10. Raleigh, NC (2010)

    Google Scholar 

  6. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: ICML, pp. 233–240. Pittsburgh, PA (2006)

    Google Scholar 

  7. Geiger, R.S.: The lives of bots. In: Critical Point of View: A Wikipedia Reader, pp. 78–93. Institute of Network Cultures, Amsterdam (2011)

    Google Scholar 

  8. Halfaker, A., Riedl, J.: Bots and Cyborgs: Wikipedia’s Immune System. Computer 45, 79–82 (2012)

    CrossRef  Google Scholar 

  9. Harpalani, M., Hart, M., Singh, S., Johnson, R., Choi, Y.: Language of vandalism: improving wikipedia vandalism detection via stylometric analysis. In: ACL: Short Papers, pp. 83–88. Portland, Oregon (2011)

    Google Scholar 

  10. Javanmardi, S., McDonald, D.W., Lopes, C.V.: Vandalism detection in wikipedia: a high-performing, feature-rich model and its reduction through lasso. In: WikiSym, pp. 82–90. Mountain View, California (2011)

    Google Scholar 

  11. Kittur, A., Suh, B., Pendleton, B.A., Chi, E.H.: He says, she says: conflict and coordination in wikipedia. In: CHI, Vancouver, BC, Canada, pp. 453–462 (2007)

    Google Scholar 

  12. Kudo, T.: CRF++: Yet Another CRF toolkit (2013)

    Google Scholar 

  13. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289. Williams College, MA (2001)

    Google Scholar 

  14. Mola-Velasco, S.M.: Wikipedia vandalism detection through machine learning: feature review and new proposals. In: CLEF. Padua, Italy (2010)

    Google Scholar 

  15. Potthast, M.: Crowdsourcing a wikipedia vandalism corpus. In: SIGIR, Geneva, Switzerland, pp. 789–790 (2010)

    Google Scholar 

  16. Ramaswamy, L., Tummalapenta, R.S., Li, K., Pu, C.: A content-context-centric approach for detecting vandalism in wikipedia. In: Collaboratecom, pp. 115–122. Austin, TX (2013)

    Google Scholar 

  17. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: NeMLaP, Manchester, UK, pp. 44–49 (1994)

    Google Scholar 

  18. Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields. Machine Learning 4(4), 267–373 (2011)

    CrossRef  MATH  Google Scholar 

  19. Tran, K.-N., Christen, P.: Cross language prediction of vandalism on wikipedia using article views and revisions. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 268–279. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  20. Tran, K.N., Christen, P.: Cross-Language Learning from Bots and Users to Detect Vandalism on Wikipedia. IEEE TKDE (2015)

    Google Scholar 

  21. Wang, W.Y., McKeown, K.R.: “Got You!”: automatic vandalism detection in wikipedia with web-based shallow syntactic-semantic modeling. In: Coling, Beijing, China, pp. 1146–1154 (2010)

    Google Scholar 

  22. West, A.G., Kannan, S., Lee, I.: Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata. In: EUROSEC, Paris, France, pp. 22–28 (2010)

    Google Scholar 

  23. West, A.G., Lee, I.: Multilingual vandalism detection using language-independent & ex post facto evidence. In: CLEF, Amsterdam, Netherlands (2011)

    Google Scholar 

  24. Wu, Q., Irani, D., Pu, C., Ramaswamy, L.: Elusive vandalism detection in wikipedia: a text stability-based approach. In: CIKM, Toronto, Canada, pp. 1797–1800 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khoi-Nguyen Tran .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Tran, KN., Christen, P., Sanner, S., Xie, L. (2015). Context-Aware Detection of Sneaky Vandalism on Wikipedia Across Multiple Languages. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9077. Springer, Cham. https://doi.org/10.1007/978-3-319-18038-0_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18038-0_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18037-3

  • Online ISBN: 978-3-319-18038-0

  • eBook Packages: Computer ScienceComputer Science (R0)