Skip to main content

Learning to Detect Vandalism in Social Content Systems: A Study on Wikipedia

Vandalism Detection in Wikipedia

  • Chapter
Mining Social Networks and Security Informatics

Abstract

A challenge facing user generated content systems is vandalism, i.e. edits that damage content quality. The high visibility and easy access to social networks makes them popular targets for vandals. Detecting and removing vandalism is critical for these user generated content systems. Because vandalism can take many forms, there are many different kinds of features that are potentially useful for detecting it. The complex nature of vandalism, and the large number of potential features, make vandalism detection difficult and time consuming for human editors. Machine learning techniques hold promise for developing accurate, tunable, and maintainable models that can be incorporated into vandalism detection tools. We describe a method for training classifiers for vandalism detection that yields classifiers that are more accurate on the PAN 2010 corpus than others previously developed. Because of the high turnaround in social network systems, it is important for vandalism detection tools to run in real-time. To this aim, we use feature selection to find the minimal set of features consistent with high accuracy. In addition, because some features are more costly to compute than others, we use cost-sensitive feature selection to reduce the total computational cost of executing our models. In addition to the features previously used for spam detection, we introduce new features based on user action histories. The user history features contribute significantly to classifier performance. The approach we use is general and can easily be applied to other user generated content systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The LASSO method does not necessarily yield a monotonically increasing set of features; it is possible that as λ is decreased some features that were in the set for larger λs might be removed from the set as other feature replace them.

References

  1. crawler4j: A fast crawler in java. http://crawler4j.googlecode.com/

  2. Wikipedia article on biography controversy. http://en.wikipedia.org/wiki/Wikipedia_biography_controversy

  3. Adler BT, de Alfaro L (2007) A content-driven reputation system for the wikipedia. In: WWW ’07: proceedings of the 16th international conference on world wide web. ACM, New York, pp 261–270

    Chapter  Google Scholar 

  4. Adler B, de Alfaro L, Pve I (2010) Detecting wikipedia vandalism using wikitrust. Tech. rep., PAN lab report, CLEF (Conference on multilingual and multimodal information access evaluation). institute.lanl.gov/isti/issdm/papers/vandalism-adler-report.pdf

  5. Adler BT, de Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Proceedings of computational linguistics and intelligent text processing, CICLing’11, pp 266–276

    Google Scholar 

  6. Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam e-mail: a comparison of a naive bayesian and a memory-based approach. In: Proceedings of CoRR, pp 1–13

    Google Scholar 

  7. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  8. Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning, ICML’06. ACM, New York, pp 161–168

    Google Scholar 

  9. Chang Mw, Yih Wt, Meek C (2008) Partitioned logistic regression for spam filtering. In: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’08. ACM, New York, pp 97–105

    Chapter  Google Scholar 

  10. Chin Sc, Srinivasan P, Street WN, Eichmann D (2010) Detecting wikipedia vandalism with active learning and statistical language models. In: Fourth workshop on information credibility on the web, WICOW 2010

    Google Scholar 

  11. Dong A, Zhang R, Kolari P, Bai J, Diaz F, Chang Y, Zheng Z, Zha H (2010) Time is of the essence: improving recency ranking using twitter data. In: World wide web conference series, pp 331–340

    Google Scholar 

  12. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  13. Friedman JH, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22. http://www.jstatsoft.org/v33/i01

    Google Scholar 

  14. Geiger RS, Ribes D (2010) The work of sustaining order in wikipedia: the banning of a vandal. In: Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW’10. ACM, New York, pp 117–126

    Chapter  Google Scholar 

  15. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  16. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278

    Article  Google Scholar 

  17. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer, New York

    MATH  Google Scholar 

  18. Itakura KY, Clarke CLA (2009) Using dynamic Markov compression to detect vandalism in the wikipedia. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR’09. ACM, New York, pp 822–823

    Google Scholar 

  19. Javanmardi S (2011) Measuring content quality in user generated content systems: a machine learning approach. Doctoral dissertation, University of California, Irvine, CA, Chapter 7

    Google Scholar 

  20. Javanmardi S, Lopes C, Baldi P (2010) Modeling user reputation in wikipedia. J Stat Anal Data Min 3(2):126–139

    MathSciNet  Google Scholar 

  21. Javanmardi S, McDonald DW, Lopes CV (2011) Vandalism detection in wikipedia: a high-performing, feature–rich model and its reduction through lasso. In: Proceedings of the 7th international symposium on wikis and open collaboration, WikiSym’11. ACM, New York, pp 82–90

    Chapter  Google Scholar 

  22. Kohavi R, John G (1997) Wrappers for feature selection. Artif Intell 97:273–324

    Article  MATH  Google Scholar 

  23. Meier L, van de Geer S, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc, Ser B, Stat Methodol 70(1):53–71

    Article  MathSciNet  MATH  Google Scholar 

  24. Mishne G, Carmel D, Lempel R (2005) Blocking blog spam with language model disagreement. In: AIRWeb, pp 1–6

    Google Scholar 

  25. Narisawa K, Ikeda D, Yamada Y, Takeda M (2006) Detecting blog spams using the vocabulary size of all substrings in their copies. In: Proceedings of the 3rd annual workshop on weblogging ecosystem

    Google Scholar 

  26. Potthast M (2010) Crowdsourcing a wikipedia vandalism corpus. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR’10. ACM, New York, pp 789–790

    Google Scholar 

  27. Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in wikipedia. In: Macdonald C, Ounis I, Plachouras V, Ruthven I, White RW (eds) ECIR. Lecture Notes in Computer Science, vol 4956. Springer, Heidelberg, pp 663–668

    Google Scholar 

  28. Potthast M, Stein B, Holfeld T (2010) Overview of the 1st international competition on wikipedia. In: CLEF’2010

    Google Scholar 

  29. Priedhorsky R, Chen J, Lam S, Panciera K, Terveen L, Riedl J (2007) Creating, destroying, and restoring value in wikipedia. In: GROUP’07: proceedings of the 2007 international ACM conference on supporting group work. ACM, New York, pp 259–268

    Chapter  Google Scholar 

  30. Santos-Rodriguez R, Garcia-Garcia D (2010) Cost-sensitive feature selection based on the set covering machine. In: International conference on data mining workshops, pp 740–746. doi:10.1109/ICDMW.2010.92

    Google Scholar 

  31. Sculley D, Wachman GM (2007) Relaxed online svms for spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’07. ACM, New York, pp 415–422

    Chapter  Google Scholar 

  32. Seewald AK (2007) An evaluation of naive Bayes variants in content-based learning for spam filtering. Intell Data Anal 11:497–524

    Google Scholar 

  33. Smets K, Goethals B, Verdonk B (2008) Automatic vandalism detection in wikipedia: towards a machine learning approach. In: Proceedings of the association for the advancement of artificial intelligence (AAAI) workshop on wikipedia and artificial intelligence: an evolving synergy, WikiAI08. AAAI Press, Menlo Park, pp 43–48

    Google Scholar 

  34. Tang Y, Krasser S, He Y, Yang W, Alperovitch D (2008) Support vector machines and random forests modeling for spam senders behavior analysis. In: Global telecommunications conference, pp 2174–2178

    Google Scholar 

  35. Viegas FB, Wattenberg M, Dave K (2004) Studying cooperation and conflict between authors with history flow visualizations. In: CHI’04: proceedings of the SIGCHI conference on human factors in computing systems. ACM, New York, pp 575–582

    Google Scholar 

Download references

Acknowledgements

Authors would like to thank Prof. Robert Tibshirani and Prof. Alexander Ihler for their comments on using lasso for cost sensitive feature selection, and also Martin Potthast for his support in making the PAN data set available to us. In addition, authors would like to thank Amazon.com for a research grant that allowed us to use their MapReduce cluster. This work has been also partially supported by NSF grant OCI-074806.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sara Javanmardi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Javanmardi, S., McDonald, D.W., Caruana, R., Forouzan, S., Lopes, C.V. (2013). Learning to Detect Vandalism in Social Content Systems: A Study on Wikipedia. In: Özyer, T., Erdem, Z., Rokne, J., Khoury, S. (eds) Mining Social Networks and Security Informatics. Lecture Notes in Social Networks. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-6359-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-94-007-6359-3_11

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-007-6358-6

  • Online ISBN: 978-94-007-6359-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics