Learning to Detect Vandalism in Social Content Systems: A Study on Wikipedia

Javanmardi, Sara; McDonald, David W.; Caruana, Rich; Forouzan, Sholeh; Lopes, Cristina V.

doi:10.1007/978-94-007-6359-3_11

Sara Javanmardi⁵,
David W. McDonald⁶,
Rich Caruana⁷,
Sholeh Forouzan⁸ &
…
Cristina V. Lopes⁸

Part of the book series: Lecture Notes in Social Networks ((LNSN))

2720 Accesses
1 Altmetric

Abstract

A challenge facing user generated content systems is vandalism, i.e. edits that damage content quality. The high visibility and easy access to social networks makes them popular targets for vandals. Detecting and removing vandalism is critical for these user generated content systems. Because vandalism can take many forms, there are many different kinds of features that are potentially useful for detecting it. The complex nature of vandalism, and the large number of potential features, make vandalism detection difficult and time consuming for human editors. Machine learning techniques hold promise for developing accurate, tunable, and maintainable models that can be incorporated into vandalism detection tools. We describe a method for training classifiers for vandalism detection that yields classifiers that are more accurate on the PAN 2010 corpus than others previously developed. Because of the high turnaround in social network systems, it is important for vandalism detection tools to run in real-time. To this aim, we use feature selection to find the minimal set of features consistent with high accuracy. In addition, because some features are more costly to compute than others, we use cost-sensitive feature selection to reduce the total computational cost of executing our models. In addition to the features previously used for spam detection, we introduce new features based on user action histories. The user history features contribute significantly to classifier performance. The approach we use is general and can easily be applied to other user generated content systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The LASSO method does not necessarily yield a monotonically increasing set of features; it is possible that as λ is decreased some features that were in the set for larger λs might be removed from the set as other feature replace them.

References

crawler4j: A fast crawler in java. http://crawler4j.googlecode.com/
Wikipedia article on biography controversy. http://en.wikipedia.org/wiki/Wikipedia_biography_controversy
Adler BT, de Alfaro L (2007) A content-driven reputation system for the wikipedia. In: WWW ’07: proceedings of the 16th international conference on world wide web. ACM, New York, pp 261–270
Chapter Google Scholar
Adler B, de Alfaro L, Pve I (2010) Detecting wikipedia vandalism using wikitrust. Tech. rep., PAN lab report, CLEF (Conference on multilingual and multimodal information access evaluation). institute.lanl.gov/isti/issdm/papers/vandalism-adler-report.pdf
Adler BT, de Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Proceedings of computational linguistics and intelligent text processing, CICLing’11, pp 266–276
Google Scholar
Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam e-mail: a comparison of a naive bayesian and a memory-based approach. In: Proceedings of CoRR, pp 1–13
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning, ICML’06. ACM, New York, pp 161–168
Google Scholar
Chang Mw, Yih Wt, Meek C (2008) Partitioned logistic regression for spam filtering. In: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’08. ACM, New York, pp 97–105
Chapter Google Scholar
Chin Sc, Srinivasan P, Street WN, Eichmann D (2010) Detecting wikipedia vandalism with active learning and statistical language models. In: Fourth workshop on information credibility on the web, WICOW 2010
Google Scholar
Dong A, Zhang R, Kolari P, Bai J, Diaz F, Chang Y, Zheng Z, Zha H (2010) Time is of the essence: improving recency ranking using twitter data. In: World wide web conference series, pp 331–340
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Friedman JH, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22. http://www.jstatsoft.org/v33/i01
Google Scholar
Geiger RS, Ribes D (2010) The work of sustaining order in wikipedia: the banning of a vandal. In: Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW’10. ACM, New York, pp 117–126
Chapter Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer, New York
MATH Google Scholar
Itakura KY, Clarke CLA (2009) Using dynamic Markov compression to detect vandalism in the wikipedia. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR’09. ACM, New York, pp 822–823
Google Scholar
Javanmardi S (2011) Measuring content quality in user generated content systems: a machine learning approach. Doctoral dissertation, University of California, Irvine, CA, Chapter 7
Google Scholar
Javanmardi S, Lopes C, Baldi P (2010) Modeling user reputation in wikipedia. J Stat Anal Data Min 3(2):126–139
MathSciNet Google Scholar
Javanmardi S, McDonald DW, Lopes CV (2011) Vandalism detection in wikipedia: a high-performing, feature–rich model and its reduction through lasso. In: Proceedings of the 7th international symposium on wikis and open collaboration, WikiSym’11. ACM, New York, pp 82–90
Chapter Google Scholar
Kohavi R, John G (1997) Wrappers for feature selection. Artif Intell 97:273–324
Article MATH Google Scholar
Meier L, van de Geer S, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc, Ser B, Stat Methodol 70(1):53–71
Article MathSciNet MATH Google Scholar
Mishne G, Carmel D, Lempel R (2005) Blocking blog spam with language model disagreement. In: AIRWeb, pp 1–6
Google Scholar
Narisawa K, Ikeda D, Yamada Y, Takeda M (2006) Detecting blog spams using the vocabulary size of all substrings in their copies. In: Proceedings of the 3rd annual workshop on weblogging ecosystem
Google Scholar
Potthast M (2010) Crowdsourcing a wikipedia vandalism corpus. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR’10. ACM, New York, pp 789–790
Google Scholar
Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in wikipedia. In: Macdonald C, Ounis I, Plachouras V, Ruthven I, White RW (eds) ECIR. Lecture Notes in Computer Science, vol 4956. Springer, Heidelberg, pp 663–668
Google Scholar
Potthast M, Stein B, Holfeld T (2010) Overview of the 1st international competition on wikipedia. In: CLEF’2010
Google Scholar
Priedhorsky R, Chen J, Lam S, Panciera K, Terveen L, Riedl J (2007) Creating, destroying, and restoring value in wikipedia. In: GROUP’07: proceedings of the 2007 international ACM conference on supporting group work. ACM, New York, pp 259–268
Chapter Google Scholar
Santos-Rodriguez R, Garcia-Garcia D (2010) Cost-sensitive feature selection based on the set covering machine. In: International conference on data mining workshops, pp 740–746. doi:10.1109/ICDMW.2010.92
Google Scholar
Sculley D, Wachman GM (2007) Relaxed online svms for spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’07. ACM, New York, pp 415–422
Chapter Google Scholar
Seewald AK (2007) An evaluation of naive Bayes variants in content-based learning for spam filtering. Intell Data Anal 11:497–524
Google Scholar
Smets K, Goethals B, Verdonk B (2008) Automatic vandalism detection in wikipedia: towards a machine learning approach. In: Proceedings of the association for the advancement of artificial intelligence (AAAI) workshop on wikipedia and artificial intelligence: an evolving synergy, WikiAI08. AAAI Press, Menlo Park, pp 43–48
Google Scholar
Tang Y, Krasser S, He Y, Yang W, Alperovitch D (2008) Support vector machines and random forests modeling for spam senders behavior analysis. In: Global telecommunications conference, pp 2174–2178
Google Scholar
Viegas FB, Wattenberg M, Dave K (2004) Studying cooperation and conflict between authors with history flow visualizations. In: CHI’04: proceedings of the SIGCHI conference on human factors in computing systems. ACM, New York, pp 575–582
Google Scholar

Download references

Acknowledgements

Authors would like to thank Prof. Robert Tibshirani and Prof. Alexander Ihler for their comments on using lasso for cost sensitive feature selection, and also Martin Potthast for his support in making the PAN data set available to us. In addition, authors would like to thank Amazon.com for a research grant that allowed us to use their MapReduce cluster. This work has been also partially supported by NSF grant OCI-074806.

Author information

Authors and Affiliations

University of California, Irvine Donald Bren Hall 5042, Irvine, CA, 92697-3440, USA
Sara Javanmardi
The Information School, University of Washington, Washington, WA, USA
David W. McDonald
Microsoft Research, Redmond, WA, USA
Rich Caruana
Bren School of Information and Computer Sciences, University of California, Irvine, CA, USA
Sholeh Forouzan & Cristina V. Lopes

Authors

Sara Javanmardi
View author publications
You can also search for this author in PubMed Google Scholar
David W. McDonald
View author publications
You can also search for this author in PubMed Google Scholar
Rich Caruana
View author publications
You can also search for this author in PubMed Google Scholar
Sholeh Forouzan
View author publications
You can also search for this author in PubMed Google Scholar
Cristina V. Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sara Javanmardi .

Editor information

Editors and Affiliations

Department of Computer Engineering, TOBB University, Sogutozu Cad No. 43, Sogutozu Ankara, Turkey
Tansel Özyer
Information Technologies Institute, TUBITAK BILGEM, Gebze, Kocaeli, 41470, Turkey
Zeki Erdem
Computer Science, University of Calgary, University Dr. NW 2500, Calgary, T2N 1N4, Canada
Jon Rokne
American University of Sharjah, Universities City, Sharjah, Saudi Arabia
Suheil Khoury

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Javanmardi, S., McDonald, D.W., Caruana, R., Forouzan, S., Lopes, C.V. (2013). Learning to Detect Vandalism in Social Content Systems: A Study on Wikipedia. In: Özyer, T., Erdem, Z., Rokne, J., Khoury, S. (eds) Mining Social Networks and Security Informatics. Lecture Notes in Social Networks. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-6359-3_11

Download citation

DOI: https://doi.org/10.1007/978-94-007-6359-3_11
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-6358-6
Online ISBN: 978-94-007-6359-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics