Wikipedia Vandal Early Detection: From User Behavior to User Embedding

  • Shuhan Yuan
  • Panpan Zheng
  • Xintao WuEmail author
  • Yang Xiang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10534)


Wikipedia is the largest online encyclopedia that allows anyone to edit articles. In this paper, we propose the use of deep learning to detect vandals based on their edit history. In particular, we develop a multi-source long-short term memory network (M-LSTM) to model user behaviors by using a variety of user edit aspects as inputs, including the history of edit reversion information, edit page titles and categories. With M-LSTM, we can encode each user into a low dimensional real vector, called user embedding. Meanwhile, as a sequential model, M-LSTM updates the user embedding each time after the user commits a new edit. Thus, we can predict whether a user is benign or vandal dynamically based on the up-to-date user embedding. Furthermore, those user embeddings are crucial to discover collaborative vandals. Code and data related to this chapter are available at:



The authors would like to thank anonymous reviewers for their valuable comments and suggestions. The authors acknowledge the support from the 973 Program of China (2014CB340404), the National Natural Science Foundation of China (71571136), and the Research Projects of Science and Technology Commission of Shanghai Municipality (16JC1403000, 14511108002) to Shuhan Yuan and Yang Xiang, and from National Science Foundation (1564250) to Panpan Zheng and Xintao Wu. This research was conducted while Shuhan Yuan visited University of Arkansas.


  1. 1.
    Adler, B.T., de Alfaro, L.: A content-driven reputation system for the wikipedia. In: WWW, pp. 261–270 (2007)Google Scholar
  2. 2.
    Adler, B.T., de Alfaro, L., Mola-Velasco, S.M., Rosso, P., West, A.G.: Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: CICLing, pp. 277–288 (2011)Google Scholar
  3. 3.
    Akoglu, L., McGlohon, M., Faloutsos, C.: oddball: Spotting anomalies in weighted graphs. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS (LNAI), vol. 6119, pp. 410–421. Springer, Heidelberg (2010). CrossRefGoogle Scholar
  4. 4.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)Google Scholar
  5. 5.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. TPAMI 35(8), 1798–1828 (2013)CrossRefGoogle Scholar
  6. 6.
    Chang, S., Han, W., Tang, J., Qi, G.J., Aggarwal, C.C., Huang, T.S.: Heterogeneous network embedding via deep architectures. In: KDD (2015)Google Scholar
  7. 7.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)Google Scholar
  8. 8.
    Fei, G., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., Ghosh, R.: Exploiting burstiness in reviews for review spammer detection. In: ICWSM, pp. 175–184 (2013)Google Scholar
  9. 9.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: ICASSP, pp. 6645–6649 (2013)Google Scholar
  10. 10.
    Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: KDD (2016)Google Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  12. 12.
    Heindorf, S., Potthast, M., Stein, B., Engels, G.: Vandalism detection in wikidata. In: CIKM, pp. 327–336 (2016)Google Scholar
  13. 13.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  14. 14.
    Javanmardi, S., McDonald, D.W., Lopes, C.V.: Vandalism detection in wikipedia: a high-performing, feature-rich model and its reduction through lasso. In: WikiSym, pp. 82–90 (2011)Google Scholar
  15. 15.
    Kumar, S., Spezzano, F., Subrahmanian, V.: Vews: a wikipedia vandal early warning system. In: KDD, pp. 607–616 (2015)Google Scholar
  16. 16.
    Li, J., Zhu, J., Zhang, B.: Discriminative deep random walk for network classification. In: ACL (2016)Google Scholar
  17. 17.
    Lim, E.P., Nguyen, V.A., Jindal, N., Liu, B., Lauw, H.W.: Detecting product review spammers using rating behaviors. In: CIKM, pp. 939–948 (2010)Google Scholar
  18. 18.
    Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)zbMATHGoogle Scholar
  19. 19.
    McKeown, K., Wang, W.: Got you!: automatic vandalism detection in wikipedia with web-based shallow syntactic-semantic modeling. In: COLING, pp. 1146–1154 (2010)Google Scholar
  20. 20.
    Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)Google Scholar
  21. 21.
    Mola-Velasco, S.M.: Wikipedia vandalism detection through machine learning: feature review and new proposals. arXiv:1210.5560 [cs] (2012)
  22. 22.
    Mukherjee, A., Venkataraman, V., Liu, B., Glance, N.S.: What yelp fake review filter might be doing? In: ICWSM, pp. 409–418 (2013)Google Scholar
  23. 23.
    Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: WWW, pp. 83–92 (2006)Google Scholar
  24. 24.
    Papadimitriou, P., Dasdan, A., Garcia-Molina, H.: Web graph similarity for anomaly detection. J. Internet Serv. Appl. 1(1), 19–30 (2010)CrossRefGoogle Scholar
  25. 25.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)Google Scholar
  26. 26.
    Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: KDD, pp. 701–710 (2014)Google Scholar
  27. 27.
    Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl. 13(2), 50–64 (2011)CrossRefGoogle Scholar
  28. 28.
    Sun, J., Qu, H., Chakrabarti, D., Faloutsos, C.: Neighborhood formation and anomaly detection in bipartite graphs. In: ICDM, pp. 1–8 (2005)Google Scholar
  29. 29.
    Tang, J., Qu, M., Mei, Q.: PTE: Predictive text embedding through large-scale heterogeneous text networks. In: KDD (2015)Google Scholar
  30. 30.
    Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: large-scale information network embedding. In: WWW, pp. 1067–1077 (2015)Google Scholar
  31. 31.
    West, A.G., Kannan, S., Lee, I.: Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata? In: EUROSEC, pp. 22–28 (2010)Google Scholar
  32. 32.
    Xie, S., Wang, G., Lin, S., Yu, P.S.: Review spam detection via temporal pattern discovery. In: KDD, pp. 823–831 (2012)Google Scholar
  33. 33.
    Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: NAACL, pp. 1480–1489 (2016)Google Scholar
  34. 34.
    Ying, X., Wu, X., Barbará, D.: Spectrum based fraud detection in social networks. In: Proceedings of the 27th International Conference on Data Engineering, Hannover, Germany, pp. 912–923. ICDE 2011, 11–16 April 2011 (2011)Google Scholar
  35. 35.
    Yuan, S., Wu, X., Li, J., Lu, A.: Spectrum-based deep neural networks for fraud detection. CoRR abs/1706.00891 (2017)Google Scholar
  36. 36.
    Yuan, S., Wu, X., Xiang, Y.: SNE: signed network embedding. In: Kim, J., Shim, K., Cao, L., Lee, J.-G., Lin, X., Moon, Y.-S. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10235, pp. 183–195. Springer, Cham (2017). CrossRefGoogle Scholar
  37. 37.
    Zeiler, M.D.: Adadelta: An adaptive learning rate method. arXiv:1212.5701 [cs] (2012)
  38. 38.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Shuhan Yuan
    • 1
  • Panpan Zheng
    • 2
  • Xintao Wu
    • 2
    Email author
  • Yang Xiang
    • 1
  1. 1.Tongji UniversityShanghaiChina
  2. 2.University of ArkansasFayettevilleUSA

Personalised recommendations