Language Resources and Evaluation

, Volume 47, Issue 4, pp 1163–1190

WHAD: Wikipedia historical attributes data

Historical structured data extraction and vandalism detection from the Wikipedia edit history
  • Enrique Alfonseca
  • Guillermo Garrido
  • Jean-Yves Delort
  • Anselmo Peñas
Original Paper

Abstract

This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.

Keywords

Wikipedia Infobox Attributes Temporal data 

References

  1. Adler, B. T., De Alfaro, L., & Pye, I. (2010). Detecting Wikipedia vandalism using WikiTrust—Lab report for PAN at CLEF 2010. In Notebook Papers of CLEF 2010 Labs and Workshops.Google Scholar
  2. Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, P., & West, A. G. (2011). Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing, Lecture Notes in Computer Science, Vol. 6609, Berlin: Springer, pp. 277–288.Google Scholar
  3. Ahn, D., Jijkoun, V., Mishne, G., Müller, K., de Rijke, M., & Schlobach, S. (2004). Using Wikipedia at the TREC QA track. In Proceedings of TREC 2004.Google Scholar
  4. Anderka, M., & Stein, B. (2012). Overview of the 1st international competition on quality flaw prediction in Wikipedia. In P. Forner, J. Karlgren, & C. Womser-Hacker (Eds.), CLEF 2012 Evaluation Labs and Workshop—Working Notes Papers.Google Scholar
  5. Arazy, O., & Nov, O. (2010). Determinants of Wikipedia quality: The roles of global and local contribution inequality. In Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW ’10, ACM, New York, NY, USA, pp. 233–236.Google Scholar
  6. Auer, S., & Lehmann, J. (2007). What have Innsbruck and Leipzig in common? Extracting semantics from Wiki content. In Proceedings of the 4th European conference on the semantic web: Research and applications, ESWC ’07, pp. 503–517.Google Scholar
  7. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., & Ives, Z. (2007). DBpedia: A nucleus for a web of open data. In The semantic web, 6th international semantic web conference, ISWC ’07, Springer, pp. 722–735.Google Scholar
  8. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. In Proceedings of the international joint conference on artificial intelligence, IJCAI ’07.Google Scholar
  9. Boguraev, B., Pustejovsky, J., Ando, R., Verhagen, M. (2007). TimeBank evolution as a community resource for TimeML parsing. Language Resources and Evaluation 41, 91–115.CrossRefGoogle Scholar
  10. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on management of data, New York, NY, USA, pp. 1247–1250.Google Scholar
  11. Chin, S. C., Street, W. N., Srinivasan, P., & Eichmann, D. (2010). Detecting Wikipedia vandalism with active learning and statistical language models. In Proceedings of the 4th workshop on information credibility, WICOW ’10, ACM, New York, NY, USA, pp. 3–10.Google Scholar
  12. Dean, J., Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 107–113.CrossRefGoogle Scholar
  13. Ferschke, O., Zesch, T., & Gurevych, I. (2011). Wikipedia revision toolkit: Efficiently accessing wikipedia’s edit history. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human language technologies. System demonstrations, Portland, OR, USA, pp. 97–102.Google Scholar
  14. Fleiss, J. L., Levin, B., & Paik, M. C. (2004). The measurement of interrater agreement (pp. 598–626). New York: Wiley.Google Scholar
  15. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on artificial intelligence, IJCAI ’07, pp. 1606–1611.Google Scholar
  16. Geiger, R. S., & Ribes, D. (2010). The work of sustaining order in Wikipedia: The banning of a vandal. In Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW ’10, ACM, New York, NY, USA, pp. 117–126.Google Scholar
  17. Hoffmann, R., Zhang, C., & Weld, D. S. (2010). Learning 5,000 relational extractors. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, ACL ’10, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 286–295.Google Scholar
  18. Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09, ACM, New York, NY, USA, pp. 389–396.Google Scholar
  19. Itakura, K. Y., & Clarke, C. L. A. (2009). Using dynamic markov compression to detect vandalism in the Wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09, ACM, New York, NY, USA, pp. 822–823.Google Scholar
  20. Lange, D., Böhm, C., & Naumann, F. (2010). Extracting structured information from Wikipedia articles to populate infoboxes. In Proceedings of the 19th ACM international conference on information and knowledge management, CIKM ’10, pp. 1661–1664.Google Scholar
  21. Milne, D., & Witten, I. H. (2008). Learning to link with Wikipedia. In Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08, ACM, New York, NY, USA, pp. 509–518.Google Scholar
  22. Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 2—Volume 2, Association for Computational Linguistics, ACL ’09, Stroudsburg, PA, USA, pp. 1003–1011.Google Scholar
  23. Mola-Velasco, S. (2010). Wikipedia vandalism detection through machine learning: Feature review and new proposals. Notebook papers of CLEF 2010 labs and workshops .Google Scholar
  24. Nguyen, D. P. T., Matsuo, Y., & Ishizuka, M. (2007). Exploiting syntactic and semantic information for relation extraction from Wikipedia. In IJCAI workshop on Text-Mining & Link-Analysis, TextLink ’07.Google Scholar
  25. Nguyen, T., Moreira, V., Nguyen, H., Nguyen, H., Freire, J. (2011). Multilingual schema matching for wikipedia infoboxes. Proceedings of the VLDB Endowment 5(2), 133–144.Google Scholar
  26. Ponzetto, S. P., & Strube, M. (2007). Deriving a large scale taxonomy from Wikipedia. In Proceedings of the 22nd national conference on artificial intelligence (Vol. 2), AAAI Press, pp. 1440–1445.Google Scholar
  27. Potthast, M. (2010). Crowdsourcing a Wikipedia vandalism corpus. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10, ACM, New York, NY, USA, pp. 789–790.Google Scholar
  28. Potthast, M., & Holfeld, T. (2011). Overview of the 2nd international competition on Wikipedia vandalism detection. In V. Petras, P. Forner & P. Clough (Eds.), Notebook papers of CLEF 11 labs and workshops.Google Scholar
  29. Potthast, M., Stein, B., & Gerling, R. (2008). Automatic vandalism detection in Wikipedia. In Proceedings of the IR research, 30th European conference on advances in information retrieval, ECIR’08, Springer, Berlin, pp. 663–668.Google Scholar
  30. Potthast, M., Stein, B., & Holfeld, T. (2010). Overview of the 1st international competition on Wikipedia vandalism detection. In Notebook papers of CLEF 2010 labs and workshops.Google Scholar
  31. Smets, K., Goethals, B., & Verdonk, B. (2008). Automatic vandalism detection in Wikipedia: Towards a machine learning approach. In WikiAI’08: Proceedings of the workshop on Wikipedia and Artificial Intelligence: An evolving synergy.Google Scholar
  32. Stvilia, B., Twidale, M. B., Smith, L. C., & Gasser, L. (2005). Assessing information quality of a community-based encyclopedia. In Proceedings of the international conference on information quality, ICIQ 2005, pp. 442–454.Google Scholar
  33. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). YAGO: A core of semantic knowledge. In Proceedings of the 16th international conference on world wide web, WWW ’07, ACM, New York, NY, USA, pp. 697–706.Google Scholar
  34. Verhagen, M., Gaizauskas, R., Schilder, F., Hepple, M., Moszkowicz, J., & Pustejovsky, J. (2009). The TempEval challenge: Identifying temporal relations in text. Language Resources and Evaluation 43, 161–179.CrossRefGoogle Scholar
  35. Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., & Studer, R. (2006). Semantic Wikipedia. In Proceedings of the 15th international conference on world wide web, WWW ’06, ACM, New York, NY, USA, pp. 585–594.Google Scholar
  36. Voss, J. (2005). Measuring Wikipedia. In Proceedings of the international conference of the international society for scientometrics and informetrics (ISSI), Vol. 10, Stockholm.Google Scholar
  37. Wang, Y., Zhu, M., Qu, L., Spaniol, M., & Weikum, G. (2010). Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. In Proceedings of the 13th international conference on extending database technology, EDBT ’10, ACM, New York, NY, USA, pp. 697–700.Google Scholar
  38. West, A. G., & Lee, I. (2011). Multilingual vandalism detection using language-independent and ex post facto evidence—Notebook for pan at clef 2011. In CLEF (Notebook papers/labs/workshop).Google Scholar
  39. West, A. G., Kannan, S., & Lee, I. (2010). Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata? Tech. rep., University of Pennsylvania, New York, NY, USA.Google Scholar
  40. Wilkinson, D. M., & Huberman, B. A. (2007). Cooperation and quality in Wikipedia. In Proceedings of the 2007 international symposium on Wikis, WikiSym ’07, ACM, New York, NY, USA, pp. 157–164.Google Scholar
  41. Wu, F., Weld, D.S. (2007). Autonomously semantifying Wikipedia. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, CIKM ’07, ACM, New York, NY, USA, pp. 41–50.Google Scholar
  42. Wu, F., & Weld, D. S. (2010). Open information extraction using Wikipedia. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, ACL ’10, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 118–127.Google Scholar
  43. Wu, Q., Irani, D., Pu, C., & Ramaswamy, L. (2010). Elusive vandalism detection in Wikipedia: a text stability-based approach. In Proceedings of the 19th ACM international conference on information and knowledge management, ACM, pp. 1797–1800.Google Scholar
  44. Xu, S., Yang, S., & Lau, F. C. M. (2010). Keyword extraction and headline generation using novel word features. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence, AAAI 2010, AAAI Press.Google Scholar
  45. Yamangil, E., & Nelken, R. (2008). Mining Wikipedia revision histories for improving sentence compression. In ACL 2008, Proceedings of the 46th annual meeting of the Association for Computational Linguistics, June 15–20, 2008, Columbus, Ohio, USA, Short Papers, pp. 137–140.Google Scholar
  46. Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., & Lee, L. (2010). For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. In Proceedings of the conference of the north American chapter of the Association for Computational Linguistics, NAACL, pp. 365–368.Google Scholar
  47. Ye, S., Chua, T. S., & Lu, J. (2009). Summarizing definition from Wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1—Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’09, pp. 199–207.Google Scholar
  48. Zanzotto, F. M., & Pennacchiotti, M. (2010). Expanding textual entailment corpora from Wikipedia using co-training. In Proceedings of the COLING-Workshop on the peoples web meets NLP: collaboratively constructed semantic resources.Google Scholar
  49. Zeng, H., Alhossaini, M. A., Ding, L., Fikes, R., & McGuinness, D. L. (2006). Computing trust from revision history. In Proceedings of the 2006 international conference on privacy, security and trust: Bridge the gap between PST technologies and business services, PST ’06, ACM, New York, NY, USA.Google Scholar
  50. Zhang, Q., Suchanek, F. M., Yue, L., & Weikum, G. (2008). TOB: Timely ontologies for business relations. In 11th international workshop on the web and databases, WebDB.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Enrique Alfonseca
    • 1
  • Guillermo Garrido
    • 2
  • Jean-Yves Delort
    • 1
  • Anselmo Peñas
    • 2
  1. 1.Google Research ZurichZurichSwitzerland
  2. 2.NLP & IR Group, UNEDMadridSpain

Personalised recommendations