A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia

  • Oliver Ferschke
  • Johannes Daxenberger
  • Iryna Gurevych
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

With the rise of the Web 2.0, participatory and collaborative content production have largely replaced the traditional ways of information sharing and have created the novel genre of collaboratively constructed language resources. A vast untapped potential lies in the dynamic aspects of these resources, which cannot be unleashed with traditional methods designed for static corpora. In this chapter, we focus on Wikipedia as the most prominent instance of collaboratively constructed language resources. In particular, we discuss the significance of Wikipedi’s revision history for applications in Natural Language Processing (NLP) and the unique prospects of the user discussions, a new resource that has just begun to be mined. While the body of research on processing Wikipedia’s revision history is dominated by works that use the revision data as the basis for practical applications such as spelling correction or vandalism detection, most of the work focused on user discussions uses NLP for analyzing and understanding the data itself.

References

  1. 1.
    Adler BT, Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Gelbukh A (ed) Computational linguistics and intelligent text processing. Lecture notes in computer science. Springer, Berlin, pp 277–288CrossRefGoogle Scholar
  2. 2.
    Bender EM, Morgan JT, Oxley M, Zachry M, Hutchinson B, Marin A, Zhang B, Ostendorf M (2011) Annotating social acts: authority claims and alignment moves in Wikipedia talk pages. In: Proceedings of the workshop on language in social media, Portland, OR, USA, pp 48–57Google Scholar
  3. 3.
    Buriol LS, Castillo C, Donato D, Leonardi S, Millozzi S (2006) Temporal analysis of the Wikigraph. In: Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence, Hong Kong, China, pp 45–51Google Scholar
  4. 4.
    Chin SC, Street WN, Srinivasan P, Eichmann D (2010) Detecting Wikipedia vandalism with active learning and statistical language models. In: Proceedings of the 4th workshop on information credibility, Hyderabad, IndiaGoogle Scholar
  5. 5.
    Cusinato A, Della Mea V, Di Salvatore F, Mizzaro S (2009) QuWi: quality control in Wikipedia. In: Proceedings of the 3rd workshop on information credibility on the web. ACM, Madrid, pp 27–34Google Scholar
  6. 6.
    Dalip DH, Gonçalves MA, Cristo M, Calado P (2009) Automatic quality assessment of content created collaboratively by web communities. In: Proceedings of the joint international conference on digital libraries, Austin, TX, USA, pp 295–304Google Scholar
  7. 7.
    Emigh W, Herring SC (2005) Collaborative authoring on the web: a genre analysis of online encyclopedias. In: Proceedings of the 38th annual Hawaii international conference on system sciences, Waikoloa, Big Island, HI, USAGoogle Scholar
  8. 8.
    Ferschke O, Zesch T, Gurevych I (2011) Wikipedia revision toolkit: efficiently accessing Wikipedia’s edit history. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. System demonstrations, Portland, ORGoogle Scholar
  9. 9.
    Ferschke O, Gurevych I, Chebotar Y (2012) Behind the article: recognizing dialog acts in Wikipedia talk pages. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, FranceGoogle Scholar
  10. 10.
    Giampiccolo D, Trang Dang H, Magnini B, Dagan I, Cabrio E, Dolan B (2007) The third PASCAL recognizing textual entailment challenge. In: Proceedings of the ACLPASCAL workshop on textual entailment and paraphrasing, Prague, Czech Republic, pp 1–9Google Scholar
  11. 11.
    Han J, Wang C, Jiang D (2011) Probabilistic quality assessment based on article’s revision history. In: Proceedings of the 22nd international conference on database and expert systems applications, Toulouse, France, pp 574–588Google Scholar
  12. 12.
    Javanmardi S, McDonald DW, Lopes CV (2011) Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through lasso. In: Proceedings of the 7th international symposium on Wikis and open collaboration, Mountain View, CA, USA, pp 82–90Google Scholar
  13. 13.
    Kittur A, Suh B, Pendleton B, Chi EH (2007) He says, she says: conflict and coordination in Wikipedia. In: Proceedings of the SIGCHI conference on human factors in computing systems, San Jose, CA, USA, pp 453–462Google Scholar
  14. 14.
    Knight K, Marcu D (2000) Statistics-based summarization—step one: sentence compression. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence, Austin, TX, USA, pp 703–710Google Scholar
  15. 15.
    Laniado D, Tasso R, Kaltenbrunner A, Milano P, Volkovich Y (2011) When the Wikipedians talk: network and tree structure of Wikipedia discussion pages. In: Proceedings of the 5th international conference on weblogs and social media, Barcelona, Spain, pp 177–184Google Scholar
  16. 16.
    Marin A, Zhang B, Ostendorf M (2011) Detecting forum authority claims in online discussions. In: Proceedings of the workshop on languages in social media, Portland, OR, USA, pp 39–47Google Scholar
  17. 17.
    Massa P (2011) Social Networks of Wikipedia. In: Proceedings of the 22nd ACM conference on hypertext and hypermedia, Eindhoven, Netherlands, pp 221–230Google Scholar
  18. 18.
    Max A, Wisniewski G (2010) Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In: Proceedings of the 7th conference on international language resources and evaluation, Valletta, MaltaGoogle Scholar
  19. 19.
    Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Human Comput Stud 67(9):716–754CrossRefGoogle Scholar
  20. 20.
    Milne D, Witten IH (2009) An open-source toolkit for mining Wikipedia. In: Proceedings of the New Zealand computer science research student conference, Auckland, New ZealandGoogle Scholar
  21. 21.
    Mizzaro S (2003) Quality control in scholarly publishing: a new proposal. J Am Soc Inf Sci Technol 54(11):989–1005CrossRefGoogle Scholar
  22. 22.
    Nelken R, Shieber SM (2006) Towards robust context-sensitive sentence alignment for monolingual corpora. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, Trento, ItalyGoogle Scholar
  23. 23.
    Nelken R, Yamangil E (2008) Mining Wikipedia’s article revision history for training computational linguistics algorithms. In: Proceedings of the 1st AAAI workshop on Wikipedia and artificial intelligence, Chicago, IL, USAGoogle Scholar
  24. 24.
    Oxley M, Morgan JT, Hutchinson B (2010) “What I Know Is”: establishing credibility on Wikipedia talk pages. In: Proceedings of the 6th international symposium on wikis and open collaboration, Gdańsk, Poland, pp 2–3Google Scholar
  25. 25.
    Posner IR, Baecker RM (1992) How people write together. In: Proceedings of the 25th Hawaii international conference on system sciences, Wailea, Maui, HI, USA, pp 127–138Google Scholar
  26. 26.
    Potthast M (2010) Crowdsourcing a Wikipedia vandalism corpus. In: Proceedings of the 33rd annual international ACM SIGIR conference on research and development on information retrieval, GenevaGoogle Scholar
  27. 27.
    Potthast M, Holfeld T (2011) Overview of the 2nd international competition on Wikipedia vandalism detection. In: Notebook papers of CLEF 2011 labs and workshops, Amsterdam, NetherlandsGoogle Scholar
  28. 28.
    Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in Wikipedia. In: Proceedings of the 30th European conference on advances in information retrieval, Glasgow, Scotland, UK, pp 663–668Google Scholar
  29. 29.
    Schneider J, Passant A, Breslin JG (2010) A content analysis: how Wikipedia talk pages are used. In: Proceedings of the 2nd international conference of web science, Raleigh, NC, USA, pp 1–7Google Scholar
  30. 30.
    Schneider J, Passant A, Breslin JG (2011) Understanding and improving Wikipedia article discussion spaces. In: Proceedings of the 2011 ACM symposium on applied computing, Taichung, Taiwan, pp 808–813Google Scholar
  31. 31.
    Soto J (2009) Wikipedia: a quantitative analysis. Ph.D. thesis, Universidad Rey Juan Carlos, MadridGoogle Scholar
  32. 32.
    Stvilia B, Twidale MB, Smith LC, Gasser L (2008) Information quality work organization in Wikipedia. J Am Soc Inf Sci Technol 59(6):983–1001CrossRefGoogle Scholar
  33. 33.
    Thomas C, Sheth AP (2007) Semantic convergence of Wikipedia articles. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence, Washington, DC, USA, pp 600–606Google Scholar
  34. 34.
    Viégas FB, Wattenberg M, Dave K (2004) Studying cooperation and conflict between authors with history flow visualizations. In: Proceedings of the SIGCHI conference on human factors in computing systems, Vienna, Austria, pp 575–582Google Scholar
  35. 35.
    Viégas FB, Wattenberg M, Kriss J, Ham F (2007) Talk before you type: coordination in Wikipedia. In: Proceedings of the 40th annual Hawaii international conference on system sciences, Big Island, HI, USA, pp 78–78Google Scholar
  36. 36.
    Wang WY, McKeown KR (2010) Got you!: automatic vandalism detection in Wikipedia with web-based shallow syntactic-semantic modeling. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, pp 1146–1154Google Scholar
  37. 37.
    Wilkinson DM, Huberman BA (2007) Cooperation and quality in Wikipedia. In: Proceedings of the 2007 international symposium on wikis, Montreal, Canada, pp 157–164Google Scholar
  38. 38.
    Woodsend K, Lapata M (2011) Learning to Simplify Sentences with quasi-synchronous grammar and integer programming. In: Proceedings of the conference on empirical methods in natural language processing, Edinburgh, Scotland, UK, pp 409–420Google Scholar
  39. 39.
    Yamangil E, Nelken R (2008) Mining Wikipedia revision histories for improving sentence compression. In: Proceedings of the 46th annual meeting of the association for computational linguistics: human language technologies. Short papers, association for computational linguistics, Columbus, OH, USA, pp 137–140Google Scholar
  40. 40.
    Yatskar M, Pang B, Danescu-Niculescu-Mizil C, Lee L (2010) For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. In: Proceedings of the 2010 annual conference of the North American chapter of the association for computational Linguistics, Los Angeles, CA, USA, pp 365–368Google Scholar
  41. 41.
    Zanzotto FM, Pennacchiotti M (2010) Expanding textual entailment corpora from Wikipedia using co-training. In: Proceedings of the 2nd COLING-workshop on the people’s web meets NLP: collaboratively constructed semantic resources, Beijing, ChinaGoogle Scholar
  42. 42.
    Zeng H, Alhossaini MA, Ding L, Fikes R, McGuinness DL (2006) Computing trust from revision history. In: Proceedings of the 2006 international conference on privacy, security and trust, Markham, Ontario, Canada, pp 1–10Google Scholar
  43. 43.
    Zesch T (2012) Measuring contextual fitness using error contexts extracted from the Wikipedia revision history. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, FranceGoogle Scholar
  44. 44.
    Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from Wikipedia and wiktionary. In: Proceedings of the 6th international conference on language resources and evaluation, Marrakech, MoroccoGoogle Scholar
  45. 45.
    Zhu Z, Bernhard D, Gurevych I (2010) A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, pp 1353–1361Google Scholar
  46. 46.
    Zobel J, Dart P (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland, pp 166–172Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Oliver Ferschke
    • 1
  • Johannes Daxenberger
    • 1
  • Iryna Gurevych
    • 2
  1. 1.Ubiquitous Knowledge Processing LabTechnische Universität DarmstadtDarmstadtGermany
  2. 2.Ubiquitous Knowledge Processing LabTechnische Universität Darmstadt, German Institute for Educational Research and Educational InformationDarmstadtGermany

Personalised recommendations