Skip to main content

A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia

  • Chapter
  • First Online:
The People’s Web Meets NLP

Abstract

With the rise of the Web 2.0, participatory and collaborative content production have largely replaced the traditional ways of information sharing and have created the novel genre of collaboratively constructed language resources. A vast untapped potential lies in the dynamic aspects of these resources, which cannot be unleashed with traditional methods designed for static corpora. In this chapter, we focus on Wikipedia as the most prominent instance of collaboratively constructed language resources. In particular, we discuss the significance of Wikipedi’s revision history for applications in Natural Language Processing (NLP) and the unique prospects of the user discussions, a new resource that has just begun to be mined. While the body of research on processing Wikipedia’s revision history is dominated by works that use the revision data as the basis for practical applications such as spelling correction or vandalism detection, most of the work focused on user discussions uses NLP for analyzing and understanding the data itself.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://etherpad.org/

  2. 2.

    https://docs.google.com

  3. 3.

    https://writer.zoho.com

  4. 4.

    http://www.sourcefabric.org/en/booktype/

  5. 5.

    However, pages can be protected from editing by privileged users, as stated in the Wikipedia Protection Policy, see http://en.wikipedia.org/wiki/WP:Protection_policy.

  6. 6.

    http://en.wikipedia.org/wiki/WP:UNDO

  7. 7.

    http://eggcorns.lascribe.net/

  8. 8.

    Freely accessible at http://code.google.com/p/dkpro-spelling-asl/.

  9. 9.

    See http://wicopaco.limsi.fr/.

  10. 10.

    The Simple Wikipedia author Specerk offers a list of transformation pairs: http://simple.wikipedia.org/w/index.php?title=User:Spencerk/list_of_straight-up_substitutables.

  11. 11.

    See http://www.cs.cornell.edu/home/llee/data/simple/.

  12. 12.

    See http://art.uniroma2.it/zanzotto/.

  13. 13.

    http://en.wikipedia.org/wiki/WP:FA_Criteria

  14. 14.

    http://en.wikipedia.org/wiki/WP:Manual_of_Style

  15. 15.

    WikiProject article quality grading scheme: http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Assessment.

  16. 16.

    From http://en.wikipedia.org/w/index.php?title=Wikipedia:Vandalism&oldid=489137966. The same page also offers a list of frequent types of vandalism.

  17. 17.

    Cf. a list of Anti-vandalism bots compiled by the author Emijrp: http://en.wikipedia.org/w/index.php?title=User:Emijrp/Anti-vandalism_bot_census&oldid=482285684.

  18. 18.

    See http://www.webis.de/research/corpora/pan-wvc-10 and http://www.uni-weimar.de/cms/medien/webis/research/corpora/pan-wvc-11.html.

  19. 19.

    http://en.wikipedia.org/wiki/WP:SIGNATURE

  20. 20.

    http://www.mediawiki.org/wiki/Extension:LiquidThreads

  21. 21.

    http://www.mediawiki.org/wiki/Visual_editor

  22. 22.

    http://en.wikipedia.org/wiki/WP:ARCHIVE

  23. 23.

    http://en.wikipedia.org/wiki/WP:TTALK

  24. 24.

    According to [35], “[t]he sample was chosen to include a variety of controversial and non-controversial topics and span a spectrum from hard science to pop culture.”

  25. 25.

    http://hadoop.apache.org/

  26. 26.

    The corpus was split into training set (67 %), development set (17 %) and test set (16 %).

  27. 27.

    A troll is a participant in online discussions with the primary goal of posting disruptive, off-topic messages or provoking emotional responses.

  28. 28.

    A compilation of these can be found under http://en.wikipedia.org/wiki/WP:WikiProject_User_scripts/Scripts

  29. 29.

    http://www.mediawiki.org/wiki/API

  30. 30.

    http://en.wikipedia.org/w/api.php

  31. 31.

    http://www.mediawiki.org/wiki/API:Client_code

  32. 32.

    http://toolserver.org/

  33. 33.

    http://dumps.wikimedia.org/

  34. 34.

    http://meta.wikimedia.org/wiki/Data_dumps#Tools

References

  1. Adler BT, Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Gelbukh A (ed) Computational linguistics and intelligent text processing. Lecture notes in computer science. Springer, Berlin, pp 277–288

    Chapter  Google Scholar 

  2. Bender EM, Morgan JT, Oxley M, Zachry M, Hutchinson B, Marin A, Zhang B, Ostendorf M (2011) Annotating social acts: authority claims and alignment moves in Wikipedia talk pages. In: Proceedings of the workshop on language in social media, Portland, OR, USA, pp 48–57

    Google Scholar 

  3. Buriol LS, Castillo C, Donato D, Leonardi S, Millozzi S (2006) Temporal analysis of the Wikigraph. In: Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence, Hong Kong, China, pp 45–51

    Google Scholar 

  4. Chin SC, Street WN, Srinivasan P, Eichmann D (2010) Detecting Wikipedia vandalism with active learning and statistical language models. In: Proceedings of the 4th workshop on information credibility, Hyderabad, India

    Google Scholar 

  5. Cusinato A, Della Mea V, Di Salvatore F, Mizzaro S (2009) QuWi: quality control in Wikipedia. In: Proceedings of the 3rd workshop on information credibility on the web. ACM, Madrid, pp 27–34

    Google Scholar 

  6. Dalip DH, Gonçalves MA, Cristo M, Calado P (2009) Automatic quality assessment of content created collaboratively by web communities. In: Proceedings of the joint international conference on digital libraries, Austin, TX, USA, pp 295–304

    Google Scholar 

  7. Emigh W, Herring SC (2005) Collaborative authoring on the web: a genre analysis of online encyclopedias. In: Proceedings of the 38th annual Hawaii international conference on system sciences, Waikoloa, Big Island, HI, USA

    Google Scholar 

  8. Ferschke O, Zesch T, Gurevych I (2011) Wikipedia revision toolkit: efficiently accessing Wikipedia’s edit history. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. System demonstrations, Portland, OR

    Google Scholar 

  9. Ferschke O, Gurevych I, Chebotar Y (2012) Behind the article: recognizing dialog acts in Wikipedia talk pages. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, France

    Google Scholar 

  10. Giampiccolo D, Trang Dang H, Magnini B, Dagan I, Cabrio E, Dolan B (2007) The third PASCAL recognizing textual entailment challenge. In: Proceedings of the ACLPASCAL workshop on textual entailment and paraphrasing, Prague, Czech Republic, pp 1–9

    Google Scholar 

  11. Han J, Wang C, Jiang D (2011) Probabilistic quality assessment based on article’s revision history. In: Proceedings of the 22nd international conference on database and expert systems applications, Toulouse, France, pp 574–588

    Google Scholar 

  12. Javanmardi S, McDonald DW, Lopes CV (2011) Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through lasso. In: Proceedings of the 7th international symposium on Wikis and open collaboration, Mountain View, CA, USA, pp 82–90

    Google Scholar 

  13. Kittur A, Suh B, Pendleton B, Chi EH (2007) He says, she says: conflict and coordination in Wikipedia. In: Proceedings of the SIGCHI conference on human factors in computing systems, San Jose, CA, USA, pp 453–462

    Google Scholar 

  14. Knight K, Marcu D (2000) Statistics-based summarization—step one: sentence compression. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence, Austin, TX, USA, pp 703–710

    Google Scholar 

  15. Laniado D, Tasso R, Kaltenbrunner A, Milano P, Volkovich Y (2011) When the Wikipedians talk: network and tree structure of Wikipedia discussion pages. In: Proceedings of the 5th international conference on weblogs and social media, Barcelona, Spain, pp 177–184

    Google Scholar 

  16. Marin A, Zhang B, Ostendorf M (2011) Detecting forum authority claims in online discussions. In: Proceedings of the workshop on languages in social media, Portland, OR, USA, pp 39–47

    Google Scholar 

  17. Massa P (2011) Social Networks of Wikipedia. In: Proceedings of the 22nd ACM conference on hypertext and hypermedia, Eindhoven, Netherlands, pp 221–230

    Google Scholar 

  18. Max A, Wisniewski G (2010) Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In: Proceedings of the 7th conference on international language resources and evaluation, Valletta, Malta

    Google Scholar 

  19. Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Human Comput Stud 67(9):716–754

    Article  Google Scholar 

  20. Milne D, Witten IH (2009) An open-source toolkit for mining Wikipedia. In: Proceedings of the New Zealand computer science research student conference, Auckland, New Zealand

    Google Scholar 

  21. Mizzaro S (2003) Quality control in scholarly publishing: a new proposal. J Am Soc Inf Sci Technol 54(11):989–1005

    Article  Google Scholar 

  22. Nelken R, Shieber SM (2006) Towards robust context-sensitive sentence alignment for monolingual corpora. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, Trento, Italy

    Google Scholar 

  23. Nelken R, Yamangil E (2008) Mining Wikipedia’s article revision history for training computational linguistics algorithms. In: Proceedings of the 1st AAAI workshop on Wikipedia and artificial intelligence, Chicago, IL, USA

    Google Scholar 

  24. Oxley M, Morgan JT, Hutchinson B (2010) “What I Know Is”: establishing credibility on Wikipedia talk pages. In: Proceedings of the 6th international symposium on wikis and open collaboration, Gdańsk, Poland, pp 2–3

    Google Scholar 

  25. Posner IR, Baecker RM (1992) How people write together. In: Proceedings of the 25th Hawaii international conference on system sciences, Wailea, Maui, HI, USA, pp 127–138

    Google Scholar 

  26. Potthast M (2010) Crowdsourcing a Wikipedia vandalism corpus. In: Proceedings of the 33rd annual international ACM SIGIR conference on research and development on information retrieval, Geneva

    Google Scholar 

  27. Potthast M, Holfeld T (2011) Overview of the 2nd international competition on Wikipedia vandalism detection. In: Notebook papers of CLEF 2011 labs and workshops, Amsterdam, Netherlands

    Google Scholar 

  28. Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in Wikipedia. In: Proceedings of the 30th European conference on advances in information retrieval, Glasgow, Scotland, UK, pp 663–668

    Google Scholar 

  29. Schneider J, Passant A, Breslin JG (2010) A content analysis: how Wikipedia talk pages are used. In: Proceedings of the 2nd international conference of web science, Raleigh, NC, USA, pp 1–7

    Google Scholar 

  30. Schneider J, Passant A, Breslin JG (2011) Understanding and improving Wikipedia article discussion spaces. In: Proceedings of the 2011 ACM symposium on applied computing, Taichung, Taiwan, pp 808–813

    Google Scholar 

  31. Soto J (2009) Wikipedia: a quantitative analysis. Ph.D. thesis, Universidad Rey Juan Carlos, Madrid

    Google Scholar 

  32. Stvilia B, Twidale MB, Smith LC, Gasser L (2008) Information quality work organization in Wikipedia. J Am Soc Inf Sci Technol 59(6):983–1001

    Article  Google Scholar 

  33. Thomas C, Sheth AP (2007) Semantic convergence of Wikipedia articles. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence, Washington, DC, USA, pp 600–606

    Google Scholar 

  34. Viégas FB, Wattenberg M, Dave K (2004) Studying cooperation and conflict between authors with history flow visualizations. In: Proceedings of the SIGCHI conference on human factors in computing systems, Vienna, Austria, pp 575–582

    Google Scholar 

  35. Viégas FB, Wattenberg M, Kriss J, Ham F (2007) Talk before you type: coordination in Wikipedia. In: Proceedings of the 40th annual Hawaii international conference on system sciences, Big Island, HI, USA, pp 78–78

    Google Scholar 

  36. Wang WY, McKeown KR (2010) Got you!: automatic vandalism detection in Wikipedia with web-based shallow syntactic-semantic modeling. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, pp 1146–1154

    Google Scholar 

  37. Wilkinson DM, Huberman BA (2007) Cooperation and quality in Wikipedia. In: Proceedings of the 2007 international symposium on wikis, Montreal, Canada, pp 157–164

    Google Scholar 

  38. Woodsend K, Lapata M (2011) Learning to Simplify Sentences with quasi-synchronous grammar and integer programming. In: Proceedings of the conference on empirical methods in natural language processing, Edinburgh, Scotland, UK, pp 409–420

    Google Scholar 

  39. Yamangil E, Nelken R (2008) Mining Wikipedia revision histories for improving sentence compression. In: Proceedings of the 46th annual meeting of the association for computational linguistics: human language technologies. Short papers, association for computational linguistics, Columbus, OH, USA, pp 137–140

    Google Scholar 

  40. Yatskar M, Pang B, Danescu-Niculescu-Mizil C, Lee L (2010) For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. In: Proceedings of the 2010 annual conference of the North American chapter of the association for computational Linguistics, Los Angeles, CA, USA, pp 365–368

    Google Scholar 

  41. Zanzotto FM, Pennacchiotti M (2010) Expanding textual entailment corpora from Wikipedia using co-training. In: Proceedings of the 2nd COLING-workshop on the people’s web meets NLP: collaboratively constructed semantic resources, Beijing, China

    Google Scholar 

  42. Zeng H, Alhossaini MA, Ding L, Fikes R, McGuinness DL (2006) Computing trust from revision history. In: Proceedings of the 2006 international conference on privacy, security and trust, Markham, Ontario, Canada, pp 1–10

    Google Scholar 

  43. Zesch T (2012) Measuring contextual fitness using error contexts extracted from the Wikipedia revision history. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, France

    Google Scholar 

  44. Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from Wikipedia and wiktionary. In: Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco

    Google Scholar 

  45. Zhu Z, Bernhard D, Gurevych I (2010) A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, pp 1353–1361

    Google Scholar 

  46. Zobel J, Dart P (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland, pp 166–172

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the Volkswagen Foundation as part of the Lichtenberg-Professorship Program under grant No. I/82806, and by the Hessian research excellence program “Landes-Offensive zur Entwicklung Wissenschaftlich-ökonomischer Exzellenz” (LOEWE) as part of the research center “Digital Humanities”. We thank the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Ferschke .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Ferschke, O., Daxenberger, J., Gurevych, I. (2013). A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35085-6_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35084-9

  • Online ISBN: 978-3-642-35085-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics