Abstract
With the rise of the Web 2.0, participatory and collaborative content production have largely replaced the traditional ways of information sharing and have created the novel genre of collaboratively constructed language resources. A vast untapped potential lies in the dynamic aspects of these resources, which cannot be unleashed with traditional methods designed for static corpora. In this chapter, we focus on Wikipedia as the most prominent instance of collaboratively constructed language resources. In particular, we discuss the significance of Wikipedi’s revision history for applications in Natural Language Processing (NLP) and the unique prospects of the user discussions, a new resource that has just begun to be mined. While the body of research on processing Wikipedia’s revision history is dominated by works that use the revision data as the basis for practical applications such as spelling correction or vandalism detection, most of the work focused on user discussions uses NLP for analyzing and understanding the data itself.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
However, pages can be protected from editing by privileged users, as stated in the Wikipedia Protection Policy, see http://en.wikipedia.org/wiki/WP:Protection_policy.
- 6.
- 7.
- 8.
Freely accessible at http://code.google.com/p/dkpro-spelling-asl/.
- 9.
- 10.
The Simple Wikipedia author Specerk offers a list of transformation pairs: http://simple.wikipedia.org/w/index.php?title=User:Spencerk/list_of_straight-up_substitutables.
- 11.
- 12.
- 13.
- 14.
- 15.
WikiProject article quality grading scheme: http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Assessment.
- 16.
From http://en.wikipedia.org/w/index.php?title=Wikipedia:Vandalism&oldid=489137966. The same page also offers a list of frequent types of vandalism.
- 17.
Cf. a list of Anti-vandalism bots compiled by the author Emijrp: http://en.wikipedia.org/w/index.php?title=User:Emijrp/Anti-vandalism_bot_census&oldid=482285684.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
According to [35], “[t]he sample was chosen to include a variety of controversial and non-controversial topics and span a spectrum from hard science to pop culture.”
- 25.
- 26.
The corpus was split into training set (67 %), development set (17 %) and test set (16 %).
- 27.
A troll is a participant in online discussions with the primary goal of posting disruptive, off-topic messages or provoking emotional responses.
- 28.
A compilation of these can be found under http://en.wikipedia.org/wiki/WP:WikiProject_User_scripts/Scripts
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
References
Adler BT, Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia vandalism detection: combining natural language, metadata, and reputation features. In: Gelbukh A (ed) Computational linguistics and intelligent text processing. Lecture notes in computer science. Springer, Berlin, pp 277–288
Bender EM, Morgan JT, Oxley M, Zachry M, Hutchinson B, Marin A, Zhang B, Ostendorf M (2011) Annotating social acts: authority claims and alignment moves in Wikipedia talk pages. In: Proceedings of the workshop on language in social media, Portland, OR, USA, pp 48–57
Buriol LS, Castillo C, Donato D, Leonardi S, Millozzi S (2006) Temporal analysis of the Wikigraph. In: Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence, Hong Kong, China, pp 45–51
Chin SC, Street WN, Srinivasan P, Eichmann D (2010) Detecting Wikipedia vandalism with active learning and statistical language models. In: Proceedings of the 4th workshop on information credibility, Hyderabad, India
Cusinato A, Della Mea V, Di Salvatore F, Mizzaro S (2009) QuWi: quality control in Wikipedia. In: Proceedings of the 3rd workshop on information credibility on the web. ACM, Madrid, pp 27–34
Dalip DH, Gonçalves MA, Cristo M, Calado P (2009) Automatic quality assessment of content created collaboratively by web communities. In: Proceedings of the joint international conference on digital libraries, Austin, TX, USA, pp 295–304
Emigh W, Herring SC (2005) Collaborative authoring on the web: a genre analysis of online encyclopedias. In: Proceedings of the 38th annual Hawaii international conference on system sciences, Waikoloa, Big Island, HI, USA
Ferschke O, Zesch T, Gurevych I (2011) Wikipedia revision toolkit: efficiently accessing Wikipedia’s edit history. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. System demonstrations, Portland, OR
Ferschke O, Gurevych I, Chebotar Y (2012) Behind the article: recognizing dialog acts in Wikipedia talk pages. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, France
Giampiccolo D, Trang Dang H, Magnini B, Dagan I, Cabrio E, Dolan B (2007) The third PASCAL recognizing textual entailment challenge. In: Proceedings of the ACLPASCAL workshop on textual entailment and paraphrasing, Prague, Czech Republic, pp 1–9
Han J, Wang C, Jiang D (2011) Probabilistic quality assessment based on article’s revision history. In: Proceedings of the 22nd international conference on database and expert systems applications, Toulouse, France, pp 574–588
Javanmardi S, McDonald DW, Lopes CV (2011) Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through lasso. In: Proceedings of the 7th international symposium on Wikis and open collaboration, Mountain View, CA, USA, pp 82–90
Kittur A, Suh B, Pendleton B, Chi EH (2007) He says, she says: conflict and coordination in Wikipedia. In: Proceedings of the SIGCHI conference on human factors in computing systems, San Jose, CA, USA, pp 453–462
Knight K, Marcu D (2000) Statistics-based summarization—step one: sentence compression. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence, Austin, TX, USA, pp 703–710
Laniado D, Tasso R, Kaltenbrunner A, Milano P, Volkovich Y (2011) When the Wikipedians talk: network and tree structure of Wikipedia discussion pages. In: Proceedings of the 5th international conference on weblogs and social media, Barcelona, Spain, pp 177–184
Marin A, Zhang B, Ostendorf M (2011) Detecting forum authority claims in online discussions. In: Proceedings of the workshop on languages in social media, Portland, OR, USA, pp 39–47
Massa P (2011) Social Networks of Wikipedia. In: Proceedings of the 22nd ACM conference on hypertext and hypermedia, Eindhoven, Netherlands, pp 221–230
Max A, Wisniewski G (2010) Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In: Proceedings of the 7th conference on international language resources and evaluation, Valletta, Malta
Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Human Comput Stud 67(9):716–754
Milne D, Witten IH (2009) An open-source toolkit for mining Wikipedia. In: Proceedings of the New Zealand computer science research student conference, Auckland, New Zealand
Mizzaro S (2003) Quality control in scholarly publishing: a new proposal. J Am Soc Inf Sci Technol 54(11):989–1005
Nelken R, Shieber SM (2006) Towards robust context-sensitive sentence alignment for monolingual corpora. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, Trento, Italy
Nelken R, Yamangil E (2008) Mining Wikipedia’s article revision history for training computational linguistics algorithms. In: Proceedings of the 1st AAAI workshop on Wikipedia and artificial intelligence, Chicago, IL, USA
Oxley M, Morgan JT, Hutchinson B (2010) “What I Know Is…”: establishing credibility on Wikipedia talk pages. In: Proceedings of the 6th international symposium on wikis and open collaboration, Gdańsk, Poland, pp 2–3
Posner IR, Baecker RM (1992) How people write together. In: Proceedings of the 25th Hawaii international conference on system sciences, Wailea, Maui, HI, USA, pp 127–138
Potthast M (2010) Crowdsourcing a Wikipedia vandalism corpus. In: Proceedings of the 33rd annual international ACM SIGIR conference on research and development on information retrieval, Geneva
Potthast M, Holfeld T (2011) Overview of the 2nd international competition on Wikipedia vandalism detection. In: Notebook papers of CLEF 2011 labs and workshops, Amsterdam, Netherlands
Potthast M, Stein B, Gerling R (2008) Automatic vandalism detection in Wikipedia. In: Proceedings of the 30th European conference on advances in information retrieval, Glasgow, Scotland, UK, pp 663–668
Schneider J, Passant A, Breslin JG (2010) A content analysis: how Wikipedia talk pages are used. In: Proceedings of the 2nd international conference of web science, Raleigh, NC, USA, pp 1–7
Schneider J, Passant A, Breslin JG (2011) Understanding and improving Wikipedia article discussion spaces. In: Proceedings of the 2011 ACM symposium on applied computing, Taichung, Taiwan, pp 808–813
Soto J (2009) Wikipedia: a quantitative analysis. Ph.D. thesis, Universidad Rey Juan Carlos, Madrid
Stvilia B, Twidale MB, Smith LC, Gasser L (2008) Information quality work organization in Wikipedia. J Am Soc Inf Sci Technol 59(6):983–1001
Thomas C, Sheth AP (2007) Semantic convergence of Wikipedia articles. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence, Washington, DC, USA, pp 600–606
Viégas FB, Wattenberg M, Dave K (2004) Studying cooperation and conflict between authors with history flow visualizations. In: Proceedings of the SIGCHI conference on human factors in computing systems, Vienna, Austria, pp 575–582
Viégas FB, Wattenberg M, Kriss J, Ham F (2007) Talk before you type: coordination in Wikipedia. In: Proceedings of the 40th annual Hawaii international conference on system sciences, Big Island, HI, USA, pp 78–78
Wang WY, McKeown KR (2010) Got you!: automatic vandalism detection in Wikipedia with web-based shallow syntactic-semantic modeling. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, pp 1146–1154
Wilkinson DM, Huberman BA (2007) Cooperation and quality in Wikipedia. In: Proceedings of the 2007 international symposium on wikis, Montreal, Canada, pp 157–164
Woodsend K, Lapata M (2011) Learning to Simplify Sentences with quasi-synchronous grammar and integer programming. In: Proceedings of the conference on empirical methods in natural language processing, Edinburgh, Scotland, UK, pp 409–420
Yamangil E, Nelken R (2008) Mining Wikipedia revision histories for improving sentence compression. In: Proceedings of the 46th annual meeting of the association for computational linguistics: human language technologies. Short papers, association for computational linguistics, Columbus, OH, USA, pp 137–140
Yatskar M, Pang B, Danescu-Niculescu-Mizil C, Lee L (2010) For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. In: Proceedings of the 2010 annual conference of the North American chapter of the association for computational Linguistics, Los Angeles, CA, USA, pp 365–368
Zanzotto FM, Pennacchiotti M (2010) Expanding textual entailment corpora from Wikipedia using co-training. In: Proceedings of the 2nd COLING-workshop on the people’s web meets NLP: collaboratively constructed semantic resources, Beijing, China
Zeng H, Alhossaini MA, Ding L, Fikes R, McGuinness DL (2006) Computing trust from revision history. In: Proceedings of the 2006 international conference on privacy, security and trust, Markham, Ontario, Canada, pp 1–10
Zesch T (2012) Measuring contextual fitness using error contexts extracted from the Wikipedia revision history. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, France
Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from Wikipedia and wiktionary. In: Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco
Zhu Z, Bernhard D, Gurevych I (2010) A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, pp 1353–1361
Zobel J, Dart P (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland, pp 166–172
Acknowledgements
This work has been supported by the Volkswagen Foundation as part of the Lichtenberg-Professorship Program under grant No. I/82806, and by the Hessian research excellence program “Landes-Offensive zur Entwicklung Wissenschaftlich-ökonomischer Exzellenz” (LOEWE) as part of the research center “Digital Humanities”. We thank the anonymous reviewers for their valuable comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ferschke, O., Daxenberger, J., Gurevych, I. (2013). A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-35085-6_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35084-9
Online ISBN: 978-3-642-35085-6
eBook Packages: Computer ScienceComputer Science (R0)