Advertisement

Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features

  • Włodzimierz Lewoniewski
  • Krzysztof Węcel
  • Witold Abramowicz
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 920)

Abstract

Wikipedia is the most popular and the largest user-generated source of knowledge on the Web. Quality of the information in this encyclopedia is often questioned. Therefore, Wikipedians have developed an award system for high quality articles, which follows the specific style guidelines. Nevertheless, more than 1.2 million articles in Polish Wikipedia are unassessed. This paper considers over 100 linguistic features to determine the quality of Wikipedia articles in Polish language. We evaluate our models on 500 000 articles of Polish Wikipedia. Additionally, we discuss the importance of linguistic features for quality prediction.

Keywords

Wikipedia NLP Data Quality Quality Assessment Random Forest Polish Linguistic Features Linguistics Data Mining 

References

  1. 1.
    Dbpedia: Main page. https://wiki.dbpedia.org/
  2. 2.
  3. 3.
    Anderka, M.: Analyzing and predicting quality flaws in user-generated content: the case of Wikipedia. Ph.D., Bauhaus-Universitaet Weimar Germany (2013). http://www.uni-weimar.de/medien/webis/publications/papers/anderka_2013.pdf
  4. 4.
    Blumenstock, J.E.: Size matters: word count as a measure of quality on wikipedia. In: WWW, pp. 1095–1096 (2008).  https://doi.org/10.1145/1367497.1367673. http://portal.acm.org/citation.cfm?id=1367673
  5. 5.
    Broda, B., Piasecki, M.: Parallel, massive processing in supermatrix: a general tool for distributional semantic analysis of corpora. Int. J. Data Min. Model. Manag. 5(1), 1–19 (2013)Google Scholar
  6. 6.
    Dalip, D.H., Lima, H., Gonçalves, M.A., Cristo, M., Calado, P.: Quality assessment of collaborative content with minimal information. In: IEEE/ACM Joint Conference on Digital Libraries, pp. 201–210 (2014).  https://doi.org/10.1109/JCDL.2014.6970169
  7. 7.
    Dang, Q.V., Ignat, C.L.: Quality assessment of Wikipedia articles without feature engineering. In: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 27–30, June 2016Google Scholar
  8. 8.
    Dang, Q.V., Ignat, C.L.: An end-to-end learning solution for assessing the quality of Wikipedia articles. In: Proceedings of the 13th International Symposium on Open Collaboration. OpenSym 2017, pp. 4:1–4:10, ACM, New York (2017).  https://doi.org/10.1145/3125433.3125448
  9. 9.
    Gruszczyński, W., Broda, B., Charzyńska, E., Dębowski, u., Hadryan, M., Nitoń, B., Ogrodniczuk, M.: Measuring readability of polish texts. In: Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, November 2015Google Scholar
  10. 10.
    Halfaker, A., Taraborelli, D.: Artificial intelligence service ‘ORES’ gives Wikipedians X-Ray specs to see through bad edits (2015). https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/
  11. 11.
    Ingawale, M., Dutta, A., Roy, R., Seetharaman, P.: Network analysis of user generated content quality in Wikipedia. Online Inf. Rev. 37(4), 602–619 (2013).  https://doi.org/10.1108/OIR-03-2011-0182CrossRefGoogle Scholar
  12. 12.
    Lewoniewski, W., Khairova, N., Węcel, K., Stratiienko, N., Abramowicz, W.: Using morphological and semantic features for the quality assessment of Russian Wikipedia. In: Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 550–560. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-67642-5_46CrossRefGoogle Scholar
  13. 13.
    Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of Wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46254-7_50CrossRefGoogle Scholar
  14. 14.
    Lewoniewski, W., Węcel, K., Abramowicz, W.: Relative quality and popularity evaluation of multilingual Wikipedia articles. Informatics 4 (2017).  https://doi.org/10.3390/informatics4040043, http://www.mdpi.com/2227-9709/4/4/43CrossRefGoogle Scholar
  15. 15.
    Lex, E., Voelske, M., Errecalde, M., Ferretti, E., Cagnina, L., Horn, C., Stein, B., Granitzer, M.: Measuring the quality of web content using factual information. In: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality - WebQuality 2012, p. 7 (2012).  https://doi.org/10.1145/2184305.2184308. http://dl.acm.org/citation.cfm?id=2184305.2184308
  16. 16.
    Lipka, N., Stein, B.: Identifying featured articles in Wikipedia: writing style matters. In: Proceedings of the 19th International Conference on World Wide Web 2010, pp. 1147–1148 (2010).  https://doi.org/10.1145/1772690.1772847
  17. 17.
    Suzuki, Y., Nakamura, S.: Assessing the quality of Wikipedia editors through crowdsourcing. In: Proceedings of the 25th International Conference Companion on World Wide Web, WWW 2016 Companion, pp. 1001–1006. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2016).  https://doi.org/10.1145/2872518.2891113
  18. 18.
    M.L.G. at the University of Waikato: Weka 3: Data mining software in Java. http://www.cs.waikato.ac.nz/ml/weka/
  19. 19.
    Warncke-wang, M., Cosley, D., Riedl, J.: Tell me more: an actionable quality model for Wikipedia. WikiSym 2013, 1–10 (2013).  https://doi.org/10.1145/2491055.2491063CrossRefGoogle Scholar
  20. 20.
    Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in Wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-26762-3_27CrossRefGoogle Scholar
  21. 21.
    M.W. Wikimedia: List of wikipedias. https://meta.wikimedia.org/wiki/List_of_Wikipedias
  22. 22.
    Wikipedia: Pomoc: jak napisać dobrą definicję. https://pl.wikipedia.org/?curid=757430
  23. 23.
    Wikipedia: Pomoc: styl - poradnik dla autorów. https://pl.wikipedia.org/?curid=3325631
  24. 24.
    Wikipedia: Wikipedia: manual of style/lead section. https://en.wikipedia.org/?curid=526968
  25. 25.
    WikiRank: Quality and popularity assessment of Wikipedia. http://www.wikirank.net
  26. 26.
    Wikisłownik: Indeks: polski - najpopularniejsze słowa 1–2000. https://pl.wiktionary.org/?curid=356429
  27. 27.
    Wilkinson, D.M., Huberman, B.a.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis WikiSym 2007, pp. 157–164 (2007).  https://doi.org/10.1145/1296951.1296968. http://portal.acm.org/citation.cfm?doid=1296951.1296968
  28. 28.
    Wolinski, M., Milkowski, M., Ogrodniczuk, M., Przepiórkowski, A.: PoliMorf: a (not so) new open morphological dictionary for polish. In: LREC, pp. 860–864 (2012)Google Scholar
  29. 29.
    Woliński, M.: System znaczników morfosyntaktycznych w korpusie ipi pan. In: Polonica XXII-XXIII, pp. 39–55 (2003)Google Scholar
  30. 30.
    Wu, G., Harrigan, M., Cunningham, P.: Characterizing Wikipedia pages using edit network motif profiles. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, SMUC 2011, pp. 45–52. ACM, New York (2011).  https://doi.org/10.1145/2065023.2065036
  31. 31.
    Xu, Y., Luo, T.: Measuring article quality in Wikipedia: Lexical clue model. In: 2011 3rd Symposium on Web Society (SWS), pp. 141–146. IEEE (2011).  https://doi.org/10.1109/SWS.2011.6101286

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Information SystemsPoznań University of Economics and BusinessPoznańPoland

Personalised recommendations