Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 920)


Wikipedia is the most popular and the largest user-generated source of knowledge on the Web. Quality of the information in this encyclopedia is often questioned. Therefore, Wikipedians have developed an award system for high quality articles, which follows the specific style guidelines. Nevertheless, more than 1.2 million articles in Polish Wikipedia are unassessed. This paper considers over 100 linguistic features to determine the quality of Wikipedia articles in Polish language. We evaluate our models on 500 000 articles of Polish Wikipedia. Additionally, we discuss the importance of linguistic features for quality prediction.


Wikipedia NLP Data Quality Quality Assessment Random Forest Polish Linguistic Features Linguistics Data Mining 


  1. 1.
    Dbpedia: Main page.
  2. 2.
  3. 3.
    Anderka, M.: Analyzing and predicting quality flaws in user-generated content: the case of Wikipedia. Ph.D., Bauhaus-Universitaet Weimar Germany (2013).
  4. 4.
    Blumenstock, J.E.: Size matters: word count as a measure of quality on wikipedia. In: WWW, pp. 1095–1096 (2008).
  5. 5.
    Broda, B., Piasecki, M.: Parallel, massive processing in supermatrix: a general tool for distributional semantic analysis of corpora. Int. J. Data Min. Model. Manag. 5(1), 1–19 (2013)Google Scholar
  6. 6.
    Dalip, D.H., Lima, H., Gonçalves, M.A., Cristo, M., Calado, P.: Quality assessment of collaborative content with minimal information. In: IEEE/ACM Joint Conference on Digital Libraries, pp. 201–210 (2014).
  7. 7.
    Dang, Q.V., Ignat, C.L.: Quality assessment of Wikipedia articles without feature engineering. In: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 27–30, June 2016Google Scholar
  8. 8.
    Dang, Q.V., Ignat, C.L.: An end-to-end learning solution for assessing the quality of Wikipedia articles. In: Proceedings of the 13th International Symposium on Open Collaboration. OpenSym 2017, pp. 4:1–4:10, ACM, New York (2017).
  9. 9.
    Gruszczyński, W., Broda, B., Charzyńska, E., Dębowski, u., Hadryan, M., Nitoń, B., Ogrodniczuk, M.: Measuring readability of polish texts. In: Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, November 2015Google Scholar
  10. 10.
    Halfaker, A., Taraborelli, D.: Artificial intelligence service ‘ORES’ gives Wikipedians X-Ray specs to see through bad edits (2015).
  11. 11.
    Ingawale, M., Dutta, A., Roy, R., Seetharaman, P.: Network analysis of user generated content quality in Wikipedia. Online Inf. Rev. 37(4), 602–619 (2013). Scholar
  12. 12.
    Lewoniewski, W., Khairova, N., Węcel, K., Stratiienko, N., Abramowicz, W.: Using morphological and semantic features for the quality assessment of Russian Wikipedia. In: Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 550–560. Springer, Cham (2017). Scholar
  13. 13.
    Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of Wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016). Scholar
  14. 14.
    Lewoniewski, W., Węcel, K., Abramowicz, W.: Relative quality and popularity evaluation of multilingual Wikipedia articles. Informatics 4 (2017)., Scholar
  15. 15.
    Lex, E., Voelske, M., Errecalde, M., Ferretti, E., Cagnina, L., Horn, C., Stein, B., Granitzer, M.: Measuring the quality of web content using factual information. In: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality - WebQuality 2012, p. 7 (2012).
  16. 16.
    Lipka, N., Stein, B.: Identifying featured articles in Wikipedia: writing style matters. In: Proceedings of the 19th International Conference on World Wide Web 2010, pp. 1147–1148 (2010).
  17. 17.
    Suzuki, Y., Nakamura, S.: Assessing the quality of Wikipedia editors through crowdsourcing. In: Proceedings of the 25th International Conference Companion on World Wide Web, WWW 2016 Companion, pp. 1001–1006. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2016).
  18. 18.
    M.L.G. at the University of Waikato: Weka 3: Data mining software in Java.
  19. 19.
    Warncke-wang, M., Cosley, D., Riedl, J.: Tell me more: an actionable quality model for Wikipedia. WikiSym 2013, 1–10 (2013). Scholar
  20. 20.
    Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in Wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015). Scholar
  21. 21.
    M.W. Wikimedia: List of wikipedias.
  22. 22.
    Wikipedia: Pomoc: jak napisać dobrą definicję.
  23. 23.
    Wikipedia: Pomoc: styl - poradnik dla autorów.
  24. 24.
    Wikipedia: Wikipedia: manual of style/lead section.
  25. 25.
    WikiRank: Quality and popularity assessment of Wikipedia.
  26. 26.
    Wikisłownik: Indeks: polski - najpopularniejsze słowa 1–2000.
  27. 27.
    Wilkinson, D.M., Huberman, B.a.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis WikiSym 2007, pp. 157–164 (2007).
  28. 28.
    Wolinski, M., Milkowski, M., Ogrodniczuk, M., Przepiórkowski, A.: PoliMorf: a (not so) new open morphological dictionary for polish. In: LREC, pp. 860–864 (2012)Google Scholar
  29. 29.
    Woliński, M.: System znaczników morfosyntaktycznych w korpusie ipi pan. In: Polonica XXII-XXIII, pp. 39–55 (2003)Google Scholar
  30. 30.
    Wu, G., Harrigan, M., Cunningham, P.: Characterizing Wikipedia pages using edit network motif profiles. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, SMUC 2011, pp. 45–52. ACM, New York (2011).
  31. 31.
    Xu, Y., Luo, T.: Measuring article quality in Wikipedia: Lexical clue model. In: 2011 3rd Symposium on Web Society (SWS), pp. 141–146. IEEE (2011).

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Information SystemsPoznań University of Economics and BusinessPoznańPoland

Personalised recommendations