Advertisement

Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia

  • Włodzimierz LewoniewskiEmail author
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 339)

Abstract

One of the most popular collaborative knowledge bases on the Internet is Wikipedia. Articles of this free encyclopaedia are created and edited by users from different countries in about 300 languages. Depending on topic and language version, quality of information there may vary. This study presents and classifies measures that can be extracted from Wikipedia articles for the purpose of automatic quality assessment in different languages. Based on a state of the art analysis and own experiments, specific measures for various aspects of quality have been defined. Additional, in this work they were also defined measures for quality assessment of data contained in the structural parts of Wikipedia articles - infoboxes. This study describes also an extraction methods for various sources of measures, that can be used in quality assessment.

Keywords

Wikipedia Data quality Quality measures DBpedia Wikidata Quality dimensions Web 2.0 Encyclopedia 

References

  1. 1.
    Abramowicz, W., Auer, S., Heath, T.: Linked data in business. Bus. Inf. Syst. Eng. 58(5), 323–326 (2016).  https://doi.org/10.1007/s12599-016-0446-0CrossRefGoogle Scholar
  2. 2.
    Alexa: Wikipedia.org traffic, demographics and competitors. https://www.alexa.com/siteinfo/wikipedia.org
  3. 3.
  4. 4.
    Anderka, M.: Analyzing and predicting quality flaws in user-generated content: the case of Wikipedia. Ph.D. Bauhaus-Universitaet Weimar Germany (2013)Google Scholar
  5. 5.
    Blumenstock, J.E.: Automatically assessing the quality of Wikipedia articles. Technical report (2008).  https://doi.org/10.1080/17439880802324251CrossRefGoogle Scholar
  6. 6.
    Blumenstock, J.E.: Size matters: word count as a measure of quality on Wikipedia. In: WWW, pp. 1095–1096 (2008).  https://doi.org/10.1145/1367497.1367673
  7. 7.
    Bormuth, J.R.: Readability: a new approach. Read. Res. Q. 1, 79–132 (1966)CrossRefGoogle Scholar
  8. 8.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)CrossRefGoogle Scholar
  9. 9.
    De la Calzada, G., Dekhtyar, A.: On measuring the quality of Wikipedia articles. In: Proceedings of the 4th Workshop on Information Credibility, pp. 11–18. ACM (2010)Google Scholar
  10. 10.
    Caylor, J.S., Sticht, T.G.: Development of a simple readability index for job reading material (1973)Google Scholar
  11. 11.
    Chen, H.H.: How to use readability formulas to access and select English reading materials. J. Educ. Media Libr. Sci. 50(2), 229–254 (2012)Google Scholar
  12. 12.
    Coleman, M., Liau, T.L.: A computer readability formula designed for machine scoring. J. Appl. Psychol. 60(2), 283 (1975)CrossRefGoogle Scholar
  13. 13.
    Conti, R., Marzini, E., Spognardi, A., Matteucci, I., Mori, P., Petrocchi, M.: Maturity assessment of Wikipedia medical articles. In: 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), pp. 281–286. IEEE (2014)Google Scholar
  14. 14.
    Dale, E., Chall, J.S.: A formula for predicting readability: instructions. Educ. Res. Bull. 18, 37–54 (1948)Google Scholar
  15. 15.
    Dalip, D.H., Gonçalves, M.A., Cristo, M., Calado, P.: Automatic quality assessment of content created collaboratively by web communities: a case study of Wikipedia. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 295–304 (2009).  https://doi.org/10.1145/1555400.1555449
  16. 16.
    Dalip, D.H., Gonçalves, M.A., Cristo, M., Calado, P.: Automatic assessment of document quality in web collaborative digital libraries. J. Data Inf. Quality 2(3), 1–30 (2011).  https://doi.org/10.1145/2063504.2063507CrossRefGoogle Scholar
  17. 17.
    Dang, Q.V., Ignat, C.L.: Measuring quality of collaboratively edited documents: the case of Wikipedia. In: 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), pp. 266–275. IEEE (2016)Google Scholar
  18. 18.
    DBpedia: Main Page. https://wiki.dbpedia.org
  19. 19.
    Einstein, A.: The Meaning of Relativity. Routledge, Abingdon (2003)CrossRefGoogle Scholar
  20. 20.
    English Wikipedia: API sandbox. https://en.wikipedia.org/wiki/Special:ApiSandbox
  21. 21.
    English Wikipedia: Criticism of Wikipedia. https://en.wikipedia.org/wiki/Criticism_of_Wikipedia
  22. 22.
    English Wikipedia: Featured article criteria. https://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria
  23. 23.
    English Wikipedia: Featured articles. https://en.wikipedia.org/wiki/Wikipedia:Featured_articles
  24. 24.
  25. 25.
  26. 26.
    English Wikipedia: Wikiproject tabular data. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Tabular_Data
  27. 27.
    Eppler, M.J.: Managing Information Quality: Increasing the Value of Information in Knowledge-Intensive Products and Processes. Springer, Heidelberg (2003).  https://doi.org/10.1007/3-540-32225-6CrossRefGoogle Scholar
  28. 28.
    Ferschke, O., Gurevych, I., Rittberger, M.: FlawFinder: a modular system for predicting quality flaws in Wikipedia. In: CLEF (Online Working Notes/Labs/Workshop), pp. 1–10 (2012)Google Scholar
  29. 29.
    Filipiak, D., Filipowska, A.: Improving the quality of art market data using linked open data and machine learning. In: Abramowicz, W., Alt, R., Franczyk, B. (eds.) BIS 2016. LNBIP, vol. 263, pp. 418–428. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-52464-1_39CrossRefGoogle Scholar
  30. 30.
    Flekova, L., Ferschke, O., Gurevych, I.: What makes a good biography? Multidimensional quality analysis based on Wikipedia article feedback data. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 855–866. ACM (2014)Google Scholar
  31. 31.
    Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221 (1948)CrossRefGoogle Scholar
  32. 32.
    Greenfield, G.R.: Classic readability formulas in an EFL context: are they valid for Japanese speakers? Ph.D. thesis. Temple University (1999)Google Scholar
  33. 33.
    Gunning, R.: The Technique of Clear Writing. McGraw-Hill, New York (1952)Google Scholar
  34. 34.
    Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A.: Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80 (2014).  https://doi.org/10.1016/j.ijpe.2014.04.018CrossRefGoogle Scholar
  35. 35.
    Infoboxes.net: quality comparison of infoboxes in Miltilingual Wikipedia. http://infoboxes.net
  36. 36.
    Juran, J., Godfrey, A.B.: Quality Handbook, pp. 173–178. McGraw-Hill, New York (1999)Google Scholar
  37. 37.
    Kane, G.C.: A multimethod study of information quality in Wiki collaboration. ACM Trans. Manag. Inf. Syst. (TMIS) 2(1), 4 (2011)Google Scholar
  38. 38.
    Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. Technical report. Naval Technical Training Command Millington TN Research Branch (1975)Google Scholar
  39. 39.
    Kittur, A., Kraut, R.E.: Harnessing the wisdom of crowds in Wikipedia: quality through coordination. In: Proceedings of the ACM 2008 Conference on Computer Supported Cooperative Work - CSCW 2008, p. 37 (2008).  https://doi.org/10.1145/1460563.1460572
  40. 40.
    Kontokostas, D., et al.: Test-driven evaluation of linked data quality. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 747–758. ACM (2014)Google Scholar
  41. 41.
    Lerner, J., Lomi, A.: Knowledge categorization affects popularity and quality of Wikipedia articles. PloS One 13(1), e0190674 (2018)CrossRefGoogle Scholar
  42. 42.
    Lewoniewski, W.: Completeness and reliability of Wikipedia infoboxes in various languages. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 295–305. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-69023-0_25CrossRefGoogle Scholar
  43. 43.
    Lewoniewski, W.: Enrichment of information in multilingual Wikipedia based on quality analysis. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 216–227. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-69023-0_19CrossRefGoogle Scholar
  44. 44.
    Lewoniewski, W., Härting, R.-C., Wecel, K., Reichstein, C., Abramowicz, W.: Application of SEO metrics to determine the quality of Wikipedia articles and their sources. In: Damaševičius, R., Vasiljevienė, G. (eds.) ICIST 2018. CCIS, vol. 920, pp. 139–152. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-99972-2_11CrossRefGoogle Scholar
  45. 45.
    Lewoniewski, W., Khairova, N., Węcel, K., Stratiienko, N., Abramowicz, W.: Using morphological and semantic features for the quality assessment of Russian Wikipedia. In: Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 550–560. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-67642-5_46CrossRefGoogle Scholar
  46. 46.
    Lewoniewski, W., Węcel, K.: Relative quality assessment of Wikipedia articles in different languages using synthetic measure. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 282–292. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-69023-0_24CrossRefGoogle Scholar
  47. 47.
    Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of Wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46254-7_50CrossRefGoogle Scholar
  48. 48.
    Lewoniewski, W., Węcel, K., Abramowicz, W.: Relative quality and popularity evaluation of multilingual Wikipedia articles. Informatics 4 (2017).  https://doi.org/10.3390/informatics4040043CrossRefGoogle Scholar
  49. 49.
    Lewoniewski, W., Węcel, K., Abramowicz, W.: Determining quality of articles in polish Wikipedia based on linguistic features. In: Damaševičius, R., Vasiljevienė, G. (eds.) ICIST 2018. CCIS, vol. 920, pp. 546–558. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-99972-2_45CrossRefGoogle Scholar
  50. 50.
    Lih, A.: Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news resource. In: 5th International Symposium on Online Journalism, p. 31 (2004)Google Scholar
  51. 51.
    Liu, J., Ram, S.: Using big data and network analysis to understand Wikipedia article quality. Data Knowl. Eng. 115, 80–93 (2018)CrossRefGoogle Scholar
  52. 52.
    Lucassen, T., Schraagen, J.M.: Trust in Wikipedia: how users trust information from an unknown source. In: Proceedings of the 4th Workshop on Information Credibility, pp. 19–26. ACM (2010)Google Scholar
  53. 53.
    Mc Laughlin, G.H.: SMOG grading-a new readability formula. J. Read. 12(8), 639–646 (1969)Google Scholar
  54. 54.
    Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: Proceedings of the 2012 Joint EDBT/ICDT Workshops, pp. 116–123. ACM (2012)Google Scholar
  55. 55.
    Microsoft Azure: Cloud computing platform & services. https://azure.microsoft.com/en-us/
  56. 56.
    Moyer, D., Carson, S.L., Dye, T.K., Carson, R.T., Goldbaum, D.: Determining the influence of reddit posts on Wikipedia pageviews. In: Ninth International AAAI Conference on Web and Social Media, pp. 75–82. AAAI Press Oxford, UK (2015)Google Scholar
  57. 57.
    O’Brien, J.A., Marakas, G.M.: Introduction to Information Systems, vol. 13. McGraw-Hill/Irwin, New York City (2005)Google Scholar
  58. 58.
    OECD Glossary of Statistical Terms: ISO 8402 - quality. http://stats.oecd.org/glossary/detail.asp?ID=5150
  59. 59.
    Ransbotham, S., Kane, G.: Membership turnover and collaboration success in online communities: explaining rises and falls from grace in Wikipedia. MIS Q. 35(3), 613–627 (2011)CrossRefGoogle Scholar
  60. 60.
    Ransbotham, S., Kane, G.C., Lurie, N.H.: Network characteristics and the value of collaborative user-generated content. Mark. Sci. 31(3), 387–405 (2012)CrossRefGoogle Scholar
  61. 61.
    di Sciascio, C., Strohmaier, D., Errecalde, M., Veas, E.: WikiLyzer: interactive information quality assessment in Wikipedia. In: Proceedings of the 22nd International Conference on Intelligent User Interfaces, pp. 377–388. ACM (2017)Google Scholar
  62. 62.
    Senter, R., Smith, E.A.: Automated readability index. Technical report, University of Cincinnati, Ohio (1967)Google Scholar
  63. 63.
    Shang, W.: A comparison of the historical entries in Wikipedia and Baidu Baike. In: Chowdhury, G., McLeod, J., Gillet, V., Willett, P. (eds.) iConference 2018. LNCS, vol. 10766, pp. 74–80. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-78105-1_9CrossRefGoogle Scholar
  64. 64.
    Shen, A., Qi, J., Baldwin, T.: A hybrid model for quality assessment of Wikipedia articles. In: Proceedings of the Australasian Language Technology Association Workshop, pp. 43–52 (2017)Google Scholar
  65. 65.
    Soonthornphisaj, N., Paengporn, P.: Thai Wikipedia article quality filtering algorithm. In: Proceedings of the International Multi Conference of Engineers and Computer Scientists, vol. 1 (2017)Google Scholar
  66. 66.
    Stróżyna, M., Eiden, G., Abramowicz, W., Filipiak, D., Małyszko, J., Węcel, K.: A framework for the quality-based selection and retrieval of open data - a use case from the maritime domain. Electron. Mark. 28(2), 219–233 (2018).  https://doi.org/10.1007/s12525-017-0277-yCrossRefGoogle Scholar
  67. 67.
    Stvilia, B., Twidale, M.B., Gasser, L., Smith, L.C.: Information quality discussions in Wikipedia. In: Proceedings of the 2005 International Conference on Knowledge Management, pp. 101–113. Citeseer (2005)Google Scholar
  68. 68.
    Stvilia, B., Twidale, M.B., Smith, L.C., Gasser, L.: Assessing information quality of a community-based encyclopedia. In: Proceedings of ICIQ, pp. 442–454 (2005)Google Scholar
  69. 69.
    Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)CrossRefGoogle Scholar
  70. 70.
    Warncke-wang, M., Cosley, D., Riedl, J.: Tell me more : an actionable quality model for Wikipedia. In: In: WikiSym 2013, pp. 1–10 (2013).  https://doi.org/10.1145/2491055.2491063
  71. 71.
    Warncke-Wang, M., Ranjan, V., Terveen, L.G., Hecht, B.J.: Misalignment between supply and demand of quality content in peer production communities. In: ICWSM, pp. 493–502 (2015)Google Scholar
  72. 72.
    Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in Wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-26762-3_27CrossRefGoogle Scholar
  73. 73.
    WikiBest: Online game about comparing data quality between various languages of the wikipedia. https://wikibest.net
  74. 74.
  75. 75.
    Wikimedia Downloads: English Wikipedia latest database backup dumps. https://dumps.wikimedia.org/enwiki/latest/
  76. 76.
    Wikipedia Meta-Wiki: List of Wikipedias. https://meta.wikimedia.org/wiki/List_of_Wikipedias
  77. 77.
  78. 78.
    WikiRank: Quality and popularity assessment of Wikipedia. https://wikirank.net
  79. 79.
    Wilkinson, D.M., Huberman, B.A.: Assessing the value of cooperation in Wikipedia. arXiv preprint arXiv: cs/0702140 (2007)
  80. 80.
    Wilkinson, D.M., Huberman, B.A.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis WikiSym 2007, pp. 157–164 (2007).  https://doi.org/10.1145/1296951.1296968
  81. 81.
    Wu, K., Zhu, Q., Zhao, Y., Zheng, H.: Mining the factors affecting the quality of Wikipedia articles. In: 2010 International Conference of Information Science and Management Engineering (ISME), vol. 1, pp. 343–346. IEEE (2010)Google Scholar
  82. 82.
    Yaari, E., Baruchson-Arbib, S., Bar-Ilan, J.: Information quality assessment of community generated content: a user study of wikipedia. J. Inf. Sci. 37(5), 487–498 (2011)CrossRefGoogle Scholar
  83. 83.
    Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Semant. Web 7(1), 63–93 (2016)CrossRefGoogle Scholar
  84. 84.
    Zhang, S., Hu, Z., Zhang, C., Yu, K.: History-based article quality assessment on Wikipedia. In: 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 1–8. IEEE (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Information SystemsPoznań University of Economics and BusinessPoznańPoland

Personalised recommendations